I’ve been a big fan of Azure Site Recovery service. The product has evolved a lot since initial release (only capable to orchestrate a failover between two on-premises sites), introducing replication of on-premises workloads to Azure and many more features. Until theses last days, we had to comply with some limitations (Max 64 disks of 1Tb each per protected workload) and a major scenario we could not address. Note that Larger disk sizes just been announced in preview, up to 4Tb per Disk.
Since introduction of ASR replication capabilities (for Hyper-V, VMware or physical), we were not able to protect an Azure Virtual machine with Azure Site Recovery. ASR Team, just announced a preview of Azure Site Recovery for Azure Virtual machines. That was one of the top customer requests about Azure Site Recovery. Introduction of this new Azure Site capability allow to respond to the following scenarios:
- Can survive to a major Azure Region failure by replicating virtual machines to another region
- Migrate virtual machines between Azure regions
- Comply with regulations requesting a Disaster Recovery strategy
- Ability to clone an application based on Virtual Machines to validate
Feature is now available for public preview. Let’s see how to protect an Azure Virtual machine and develop a Disaster Recovery Plan. Note that even if this preview is available in the Azure portal, it’s still a preview. For example, support for PowerShell and CLI is not yet available.
An instance of the Azure Recovery Site service: We will need to have an instance of the service in the Azure region in which we want to replicate with. This can be strange. If you remember Azure backup limitations, Azure Backup instance service must be in the same region as virtual machines we wish to protect.
A resource group in the target Azure region to store
A virtual network (and subnet) to connect the future Network interface. It’s not mandatory if you plan to establish RDP between two paired Azure regions (Azure virtual Network span between paired Azure regions).
A storage account created in the target Azure region to store virtual machines
A storage account created in the source Azure region to cache content to be replicated to the Target storage account
An availability set depending of the destination Azure region if protected workloads in the source Azure regions are linked to an existing availability set.
This blog post aim to describe the setup Itself, but also:
- The Failover of an Azure Virtual machine to the target Azure Region
- The Rollback of an Azure Virtual machine to its initial Azure region
Enabling and managing this new feature is so simple and very attractive from a financial point of view (no extra-cost for compute), that would be a crime to do not use it.
If you know the Azure Site Recovery portal experience, you should discover Azure as a source of a protected workload. Note that service is now in preview. For this blog post, I will be having a workload to protect located in Central US Azure region. My goal is to use the West Europe Azure region for DRP plan.
Azure Site Recovery service instance cannot be in the same Azure region as workloads to be protected. It’s logic. On this first stage, we only select the source Azure region, the deployment model and the resource group containing the workloads you wish to protect. Azure Site Recovery instance must depend of the same Azure subscription (and of course Azure environment). While writing this blog-post, it was not yet possible to protect workloads from different subscriptions. That’s a scenario that Azure CSP vendors are waiting to offer services. At this stage UX experience does not change so much from protecting other workloads:
That’s here that many UX changes are introduced. For this blog post, I choose to select a target Azure region that is not paired with the source Azure Region. We are not limited to the paired Azure regions. Azure Site Recovery portal make some proposal for some resources:
If you click on the « Customize » link in the upper part of the UX, you will be able to change proposal for required resources:
- A resource Group in the target Azure region
- A virtual Network in the Target Azure region
For each Virtual machine to protect we can customize:
- The Storage account to be used to store VHD to be used by future replicated virtual machines
- The Storage account to be used for ASR cache purpose
For these two storage Accounts, you don’t need more than a Local Redundant Storage.
If your virtual machines to be protected is linked to an Availability set, you will need to select an availability Set in the destination Azure region to offer the same SLA for your virtual machine.
Each Azure Virtual machine to be protected must be associated to a replication Policy.
we can customize the following parameters:
- Retention policy for the recovery points
- Frequency for app-consistent snapshot
Initial configuration is now over. We can enable replication.
Once replication is enabled we can follow the Azure Site Recovery Jobs. Bellow we have all jobs related to the setup of the protection of your Azure virtual machine.
If we take a closer look at the jobs related to replication, we will discover that ASR deploy a virtual machine extension in our workload.
Watch out, it’s not because initial replication is considered as enabled that its completed. Initial replication will take some times.
In my case (standard Windows Virtual machine without any data disks), initial replication took around fifty minutes.
Once initial replication is terminated, we can consider that our workloads is protected.
If we take a closer look at virtual machine information page, we discover two type of recovery points:
In DRP situation, we can choose to prefer RTO versus RPO.
If you are familiar with Azure SQL, you should be familiar with that map. Now I have an Azure Virtual machine located in Central US Azure region with a replica in East US Azure region.
Now you have a DRP, you can test it. Microsoft recommendation is to perform a test failover from times to times to validate our DRP plan. You can choose to ignore the recommendation is you wish but that’s a good idea to test a DRP.
We can select the recovery point using:
- The latest recovery point (for a lowest RPO)
- The latest processed (for the best recovery time)
- The last app-consistent
For this test, recommendation is to select a dedicated Virtual Network that is not connected to the source Virtual Network (Virtual Network Gateway or Peering). In the jobs, we can track the progress.
At this stage of the test we have two virtual machines running: The original and the test. Because the second one is connected to an isolated network you can check the health of the application and validate the DR plan.
Test is now complete we can cleanup used resources.
Our workload is now protected. Let’s see how to initiate a « clean failover ».
The « Clean Failover »
We will be considering that failover is initiated manually. In fact, using Log Analytics we can:
- Detect an unhealthy virtual machine and initiate a single failover
- Detect a major outage in Azure Activity and initiate a complete relocation
When initiating the failover process, we need decide if fast service recovery is more important than full service restauration.
Because we have both App-Consistent snapshots and Recovery points (up to 24 hours with the default policy), custom recovery point allows us to choose a specific version of my application. That’s great because, nowadays, applications are distributed among many servers. In DR situation, we will need to restore many servers.
In Azure Site Recovery Jobs view we can notice that failover was initiated.
If we look at the details, we see that the virtual machine located in the source Azure region was shutdown properly and a new virtual machine was provisioned in the target Azure region. In my case, complete failover operation took five minutes and a half (I choose to minimize the Recovery Point objective).
Once terminated Job appear as completed in the Azure Site Recovery Jobs view. From a virtual machine point of view we now have two virtual machines but only the one located in the target Azure region is running.
Once failover is completed we need to commit the operation. Once commit is performed we select another recovery point.
The clean Rollback
Rollback process is named « re-protect ». In fact, it’s the same process but we switched the target and source parameters.
By default, ASR will use the same parameters but you can override them if necessary.
It’s a little bit longer than on the protection phase, but remain an acceptable
That’s the bonus. When failover is committed, you have the choice to disable replication. This option was designed to move a workload from an Azure region to another.
To have more information about the feature, I would recommend the following readings :
What to say. Even if it’s a preview, it’s almost feature complete (CLI & PowerShell support are missing, just like Managed disks). Once feature will be Generally Available, we will have first-class Disaster Recovery service we cannot not implement.
BenoitS – Simple and Secure by design but Business compliant (with disruptive flag enabled)