Introduction
IT disasters such as data center failures, server corruptions, or cyber-attacks can not only disrupt your business, but also cause data loss, impact your revenue, and damage your reputation. AWS Elastic Disaster Recovery (commonly referred to as DRS) minimizes downtime and data loss by providing fast, reliable recovery of physical, virtual, and cloud-based servers into the AWS Cloud.
DRS continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region. In the case of a disaster, you can instruct DRS to automatically launch all necessary machines in their fully provisioned state in minutes.
The focus of this post is the reliability patterns and how to apply the patterns to your solutions using AWS services, for the goal of creating a disaster recovery environment. The material below is based on the Reliability Pillar of the AWS Well-Architected Framework. It provides a set of mechanisms to help customers to apply best practices in the design, delivery, and maintenance of AWS environments.
Terms
Having a disaster recovery plan is more than having backup routines and redundant components. You should define what your RTO and RPO objectives are for disaster recovery. Set objectives based on business metrics. By defining the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each application, you’ll have how to guide your decisions on which disaster recovery strategies to follow.
- Recovery Time Objective (RTO) is the maximum acceptable delay between service interruption and service restoration. This determines what is considered an acceptable time window when the service is unavailable.
- When launching a recovery job, the DRS orchestration process creates cloned volumes by using the replicated volumes in the replication Staging Area. During this process, DRS also initiates a process that converts all volumes that originated outside of AWS into AWS-compatible volumes, which are attached to EC2 instances that can boot natively on AWS. The job and boot time depend on the following environment conditions:
- OS type: The average recovered Linux server normally boots within 5 minutes, while the average recovered Windows server normally boots within 20 minutes because it is tied to the more resource-intensive Windows boot process.
- OS configuration: The OS configuration and application components it runs can impact the boot time. For example, some servers run heavier workloads and start additional services when booted, which may increase their total boot time.
- Target instance performance: DRS sets a default instance type based on the CPU and RAM provisioned on the source server. Changing to a lower performance instance type will result in a slower boot time than that of a higher performance instance type.
- Target volume performance: Using a lower performance volume type will result in a slower boot time than that of a higher performance volume type with more provisioned IOPS.
- When launching a recovery job, the DRS orchestration process creates cloned volumes by using the replicated volumes in the replication Staging Area. During this process, DRS also initiates a process that converts all volumes that originated outside of AWS into AWS-compatible volumes, which are attached to EC2 instances that can boot natively on AWS. The job and boot time depend on the following environment conditions:
- Recovery Point Objective (RPO) is the maximum acceptable time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the service outage.
- The AWS Replication Agent continuously monitors the blocks written to the source server volume(s), and immediately attempts to copy the blocks across the network and into the replication Staging Area Subnet located in the customer’s target AWS account. This continuous replication approach enables an RPO of seconds as long as the written data can be immediately copied across the network and into the replication Staging Area volumes.
- Point in Time (PIT) is a disaster recovery feature which allows launching an instance from a snapshot captured at a specific Point In Time. As source servers are replicated, Point in Time states are chronicled over time, while a retention policy will determine which Points in Time are not required after a defined duration. One can increase or decrease the default 7 day snapshot retention rate from anywhere between 1 day and 365 days in the Replication Settings. Elastic Disaster Recovery has the following PIT state schedule:
- Every 10 minutes for the last hour
- Once an hour for the last 24 hours
- Once a day for the last 7 days (or a different retention period, as configured)
Elastic Disaster Recovery
AWS Elastic Disaster Recovery (AWS DRS) is a service that helps minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. IT resilience is increased when using AWS Elastic Disaster Recovery to replicate on-premises or cloud-based applications running on supported operating systems.
The AWS Management Console can be used to configure replication and launch settings, monitor data replication, and launch instances for drills or recovery.
AWS Elastic Disaster Recovery is started on your source servers as these source servers will be used to initiate secure data replication. This data is replicated to a staging area subnet in the AWS VPC, in an AWS Region. The staging area design reduces costs by using affordable storage and minimal compute resources to maintain ongoing replication.
Once synched, non-disruptive tests can be performed to confirm that the DRS implementation was correctly configured and is complete. During normal operation, monitor replication and periodically performing non-disruptive recovery and failback drills. AWS Elastic Disaster Recovery automatically converts source servers to boot and run natively on AWS when launching instances for drills or recovery. If recovery is needed, launching recovery instances on AWS can be accomplished within minutes, using the most up-to-date server state or a previous point in time. After applications are running on AWS, these applications/servers can sit tight in AWS, or data replication back to your primary site can be initiated.
Setup and Configuration
DRS Workflow
AWS Account
Obtain an AWS account so as to create the infrastructure needed for your DRS environment.
IAM
Create two IAM users with programmatic access:
- DRSAgentInstallUser
- This user will be used for agent installation.
- An AWS managed policy called AWSElasticDisasterRecoveryAgentInstallationPolicy can be directly added to this user
- FailbackAgentUser
- This user will be used for failback scenarios
- An AWS managed policy called AWSElasticDisasterRecoveryFailbackInstallationPolicy can be directly added to this user
Note: For this post, I have created a single user object called awsdrs and attached both policies to this account
Infrastructure
VPC
The virtual private cloud will be the virtual network that will be used to house your DRS environment. The VPC is comprised of a CIDR which is broken up into smaller chunks called subnets.
Public Subnet
This subnet will host any servers or services that will be public facing. This subnet will also host the NAT Gateway, which will be used for any servers or services sitting in the private subnet.
Private Subnet
This subnet will host any servers or services that should not be public facing. Access to the internet will be through the NAT Gateway which will be sitting in the public subnet.
Staging Subnet
This subnet will be used by DRS as a staging area for replicated data from the source servers into AWS. This subnet will be called upon by the Replication settings template.
Recovery Subnet
Drill and recovery instances will be launched in this subnet, defined by the launch templates associated with each source server.
DRS
Enablement
To enable DRS, confirm the region you want to configure the DRS environment. This region is usually the same region as your VPC.
Replication Template
Set the default replication settings so that DRS will use these settings to create the replication objects.
- Staging subnet
- State the subnet that was created for staging purposes
- Volumes
- State the size and type of the volumes to be used for replication
- Security Groups
- Use either the default DRS security group or use your own
- Data routing and throttling
- Configure settings here if you have a direct connect or site to site VPN connection, as well as the need to throttle bandwidth
- Snapshot Retention (Point in Time) Policy
- By default, DRS takes a snapshot every 10 minutes, once an hour, once a day.
- By default, DRS retains the daily snapshot for 7 days; however, this can be changed to 1 day to 365 days.
On-Premises/Source Servers
Source Servers
Obtain a list of servers that will need to be protected/replicated to AWS. The Operating System of these source servers will determine the agent and the steps to install the replication agents.
Windows Servers
- In PowerShell as an administrator
- Start-BitsTransfer https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/windows/AwsReplicationWindowsInstaller.exe
- .\AwsReplicationWindowsInstaller.exe –region ‘us-east-2’ –aws-access-key-id [akid] –aws-secret-access-key [sakid] –no-prompt –devices c:
- Replace ‘–region us-east-2’ with another region if so desired
- Replace akid and sakid with the appropriate IDs. These were purposely left out in this document for security reasons.
- The ‘–no-prompt’ flag is used so that there isn’t any user interaction. Please remove this flag if you want to have a verbose installation.
- The ‘–devices c:’ flag is used to only replicate the c:\ drive. Please remove this if you want to replicate all drives.
Linux Servers
- sudo wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/linux/aws-replication-installer-init.py
- sudo python3 aws-replication-installer-init.py –region ‘us-east-2’ –aws-access-key-id [akid] –aws-secret-access-key [sakid] –no-prompt
- Replace ‘–region us-east-2’ with another region if so desired
- Replace akid and sakid with the appropriate IDs. This was purposely left out in this document for security reasons
- The ‘–no-prompt’ flag is used so that there isn’t any user interaction. Please remove this flag if you want to have a verbose installation.
Networking/Firewall
In order for the source servers to replicate to AWS, please make sure:
- The source servers can access the internet
- The source servers are allowed out on port 443 and 1500
- 443 is used for connecting to the AWS services
- 1500 is used for replicating to the staging area
Synchronization
Initial Sync
Once the agents have been installed, monitoring the replication job can be done at the AWS console.
DRS source servers page can be used to review the initial sync process.
The initial sync is comprised of several tasks:
- Create security groups
- Launch replication server
- Boot replication server
- Authenticate with service
- Download replication software
- Create staging disks
- Attach staging disks
- Pair replication server with Agent
- Connect agent with replication server
- Start data transfer
Once the initial sync is complete, the final status will be shown as “Healthy”. That is the desired state as it indicates that all changes in the source server are being automatically replicated via the DRS replication server.
Launch Settings/Launch Template
An EC2 launch template is required to configure the target instance which will be launched in the event of a drill/disaster. To configure the launch template for each replicating source server:
- In the DRS console, click into one of your replicating source machines on the “Source Servers” tab
- Select the “Launch Settings” tab
- Edit the General launch settings and set “Instance type right sizing” to “None” if you want to have more control into which EC2 instance family will be used. Save the settings
- General Launch Settings
- In the EC2 launch template box click “Edit”
- In the “Template version description” box enter today’s date and the time you are creating the version (or you can label this anything you like)
- For the Instance type (if you selected “Instance type right sizing” to “None” in step 3), select the size of the EC2 family that is similar to your source server
- Under Network Settings select your Recovery Subnet for Subnet
- Select existing security group for the firewall section
- Then in the Common security groups field select the security group of your choice; this will be the security group that will be tied to the recovery instance once it boots up
- Under “Configure Storage” change the “Volume type” to “gp3” (or any volume type of your choice)
- Leave all other settings untouched and click “Create template version”
- Click the link to the template version you just created in the Success notification box
- Select “Actions > Set Default Version”, then choose the name of the template you provided in step 6 as the Template version.
- Click the “Set as default version” button
- Repeat this process for all source servers
Drill/Failover
Drills
Elastic Disaster Recovery helps to be ready for a failover event by making the running of drills easy. Elastic Disaster Recovery allows frequent launching of instances for test and drill purposes without redirecting the production traffic.
In order to be prepared for a Failover, continuous drills should be performed by launching Drill instances in AWS through Elastic Disaster Recovery and testing these instances.
Performing drills is a key aspect of being prepared for a disaster. Once the actual disaster strikes, one can immediately perform a Failover by launch Recovery instances in AWS based on a chosen Point In Time snapshot.
Disaster/Failover
In the event of a disaster, a failover to AWS will be performed with the help of Elastic Disaster Recovery. Once the disaster has been mitigated, a failback to your original source infrastructure will be completed.
Elastic Disaster Recovery ensures that your recovery systems are ready in the case of a disaster. Launch recovery instances with Elastic Disaster Recovery, up to the latest second, or to a certain Point in Time. Since new data will most likely have been written in AWS and this data needs to be copied back to your primary system, a failback replication will need to be performed.
Drill/Failover Tasks
In the event of a disaster or a drill, these steps are completed to perform a failover to AWS with the help of AWS Elastic Disaster Recovery:
- In the DRS console, click on one of the source servers to bring up the “Server info” page
- Click “Initiate recovery job > Initiate Drill”
- For Points in time select “Use most recent data”
- Click the “Initiate drill” button
- Use “Recovery job history” in the DRS console to monitor the job log and ensure that your launch is successful
Failback
Once the disaster is over, a Failback can be performed to the original source server or to any other server that meets the prerequisites by installing the Elastic Disaster Recovery Failback Client on the server. A cross-Region or cross-AZ failover and failback can be done directly with the aid of the DRS Console. In addition, DRS allows a scalable failback for vCenter with the DRS Mass Failback Automation client (DRSFA client). Once your Failback is complete, one has the option to either terminate, delete, or disconnect the Recovery instance.
Failback is the act of redirecting traffic from your recovery system to your primary system. This is an operation that is performed outside of Elastic Disaster Recovery. Elastic Disaster Recovery aids you in performing the failback by ensuring that the state of your primary system is up to date with the state of your recovery system.
Before performing a failback, you want to make sure that any data that was written to your failover systems during the failover is replicated back to your original systems before you perform the actual failback and redirecting users to your primary systems. Elastic Disaster Recovery helps you prepare for failback by replicating the data from your Recovery instances on AWS back to your source servers with the aid of the Failback Client.
Failback replication is performed by booting the Failback Client on the source sever into which you want to replicate your data from AWS. In order to use the Failback Client, you must meet the failback prerequisites and generate failback AWS credentials as described below. The DRS Console lets you track the progress of your failback replication on the Recovery Instances page. Learn more about the Recovery Instances page.
Failback Prerequisites
Prior to performing a failback, ensure that you meet all replication network requirements and the following failback-specific requirements:
- Ensure that the volumes on the server you are failing back to are the same size, or larger, than the Recovery instance
- The Failback Client must be able to communicate with the Recovery instance on TCP 1500, this can be done either by via a private route (VPN/DX) or a public route (public IP assigned to the Recovery instance)
- TCP Port 1500 inbound and TCP Port 443 outbound must be open on the Recovery instance for the pairing to succeed
- You must allow traffic to S3 from the server you are failing back to
- The server on which the Failback Client is ran must have at least 4 GB of dedicated RAM
Failback Tasks
Once you are ready to perform a failback to your original source servers or to different servers, follow the following flow:
- Complete the Recovery
- Configure your Failback Replication Settings on the Recovery instances you want to fail back
- Download the Elastic Disaster Recovery Failback Client ISO (aws-failback-livecd-64bit.iso) from the S3 bucket that corresponds to the AWS Region in which your Recovery instances are located
- Direct download link: Failback Client ISO: https://aws-elastic-disaster-recovery-{REGION}.s3.amazonaws.com/latest/failback_livecd/aws-failback-livecd-64bit.iso
- Boot the Failback Client ISO on the server you want fail back to. This can be the original source server that is paired with the Recovery instance, or a different server.
- Important: Ensure that the server you are failing back to has the same number of volumes or more than the Recovery Instance and that the volume sizes are equal to or larger than the ones on the Recovery Instance.
- Note:
- When performing a recovery for a Linux server, you must boot the Failback Client with BIOS boot mode.
- When performing a recovery for a Windows server, you must boot the Failback Client with the same boot mode (BIOS or UEFI) as the Windows source server.
- Enter the AWS credentials, including your AWS Access Key ID and AWS Secret Access Key that was created earlier in the document for awsdrs, and the AWS Region in which your Recovery instance resides
- Enter the custom endpoint or press Enter to use the default endpoint
- If you are failing back to the original source machine, the Failback Client will automatically choose the correct corresponding Recovery instance
- If the Failback Client is unable to automatically map the instance, then you will be prompted to select the Recovery instance to fail back from. The Failback Client will display a list with all Recovery instances. Select the correct Recovery instance by either entering the numerical choice from the list that corresponds to the correct Recovery instance or by typing in the full Recovery instance ID
- Note: The Failback Client will only display Recovery instances whose volume sizes are equal to or smaller than the volume sizes of the server you’re failing back to. If the Recovery instance has volume sizes that are larger than that of the server you are failing back to, then these Recovery instances will not be displayed.
- If you are failing back to the original source server, then the Failback Client will attempt to automatically map the volumes of the instance
- If the Failback Client is unable to automatically map the volumes, then you will need to manually enter a local block device (example /dev/sdg) to replicate to from the remote block device. Enter the EXCLUDE command to specifically exclude local block devices.
- Note: The local volumes must be the same in size or larger than the Recovery instance volumes
- The Failback Client will verify connectivity between the Recovery instance and the Elastic Disaster Recovery service
- Note: For the Failback Client to successfully establish connectivity, a public IP must be set on the Recovery instance in EC2. In addition, TCP Port 443 outbound must be open on the Recovery instance
- The Failback Client will download the replication software from a public S3 bucket onto the source server
- Note: You must allow traffic to S3 from the source server for this step to succeed
- The Failback Client will configure the replication software
- The Failback Client will pair with the AWS Replication Agent running on the Recovery instance and will establish a connection
- Note: TCP Port 1500 inbound must be open on the Recovery instance for the pairing to succeed
- Data replication will begin
- You can monitor data replication progress on the Recovery Instances page in the Elastic Disaster Recovery Console
- Once data replication has been completed, the Recovery instance on the Recovery Instances page will show the Ready status under the Failback state column and the Healthy status under the Data replication status column
- Once all of the Recovery instances you are planning to fail back show the statuses above, select the checkbox to the left of each Instance ID and choose Failback. This will stop data replication and will start the conversion process. This will finalize the failback process and will create a replica of each Recovery instance on the corresponding source server
- Select the checkbox to the left of one or more Recovery instances that are in the Ready state and choose the Failback option to continue the failback process after performing a failback with the Elastic Disaster Recovery Failback Client. This action will stop data replication and will start the conversion process. This will finalize the failback process and will create a replica of each Recovery instance on the corresponding source server
- On the Continue with failback for X instances dialog, choose Failback
- This action will create a Job, which you can follow on the Recovery job history page. Learn more about the Recovery job history page
- Once the failback is complete, the Failback Client will show that the failback has been completed successfully
- You can opt to either terminate, delete, or disconnect the Recovery instance
Failing Back to Original Source Server v. Different Source Server
You can fail back to the original source server or a different source server using Elastic Disaster Recovery. If the original source server has been deleted or no longer exists, then you will not be able to fail back to it and it will show as having Lag and being Stalled in the Elastic Disaster Recovery Console. If the original source server is healthy and you decide to fail back to it, then it will undergo a rescan until it reaches the Ready status. You can tell whether you are failing back to the original or a new source server in the Recovery Instance details view under Failback status.