AWS - Elastic Disaster Recovery
AWS Elastic Disaster Recovery (DRS) provides fast, reliable recovery of on-premises and cloud servers to AWS. It continuously replicates your machines to a low-cost staging area, enabling RPO in seconds and RTO in minutes. Key steps: create IAM users, set up VPC with staging/recovery subnets, install replication agent on source servers, configure launch templates, and perform regular drills.
Introduction
IT disasters such as data center failures, server corruptions, or cyber-attacks can not only disrupt your business, but also cause data loss, impact your revenue, and damage your reputation. AWS Elastic Disaster Recovery (commonly referred to as DRS) minimizes downtime and data loss by providing fast, reliable recovery of physical, virtual, and cloud-based servers into the AWS Cloud.
DRS continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region. In the case of a disaster, you can instruct DRS to automatically launch all necessary machines in their fully provisioned state in minutes.
Key Terms
Having a disaster recovery plan is more than having backup routines and redundant components. Define your RTO and RPO objectives based on business metrics to guide your disaster recovery strategy decisions.
Recovery Time Objective (RTO)
The maximum acceptable delay between service interruption and service restoration. This determines what is considered an acceptable time window when the service is unavailable.
Factors affecting RTO:
- OS type - Linux servers typically boot within 5 minutes; Windows servers within 20 minutes
- OS configuration - Heavier workloads and additional services increase boot time
- Target instance performance - Lower performance instance types result in slower boot times
- Target volume performance - Lower performance volume types result in slower boot times
Recovery Point Objective (RPO)
The maximum acceptable time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the service outage.
The AWS Replication Agent continuously monitors blocks written to source server volumes and immediately copies them to the replication Staging Area. This enables an RPO of seconds as long as data can be immediately copied across the network.
Point in Time (PIT)
A feature that allows launching an instance from a snapshot captured at a specific Point In Time. Default retention is 7 days (configurable from 1 to 365 days).
PIT snapshot schedule:
- Every 10 minutes for the last hour
- Once an hour for the last 24 hours
- Once a day for the configured retention period
Setup and Configuration
AWS Account Setup
IAM Users
Create two IAM users with programmatic access:
- DRSAgentInstallUser - For agent installation
- Attach:
AWSElasticDisasterRecoveryAgentInstallationPolicy
- Attach:
- FailbackAgentUser - For failback scenarios
- Attach:
AWSElasticDisasterRecoveryFailbackInstallationPolicy
- Attach:
For simplicity, you can create a single user (e.g., awsdrs) and attach both policies to this account.
Infrastructure
Set up your VPC with the following subnets:
- Public Subnet - For public-facing servers and NAT Gateway
- Private Subnet - For non-public servers (internet via NAT Gateway)
- Staging Subnet - For DRS replicated data staging area
- Recovery Subnet - For drill and recovery instances
DRS Configuration
Replication Template Settings
- Staging subnet - Select the staging subnet created above
- Volumes - Configure size and type for replication
- Security Groups - Use default DRS security group or custom
- Data routing - Configure if using Direct Connect or VPN
- Snapshot Retention - Default 7 days (configurable 1-365 days)
Source Server Setup
Windows Servers
# Run in PowerShell as Administrator
Start-BitsTransfer https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/windows/AwsReplicationWindowsInstaller.exe
.\AwsReplicationWindowsInstaller.exe --region 'us-east-2' --aws-access-key-id [akid] --aws-secret-access-key [sakid] --no-prompt --devices c:
Linux Servers
sudo wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/linux/aws-replication-installer-init.py
sudo python3 aws-replication-installer-init.py --region 'us-east-2' --aws-access-key-id [akid] --aws-secret-access-key [sakid] --no-prompt
Source servers must be able to access the internet and allow outbound traffic on ports 443 (AWS services) and 1500 (replication to staging area).
Launch Template Configuration
- In DRS console, click into a replicating source machine
- Select "Launch Settings" tab
- Edit General launch settings - set "Instance type right sizing" to "None" for more control
- Click "Edit" on EC2 launch template
- Configure instance type, recovery subnet, security groups, and volume type
- Create template version and set as default
- Repeat for all source servers
Drill and Failover
Running Drills
Performing drills is key to being prepared for a disaster:
- In DRS console, click on a source server
- Click "Initiate recovery job > Initiate Drill"
- Select "Use most recent data" for Points in time
- Click "Initiate drill"
- Monitor in "Recovery job history"
Run drills frequently to ensure your DRS configuration is correct. Drill instances don't affect production traffic or ongoing replication.
Disaster Failover
In a disaster event, launch recovery instances using DRS - either to the latest second or a specific Point in Time. The service automatically converts source servers to boot natively on AWS.
Failback
Prerequisites
- Failback target volumes must be same size or larger than Recovery instance
- Failback Client must communicate with Recovery instance on TCP 1500
- TCP 1500 inbound and TCP 443 outbound must be open on Recovery instance
- Allow S3 traffic from the failback target server
- Failback Client server needs at least 4 GB dedicated RAM
Failback Process
- Complete the Recovery
- Configure Failback Replication Settings on Recovery instances
- Download Failback Client ISO from the appropriate region S3 bucket
- Boot Failback Client ISO on the target server
- Enter AWS credentials and region
- Select or confirm Recovery instance mapping
- Map volumes (automatic or manual)
- Wait for data replication to complete (monitor in DRS console)
- Select Recovery instances and choose "Failback" to finalize
- Terminate, delete, or disconnect Recovery instance when complete
Troubleshooting
- Agent installation fails - Verify outbound connectivity on ports 443 and 1500. Ensure the IAM credentials have the correct policy attached.
- Replication stuck at initial sync - Check network bandwidth. Large source volumes take longer. Verify staging subnet has internet access.
- Source server shows "Stalled" - The agent lost connectivity. Check network, verify the agent service is running, and ensure ports 443/1500 are open.
- Launch fails at conversion - Ensure root drive has at least 2 GB free space. Check the conversion server can reach required AWS endpoints.
- Recovery instance won't boot - Verify launch template settings. Check security groups allow necessary traffic. Review conversion logs in the job history.
- Failback pairing fails - Verify TCP 1500 connectivity between Failback Client and Recovery instance. Ensure Recovery instance has a public IP or private route (VPN/DX).
- Failback volumes don't match - Target server volumes must be equal to or larger than Recovery instance volumes. Resize if necessary.