AWS - Elastic Disaster Recovery

TL;DR

AWS Elastic Disaster Recovery (DRS) provides fast, reliable recovery of on-premises and cloud servers to AWS. It continuously replicates your machines to a low-cost staging area, enabling RPO in seconds and RTO in minutes. Key steps: create IAM users, set up VPC with staging/recovery subnets, install replication agent on source servers, configure launch templates, and perform regular drills.

Introduction

IT disasters such as data center failures, server corruptions, or cyber-attacks can not only disrupt your business, but also cause data loss, impact your revenue, and damage your reputation. AWS Elastic Disaster Recovery (commonly referred to as DRS) minimizes downtime and data loss by providing fast, reliable recovery of physical, virtual, and cloud-based servers into the AWS Cloud.

DRS continuously replicates your machines (including operating system, system state configuration, databases, applications, and files) into a low-cost staging area in your target AWS account and preferred Region. In the case of a disaster, you can instruct DRS to automatically launch all necessary machines in their fully provisioned state in minutes.

Key Terms

Note

Having a disaster recovery plan is more than having backup routines and redundant components. Define your RTO and RPO objectives based on business metrics to guide your disaster recovery strategy decisions.

Recovery Time Objective (RTO)

The maximum acceptable delay between service interruption and service restoration. This determines what is considered an acceptable time window when the service is unavailable.

Factors affecting RTO:

OS type - Linux servers typically boot within 5 minutes; Windows servers within 20 minutes
OS configuration - Heavier workloads and additional services increase boot time
Target instance performance - Lower performance instance types result in slower boot times
Target volume performance - Lower performance volume types result in slower boot times

Recovery Point Objective (RPO)

The maximum acceptable time since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the service outage.

The AWS Replication Agent continuously monitors blocks written to source server volumes and immediately copies them to the replication Staging Area. This enables an RPO of seconds as long as data can be immediately copied across the network.

Point in Time (PIT)

A feature that allows launching an instance from a snapshot captured at a specific Point In Time. Default retention is 7 days (configurable from 1 to 365 days).

PIT snapshot schedule:

Every 10 minutes for the last hour
Once an hour for the last 24 hours
Once a day for the configured retention period

Setup and Configuration

AWS Account Setup

IAM Users

Create two IAM users with programmatic access:

DRSAgentInstallUser - For agent installation
- Attach: AWSElasticDisasterRecoveryAgentInstallationPolicy
FailbackAgentUser - For failback scenarios
- Attach: AWSElasticDisasterRecoveryFailbackInstallationPolicy

Pro Tip

For simplicity, you can create a single user (e.g., awsdrs) and attach both policies to this account.

Infrastructure

Set up your VPC with the following subnets:

Public Subnet - For public-facing servers and NAT Gateway
Private Subnet - For non-public servers (internet via NAT Gateway)
Staging Subnet - For DRS replicated data staging area
Recovery Subnet - For drill and recovery instances

DRS Configuration

Replication Template Settings

Staging subnet - Select the staging subnet created above
Volumes - Configure size and type for replication
Security Groups - Use default DRS security group or custom
Data routing - Configure if using Direct Connect or VPN
Snapshot Retention - Default 7 days (configurable 1-365 days)

Source Server Setup

Windows Servers

# Run in PowerShell as Administrator
Start-BitsTransfer https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/windows/AwsReplicationWindowsInstaller.exe

.\AwsReplicationWindowsInstaller.exe --region 'us-east-2' --aws-access-key-id [akid] --aws-secret-access-key [sakid] --no-prompt --devices c:

Linux Servers

sudo wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-us-east-2.s3.amazonaws.com/latest/linux/aws-replication-installer-init.py

sudo python3 aws-replication-installer-init.py --region 'us-east-2' --aws-access-key-id [akid] --aws-secret-access-key [sakid] --no-prompt

Important

Source servers must be able to access the internet and allow outbound traffic on ports 443 (AWS services) and 1500 (replication to staging area).

Launch Template Configuration

In DRS console, click into a replicating source machine
Select "Launch Settings" tab
Edit General launch settings - set "Instance type right sizing" to "None" for more control
Click "Edit" on EC2 launch template
Configure instance type, recovery subnet, security groups, and volume type
Create template version and set as default
Repeat for all source servers

Drill and Failover

Running Drills

Performing drills is key to being prepared for a disaster:

In DRS console, click on a source server
Click "Initiate recovery job > Initiate Drill"
Select "Use most recent data" for Points in time
Click "Initiate drill"
Monitor in "Recovery job history"

Pro Tip

Run drills frequently to ensure your DRS configuration is correct. Drill instances don't affect production traffic or ongoing replication.

Disaster Failover

In a disaster event, launch recovery instances using DRS - either to the latest second or a specific Point in Time. The service automatically converts source servers to boot natively on AWS.

Failback

Prerequisites

Failback target volumes must be same size or larger than Recovery instance
Failback Client must communicate with Recovery instance on TCP 1500
TCP 1500 inbound and TCP 443 outbound must be open on Recovery instance
Allow S3 traffic from the failback target server
Failback Client server needs at least 4 GB dedicated RAM

Failback Process

Complete the Recovery
Configure Failback Replication Settings on Recovery instances
Download Failback Client ISO from the appropriate region S3 bucket
Boot Failback Client ISO on the target server
Enter AWS credentials and region
Select or confirm Recovery instance mapping
Map volumes (automatic or manual)
Wait for data replication to complete (monitor in DRS console)
Select Recovery instances and choose "Failback" to finalize
Terminate, delete, or disconnect Recovery instance when complete

Troubleshooting

Agent installation fails - Verify outbound connectivity on ports 443 and 1500. Ensure the IAM credentials have the correct policy attached.
Replication stuck at initial sync - Check network bandwidth. Large source volumes take longer. Verify staging subnet has internet access.
Source server shows "Stalled" - The agent lost connectivity. Check network, verify the agent service is running, and ensure ports 443/1500 are open.
Launch fails at conversion - Ensure root drive has at least 2 GB free space. Check the conversion server can reach required AWS endpoints.
Recovery instance won't boot - Verify launch template settings. Check security groups allow necessary traffic. Review conversion logs in the job history.
Failback pairing fails - Verify TCP 1500 connectivity between Failback Client and Recovery instance. Ensure Recovery instance has a public IP or private route (VPN/DX).
Failback volumes don't match - Target server volumes must be equal to or larger than Recovery instance volumes. Resize if necessary.

Want Help With This?

If you're working on something similar and want a second set of eyes, or you'd like to talk through how this applies to your environment, reach out via the contact form. Happy to help.