A
Arun's Blog
All Posts

Running a Full AWS DRS Failover and Failback, AZ to AZ

|10 min read|
AWSContinuityMigration
TL;DR

I ran AWS Elastic Disaster Recovery (DRS) through a complete loop in a lab: protect two EC2 instances in one Availability Zone, simulate the AZ going down, fail over into a second AZ, then fail back. Each instance ran a tiny heartbeat app that printed its live instance ID, AZ, and an ever-incrementing counter, so I could literally watch state survive the move. The mechanics are simple once a few things click: the launch template subnet decides which AZ you recover into, failback launches a brand-new instance rather than reviving the old one, the reverse replication button stays greyed until the failback agent checks in, and DRS never redirects traffic for you. I ran this across two Availability Zones, but the exact same procedure works Region to Region, you just point the launch template at a subnet in the other Region and drive it from that Region's console. This is the hands-on walkthrough with the gotchas that cost me time.

The Lab

I wanted to actually feel a DRS failover and failback rather than read about it, so I built something small and observable. Two source instances in us-east-1a: one Linux, one Windows. Each ran a trivial web app that, on every request, read its own instance metadata and rendered the instance ID, the Availability Zone, the private IP, and a counter that ticked up every five seconds. A stable Elastic IP fronted each app so I had one URL per workload that didn't change as instances came and went.

That counter is the whole trick. If the app comes back in the other AZ and the counter resumes near where it left off, I know two things at a glance: the data replicated, and the app restarted cleanly on the recovered disk. No guessing.

How DRS Models a Failover

Three nouns carry the whole service, and getting them straight up front makes everything else obvious.

Thing What it is
Source server DRS's representation of a protected machine. Holds replication state and the recovery launch settings. This is not the instance, it's the record about the instance.
Recovery instance The actual EC2 instance DRS launches in the DR location during a recovery or a drill.
Drill vs recovery A drill is a non-destructive test launch, your source keeps running and is untouched. A recovery is the real failover. Mechanically identical, only the intent differs.

The flow, end to end, is just: replicate, fail over, redirect traffic, reverse-replicate so you can come home, fail back, redirect traffic again, re-protect. Everything below is that sentence in slow motion.

AZ to AZ or Region to Region, same playbook

I ran this between two Availability Zones in a single Region because it's the cheapest way to rehearse, but nothing in the procedure is AZ-specific. To do it Region to Region instead, you point the recovery launch template at a subnet in the other Region and operate the failover steps from that Region's DRS console. The buttons, the phases, and every gotcha below are identical. The only real differences: cross-Region replication crosses the public boundary and bills inter-Region data transfer, and your traffic-redirect step is usually a DNS change rather than an Elastic IP move. Mentally, every "AZ-b" below reads as "the recovery Region" and "AZ-a" as "the source Region."

The Launch Template Decides Your AZ (or Region)

This is the single most important setting and the one people miss. Each source server carries a recovery launch template. The subnet in that template is what determines which Availability Zone (or which Region) your recovery instance lands in. There is no separate "recover into AZ-b" button. You point the launch template at an AZ-b subnet, and that's where recovery goes.

Set this once, ahead of time

In a real event you do not want to be editing launch templates while the business is down. Configure the recovery subnet when you onboard the server, then rehearse. I only edited it live in the lab because that was the point of the lab.

Simulating the Disaster

Before touching anything, I confirmed both source servers showed CONTINUOUS replication at zero lag, and I wrote down the baseline: Linux counter at 236, Windows at 220. Then I pulled the plug, which in a lab means stopping the two primaries.

aws ec2 stop-instances --instance-ids i-aaaa i-bbbb --region us-east-1

Both went to stopped, both stable URLs went dark, and the DRS source servers flipped to STALLED because the replication agent stopped reporting. All expected. Recovery uses the last snapshot taken before the stop, so nothing is lost up to that moment.

If you can't take an outage

Don't stop anything. Use Initiate drill instead of Initiate recovery in the next step. The source keeps serving production while you validate the DR launch in isolation. The rest of the flow is the same.

Failover: Initiate Recovery

On the Source servers page I selected both servers, chose Initiate recovery job, then Initiate recovery, and picked "use most recent data." DRS creates a job that runs three phases per server: snapshot, conversion, launch.

Conversion is the long pole

The snapshot finishes in seconds. Then conversion sat for about twenty minutes before anything launched. That is normal, especially the first time, DRS spins up a conversion server, converts the volumes so they'll boot in the target, then launches. Budget 15 to 25 minutes and don't panic at a long PENDING. Only worry if the job log errors or you blow past 30.

When it finished, both recovery instances were running in us-east-1b with fresh instance IDs and fresh private IPs. I moved each Elastic IP onto its recovery instance, which is the traffic-redirect step (more on that below), and went to validate.

Proving It Worked

The Linux app told the whole story in one page load: new instance ID, AZ now reading us-east-1b, and the counter resuming at 241 from a pre-failure 236. State survived the "disaster" and the app picked up in the other AZ. That's a clean failover.

The Windows app was more interesting, and it's a gotcha worth knowing.

Windows scheduled-task apps serve a stale page for a few minutes

For the first couple of minutes after recovery, the Windows app served its old instance ID and AZ. The instance was fine. The page is generated by a scheduled task, and boot-triggered scheduled tasks don't always fire the instant a recovered instance comes up. So the web server was happily serving the last page the task wrote before the snapshot. It self-corrected in about three minutes once the task ran. If you need correctness immediately, run the generator as a service or trigger the task at startup.

At this point I was failed over and serving from AZ-b. For a drill, you'd stop here and clean up. This was a real failover, so I kept going.

Arming Failback: Reverse Replication

To fail back, you first reverse replication so the recovered instance starts syncing its data back toward the original AZ. The console exposes this two ways that do the same thing: a "Start reversed replication" button on the Recovery instances page, or "Protect recovered instance" in the source server's Replication menu.

The reverse replication button is greyed out right after recovery

I selected the recovery instances, and the button was dead. The reason: the failback agent on the recovered instance has to phone home before DRS will let you reverse anything, and it had only just booted. The pending action reads "Protect recovered instance" until the agent checks in. Give it two or three minutes, refresh, and it lights up. The tell is a recent failback agent "last seen" timestamp on the recovery instance.

Once it kicked off, DRS created a source server back in the original location and a fresh staging server, then synced the disks in reverse. I watched it go from RESCAN to CONTINUOUS at zero lag, which is the green light for the failback launch.

Reverse replication isn't free or instant

It re-transfers the disks and spins up its own EBS volumes, snapshots, and a staging server. Cross-Region adds data transfer charges on top. And once you stop replication, DRS deletes all prior points in time to save cost, so validate your failover instances before you stop anything.

Failing Back

Now the launch template matters again. It still pointed at the AZ-b subnet from the failover, so I flipped it to an AZ-a subnet. Otherwise the failback would have landed right back in the AZ I was trying to leave.

Then, on the source server that the recovery instance was replicating to, I chose Initiate recovery job, then Launch for failback. Same snapshot, conversion, launch cycle. A new instance came up in us-east-1a, I moved the Elastic IPs back onto it, and confirmed the apps were serving from AZ-a again with their counters intact.

Failback does not revive your original instance

This surprised me the first time. The old, stopped originals are still sitting there. Failback launches a brand-new instance from the data that replicated back. The originals are now obsolete and get terminated during cleanup. Don't expect your old instance IDs to come back to life.

DRS Never Redirects Traffic for You

Worth stating plainly because it's easy to assume otherwise. DRS gets an instance running in the new location. That's it. Pointing users at it is a separate, deliberate step you own. In the lab I moved an Elastic IP. In a real environment that's a Route 53 record update, a load balancer target swap, or an ENI move. If you forget this step, your perfectly healthy recovered instance sits there receiving no traffic while everyone stares at a dead URL.

Re-Protect and Clean Up

After a real failback, the new AZ-a instance is running unprotected. The last real step is to re-arm it by starting reversed replication again from the recovery side, so you're covered for the next event. Skip this on a drill, since on a drill your original is still production and you don't want to repoint protection away from it.

Then cleanup, in order, because you have to stop replication before terminating what it protects:

  1. Stop replication on the leftover reverse source servers you no longer need.
  2. Terminate the recovery instances in the DR AZ.
  3. Terminate the old, stopped originals, they've been replaced.
  4. Disconnect and delete any leftover source servers.

Gotchas, Collected

Symptom Cause and fix
Reverse replication button greyed out Failback agent on the recovery instance hasn't checked in yet. Wait two to three minutes, refresh, retry.
Recovery stuck at PENDING for 15-20 minutes Normal. Conversion is slow, especially the first run. Watch the job log for errors, otherwise let it cook.
Windows page shows old instance ID, AZ, or counter The page-generating scheduled task hasn't fired on the fresh boot. Self-corrects in a few minutes. Use a service or a startup trigger for instant correctness.
Recovery lands in the wrong AZ Launch template subnet wasn't updated to the target AZ.
Source shows STALLED after you stop the primary Expected, the agent stopped reporting. Recovery uses the last snapshot.
Traffic still hitting the old instance DRS doesn't redirect traffic. Move the EIP, update DNS, or swap the load balancer target yourself.

None of this is hard once you've done it once. The value of a dry run is exactly this list, the small surprises that would otherwise eat the first hour of a real event when you have the least patience for them. If you operate anything you'd need to recover, run the loop in a lab before you need it for real.

Want Help With This?

If you're working on something similar and want a second set of eyes, or you'd like to talk through how this applies to your environment, reach out via the contact form. Happy to help.

Related Articles