The AWS-provided EC2Rescue runbooks (AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow) fail immediately on encrypted EBS volumes — including those using AWS-managed keys (aws/ebs). This post walks through building your own rescue workflow using AWS Step Functions that handles encrypted volumes natively, with full KMS integration, no Lambda functions required, and a visual workflow you can monitor in real time.
The Problem
If you've read my previous post on fixing broken Windows EC2 instances with SSM automation, you know that AWS provides excellent runbooks for offline registry repair. There's just one catch buried in the limitations section:
Both AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow check the volume's Encrypted flag and fail immediately if it's true. This applies to all encryption types — AWS-managed keys (aws/ebs), customer-managed KMS keys, even keys shared from other accounts. The helper instance's IAM role simply doesn't have KMS permissions.
With the push toward encryption-by-default (many organizations enable default EBS encryption at the account level), this limitation affects a growing number of instances. The usual advice is “do a manual volume swap” — a tedious, error-prone, 10+ step process that nobody wants to do at 2 AM during an incident.
What if we could build our own automated rescue workflow that handles encrypted volumes?
Why Step Functions?
AWS Step Functions is the right tool for this job for several reasons:
- Native AWS SDK integrations — Step Functions can call EC2, SSM, and KMS APIs directly without Lambda functions. Every API call in our workflow is a native SDK integration.
- Visual workflow — You can watch each state execute in real time in the console. During a 2 AM incident, this visibility is invaluable.
- Built-in error handling — Retry policies, catch blocks, and compensation logic are first-class citizens. If a step fails, the workflow can clean up after itself.
- IAM role you control — Unlike the AWS-managed runbooks, you define the execution role. You can grant
kms:CreateGrant,kms:Decrypt, andkms:DescribeKey— the permissions needed to mount encrypted volumes on the helper instance. - No compute cost during waits — The workflow uses polling loops to wait for instances and volumes. Unlike a Lambda that would sit idle (and bill you), Step Functions only charges per state transition.
Architecture Overview
The workflow follows the same logical steps as the manual volume swap, fully automated:
- Snapshot — Create a backup snapshot of the broken instance's root volume
- Stop — Stop the broken instance
- Detach — Detach the encrypted root volume
- Launch helper — Launch a temporary Windows instance in the same AZ with KMS permissions
- Attach — Attach the encrypted volume to the helper as a secondary drive
- Fix — Run your repair script on the helper via SSM Run Command
- Detach from helper — Detach the volume from the helper
- Reattach — Reattach the volume to the original instance as the root device
- Start — Start the original instance
- Cleanup — Terminate the helper instance
Each step includes wait loops (polling instance/volume state) and error handling with cleanup.
Prerequisites
- An AWS account with Step Functions, EC2, SSM, and KMS access
- A Windows AMI ID for the helper instance (e.g., latest Windows Server 2022 Base)
- A VPC subnet in the same Availability Zone as the target instance
- An IAM instance profile for the helper with SSM agent permissions
- If using customer-managed KMS keys: the key policy must allow the Step Functions role to create grants
IAM Role for the State Machine
The Step Functions execution role needs permissions across EC2, SSM, and KMS. Here's the IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "EC2Permissions",
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:StopInstances",
"ec2:StartInstances",
"ec2:DetachVolume",
"ec2:AttachVolume",
"ec2:CreateSnapshot",
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:CreateTags"
],
"Resource": "*"
},
{
"Sid": "SSMRunCommand",
"Effect": "Allow",
"Action": [
"ssm:SendCommand",
"ssm:GetCommandInvocation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Sid": "KMSForEncryptedVolumes",
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:GenerateDataKeyWithoutPlaintext",
"kms:ReEncryptFrom",
"kms:ReEncryptTo"
],
"Resource": "*"
},
{
"Sid": "PassRoleForHelper",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::*:role/EC2RescueHelperRole"
}
]
}
If using customer-managed KMS keys, the key policy must include the Step Functions execution role (or the helper instance's role) as a principal allowed to call kms:CreateGrant and kms:Decrypt. For AWS-managed keys (aws/ebs), the default key policy typically allows any principal in the account to use the key for EBS operations, so no key policy changes are needed.
The trust policy for the execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "states.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Helper Instance Profile
The helper instance needs an IAM instance profile with SSM permissions so the Run Command can execute your repair script:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:UpdateInstanceInformation",
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel",
"ec2messages:GetMessages",
"ec2messages:AcknowledgeMessage",
"ec2messages:SendReply"
],
"Resource": "*"
}
]
}
You can also attach the AmazonSSMManagedInstanceCore managed policy instead of the inline policy above.
The State Machine Definition
Below is the complete Amazon States Language (ASL) definition. It's broken into logical sections with commentary.
Input Format
The state machine expects this input when started:
{
"InstanceId": "i-0abc123def456",
"HelperAmiId": "ami-0abcdef1234567890",
"HelperInstanceType": "t3.medium",
"HelperSubnetId": "subnet-0abc123",
"HelperSecurityGroupId": "sg-0abc123",
"HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueHelperRole",
"FixScript": "base64-encoded-powershell-script"
}
Complete ASL Definition
{
"Comment": "EC2 Rescue Workflow for Encrypted EBS Volumes",
"StartAt": "GetInstanceDetails",
"States": {
"GetInstanceDetails": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": "$.InstanceDetails",
"ResultSelector": {
"AvailabilityZone.$": "$.Reservations[0].Instances[0].Placement.AvailabilityZone",
"RootDeviceName.$": "$.Reservations[0].Instances[0].RootDeviceName",
"VolumeId.$": "$.Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId"
},
"Next": "CreateBackupSnapshot"
},
"CreateBackupSnapshot": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:createSnapshot",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"Description.$": "States.Format('EC2Rescue backup - {}', $.InstanceId)",
"TagSpecifications": [
{
"ResourceType": "snapshot",
"Tags": [
{
"Key": "Name",
"Value.$": "States.Format('EC2Rescue-Backup-{}', $.InstanceId)"
},
{
"Key": "CreatedBy",
"Value": "StepFunctions-EC2Rescue"
}
]
}
]
},
"ResultPath": "$.Snapshot",
"ResultSelector": {
"SnapshotId.$": "$.SnapshotId"
},
"Next": "StopInstance"
},
"StopInstance": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:stopInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": null,
"Next": "WaitForInstanceStopped"
},
"WaitForInstanceStopped": {
"Type": "Wait",
"Seconds": 15,
"Next": "CheckInstanceStopped"
},
"CheckInstanceStopped": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": "$.InstanceState",
"ResultSelector": {
"State.$": "$.Reservations[0].Instances[0].State.Name"
},
"Next": "IsInstanceStopped"
},
"IsInstanceStopped": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.InstanceState.State",
"StringEquals": "stopped",
"Next": "DetachRootVolume"
}
],
"Default": "WaitForInstanceStopped"
},
"DetachRootVolume": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.InstanceId"
},
"ResultPath": null,
"Next": "WaitForVolumeDetached"
},
"WaitForVolumeDetached": {
"Type": "Wait",
"Seconds": 10,
"Next": "CheckVolumeDetached"
},
"CheckVolumeDetached": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
"Parameters": {
"VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
},
"ResultPath": "$.VolumeState",
"ResultSelector": {
"State.$": "$.Volumes[0].State"
},
"Next": "IsVolumeDetached"
},
"IsVolumeDetached": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.VolumeState.State",
"StringEquals": "available",
"Next": "LaunchHelperInstance"
}
],
"Default": "WaitForVolumeDetached"
},
"LaunchHelperInstance": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:runInstances",
"Parameters": {
"ImageId.$": "$.HelperAmiId",
"InstanceType.$": "$.HelperInstanceType",
"MinCount": 1,
"MaxCount": 1,
"SubnetId.$": "$.HelperSubnetId",
"SecurityGroupIds.$": "States.Array($.HelperSecurityGroupId)",
"IamInstanceProfile": {
"Arn.$": "$.HelperInstanceProfileArn"
},
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "Name",
"Value.$": "States.Format('EC2Rescue-Helper-{}', $.InstanceId)"
},
{
"Key": "CreatedBy",
"Value": "StepFunctions-EC2Rescue"
}
]
}
]
},
"ResultPath": "$.Helper",
"ResultSelector": {
"InstanceId.$": "$.Instances[0].InstanceId"
},
"Next": "WaitForHelperRunning"
},
"WaitForHelperRunning": {
"Type": "Wait",
"Seconds": 30,
"Next": "CheckHelperRunning"
},
"CheckHelperRunning": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.Helper.InstanceId)"
},
"ResultPath": "$.HelperState",
"ResultSelector": {
"State.$": "$.Reservations[0].Instances[0].State.Name"
},
"Next": "IsHelperRunning"
},
"IsHelperRunning": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.HelperState.State",
"StringEquals": "running",
"Next": "WaitForSSMAgent"
}
],
"Default": "WaitForHelperRunning"
},
"WaitForSSMAgent": {
"Type": "Wait",
"Seconds": 60,
"Comment": "Wait for Windows to boot and SSM agent to register",
"Next": "AttachVolumeToHelper"
},
"AttachVolumeToHelper": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.Helper.InstanceId",
"Device": "xvdf"
},
"ResultPath": null,
"Next": "WaitForVolumeAttached"
},
"WaitForVolumeAttached": {
"Type": "Wait",
"Seconds": 10,
"Next": "CheckVolumeAttached"
},
"CheckVolumeAttached": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
"Parameters": {
"VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
},
"ResultPath": "$.VolumeAttachState",
"ResultSelector": {
"State.$": "$.Volumes[0].Attachments[0].State"
},
"Next": "IsVolumeAttached"
},
"IsVolumeAttached": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.VolumeAttachState.State",
"StringEquals": "attached",
"Next": "RunFixScript"
}
],
"Default": "WaitForVolumeAttached"
},
"RunFixScript": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ssm:sendCommand",
"Parameters": {
"InstanceIds.$": "States.Array($.Helper.InstanceId)",
"DocumentName": "AWS-RunPowerShellScript",
"Parameters": {
"commands.$": "States.Array(States.Format('$script = [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String(\'{}\'));Invoke-Expression $script', $.FixScript))"
},
"TimeoutSeconds": 600
},
"ResultPath": "$.CommandResult",
"ResultSelector": {
"CommandId.$": "$.Command.CommandId"
},
"Next": "WaitForScriptExecution"
},
"WaitForScriptExecution": {
"Type": "Wait",
"Seconds": 30,
"Next": "CheckScriptStatus"
},
"CheckScriptStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ssm:getCommandInvocation",
"Parameters": {
"CommandId.$": "$.CommandResult.CommandId",
"InstanceId.$": "$.Helper.InstanceId"
},
"ResultPath": "$.ScriptStatus",
"ResultSelector": {
"Status.$": "$.Status",
"Output.$": "$.StandardOutputContent",
"Error.$": "$.StandardErrorContent"
},
"Next": "IsScriptComplete"
},
"IsScriptComplete": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.ScriptStatus.Status",
"StringEquals": "Success",
"Next": "DetachVolumeFromHelper"
},
{
"Variable": "$.ScriptStatus.Status",
"StringEquals": "Failed",
"Next": "ScriptFailed"
}
],
"Default": "WaitForScriptExecution"
},
"ScriptFailed": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.Helper.InstanceId"
},
"ResultPath": null,
"Next": "WaitForFailedDetach"
},
"WaitForFailedDetach": {
"Type": "Wait",
"Seconds": 15,
"Next": "ReattachAfterFailure"
},
"ReattachAfterFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.InstanceId",
"Device.$": "$.InstanceDetails.RootDeviceName"
},
"ResultPath": null,
"Next": "TerminateHelperAfterFailure"
},
"TerminateHelperAfterFailure": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.Helper.InstanceId)"
},
"ResultPath": null,
"Next": "WorkflowFailed"
},
"WorkflowFailed": {
"Type": "Fail",
"Error": "ScriptExecutionFailed",
"Cause.$": "States.Format('Fix script failed. Output: {} Error: {}', $.ScriptStatus.Output, $.ScriptStatus.Error)"
},
"DetachVolumeFromHelper": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.Helper.InstanceId"
},
"ResultPath": null,
"Next": "WaitForHelperDetach"
},
"WaitForHelperDetach": {
"Type": "Wait",
"Seconds": 10,
"Next": "CheckHelperDetach"
},
"CheckHelperDetach": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
"Parameters": {
"VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
},
"ResultPath": "$.HelperDetachState",
"ResultSelector": {
"State.$": "$.Volumes[0].State"
},
"Next": "IsHelperDetachComplete"
},
"IsHelperDetachComplete": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.HelperDetachState.State",
"StringEquals": "available",
"Next": "ReattachToOriginal"
}
],
"Default": "WaitForHelperDetach"
},
"ReattachToOriginal": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
"Parameters": {
"VolumeId.$": "$.InstanceDetails.VolumeId",
"InstanceId.$": "$.InstanceId",
"Device.$": "$.InstanceDetails.RootDeviceName"
},
"ResultPath": null,
"Next": "WaitForReattach"
},
"WaitForReattach": {
"Type": "Wait",
"Seconds": 10,
"Next": "CheckReattach"
},
"CheckReattach": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
"Parameters": {
"VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
},
"ResultPath": "$.ReattachState",
"ResultSelector": {
"State.$": "$.Volumes[0].Attachments[0].State"
},
"Next": "IsReattachComplete"
},
"IsReattachComplete": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.ReattachState.State",
"StringEquals": "attached",
"Next": "StartOriginalInstance"
}
],
"Default": "WaitForReattach"
},
"StartOriginalInstance": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:startInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": null,
"Next": "WaitForOriginalRunning"
},
"WaitForOriginalRunning": {
"Type": "Wait",
"Seconds": 15,
"Next": "CheckOriginalRunning"
},
"CheckOriginalRunning": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": "$.OriginalState",
"ResultSelector": {
"State.$": "$.Reservations[0].Instances[0].State.Name"
},
"Next": "IsOriginalRunning"
},
"IsOriginalRunning": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.OriginalState.State",
"StringEquals": "running",
"Next": "TerminateHelper"
}
],
"Default": "WaitForOriginalRunning"
},
"TerminateHelper": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.Helper.InstanceId)"
},
"ResultPath": null,
"Next": "RescueComplete"
},
"RescueComplete": {
"Type": "Succeed"
}
}
}
How the Fix Script Works
The fix script runs on the helper instance via SSM Run Command. The encrypted volume from the broken instance is attached as a secondary drive. The script needs to:
- Find the secondary drive (the offline volume)
- Bring it online and assign a drive letter
- Load the registry hive from the offline volume
- Make the fix
- Unload the hive
- Take the disk offline
Here’s a complete example script that re-enables RDP:
# fix-rdp-encrypted.ps1
# Runs on the helper instance against the attached encrypted volume
# Step 1: Find the offline disk (it's the secondary disk, not the boot disk)
$offlineDisk = Get-Disk | Where-Object { $_.OperationalStatus -eq 'Offline' }
if (-not $offlineDisk) {
Write-Error "No offline disk found. The volume may not be attached yet."
exit 1
}
# Step 2: Bring the disk online and assign a drive letter
Set-Disk -Number $offlineDisk.Number -IsOffline $false
Set-Disk -Number $offlineDisk.Number -IsReadOnly $false
# Find the partition with the Windows directory
$partition = Get-Partition -DiskNumber $offlineDisk.Number | Where-Object { $_.Type -ne 'System' -and $_.Size -gt 10GB }
if (-not $partition.DriveLetter) {
$partition | Set-Partition -NewDriveLetter D
}
$driveLetter = (Get-Partition -DiskNumber $offlineDisk.Number | Where-Object { $_.DriveLetter }).DriveLetter
Write-Host "Offline volume mounted at ${driveLetter}:\"
# Step 3: Verify Windows directory exists
$registryDir = "${driveLetter}:\Windows\System32\config"
if (-not (Test-Path $registryDir)) {
Write-Error "Windows registry directory not found at $registryDir"
exit 1
}
# Step 4: Load the SYSTEM registry hive and apply the fix
try {
reg load "HKLM\OfflineSystem" "$registryDir\SYSTEM"
# Set TermService (RDP) to Automatic start
reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f
Write-Host "SUCCESS: TermService start type set to Automatic"
}
finally {
# Always unload the hive to prevent corruption
[GC]::Collect()
Start-Sleep -Seconds 2
reg unload "HKLM\OfflineSystem"
Write-Host "Registry hive unloaded successfully"
}
# Step 5: Take the disk offline before detaching
Set-Disk -Number $offlineDisk.Number -IsOffline $true
Write-Host "Disk taken offline. Ready for detach."
Disable Smart Card Logon (scforcelogon)
Swap the registry fix section for this:
# Inside the try block, replace the reg add line with:
reg load "HKLM\OfflineSoftware" "$registryDir\SOFTWARE"
reg add "HKLM\OfflineSoftware\Microsoft\Windows\CurrentVersion\Policies\System" /v scforcelogon /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: scforcelogon disabled"
# In the finally block, unload OfflineSoftware instead of OfflineSystem
Reset Windows Firewall
# Inside the try block:
reg load "HKLM\OfflineSystem" "$registryDir\SYSTEM"
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\DomainProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\PublicProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: All firewall profiles disabled"
Base64 Encoding the Script
The state machine expects a base64-encoded script in the FixScript input parameter:
# Encode your script for the Step Functions input
$scriptContent = Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw
$base64 = [System.Convert]::ToBase64String(
[System.Text.Encoding]::ASCII.GetBytes($scriptContent)
)
Write-Host $base64
Deploying with CloudFormation
Here's a CloudFormation template that creates the state machine, execution role, and helper instance profile:
AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 Rescue Workflow for Encrypted EBS Volumes using Step Functions
Parameters:
StateMachineName:
Type: String
Default: EC2RescueEncryptedVolumes
Description: Name for the Step Functions state machine
Resources:
# IAM Role for Step Functions execution
StepFunctionsExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub '${StateMachineName}-ExecutionRole'
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: EC2RescuePolicy
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: EC2Permissions
Effect: Allow
Action:
- ec2:DescribeInstances
- ec2:DescribeVolumes
- ec2:DescribeSnapshots
- ec2:StopInstances
- ec2:StartInstances
- ec2:DetachVolume
- ec2:AttachVolume
- ec2:CreateSnapshot
- ec2:RunInstances
- ec2:TerminateInstances
- ec2:CreateTags
Resource: '*'
- Sid: SSMRunCommand
Effect: Allow
Action:
- ssm:SendCommand
- ssm:GetCommandInvocation
- ssm:DescribeInstanceInformation
Resource: '*'
- Sid: KMSForEncryptedVolumes
Effect: Allow
Action:
- kms:Decrypt
- kms:DescribeKey
- kms:CreateGrant
- kms:GenerateDataKeyWithoutPlaintext
- kms:ReEncryptFrom
- kms:ReEncryptTo
Resource: '*'
- Sid: PassRoleForHelper
Effect: Allow
Action: iam:PassRole
Resource: !GetAtt HelperInstanceRole.Arn
# IAM Role for the helper EC2 instance (SSM agent)
HelperInstanceRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub '${StateMachineName}-HelperRole'
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: ec2.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
HelperInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
InstanceProfileName: !Sub '${StateMachineName}-HelperProfile'
Roles:
- !Ref HelperInstanceRole
# Step Functions State Machine
EC2RescueStateMachine:
Type: AWS::StepFunctions::StateMachine
Properties:
StateMachineName: !Ref StateMachineName
RoleArn: !GetAtt StepFunctionsExecutionRole.Arn
Definition:
Comment: EC2 Rescue Workflow for Encrypted EBS Volumes
StartAt: GetInstanceDetails
States:
GetInstanceDetails:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
Parameters:
InstanceIds.$: States.Array($.InstanceId)
ResultPath: $.InstanceDetails
ResultSelector:
AvailabilityZone.$: $.Reservations[0].Instances[0].Placement.AvailabilityZone
RootDeviceName.$: $.Reservations[0].Instances[0].RootDeviceName
VolumeId.$: $.Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId
Next: CreateBackupSnapshot
CreateBackupSnapshot:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:createSnapshot
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
Description.$: "States.Format('EC2Rescue backup - {}', $.InstanceId)"
TagSpecifications:
- ResourceType: snapshot
Tags:
- Key: Name
Value.$: "States.Format('EC2Rescue-Backup-{}', $.InstanceId)"
- Key: CreatedBy
Value: StepFunctions-EC2Rescue
ResultPath: $.Snapshot
ResultSelector:
SnapshotId.$: $.SnapshotId
Next: StopInstance
StopInstance:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:stopInstances
Parameters:
InstanceIds.$: States.Array($.InstanceId)
ResultPath: null
Next: WaitForInstanceStopped
WaitForInstanceStopped:
Type: Wait
Seconds: 15
Next: CheckInstanceStopped
CheckInstanceStopped:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
Parameters:
InstanceIds.$: States.Array($.InstanceId)
ResultPath: $.InstanceState
ResultSelector:
State.$: $.Reservations[0].Instances[0].State.Name
Next: IsInstanceStopped
IsInstanceStopped:
Type: Choice
Choices:
- Variable: $.InstanceState.State
StringEquals: stopped
Next: DetachRootVolume
Default: WaitForInstanceStopped
DetachRootVolume:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.InstanceId
ResultPath: null
Next: WaitForVolumeDetached
WaitForVolumeDetached:
Type: Wait
Seconds: 10
Next: CheckVolumeDetached
CheckVolumeDetached:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
Parameters:
VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
ResultPath: $.VolumeState
ResultSelector:
State.$: $.Volumes[0].State
Next: IsVolumeDetached
IsVolumeDetached:
Type: Choice
Choices:
- Variable: $.VolumeState.State
StringEquals: available
Next: LaunchHelperInstance
Default: WaitForVolumeDetached
LaunchHelperInstance:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:runInstances
Parameters:
ImageId.$: $.HelperAmiId
InstanceType.$: $.HelperInstanceType
MinCount: 1
MaxCount: 1
SubnetId.$: $.HelperSubnetId
SecurityGroupIds.$: States.Array($.HelperSecurityGroupId)
IamInstanceProfile:
Arn.$: $.HelperInstanceProfileArn
TagSpecifications:
- ResourceType: instance
Tags:
- Key: Name
Value.$: "States.Format('EC2Rescue-Helper-{}', $.InstanceId)"
- Key: CreatedBy
Value: StepFunctions-EC2Rescue
ResultPath: $.Helper
ResultSelector:
InstanceId.$: $.Instances[0].InstanceId
Next: WaitForHelperRunning
WaitForHelperRunning:
Type: Wait
Seconds: 30
Next: CheckHelperRunning
CheckHelperRunning:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
Parameters:
InstanceIds.$: States.Array($.Helper.InstanceId)
ResultPath: $.HelperState
ResultSelector:
State.$: $.Reservations[0].Instances[0].State.Name
Next: IsHelperRunning
IsHelperRunning:
Type: Choice
Choices:
- Variable: $.HelperState.State
StringEquals: running
Next: WaitForSSMAgent
Default: WaitForHelperRunning
WaitForSSMAgent:
Type: Wait
Seconds: 60
Comment: Wait for Windows to boot and SSM agent to register
Next: AttachVolumeToHelper
AttachVolumeToHelper:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.Helper.InstanceId
Device: xvdf
ResultPath: null
Next: WaitForVolumeAttached
WaitForVolumeAttached:
Type: Wait
Seconds: 10
Next: CheckVolumeAttached
CheckVolumeAttached:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
Parameters:
VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
ResultPath: $.VolumeAttachState
ResultSelector:
State.$: $.Volumes[0].Attachments[0].State
Next: IsVolumeAttached
IsVolumeAttached:
Type: Choice
Choices:
- Variable: $.VolumeAttachState.State
StringEquals: attached
Next: RunFixScript
Default: WaitForVolumeAttached
RunFixScript:
Type: Task
Resource: arn:aws:states:::aws-sdk:ssm:sendCommand
Parameters:
InstanceIds.$: States.Array($.Helper.InstanceId)
DocumentName: AWS-RunPowerShellScript
Parameters:
commands.$: "States.Array(States.Format('$script = [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String(\'{}\')); Invoke-Expression $script', $.FixScript))"
TimeoutSeconds: 600
ResultPath: $.CommandResult
ResultSelector:
CommandId.$: $.Command.CommandId
Next: WaitForScriptExecution
WaitForScriptExecution:
Type: Wait
Seconds: 30
Next: CheckScriptStatus
CheckScriptStatus:
Type: Task
Resource: arn:aws:states:::aws-sdk:ssm:getCommandInvocation
Parameters:
CommandId.$: $.CommandResult.CommandId
InstanceId.$: $.Helper.InstanceId
ResultPath: $.ScriptStatus
ResultSelector:
Status.$: $.Status
Output.$: $.StandardOutputContent
Error.$: $.StandardErrorContent
Next: IsScriptComplete
IsScriptComplete:
Type: Choice
Choices:
- Variable: $.ScriptStatus.Status
StringEquals: Success
Next: DetachVolumeFromHelper
- Variable: $.ScriptStatus.Status
StringEquals: Failed
Next: ScriptFailed
Default: WaitForScriptExecution
ScriptFailed:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.Helper.InstanceId
ResultPath: null
Next: WaitForFailedDetach
WaitForFailedDetach:
Type: Wait
Seconds: 15
Next: ReattachAfterFailure
ReattachAfterFailure:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.InstanceId
Device.$: $.InstanceDetails.RootDeviceName
ResultPath: null
Next: TerminateHelperAfterFailure
TerminateHelperAfterFailure:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:terminateInstances
Parameters:
InstanceIds.$: States.Array($.Helper.InstanceId)
ResultPath: null
Next: WorkflowFailed
WorkflowFailed:
Type: Fail
Error: ScriptExecutionFailed
Cause: Fix script failed - check execution output for details
DetachVolumeFromHelper:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.Helper.InstanceId
ResultPath: null
Next: WaitForHelperDetach
WaitForHelperDetach:
Type: Wait
Seconds: 10
Next: CheckHelperDetach
CheckHelperDetach:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
Parameters:
VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
ResultPath: $.HelperDetachState
ResultSelector:
State.$: $.Volumes[0].State
Next: IsHelperDetachComplete
IsHelperDetachComplete:
Type: Choice
Choices:
- Variable: $.HelperDetachState.State
StringEquals: available
Next: ReattachToOriginal
Default: WaitForHelperDetach
ReattachToOriginal:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
Parameters:
VolumeId.$: $.InstanceDetails.VolumeId
InstanceId.$: $.InstanceId
Device.$: $.InstanceDetails.RootDeviceName
ResultPath: null
Next: WaitForReattach
WaitForReattach:
Type: Wait
Seconds: 10
Next: CheckReattach
CheckReattach:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
Parameters:
VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
ResultPath: $.ReattachState
ResultSelector:
State.$: $.Volumes[0].Attachments[0].State
Next: IsReattachComplete
IsReattachComplete:
Type: Choice
Choices:
- Variable: $.ReattachState.State
StringEquals: attached
Next: StartOriginalInstance
Default: WaitForReattach
StartOriginalInstance:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:startInstances
Parameters:
InstanceIds.$: States.Array($.InstanceId)
ResultPath: null
Next: WaitForOriginalRunning
WaitForOriginalRunning:
Type: Wait
Seconds: 15
Next: CheckOriginalRunning
CheckOriginalRunning:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
Parameters:
InstanceIds.$: States.Array($.InstanceId)
ResultPath: $.OriginalState
ResultSelector:
State.$: $.Reservations[0].Instances[0].State.Name
Next: IsOriginalRunning
IsOriginalRunning:
Type: Choice
Choices:
- Variable: $.OriginalState.State
StringEquals: running
Next: TerminateHelper
Default: WaitForOriginalRunning
TerminateHelper:
Type: Task
Resource: arn:aws:states:::aws-sdk:ec2:terminateInstances
Parameters:
InstanceIds.$: States.Array($.Helper.InstanceId)
ResultPath: null
Next: RescueComplete
RescueComplete:
Type: Succeed
Outputs:
StateMachineArn:
Value: !Ref EC2RescueStateMachine
Description: ARN of the EC2 Rescue state machine
HelperInstanceProfileArn:
Value: !GetAtt HelperInstanceProfile.Arn
Description: ARN of the helper instance profile (use in state machine input)
Running the Workflow
Step 1: Deploy the CloudFormation Stack
aws cloudformation deploy \
--template-file ec2-rescue-step-functions.yaml \
--stack-name ec2-rescue-encrypted \
--capabilities CAPABILITY_NAMED_IAM
Step 2: Get the Helper Instance Profile ARN
aws cloudformation describe-stacks \
--stack-name ec2-rescue-encrypted \
--query "Stacks[0].Outputs[?OutputKey=='HelperInstanceProfileArn'].OutputValue" \
--output text
Step 3: Prepare Your Fix Script
Save your PowerShell fix script (like the RDP fix above) and base64 encode it:
$base64 = [System.Convert]::ToBase64String(
[System.Text.Encoding]::ASCII.GetBytes(
(Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw)
)
)
Write-Host $base64
Step 4: Start the Execution
You'll need the instance ID of the broken instance, a Windows AMI ID in the same region, and network details for the helper instance (same AZ as the target):
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:EC2RescueEncryptedVolumes" \
--input '{
"InstanceId": "i-0abc123def456",
"HelperAmiId": "ami-0abcdef1234567890",
"HelperInstanceType": "t3.medium",
"HelperSubnetId": "subnet-0abc123",
"HelperSecurityGroupId": "sg-0abc123",
"HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueEncryptedVolumes-HelperProfile",
"FixScript": "YOUR_BASE64_ENCODED_SCRIPT_HERE"
}'
Step 5: Monitor in the Console
Open the Step Functions console and click on the running execution. You'll see each state light up in green as it completes. If a state is in a polling loop (waiting for an instance to stop, a volume to detach, etc.), you'll see it cycle between the Wait and Check states.
The entire workflow typically completes in 8-12 minutes, depending on how long Windows takes to boot on the helper instance.
KMS Considerations
AWS-Managed Keys (aws/ebs)
If your encrypted volumes use the default aws/ebs AWS-managed key, no additional KMS configuration is needed beyond the IAM policy above. The AWS-managed key's default policy allows any principal in the account to use it for EBS operations. The Step Functions role and the helper instance can both access the volume transparently.
Customer-Managed KMS Keys
If the volume is encrypted with a customer-managed KMS key, you need to ensure the key policy allows the Step Functions execution role to create grants. Add this statement to your KMS key policy:
{
"Sid": "AllowStepFunctionsEC2Rescue",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/EC2RescueEncryptedVolumes-ExecutionRole"
},
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource": "*"
}
Cross-Account KMS Keys
If the KMS key lives in a different account (e.g., a centralized key management account), both the key policy in the source account and the IAM policy in the local account need to grant access. The key policy in the source account must trust the local account's Step Functions role, and the local IAM policy must reference the cross-account key ARN.
Error Handling and Rollback
The state machine includes built-in compensation logic:
- Script failure — If the fix script fails on the helper, the workflow automatically detaches the volume, reattaches it to the original instance, terminates the helper, and transitions to a
Failstate with the script's error output. - Backup snapshot — The very first step creates a snapshot before any changes. If something goes catastrophically wrong, you can always create a new volume from this snapshot.
- Tagged resources — The helper instance and snapshot are tagged with
CreatedBy: StepFunctions-EC2Rescueso you can easily find and clean up any orphaned resources.
For production use, consider adding:
- SNS notifications on success/failure (add a
Publishtask before the terminal states) - Timeout on the overall execution (set
TimeoutSecondson the state machine definition) - Retry policies on individual API calls (add
Retryblocks for throttling and transient errors) - CloudWatch alarms on failed executions
Comparison: SSM Runbooks vs Step Functions
| Feature | SSM Runbooks | Step Functions Workflow |
|---|---|---|
| Encrypted volumes | Not supported | Fully supported (AWS-managed and customer-managed keys) |
| Setup effort | Zero (AWS-provided) | Deploy CloudFormation stack + IAM roles |
| Customization | Limited to base64 script parameter | Full control over every step |
| Visual monitoring | Step-by-step in SSM console | Visual workflow graph in Step Functions console |
| Error handling | Basic (fails at encrypted check) | Custom rollback and compensation logic |
| Cost | Free (SSM is free, pay for helper instance) | ~$0.025 per execution + helper instance time |
| Lambda required | Yes (internally) | No (native SDK integrations) |
| Backup AMI | Automatic | Snapshot (you can add AMI creation if needed) |
| When to use | Unencrypted volumes, quick fix | Encrypted volumes, custom workflows, compliance requirements |
Testing the Workflow
Before you need this at 2 AM, test it on a throwaway instance:
- Launch a test instance with an encrypted root volume (enable “Encrypt this volume” in the launch wizard or use account-level default encryption)
- Break something — disable RDP, enable scforcelogon, block the firewall
- Run the workflow with the appropriate fix script
- Verify you can RDP back in after the workflow completes
- Clean up — terminate the test instance, delete the backup snapshot
Keep a library of base64-encoded fix scripts in S3 or Parameter Store. When an incident happens, you just grab the right script and paste it into the Step Functions input — no scrambling to write and encode a script under pressure.
Conclusion
The AWS-provided EC2Rescue runbooks are excellent for unencrypted volumes, but as more organizations adopt EBS encryption by default, the gap is real. Building your own rescue workflow with Step Functions gives you:
- Encrypted volume support — The whole reason we're here
- Full KMS integration — Works with AWS-managed keys, customer-managed keys, and cross-account keys
- No Lambda functions — Every API call is a native Step Functions SDK integration
- Visual workflow monitoring — Watch each state execute in real time
- Automated rollback — If the fix script fails, the volume goes back where it came from
- One CloudFormation stack — Deploy once, use whenever you need it
The workflow handles the same tedious volume-swap dance that the SSM runbooks do — it just doesn't bail out when it sees an encrypted volume. Deploy it before you need it, test it on a throwaway instance, and keep your fix scripts ready. Your future 2 AM self will thank you.
For the companion post covering SSM runbooks for unencrypted volumes, see Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation.