The AWS-provided EC2Rescue runbooks (AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow) fail immediately on encrypted EBS volumes - including those using AWS-managed keys (aws/ebs). This post walks through building your own rescue workflow using AWS Step Functions that handles encrypted volumes natively, with full KMS integration, no Lambda functions required, and a visual workflow you can monitor in real time.
The Problem
If you've read my previous post on fixing broken Windows EC2 instances with SSM automation, you know that AWS provides excellent runbooks for offline registry repair. There's just one catch buried in the limitations section:
Both AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow check the volume's Encrypted flag and fail immediately if it's true. This applies to all encryption types - AWS-managed keys (aws/ebs), customer-managed KMS keys, even keys shared from other accounts. The helper instance's IAM role simply doesn't have KMS permissions.
With the push toward encryption-by-default (many organizations enable default EBS encryption at the account level), this limitation affects a growing number of instances. The usual advice is “do a manual volume swap” - a tedious, error-prone, 10+ step process that nobody wants to do at 2 AM during an incident.
What if we could build our own automated rescue workflow that handles encrypted volumes?
Why Step Functions?
AWS Step Functions is the right tool for this job for several reasons:
- Native AWS SDK integrations - Step Functions can call EC2, SSM, and KMS APIs directly without Lambda functions. Every API call in our workflow is a native SDK integration.
- Visual workflow - You can watch each state execute in real time in the console. During a 2 AM incident, this visibility is invaluable.
- Built-in error handling - Retry policies, catch blocks, and compensation logic are first-class citizens. If a step fails, the workflow can clean up after itself.
- IAM role you control - Unlike the AWS-managed runbooks, you define the execution role. You can grant
kms:CreateGrant,kms:Decrypt, andkms:DescribeKey- the permissions needed to mount encrypted volumes on the helper instance. - No compute cost during waits - The workflow uses polling loops to wait for instances and volumes. Unlike a Lambda that would sit idle (and bill you), Step Functions only charges per state transition.
Architecture Overview
The workflow follows the same logical steps as the manual volume swap, fully automated:
- Snapshot - Create a backup snapshot of the broken instance's root volume
- Stop - Stop the broken instance
- Detach - Detach the encrypted root volume
- Launch helper - Launch a temporary Windows instance in the same AZ with KMS permissions
- Attach - Attach the encrypted volume to the helper as a secondary drive
- Fix - Run your repair script on the helper via SSM Run Command
- Detach from helper - Detach the volume from the helper
- Reattach - Reattach the volume to the original instance as the root device
- Start - Start the original instance
- Cleanup - Terminate the helper instance
Each step includes wait loops (polling instance/volume state) and error handling with cleanup.
Prerequisites
- An AWS account with Step Functions, EC2, SSM, and KMS access
- A Windows AMI ID for the helper instance (e.g., latest Windows Server 2022 Base)
- A VPC subnet in the same Availability Zone as the target instance
- An IAM instance profile for the helper with SSM agent permissions
- If using customer-managed KMS keys: the key policy must allow the Step Functions role to create grants
IAM Role for the State Machine
The Step Functions execution role needs permissions across EC2, SSM, and KMS. Here's the IAM policy:
The role's trust policy lets states.amazonaws.com assume it. The inline policy needs the usual EC2 verbs for the rescue dance: DescribeInstances, StopInstances, StartInstances, DetachVolume, AttachVolume, CreateVolume, plus RunInstances and TerminateInstances for the helper, and SendCommand / GetCommandInvocation on SSM. iam:PassRole is needed so the state machine can hand the helper instance profile to the helper EC2. If your volumes use a customer-managed KMS key, add kms:CreateGrant and kms:DescribeKey on that key ARN.
If using customer-managed KMS keys, the key policy must include the Step Functions execution role (or the helper instance's role) as a principal allowed to call kms:CreateGrant and kms:Decrypt. For AWS-managed keys (aws/ebs), the default key policy typically allows any principal in the account to use the key for EBS operations, so no key policy changes are needed.
The trust policy for the execution role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "states.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Helper Instance Profile
The helper instance needs an IAM instance profile with SSM permissions so the Run Command can execute your repair script:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:UpdateInstanceInformation",
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel",
"ec2messages:GetMessages",
"ec2messages:AcknowledgeMessage",
"ec2messages:SendReply"
],
"Resource": "*"
}
]
}
You can also attach the AmazonSSMManagedInstanceCore managed policy instead of the inline policy above.
The State Machine Definition
Below is the complete Amazon States Language (ASL) definition. It's broken into logical sections with commentary.
Input Format
The state machine expects this input when started:
{
"InstanceId": "i-0abc123def456",
"HelperAmiId": "ami-0abcdef1234567890",
"HelperInstanceType": "t3.medium",
"HelperSubnetId": "subnet-0abc123",
"HelperSecurityGroupId": "sg-0abc123",
"HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueHelperRole",
"FixScript": "base64-encoded-powershell-script"
}
Complete ASL Definition
The state machine has roughly a dozen states. The flow:
DescribeBrokenInstance, pull the AZ, root volume ID, and KMS keyStopBrokenInstance+ aWaitForStoppoll loopDetachRootVolumefrom the broken instanceLaunchHelperInstancein the same AZ with the helper instance profileWaitForHelperReady, poll until SSM reports it Online (this is the gotcha state, SSM agent registration takes 60-90 seconds)AttachVolumeToHelperasxvdfRunFixScriptviaaws-runPowerShellScriptwith the base64 script as a parameter- Poll command status, then
DetachFromHelper,ReattachToOriginalas/dev/sda1 StartOriginal+TerminateHelperin parallel
Each step has a Retry block with exponential backoff and a Catch that routes to a cleanup state. The cleanup state re-attaches the root volume to the original instance even on partial failure, so you don't end up with an orphaned root volume sitting in the AZ. I won't paste the full ASL JSON here, it's about 400 lines , the structure above is what matters.
How the Fix Script Works
The fix script runs on the helper instance via SSM Run Command. The encrypted volume from the broken instance is attached as a secondary drive. The script needs to:
- Find the secondary drive (the offline volume)
- Bring it online and assign a drive letter
- Load the registry hive from the offline volume
- Make the fix
- Unload the hive
- Take the disk offline
Here’s a complete example script that re-enables RDP:
The fix script runs on the helper EC2 against the attached offline volume. Shape of it:
# Bring the attached disk online
Get-Disk | Where Number -ne 0 | Set-Disk -IsOffline $false
# Load the SOFTWARE hive from the offline volume
reg load HKLM\OFFLINE D:\Windows\System32\config\SOFTWARE
# ... apply the registry fix ...
# Unload the hive cleanly (critical, otherwise the volume won't detach)
reg unload HKLM\OFFLINE
The whole thing is wrapped in try/catch so a failed registry edit still runs the unload step. Skip the unload and your detach step later in the state machine hangs forever.
Disable Smart Card Logon (scforcelogon)
Swap the registry fix section for this:
# Inside the try block, replace the reg add line with:
reg load "HKLM\OfflineSoftware" "$registryDir\SOFTWARE"
reg add "HKLM\OfflineSoftware\Microsoft\Windows\CurrentVersion\Policies\System" /v scforcelogon /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: scforcelogon disabled"
# In the finally block, unload OfflineSoftware instead of OfflineSystem
Reset Windows Firewall
# Inside the try block:
reg load "HKLM\OfflineSystem" "$registryDir\SYSTEM"
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\DomainProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\PublicProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: All firewall profiles disabled"
Base64 Encoding the Script
The state machine expects a base64-encoded script in the FixScript input parameter:
# Encode your script for the Step Functions input
$scriptContent = Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw
$base64 = [System.Convert]::ToBase64String(
[System.Text.Encoding]::ASCII.GetBytes($scriptContent)
)
Write-Host $base64
Deploying with CloudFormation
Here's a CloudFormation template that creates the state machine, execution role, and helper instance profile:
The CloudFormation template provisions four things:
- The Step Functions state machine (with the ASL definition inline)
- The state machine's IAM role with the policy described above
- The helper instance profile and its role (just
AmazonSSMManagedInstanceCore) - A CloudWatch log group for state machine execution history
I won't paste the 400+ lines of YAML here. Two gotchas worth calling out:
- The ASL definition goes in
DefinitionStringas a JSON string. CloudFormation's!Subworks fine inside it, but you need to escape curly braces that aren't substitution targets (use${!Foo}). - The helper role needs the SSM core policy plus, if your volumes use customer-managed KMS keys, a grant on those keys. CloudFormation can't create KMS grants directly, so I do that as a post-deploy step.
Running the Workflow
Step 1: Deploy the CloudFormation Stack
aws cloudformation deploy \
--template-file ec2-rescue-step-functions.yaml \
--stack-name ec2-rescue-encrypted \
--capabilities CAPABILITY_NAMED_IAM
Step 2: Get the Helper Instance Profile ARN
aws cloudformation describe-stacks \
--stack-name ec2-rescue-encrypted \
--query "Stacks[0].Outputs[?OutputKey=='HelperInstanceProfileArn'].OutputValue" \
--output text
Step 3: Prepare Your Fix Script
Save your PowerShell fix script (like the RDP fix above) and base64 encode it:
$base64 = [System.Convert]::ToBase64String(
[System.Text.Encoding]::ASCII.GetBytes(
(Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw)
)
)
Write-Host $base64
Step 4: Start the Execution
You'll need the instance ID of the broken instance, a Windows AMI ID in the same region, and network details for the helper instance (same AZ as the target):
aws stepfunctions start-execution \
--state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:EC2RescueEncryptedVolumes" \
--input '{
"InstanceId": "i-0abc123def456",
"HelperAmiId": "ami-0abcdef1234567890",
"HelperInstanceType": "t3.medium",
"HelperSubnetId": "subnet-0abc123",
"HelperSecurityGroupId": "sg-0abc123",
"HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueEncryptedVolumes-HelperProfile",
"FixScript": "YOUR_BASE64_ENCODED_SCRIPT_HERE"
}'
Step 5: Monitor in the Console
Open the Step Functions console and click on the running execution. You'll see each state light up in green as it completes. If a state is in a polling loop (waiting for an instance to stop, a volume to detach, etc.), you'll see it cycle between the Wait and Check states.
The entire workflow typically completes in 8-12 minutes, depending on how long Windows takes to boot on the helper instance.
KMS Considerations
AWS-Managed Keys (aws/ebs)
If your encrypted volumes use the default aws/ebs AWS-managed key, no additional KMS configuration is needed beyond the IAM policy above. The AWS-managed key's default policy allows any principal in the account to use it for EBS operations. The Step Functions role and the helper instance can both access the volume transparently.
Customer-Managed KMS Keys
If the volume is encrypted with a customer-managed KMS key, you need to ensure the key policy allows the Step Functions execution role to create grants. Add this statement to your KMS key policy:
{
"Sid": "AllowStepFunctionsEC2Rescue",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:role/EC2RescueEncryptedVolumes-ExecutionRole"
},
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:CreateGrant",
"kms:GenerateDataKeyWithoutPlaintext"
],
"Resource": "*"
}
Cross-Account KMS Keys
If the KMS key lives in a different account (e.g., a centralized key management account), both the key policy in the source account and the IAM policy in the local account need to grant access. The key policy in the source account must trust the local account's Step Functions role, and the local IAM policy must reference the cross-account key ARN.
Error Handling and Rollback
The state machine includes built-in compensation logic:
- Script failure - If the fix script fails on the helper, the workflow automatically detaches the volume, reattaches it to the original instance, terminates the helper, and transitions to a
Failstate with the script's error output. - Backup snapshot - The very first step creates a snapshot before any changes. If something goes catastrophically wrong, you can always create a new volume from this snapshot.
- Tagged resources - The helper instance and snapshot are tagged with
CreatedBy: StepFunctions-EC2Rescueso you can easily find and clean up any orphaned resources.
For production use, consider adding:
- SNS notifications on success/failure (add a
Publishtask before the terminal states) - Timeout on the overall execution (set
TimeoutSecondson the state machine definition) - Retry policies on individual API calls (add
Retryblocks for throttling and transient errors) - CloudWatch alarms on failed executions
Comparison: SSM Runbooks vs Step Functions
| Feature | SSM Runbooks | Step Functions Workflow |
|---|---|---|
| Encrypted volumes | Not supported | Fully supported (AWS-managed and customer-managed keys) |
| Setup effort | Zero (AWS-provided) | Deploy CloudFormation stack + IAM roles |
| Customization | Limited to base64 script parameter | Full control over every step |
| Visual monitoring | Step-by-step in SSM console | Visual workflow graph in Step Functions console |
| Error handling | Basic (fails at encrypted check) | Custom rollback and compensation logic |
| Cost | Free (SSM is free, pay for helper instance) | ~$0.025 per execution + helper instance time |
| Lambda required | Yes (internally) | No (native SDK integrations) |
| Backup AMI | Automatic | Snapshot (you can add AMI creation if needed) |
| When to use | Unencrypted volumes, quick fix | Encrypted volumes, custom workflows, compliance requirements |
Testing the Workflow
Before you need this at 2 AM, test it on a throwaway instance:
- Launch a test instance with an encrypted root volume (enable “Encrypt this volume” in the launch wizard or use account-level default encryption)
- Break something - disable RDP, enable scforcelogon, block the firewall
- Run the workflow with the appropriate fix script
- Verify you can RDP back in after the workflow completes
- Clean up - terminate the test instance, delete the backup snapshot
Keep a library of base64-encoded fix scripts in S3 or Parameter Store. When an incident happens, you just grab the right script and paste it into the Step Functions input - no scrambling to write and encode a script under pressure.
Conclusion
The AWS-provided EC2Rescue runbooks are excellent for unencrypted volumes, but as more organizations adopt EBS encryption by default, the gap is real. Building your own rescue workflow with Step Functions gives you:
- Encrypted volume support - The whole reason we're here
- Full KMS integration - Works with AWS-managed keys, customer-managed keys, and cross-account keys
- No Lambda functions - Every API call is a native Step Functions SDK integration
- Visual workflow monitoring - Watch each state execute in real time
- Automated rollback - If the fix script fails, the volume goes back where it came from
- One CloudFormation stack - Deploy once, use whenever you need it
The workflow handles the same tedious volume-swap dance that the SSM runbooks do - it just doesn't bail out when it sees an encrypted volume. Deploy it before you need it, test it on a throwaway instance, and keep your fix scripts ready. Your future 2 AM self will thank you.
For the companion post covering SSM runbooks for unencrypted volumes, see Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation.
Want Help With This?
If you're working on something similar and want a second set of eyes, or you'd like to talk through how this applies to your environment, reach out via the contact form. Happy to help.