A
Arun's Blog
All Posts

Building a Custom EC2 Rescue Workflow with Step Functions for Encrypted EBS Volumes

|20 min read|
AWSEC2Step FunctionsAutomationSecurity
TL;DR

The AWS-provided EC2Rescue runbooks (AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow) fail immediately on encrypted EBS volumes — including those using AWS-managed keys (aws/ebs). This post walks through building your own rescue workflow using AWS Step Functions that handles encrypted volumes natively, with full KMS integration, no Lambda functions required, and a visual workflow you can monitor in real time.

The Problem

If you've read my previous post on fixing broken Windows EC2 instances with SSM automation, you know that AWS provides excellent runbooks for offline registry repair. There's just one catch buried in the limitations section:

The Encrypted Volume Gap

Both AWSSupport-ExecuteEC2Rescue and AWSSupport-StartEC2RescueWorkflow check the volume's Encrypted flag and fail immediately if it's true. This applies to all encryption types — AWS-managed keys (aws/ebs), customer-managed KMS keys, even keys shared from other accounts. The helper instance's IAM role simply doesn't have KMS permissions.

With the push toward encryption-by-default (many organizations enable default EBS encryption at the account level), this limitation affects a growing number of instances. The usual advice is “do a manual volume swap” — a tedious, error-prone, 10+ step process that nobody wants to do at 2 AM during an incident.

What if we could build our own automated rescue workflow that handles encrypted volumes?

Why Step Functions?

AWS Step Functions is the right tool for this job for several reasons:

  • Native AWS SDK integrations — Step Functions can call EC2, SSM, and KMS APIs directly without Lambda functions. Every API call in our workflow is a native SDK integration.
  • Visual workflow — You can watch each state execute in real time in the console. During a 2 AM incident, this visibility is invaluable.
  • Built-in error handling — Retry policies, catch blocks, and compensation logic are first-class citizens. If a step fails, the workflow can clean up after itself.
  • IAM role you control — Unlike the AWS-managed runbooks, you define the execution role. You can grant kms:CreateGrant, kms:Decrypt, and kms:DescribeKey — the permissions needed to mount encrypted volumes on the helper instance.
  • No compute cost during waits — The workflow uses polling loops to wait for instances and volumes. Unlike a Lambda that would sit idle (and bill you), Step Functions only charges per state transition.

Architecture Overview

The workflow follows the same logical steps as the manual volume swap, fully automated:

  1. Snapshot — Create a backup snapshot of the broken instance's root volume
  2. Stop — Stop the broken instance
  3. Detach — Detach the encrypted root volume
  4. Launch helper — Launch a temporary Windows instance in the same AZ with KMS permissions
  5. Attach — Attach the encrypted volume to the helper as a secondary drive
  6. Fix — Run your repair script on the helper via SSM Run Command
  7. Detach from helper — Detach the volume from the helper
  8. Reattach — Reattach the volume to the original instance as the root device
  9. Start — Start the original instance
  10. Cleanup — Terminate the helper instance

Each step includes wait loops (polling instance/volume state) and error handling with cleanup.

Prerequisites

  • An AWS account with Step Functions, EC2, SSM, and KMS access
  • A Windows AMI ID for the helper instance (e.g., latest Windows Server 2022 Base)
  • A VPC subnet in the same Availability Zone as the target instance
  • An IAM instance profile for the helper with SSM agent permissions
  • If using customer-managed KMS keys: the key policy must allow the Step Functions role to create grants

IAM Role for the State Machine

The Step Functions execution role needs permissions across EC2, SSM, and KMS. Here's the IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EC2Permissions",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstances",
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:StopInstances",
        "ec2:StartInstances",
        "ec2:DetachVolume",
        "ec2:AttachVolume",
        "ec2:CreateSnapshot",
        "ec2:RunInstances",
        "ec2:TerminateInstances",
        "ec2:CreateTags"
      ],
      "Resource": "*"
    },
    {
      "Sid": "SSMRunCommand",
      "Effect": "Allow",
      "Action": [
        "ssm:SendCommand",
        "ssm:GetCommandInvocation",
        "ssm:DescribeInstanceInformation"
      ],
      "Resource": "*"
    },
    {
      "Sid": "KMSForEncryptedVolumes",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey",
        "kms:CreateGrant",
        "kms:GenerateDataKeyWithoutPlaintext",
        "kms:ReEncryptFrom",
        "kms:ReEncryptTo"
      ],
      "Resource": "*"
    },
    {
      "Sid": "PassRoleForHelper",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/EC2RescueHelperRole"
    }
  ]
}
KMS Key Policy Requirement

If using customer-managed KMS keys, the key policy must include the Step Functions execution role (or the helper instance's role) as a principal allowed to call kms:CreateGrant and kms:Decrypt. For AWS-managed keys (aws/ebs), the default key policy typically allows any principal in the account to use the key for EBS operations, so no key policy changes are needed.

The trust policy for the execution role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "states.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Helper Instance Profile

The helper instance needs an IAM instance profile with SSM permissions so the Run Command can execute your repair script:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceInformation",
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel",
        "ec2messages:GetMessages",
        "ec2messages:AcknowledgeMessage",
        "ec2messages:SendReply"
      ],
      "Resource": "*"
    }
  ]
}

You can also attach the AmazonSSMManagedInstanceCore managed policy instead of the inline policy above.

The State Machine Definition

Below is the complete Amazon States Language (ASL) definition. It's broken into logical sections with commentary.

Input Format

The state machine expects this input when started:

{
  "InstanceId": "i-0abc123def456",
  "HelperAmiId": "ami-0abcdef1234567890",
  "HelperInstanceType": "t3.medium",
  "HelperSubnetId": "subnet-0abc123",
  "HelperSecurityGroupId": "sg-0abc123",
  "HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueHelperRole",
  "FixScript": "base64-encoded-powershell-script"
}

Complete ASL Definition

{
  "Comment": "EC2 Rescue Workflow for Encrypted EBS Volumes",
  "StartAt": "GetInstanceDetails",
  "States": {
    "GetInstanceDetails": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.InstanceId)"
      },
      "ResultPath": "$.InstanceDetails",
      "ResultSelector": {
        "AvailabilityZone.$": "$.Reservations[0].Instances[0].Placement.AvailabilityZone",
        "RootDeviceName.$": "$.Reservations[0].Instances[0].RootDeviceName",
        "VolumeId.$": "$.Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId"
      },
      "Next": "CreateBackupSnapshot"
    },

    "CreateBackupSnapshot": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:createSnapshot",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "Description.$": "States.Format('EC2Rescue backup - {}', $.InstanceId)",
        "TagSpecifications": [
          {
            "ResourceType": "snapshot",
            "Tags": [
              {
                "Key": "Name",
                "Value.$": "States.Format('EC2Rescue-Backup-{}', $.InstanceId)"
              },
              {
                "Key": "CreatedBy",
                "Value": "StepFunctions-EC2Rescue"
              }
            ]
          }
        ]
      },
      "ResultPath": "$.Snapshot",
      "ResultSelector": {
        "SnapshotId.$": "$.SnapshotId"
      },
      "Next": "StopInstance"
    },

    "StopInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:stopInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.InstanceId)"
      },
      "ResultPath": null,
      "Next": "WaitForInstanceStopped"
    },

    "WaitForInstanceStopped": {
      "Type": "Wait",
      "Seconds": 15,
      "Next": "CheckInstanceStopped"
    },

    "CheckInstanceStopped": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.InstanceId)"
      },
      "ResultPath": "$.InstanceState",
      "ResultSelector": {
        "State.$": "$.Reservations[0].Instances[0].State.Name"
      },
      "Next": "IsInstanceStopped"
    },

    "IsInstanceStopped": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.InstanceState.State",
          "StringEquals": "stopped",
          "Next": "DetachRootVolume"
        }
      ],
      "Default": "WaitForInstanceStopped"
    },

    "DetachRootVolume": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.InstanceId"
      },
      "ResultPath": null,
      "Next": "WaitForVolumeDetached"
    },

    "WaitForVolumeDetached": {
      "Type": "Wait",
      "Seconds": 10,
      "Next": "CheckVolumeDetached"
    },

    "CheckVolumeDetached": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
      "Parameters": {
        "VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
      },
      "ResultPath": "$.VolumeState",
      "ResultSelector": {
        "State.$": "$.Volumes[0].State"
      },
      "Next": "IsVolumeDetached"
    },

    "IsVolumeDetached": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.VolumeState.State",
          "StringEquals": "available",
          "Next": "LaunchHelperInstance"
        }
      ],
      "Default": "WaitForVolumeDetached"
    },

    "LaunchHelperInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:runInstances",
      "Parameters": {
        "ImageId.$": "$.HelperAmiId",
        "InstanceType.$": "$.HelperInstanceType",
        "MinCount": 1,
        "MaxCount": 1,
        "SubnetId.$": "$.HelperSubnetId",
        "SecurityGroupIds.$": "States.Array($.HelperSecurityGroupId)",
        "IamInstanceProfile": {
          "Arn.$": "$.HelperInstanceProfileArn"
        },
        "TagSpecifications": [
          {
            "ResourceType": "instance",
            "Tags": [
              {
                "Key": "Name",
                "Value.$": "States.Format('EC2Rescue-Helper-{}', $.InstanceId)"
              },
              {
                "Key": "CreatedBy",
                "Value": "StepFunctions-EC2Rescue"
              }
            ]
          }
        ]
      },
      "ResultPath": "$.Helper",
      "ResultSelector": {
        "InstanceId.$": "$.Instances[0].InstanceId"
      },
      "Next": "WaitForHelperRunning"
    },

    "WaitForHelperRunning": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "CheckHelperRunning"
    },

    "CheckHelperRunning": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.Helper.InstanceId)"
      },
      "ResultPath": "$.HelperState",
      "ResultSelector": {
        "State.$": "$.Reservations[0].Instances[0].State.Name"
      },
      "Next": "IsHelperRunning"
    },

    "IsHelperRunning": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.HelperState.State",
          "StringEquals": "running",
          "Next": "WaitForSSMAgent"
        }
      ],
      "Default": "WaitForHelperRunning"
    },

    "WaitForSSMAgent": {
      "Type": "Wait",
      "Seconds": 60,
      "Comment": "Wait for Windows to boot and SSM agent to register",
      "Next": "AttachVolumeToHelper"
    },

    "AttachVolumeToHelper": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.Helper.InstanceId",
        "Device": "xvdf"
      },
      "ResultPath": null,
      "Next": "WaitForVolumeAttached"
    },

    "WaitForVolumeAttached": {
      "Type": "Wait",
      "Seconds": 10,
      "Next": "CheckVolumeAttached"
    },

    "CheckVolumeAttached": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
      "Parameters": {
        "VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
      },
      "ResultPath": "$.VolumeAttachState",
      "ResultSelector": {
        "State.$": "$.Volumes[0].Attachments[0].State"
      },
      "Next": "IsVolumeAttached"
    },

    "IsVolumeAttached": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.VolumeAttachState.State",
          "StringEquals": "attached",
          "Next": "RunFixScript"
        }
      ],
      "Default": "WaitForVolumeAttached"
    },

    "RunFixScript": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ssm:sendCommand",
      "Parameters": {
        "InstanceIds.$": "States.Array($.Helper.InstanceId)",
        "DocumentName": "AWS-RunPowerShellScript",
        "Parameters": {
          "commands.$": "States.Array(States.Format('$script = [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String(\'{}\'));Invoke-Expression $script', $.FixScript))"
        },
        "TimeoutSeconds": 600
      },
      "ResultPath": "$.CommandResult",
      "ResultSelector": {
        "CommandId.$": "$.Command.CommandId"
      },
      "Next": "WaitForScriptExecution"
    },

    "WaitForScriptExecution": {
      "Type": "Wait",
      "Seconds": 30,
      "Next": "CheckScriptStatus"
    },

    "CheckScriptStatus": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ssm:getCommandInvocation",
      "Parameters": {
        "CommandId.$": "$.CommandResult.CommandId",
        "InstanceId.$": "$.Helper.InstanceId"
      },
      "ResultPath": "$.ScriptStatus",
      "ResultSelector": {
        "Status.$": "$.Status",
        "Output.$": "$.StandardOutputContent",
        "Error.$": "$.StandardErrorContent"
      },
      "Next": "IsScriptComplete"
    },

    "IsScriptComplete": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.ScriptStatus.Status",
          "StringEquals": "Success",
          "Next": "DetachVolumeFromHelper"
        },
        {
          "Variable": "$.ScriptStatus.Status",
          "StringEquals": "Failed",
          "Next": "ScriptFailed"
        }
      ],
      "Default": "WaitForScriptExecution"
    },

    "ScriptFailed": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.Helper.InstanceId"
      },
      "ResultPath": null,
      "Next": "WaitForFailedDetach"
    },

    "WaitForFailedDetach": {
      "Type": "Wait",
      "Seconds": 15,
      "Next": "ReattachAfterFailure"
    },

    "ReattachAfterFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.InstanceId",
        "Device.$": "$.InstanceDetails.RootDeviceName"
      },
      "ResultPath": null,
      "Next": "TerminateHelperAfterFailure"
    },

    "TerminateHelperAfterFailure": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.Helper.InstanceId)"
      },
      "ResultPath": null,
      "Next": "WorkflowFailed"
    },

    "WorkflowFailed": {
      "Type": "Fail",
      "Error": "ScriptExecutionFailed",
      "Cause.$": "States.Format('Fix script failed. Output: {} Error: {}', $.ScriptStatus.Output, $.ScriptStatus.Error)"
    },

    "DetachVolumeFromHelper": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:detachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.Helper.InstanceId"
      },
      "ResultPath": null,
      "Next": "WaitForHelperDetach"
    },

    "WaitForHelperDetach": {
      "Type": "Wait",
      "Seconds": 10,
      "Next": "CheckHelperDetach"
    },

    "CheckHelperDetach": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
      "Parameters": {
        "VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
      },
      "ResultPath": "$.HelperDetachState",
      "ResultSelector": {
        "State.$": "$.Volumes[0].State"
      },
      "Next": "IsHelperDetachComplete"
    },

    "IsHelperDetachComplete": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.HelperDetachState.State",
          "StringEquals": "available",
          "Next": "ReattachToOriginal"
        }
      ],
      "Default": "WaitForHelperDetach"
    },

    "ReattachToOriginal": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:attachVolume",
      "Parameters": {
        "VolumeId.$": "$.InstanceDetails.VolumeId",
        "InstanceId.$": "$.InstanceId",
        "Device.$": "$.InstanceDetails.RootDeviceName"
      },
      "ResultPath": null,
      "Next": "WaitForReattach"
    },

    "WaitForReattach": {
      "Type": "Wait",
      "Seconds": 10,
      "Next": "CheckReattach"
    },

    "CheckReattach": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeVolumes",
      "Parameters": {
        "VolumeIds.$": "States.Array($.InstanceDetails.VolumeId)"
      },
      "ResultPath": "$.ReattachState",
      "ResultSelector": {
        "State.$": "$.Volumes[0].Attachments[0].State"
      },
      "Next": "IsReattachComplete"
    },

    "IsReattachComplete": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.ReattachState.State",
          "StringEquals": "attached",
          "Next": "StartOriginalInstance"
        }
      ],
      "Default": "WaitForReattach"
    },

    "StartOriginalInstance": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:startInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.InstanceId)"
      },
      "ResultPath": null,
      "Next": "WaitForOriginalRunning"
    },

    "WaitForOriginalRunning": {
      "Type": "Wait",
      "Seconds": 15,
      "Next": "CheckOriginalRunning"
    },

    "CheckOriginalRunning": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.InstanceId)"
      },
      "ResultPath": "$.OriginalState",
      "ResultSelector": {
        "State.$": "$.Reservations[0].Instances[0].State.Name"
      },
      "Next": "IsOriginalRunning"
    },

    "IsOriginalRunning": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.OriginalState.State",
          "StringEquals": "running",
          "Next": "TerminateHelper"
        }
      ],
      "Default": "WaitForOriginalRunning"
    },

    "TerminateHelper": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:ec2:terminateInstances",
      "Parameters": {
        "InstanceIds.$": "States.Array($.Helper.InstanceId)"
      },
      "ResultPath": null,
      "Next": "RescueComplete"
    },

    "RescueComplete": {
      "Type": "Succeed"
    }
  }
}

How the Fix Script Works

The fix script runs on the helper instance via SSM Run Command. The encrypted volume from the broken instance is attached as a secondary drive. The script needs to:

  1. Find the secondary drive (the offline volume)
  2. Bring it online and assign a drive letter
  3. Load the registry hive from the offline volume
  4. Make the fix
  5. Unload the hive
  6. Take the disk offline

Here’s a complete example script that re-enables RDP:

# fix-rdp-encrypted.ps1
# Runs on the helper instance against the attached encrypted volume

# Step 1: Find the offline disk (it's the secondary disk, not the boot disk)
$offlineDisk = Get-Disk | Where-Object { $_.OperationalStatus -eq 'Offline' }
if (-not $offlineDisk) {
    Write-Error "No offline disk found. The volume may not be attached yet."
    exit 1
}

# Step 2: Bring the disk online and assign a drive letter
Set-Disk -Number $offlineDisk.Number -IsOffline $false
Set-Disk -Number $offlineDisk.Number -IsReadOnly $false

# Find the partition with the Windows directory
$partition = Get-Partition -DiskNumber $offlineDisk.Number | Where-Object { $_.Type -ne 'System' -and $_.Size -gt 10GB }
if (-not $partition.DriveLetter) {
    $partition | Set-Partition -NewDriveLetter D
}
$driveLetter = (Get-Partition -DiskNumber $offlineDisk.Number | Where-Object { $_.DriveLetter }).DriveLetter
Write-Host "Offline volume mounted at ${driveLetter}:\"

# Step 3: Verify Windows directory exists
$registryDir = "${driveLetter}:\Windows\System32\config"
if (-not (Test-Path $registryDir)) {
    Write-Error "Windows registry directory not found at $registryDir"
    exit 1
}

# Step 4: Load the SYSTEM registry hive and apply the fix
try {
    reg load "HKLM\OfflineSystem" "$registryDir\SYSTEM"
    
    # Set TermService (RDP) to Automatic start
    reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f
    
    Write-Host "SUCCESS: TermService start type set to Automatic"
}
finally {
    # Always unload the hive to prevent corruption
    [GC]::Collect()
    Start-Sleep -Seconds 2
    reg unload "HKLM\OfflineSystem"
    Write-Host "Registry hive unloaded successfully"
}

# Step 5: Take the disk offline before detaching
Set-Disk -Number $offlineDisk.Number -IsOffline $true
Write-Host "Disk taken offline. Ready for detach."

Disable Smart Card Logon (scforcelogon)

Swap the registry fix section for this:

# Inside the try block, replace the reg add line with:
reg load "HKLM\OfflineSoftware" "$registryDir\SOFTWARE"
reg add "HKLM\OfflineSoftware\Microsoft\Windows\CurrentVersion\Policies\System" /v scforcelogon /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: scforcelogon disabled"
# In the finally block, unload OfflineSoftware instead of OfflineSystem

Reset Windows Firewall

# Inside the try block:
reg load "HKLM\OfflineSystem" "$registryDir\SYSTEM"
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\DomainProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\PublicProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
Write-Host "SUCCESS: All firewall profiles disabled"

Base64 Encoding the Script

The state machine expects a base64-encoded script in the FixScript input parameter:

# Encode your script for the Step Functions input
$scriptContent = Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw
$base64 = [System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes($scriptContent)
)
Write-Host $base64

Deploying with CloudFormation

Here's a CloudFormation template that creates the state machine, execution role, and helper instance profile:

AWSTemplateFormatVersion: '2010-09-09'
Description: EC2 Rescue Workflow for Encrypted EBS Volumes using Step Functions

Parameters:
  StateMachineName:
    Type: String
    Default: EC2RescueEncryptedVolumes
    Description: Name for the Step Functions state machine

Resources:
  # IAM Role for Step Functions execution
  StepFunctionsExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${StateMachineName}-ExecutionRole'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: states.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: EC2RescuePolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Sid: EC2Permissions
                Effect: Allow
                Action:
                  - ec2:DescribeInstances
                  - ec2:DescribeVolumes
                  - ec2:DescribeSnapshots
                  - ec2:StopInstances
                  - ec2:StartInstances
                  - ec2:DetachVolume
                  - ec2:AttachVolume
                  - ec2:CreateSnapshot
                  - ec2:RunInstances
                  - ec2:TerminateInstances
                  - ec2:CreateTags
                Resource: '*'
              - Sid: SSMRunCommand
                Effect: Allow
                Action:
                  - ssm:SendCommand
                  - ssm:GetCommandInvocation
                  - ssm:DescribeInstanceInformation
                Resource: '*'
              - Sid: KMSForEncryptedVolumes
                Effect: Allow
                Action:
                  - kms:Decrypt
                  - kms:DescribeKey
                  - kms:CreateGrant
                  - kms:GenerateDataKeyWithoutPlaintext
                  - kms:ReEncryptFrom
                  - kms:ReEncryptTo
                Resource: '*'
              - Sid: PassRoleForHelper
                Effect: Allow
                Action: iam:PassRole
                Resource: !GetAtt HelperInstanceRole.Arn

  # IAM Role for the helper EC2 instance (SSM agent)
  HelperInstanceRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${StateMachineName}-HelperRole'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

  HelperInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      InstanceProfileName: !Sub '${StateMachineName}-HelperProfile'
      Roles:
        - !Ref HelperInstanceRole

  # Step Functions State Machine
  EC2RescueStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      StateMachineName: !Ref StateMachineName
      RoleArn: !GetAtt StepFunctionsExecutionRole.Arn
      Definition:
        Comment: EC2 Rescue Workflow for Encrypted EBS Volumes
        StartAt: GetInstanceDetails
        States:
          GetInstanceDetails:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
            Parameters:
              InstanceIds.$: States.Array($.InstanceId)
            ResultPath: $.InstanceDetails
            ResultSelector:
              AvailabilityZone.$: $.Reservations[0].Instances[0].Placement.AvailabilityZone
              RootDeviceName.$: $.Reservations[0].Instances[0].RootDeviceName
              VolumeId.$: $.Reservations[0].Instances[0].BlockDeviceMappings[0].Ebs.VolumeId
            Next: CreateBackupSnapshot
          CreateBackupSnapshot:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:createSnapshot
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              Description.$: "States.Format('EC2Rescue backup - {}', $.InstanceId)"
              TagSpecifications:
                - ResourceType: snapshot
                  Tags:
                    - Key: Name
                      Value.$: "States.Format('EC2Rescue-Backup-{}', $.InstanceId)"
                    - Key: CreatedBy
                      Value: StepFunctions-EC2Rescue
            ResultPath: $.Snapshot
            ResultSelector:
              SnapshotId.$: $.SnapshotId
            Next: StopInstance
          StopInstance:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:stopInstances
            Parameters:
              InstanceIds.$: States.Array($.InstanceId)
            ResultPath: null
            Next: WaitForInstanceStopped
          WaitForInstanceStopped:
            Type: Wait
            Seconds: 15
            Next: CheckInstanceStopped
          CheckInstanceStopped:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
            Parameters:
              InstanceIds.$: States.Array($.InstanceId)
            ResultPath: $.InstanceState
            ResultSelector:
              State.$: $.Reservations[0].Instances[0].State.Name
            Next: IsInstanceStopped
          IsInstanceStopped:
            Type: Choice
            Choices:
              - Variable: $.InstanceState.State
                StringEquals: stopped
                Next: DetachRootVolume
            Default: WaitForInstanceStopped
          DetachRootVolume:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.InstanceId
            ResultPath: null
            Next: WaitForVolumeDetached
          WaitForVolumeDetached:
            Type: Wait
            Seconds: 10
            Next: CheckVolumeDetached
          CheckVolumeDetached:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
            Parameters:
              VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
            ResultPath: $.VolumeState
            ResultSelector:
              State.$: $.Volumes[0].State
            Next: IsVolumeDetached
          IsVolumeDetached:
            Type: Choice
            Choices:
              - Variable: $.VolumeState.State
                StringEquals: available
                Next: LaunchHelperInstance
            Default: WaitForVolumeDetached
          LaunchHelperInstance:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:runInstances
            Parameters:
              ImageId.$: $.HelperAmiId
              InstanceType.$: $.HelperInstanceType
              MinCount: 1
              MaxCount: 1
              SubnetId.$: $.HelperSubnetId
              SecurityGroupIds.$: States.Array($.HelperSecurityGroupId)
              IamInstanceProfile:
                Arn.$: $.HelperInstanceProfileArn
              TagSpecifications:
                - ResourceType: instance
                  Tags:
                    - Key: Name
                      Value.$: "States.Format('EC2Rescue-Helper-{}', $.InstanceId)"
                    - Key: CreatedBy
                      Value: StepFunctions-EC2Rescue
            ResultPath: $.Helper
            ResultSelector:
              InstanceId.$: $.Instances[0].InstanceId
            Next: WaitForHelperRunning
          WaitForHelperRunning:
            Type: Wait
            Seconds: 30
            Next: CheckHelperRunning
          CheckHelperRunning:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
            Parameters:
              InstanceIds.$: States.Array($.Helper.InstanceId)
            ResultPath: $.HelperState
            ResultSelector:
              State.$: $.Reservations[0].Instances[0].State.Name
            Next: IsHelperRunning
          IsHelperRunning:
            Type: Choice
            Choices:
              - Variable: $.HelperState.State
                StringEquals: running
                Next: WaitForSSMAgent
            Default: WaitForHelperRunning
          WaitForSSMAgent:
            Type: Wait
            Seconds: 60
            Comment: Wait for Windows to boot and SSM agent to register
            Next: AttachVolumeToHelper
          AttachVolumeToHelper:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.Helper.InstanceId
              Device: xvdf
            ResultPath: null
            Next: WaitForVolumeAttached
          WaitForVolumeAttached:
            Type: Wait
            Seconds: 10
            Next: CheckVolumeAttached
          CheckVolumeAttached:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
            Parameters:
              VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
            ResultPath: $.VolumeAttachState
            ResultSelector:
              State.$: $.Volumes[0].Attachments[0].State
            Next: IsVolumeAttached
          IsVolumeAttached:
            Type: Choice
            Choices:
              - Variable: $.VolumeAttachState.State
                StringEquals: attached
                Next: RunFixScript
            Default: WaitForVolumeAttached
          RunFixScript:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ssm:sendCommand
            Parameters:
              InstanceIds.$: States.Array($.Helper.InstanceId)
              DocumentName: AWS-RunPowerShellScript
              Parameters:
                commands.$: "States.Array(States.Format('$script = [System.Text.Encoding]::ASCII.GetString([System.Convert]::FromBase64String(\'{}\')); Invoke-Expression $script', $.FixScript))"
              TimeoutSeconds: 600
            ResultPath: $.CommandResult
            ResultSelector:
              CommandId.$: $.Command.CommandId
            Next: WaitForScriptExecution
          WaitForScriptExecution:
            Type: Wait
            Seconds: 30
            Next: CheckScriptStatus
          CheckScriptStatus:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ssm:getCommandInvocation
            Parameters:
              CommandId.$: $.CommandResult.CommandId
              InstanceId.$: $.Helper.InstanceId
            ResultPath: $.ScriptStatus
            ResultSelector:
              Status.$: $.Status
              Output.$: $.StandardOutputContent
              Error.$: $.StandardErrorContent
            Next: IsScriptComplete
          IsScriptComplete:
            Type: Choice
            Choices:
              - Variable: $.ScriptStatus.Status
                StringEquals: Success
                Next: DetachVolumeFromHelper
              - Variable: $.ScriptStatus.Status
                StringEquals: Failed
                Next: ScriptFailed
            Default: WaitForScriptExecution
          ScriptFailed:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.Helper.InstanceId
            ResultPath: null
            Next: WaitForFailedDetach
          WaitForFailedDetach:
            Type: Wait
            Seconds: 15
            Next: ReattachAfterFailure
          ReattachAfterFailure:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.InstanceId
              Device.$: $.InstanceDetails.RootDeviceName
            ResultPath: null
            Next: TerminateHelperAfterFailure
          TerminateHelperAfterFailure:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:terminateInstances
            Parameters:
              InstanceIds.$: States.Array($.Helper.InstanceId)
            ResultPath: null
            Next: WorkflowFailed
          WorkflowFailed:
            Type: Fail
            Error: ScriptExecutionFailed
            Cause: Fix script failed - check execution output for details
          DetachVolumeFromHelper:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:detachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.Helper.InstanceId
            ResultPath: null
            Next: WaitForHelperDetach
          WaitForHelperDetach:
            Type: Wait
            Seconds: 10
            Next: CheckHelperDetach
          CheckHelperDetach:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
            Parameters:
              VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
            ResultPath: $.HelperDetachState
            ResultSelector:
              State.$: $.Volumes[0].State
            Next: IsHelperDetachComplete
          IsHelperDetachComplete:
            Type: Choice
            Choices:
              - Variable: $.HelperDetachState.State
                StringEquals: available
                Next: ReattachToOriginal
            Default: WaitForHelperDetach
          ReattachToOriginal:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:attachVolume
            Parameters:
              VolumeId.$: $.InstanceDetails.VolumeId
              InstanceId.$: $.InstanceId
              Device.$: $.InstanceDetails.RootDeviceName
            ResultPath: null
            Next: WaitForReattach
          WaitForReattach:
            Type: Wait
            Seconds: 10
            Next: CheckReattach
          CheckReattach:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeVolumes
            Parameters:
              VolumeIds.$: States.Array($.InstanceDetails.VolumeId)
            ResultPath: $.ReattachState
            ResultSelector:
              State.$: $.Volumes[0].Attachments[0].State
            Next: IsReattachComplete
          IsReattachComplete:
            Type: Choice
            Choices:
              - Variable: $.ReattachState.State
                StringEquals: attached
                Next: StartOriginalInstance
            Default: WaitForReattach
          StartOriginalInstance:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:startInstances
            Parameters:
              InstanceIds.$: States.Array($.InstanceId)
            ResultPath: null
            Next: WaitForOriginalRunning
          WaitForOriginalRunning:
            Type: Wait
            Seconds: 15
            Next: CheckOriginalRunning
          CheckOriginalRunning:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:describeInstances
            Parameters:
              InstanceIds.$: States.Array($.InstanceId)
            ResultPath: $.OriginalState
            ResultSelector:
              State.$: $.Reservations[0].Instances[0].State.Name
            Next: IsOriginalRunning
          IsOriginalRunning:
            Type: Choice
            Choices:
              - Variable: $.OriginalState.State
                StringEquals: running
                Next: TerminateHelper
            Default: WaitForOriginalRunning
          TerminateHelper:
            Type: Task
            Resource: arn:aws:states:::aws-sdk:ec2:terminateInstances
            Parameters:
              InstanceIds.$: States.Array($.Helper.InstanceId)
            ResultPath: null
            Next: RescueComplete
          RescueComplete:
            Type: Succeed

Outputs:
  StateMachineArn:
    Value: !Ref EC2RescueStateMachine
    Description: ARN of the EC2 Rescue state machine
  HelperInstanceProfileArn:
    Value: !GetAtt HelperInstanceProfile.Arn
    Description: ARN of the helper instance profile (use in state machine input)

Running the Workflow

Step 1: Deploy the CloudFormation Stack

aws cloudformation deploy \
  --template-file ec2-rescue-step-functions.yaml \
  --stack-name ec2-rescue-encrypted \
  --capabilities CAPABILITY_NAMED_IAM

Step 2: Get the Helper Instance Profile ARN

aws cloudformation describe-stacks \
  --stack-name ec2-rescue-encrypted \
  --query "Stacks[0].Outputs[?OutputKey=='HelperInstanceProfileArn'].OutputValue" \
  --output text

Step 3: Prepare Your Fix Script

Save your PowerShell fix script (like the RDP fix above) and base64 encode it:

$base64 = [System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes(
        (Get-Content -Path '.\fix-rdp-encrypted.ps1' -Raw)
    )
)
Write-Host $base64

Step 4: Start the Execution

You'll need the instance ID of the broken instance, a Windows AMI ID in the same region, and network details for the helper instance (same AZ as the target):

aws stepfunctions start-execution \
  --state-machine-arn "arn:aws:states:us-east-1:123456789012:stateMachine:EC2RescueEncryptedVolumes" \
  --input '{
    "InstanceId": "i-0abc123def456",
    "HelperAmiId": "ami-0abcdef1234567890",
    "HelperInstanceType": "t3.medium",
    "HelperSubnetId": "subnet-0abc123",
    "HelperSecurityGroupId": "sg-0abc123",
    "HelperInstanceProfileArn": "arn:aws:iam::123456789012:instance-profile/EC2RescueEncryptedVolumes-HelperProfile",
    "FixScript": "YOUR_BASE64_ENCODED_SCRIPT_HERE"
  }'

Step 5: Monitor in the Console

Open the Step Functions console and click on the running execution. You'll see each state light up in green as it completes. If a state is in a polling loop (waiting for an instance to stop, a volume to detach, etc.), you'll see it cycle between the Wait and Check states.

The entire workflow typically completes in 8-12 minutes, depending on how long Windows takes to boot on the helper instance.

KMS Considerations

AWS-Managed Keys (aws/ebs)

If your encrypted volumes use the default aws/ebs AWS-managed key, no additional KMS configuration is needed beyond the IAM policy above. The AWS-managed key's default policy allows any principal in the account to use it for EBS operations. The Step Functions role and the helper instance can both access the volume transparently.

Customer-Managed KMS Keys

If the volume is encrypted with a customer-managed KMS key, you need to ensure the key policy allows the Step Functions execution role to create grants. Add this statement to your KMS key policy:

{
  "Sid": "AllowStepFunctionsEC2Rescue",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::123456789012:role/EC2RescueEncryptedVolumes-ExecutionRole"
  },
  "Action": [
    "kms:Decrypt",
    "kms:DescribeKey",
    "kms:CreateGrant",
    "kms:GenerateDataKeyWithoutPlaintext"
  ],
  "Resource": "*"
}

Cross-Account KMS Keys

If the KMS key lives in a different account (e.g., a centralized key management account), both the key policy in the source account and the IAM policy in the local account need to grant access. The key policy in the source account must trust the local account's Step Functions role, and the local IAM policy must reference the cross-account key ARN.

Error Handling and Rollback

The state machine includes built-in compensation logic:

  • Script failure — If the fix script fails on the helper, the workflow automatically detaches the volume, reattaches it to the original instance, terminates the helper, and transitions to a Fail state with the script's error output.
  • Backup snapshot — The very first step creates a snapshot before any changes. If something goes catastrophically wrong, you can always create a new volume from this snapshot.
  • Tagged resources — The helper instance and snapshot are tagged with CreatedBy: StepFunctions-EC2Rescue so you can easily find and clean up any orphaned resources.

For production use, consider adding:

  • SNS notifications on success/failure (add a Publish task before the terminal states)
  • Timeout on the overall execution (set TimeoutSeconds on the state machine definition)
  • Retry policies on individual API calls (add Retry blocks for throttling and transient errors)
  • CloudWatch alarms on failed executions

Comparison: SSM Runbooks vs Step Functions

Feature SSM Runbooks Step Functions Workflow
Encrypted volumes Not supported Fully supported (AWS-managed and customer-managed keys)
Setup effort Zero (AWS-provided) Deploy CloudFormation stack + IAM roles
Customization Limited to base64 script parameter Full control over every step
Visual monitoring Step-by-step in SSM console Visual workflow graph in Step Functions console
Error handling Basic (fails at encrypted check) Custom rollback and compensation logic
Cost Free (SSM is free, pay for helper instance) ~$0.025 per execution + helper instance time
Lambda required Yes (internally) No (native SDK integrations)
Backup AMI Automatic Snapshot (you can add AMI creation if needed)
When to use Unencrypted volumes, quick fix Encrypted volumes, custom workflows, compliance requirements

Testing the Workflow

Before you need this at 2 AM, test it on a throwaway instance:

  1. Launch a test instance with an encrypted root volume (enable “Encrypt this volume” in the launch wizard or use account-level default encryption)
  2. Break something — disable RDP, enable scforcelogon, block the firewall
  3. Run the workflow with the appropriate fix script
  4. Verify you can RDP back in after the workflow completes
  5. Clean up — terminate the test instance, delete the backup snapshot
Pro Tip

Keep a library of base64-encoded fix scripts in S3 or Parameter Store. When an incident happens, you just grab the right script and paste it into the Step Functions input — no scrambling to write and encode a script under pressure.

Conclusion

The AWS-provided EC2Rescue runbooks are excellent for unencrypted volumes, but as more organizations adopt EBS encryption by default, the gap is real. Building your own rescue workflow with Step Functions gives you:

  • Encrypted volume support — The whole reason we're here
  • Full KMS integration — Works with AWS-managed keys, customer-managed keys, and cross-account keys
  • No Lambda functions — Every API call is a native Step Functions SDK integration
  • Visual workflow monitoring — Watch each state execute in real time
  • Automated rollback — If the fix script fails, the volume goes back where it came from
  • One CloudFormation stack — Deploy once, use whenever you need it

The workflow handles the same tedious volume-swap dance that the SSM runbooks do — it just doesn't bail out when it sees an encrypted volume. Deploy it before you need it, test it on a throwaway instance, and keep your fix scripts ready. Your future 2 AM self will thank you.

For the companion post covering SSM runbooks for unencrypted volumes, see Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation.

Related Articles