A
Arun's Blog
All Posts

Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation

|14 min read|
AWSEC2WindowsAutomationTroubleshooting
TL;DR

When a Windows EC2 instance won't boot or you're locked out due to a bad registry change, AWS Systems Manager provides automation runbooks that handle the entire offline rescue workflow for you. AWSSupport-ExecuteEC2Rescue auto-fixes common issues (RDP, firewall, services). AWSSupport-StartEC2RescueWorkflow lets you run custom PowerShell scripts against an offline volume — including loading and editing registry hives. This guide covers both approaches plus a hands-on test you can run in your own account.

Introduction

A Windows EC2 instance that won't boot or accept RDP connections is one of the most stressful scenarios in cloud operations. Maybe someone pushed a bad Group Policy, disabled the RDP service, or a driver update corrupted the boot sequence. One particularly nasty example: the scforcelogon registry setting that enforces smart card authentication for all interactive logons. When networking, clock drift, or PKI issues prevent certificate verification, nobody can log in — not even local administrators. On physical hardware, you'd boot into safe mode and flip the registry key. In AWS, there’s no console or KVM access to do that, and if the SSM agent isn’t reachable either, you’re stuck.

The traditional fix looks like this:

  1. Stop the broken instance
  2. Detach the root EBS volume
  3. Launch a temporary “rescue” instance
  4. Attach the volume as a secondary drive
  5. Load the offline registry hive and make your fix
  6. Detach, reattach to the original instance
  7. Start the original instance and hope it works

That’s a 7-step manual process with plenty of room for error. AWS Systems Manager can automate the entire thing with a single click.

This post covers two SSM automation runbooks:

  • AWSSupport-ExecuteEC2Rescue — One-click automated repair for common Windows issues
  • AWSSupport-StartEC2RescueWorkflow — Custom PowerShell scripts against an offline volume (for when you need full control)

Plus a hands-on walkthrough where we intentionally break an instance and fix it with SSM.

Prerequisites

  • AWS account with Systems Manager access
  • IAM permissions for SSM Automation (see IAM Permissions section below)
  • Target instance must be EBS-backed (instance store not supported)
  • Root volume must be unencrypted (encrypted volumes are not supported by either runbook)
  • Instance must not be from an AWS Marketplace AMI

Option 1: AWSSupport-ExecuteEC2Rescue (Automated Fix)

This is the “easy button.” It automatically diagnoses and repairs common Windows connectivity issues without you needing to write any scripts. Behind the scenes, it spins up a helper instance, mounts your volume, runs EC2Rescue, and puts everything back.

What It Auto-Fixes

Category What It Fixes
Remote Desktop (RDP) Enables RDP service (sets to Automatic start), enables Remote Desktop connections, verifies TCP port 3389
Windows Firewall Detects and resets firewall profiles (Domain, Private, Public)
Network Interface Fixes DHCP service startup
System Time Fixes the RealTimeIsUniversal registry key (prevents clock drift)
EC2Config / EC2Launch Fixes service startup, password generation, user data execution
Disk Signature Compares disk signature with BCD and corrects mismatches (fixes boot failures from cloned volumes)
Registry Restore Can restore registry from backup (\Windows\System32\config\RegBack)
Boot Config Can set instance to boot to Last Known Good Configuration

Parameters

Parameter Required Default Description
UnreachableInstanceId Yes ID of the broken instance
EC2RescueInstanceType No t2.small Helper instance type (t2.small, t2.medium, t2.large)
SubnetId No CreateNewVPC CreateNewVPC, SelectedInstanceSubnet, or a specific subnet ID (must be same AZ)
LogDestination No S3 bucket name for troubleshooting logs
AutomationAssumeRole No IAM role ARN for the automation

Step-by-Step in the Console

  1. Open AWS Systems ManagerAutomation (left sidebar)
  2. Click Execute automation
  3. Under Owned by Amazon, search for AWSSupport-ExecuteEC2Rescue
  4. Select it and click Next
  5. Choose Simple execution
  6. Fill in UnreachableInstanceId with your instance ID (e.g., i-0abc123def456)
  7. Optionally set LogDestination to an S3 bucket for detailed logs
  8. Click Execute

What Happens Behind the Scenes

  1. Creates a backup AMI of your instance (named AWSSupport-EC2Rescue:<InstanceId>)
  2. Creates a temporary VPC (if using CreateNewVPC)
  3. Launches a helper instance in the same Availability Zone
  4. Stops your original instance
  5. Detaches the root volume and attaches it to the helper
  6. Runs EC2Rescue with the /rescue:all action against the offline volume
  7. Reattaches the root volume to the original instance
  8. Starts the original instance
  9. Cleans up — terminates helper, deletes temporary VPC and Lambda functions

The backup AMI persists in your account after the automation completes, giving you a rollback point.

You can expand each step in the Execution details panel to watch progress in real time.

Option 2: AWSSupport-StartEC2RescueWorkflow (Custom Script)

When you need to make a specific registry change — not just run the automated fixer — this is the runbook to use. It performs the same mount/unmount dance but lets you provide a custom PowerShell script that runs against the offline volume.

Parameters

Parameter Required Default Description
InstanceId Yes ID of the instance to rescue
OfflineScript Yes Base64-encoded PowerShell script
EC2RescueInstanceType No t3.medium Helper instance type
SubnetId No SelectedInstanceSubnet Must be same AZ as target
CreatePreEC2RescueBackup No false Create AMI before running script
CreatePostEC2RescueBackup No false Create AMI after running script
S3BucketName No S3 bucket for logs
AutomationAssumeRole No IAM role ARN

Environment Variables Available in Your Script

When your script runs on the helper instance, the offline volume is already mounted. These environment variables tell your script where everything is:

Variable Description Example
$env:EC2RESCUE_OFFLINE_DRIVE Offline Windows drive letter D:\
$env:EC2RESCUE_OFFLINE_SYSTEM_ROOT Offline Windows system root D:\Windows
$env:EC2RESCUE_OFFLINE_REGISTRY_DIR Offline registry config folder D:\Windows\System32\config
$env:EC2RESCUE_OFFLINE_CURRENT_CONTROL_SET Current control set path ControlSet001
$env:EC2RESCUE_SOURCE_INSTANCE Source instance ID i-0abc123def456
$env:EC2RESCUE_REGION AWS Region us-east-1

Registry Hives You Can Load

The standard Windows registry hive files are located in the offline volume’s \Windows\System32\config\ directory:

Hive File Registry Key Contains
SYSTEM HKLM\SYSTEM Hardware config, services, drivers, boot config
SOFTWARE HKLM\SOFTWARE Installed software, Windows settings, Group Policy
SAM HKLM\SAM Local user accounts and groups
SECURITY HKLM\SECURITY Security policies, LSA secrets
DEFAULT HKU\.DEFAULT Default user profile

Backup copies also exist in \Windows\System32\config\RegBack\.

Writing a Custom Offline Script

Here’s the general pattern for loading, editing, and unloading a registry hive:

# Load the SYSTEM hive from the offline volume
reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

# Make your registry change
reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f

# CRITICAL: Always unload the hive when done (failure to unload = corruption risk)
reg unload "HKLM\OfflineSystem"
Critical

You must reg unload every hive you load before your script exits. If you skip this step, the hive file can be left in a dirty state and the volume may become corrupted. Always wrap your hive operations in try/finally blocks for safety.

The key name you mount under (OfflineSystem, OfflineSoftware, etc.) is arbitrary — pick any name you want. All paths inside the loaded hive are relative to your chosen mount point.

Base64 Encoding Your Script

The OfflineScript parameter requires base64-encoded input:

# PowerShell: encode a script file to base64
[System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes(
        [System.IO.File]::ReadAllText('C:\path\to\your-script.ps1')
    )
)

Step-by-Step in the Console

  1. Open AWS Systems ManagerAutomation
  2. Click Execute automation
  3. Search for AWSSupport-StartEC2RescueWorkflow
  4. Select it and click Next
  5. Fill in:
    • InstanceId: your instance ID
    • OfflineScript: paste the base64-encoded string
    • CreatePreEC2RescueBackup: true (recommended)
  6. Click Execute

Hands-On Test: Break and Fix an Instance

Let’s walk through a complete test: intentionally break RDP on a Windows instance, then fix it with SSM automation.

Step 1: Launch a Test Instance

  • Launch a t3.small with Windows Server 2022 (use an unencrypted root volume)
  • Wait for it to pass both status checks
  • RDP in and confirm connectivity works

Step 2: Break RDP via Registry

From an RDP session on the test instance, open PowerShell as Administrator and run:

# Disable the Remote Desktop service on boot
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\TermService" -Name "Start" -Value 4
Restart-Computer

This sets the TermService (RDP) start type to Disabled. After the restart, RDP connections will fail — simulating a broken instance.

Step 3: Verify It's Broken

Try to RDP into the instance. It should fail with a connection timeout or “Remote Desktop can’t connect to the remote computer.”

Step 4a: Fix with AWSSupport-ExecuteEC2Rescue (Easy Way)

Since a disabled RDP service is one of the issues EC2Rescue auto-fixes, you can use the simple runbook:

  1. Go to Systems ManagerAutomationExecute automation
  2. Search for AWSSupport-ExecuteEC2Rescue
  3. Enter the UnreachableInstanceId
  4. Click Execute
  5. Wait for completion (typically 10-15 minutes)

Step 4b: Fix with AWSSupport-StartEC2RescueWorkflow (Custom Script)

If you want to practice the custom script approach, create this PowerShell script (fix-rdp.ps1):

# fix-rdp.ps1 - Re-enable the RDP service on an offline Windows volume
try {
    # Load the SYSTEM registry hive from the offline volume
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Set TermService (RDP) start type back to Automatic (2)
    reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f

    Write-Host "SUCCESS: TermService start type set to Automatic"
}
finally {
    # Always unload the hive
    reg unload "HKLM\OfflineSystem"
    Write-Host "Registry hive unloaded"
}

Base64 encode it:

[System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes(
        [System.IO.File]::ReadAllText('C:\path\to\fix-rdp.ps1')
    )
)

Then run the workflow:

  1. Go to Systems ManagerAutomationExecute automation
  2. Search for AWSSupport-StartEC2RescueWorkflow
  3. Enter the InstanceId
  4. Paste the base64 string into OfflineScript
  5. Set CreatePreEC2RescueBackup to true
  6. Click Execute

Step 5: Verify the Fix

After the automation completes, RDP into the instance. If it connects, you’ve successfully fixed the registry offline using SSM automation.

More Custom Script Examples

Disable Smart Card Forced Logon (scforcelogon)

In environments that enforce smart card authentication via the scforcelogon registry setting, a PKI outage, clock drift, or networking issue can make it impossible for anyone to log in — the machine demands a smart card certificate that can’t be verified. On-premises, you’d boot into safe mode and disable it. In AWS, this offline script does the same thing without needing any access to the broken instance:

try {
    # Load the SOFTWARE hive (scforcelogon lives under HKLM\SOFTWARE)
    reg load "HKLM\OfflineSoftware" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SOFTWARE"

    # Disable forced smart card logon (0 = not required, 1 = required)
    reg add "HKLM\OfflineSoftware\Microsoft\Windows\CurrentVersion\Policies\System" /v scforcelogon /t REG_DWORD /d 0 /f

    Write-Host "SUCCESS: scforcelogon disabled - smart card no longer required for interactive logon"
}
finally {
    reg unload "HKLM\OfflineSoftware"
    Write-Host "Registry hive unloaded"
}
Why This Works When SSM Run Command Doesn’t

If the instance has connectivity issues or the SSM agent isn’t running, you can’t use Run Command to execute a script on the machine. The EC2Rescue workflow sidesteps this entirely — it stops the broken instance, mounts its disk to a healthy helper instance, and runs your script there. The broken machine doesn’t need to be on the network, have a running agent, or even be bootable.

After the automation completes and the instance restarts, users will be able to log in with username/password while the PKI or networking issue is resolved. Once the underlying problem is fixed, re-enable scforcelogon by setting it back to 1.

Reset Windows Firewall

try {
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Disable all three firewall profiles
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\DomainProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\PublicProfile" /v EnableFirewall /t REG_DWORD /d 0 /f

    Write-Host "SUCCESS: All firewall profiles disabled"
}
finally {
    reg unload "HKLM\OfflineSystem"
}

Disable a Problematic Service

try {
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Disable a service that's causing boot loops (replace ServiceName)
    reg add "HKLM\OfflineSystem\ControlSet001\Services\ServiceName" /v Start /t REG_DWORD /d 4 /f

    Write-Host "SUCCESS: Service disabled"
}
finally {
    reg unload "HKLM\OfflineSystem"
}

Revert a Bad Group Policy Change

try {
    reg load "HKLM\OfflineSoftware" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SOFTWARE"

    # Remove a problematic Group Policy setting
    reg delete "HKLM\OfflineSoftware\Policies\Microsoft\Windows\RemoteDesktop" /f

    Write-Host "SUCCESS: Group Policy key removed"
}
finally {
    reg unload "HKLM\OfflineSoftware"
}

Service Start Type Reference

Value Start Type Description
0 Boot Loaded by the boot loader (kernel drivers)
1 System Started during kernel initialization
2 Automatic Started by Service Control Manager at boot
3 Manual Started on demand
4 Disabled Cannot be started

Alternative: The User Data Method

If your instance can still boot into Windows (it just won’t accept RDP), there’s an even simpler approach that avoids the volume swap entirely:

  1. Stop the instance (EC2 Console → Actions → Instance state → Stop)
  2. Edit User Data (Actions → Instance settings → Edit user data)
  3. Paste a PowerShell script:
<powershell>
# Re-enable RDP and fix firewall
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\TermService' -Name 'Start' -Value 2
Start-Service TermService
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Terminal Server' -Name 'fDenyTSConnections' -Value 0
Enable-NetFirewallRule -DisplayGroup "Remote Desktop"
Restart-Computer -Force
</powershell>
<persist>true</persist>
  1. Start the instance — the script runs during boot
  2. After you regain access, clear the user data (stop, edit, remove script, start) so it doesn’t run on every boot
Important

The User Data method only works if Windows can boot and the EC2Launch agent runs. EC2Launch v2 (Server 2019+ AMIs) processes user data on every boot automatically. EC2Launch v1 (Server 2016) and EC2Config (Server 2012 R2) require user data execution to be pre-enabled. If the instance is stuck in a boot loop, blue screen, or can’t load Windows at all, this method won’t work — use the EC2Rescue runbooks instead.

When to Use Which Approach

Scenario Best Approach
RDP broken, firewall blocking, common service issues AWSSupport-ExecuteEC2Rescue — one click, no scripting
Specific registry change needed (known key/value) AWSSupport-StartEC2RescueWorkflow — custom PowerShell script
Instance boots but RDP fails (EC2Launch agent works) User Data method — simplest, no volume swap
Blue screen, boot loop, can’t load Windows AWSSupport-StartEC2RescueWorkflow or ExecuteEC2Rescue
Smart card lockout (scforcelogon) — instance unreachable, SSM agent not running AWSSupport-StartEC2RescueWorkflow — offline registry edit to disable scforcelogon
Corrupted registry, need to restore from backup AWSSupport-ExecuteEC2Rescue (has built-in registry restore)
Encrypted root volume Manual volume swap (SSM runbooks don’t support encrypted volumes)

IAM Permissions

The automation needs permissions to create VPCs, launch instances, manage volumes, and create Lambda functions. AWS provides a CloudFormation template that creates the required IAM role automatically:

  1. Go to the SSM EC2Rescue documentation
  2. Download the AWSSupport-EC2RescueRole.zip CloudFormation template
  3. Deploy the stack in CloudFormation
  4. Copy the AssumeRole ARN from the stack’s Outputs tab
  5. Use that ARN as the AutomationAssumeRole parameter

Alternatively, attach the AmazonSSMAutomationRole managed policy to your execution role and add permissions for:

  • ec2:* — VPC, subnet, instance, volume, and AMI operations
  • lambda:CreateFunction, lambda:InvokeFunction, lambda:DeleteFunction — for the automation’s internal functions
  • iam:CreateRole, iam:PassRole, iam:DeleteRole — for the helper instance profile
  • s3:GetObject — to pull the EC2Rescue tooling from AWS-managed buckets

Limitations

Limitation Details
Encrypted root volumes Not supported — including AWS-managed keys (aws/ebs). Both runbooks check the volume’s Encrypted flag and fail immediately if it’s true, regardless of whether the encryption uses AWS-managed or customer-managed KMS keys. The helper instance’s IAM role lacks the KMS permissions needed to mount the volume. For encrypted volumes, use the manual volume swap method instead.
Instance store volumes Data on instance store will be lost when the automation stops the instance.
Marketplace AMIs Instances from AWS Marketplace AMIs are not supported.
Public IP The public IP changes after stop/start unless an Elastic IP is associated.
VPC quota The CreateNewVPC option fails if you’ve hit the 5 VPC per-region limit.
Same AZ required The helper instance/subnet must be in the same Availability Zone as the target.

Troubleshooting

Issue Solution
Automation fails at “assert volume not encrypted” Root volume is encrypted. Use the manual volume swap method instead.
Automation fails creating VPC VPC limit reached. Use SelectedInstanceSubnet instead of CreateNewVPC.
Script runs but registry change didn’t take effect Verify you targeted the correct ControlSet. Check $env:EC2RESCUE_OFFLINE_CURRENT_CONTROL_SET in your script.
“The process cannot access the file because it is being used” Failed to reg unload the hive. May need to force-kill processes holding handles.
Instance still won’t boot after fix Check the backup AMI created by the automation — launch a new instance from it as a fallback.
Automation times out Check the execution steps in SSM to see where it stalled. Verify IAM permissions and subnet connectivity.

Conclusion

The days of manually swapping EBS volumes between instances to fix a broken Windows registry are over. AWS Systems Manager gives you two automation runbooks that handle the entire rescue workflow:

  • AWSSupport-ExecuteEC2Rescue — One click to auto-fix common issues (RDP, firewall, services, disk signature, registry restore)
  • AWSSupport-StartEC2RescueWorkflow — Full control with custom PowerShell scripts for specific registry edits

Both runbooks create backup AMIs, handle all the volume mounting logistics, and clean up after themselves. For instances that can still boot, the User Data method is even simpler — no volume swap at all.

Key takeaways:

  • Use ExecuteEC2Rescue first — It fixes the most common issues automatically
  • Use StartEC2RescueWorkflow for custom fixes — Load any registry hive and make targeted changes
  • Always reg unload your hives — Failure to unload risks corruption
  • Encrypted volumes aren’t supported — You'll need the manual approach for those
  • Backup AMIs persist — You always have a rollback point
  • Test it before you need it — Run the hands-on exercise on a throwaway instance so you’re ready when it matters at 2 AM

Related Articles