Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation

TL;DR

When a Windows EC2 instance won't boot or you're locked out due to a bad registry change, AWS Systems Manager provides automation runbooks that handle the entire offline rescue workflow for you. AWSSupport-ExecuteEC2Rescue auto-fixes common issues (RDP, firewall, services). AWSSupport-StartEC2RescueWorkflow lets you run custom PowerShell scripts against an offline volume - including loading and editing registry hives. This guide covers both approaches plus a hands-on test you can run in your own account.

Introduction

A Windows EC2 instance that won't boot or accept RDP connections is one of the most stressful scenarios in cloud operations. Maybe someone pushed a bad Group Policy, disabled the RDP service, or a driver update corrupted the boot sequence. One particularly nasty example: the scforcelogon registry setting that enforces smart card authentication for all interactive logons. When networking, clock drift, or PKI issues prevent certificate verification, nobody can log in - not even local administrators. On physical hardware, you'd boot into safe mode and flip the registry key. In AWS, there’s no console or KVM access to do that, and if the SSM agent isn’t reachable either, you’re stuck.

The traditional fix looks like this:

Stop the broken instance
Detach the root EBS volume
Launch a temporary “rescue” instance
Attach the volume as a secondary drive
Load the offline registry hive and make your fix
Detach, reattach to the original instance
Start the original instance and hope it works

That’s a 7-step manual process with plenty of room for error. AWS Systems Manager can automate the entire thing with a single click.

This post covers two SSM automation runbooks:

AWSSupport-ExecuteEC2Rescue - One-click automated repair for common Windows issues
AWSSupport-StartEC2RescueWorkflow - Custom PowerShell scripts against an offline volume (for when you need full control)

Plus a hands-on walkthrough where we intentionally break an instance and fix it with SSM.

Prerequisites

AWS account with Systems Manager access
IAM permissions for SSM Automation (see IAM Permissions section below)
Target instance must be EBS-backed (instance store not supported)
Root volume must be unencrypted (encrypted volumes are not supported by either runbook)
Instance must not be from an AWS Marketplace AMI

Option 1: AWSSupport-ExecuteEC2Rescue (Automated Fix)

This is the “easy button.” It automatically diagnoses and repairs common Windows connectivity issues without you needing to write any scripts. Behind the scenes, it spins up a helper instance, mounts your volume, runs EC2Rescue, and puts everything back.

What It Auto-Fixes

Category	What It Fixes
Remote Desktop (RDP)	Enables RDP service (sets to Automatic start), enables Remote Desktop connections, verifies TCP port 3389
Windows Firewall	Detects and resets firewall profiles (Domain, Private, Public)
Network Interface	Fixes DHCP service startup
System Time	Fixes the `RealTimeIsUniversal` registry key (prevents clock drift)
EC2Config / EC2Launch	Fixes service startup, password generation, user data execution
Disk Signature	Compares disk signature with BCD and corrects mismatches (fixes boot failures from cloned volumes)
Registry Restore	Can restore registry from backup (`\Windows\System32\config\RegBack`)
Boot Config	Can set instance to boot to Last Known Good Configuration

Parameters

Parameter	Required	Default	Description
`UnreachableInstanceId`	Yes	-	ID of the broken instance
`EC2RescueInstanceType`	No	`t2.small`	Helper instance type (`t2.small`, `t2.medium`, `t2.large`)
`SubnetId`	No	`CreateNewVPC`	`CreateNewVPC`, `SelectedInstanceSubnet`, or a specific subnet ID (must be same AZ)
`LogDestination`	No	-	S3 bucket name for troubleshooting logs
`AutomationAssumeRole`	No	-	IAM role ARN for the automation

Step-by-Step in the Console

Open AWS Systems Manager → Automation (left sidebar)
Click Execute automation
Under Owned by Amazon, search for AWSSupport-ExecuteEC2Rescue
Select it and click Next
Choose Simple execution
Fill in UnreachableInstanceId with your instance ID (e.g., i-0abc123def456)
Optionally set LogDestination to an S3 bucket for detailed logs
Click Execute

What Happens Behind the Scenes

Creates a backup AMI of your instance (named AWSSupport-EC2Rescue:<InstanceId>)
Creates a temporary VPC (if using CreateNewVPC)
Launches a helper instance in the same Availability Zone
Stops your original instance
Detaches the root volume and attaches it to the helper
Runs EC2Rescue with the /rescue:all action against the offline volume
Reattaches the root volume to the original instance
Starts the original instance
Cleans up - terminates helper, deletes temporary VPC and Lambda functions

The backup AMI persists in your account after the automation completes, giving you a rollback point.

You can expand each step in the Execution details panel to watch progress in real time.

Option 2: AWSSupport-StartEC2RescueWorkflow (Custom Script)

When you need to make a specific registry change - not just run the automated fixer - this is the runbook to use. It performs the same mount/unmount dance but lets you provide a custom PowerShell script that runs against the offline volume.

Parameters

Parameter	Required	Default	Description
`InstanceId`	Yes	-	ID of the instance to rescue
`OfflineScript`	Yes	-	Base64-encoded PowerShell script
`EC2RescueInstanceType`	No	`t3.medium`	Helper instance type
`SubnetId`	No	`SelectedInstanceSubnet`	Must be same AZ as target
`CreatePreEC2RescueBackup`	No	`false`	Create AMI before running script
`CreatePostEC2RescueBackup`	No	`false`	Create AMI after running script
`S3BucketName`	No	-	S3 bucket for logs
`AutomationAssumeRole`	No	-	IAM role ARN

Environment Variables Available in Your Script

When your script runs on the helper instance, the offline volume is already mounted. These environment variables tell your script where everything is:

Variable	Description	Example
`$env:EC2RESCUE_OFFLINE_DRIVE`	Offline Windows drive letter	`D:\`
`$env:EC2RESCUE_OFFLINE_SYSTEM_ROOT`	Offline Windows system root	`D:\Windows`
`$env:EC2RESCUE_OFFLINE_REGISTRY_DIR`	Offline registry config folder	`D:\Windows\System32\config`
`$env:EC2RESCUE_OFFLINE_CURRENT_CONTROL_SET`	Current control set path	`ControlSet001`
`$env:EC2RESCUE_SOURCE_INSTANCE`	Source instance ID	`i-0abc123def456`
`$env:EC2RESCUE_REGION`	AWS Region	`us-east-1`

Registry Hives You Can Load

The standard Windows registry hive files are located in the offline volume’s \Windows\System32\config\ directory:

Hive File	Registry Key	Contains
`SYSTEM`	`HKLM\SYSTEM`	Hardware config, services, drivers, boot config
`SOFTWARE`	`HKLM\SOFTWARE`	Installed software, Windows settings, Group Policy
`SAM`	`HKLM\SAM`	Local user accounts and groups
`SECURITY`	`HKLM\SECURITY`	Security policies, LSA secrets
`DEFAULT`	`HKU\.DEFAULT`	Default user profile

Backup copies also exist in \Windows\System32\config\RegBack\.

Writing a Custom Offline Script

Here’s the general pattern for loading, editing, and unloading a registry hive:

# Load the SYSTEM hive from the offline volume
reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

# Make your registry change
reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f

# CRITICAL: Always unload the hive when done (failure to unload = corruption risk)
reg unload "HKLM\OfflineSystem"

Critical

You must reg unload every hive you load before your script exits. If you skip this step, the hive file can be left in a dirty state and the volume may become corrupted. Always wrap your hive operations in try/finally blocks for safety.

The key name you mount under (OfflineSystem, OfflineSoftware, etc.) is arbitrary - pick any name you want. All paths inside the loaded hive are relative to your chosen mount point.

Base64 Encoding Your Script

The OfflineScript parameter requires base64-encoded input:

# PowerShell: encode a script file to base64
[System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes(
        [System.IO.File]::ReadAllText('C:\path\to\your-script.ps1')
    )
)

Step-by-Step in the Console

Open AWS Systems Manager → Automation
Click Execute automation
Search for AWSSupport-StartEC2RescueWorkflow
Select it and click Next
Fill in:
- InstanceId: your instance ID
- OfflineScript: paste the base64-encoded string
- CreatePreEC2RescueBackup: true (recommended)
Click Execute

Hands-On Test: Break and Fix an Instance

Let’s walk through a complete test: intentionally break RDP on a Windows instance, then fix it with SSM automation.

Step 1: Launch a Test Instance

Launch a t3.small with Windows Server 2022 (use an unencrypted root volume)
Wait for it to pass both status checks
RDP in and confirm connectivity works

Step 2: Break RDP via Registry

From an RDP session on the test instance, open PowerShell as Administrator and run:

# Disable the Remote Desktop service on boot
Set-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Services\TermService" -Name "Start" -Value 4
Restart-Computer

This sets the TermService (RDP) start type to Disabled. After the restart, RDP connections will fail - simulating a broken instance.

Step 3: Verify It's Broken

Try to RDP into the instance. It should fail with a connection timeout or “Remote Desktop can’t connect to the remote computer.”

Step 4a: Fix with AWSSupport-ExecuteEC2Rescue (Easy Way)

Since a disabled RDP service is one of the issues EC2Rescue auto-fixes, you can use the simple runbook:

Go to Systems Manager → Automation → Execute automation
Search for AWSSupport-ExecuteEC2Rescue
Enter the UnreachableInstanceId
Click Execute
Wait for completion (typically 10-15 minutes)

Step 4b: Fix with AWSSupport-StartEC2RescueWorkflow (Custom Script)

If you want to practice the custom script approach, create this PowerShell script (fix-rdp.ps1):

# fix-rdp.ps1 - Re-enable the RDP service on an offline Windows volume
try {
    # Load the SYSTEM registry hive from the offline volume
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Set TermService (RDP) start type back to Automatic (2)
    reg add "HKLM\OfflineSystem\ControlSet001\Services\TermService" /v Start /t REG_DWORD /d 2 /f

    Write-Host "SUCCESS: TermService start type set to Automatic"
}
finally {
    # Always unload the hive
    reg unload "HKLM\OfflineSystem"
    Write-Host "Registry hive unloaded"
}

Base64 encode it:

[System.Convert]::ToBase64String(
    [System.Text.Encoding]::ASCII.GetBytes(
        [System.IO.File]::ReadAllText('C:\path\to\fix-rdp.ps1')
    )
)

Then run the workflow:

Go to Systems Manager → Automation → Execute automation
Search for AWSSupport-StartEC2RescueWorkflow
Enter the InstanceId
Paste the base64 string into OfflineScript
Set CreatePreEC2RescueBackup to true
Click Execute

Step 5: Verify the Fix

After the automation completes, RDP into the instance. If it connects, you’ve successfully fixed the registry offline using SSM automation.

More Custom Script Examples

Disable Smart Card Forced Logon (scforcelogon)

In environments that enforce smart card authentication via the scforcelogon registry setting, a PKI outage, clock drift, or networking issue can make it impossible for anyone to log in - the machine demands a smart card certificate that can’t be verified. On-premises, you’d boot into safe mode and disable it. In AWS, this offline script does the same thing without needing any access to the broken instance:

try {
    # Load the SOFTWARE hive (scforcelogon lives under HKLM\SOFTWARE)
    reg load "HKLM\OfflineSoftware" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SOFTWARE"

    # Disable forced smart card logon (0 = not required, 1 = required)
    reg add "HKLM\OfflineSoftware\Microsoft\Windows\CurrentVersion\Policies\System" /v scforcelogon /t REG_DWORD /d 0 /f

    Write-Host "SUCCESS: scforcelogon disabled - smart card no longer required for interactive logon"
}
finally {
    reg unload "HKLM\OfflineSoftware"
    Write-Host "Registry hive unloaded"
}

Why This Works When SSM Run Command Doesn’t

If the instance has connectivity issues or the SSM agent isn’t running, you can’t use Run Command to execute a script on the machine. The EC2Rescue workflow sidesteps this entirely - it stops the broken instance, mounts its disk to a healthy helper instance, and runs your script there. The broken machine doesn’t need to be on the network, have a running agent, or even be bootable.

After the automation completes and the instance restarts, users will be able to log in with username/password while the PKI or networking issue is resolved. Once the underlying problem is fixed, re-enable scforcelogon by setting it back to 1.

Reset Windows Firewall

try {
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Disable all three firewall profiles
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\DomainProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\StandardProfile" /v EnableFirewall /t REG_DWORD /d 0 /f
    reg add "HKLM\OfflineSystem\ControlSet001\Services\SharedAccess\Parameters\FirewallPolicy\PublicProfile" /v EnableFirewall /t REG_DWORD /d 0 /f

    Write-Host "SUCCESS: All firewall profiles disabled"
}
finally {
    reg unload "HKLM\OfflineSystem"
}

Disable a Problematic Service

try {
    reg load "HKLM\OfflineSystem" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SYSTEM"

    # Disable a service that's causing boot loops (replace ServiceName)
    reg add "HKLM\OfflineSystem\ControlSet001\Services\ServiceName" /v Start /t REG_DWORD /d 4 /f

    Write-Host "SUCCESS: Service disabled"
}
finally {
    reg unload "HKLM\OfflineSystem"
}

Revert a Bad Group Policy Change

try {
    reg load "HKLM\OfflineSoftware" "$env:EC2RESCUE_OFFLINE_REGISTRY_DIR\SOFTWARE"

    # Remove a problematic Group Policy setting
    reg delete "HKLM\OfflineSoftware\Policies\Microsoft\Windows\RemoteDesktop" /f

    Write-Host "SUCCESS: Group Policy key removed"
}
finally {
    reg unload "HKLM\OfflineSoftware"
}

Service Start Type Reference

Value	Start Type	Description
`0`	Boot	Loaded by the boot loader (kernel drivers)
`1`	System	Started during kernel initialization
`2`	Automatic	Started by Service Control Manager at boot
`3`	Manual	Started on demand
`4`	Disabled	Cannot be started

Alternative: The User Data Method

If your instance can still boot into Windows (it just won’t accept RDP), there’s an even simpler approach that avoids the volume swap entirely:

Stop the instance (EC2 Console → Actions → Instance state → Stop)
Edit User Data (Actions → Instance settings → Edit user data)
Paste a PowerShell script:

<powershell>
# Re-enable RDP and fix firewall
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Services\TermService' -Name 'Start' -Value 2
Start-Service TermService
Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Terminal Server' -Name 'fDenyTSConnections' -Value 0
Enable-NetFirewallRule -DisplayGroup "Remote Desktop"
Restart-Computer -Force
</powershell>
<persist>true</persist>

Start the instance - the script runs during boot
After you regain access, clear the user data (stop, edit, remove script, start) so it doesn’t run on every boot

Important

The User Data method only works if Windows can boot and the EC2Launch agent runs. EC2Launch v2 (Server 2019+ AMIs) processes user data on every boot automatically. EC2Launch v1 (Server 2016) and EC2Config (Server 2012 R2) require user data execution to be pre-enabled. If the instance is stuck in a boot loop, blue screen, or can’t load Windows at all, this method won’t work - use the EC2Rescue runbooks instead.

When to Use Which Approach

Scenario	Best Approach
RDP broken, firewall blocking, common service issues	AWSSupport-ExecuteEC2Rescue - one click, no scripting
Specific registry change needed (known key/value)	AWSSupport-StartEC2RescueWorkflow - custom PowerShell script
Instance boots but RDP fails (EC2Launch agent works)	User Data method - simplest, no volume swap
Blue screen, boot loop, can’t load Windows	AWSSupport-StartEC2RescueWorkflow or ExecuteEC2Rescue
Smart card lockout (scforcelogon) - instance unreachable, SSM agent not running	AWSSupport-StartEC2RescueWorkflow - offline registry edit to disable scforcelogon
Corrupted registry, need to restore from backup	AWSSupport-ExecuteEC2Rescue (has built-in registry restore)
Encrypted root volume	Manual volume swap (SSM runbooks don’t support encrypted volumes)

IAM Permissions

The automation needs permissions to create VPCs, launch instances, manage volumes, and create Lambda functions. AWS provides a CloudFormation template that creates the required IAM role automatically:

Go to the SSM EC2Rescue documentation
Download the AWSSupport-EC2RescueRole.zip CloudFormation template
Deploy the stack in CloudFormation
Copy the AssumeRole ARN from the stack’s Outputs tab
Use that ARN as the AutomationAssumeRole parameter

Alternatively, attach the AmazonSSMAutomationRole managed policy to your execution role and add permissions for:

ec2:* - VPC, subnet, instance, volume, and AMI operations
lambda:CreateFunction, lambda:InvokeFunction, lambda:DeleteFunction - for the automation’s internal functions
iam:CreateRole, iam:PassRole, iam:DeleteRole - for the helper instance profile
s3:GetObject - to pull the EC2Rescue tooling from AWS-managed buckets

Limitations

Limitation	Details
Encrypted root volumes	Not supported - including AWS-managed keys (`aws/ebs`). Both runbooks check the volume’s `Encrypted` flag and fail immediately if it’s `true`, regardless of whether the encryption uses AWS-managed or customer-managed KMS keys. The helper instance’s IAM role lacks the KMS permissions needed to mount the volume. For encrypted volumes, use the manual volume swap method instead.
Instance store volumes	Data on instance store will be lost when the automation stops the instance.
Marketplace AMIs	Instances from AWS Marketplace AMIs are not supported.
Public IP	The public IP changes after stop/start unless an Elastic IP is associated.
VPC quota	The `CreateNewVPC` option fails if you’ve hit the 5 VPC per-region limit.
Same AZ required	The helper instance/subnet must be in the same Availability Zone as the target.

Troubleshooting

Issue	Solution
Automation fails at “assert volume not encrypted”	Root volume is encrypted. Use the manual volume swap method instead.
Automation fails creating VPC	VPC limit reached. Use `SelectedInstanceSubnet` instead of `CreateNewVPC`.
Script runs but registry change didn’t take effect	Verify you targeted the correct ControlSet. Check `$env:EC2RESCUE_OFFLINE_CURRENT_CONTROL_SET` in your script.
“The process cannot access the file because it is being used”	Failed to `reg unload` the hive. May need to force-kill processes holding handles.
Instance still won’t boot after fix	Check the backup AMI created by the automation - launch a new instance from it as a fallback.
Automation times out	Check the execution steps in SSM to see where it stalled. Verify IAM permissions and subnet connectivity.

Conclusion

The days of manually swapping EBS volumes between instances to fix a broken Windows registry are over. AWS Systems Manager gives you two automation runbooks that handle the entire rescue workflow:

AWSSupport-ExecuteEC2Rescue - One click to auto-fix common issues (RDP, firewall, services, disk signature, registry restore)
AWSSupport-StartEC2RescueWorkflow - Full control with custom PowerShell scripts for specific registry edits

Both runbooks create backup AMIs, handle all the volume mounting logistics, and clean up after themselves. For instances that can still boot, the User Data method is even simpler - no volume swap at all.

Key takeaways:

Use ExecuteEC2Rescue first - It fixes the most common issues automatically
Use StartEC2RescueWorkflow for custom fixes - Load any registry hive and make targeted changes
Always reg unload your hives - Failure to unload risks corruption
Encrypted volumes aren’t supported - You'll need the manual approach for those
Backup AMIs persist - You always have a rollback point
Test it before you need it - Run the hands-on exercise on a throwaway instance so you’re ready when it matters at 2 AM

Fixing Broken Windows EC2 Instances with Offline Registry Edits via SSM Automation

Introduction

Prerequisites

Option 1: AWSSupport-ExecuteEC2Rescue (Automated Fix)

What It Auto-Fixes

Parameters

Step-by-Step in the Console

What Happens Behind the Scenes

Option 2: AWSSupport-StartEC2RescueWorkflow (Custom Script)

Parameters

Environment Variables Available in Your Script

Registry Hives You Can Load

Writing a Custom Offline Script

Base64 Encoding Your Script

Step-by-Step in the Console

Hands-On Test: Break and Fix an Instance

Step 1: Launch a Test Instance

Step 2: Break RDP via Registry

Step 3: Verify It's Broken

Step 4a: Fix with AWSSupport-ExecuteEC2Rescue (Easy Way)

Step 4b: Fix with AWSSupport-StartEC2RescueWorkflow (Custom Script)

Step 5: Verify the Fix

More Custom Script Examples

Disable Smart Card Forced Logon (scforcelogon)

Reset Windows Firewall

Disable a Problematic Service

Revert a Bad Group Policy Change

Service Start Type Reference

Alternative: The User Data Method

When to Use Which Approach

IAM Permissions

Limitations

Troubleshooting

Conclusion

Related Articles

Mounting EFS Cross-Account from a Spoke VPC via Transit Gateway

Centralized Inspection VPC with AWS Network Firewall, Transit Gateway, and Tag-Based Firewall Bypass

Fixing Automatic PTR Record Registration for Domain-Joined EC2 Instances