A
Arun's Blog
All Posts

Cutting an MGN Migration Tracker From EC2 to Serverless (90% Cost Drop)

|11 min read|
AWSMigrationCost OptimizationServerless
TL;DR

I replaced a two-EC2 Flask app that coordinated AWS Application Migration Service (MGN) cutover waves across 23 spoke accounts with a single Lambda + API Gateway + S3 stack. A second Lambda runs every 5 minutes via EventBridge to refresh inventory by assuming a read-only role into each spoke. Annual cost dropped from roughly $500-700/year to $15-65/year. The tracker also stopped depending on a consultant laptop staying awake.

Introduction

If you've ever run an MGN cutover wave across more than a handful of accounts, you know the operational pain. AWS gives you the Application Migration Service to move on-prem servers into AWS, but it gives you almost nothing for coordinating which servers move which weekend, who's on the hook for the launch, what their pre-staged ENI is, and whether the launch template was modified to your reference pattern yet. That's where a migration tracker comes in.

This post walks through how I rebuilt one. I'll cover the old EC2-based architecture, the new fully-serverless architecture, the IAM trust chain that makes cross-account writes safe, and the cost math that justified the rewrite. By the end you should have a clear template for doing the same thing in your own environment.

The Problem Being Solved

The setup is a 23-account AWS organization in the middle of a phased lift-and-shift from a corporate data centre. Roughly 250 source servers need to be replicated via MGN and cut over in waves of 30-50 servers at a time, every other Friday night. Per server, the engineer doing the cutover needs to know:

  • Which AWS account, VPC, and subnet the server lands in
  • Which Elastic Network Interface (ENI) was pre-staged for it so the cutover instance comes up with the right IP
  • Whether the MGN-managed launch template was already patched to the standard pattern (gp3 disks, fixed tag set, SSM instance profile, m7i instance type)
  • Whether the source server is fully replicated (CONTINUOUS) or still syncing
  • Whether a test EC2 has been launched and validated, and whether termination protection is on it
  • Whether the 26 cutover-night tasks are checked off

That's roughly a 14-column matrix per server, with state changing in real time as MGN replicates and as engineers click. A spreadsheet gets noisy fast. The AWS console doesn't aggregate across accounts. So the team built a small internal web app to surface it all on one screen.

The Old Architecture (EC2-Based)

The first version was the obvious one:

  • Two t3.medium EC2 instances running 24/7 in a shared services account, each behind an internal ALB
  • One ran the infrastructure dashboard (EC2 / ELB / S3 / VPC / MGN inventory), the other ran the migration tracker
  • Both were Flask apps with hand-rolled cookie auth
  • Inventory data came from a PowerShell script _5minrun.ps1 that ran on a consultant laptop on a 5-minute loop, hit boto3 against each account, and uploaded CSVs to an S3 bucket
  • The Flask apps polled S3 on the EC2 every few minutes to refresh in-memory state
The Hidden Dependency

The whole system silently depended on the consultant's laptop being on, on the corporate network, and not asleep. If the laptop closed for the weekend, the dashboard data went stale until Monday. Nobody outside the consulting team realised that was the dependency until the laptop went to sleep mid-cutover.

The old architecture worked, but it had four structural problems:

  1. Laptop dependency. The collector was not a server; it was a script on a workstation.
  2. Two boxes running 24/7 for an app used maybe 20 hours a week.
  3. No CloudFormation, no infrastructure-as-code for the EC2 hosts. Recreating the environment was a tribal-knowledge ritual.
  4. Cost. ~$500-700/year for compute we mostly weren't using. Trivial in absolute terms but architecturally embarrassing.

The New Architecture (Serverless)

The replacement runs entirely on managed services:

  • API Gateway HTTP API with two custom domains attached (one as a backup for users on Zscaler that blocks the primary's TLD)
  • Web Lambda running the Flask app via the AWS Lambda Web Adapter (no rewrite of the Flask code was needed)
  • S3 bucket in a shared-services account holding seven inventory CSVs plus the per-wave task-progress JSON
  • Collector Lambda fired every 5 minutes by an EventBridge Scheduler rule. It assumes a read-only role into each of the 23 spokes in parallel, gathers EC2 / ELB / S3 / VPC / EFS / MGN inventory, writes CSVs to S3
  • CloudFormation StackSet owns the read-only role across the org; service-managed, auto-deployment enabled so new accounts pick it up automatically
  • ACM certs + Route 53 ALIAS records for the two custom domains
  • 14-day CloudWatch Logs retention on both Lambdas, set explicitly so we don't accidentally keep noise forever
Lambda Web Adapter Was the Trick

The Lambda Web Adapter is a small AWS-published layer that translates API Gateway events into vanilla HTTP requests on a local port. Your Flask app listens on $PORT the same way it would on EC2; the layer handles the Lambda lifecycle. That meant migrating the existing Flask code took an afternoon, not a sprint.

The Two-Role IAM Pattern

Cross-account access is the only thing in the design that has any meaningful security surface. There are two roles in each spoke account:

Role Trusted by Permissions Why it exists
DashboardCollectorRole The collector Lambda's execution role Read-only: ec2:Describe*, mgn:Describe*, elasticloadbalancing:Describe*, s3:List*, elasticfilesystem:Describe* Inventory gathering. Pure reads. Cannot mutate anything.
DashboardWriterRole The web Lambda's execution role Scoped writes: create/delete tagged ENIs, modify launch templates, MGN StartTest / TerminateTargetInstances, ec2:RunInstances via MGN, kms scoped via kms:ViaService = ec2.us-east-{1,2}.amazonaws.com, iam:PassRole conditioned on iam:PassedToService = ec2.amazonaws.com Lets engineers drive a small set of cutover actions from the UI without console hopping.

Both roles are deployed by CloudFormation StackSet, service-managed, with auto-deployment enabled. New accounts added to the organisation pick up both roles within minutes. Neither role is editable by hand in any spoke - if someone tries, the next StackSet reconciliation overwrites the change.

Separation Is the Whole Point

Keep the collector role and the writer role separate. The collector runs every 5 minutes on a schedule with no human in the loop, so it gets the smaller blast radius (reads only). The writer role activates only when a real engineer clicks a button, so it can carry the heavier permissions. Conflating them into one role with both read and write would make every collector run a potential write incident.

What the Tracker Actually Does

The migration tracker page is a sortable / filterable table, one row per source server, grouped by wave. Each row has clickable cells that perform live AWS actions in the right spoke account through the writer role:

  • ENI cell. Pre-stage an ENI in the target subnet so the cutover instance launches with a known IP. Tags the ENI with map-migrated for cost attribution.
  • Launch Template cell. Modify the MGN-managed launch template to the reference pattern in one click: all EBS volumes become gp3, the pre-staged ENI is wired as NetworkInterfaces[0], the standard tag set lands, the SSM instance profile is set, and right-sizing is set to NONE. As of June 2026 it also flips the MGN replication staging disks to gp3 in the same click.
  • Test EC2 cell. Start a Test launch via mgn:StartTest or terminate one via mgn:TerminateTargetInstances. Both are password-gated.
  • Term Protect cell. Flip DisableApiTermination on the test EC2 so an errant Terminate doesn't blow it away mid-validation.
  • Task panel. A 26-task cutover-night checklist per server. Each click writes to a per-wave JSON file in S3 so multiple engineers see updates in near-real-time.

Batch operations live in a toolbar that appears when you select rows. Bulk-flip termination protection, bulk-modify launch templates, bulk-terminate test instances. The destructive ones share a single environment-variable password so rotation is one place.

Cost Comparison

This is the section the post exists for. Numbers are for us-east-1 as of mid-2026.

Old EC2 Deployment - Annual Cost

Item Sizing Cost
EC2 (dashboard host) t3.medium, 24/7 ~$30/month
EC2 (tracker host) t3.medium, 24/7 ~$30/month
EBS gp3 root volumes 30 GiB × 2 ~$6/month
CloudWatch Logs / agent metrics Default retention, unscoped ~$3/month
Data transfer + minor misc Low-volume internal app ~$2/month
Subtotal Roughly $70/month ~$840/year (real-world range $500-700 with reserved-instance / savings-plan effects)

New Serverless Deployment - Annual Cost

Item Sizing Cost
Domain renewal (.click TLD) 1 domain ~$3/year
Route 53 hosted zones 2 zones (primary + backup) $12/year
Web Lambda compute ~20 hours of clicks/week, sub-1s requests ~$0 (free tier)
Collector Lambda compute 5-minute schedule × 30-80s runs × 23 accounts $0-4/month
API Gateway HTTP API ~$1 per million requests <$1/year at this scale
S3 storage + requests Tiny CSVs, frequent PUTs ~$4/year
EventBridge Scheduler $1 per million invocations ~$0.15/year
CloudWatch Logs 14-day retention, both functions ~$1.20/year
ACM public certs 2 certs $0 (free)
Subtotal Roughly $1-5/month + $3/year ~$15-65/year

Net Savings

~$500-700/year → ~$15-65/year

That's an 85-92% reduction in annual run cost. Just as important: the cost no longer scales linearly with how much you use the tool, because the floor is paid almost entirely in S3 storage and scheduled-Lambda invocations.

What You Lose

This isn't a free lunch. The serverless rebuild gives up a few things worth being honest about:

  • Cold start. The first request to a fresh container takes ~1.5s. Subsequent requests are ~50ms while the container stays warm. Not a problem for an internal tool; would be for a customer-facing API.
  • 20-second S3 cache. The web Lambda caches S3 reads for 20s to avoid hammering S3 on every page view. So a CSV update may not show up immediately. Acceptable for cutover-coordination work where the data was already on a 5-minute refresh.
  • Shared password instead of SSO. Deliberate for team size (under 10 people) and data sensitivity (operational metadata, no PII). I would not ship the same model to a 200-person org.
  • No global state in memory. Each Lambda invocation is independent. Anything you want to persist has to go to S3 (or DynamoDB). Forced me to think harder about what truly needed persistence vs. what was incidental.

Lessons Learned

  1. Lambda Web Adapter saves the rewrite. If you've already got a Flask or Express app working on EC2, you don't have to rip it apart. Add the layer, set PORT, point the Lambda's handler at the wrapper. Day one productivity.
  2. Make the IAM separation deliberate. A read role and a write role, not one fat admin role. The collector should not be able to mutate anything even if it's compromised.
  3. Use CloudFormation StackSets early. Auto-deployment means new accounts get coverage without you remembering. We had two new accounts onboard during the migration; both showed up in the dashboard within the hour with zero manual steps.
  4. S3 is fine for state at this scale. Per-wave JSON files holding task progress, CSV inventories - the latency and durability are both more than enough. DynamoDB would be the right call if you needed sub-second updates or true real-time fan-out across users.
  5. The laptop dependency is the architecture failure mode you don't see coming. Anything that requires a workstation to be online is a production dependency. Make it a managed service.

Conclusion

Moving an internal MGN coordination tool from EC2 to Lambda + API Gateway + S3 cut annual run cost from roughly $500-700 to $15-65, eliminated the laptop dependency that was silently load-bearing, and made the whole stack reproducible through CloudFormation. The migration took about a week of focused work and a few smaller follow-ups to harden the IAM and add UI polish.

If you're running anything similar - a cutover dashboard, an inventory tool, a small internal status page on a permanent EC2 - it's worth pricing out the serverless equivalent. The cost delta usually pays for the rewrite in months, and the operational surface goes from a couple of always-on hosts to a Lambda you forget exists until you change it.

Pro Tip

Before you build, search your own operations for a workstation script that runs on a cron. If you find one, it's almost always the same pattern as the laptop collector here. Wrap it in an EventBridge-scheduled Lambda and you've already retired your most fragile dependency.

Want Help With This?

If you're working on something similar and want a second set of eyes, or you'd like to talk through how this applies to your environment, reach out via the contact form. Happy to help.

Related Articles