'Trust Me, Bro! Replication Complete!': Why You Should Validate MGN Volumes Before Cutover

TL;DR

Two RHEL 8 servers migrated with AWS Application Migration Service (MGN) kernel-panicked on their test launches: Unable to mount root fs on unknown-block(0,0). The boot disk's rootvg came across as an empty LVM physical volume and the boot kernel's initramfs was 0 bytes, even though MGN reported the disk fully replicated. A sudoers error in the conversion log sent me down a wrong path. The real cause was an incomplete initial replication of the partitioned boot disk, fixed by a full re-replication (Stop, then Start). The reusable trick: instead of waiting 15-20 minutes for a Test launch just to learn whether the volumes replicated, mount the MGN staging snapshot of the boot disk on a throwaway helper and check it in two minutes.

The Panic

I ran an MGN Test launch on a migrated RHEL 8 box and the instance never reached a login. The serial console showed the kernel dying about two seconds in:

Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)

The graphical console screenshot was frozen at Probing EDD (edd=off to disable)... ok, which is a red herring on Nitro (output goes to the serial console, so the screenshot stalls early even on healthy boots). The serial log is the source of truth, and it was a clean panic. A second server with the same build hit the exact same thing. Two for two is not a fluke, it's a pattern.

The Disk Layout

These were standard RHEL LVM builds. The boot disk is partitioned, with the OS root on an LVM volume group inside a partition:

/dev/sda (boot disk): sda2 = /boot (xfs), sda4 = LVM physical volume holding rootvg (lv_root, lv_usr, lv_var, lv_opt, and friends)
/dev/sdb, /dev/sdc (data disks): whole-disk LVM PVs, each with its own volume group

On the source server, everything was healthy. pvs showed rootvg sitting on /dev/sda4 with all its logical volumes. No multipath, no SAN, nothing exotic.

First Clue: an Empty Root Volume Group

To see what actually landed, I took the launched test instance's root volume, attached it to a helper, and scanned it. The data disks were perfect. The boot disk was not. The rootvg physical volume was empty:

  PV         VG     PSize    PFree
  /dev/sdf4         <47.50g  <47.50g     # no VG, 100% free

An LVM PV that is 100% free has no logical volume extents on it at all. No rootvg, no lv_root, no operating system. On top of that, the initramfs for the default boot kernel was 0 bytes. So even if the root volume had been there, GRUB had no working initramfs to assemble LVM with. Either failure alone is enough to panic; this had both.

The part that made me distrust everything: MGN's console reported all three disks fully replicated, zero backlog, including this one.

The Sudoers Red Herring

MGN runs a Linux conversion step on the replicated disk at launch, the part that injects AWS drivers, rewrites GRUB, and rebuilds the boot initramfs. Its log had a line that looked like a smoking gun:

root is not in the sudoers file.  This incident will be reported.

That sent me straight to the MGN docs, which do state the source user must be root or in the sudoers list. These servers were hardened, and root had been stripped of sudo. The theory wrote itself: the conversion can't finish its sudo steps, so the boot config never gets built. I almost had the client change their sudoers baseline.

Then I checked a server that had already migrated cleanly from the same fleet. It had the identical hardening:

$ sudo -ll -U root
User root is not allowed to run sudo on this host.

Same locked-down root, and it migrated fine. So the sudo line was benign noise (the conversion script even exited 0 around it). The hardening baseline was a shared org standard, not the differentiator. That comparison saved me from shipping a wrong fix.

Always compare against a known-good

When you think you've found the cause on a broken server, find a working server in the same fleet and check the same thing. If the working one has the same "problem," you haven't found the cause. A 30-second comparison killed a theory I'd already half-committed to.

The Actual Cause

The whole-disk data volumes replicated perfectly. Only the partitioned boot disk came over with empty LVM extents and 0-byte boot files. That shape, complete data disks but a hollow boot disk, points at an incomplete or interrupted initial sync that MGN nonetheless marked as done. The bytes-replicated counter said 100%, but the blocks that mattered for LVM and the boot files weren't actually there.

The Fix: Force a Full Re-Replication

You can make MGN re-copy every block from scratch by stopping replication and starting it again. Stopping discards the staging copy and all points in time; starting kicks off a fresh full initial sync. It only resets the AWS-side replication and never touches the running source server.

aws mgn stop-replication  --source-server-id s-xxxxxxxx   # discards staging copy
aws mgn start-replication --source-server-id s-xxxxxxxx   # fresh full initial sync

A few hours later (it re-transfers everything, so budget for the full disk size over your link), replication was healthy again. This time the boot disk came over correctly.

The Better Move: Check the Volumes Without a Test Launch

Here is the part I'll reuse on every migration from now on. A full MGN Test launch takes 15 to 20 minutes because it converts the disk and boots an instance. If all you want to know is "did the volumes replicate correctly," you do not need any of that. The replicated data already exists as an MGN staging snapshot, and you can inspect it directly.

This is a good way to confirm volumes came over successfully, instead of waiting 20 minutes to run an MGN Test just to find out.

The flow takes about two minutes of work plus a short wait:

Find the MGN staging snapshot of the boot disk (filter by the source server tag, pick the /dev/sda one).
Create an EBS volume from that snapshot.
Attach it to a cheap throwaway Amazon Linux 2023 helper.
Activate and scan LVM, then mount /boot and check the initramfs.

Finding the snapshot:

aws ec2 describe-snapshots --owner-ids self \
  --filters "Name=tag:AwsApplicationMigrationService-SourceServerID,Values=s-xxxxxxxx" \
  --query "reverse(sort_by(Snapshots,&StartTime))[].{Snap:SnapshotId,Size:VolumeSize,Dev:Tags[?Key=='AwsApplicationMigrationService-SourcePath']|[0].Value}" \
  --output table

Then on the helper, the whole verdict is a few commands:

vgchange -ay                 # activate VGs on the attached disk first
vgs ; lvs                     # rootvg present, with lv_root etc.?
mount -o ro,norecovery /dev/nvme1n1p2 /mnt   # /boot
find /mnt -maxdepth 1 -name 'initramfs-*.img' -printf '%s  %p\n'   # non-zero?

A pass looks like rootvg present with all its logical volumes and a non-zero initramfs for the default boot kernel. A fail looks like the empty PV from earlier. Same evidence the panic gave me, minus the 20-minute launch.

One honest caveat

This checks the raw replicated data, before MGN's boot conversion runs. It tells you the volumes came over, which is the failure mode I was chasing. It does not exercise the conversion or an actual boot, so a final Test launch is still the definitive "it boots" proof. I use the snapshot scan to decide whether a Test launch is even worth running yet.

Gotchas From Doing This

Gotcha	What bites you
Helper OS choice	Use Amazon Linux 2023: the SSM agent is preinstalled and its root is not on LVM, so there's no `rootvg` name collision with the disk you're scanning. A RHEL helper has neither property.
Mounting the replicated xfs	A replicated filesystem is crash-consistent, so a normal mount tries log replay and fails. Use `mount -o ro,norecovery`.
LVM looks empty at first	The volume group is inactive on a freshly attached disk. Run `vgchange -ay` before `lvs`, or you'll think `rootvg` is missing when it's just not activated.
Snapshot tag casing	The MGN tags are inconsistent: `AwsApplicationMigrationService-SourceServerID` and `-SourcePath` are mixed-case, but `AWSApplicationMigrationServiceManaged` is all-caps. Filtering on two tags at once returns nothing, so filter by server ID and pick the device out of the output.
0-byte initramfs	A 0-byte initramfs for a non-default kernel is harmless. Only the kernel in `grubenv`'s `saved_entry` has to be non-zero.

What I'd Tell Past Me

Trust the bytes-replicated counter less than you'd like. It said 100% on a disk that was missing its root volume group. When a migrated Linux box panics on unable to mount root fs, scan the staging snapshot before you theorize, compare the broken server against a working one before you commit to a cause, and reach for a full Stop/Start re-replication when a partitioned boot disk comes over hollow. The whole thing went from a confident wrong diagnosis to a clean boot once I stopped guessing and looked at the actual blocks.

Want Help With This?

If you're working on something similar and want a second set of eyes, or you'd like to talk through how this applies to your environment, reach out via the contact form. Happy to help.