Network failure of live migration of diskless hosts



I’m running KVM on Ubuntu 9.10, in a cluster of three nodes. Each server has a NIC dedicated to the server and a separate NIC for the KVM bridges. All nodes are diskless, served by an OpenSolaris NFS server. My virtual machines are also diskless.

My Ubuntu and FreeBSD VMs work perfectly until I do a live migration to another server. Upon live migration, the network appears to shut down. The VM console becomes nonresponsive in vncviewer. All network connections disconnect, and the machine stops responding to pings. On the network side, other machines lose arp to the VM IP address. Using TCPdump on the bridge’s underlying interface, I can see arp who-has requests for the VM IP, but the host running the VM does not respond. This occurs with both Ubuntu and FreeBSD VMs. To my untutored eye, it appears that the VMs don’t connect to the bridge after a live migration.

The log for the VM shows it starting on the new node, but no debugging information:

LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin /usr/bin/kvm -S -M pc-0.11 -m 512 -smp 1 -name one-12 -uuid e9958ad6-514e-9142-2265-56a7efd25ad4 -monitor unix:/var/run/libvirt/qemu/one-12.monitor,server,nowait -boot n -net nic,macaddr=00:16:8b:ab:c7:14,vlan=0,name=nic.0 -net tap,fd=17,vlan=0,name=tap.0 -serial none -parallel none -usb -vnc -vga cirrus -incoming tcp:

The command line is generated by OpenNebula, but I get the same error when migrating using virsh.

If I kill the KVM process and restart the VM on the new hardware, it boots and runs correctly.

Any suggestions on how I can debug this? What facilities do we have to look into the running KVM and see what’s happening?