NetBooting
Network booting techniques used in Emulab
How things boot: a brief summary of the various techniques we use to network boot nodes and load MFSes. 0. What we can boot. All of our nodes are configured to network (PXE) boot and, whereever possible, to NEVER boot from the hard drive. The latter is to ensure that if the network boot fails, we won't boot some potentially unknown OS from the hard drive and that we will try the network boot again. In the olden days, for nodes that we could not prevent from falling back to a disk boot, we would install an MBR boot loader that just immediately rebooted (well, it would give the user a prompt for a couple of seconds and then reboot--just in case they really wanted to boot from disk). The first level boot loader, loaded via PXE, is typically specified via DHCP as the "filename" argument. However, this is not always true as we will see below. This "pxeboot" will then talk to Emulab via the bootinfo protocol to see what to do next. The options are: * Boot from a disk/partition combo. The disk number is a BIOS disk number (e.g., 0x80). If no disk is specified, then boot from disk 0. * Boot from a disk/partition with a kernel command line. This was mostly useful for OSKit kernel "back in the day", but is also used to select an alternate kernel or select the HZ rate of a delay node. The actual semantics of the command line depends on the combination of the boot loader and kernel. At the very least an alternate kernel name (the first arg) should work. Additional command line arguments may or may not make it to the kernel and some may be interpreted as environment settings in the loader and those may or may not make it to the kernel. In other words, use of a command line for anything other than setting the HZ rate of a delay node kernel may or may not work! * Boot a kernel using a memory-based root filesystem (MFS). MFSes are used typically when loading a disk or creating an image of a disk. The former, the so-called "frisbee" MFS, is not intended to be a general OS, its sole purpose is to run the frisbee disk loader. The latter is known as the "admin" MFS because it can be used for purposes other than just capturing a disk image; e.g., if you have screwed up the disk somehow and want to get on and look at or fix it. This MFS is intended to be more general purpose. There is a third "newnode" MFS that is used only when a machine is first booted, to report info back to Emulab about this new node. Two stated transitions are related to the initial boot process: PXEBOOTING and BOOTING. The first says that a node has made a PXE request, the latter says that the OS is booting. Since the node cannot self-report these transitions (as it does all others) we have to report them by proxy. Typicially, but not always, the bootinfo server on boss is the proxy. Whenever a node makes a "what do I do next" request, the server reports both transitions. 1. The "classic" Emulab legacy BIOS boot path. The original and still most highly used bootstrap path involves PXE booting via the BIOS, a FreeBSD-based boot loader, and FreeBSD-based MFSes. In this path, /tftpboot/pxeboot.emu is specified in the dhcpd.conf file as the program to download. pxeboot is a PXE-savvy version of the FreeBSD "stage 2" boot loader which has been modified to talk to the bootinfo server and handle the boot scenarios above. If bootinfo tells the loader to boot from a disk without any additional command line arguments, then pxeboot loads the first sector of that partition (or MBR) and jumps to it. Job done. If there is a command line, and we are booting FreeBSD, then the loader can directly boot the kernel (first arg), after first parsing the remaining arguments converting key=val strings into loader variables that can be passed to the kernel. This is how delay nodes get booted--we specify either a custom kernel name or we pass "kern.hz=10000" as an argument. However, if it is a Linux kernel being booted, then the presense of a command line will likely cause the boot to fail. You cannot even specify an alternate kernel. (Something to keep in mind if we ever want a Linux delay node!) Note that the only case where a Linux command line works is if the on-disk boot loader is LILO. pxeboot has code to pass LILO command line arguments. But only very ancient images have LILO boot blocks. Otherwise pxeboot uses TFTP to load a kernel and "mfsroot". pxeboot will only load FreeBSD-based MFSes because it only knows how to direct boot a FreeBSD kernel. Thus the FreeBSD pxeboot must be paired with FreeBSD MFSes. MFS booting basically works by fooling pxeboot into thinking that, e.g., boss:/tftpboot/frisbee is the root filesystem. Hence it tries to read the normal boot time things out of "/boot", including loader.conf and the kernel. loader.conf contains special variables to tell it to use a file as an in-memory root filesystem ("mfsroot"). So pxeboot.emu reads all these files from "the filesystem" which is handled behind the scenes via the libstand TFTP code. Once the kernel is booted, the OS looks pretty normal, modulo the fact that the command set is extremely limited and the disk is extremely small. 2. The "Linux MFS" BIOS boot path. Back around 2008, Ryan Jackson made a valiant effort to move us out of the FreeBSD world and into Linux by creating a boot chain that used Linux-related tools. These include a version of Grub (pre-2.0) modified to support our bootinfo protocol, a custom grub.cfg to handle booting from a partition or an MFS, a busybox-based Linux filesystem that serves as the MFS, and a Linux 3.2.7 kernel. PXE downloads the custom Grub (/tftpboot/pxeboot_grub2pxe) specified in the DHCP config file. It uses /tftpboot/grub2/grub.cfg as its initial config. This is the custom script which invokes bootinfo and then either boots from a partition or loads an MFS. For the MFS case, it reloads a config file from one of {admin,frisbee,newnode}_linux and that config file loads the Linux kernel and initrd (MFS). The kernel is passed a special "elab_mode=" command line parameter set to one of "frisbee", "admin", or "newnode" so that only one kernel and MFS is needed for all three uses. This one MFS is a much more complete system than what any of the FreeBSD MFSes provide. However, it is busybox so many of the standard commands are subsets of the real versions. For booting a FreeBSD system from disk, grub uses the kFreeBSD command to load /boot/loader from the disk passing along environment variables via kFreeBSD.* variables. *** TODO: figure out exactly what command lines we can handle and finish this section. *** For booting from disk without a command line in Linux, we typically just chainload the partition boot block. ... Of note for chain booting. In order to pass (command line?) arguments to the FreeBSD bootloader from Grub, we had to hack a special version of the BSD bootloader for some older FreeBSD images. This is /boot/emuboot in those images. I am not sure if this is needed for newer versions of the laoder. 3. The Moonshot m400 ARM U-boot/pxelinux boot path. In 2014, we got the HP Moonshot ARM cartridges that use U-boot and PXE boot via a builtin version of pxelinux. Since there is no working version of FreeBSD for these boxes, we use Linux-based MFSes--one an initramfs (frisbee) and the other an NFS-mounted root filesystem (admin). When booted, the nodes still DHCP, but they now ignore the "filename" argument that is returned. Instead, the first contact is via a TFTP read of a file from /tftpboot/pxelinux.cfg. There is a sequence of files it attempts to read starting with the very specific (a file name that is the same as the MAC address) up to a generic "default". We use the individual files named after the MAC that are cloned from a template file /tftpboot/pxelinux.cfg/boot.template. This template has different menu entries for booting from the disk (via the boot partition which is also the root partition for us), booting from NFS (the admin MFS), or booting from an initramfs (frisbee MFS). It also has a special PXEWAIT entry, but we won't talk about that. Whenever a node changes its boot designation (e.g., set/clear node_admin, during reloading) we create a new version of the template with the correct default boot menu item (and correct NFS root FS path for NFS booting). One interesting aspect of the PXE boot is that, since bootinfo is not called on the boot path, we handle sending the initial PXEBOOTING and BOOTING state transition events via dhcpd. The dhcpd.conf file has an "on commit" section that allows us to invoke a script whenever a client has accepted a lease. In our case we call /usr/testbed/sbin/reportboot which will send the appropriate events. Since the Utah Moonshot cluster now supports both ARM and x86 (see #5 below), we have to be careful to only send these events from dhcpd for U-boot nodes. The frisbee MFS is a more conventional Linux initramfs, but it is utterly unrelated to the "Linux MFS" (#2) that we use on the x86 nodes. Trying to recreate Ryan's build environment for the ARM architecture was a non starter. Instead, I started with the default "initrd" and just kept adding stuff til rc.frisbee worked! So it is just as, or even more, limited than the FreeBSD frisbee MFS. I took a different tack for the admin MFS. Rather than trying to build up the initramfs further (with perl, etc.) I took advantage of two things: 1) the fact that we had an NFS bootable version of Ubuntu 14 that we got from HP and 2) the fact that we are using ZFS on the fs node making cloned filesystems fast and easy. The z/nfsroot/m400 zfs, mounted at /nfsroot/m400, is the base volume for the m400 admin FS. z/nfsroot/m400@current is the snapshot which is cloned on demand to create filesystems for the individual nodes; e.g., z/nfsroot/ms0102 would be the volume for ms0102 when it is in admin mode. So the node_admin command will "zfs clone" a new version of the snapshot when admin mode is entered, and "zfs destroy" when done. This is triggered if the osid for the admin MFS contains the path "fs:/nfsroot". 4. The Intel NUC got-a-lame-BIOS boot path. In 2015, we got some Intel NUCs for PhantomNet. The plan here was just to use the traditional boot path (#1) but we ran into a problem where, when configured for legacy BIOS and booting via the network, the SATA (AHCI?) option ROM (aka, driver) was not loaded. So we could not access the disk from the PXE booted loader. This is fine for the MFSes, which never access the disk until the OS is loaded, but made it impossible to boot from a disk partition since we could not load the boot sector. Fortunately, Grub2 has the option to use "native" device drivers rather than the BIOS, so using the Linux MFS (#2) seemed feasible...except that the pre-2.0 version of Grub Ryan used did not support native SATA. Thus I embarked on a project to figure out how Ryan built pxeboot_grub2pxe and transfer his changes into Grub 2.02. The result is now in the https://gitlab.flux.utah.edu/emulab/emulab-grub2.git repo. To not clash with the older grub-pre-2 install, we set the filename differently in dhcpd.conf. For these nodes we use /tftpboot/grub2pxe-native-vga/grub2pxe as the filename. That directory also contains a further tweaked version of grub.cfg which is configured to use native disk drivers via a variable setting at the top of the file. The naming convention for directories: grub2pxe--, allows us to have a different grub.cfg in each, specifying the drivers to use ("bios" or "native") and the console ("vga", "sio1", "sio2") via variables. They share modules, fonts, etc. via symlinks to the grub2.0 directory. The MFSes used are just the VGA versions of the ones discussed in #2 (/tftpboot/admin_linux_vga and /tftpboot/frisbee_linux_vga -- I hope you were not looking for a consistent naming convention from me!) Booting from disk however is harder. We cannot just chainload the on-disk loader since it most likely expects to use the BIOS disk interface. So for Linux, we just boot the kernel directly. We load the grub.cfg file from the disk and hope the config file is compatible with our version of Grub! For FreeBSD...well, we just don't boot FreeBSD on these nodes. Grub2 is capable of directly booting FreeBSD kernels, so it would be possible. 5. The Moonshot m510 x86 UEFI-only boot path. Now we are in 2016 and we have yet another variant! The new HP Moonshot x86 cartridges are UEFI only, no legacy BIOS setting. So it is time to suck it up and figure out how to handle UEFI. The goal is to have common images that boot on both BIOS and UEFI machines. We can actually pull this off if we never want to boot directly from the disk. When booting from disk with UEFI, you not only have to have a boot loader that speaks the UEFI API but also a dedicated boot partition formatted with a FAT filesystem that contains the boot loader. Making that work, along with having an MBR boot block with a BIOS-speaking boot loader was going to be a task. Fortunately, it is simpler if we assume that we always boot from the network (we do). Now we just have to load an EFI-savvy boot loader via PXE (which UEFI still supports) and then (hopefully) just use that to directly load/boot the kernel and MFS. My travails are documented in https://gitlab.flux.utah.edu/emulab/emulab-devel/issues/53. We wound up needing to use Grub here as well for the first level boot. But of course Grub has to be compiled differently to get EFI support. So now we have a new "driver" catagory: "efi". For these particular nodes the DHCP "filename" is set to /tftpboot/grub2pxe-efi-sio1/grub2pxe. And grub.cfg needed some further tweaks. To boot FreeBSD from disk, we just need to chainload /boot/loader.efi from disk. It exists in all our existing images for newer versions of FreeBSD. Since I just could not stomach the prospect of upgrading the Linux MFS, or at least the kernel, to support newer hardware (NVME disk, Mellanox 10Gb Ethernet), I took advantage of Grub's support for booting FreeBSD and figured out how to load a FreeBSD kernel and MFS with Grub. How to make things. 1. pxeboot.emu, the FreeBSD-based PXE boot loader. pxeboot.emu is still built using the FreeBSD 7.2 sources from ops.emulab.net:/share/freebsd/7.2/src/sys/boot/i386/emuboot along with a tweaked version of libstand. IMPORTANT: these changes only exist in that source tree (and the corresponding RCS directories). They have never been isolated and put in a git repo anywhere! To rebuild pxeboot, you will have to allocate a node running the FBSD72-STD image (a 32-bit version of FreeBSD 7.2--anybody see any weak links in this chain?), login and: mount -o ro fs:/share/freebsd/7.2/src /usr/src cd /usr/src/lib/libstand make obj all install cd /usr/src/sys/boot make obj all cd /usr/obj/usr/src/sys/boot/i386/emuboot cp pxeboot By default this will create the "sio1" serial port console version. To create "null", "vga", "sio[234]" versions you will need to tweak /share/freebsd/7.2/src/sys/boot/i386/Makefile.inc as follows: "vga" version: comment out: FORCE_SERIAL_CONSOLE= 1 BTX_SERIAL= 1 BOOT_PXELDR_ALWAYS_SERIAL= 1 and build. "null" version: uncomment: #CFLAGS+= -DNONINTERACTIVE To be safe, also comment out the serial console lines as per the "vga" build above. "sio2" verison: make sure the three serial console lines above are *not* commented out and then uncomment the "2f8" line from: #BOOT_COMCONSOLE_PORT= 0x2f8 #BOOT_COMCONSOLE_PORT= 0x3e8 #BOOT_COMCONSOLE_PORT= 0x2e8 "sio3" version: include serial console lines, uncomment the "3e8" line. "sio3" version: include serial console lines, uncomment the "2e8" line. That is it! You should now have all five versions of pxeboot. 2. The FreeBSD MFSes. The FreeBSD MFSes were hand-rolled long ago (circa 2000) and have been lovingly maintained ever since. They all started life as FreeBSD 6(?) filesystems. Since then, they have been hand updated a couple of times. Once to move them to FreeBSD 8 binaries and once to create a version with 64-bit binaries. They are waaay past due for another update, or even better, a reproducible reimplementation using NanoBSD or something. The disk-loading ("frisbee") MFS was mercilessly (and non-systematically) hacked to remove all packages and standard binaries that were not needed to run our rc.frisbee script. It is truly a unicorn--a 12MB filesystem (4MB compressed) that is fast to download with TFTP. In fact, it is now smaller than the kernel that is downloaded with it. The frisbee MFS does NOT include perl or ssh among many other utilities normally thought of as indispensable. The admin ("freebsd") MFS was basically the same, but without taking out the more useful things like perl and ssh. It is a 40MB filesystem. The "newnode" MFS, used only when adding new nodes to the testbed, is very similar to the admin MFS--basically the same with different Emulab apps loaded. "Rebuilding" these basically means updating the Emulab client-side that is installed. Start with a FBSD83-STD or FBSD83-64-STD image and copy over the "mfsroot" files from /tftpboot/{frisbee,freebsd,freebsd.newnode}/boot. Then set up an Emulab clientside build tree ala: mkdir obj; cd obj /clientside/configure --with-TBDEFS=/defs-utahclient and for each MFS (e.g. frisbee): mdconfig -a -t vnode -f mfsroot.frisbee -u 5 mount /dev/md5 /mnt cd setenv DESTDIR /mnt make frisbee-mfs-install umount /dev/md5 mdconfig -d -u 5 And then copy back the updated mfsroot file. There are make targets "mfs-install" (admin MFS) "newnode-mfs-install" (newnode MFS) for the others. If you only need to update a shell/perl script, then you don't need a FreeBSD 8 node. You can just mount up the MFS and copy in the script by hand. It is possible to pair one of a FreeBSD 8, 9, or 10 kernel with the mfsroot. We have had to use increasingly newer kernels to pick up support for newer hardware over time. These kernels used to be of the custom unicorn variety, reflecting just the hardware we had, but nowadays we use a variant of the GENERIC config so we get a more broad range of hardware coverage. There should be a {i386,amd64}/conf/TESTBED-PXEBOOT-GENERIC configuration file to use. This includes a custom Emulab enhancement (ipod) as well as additional drivers and setting we need. To build the kernel, start with a node running the appropriate image and then: mount fs:/share/freebsd//src /usr/src cd /usr/src make -j8 buildkernel KERNCONF=TESTBED-PXEBOOT-GENERIC cd /usr/obj/usr/src/sys/TESTBED-PXEBOOT-GENERIC cp kernel 3. The Linux MFS. Swap in Mike's emulab-ops/Linux-MFS experiment. Do magic shit. (I think there is a README out in the persistent blockstore that has the environment...) 4. The HP Moonshot m400 MFSes. How to update the initramfs. - unwrap from u-boot, unpack, update, re-pack, rewrap for u-boot - all in boss.utah.cloudlab.us:hibler/Moonshot/mfs How to update the NFS filesystem. - install new stuff into /nfsroot/m400 on fs (no makefile target yet, I don't think) - move old @current snapshot to @old - create new @current snapshot How to update the kernel. - don't! I tried once and broke it. - need to have NFS (NFSROOT?) built into the kernel 5. Building "grub2pxe", the modified version of Grub for PXE boot. Hopefully you will never have to build the 1.97+ version Ryan used for the original Linux MFS environment. The source repo is: git-public.flux.utah.edu:/flux/git/users/rdjackso/grub and I have it checked out in ops:~mike/grub-ryan. For the 2.02 version, the repo is http://gitlab.flux.utah.edu/emulab/emulab-grub2.git and there is a README.md file that describes how to build and install it.