Bootscalingissues

Boot sequence overview

A. PXE BIOS interacts with DHCP to get initial boot program name:

    1. DHCPDISCOVER to server, server replies with DHCP info
       Up to four retries with timeouts of 4, 8, 16, 32 seconds
       (total of 60 seconds to get an initial reply)

    2. DHCPREQUEST to server, server replies
       This is apparently necessary in the DHCP protocol even though the
       client got its info from the DHCPDISCOVER reply.  Its not clear what
       the timeout used here is.  If it doesn't get a reply, it restarts
       with the discover.

    3. DHCPREQUEST to the boot server (proxydhcp), server replies
       This gets the boot file name.  Up to four retries with 1, 2, 3, 4
       second timeouts.

   Notes:
       - In step 2, if the client gets a BOOTP reply rather than a DHCP
         reply to the first query, this step isn't needed.

       - In step 3, the extremely short timeouts are why elvind/stated
         bogging down at all gets us in trouble, they don't cut us a whole
         lot of slack.

B. PXE BIOS requests/loads bootfile via TFTP:

    1. An initial request is made for block 1 with the TSIZE=0 option,
       presumably to learn the size of the file.  Retries?

    2. A second request is made for block 1 with the BLKSIZE=1456 option,
       to set the transfer block size and begin transferring the file.

    3. The remaining blocks are requested and transfered.

   Notes:
       - Our tftpd forks a new copy for each request on a new port.
         Both steps 1 and 2 cause such a fork.

       - Our tftpd doesn't recognize any options.  Could be that step
         2 wouldn't happen if we responded correctly to step 1.

C1. Normal pxeboot (emuboot) executes:

    1. The FreeBSD libstand code does the DHCP DISCOVER/REQUEST dance with
       the server, two messages are exchanged.  Retries?

    2. Emuboot sends a bootinfo request (retries?) and boots as indicated
       (usually from disk).

D1. OS boots:

    1. OS boots from disk, once again doing the DHCP dance (from dhclient
       or pump).

    2. Testbed specific startup issues a series of TMCD commands.
       From power on of an already allocated node, I counted 26 TCP
       TMCD requests in a seven second period:
	    +0 reboot
	    +0 status
	    +1 ntpinfo
	    +3 state
	    +3 reboot
	    +3 status
	    +3 mounts
	    +4 accounts
	    +4 ifconfig
	    +5 tunnels
	    +5 hostnames
	    +5 routing
	    +5 status
	    +5 ifconfig
	    +5 routelist
	    +5 trafgens
	    +5 nseconfigs
	    +5 rpms
	    +6 tarballs
	    +6 startupcmd
	    +6 delay
	    +6 ipodinfo
	    +7 vnodelist
	    +7 isalive
	    +7 creator
	    +7 state

   Notes:
       - The DHCP transaction done here, at least under FreeBSD will take
         a long time (20-30 seconds, instead of 1-3) if the cisco2 control
	 net port is not configured properly ("set port host <mod/port>").


E1. First time boot of an experiment

    1. Optional download and installation of tarballs and RPM files
       across NFS.

   Notes:
       - The loading of tarballs/RPMs across NFS has been shown to put a
         hurtin' on our server with as few as 30 nodes and 5MB of RPMs.
	 This is also a problem when people log to files in /proj.


C2. Frisbee pxeboot (pxeboot.frisbee) executes:

    1. The FreeBSD libstand code does the DHCP DISCOVER/REQUEST dance with
       the server, two messages are exchanged.  Retries?

    2. The FreeBSD loader issues a series of requests for files via TFTP:
        /tftpboot/frisbee/boot/boot.4th.gz
	/tftpboot/frisbee/boot/loader.rc.gz
	/tftpboot/frisbee/boot/loader.4th.gz
	/tftpboot/frisbee/boot/support.4th.gz
	/tftpboot/frisbee/boot/defaults/loader.conf
	/tftpboot/frisbee/boot/loader.conf
	/tftpboot/frisbee/boot/loader.conf.local
	/tftpboot/frisbee/boot/kernel.ko.gz	# check to see if it exists
	/tftpboot/frisbee/boot/kernel.ko.gz	# read it
	/tftpboot/frisbee/boot/mfsroot.gz	# check to see if it exists
	/tftpboot/frisbee/boot/mfsroot.gz	# read it

    Notes:
        - Each of the 11 file requests uses a different instance of tftpd.

        - Use of .gz and .ko files ensures a minimum number of requests;
	  i.e., if the .gz or .ko file didn't exist it would try again
	  without the suffix, doubling (or more) the number of requests
	  (and servers).

Scaling Issues

PXE BIOS interaction (Step A):

   The big concern here is losing the initial (larger timeout) or later
   (smaller timeout) DHCP requests.  This seems to happen at about 40
   machines.  Not sure about all the drop scenarios, but we know that
   proxydhcp will overload.  Presumably dhcpd does as well.

   We have a hack right now that if PXE fails, it boots from the hard disk
   where we have a special MBR which just reboots the machine, thus trying
   again.  This is ok as a last ditch effort, but we need to scale a lot
   higher than 40 machines before we hit this.

   Possible fixes/optimizations:

   1. Have dhcpd send bootp replies in step 1, eliminating the second step.
      As Leigh points out in the message below, this won't work if we
      continue to have proxydhcp provide the boot file.

   2. Let regular dhcpd provide the boot file (pxeboot).
      I found a way to do this even with dhcpd version 2.  This eliminates
      proxydhcp.  The downside is that the boot file must be specified in
      the dhcpd.conf file.  While we could dynamically generate the dhcpd.conf
      file, we're opening ourselves up to the same headaches we have with
      mountd, named and anything else that we have to kill/restart.  Or we
      can hack dhcpd to directly access the DB for its info.  I looked at
      both V2 and V3 to see if there were any existing hook mechanisms or if
      there were any obvious places at which to add hooks, but it looks like
      it will be a PITA.

   3. We can get rid of regular dhcpd and just use proxydhcp.
      This should work though I haven't tried it.  This would reduce it to a
      single transaction.  Downside is that we lose the ability to handle
      regular DHCP traffic (since proxydhcp would have to run on the standard
      port).  Not an issue for us, at least right now.

   4. Get the PXE BIOS source or a custom version from Intel.
      In this scenario we customize PXE will larger timeouts or different
      failure behavior.  I don't think this is practical personally.  It
      applies only to Intel NICs (granted, that is what we mostly have)
      and could make us incompatible with standard PXE, a bad move if we
      want other people to use our stuff.

   A thread discussing PXE/TFTP problems and potential solutions in the
   cluster space starts here:
       http://www.beowulf.org/pipermail/beowulf/2002-November/005063.html
   See the later messages in the thread.

TFTP interaction (Steps B, C2):

   The biggest single problem right now is that inetd forks a new tftpd
   for every new request (on a new port).  In the PXE case (Step B) this
   is not so bad, but for the BSD frisbee MFS we make a new request
   everytime we even want to check for the existence of a file ("stat").
   The BSD boot loader is oriented toward having a disk filesystem, so all
   this activity isn't an issue.

   One optimization already enacted is to try to minimize the number of
   files it attempts to access.  At one time, when attempting to locate
   a file "foo", it would try: foo.split.gz, foo.split, foo.gz, foo,
   often looking in both /boot and /boot/defaults for the file.  I tweaked
   the loader configuration to minimize the number of configuration files
   it tries to load, eliminated the splitfs support to get rid of attempts
   to access .split files, and then made sure there was a .gz (or .ko.gz
   for kernel modules) version of every file that did exist.  This
   eliminated most of the "stat" type calls.

   Another thing to look into is a better tftpd, specifically we want to
   minimize the number of spawns yet maintain reasonable parallelism.
   Prioritizing incoming requests as discussed in the beowulf thread
   cited above are probably less important right now.  I looked at tftp-hpa
   (http://freshmeat.net/projects/tftp-hpa/) some.  While it doesn't have
   a FreeBSD port, it does use configure and is supposed to work on BSD
   (it is derived from the BSD tftp code).  The server supports some of
   the newer TFTP options like increased transfer blocksize (> 512 bytes)
   and the ability to test for the existence of a file.  It can operate
   standalone (rather than under inetd) where it will just fork to handle
   new requests (rather than fork/exec) and/or it can have copies of the
   daemon persist rather than immediately exit.  We can probably tweak-out
   the BSD boot code to take advantage of the options, and hopefully
   persistent and faster spawning daemons would take care of some of the
   load issues.


TMCD interaction (Step D1):

   The sheer volume of requests is the big concern here.  At 26 requests
   per-node, per-boot, it would be easy to overwhelm the server and cause
   requests to timeout on the clients.

   An obvious boot-time-only solution is to have a meta-request which
   will get all the info for a host in one request at boot time.  This
   could take the form of a script which makes the call and records the
   data into a file for all other scripts to use, or it could be a
   proxy tmcd which starts up first and services all the usual tmcc calls.

   A more general solution is a caching proxy which runs all the time.
   This would presumably be an extension of the current watchdog using
   the keep-alive packets to find out if cached data needs updating.
   For jails, we would presumably run only a single proxy outside the
   jails since I doubt we will ever support enough jails per node to
   require a proxy in each.


NFS traffic (Step E1):

   When lots of nodes attempt to use NFS, we get in trouble in a hurry.
   The NFS traffic, being UDP, interferes will most other boot-time or
   run-time traffic causing timeouts and lost packets.

   One thing we could do is switch to using TCP-based NFS where we would
   get some congestion control.  FreeBSD TCP NFS works fine, we would need
   to be sure of Linux NFS.  We should also make sure we are using NFS v3
   which reduces the amount of control traffic some.  We could also get
   a serious file server to handle the load.

   Alternatives to NFS?  We can use ssh to download tarballs and RPMs.
   We could provide a logging facility that uses something other than
   NFS.  But in general, if we want to continue to present a filesystem
   interface to shared space, there are not many good options.

Related Mail messages

Date: Wed, 18 Dec 2002 10:50:52 -0700 (MST)
From: Mike Hibler <mike@flux.utah.edu>
Message-Id: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>
To: testbed-ops@flux.utah.edu
Subject: PXE and DHCP and TFTP

Was looking into this a bit yesterday.

Summary: we can probably eliminate 1 or 2 of the 3 PXE/DHCP transactions
at boot and there is a potentially better TFTP daemon out there.

PXE/DHCP: the normal procedure is three transactions:

1. DHCPDISCOVER to server, server replies with DHCP info
   Up to four retries with timeouts of 4, 8, 16, 32 seconds
   (total of 60 seconds to get an initial reply)

2. DHCPREQUEST to server, server replies
   This is apparently necessary in the DHCP protocol even though the
   client got its info from the DHCPDISCOVER reply.  Its not clear what
   the timeout used here is.  If it doesn't get a reply, it restarts
   with the discover.  One note: if the client gets a BOOTP reply rather
   than a DHCP reply to the first query, this step isn't needed.

3. DHCPREQUEST to the boot server (proxydhcp), server replies
   This gets the boot file name.  Up to four retries with 1, 2, 3, 4
   second timeouts.  This is why elvind/stated bogging down at all
   gets us in trouble, they don't cut us a whole lot of slack.

Possible optimizations:

1. Have dhcpd send bootp replies in step 1, eliminating the second
   step.  Not sure this is possible.

2. Let regular dhcpd provide the boot file (pxeboot).  I found a way to do
   this even with dhcpd version 2.  This eliminates proxydhcp.  The downside
   is that the boot file must be specified in the dhcpd.conf file.  While we
   could dynamically generate the dhcpd.conf file, we're opening ourselves
   up to the same headaches we have with mountd, named and anything else
   that we have to kill/restart.  Or we can hack dhcpd to directly access
   the DB for its info.  I looked at both V2 and V3 to see if there were
   any existing hook mechanisms or if there were any obvious places at
   which to add hooks, but it looks like it will be a PITA.

3. We can get rid of regular dhcpd and just use proxydhcp.  This should
   work though I haven't tried it.  This would reduce it to a single
   transaction.  Downside is that we lose the ability to handle regular
   DHCP traffic (since proxydhcp would have to run on the standard port).
   Not an issue for us, at least right now.

TFTP: I looked at tftp-hpa (http://freshmeat.net/projects/tftp-hpa/) some.
While it doesn't have a FreeBSD port, it does use configure and is supposed
to work on BSD (it is derived from the BSD tftp code).  The server supports
some of the newer TFTP options like increased transfer blocksize (> 512
bytes) and the ability to test for the existence of a file.  It can operate
standalone (rather than under inetd) where it will just fork to handle new
requests (rather than fork/exec) and/or it can have copies of the daemon
persist rather than immediately exit.  We can probably tweak-out the BSD
boot code to take advantage of the options, and hopefully persistent and
faster spawning daemons would take care of some of the load issues.

Date: Wed, 18 Dec 2002 09:58:41 -0800
From: Leigh Stoller <stoller@flux.utah.edu>
To: mike@cs.utah.edu
Cc: testbed-ops@flux.utah.edu
Subject: Re: PXE and DHCP and TFTP
In-Reply-To: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>
References: <200212181750.gBIHoqFC008329@bas.flux.utah.edu>

> From: Mike Hibler <mike@flux.utah.edu>
> Subject: PXE and DHCP and TFTP
> Date: Wed, 18 Dec 2002 10:50:52 -0700 (MST)
> 
> 1. Have dhcpd send bootp replies in step 1, eliminating the second
>    step.  Not sure this is possible.

No, its not. The client won't go to the proxydhcp. 

> 2. Let regular dhcpd provide the boot file (pxeboot).  I found a way to do
>    this even with dhcpd version 2.  This eliminates proxydhcp.  The downside
>    is that the boot file must be specified in the dhcpd.conf file.  While we
>    could dynamically generate the dhcpd.conf file, we're opening ourselves
>    up to the same headaches we have with mountd, named and anything else
>    that we have to kill/restart.  Or we can hack dhcpd to directly access
>    the DB for its info.  I looked at both V2 and V3 to see if there were
>    any existing hook mechanisms or if there were any obvious places at
>    which to add hooks, but it looks like it will be a PITA.

We could hardwire the path if the bsd based pxeboot you put together could
handle everything after that. That is, we run the same pxeboot all the
time, and have it contact bootinfo to see if it should run the frisbee MFS,
or the freebsd MFS, or whatever, and then hand off to that. That gives us
lots of control over timeouts and retries.

Don't know if thats possible though. It would require that pxeboot be able
to load and run another pxeboot (say, the one in the frisbee directory).