Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Wiki / Swapintimeouts

Swapintimeouts

Timeouts (or lack thereof) for the swapin process:

1. stated, state_timeouts for RELOAD os_mode:

    BOOTING:    180
    RELOADSETUP: 60
    RELOADING:  600
    SHUTDOWN:	120

   Currently these only cause a log message.

2. snmpit

   There does not appear to be a timeout on the snmpit run to setup VLANs.

3. libreboot::nodereboot 

   The first step of reloading a node is to reboot it into the reload MFS.
   This is done by invoking nodereboot with waitmode == 0, which means
   return when we have determined that the node is rebooting. Specifically
   that means one of:

    1) A node in a PXEWAIT/WAKEUP/LIMBO state was successfully sent a
       bootinfo "wakeup and query again". Success means that the bootinfo
       packet was sent.

    2) A node that did not respond to ping, has been successfully
       power cycled. Success means that the power controller issued its
       "cycle" command.

    3) A child process exec'ing "ssh reboot" exits within (hardwired)
       20 seconds and the node stops pinging within another (hardwired)
       30 seconds. Success means that the node went from pinging to
       not pinging.

    4) If the "ssh reboot" timed out or the node didn't stop pinging,
       but an "ipod" causes the node to stop pinging within (hardwired)
       30 seconds. Success means that the node went from pinging to
       not pinging.

    5) All of the above failed but a power cycle command was successfully
       issued. Success means that the power controller issued its
       "cycle" command.

   What happens if nodereboot returns but a node really isn't rebooting?

   For #1, this is quite possible due to a dropped packet or the node
   being hung in some way. Stated provides some transparent (to libreboot)
   recovery for this. Prior to sending the "wakeup and query again" message,
   the node's state is set to PXEWAKEUP. The node should respond by querying
   and thus causing a transition to PXEBOOTING. If that transition does not
   happen in 20 seconds, stated will reboot the node. Even though the
   reboot-on-timeout code in stated is commented out, there is special case
   code to handle the PXEWAKEUP state. It will try waking the node up three
   times before it will then force a power cycle. So a node hung in this
   state will sit for 60 seconds before being power cycled and it could
   easily be 2-3 minutes after returning from nodereboot that the node
   actually is rebooted.

   For #2 and #5 where a power cycle is performed, failure to initiate
   an actual reboot is unlikely unless the node is well and truly dead.
   But there are certainly reasons why even a successful physical reboot
   will not advance the reload/swapin process (failure in the BIOS, not
   booting from the network, etc.), so we do want to recover from these
   situations ASAP. For #3 and #4, failure is more likely as the OS
   could hang after the network is shutdown.

   For all of #2-5, stated has a timeout mechanism, but it is disabled.
   A successful "ssh reboot", ipod, or power cycle puts the node in the
   SHUTDOWN state. Depending only on the op-mode of the node, stated will
   timeout in the SHUTDOWN state after 120-300 seconds, a value stored in
   the DB. A timeout (300 seconds) from one of the SECURE* op-modes will
   put the node in the SECVIOLATION state and (in theory) power it off,
   but I don't consider this case any further here. Other op-mode timeouts
   are supposed to trigger a node reboot, but the reboot code is disabled.
   So the node will remain in the SHUTDOWN state forever unless a higher
   level timeout takes care of it.
       
4. libosload::WaitTillReloadDone

   Calculates a per-node max wait time in seconds. This timeout starts when
   the node is determined to be rebooting (3 above) and runs til stated
   clears the node's current_reloads entry in the DB (or if it enters state
   TBFAILED or doesn't get a state transition in five minutes). The value of
   the timeout is:

     If we are zeroing the disk, take the disksize in GB (from the node_type
     info) and multiply by 60; i.e., uses a fixed 1GB/min or ~17MB/sec.

     Otherwise, use the per-image "maxloadwait" attribute if set.

     Otherwise, use 65% of the number of 1MB compressed chunks in the image
     plus a constant factor (TBLOADWAIT == 600 seconds). The former is about
     65 seconds/GB or ~15.4MB/sec.

   How random is that?!


What we really need:

1. We need an overarching stated timeout in case os_load or snmpit hangs for
   some reason. We really cannot pick a constant value here, it needs to
   reflect lack of progress. I guess this means that os_load/snmpit need to
   report a heartbeat to stated so that it will reset its timeout. Then we
   can say if os_load/snmpit doesn't report for 5-10 minutes, we fail the
   swapin.

2. Do we need a snmpit timeout? Individual switch ops will timeout and fail
   the snmpit. Can snmpit hang doing nothing? This timeout would have to be
   static based on the number of VLANs to setup, or be a heartbeat.

3. The os_load timeout should be broken into two pieces: the (relatively)
   fixed-length (per node-type) "reboot the node into the MFS" part,
   and the image-size/disk-speed/network-speed dependent "load the disk"
   part. If we can tightly constrain the first part (1-5 minutes), we can
   detect and repair non-transient errors much more quickly.

4. The initial "reboot the node into the MFS" could be handled by specifying
   waitmode == 1 to nodereboot, except that that serializes the waits on
   the nodes. It would also require a single wait time for all nodes in
   a single call to nodereboot. We need to track all nodes in parallel
   with per-node-type timeout values. Of course, this is exactly what
   stated was intended for.

5. The "load the disk" timeout could be static, but needs to be a function
   of the uncompressed size of the image data and a per-node disk write rate.
   This rate could be a fixed value or could be updated periodically based
   on reported numbers from frisbee or some benchmark.

   The timeout could also be a "no progress" timeout. Here, the frisbee
   client would have to make periodic reports of how much data it has
   received and how much it has written to disk. This could be reported
   as a UDP packet to the frisbee server (which might get lost under load)
   or as a TCP packet to the master server. One bad thing about this is
   that it doesn't scale, we would need to keep the reports relatively
   infrequent and unsynchronized among clients. Another issue is how
   do we handle backward compat? We would have to say that lack of reports
   means don't timeout (or fall back on a static timeout?) But this is
   indistinguishable from a client that is hung and making no progress,
   which is what we really want to catch with the timeout! We could make
   a new version of the JOIN message that says "I do heartbeat reports".