Timeouts (or lack thereof) for the swapin process:
1. stated, state_timeouts for RELOAD os_mode:
BOOTING: 180
RELOADSETUP: 60
RELOADING: 600
SHUTDOWN: 120
Currently these only cause a log message.
2. snmpit
There does not appear to be a timeout on the snmpit run to setup VLANs.
3. libreboot::nodereboot
The first step of reloading a node is to reboot it into the reload MFS.
This is done by invoking nodereboot with waitmode == 0, which means
return when we have determined that the node is rebooting. Specifically
that means one of:
1) A node in a PXEWAIT/WAKEUP/LIMBO state was successfully sent a
bootinfo "wakeup and query again". Success means that the bootinfo
packet was sent.
2) A node that did not respond to ping, has been successfully
power cycled. Success means that the power controller issued its
"cycle" command.
3) A child process exec'ing "ssh reboot" exits within (hardwired)
20 seconds and the node stops pinging within another (hardwired)
30 seconds. Success means that the node went from pinging to
not pinging.
4) If the "ssh reboot" timed out or the node didn't stop pinging,
but an "ipod" causes the node to stop pinging within (hardwired)
30 seconds. Success means that the node went from pinging to
not pinging.
5) All of the above failed but a power cycle command was successfully
issued. Success means that the power controller issued its
"cycle" command.
What happens if nodereboot returns but a node really isn't rebooting?
For #1, this is quite possible due to a dropped packet or the node
being hung in some way. Stated provides some transparent (to libreboot)
recovery for this. Prior to sending the "wakeup and query again" message,
the node's state is set to PXEWAKEUP. The node should respond by querying
and thus causing a transition to PXEBOOTING. If that transition does not
happen in 20 seconds, stated will reboot the node. Even though the
reboot-on-timeout code in stated is commented out, there is special case
code to handle the PXEWAKEUP state. It will try waking the node up three
times before it will then force a power cycle. So a node hung in this
state will sit for 60 seconds before being power cycled and it could
easily be 2-3 minutes after returning from nodereboot that the node
actually is rebooted.
For #2 and #5 where a power cycle is performed, failure to initiate
an actual reboot is unlikely unless the node is well and truly dead.
But there are certainly reasons why even a successful physical reboot
will not advance the reload/swapin process (failure in the BIOS, not
booting from the network, etc.), so we do want to recover from these
situations ASAP. For #3 and #4, failure is more likely as the OS
could hang after the network is shutdown.
For all of #2-5, stated has a timeout mechanism, but it is disabled.
A successful "ssh reboot", ipod, or power cycle puts the node in the
SHUTDOWN state. Depending only on the op-mode of the node, stated will
timeout in the SHUTDOWN state after 120-300 seconds, a value stored in
the DB. A timeout (300 seconds) from one of the SECURE* op-modes will
put the node in the SECVIOLATION state and (in theory) power it off,
but I don't consider this case any further here. Other op-mode timeouts
are supposed to trigger a node reboot, but the reboot code is disabled.
So the node will remain in the SHUTDOWN state forever unless a higher
level timeout takes care of it.
4. libosload::WaitTillReloadDone
Calculates a per-node max wait time in seconds. This timeout starts when
the node is determined to be rebooting (3 above) and runs til stated
clears the node's current_reloads entry in the DB (or if it enters state
TBFAILED or doesn't get a state transition in five minutes). The value of
the timeout is:
If we are zeroing the disk, take the disksize in GB (from the node_type
info) and multiply by 60; i.e., uses a fixed 1GB/min or ~17MB/sec.
Otherwise, use the per-image "maxloadwait" attribute if set.
Otherwise, use 65% of the number of 1MB compressed chunks in the image
plus a constant factor (TBLOADWAIT == 600 seconds). The former is about
65 seconds/GB or ~15.4MB/sec.
How random is that?!
What we really need:
1. We need an overarching stated timeout in case os_load or snmpit hangs for
some reason. We really cannot pick a constant value here, it needs to
reflect lack of progress. I guess this means that os_load/snmpit need to
report a heartbeat to stated so that it will reset its timeout. Then we
can say if os_load/snmpit doesn't report for 5-10 minutes, we fail the
swapin.
2. Do we need a snmpit timeout? Individual switch ops will timeout and fail
the snmpit. Can snmpit hang doing nothing? This timeout would have to be
static based on the number of VLANs to setup, or be a heartbeat.
3. The os_load timeout should be broken into two pieces: the (relatively)
fixed-length (per node-type) "reboot the node into the MFS" part,
and the image-size/disk-speed/network-speed dependent "load the disk"
part. If we can tightly constrain the first part (1-5 minutes), we can
detect and repair non-transient errors much more quickly.
4. The initial "reboot the node into the MFS" could be handled by specifying
waitmode == 1 to nodereboot, except that that serializes the waits on
the nodes. It would also require a single wait time for all nodes in
a single call to nodereboot. We need to track all nodes in parallel
with per-node-type timeout values. Of course, this is exactly what
stated was intended for.
5. The "load the disk" timeout could be static, but needs to be a function
of the uncompressed size of the image data and a per-node disk write rate.
This rate could be a fixed value or could be updated periodically based
on reported numbers from frisbee or some benchmark.
The timeout could also be a "no progress" timeout. Here, the frisbee
client would have to make periodic reports of how much data it has
received and how much it has written to disk. This could be reported
as a UDP packet to the frisbee server (which might get lost under load)
or as a TCP packet to the master server. One bad thing about this is
that it doesn't scale, we would need to keep the reports relatively
infrequent and unsynchronized among clients. Another issue is how
do we handle backward compat? We would have to say that lack of reports
means don't timeout (or fall back on a static timeout?) But this is
indistinguishable from a client that is hung and making no progress,
which is what we really want to catch with the timeout! We could make
a new version of the JOIN message that says "I do heartbeat reports".