Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Wiki / Overview

Overview

Some Key Emulab Programs and Libraries

Overview of the Emulab software, and the way various pieces fit together.

Some Key Emulab Programs and Libraries

tmcd - daemon that runs on boss, and is essentially a proxy for the database. Nodes are not, for security reasons, allowed to contact the database directly. Also, nodes should not have to know details of the database to get, for example, a list of accounts to create. Used by nodes to get information such as user accounts, NFS mounts, hostnames, etc. Uses SSL to authenticate the server and clients, and to encrypt transmissions. The client side, used on the nodes, is called tmcc.

snmpit - The program we use to configure switches via SNMP. It gets used on the experimental net to create VLANs and set port speed and duplex. It is generally not used on the control net switch.

suexec - Invoked by the web interface to execute commands as other users. All commands run on boss from the web interface (such as the ones to create and terminate experiments) go through suexec, and are executed as the user logged into the webserver, not as the webserver itself.

assign - Our simulated annealing algorithm that maps the user's requested topology onto available hardware. Its main purposes are to minimize inter-switch bandwidth in environments with multiple experimental switches, resolve hardware types, and make sure that any special features of nodes are used efficiently (ie. are not used by an experiment if not requested.) It is called by assign_wrapper, which does the task of generating the list of available resources, and reserving the resources picked by assign.

parse.tcl - Our NS parser, implemented as a TCL script that loads libraries that mimic NS commands, and pulls a few other tricks (such as overriding variable assignment.) Evaluates the user's input NS script, and places the results into the database.

frisbeed - Server for our multicast disk loading system. More on this below in the 'Images' section.

libdb - Big library that is used to interface with the database. It hides details such as the name of the database from scripts, retries failed connections, and can send mail and/or terminate the script when queries fail. It also contains functions for doing permissions checks, getting information about the state of an experiment, project, or node, and so forth. Almost all perl scripts use this library.

libtestbed - A small library that contains handy functions for sending mail to the user, going into the background, and so forth.

Some useful administrative programs

sched_reload - Schedules a reload of a node with the default image. If the node is free, moves it to the reloading experiment, and starts the reload immediately. If the node is reserved, puts an entry into the scheduled_reloads table. When the node is freed from an experiment (by 'nfree'), it checks this table to see if should be reloaded rather than being released into the free pool. This is the preferred way to get nodes reloaded when you have a new version of the default image.

sched_reserve - Works like sched_reload, but re-allocates a node to another experiment when it gets freed (or, immediately, if the node is already free.) Most often used to move a suspect node to emulab-ops/hwdown when an experimenter reports something that may be a hardware problem.

Node Boot Process

We boot nodes via PXE, which is a feature that allows a network card to download code to boot from. Thus, the control network card in each node needs to have PXE, but it's best to have it disabled on the experimental interfaces (because you'll just waste time at boot, waiting for DHCP to time out.) PXE contacts the dchpd on boss, which gives it an IP address and so forth, and then hands it off to 'proxydhcp' (also running on boss.) This daemon looks into the nodes table, at the pxe_boot_path and next_pxe_boot path fields, to tell the card where to load its boot program from. next_pxe_boot_path is intended to be used by the emulab software to temporarily override the user's settings. PXE on the NIC then downloads the boot program via TFTP from boss. Normally, we load something called 'pxeboot', which is a little custom OSKit boot loader. But, we can also boot some loaders that load FreeBSD into memory and run it from there - more on this in the disk image section.

pxeboot contacts another daemon on boss called 'bootinfo' to find out what to boot. bootinfo looks at the nodes table to figure this out. Usually, this is done by looking at the 'def_boot_osid' field, then looking in the partitions table to discover which partition that OS can be found in. However, pxeboot can also boot from other sources, such as kernels loaded via TFTP. You can also use pxeboot interactively, by pressing any key when prompted to do so during boot.

When the OS booted is our standard FreeBSD or Linux installation, /etc/testbed/rc.testbed is called to perform Emulab-specific configuration. First, the nodes contact cvsup on boss to look for incremental updates (we do this so that we don't have to create a new image every time we update any single file.) Next, they run scripts that set up things like routes, delay pipes, accounts, NFS mounts, etc. Most of this information is obtained from tmcd on boss.

Images

We create images with a program called 'imagezip' that does filesystem-specific compression.

To create an image, we boot into a special FreeBSD that is loaded over the network and run solely out of memory, not touching the disk at all. This way, we don't depend on specific disk contents, and aren't trying to zip up mounted filesystems. This is a stripped-down FreeBSD kernel and root filesystem that are loaded from boss via TFTP. The filesystem is decompressed into a memory file system (MFS) by the boot loader. You can get a node into and out of the MFS FreeBSD with the 'node_admin' command. It runs a (usually out-of-date) version of the node setup software, so it looks a lot like a regular node, with user accounts, NFS mounts, etc. Inside this MFS FreeBSD, we run imagezip and write the image via NFS to the project directory on ops.

To load an image, we boot another FreeBSD MFS, but this one is _much_ more stripped down, as it may get loaded by dozens of reloading nodes at once, and TFTP is unicast. This MFS contacts tmcd to find out which address (multicast address and port number) to get its disk image from. A program called 'frisbee' (think: flying disks) is invoked with this address, and grabs the image from frisbeed running on boss. The multicast protocol used by Frisbee is designed so that no global synchronization is required, and nodes can join at any time. Once frisbee is done, the node reboots into the new OS.

To initiate a disk reload, os_load gets run on boss. It sets the next_pxe_boot_path so that the node will boot into the reloading MFS (pxeboot.frisbee) then reboots the node. It also sets the node's default boot OS to the default for the image (specified in the images table.)

os_load then runs frisbee_launcher. frisbee_launcher is a wrapper around the real frisbee server, frisbeed. This way, frisbeed itself does not need to know any Emulab specifics, and can be restarted by frisbee_launcher if it dies. One frisbee_launcher/frisbeed process is run per image. frisbee_launcher looks at the database to determine if an instance is already running for this image, and exits if it is. If not, frisbee_launcher picks a multicast address to use, registers it in the database (images table) starts frisbeed, and goes into the background. When frisbeed exits (which it may do after being idle for a long time,) frisbee_launcher updates the database to indicate that no frisbeed is running for the image anymore. If you change the path to an image in the images table, or replace its file, check for instances of frisbee_launcher running for the image (you can tell from the command line shown by 'ps'). Kill the frisbee_launcher process (NOT the frisbeed process.)

There is a reload daemon (called reload_daemon) that runs on boss, and does the job of reloading nodes after they are freed from experiments. Nodes are placed into the emulab-ops/reloadpending experiment, then moved over to the emulab-ops/reloading experiment by the reload daemon (this is largely a relic of when we had a unicast disk loader, and could only reload a few nodes at a time.) This is a common place to notice hardware and software problems, so the reload daemon sends mail if any nodes get stuck in the reloading experiment for too long.

Experiment Lifetime

Once the user hits the 'submit' button on the experiment creation form, 'batchexp' is fired off on the user's behalf to actually start the experiment. If the experiment was marked as a batch experiment, it is submitted to the batch queue, to be run by batch_daemon when there are enough free nodes. Otherwise, 'startexp' is run.

The first thing that startexp does is run tbprerun, which sets up the experiment in the database - . At this point, the experiment has been created, but is not swapped in (if you checked the 'preload' box on the webpage, things stop here.) In general, this fills out the virt_* tables in the database.

Next, startexp calls tbswapin to realize the experiment in hardware. It calls the programs to do resource assignment (assign_wrapper), set up VLANs (snmpit), set up NFS exports (exports_setup), set up DNS records (named_setup), etc. tbswapin waits for nodes to come up - in older versions of our software, detection is done by pinging them. In newer versions, the nodes report back in with the event system. If any nodes fail to come up, they are rebooted once, and if they fail again, the are moved the the emulab-ops/hwdown experiment, and the experiment swapin fails.

Finally, when the experiment is configured, startexp sends the user email.

If the user swaps the experiment in and out during its lifetime, 'tbswap' is called to do the job - since a failed swapin requires a lot of the same cleanup as a swapout, one script handles both, called with 'in' or 'out' as its first argument.

When the user terminates the experiment, endexp gets called. It calls tbswapout to free hardware resources, then calls tbend, which cleans up the experiment's data in the virt_* tables.

While an experiment is swapped in, its activity is tracked using slothd, sdcollectd, etc, and inactive experiments get sent email messages asking them to swap out. After a while, these experiments get swapped out, either by administrators, or automatically by the system itself (coming soon).

The Event System

Our event system is implemented as a thin layer on top of the 'elvin' system. Elvin is a publish/subscribe system, meaning that programs that want to recive events connect to the elvin server and subscribe to all events for a particular node, for a particular traffic generator, etc.

We run an event scheduler for each experiment. It's this sheduler's job to keep track of future evnets for the experiment, and send them at the appropriate time. The most basic example of this are events specified through the NS file, but tevc can also be used to schedule events for some future time. eventys_control can be used to control the scheduler, stooping it, starting it, or replaying all events listed in the NS file.

We also use the event system to communicate information about the state of nodes. At several times during the node boot process, they contact tmcd informing it of their current state. State changes are also inferred from some external event, such as a node DHCPing, or running 'power' to power cycle a stuck node. All of these state changes are received by a daemon on boss, 'stated'. In addition to receiving state transition events, stated can generate them for nodes than cannot do so themselves - for example, it can ping nodes that are running OSes that do no report in, to detect when they come up. stated reads state machines from the database, and can detect nodes that time out in specific states (ie., they begin to boot, but then hang), or that somehow make invalid state transitions. During experiment swapin, when all nodes have reached the ISUP state, a 'time start' event is send, so that all agents will have a similar idea of when the experiment began.

On the nodes, the main things that interact with the event system are delay nodes, traffic generators, and the program agent. Delay nodes subscribe to events for their link, so that the delay can be changed, the link brought down, etc. Traffic generators can be turned on and off, and have their paramters changed through this method. And, if the user specified any 'Program' objects in the NS file, agents are started so that they can be dynamically started and stopped.