Kb14

Emulab FAQ: Troubleshooting: My experiment setup failed, what did I do wrong?

Experiments can fail in many, many ways, but before you send the above vague question off to us, consider a couple of things. First, look carefully at the "experiment failed" e-mail that you received. It includes a log of the setup process which, while not a model of clarity, often contains an obvious indication of what happened.

One potential point of failure is the mapping phase where Emulab attempts to map your topology to the available resources. Look in the log for where it runs assign. Common errors here include:

Your topology that requires more physical nodes than are currently available. There should be a message of the form:
```
      *** NN nodes of type XX requested, but only MM found
```
in the log. You should always check the free node count on the left menu before trying an experiment swapin. Keep in mind that shaped links might require additional traffic-shaping nodes above and beyond nodes that are explicit in your topology.
Your topology requires too many links on one node. Currently you can have no more than four links per node unless you use multiplexed links.

If the setup log shows assign failing repeatedly and eventually giving up, contact us.

The next potential failure point is the setup of the physical nodes. If you are explicitly setting the OS image to use with tb-set-node-os, then make sure you have specified a valid image (e.g., did you spell the OS identifier correctly?) Again, the log output should include an error if the OSID was invalid.

Click List ImageIDs and OSIDs in the Emulab web interface "Interaction" pane to see the current list of Emulab-supplied OSs.

If the OSID is correct, but the log contains messages of the form:

    *** Giving up on pcXXX - it's been NN minute(s).
    *** WARNING: pcXXX may be down.
    This has been reported to testbed-ops.

then a node failed to reach the point where it would report a successful setup to Emulab.

Near the end of the experiment setup, Emulab's event system can fail to startup with a message like this:

      Starting the event system.
      *** ~/.ssh/identity is not a passphrase-less key
          You will need to regenerate the key manually
      *** /usr/testbed/devel/stack/sbin/eventsys.proxy:
          Failed to start event system for foo/bar

Or, like this:

      Starting the event system.
      Permission denied, please try again.
      Permission denied, please try again.
      Permission denied.
      *** /usr/testbed/devel/stack/sbin/eventsys.proxy:
          Failed to start event system for foo/bar

This failure occurs because you have manually changed your default SSH identity (~/.ssh/identity) or edited your authorized_keys file in your Emulab home directory without going through the "Edit SSH Keys" web form on your user page. The easiest way to fix this is to make sure the passphrase is empty using ssh-keygen(1) on the user's machine:

      users$ ssh-keygen -p -P "" -N "" -f ~/.ssh/identity

Then, make sure the corresponding public key in your Emulab home directory ("~/.ssh/identity.pub") is listed in the "Edit SSH Keys" form.

Such failures can be caused by many things. Sometimes a transient load on an Emulab server can push a node over its timeout, though this is happening less and less as we improve our infrastructure. Most often, these failures are caused by the use of custom images which either do not boot or do not self-configure properly. These are harder to dianose because you often need access to the console logs to see what happened, and these logs aren't available after an experiment fails. However, it is possible to interactively monitor the console while the experiment is setting up since console access is granted early in setup process. You can either use the console command on users, use the tiptunnel client application, or just run "tail -f" on the /var/log/tiplogs/pcXXX.run file.