Jail

FreeBSD Jail-based Virtual Node Implementation

This page describes the changes we made to FreeBSD jails to support Emulab virtual nodes and describes the boot time setup process for those jails.

Jail Changes

Following is a list of the features we added, and bugs we fixed in FreeBSD jails. All of the new features are optional, controlled by sysctl MIBs and per-jail flags. This new jail implementation is backward compatible with the original implementation, meaning all new features are disabled by default.

Allow a jailed process to bind to multiple IP addresses. The default implementation of jail allows processes inside of a jail to bind to just one IP, the IP that was specified to the jail command. In that implementation, if a process specifies INADDR_ANY, the kernel silently changes it to the jail IP. If however there are other interfaces on the node, or if tunnels are being used to construct an overlay for the experiment, it is necessary to allow processes inside the jail to bind to those interfaces. In our modified implementation, when the jail is created, a list of auxiliary IPs can be specified on the command line, telling the kernel to allow processes inside the jail to bind to any of those IPs (including the jail IP). When the bind happens, the kernel checks the jails list of IPs; this applies to sockets bound for outgoing traffic, as well as incoming traffic. Further, the set of accessible IPs determine the list of interfaces that a jail can see so that, for example, ifconfig inside a jail will only list the interfaces and IPs available to the jail.
Allow jails to bind to INADDR_ANY. The default behavior (and original implementation) of jail maps INADDR_ANY to the jail's main IP address. However, when a jail is allowed to access other IPs, then INADDR_ANY actually means a subset of all the interfaces on the node that the jail is allowed to use (which might also be tunnels). There are two situations in which this matters:
- A process is connecting to another address, and has specified its local address as INADDR_ANY (which is typical). Instead of binding the local address of packets to the jail IP, the local address is set to the actual address of the interface that the packet is routed out of. If there are IP aliases on the interface, the list of aliases is searched for a match against one of the allowed prison IPs. If there is a match, the local address is set to that IP. Otherwise the address is set to the main address of the interface (this is not correct; it should be an error). This is to support multiplexing links using IP aliases. If we were to use IP tunnels or some other form of virtual interface, there would be no need to search the list of aliases.
- A process is binding a local socket for an incoming connection. In this case, any of the prison IPs can be the local target of the connection, but it is not until the connection is actually made that the address can be checked. This is done in the pcb lookup routine. For each pcb, if the port matches and the local address is INADDR_ANY, and the pcb was created within a jail, then the list of the prison IPs is searched, looking for a match. If no match is found, the pcb is skipped. This behavior improves compatibility with existing server applications which typically specify INADDR_ANY. If the kernel were to continue binding INADDR_ANY sockets to the main IP address of the jail, such applications would only be able to receive packets on the primary jail interface.
Allow access to raw sockets. The jail is allowed to both read and write, but is restricted from accessing the firewall, dummynet, route, and RSVP interfaces. We also ensure that the packet header reflects a source IP address appropriate for the jail: INADDR_ANY is mapped to an appropriate address for the outgoing interface and fixed addresses that are not part of the jail set are rejected. This feature allows ping, traceroute and gated to work in jails.
Allow read-only access to BPF devices. The interface is not put into promiscuous mode, so the jail is not able to see all of the packets on the wire, but only those addressed to the node. However, if the interface is already in promiscuous mode (say, because someone outside the jail is using tcpdump), then the jail will also be able to see any packet that goes by. Even when not in promiscuous mode, a jail will see all packets destined for the interface whether targeted to a valid jail IP address or not. This could be fixed, and the promiscuous-mode problem avoided, by augmenting the filter given when the bpf device is setup. Allowing BPF access enables use of tcpdump and other packet trace tools within jails.
Restrict the port range to which a jail can bind. This allows multiple jails on the same node to safely share the port space without stepping on each other in environments where jails cannot be assigned their own IP addresses. Since the ultimate goal is to allow different experiments to coexist in jails on the same node, the port space has to be allocated globally, with the same port space assigned to all jails across an experiment, so as not to conflict with any other experiments. This assignment is done when the experiment is swapped in so that swapped experiments are not holding ranges (16 bits of port space does not go very far).
Disallow FS unmounts inside a jail unless the mount was created in the jail. This is a bug fix that prevents a jail from unmounting a filesystem and exposing the underlying mount point to which it likely shouldn't have access.
Added per-jail flags to control various existing and new jail features. These are in addition to sysctls which control the global availability of a given feature. Existing features thus controlled are: access to SYSV IPC facilities, access to routing sockets and ability to turn on and off filesystem quotas. New features controlled are: access to raw sockets, access to read-only BPF and the ability to use INADDR_ANY. Additionally, there is a new global sysctl to allow jails to be configured with multiple IP addresses.

Starting a FreeBSD Virtual Node

The goal for Emulab jail-based virtual nodes (hence forth known just as "jails") is to set up an environment that is as much like the standard Emulab node environment as possible. This makes it easy for the Emulab infrastructure as well as for the Emulab user. Also note that the intent is to use jails both locally (Emulab cluster nodes) and remotely (wide-area, RON nodes), where there are going to be different security considerations. Hence the need for per-jail permissions bits as mentioned above. Setting up the jail is broken into two parts; the stuff that needs to be done outside the jail (creating the jail filesystem, setting up interfaces, tunnels, routes, mounting shared filesystems) because the jail does not have enough permission, and the stuff that can be done inside the jail (creating accounts, installing software, starting programs and traffic generators). Following is a description of those two phases.

Setting up the jail, phase one:

To set up the outer environment it is necessary to:

Create the tunnels if the experiment requested tunnels. This applies only to wide-area nodes, not to local nodes. At the same time, routes are setup if the user requested them (static and manual only; we do not run gated on wide-area nodes!). At present, the routing setup is done via the vtun config file, which specifies external commands to run as each tap interface is configured and torn down.
Ask tmcd for the set of jail options that apply. Different users and/or experiments might get differing levels of permission to access the extended jail features mentioned above.
Create a base filesystem for the jail, and then apply some customizations to it. In addition to customizations based on the permissions that tmcd said to use, there are the usual things like setting the hostname of the jail, giving it a proper rc.conf and resolv.conf, etc. More on this below.
Mounting filesystems. Locally, we mount the /user and /proj filesystems into the jail so that the users get the standard directories.
Start the tmcd proxy. More on this below.

Setting up the filesystem for the jail is a long arduous process:

Create a zero-filled vnode file (currently set to 64MB) and find a free vn device to configure. The root of the filesystem is mounted under /var/emulab/jails//root.
Copy in /root and /etc into the new jail filesystem so that each jail gets to munge their own copy of it.
NFS mount read-only /bin, /sbin, and /usr into the jailed filesystem. This gives each jail shared access to the bulk of the filesystem so that we do not have to duplicate. If the user wishes to install their own software, they will need to do it into /opt. This is perhaps not ideal.
Mount a proc filesystem inside the jail. This gives the jail a private view of it process world.
Populate the jails /dev filesystem. The jail is not allowed to run the mknod system call, so device entries must be created for it.
Create a pristine /var filesystem. Create stub entries for several files in /etc including the passwd and group file. Create a resolv.conf file that points to the outer host.
Create an sshd config file and make sure X11 forwarding is off. Also arrange for sshd to be started up (inside the jail) on its per-jail assigned port (which is within the port range for the jail).
NFS mount (via a call in the tmcd library) all of the proj and user directories for the experiment. Again, since the jail cannot do NFS mounts inside, this is down outside. Clean out various files for security (pem files, cvsup auth file, etc).

The other complication in setting up the jailed environment involves access to TMCD. Wide-area testbed nodes are not allowed to contact tmcd without an ssl certificate, but we do not want to hand out per-jail certificates that could be easily copied. My approach was to not allow a jail to contact tmcd directly, but to instead go through a proxy running outside the jail. This has the added benefit of ensuring that the jail is not able to spoof another jail in another experiment. The implementation of this was to add a proxy mode to the tmcc client. Outside the jail, a tmcc proxy is started that creates a unix domain socket, whose path is inside the jail filesystem. In other words, the socket is named such that a tmcc client running inside the jail sees it too. The tmcc client inside the jail connects to the tmcc proxy running outside the jail via the unix domain socket, which relays the request to tmcd (sanitizing the request string), and then relays the answer back to the tmcc inside the jail. The proxy ensures that there is no spoofing of the jail id. There are many other alternatives for accomplishing this, but this was fairly easy to do.

Setting up the jail, phase two:

Once the jail system call has been issued, it is up to the inner environment to finish getting it set up. Inside the jail, the first program to run is a little program (injail) that is intended to act like "init" in that it starts the initial shell and then waits until it receives a signal to terminate. The easiest way to ensure that all processes inside the jail are terminated is for injail to send a TERM to the entire process group, and then a KILL to pick up any stragglers. This is because kill all of the processes from outside the jail is difficult (hard to see inside the jail), and because the jail will not actually terminate until all the processes inside are really dead.

The initial shell mentioned above is /etc/rc, which proceeds to do all of the same boot time configuration that normally happens when a node boots. The difference of course is that the jail has a heavily constrained /etc/rc.conf that starts up just a few essential services such as syslogd, cron, and sshd (on the specific port assigned sshd for the jail; see above). The last part of configuration run is the standard testbed setup, although again in a somewhat restricted manner. Currently the following testbed mechanisms are supported within the jailed environment: