Hi folks, Short version: I'm starting to have all the pieces necessary for Condor to manage jobs inside containers on Linux, a powerful alternative to using full virtualization for simply running jobs. Long version: I just wanted to update everyone on the continued progress of job isolation in Condor: 0) cgroups: Added in 7.7.0, but undocumented. This uses cgroups to track CPU, memory, and block I/O for vanilla universe jobs. Further uses cgroups to do kernel-side killing of the job. - Note that the plumbing is all there to do isolation/throttline of memory and block I/O. Note that CPU locking has been implemented for awhile, but this could be extended with per-job CPU fairsharing (i.e., job X gets 2x the CPU shares as job Y). 1) Filesystem namespaces (#2015): Reviewed once, one known issue. This allows the sysadmins to remount well-known directories (think /tmp, /var/tmp) into a subdirectory of $_CONDOR_SCRATCH_DIR. This way each job gets its own private /tmp that disappears when the job exits. - I think this is very close to being commit-able. - Alone of the items in this list, this is available in RHEL5. 2) PID namespaces (#1959). Have the job see its own copy of the process tree. For example, getpid() for the starter will return 1. This prevents jobs from seeing other jobs on the system. Inserts a newer synchronization barrier into the post-clone() environment such that the parent can perform actions before the child exec's a race-free manner. 3) Network namespaces (https://github.com/bbockelm/condor-network-accounting). This allows Condor to manage network devices/routing per-job and do per-job network accounting. Depends on the synchronization mechanism in (2). - While the initial goal was per-job network accounting, Matt has pointed out that this could be used to have Condor manage overlay networks. I.e., isolate groups of jobs into their own sub-network. Intriguing, but I'm not sure how to make it pragmatic yet. - Likely a subject of a follow-up email. There are a lot of details and design decisions here. The take-home story is currently that you can do to-the-byte network accounting of the data passed through the network device (*not* the data done through POSIX calls). 4) Condor management of chroot jails. Allow Condor to map jobs into chroot jails. This would allow the user land presented to jobs to be from a different operating system than the host. I think CMS would especially be interested in this, as the experiment won't upgrade to RHEL6 until 2013, but this would give sites the ability to take advantage of RHEL6 features while presenting a RHEL5 userland to jobs. Depends on (1). I believe this sums to a significant leap in capability for Condor (and would appreciate help getting items (1) through (4) into mainline). It demonstrates an alternate approach to the VM "craze" in that sites could pick and choose the precise level of isolation they need for their jobs. Containers won't have the overhead present in I/O for VMs, and won't have the issues in getting the memory subsystems of several kernels to cooperate (as there is only one kernel). Further, for many use cases, it's much more sane to start a single process in a container than an entire operating system (and all the corresponding overhead) to run a job. The time taken to setup and teardown a container is measured in milliseconds, as opposed to tens of seconds to start an OS. Thanks, Brian
Attachment:
smime.p7s
Description: S/MIME cryptographic signature