Re: [HTCondor-devel] Inclusion of additional arguments to docker run


Date: Mon, 05 Oct 2015 15:04:32 +0100
From: Brian Candler <b.candler@xxxxxxxxx>
Subject: Re: [HTCondor-devel] Inclusion of additional arguments to docker run
On Thu, 03 Sep 2015 10:40:29 -0500, Greg Thain wrote:
> For your use case, it seems that you have machines with large amounts of
> data pre-loaded on them, that you want your containers to be able to access
> -- is this correct? If so, that means that jobs that request docker_volume =
> foo can only run on certain hosts? When working on HTCondor, we try to
> think about what the responsibility of the job is versus the
> responsibility of the machine. If a machine has a special capability, we
> like to have it advertise that fact in the startd classad, and allow jobs
> to match against it. Perhaps for this use case, we could add a knob to
> the startd that allows the administrator to configure one or more
> filesystems that it will volume mount into docker containers that request
> them. That way, jobs can only match to machines that have the data they
> need, and admins can be more assured that containers are contained.

[Aside: I'm working with Matt on this project!]

In our case we already use explicit startd attributes to match the machines. For example, machines may have a local SSD volume mounted on /media/SSD, but different machines have different databases in their SSD. To ensure we run the job on the right machine(s) we use a custom script to announce all the available files:

#!/bin/sh
ls /media/SSD/*.dbq | sed -e 's#[./]#_#g' -e 's/^/has/' -e 's/$/ = True/'

and then get HTCondor to periodically run this script to update the machine ClassAds:

# http://stackoverflow.com/questions/9864766/how-to-tell-condor-to-dispatch-jobs-only-to-machines-on-the-cluster-that-have
LIST_SSD = /usr/local/bin/condor_list_ssd
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) LIST_SSD
STARTD_CRON_LIST_SSD_EXECUTABLE = $(LIST_SSD)
STARTD_CRON_LIST_SSD_PERIOD = 300

Then the jobs themselves have a requirements _expression_ which declares what file or files they need to use.

In other cases there is an NFS mount which is under /shared/, and we know this mount is available on all nodes, so we don't bother to check for it.

Therefore, one example of a simple condor job would be one that copies a certain file from /shared/XXX to /media/SSD. If we want to run that in a docker container then we need to mount both those directories inside the container. This sort of job has a target requirement saying what actual machine we want to run it on.

Other jobs perform lookups using the files under /media/SSD. For example, we can have a requirement on has_media_SSD_test1_dbq, and the job will run on any node which has file "test1.dbq" on its SSD. Hence the job doesn't need to know which nodes contain which files, only the name of the file it wants to use.

To some degree, I think the question of advertising volumes and mounting them is orthogonal:

* Not all condor jobs need to mount all available volumes
* A volume called /media/SSD on node A is not necessarily equivalent to a volume called /media/SSD on node B

However, it is true that a condor job should only run where the volumes it needs are available.

In the spirit of "do the simplest thing which can possibly work", the first option we considered is something like

ÂÂÂ docker_volumes = "/media/SSD:/SSD /shared:/shared"

which can be parsed to provide the argument to docker run -v. Optionally this could also update the default requirements _expression_ saying that /media/SSD and /shared must be present on the execution node; but as explained above, this is not actually sufficient to meet our needs anyway.

Even more generic is to be able to pass arbitrary additional arguments to docker run, e.g.

ÂÂÂ docker_arguments = "-v '[""/media/SSD:/SSD"", ""/shared:/shared""]'"

[Aside: people may argue about the security implications of either option, but I don't think you can run completely untrusted docker containers under condor, since docker does everything as root anyway, at least not without heavy use of some MAC layer. For example, any available NFS servers could be mounted within the container anyway, since they are likely only access controlled by source IP address. Anyone who enables the docker universe needs to be aware of this]

There is another approach to mounting data in docker, which is to use data volume containers rather than direct mounting from the host filesystem.
https://docs.docker.com/userguide/dockervolumes/#creating-and-mounting-a-data-volume-container

To work this way, the submit file could have

ÂÂÂ docker_volumes_from = FOO BAR

and the starter would generate `docker run --volumes-from FOO --volumes-from BAR`

Then it makes sense to make use of a classAd advertising the availability of a particular named data container, e.g.

docker_volume_FOO = True
docker_volume_BAR = True

and this could be added into the requirements _expression_.

It's not clear to me whether a data volume can mount data from the host filesystem, and then another container can in mount that data container successfully. (I need to test this).

For us, I think this would also cause some more configuration overhead:

- we would have to create and name the data containers on each host.

(Given that /media/SSD is different on each host, we probably ought to give them unique names, but this would then lose the benefit that our query jobs don't need to know which host they are running on)

- we would have to add the classAds saying which data containers were present on each host (*)

But it would still be perfectly usable.

Anyway, the reason we're interested in discussing and agreeing the way forward is because we'd like to get away from having our own custom build of condor, which is somewhat painful.

The other approach we've considered is to replace 'docker' with a custom wrapper script which adds extra options to the docker command line:

=== /usr/bin/docker-media ===
#!/bin/sh
cmd="$1"
if [ "$cmd" = "run" ]; then
ÂÂÂ shift
ÂÂÂ exec docker "$cmd" --volume /media:/media "$@"
fi
exec docker "$@"

=== /etc/condor/condor_config.local ===
...
DOCKER = /usr/bin/docker-media

That's arguably cleaner than hacking about with condor itself, but still needs to be deployed to every node. Furthermore it hardcodes the volume(s) of interest, and only one fixed global configuration is possible for all docker jobs.

Regards,

Brian.

(*) I don't think "data volume containers" are specially marked, but we could announce all containers which match a particular name pattern for example.
[← Prev in Thread] Current Thread [Next in Thread→]