On
Thu, 03 Sep 2015 10:40:29 -0500, Greg Thain wrote:
> For your use case, it seems that you have machines with large
amounts of
> data pre-loaded on them, that you want your containers to be
able to access
> -- is this correct? If so, that means that jobs that request
docker_volume =
> foo can only run on certain hosts? When working on HTCondor, we
try to
> think about what the responsibility of the job is versus the
> responsibility of the machine. If a machine has a special
capability, we
> like to have it advertise that fact in the startd classad, and
allow jobs
> to match against it. Perhaps for this use case, we could add a
knob to
> the startd that allows the administrator to configure one or
more
> filesystems that it will volume mount into docker containers
that request
> them. That way, jobs can only match to machines that have the
data they
> need, and admins can be more assured that containers are
contained.
[Aside: I'm working with Matt on this project!]
In our case we already use explicit startd attributes to match the
machines. For example, machines may have a local SSD volume mounted
on /media/SSD, but different machines have different databases in
their SSD. To ensure we run the job on the right machine(s) we use a
custom script to announce all the available files:
#!/bin/sh
ls /media/SSD/*.dbq | sed -e 's#[./]#_#g' -e 's/^/has/' -e
's/$/ = True/'
and then get HTCondor to periodically run this script to update the
machine ClassAds:
#
http://stackoverflow.com/questions/9864766/how-to-tell-condor-to-dispatch-jobs-only-to-machines-on-the-cluster-that-have
LIST_SSD = /usr/local/bin/condor_list_ssd
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) LIST_SSD
STARTD_CRON_LIST_SSD_EXECUTABLE = $(LIST_SSD)
STARTD_CRON_LIST_SSD_PERIOD = 300
Then the jobs themselves have a requirements _expression_ which
declares what file or files they need to use.
In other cases there is an NFS mount which is under /shared/, and we
know this mount is available on all nodes, so we don't bother to
check for it.
Therefore, one example of a simple condor job would be one that
copies a certain file from /shared/XXX to /media/SSD. If we want to
run that in a docker container then we need to mount both those
directories inside the container. This sort of job has a target
requirement saying what actual machine we want to run it on.
Other jobs perform lookups using the files under /media/SSD. For
example, we can have a requirement on has_media_SSD_test1_dbq, and
the job will run on any node which has file "test1.dbq" on its SSD.Â
Hence the job doesn't need to know which nodes contain which files,
only the name of the file it wants to use.
To some degree, I think the question of advertising volumes and
mounting them is orthogonal:
* Not all condor jobs need to mount all available volumes
* A volume called /media/SSD on node A is not necessarily equivalent
to a volume called /media/SSD on node B
However, it is true that a condor job should only run where the
volumes it needs are available.
In the spirit of "do the simplest thing which can possibly work",
the first option we considered is something like
ÂÂÂ docker_volumes = "/media/SSD:/SSD /shared:/shared"
which can be parsed to provide the argument to docker run -v.Â
Optionally this could also update the default requirements
_expression_ saying that /media/SSD and /shared must be present on the
execution node; but as explained above, this is not actually
sufficient to meet our needs anyway.
Even more generic is to be able to pass arbitrary additional
arguments to docker run, e.g.
ÂÂÂ docker_arguments = "-v '[""/media/SSD:/SSD"",
""/shared:/shared""]'"
[Aside: people may argue about the security implications of either
option, but I don't think you can run completely untrusted docker
containers under condor, since docker does everything as root
anyway, at least not without heavy use of some MAC layer. For
example, any available NFS servers could be mounted within the
container anyway, since they are likely only access controlled by
source IP address. Anyone who enables the docker universe needs to
be aware of this]
There is another approach to mounting data in docker, which is to
use data volume containers rather than direct mounting from the host
filesystem.
https://docs.docker.com/userguide/dockervolumes/#creating-and-mounting-a-data-volume-container
To work this way, the submit file could have
ÂÂÂ docker_volumes_from = FOO BAR
and the starter would generate `docker run --volumes-from FOO
--volumes-from BAR`
Then it makes sense to make use of a classAd advertising the
availability of a particular named data container, e.g.
docker_volume_FOO = True
docker_volume_BAR = True
and this could be added into the requirements _expression_.
It's not clear to me whether a data volume can mount data from the
host filesystem, and then another container can in mount that data
container successfully. (I need to test this).
For us, I think this would also cause some more configuration
overhead:
- we would have to create and name the data containers on each host.
(Given that /media/SSD is different on each host, we probably ought
to give them unique names, but this would then lose the benefit that
our query jobs don't need to know which host they are running on)
- we would have to add the classAds saying which data containers
were present on each host (*)
But it would still be perfectly usable.
Anyway, the reason we're interested in discussing and agreeing the
way forward is because we'd like to get away from having our own
custom build of condor, which is somewhat painful.
The other approach we've considered is to replace 'docker' with a
custom wrapper script which adds extra options to the docker command
line:
=== /usr/bin/docker-media ===
#!/bin/sh
cmd="$1"
if [ "$cmd" = "run" ]; then
ÂÂÂ shift
ÂÂÂ exec docker "$cmd" --volume /media:/media "$@"
fi
exec docker "$@"
=== /etc/condor/condor_config.local ===
...
DOCKER = /usr/bin/docker-media
That's arguably cleaner than hacking about with condor itself, but
still needs to be deployed to every node. Furthermore it hardcodes
the volume(s) of interest, and only one fixed global configuration
is possible for all docker jobs.
Regards,
Brian.
(*) I don't think "data volume containers" are specially marked, but
we could announce all containers which match a particular name
pattern for example.
|
|