[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor and Docker



We are using a hybrid version of the three scenarios (more so #2) i.e we have docker installed on our HTCondor worker nodes and allow submission of jobs which include the docker pull and "run" commands in the submit file. Since we construct the submit files on behalf of the user from a web app, our usage pattern is a bit different than direct submission by a user.

1. We have our sanctioned docker image that by default do the data staging (since we use iRODS to transfer large files back and forth from our repository on behalf of a user)
2. Once data is staged on local disk of the node, we mount that as a volume in the user provided docker image (after pull) and execute the tasks (which will write back to the same mounted volume). If this is not a sanctioned/vetted image (application) or permitted user we restrict network access i.e no in and out bound traffic
3. Finally on completion the docker container for sending data back is executed (same as step 1)

We would like to make efficient use of the docker image already available on a node so we are not waiting for downloads every time a pull request occurs, rather have a easy way to identify which worker has that image and position the job on it. (extend classAd Â?)
Likewise when we have 50+ workers requesting the same image pull request from a docker repository/registry (private or docker hub), making use of locality and hit local copies instead of hitting the registry 50+ times. This is not a pure HTcondor issue but more of a cache, CDN and load balancing, docker image lifecycle mgmt issue and is a scalability concern as we get more users working with large number of jobs and docker images that are few GB in size.

We have this setup currently in our QA/test environment with internal users for iPlant, with the intention of public release later this summer. I would welcome pointers and suggestions on security, scalability and effective use of HTcondor with docker. Our campus HPC group at Univ of Arizona is also looking to allow approved users (added to docker group) to run docker jobs (we use PBS) and would be benefit from feedback on how to best configure this for multiuser environments

Regards,
Nirav
Â



On Tue, Apr 7, 2015 at 8:21 AM, Greg Thain <gthain@xxxxxxxxxxx> wrote:
On 04/07/2015 10:02 AM, Brian Candler wrote:
There are three different things I'm thinking of.

(1) Running a HTCondor worker node as a Docker container.

This should be straightforward. All the jobs would run within the same container and therefore have an enforced limit on total resource usage.

This would be a quick way to add HTCondor execution capability to an existing Docker-aware server, just by
"docker run -d htcondor-worker"
or somesuch.

We've looked at this, and it is a bit more work than you might think, for the htcondor-worker would need to be configured to point the central manager, and be compatible with the rest of the pool. Generally, docker containers run within NATs, and worker nodes need inbound connections, so CCB needs to be set up on the central manager as well. You might want to volume mount the execute directory, otherwise, docker has a 10gb limit on container growth out of the box, though that limit can be increased.

Also, depending on your security posture, you probably don't want to run the worker node as root within the container, which may or may not be a problem for your HTCondor usage.


(2) A "docker universe" where each job instance launches within a new Docker container, from a chosen container template.

When the job starts, a container is created, and when the job terminates the container is destroyed (except perhaps on failure, in which case we can keep it around for post-mortem?)

condor_exec would need to fire off "docker run" (preferably via the docker API) and track it until the container terminated. Plumbing for stdin/stdout and file transfer would also be required. Hence maybe part of condor_exec itself should run within the container?


This is something we are actively working on. If you have ideas, or use cases, we'd love to hear them.

(3) Docker containers on the submit host

A docker container would be a convenient abstraction to use on the submission host. Normally when you start a HTCondor DAG you need to create an empty working directory, run a script to create the DAG and/or SUB files, run condor_submit_dag, monitor progress to wait for completion, check the exit status to see if all DAG nodes completed successfully, fix/restart if necessary, then tidy up the work directory.

Docker on the submission host could handle this lifecycle: the container would be the work directory, it would run the scripts you want, submit the DAG and be visible as a running container until it has completed, and the container itself has an exit status which would show whether the DAG completed succesfully or not, under "docker ps".
https://docs.docker.com/reference/commandline/cli/#filtering_2

When you are finished with the results then you would destroy the container.

This one might be a bit tricky to implement, as I don't see any way to have condor_submit_dag or condor_submit run in the foreground. I think it would be necessary to run "condor_dagman -f" directly as the process within the container.

The container also needs to communicate with the condor schedd, and I'm not sure if it needs access to bits of the filesystem as well (e.g. condor_config). If necessary, /etc/condor/ can be loopback-mounted as a volume within the container.

This is a use case we haven't considered, but dagman really works best now when it is a job managed by the schedd.

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/