We are using a hybrid version of the three scenarios (more so #2) i.e we have docker installed on our HTCondor worker nodes and allow submission of jobs which include the docker pull and "run" commands in the submit file. Since we construct the submit files on behalf of the user from a web app, our usage pattern is a bit different than direct submission by a user.
1. We have our sanctioned docker image that by default do the data staging (since we use iRODS to transfer large files back and forth from our repository on behalf of a user)
2. Once data is staged on local disk of the node, we mount that as a volume in the user provided docker image (after pull) and execute the tasks (which will write back to the same mounted volume). If this is not a sanctioned/vetted image (application) or permitted user we restrict network access i.e no in and out bound traffic
3. Finally on completion the docker container for sending data back is executed (same as step 1)
We would like to make efficient use of the docker image already available on a node so we are not waiting for downloads every time a pull request occurs, rather have a easy way to identify which worker has that image and position the job on it. (extend classAd Â?)
Likewise when we have 50+ workers requesting the same image pull request from a docker repository/registry (private or docker hub), making use of locality and hit local copies instead of hitting the registry 50+ times. This is not a pure HTcondor issue but more of a cache, CDN and load balancing, docker image lifecycle mgmt issue and is a scalability concern as we get more users working with large number of jobs and docker images that are few GB in size.
We have this setup currently in our QA/test environment with internal users for iPlant, with the intention of public release later this summer. I would welcome pointers and suggestions on security, scalability and effective use of HTcondor with docker. Our campus HPC group at Univ of Arizona is also looking to allow approved users (added to docker group) to run docker jobs (we use PBS) and would be benefit from feedback on how to best configure this for multiuser environments
Regards,
Nirav
Â