Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] HTCondor and Docker
- Date: Fri, 10 Apr 2015 18:41:20 +0100
- From: Brian Candler <b.candler@xxxxxxxxx>
- Subject: Re: [HTCondor-users] HTCondor and Docker
On 07/04/2015 16:21, Greg Thain wrote:
On 04/07/2015 10:02 AM, Brian Candler wrote:
There are three different things I'm thinking of.
(1) Running a HTCondor worker node as a Docker container.
This should be straightforward. All the jobs would run within the
same container and therefore have an enforced limit on total resource
usage.
This would be a quick way to add HTCondor execution capability to an
existing Docker-aware server, just by
"docker run -d htcondor-worker"
or somesuch.
We've looked at this, and it is a bit more work than you might think,
for the htcondor-worker would need to be configured to point the
central manager, and be compatible with the rest of the pool.
Generally, docker containers run within NATs, and worker nodes need
inbound connections, so CCB needs to be set up on the central manager
as well. You might want to volume mount the execute directory,
otherwise, docker has a 10gb limit on container growth out of the box,
though that limit can be increased.
Also, depending on your security posture, you probably don't want to
run the worker node as root within the container, which may or may not
be a problem for your HTCondor usage.
Well, on a normal system condor_master is run as root and drops to the
submit user. Under docker, it would probably make more sense to run all
jobs as a condor user, which I know condor can be configured to do.
Re configuration: I guess this could be provided at container start
time, but in practice I'd be quite happy to build my own Dockerfile
which layers on top of a base htcondor container. That is, the
dockerfile would add a customised condor_config[.local]
Re networking: I hadn't considered that, but CCB looks like a good solution.
(3) Docker containers on the submit host
A docker container would be a convenient abstraction to use on the
submission host. Normally when you start a HTCondor DAG you need to
create an empty working directory, run a script to create the DAG
and/or SUB files, run condor_submit_dag, monitor progress to wait for
completion, check the exit status to see if all DAG nodes completed
successfully, fix/restart if necessary, then tidy up the work directory.
Docker on the submission host could handle this lifecycle: the
container would be the work directory, it would run the scripts you
want, submit the DAG and be visible as a running container until it
has completed, and the container itself has an exit status which
would show whether the DAG completed succesfully or not, under
"docker ps".
https://docs.docker.com/reference/commandline/cli/#filtering_2
When you are finished with the results then you would destroy the
container.
This one might be a bit tricky to implement, as I don't see any way
to have condor_submit_dag or condor_submit run in the foreground. I
think it would be necessary to run "condor_dagman -f" directly as the
process within the container.
The container also needs to communicate with the condor schedd, and
I'm not sure if it needs access to bits of the filesystem as well
(e.g. condor_config). If necessary, /etc/condor/ can be
loopback-mounted as a volume within the container.
This is a use case we haven't considered, but dagman really works best
now when it is a job managed by the schedd.
I understand that's how dagman is designed to run, in the scheduler
universe. This means the user needs to poll either condor_q, or the
dagman log or the jobstate.log or node.status file, to work out when the
job finished and if it was successful - or add a FINAL node.
Anyway, I only mention this use case because I have had to start
wrapping condor for use in automated batch jobs triggered by other
systems. This includes:
1. creating a working directory (I'm using
/var/spool/htcondor/current/<uuid>)
2. running a script to create the DAG, using parameters from the request
3. submitting the DAG
4. polling the status
5. sending a response when the DAG completes successfully or fails
(right now I'm adding an empty FINAL node with a POST script for this)
6. resubmitting the DAG if a retry is required
7. removing the working directory when it is no longer needed
- and it's just starting to look very much like a Docker container
lifecycle!
1-3 = docker run
4-5 = docker ps
6 = docker start
7 = docker rm
Hence a docker_scheduler universe would be attractive to me.
Regards,
Brian.