Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_rm & the docker universe
- Date: Thu, 30 Jul 2015 20:35:53 +0000
- From: andrew.lahiff@xxxxxxxxxx
- Subject: Re: [HTCondor-users] condor_rm & the docker universe
H Todd,
It doesn't seem to me that HTCondor actually does the "docker stop" after 10 minutes. Here is an example where after 10 minutes, the job has been stopped according to HTCondor (*):
[root@vm168 ~]# condor_history 136.0
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
136.0 alahiff 7/30 21:07 0+00:10:46 X ??? ./wrapper.sh
but the container is still running:
[root@vm168 ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6a37981789b2 centos:6 "./wrapper.sh" 12 minutes ago Up 12 minutes HTCJob136_0_slot1_2_PID32265
With "job_max_vacate_time = 2" the same thing happens but much quicker.
So at least for me the container is allowed to run forever if it wants, without HTCondor's knowledge.
Thanks,
Andrew.
(*)
000 (136.000.000) 07/30 21:07:37 Job submitted from host: <x.y.z.t:47771?addrs=x.y.z.t-47771>
...
001 (136.000.000) 07/30 21:07:38 Job executing on host: <x.y.z.t:60021?addrs=x.y.z.t-60021>
...
004 (136.000.000) 07/30 21:18:23 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
29 - Run Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 9 2 1890179
Memory (MB) : 1 1
...
009 (136.000.000) 07/30 21:18:23 Job was aborted by the user.
via condor_rm (by user alahiff)
...
________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Todd Tannenbaum [tannenba@xxxxxxxxxxx]
Sent: Thursday, July 30, 2015 8:45 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor_rm & the docker universe
On 7/30/2015 2:31 PM, Brian Bockelman wrote:
>
>> On Jul 30, 2015, at 11:40 AM, Dimitri Maziuk <dmaziuk@xxxxxxxxxxxxx> wrote:
>>
>> On 07/30/2015 10:01 AM, andrew.lahiff@xxxxxxxxxx wrote:
>>> Hi Greg,
>>>
>>> Ok, I didn't realized it worked like this - I had assumed HTCondor
>> would do something like "docker stop", rather than send a signal to the
>> actual executable running inside the container. Isn't this rather
>> unsafe? It makes it very easy for people to run jobs which escape
>> HTCondor's control - according to HTCondor the job has been killed but
>> the Docker container continues running for as long as it wants.
>>
Greg can correct me if I am wrong, but I believe the signal sending is
only to give the job a chance to "gracefully" shut down (vacate). After
HTCondor sends the signals, it sets a timer to follow up with a docker
stop. Thus nothing is allowed to continue running forever. See the
manual for MachineMaxVacateTime and JobMaxVacateTime - I think the
default on these is 10 minutes. So to achieve today what you stated
above, I think you could submit your docker universe job with something like
job_max_vacate_time = 2
and then HTCondor should do a docker-stop two seconds after sending the
signal if the instance is still lingering. I think Greg is thinking
about changing the default JobMaxVacateTime to be much smaller for
docker universe than the default of 10 minutes...
regards
Todd
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/