[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Unexpected hold on jobs



Hi Steve,

 

Thank you for this information, that does track with the behaviour we’re seeing. I’m still perplexed at why it’s a small percentage of jobs, we regularly have jobs that exceed their requirements, and we use the PeriodicRemove statement with a scaled JobMemoryLimit classad to give an upper memory ceiling to jobs (our current config allows jobs to use 3x requested memory before being killed). Yet only a small number are being killed by Docker in this method.

 

May I ask if there is config in Condor/Docker to enable or disable this functionality? It’s likely with tweaking we can use Docker’s memory limits to our advantage but never knew they were in effect!

 

Many thanks again,

 

Tom

 

From: Steven C Timm <timm@xxxxxxxx>
Date: Wednesday, 21 June 2023 at 13:18
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>, HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Birkett, Thomas (STFC,RAL,SC) <thomas.birkett@xxxxxxxxxx>
Subject: Re: Unexpected hold on jobs

The hold in question is coming from Docker itself.  It appears that you are running all your worker node jobs with WANT_Docker. (as we do here at Fermilab) and running them inside a Docker container which by default is set to have the memory equal or slightly greater to RequestMemory.

The hold comes because docker detects you've gone over the memory limit and terminates the container.

 

At Fermilab we view this as a feature because Docker is much more prompt about clipping off jobs that are running high memory than a condor_schedd would be.. typically that metric lags by several minutes and if you have a bad memory leak a job can take over the whole machine in that time.

 

Steve Timm

 

 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Thomas Birkett - STFC UKRI via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, June 21, 2023 5:41 AM
To: condor-users@xxxxxxxxxxx <condor-users@xxxxxxxxxxx>
Cc: Thomas Birkett - STFC UKRI <thomas.birkett@xxxxxxxxxx>
Subject: [HTCondor-users] Unexpected hold on jobs

 

Hi Condor Community,

 

I have an odd issue with a small percentage of jobs we run. We have a small subset of jobs that go on hold due to resource being exceeded, for example:

 

LastHoldReason = "Error from slot1_38@xxxxxxxxxxxxxxxxxxxxxxx: Docker job has gone over memory limit of 4100 Mb"

 

However, we haven’t configured any resource limits to hold jobs. I also notice the only ClassAd that appears to match the memory limit is:

 

MemoryProvisioned = 4100

 

These jobs are then removed by a SYSTEM_PERIODIC_REMOVE statement to clear down held jobs. My question to the community is why is the job going on hold in the first place? The only configured removal limit / PeriodicRemove statement we configure is on a per job level shown below:

 

PeriodicRemove = (JobStatus == 1 && NumJobStarts > 0) || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)

 

I cannot replicate this behaviour in my testing, and I cannot find any reason why the job went on hold.

 

Researching the relevant classads, I see:

 

MemoryProvisioned

The amount of memory in MiB allocated to the job. With statically-allocated slots, it is the amount of memory space allocated to the slot. With dynamically-allocated slots, it is based upon the job attribute RequestMemory, but may be larger due to the minimum given to a dynamic slot.

At our site we dynamically assign our slots and the Request memory for this job is “RequestMemory = 4096”. I find this even more perplexing as this is a very rare issue with over 90% of the jobs working well and completing, same job type, same VO, same config. Any assistance debugging this issue will be gratefully received.

 

Many thanks,

 

Thomas Birkett

Senior Systems Administrator

Scientific Computing Department  

Science and Technology Facilities Council (STFC)

Rutherford Appleton Laboratory, Chilton, Didcot 
OX11 0QX

 

signature_609518872