[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Issues with cgroup memory limits, qedit memory units and expiring scitokens



On 4/2/24 10:37, Joseph Areeda wrote:
condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:

Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes.  Consider resubmitting with a higher request_memory.


Note that condor_ssh_to_job puts the interactive job in the same cgroup as the job proper, so if the job is close to hitting the OOM killer and going out of memory, ssh'ing to_job might nudge the job over the limit.  Feels like that's what is going on here.

Problems:

We need to figure out why the jobs aren't making progress.  Could they be blocked on I/O that's never coming back?



Sadly, condor_qedit only knows about the classad language, it doesn't know about the submit language, and the unit suffices only exist in the submit language.


-greg