condor_ssh_to_job showed the python executables in the [ps] interruptible wait state at about 14hr runtime now at 20hrs trying to ssh to the job puts it on hold withe the message:Job has gone over cgroup memory limit of 4096 megabytes. Peak usage: 4097 megabytes. Consider resubmitting with a higher request_memory.
Note that condor_ssh_to_job puts the
interactive job in the same cgroup as the job proper, so if the
job is close to hitting the OOM killer and going out of memory,
ssh'ing to_job might nudge the job over the limit. Feels like
that's what is going on here.
Problems:
- How can we deal with being stuck in the run state?
We need to figure out why the jobs aren't making progress. Could they be blocked on I/O that's never coming back?
- When I try to qedit the RequestMemory classad values like 6G, 6GB, 6000M or 6000MB are accepted but -better-analyze says:
Sadly, condor_qedit only knows about the classad language, it doesn't know about the submit language, and the unit suffices only exist in the submit language.
-greg