[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MAX_SHADOW_EXCEPTIONS



On 12/30/24 15:29, Thomas Madureira wrote:
Hi All,
We're having a difficult time finding a way to prevent what appears to be an infinite retry loop when a condor_shadow process runs OOM.

e.g.
Here we created a simple test script that will allocate memory > requested memory

The exception is viewed in logs,
007 (3738904.000.000) 2024-12-27 17:09:28 Shadow exception!
        Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxx: Worker node is out of memory


Hi Thomas:

There have been several fixes in this area in 23.0.19, but what do you want to happen in this case?  To put the job on hold, so the user must itervene before trying again?

-greg