Hi All,
We're having a difficult time finding a way to prevent what appears to be an infinite retry loop when a condor_shadow process runs OOM.
e.g.Here we created a simple test script that will allocate memory > requested memory
The exception is viewed in logs,007 (3738904.000.000) 2024-12-27 17:09:28 Shadow exception!
Error from slot1_1@xxxxxxxxxxxxxxxxxxxxxxx: Worker node is out of memory
Hi Thomas:
There have been several fixes in this area in 23.0.19, but what do you want to happen in this case? To put the job on hold, so the user must itervene before trying again?
-greg