We're having a difficult time finding a way to prevent what appears to be an infinite retry loop when a condor_shadow process runs OOM.
e.g.
Here we created a simple test script that will allocate memoryÂ> requested memory
The exception is viewed in logs,
The default MAX_SHADOW_EXCEPTIONS appears set to default at 5
condor_config_val -name condor-head-01.bo1.dbn.to -pool condor-head-01.bo1.dbn.to -schedd MAX_SHADOW_EXCEPTIONS
5
condor_q -allusers -global -pool condor-head-01.bo1.dbn.to -long 3738908. |egrep "Starts"
NumJobStarts = 0
NumShadowStarts = 6
Additionally I tried to use retry_until like such with no luck,
condor_q -allusers -global -pool condor-head-01.bo1.dbn.to -long 3738913. |egrep "Start|Retr|Reason|Shadow|Retry|Until"
..
.
JobMaxRetries = 2
NumJobStarts = 0
NumShadowExceptions = 4
NumShadowStarts = 5
> JobMaxRetries || ExitCode =?= 0 || NumShadowStarts > JobMaxRetries || NumShadowExceptions > JobMaxRetries
$CondorVersion: 23.0.18 2024-11-18 BuildID: 769617 PackageID: 23.0.18-1 $
Is there anything else that we can try? Was this possibly resolved in a newer version? We scanned some threads and could not find anything relevant.
Thanks for your help,
Tom Madureira