[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] MAX_SHADOW_EXCEPTIONS



Hi All,
We're having a difficult time finding a way to prevent what appears to be an infinite retry loop when a condor_shadow process runs OOM.

e.g.
Here we created a simple test script that will allocate memoryÂ> requested memory

The exception is viewed in logs,
007 (3738904.000.000) 2024-12-27 17:09:28 Shadow exception!
    Error fromÂslot1_1@xxxxxxxxxxxxxxxxxxxxxxx: Worker node is out of memory

The default MAX_SHADOW_EXCEPTIONS appears set to default at 5
condor_config_val -name condor-head-01.bo1.dbn.to -pool condor-head-01.bo1.dbn.to -schedd MAX_SHADOW_EXCEPTIONS
5
condor_q -allusers -global  -pool condor-head-01.bo1.dbn.to -long 3738908.  |egrep "Starts"
NumJobStarts = 0
NumShadowStarts = 6
Additionally I tried to use retry_until like such with no luck,
condor_q -allusers -global  -pool condor-head-01.bo1.dbn.to -long 3738913.  |egrep "Start|Retr|Reason|Shadow|Retry|Until"
..
.
JobMaxRetries = 2
NumJobStarts = 0
NumShadowExceptions = 4
NumShadowStarts = 5
 > JobMaxRetries || ExitCode =?= 0 || NumShadowStarts > JobMaxRetries || NumShadowExceptions > JobMaxRetries

$CondorVersion: 23.0.18 2024-11-18 BuildID: 769617 PackageID: 23.0.18-1 $

Is there anything else that we can try? Was this possibly resolved in a newer version? We scanned some threads and could not find anything relevant.

Thanks for your help,
Tom Madureira