[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MAX_SHADOW_EXCEPTIONS



Hi Greg,
I just saw the release post and briefly read through the release notes. It doesÂlook like we should test out the behavior that was implemented as it is likely suitable.Â

To directly answer your question though, I suppose we'd expect to see similar behavior as if the job ran and didn't return an exit code of 0. It would eventually breach max_retries.

Thanks,
Tom

On Mon, Jan 6, 2025 at 10:25âAM Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
On 12/30/24 15:29, Thomas Madureira wrote:
Hi All,
We're having a difficult time finding a way to prevent what appears to be an infinite retry loop when a condor_shadow process runs OOM.

e.g.
Here we created a simple test script that will allocate memoryÂ> requested memory

The exception is viewed in logs,
007 (3738904.000.000) 2024-12-27 17:09:28 Shadow exception!
    Error fromÂslot1_1@xxxxxxxxxxxxxxxxxxxxxxx: Worker node is out of memory


Hi Thomas:

There have been several fixes in this area in 23.0.19, but what do you want to happen in this case? To put the job on hold, so the user must itervene before trying again?

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/