Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory
- Date: Fri, 6 Oct 2023 17:17:12 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory
On 10/5/23 10:03, Vikrant Aggarwal wrote:
Hello Experts,
We want to capture the signal to copy some logs before the scratch
directory disappears after the job goes into hold status because of
memory breach but we are unsuccessfulÂto do it. Do we have any way to
achieve this? We thought it was probably a job wrapper which is doing
exec to run actual condor jobs not allowing us to capture the signal
but that's not the case.
The Linux out-of-memory signal uses signal 9, which is uncatchable. You
could write a startd policy which evicts jobs when their MemoryUsage is
some percentage of the total, and if the job has
when_to_transfer_output = ON_EXIT_OR_EVICT
then the scratch directory would get copied back to the spool on the AP
-greg