Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hello Greg and Gerard,

Thanks for the suggestions.Â

Unfortunately, policy works on Virtual memory (similar to ulimit). cgroup setting works on RSS memory of process/job which is the behavior we want.Â

We don't want to take action on jobs based on virtual memory. Looked around but couldn't find any other way except cgroup to put a limit on RSS. Unfortunately cgroup setting brings stdout/stderr copy back issue.Â

Any other suggestions?

Thanks & Regards,

Vikrant Aggarwal

On Fri, Oct 6, 2023 at 6:19âPM Greg Thain via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

On 10/5/23 10:03, Vikrant Aggarwal wrote:
> Hello Experts,
>
> We want to capture the signal to copy some logs before the scratch
> directory disappears after the job goes into hold status because of
> memory breach but we are unsuccessfulÂto do it. Do we have any way to
> achieve this? We thought it was probably a job wrapper which is doing
> exec to run actual condor jobs not allowing us to capture the signal
> but that's not the case.

The Linux out-of-memory signal uses signal 9, which is uncatchable.Â You
could write a startd policy which evicts jobs when their MemoryUsage is
some percentage of the total, and if the job has

when_to_transfer_output = ON_EXIT_OR_EVICT

then the scratch directory would get copied back to the spool on the AP

-greg

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Capturing the signal from worker nodes when job breaches memory