[HTCondor-users] JOB_EXIT_HOOK is putting jobs on Hold

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hello all,

I have been using the HTCondor version 10.2 on linux ubuntu 22.04.2 LTS for

about 6 months. I am using vanilla universe having multiple submitters and

dynamic partitioning on executors.

Recently I have tried to deploy <Keyword>_JOB_EXIT_HOOK for all jobs which

get evicted/completed/exit from executor to clean my disk on executor. Each

Job writes logsPath in its condor directory(like

/var/lib/condor/execute/dir_<num>/logsPath.sh) and exit hook reads this

file and performs cleaning operations(on root disk but not on condor

directories).

After these changes, a lot of jobs are being put on hold with msg *Job has

gone over memory limit of 1024 megabytes. Peak usage: 950 megabytes. *From

starter files logs i found that jobs is on hold due to an OOM event logged

in the starter file, but there is no OOM event from the kernel. Upon

further checking condor triggers this OOM event only for EXIT_HOOK after

the job is completed within memory limits. I have checked github code but

was not able to find any solution.

EXIT_HOOK script simply removes the directory used for logs.

For some jobs it logs *Supirous OOM event, usage is 100 bytes, slot size is

1024* megabytes, ignoring OOM(read 8 bytes)

Is there anything wrong I am doing or is there any memory leak??

Can anyone help in this please, I have tried for over 2 weeks debugging

this issue.

Thanks and Regards

Raman

Mailing List Archives