[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] JOB_EXIT_HOOK is putting jobs on Hold



Hello all,
I have been using the HTCondor version 10.2 on linux ubuntu 22.04.2 LTS for
about 6 months. I am using vanilla universe having multiple submitters and
dynamic partitioning on executors.
Recently I have tried to deploy <Keyword>_JOB_EXIT_HOOK for all jobs which
get evicted/completed/exit from executor to clean my disk on executor. Each
Job writes logsPath in its condor directory(like
/var/lib/condor/execute/dir_<num>/logsPath.sh) and exit hook reads this
file and performs cleaning operations(on root disk but not on condor
directories).
After these changes, a lot of jobs are being put on hold with msg *Job has
gone over memory limit of 1024 megabytes. Peak usage: 950 megabytes. *From
starter files logs i found that jobs is on hold due to an OOM event logged
in the starter file, but there is no OOM event from the kernel. Upon
further checking condor triggers this OOM event only for EXIT_HOOK after
the job is completed within memory limits. I have checked github code but
was not able to find any solution.
EXIT_HOOK script simply removes the directory used for logs.
For some jobs it logs *Supirous OOM event, usage is 100 bytes, slot size is
1024* megabytes, ignoring OOM(read 8 bytes)
Is there anything wrong I am doing or is there any memory leak??
Can anyone help in this please, I have tried for over 2 weeks debugging
this issue.
Thanks and Regards
Raman