[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] cgroup directories not being deleted



Hi all,

In our job wrapper, we do

exec /usr/bin/time -f "$TIME_LOGMSG" -o >($NOTICE_LOG) "$@"

where NOTICE_LOG is

NOTICE_LOG="/bin/logger -p local1.notice -t CondorJobLogger"

This works well, except for one thing: occasionally the per-job cgroup directories don't get deleted. This in turn seems to cause an OOM event when another job with the same PID gets assigned to the same slot. We have lots of short jobs, so this eventually happens if the nodes stay up long enough.

What seems to be happening is that the cgroup directories can't be deleted because the /bin/logger process is still running when cgroup_delete_cgroup is invoked.

Is there any config option that would allow the job more time to finish before this happens?

Thanks,
Jon