[HTCondor-users] cgroup directories not being deleted

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 26 Dec 2017 14:40:04 -0500

From: Jon Bernard <jonbernard@xxxxxxxxx>

Subject: [HTCondor-users] cgroup directories not being deleted

Hi all,

In our job wrapper, we do

exec /usr/bin/time -f "$TIME_LOGMSG" -o >($NOTICE_LOG) "$@"

where NOTICE_LOG is

NOTICE_LOG="/bin/logger -p local1.notice -t CondorJobLogger"

This works well, except for one thing: occasionally the per-job cgroup directories don't get deleted. This in turn seems to cause an OOM event when another job with the same PID gets assigned to the same slot. We have lots of short jobs, so this eventually happens if the nodes stay up long enough.

What seems to be happening is that the cgroup directories can't be deleted because the /bin/logger process is still running when cgroup_delete_cgroup is invoked.

Is there any config option that would allow the job more time to finish before this happens?

Thanks,

Jon

Mailing List Archives

Authenticated access

[HTCondor-users] cgroup directories not being deleted