[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job remnants on SL6



Hi all,

we noticed on a number of SL6 nodes remaining processes that got stuck
in state D with stalled file handles.

Since all condor daemons/processes have already shut down - including
the job's cgroup sub-slices - the affected PIDs have moved up to the
parent condor cgroup [1.a,b]

Thing is, that there are no mentions of such affected PIDs in the logs.
Our guess is, that when such a PIDs parent job exits/gets killed by
Condor, the job's cgroup etc. gets deleted and its child process tree
gets a SIGTERM/KILL, or?
Does the daemon checks before shutting down, if all child processes have
actually exited and maybe log a warning before shutting down?
(where I am not sure, if it is really reasonable as it's more of the
kernel's task to taken care of such processes... ;) )

Cheers,
  Thomas

[1.a]
[root@bird839 ~]# cat /cgroup/cpu/htcondor/tasks
180938

[1.b]
[root@bird839 ~]# cat /proc/180938/cgroup
5:freezer:/htcondor
4:blkio:/htcondor
3:cpuacct:/htcondor
2:cpu:/htcondor
1:memory:/htcondor

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature