Hi all, we noticed on a number of SL6 nodes remaining processes that got stuck in state D with stalled file handles. Since all condor daemons/processes have already shut down - including the job's cgroup sub-slices - the affected PIDs have moved up to the parent condor cgroup [1.a,b] Thing is, that there are no mentions of such affected PIDs in the logs. Our guess is, that when such a PIDs parent job exits/gets killed by Condor, the job's cgroup etc. gets deleted and its child process tree gets a SIGTERM/KILL, or? Does the daemon checks before shutting down, if all child processes have actually exited and maybe log a warning before shutting down? (where I am not sure, if it is really reasonable as it's more of the kernel's task to taken care of such processes... ;) ) Cheers, Thomas [1.a] [root@bird839 ~]# cat /cgroup/cpu/htcondor/tasks 180938 [1.b] [root@bird839 ~]# cat /proc/180938/cgroup 5:freezer:/htcondor 4:blkio:/htcondor 3:cpuacct:/htcondor 2:cpu:/htcondor 1:memory:/htcondor
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature