We are facing a strange issue with MPI jobs.
Condor scheduler "condor_q -all" is not showing any job in queue and we are using only single sched node to submit MPI jobs but still if we submit new job it remains in idle state, running better-analyze against that job showing below relevant output.
172 is expected as we have 172 vanila jobs running with another user but since no job is running with this user hence count 10 is totally unexpected.
thought of checking the used slots on all nodes present in cluster I found that it's showing chunk of 10 slots in used status which I believe is corresponding to earlier attempts of MPI job ran. We are requesting 10 cores and 5 nodes while running MPI
job. when I login into any of the node showing 10 cpus in used status and do condor_who or pstree -p condor or htop or top it doesn't show any user process running on that node. Again total count of 1slot 172 is expected as another user is running the jobs.
# condor_status -compact -af:h machine cpus totalcpus childcpus 'int(totalcpus-cpus)'
machine cpus totalcpus childcpus int(totalcpus-cpus)
testnode0001.test.com 10 36.0 { 1,1,1,1,1,1,10,10 } 26
testnode0002.test.com 11 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 25
testnode0003.test.com 21 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 15
testnode0004.test.com 15 36.0 { 1,1,1,1,1,10,1,1,1,1,1,1 } 21
testnode0005.test.com 13 36.0 { 10,1,1,1,1,1,1,1,1,1,1,1,1,1 } 23
testnode0006.test.com 24 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1 } 12
testnode0007.test.com 8 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,1 } 28
testnode0008.test.com 12 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 24
testnode0009.test.com 12 36.0 { 1,1,1,1,1,1,1,1,10,1,1,1,1,1,1 } 24
testnode0010.test.com 1 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,10 } 35
testnode0011.test.com 15 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 21
testnode0012.test.com 17 36.0 { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 19
Adding more confusion I found that some of old cgroup directories are not removed below output is from testnode0001 on which actually 6 vanila jobs are running each slot 1 but in cgroup I can see 14 directories are present.