We are facing a strange issue with MPI jobs.
Condor scheduler "condor_q -all" is not showing any job in queue and we are using only single sched node to submit MPI jobs but still if we submit new job it remains in idle state, runningÂbetter-analyze against that job showing below relevant output. 172 is expected as we have 172 vanila jobs running with another user but since no job is running with this user hence count 10 is totally unexpected.Â
thought of checking the used slots on all nodes present in cluster I found that it's showing chunk of 10 slots in used status which I believe is corresponding to earlier attempts of MPI job ran. We are requesting 10 cores and 5 nodes while running MPI job. when I login into any of the node showing 10 cpus in used status and do condor_who or pstree -p condor or htop or top it doesn't show any user process running on that node. Again total count of 1slot 172 is expected as another user is running the jobs.Â
# condor_status -compact Â-af:h machine cpus totalcpus childcpus 'int(totalcpus-cpus)'
machine                Âcpus totalcpus       childcpus int(totalcpus-cpus)
testnode0001.test.com 10 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,10,10 } 26 Â Â Â Â Â Â Â Â
testnode0002.test.com 11 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } 25 Â Â Â Â Â Â Â Â
testnode0003.test.com 21 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â15 Â Â Â Â Â Â Â Â
testnode0004.test.com 15 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,10,1,1,1,1,1,1 } Â Â Â Â 21 Â Â Â Â Â Â Â Â
testnode0005.test.com 13 Â 36.0 Â Â Â Â Â Â Â Â Â{ 10,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â 23 Â Â Â Â Â Â Â Â
testnode0006.test.com 24 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1 } Â Â Â Â Â12 Â Â Â Â Â Â Â Â
testnode0007.test.com 8 Â Â36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,1 } 28 Â Â Â Â Â Â Â Â
testnode0008.test.com 12 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,10 } Â Â Â Â 24 Â Â Â Â Â Â Â Â
testnode0009.test.com 12 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,10,1,1,1,1,1,1 } Â Â Â Â 24 Â Â Â Â Â Â Â Â
testnode0010.test.com 1 Â Â36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,10 } Â Â35 Â Â Â Â Â Â Â Â
testnode0011.test.com 15 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } 21Â Â Â Â Â ÂÂ
testnode0012.test.com 17 Â 36.0 Â Â Â Â Â Â Â Â Â{ 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 } Â Â 19Â Â Â Â Â Â Â Â
Adding more confusion I found that some of old cgroup directories are not removed below output is fromÂtestnode0001 on which actually 6 vanila jobs are running each slot 1 but in cgroup I can see 14 directories are present.Â