Thanks for your reply. It's a bit of relief to hear that we are not the only ones hitting this issue.Â
More information regarding our environmentÂ
- Partitionable slots
- Schedule level splitting enabled.Â
which I believe could be relevant after seeing the following message in schedlog file for the job which was the only job that stays idle out of 179 jobs.Â
# grep '154439.159' /var/log/condor/SchedLog.old
12/13/22 14:44:51 (pid:19397) job_transforms for 154439.159: 1 considered, 1 applied (SetTeam)
12/13/22 14:50:22 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.159
12/13/22 14:50:22 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.159) deleted
However the same message appears for other jobs as well but didn't stay in idle state for too long. The above mentioned job was stuck in idle state for more than 15 mins.Â
# grep 'node1.com' /var/log/condor/SchedLog.old | grep '12/13/22 14:'
12/13/22 14:48:58 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.109
12/13/22 14:48:58 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.109) deleted
12/13/22 14:49:15 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.119
12/13/22 14:49:15 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.119) deleted
12/13/22 14:49:34 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.130
12/13/22 14:49:34 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.130) deleted
If we look at the logs of a job which was successfully matched after first rejection it happened very quickly. node1.com didn't accept the request node2.com does.
# grep '154439.173' /var/log/condor/SchedLog.old
12/13/22 14:44:51 (pid:19397) job_transforms for 154439.173: 1 considered, 1 applied (SetTeam)
12/13/22 14:50:40 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.51.71:9618?addrs=xx.xx.51.71-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.173
12/13/22 14:50:40 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.51.71:9618?addrs=xx.xx.51.71-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.173) deleted
12/13/22 14:50:42 (pid:19397) match (slot1@xxxxxxxxx <xx.xx.85.143:9618?addrs=xx.xx.85.143-9618&noUDP&sock=3132923_2bce_3> for test.user1) switching to job 154439.173
12/13/22 14:50:42 (pid:19397) Shadow pid 1628682 switching to job 154439.173.
12/13/22 14:50:42 (pid:19397) Starting add_shadow_birthdate(154439.173)
12/13/22 14:52:12 (pid:19397) Shadow pid 1628682 for job 154439.173 reports job exit reason 100.
12/13/22 14:52:12 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.85.143:9618?addrs=xx.xx.85.143-9618&noUDP&sock=3132923_2bce_3> for test.user1, 154439.173) deleted
Like I mentioned earlier this issue is not reproducible at desire but we are hitting this issue often.Â
For the above case I don't have the logs fromÂ/var/log/condor/StartLog file of worker node(s) as they are rotated.Â
Community:Â any idea what else can be helpful to troubleshoot the long queue time of jobs?
Thanks & Regards,
Vikrant Aggarwal