Hello Experts,
We are seeing an issue where one job of the batch remains in idle state despite having resources available in the cluster. This started happening after the update to 8.8.5, we never saw this behavior with the 8.5.8 version.
Hi Vikrant:
Can we see (off list, if you so desire), the StartLog on the worker node for the claim failure in question?
-greg
We are using scheduler level splitting of slots.Â
# condor_config_val CLAIM_PARTITIONABLE_LEFTOVERS
true
Whenever this issue happened we noticed "Request was NOT accepted for claim" in schedlog which I believeÂindicates one failed attempt was made but then another attempt was made approx after 21m this time the job started running.Â
# grep '2290171.0' /var/log/condor/SchedLog
08/27/21 00:22:30 (pid:9386) job_transforms for 2290171.0: 1 considered, 1 applied (SetTestTeam)
08/27/21 00:22:44 (pid:9386) Request was NOT accepted for claim slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.175:9618?addrs=xx.xx.84.175-9618&noUDP&sock=7226_0371_3> for testuser1 2290171.0
08/27/21 00:22:44 (pid:9386) Match record (slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.175:9618?addrs=xx.xx.84.175-9618&noUDP&sock=7226_0371_3> for testuser1, 2290171.0) deleted
08/27/21 00:43:40 (pid:9386) Starting add_shadow_birthdate(2290171.0)
08/27/21 00:43:40 (pid:9386) Started shadow for job 2290171.0 on slot1@xxxxxxxxxxxxxxxxxxxxxxx<xx.xx.84.31:9618?addrs=xx.xx.84.31-9618&noUDP&sock=56704_ce58_3> for testuser1, (shadow pid = 1946817)
What can we do to speed the job matchmakingÂafter the first failed attempt?Â
Thanks & Regards,Vikrant Aggarwal
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/