[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Scheduling delay in cluster mix of 8.8.5 and 9.0.17 version



Hello Experts,

Sched running 9.0.17 version.
HTcondor masters running 8.8.5 version (Primary and all in flock_to list)

Special setup details: We are dynamicallyÂmodifying the job requirements to give it an opportunity first to run on private pool (team owned pool) if not then on public pool (which is shared by multiple teams) ensuring we are not creating too many autoclusterIDs.

Despite having available cores in both primary and flock pools, the job stays in queue forever until we do the restart of condor service on scheduler.Â

Sched doesn't present the jobs for matchmaking.Â
10/30/23 13:38:10 0 seconds so far for this submitter
10/30/23 13:38:10 0 seconds so far for this schedd
10/30/23 13:38:10     Got NO_MORE_JOBS;  schedd has no more requests

In sched logs, the following message was reported but still after this message it was keep on sending jobs to negotiator for matchmaking.

10/30/23 12:07:01 (pid:43091) condor_write(): Socket closed when trying to write 354 bytes to negotiator test.example.com, fd is 25
10/30/23 12:07:01 (pid:43091) Buf::write(): condor_write() failed
10/30/23 12:07:01 (pid:43091) SECMAN: failed to end classad message
10/30/23 12:07:01 (pid:43091) Failed to send RESCHEDULE to negotiator test.example.com: SECMAN:2007:Failed to end classad message.
10/30/23 12:07:01 (pid:43091) (cid:1237904) actOnJobs: didn't do any work, aborting
Immediately before sched stops advertising jobs to negotiators following message reported but doesn't look problematic. 
10/30/23 12:34:00 (pid:43091) Shadow pid 3450123 for job 37192219.2 exited with status 115
10/30/23 12:34:00 (pid:43091) Match record (slot1@xxxxxxxxxxxxxxxxxxxx <10.xx.xx.xx:9618?addrs=10.xx.xx.xx-9618&alias=testnode.example.com&noUDP&sock=startd_283196_7aaf> for test.user1, 37192219.2) deleted
Any thoughts on what could be an issue here?

Thanks & Regards,
Vikrant Aggarwal