Re: [HTCondor-users] Scheduling delay in cluster mix of 8.8.5 and 9.0.17 version

Hi Vikrant,

sounds very much as if the sched is busy doing something else instead, can you get a condor_q response from it ?

If you did not change anything in your configuration and it worked before I would suspect that something is creepy with your spool que_log file - that can make the sched unresponsible pretty quick.

- stop condor

- remove /var/lib/condor/spool/job_queue.log (or move away for later inspection)

- start condor

And see if that cures the problem ...

best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 6. November 2023 23:04:00
Betreff: Re: [HTCondor-users] Scheduling delay in cluster mix of 8.8.5 and 9.0.17 version

Hello Joe,

Yes, it's sanitized.

We are facing this issue more often after doing the update of few nodes to 9.0.17

I saw some jobs exiting with 115 status after that nothing was matched.

11/06/23 14:34:31 (pid:120740) Shadow pid 1921530 for job 33198002.10 exited with status 115

the following messages came much before the issue started happening.

# grep 'SECMAN:2007:Failed to end classad message.' /var/log/condor/SchedLog*
/var/log/condor/SchedLog.old:11/06/23 12:45:58 (pid:120740) Failed to send RESCHEDULE to negotiator master.example.com: SECMAN:2007:Failed to end classad message.
/var/log/condor/SchedLog.old:11/06/23 12:46:08 (pid:120740) Failed to send RESCHEDULE to negotiator master.example.com: SECMAN:2007:Failed to end classad message.

I checked the negotiator logs.

Unfortunately sched is not sending any jobs to negotiator for match making. I don't understand this part why despite of having so many jobs in queue sched is not sending any job to negotiator for matchmaking?

Thanks & Regards,

Vikrant Aggarwal

On Thu, Nov 2, 2023 at 2:09âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi,

maybe not very helpful but we saw similar behaviour on scheds that were overly busy reordering there state database due to a high number of jobs combined with some typos in submit files etc ...

Hence I agree with Joe, the sched logs should shed some light on this ;)

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "JOSEPH RYAN REUSS via HTCondor-users" <htcondor-users@xxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
CC: "JOSEPH RYAN REUSS" <jrreuss@xxxxxxxx>
Gesendet: Donnerstag, 2. November 2023 16:22:49
Betreff: Re: [HTCondor-users] Scheduling delay in cluster mix of 8.8.5 and 9.0.17 version
Vikrant,

It looks like around when the Schedd was failing, it was communicating to the negotiator and continued to send jobs. I would look at the NegotiatorLog at that timestamp to see if there is any helpful information there. Also, is "test.example.com" the real name or did you sanitize the log?

Best,

Joe Reuss

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, October 30, 2023 4:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Scheduling delay in cluster mix of 8.8.5 and 9.0.17 version
Hello Experts,

Sched running 9.0.17 version.

HTcondor masters running 8.8.5 version (Primary and all in flock_to list)

Special setup details: We are dynamically modifying the job requirements to give it an opportunity first to run on private pool (team owned pool) if not then on public pool (which is shared by multiple teams) ensuring we are not creating too many autoclusterIDs.

Despite having available cores in both primary and flock pools, the job stays in queue forever until we do the restart of condor service on scheduler.

Sched doesn't present the jobs for matchmaking.
10/30/23 13:38:10 0 seconds so far for this submitter
10/30/23 13:38:10 0 seconds so far for this schedd
10/30/23 13:38:10     Got NO_MORE_JOBS;  schedd has no more requests
In sched logs, the following message was reported but still after this message it was keep on sending jobs to negotiator for matchmaking.
10/30/23 12:07:01 (pid:43091) condor_write(): Socket closed when trying to write 354 bytes to negotiator test.example.com, fd is 25
10/30/23 12:07:01 (pid:43091) Buf::write(): condor_write() failed
10/30/23 12:07:01 (pid:43091) SECMAN: failed to end classad message
10/30/23 12:07:01 (pid:43091) Failed to send RESCHEDULE to negotiator test.example.com: SECMAN:2007:Failed to end classad message.
10/30/23 12:07:01 (pid:43091) (cid:1237904) actOnJobs: didn't do any work, aborting
Immediately before sched stops advertising jobs to negotiators following message reported but doesn't look problematic. 
10/30/23 12:34:00 (pid:43091) Shadow pid 3450123 for job 37192219.2 exited with status 115
10/30/23 12:34:00 (pid:43091) Match record (slot1@xxxxxxxxxxxxxxxxxxxx <10.xx.xx.xx:9618?addrs=10.xx.xx.xx-9618&alias=testnode.example.com&noUDP&sock=startd_283196_7aaf> for test.user1, 37192219.2) deleted
Any thoughts on what could be an issue here?
Thanks & Regards,

Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/