[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job stays in queue for approx 20m before match making



Hi Vikrant,

Depending on the rate of job submission to the AP and job pressure backlog to the local pool seeing a 10-minute delay for job start to a flocked location doesn't seem too obscene. This is due to base times for negotiation cycles, how the Schedd rescheduled negotiation and flocks' jobs. Setting the configuration macros MIN_FLOCK_LEVEL and/or FLOCK_INCREMENT could help with decreasing the time until a job starts running.

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, October 6, 2023 3:58 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job stays in queue for approx 20m before match making
 
Hello Experts, 

Sorry in the last email I mentioned 20m but actually it's approx 10m.


On Fri, Oct 6, 2023 at 4:51âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

We are seeing issues with the 9.0.17 submitter box (all-in-one) with multiple pools in the flocking list. Flocking pools are running with the 8.8.5 version. 

Job submitted but it wasn't even considered for matchmaking by the negotiator. 

Logs from submit node. I don't see any attempt in Negotiator logs during this time to match the job. 

10/06/23 10:18:30 (pid:1811906) job_transforms for 1129266.0: 5 considered, 5 applied
===== Lot of logs =====
10/06/23 10:29:50 (pid:1811906) Starting add_shadow_birthdate(1129266.0)

I do see messages about "rebuilt prioritized runnable list" 
# awk '/10\/06\/23 10:18:30/,/10\/06\/23 10:29:51/ {print $0}' /var/log/condor/SchedLog | grep 'Rebuilt prioritized runnable job list in' | head
10/06/23 10:18:34 (pid:1811906) Rebuilt prioritized runnable job list in 0.014s.
10/06/23 10:18:52 (pid:1811906) Rebuilt prioritized runnable job list in 0.004s.
This bug [1] is already fixed in the version we are using on submitter, and afaiu it's only related to submitter not master or worker nodes, anything else which can cause this issue? 

[1] https://opensciencegrid.atlassian.net/browse/HTCONDOR-769

Thanks & Regards,
Vikrant Aggarwal