[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Some jobs of batch stays in idle state for longer time



One of the old discussions (with no resolution) which looks similar to the issue I described.

https://www-auth.cs.wisc.edu/lists/htcondor-users/2018-October/msg00156.shtml

Thanks & Regards,
Vikrant Aggarwal



On Mon, Dec 19, 2022 at 2:25 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello David,

Thanks for your reply. It's a bit of relief to hear that we are not the only ones hitting this issue.Â

More information regarding our environmentÂ

- Partitionable slots
- Schedule level splitting enabled.Â

which I believe could be relevant after seeing the following message in schedlog file for the job which was the only job that stays idle out of 179 jobs.Â

# grep '154439.159' /var/log/condor/SchedLog.old
12/13/22 14:44:51 (pid:19397) job_transforms for 154439.159: 1 considered, 1 applied (SetTeam)
12/13/22 14:50:22 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.159
12/13/22 14:50:22 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx:9618?addrs=xx.xx.xx.xx-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.159) deleted

However the same message appears for other jobs as well but didn't stay in idle state for too long. The above mentioned job was stuck in idle state for more than 15 mins.Â

# grep 'node1.com' /var/log/condor/SchedLog.old | grep '12/13/22 14:'
12/13/22 14:48:58 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.109
12/13/22 14:48:58 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.109) deleted
12/13/22 14:49:15 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.119
12/13/22 14:49:15 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.119) deleted
12/13/22 14:49:34 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.130
12/13/22 14:49:34 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.xx.xx1:9618?addrs=xx.xx.xx.xx1-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.130) deleted

If we look at the logs of a job which was successfully matched after first rejection it happened very quickly. node1.com didn't accept the request node2.com does.

# grep '154439.173' /var/log/condor/SchedLog.old
12/13/22 14:44:51 (pid:19397) job_transforms for 154439.173: 1 considered, 1 applied (SetTeam)
12/13/22 14:50:40 (pid:19397) Request was NOT accepted for claim slot1@xxxxxxxxx <xx.xx.51.71:9618?addrs=xx.xx.51.71-9618&noUDP&sock=56761_e73e_3> for test.user1 154439.173
12/13/22 14:50:40 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.51.71:9618?addrs=xx.xx.51.71-9618&noUDP&sock=56761_e73e_3> for test.user1, 154439.173) deleted
12/13/22 14:50:42 (pid:19397) match (slot1@xxxxxxxxx <xx.xx.85.143:9618?addrs=xx.xx.85.143-9618&noUDP&sock=3132923_2bce_3> for test.user1) switching to job 154439.173
12/13/22 14:50:42 (pid:19397) Shadow pid 1628682 switching to job 154439.173.
12/13/22 14:50:42 (pid:19397) Starting add_shadow_birthdate(154439.173)
12/13/22 14:52:12 (pid:19397) Shadow pid 1628682 for job 154439.173 reports job exit reason 100.
12/13/22 14:52:12 (pid:19397) Match record (slot1@xxxxxxxxx <xx.xx.85.143:9618?addrs=xx.xx.85.143-9618&noUDP&sock=3132923_2bce_3> for test.user1, 154439.173) deleted


Like I mentioned earlier this issue is not reproducible at desire but we are hitting this issue often.Â

For the above case I don't have the logs fromÂ/var/log/condor/StartLog file of worker node(s) as they are rotated.Â

Community:Â any idea what else can be helpful to troubleshoot the long queue time of jobs?


Thanks & Regards,
Vikrant Aggarwal


On Mon, Nov 7, 2022 at 10:09 PM Dudu Handelman <duduhandelman@xxxxxxxxxxx> wrote:
Hi Vikrant.
I'm glad that I'm not the only one.

If you will turn on debug you will see that the negotiator has a cache of job list which is probably not accurate. the submitter will refresh the job list when you hold and release or submit a new job.Â

We should help the condor team to find the reason.Â

I have not seen this recently.Â

ThanksÂ
DavidÂ



Get Outlook for Android


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Monday, November 7, 2022, 18:03
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Some jobs of batch stays in idle state for longer time

Hi Thomas,Â

Yes we did, it was showing available cores, though, we didn't try with a particular machine.Â

I am suspecting that sometimes leftover jobs from the batch take a lotÂof time to get matched, One hacky solution works most of the time: holding/releasing the idle jobs makes them match quickly.Â

Thanks & Regards,
Vikrant Aggarwal


On Mon, Nov 7, 2022 at 8:25 PM Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
Hi Vikrant,

have you checked for more matching details with
 Â> condor_q -better-analyze job.id
that should give a bit more details

vice versa, with -reverse-analyze/-better-analyze:reverse you can
compare a machine/slot against a job's requirements.

Cheers,
 ÂThomas

On 07/11/2022 15.08, Vikrant Aggarwal wrote:
> Hello Experts,
>
> We have seen issues where some jobs of the batch stays in idle status, We are using scheduler level splitting.
>
> Let's say a batch of 300 jobs is submitted, cores are available to do match making of let's say 280-290 jobs, negotiator do the match making of 280 jobs, 20 jobs stay in idle status even when the cores are available in the cluster but if we submit new batch it's getting scheduled immediately.
>
> IsHTcondor also considering the time spent by the job in the queue before scheduling, maybe it's considering new jobs more quickly?
>
> Time spent by jobs in queue sometimes goes upto 45 mins despite having cores available in cluster.
>
> Master logs only show no match found for 20 idle jobs but it can find resources for new jobs.
>
>
> Thanks & Regards,
> Vikrant Aggarwal
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/