[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 499999. Limit is 50000



Thanks Jaime.Â

Appreciate your help.

Thanks & Regards,
Vikrant Aggarwal


On Thu, Mar 30, 2023 at 11:19âPM Jaime Frey <jfrey@xxxxxxxxxxx> wrote:
I havenât been able to figure out how your scheddâs count of queued jobs is getting artificially high, but HTCondor 10.0.4 will include a fix to correct the count when this happens:
https://opensciencegrid.atlassian.net/browse/HTCONDOR-1688

Â- Jaime

On Mar 13, 2023, at 6:29 AM, Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Jamie,Â

Thanks for your reply.Â

Nope we don't use materialization or factory option.Â




Thanks & Regards,
Vikrant Aggarwal


On Fri, Mar 10, 2023 at 3:28âAM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
From looking at the code and doing a simple test on a personal condor,ÂMAX_JOBS_SUBMITTED should be a limit on the number of jobs currently queued in the schedd. It sounds like we have a bug in tracking the job count for this check.
Do either of you use the -factory option or the max_materialize command when submitting jobs?

Â- Jaime

On Mar 8, 2023, at 10:05 AM, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hmm,

MAX_JOBS_SUBMITTED This integer value limits the number of jobs permitted in a condor_schedd daemonâs
queue. Submission of a new cluster of jobs fails, if the total number of jobs would exceed this limit. The default
value for this variable is the largest positive integer value.

This explains also the default in my case ;)

I think at least hold jobs are considered part of the scheds daemon queue (apart from running and idle jobs) and I suspect it might mean roughly all the files in jobqueue.log file which is rotated per default once a day. Hence it contains many more jobs than you see with condor_q - but that is pure speculation - would explain the behaviour you see though.Â

We use

MAX_JOBS_PER_OWNER This integer value limits the number of jobs any given owner (user) is permitted to have
within a condor_schedd daemonâs queue. A job submission fails if it would cause this limit on the number of
jobs to be exceeded. The default value is 100000.

for the same effect ...

best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 8. MÃrz 2023 16:37:26
Betreff: Re: [HTCondor-users] MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 499999. Limit is 50000

Hello,Â
Yes, we use it as a protection mechanism so that accident submission of too many jobs didn't kill the schedd. Limit is applicable for all users using the submit box.Â

But when we hit this limit "ERROR" we don't have any job in the queue or sometimes very few which baffled me.
Â

Thanks & Regards,
Vikrant Aggarwal


On Wed, Mar 8, 2023 at 8:27âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

yes I think you are right I was a bit confused as I do not set the value I thought it might be kind of a counter but it seems the default is just set to a very high number ...

[root@bird-htc-sched11 ~]# condor_config_val -v MAX_JOBS_SUBMITTED
MAX_JOBS_SUBMITTED = 2147483647
Â# at: <Default>
Â# raw: MAX_JOBS_SUBMITTED = 2147483647

Is this kind of ceiling for the sched what you need ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 8. MÃrz 2023 14:54:40
Betreff: Re: [HTCondor-users] MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 499999. Limit is 50000

Hello,

After making the change to MAX_JOBS_SUBMITTED restarted condor Sched service.Â

# grep restarted /var/log/condor/ScheddRestartReport
The schedd el6study2.skae.tower-research.com restarted at 03/08/23 05:10:40.

$ condor_config_val MAX_JOBS_SUBMITTED
5

Submitted batch of 5 jobs.Â

$ condor_submit sleep.sub
Submitting job(s).....
5 job(s) submitted to cluster 313.

Trying to submit another batch fails as I have 5 jobs in queue.Â

$ condor_submit sleep.sub
Submitting job(s)
ERROR: Failed to create cluster
Number of submitted jobs would exceed MAX_JOBS_SUBMITTED

If I wait for the completion of existing jobs then I can submit another 5 jobs without any issue which makes me believe that this parameter is related to jobs present in the queue irrespective of their status (hold/running/idle). I don't think it's related to the total number of jobs submitted in sched..


Thanks & Regards,
Vikrant Aggarwal


On Wed, Mar 8, 2023 at 12:48âPM Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:
Hi,

it seems to me - at least on my scheds the MAX_JOBS_SUBMITTED is indeed the number of jobs the sched dealt with since the last boot (I suppose)

At least this is definetley not the current number of jobs on this sched:

[root@bird-htc-sched11 ~]# condor_config_val MAX_JOBS_SUBMITTED
2147483647

;)

Hence it looks to me as if MAX_JOBS_SUBMITTED should not be set at all unless you want to stop the scheduling after a certain amount of jobs ?

Maybe MAX_JOBS_PER_OWNER is more likely to do what you want (limiting the number of jobs per owner on the sched) ?

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Vikrant Aggarwal" <ervikrant06@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 8. MÃrz 2023 08:06:16
Betreff: Re: [HTCondor-users] MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 499999. Limit is 50000

Thanks Jamie,
But we don't have this many jobs in the queue. The batch we are trying to submit has only a handful of jobs still we are hitting the max job limit.Â

03/07/23 21:46:46 (pid:55697) NewCluster(): MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 300027. Limit is 300000

03/07/23 22:11:09 (pid:55697) NewCluster(): MAX_JOBS_SUBMITTED exceeded, submit failed. Current total is 300000. Limit is 300000


It's happening randomly but often on a few submit nodes (not all). All submit nodes are with the same conf.ÂÂ

Thanks & Regards,
Vikrant Aggarwal


On Wed, Feb 8, 2023 at 9:24âPM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> On Feb 7, 2023, at 11:01 AM, Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
>> We hit this issue multiple times: Issue disappears if we restart the condor
>> service or change the MAX_JOBS_SUBMITTED limit.
>
>   ÂYou probably shouldn't be setting MAX_JOBS_SUBMITTED at all. It's a cap on the total number of clusters a schedd is willing to have dealt with for its entire life. What are you trying to accomplish?


This is incorrect. MAX_JOBS_SUBMITTED is a cap on the number of jobs that can be queued at any given time.

Â- Jaime
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/