Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Running short-lived jobs on Condor
- Date: Thu, 18 Jun 2015 20:23:55 +0000
- From: "Rowe, Thomas" <rowet@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Running short-lived jobs on Condor
I'm running 8.2.
I will bet increasing DAGMAN_MAX_SUBMITS_PER_INTERVAL will fix the whole problem, thanks. I was unaware of this setting and the others. The problem boils down to the DAG submitting too slowly.
________________________________________
From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of Brian Bockelman [bbockelm@xxxxxxxxxxx]
Sent: Thursday, June 18, 2015 4:01 PM
To: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] Running short-lived jobs on Condor
Hi Thomas,
This is certainly something I’d expect HTCondor to handle well. (It may be a matter of tuning). What version of HTCondor are you using?
Being that you say “queue is actually often empty”, I’d guess that DAGMan is not submitting fast enough. Here’s a few suggestions:
# Number of jobs DAGMan will submit at once (default 5).
DAGMAN_MAX_SUBMITS_PER_INTERVAL=100
# Number of seconds between checks for submitting more jobs (default 5)
DAGMAN_USER_LOG_SCAN_INTERVAL=1
# Amount of time slot will do work for a user/schedd before a new negotiation cycle is needed.
# Within this time, the schedd will instantly run another job on the slot (if there are jobs in queue!)
# Default 1200
CLAIM_WORKLIFE=3600
# How much time the schedd will hold onto the slot even if it has no jobs to run.
# This gives DAGMan some time to submit more if it runs out of jobs.
# Default 20
DAGMAN_HOLD_CLAIM_TIME=60
Basically, this all avoids slots having to go back to the negotiator between short jobs. That’s what can cause a lot of your throughput loss!
Hope this helps,
Brian
> On Jun 18, 2015, at 1:23 PM, Rowe, Thomas <rowet@xxxxxxxxxx> wrote:
>
> The simulation replications I'm running can take anywhere between three days and thirty seconds. I have 80 slots on this network. Everything is great if runtimes are up towards a half hour. All slots are kept busy grinding away. But if I submit five hundred jobs that take 40 seconds each, I see at most about 30 slots put to use. The queue is actually often empty and all slots idle for one minute stretches except for the dagman job.
>
> I played around with the NEGOTIATOR_INTERVAL setting, dropping it down to 20 seconds but that didn't seem to have too much impact.
>
> What can I do to make it so that short running jobs don't result in a mostly idle cluster? There are many *_INTERVAL settings and it's not exactly obvious what knobs to turn. "HTCondor can't handle that case well" is a perfectly valid answer if that's the case.
>
> Same question nine years ago without clear answers: https://lists.cs.wisc.edu/archive/htcondor-users/2006-September/msg00255.shtml
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/