Subject: Re: [HTCondor-users] jobs with disjoint requirements
Hello Don, and everyone,
One important thing to remember about
HTCondor is that you no longer have "queues," in the usual Grid
Engine sense. This was one of the concepts I had the most difficulty conveying
to my users as we migrated from SGE to HTCondor - old habits died hard.
One of the numerous Grid Engine workarounds
they'd written into their submission tools over the years was to only submit
one job every few seconds, to give multiple submitters a chance to interleave
their jobs in the single-file Grid Engine queue that had been set up years
earlier. This usually meant dozens or hundreds of cores sat idle for hours
on end, especially with short-running jobs, which when you've spent as
much money as they had on the exec nodes is a pretty grim state of affairs.
Once they got the hang of the idea that
job sequence and priority is calculated at every negotiation cycle for
every job, and after enough runs of "condor_userprio," I was
able to get them to write submissions so that they could submit ten thousand
jobs in a matter of seconds, rather than five and a half hours with a "sleep
5" between each "qsub." The negotiator takes care of dividing
up the resources fairly among the multiple users so that nobody has to
wait for all 10,000 jobs to finish before their own jobs run, and they
just don't have to worry about it anymore.
If your own jobs are the only ones contending
for the resources, I think what you may be after is accounting groups,
rather than a requirements _expression_.
For example, my test pool has seven
slots. I create a submit description like so:
This creates 40 jobs, half of which
were tied to the "batchA" accounting group user (Processes .0
through .19) and half of which were tied to "batchB," as .20-.39.
Look what happened:
-- Submitter: condor1 : <138.127.79.182:54201>
: condor1 ID OWNER
SUBMITTED RUN_TIME ST PRI SIZE CMD 27.0 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.1 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.2 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.3 pelletm 10/26
11:24 0+00:00:00 I 0 0.0 sleep 120 ... 27.19 pelletm 10/26
11:24 0+00:00:00 I 0 0.0 sleep 120 27.20 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.21 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.22 pelletm
10/26 11:24 0+00:00:02 R 0 0.0 sleep 120 27.23 pelletm 10/26
11:24 0+00:00:00 I 0 0.0 sleep 120 ...
The negotiator assigned six slots in
the pool off the bat, and half went to batchA and half to batchB. Here's
the condor_userprio a bit later, once the seventh slot was claimed:
condor1$ condor_userprio Last Priority Update: 10/26 11:26
Effective Priority Res Total Usage
Time Since User Name
Priority Factor In Use (wghted-hrs) Last Usage ------------------- ------------ --------- ------
------------ ---------- pelletm_batchB@doma 502.41
1000.00 3 0.10
<now> pelletm_batchA@doma 502.89
1000.00 4 0.12
<now> ------------------- ------------ --------- ------
------------ ---------- Number of users: 2
7
0.22 0+23:59
The accounting_group_user specified
in the submit description resulted in two separate "users" for
a single submission, and the resources will be fair-share divided between
them by the negotiator. We'd expect to see the assignment of the seventh
slot oscillate back and forth between the two as the total usage figure
reflects the use of the that odd slot.
If you want to divide up the resources
unevenly, then you'd want to set up the pool's configuration with an accounting
group with a different priority factor, and direct the jobs accordingly
using the "accounting_group" submit value.
With respect to large numbers of queued
jobs: I have one group of users which submits about a quarter to half a
million short-running jobs on a fairly regular basis. I gave them their
own private scheduler so that people could still run condor_q on the main
scheduler* without timing out, but given that setup I don't generally find
it's a bad thing to have very large numbers of jobs queued. It makes it
easier for the users, as they no longer have to leave their submission
running for hours and hours on end while empty slots wait for work, or
write a DAG, to avoid having too many idle jobs.. The highest peak I've
seen was about 800,000 jobs waiting on a Friday evening. It brightens my
weekend to think about every last scrap of available CPU power fully utilized
all weekend long, and the toasty warm air flowing from the back of the
systems.
---
*HTCondor defaults to a scheduler on
each member of the pool, but setting SCHEDD_HOST and having a single central
scheduler was, in part, a concession to those same old Grid Engine habits,
where there would have been panic in the hallways if the equivalent of
"qstat" returned different results on different machines. That,
and the scale of the pool coupled with the condor_shadow processes and
the network topology meant that a beefy schedd host on the same network
as the exec nodes works better for us.
Michael V. Pelletier
IT Program Execution
Principal Engineer
978.858.9681 (5-9681) NOTE NEW NUMBER
339.293.9149 cell
339.645.8614 fax
michael.v.pelletier@xxxxxxxxxxxx