Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Speeding up condor_submit (was Speeding up DAGmansubmits)
- Date: Tue, 11 May 2004 09:35:44 -0700
- From: "David E. Konerding DSD Staff" <dekonerding@xxxxxxx>
- Subject: Re: [condor-users] Speeding up condor_submit (was Speeding up DAGmansubmits)
First, let me recommend reading a short post made to condor-users a
while ago by Doug Thain:
http://www.cs.wisc.edu/~lists/archive/condor-users/msg00919.html
In part, he says:
Please keep in mind that Condor is a high *throughput* system
designed to
execute large workloads over long time periods. It is *not* designed
to be
a low latency system that executes a single job quickly. Condor
performs a
large number of expensive operations in order to maximize scalability
and
reliability at the expense of latency.
Take this to heart. Condor is targeted at high throughput, not high
performance. Condor is not tuned to start up jobs in seconds. If you
need reliability and scalability, Condor is a good match.
Thanks for the response, Alain.
I'm actually quite familiar with the latency and throughput issues
associate with batch computing systems. That said, my personal
observations suggest
that there is some improvement that could be made with respect to DAGman
(see below).
1) using the 'test job feature' for fast turnaround time. Can this
be applied to DAGman jobs?
What test job feature are you referring to?
Section 3.6 of the manual caught my eye (there is somethign wrong about
the letter f in the manual PDF that causes it to be pasted weirdly):
Test-job Policy Example
This example shows how the default macros can be used to set up a
machine for running test jobs
from a specic user. Suppose we want the machine to behave normally,
except if user coltrane
submits a job. In that case, we want that job to start regardless of
what is happening on the machine.
We do not want the job suspended, vacated or killed. This is reasonable
if we know coltrane is
submitting very short running programs for testing purposes. The jobs
should be executed right
away. This works with any machine (or the whole pool, for that matter)
by adding the following 5
expressions to the existing conguration:
START = ($(START)) || Owner == "coltrane"
SUSPEND = ($(SUSPEND)) && Owner != "coltrane"
CONTINUE = $(CONTINUE)
PREEMPT = ($(PREEMPT)) && Owner != "coltrane"
KILL = $(KILL)
2) The matchmaking cycle runs every five minutes, except when jobs are
submitted. When you submit a job, it will start a new matchmaking
cycle as soon as it can (perhaps it's already in the middle of
matchmaking) unless it started a matchmaking cycle within the last 20
(25?) seconds. This number is tunable, but the point is that
matchmaking doesn't happen constantly.
OK, this is highly appropriate (I hadn't realized job submission started
a new matchmaking cycle, although I should have picked that up from the
manual).
Nevertheless, the observation I've made is this:
1) if I submit a DAG element (a specific .job file) with condor_submit,
it runs nearly immediately (within a second). This is almost certainly due
to the matchmaking cycling starting when I submit the job, and being
matched quickly. That's working great.
2) If I submit the full DAG, which then submits the same .job file, that
job sits idle for 20-25 seconds before reaching the run state.
Working from this observation and the observation in #1 above, I suspect
that when DAGman submits the .job file, it does not invoke a new
matchmaking cycle. I never have seen it take more than 20-25 seconds so
I don't think the negotiator time interval of 300 seconds is an issue
here. Given that our cluster is small, and matchmaking probably doesn't
take very long, would reducing the negotiator time interval to 1 make it
likely that jobs would go from
idle to running more quickly?
3) I assume that file transfer only happens once the job is running, not
when it is listed as idle. If that's not th ecase, then I suspect that
(along with several other aspects that are noted on the mailing list to
affect job startup time) could shave a short amount of time off the job
start.
At this point it sounds like I need to do a bit more peering at the log
files in real time as well as running strace on the Condor daemons to
see what
time interval they are providing to select().
One more thing that I noticed in the Condor manual, is that DAGman jobs
are submitted to the scheduler universe and thus always run immediately
on the local machine. It seems that should I be able to make my .job
file submit to the scheduler universe and see no time delay between
dagman submitting the job and it running.
Yet, there is still a five second interval between submission and
execution (that probably explains the 5 second component of the 25 seconds).
5/11 09:33:32 submitting: condor_submit -a 'dag_node_name = A' -a
'+DAGManJobID
= 244.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' ls.job 2>&1
5/11 09:33:32 assigned Condor ID (245.0.0)
5/11 09:33:32 Just submitted 1 job this cycle...
5/11 09:33:32 Event: ULOG_SUBMIT for Condor Job A (245.0.0)
5/11 09:33:32 Of 2 nodes total:
5/11 09:33:32 Done Pre Queued Post Ready Un-Ready Failed
5/11 09:33:32 === === === === === === ===
5/11 09:33:32 0 0 1 0 0 1 0
5/11 09:33:37 Event: ULOG_EXECUTE for Condor Job A (245.0.0)
5/11 09:33:37 Event: ULOG_JOB_TERMINATED for Condor Job A (245.0.0)
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>