[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Can a job send a trigger to let other jobs start?
- Date: Tue, 15 Dec 2009 10:44:44 -0600
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Can a job send a trigger to let other jobs start?
The jobs (usually 2000-4000) are started via dagman and read a lot of data
initially (about 2-3 GByte per jobs). After that they crunch through the
loaded data for a couple of hours. This initial start-up phase is quite a lot
of load on the central data server, thus we would like to have a handle to
limit this.
With dagman's maxjobs feature this could be solved, however this would only
start new jobs after the first batch of jobs is done.
Thus my question is, is there a way to limit the initial number of jobs and
send a "trigger" to dagman to start more jobs, once jobs are done with loading
their data sets.
What a great question! You could use a DAGman prescript on each node to
poll for a certain load threshold, and as long as the load is above some
threshold, sleep for a random period and re-poll. The script could poll
the data server's load directly, perhaps, if there's a way to do that.
Or, it could run condor_q, and count the number of jobs that have been
running for less than an hour (if the startup phase is about an hour).
Or, perhaps the jobs themselves could use chirp or condor_qedit to set a
job attribute in the schedd to indicate which phase they are in, and the
prescript could poll for that.
-Greg