https://www.racf.bnl.gov/docs/sw/condor/dealing-with-evicted-jobs offers
a solution. These "I" jobs were basically evicted/preempted by its
condor_master before it expired (fast-shutdown). For whatever reason
(this may be a BUG in condor because it might be treating these
jobs(universe=vanilla) as Standard universe jobs and try to
checkpoint&re-cover it elsewhere.) , they remain in the queue and never
run again. Specify "periodic_hold" (or directly "periodic_remove") in
the job submit file to hold these "I" jobs the next time the schedd
considers periodic job actions (=hold, remove, release, The periodic
interval is controlled by PERIODIC_EXPR_INTERVAL=60).
periodic_hold = (NumJobStarts >= 1 && JobStatus == 1)
The original link mislabels "periodic_hold" as "PeriodicHold" (it may
work as well). But
http://research.cs.wisc.edu/condor/manual/v7.7/condor_submit.html#SECTION0010474000000000000000 states
that it should be "periodic_hold". "NumJobStarts >= 1" means the job has
run at least once and "JobStatus == 1" means the job is in "I"=Idle state.
In my case, "periodic_remove = (NumJobStarts >= 1 && JobStatus == 1)" is
preferred because i want this type of "stuck" jobs to be removed on next
period and dagman will re-submit them automatically (dagman.retry=3).
Interesting enough, there's a shortcut to this. Instead of adding this
line to every job submit file, you can set its counterpart in the daemon
(only the central manager) conf file, "SYSTEM_PERIODIC_REMOVE =
(NumJobStarts >= 1 && JobStatus == 1)", based on
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml.
I tested it and it worked. I have no idea the "SYSTEM_" prefix and
there's nowhere in the official doc to find this usage is valid.
a few other useful links:
http://spinningmatt.wordpress.com/category/classads/
http://etutorials.org/Linux+systems/cluster+computing+with+linux/Part+III+Managing+Clusters/Chapter+15+Condor+A+Distributed+Job+Scheduler/15.2+Using+Condor/
http://spinningmatt.wordpress.com/2009/12/05/cap-job-runtime-debugging-periodic-job-policy-in-a-condor-pool/
https://lists.cs.wisc.edu/archive/condor-users/2008-February/msg00275.shtml
yu
On Tue, Apr 10, 2012 at 9:30 PM, Yu Huang <polyactis@xxxxxxxxx
<mailto:polyactis@xxxxxxxxx>> wrote:
On Tue, Apr 10, 2012 at 9:12 PM, Mats Rynge <rynge@xxxxxxx
<mailto:rynge@xxxxxxx>> wrote:
On 04/10/2012 08:33 PM, Yu Huang wrote:
> Because qsub jobs have a time limit (say 24 hours), I
instruct the
> condor_master daemon to expire after 23.8 hours (=23.8X60 =1428
> minutes). Usually the condor_master commandline is like
"condor_master
> -f -r 1428".
>
> One thing I'm desperate to find out is when the condor_master
on the
> slave node (not the host) expires, what happens to the jobs
that are
> still running? Sometime ago I remembered seeing some doc say
that all
> the jobs will keep running. Could any of guys confirm that or the
> opposite? My "impression" so far is most jobs on that expired
node would
> all die immediately (although I did see some mangled output
due to >1
> jobs output to the same file).
Yu,
I don't remember the exact config we used in your case, but I
think you
want to try to set the -r time to 24-max_job_walltime. For
example, if
your longest job takes 6 hours, set it to 1080 (18*60). The
startd will
then shut down gracefully which means finishing the current job
(shows
up as "Retiring" in condor_status).
There are times that a few jobs would run so long , entirely beyond
what I could expect. Or that node just went down. so it's really
hard to set that expiration attuned to the job running time (also i
usually have a bunch of heterogeneous jobs).
In your current setup, jobs which are running only get 12 minutes to
finish before SGE kills the Condor daemons.
condor daemon exits themselves before SGE walltime reaches so SGE
never comes in and kill the daemon.
The jobs that are running will exit when condor daemon exits (I just
tried one workflow and confirmed it).
This means that the job will
have to restart somewhere else.
The problem is that instead of those jobs disappearing (failure)
from the condor queue and restarting elsewhere, they persist in the
queue with "I" state. I think the host (central manager) just has no
idea of what has happened to that condor daemon on that slave node.
What I hope to get is a mechanism to let the condor daemon on the
slave node 1. kill all jobs that are running; 2. notify the central
manager that the jobs are (being) killed so that the central manager
would try to re-run them elsewhere (rather than mark them "I").
yu
--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx
<mailto:condor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang
--
Yu Huang
Postdoc in Nelson Freimer Lab,
Center for Neurobehavioral Genetics, UCLA
Office Phone:_+1.310-794-9598 <tel:%2B1.310-794-9598>_
Skype ID: crocea
http://www-scf.usc.edu/~yuhuang
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/