Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Managing evictions and reruns
- Date: Mon, 4 Feb 2013 17:13:19 -0500
- From: Brian Pipa <brianpipa@xxxxxxxxx>
- Subject: [HTCondor-users] Managing evictions and reruns
I ran, for the first time, my new job in Condor that splits the work
to do into multiple worker jobs. Unfortunately, of the 178 worker jobs
that it produced,
one of the jobs was suspended, unsuspended, evicted, then re-run (see
log at the bottom). When it re-ran, it got "stuck" (ie - it hasn't
finished yet and condor_wait still says it's running and ps -ef shows
it running). I need to setup things so that either
#1: no jobs get evicted
or
#2: if a job does get evicted, do not rerun it
The jobs are java code that exec a python script and evidently, it
doesn't liek to be evicted/suspended then rerun.
I was poking around and it looks like I can set some variables in the
classAd like WANT_SUSPEND, SUSPEND, PREEMPT, WANT_VACATE, CONTINUE but
I'm having a hard time figuring out in what combination of these I
need to set to make it do #1 or #2 above. Can anyone shed some light
on this?
it seems like I could:
set WANT_SUSPEND to FALSE for #1 above
or
set CONTINUE to FALSE for #2 above
but I'm just not positive. And do I set this in the job classAd itself?
And a side note - how do I figure out why my job was evicted/suspended
in the first place?
---
job's log file
---
> more /workspace/jobs/3150/output/273.14.log
000 (273.014.000) 02/04 12:40:09 Job submitted from host: <...62:52527>
...
001 (273.014.000) 02/04 12:40:10 Job executing on host: <...64:45943>
...
006 (273.014.000) 02/04 12:40:18 Image size of job updated: 4937388
15 - MemoryUsage of job (MB)
14436 - ResidentSetSize of job (KB)
...
010 (273.014.000) 02/04 12:42:41 Job was suspended.
Number of processes actually suspended: 2
...
006 (273.014.000) 02/04 12:42:41 Image size of job updated: 10673688
37 - MemoryUsage of job (MB)
37280 - ResidentSetSize of job (KB)
...
011 (273.014.000) 02/04 12:52:43 Job was unsuspended.
...
004 (273.014.000) 02/04 12:52:43 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
Partitionable Resources : Usage Request
Cpus : 1
Disk (KB) : 1750 1750
Memory (MB) : 37 37
...
001 (273.014.000) 02/04 12:58:12 Job executing on host: <...64:45943>
...