Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Long time to reallocation of jobs
- Date: Thu, 14 Aug 2008 09:02:13 -0300
- From: "JuliĆ£o" <juliao.mmx@xxxxxxxxx>
- Subject: [Condor-users] Long time to reallocation of jobs
Hi, I'm have a condor manager and two nodes configured to act like a
dedicated cluster (testing for while, with this options configurated:
Modified condor_config file:
#START = $(UWCS_START)
START = True
#SUSPEND = $(UWCS_SUSPEND)
#CONTINUE = $(UWCS_CONTINUE)
#PREEMPT = $(UWCS_PREEMPT)
SUSPEND = False
CONTINUE = True
PREEMPT = False
#KILL = $(UWCS_KILL)
KILL = $(ActivityTimer) > $(MaxVacateTime)
#PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_REQUIREMENTS=False
I submit jobs with this description file:
Executable = job2
Universe = vanilla
Requirements = (Arch == "INTEL") || (Arch == "X86_64")
Log = job2.log
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
job_lease_duration = 180
Queue 10
For testing the reallocation os jobs, I shutdown one of nodes and
verified on ShadownLog that jobs take about 130 minutos to be moved to
another node, look:
8/11 11:40:49 (7.8) (26492): Request to run on <200.200.x.x:59245> was ACCEPTED
8/11 11:55:53 (7.8) (26492): ZKM: setting default map to (null)
8/11 13:52:18 (7.8) (26492): condor_read(): recv() returned -1, errno
= 110, assuming failure reading 5 bytes from unknown source.
8/11 13:52:18 (7.8) (26492): IO: Failed to read packet header
8/11 13:52:18 (7.8) (26492): Can no longer talk to condor_starter
<200.200.x.x:59245>
8/11 13:52:18 (7.8) (26492): Trying to reconnect to disconnected job
8/11 13:52:18 (7.8) (26492): LastJobLeaseRenewal: 1218465663 Mon Aug
11 11:41:03 2008
8/11 13:52:18 (7.8) (26492): JobLeaseDuration: 180 seconds
8/11 13:52:18 (7.8) (26492): JobLeaseDuration remaining: EXPIRED!
8/11 13:52:18 (7.8) (26492): Reconnect FAILED: Job disconnected too
long: JobLeaseDuration (180 seconds) expired
8/11 13:52:18 (7.8) (26492): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 107
8/11 13:55:45 Initializing a VANILLA shadow for job 7.8
8/11 13:55:46 (7.8) (23196): Request to run on <200.100.x.x:60004> was ACCEPTED
8/11 14:10:48 (7.8) (23196): ZKM: setting default map to (null)
8/11 14:15:48 (7.8) (23196): ZKM: setting default map to (null)
8/11 14:15:48 (7.8) (23196): Job 7.8 terminated: exited with status 0
8/11 14:15:48 (7.8) (23196): **** condor_shadow (condor_SHADOW)
EXITING WITH STATUS 100
My doubt is, how I could configure condor to check node failure on
less time, like 15 minutes, and thus, move job to another node?
I have another question, what is the default time for condor manager
do a "condor_reschedule"? Is possible change this time?
Thanks in advanced,
Juliao