Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor Eviction Problems
- Date: Mon, 13 Feb 2006 15:58:21 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: [Condor-users] Condor Eviction Problems
Hi All
We have implemented a "time of day" policy as shown in section"
3.6.9.3 Time of Day Policy
in the online manual for version 6.6.10.
It is statede here that:
WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \
(ClockDay > 0 && ClockDay < 6) )
AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \
(ClockDay == 0 || ClockDay == 6) )
START = $(AfterHours) && $(CPUIdle) && KeyboardIdle > $(StartIdleTime)
MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )
By default, the MachineBusy macro is used to define the SUSPEND and
PREEMPT expressions. If you have changed these expressions at your site,
you will need to add $(WorkHours) to your SUSPEND and PREEMPT
expressions
as appropriate.
Depending on your site, you might also want to avoid suspending jobs
during work hours, so that in the morning, if a job is running, it will
be immediately preempted, instead of being suspended for some length
of time:
WANT_SUSPEND = $(AfterHours)
We seem to have MANY jobs being evicted after 30mins. See the log file
at the end of this email. Could our config be the problem?
Here is our current configuration:
CONDOR_CONFIG FILE
***********************************************************************
MachineBusy = ( $(WorkHours) || $(CPUBusy) || $(KeyboardBusy) )
WorkHours = ( (ClockMin >= 480 && ClockMin < 1020) && \
(ClockDay > 0 && ClockDay < 6) )
AfterHours = ( (ClockMin < 480 || ClockMin >= 1020) || \
(ClockDay == 0 || ClockDay == 6) )
## The RANK expression controls which jobs this machine prefers to
## run over others. Some examples from the manual include:
## RANK = TARGET.ImageSize
## RANK = (Owner == "coltrane") + (Owner == "tyner") \
## + ((Owner == "garrison") * 10) + (Owner == "jones")
## By default, RANK is always 0, meaning that all jobs have an equal
## ranking.
#RANK = 0
#####################################################################
## This where you choose the configuration that you would like to
## use. It has no defaults so it must be defined. We start this
## file off with the UWCS_* policy.
######################################################################
## Also here is what is referred to as the TESTINGMODE_*, which is
## a quick hardwired way to test Condor.
## Replace UWCS_* with TESTINGMODE_* if you wish to do testing mode.
## For example:
## WANT_SUSPEND = $(UWCS_WANT_SUSPEND)
## becomes
## WANT_SUSPEND = $(TESTINGMODE_WANT_SUSPEND)
WANT_SUSPEND = $(UWCS_WANT_SUSPEND)
#WANT_SUSPEND = $(CSIRO_WANT_SUSPEND)
#WANT_VACATE = $(UWCS_WANT_VACATE)
WANT_VACATE = $(CSIRO_WANT_VACATE)
#START = $(UWCS_START)
START = $(CSIRO_START)
SUSPEND = $(UWCS_SUSPEND)
#SUSPEND = $(CSIRO_SUSPEND)
CONTINUE = $(UWCS_CONTINUE)
#CONTINUE = $(CSIRO_CONTINUE)
PREEMPT = $(UWCS_PREEMPT)
#PREEMPT = $(CSIRO_PREEMPT)
KILL = $(UWCS_KILL)
#KILL = $(CSIRO_KILL)
PERIODIC_CHECKPOINT = $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK = $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)
#####################################################################
## This is the default CSIRO configuration.
#####################################################################
CSIRO_WANT_SUSPEND = False
CSIRO_WANT_VACATE = False
CSIRO_START = $(AfterHours) && $(CPUIdle) && KeyboardIdle >
$(StartIdleTime)
CSIRO_SUSPEND = False
CSIRO_CONTINUE = True
CSIRO_PREEMPT = False
CSIRO_KILL = False
CSIRO_NUM_CPUS = 1
CSIRO_JOB_RENICE_INCREMENT = 10
************************************************************************
***
EXCERPT FROM EXECUTING MACHINES SHADOW LOG
2/11 22:50:10 ******************************************************
2/11 22:50:10 ** condor_starter (CONDOR_STARTER) STARTING UP
2/11 22:50:10 ** C:\Condor\bin\condor_starter.exe
2/11 22:50:10 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/11 22:50:10 ** $CondorPlatform: INTEL-WINNT50 $
2/11 22:50:10 ** PID = 3296
2/11 22:50:10 ******************************************************
2/11 22:50:10 Using config file: C:\Condor\condor_config
2/11 22:50:10 Using local config files: C:\Condor/condor_config.local
2/11 22:50:10 DaemonCore: Command Socket at <138.194.10.128:9655>
2/11 22:50:10 Setting resource limits not implemented!
2/11 22:50:10 Starter communicating with condor_shadow
<130.155.67.83:9805>
2/11 22:50:10 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/11 22:50:16 File transfer completed successfully.
2/11 22:50:16 Starting a VANILLA universe job with ID: 5.0
2/11 22:50:16 IWD: C:\Condor/execute\dir_3296
2/11 22:50:16 Output file: C:\Condor/execute\dir_3296\D7EG9AD.log
2/11 22:50:16 Renice expr "10" evaluated to 10
2/11 22:50:16 About to exec C:\Condor\execute\dir_3296\condor_exec.exe
D7EG9AD.egs
2/11 22:50:16 Create_Process succeeded, pid=1772
2/11 23:19:41 Got SIGQUIT. Performing fast shutdown.
2/11 23:19:41 ShutdownFast all jobs.
2/11 23:19:41 Process exited, pid=1772, status=-1073741510
2/11 23:19:41 Last process exited, now Starter is exiting
2/11 23:19:41 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
2/12 11:50:33 ******************************************************
2/12 11:50:33 ** condor_starter (CONDOR_STARTER) STARTING UP
2/12 11:50:33 ** C:\Condor\bin\condor_starter.exe
2/12 11:50:33 ** $CondorVersion: 6.6.10 Jun 22 2005 $
2/12 11:50:33 ** $CondorPlatform: INTEL-WINNT50 $
2/12 11:50:33 ** PID = 1584
2/12 11:50:33 ******************************************************
2/12 11:50:33 Using config file: C:\Condor\condor_config
2/12 11:50:33 Using local config files: C:\Condor/condor_config.local
2/12 11:50:33 DaemonCore: Command Socket at <138.194.10.128:9230>
2/12 11:50:33 Setting resource limits not implemented!
2/12 11:50:33 Starter communicating with condor_shadow
<130.155.67.83:9733>
2/12 11:50:33 Submitting machine is "student3-lu.minerals.CSIRO.AU"
2/12 11:50:40 File transfer completed successfully.
2/12 11:50:40 Starting a VANILLA universe job with ID: 6.0
2/12 11:50:40 IWD: C:\Condor/execute\dir_1584
2/12 11:50:40 Output file: C:\Condor/execute\dir_1584\D7EG9AE.log
2/12 11:50:40 Renice expr "10" evaluated to 10
2/12 11:50:40 About to exec C:\Condor\execute\dir_1584\condor_exec.exe
D7EG9AE.egs
2/12 11:50:40 Create_Process succeeded, pid=2260
2/12 12:20:06 Got SIGQUIT. Performing fast shutdown.
2/12 12:20:06 ShutdownFast all jobs.
2/12 12:20:06 Process exited, pid=2260, status=-1073741510
2/12 12:20:07 Last process exited, now Starter is exiting
2/12 12:20:07 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
-----------------------------------------------------------------------
Greg Hitchen
greg.hitchen@xxxxxxxx
CSIRO Exploration and Mining phone:+61 8 6436
8663
Australian Resources Research Centre (ARRC) fax: +61 8 6436 8555
Postal address: mob: 0407 952
748
PO Box 1130, Bentley WA 6102, Australia
Street Address:
26 Dick Perry Avenue, Kensington WA 6151
-----------------------------------------------------------------------