HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] help with preemption





On 1/13/12 9:00 AM, Nathan Panike wrote:
Dear colleagues:

I have the following configuration:

CountDeNovoJobs = int(( slot1_IsDeNovoJob =?= True ) +( slot2_IsDeNovoJob =?=
True ) +( slot3_IsDeNovoJob =?= True ) +( slot4_IsDeNovoJob =?= True ))
WantsDeNovoJobs = ( CountDeNovoJobs<= 1 )
STARTD_SLOT_ATTRS = $(STARTD_SLOT_ATTRS) IsDeNovoJob
STARTD_JOB_EXPRS = $(STARTD_JOB_EXPRS) IsDeNovoJob
STARTD_ATTRS = $(STARTD_ATTRS) CountDeNovoJobs WantsDeNovoJobs
KILL = ($(KILL))&&  ( MY.IsDeNovoJob =!= True )
PREEMPT = ($(PREEMPT))&&  ( MY.IsDeNovoJob =!= True )
SUSPEND = ($(SUSPEND))&&  ( MY.IsDeNovoJob =!= True )

What is RANK set to?


The idea is that once an "IsDeNovoJob" is running, it should never be killed,
suspended, or preempted. Also, It really would be best if only one
"IsDeNovoJob" job is running at a time on an execute node.

I submitted the following submit file:

executable = queuetest.sh
arguments = hello
universe = vanilla
output = queuetest.$(cluster).$(process).out
error = queuetest.$(cluster).$(process).err
log = queuetest.log
+IsDeNovoJob = True
Requirements = ( WantsDeNovoJobs =?= True )
Rank = 2 - Target.CountDeNovoJobs * 8 - SlotId
queue

I thought the above policy with the indicated submit file would keep my job
from being preempted.  But I get in the log:

000 (931949.000.000) 01/12 21:27:19 Job submitted from host:<192.168.0.149:40701>
...
001 (931949.000.000) 01/12 21:27:22 Job executing on host:<192.168.0.20:37016>
...
004 (931949.000.000) 01/12 23:25:02 Job was evicted.
	(0) Job was not checkpointed.
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
...
<rest snipped>

In the StartLog on the execute machine, I have:

1/12 23:25:02 slot1: Preempting claim has correct ClaimId.
1/12 23:25:02 slot1: New claim has sufficient rank, preempting current claim.

The preemption happened because the new job had higher machine rank than the existing job.

It's a continual source of surprise to users that when PREEMPT is false, jobs can still be preempted. Our terminology sucks.

--Dan

1/12 23:25:02 slot1: State change: preempting claim based on user priority
1/12 23:25:02 slot1: State change: claim retirement ended/expired
1/12 23:25:02 slot1: Changing state and activity: Claimed/Busy ->  Preempting/Vacating

So can anyone tell me what is wrong with my policy?  It seems like I have read
section 3 of the condor manual a couple dozen times, and I am still at a loss.

Thank you much.

Nathan Panike
_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel