[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_status -total, Preempting



Hi Daniel -

The below looks really unexpected. Your settings indeed should disable preemption, assuming you did a successful condor_reconfig after the changes and they are set at the right host (the PREEMPTION_REQUIREMENTS change read by the condor_negotiator, and the other settings are read by all the execute hosts running condor_startds). Note that the preferred way to disable preemption on HTCondor v8.0+ is via MaxJobRetirementTime, see

http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000

But what you have below should work as well.

HTCondor may preempt a job in favor of another job from the same user, but only in the case of a higher startd RANK.

Very strange.

Is the below regularly reproducible, or do you only see it very rarely ?

Note that starting HTCondor v8.1.3, the machine classads will report some helpful/insightful attributes regarding preemption; I copied the below from the manual at
http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
These statistics were added for just such an occurance, i.e. so admins can confirm that preemption is disabled. So, if you are running v8.1.3 or above, are these statistics below reporting preemptions as occuring? If so, is it reporting user preemptions or rank preemptions? Maybe it is only happening on some specific nodes?

JobPreemptions:
The total number of times a running job has been preempted on this machine.

JobRankPreemptions:
The total number of times a running job has been preempted on this machine due to the machine's rank of jobs since the condor_startd started running.

JobUserPrioPreemptions:
The total number of times a running job has been preempted on this machine based on a fair share allocation of the pool since the condor_startd started running.

RecentJobPreemptions:
The total number of jobs which have been preempted from this machine in the last twenty minutes.

RecentJobRankPreemptions:
The total number of times a running job has been preempted on this machine due to the machine's rank of jobs in the last twenty minutes.

RecentJobUserPrio:
The total number of times a running job has been preempted on this machine based on a fair share allocation of the pool in the last twenty minutes.

regards,
Todd

On 1/27/2014 9:53 AM, Pek Daniel wrote:
Some lines from the StartLog:

01/27/14 16:45:42 slot22: Request accepted.
01/27/14 16:45:42 slot22: Remote owner is xxx
01/27/14 16:45:42 slot22: State change: claiming protocol successful
01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
01/27/14 16:45:46 slot22: Got activate_claim request from shadow
(xxx.xxx.xxx.xxx)
01/27/14 16:45:46 slot22: Remote job ID is 3920.25
01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
01/27/14 16:45:47 slot22: State change: claim-activation protocol successful
01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
current claim.
01/27/14 16:45:55 slot22: State change: preempting claim based on user priority
01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
Preempting/Vacating

2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
Hi,

I tried my best to turn off preemption completely:
PREEMPT = FALSE
SUSPEND = FALSE
KILL = FALSE
PREEMPTION_REQUIREMENTS = FALSE
NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
RANK = 0

But sometimes during negotiation, I still can see non-zero value in
the Preempting column of the output of condor_status -total.

According to the docs:

``Preempting'': A Condor job is being preempted (possibly via
checkpointing) in order to clear the machine for either a higher
priority job or because the machine owner wants the machine back.

Regarding that I have only one single user and completely identical
jobs, I don't think the preemption would happen because of a higher
priority job. Any idea why is this?

Thanks,
Daniel
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685