Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_status -total, Preempting
- Date: Mon, 27 Jan 2014 12:04:02 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_status -total, Preempting
Hi Daniel -
The below looks really unexpected. Your settings indeed should disable
preemption, assuming you did a successful condor_reconfig after the
changes and they are set at the right host (the PREEMPTION_REQUIREMENTS
change read by the condor_negotiator, and the other settings are read by
all the execute hosts running condor_startds). Note that the preferred
way to disable preemption on HTCondor v8.0+ is via MaxJobRetirementTime,
see
http://research.cs.wisc.edu/htcondor/manual/current/3_5Policy_Configuration.html#SECTION00459500000000000000
But what you have below should work as well.
HTCondor may preempt a job in favor of another job from the same user,
but only in the case of a higher startd RANK.
Very strange.
Is the below regularly reproducible, or do you only see it very rarely ?
Note that starting HTCondor v8.1.3, the machine classads will report
some helpful/insightful attributes regarding preemption; I copied the
below from the manual at
http://research.cs.wisc.edu/htcondor/manual/latest/12_Appendix_A.html
These statistics were added for just such an occurance, i.e. so admins
can confirm that preemption is disabled. So, if you are running v8.1.3
or above, are these statistics below reporting preemptions as occuring?
If so, is it reporting user preemptions or rank preemptions? Maybe it
is only happening on some specific nodes?
JobPreemptions:
The total number of times a running job has been preempted on this
machine.
JobRankPreemptions:
The total number of times a running job has been preempted on this
machine due to the machine's rank of jobs since the condor_startd
started running.
JobUserPrioPreemptions:
The total number of times a running job has been preempted on this
machine based on a fair share allocation of the pool since the
condor_startd started running.
RecentJobPreemptions:
The total number of jobs which have been preempted from this
machine in the last twenty minutes.
RecentJobRankPreemptions:
The total number of times a running job has been preempted on this
machine due to the machine's rank of jobs in the last twenty minutes.
RecentJobUserPrio:
The total number of times a running job has been preempted on this
machine based on a fair share allocation of the pool in the last twenty
minutes.
regards,
Todd
On 1/27/2014 9:53 AM, Pek Daniel wrote:
Some lines from the StartLog:
01/27/14 16:45:42 slot22: Request accepted.
01/27/14 16:45:42 slot22: Remote owner is xxx
01/27/14 16:45:42 slot22: State change: claiming protocol successful
01/27/14 16:45:42 slot22: Changing state: Unclaimed -> Claimed
01/27/14 16:45:46 slot22: Got activate_claim request from shadow
(xxx.xxx.xxx.xxx)
01/27/14 16:45:46 slot22: Remote job ID is 3920.25
01/27/14 16:45:46 slot22: Got universe "VANILLA" (5) from request classad
01/27/14 16:45:47 slot22: State change: claim-activation protocol successful
01/27/14 16:45:47 slot22: Changing activity: Idle -> Busy
01/27/14 16:45:55 slot22: Preempting claim has correct ClaimId.
01/27/14 16:45:55 slot22: New claim has sufficient rank, preempting
current claim.
01/27/14 16:45:55 slot22: State change: preempting claim based on user priority
01/27/14 16:45:55 slot22: State change: claim retirement ended/expired
01/27/14 16:45:55 slot22: Changing state and activity: Claimed/Busy ->
Preempting/Vacating
2014/1/27 Pek Daniel <pekdaniel@xxxxxxxxx>:
Hi,
I tried my best to turn off preemption completely:
PREEMPT = FALSE
SUSPEND = FALSE
KILL = FALSE
PREEMPTION_REQUIREMENTS = FALSE
NEGOTIATOR_CONSIDER_PREEMPTION = FALSE
RANK = 0
But sometimes during negotiation, I still can see non-zero value in
the Preempting column of the output of condor_status -total.
According to the docs:
``Preempting'': A Condor job is being preempted (possibly via
checkpointing) in order to clear the machine for either a higher
priority job or because the machine owner wants the machine back.
Regarding that I have only one single user and completely identical
jobs, I don't think the preemption would happen because of a higher
priority job. Any idea why is this?
Thanks,
Daniel
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685