[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] peaceful node drain and shutdown



So, I'll allow that I may have tested my CROND hooks improperly and that jobs are starting when they shouldn't be. But my slots are definitely not transitioning to the Owner state even though my CROND hooks are evaluating to false (by condor_status queries). A careful examination notes a few things:

1. My start _expression_ is actuallyÂ

START = $(CROND_REQUIREMENTS) || TARGET.Owner =?= UNDEFINED

I inherited the RHS of the or and have never quite understood what it's doing but whatever Catholicism I inherited has allowed me to accept the mystery. But, through epiphany or, um, reading the right part of the manual, I think I understand. This part of the manual warns about weird behavior with IS_OWNER and the use of job (TARGET) properties in START. Maybe this _expression_ worked well enough for job matching but borks IS_OWNER enough that it never re-enters Owner state?

CROND_REQUIREMENTS is guaranteed to come out boolean by testing for isUndefined first.

Alternatively, maybe the dynamic slots aren't seeing CROND_REQUIREMENTS as it updates? There is a reported problem in which, for a short period at startup, the classad hooks change from FALSE to TRUE for the partitionable slot, but a *newly* created dynamic slot will fail to start the job because it thinks they're still FALSE. So it's possible IS_OWNER isn't being evaluated correctly for the dynamic slots.

2. Regardless of why my configuration worked, I think the better start _expression_ for me at a dedicated site is:

IS_OWNER = FALSE
START = $(CROND_REQUIREMENTS)

So thanks for helping me solve a problem!

--
Tom Downes
Senior Scientist and Data CenterÂManager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678

On Wed, Jul 13, 2016 at 3:59 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
Argh!

I think Brian is right. Unless you also change IS_OWNER, you will want to do

 START = UNDEFINED

instead of start = false in order to avoid the owner state and just go to undefined.

Sent from my iPhone

> On Jul 13, 2016, at 3:28 PM, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:
>
> You sure about this?
>
> I also recall the same behavior that Bob describes - if START goes to FALSE instead of UNDEFINED, then the node transitions to Owner state, which then kills off running jobs.
>
> (Again, might have changed at some point)
>
> Brian
>
>> On Jul 13, 2016, at 3:09 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>
>> On 7/13/2016 3:03 PM, Bob Ball wrote:
>>> Maybe this info is now obsolete, but I remember once setting the START
>>> to an _expression_ that evaluated "FALSE" and caused all the running jobs
>>> to terminate....
>>>
>>> bob
>>
>> Only if $(START) is referenced in the PREEMPT _expression_....
>>
>> START just controls when new jobs can be launched.
>>
>> PREEMPT controls when to kick off jobs (really would be more accurate to have named it "Evict" instead of "Preempt", sigh...).
>>
>> regards
>> Todd
>>
>>
>>>> On 7/13/2016 3:56 PM, Fox, Kevin M wrote:
>>>> I'm guessing the condor_drain command will have similar issues to the
>>>> condor_off -peaceful command? That you have to have all the
>>>> permissions setup right?
>>>>
>>>> The nice thing about the START=FALSE config trick is you only need
>>>> root on the machine to do it.
>>>>
>>>> Thanks,
>>>> Kevin
>>>> ________________________________________
>>>> From: HTCondor-users [htcondor-users-bounces@xxxxxxxxxxx] on behalf of
>>>> Todd Tannenbaum [tannenba@xxxxxxxxxxx]
>>>> Sent: Wednesday, July 13, 2016 12:46 PM
>>>> To: HTCondor-Users Mail List
>>>> Subject: Re: [HTCondor-users] peaceful node drain and shutdown
>>>>
>>>>> On 7/13/2016 2:29 PM, Fox, Kevin M wrote:
>>>>> Ah. I had seen the docs for START but didn't realize it would affect new
>>>>> job startup too. It seemed to imply that its for eviction.
>>>>>
>>>>> But, the following seems to work to drain the node gracefully, as you
>>>>> suggested:
>>>>> echo START=FALSE > /etc/condor/config.d/00shutdown
>>>>> kill -HUP <PID OF MASTER>
>>>>>
>>>>> and to reverse it
>>>>> rm -f /etc/condor/config.d/00shutdown
>>>>> kill -HUP <PID OF MASTER>
>>>>>
>>>>> Thanks for the help. :)
>>>> Hi Kevin,
>>>>
>>>> If the above satisfies your needs, great. But just wanted to point out
>>>> you can do the same thing (drain a node gracefully) with the
>>>> condor_drain tool. Do "man condor_drain", or see
>>>>Â http://htcondor.org/manual/v8.4/condor_drain.html
>>>>
>>>> Also in the upcoming HTCondor v8.5.6, the condor_drain functionality is
>>>> exposed via HTCondor's Python API. :)
>>>>
>>>> regards,
>>>> Todd
>>>>
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>>> with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>> --
>> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
>> Center for High Throughput Computing ÂDepartment of Computer Sciences
>> HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
>> Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/