Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] PERIODIC_HOLD is applied extremely infrequently
- Date: Tue, 12 May 2015 06:54:13 -0500
- From: Brian Bockelman <bbockelm@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] PERIODIC_HOLD is applied extremely infrequently
> On May 12, 2015, at 5:51 AM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
>
> Thanks very much, Brian. That was it.
>
>> - What exists on the shadow is the "real" copy, not rounded.
> Do you mean shadow's ResidentSetSize is equal to schedd's ResidentSetSize_RAW?
>
Yup (the _RAW stuff is present to help the schedd perform autoclustering with the matchmaking protocol).
To avoid overload, the shadow default to updating the schedd at a lower frequency than the starter updates the schedd. Hence, the shadow's RSS data may be more up-to-date than the schedd's.
>
> Vlad
>
>
>
> On 05/11/15 20:31, Brian Bockelman wrote:
>> Ah - that rings a bell.
>>
>> - The SYSTEM_PERIODIC_REMOVE is being evaluated on the shadow.
>> - The *_RAW variants of attributes only exist on the schedd (they are set when the shadow pushes the update through).
>> - What exists on the shadow is the "real" copy, not rounded.
>> - In a few cases (for example, when condor_qedit is run), the schedd will consider the job ad as "dirty" and push a fresh copy to the shadow. This probably pushes the _RAW variant and causes the job to trigger the SYSTEM_PERIODIC_HOLD.
>> - A good way to test this would be to do something like "condor_qedit -const 'JobStatus=?=2' true foo 1" to touch all running jobs and see which go on hold.
>>
>> What happens if you do something like:
>>
>> SYSTEM_PERIODIC_HOLD = .... && ifThenElse(ResidentSetSize_RAW isnt undefined, ResidentSetSize_RAW>12345, ResidentSetSize>12345)
>>
>> ?
>>
>> Brian
>>
>>> On May 11, 2015, at 5:02 PM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
>>>
>>> I added "ALL_DEBUG = D_FULLDEBUG" and something strange is going on.
>>>
>>> It looks like shadows are evaluating SYSTEM_PERIODIC_HOLD using out-of-date classad. I have entries like this in my ShadowLog:
>>> (2498514.1) (5242): Classad debug: [0.00095ms] ResidentSetSize_RAW --> UNDEFINED
>>> (which would cause SYSTEM_PERIODIC_HOLD to be false)
>>>
>>> However, according to schedd, 2498514.1 *is* defined:
>>> condor_q 2498514.1 -autof ResidentSetSize_RAW
>>> 7340604
>>>
>>> Are condor_shadows of flocked jobs getting job classads from somewhere other than the local condor_schedd?
>>>
>>>
>>>
>>> Vlad
>>>
>>>
>>>
>>>
>>> On 05/11/15 15:23, Brian Bockelman wrote:
>>>>
>>>>> On May 11, 2015, at 11:18 AM, Vladimir Brik <vladimir.brik@xxxxxxxxxxxxxxxx> wrote:
>>>>>
>>>>> I added D_FULLDEBUG and "Evaluated periodic expressions" lines appear is SchedLog as expected. For example:
>>>>> Evaluated periodic expressions in 0.301s, scheduling next run in 60s
>>>>>
>>>>> My periodic hold expression is defined like this:
>>>>> rss_max = 6000
>>>>> mem_hold = ((isUndefined(ResidentSetSize_RAW) =?= False && isUndefined(RequestMemory) =?= False && ResidentSetSize_RAW/1000 > RequestMemory \
>>>>> && ResidentSetSize_RAW/1000 > 6000) =?= True)
>>>>> SYSTEM_PERIODIC_HOLD = ((JobStatus == 2 && JobUniverse == 5 && $(mem_hold) && isUndefined(RemoteHost) =?= False && regex("gzk9000c", RemoteHost) =!= True) =?= True)
>>>>>
>>>>> For testing, I tried using this:
>>>>> SYSTEM_PERIODIC_HOLD = (JobStatus == 2 && JobUniverse == 5 && Owner == "vbrik")
>>>>>
>>>>
>>>> Hi Vlad,
>>>>
>>>> Try adding:
>>>>
>>>> SYSTEM_PERIODIC_HOLD = debug( $(SYSTEM_PERIODIC_HOLD) )
>>>>
>>>> This will have HTCondor log the expression evaluation into ScheddLog, perhaps illuminating what is going on here!
>>>>
>>>> Brian
>>>>
>>>>> The interesting thing about the expression above is that it puts *some* jobs on hold immediately after they start running (as expected), but jobs that weren't put on hold immediately after starting are never put on hold.
>>>>>
>>>>> While debugging, I am also using this:
>>>>> PERIODIC_EXPR_INTERVAL = 60
>>>>> MAX_PERIODIC_EXPR_INTERVAL = 300
>>>>> PERIODIC_EXPR_TIMESLICE = .9
>>>>>
>>>>>
>>>>> Vlad
>>>>>
>>>>>
>>>>>
>>>>> On 05/08/15 15:51, Ben Cotton wrote:
>>>>>> Vlad,
>>>>>>
>>>>>> You should see lines like:
>>>>>>
>>>>>> 05/08/15 16:45:51 (pid:2968) Evaluated periodic expressions in 0.000s,
>>>>>> scheduling next run in 300s
>>>>>>
>>>>>> in your sched log (assuming SCHEDD_DEBUG includes D_FULLDEBUG). If you
>>>>>> see that at the expected interval (based on your
>>>>>> PERIODIC_EXPR_INTERVAL setting) then it's probably a problem in your
>>>>>> SYSTEM_PERIODIC_HOLD expression. Could you share that? If it doesn't
>>>>>> show up at the expected time, we'll have to try something else.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> BC
>>>>>>
>>>>> _______________________________________________
>>>>> HTCondor-users mailing list
>>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>>>> subject: Unsubscribe
>>>>> You can also unsubscribe by visiting
>>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>>
>>>>> The archives can be found at:
>>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/