Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Unsubscribe
- Date: Thu, 16 Jul 2020 15:22:55 -0500
- From: psicurity@xxxxxxxxx
- Subject: [HTCondor-users] Unsubscribe
> On Jul 16, 2020, at 1:12 AM, Sever, Krunoslav <krunoslav.sever@xxxxxxx> wrote:
>
> Hi Mark,
>
> maybe I should have provided some log excerpts from the start...
>
> Okay, here is a more detailed timeline in terms of logs:
>
> ----
> (Startlog) - crash at 14:52, unknown reason
> ----
> 07/13/20 14:52:06 Setting up slot pairings
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^
> ....
> ^@^@^@^@^@^@^@^@^@^@^@^@07/14/20 09:38:02 ******************************************************
> 07/14/20 09:38:02 ** condor_startd (CONDOR_STARTD) STARTING UP
> 07/14/20 09:38:02 ** /usr/sbin/condor_startd
> ....
>
> ----
> (CollectorLog)
> ----
> 07/13/20 15:12:44 **** Removing stale ad: "< slot2_3@xxxxxxxxxxxxxxxxx , 127.0.0.1 >"
> 07/13/20 15:12:44 Added ad to persistent store key=<slot2_3@xxxxxxxxxxxxxxxxx,127.0.0.1>
>
> From the source code these two lines should be produced precisely when an ad gets the Absent attribute
> (offline_plugin on the collector), hence from this point on:
>
> ----
> (condor_status -absent)
> slot1@xxxxxxxxxxxxxxxxx LINUX X86_64 7/13 14:52 8/12 14:52
> ...
>
> So far, before reboot at 09:38, all is as it should be, after such a crash.
>
> Now the node with the startd reboots and gets a job:
>
> (CollectorLog)
> 07/14/20 09:38:02 MasterAd : Inserting ** "< batch1066.desy.de >"
> 07/14/20 09:38:14 StartdAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdPvtAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdAd : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:38:14 StartdPvtAd : Inserting ** "< slot2@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:44:01 StartdAd : Inserting ** "< slot2_1@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> ...(rest of the slots, gradually until final slot)
> 07/14/20 09:49:32 StartdAd : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 09:49:32 StartdPvtAd : Inserting ** "< slot2_47@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
>
> (NegotiatorLog)
> 07/14/20 09:43:58 Request 50111935.00000: autocluster 915 (request count 87 of 100)
> 07/14/20 09:43:58 Matched 50111935.0 BIRD_cms.lite.user@xxxxxxx <131.169.223.41:9618?addrs=131.169.223.41-9618+[2001-638-700-10df--1-29]-9618&noUDP&sock=schedd_2006168_f5e8_3> preempting none <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> slot2@xxxxxxxxxxxxxxxxx
> 07/14/20 09:43:58 Successfully matched with slot2@xxxxxxxxxxxxxxxxx
>
> (SchedLog)
> 7/14/20 09:43:59 (pid:3957219) Started shadow for job 50111938.9 on slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, (shadow pid = 3165751)
> ...
> 07/14/20 10:09:27 (pid:3957219) Shadow pid 3165751 for job 50111938.9 exited with status 115
> ...
> 07/14/20 10:09:27 (pid:3957219) Match record (slot2@xxxxxxxxxxxxxxxxx <131.169.160.166:36119?addrs=131.169.160.166-36119&alias=batch1066.desy.de> for BIRD_cms.lite.user, 50111938.9) deleted
>
> The output of condor_status -absent remains unchanged (a few hours after the reboot) and presumably all the time since reboot.
>
> Then there is another crash of the node and the node is marked absent again:
>
> (CollectorLog)
> 07/14/20 23:42:44 **** Removing stale ad: "< slot2_22@xxxxxxxxxxxxxxxxx , 131.169.160.166 >"
> 07/14/20 23:42:44 Added ad to persistent store key=<slot2_22@xxxxxxxxxxxxxxxxx,131.169.160.166>
>
> At this point, I would have expected the absent output to have changed to 23:42 - a few hours later it is still 15:12.
>
> I did search the CollectorLog for lines matching
>
> 07/14/20 hh:mm:ss Removed ad from persistent store key=<slotX_Y@xxxxxxxxxxxxxxxxx,131.169.160.166>
>
> which are produced for other nodes on at least two occasions (explicit invalidate from node and presumably regular absent removal).
>
> I am fairly sure these would indicate removal of the ad with the Absent attribute.
>
> But there were none, so I figure that these somehow remained next to the newly inserted ones above and possibly are the reason for this whole behaviour.
>
> Hopefully this is more helpful.
>
> Best
> Kruno
>
> ----- Original Message -----
>> From: "Mark Coatsworth" <coatsworth@xxxxxxxxxxx>
>> To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
>> Sent: Thursday, 16 July, 2020 00:39:31
>> Subject: Re: [HTCondor-users] Absent node still active
>
>> Hi Kruno,
>>
>> I'm trying to reproduce this problem here on a 3-node testbed cluster.
>>
>> Can you explain how exactly the startd is crashing such that
>> `condor_status -absent` shows output? Can you attach what it says?
>>
>> In my tests with both controlled and forced shutdowns, the controller
>> seems to react well and I don't know how to get it to a state where
>> -absent returns anything. I'm running condor v8.8.9.
>>
>> Mark
>>
>>
>>
>>
>> On Wed, Jul 15, 2020 at 3:30 AM Sever, Krunoslav
>> <krunoslav.sever@xxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> got an absent node that should not be absent... here is the story:
>>>
>>> First it crashed and I see in the log where the collector dutifully set it to
>>> absent about 30 minutes later, when the startd ad expired.
>>>
>>> condor_status -absent (currently) shows that time.
>>>
>>> After the node started up again, I see that the collector received new startd
>>> ads, so I assume these would replace the absent ad.
>>>
>>> But condor_status -absent still shows the node, unchanged, a few hours after
>>> reboot.
>>>
>>> Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a
>>> job for the node, which was scheduled and ran.
>>>
>>> Even more interesting, the node apparently crashed again a few hours later and
>>> again I see the log entry where the collector sets the Absent attribute.
>>>
>>> But condor_status -absent *still* shows the original absent date, i.e. from the
>>> first crash.
>>>
>>> Looking through the sources I see that the offline plugin in the collector is
>>> the only place where the Absent attribute is set.
>>>
>>> A few other source files reference the attribute but only for reading purposes
>>> (e.g. condor_status).
>>>
>>> I also note that the persistent storage where the absent ads are put was never
>>> removed after reboot of the node.
>>>
>>> This removal is done when a node actively invalidates an ad, so maybe that's
>>> missing or didn't run somehow?
>>>
>>> Any ideas?
>>>
>>> Best
>>> Kruno
>>>
>>> --
>>> ------------------------------------------------------------------------
>>> Krunoslav Sever Deutsches Elektronen-Synchrotron (IT-Systems)
>>> Ein Forschungszentrum der Helmholtz-Gemeinschaft
>>> Notkestr. 85
>>> phone: +49-40-8998-1648 22607 Hamburg
>>> e-mail: krunoslav.sever@xxxxxxx Germany
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>> --
>> Mark Coatsworth
>> Systems Programmer
>> Center for High Throughput Computing
>> Department of Computer Sciences
>> University of Wisconsin-Madison
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> --
> ------------------------------------------------------------------------
> Krunoslav Sever Deutsches Elektronen-Synchrotron (IT-Systems)
> Ein Forschungszentrum der Helmholtz-Gemeinschaft
> Notkestr. 85
> phone: +49-40-8998-1648 22607 Hamburg
> e-mail: krunoslav.sever@xxxxxxx Germany
> ------------------------------------------------------------------------
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/