Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Absent node still active
- Date: Wed, 15 Jul 2020 10:30:17 +0200 (CEST)
- From: "Sever, Krunoslav" <krunoslav.sever@xxxxxxx>
- Subject: [HTCondor-users] Absent node still active
Hi,
got an absent node that should not be absent... here is the story:
First it crashed and I see in the log where the collector dutifully set it to absent about 30 minutes later, when the startd ad expired.
condor_status -absent (currently) shows that time.
After the node started up again, I see that the collector received new startd ads, so I assume these would replace the absent ad.
But condor_status -absent still shows the node, unchanged, a few hours after reboot.
Moreover, a few minutes after reboot, the negotiator (surprisingly?) matched a job for the node, which was scheduled and ran.
Even more interesting, the node apparently crashed again a few hours later and again I see the log entry where the collector sets the Absent attribute.
But condor_status -absent *still* shows the original absent date, i.e. from the first crash.
Looking through the sources I see that the offline plugin in the collector is the only place where the Absent attribute is set.
A few other source files reference the attribute but only for reading purposes (e.g. condor_status).
I also note that the persistent storage where the absent ads are put was never removed after reboot of the node.
This removal is done when a node actively invalidates an ad, so maybe that's missing or didn't run somehow?
Any ideas?
Best
Kruno
--
------------------------------------------------------------------------
Krunoslav Sever Deutsches Elektronen-Synchrotron (IT-Systems)
Ein Forschungszentrum der Helmholtz-Gemeinschaft
Notkestr. 85
phone: +49-40-8998-1648 22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx Germany
------------------------------------------------------------------------