Hi,
somehow I managed to completely ignore the IPs - it seems so obvious now...
Well, an objective eye is very valuable indeed.
I see that the negotiator matched the node exactly on 15:12 so I would assume that it made the collector notice that it is absent?
And since the negotiator and the collector are on the same host, that's where the 127.0.0.1 comes from in the first log line?
Second time, the node sent its startd ad, so it would have the actual node IP.
Hm, it seems like this could be improved or is there a reason for this?
I also applied your config suggestion, Christoph was immediately on board ;)
It seems like a (sensible) cosmetic change, but maybe it will also help to improve our absent situation at least a little.
Best
Kruno
From: "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>, "Krunoslav Sever" <krunoslav.sever@xxxxxxx>
Sent: Thursday, 16 July, 2020 20:39:20
Subject: Re: [HTCondor-users] Absent node still active
Hi Kruno,
A couple quick thoughts on this:
1. The collector is essentially a hashtable of ClassAds, and the
key is a tuple of the classad's Name attribute and the MyAddress
(ip address) attribute. In the logs below, it appears that the
first time batch1066.desy was marked as absent, it had an IP
address of 127.0.0.1, as shown from this log entry:
07/13/20 15:12:44 Added ad to persistent store key=<slot2_3@xxxxxxxxxxxxxxxxx,127.0.0.1>
But then when this node crashed a second time, the Collector appears
to have made a second absent entry because the IP address was
different as shown here:
07/14/20 23:42:44 Added ad to persistent store key=<slot2_22@xxxxxxxxxxxxxxxxx,131.169.160.166>
So from the Collector's perspective, these are two different server
instances. Thus the original absent entry, with IP address
127.0.0.1, did not get replaced when the node crashed again.
2. It seems bizarre to have absent ads for each slot, esp when your
startd is configured for partitionable slots. Perhaps instead of
configuring
ABSENT_REQUIREMENTS = True
you may prefer to say something like
ABSENT_REQUIREMENTS = SlotID == 1
Hope the above helps,
Todd
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
--
------------------------------------------------------------------------
Krunoslav Sever Deutsches Elektronen-Synchrotron (IT-Systems)
Ein Forschungszentrum der Helmholtz-Gemeinschaft
Notkestr. 85
phone: +49-40-8998-1648 22607 Hamburg
e-mail: krunoslav.sever@xxxxxxx Germany
------------------------------------------------------------------------