Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Error with condor_power
- Date: Wed, 11 Mar 2026 11:40:24 +0100 (CET)
- From: "Beyer, Christoph" <christoph.beyer@xxxxxxx>
- Subject: Re: [HTCondor-users] Error with condor_power
Hi,
the problem most likely here is that once the machine powers down it sends a last classadd update overwriting the previous offline state. That is a known issue but not yet fixed to my knowledge (?)
Try setting the shutdown script on the worker to:
[root@batch1064 ~]# grep -i kill /etc/systemd/system/condor.service.d/01-condor-basic-overwrites.conf
# send sigkill instead of sigterm
KillSignal=SIGKILL
(SIGKILL instead of SIGSTOP)
(this is on RH like systems you will need to find the equivalent script on unbuntu-like systems ...)
Not pretty but will preserver the offline state ...
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
----- UrsprÃngliche Mail -----
Von: "Valerio Bellizzomi" <valerio@xxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 11. MÃrz 2026 11:18:09
Betreff: Re: [HTCondor-users] Error with condor_power
On Tue, 2026-03-10 at 22:09 +0000, Zach McGrew wrote:
> Hi Valerio,
>
> With regards to Unhibernate being undefined, that's interesting! I didn't notice that when I was setting up Rooster here, it's value is visible if you print the `-long` or string form (-f '%s') of the slot:
>
> $ condor_status slot1@xxxxxxxxxxxxxxxxxxxxxxxx -f '%s\n' Unhibernate
> MY.MachineLastMatchTime =!= undefined
>
> But I don't get the variable back when I use "condor_config_val -dump -verbose Unhibernate". I do see ROOSTER_MAX_UNHIBERNATE, ROOSTER_UNHIBERNATE, and ROOSTER_UNHIBERNATE_RANK though. It is described in documentation [1] at least? Better than nothing.
>
> Rooster not working as intended could be from absent class ads being different from offline class ads? The documentation for Absent ClassAds [2] says "This renders absent ClassAds invisible to the rest of the HTCondor infrastructure." Best I can tell, this is indeed true, which is why I run a little shell script service (search the list archives for an example of the systemd service to start it) that makes the absent classads not absent anymore.
Hi Zach,
I have redefined Unhibernate:
# condor_status -absent -long|grep Unhibernate
Unhibernate = MY.MachineLastMatchTime =!= undefined
Unhibernate = MY.MachineLastMatchTime =!= undefined
But it doesn't shows up in the following command:
# condor_config_val -dump -verbose Unhibernate
# Configuration from machine: htcondor.sel
# Parameters with names that match Unhibernate:
ROOSTER_MAX_UNHIBERNATE = 0
# at: <Default>
# expanded: 0
# default: 0
ROOSTER_UNHIBERNATE = Absent && UNHIBERNATE
# at: /etc/condor/config.d/01-central-manager.config, line 28
# expanded: Absent && UNHIBERNATE
# default: Offline && Unhibernate
ROOSTER_UNHIBERNATE_RANK =
# at: <Default>
# expanded:
>
> #!/bin/sh
>
> SLEEP_TIME=$(condor_config_val ROOSTER_INTERVAL)
>
> if echo "${SLEEP_TIME}" | grep -q 'Not defined' ; then
> SLEEP_TIME=300
> fi
>
> POOL=$(condor_config_val COLLECTOR_HOST)
>
> if echo "${POOL}" | grep -q 'Not defined' ; then
> POOL='localhost'
> fi
>
> while true
> do
> sleep ${SLEEP_TIME}
> for h in $(condor_status -pool "${POOL}" -absent | grep slot1@ | cut -d ' ' -f 1)
> do
> 1>&2 date -u
> 1>&2 echo "Host: ${h}"
> if condor_status -pool "${POOL}" -absent -long "${h}" | grep -qi 'START = ' ; then
> # Update ad if still valid (contains start expression)
> condor_status -pool "${POOL}" -absent -long "${h}" | \
> grep -v '^Absent =' | \
> condor_advertise -pool "${POOL}" UPDATE_STARTD_AD_WITH_ACK -
> else
> 1>&2 echo "Invalid classad detected!"
> 1>&2 condor_status -pool "${POOL}" -absent -long "${h}"
> fi
> done
> done
>
>
> That script checks for absent ads, strips the Absent attribute, and re-sends it to the collector with "UPDATE_STARTD_AD_WITH_ACK" which sets the Offline attribute to True as part of the process. With this service running, Rooster is able to see the offline machines and wake them up when work comes in with matching requirements.
>
> The other part of this is that when HTCondor stops it invalidates it's own machine ad to the collector. One of the things it does is remove the start attribute, which leads to nothing ever matching with the machine, and then Rooster never wakes it up.
>
> A quick override of the systemd service file can can tell it to SIGKILL the service so HTCondor can't cleanup and invalidate itself (this has other issues, in particular if you just wanted to `systemctl restart condor` and not shutdown the computer):
>
> # cat /etc/systemd/system/condor.service.d/override.conf
> [Service]
> KillSignal=SIGKILL
>
> There's an open Jira ticket [3] to address this so it will hopefully get addressed eventually.
Yes, I hope it will get addressed, however I see that this ticket
(HTCONDOR-1806) is from 2023 ...
> Hope that helps,
> -Zach
>
> Reference URLs:
> 1. https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html#UNHIBERNATE
> 2. https://htcondor.readthedocs.io/en/latest/admin-manual/cm-configuration.html#absent-classads
> 3. https://opensciencegrid.atlassian.net/browse/HTCONDOR-1806
>
> ________________________________________
> From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Valerio Bellizzomi <valerio@xxxxxxxxxx>
> Sent: Tuesday, March 10, 2026 8:38 AM
> To: htcondor-users@xxxxxxxxxxx
> Subject: Re: [HTCondor-users] Error with condor_power
>
> On Mon, 2026-03-09 at 16:21 +0100, Valerio Bellizzomi wrote:
> > On Fri, 2026-03-06 at 21:36 +0000, Jaime Frey via HTCondor-users wrote:
> > >
> > > As Zach McGrew said, you canât feed the contents of the offline.ads file to condor_power, so the errors youâre seeing there are expected. Instead, youâd provide each ad from running 'condor_status -offline -longâ.
> > >
> > >
> > >
> > > Can you provide more detail on condor_rooster not working?
> >
> >
> > For now I can say that my test ep never matched "Offline &&
> > Unhibernate", but it matches "Absent"
> >
> > Unhibernate does not have a default value and is "Not defined".
> >
> > "condor_status -offline" shows nothing, but "condor_status -absent"
> > shows the offline ep.
>
> Of course with "Absent" alone the ep is woken up every 300 seconds. Now
> the problematic task is to define Unhibernate so that the ep is not
> woken up unless it receives a job to run.
>
>
>
> >
> >
> > > I suggest adding this line to your configuration, which will cause condor_rooster to write more details to its RoosterLog file:
> > >
> > > ROOSTER_DEBUG = D_FULLDEBUG
> > >
> > > With the more detailed logging, it will write these messages each time it looks for machines to wake up:
> > >
> > > Cock-a-doodle-doo! (Time to look for machines to wake up.)
> > > Got ### startd ads matching ROOSTER_UNHIBERNATE=...
> > > Sending wakeup call to XXXX.
> > >
> > > Youâll see that last line only if the count of startd ads is greater than 0.
> > >
> > > - Jaime
> > >
> > >
> > >
> > > > On Mar 5, 2026, at 12:42âAM, Valerio Bellizzomi <valerio@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, 2026-02-26 at 19:25 +0100, Valerio Bellizzomi wrote:
> > > > > On Thu, 2026-02-26 at 17:05 +0000, Zach McGrew wrote:
> > > > > > The `-i` tells condor_power to read a classad from stdin and not a file. Removing the `-i` lets you specify a file to read from instead. It's a neat trick where you can build your own tiny classad to wake up a machine similar to what condor_rooster does. You can use something like this to wakeup machines on demand:
> > > > > >
> > > > > > printf "MyAddress = \"<${the_ip}:9618>\"\nHardwareAddress = \"${hwaddr}\"\nSubnetMask = \"${subnet}\"\n" | condor_power -i
> > > > > >
> > > > > > Presumably your offline.ads is set by "COLLECTOR_PERSISTENT_AD_LOG" in which case it's not a classad, but a little database like file that describes the slots that the collector was aware of but stopped talking to it for some reason or another. These should be visible with "condor_status -offline" or "condor_status -absent" depending on how they got entered. The file itself is used to restore those slots into memory when the collector restarts (i.e. restarting the collector means you no longer forget about the EPs that are powered off). You're not meant to pass this file as is to condor_power.
> > > > >
> > > > > Yes exact, my central-manager config is as follows:
> > > > > ABSENT_REQUIREMENTS = True
> > > > > EXPIRE_INVALIDATED_ADS = True
> > > > > COLLECTOR_PERSISTENT_AD_LOG = $(SPOOL)/offline.ads
> > > > > VALID_SPOOL_FILES = $(SPOOL)/offline.ads
> > > > >
> > > > > Condor_rooster is supposed to call condor_power to wake up a machine,
> > > > > and in this case the documentation says that the default value is
> > > > > condor_power -d -i:
> > > > >
> > > > > https://urldefense.com/v3/__https://htcondor.readthedocs.io/en/latest/admin-manual/configuration-macros.html*condor-rooster-configuration-file-macros__;Iw!!Mak6IKo!MdklhxhrHPvivWYvDVmvgh8vG_ft_o95Vma7QvRCDoz6e8f0roYtSMGdSJLkPhgYgHYWdImPZ9SnTAOgPw$
> > > > >
> > > > > ROOSTER_WAKEUP_CMDÂ
> > > > >
> > > > > A string representing the command line invoked by condor_rooster
> > > > > that is to wake up a machine. The command and any arguments should be
> > > > > enclosed in double quote marks, the same as arguments syntax in an
> > > > > HTCondor submit description file. The default value is
> > > > > â$(BIN)/condor_power -d -iâ. The command is expected to read from its
> > > > > standard input a ClassAd representing the offline machine.
> > > > >
> > > > >
> > > > > But this configuration does not work for me.
> > > > >
> > > >
> > > > Follow-up:
> > > >
> > > > The following command wakes up the ep:
> > > >
> > > > condor_power -d -s 255.255.255.255 -m b8:af:6f:84:5c:67
> > > > 03/04/26 19:53:14 Can't find Name in classad for startd
> > > > 03/04/26 19:53:14 Can't find CondorVersion in classad for startd
> > > > 03/04/26 19:53:14 Can't find CondorPlatform in classad for startd
> > > > 03/04/26 19:53:14 Can't find Machine in classad for startd
> > > > Packet sent.
> > > >
> > > > the ep boots and the job that was waiting in idle state now is in
> > > > running state.
> > > > however I don't know how to automate this since rooster doesn't seem to
> > > > invoke condor_power at intervals of 300 sec like it is specified in my
> > > > configuration.
> > > >
> > > > Looking at the condor_power code, "error in class ad" is actually
> > > > E_CLASSAD = -9, while the errno = -1 indicates something else.
> > > >
> > > >
> > > > >
> > > > > > -Zach
> > > > > >
> > > > > > ________________________________________
> > > > > > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Valerio Bellizzomi <valerio@xxxxxxxxxx>
> > > > > > Sent: Thursday, February 26, 2026 6:01 AM
> > > > > > To: htcondor-users@xxxxxxxxxxx
> > > > > > Subject: Re: [HTCondor-users] Error with condor_power
> > > > > >
> > > > > > On Thu, 2026-02-26 at 13:20 +0000, Pelletier, Michael V via HTCondor-
> > > > > > users wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Double-check the offline.ads file for that â107 1â string â looks like itâs just carping about a syntax error.
> > > > > > >
> > > > > > > Michael V Pelletier
> > > > > > > Principal Technologist
> > > > > > >
> > > > > > > C: +1 339.293.9149
> > > > > > > michael.v.pelletier@xxxxxxx
> > > > > >
> > > > > > Thank you, the file is generated automatically by the collector I
> > > > > > think, and I have attempted to edit the file removing the initial
> > > > > > numbers, but still the same error.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>On Behalf Of Valerio Bellizzomi
> > > > > > > Sent: Thursday, February 26, 2026 6:19 AM
> > > > > > > To: HTCondor-Users List <htcondor-users@xxxxxxxxxxx>
> > > > > > > Subject: [HTCondor-users] Error with condor_power
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Hello, has anyone had the same error? # condor_power -d -i /var/spool/condor/offline.âads 02/26/26 12:â16:â41 failed to create classad; bad expr = '107 1 CreationTimestamp 1772031128' condor_power: error in class-ad (errno = -1). _______________________________________________
> > > > > > >
> > > > > > > Hello,
> > > > > > > has anyone had the same error?
> > > > > > >
> > > > > > > # condor_power -d -i /var/spool/condor/offline.ads
> > > > > > > 02/26/26 12:16:41 failed to create classad; bad expr = '107 1
> > > > > > > CreationTimestamp 1772031128'
> > > > > > > condor_power: error in class-ad (errno = -1).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > HTCondor-users mailing list
> > > > > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > > > > > subject: Unsubscribe
> > > > > > >
> > > > > > > The archives can be found at: https://urldefense.com/v3/__https://urldefense.us/v2/url?u=https-3A__www-2Dauth.cs.wisc.edu_lists_htcondor-2Dusers_&d=DwICAg&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECGK3jb50YP8vZUAedXybzgaNykar_o0SxKOUPkRHE0WG&m=d5NrZQlaDxHYRhD0JmcYYR_2Y71kE5GdT4M9LJHJGSIX5qUM9ckUxOSvtqr4zyH4&s=LL4IY0Fx3TIK2_DP31KVvC3KT1v0J-aCqRyiIkqDa0w&e=__;!!Mak6IKo!MdklhxhrHPvivWYvDVmvgh8vG_ft_o95Vma7QvRCDoz6e8f0roYtSMGdSJLkPhgYgHYWdImPZ9Qk6e26dQ$
> > > > > > > _______________________________________________
> > > > > > > HTCondor-users mailing list
> > > > > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > > > > > subject: Unsubscribe
> > > > > > >
> > > > > > > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> > > > > >
> > > > > > _______________________________________________
> > > > > > HTCondor-users mailing list
> > > > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > > > > subject: Unsubscribe
> > > > > >
> > > > > > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> > > > > >
> > > > > > _______________________________________________
> > > > > > HTCondor-users mailing list
> > > > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > > > > subject: Unsubscribe
> > > > > >
> > > > > > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> > > >
> > > > _______________________________________________
> > > > HTCondor-users mailing list
> > > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > > subject: Unsubscribe
> > > >
> > > > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> > >
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > > subject: Unsubscribe
> > >
> > > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> >
> > The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/