[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hibernate and Cron interference



Hi,

on the track of the EP sending an extraneous ClassAd, I tried the
following:

1. I removed Crons, including KFlops and MIPS. No changes.

2. I changed the KillSignal in the systemd unit for condor from
"KillSignal=SIGQUIT" to "KillSignal=SIGKILL"

In this configuration, "condor_power_state set 5" sends its ClassAd, the
system shuts down, and systemd brutally kills condor. So the extra
ClassAd is not sent when startd shuts down (confimed by logs on the EP
and network captures).


The EP correctly shows as Offline, and Negociator matching works, and
Rooster wakes the machines up just perfect. Woohoo \o/ !


A problem remains, in the sense that when I do a "systemctl stop condor"
or a manual shutdown, condor also gets killed violently and the CM is
not fully aware that the machine should become absent. This is where I
can lower the ClassAd lifetime to keep things tidy.


Another thing I can try after I'm done further testing, is to divert the
condor_power_state so that it send the ClassAd and kills condor. Then
there is a risk that systemd respawns condor anyway and all these
efforts are lost :).


Now I think hibernation works without this kludge when doing
suspend-to-ram or suspend-to-disk, because systemd does not try to stop
the unit in these cases. But I do poweroff, because on the servers I
have, suspend to RAM saves only 5 watts, and suspend to disk is not
faster than a cold start.


To sum up, maybe something has to be polished in this part of the condor
code, to prevent startd from sending a stray ClassAd that breaks
hibernation ?


Questions to Christoph : how do you start condor ? On what OS ? What
hibernation level do you use ? I'm triying to figure out why it worked
for you and not for me :).


I'm a bit anxious about pushing a setup in production that relies on
killing software to prevent it doing its stuff :').


Any insights ?


Thanks !


(and in the end, it has nothing to do with Cron)

-- 
Charles


Todd L Miller via HTCondor-users wrote:
> > That's my intuition, but I don't understand where this rogue Ad comes
> > from, and why it works for others :).
> 
> 	Given what you've said, I would assume that the "rogue" ad is caused by
> your continuous cron producing output (and thus forcing an update) between
> when the startd sends the hibernation ad and when the subsequent system shut
> down actually occurs.  (The startd doesn't keep sending the hibernate
> information in subsequent ads, probably because it's nontrivial to determine
> if the hibernation actually happened.)