Hi Joan,
A peaceful-off is a combination of two signals - a) A signal over the DaemonCore socket. (this is not a RPC, so the master does not know if it succeeded) b) A traditional unix signal.
As you saw, if HTCondor permissions are incorrect, it's possible (and locally, we've done this about a dozen times) to have (b) go through while (a) is ignored. This results in a graceful off when a peaceful one was intended.
After being burned a few too many times, we either: (a) SSH into each node individually and send the command locally, or (b) Find a working combination via trial & error with a few worker nodes, then drain off the cluster.
I tend to grumble about condor_off constantly -- I'm hoping it'll get redesigned in the next year or so to be more robust.
Brian
Hi all,
I don't know if this is a bug (I think it is), but there is a
problem when you try to do a condor_off -peaceful -daemon master
node from a central management machine.
When the condor master gets the peaceful shutdown command, it gets
it from an authorized (as ADMINISTRATOR) machine. However, when it
is to propagate this command to the children daemons, it does so as
the local machine, which is not in the HOSTALLOW_ADMINISTRATOR list.
We can see it in the log (172.16.4.103 is our management node, and
172.16.6.2 our test node):
MasterLog (trimmed, only relevant lines):
06/13/13 13:14:08 Received TCP command 60015
(DC_OFF_PEACEFUL) from unauthenticated@unmapped
<172.16.4.103:46020>, access level ADMINISTRATOR
06/13/13 13:14:08 Calling HandleReq <handle_off_peaceful()>
(0) for command 60015 (DC_OFF_PEACEFUL) from
unauthenticated@unmapped <172.16.4.103:46020>
06/13/13 13:14:08 Got SIGTERM. Performing graceful shutdown.
06/13/13 13:14:08 Completed DC_SET_PEACEFUL_SHUTDOWN to local
startd
06/13/13 13:14:14 Sent SIGTERM to STARTD (pid 31817)
06/13/13 13:14:14 The STARTD (pid 31817) exited with status 0
06/13/13 13:14:15 All daemons are gone. Exiting.
Here, we see that the request comes from an authorized source.
However, what the startd sees is subtly different, as the order is
seen as coming from the local machine, which is not authorized:
StartLog:
06/13/13 13:14:08 Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
06/13/13 13:14:08 PERMISSION DENIED to unauthenticated@unmapped
from host 172.16.6.2 for command 60016 (DC_SET_PEACEFUL_SHUTDOWN),
access level ADMINISTRATOR: reason: ADMINISTRATOR authorization
policy contains no matching ALLOW entry for this request;
identifiers used for this host:
172.16.6.2,her06-02.hermes.cps.unizar.es,her06-02, hostname size =
2, original ip address = 172.16.6.2
As it later gets the sigterm:
06/13/13 13:14:14 Got SIGTERM. Performing
graceful shutdown.
06/13/13 13:14:14 shutdown graceful
06/13/13 13:14:14 All resources are free, exiting.
The end result is that we get a graceful shutdown instead of the
peaceful one we asked for.
An obvious workaround is to change:
HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
to:
HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST),
$(FULL_HOSTNAME)
But since it's not the default policy, nor there is a clear reason
why this should be so, I think it's more of a bug. condor_master
should somehow authenticate as DAEMON, or pass on the credentials to
startd.
When we do a condor_off -peaceful -daemon stard, however, everything
works as expected since the shutdown command comes directly from the
management machine.
Regards,
Joan
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 876 55 51 47 (ext. 845147)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------
_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |