Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] condor_off -peaceful -daemon master permissions check fail (BUG?)
- Date: Thu, 13 Jun 2013 12:06:49 -0500
- From: Zachary Miller <zmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] condor_off -peaceful -daemon master permissions check fail (BUG?)
On Thu, Jun 13, 2013 at 03:09:30PM +0200, Joan J. Piles wrote:
> Hi Brian,
>
> What I wonder is whether sendig a first condor_off command to the startd and
> then another one to the master would work, given that since the DC_OFF_PEACEFUL
> has already arrived to the startd, the second SIGTERM from master should be
> mostly ignored... shouldn't it?
This should in fact work in your case:
condor_off -peaceful -startd REMOTEHOST
condor_off -peaceful -master REMOTEHOST
However, as Brian mentioned, I was debugging other problems earlier this week
such that if REMOTEHOST is not an actual DNS name, then:
condor_off -peaceful -startd STARTDNAME
doesn't work. You can then use:
condor_off -peaceful -startd -addr <rem.ote.ho.st:port>
condor_off -peaceful -master -addr <rem.ote.ho.st:port>
to communcate directly with the daemons, but that is quite clunky.
Our proposed solution is partially mentioned in that ticket:
http://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3686
which is to essentially do what you suggested, and only ever communicate with
the condor_master on a given machine, and change the authorization levels so
that the master can inform its children about the way in which they're being
shut down.
This is a big enough change that we're planning it for the development series,
and not stable. Currently it's targeted for 8.1.0. I'll also add your notes
into the ticket.
Cheers,
-zach
> Joan
>
> El 13/06/13 14:22, Brian Bockelman escribió:
>
> Hi Joan,
>
> A peaceful-off is a combination of two signals -
> a) A signal over the DaemonCore socket. (this is not a RPC, so the master
> does not know if it succeeded)
> b) A traditional unix signal.
>
> As you saw, if HTCondor permissions are incorrect, it's possible (and
> locally, we've done this about a dozen times) to have (b) go through while
> (a) is ignored. This results in a graceful off when a peaceful one was
> intended.
>
> There are other issues with condor_off. See: https://
> htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=3686.
>
> After being burned a few too many times, we either:
> (a) SSH into each node individually and send the command locally, or
> (b) Find a working combination via trial & error with a few worker nodes,
> then drain off the cluster.
>
> I tend to grumble about condor_off constantly -- I'm hoping it'll get
> redesigned in the next year or so to be more robust.
>
> Brian
>
> On Jun 13, 2013, at 6:25 AM, "Joan J. Piles" <jpiles@xxxxxxxxx> wrote:
>
>
> Hi all,
>
> I don't know if this is a bug (I think it is), but there is a problem
> when you try to do a condor_off -peaceful -daemon master node from a
> central management machine.
>
> When the condor master gets the peaceful shutdown command, it gets it
> from an authorized (as ADMINISTRATOR) machine. However, when it is to
> propagate this command to the children daemons, it does so as the local
> machine, which is not in the HOSTALLOW_ADMINISTRATOR list. We can see
> it in the log (172.16.4.103 is our management node, and 172.16.6.2 our
> test node):
>
> MasterLog (trimmed, only relevant lines):
>
>
> 06/13/13 13:14:08 Received TCP command 60015 (DC_OFF_PEACEFUL) from
> unauthenticated@unmapped <172.16.4.103:46020>, access level
> ADMINISTRATOR
> 06/13/13 13:14:08 Calling HandleReq <handle_off_peaceful()> (0) for
> command 60015 (DC_OFF_PEACEFUL) from unauthenticated@unmapped
> <172.16.4.103:46020>
> 06/13/13 13:14:08 Got SIGTERM. Performing graceful shutdown.
> 06/13/13 13:14:08 Completed DC_SET_PEACEFUL_SHUTDOWN to local
> startd
> 06/13/13 13:14:14 Sent SIGTERM to STARTD (pid 31817)
> 06/13/13 13:14:14 The STARTD (pid 31817) exited with status 0
> 06/13/13 13:14:15 All daemons are gone. Exiting.
>
>
>
> Here, we see that the request comes from an authorized source. However,
> what the startd sees is subtly different, as the order is seen as
> coming from the local machine, which is not authorized:
>
>
> StartLog:
>
>
> 06/13/13 13:14:08 Calling Handler
> <DaemonCommandProtocol::WaitForSocketData> (2)
> 06/13/13 13:14:08 PERMISSION DENIED to unauthenticated@unmapped
> from host 172.16.6.2 for command 60016 (DC_SET_PEACEFUL_SHUTDOWN),
> access level ADMINISTRATOR: reason: ADMINISTRATOR authorization
> policy contains no matching ALLOW entry for this request;
> identifiers used for this host: 172.16.6.2,her06-02.
> hermes.cps.unizar.es,her06-02, hostname size = 2, original ip
> address = 172.16.6.2
>
>
>
> As it later gets the sigterm:
>
>
> 06/13/13 13:14:14 Got SIGTERM. Performing graceful shutdown.
> 06/13/13 13:14:14 shutdown graceful
> 06/13/13 13:14:14 All resources are free, exiting.
>
>
> The end result is that we get a graceful shutdown instead of the
> peaceful one we asked for.
>
> An obvious workaround is to change:
>
>
> HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST)
>
>
> to:
>
>
> HOSTALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(FULL_HOSTNAME)
>
>
> But since it's not the default policy, nor there is a clear reason why
> this should be so, I think it's more of a bug. condor_master should
> somehow authenticate as DAEMON, or pass on the credentials to startd.
>
> When we do a condor_off -peaceful -daemon stard, however, everything
> works as expected since the shutdown command comes directly from the
> management machine.
>
> Regards,
>
> Joan
>
>
>
> --
> --------------------------------------------------------------------------
> Joan Josep Piles Contreras - Analista de sistemas
> I3A - Instituto de Investigación en Ingeniería de Aragón
> Tel: 876 55 51 47 (ext. 845147)
> http://i3a.unizar.es -- jpiles@xxxxxxxxx
> --------------------------------------------------------------------------
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
>
>
> --
> --------------------------------------------------------------------------
> Joan Josep Piles Contreras - Analista de sistemas
> I3A - Instituto de Investigación en Ingeniería de Aragón
> Tel: 876 55 51 47 (ext. 845147)
> http://i3a.unizar.es -- jpiles@xxxxxxxxx
> --------------------------------------------------------------------------
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/