[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor high availability



Hi Christian,

my first gues would be, that the scheds on both machines want to
authenticate each other. The daemons on the same node do this normally
by writing/rading a file under /tmp, but to authz daemons over different
nodes, you need to secure these.
Maybe the easiest(?) could be to use the sahred file system (assuming
that it is secure), to let the daemons authenticate each other through
it, i.e., with FS_REMOTE

Maybe you can try and put a shared path on the nodes with
  FS_REMOTE_DIR = /path/foo/condor/sec

(the other authentication options via SSL and so are probably more
secure, but the shared fs could be faster to setup for testing)

Cheers,
  Thomas

On 08/10/2020 12.04, Hennen, Christian wrote:
> Hi again,
> 
> searching through the log files once more something caught my eye: When
> running condor_q on master2 while master1 is active, the following lines
> appear in SchedLog (along with the segmentation fault message):
> 
> 10/08/20 11:50:30 (pid:47347) Number of Active Workers 0  
> 10/08/20 11:50:41 (pid:47347) AUTHENTICATE: handshake failed!    
> 10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: authentication of
> <192.168.1.22:10977> did not result in a valid mapped user name, which is
> required for this command (519 QUERY_JOB_ADS_WITH_AUTH), so aborting. 
> 10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: reason for authentication
> failure: AUTHENTICATE:1002:Failure performing
> handshake|AUTHENTICATE:1004:Failed to authenticate using
> KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to
> lstat(/tmp/FS_XXXGNYmKn)  
> 
> Do I need to configure any other authentication methods in addition to all
> servers using LDAP via PAM ?
> 
> Kind regards
> 
> Christian
> 
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
> Hennen, Christian
> Gesendet: Donnerstag, 1. Oktober 2020 12:58
> An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Betreff: Re: [HTCondor-users] HTCondor high availability
> 
> Hello Thomas,
> 
> the spool directory (/clients/condor/spool) is located on a NFS v3 share
> every server has access to (/clients). All machines have a local user
> (r-admin) with uid and gid 1000 and the spool directory is owned by that
> user, since it is configured as the Condor user (see condor_config.local in
> the Serverfault thread). Every other user is mapped via LDAP on every server
> including the storage cluster. On both master servers the user "condor" has
> the same uid and gid.
> 
> Kind regards
> 
> Christian
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von
> Thomas Hartmann
> Gesendet: Donnerstag, 1. Oktober 2020 11:46
> An: htcondor-users@xxxxxxxxxxx
> Betreff: Re: [HTCondor-users] HTCondor high availability
> 
> Hi Christian,
> 
> the spool dir resides on a shared file system between both nodes, or?
> Maybe you can check, if it is writable from both clients and if the
> users/permissions work for both? (sometimes NFS is a bit fiddly with the ID
> mapping...)
> 
> Cheers,
>   Thomas
> 
> 
> On 01/10/2020 09.58, Hennen, Christian wrote:
>> Hi,
>>
>>  
>>
>> I am currently trying to make the job queue and submission mechanism 
>> of a local, isolated HTCondor cluster highly available. The cluster 
>> consists of 2 master servers (previously 1) and several compute nodes 
>> and a central storage system. DNS, LDAP and other services are 
>> provided by the master servers.
>>
>>  
>>
>> I followed the directions under
>> https://htcondor.readthedocs.io/en/latest/admin-manual/high-availabili
>> ty.html but it doesn’t seem to work the way it should. Further 
>> information about the setup and the problems has been posted to
>> Serverfault:
>> https://serverfault.com/questions/1035879/htcondor-high-availability
>>
>>  
>>
>> Maybe any of you have got any insights on this? Any help would be 
>> appreciated!
>>
>>  
>>
>> Kind regards
>>
>> *
>> Christian Hennen*, M.Sc.**
>>
>> Project Manager Infrastructural Services
>>
>> Zentrum für Informations-, Medien-
>>
>> und Kommunikationstechnologie (ZIMK)
>>
>>  
>>
>> cid:image001.png@01D491F5.AD0E2F30
>>
>>  
>>
>> Universität Trier | Universitätsring 15 | 54296 Trier | Germany 
>> www.uni-trier.de <http://www.uni-trier.de/>
>>
>>  
>>
>> <https://50jahre.uni-trier.de/>
>>
>>  
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>> with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature