Hi again, searching through the log files once more something caught my eye: When running condor_q on master2 while master1 is active, the following lines appear in SchedLog (along with the segmentation fault message): 10/08/20 11:50:30 (pid:47347) Number of Active Workers 0 10/08/20 11:50:41 (pid:47347) AUTHENTICATE: handshake failed! 10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: authentication of <192.168.1.22:10977> did not result in a valid mapped user name, which is required for this command (519 QUERY_JOB_ADS_WITH_AUTH), so aborting. 10/08/20 11:50:41 (pid:47347) DC_AUTHENTICATE: reason for authentication failure: AUTHENTICATE:1002:Failure performing handshake|AUTHENTICATE:1004:Failed to authenticate using KERBEROS|AUTHENTICATE:1004:Failed to authenticate using FS|FS:1004:Unable to lstat(/tmp/FS_XXXGNYmKn) Do I need to configure any other authentication methods in addition to all servers using LDAP via PAM ? Kind regards Christian -----Ursprüngliche Nachricht----- Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von Hennen, Christian Gesendet: Donnerstag, 1. Oktober 2020 12:58 An: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Betreff: Re: [HTCondor-users] HTCondor high availability Hello Thomas, the spool directory (/clients/condor/spool) is located on a NFS v3 share every server has access to (/clients). All machines have a local user (r-admin) with uid and gid 1000 and the spool directory is owned by that user, since it is configured as the Condor user (see condor_config.local in the Serverfault thread). Every other user is mapped via LDAP on every server including the storage cluster. On both master servers the user "condor" has the same uid and gid. Kind regards Christian -----Ursprüngliche Nachricht----- Von: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> Im Auftrag von Thomas Hartmann Gesendet: Donnerstag, 1. Oktober 2020 11:46 An: htcondor-users@xxxxxxxxxxx Betreff: Re: [HTCondor-users] HTCondor high availability Hi Christian, the spool dir resides on a shared file system between both nodes, or? Maybe you can check, if it is writable from both clients and if the users/permissions work for both? (sometimes NFS is a bit fiddly with the ID mapping...) Cheers, Thomas On 01/10/2020 09.58, Hennen, Christian wrote: > Hi, > > > > I am currently trying to make the job queue and submission mechanism > of a local, isolated HTCondor cluster highly available. The cluster > consists of 2 master servers (previously 1) and several compute nodes > and a central storage system. DNS, LDAP and other services are > provided by the master servers. > > > > I followed the directions under > https://htcondor.readthedocs.io/en/latest/admin-manual/high-availabili > ty.html but it doesn?t seem to work the way it should. Further > information about the setup and the problems has been posted to > Serverfault: > https://serverfault.com/questions/1035879/htcondor-high-availability > > > > Maybe any of you have got any insights on this? Any help would be > appreciated! > > > > Kind regards > > * > Christian Hennen*, M.Sc.** > > Project Manager Infrastructural Services > > Zentrum für Informations-, Medien- > > und Kommunikationstechnologie (ZIMK) > > > > cid:image001.png@01D491F5.AD0E2F30 > > > > Universität Trier | Universitätsring 15 | 54296 Trier | Germany > www.uni-trier.de <http://www.uni-trier.de/> > > > > <https://50jahre.uni-trier.de/> > > > > > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx > with a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ >
Attachment:
smime.p7s
Description: S/MIME cryptographic signature