Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process
- Date: Sat, 6 Jul 2024 12:09:06 +0200 (CEST)
- From: Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process
Hello,
In order to work on the migration from CentOS 7 to RHEL9 like, I set up a model with 3 servers. It is the same model that I already have on CentOS7. It's running well in this configuration on CentOS 7
My configuration on AL9 is :
 A master scheduler : clrarcce03.in2p3.fr ( 134.158.121.105 )
Name: clrarcce03.in2p3.fr
Address: 134.158.121.105
Name: clrarcce03.in2p3.fr
Address: 2001:660:5104:134:134:158:121:105
condor 5394 1 0 16:02 ? 00:00:00 /usr/sbin/condor_master -f
root 5438 5394 0 16:02 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
condor 5439 5394 0 16:02 ? 00:00:00 condor_shared_port
condor 5440 5394 0 16:02 ? 00:00:00 condor_schedd
 A Manager : clrhtcmgtb.in2p3.fr
Name: clrhtcmgtb.in2p3.fr
Address: 134.158.121.108
Name: clrhtcmgtb.in2p3.fr
Address: 2001:660:5104:134:134:158:121:108
root@clrhtcmgtb condor]# ps -ef | grep condor
condor 3033 1 0 16:16 ? 00:00:00 /usr/sbin/condor_master -f
root 3082 3033 0 16:16 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 991
condor 3083 3033 0 16:16 ? 00:00:00 condor_shared_port
condor 3084 3033 0 16:16 ? 00:00:00 condor_collector
condor 3091 3033 0 16:16 ? 00:00:00 condor_negotiator
condor 3092 3033 0 16:16 ? 00:00:00 condor_schedd
 A compute node : clrwn001
Name: clrwn001.in2p3.fr
Address: 134.158.123.1
Name: clrwn001.in2p3.fr
Address: 2001:660:5104:134:134:158:123:1
I have the following message in the scheduler logs on the node clrarcce03 :
07/05/24 17:05:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 17:06:38 (pid:5440) Finished sending rrls to negotiator
07/05/24 17:06:38 (pid:5440) Finished sending RRL for atlas001
07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 17:06:38 (pid:5440) SECMAN: removing lingering non-negotiated security session <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2>#1720191447#1 because it conflicts with new request
07/05/24 17:06:38 (pid:5440) Negotiation ended: 1 jobs matched
07/05/24 17:06:38 (pid:5440) Finished negotiating for atlas001 in local pool: 1 matched, 0 rejected
07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
07/05/24 17:06:38 (pid:5440) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001 failed.
07/05/24 17:06:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
It seems that my test job match with the node clrwn001, but this "match' was immediatdly remove. I don't have any firewall enable.
Any idea are welcome
Best Regards
Jean-Claude
------------------------------------------------------------------------
Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
Laboratoire de Physique Clermont
Campus Universitaire des CÃzeaux
4 Avenue Blaise Pascal
TSA 60026
CS 60026
63178 AubiÃre Cedex
Tel : 04 73 40 73 60
-------------------------------------------------------------------------