[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process



Hello Jean-Claude,

I think this is the relevant message:

07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).

"No route to host" may indicate there's literal routing issues between the hosts but, more likely, it could indicate your host clrwn001.in2p3.fr does not have port 9618 open in the firewall.

Brian

PS -- I see a "condor_schedd" process running in the output of clrhtcmgtb.in2p3.fr; that's not needed.

> On Jul 6, 2024, at 5:09âAM, Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx> wrote:
> 
> Hello,
> 
> In order to work on the migration from CentOS 7 to RHEL9 like, I set up a model with 3 servers. It is the same model that I already have on CentOS7. It's running well in this configuration on CentOS 7
> 
> 
> My configuration on AL9 is :
> 
> Â A master scheduler : clrarcce03.in2p3.fr ( 134.158.121.105 )
> Name:   clrarcce03.in2p3.fr
> Address: 134.158.121.105
> Name:   clrarcce03.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:105
> 
> 
> condor      5394       1  0 16:02 ?        00:00:00 /usr/sbin/condor_master -f
> root        5438    5394  0 16:02 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
> condor      5439    5394  0 16:02 ?        00:00:00 condor_shared_port
> condor      5440    5394  0 16:02 ?        00:00:00 condor_schedd
> 
> 
> 
> Â A Manager :  clrhtcmgtb.in2p3.fr
> Name:   clrhtcmgtb.in2p3.fr
> Address: 134.158.121.108
> Name:   clrhtcmgtb.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:108
> 
> 
> 
> root@clrhtcmgtb condor]# ps -ef | grep condor
> condor      3033       1  0 16:16 ?        00:00:00 /usr/sbin/condor_master -f
> root        3082    3033  0 16:16 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 991
> condor      3083    3033  0 16:16 ?        00:00:00 condor_shared_port
> condor      3084    3033  0 16:16 ?        00:00:00 condor_collector
> condor      3091    3033  0 16:16 ?        00:00:00 condor_negotiator
> condor      3092    3033  0 16:16 ?        00:00:00 condor_schedd
> 
> Â A compute node : clrwn001
> Name:   clrwn001.in2p3.fr
> Address: 134.158.123.1
> Name:   clrwn001.in2p3.fr
> Address: 2001:660:5104:134:134:158:123:1
> 
> 
> I have the following message in the scheduler logs on the  node clrarcce03 :
> 
> 07/05/24 17:05:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) Finished sending rrls to negotiator
> 07/05/24 17:06:38 (pid:5440) Finished sending RRL for atlas001
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) SECMAN: removing lingering non-negotiated security session <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2>#1720191447#1 because it conflicts with new request
> 07/05/24 17:06:38 (pid:5440) Negotiation ended: 1 jobs matched
> 07/05/24 17:06:38 (pid:5440) Finished negotiating for atlas001 in local pool: 1 matched, 0 rejected
> 07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
> 07/05/24 17:06:38 (pid:5440) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001 failed.
> 07/05/24 17:06:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
> 
> 
> 
> 
> It seems that my test job match with the node clrwn001, but this "match' was immediatdly remove. I don't have any firewall enable.
> 
> Any idea are welcome
> 
> Best Regards
> 
> Jean-Claude
> 
> 
> ------------------------------------------------------------------------
> Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr > 
> Laboratoire de Physique Clermont
> Campus Universitaire des CÃzeaux
> 4 Avenue Blaise Pascal
> TSA 60026
> CS 60026
> 63178 AubiÃre Cedex
> 
> Tel : 04 73 40 73 60
> 
> -------------------------------------------------------------------------
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/