Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process
- Date: Sat, 6 Jul 2024 12:44:55 +0000
- From: "Bockelman, Brian" <BBockelman@xxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process
Hello Jean-Claude,
I think this is the relevant message:
07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
"No route to host" may indicate there's literal routing issues between the hosts but, more likely, it could indicate your host clrwn001.in2p3.fr does not have port 9618 open in the firewall.
Brian
PS -- I see a "condor_schedd" process running in the output of clrhtcmgtb.in2p3.fr; that's not needed.
> On Jul 6, 2024, at 5:09âAM, Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> In order to work on the migration from CentOS 7 to RHEL9 like, I set up a model with 3 servers. It is the same model that I already have on CentOS7. It's running well in this configuration on CentOS 7
>
>
> My configuration on AL9 is :
>
> Â A master scheduler : clrarcce03.in2p3.fr ( 134.158.121.105 )
> Name: clrarcce03.in2p3.fr
> Address: 134.158.121.105
> Name: clrarcce03.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:105
>
>
> condor 5394 1 0 16:02 ? 00:00:00 /usr/sbin/condor_master -f
> root 5438 5394 0 16:02 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
> condor 5439 5394 0 16:02 ? 00:00:00 condor_shared_port
> condor 5440 5394 0 16:02 ? 00:00:00 condor_schedd
>
>
>
> Â A Manager : clrhtcmgtb.in2p3.fr
> Name: clrhtcmgtb.in2p3.fr
> Address: 134.158.121.108
> Name: clrhtcmgtb.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:108
>
>
>
> root@clrhtcmgtb condor]# ps -ef | grep condor
> condor 3033 1 0 16:16 ? 00:00:00 /usr/sbin/condor_master -f
> root 3082 3033 0 16:16 ? 00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 991
> condor 3083 3033 0 16:16 ? 00:00:00 condor_shared_port
> condor 3084 3033 0 16:16 ? 00:00:00 condor_collector
> condor 3091 3033 0 16:16 ? 00:00:00 condor_negotiator
> condor 3092 3033 0 16:16 ? 00:00:00 condor_schedd
>
> Â A compute node : clrwn001
> Name: clrwn001.in2p3.fr
> Address: 134.158.123.1
> Name: clrwn001.in2p3.fr
> Address: 2001:660:5104:134:134:158:123:1
>
>
> I have the following message in the scheduler logs on the node clrarcce03 :
>
> 07/05/24 17:05:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) Finished sending rrls to negotiator
> 07/05/24 17:06:38 (pid:5440) Finished sending RRL for atlas001
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) SECMAN: removing lingering non-negotiated security session <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2>#1720191447#1 because it conflicts with new request
> 07/05/24 17:06:38 (pid:5440) Negotiation ended: 1 jobs matched
> 07/05/24 17:06:38 (pid:5440) Finished negotiating for atlas001 in local pool: 1 matched, 0 rejected
> 07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
> 07/05/24 17:06:38 (pid:5440) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001 failed.
> 07/05/24 17:06:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
>
>
>
>
> It seems that my test job match with the node clrwn001, but this "match' was immediatdly remove. I don't have any firewall enable.
>
> Any idea are welcome
>
> Best Regards
>
> Jean-Claude
>
>
> ------------------------------------------------------------------------
> Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr >
> Laboratoire de Physique Clermont
> Campus Universitaire des CÃzeaux
> 4 Avenue Blaise Pascal
> TSA 60026
> CS 60026
> 63178 AubiÃre Cedex
>
> Tel : 04 73 40 73 60
>
> -------------------------------------------------------------------------
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/