[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process



Hello Brian,

I finally found the problem. I have a puppet template that update regulary 'the iptables' services. 
I believe that there were no firewall ( firewalld ) on the WN ( clrwn001 )

Now it's fine .

Thanks for you help.

Best Regards
Jean-Claude

----- Mail original -----
De: "Chevaleyre Jean-Claude" <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
Ã: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Chevaleyre Jean-Claude" <Jean-Claude.Chevaleyre@xxxxxxxxxxxxxxxxx>
EnvoyÃ: Samedi 6 Juillet 2024 16:48:22
Objet: Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process

Hello Brian,

thanks for answer.

I have notice for condor_schedd on my node clrhtcmgtb.

For the routing problem, I have running process on my workers clrwn001 :

root@clrwn001 config.d]# netstat -tupan |grep 9618
tcp        0      0 0.0.0.0:9618            0.0.0.0:*               LISTEN      1328/condor_shared_
tcp        0      0 134.158.123.1:45523     134.158.121.108:9618    ESTABLISHED 1046/condor_master
tcp        0      0 134.158.123.1:45915     134.158.121.108:9618    ESTABLISHED 1329/condor_startd


I don't have any firewall on any node :

[root@clrwn001 config.d]# systemctl status firewalld
â firewalld.service - firewalld - dynamic firewall daemon
     Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; preset: enabled)
     Active: inactive (dead)
       Docs: man:firewalld(1)



I have change some parameters in the config file on the 3 node. I have also suppress IPV6 routing.

I have now this error n the ScheddLog file on node clrarcce03 :


07/06/24 16:45:21 (pid:158267) condor_write(): Socket closed when trying to write 4114 bytes to collector clrhtcmgtb.in2p3.fr, fd is 12
07/06/24 16:45:21 (pid:158267) Buf::write(): condor_write() failed
07/06/24 16:45:21 (pid:158267) SECMAN: Server rejected our session id
07/06/24 16:45:21 (pid:158267) SECMAN: Invalidating negotiated session rejected by peer
07/06/24 16:45:21 (pid:158267) ERROR: SECMAN:2004:Server rejected our session id
07/06/24 16:45:21 (pid:158267) Failed to send RESCHEDULE to unknown daemon:
07/06/24 16:45:24 (pid:158267) Using negotiation protocol: NEGOTIATE
07/06/24 16:45:24 (pid:158267) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
07/06/24 16:45:24 (pid:158267) AutoCluster:config() significant attributes changed to MachineLastMatchTime Offline RemoteOwner RequestCpus RequestDiskequestMemory TotalJobRuntime ConcurrencyLimits FlockTo Rank Requirements
07/06/24 16:45:24 (pid:158267) Rebuilt prioritized runnable job list in 0.000s.
07/06/24 16:45:24 (pid:158267) Finished sending rrls to negotiator
07/06/24 16:45:24 (pid:158267) Finished sending RRL for atlas001
07/06/24 16:45:24 (pid:158267) Activity on stashed negotiator socket: <134.158.121.108:32635>
07/06/24 16:45:24 (pid:158267) Using negotiation protocol: NEGOTIATE
07/06/24 16:45:24 (pid:158267) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
07/06/24 16:45:24 (pid:158267) Negotiation ended: 1 jobs matched
07/06/24 16:45:24 (pid:158267) Finished negotiating for atlas001 in local pool: 1 matched, 0 rejected
07/06/24 16:45:24 (pid:158267) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
07/06/24 16:45:24 (pid:158267) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618&alias=clrwn1.in2p3.fr&noUDP&sock=startd_15677_b24f> for atlas001: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.3.1-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_15677_b24f> for atlas001 failed.
07/06/24 16:45:24 (pid:158267) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618&alias=clrwn001.in2p3.fr&noUDP&sock=srtd_15677_b24f> for atlas001, 9.0) deleted


Jean-Claude



----- Mail original -----
De: "Bockelman, Brian" <BBockelman@xxxxxxxxxxxxx>
Ã: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
EnvoyÃ: Samedi 6 Juillet 2024 14:44:55
Objet: Re: [HTCondor-users] Condor Problem with communication between SchedLog and SharedLog Process

Hello Jean-Claude,

I think this is the relevant message:

07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).

"No route to host" may indicate there's literal routing issues between the hosts but, more likely, it could indicate your host clrwn001.in2p3.fr does not have port 9618 open in the firewall.

Brian

PS -- I see a "condor_schedd" process running in the output of clrhtcmgtb.in2p3.fr; that's not needed.

> On Jul 6, 2024, at 5:09âAM, Jean-Claude CHEVALEYRE <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx> wrote:
> 
> Hello,
> 
> In order to work on the migration from CentOS 7 to RHEL9 like, I set up a model with 3 servers. It is the same model that I already have on CentOS7. It's running well in this configuration on CentOS 7
> 
> 
> My configuration on AL9 is :
> 
> Â A master scheduler : clrarcce03.in2p3.fr ( 134.158.121.105 )
> Name:   clrarcce03.in2p3.fr
> Address: 134.158.121.105
> Name:   clrarcce03.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:105
> 
> 
> condor      5394       1  0 16:02 ?        00:00:00 /usr/sbin/condor_master -f
> root        5438    5394  0 16:02 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 990
> condor      5439    5394  0 16:02 ?        00:00:00 condor_shared_port
> condor      5440    5394  0 16:02 ?        00:00:00 condor_schedd
> 
> 
> 
> Â A Manager :  clrhtcmgtb.in2p3.fr
> Name:   clrhtcmgtb.in2p3.fr
> Address: 134.158.121.108
> Name:   clrhtcmgtb.in2p3.fr
> Address: 2001:660:5104:134:134:158:121:108
> 
> 
> 
> root@clrhtcmgtb condor]# ps -ef | grep condor
> condor      3033       1  0 16:16 ?        00:00:00 /usr/sbin/condor_master -f
> root        3082    3033  0 16:16 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 991
> condor      3083    3033  0 16:16 ?        00:00:00 condor_shared_port
> condor      3084    3033  0 16:16 ?        00:00:00 condor_collector
> condor      3091    3033  0 16:16 ?        00:00:00 condor_negotiator
> condor      3092    3033  0 16:16 ?        00:00:00 condor_schedd
> 
> Â A compute node : clrwn001
> Name:   clrwn001.in2p3.fr
> Address: 134.158.123.1
> Name:   clrwn001.in2p3.fr
> Address: 2001:660:5104:134:134:158:123:1
> 
> 
> I have the following message in the scheduler logs on the  node clrarcce03 :
> 
> 07/05/24 17:05:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) Finished sending rrls to negotiator
> 07/05/24 17:06:38 (pid:5440) Finished sending RRL for atlas001
> 07/05/24 17:06:38 (pid:5440) Activity on stashed negotiator socket: <134.158.121.108:6099>
> 07/05/24 17:06:38 (pid:5440) Using negotiation protocol: NEGOTIATE
> 07/05/24 17:06:38 (pid:5440) Negotiating for owner: atlas001@xxxxxxxxxxxxxxxxxxxxxx
> 07/05/24 17:06:38 (pid:5440) SECMAN: removing lingering non-negotiated security session <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2>#1720191447#1 because it conflicts with new request
> 07/05/24 17:06:38 (pid:5440) Negotiation ended: 1 jobs matched
> 07/05/24 17:06:38 (pid:5440) Finished negotiating for atlas001 in local pool: 1 matched, 0 rejected
> 07/05/24 17:06:38 (pid:5440) attempt to connect to <134.158.123.1:9618> failed: No route to host (connect errno = 113).
> 07/05/24 17:06:38 (pid:5440) Failed to send REQUEST_CLAIM to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001: SECMAN:2003:TCP connection to startd slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001 failed.
> 07/05/24 17:06:38 (pid:5440) Match record (slot1@xxxxxxxxxxxxxxxxx <134.158.123.1:9618?addrs=134.158.123.1-9618+[2001-660-5104-134-134-158-123-1]-9618&alias=clrwn001.in2p3.fr&noUDP&sock=startd_2920_74a2> for atlas001, 8.0) deleted
> 
> 
> 
> 
> It seems that my test job match with the node clrwn001, but this "match' was immediatdly remove. I don't have any firewall enable.
> 
> Any idea are welcome
> 
> Best Regards
> 
> Jean-Claude
> 
> 
> ------------------------------------------------------------------------
> Jean-Claude Chevaleyre < Jean-Claude.Chevaleyre(at)clermont.in2p3.fr > 
> Laboratoire de Physique Clermont
> Campus Universitaire des CÃzeaux
> 4 Avenue Blaise Pascal
> TSA 60026
> CS 60026
> 63178 AubiÃre Cedex
> 
> Tel : 04 73 40 73 60
> 
> -------------------------------------------------------------------------
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/