Hello all,
Thank you for your responses!
Dimitri, the nodes are Dell and Supermicro twins with two E5-2680 v4 processors (there are 56 cores using HT per machine but we just offer 48 slots to HTCondor).
Greg, I will send you the log right now, thank you very much.
Brian, yes, you're right, yesterday, I discovered that there were several "condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 988" processes still running in the failing machines, unfortunately, the ps command ends to hung. The
Considering that the crash, according to the core.XXX file, is yesterday at 18:01, I can see several messages like this in ProcLog:
11/14/18 18:04:42 : Procd has a watcher pid and will die if pid 47834 dies.
11/14/18 18:04:42 : Initializing cgroup library.
11/14/18 18:04:42 : taking a snapshot...
11/14/18 18:28:13 : ***********************************
11/14/18 18:28:13 : * condor_procd STARTING UP
11/14/18 18:28:13 : * PID = 50596
11/14/18 18:28:13 : * UID = 0
11/14/18 18:28:13 : * GID = 0
11/14/18 18:28:13 : ***********************************
[...]
Moreover, in the SharedPortLog at 18:01 exactly:
11/14/18 18:01:11 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .
11/14/18 18:01:11 IO: Failed to read packet header
11/14/18 18:01:11 SharedPortClient: failed to receive result for SHARED_PORT_PASS_FD to 45848_7e35 as requested by <
192.168.100.56:38473>: Connection reset by peer
11/14/18 18:01:11 ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.
11/14/18 18:01:11 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :
[...]
I will investigate further in this direction.
Thank you again.
Cheers,
Carles