Re: [HTCondor-users] HTCondor 8.6.13 stack dump

Hello all,

Thank you for your responses!

Dimitri, the nodes are Dell and Supermicro twins with two E5-2680 v4 processors (there are 56 cores using HT per machine but we just offer 48 slots to HTCondor).

Greg, I will send you the log right now, thank you very much.

Brian, yes, you're right, yesterday, I discovered that there were several "condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 988" processes still running in the failing machines, unfortunately, the ps command ends to hung. The

Considering that the crash, according to the core.XXX file, is yesterday at 18:01, I can see several messages like this in ProcLog:

11/14/18 18:04:42 : Procd has a watcher pid and will die if pid 47834 dies.

11/14/18 18:04:42 : Initializing cgroup library.

11/14/18 18:04:42 : taking a snapshot...

11/14/18 18:28:13 : ***********************************

11/14/18 18:28:13 : * condor_procd STARTING UP

11/14/18 18:28:13 : * PID = 50596

11/14/18 18:28:13 : * UID = 0

11/14/18 18:28:13 : * GID = 0

11/14/18 18:28:13 : ***********************************

[...]

Moreover, in the SharedPortLog at 18:01 exactly:

SharedPortCommandSinfuls = "<192.168.100.56:9618>,<[::1]:9618>"

MyAddress = "<192.168.100.56:9618?addrs=192.168.100.56-9618+[--1]-9618&noUDP>"

11/14/18 18:01:11 condor_read() failed: recv(fd=6) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .

11/14/18 18:01:11 IO: Failed to read packet header

11/14/18 18:01:11 SharedPortClient: failed to receive result for SHARED_PORT_PASS_FD to 45848_7e35 as requested by <192.168.100.56:38473>: Connection reset by peer

11/14/18 18:01:11 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <192.168.100.56:0> (try 1 of 3): CEDAR:6001:Failed to connect to <192.168.100.56:0?sock=45848_7e35>

11/14/18 18:01:11 ChildAliveMsg: giving up because deadline expired for sending DC_CHILDALIVE to parent.

11/14/18 18:01:11 About to update statistics in shared_port daemon ad file at /var/lock/condor/shared_port_ad :

[...]

I will investigate further in this direction.

Thank you again.

Cheers,

Carles

On Wed, 14 Nov 2018 at 21:11, Brian Bockelman <bbockelm@xxxxxxxxxxx> wrote:

On Nov 14, 2018, at 2:50 AM, Carles Acosta <cacosta@xxxxxx> wrote:

Hello all,

We have recently updated to CentOs7 and HTCondor 8.6.13 our WorkerNodes. For our WNs with more slots, 48, we are seeing that the condor_master is crashing from time to time with this error:

Caught signal 6: si_code=0, si_pid=1, si_uid=0, si_addr=0x1

Signal 6 is SIGABRT -- often something that HTCondor does when some sort of assumption in the code fails.Â It's not actually segfaulting, which is probably a good sign!

From the stack trace, the problem is occurring when it's trying to talk to the procd; hence, there may also be some useful information in the ProcdLog.

Stack dump for process 23309 at timestamp 1542183984 (16 frames)
/usr/lib64/libcondor_utils_8_6_13.so(dprintf_dump_stack+0x24)[0x7ff576c8b2e4]
/usr/lib64/libcondor_utils_8_6_13.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x69)[0x7ff576e15db9]
/usr/lib64/libpthread.so.0(+0xf6d0)[0x7ff5753666d0]
/usr/lib64/libpthread.so.0(read+0x10)[0x7ff5753657e0]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore9Read_PipeEiPvi+0x6c)[0x7ff576de538c]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxy11start_procdEv+0x728)[0x7ff576d12e78]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN15ProcFamilyProxyC1EPKc+0x1f8)[0x7ff576d13748]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN19ProcFamilyInterface6createEPKc+0x5e)[0x7ff576c7c4ee]
/usr/lib64/libcondor_utils_8_6_13.so(_ZN10DaemonCore14Create_ProcessEPKcRK7ArgList10priv_stateiiiPK3EnvS1_P10FamilyInfoPP6StreamPiSE_iP10__sigset_tiPmSE_S1_P8MyStringP15FilesystemRemapl+0x19e8)[0x7ff576dff188]
/usr/sbin/condor_master(_ZN6daemon9RealStartEv+0xca3)[0x410303]
/usr/sbin/condor_master(_ZN7Daemons15StartDaemonHereEP6daemon+0x26)[0x410e06]
/usr/sbin/condor_master(_ZN7Daemons15StartAllDaemonsEv+0x68)[0x410e78]
/usr/sbin/condor_master(_Z9main_initiPPc+0x6ff)[0x41583f]
/usr/lib64/libcondor_utils_8_6_13.so(_Z7dc_mainiPPc+0x138d)[0x7ff576e1942d]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff574fac445]
/usr/sbin/condor_master[0x40a6af]

The /var/log/condor directory has a lot of core.XXXX dump files.

This is not happening for the rest of our machines with fewer slots. The WNs can be several days toÂone-week executing jobs without any issue until the crash. Any ideas? Increasing the stack size of the condor user with ulimit does not solve the issue.

Cheers,

Carles

--
Carles Acosta i Silva
PIC (Port d'InformaciÃ CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.esÂ
AvÃs - Aviso - Legal Notice: http://www.ifae.es/legal.html
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/