Dear all, After updating to HTCondor 9.0.0 we are experiencing problems on our worker nodes (running SL7). The condor_starter encounter a segmentation fault after ~1h runtime. In the StarterLog.slot1_X there is nothing being logged around that time until the start of the next condor_starter. In the StartLog I only see the report about the segmentation fault [1]. In addition a core dump is created. Having a look at the core dump the corresponding backtrace is [2]. As to me it seems like this is be related to the update of the X509 proxy for the job, I tried submitting a job without a userproxy, which so far does not trigger this problem. Other than the update of the HTCondor version nothing changed about the setup and jobs we are submitting. Has anyone experienced similar issues? Please let me know if any additional information can be useful to debug this issue. Thanks, Rene [1] StartLog 04/29/21 13:01:35 (pid:4025) (D_ALWAYS|D_FAILURE) Starter pid 3337908 died on signal 11 (signal 11 (Segmentation fault)) [2] #0 0x00007fa4d9d7f657 in kill () from /usr/lib64/libc.so.6 #1 0x00007fa4dc520e64 in unix_sig_coredump (signum=11, s_info=<optimized out>) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:1355 #2 <signal handler called> #3 0x00007fa4db0a9802 in EVP_DigestFinal_ex () from /usr/lib64/libcrypto.so.10 #4 0x00007fa4dc4aee1d in ReliSock::SndMsg::snd_packet (this=this@entry=0x55caf89bd470, peer_description=0x55caf89bd35c "<[2a00:139c:5:1dc:0:43:1:8c]:18245>", _sock=_sock@entry=19, end=end@entry=1,  _timeout=_timeout@entry=10) at /usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:1199 #5 0x00007fa4dc4af538 in ReliSock::end_of_message_internal (this=this@entry=0x55caf89bd170) at /usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:564 #6 0x00007fa4dc4af5fc in ReliSock::end_of_message (this=0x55caf89bd170) at /usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:546 #7 0x00007fa4dc4753de in relisock_gsi_put (arg=0x55caf89bd170, buf=0x55caf89f99a0, size=667) at /usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:943 #8 0x00007fa4dc3c4fc3 in x509_receive_delegation (destination_file=destination_file@entry=0x55caf89bdc40 "proxy.tmp",  recv_data_func=recv_data_func@entry=0x7fa4dc4752c0 <relisock_gsi_get(void*, void**, unsigned long*)>, recv_data_ptr=recv_data_ptr@entry=0x55caf89bd170,  send_data_func=send_data_func@entry=0x7fa4dc4753b0 <relisock_gsi_put(void*, void*, unsigned long)>, send_data_ptr=send_data_ptr@entry=0x55caf89bd170, state_ptr=state_ptr@entry=0x7ffc6dbd1738)  at /usr/src/debug/condor-9.0.0/src/condor_utils/globus_utils.cpp:1676 #9 0x00007fa4dc47673e in ReliSock::get_x509_delegation (this=this@entry=0x55caf89bd170, destination=0x55caf89bdc40 "proxy.tmp", flush_buffers=flush_buffers@entry=false, state_ptr=state_ptr@entry=0x0)  at /usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:757 #10 0x000055caf678fe84 in updateX509Proxy (path=0x55caf8993ac5 "proxy", rsock=0x55caf89bd170, cmd=500) at /usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1791 #11 JICShadow::updateX509Proxy (this=0x55caf898c4f0, cmd=500, s=0x55caf89bd170) at /usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1896 #12 0x000055caf676e65b in Starter::updateX509Proxy (this=<optimized out>, cmd=<optimized out>, s=<optimized out>) at /usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/starter.cpp:3723 #13 0x00007fa4dc50b46a in DaemonCore::CallCommandHandler (this=0x55caf897d0e0, req=500, stream=0x55caf89bd170, delete_stream=delete_stream@entry=false, check_payload=check_payload@entry=true,  time_spent_on_sec=0.000277000014, time_spent_waiting_for_payload=time_spent_waiting_for_payload@entry=0) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4468 #14 0x00007fa4dc4fa19a in DaemonCommandProtocol::ExecCommand (this=0x55caf89b35f0) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:1810 #15 0x00007fa4dc4fd385 in DaemonCommandProtocol::doProtocol (this=this@entry=0x55caf89b35f0) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:176 #16 0x00007fa4dc4fd485 in DaemonCommandProtocol::SocketCallback (this=this@entry=0x55caf89b35f0, stream=0x55caf89bd170) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:239 #17 0x00007fa4dc50c850 in DaemonCore::CallSocketHandler_worker (this=0x55caf897d0e0, i=3, default_to_HandleCommand=<optimized out>, asock=<optimized out>)  at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4235 #18 0x00007fa4dc50c8ed in DaemonCore::CallSocketHandler_worker_demarshall (arg=0x55caf89b2d00) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4194 #19 0x00007fa4dc345dd5 in CondorThreads::pool_add (routine=routine@entry=0x7fa4dc50c8d0 <DaemonCore::CallSocketHandler_worker_demarshall(void*)>, arg=arg@entry=0x55caf89b2d00, tid=<optimized out>,  descrip=<optimized out>) at /usr/src/debug/condor-9.0.0/src/condor_utils/condor_threads.cpp:1109 #20 0x00007fa4dc508617 in DaemonCore::CallSocketHandler (this=this@entry=0x55caf897d0e0, i=@0x7ffc6dbd1c60: 3, default_to_HandleCommand=default_to_HandleCommand@entry=true)  at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4182 #21 0x00007fa4dc51130e in DaemonCore::Driver (this=0x55caf897d0e0) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4019 #22 0x00007fa4dc525d12 in dc_main (argc=2, argv=0x7ffc6dbd2550) at /usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:4386 #23 0x00007fa4d9d6b555 in __libc_start_main () from /usr/lib64/libc.so.6 #24 0x000055caf676d961 in _start () -- Karlsruher Institut fÃr Technologie (KIT) Steinbuch Centre for Computing (SCC) Dr. Renà Caspart Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen, Germany Telefon: +49 721 608-25631 E-mail: Rene.Caspart@xxxxxxx Sitz der KÃrperschaft: KaiserstraÃe 12, 76131 Karlsruhe KIT â Die ForschungsuniversitÃt in der Helmholtz-Gemeinschaft
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature