[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 9.0.0 condor_starter segfaulting during x509 proxy update



Hi Rene,

 

This certainly seems like a bug.  We will look into it and also try to reproduce it.  Thank you for the report.

 

In the meantime, you may be able to work around it by disabling AES (which was added in 9.0.0 and I suspect related to what you are seeing).  In your condor_config set:

    SEC_DEFAULT_CRYPTO_METHODS = BLOWFISH

 

I think setting it just on the submit node should be sufficient but it won't hurt to set it everywhere.  Let me know if that helps and I will let you know what we find once we know more.

 

 

Cheers,

-zach

 

 

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Caspart, René (SCC) <rene.caspart@xxxxxxx>
Date: Thursday, April 29, 2021 at 8:58 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] HTCondor 9.0.0 condor_starter segfaulting during x509 proxy update

Dear all,

 

After updating to HTCondor 9.0.0 we are experiencing problems on our

worker nodes (running SL7). The condor_starter encounter a segmentation

fault after ~1h runtime. In the StarterLog.slot1_X there is nothing

being logged around that time until the start of the next

condor_starter. In the StartLog I only see the report about the

segmentation fault [1]. In addition a core dump is created. Having a

look at the core dump the corresponding backtrace is [2].

 

As to me it seems like this is be related to the update of the X509

proxy for the job, I tried submitting a job without a userproxy, which

so far does not trigger this problem. Other than the update of the

HTCondor version nothing changed about the setup and jobs we are submitting.

 

Has anyone experienced similar issues? Please let me know if any

additional information can be useful to debug this issue.

 

Thanks,

Rene

 

[1]

StartLog

04/29/21 13:01:35 (pid:4025) (D_ALWAYS|D_FAILURE) Starter pid 3337908

died on signal 11 (signal 11 (Segmentation fault))

 

[2]

#0  0x00007fa4d9d7f657 in kill () from /usr/lib64/libc.so.6

#1  0x00007fa4dc520e64 in unix_sig_coredump (signum=11,

s_info=<optimized out>) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:1355

#2  <signal handler called>

#3  0x00007fa4db0a9802 in EVP_DigestFinal_ex () from

/usr/lib64/libcrypto.so.10

#4  0x00007fa4dc4aee1d in ReliSock::SndMsg::snd_packet

(this=this@entry=0x55caf89bd470, peer_description=0x55caf89bd35c

"<[2a00:139c:5:1dc:0:43:1:8c]:18245>", _sock=_sock@entry=19,

end=end@entry=1,

    _timeout=_timeout@entry=10) at

/usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:1199

#5  0x00007fa4dc4af538 in ReliSock::end_of_message_internal

(this=this@entry=0x55caf89bd170) at

/usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:564

#6  0x00007fa4dc4af5fc in ReliSock::end_of_message (this=0x55caf89bd170)

at /usr/src/debug/condor-9.0.0/src/condor_io/reli_sock.cpp:546

#7  0x00007fa4dc4753de in relisock_gsi_put (arg=0x55caf89bd170,

buf=0x55caf89f99a0, size=667) at

/usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:943

#8  0x00007fa4dc3c4fc3 in x509_receive_delegation

(destination_file=destination_file@entry=0x55caf89bdc40 "proxy.tmp",

    recv_data_func=recv_data_func@entry=0x7fa4dc4752c0

<relisock_gsi_get(void*, void**, unsigned long*)>,

recv_data_ptr=recv_data_ptr@entry=0x55caf89bd170,

    send_data_func=send_data_func@entry=0x7fa4dc4753b0

<relisock_gsi_put(void*, void*, unsigned long)>,

send_data_ptr=send_data_ptr@entry=0x55caf89bd170,

state_ptr=state_ptr@entry=0x7ffc6dbd1738)

    at /usr/src/debug/condor-9.0.0/src/condor_utils/globus_utils.cpp:1676

#9  0x00007fa4dc47673e in ReliSock::get_x509_delegation

(this=this@entry=0x55caf89bd170, destination=0x55caf89bdc40 "proxy.tmp",

flush_buffers=flush_buffers@entry=false, state_ptr=state_ptr@entry=0x0)

    at /usr/src/debug/condor-9.0.0/src/condor_io/cedar_no_ckpt.cpp:757

#10 0x000055caf678fe84 in updateX509Proxy (path=0x55caf8993ac5 "proxy",

rsock=0x55caf89bd170, cmd=500) at

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1791

#11 JICShadow::updateX509Proxy (this=0x55caf898c4f0, cmd=500,

s=0x55caf89bd170) at

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/jic_shadow.cpp:1896

#12 0x000055caf676e65b in Starter::updateX509Proxy (this=<optimized

out>, cmd=<optimized out>, s=<optimized out>) at

/usr/src/debug/condor-9.0.0/src/condor_starter.V6.1/starter.cpp:3723

#13 0x00007fa4dc50b46a in DaemonCore::CallCommandHandler

(this=0x55caf897d0e0, req=500, stream=0x55caf89bd170,

delete_stream=delete_stream@entry=false,

check_payload=check_payload@entry=true,

    time_spent_on_sec=0.000277000014,

time_spent_waiting_for_payload=time_spent_waiting_for_payload@entry=0)

at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4468

#14 0x00007fa4dc4fa19a in DaemonCommandProtocol::ExecCommand

(this=0x55caf89b35f0) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:1810

#15 0x00007fa4dc4fd385 in DaemonCommandProtocol::doProtocol

(this=this@entry=0x55caf89b35f0) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:176

#16 0x00007fa4dc4fd485 in DaemonCommandProtocol::SocketCallback

(this=this@entry=0x55caf89b35f0, stream=0x55caf89bd170) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_command.cpp:239

#17 0x00007fa4dc50c850 in DaemonCore::CallSocketHandler_worker

(this=0x55caf897d0e0, i=3, default_to_HandleCommand=<optimized out>,

asock=<optimized out>)

    at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4235

#18 0x00007fa4dc50c8ed in

DaemonCore::CallSocketHandler_worker_demarshall (arg=0x55caf89b2d00) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4194

#19 0x00007fa4dc345dd5 in CondorThreads::pool_add

(routine=routine@entry=0x7fa4dc50c8d0

<DaemonCore::CallSocketHandler_worker_demarshall(void*)>,

arg=arg@entry=0x55caf89b2d00, tid=<optimized out>,

    descrip=<optimized out>) at

/usr/src/debug/condor-9.0.0/src/condor_utils/condor_threads.cpp:1109

#20 0x00007fa4dc508617 in DaemonCore::CallSocketHandler

(this=this@entry=0x55caf897d0e0, i=@0x7ffc6dbd1c60: 3,

default_to_HandleCommand=default_to_HandleCommand@entry=true)

    at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4182

#21 0x00007fa4dc51130e in DaemonCore::Driver (this=0x55caf897d0e0) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core.cpp:4019

#22 0x00007fa4dc525d12 in dc_main (argc=2, argv=0x7ffc6dbd2550) at

/usr/src/debug/condor-9.0.0/src/condor_daemon_core.V6/daemon_core_main.cpp:4386

#23 0x00007fa4d9d6b555 in __libc_start_main () from /usr/lib64/libc.so.6

#24 0x000055caf676d961 in _start ()

 

--

Karlsruher Institut für Technologie (KIT)

Steinbuch Centre for Computing (SCC)

 

Dr. René Caspart

 

Hermann-von-Helmholtz-Platz 1

76344 Eggenstein-Leopoldshafen, Germany

Telefon: +49 721 608-25631

 

 

Sitz der Körperschaft:

Kaiserstraße 12, 76131 Karlsruhe

 

 

 

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft