[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_negotiator 24.0.2 sigsegv with offline EPs



Hi All,

I recently upgraded our pools here at WWU to Rocky 9 and HTCondor 24.0.2 (from Rocky 8 and HTCondor 23.0.x). Overall things went smoothly, except when I have an offline EP that could/should take a job. In that case, I get a SIGSEGV in the condor_negotiator process and condor_master restarts it over and over again until I remove the job. I installed the debuginfo RPMs, set D_FULLDEBUG, and crashed it again with GDB attached to get the backtrace:

>From the NegotiatorLog:

01/03/25 14:11:44 Registering attempt to match offline machine slot1@xxxxxxxxxxxxxxxxxxxxxxxx by mcgrewz.
Caught signal 11: si_code=1, si_pid=0, si_uid=0, si_addr=0x0
Stack dump for process 21280 at timestamp 1735942304 (16 frames)
/lib64/libcondor_utils_24_0_2.so(_Z18dprintf_dump_stackv+0x28)[0x7f705de194c8]
/lib64/libcondor_utils_24_0_2.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6e)[0x7f705dfb520e]
/lib64/libc.so.6(+0x3e730)[0x7f705d43e730]
/lib64/libc.so.6(+0xacb2a)[0x7f705d4acb2a]
/lib64/libcondor_utils_24_0_2.so(_ZN11DCCollectorC1EPKcNS_10UpdateTypeE+0xac)[0x7f705df7575c]
condor_negotiator(_ZN10Matchmaker29RegisterAttemptedOfflineMatchEPN7classad7ClassAdES2_PKc+0x4e9)[0x55d2cbb22c59]
condor_negotiator(_ZN10Matchmaker19matchmakingProtocolERN7classad7ClassAdEPS1_RSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt3setISA_St4lessISA_ESaISA_EESD_SaISt4pairIKSA_SF_EEEP4SockPKcSP_+0x1f9)[0x55d2cbb12ee9]
condor_negotiator(_ZN10Matchmaker9negotiateEPKcS1_PKN7classad7ClassAdEdddiRSt6vectorIPS3_SaIS7_EERSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt3setISH_St4lessISH_ESaISH_EESK_SaISt4pairIKSH_SM_EEEblRiRd+0x7b2)[0x55d2cbb1a
e82]
condor_negotiator(_ZN10Matchmaker18negotiateWithGroupEbiddRSt6vectorIPN7classad7ClassAdESaIS3_EERSt3mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt3setISD_St4lessISD_ESaISD_EESG_SaISt4pairIKSD_SI_EEES6_dPKc+0x1141)[0x55d2cbb0
d6b1]
condor_negotiator(_ZN10Matchmaker15negotiationTimeEi+0x7e0)[0x55d2cbb07fd0]
/lib64/libcondor_utils_24_0_2.so(_ZN12TimerManager7TimeoutEPiPd+0x154)[0x7f705dfca7e4]
/lib64/libcondor_utils_24_0_2.so(_ZN10DaemonCore6DriverEv+0x2ce)[0x7f705df9e08e]
/lib64/libcondor_utils_24_0_2.so(_Z7dc_mainiPPc+0x168b)[0x7f705dfc523b]
/lib64/libc.so.6(+0x295d0)[0x7f705d4295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f705d429680]
condor_negotiator(_start+0x25)[0x55d2cbae83a5]


The first line of the output is present from the D_FULLDEBUG, so it's at least getting past there in the code.

When I look at the back trace in GDB I can see a bit more useful info:

Program received signal SIGSEGV, Segmentation fault.
0x00007f5d29aacb2a in __strlen_sse2 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f5d29aacb2a in __strlen_sse2 () from /lib64/libc.so.6
#1  0x00007f5d2a57575c in std::char_traits<char>::length (__s=<optimized out>, __s=<optimized out>) at /opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/char_traits.h:399
#2  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::assign (__s=0x0, this=0x55f166f34630) at /opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/basic_string.h:1684
#3  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator= (__s=<optimized out>, this=<optimized out>, this=<optimized out>, __s=<optimized out>)
    at /opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/basic_string.h:824
#4  DCCollector::DCCollector (this=<optimized out>, dcName=<optimized out>, uType=<optimized out>, this=<optimized out>, dcName=<optimized out>, uType=<optimized out>)
    at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_daemon_client/dc_collector.cpp:41
#5  0x000055f16550ac59 in Matchmaker::RegisterAttemptedOfflineMatch (this=<optimized out>, job_ad=<optimized out>, startd_ad=0x55f166fb9ff0, schedd_addr=<optimized out>)
    at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_negotiator.V6/matchmaker.cpp:6068
#6  0x000055f1654faee9 in Matchmaker::matchmakingProtocol (this=this@entry=0x55f16552c200 <matchMaker>, request=..., offer=offer@entry=0x55f166fb9ff0, claimIds=std::map with 21 elements = {...}, sock=0x55f166f27550, 
    submitterName=submitterName@entry=0x55f166f02920 "research_berger.mcgrewz@xxxxxxxxxxxxxxxxxx", 
    scheddAddr=0x55f166f68130 "<140.160.143.130:9618?addrs=140.160.143.130-9618&alias=csci-head.cluster.cs.wwu.edu&noUDP&sock=schedd_23847_a310>")
    at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_negotiator.V6/matchmaker.cpp:5060
#7  0x000055f165502e82 in Matchmaker::negotiate (this=0x55f16552c200 <matchMaker>, groupName=<optimized out>, submitterName=<optimized out>, submitterAd=<optimized out>, priority=<optimized out>, submitterLimit=<optimized out>, 
    submitterLimitUnclaimed=<optimized out>, submitterCeiling=<optimized out>, startdAds=..., claimIds=..., ignore_schedd_limit=<optimized out>, deadline=<optimized out>, numMatched=<optimized out>, pieLeft=<optimized out>)
    at /opt/rh/gcc-toolset-13/root/usr/include/c++/13/bits/basic_string.h:222
#8  0x000055f1654f56b1 in Matchmaker::negotiateWithGroup (this=<optimized out>, isFloorRound=<optimized out>, untrimmed_num_startds=<optimized out>, untrimmedSlotWeightTotal=<optimized out>, minSlotWeight=<optimized out>, 
    startdAds=..., claimIds=..., submitterAds=..., groupQuota=<optimized out>, groupName=<optimized out>) at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_negotiator.V6/matchmaker.cpp:2550
#9  0x000055f1654effd0 in Matchmaker::negotiationTime (this=0x55f16552c200 <matchMaker>) at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_negotiator.V6/matchmaker.cpp:1833
#10 0x00007f5d2a5ca7e4 in TimerManager::Timeout (this=0x55f166eb4820, pNumFired=0x7ffe999aa684, pruntime=0x7ffe999aa688) at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:465
#11 0x00007f5d2a59e08e in DaemonCore::Driver (this=0x55f166eb0850) at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3391
#12 0x00007f5d2a5c523b in dc_main (argc=1, argv=<optimized out>) at /usr/src/debug/condor-24.0.2-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4423
#13 0x00007f5d29a295d0 in __libc_start_call_main () from /lib64/libc.so.6
#14 0x00007f5d29a29680 in __libc_start_main_impl () from /lib64/libc.so.6
#15 0x000055f1654d03a5 in _start ()


My C++ is rusty, but it looks like condor_negotiator.V6/matchmaker.cpp:6068 is trying to call the default constructor for DCCollector(). However, the back trace shows it as calling a constructor that takes six arguments? condor_daemon_client/dc_collector.cpp:41 is inside the one constructor defined, which takes three arguments, and it's failing on "this->constructorName = dcName;" Is this a case of the optimizer being incorrect with its call to the wrong constructor, or am I missing something?

For now I've worked around the problem by simply setting my HIBERNATE="NONE" for all the EPs instead of dynamically changing it when they're idle for too long. This keeps them awake, but wanted to see if anyone could take a look at this so they can go back to napping occasionally.

Thanks,
-Zach