Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!
- Date: Tue, 29 Apr 2025 18:45:31 +0000
- From: Zach McGrew <mcgrewz@xxxxxxx>
- Subject: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!
Hi All,
There's been a parallel programming course using our HTCondor environment here at the university, which has lead to the discovery of all sorts of fun issues, errr... schedd crashes. Some of the crashes seem to stem from incorrect student code that crashes on some EP slot, and the scheduler doesn't always seem to understand what happened on the remote side. Some of the crashes happen shortly after submitting jobs.
Below I've included the gdb back traces and local vars where I could catch them, but that's only for a few of the crashes where I could have the student submit their jobs and gdb was attached and waiting for it to crash. The rest are just the sections of the logs that got emailed to me when the schedd crashed.
The first crash that's listed I was able to reproduce in my tiny (1 AP, 2 4-core EPs) VM test environment, this would eventually get it to crash by submitting a bunch of these (though not necessarily immediately like the students could):
universe = parallel
TransferExecutable = False
executable = /usr/bin/nonexistent
arguments = I expect this to crash the scheduler
output = logs/out.$(NODE)
error = logs/err.$(NODE)
log = logs/log
machine_count = 2
request_cpus = 2
request_memory = 128MB
queue 2
I think I found the cause for the first crash in my notes after the GDB section, and I think I see a potential code path or two that could lead to give_up being false after the while loop.
Happy to chat with anyone and provide as much data as I can to hopefully get the parallel universe up and working correctly again in a future release.
Thanks,
-Zach
--------------------------------------------------
First crash log: Happened ~90x
Caught signal 11: si_code=1, si_pid=16, si_uid=0, si_addr=0x10
Stack dump for process 1812195 at timestamp 1745518544 (14 frames)
/lib64/libcondor_utils_24_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f7b01224e08]
/lib64/libcondor_utils_24_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f7b013c7b4f]
/lib64/libc.so.6(+0x3e730)[0x7f7b0083e730]
/lib64/libclassad.so.24.0.7(_ZNK7classad7ClassAd13LookupInScopeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERPNS_8ExprTreeERNS_9EvalStateE+0x55)[0x7f7b00f91405]
/lib64/libclassad.so.24.0.7(_ZNK7classad7ClassAd12EvaluateAttrERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_5ValueENS9_9ValueTypeE+0x79)[0x7f7b00f931d9]
/lib64/libclassad.so.24.0.7(_ZNK7classad7ClassAd21EvaluateAttrBoolEquivERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERb+0x3d)[0x7f7b00f97bbd]
condor_schedd(_ZN18DedicatedScheduler15computeScheduleEv+0x585)[0x559ac2403fc5]
condor_schedd(_ZN18DedicatedScheduler19handleDedicatedJobsEv+0x4b)[0x559ac2408f2b]
/lib64/libcondor_utils_24_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0x152)[0x7f7b013e1232]
/lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x2aa)[0x7f7b013b323a]
/lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e)[0x7f7b013d631e]
/lib64/libc.so.6(+0x295d0)[0x7f7b008295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f7b00829680]
condor_schedd(_start+0x25)[0x559ac23ef955]
GDB:
Program received signal SIGSEGV, Segmentation fault.
classad::ClassAd::LookupInScope (this=this@entry=0x0, name="WantParallelSchedulingGroups", expr=@0x7fff1c697f78: 0x0, state=...) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:671
671 if( ( expr = current->Lookup( name ) ) ) {
(gdb) bt
#0 classad::ClassAd::LookupInScope (this=this@entry=0x0, name="WantParallelSchedulingGroups", expr=@0x7fff1c697f78: 0x0, state=...) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:671
#1 0x00007f7b00f931d9 in classad::ClassAd::EvaluateAttr (this=0x0, attr="WantParallelSchedulingGroups", val=..., mask=classad::Value::NUMBER_VALUES) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:1017
#2 0x00007f7b00f97bbd in classad::ClassAd::EvaluateAttrBoolEquiv (this=this@entry=0x0, attr="WantParallelSchedulingGroups", b=@0x7fff1c698115: false) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:1181
#3 0x0000559ac2403fc5 in classad::ClassAd::LookupBool (this=0x0, name="WantParallelSchedulingGroups", value=@0x7fff1c698115: false) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad/classad.h:539
#4 DedicatedScheduler::computeSchedule (this=this@entry=0x559ac24f5100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:2280
#5 0x0000559ac2408f2b in DedicatedScheduler::handleDedicatedJobs (this=0x559ac24f5100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:1521
#6 0x00007f7b013e1232 in TimerManager::Timeout (this=0x559ac3f14df0, pNumFired=0x7fff1c698304, pruntime=0x7fff1c698308) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:465
#7 0x00007f7b013b323a in DaemonCore::Driver (this=0x559ac3efa150) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3391
#8 0x00007f7b013d631e in dc_main (argc=1, argv=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4554
#9 0x00007f7b008295d0 in __libc_start_call_main () from /lib64/libc.so.6
#10 0x00007f7b00829680 in __libc_start_main_impl () from /lib64/libc.so.6
#11 0x0000559ac23ef955 in _start ()
(gdb) info locals
current = 0x0
superScope = <optimized out>
Notes:
from around dedicated_scheduler.cpp:2280
bool want_groups = false;;
job = jobs->Head();
jobs->Rewind();
job->LookupBool(ATTR_WANT_PARALLEL_SCHEDULING_GROUPS, want_groups);
I'm guessing jobs is empty, and job winds up as null because of that.
Slight variation of the same crash in GDB?:
Program received signal SIGSEGV, Segmentation fault.
0x00007f211a63e96b in kill () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f211a63e96b in kill () from /lib64/libc.so.6
#1 0x00007f211b1c7be5 in unix_sig_coredump (signum=11, s_info=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:1372
#2 unix_sig_coredump (signum=11, s_info=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:1292
#3 <signal handler called>
#4 0x00007f211b3c2405 in __gnu_cxx::__normal_iterator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, classad::ExprTree*> const*, std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, classad::ExprTree*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, classad::ExprTree*> > > >::__normal_iterator (
this=<optimized out>, __i=<optimized out>) at /opt/rh/gcc-toolset-14/root/usr/include/c++/14/bits/stl_iterator.h:1068
#5 std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, classad::ExprTree*>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, classad::ExprTree*> > >::cbegin (this=<optimized out>) at /opt/rh/gcc-toolset-14/root/usr/include/c++/14/bits/stl_vector.h:955
#6 classad::ClassAdFlatMap::begin[abi:cxx11]() const (this=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad/classad_flat_map.h:96
#7 classad::ClassAdFlatMap::find<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=<optimized out>, key=...) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad/classad_flat_map.h:124
#8 classad::ClassAd::Lookup<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > (this=<optimized out>, name=...) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad/classad.h:284
#9 classad::ClassAd::LookupInScope (this=this@entry=0x0, name="WantParallelSchedulingGroups", expr=@0x7ffec98c3bf8: 0x0, state=...) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:671
#10 0x00007f211b3c41d9 in classad::ClassAd::EvaluateAttr (this=0x0, attr="WantParallelSchedulingGroups", val=..., mask=classad::Value::NUMBER_VALUES) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:1017
#11 0x00007f211b3c8bbd in classad::ClassAd::EvaluateAttrBoolEquiv (this=this@entry=0x0, attr="WantParallelSchedulingGroups", b=@0x7ffec98c3d95: false) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad.cpp:1181
#12 0x0000561484b85fc5 in classad::ClassAd::LookupBool (this=0x0, name="WantParallelSchedulingGroups", value=@0x7ffec98c3d95: false) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/classad/classad/classad.h:539
#13 DedicatedScheduler::computeSchedule (this=this@entry=0x561484c77100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:2280
#14 0x0000561484b8af2b in DedicatedScheduler::handleDedicatedJobs (this=0x561484c77100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:1521
#15 0x00007f211b1e1232 in TimerManager::Timeout (this=0x561484d5ddf0, pNumFired=0x7ffec98c3f84, pruntime=0x7ffec98c3f88) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:465
#16 0x00007f211b1b323a in DaemonCore::Driver (this=0x561484d43150) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3391
#17 0x00007f211b1d631e in dc_main (argc=1, argv=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4554
#18 0x00007f211a6295d0 in __libc_start_call_main () from /lib64/libc.so.6
#19 0x00007f211a629680 in __libc_start_main_impl () from /lib64/libc.so.6
#20 0x0000561484b71955 in _start ()
(gdb) info locals
No symbol table info available.
----------------------------------------
Second crash log: Happened 6x
Caught signal 11: si_code=1, si_pid=0, si_uid=0, si_addr=0x0
Stack dump for process 1820432 at timestamp 1745518613 (11 frames)
/lib64/libcondor_utils_24_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f2e1c224e08]
/lib64/libcondor_utils_24_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f2e1c3c7b4f]
/lib64/libc.so.6(+0x3e730)[0x7f2e1b83e730]
/lib64/libc.so.6(+0xac5ea)[0x7f2e1b8ac5ea]
condor_schedd(_ZN18DedicatedScheduler19checkReconnectQueueEi+0xb76)[0x558f082c2b46]
/lib64/libcondor_utils_24_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0x152)[0x7f2e1c3e1232]
/lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x2aa)[0x7f2e1c3b323a]
/lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e)[0x7f2e1c3d631e]
/lib64/libc.so.6(+0x295d0)[0x7f2e1b8295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f2e1b829680]
condor_schedd(_start+0x25)[0x558f0829f955]
GDB:
Program received signal SIGSEGV, Segmentation fault.
0x00007f6ab42ac5ea in __strlen_sse2 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f6ab42ac5ea in __strlen_sse2 () from /lib64/libc.so.6
#1 0x00005594e5bb3b46 in StringTokenIterator::end (this=0x7ffc960aea50) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_utils/stl_string_utils.h:185
#2 DedicatedScheduler::checkReconnectQueue (this=0x5594e5c96100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:4064
#3 0x00007f6ab4de1232 in TimerManager::Timeout (this=0x5594e639bdf0, pNumFired=0x7ffc960af084, pruntime=0x7ffc960af088) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:465
#4 0x00007f6ab4db323a in DaemonCore::Driver (this=0x5594e6381150) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3391
#5 0x00007f6ab4dd631e in dc_main (argc=1, argv=<optimized out>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4554
#6 0x00007f6ab42295d0 in __libc_start_call_main () from /lib64/libc.so.6
#7 0x00007f6ab4229680 in __libc_start_main_impl () from /lib64/libc.so.6
#8 0x00005594e5b90955 in _start ()
(gdb) info locals
No symbol table info available.
----------------------------------------
Third crash log:
Caught signal 11: si_code=128, si_pid=0, si_uid=0, si_addr=0x0
Stack dump for process 1820618 at timestamp 1745537870 (12 frames)
/lib64/libcondor_utils_24_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f8f61e24e08]
/lib64/libcondor_utils_24_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f8f61fc7b4f]
/lib64/libc.so.6(+0x3e730)[0x7f8f6143e730]
condor_schedd(_ZN9Scheduler19finishRecycleShadowEP10shadow_rec+0x37)[0x55e774265567]
condor_schedd(_Z26aboutToSpawnJobHandlerDoneiiPvi+0x93)[0x55e77423a0a3]
condor_schedd(_ZN9Scheduler15StartJobHandlerEi+0x192)[0x55e774250432]
/lib64/libcondor_utils_24_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0x152)[0x7f8f61fe1232]
/lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x2aa)[0x7f8f61fb323a]
/lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e)[0x7f8f61fd631e]
/lib64/libc.so.6(+0x295d0)[0x7f8f614295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f8f61429680]
condor_schedd(_start+0x25)[0x55e7741ca955]
----------------------------------------
Fourth crash log: Happened 7x
Caught signal 11: si_code=1, si_pid=384, si_uid=0, si_addr=0x180
Stack dump for process 1861743 at timestamp 1745537882 (11 frames)
/lib64/libcondor_utils_24_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f2891a24e08]
/lib64/libcondor_utils_24_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f2891bc7b4f]
/lib64/libc.so.6(+0x3e730)[0x7f289103e730]
condor_schedd(_ZN9Scheduler20makeReconnectRecordsEP7PROC_IDPKN7classad7ClassAdE+0x4dc)[0x562941375edc]
condor_schedd(_ZN9Scheduler19checkReconnectQueueEi+0x59)[0x5629413765b9]
/lib64/libcondor_utils_24_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0x152)[0x7f2891be1232]
/lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x2aa)[0x7f2891bb323a]
/lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e)[0x7f2891bd631e]
/lib64/libc.so.6(+0x295d0)[0x7f28910295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f2891029680]
condor_schedd(_start+0x25)[0x5629412ee955]
----------------------------------------
Fifth crash log:
Caught signal 6: si_code=4294967290, si_pid=1861824, si_uid=0, si_addr=0x1C68C0
Stack dump for process 1861824 at timestamp 1745538133 (21 frames)
/lib64/libcondor_utils_24_0_7.so(_Z18dprintf_dump_stackv+0x28)[0x7f6f91c24e08]
/lib64/libcondor_utils_24_0_7.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6f)[0x7f6f91dc7b4f]
/lib64/libc.so.6(+0x3e730)[0x7f6f9123e730]
/lib64/libc.so.6(+0x8b52c)[0x7f6f9128b52c]
/lib64/libc.so.6(raise+0x16)[0x7f6f9123e686]
/lib64/libc.so.6(abort+0xd3)[0x7f6f91228833]
/lib64/libc.so.6(+0x29170)[0x7f6f91229170]
/lib64/libc.so.6(+0x955d7)[0x7f6f912955d7]
/lib64/libc.so.6(+0x98b43)[0x7f6f91298b43]
/lib64/libc.so.6(malloc+0x1a2)[0x7f6f912994f2]
/lib64/libstdc++.so.6(_Znwm+0x1c)[0x7f6f916adcbc]
/lib64/libcondor_utils_24_0_7.so(_ZN7ProcAPI6initpiERP8procInfo+0x6a)[0x7f6f91de271a]
/lib64/libcondor_utils_24_0_7.so(_ZN7ProcAPI11getProcInfoEiRP8procInfoRi+0x21)[0x7f6f91de2741]
/lib64/libcondor_utils_24_0_7.so(_ZN15SelfMonitorData11CollectDataEv+0x52)[0x7f6f91de2942]
/lib64/libcondor_utils_24_0_7.so(+0x3e2a35)[0x7f6f91de2a35]
/lib64/libcondor_utils_24_0_7.so(_ZN12TimerManager7TimeoutEPiPd+0x5dd)[0x7f6f91de16bd]
/lib64/libcondor_utils_24_0_7.so(_ZN10DaemonCore6DriverEv+0x2aa)[0x7f6f91db323a]
/lib64/libcondor_utils_24_0_7.so(_Z7dc_mainiPPc+0x134e)[0x7f6f91dd631e]
/lib64/libc.so.6(+0x295d0)[0x7f6f912295d0]
/lib64/libc.so.6(__libc_start_main+0x80)[0x7f6f91229680]
condor_schedd(_start+0x25)[0x55fc9304e955]
----------------------------------------
Noticed weird Args set for jobs, maybe not relevant?
Args = "oddeven -t $$([ 2000 ^ 24 ])"
"oddeven -t $$([ 2000 * Item ])" // Item undefined
"oddeven -t $$([ 2^Item ])" // Item undefined
LastHoldReason = "Cannot expand $$ expression ([ 2^Item ])."
----------------------------------------
Sixth crash log: Happened 4x
04/24/25 21:06:19 (pid:1895725) Trying to match slot2_2@xxxxxxxxxxxxxxxxxxxxxxxx to slot2_2@xxxxxxxxxxxxxxxxxxxxxxxx
04/24/25 21:06:19 (pid:1895725) Dedicated Scheduler:: reconnect target address is <140.160.143.160:9618?addrs=140.160.143.160-9618&alias=c-1-0.cluster.cs.wwu.edu&noUDP&sock=startd_2035_6b51>; claim is <140.160.143.160:9618?addrs=140.160.143.160-9618&alias=c-1-0.cluster.cs.wwu.edu&noUDP&sock=startd_2035_6b51>#1741212952#15998#...
04/24/25 21:06:19 (pid:1895725) SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION: failed to create security session for <140.160.143.160:9618?addrs=140.160.143.160-9618&alias=c-1-0.cluster.cs.wwu.edu&noUDP&sock=startd_2035_6b51>#1741212952#15998#..., so will try to obtain a new security session
04/24/25 21:06:19 (pid:1895725) ERROR "Assertion ERROR on (all_matches->insert(host, mrec) == 0)" at line 4139 in file /var/lib/condor/execute/slot1/dir_2606725/userdir/build-qRBc1D/BUILD/condor-24.0.7/src/condor_schedd.V6/dedicated_scheduler.cpp
04/24/25 21:06:19 (pid:1895725) ScheddCronJobMgr: Bye
04/24/25 21:06:19 (pid:1895725) Clearing userlog file cache
----------------------------------------
Maybe related?
Issues with not being able to transfer files to EP, causing vacated slot and crash.
in submitted classad:
ShouldTransferFiles = "YES"
TransferInput = "scan"
in job classad later:
VacateReason = "Transfer input files failure at access point csci-head while sending files to execution point slot2_1@xxxxxxxxxxxxxxxxxxxxxxxxx Details: reading from file /cluster/home/USER/mpi/scan: (errno 2) No such file or directory"
scan wasn't set as the job's executable, it was an argument passed to the openmpiscript
--------------------------------------------------