Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!
- Date: Tue, 29 Apr 2025 15:17:52 -0500 (CDT)
- From: Todd L Miller <tlmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!
Thanks for the excellent bug report(s).
TransferExecutable = False
This is documented as `transfer_executable`:
https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#transfer_executable,
although I don't expect that to matter in practice.
The first crash that's listed I was able to reproduce in my tiny (1 AP, 2
4-core EPs) VM test environment, this would eventually get it to crash
by submitting a bunch of these (though not necessarily immediately like
the students could):
I haven't been able to reproduce this crash yet. :( Are you
submitting the jobs with condor_submit?
I think I found the cause for the first crash in my notes after the GDB
section, and I think I see a potential code path or two that could lead
to give_up being false after the while loop.
I found two, but neither of them should ever actually happen,
AFAICT; clearly more investigation is necessary. (If GetJobs() in the
while loop fails; if ATTR_MAX_HOSTS is 0 or less.)
#1 0x00005594e5bb3b46 in StringTokenIterator::end (this=0x7ffc960aea50) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_utils/stl_string_utils.h:185
#2 DedicatedScheduler::checkReconnectQueue (this=0x5594e5c96100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:4064
Looks like another case where something impossible(TM) happens.
I (or another one of us) will take a look at the other reports
later, but for now, if you could provide more details about your testing
(or live) set-ups, that might be important information. Thanks.
-- ToddM