[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!



	Thanks for the excellent bug report(s).


TransferExecutable = False

	This is documented as `transfer_executable`:

https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#transfer_executable,

although I don't expect that to matter in practice.


The first crash that's listed I was able to reproduce in my tiny (1 AP, 2
4-core EPs) VM test environment, this would eventually get it to crash by submitting a bunch of these (though not necessarily immediately like the students could):

I haven't been able to reproduce this crash yet. :( Are you submitting the jobs with condor_submit?

I think I found the cause for the first crash in my notes after the GDB section, and I think I see a potential code path or two that could lead to give_up being false after the while loop.

I found two, but neither of them should ever actually happen, AFAICT; clearly more investigation is necessary. (If GetJobs() in the while loop fails; if ATTR_MAX_HOSTS is 0 or less.)

#1  0x00005594e5bb3b46 in StringTokenIterator::end (this=0x7ffc960aea50) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_utils/stl_string_utils.h:185
#2  DedicatedScheduler::checkReconnectQueue (this=0x5594e5c96100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:4064

	Looks like another case where something impossible(TM) happens.


I (or another one of us) will take a look at the other reports later, but for now, if you could provide more details about your testing (or live) set-ups, that might be important information. Thanks.

-- ToddM