[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!



Thanks for the help and feedback Todd (and Team),

Here's some additional details that may be helpful:

All of the systems (production and my test vms) are running Rocky Linux 9.5 on x86_64. AP and EPs have been upgraded to HTCondor 24.0.7.

The production EPs are dual-cpu Xeon Gold 6130's (64 threads that get treated as 64 CPUs for HTCondor). The production AP is an older system,  Xeon E5-2620. Not seeing any ECC errors or hardware check exceptions on EPs or AP.

My test VMs are running on an old desktop box with an i7-7700, using KVM and virt-manager. I've oversubscribed that box with 2 vcpus going to my AP, and 4 vcpus going to each of the 2EPs that its controlling. The crash I reproduced definitely takes longer on the VMs than the production hardware, and it almost (apologies for the use of the F-word) feels like a race condition.

All jobs were submitted using condor_submit, no one was using condor_submit_dag or the new htcondor command (though I did recently start looking into jobsets for something else and get excited).

All EPs are configured with partitionable slots, and given all the cpus and memory of each host.

HTCondor packages were all installed from the research.cs.wisc.edu repos, I have the debuginfo packages installed on the node running the dedicated scheduler to get the extra info from GDB, the EPs do not.

AP and all EPs have "ALL_DEBUG = D_FULLDEBUG" set to see as much info as possible in the logs.

The crash you mentioned trying to reproduce is one I noticed slots being claimed in condor_status, but the job(s) not running. Like the negotiator had said ok, but the corresponding starter and/or shadow died and it hadn't been released yet. I was able to reproduce it by submitting more of that job requesting more cpus than I had cpus available in the test environment and eventually got it to trigger. However, in production the students can crash it if they quickly submit jobs. I was chatting with a student this morning who told me they wrote a tiny script to submit 5 parallel jobs quickly in a row and he could trigger it almost every batch of 5 jobs submitted to the production AP.

-Zach

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, April 29, 2025 1:17 PM
To: HTCondor-Users Mail List
Cc: Todd L Miller
Subject: Re: [HTCondor-users] Parallel Universe (DedicatedScheduler): Segfaults, SIGABRTs, and Assertions, oh my!

        Thanks for the excellent bug report(s).


> TransferExecutable = False

        This is documented as `transfer_executable`:

https://htcondor.readthedocs.io/en/latest/man-pages/condor_submit.html#transfer_executable,

although I don't expect that to matter in practice.


> The first crash that's listed I was able to reproduce in my tiny (1 AP, 2
> 4-core EPs) VM test environment, this would eventually get it to crash
> by submitting a bunch of these (though not necessarily immediately like
> the students could):

        I haven't been able to reproduce this crash yet. :(  Are you
submitting the jobs with condor_submit?

> I think I found the cause for the first crash in my notes after the GDB
> section, and I think I see a potential code path or two that could lead
> to give_up being false after the while loop.

        I found two, but neither of them should ever actually happen,
AFAICT; clearly more investigation is necessary.  (If GetJobs() in the
while loop fails; if ATTR_MAX_HOSTS is 0 or less.)

> #1  0x00005594e5bb3b46 in StringTokenIterator::end (this=0x7ffc960aea50) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_utils/stl_string_utils.h:185
> #2  DedicatedScheduler::checkReconnectQueue (this=0x5594e5c96100 <dedicated_scheduler>) at /usr/src/debug/condor-24.0.7-1.el9.x86_64/src/condor_schedd.V6/dedicated_scheduler.cpp:4064

        Looks like another case where something impossible(TM) happens.


        I (or another one of us) will take a look at the other reports
later, but for now, if you could provide more details about your testing
(or live) set-ups, that might be important information.  Thanks.

-- ToddM
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

Join us in June at Throughput Computing 25: https://osg-htc.org/htc25

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/