[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Remove job without condor_rm



Hi Stuart,

I was just going to open a JIRA ticket for this today (If you have not already. I will check) . When I sent the original response I assumed that this issue was also in the 10.0.X LTS release series, but after a quick test I can confirm that this issue is in the LTS release series. I have not determined why there was no core file or stack trace produced. I will look into that while working on the bug. As for the two RFE tickets, I will bring them up with the rest of the Dev team for consideration.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Anderson, Stuart B. <sba@xxxxxxxxxxx>
Sent: Friday, March 3, 2023 4:22 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Remove job without condor_rm
 

> On Mar 3, 2023, at 1:44 PM, Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> Hi Stuart,
>
> I am glad that the removal of the single line stopped the infinite Schedd segmentation faults. It looks like the condor cron doesn't know how to handle a STEP value without a range (x-y) or asterisk i.e. [1/10]. Because of this invalid STEP value, the matchFields() appears to recursively run until a segmentation fault occurs. Sorry you had to stumble across this.

Cole,
        Do you have enough information to open a ticket (and tag it LIGO) or should I do that?

Have you determined if this problem exists in some (or all) 10.x releases as well?

And do you understand why this segfault did not generate a core file, or include a stack trace in the automatic condor daemon segfault notification email (perhaps due to blowing out the Linux process stack with unbounded recursive function calls)?

Are you open to considering the following two RFE tickets?

* Add support for condor_q, condor_hold and condor_rm to work on an offline queue.

* Add a knob for condor_master to start daemons under gdb.

These are motivated by feeling lucky that I was able to manually attach gdb to a schedd instance before it crashed, and wondering if I would need to replace /usr/sbin/condor_schedd with a script to start the schedd under gdb to get information on the crash, or that I might have to drop the entire queue to get the AP running again.

Thanks.

--
Stuart Anderson
sba@xxxxxxxxxxx




_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/