[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Remove job without condor_rm



> On Mar 3, 2023, at 1:44 PM, Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Hi Stuart,
> 
> I am glad that the removal of the single line stopped the infinite Schedd segmentation faults. It looks like the condor cron doesn't know how to handle a STEP value without a range (x-y) or asterisk i.e. [1/10]. Because of this invalid STEP value, the matchFields() appears to recursively run until a segmentation fault occurs. Sorry you had to stumble across this.

Cole,
	Do you have enough information to open a ticket (and tag it LIGO) or should I do that?

Have you determined if this problem exists in some (or all) 10.x releases as well?

And do you understand why this segfault did not generate a core file, or include a stack trace in the automatic condor daemon segfault notification email (perhaps due to blowing out the Linux process stack with unbounded recursive function calls)?

Are you open to considering the following two RFE tickets?

* Add support for condor_q, condor_hold and condor_rm to work on an offline queue.

* Add a knob for condor_master to start daemons under gdb.

These are motivated by feeling lucky that I was able to manually attach gdb to a schedd instance before it crashed, and wondering if I would need to replace /usr/sbin/condor_schedd with a script to start the schedd under gdb to get information on the crash, or that I might have to drop the entire queue to get the AP running again.

Thanks.

--
Stuart Anderson
sba@xxxxxxxxxxx