[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Segmentation fault running condor_history



Hi Angel,

I recently stumbled upon the segmentation fault issue recently and made a patch with some other code that is slated for release in V10.3. I did not realize that this was in the stable series of the code so I will make sure to get the bug fix into the stable series also. Luckily, the segmentation fault occurs at the end of the tools code so you still get all the job information. The reason for the fault is the freeing of constraint expressions for choosing which job ads to output, and specifying a clusterid or clusterid.procid turns the passed values into a constarint _expression_ under the hood.

Regarding passing the scanlimit flag, it does only apply on the local machine. If you want to increase the amount of history ads queried during a remote history then you need to add the config knob HISTORY_HELPER_MAX_HISTORY = Number (defaults to 10000) in the remote schedd's config.

Cheers,
Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Angel de Vicente <angel.vicente.garrido@xxxxxxxxx>
Sent: Friday, January 27, 2023 3:12 AM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Segmentation fault running condor_history
 
Hello,

in our HTCondor Pool (Ubuntu 20.04, CondorVersion: 10.0.1 + Ubuntu
18.04, CondorVersion: 9.12.0) I've found that using the condor_history
command to query a remote scheduler for a particular job gives me
segmentation fault.

To be clear, both these commands run without problems:

+ condor_history  (from machine cruise)
+ condor_history -name cruise (from another machine)

But if I want the details of a given job, running from another machine
gives me segmentation fault, i.e.:

+ condor_history 61.4540 (from machine cruise: OK)
+ condor_history -name 61.4540 (from another machine: it returns the
details of the job apparently OK, but then it ends with "Segmentation
fault (core dumped)")


I'm not sure if this is related to the following, but I also found that
condor_history from the remote machine seems to have a limit of 10000
jobs:

+ condor_history | wc -l (from machine cruise: 11013 jobs)
+ condor_history -name cruise | wc -l (from another machine: 10000 jobs)

By the way, I try the option -scanlimit, but it seems to only work when
running condor_history in the local machine. When querying a remote
scheduler, it didn't have any impact (-scanlimit 500 or -scanlimit 11000
always returned 10000 jobs history).

[By the way, https://github.com/htcondor/htcondor doesn't accept issues,
so is the mailing list the best place nowadays to report issues like
these, or is there some other preferred channel?]

Cheers,
--
Ángel de Vicente                 -- (GPG: 0x64D9FDAE7CD5E939)
 Research Software Engineer (Supercomputing and BigData)
 Instituto de Astrofísica de Canarias (https://www.iac.es/en)

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/