[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_adstash behaviour when dealing with big amounts of jobs



Hi Maria,

The only backfilling method that definitely works with adstash right now is to use the --ad_file option pointed to the history files on disk. I think there are opportunities to try to do better both with remote history calls (e.g. by allowing custom constraint and since expressions rather than only looking at the checkpoint file) and reading from history files (e.g. also by allowing custom constraints), so I've opened a ticket and will be working on a design there: https://opensciencegrid.atlassian.net/browse/HTCONDOR-3793

For some background, the Schedd.history method ( https://htcondor.readthedocs.io/en/latest/apis/python-bindings/api/version2/htcondor2/schedd.html#htcondor2.Schedd.history ) opens a connection to the condor_schedd, the schedd forks a condor_history child process, and the history process then reads the flat history file(s) on disk in *reverse chronological order*, returning ads one by one until:
1. an optionally set "match" number of ads are *returned*,
2. an optionally set "since" _expression_ becomes true, or
3. HISTORY_HELPER_MAX_HISTORY (configured on the access point) number of ads are *read*.

Note how there's no "4. HISTORY_TIMEOUT is reached"... because there is no condor_history/Schedd.history timeout setting. To clarify, the timeout options in adstash control how long the parent adstash process waits for history to be processed for a schedd, it does not control the timeout for the individual Schedd.history call because there is (unfortunately) no setting for that. Does the timeout behavior you were seeing match with this understanding, i.e. were you always hitting the timeout you configured no matter how long you set it for?

The "scan every ad in reverse order" i/o intense behavior of condor_history is what makes backfilling more than a few hours on busy schedds extremely slow. Even if you provide a perfect constraint _expression_ to capture missing ads, condor_history still has to read through the entire history on disk to figure out which ads match that constraint. (This is why adstash writes out checkpoints to generate "since" expressions per schedd, so condor_history knows when it can exit early.) Nothing is (explicitly) cached in memory between condor_history calls. The HTCSS team has started to address this with the new Archive Librarian feature that puts some per-ad metadata for each ad in the history file in a SQLite database, but this only works for the current history file now, not rotated history files: https://htcondor.readthedocs.io/en/latest/admin-manual/ap-policy-configuration.html#archive-librarian

I know this is not a very satisfying answer but the current design of the history files and reading methods make tackling the backfill problem difficult.

To add one more thing to think about... do you rotate your Opensearch indexes (e.g. with ILM policies)? If so, if we were to add a backfilling feature, I'm still not sure how to prevent duplicate ads after an index rotation occurs.

Jason

On Wed, Jun 17, 2026 at 8:01âAM Maria Alandes Pradillo via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Â

Dear all,

Â

Iâm interested in understanding better how condor_adstash works after an issue we have experienced this month. Some facts that are relevant to understand what has happened:

Â

  • ~55 schedds deployed at CERN
  • 4 nodes that call condor_adstash every 5min Â(~15 schedds in each node)
  • Data is ingested in Opensearch
  • From 2nd to 15th June, data stops being ingested in Opensearch because we canât reach the schedds (this problem has been understood, details are not relevant here and the necessary alarms are now in place to detect this in the future)
  • On 15th June, data starts to get ingested again:
    • condor_adstash retrieves data without problems for those schedds dealing with low number of jobs (size of history files are ~10MB-300MB) and itâs able to retrieve old data, as shown in plot below.

Â

Â

    • condor_adstash timeouts when trying to contact schedds dealing with high number of jobs (size of history files are ~700MB-1GB) and for those schedds, it canât retrieve old data. See plot below:

Â

I try to call condor_adstash with several options:

  • I start increasing the timeout with --schedd_history_timeout with 10min, 20min, 2h. The big schedds still timeout
  • On 16th June, I decide to change strategy and use instead --schedd_history_max_ads 1000, hoping to recover data little by little in small chunks. This works for all schedds, but it doesnât seem to recover jobs from log rotated history files, only from the current live history file. Is it possible that when defining the max number of jobs to be retrieved it stops scanning all the history files and uses only the current one?

Â

I would like to understand whether thereâs any way I could retrieve the old data for the big schedds.

Â

Iâm also wondering what happens if I recreate the adstash nodes and start from scratch, will it also timeout for the big schedds and all the data it has to retrieve? We have one month of data in /var/lib/condor/spool/, so for big schedds it represents ~30 GB of data to be processed.

Â

Thanks very much in advance for your help!

Maria

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/