_______________________________________________Â
Dear all,
Â
Iâm interested in understanding better how condor_adstash works after an issue we have experienced this month. Some facts that are relevant to understand what has happened:
Â
- ~55 schedds deployed at CERN
- 4 nodes that call condor_adstash every 5min Â(~15 schedds in each node)
- Data is ingested in Opensearch
- From 2nd to 15th June, data stops being ingested in Opensearch because we canât reach the schedds (this problem has been understood, details are not relevant here and the necessary alarms are now in place to detect this in the future)
- On 15th June, data starts to get ingested again:
- condor_adstash retrieves data without problems for those schedds dealing with low number of jobs (size of history files are ~10MB-300MB) and itâs able to retrieve old data, as shown in plot below.
Â
Â
- condor_adstash timeouts when trying to contact schedds dealing with high number of jobs (size of history files are ~700MB-1GB) and for those schedds, it canât retrieve old data. See plot below:
Â
I try to call condor_adstash with several options:
- I start increasing the timeout with --schedd_history_timeout with 10min, 20min, 2h. The big schedds still timeout
- On 16th June, I decide to change strategy and use instead --schedd_history_max_ads 1000, hoping to recover data little by little in small chunks. This works for all schedds, but it doesnât seem to recover jobs from log rotated history files, only from the current live history file. Is it possible that when defining the max number of jobs to be retrieved it stops scanning all the history files and uses only the current one?
Â
I would like to understand whether thereâs any way I could retrieve the old data for the big schedds.
Â
Iâm also wondering what happens if I recreate the adstash nodes and start from scratch, will it also timeout for the big schedds and all the data it has to retrieve? We have one month of data in /var/lib/condor/spool/, so for big schedds it represents ~30 GB of data to be processed.
Â
Thanks very much in advance for your help!
Maria
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/