Hi, When using the Python API to submit a large number of jobs (10’s of thousands), we encounter the following error: RuntimeError: Failed to commit and disconnect from queue. We use the following code to submit jobs in chunks: max_jobs_per_sub = htcondor.param['MAX_JOBS_PER_SUBMISSION'] for itemdata_chunk in common.utils.iter_data_chunks(itemdata, max_jobs_per_sub): with schedd.transaction() as txn: submit.queue_with_itemdata(txn, 1, iter(itemdata_chunk)) If we use a smaller chunk size (say 5,000 rather than the default 20,000), we still encounter the error once a certain number of jobs have been submitted (usually around 30-50k). Looking at the logs and based on
this message thread it would seem that we’re hitting the schedd’s 20 second transaction timeout. Is there any way of increasing or avoiding this timeout? The pool and central manager all run on Windows. Kind regards, Peet Whittaker Discipline Lead for DevOps | Principal Software Developer JBA Consulting, 1 Broughton Park, Old Lane North, Broughton, Skipton, North Yorkshire, BD23 3FD. Telephone: +441756699500
|