Hi all, we probably found the cause and fixed it (fingers crossed) post mortem ~> https://confluence.desy.de/pages/viewpage.action?pageId=47425023 Presumably during a 'transparent' maintenance on the ARC's underlying supervisor, Condor shadows etc. could not access the local job files. This caused(?) a large number of jobs to be seen as failed by condor and sending them to hold. Apparently, condor was overwhelmed by the large number of hold jobs (160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log already at ~1.4GB). Plain removing the hold jobs with condor_rm failed accordingly(?), so that we moved the spool dir away and gave condor a fresh start. Since then, the node has been running fine again. Cheers, Thomas
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature