Hi all, on our two CondorCEs we observed pile-ups of idle jobs - yesterday on the first one and today on the other one. For a specific VO (here ATLAS) hardly any jobs went from idle on the CE to running on the batch Condor [1]. `-better-analyze`s were inconclusive and the userprios were in favour of the group. Judging from the CE logs, it seems that the CE did not not have tried to submit its jobs at all - at least the job ids did not appear in the SchedLogs and routes. However, all daemons waere responsive and CLI commands all responded in time. Restarting the condor-ce & condor units 'fixed' the issues in both cases - as shortly after each restart, most of the previous idle jobs got already routed and started to run [2]. Unfortunately, I have not found a clue in the logs, that could be used as trigger for an alarm (to notice the issue a bit earlier and force a restart). Maybe somebody has noticed something similar and has a suggestion, what might be a good trigger to watch for in such a case? Cheers, Thomas [1.before services restarts] Total for query: 1747 jobs; 1 completed, 0 removed, 1577 idle, 169 running, 0 held, 0 suspended [2.after services restarts] Total for query: 1730 jobs; 25 completed, 0 removed, 167 idle, 1538 running, 0 held, 0 suspended [3] htcondor-ce-view-5.1.3-1.el7.noarch python2-condor-9.0.11-1.el7.x86_64 htcondor-ce-5.1.3-1.el7.noarch htcondor-ce-bdii-5.1.3-1.el7.noarch condor-classads-9.0.11-1.el7.x86_64 condor-9.0.11-1.el7.x86_64 htcondor-ce-condor-5.1.3-1.el7.noarch condor-procd-9.0.11-1.el7.x86_64 condor-externals-9.0.11-1.el7.x86_64 htcondor-ce-client-5.1.3-1.el7.noarch htcondor-ce-apel-5.1.3-1.el7.noarch python3-condor-9.0.11-1.el7.x86_64 on CentOS Linux release 7.9.2009 3.10.0-1160.62.1.el7.x86_64
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature