Hi all,we have noticed a number of jobs, where the LRMS job had successful finished and moved out of the job queue into the LRMS schedd history - but the corresponding CE job just stayed as completed in the CE job queue.
E.g, job 3679882.0 (CE) got realized as 6402364.0 (LRMS) [1]. In the CE job queue, the job's last update is actually its JobFinishedHookDone just after midnight [2]. Interestingly, I found the job updates in job_queue.log.534 - which got its name after I had restarted condor-ce.service hoping that a new incarnation would pick up the job finished hook, in case the previous incarnation (presumably under PID 534) had missed it somehow. Unfortunately, also the newly restarted condor-ce.service did not cleaned up the job queues from the completed jobs. As the CE jobs have not concluded for good(?), the CE sched has not yet written PER_JOB_HISTORY files (so far I have had epoch logs enabled only for the LRMS scheds but have to set them up for the CE as well)
Maybe somebody has observed a similar behaviour, where completed jobs stayed on the CE queue (and maybe has a fix how to get the jobs out of the CE queue)?
Cheers and thanks for ideas, Thomas* I have uploaded a stash of logs to [3] - but these are primarily job related information and I have to start a fulldebug on the schedd, if there is something hidden deeper)
* installed versions are condor-24.6.1-1.el9.x86_64 condor-upgrade-checks-24.7.3-1.el9.x86_64 htcondor-ce-24.0.2-1.el9.noarch htcondor-ce-bdii-24.0.2-1.el9.noarch htcondor-ce-client-24.0.2-1.el9.noarch htcondor-ce-condor-24.0.2-1.el9.noarch python3-condor-24.6.1-1.el9.x86_64 on 5.14.0-503.26.1.el9_5.x86_64 [1] [root@grid-htc-ce03 foo.d]# condor_ce_q 3679882.0-- Schedd: grid-htc-ce03.desy.de : <131.169.223.135:25449?... @ 07/11/25 15:08:14
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS belleprd002 ID: 3679882 7/11 01:07 _ _ _ 1 3679882.0Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 4726 jobs; 620 completed, 0 removed, 1714 idle, 2241 running, 151 held, 0 suspended
[root@grid-htc-ce03 foo.d]# condor_ce_q 3679882.0 -af RoutedToJobId LastJobStatus JobStatus EnteredCurrentStatus
6402364.0 2 4 1752189944 [2] > grep 3679882 /var/lib/condor-ce/spool/job_queue.log.534 | tail -n 5 103 3679882.0 ScratchDirFileCount 49 103 3679882.0 SpooledOutputFiles "" 103 3679882.0 Managed "ScheddDone" 103 3679882.0 ManagedManager "" 103 3679882.0 JobFinishedHookDone 1752190764 > date -d @1752190764 Fri Jul 11 01:39:24 CEST 2025 [3] https://syncandshare.desy.de/index.php/s/z5dbwd4D3xgiZza * bot repellent password "verysecret20250711"
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature