[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] CondorCE: finished jobs not propagating to history/finished for good



Hi all,

we have noticed a number of jobs, where the LRMS job had successful finished and moved out of the job queue into the LRMS schedd history - but the corresponding CE job just stayed as completed in the CE job queue.

E.g, job 3679882.0 (CE) got realized as 6402364.0 (LRMS) [1]. In the CE job queue, the job's last update is actually its JobFinishedHookDone just after midnight [2]. Interestingly, I found the job updates in job_queue.log.534 - which got its name after I had restarted condor-ce.service hoping that a new incarnation would pick up the job finished hook, in case the previous incarnation (presumably under PID 534) had missed it somehow. Unfortunately, also the newly restarted condor-ce.service did not cleaned up the job queues from the completed jobs. As the CE jobs have not concluded for good(?), the CE sched has not yet written PER_JOB_HISTORY files (so far I have had epoch logs enabled only for the LRMS scheds but have to set them up for the CE as well)

Maybe somebody has observed a similar behaviour, where completed jobs stayed on the CE queue (and maybe has a fix how to get the jobs out of the CE queue)?

Cheers and thanks for ideas,
  Thomas

* I have uploaded a stash of logs to [3] - but these are primarily job related information and I have to start a fulldebug on the schedd, if there is something hidden deeper)

* installed versions are
condor-24.6.1-1.el9.x86_64
condor-upgrade-checks-24.7.3-1.el9.x86_64
htcondor-ce-24.0.2-1.el9.noarch
htcondor-ce-bdii-24.0.2-1.el9.noarch
htcondor-ce-client-24.0.2-1.el9.noarch
htcondor-ce-condor-24.0.2-1.el9.noarch
python3-condor-24.6.1-1.el9.x86_64
on
5.14.0-503.26.1.el9_5.x86_64


[1]
[root@grid-htc-ce03 foo.d]# condor_ce_q 3679882.0


-- Schedd: grid-htc-ce03.desy.de : <131.169.223.135:25449?... @ 07/11/25 15:08:14
OWNER       BATCH_NAME     SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
belleprd002 ID: 3679882   7/11 01:07      _      _      _      1 3679882.0

Total for query: 1 jobs; 1 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Total for all users: 4726 jobs; 620 completed, 0 removed, 1714 idle, 2241 running, 151 held, 0 suspended

[root@grid-htc-ce03 foo.d]# condor_ce_q 3679882.0 -af RoutedToJobId LastJobStatus JobStatus EnteredCurrentStatus
6402364.0 2 4 1752189944

[2]
> grep 3679882 /var/lib/condor-ce/spool/job_queue.log.534 | tail -n 5
103 3679882.0 ScratchDirFileCount 49
103 3679882.0 SpooledOutputFiles ""
103 3679882.0 Managed "ScheddDone"
103 3679882.0 ManagedManager ""
103 3679882.0 JobFinishedHookDone 1752190764

> date -d @1752190764
Fri Jul 11 01:39:24 CEST 2025

[3]
https://syncandshare.desy.de/index.php/s/z5dbwd4D3xgiZza
* bot repellent password "verysecret20250711"

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature