Hi Brian, I have attached a tarball of all job_queue.* on our CE grid-htcondorce0.One odd thing(?) might be, that on both our prod CEs, there are each one additional rotated(?) job_queue.log.{5,12} - so somewhat with a "random" number. Both have (c)mod timestamps from when I restarted the condor.service unit - no idea, why these ended up as remnants?
Anyway, the current job_queue.log as well as the pre-restart one look good to me - with the job IDs + their ads well formatted (on a quick grep I dod not find any non-ascii char, that might point to some corruption.
Cheers, Thomas [1] root@grid-htcondorce1: [~] ls -all /var/lib/condor/spool/job_queue.log*-rw------- 1 condor condor 15030781 Mar 8 16:33 /var/lib/condor/spool/job_queue.log -rw------- 1 condor condor 33740967 Mar 8 13:57 /var/lib/condor/spool/job_queue.log.12
root@grid-htcondorce0: [/etc/condor/config.d] ls -hall /var/lib/condor/spool/job_queue.log* -rw------- 1 condor condor 7,6M Mar 8 16:36 /var/lib/condor/spool/job_queue.log -rw------- 1 condor condor 3,3M Mar 8 13:55 /var/lib/condor/spool/job_queue.log.5
On 05/03/2021 15.54, Brian Lin wrote:
Hi Thomas,That's quite strange and certainly shouldn't happen! There should be a plain-text /var/lib/condor/spool/job_queue.log: does that file look corrupted at all?As for the local SYSTEM_PERIODIC_REMOVE, even though it may not be the culprit here, you should move them to your config and append them to the CE's SYSTEM_PERIODIC_REMOVE to avoid similar issues. And if you're on a new enough version of HTCondor-CE, you should be able to remove a few of the clauses:- Since at least HTCondor-CE 3, held CE jobs are removed after 24 hrs- HTCondor-CE 4.0.0 <https://htcondor.github.io/htcondor-ce/releases/#disabled-job-retries-by-default> disables job retries by default - HTCondor-CE 5.0.0 <https://htcondor.github.io/htcondor-ce/releases/#500> (available as a release candidate [1]) will remove jobs that exceed the configured value of "ROUTED_JOB_MAX_TIME"Brian [1] https://research.cs.wisc.edu/htcondor/repo/8.9/el7/rc/ On 3/5/21 5:06 AM, Thomas Hartmann wrote:Hi again,maybe related(??) - I just noticed, that a restart of the condor unit caused the Schedd to loose all its jobs [1]. Since the restart was more or less instantaneous, I would have expected the Sched to pick up its jobs.Cheers, Â Thomas [1]03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deleted with proc ads still attached to it. This should only happen during schedd shutdown.[2]Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopping Condor Distributed High-Throughput-Computing...-- Subject: Unit condor.service has begun shutting down -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature