[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)

Date: Mon, 08 Mar 2021 16:48:25 +0100
From: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)

Hi Brian,

I have attached a tarball of all job_queue.* on our CE grid-htcondorce0.

One odd thing(?) might be, that on both our prod CEs, there are each oneadditional rotated(?) job_queue.log.{5,12} - so somewhat with a "random"number. Both have (c)mod timestamps from when I restarted thecondor.service unit - no idea, why these ended up as remnants?

Anyway, the current job_queue.log as well as the pre-restart one lookgood to me - with the job IDs + their ads well formatted (on a quickgrep I dod not find any non-ascii char, that might point to some corruption.


Cheers,
  Thomas


[1]
root@grid-htcondorce1: [~] ls -all /var/lib/condor/spool/job_queue.log*

-rw------- 1 condor condor 15030781 Mar 8 16:33/var/lib/condor/spool/job_queue.log-rw------- 1 condor condor 33740967 Mar 8 13:57/var/lib/condor/spool/job_queue.log.12

root@grid-htcondorce0: [/etc/condor/config.d] ls -hall/var/lib/condor/spool/job_queue.log*-rw------- 1 condor condor 7,6M Mar 8 16:36/var/lib/condor/spool/job_queue.log-rw------- 1 condor condor 3,3M Mar 8 13:55/var/lib/condor/spool/job_queue.log.5





On 05/03/2021 15.54, Brian Lin wrote:

Hi Thomas,
That's quite strange and certainly shouldn't happen! There should be aplain-text /var/lib/condor/spool/job_queue.log: does that file lookcorrupted at all?
As for the local SYSTEM_PERIODIC_REMOVE, even though it may not be theculprit here, you should move them to your config and append them to theCE's SYSTEM_PERIODIC_REMOVE to avoid similar issues. And if you're on anew enough version of HTCondor-CE, you should be able to remove a few ofthe clauses:
- Since at least HTCondor-CE 3, held CE jobs are removed after 24 hrs
- HTCondor-CE 4.0.0<https://htcondor.github.io/htcondor-ce/releases/#disabled-job-retries-by-default>disables job retries by default- HTCondor-CE 5.0.0<https://htcondor.github.io/htcondor-ce/releases/#500> (available as arelease candidate [1]) will remove jobs that exceed the configured valueof "ROUTED_JOB_MAX_TIME"
Brian

[1] https://research.cs.wisc.edu/htcondor/repo/8.9/el7/rc/

On 3/5/21 5:06 AM, Thomas Hartmann wrote:
Hi again,
maybe related(??) - I just noticed, that a restart of the condor unitcaused the Schedd to loose all its jobs [1]. Since the restart wasmore or less instantaneous, I would have expected the Sched to pick upits jobs.
Cheers,
Â Thomas

[1]
03/05/21 10:55:11 (pid:3997828) WARNING - Cluster 437906 was deletedwith proc ads still attached to it. This should only happen duringschedd shutdown.
[2]
Mar 05 10:55:11 grid-htcondorce0.desy.de systemd[1]: Stopping CondorDistributed High-Throughput-Computing...
-- Subject: Unit condor.service has begun shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx  with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

References:
- [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Thomas Hartmann
- Re: [HTCondor-users] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
  - From: Brian Lin

Prev by Date: Re: [HTCondor-users] slow creation of condor_shadow processes
Next by Date: Re: [HTCondor-users] [grid-intern] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Previous by thread: Re: [HTCondor-users] [grid-intern] CondorCE: job submission to Condor-LRMS fails due to stdout/stderr files missing during staging(?)
Next by thread: [HTCondor-users] Docker Community Edition 20.10.5 fixes bug introduced in 20.10.4 that broke Docker Universe
Index(es):
- Date
- Thread