[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]

Date: Tue, 14 Feb 2017 15:48:59 +0100
From: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Subject: Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]

Hi all,

we probably found the cause and fixed it (fingers crossed)

post mortem ~>
https://confluence.desy.de/pages/viewpage.action?pageId=47425023

Presumably during a 'transparent' maintenance on the ARC's underlying
supervisor, Condor shadows etc. could not access the local job files.
This caused(?) a large number of jobs to be seen as failed by condor and
sending them to hold.
Apparently, condor was overwhelmed by the large number of hold jobs
(160.000 jobs in hold, /var/lib/condor/spool.old.20170214/job_queue.log
already at ~1.4GB). Plain removing the hold jobs with condor_rm failed
accordingly(?), so that we moved the spool dir away and gave condor a
fresh start.

Since then, the node has been running fine again.

Cheers,
  Thomas

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Follow-Ups:
- Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Brian Bockelman

References:
- [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Thomas Hartmann
- Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Brian Bockelman
- Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Thomas Hartmann
- Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Todd Tannenbaum
- Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
  - From: Thomas Hartmann

Prev by Date: [HTCondor-users] automatic selection of advertised IP
Next by Date: Re: [HTCondor-users] repositories for RHEL5
Previous by thread: Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
Next by thread: Re: [HTCondor-users] increasing schedd memory usage [v8.6.0?]
Index(es):
- Date
- Thread