Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state

Date: Mon, 19 Jan 2015 09:14:23 +0100
From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
Subject: Re: [HTCondor-users] Huge pile of jobs in "C" state

On Fri, Jan 16, 2015 at 01:59:22PM -0600, Todd Tannenbaum wrote:
> On 1/9/2015 9:08 AM, Steffen Grunewald wrote:
> > Ben, others:
> > 
> > On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
> >> There are two likely candidates that come to mind. Either the jobs
> >> have LeaveJobInQueue set, or they were submitted via a SOAP call and
> >> they're still waiting for FilesRetrieved to be true (even if there's
> >> not actually anything to do).
> >>
> >> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
> >> FilesRetrieved
> > 
> > # condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
> >     4705 false undefined
> > 
> > I suppose your guess, though nice, was wrong.
> > 
> 
> Strange. The more info you can give us, the more we can help....
> 
> Are these jobs remotely submitted (i.e. not submitted from the same machine where the schedd runs) ?

No, schedd machine == submit machine

> Has this user ever had the problem in the past, or did it just suddenly start happening in the past day or so?

Happens every now and then, seems to be most visible when jobs are short-running
DAG nodes.
Less impact when number of DAG jobs allowed to be submitted/idle is reduced
via config.

> >From your submit machine, can you provide 
>   condor_version
>   condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
> and finally assuming your schedd is running on machine "my.submit.edu" please run
>   condor_status -schedd my.submit.edu -l | grep Recent

That'll have to wait for the issue to show up again, right?

Thanks,
 Steffen

Follow-Ups:
- Re: [HTCondor-users] Huge pile of jobs in "C" state
  - From: Todd Tannenbaum

References:
- [HTCondor-users] Huge pile of jobs in "C" state
  - From: Steffen Grunewald
- Re: [HTCondor-users] Huge pile of jobs in "C" state
  - From: Ben Cotton
- Re: [HTCondor-users] Huge pile of jobs in "C" state
  - From: Steffen Grunewald
- Re: [HTCondor-users] Huge pile of jobs in "C" state
  - From: Todd Tannenbaum

Prev by Date: Re: [HTCondor-users] condor_submit - Rank expression of jobs ignored
Next by Date: Re: [HTCondor-users] Huge pile of jobs in "C" state
Previous by thread: Re: [HTCondor-users] Huge pile of jobs in "C" state
Next by thread: Re: [HTCondor-users] Huge pile of jobs in "C" state
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Huge pile of jobs in "C" state