[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state



On Fri, Jan 16, 2015 at 01:59:22PM -0600, Todd Tannenbaum wrote:
> On 1/9/2015 9:08 AM, Steffen Grunewald wrote:
> > Ben, others:
> > 
> > On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
> >> There are two likely candidates that come to mind. Either the jobs
> >> have LeaveJobInQueue set, or they were submitted via a SOAP call and
> >> they're still waiting for FilesRetrieved to be true (even if there's
> >> not actually anything to do).
> >>
> >> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
> >> FilesRetrieved
> > 
> > # condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
> >     4705 false undefined
> > 
> > I suppose your guess, though nice, was wrong.
> > 
> 
> Strange. The more info you can give us, the more we can help....
> 
> Are these jobs remotely submitted (i.e. not submitted from the same machine where the schedd runs) ?

No, schedd machine == submit machine

> Has this user ever had the problem in the past, or did it just suddenly start happening in the past day or so?

Happens every now and then, seems to be most visible when jobs are short-running
DAG nodes.
Less impact when number of DAG jobs allowed to be submitted/idle is reduced
via config.

> >From your submit machine, can you provide 
>   condor_version
>   condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
>   condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
> and finally assuming your schedd is running on machine "my.submit.edu" please run
>   condor_status -schedd my.submit.edu -l | grep Recent

That'll have to wait for the issue to show up again, right?

Thanks,
 Steffen