[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state

On 1/9/2015 9:08 AM, Steffen Grunewald wrote:
> Ben, others:
> On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
>> There are two likely candidates that come to mind. Either the jobs
>> have LeaveJobInQueue set, or they were submitted via a SOAP call and
>> they're still waiting for FilesRetrieved to be true (even if there's
>> not actually anything to do).
>> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
>> FilesRetrieved
> # condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
>     4705 false undefined
> I suppose your guess, though nice, was wrong.

Strange. The more info you can give us, the more we can help....

Are these jobs remotely submitted (i.e. not submitted from the same machine where the schedd runs) ?  Has this user ever had the problem in the past, or did it just suddenly start happening in the past day or so?

>From your submit machine, can you provide 



  condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c

  condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c

  condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c

and finally assuming your schedd is running on machine "my.submit.edu" please run

  condor_status -schedd my.submit.edu -l | grep Recent
