[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state



On 1/9/2015 9:08 AM, Steffen Grunewald wrote:
> Ben, others:
> 
> On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
>> There are two likely candidates that come to mind. Either the jobs
>> have LeaveJobInQueue set, or they were submitted via a SOAP call and
>> they're still waiting for FilesRetrieved to be true (even if there's
>> not actually anything to do).
>>
>> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
>> FilesRetrieved
> 
> # condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
>     4705 false undefined
> 
> I suppose your guess, though nice, was wrong.
> 

Strange. The more info you can give us, the more we can help....

Are these jobs remotely submitted (i.e. not submitted from the same machine where the schedd runs) ?  Has this user ever had the problem in the past, or did it just suddenly start happening in the past day or so?

>From your submit machine, can you provide 

  condor_version

  condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT

  condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c

  condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c

  condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c

and finally assuming your schedd is running on machine "my.submit.edu" please run

  condor_status -schedd my.submit.edu -l | grep Recent


Thanks,
Todd