Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Huge pile of jobs in "C" state
- Date: Fri, 16 Jan 2015 13:59:22 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Huge pile of jobs in "C" state
On 1/9/2015 9:08 AM, Steffen Grunewald wrote:
> Ben, others:
>
> On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
>> There are two likely candidates that come to mind. Either the jobs
>> have LeaveJobInQueue set, or they were submitted via a SOAP call and
>> they're still waiting for FilesRetrieved to be true (even if there's
>> not actually anything to do).
>>
>> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
>> FilesRetrieved
>
> # condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
> 4705 false undefined
>
> I suppose your guess, though nice, was wrong.
>
Strange. The more info you can give us, the more we can help....
Are these jobs remotely submitted (i.e. not submitted from the same machine where the schedd runs) ? Has this user ever had the problem in the past, or did it just suddenly start happening in the past day or so?
>From your submit machine, can you provide
condor_version
condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT
condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
and finally assuming your schedd is running on machine "my.submit.edu" please run
condor_status -schedd my.submit.edu -l | grep Recent
Thanks,
Todd