[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state



Ben, others:

On Fri, Jan 09, 2015 at 09:37:28AM -0500, Ben Cotton wrote:
> There are two likely candidates that come to mind. Either the jobs
> have LeaveJobInQueue set, or they were submitted via a SOAP call and
> they're still waiting for FilesRetrieved to be true (even if there's
> not actually anything to do).
> 
> condor_q -const 'JobStatus==4' -af ClusterId ProcId LeaveJobInQueue
> FilesRetrieved

# condor_q -const 'JobStatus==4' -af LeaveJobInQueue FilesRetrieved | uniq -c
   4705 false undefined

I suppose your guess, though nice, was wrong.

The user had given up in between waiting for the jobs to vanish, and
condor_rm'ed them, but they are still there, and their number is only
decreasing very slowly:

# condor_q -const 'JobStatus==3' -af LeaveJobInQueue FilesRetrieved | uniq -c
  20915 false undefined

So it's not an issue of jobs finished regularly but also of those having
been removed forcefully (more or less).

> it. If it's a SOAP submission, you may just set SOAP_LEAVE_IN_QUEUE to
> false (assuming that's appropriate).

Most of those jobs come from a DagMan (condor_dagman and pegasus-dagman),
and both Vanilla and Local Universes are affected. But that might turn
out to be a red herring in the end (as non-DAG users have also complained)

It might be worth noting (as you speak of config changes) that the slots
used are dynamic ones (shouldn't matter as the job runtime is below the
already short lease lifetime, so defragging wouldn't happen as all jobs
have the same memory and cpu requirements). 

After shrinking the settings of DAGMAN_MAX_JOBS_IDLE and 
DAGMAN_MAX_JOBS_SUBMITTED, I see the first pile shrinking as well, but 
only at the same rate as the new one is building up (sum almost constant
at about 25000).

Any other suggestions?

Thanks,
 Steffen