[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state



* On 09 Jan 2015, Steffen Grunewald wrote: 
> On my pool - which is working flawlessly otherwise - I can see
> a huge (>10000) number of jobs in C state.
> >From what I can observe, those jobs had a rather short runtime -
> there are only 1000+ slots available, and the number is growing
> by hundreds every few minutes.
> 
> Apparently, some part of the job aftermath takes an unexpectedly
> long time - but which? The number of shadows is rather small, and
> the fileserver is behaving nicely (as iostat and ethstatus show).
> TCP updates are enabled.

Do these jobs leave queue after some time has elapsed -- they are just
slow -- or do they remain indefinitely?

I understood from a side conversation that you're using NFS, is that
right?  I could be completely off target here, and I'm pretty new to
condor, but a few questions come to mind wrt how the condor_shadow is
delivering results:

* how many output files are involved?

* are they all in the same directory, or how are they broken down (how
  many files per directory)?

* what filesystem is backing the NFS server?  Some filesystems have
  different performance properties at various numbers of files.

* on the submit node (as NFS client), what does nfsstat -vl say now vs
  after some minutes -- are there any nfsv3 or v4 calls whose deltas are
  very high?

For the last question you might try running nfsstat -Z60, which will
give you nfs call counts at 60-second intervals, and you can look for a
spike.

-- 
       David Champion â dgc@xxxxxxxxxxxx â University of Chicago