Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Huge pile of jobs in "C" state
- Date: Sat, 17 Jan 2015 12:41:52 -0600
- From: David Champion <dgc@xxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Huge pile of jobs in "C" state
* On 09 Jan 2015, Steffen Grunewald wrote:
> On my pool - which is working flawlessly otherwise - I can see
> a huge (>10000) number of jobs in C state.
> >From what I can observe, those jobs had a rather short runtime -
> there are only 1000+ slots available, and the number is growing
> by hundreds every few minutes.
>
> Apparently, some part of the job aftermath takes an unexpectedly
> long time - but which? The number of shadows is rather small, and
> the fileserver is behaving nicely (as iostat and ethstatus show).
> TCP updates are enabled.
Do these jobs leave queue after some time has elapsed -- they are just
slow -- or do they remain indefinitely?
I understood from a side conversation that you're using NFS, is that
right? I could be completely off target here, and I'm pretty new to
condor, but a few questions come to mind wrt how the condor_shadow is
delivering results:
* how many output files are involved?
* are they all in the same directory, or how are they broken down (how
many files per directory)?
* what filesystem is backing the NFS server? Some filesystems have
different performance properties at various numbers of files.
* on the submit node (as NFS client), what does nfsstat -vl say now vs
after some minutes -- are there any nfsv3 or v4 calls whose deltas are
very high?
For the last question you might try running nfsstat -Z60, which will
give you nfs call counts at 60-second intervals, and you can look for a
spike.
--
David Champion â dgc@xxxxxxxxxxxx â University of Chicago