Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Huge pile of jobs in "C" state
- Date: Wed, 21 Jan 2015 16:13:36 +0100
- From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Huge pile of jobs in "C" state
As there's a number of C jobs again:
On Tue, Jan 20, 2015 at 11:14:52AM -0600, Todd Tannenbaum wrote:
> On 1/19/2015 2:14 AM, Steffen Grunewald wrote:
> >>>From your submit machine, can you provide
> >> condor_version
Still 8,2,6 (Wheezy)
> >> condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT
1, unknown, unknown
> >> condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
# condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
79 undefined
> >> condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
# condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
93 undefined
> >> condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
# condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
153 undefined
(obviously, the pool has started to "breathe" again, and as I was running
the commands one by one, another pile started to form).
> >>and finally assuming your schedd is running on machine "my.submit.edu" please run
> >> condor_status -schedd my.submit.edu -l | grep Recent
RecentJobsShouldHold = 0
RecentJobsSubmitted = 773
RecentJobsAccumBadputTime = 0
RecentJobsWierdTimestamps = 0
RecentDaemonCoreDutyCycle = 0.1361312669533615
RecentJobsBadputSizes = "0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentJobsAccumChurnTime = 0
RecentShadowsStarted = 767
RecentJobsCompletedRuntimes = "0, 0, 0, 8, 850, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentAutoclusters = 0
RecentJobsExitedNormally = 858
RecentJobsShouldRemove = 0
RecentJobsStarted = 772
RecentJobsKilled = 0
RecentJobsShouldRequeue = 0
RecentJobsExited = 859
RecentJobsAccumExecuteAltTime = 838609
RecentStatsLifetime = 1200
RecentJobsCompletedSizes = "0, 0, 0, 0, 0, 0, 0, 3, 855, 0, 0, 0, 0"
RecentJobsAccumRunningTime = 840448
RecentJobsAccumPreExecuteTime = 1839
RecentJobsDebugLogError = 0
RecentJobsExitException = 0
RecentJobsCompleted = 858
RecentJobsBadputRuntimes = "0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentJobsShadowNoMemory = 0
RecentJobsCheckpointed = 0
RecentJobsAccumPostExecuteTime = 0
RecentJobsAccumTimeToStart = 9470
RecentJobsAccumExecuteTime = 838609
RecentJobsExecFailed = 0
RecentJobsExitedAndClaimClosing = 850
RecentJobsMissedDeferralTime = 0
RecentJobsCoredumped = 0
RecentJobsNotStarted = 1
There are now 176 jobs in C state, and 937 jobs of the same name (likely,
same type?) R-unning.
Watching the numbers of jobs, there seems to be a large portion of
extreme short-runners. But as long as the total sum of R+C doesn't
go up any further (it's falling now), I'm not worried anymore.
Must be some kind of overhead that's unavoidable, and reducing the
job counts seemed to help:
-DAGMAN_MAX_JOBS_IDLE = 250
-DAGMAN_MAX_JOBS_SUBMITTED = 5000
+DAGMAN_MAX_JOBS_IDLE = 50
+DAGMAN_MAX_JOBS_SUBMITTED = 1000
I may have to increase the numbers again to better utilize the ~2000
cores, but there's not only this user, and not only DAG jobs involved.
(Though he seems to be the only one having this issue, with DAG stuff.)
Thanks,
- S