[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Huge pile of jobs in "C" state



As there's a number of C jobs again:

On Tue, Jan 20, 2015 at 11:14:52AM -0600, Todd Tannenbaum wrote:
> On 1/19/2015 2:14 AM, Steffen Grunewald wrote:
> >>>From your submit machine, can you provide
> >>   condor_version

Still 8,2,6 (Wheezy)

> >>   condor_config_val JOB_IS_FINISHED_INTERVAL JOB_STOP_DELAY JOB_STOP_COUNT

1, unknown, unknown

> >>   condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c

# condor_q -const 'JobStatus==3 || JobStatus==4' -af IwdFlushNFSCache | sort | uniq -c
     79 undefined

> >>   condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c

# condor_q -const 'JobStatus==3 || JobStatus==4' -af StageInStart | sort | uniq -c
     93 undefined

> >>   condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c

# condor_q -const 'JobStatus==3 || JobStatus==4' -af JobRequiresSandbox | sort | uniq -c
    153 undefined

(obviously, the pool has started to "breathe" again, and as I was running
the commands one by one, another pile started to form).

> >>and finally assuming your schedd is running on machine "my.submit.edu" please run
> >>   condor_status -schedd my.submit.edu -l | grep Recent

RecentJobsShouldHold = 0
RecentJobsSubmitted = 773
RecentJobsAccumBadputTime = 0
RecentJobsWierdTimestamps = 0
RecentDaemonCoreDutyCycle = 0.1361312669533615
RecentJobsBadputSizes = "0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentJobsAccumChurnTime = 0
RecentShadowsStarted = 767
RecentJobsCompletedRuntimes = "0, 0, 0, 8, 850, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentAutoclusters = 0
RecentJobsExitedNormally = 858
RecentJobsShouldRemove = 0
RecentJobsStarted = 772
RecentJobsKilled = 0
RecentJobsShouldRequeue = 0
RecentJobsExited = 859
RecentJobsAccumExecuteAltTime = 838609
RecentStatsLifetime = 1200
RecentJobsCompletedSizes = "0, 0, 0, 0, 0, 0, 0, 3, 855, 0, 0, 0, 0"
RecentJobsAccumRunningTime = 840448
RecentJobsAccumPreExecuteTime = 1839
RecentJobsDebugLogError = 0
RecentJobsExitException = 0
RecentJobsCompleted = 858
RecentJobsBadputRuntimes = "0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0"
RecentJobsShadowNoMemory = 0
RecentJobsCheckpointed = 0
RecentJobsAccumPostExecuteTime = 0
RecentJobsAccumTimeToStart = 9470
RecentJobsAccumExecuteTime = 838609
RecentJobsExecFailed = 0
RecentJobsExitedAndClaimClosing = 850
RecentJobsMissedDeferralTime = 0
RecentJobsCoredumped = 0
RecentJobsNotStarted = 1

There are now 176 jobs in C state, and 937 jobs of the same name (likely,
same type?) R-unning.

Watching the numbers of jobs, there seems to be a large portion of
extreme short-runners. But as long as the total sum of R+C doesn't
go up any further (it's falling now), I'm not worried anymore.
Must be some kind of overhead that's unavoidable, and reducing the
job counts seemed to help:

-DAGMAN_MAX_JOBS_IDLE           = 250
-DAGMAN_MAX_JOBS_SUBMITTED      = 5000
+DAGMAN_MAX_JOBS_IDLE           = 50
+DAGMAN_MAX_JOBS_SUBMITTED      = 1000

I may have to increase the numbers again to better utilize the ~2000
cores, but there's not only this user, and not only DAG jobs involved.
(Though he seems to be the only one having this issue, with DAG stuff.)

Thanks,
- S