[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs

Date: Thu, 31 Mar 2016 14:02:09 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs

On 3/31/2016 7:55 AM, L Kreczko wrote:

Dear experts,

I am trying to understand the schedd behaviour I witnessed today.
After sending 10k (bad) jobs to hold status, the RAM usage of the
condor_schedd process exploded (see attached png).

The job_queue log is now 9.3GB and contains all ClassAds of the held
jobs (I assume this is what is causing the RAM usage).
This was not the case when the jobs where idle. Is this behaviour expected?
Can I do something to prevent this from happening?

Cheers,
Luke


Hi Luke,

What HTCondor version / operating system are your using?

Including version information in any incident report is always a goodidea. :)

Also, did you submit these 10k jobs via 10,000 invocations ofcondor_submit, or via one invocation with "queue 10000" ?

Just to be sure we have the correct facts: you submitted the 10k jobs,and memory usage of the schedd was fine (i.e. less than 5 gig accordingto your graph). Then schedd memory usage exploded to 15GB+ as soon asyou did the condor_hold, and most (all?) of the jobs you put on holdwere previously in the idle state.


Also, could you send the output of
  condor_schedd -v
and
  condor_config_val -dump QUEUE

As you is there something you can do to prevent this: Once we haveclarification on the above, we can investigate more (i.e. reproducehere) and hopefully give better advice. Until then I cannot preciselysay what is going on, so my naive initial in the mean time advice wouldbe run the latest release in whatever series you are using, and perhapshold jobs a chunk at a time , i.e 500 at a time could be done like

  condor_hold -cons 'ClusterId > 5000 && Cluster <= 5500'

Certainly HTCondor should be able to handle putting 10k jobs on hold inone go. As to what I think is going on: When you do condor_hold (orwhatever) on a large group of jobs all at once, either all the jobs willgo on hold, or none of the jobs will go on hold (i.e. database-styletransactional processing). The schedd will store 10k changes to atransaction log in RAM... I wouldn't expect this log to take many gigsof ram however! But one improvement we've had in mind for a while(mainly for speed) is instead of having 10k transaction log entrieswould be to have one transaction log action that effectively gives aconstraint like "all jobs" or whatever you gave to condor_hold... Adownside of implementing this is it would not be forwards compatible -i.e. after upgrading to a new schedd with this feature, you may not beable to downgrade anymore (because the job_queue.log file may containsentries an old schedd would not understand).

Absolute worst case you could shutdown HTCondor and remove everything inthe $(SPOOL) directory, effectively flushing all your jobs to thebitbucket. Then before restarting you could set config knobSCHEDD_CLUSTER_INITIAL_VALUE to a number higher than your previous jobid so that you don't repeat job id numbers, if you care about that. Ofcourse it shouldn't have to come down to this extreme option, but Ithought I'd mention it just in case everything is on fire and restartingHTCondor doesn't help.


Thanks
Todd

Follow-Ups:
- Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
  - From: L Kreczko
- Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
  - From: Brian Bockelman

References:
- [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
  - From: L Kreczko

Prev by Date: [HTCondor-users] RemoteGroupResourcesInUse in 8.4
Next by Date: Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
Previous by thread: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
Next by thread: Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs
Index(es):
- Date
- Thread