[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ResidentSetSize



Thanks, Todd, I will look at this.  Also see below.

On 5/10/2016 4:01 PM, Todd Tannenbaum wrote:
On 5/10/2016 2:38 PM, Bob Ball wrote:
It would appear that the ResidentSetSize value does not immediately
appear in the Job ClassAd as a job starts, but only some short time
later.  OK, I can live with that.

However, what is harder to understand is that we have seen instances
where it will display a value of 27GB when a job starts!  I was trying
to use this on the schedd machine
    SYSTEM_PERIODIC_REMOVE = ResidentSetSize > 5000*RequestMemory
but that 27GB cut was killing the jobs.  However, once I removed this
from the schedd, similar jobs finished fine, with ClassAds reporting
only about 2.7GB of final ResidentSetSize.

Has anyone seen this kind of thing before?  How could this be real, and
why does the ResidentSetSize not appear in the job ClassAd at job start?

We are running HTCondor 8.4.6 with cgroups enabled.


Hi Bob,

Couple quick thoughts:

1. We've seen Linux cgroups report large memory sizes for jobs because it is including memory used by the kernel to buffer file system writes. If you have condor_config knob ENABLE_KERNEL_TUNING set to True (the default), HTCondor should prevent this problem from happening starting with v8.4.5. But if you disabled this for whatever reason, that could be the culprit. For more info see
 https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5500
This is enabled and active.


2. Suggest using MemoryUsage for all policy expressions instead of ResidentSetSize.
MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )

How will this help in this expression, other than just rounding 27GB a bit? What happens with this expression at first, when ResidentSetSize is not yet available?


3. As to why does ResidentSetSize not appear in the job classad at job start : Immediately at job start (i.e. right after the job is execed), many jobs haven't allocated much memory yet, because they are still initializing. So now the question is how long to wait... 1 second? 1 minute? By default HTCondor waits 8 seconds after job startup before updating ResidentSetSize and other attributes. There is a condor_starter config knob for this named STARTER_INITIAL_UPDATE_INTERVAL.
As noted, I can live with a delay.


4. You may be interested in the HOWTO at
 https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitMemoryUsage
I will check this out. I _think_ I've seen it before, but will have to take a second look.


hope the above helps
Todd
Any information is helpful.

Thanks,
bob



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/