Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] ResidentSetSize
- Date: Tue, 10 May 2016 16:24:13 -0400
- From: Bob Ball <ball@xxxxxxxxx>
- Subject: Re: [HTCondor-users] ResidentSetSize
Thanks, Todd, I will look at this. Also see below.
On 5/10/2016 4:01 PM, Todd Tannenbaum wrote:
On 5/10/2016 2:38 PM, Bob Ball wrote:
It would appear that the ResidentSetSize value does not immediately
appear in the Job ClassAd as a job starts, but only some short time
later. OK, I can live with that.
However, what is harder to understand is that we have seen instances
where it will display a value of 27GB when a job starts! I was trying
to use this on the schedd machine
SYSTEM_PERIODIC_REMOVE = ResidentSetSize > 5000*RequestMemory
but that 27GB cut was killing the jobs. However, once I removed this
from the schedd, similar jobs finished fine, with ClassAds reporting
only about 2.7GB of final ResidentSetSize.
Has anyone seen this kind of thing before? How could this be real, and
why does the ResidentSetSize not appear in the job ClassAd at job start?
We are running HTCondor 8.4.6 with cgroups enabled.
Hi Bob,
Couple quick thoughts:
1. We've seen Linux cgroups report large memory sizes for jobs because
it is including memory used by the kernel to buffer file system
writes. If you have condor_config knob ENABLE_KERNEL_TUNING set to
True (the default), HTCondor should prevent this problem from
happening starting with v8.4.5. But if you disabled this for whatever
reason, that could be the culprit. For more info see
https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5500
This is enabled and active.
2. Suggest using MemoryUsage for all policy expressions instead of
ResidentSetSize.
MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )
How will this help in this expression, other than just rounding 27GB a
bit? What happens with this expression at first, when ResidentSetSize
is not yet available?
3. As to why does ResidentSetSize not appear in the job classad at job
start : Immediately at job start (i.e. right after the job is execed),
many jobs haven't allocated much memory yet, because they are still
initializing. So now the question is how long to wait... 1 second? 1
minute? By default HTCondor waits 8 seconds after job startup before
updating ResidentSetSize and other attributes. There is a
condor_starter config knob for this named
STARTER_INITIAL_UPDATE_INTERVAL.
As noted, I can live with a delay.
4. You may be interested in the HOWTO at
https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToLimitMemoryUsage
I will check this out. I _think_ I've seen it before, but will have to
take a second look.
hope the above helps
Todd
Any information is helpful.
Thanks,
bob
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/