hi brian,
no, i am not using cgroups. it somehow made our main execute machine
crash hard (i.e. a hard reboot was needed....). maybe some kernel
problem... however, nothing that i can fix because the IT insist of
keeping the standard installation. so, i do this in my config files
for the execute machines:
MEMORY_EXCEEDED = ifThenElse(isUndefined(MemoryUsage), False, (MemoryUsage > (RequestMemory+2000)*1.4))
PREEMPT = ($(PREEMPT) || $(MEMORY_EXCEEDED))
WANT_SUSPEND = $(WANT_SUSPEND) && $(MEMORY_EXCEEDED)
WANT_HOLD = $(MEMORY_EXCEEDED)
WANT_HOLD_REASON = ifThenElse( $(MEMORY_EXCEEDED), \
"Your job used too much virtual memory.", \
undefined )
its basically, what i found on the wiki.
i am testing implementing something very close to dimitris/laurens
suggestion right now...
best,
thomas
Am 2016-03-18 um 15:09 schrieb Brian
Bockelman:
Hi Thomas,
How are you killing off jobs? Are you using cgroups
for enforcement?
If so, it might be a good idea to look at the SOFT
enforcement in combination with what Dimitri suggested. This
allows jobs to go over limits until the machine is out of
memory. In that case, the over-limit jobs are killed first.
However, I do suggest to keep the recipe from
Dimitri / Lauren - you don’t want to give users an unlimited
pass because otherwise they never understand their real memory
requirements.
Brian
thank you
for the suggestion, dimitri. but if i understand
correctly what is happening there, a job that exceeds
the limit will be put on hold and then rescheduled also
if there is the possibility to just increase the
request_memory on the same machine. we cannot work with
checkpoints here (at least not using htcondor's standard
universe), so this would mean that jobs would need to
rerun from the very beginning.
if there was a possibility to update the requirements of
a job while it is running and after checking whether
under these new requirements, the job can remain on the
machine, would be great for my use case.
don't get me wrong: yours is a wonderful suggestion and
if this extra bit is not possible i will definitely test
it!
thanks again,
thomas
Am 2016-03-15 um 18:50
schrieb Dimitri Maziuk:
On 03/15/2016 08:24 AM, Thomas Hartmann wrote:
2. handle RAM allocations more dynamically. for instance:
2.1. if a job wants to use more RAM than previously requested, see
whether the machine on which it runs still has this amount of RAM
available.
2.2. if it does, update the request_memory to a safe value and continue
running the job.
2.3. if the extra RAM is not available, stop the job, update the
request_memory to a safe value and put it back into the queue.
Courtesy of Lauren Michael:
2) The below lines added to the submit file will allow the jobs to
self-police MemoryUsage, and will adjust the memory request in response
(though "request_memory" would need to be replaced in the submit file, not
added).
+MemoryUsage = ( 800 ) * 2 / 3
request_memory = ( MemoryUsage ) * 3 / 2
periodic_hold = ( MemoryUsage >= ( ( RequestMemory ) * 3 / 2 ) )
periodic_release = (JobStatus == 5) && ((CurrentTime -
EnteredCurrentStatus) > 180) && (HoldReasonCode != 34)
These lines essentially say:
Set the "request_memory" ("RequestMemory" in the job classad) to be a
function of MemoryUsage, and artificially set the MemoryUsage to an initial
value (800 MB * 2/3).
Put the job on hold if the (real) MemoryUsage goes 50% above the current
RequestMemory value.
Release the held job (if held for the memory reason, and held for at least
3 minutes), so that it will be matched to run again on a compute "slot"
with more memory (according to the new RequestMemory value).
We removed "HoldReasonCode != 34" and added "periodic_remove = (time() -
QDate) > 500000" and have been running those jobs for quite some time.
What I can't tell you is how many of them actually use that magic: I
won't dig into that until things break and so far they haven't. (Most of
those jobs run in under 800MB.)
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Dr. Thomas Hartmann
Centre for Cognitive Neuroscience
FB Psychologie
Universität Salzburg
Hellbrunnerstraße 34/II
5020 Salzburg
Tel: +43 662 8044 5109
Email: thomas.hartmann@xxxxxxxx
"I am a brain, Watson. The rest of me is a mere appendix. " (Arthur Conan Doyle)
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Dr. Thomas Hartmann
Centre for Cognitive Neuroscience
FB Psychologie
Universität Salzburg
Hellbrunnerstraße 34/II
5020 Salzburg
Tel: +43 662 8044 5109
Email: thomas.hartmann@xxxxxxxx
"I am a brain, Watson. The rest of me is a mere appendix. " (Arthur Conan Doyle)
|