Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Memory requirements of parallel universe jobs

Date: Mon, 26 Jan 2015 14:38:39 +0100
From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
Subject: [HTCondor-users] Memory requirements of parallel universe jobs

Good afternoon,

here comes another Parallel-Universe related question.
I have a couple of users who run parallel jobs with ~O(100) threads
via MPI, and they use to slightly underestimate the memory footprints.
That's not a big deal as there's still some spare memory left with
the default partitioning (static or dynamic), but nevertheless I
have to preempt jobs (of all universes) which overstep their limit
too much.
As I said, that's not a big deal for worker threads - they keep well
inside their limits (plus a small margin). It's "rank-0" that makes
a difference - by collecting results from the worker threads, and
therefore slowly but steadily inclresing their memory footprint,
way beyond that of the workers.
Condor apparently has only one request_memory declaration that gets
used for all MPI ranks - and to "be honest in declaring one's needs"
one would have to specify the extraordinary amount of memory used
by a single rank-0 thread while 99 threads will carry a largely
overestimated memory tag. As a result, threads will be spread over
a lot of machines, not two of them fitting onto one completely
destroys the remainders of locality, and a lot of performance is
wasted.

Is there an easy way out, except some special configuration that
would allow the users to "lie" and not be punished for overcommitting
memory?

Thanks,
 Steffen

-- 
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}

Prev by Date: [HTCondor-users] BoscoR and recent Microsoft deal to buy Revolution Analytics
Next by Date: [HTCondor-users] fresh centos 7 install fails
Previous by thread: [HTCondor-users] BoscoR and recent Microsoft deal to buy Revolution Analytics
Next by thread: [HTCondor-users] fresh centos 7 install fails
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Memory requirements of parallel universe jobs