[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Getting closer with Parallel Universe on Dynamic slots



On Fri, Nov 25, 2011 at 01:12:01PM +0100, Lukas Slebodnik wrote:
> On Fri, Nov 25, 2011 at 12:14:19PM +0100, Steffen Grunewald wrote:
> > ... but still no cigar.
> >
> > The setup consists of 5 4-core machines and some more 2-cores machines.
> > All of them have been configured as single, partitionable slots.
> > Preemption is forbidden completely.
> > The rank definitions are as follows:
> > RANK = 0
> > NEGOTIATOR_PRE_JOB_RANK = 1000000000 + 1000000000 * (TARGET.JobUniverse =?= 11) * (TotalCpus+TotalSlots) - 1000 * Memory
> >
> > I'd expect this to favour big machines over small ones (for Parallel jobs),
> > and partially occupied ones over empty ones.
> >
> > What I see with the following submit file, is quite different:
> >
> > universe   = parallel
> > initialdir = /home/steffeng/tests/mpi/
> > executable = /home/steffeng/tests/mpi/mpitest
> > arguments  =  $(Process) $(NODE)
> > output     = out.$(NODE)
> > error      = err.$(NODE)
> > log        = log
> > notification = Never
> > on_exit_remove = (ExitBySignal == False) || ((ExitBySignal == True) && (ExitSignal != 11))
> > should_transfer_files = yes
> > when_to_transfer_output = on_exit
> > Requirements = ( TotalCpus == 4 )
> > request_memory = 500
> > machine_count = 10
> >
> > (mpitest is the ubiquitous "MPI hello world" program trying to get rank and
> > size from MPI_COMM_WORLD)
> >
> > - if I leave the Requirements out, the 10 MPI nodes will end up on the big
> > 5 machines (one per machine) plus 5 small ones
> If you did not specify request_cpus, then default value (1) will be used.

Yes.
I cannot specify "request_cpus=4" as this would let my jobs idle if the big nodes
were taken by someone else.
And AFAICT, there's no "request_cpus=all" or "request_cpus=TARGET.TotalCpus".

Specifying "request_cpus=4" together with "machine_count=16" results in the job
being forever idle - as there are no 16 *times* 4 cores available. There are only
five machines like that.

The code is pure MPI, no OpenMP (and the "real" code will probably never learn of
OpenMP)- and therefore each MPI node is single-threaded, right? 

> I suppose, that there isn't any others jobs. At the beginning of negotiation 
> cycle you have only partitionable slots. According to NEGOTIATOR_PRE_JOB_RANK 
> slots with 4-cores will have higher priority. This is what you exactly want.
> But you don't specifies request_cpus, therefore only ONE core will be stoled
> from Partitionable slot(slot1@xxxxxxxxxxxxxxx) and new dynamic slot
> (slot1_1@xxxxxxxxxxxxxxx) will be created. I the same negotiation cycle 
> there is also 2-cores partitionable slot available. The same process will occur
> with 2-cores slots.
> 
> Result: 10 new slots will be created (5 on big machines and 5 on small machines)
> 
> > - with the Requirements set as above, each of the big machines will run
> > exactly two nodes instead of 4+4+2+0+0
> Like previous case, but in first negotiation cycle 2-cores partitionable slots
> will not be considered because of own job requirement. In the next negotiation
> cycle "4-cores" partitionable slots contains only 3 Cpus, but TotalCpus will
> be always equal to 4. Therefore in next negotiation cycle another 5 dynamic
> slots from "4-cores" partitionable slot will be created 
> (with name slot1_2@xxxxxxxxxxxxxxx).
> 
> Result: Each big machines have two slots(nodes).
> 
> I think that my explanation will help you.

I understand it, but I don't see the way out with this heterogeneous setup.

What I see more clearly now is that a maximum of MPI nodes gets matched (and the
dynamic slot created) in the first negotiation cycle - which is in contradiction
to the recipe in https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToPackParallelJobs
BTW -, and remaining ones are attempted to match in a subsequent cycle - which fails,
with the modified request_cpus. I can see 10 slots now satisfying the Requirement now:

1   ( ( target.TotalCpus == 4 ) )     10                   

apparently the "parent" partitionable slot is counted as well as the split-off dynamic
(claimed) one.

Since request_cpus does not cure the problem: is there any means to tell Condor to
use up the slot completely before proceeding?

> > - not all out.* and err.* files get written (the pattern looks semi-random)
> > - all of them identify as "rank 0" of "size 1"

With machine_count=16, this has become worse.

> > Condor version is 7.6.0 (and should include the fixes of ticket 986 which
> > went into 7.5.6).

7.6.4 now, no change.

> > How can I debug this?

S