Re: [HTCondor-users] multicore and multinode run

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

OK I understand what you say, but in practice, I see something else which I can not figure out. I followed the examples in the manual.

This time, I requested 2 machines and want to allocate 1 core from each of them. But again, I see only compute-0-1 response.

[mahmood@rocks7 mpi]$ cat mpi.ht
universe = parallel
executable = openmpiscript
arguments = mpihello
log = hellompi.log
output = hellompi.out.$(Node)
error = hellompi.err.$(Node)
machine_count = 2
request_cpus = 1
queue
[mahmood@rocks7 mpi]$ cat hellompi.out.0
Hello world from processor compute-0-1.local, rank 1 out of 2 processors
Hello world from processor compute-0-1.local, rank 0 out of 2 processors
[mahmood@rocks7 mpi]$ cat hellompi.out.1
[mahmood@rocks7 mpi]$ cat hellompi.err.0
mkdir: cannot create directory '/var/opt/condor/execute/dir_17113/tmp': File exists
[mahmood@rocks7 mpi]$ cat hellompi.err.1
mkdir: cannot create directory '/var/opt/condor/execute/dir_17114/tmp': File exists
[mahmood@rocks7 mpi]$

Regards,
Mahmood

On Friday, January 26, 2018, 7:07:02 PM GMT+3:30, Jason Patton <jpatton@xxxxxxxxxxx> wrote:

Let's take your example with a condor pool containing:

Machine1 with 1 core

Machine2 with 2 cores

If you submit a parallel universe job with "machine_count = 3" as the

only requirement, condor will try to schedule three slots in the pool

with one core each. This will work **if** your pool is configured to

have one static slot per core (the default) or if you are using

partitionable slots. However, if your pool is configured with a single

static slot on each machine (perhaps with each slot containing all of

the cores), then your job will not match because you will only have

two slots -- one on Machine1 with one core, one on Machine2 with two

cores.

It's difficult to address if specific examples will work without

knowing exactly how your pool is configured.

Based on the condor_status output you've provided in these threads, it

seems that you have one static slot per core. This means that you can

only submit jobs that request a single core per slot (or per

node/machine in the parallel universe), but you can request as many

nodes (machine_count) as you want up to the number of slots in your

pool.

If you want to submit jobs that request more than a single core (e.g.

request_cpus = 2), then you will need to reconfigure your pool to have

more than one core per slot or consider using partitionable slots.

Here's the manual for configuring slots (Section 3.5.10):

https://research.cs.wisc.edu/htcondor/manual/v8.0/3_5Policy_Configuration.html

Jason

Mailing List Archives

Authenticated access

Re: [HTCondor-users] multicore and multinode run