Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] multicore and multinode run
- Date: Fri, 26 Jan 2018 13:46:55 -0600
- From: Jason Patton <jpatton@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] multicore and multinode run
That is strange. I believe I have the same test program as your
mpihello, so I will see if I can reproduce this behavior.
In the meantime, can you run your test job with machine_count = 5? (No
multiple requirements/queue statements.)
Jason
On Fri, Jan 26, 2018 at 1:35 PM, Mahmood Naderan <nt_mahmood@xxxxxxxxx> wrote:
> I am sorry but this time it puts the job on compute-0-0 only. I understand
> the logic that you said, but it is really weird.
>
> [mahmood@rocks7 mpi]$ cat mpi.ht
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out.$(Node)
> error = hellompi.err.$(Node)
> request_cpus = 1
> # set requirements for first execute node
> requirements = Machine == "compute-0-0.local"
> machine_count = 1
> queue
> # set requirements for second execute node
> requirements = Machine == "compute-0-1.local"
> machine_count = 1
> queue
> [mahmood@rocks7 mpi]$ condor_submit mpi.ht
> Submitting job(s)..
> 1 job(s) submitted to cluster 31.
> [mahmood@rocks7 mpi]$ cat hellompi.out.0
> Hello world from processor compute-0-0.local, rank 1 out of 2 processors
> Hello world from processor compute-0-0.local, rank 0 out of 2 processors
> [mahmood@rocks7 mpi]$ cat hellompi.out.1
> [mahmood@rocks7 mpi]$ cat hellompi.err.0
> mkdir: cannot create directory '/var/opt/condor/execute/dir_28046/tmp': File
> exists
> [mahmood@rocks7 mpi]$ cat hellompi.err.1
> mkdir: cannot create directory '/var/opt/condor/execute/dir_28508/tmp': File
> exists
> [mahmood@rocks7 mpi]$ condor_status -af:h Machine DedicatedScheduler
> Machine DedicatedScheduler
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-0.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> compute-0-1.local DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
> [mahmood@rocks7 mpi]$ condor_status
> Name OpSys Arch State Activity LoadAv Mem
> ActvtyTime
>
> slot1@xxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Idle 0.000 1973
> 0+00:00:03
> slot2@xxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1973
> 3+22:04:01
> slot1@xxxxxxxxxxxxxxxxx LINUX X86_64 Claimed Idle 0.000 986
> 0+00:01:13
> slot2@xxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 986
> 0+03:29:59
> slot3@xxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 986
> 3+23:02:22
> slot4@xxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 986
> 3+23:02:22
>
> Total Owner Claimed Unclaimed Matched Preempting
> Backfill Drain
>
> X86_64/LINUX 6 0 2 4 0 0
> 0 0
>
> Total 6 0 2 4 0 0
> 0 0
> [mahmood@rocks7 mpi]$ condor_q
>
>
> -- Schedd: rocks7.vbtestcluster.com : <10.0.3.15:9618?... @ 01/26/18
> 14:31:41
> OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
>
> 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
> [mahmood@rocks7 mpi]$
>
>
>
> Regards,
> Mahmood
>
>
> On Friday, January 26, 2018, 11:45:41 AM EST, Jason Patton
> <jpatton@xxxxxxxxxxx> wrote:
>
>
> When you specify "machine_count = 2", you are asking for your job to
> land on any two slots in your condor pool. These two slots could be on
> the same physical machine, which is likely what has happened.
>
> If you want to *test* that your job can land on each machine, you can
> set requirements per node in your submit file:
>
> universe = parallel
> executable = openmpiscript
> arguments = mpihello
> log = hellompi.log
> output = hellompi.out.$(Node)
> error = hellompi.err.$(Node)
> request_cpus = 1
>
> # set requirements for first execute node
> requirements = Machine == "compute-0-0.local"
> machine_count = 1
> queue
>
> # set requirements for second execute node
> requirements = Machine == "compute-0-1.local"
> machine_count = 1
> queue
>
> This will get you a two-node parallel universe job where one node is
> restricted to running on compute-0-0 and the other node is restricted
> to running on compute-0-1.
>
> Jason
>