Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] trying to get parallel-universe jobs working
- Date: Tue, 20 Apr 2010 11:11:12 -0700
- From: Lee Damon <nomad@xxxxxxxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] trying to get parallel-universe jobs working
One of the users here has decided he wants to run MPI jobs. In trying
to set up the parallel universe I can't even get the simple "sleep 30"
job from
<http://www.cs.wisc.edu/condor/manual/v7.2/2_9Parallel_Applications.html>
to launch, it just sits idle.
I've followed the instructions in
<http://www.cs.wisc.edu/condor/manual/v7.2/3_13Setting_Up.html#sec:Config-Dedicated-Jobs>
and have the following configuration values set on the 8 test hosts.
Vanilla universe jobs submitted on these hosts run just fine. Parallel
universe jobs just sit idle.
-----
: || nomad@flock03 ~ [77] ; condor_config_val STARTD_ATTRS
RESOURCE_GROUP, JOB_GROUP, [...], DedicatedScheduler
: || nomad@flock03 ~ [78] ; condor_config_val DedicatedScheduler
"DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
: || nomad@flock03 ~ [79] ; condor_config_val SUSPEND
Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && (False)
: || nomad@flock03 ~ [80] ; condor_config_val PREEMPT
Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && ((
((Activity == "Suspended") && ((CurrentTime - EnteredCurrentActivity) >
10 * 60)) || (SUSPEND && (WANT_SUSPEND == False)) ))
: || nomad@flock03 ~ [81] ; condor_config_val RANK_FACTOR
1000000
: || nomad@flock03 ~ [82] ; condor_config_val RANK
(Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" * 1000000)
+ ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
TARGET.USER_GROUP )
: || nomad@flock03 ~ [83] ; condor_config_val START
(Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx") || (True
&& ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
TARGET.USER_GROUP || MY.RESOURCE_GROUP == "ssli" ) && ( State !=
"Claimed" || (CurrentTime - EnteredCurrentState) < 10 * 60 ) &&
((VirtualMachineID == 1) && ((2026 - Target.JobMaxMem -
ifThenElse(isUndefined(slot2_JobMaxMem), 0, slot2_JobMaxMem)) > 0)) ||
((VirtualMachineID == 2) && ((2026 -
ifThenElse(isUndefined(slot1_JobMaxMem), 0, slot1_JobMaxMem) -
Target.JobMaxMem) > 0)))
-----
The submit job I'm using it:
-----
: || nomad@flock03 ~/condor [85] ; cat paralleltest
#############################################
## submit description file for a parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 2
log = /homes/nomad/condor/log
queue
-----
The log file shows the job being submitted but nothing further.
condor_q -better says there are 24 slots available to run this job:
-----
: || nomad@flock03 ~/condor [88] ; condor_q -better 7
-- Submitter: flock03.ee.washington.edu : <128.208.232.223:33650> :
flock03.ee.washington.edu
---
007.000: Run analysis summary. Of 30 machines,
4 are rejected by your job's requirements
2 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
0 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
24 are available to run your job
The Requirements expression for your job is:
( ( ( MY.RESOURCE_GROUP is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
undefined ) ) ) &&
( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 ( target.Arch == "INTEL" ) 26
2 ( ( ( "ssli" is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
undefined ) ) )
30
3 ( target.OpSys == "LINUX" ) 30
4 ( target.Disk >= 17 ) 30
5 ( ( 1024 * target.Memory ) >= 17 )30
6 ( TARGET.FileSystemDomain == "ee.washington.edu" )
30
-----
And condor_status -submitter shows the jobs being inserted by the
DedicatedScheduler:
-----
: || nomad@flock03 ~/condor [99] ; condor_status -submitter
Name Machine Running IdleJobs HeldJobs
DedicatedScheduler@f flock03.ee 0 4 0
asubram@xxxxxxxxxxxx flock03.ee 0 0 0
nomad@xxxxxxxxxxxxxx flock03.ee 0 0 0
RunningJobs IdleJobs HeldJobs
DedicatedScheduler@f 0 4 0
asubram@xxxxxxxxxxxx 0 0 0
nomad@xxxxxxxxxxxxxx 0 0 0
Total 0 4 0
-----
Any hints on where I should look to see why the job isn't running?
thanks,
nomad