Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] trying to get parallel-universe jobs working
- Date: Tue, 20 Apr 2010 15:39:16 -0700
- From: Lee Damon <nomad@xxxxxxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] trying to get parallel-universe jobs working
It looks like this was tickling a bug in 7.2.1. I've upgraded to 7.4.2
and the problem appears to have gone away.
nomad
Lee Damon wrote:
> One of the users here has decided he wants to run MPI jobs. In trying
> to set up the parallel universe I can't even get the simple "sleep 30"
> job from
> <http://www.cs.wisc.edu/condor/manual/v7.2/2_9Parallel_Applications.html>
> to launch, it just sits idle.
>
> I've followed the instructions in
> <http://www.cs.wisc.edu/condor/manual/v7.2/3_13Setting_Up.html#sec:Config-Dedicated-Jobs>
> and have the following configuration values set on the 8 test hosts.
> Vanilla universe jobs submitted on these hosts run just fine. Parallel
> universe jobs just sit idle.
>
> -----
> : || nomad@flock03 ~ [77] ; condor_config_val STARTD_ATTRS
> RESOURCE_GROUP, JOB_GROUP, [...], DedicatedScheduler
> : || nomad@flock03 ~ [78] ; condor_config_val DedicatedScheduler
> "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx"
> : || nomad@flock03 ~ [79] ; condor_config_val SUSPEND
> Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && (False)
> : || nomad@flock03 ~ [80] ; condor_config_val PREEMPT
> Scheduler =!= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" && ((
> ((Activity == "Suspended") && ((CurrentTime - EnteredCurrentActivity) >
> 10 * 60)) || (SUSPEND && (WANT_SUSPEND == False)) ))
> : || nomad@flock03 ~ [81] ; condor_config_val RANK_FACTOR
> 1000000
> : || nomad@flock03 ~ [82] ; condor_config_val RANK
> (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx" * 1000000)
> + ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
> TARGET.USER_GROUP )
> : || nomad@flock03 ~ [83] ; condor_config_val START
> (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxx") || (True
> && ( MY.RESOURCE_GROUP == TARGET.JOB_GROUP || MY.RESOURCE_GROUP ==
> TARGET.USER_GROUP || MY.RESOURCE_GROUP == "ssli" ) && ( State !=
> "Claimed" || (CurrentTime - EnteredCurrentState) < 10 * 60 ) &&
> ((VirtualMachineID == 1) && ((2026 - Target.JobMaxMem -
> ifThenElse(isUndefined(slot2_JobMaxMem), 0, slot2_JobMaxMem)) > 0)) ||
> ((VirtualMachineID == 2) && ((2026 -
> ifThenElse(isUndefined(slot1_JobMaxMem), 0, slot1_JobMaxMem) -
> Target.JobMaxMem) > 0)))
> -----
>
> The submit job I'm using it:
>
> -----
> : || nomad@flock03 ~/condor [85] ; cat paralleltest
> #############################################
> ## submit description file for a parallel program
> #############################################
> universe = parallel
> executable = /bin/sleep
> arguments = 30
> machine_count = 2
> log = /homes/nomad/condor/log
>
> queue
> -----
>
> The log file shows the job being submitted but nothing further.
> condor_q -better says there are 24 slots available to run this job:
>
> -----
> : || nomad@flock03 ~/condor [88] ; condor_q -better 7
>
>
> -- Submitter: flock03.ee.washington.edu : <128.208.232.223:33650> :
> flock03.ee.washington.edu
> ---
> 007.000: Run analysis summary. Of 30 machines,
> 4 are rejected by your job's requirements
> 2 reject your job because of their own requirements
> 0 match but are serving users with a better priority in the pool
> 0 match but reject the job for unknown reasons
> 0 match but will not currently preempt their existing job
> 24 are available to run your job
>
> The Requirements expression for your job is:
>
> ( ( ( MY.RESOURCE_GROUP is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
> undefined ) ) ) &&
> ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) &&
> ( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
> ( TARGET.FileSystemDomain == MY.FileSystemDomain )
>
> Condition Machines Matched Suggestion
> --------- ---------------- ----------
> 1 ( target.Arch == "INTEL" ) 26
> 2 ( ( ( "ssli" is TARGET.JOB_GROUP ) || ( TARGET.JOB_GROUP is
> undefined ) ) )
> 30
> 3 ( target.OpSys == "LINUX" ) 30
> 4 ( target.Disk >= 17 ) 30
> 5 ( ( 1024 * target.Memory ) >= 17 )30
> 6 ( TARGET.FileSystemDomain == "ee.washington.edu" )
> 30
> -----
>
> And condor_status -submitter shows the jobs being inserted by the
> DedicatedScheduler:
>
> -----
> : || nomad@flock03 ~/condor [99] ; condor_status -submitter
>
> Name Machine Running IdleJobs HeldJobs
>
> DedicatedScheduler@f flock03.ee 0 4 0
> asubram@xxxxxxxxxxxx flock03.ee 0 0 0
> nomad@xxxxxxxxxxxxxx flock03.ee 0 0 0
>
> RunningJobs IdleJobs HeldJobs
>
> DedicatedScheduler@f 0 4 0
> asubram@xxxxxxxxxxxx 0 0 0
> nomad@xxxxxxxxxxxxxx 0 0 0
>
> Total 0 4 0
> -----
>
>
>
> Any hints on where I should look to see why the job isn't running?
>
> thanks,
> nomad
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/