Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] MPI job problem
The problem seems to be in the fact that all your computers are in the
"Owner" state, i.e. Condor is NOT allowed to start any job on them.
Obviously you're using the START expression (in the condor_config),
which makes your resources reject Condor jobs when they are under load
or when there's some keyboard activity. ( the output you sent was
produced on pragma001, so you were working on it, and two others have a
load average of 1.000 ) .
To TEST that MPI really works you might want to disable this, by putting
START=TRUE ( which would allow any job to be invoked, regardless of the
current computer activity), or START=($(START))||((Scheduler =?=
$(DedicatedScheduler)).
Mark
On Fri, 2005-04-29 at 15:24 +0800, Li-Yung_Ho wrote:
> Dear all
>
> My mpi job always IDLE in my computing pool.
> The job is an expample of mpich which is in the mpich package
> subdirectory "example", cpi (calculate pi).
> I have set up the dedicated scheduler and dedicated resources (with NFS).
> The model is
> pragma001.grid.sinica.edu.tw - central manager and dedicated scheduler
> pragma002.grid.sinica.edu.tw - dedicated resource
> pragma004.grid.sinica.edu.tw - dedicated resource
>
> The following are some messages, job description file ,local configuration
> file and schedlog
>
> =================================================================
> Job description file :
>
> universe = MPI
> executable = cpi
> machine_count = 1
> log = logofcpi.new
> error = errofcpi.$(NODE).new
> output = outofcpi.$(NODE).new
> queue
>
> =================================================================
>
> [lyho@pragma001 pragma001]$ condor_q
>
>
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> pragma001.g
> rid.sinica.edu.tw
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 136.0 lyho 4/29 14:22 0+00:00:00 I 0 0.3 cpi
>
> 1 jobs; 1 idle, 0 running, 0 held
>
>
> [lyho@pragma001 pragma001]$ condor_status
>
> Name OpSys Arch State Activity LoadAv Mem
> ActvtyTime
>
> pragma001.gri LINUX INTEL Owner Idle 0.000 469
> 0+00:35:04
> pragma002.gri LINUX INTEL Owner Idle 1.000 469
> 0+03:42:04
> pragma004.gri LINUX INTEL Owner Idle 1.000 1004
> 0+03:40:06
>
> Machines Owner Claimed Unclaimed Matched Preempting
>
> INTEL/LINUX 3 3 0 0 0 0
>
> Total 3 3 0 0 0 0
>
> =================================================================
>
> [lyho@pragma001 pragma001]$ condor_q -analyze
>
>
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> pragma001.g
> rid.sinica.edu.tw
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> ---
> 136.000: Run analysis summary. Of 3 machines,
> 0 are rejected by your job's requirements
> 3 reject your job because of their own requirements
> 0 match, but are serving users with a better priority in the pool
> 0 match, match, but reject the job for unknown reasons
> 0 match, but will not currently preempt their existing job
> 0 are available to run your job
>
> WARNING: Be advised: Request 136.0 did not match any resource's
> constraints
>
>
> WARNING: Analysis is meaningless for MPI universe jobs.
>
> 1 jobs; 1 idle, 0 running, 0 held
>
> ===================================================================
>
> [lyho@pragma001 pragma001]$ condor_status -l|less
>
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma001.grid.sinica.edu.tw"
> Machine = "pragma001.grid.sinica.edu.tw"
> Rank = 0.000000
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 945720
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.010000
> KeyboardIdle = 175
> ConsoleIdle = 30290412
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.21:33669>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 945720
> TotalDisk = 59017960
> KFlops = 868714
> Mips = 1941
> LastBenchmark = 1114753475
> TotalLoadAvg = 0.010000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 0
> CpuIsBusy = FALSE
> State = "Owner"
> EnteredCurrentState = 1114755875
> Activity = "Idle"
> EnteredCurrentActivity = 1114755875
> Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
> 0.300000) ||
> (State != "Unclaimed" && State != "Owner")))
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114695432
> UpdateSequenceNumber = 210
> MyAddress = "<140.109.98.21:33669>"
> LastHeardFrom = 1114757679
> UpdatesTotal = 211
> UpdatesSequenced = 210
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
>
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma002.grid.sinica.edu.tw"
> Machine = "pragma002.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 953140
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.870000
> KeyboardIdle = 1564555
> ConsoleIdle = 1564995
> Memory = 469
> Cpus = 1
> StartdIpAddr = "<140.109.98.22:48852>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 953140
> TotalDisk = 59017960
> KFlops = 832323
> Mips = 2033
> LastBenchmark = 1114744656
> TotalLoadAvg = 0.870000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 893
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> CpuIsBusy = TRUE
> State = "Owner"
> EnteredCurrentState = 1114744651
> Activity = "Idle"
> EnteredCurrentActivity = 1114744651
> Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744650
> UpdateSequenceNumber = 44
> MyAddress = "<140.109.98.22:48852>"
> LastHeardFrom = 1114757675
> UpdatesTotal = 107
> UpdatesSequenced = 105
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
>
> MyType = "Machine"
> TargetType = "Job"
> Name = "pragma004.grid.sinica.edu.tw"
> Machine = "pragma004.grid.sinica.edu.tw"
> Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
> COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
> VirtualMachineID = 1
> VirtualMemory = 2013048
> Disk = 59017960
> CondorLoadAvg = 0.000000
> LoadAvg = 0.890000
> KeyboardIdle = 8300
> ConsoleIdle = 30290424
> Memory = 1004
> Cpus = 1
> StartdIpAddr = "<140.109.98.24:35849>"
> Arch = "INTEL"
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 2013048
> TotalDisk = 59017960
> KFlops = 547145
> Mips = 1324
> LastBenchmark = 1114744778
> TotalLoadAvg = 0.890000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> OpSys = "LINUX"
> UidDomain = "grid.sinica.edu.tw"
> FileSystemDomain = "grid.sinica.edu.tw"
> Subnet = "140.109.98"
> HasIOProxy = TRUE
> TotalVirtualMemory = 2013048
> TotalDisk = 59017960
> KFlops = 547145
> Mips = 1324
> LastBenchmark = 1114744778
> TotalLoadAvg = 0.890000
> TotalCondorLoadAvg = 0.000000
> ClockMin = 894
> ClockDay = 5
> TotalVirtualMachines = 1
> HasFileTransfer = TRUE
> HasMPI = TRUE
> HasJICLocalConfig = TRUE
> HasJICLocalStdin = TRUE
> HasPVM = TRUE
> HasRemoteSyscalls = TRUE
> HasCheckpointing = TRUE
> StarterAbilityList
> = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
> HasPVM,HasRemoteSyscalls,HasCheckpointing"
> CpuBusyTime = 4
> CpuIsBusy = TRUE
> State = "Owner"
> EnteredCurrentState = 1114744769
> Activity = "Idle"
> EnteredCurrentActivity = 1114744769
> Start = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
> Requirements = START
> CurrentRank = 0.000000
> DaemonStartTime = 1114744768
> UpdateSequenceNumber = 44
> MyAddress = "<140.109.98.24:35849>"
> LastHeardFrom = 1114757675
> UpdatesTotal = 106
> UpdatesSequenced = 104
> UpdatesLost = 0
> UpdatesHistory = "0x00000000000000000000000000000000"
>
> ================================================================
>
> pragma001 local configuration file :
>
> COLLECTOR_NAME = ASCC-Condor
> DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD
> COLLECTOR = $(SBIN)/condor_collector
> NEGOTIATOR = $(SBIN)/condor_negotiator
> UNUSED_CLAIM_TIMEOUT = 0
>
> =================================================================
>
> pragma002 and pragma004 local configuration file :
>
> DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
>
> ##--------------------------------------------------------------------
> ## 1) Only run dedicated jobs
> ##--------------------------------------------------------------------
> START = Scheduler =?= $(DedicatedScheduler)
> SUSPEND = False
> CONTINUE = True
> PREEMPT = False
> KILL = False
> WANT_SUSPEND = False
> WANT_VACATE = False
> RANK = Scheduler =?= $(DedicatedScheduler)
> MPI_CONDOR_RSH_PATH = $(SBIN)
> STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
>
>
>
> ========================================================================
>
> schedlog on pagma001 :
>
> 4/29 15:12:50 Found idle MPI cluster 136
> 4/29 15:12:50 Started timer (182) to call handleDedicatedJobs() in 2 secs
> 4/29 15:12:50 JobsRunning = 0
> 4/29 15:12:50 JobsIdle = 0
> 4/29 15:12:50 JobsHeld = 0
> 4/29 15:12:50 JobsRemoved = 0
> 4/29 15:12:50 SchedUniverseJobsRunning = 0
> 4/29 15:12:50 SchedUniverseJobsIdle = 0
> 4/29 15:12:50 N_Owners = 1
> 4/29 15:12:50 MaxJobsRunning = 200
> 4/29 15:12:50 Attempting to send update via UDP to collector
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Sent HEART BEAT ad to central mgr: Number of submittors=1
> 4/29 15:12:50 Attempting to send update via UDP to collector marlin.bii.a-
> star.e
> du.sg <202.6.243.157:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Changed attribute: RunningJobs = 0
> 4/29 15:12:50 Changed attribute: IdleJobs = 0
> 4/29 15:12:50 Changed attribute: HeldJobs = 0
> 4/29 15:12:50 Changed attribute: FlockedJobs = 0
> 4/29 15:12:50 Changed attribute: Name = "lyho@xxxxxxxxxxxxxxxxxx"
> 4/29 15:12:50 Attempting to send update via UDP to collector
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:50 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:50 Sent ad to central manager for lyho@xxxxxxxxxxxxxxxxxx
> 4/29 15:12:50 ============ Begin clean_shadow_recs =============
> 4/29 15:12:50 ============ End clean_shadow_recs =============
> 4/29 15:12:52 Starting DedicatedScheduler::handleDedicatedJobs
> 4/29 15:12:52 Found 1 idle dedicated job(s)
> 4/29 15:12:52 DedicatedScheduler: Listing all dedicated jobs -
> 4/29 15:12:52 Dedicated job: 136.0 lyho
> 4/29 15:12:52 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of
> 0
> 4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:52 Found 0 potential dedicated resources
> 4/29 15:12:52 Displaying dedicated resources:
> 4/29 15:12:52 No resources claimed
> 4/29 15:12:52 In DedicatedScheduler::publishRequestAd()
> 4/29 15:12:52 Attempting to send update via UDP to collector
> pragma001.grid.sini
> ca.edu.tw <140.109.98.21:9618>
> 4/29 15:12:52 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
> 4/29 15:12:52 Finished DedicatedScheduler::handleDedicatedJobs
>
>
> ==========================================================================
>
>
>
> I found that the resources state are always "owner" , is it the problem ?
>
>
> Can anyone give me a BIG help ?
> Thanks a lot
>
>
>
>
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users