Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] MPI job problem
Dear Greg
Of course and thanks for your help
This is the SchedLog of pragma001.grid.sinica.edu.tw
and there is nothing in startdlog
------------------------------------------------------------------------
5/3 09:04:01 -------- Begin starting jobs --------
5/3 09:04:01 -------- Done starting jobs --------
5/3 09:04:02 JobsRunning = 0
5/3 09:04:02 JobsIdle = 0
5/3 09:04:02 JobsHeld = 0
5/3 09:04:02 JobsRemoved = 0
5/3 09:04:02 SchedUniverseJobsRunning = 0
5/3 09:04:02 SchedUniverseJobsIdle = 0
5/3 09:04:02 N_Owners = 0
5/3 09:04:02 MaxJobsRunning = 200
5/3 09:04:02 Attempting to send update via UDP to collector
pragma001.grid.sinic
a.edu.tw <140.109.98.21:9618>
5/3 09:04:02 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:04:02 Sent HEART BEAT ad to central mgr: Number of submittors=0
5/3 09:04:02 Attempting to send update via UDP to collector marlin.bii.a-
star.ed
u.sg <202.6.243.157:9618>
5/3 09:04:02 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:04:02 ============ Begin clean_shadow_recs =============
5/3 09:04:02 ============ End clean_shadow_recs =============
5/3 09:06:28 DaemonCore: Command received via TCP from host
<140.109.98.21:44215
>
5/3 09:06:28 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(han
dle_q)
5/3 09:06:28 condor_read(): Socket closed when trying to read buffer
5/3 09:06:28 QMGR Connection closed
5/3 09:07:35 DaemonCore: Command received via TCP from host
<140.109.98.21:44245
>
5/3 09:07:35 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(han
dle_q)
5/3 09:07:35 AUTHENTICATE_FS: used file /tmp/qmgr_6LKOTY, status: 1
5/3 09:07:35 OwnerCheck retval 1 (success), super_user
5/3 09:07:35 OwnerCheck retval 1 (success), super_user
5/3 09:07:36 wrote 300788 bytes
5/3 09:07:36 done with transfer, errno = 0
5/3 09:07:36 condor_read(): Socket closed when trying to read buffer
5/3 09:07:36 QMGR Connection closed
5/3 09:07:36 DaemonCore: Command received via TCP from host
<140.109.98.21:44256
>
5/3 09:07:36 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling
handler
(attempt_access_handler)
5/3 09:07:36 ATTEMPT_ACCESS: Switching to user uid: 510 gid: 510.
5/3 09:07:36 Checking
file /home/lyho/test/examples/condor_test/outofcpi.0.new f
or write permission.
5/3 09:07:36 Switching back to old priv state.
5/3 09:07:36 DaemonCore: Command received via TCP from host
<140.109.98.21:44257
>
5/3 09:07:36 DaemonCore: received command 464 (ATTEMPT_ACCESS), calling
handler
(attempt_access_handler)
5/3 09:07:36 ATTEMPT_ACCESS: Switching to user uid: 510 gid: 510.
5/3 09:07:36 Checking
file /home/lyho/test/examples/condor_test/errofcpi.0.new f
or write permission.
5/3 09:07:36 Switching back to old priv state.
5/3 09:07:36 Found idle MPI cluster 143
5/3 09:07:36 Started timer (1035) to call handleDedicatedJobs() in 2 secs
5/3 09:07:36 JobsRunning = 0
5/3 09:07:36 JobsIdle = 0
5/3 09:07:36 JobsHeld = 0
5/3 09:07:36 JobsRemoved = 0
5/3 09:07:36 SchedUniverseJobsRunning = 0
5/3 09:07:36 SchedUniverseJobsIdle = 0
5/3 09:07:36 N_Owners = 1
5/3 09:07:36 MaxJobsRunning = 200
5/3 09:07:36 Attempting to send update via UDP to collector
pragma001.grid.sinic
a.edu.tw <140.109.98.21:9618>
5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:36 Sent HEART BEAT ad to central mgr: Number of submittors=1
5/3 09:07:36 Attempting to send update via UDP to collector marlin.bii.a-
star.ed
u.sg <202.6.243.157:9618>
5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:36 Changed attribute: RunningJobs = 0
5/3 09:07:36 Changed attribute: IdleJobs = 0
5/3 09:07:36 Changed attribute: HeldJobs = 0
5/3 09:07:36 Changed attribute: FlockedJobs = 0
5/3 09:07:36 Changed attribute: Name = "lyho@xxxxxxxxxxxxxxxxxx"
5/3 09:07:36 Attempting to send update via UDP to collector
pragma001.grid.sinic
a.edu.tw <140.109.98.21:9618>
5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:36 Sent ad to central manager for lyho@xxxxxxxxxxxxxxxxxx
5/3 09:07:36 ============ Begin clean_shadow_recs =============
5/3 09:07:36 ============ End clean_shadow_recs =============
5/3 09:07:36 Called reschedule_negotiator()
5/3 09:07:36 Sending RESCHEDULE command to negotiator(s)
5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:36 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:38 Starting DedicatedScheduler::handleDedicatedJobs
5/3 09:07:38 Found 1 idle dedicated job(s)
5/3 09:07:38 DedicatedScheduler: Listing all dedicated jobs -
5/3 09:07:38 Dedicated job: 143.0 lyho
5/3 09:07:38 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/3 09:07:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:38 Found 0 potential dedicated resources
5/3 09:07:38 Displaying dedicated resources:
5/3 09:07:38 No resources claimed
5/3 09:07:38 In DedicatedScheduler::publishRequestAd()
5/3 09:07:38 Attempting to send update via UDP to collector
pragma001.grid.sinic
a.edu.tw <140.109.98.21:9618>
5/3 09:07:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/3 09:07:38 Finished DedicatedScheduler::handleDedicatedJobs
5/3 09:07:38 DaemonCore: Command received via TCP from host
<140.109.98.21:44271
>
5/3 09:07:38 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(han
dle_q)
5/3 09:07:38 condor_read(): Socket closed when trying to read buffer
5/3 09:07:38 QMGR Connection closed
5/3 09:07:39 DaemonCore: Command received via TCP from host
<140.109.98.21:44284
>
5/3 09:07:39 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(han
dle_q)
5/3 09:07:39 condor_read(): Socket closed when trying to read buffer
5/3 09:07:39 QMGR Connection closed
5/3 09:07:40 DaemonCore: Command received via TCP from host
<140.109.98.21:44297
>
5/3 09:07:40 DaemonCore: received command 1111 (QMGMT_CMD), calling handler
(han
dle_q)
5/3 09:07:40 condor_read(): Socket closed when trying to read buffer
5/3 09:07:40 QMGR Connection closed
---------------------------------------------------------------------------
job status :
---------------------------------------------------------------------------
[lyho@pragma001 log]$ condor_q
-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
pragma001.g
rid.sinica.edu.tw
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
143.0 lyho 5/3 09:07 0+00:00:00 I 0 0.3 cpi
1 jobs; 1 idle, 0 running, 0 held
---------------------------------------------------------------------------
[lyho@pragma001 log]$ condor_q -l
-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
pragma001.g
rid.sinica.edu.tw
MyType = "Job"
TargetType = "Machine"
ClusterId = 143
QDate = 1115082455
CompletionDate = 0
Owner = "lyho"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
RootDir = "/"
Iwd = "/home/lyho/test/examples/condor_test"
JobUniverse = 8
Cmd = "/home/lyho/test/examples/condor_test/cpi"
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
MinHosts = 2
MaxHosts = 2
JobStatus = 1
EnteredCurrentStatus = 1115082456
JobPrio = 0
User = "lyho@xxxxxxxxxxxxxxxxxx"
NiceUser = FALSE
Env = ""
JobNotification = 2
UserLog = "/home/lyho/test/examples/condor_test/logofcpi.new"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "outofcpi.#MpInOdE#.new"
Err = "errofcpi.#MpInOdE#.new"
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize = 294
ExecutableSize = 294
DiskUsage = 294
Requirements = (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >=
DiskUsage) &&
((Memory * 1024) >= ImageSize) && (HasMPI) && (TARGET.FileSystemDomain ==
MY.Fi
leSystemDomain)
FileSystemDomain = "grid.sinica.edu.tw"
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = ""
ProcId = 0
Scheduler = "DedicatedScheduler@lyho@pragma001.grid.sinica.edu.tw"
ServerTime = 1115083476
-------------------------------------------------------------------------
machine status:
-------------------------------------------------------------------------
[lyho@pragma001 log]$ condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
pragma001.gri LINUX INTEL Owner Idle 0.000 469
0+00:15:04
pragma002.gri LINUX INTEL Unclaimed Idle 0.890 469
0+03:36:01
pragma004.gri LINUX INTEL Unclaimed Idle 1.000 1004
0+03:34:48
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 3 1 0 2 0 0
Total 3 1 0 2 0 0
-------------------------------------------------------------------------
[lyho@pragma001 log]$ condor_status -l
MyType = "Machine"
TargetType = "Job"
Name = "pragma001.grid.sinica.edu.tw"
Machine = "pragma001.grid.sinica.edu.tw"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 940764
Disk = 58974996
CondorLoadAvg = 0.000000
LoadAvg = 0.010000
KeyboardIdle = 154
ConsoleIdle = 30616471
Memory = 469
Cpus = 1
StartdIpAddr = "<140.109.98.21:33669>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 940764
TotalDisk = 58974996
KFlops = 875905
Mips = 1905
LastBenchmark = 1115071434
TotalLoadAvg = 0.010000
TotalCondorLoadAvg = 0.000000
ClockMin = 568
ClockDay = 2
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Owner"
EnteredCurrentState = 1115082534
Activity = "Idle"
EnteredCurrentActivity = 1115082534
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) ||
(State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114695432
UpdateSequenceNumber = 1297
MyAddress = "<140.109.98.21:33669>"
LastHeardFrom = 1115083738
UpdatesTotal = 1298
UpdatesSequenced = 1297
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
MyType = "Machine"
TargetType = "Job"
Name = "pragma002.grid.sinica.edu.tw"
Machine = "pragma002.grid.sinica.edu.tw"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 945368
Disk = 58974996
CondorLoadAvg = 0.000000
LoadAvg = 0.990000
KeyboardIdle = 44595
ConsoleIdle = 1891066
Memory = 469
Cpus = 1
StartdIpAddr = "<140.109.98.22:48852>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 945368
TotalDisk = 58974996
KFlops = 801365
Mips = 1880
LastBenchmark = 1115070484
TotalLoadAvg = 0.990000
TotalCondorLoadAvg = 0.000000
ClockMin = 568
ClockDay = 2
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 304
CpuIsBusy = TRUE
State = "Unclaimed"
EnteredCurrentState = 1115011084
Activity = "Idle"
EnteredCurrentActivity = 1115070484
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114744650
UpdateSequenceNumber = 1132
MyAddress = "<140.109.98.22:48852>"
LastHeardFrom = 1115083745
UpdatesTotal = 1195
UpdatesSequenced = 1193
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
MyType = "Machine"
TargetType = "Job"
Name = "pragma004.grid.sinica.edu.tw"
Machine = "pragma004.grid.sinica.edu.tw"
Rank = Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pragma001.grid.sinica.edu.tw"
DedicatedScheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 2009408
Disk = 58974912
CondorLoadAvg = 0.000000
LoadAvg = 1.000000
KeyboardIdle = 37227
ConsoleIdle = 30616285
Memory = 1004
Cpus = 1
StartdIpAddr = "<140.109.98.24:35849>"
Arch = "INTEL"
OpSys = "LINUX"
UidDomain = "grid.sinica.edu.tw"
FileSystemDomain = "grid.sinica.edu.tw"
Subnet = "140.109.98"
HasIOProxy = TRUE
TotalVirtualMemory = 2009408
TotalDisk = 58974912
KFlops = 575797
Mips = 1281
LastBenchmark = 1115070647
TotalLoadAvg = 1.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 565
ClockDay = 2
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList
= "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 9305
CpuIsBusy = TRUE
State = "Unclaimed"
EnteredCurrentState = 1114767739
Activity = "Idle"
EnteredCurrentActivity = 1115070647
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1114744768
UpdateSequenceNumber = 1130
MyAddress = "<140.109.98.24:35849>"
LastHeardFrom = 1115083535
UpdatesTotal = 1192
UpdatesSequenced = 1190
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
--------------------------------------------------------------------------
condor_q -analyze :
--------------------------------------------------------------------------
[lyho@pragma001 log]$ condor_q -analyze
-- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
pragma001.g
rid.sinica.edu.tw
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
143.000: Run analysis summary. Of 3 machines,
0 are rejected by your job's requirements
1 reject your job because of their own requirements
0 match, but are serving users with a better priority in the pool
2 match, match, but reject the job for unknown reasons
0 match, but will not currently preempt their existing job
0 are available to run your job
WARNING: Analysis is meaningless for MPI universe jobs.
1 jobs; 1 idle, 0 running, 0 held
--------------------------------------------------------------------------
really appreciate your help !
Leon
On Mon, 02 May 2005 07:59:06 -0500, Greg Thain wrote
> Can you send us the log from the schedd and the startd?
>
> Thanks,
>
> -greg
>
> Li-Yung_Ho wrote:
> > Hi Mark and Greg
> > Thanks for your responses
> >
> > I change the START attribute from Scheduler =?= $(DedicatedScheduler) to
True
> > in pragma002 and pragma004 local configuraion file and indeed , the
status
> > become "Unclaimed"
> > ------------------------------------------------------------------------
> > [lyho@pragma001 lyho]$ condor_status
> >
> > Name OpSys Arch State Activity LoadAv Mem
> > ActvtyTime
> >
> > pragma001.gri LINUX INTEL Owner Idle 0.010 469
> > 0+00:10:04
> > pragma002.gri LINUX INTEL Unclaimed Idle 0.290 469
> > 0+03:21:02
> > pragma004.gri LINUX INTEL Unclaimed Idle 0.150 1004
> > 0+03:19:48
> >
> > Machines Owner Claimed Unclaimed Matched Preempting
> >
> > INTEL/LINUX 3 1 0 2 0 0
> >
> > Total 3 1 0 2 0 0
> >
> > -------------------------------------------------------------------------
> >
> > but the job still IDLE
> >
> > -------------------------------------------------------------------------
> > [lyho@pragma001 lyho]$ condor_q
> >
> >
> > -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> > pragma001.g
> > rid.sinica.edu.tw
> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> > 140.0 lyho 4/29 17:44 0+00:00:00 I 0 0.3 cpi
> >
> > 1 jobs; 1 idle, 0 running, 0 held
> >
> > ------------------------------------------------------------------------
> >
> > and then I test the vanilla job
> > the job description file :
> > ============================
> > universe = vanilla
> > executable = cpi
> > log = logofcpi.new
> > error = errofcpi.$(NODE).new
> > output = outofcpi.$(NODE).new
> > queue
> > =============================
> >
> > and it can be done
> >
> > ------------------------------------------------------------------------
> > [lyho@pragma001 condor_test]$ condor_q
> >
> >
> > -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> > pragma001.g
> > rid.sinica.edu.tw
> > ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> > 142.0 lyho 5/2 13:18 0+00:00:00 R 0 0.3 cpi
> >
> > 1 jobs; 0 idle, 1 running, 0 held
> > ---------------------------------------------------------------------
> >
> > The files of log, error and output
> >
> > ---------------------------------------------------------------------
> > [lyho@pragma001 condor_test]$ more *.new
> > ::::::::::::::
> > errofcpi..new
> > ::::::::::::::
> > Process 0 on pragma002.grid.sinica.edu.tw
> > ::::::::::::::
> > logofcpi.new
> > ::::::::::::::
> > 000 (142.000.000) 05/02 13:18:57 Job submitted from host:
> > <140.109.98.21:33670>
> > ...
> > 001 (142.000.000) 05/02 13:19:00 Job executing on host:
<140.109.98.22:48852>
> > ...
> > 005 (142.000.000) 05/02 13:19:00 Job terminated.
> > (1) Normal termination (return value 0)
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
> > Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> > 0 - Run Bytes Sent By Job
> > 0 - Run Bytes Received By Job
> > 0 - Total Bytes Sent By Job
> > 0 - Total Bytes Received By Job
> > ...
> > ::::::::::::::
> > outofcpi..new
> > ::::::::::::::
> > pi is approximately 3.1416009869231254, Error is 0.0000083333333323
> > wall clock time = 0.000055
> >
> > --------------------------------------------------------------------
> >
> > So, someting wrong with mpi job
> >
> > Can anyone help me ??
> >
> >
> >
> > On Fri, 29 Apr 2005 12:11:53 +0300, Mark Silberstein wrote
> >
> >>The problem seems to be in the fact that all your computers are in
> >>the "Owner" state, i.e. Condor is NOT allowed to start any job on them.
> >>Obviously you're using the START expression (in the condor_config),
> >>which makes your resources reject Condor jobs when they are under
> >>load or when there's some keyboard activity. ( the output you sent was
> >>produced on pragma001, so you were working on it, and two others
> >>have a load average of 1.000 ) . To TEST that MPI really works you
> >>might want to disable this, by putting START=TRUE ( which would
> >>allow any job to be invoked, regardless of the current computer
> >>activity), or START=($(START))||((Scheduler =?= $(DedicatedScheduler)
> >>). Mark
> >>
> >
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users