Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] MPI job problem
- Date: Mon, 02 May 2005 07:59:06 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [Condor-users] MPI job problem
Can you send us the log from the schedd and the startd?
Thanks,
-greg
Li-Yung_Ho wrote:
> Hi Mark and Greg
> Thanks for your responses
>
> I change the START attribute from Scheduler =?= $(DedicatedScheduler) to True
> in pragma002 and pragma004 local configuraion file and indeed , the status
> become "Unclaimed"
> ------------------------------------------------------------------------
> [lyho@pragma001 lyho]$ condor_status
>
> Name OpSys Arch State Activity LoadAv Mem
> ActvtyTime
>
> pragma001.gri LINUX INTEL Owner Idle 0.010 469
> 0+00:10:04
> pragma002.gri LINUX INTEL Unclaimed Idle 0.290 469
> 0+03:21:02
> pragma004.gri LINUX INTEL Unclaimed Idle 0.150 1004
> 0+03:19:48
>
> Machines Owner Claimed Unclaimed Matched Preempting
>
> INTEL/LINUX 3 1 0 2 0 0
>
> Total 3 1 0 2 0 0
>
> -------------------------------------------------------------------------
>
> but the job still IDLE
>
> -------------------------------------------------------------------------
> [lyho@pragma001 lyho]$ condor_q
>
>
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> pragma001.g
> rid.sinica.edu.tw
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 140.0 lyho 4/29 17:44 0+00:00:00 I 0 0.3 cpi
>
> 1 jobs; 1 idle, 0 running, 0 held
>
> ------------------------------------------------------------------------
>
> and then I test the vanilla job
> the job description file :
> ============================
> universe = vanilla
> executable = cpi
> log = logofcpi.new
> error = errofcpi.$(NODE).new
> output = outofcpi.$(NODE).new
> queue
> =============================
>
> and it can be done
>
> ------------------------------------------------------------------------
> [lyho@pragma001 condor_test]$ condor_q
>
>
> -- Submitter: pragma001.grid.sinica.edu.tw : <140.109.98.21:33670> :
> pragma001.g
> rid.sinica.edu.tw
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 142.0 lyho 5/2 13:18 0+00:00:00 R 0 0.3 cpi
>
> 1 jobs; 0 idle, 1 running, 0 held
> ---------------------------------------------------------------------
>
> The files of log, error and output
>
> ---------------------------------------------------------------------
> [lyho@pragma001 condor_test]$ more *.new
> ::::::::::::::
> errofcpi..new
> ::::::::::::::
> Process 0 on pragma002.grid.sinica.edu.tw
> ::::::::::::::
> logofcpi.new
> ::::::::::::::
> 000 (142.000.000) 05/02 13:18:57 Job submitted from host:
> <140.109.98.21:33670>
> ...
> 001 (142.000.000) 05/02 13:19:00 Job executing on host: <140.109.98.22:48852>
> ...
> 005 (142.000.000) 05/02 13:19:00 Job terminated.
> (1) Normal termination (return value 0)
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
> 0 - Run Bytes Sent By Job
> 0 - Run Bytes Received By Job
> 0 - Total Bytes Sent By Job
> 0 - Total Bytes Received By Job
> ...
> ::::::::::::::
> outofcpi..new
> ::::::::::::::
> pi is approximately 3.1416009869231254, Error is 0.0000083333333323
> wall clock time = 0.000055
>
> --------------------------------------------------------------------
>
> So, someting wrong with mpi job
>
> Can anyone help me ??
>
>
>
> On Fri, 29 Apr 2005 12:11:53 +0300, Mark Silberstein wrote
>
>>The problem seems to be in the fact that all your computers are in
>>the "Owner" state, i.e. Condor is NOT allowed to start any job on them.
>>Obviously you're using the START expression (in the condor_config),
>>which makes your resources reject Condor jobs when they are under
>>load or when there's some keyboard activity. ( the output you sent was
>>produced on pragma001, so you were working on it, and two others
>>have a load average of 1.000 ) . To TEST that MPI really works you
>>might want to disable this, by putting START=TRUE ( which would
>>allow any job to be invoked, regardless of the current computer
>>activity), or START=($(START))||((Scheduler =?= $(DedicatedScheduler)
>>). Mark
>>
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users