On 10/24/06, Becky Gietzel <bgietzel@xxxxxxxxxxx> wrote:
On Oct 20, 2006, at 5:50 PM, Diego Bello wrote: > Hi everyone, I have a Condor pool made of workstations to support MPI, > simple jobs and dag, all using globus. Condor version is 6.8.0 > > What I need is that MPI jobs could be stopped if a machine is used, > wich normally is between 10 am and 9 pm. I have tried some > configurations taken from the condor manual, but some jobs doesn't > start. I think there could be a problem with the start configuration. > > I'm now trying a dag job, with three jobs doing nothing more than > /bin/hostname, but it gets to the queue, the first job starts running > but, after several hours, it doesn't finish. If I send a globus job > directly, it works. My proxy is valid for 48 hrs. > > I have attached my central manager and my exec nodes's config files. > Can someone tell me if there is something wrong with my config files?. > You'll want to adjust your START policy for the execute nodes as follows: Add: IsNighttime = (ClockMin < 600 || ClockMin > 1260) Replace the START and PREEMPT expressions with: START = ( (Scheduler =?= $(DedicatedScheduler) && $(IsNighttime) =? = TRUE && $(KeyboardIdleTime) > $(StartIdleTime) ) || $(START) ) PREEMPT = (Scheduler =!= $(DedicatedScheduler) && $(KeyboardBusy) This policy will allow MPI jobs to start only during the nighttime hours if nobody is actively using the machine. Once you set up the new policy, make sure you are able to run a simple Vanilla universe / bin/hostname job. When that is working try the dag job with /bin/ hostname again. Then try an MPI job. If you are using the MPI Universe for your MPI jobs I'd recommend switching to the Parallel Universe. Thanks, Becky
Thanks for the reply! I tried what you said, but condor daemons can't start in exec nodes. This is the error message i get: *** Last 20 line(s) of file StartLog: 10/28 23:00:39 Using config source: /etc/condor/condor_config 10/28 23:00:39 Using local config sources: 10/28 23:00:39 /opt/condor-6.8.0/local.chaparro/condor_config.local 10/28 23:00:39 DaemonCore: Command Socket at <200.1.19.171:9642> 10/28 23:00:39 ERROR "Syntax error in START expression: '( (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 || ClockMin > 1260) =?= TRUE && > 15 * 60 ) || ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) ) )'" at line 286 in file util.C 10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 10/28 23:00:52 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found 10/28 23:00:52 ****************************************************** 10/28 23:00:52 ** condor_startd (CONDOR_STARTD) STARTING UP 10/28 23:00:52 ** /opt/condor-6.8.0/sbin/condor_startd 10/28 23:00:52 ** $CondorVersion: 6.8.0 Jul 19 2006 $ 10/28 23:00:52 ** $CondorPlatform: I386-LINUX_RHEL3 $ 10/28 23:00:52 ** PID = 5351 10/28 23:00:52 ** Log last touched 10/28 23:00:39 10/28 23:00:52 ****************************************************** 10/28 23:00:52 Using config source: /etc/condor/condor_config 10/28 23:00:52 Using local config sources: 10/28 23:00:52 /opt/condor-6.8.0/local.chaparro/condor_config.local 10/28 23:00:52 DaemonCore: Command Socket at <200.1.19.171:9683> 10/28 23:00:52 ERROR "Syntax error in START expression: '( (Scheduler =?= "DedicatedScheduler@xxxxxxxxxxxxxxxxxxx" && (ClockMin < 600 || ClockMin > 1260) =?= TRUE && > 15 * 60 ) || ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) ) )'" at line 286 in file util.C - Hide quoted text - *** End of file StartLog I tried having a START=TRUE expression before what you said, and then removing that. The only difference was the error message saying TRUE instead ((LoadAvg - Condor....... In the PREEMT line, I supose there is a missing ")" at the end, am I right?. Can you help me find out what is going wrong? Thanks. -- Diego Bello Carreño Estudiante Memorista de Ingeniería Civil Informática UTFSM, Valparaíso, Chile Usuario #294897 counter.li.org