Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Parallel Universe and dedicated scheduling
- Date: Tue, 10 Nov 2009 09:57:21 -0500
- From: "Jonathan D. Proulx" <jon@xxxxxxxxxxxxx>
- Subject: [Condor-users] Parallel Universe and dedicated scheduling
Hi All,
I'm trying to get Dedicated Scheduling setup for Parallel Universe
jobs. My understadning is taht all I need to do is define
DedicatedScheduler on the execute nodes and do all submissions through
the host I define the (and I should also make sure these are never
preempted or suspended). I have a set of nodes that lets all jobs run
to completion except NiceUser jobs, I chose these as my test set.
My parallel universe jobs are getting scheduled but they are crashing
the Startd which claims "WantSuspend" is undefined:
condor/latest-install/sbin/condor_startd" on
+"borg68.csail.mit.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.
*** Last 20 line(s) of file /opt/condor/log/StartLog:
11/9 22:02:33 Calling HandleReq <command_match_info> (0)
11/9 22:02:33 match_info called
11/9 22:02:33 Received match <128.30.112.196:38230>#1254759541#411#...
11/9 22:02:33 State change: match notification protocol successful
11/9 22:02:33 Changing state: Unclaimed -> Matched
11/9 22:02:33 Return from HandleReq <command_match_info> (handler: 0.000s, sec:
+0.003s)
11/9 22:02:33 Calling Handler <DaemonCore::HandleReqSocketHandler>
11/9 22:02:33 Received TCP command 442 (REQUEST_CLAIM) from condor
+<128.30.112.26:34738>, access level DAEMON
11/9 22:02:33 Calling HandleReq <command_request_claim> (0)
11/9 22:02:33 Request accepted.
11/9 22:02:33 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxx
11/9 22:02:33 State change: claiming protocol successful
11/9 22:02:33 Changing state: Matched -> Claimed
11/9 22:02:33 ERROR "Can't find WANT_SUSPEND in internal ClassAd" at line 1226
+in file Resource.cpp
11/9 22:02:33 Changing state and activity: Claimed/Idle -> Preempting/Killing
11/9 22:02:34 State change: No preempting claim, returning to owner
11/9 22:02:34 Changing state and activity: Preempting/Killing -> Owner/Idle
11/9 22:02:34 State change: IS_OWNER is false
11/9 22:02:34 Changing state: Owner -> Unclaimed
11/9 22:02:34 startd exiting because of fatal exception.
*** End of file StartLog
But "Want_Suspend" *is* defined:
[jon@borg-login-1 ~]$ condor_config_val -n borg68 DedicatedScheduler Start Want_Suspend
"DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxxxx"
True
((TARGET.ImageSize < (15 * 1024)) || ((KeyboardIdle < 60) == False) || (TARGET.JobUniverse == 4) || (TARGET.JobUniverse == 5) ) && ( NiceUser == True)
I'm puzzled...
-Jon