CREDD_HOST =
nes30700.lands.resnet.qg
STARTER_ALLOW_RUNAS_OWNER =
True
CREDD_CACHE_LOCALLY = True
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
HOSTALLOW_ADMINISTRATOR = *
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *
The contents of
etc/condor_config.local.credd were copied to
condor_config.local:
- only one line was
changed:
CREDD.SEC_DEFAULT_AUTHENTICATION =REQUIRED was changed to:
CREDD.SEC_DEFAULT_AUTHENTICATION =OPTIONAL
Pool
password was set: condor_store_cred
-c add
-----------------------------------------------------------------------------------------------------------------
One
execute machine (nes15300.lands.resnet.qg):
condor_config:
the following
changes were made to the default
SEC_DEFAULT_AUTHENTICATION = OPTIONAL
CREDD_HOST =
nes30700.lands.resnet.qg
STARTER_ALLOW_RUNAS_OWNER =
True
CREDD_CACHE_LOCALLY = True
SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
HOSTALLOW_ADMINISTRATOR = *
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *
The contents of
etc/condor_config.local.credd were copied to
condor_config.local:
- one line was
changed:
CREDD.SEC_DEFAULT_AUTHENTICATION =REQUIRED was changed to:
CREDD.SEC_DEFAULT_AUTHENTICATION =OPTIONAL
The contents
of etc/condor_config.local.dedicated.resource were appended to
condor_config.local:
- run policy number
2 was selected
Pool password set:
condor_store_cred -c add
-----------------------------------------------------------------------------------------------------------------
My submit script
is:
It appears as though
Condor attempts to start the job - the execute machine nes15300 changes status
to "Claimed", but it fails in some authentication test.
The start log
contains:
-----------------------------------------------------------------------------------------------------------------
3/19 16:26:49
DaemonCore: Command received via TCP from host
<131.242.63.124:3285>
3/19 16:26:49 DaemonCore: received command 442
(REQUEST_CLAIM), calling handler (command_request_claim)
3/19 16:26:49
Request accepted.
3/19 16:26:49 Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx
3/19 16:26:49 State change: claiming protocol
successful
3/19 16:26:49 Changing state: Unclaimed -> Claimed
3/19
16:26:49 DaemonCore: Command received via UDP from host
<131.242.63.124:3283>
3/19 16:26:49 DaemonCore: received command 440
(MATCH_INFO), calling handler (command_match_info)
3/19 16:26:49 match_info
called
3/19 16:26:53 DaemonCore: Command received via TCP from host
<131.242.63.124:3298>
3/19 16:26:53 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim)
3/19 16:26:53 Got
activate_claim request from shadow (<131.242.63.124:3298>)
3/19
16:26:53 Remote job ID is 9.0
3/19 16:26:53 Got universe "PARALLEL" (11) from
request classad
3/19 16:26:53 State change: claim-activation protocol
successful
3/19 16:26:53 Changing activity: Idle -> Busy
3/19 16:26:55
DaemonCore: Command received via TCP from host
<131.242.63.124:3304>
3/19 16:26:55 DaemonCore: received command 403
(DEACTIVATE_CLAIM), calling handler (command_handler)
3/19 16:26:55 Called
deactivate_claim()
3/19 16:26:55 attempt to connect to
<131.242.63.162:2345> failed: connect errno = 10061 connection
refused.
3/19 16:26:55 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:2345> failed
3/19 16:26:55 Send_Signal: ERROR Connect to
<131.242.63.162:2345> failed.
3/19 16:26:55 Error sending signal to
starter, errno = 0 (No error)
3/19 16:26:55 attempt to connect to
<131.242.63.162:2345> failed: connect errno = 10061 connection
refused.
3/19 16:26:55 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:2345> failed
3/19 16:26:55 Send_Signal: ERROR Connect to
<131.242.63.162:2345> failed.
3/19 16:26:55 DaemonCore: Command
received via UDP from host <131.242.63.162:2355>
3/19 16:26:55
DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
3/19 16:26:55 Starter pid 388 exited with status 0
3/19
16:26:55 State change: starter exited
3/19 16:26:55 Changing activity: Busy
-> Idle
3/19 16:31:59 DaemonCore: Command received via TCP from host
<131.242.63.124:3344>
3/19 16:31:59 DaemonCore: received command 444
(ACTIVATE_CLAIM), calling handler (command_activate_claim)
3/19 16:31:59 Got
activate_claim request from shadow (<131.242.63.124:3344>)
3/19
16:31:59 Remote job ID is 9.0
3/19 16:31:59 Got universe "PARALLEL" (11) from
request classad
3/19 16:31:59 State change: claim-activation protocol
successful
3/19 16:31:59 Changing activity: Idle -> Busy
3/19 16:32:01
DaemonCore: Command received via TCP from host
<131.242.63.124:3346>
3/19 16:32:01 DaemonCore: received command 403
(DEACTIVATE_CLAIM), calling handler (command_handler)
3/19 16:32:01 Called
deactivate_claim()
3/19 16:32:01 attempt to connect to
<131.242.63.162:2389> failed: connect errno = 10061 connection
refused.
3/19 16:32:01 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:2389> failed
3/19 16:32:01 Send_Signal: ERROR Connect to
<131.242.63.162:2389> failed.
3/19 16:32:01 Error sending signal to
starter, errno = 0 (No error)
3/19 16:32:01 attempt to connect to
<131.242.63.162:2389> failed: connect errno = 10061 connection
refused.
3/19 16:32:01 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:2389> failed
3/19 16:32:01 Send_Signal: ERROR Connect to
<131.242.63.162:2389> failed.
3/19 16:32:01 DaemonCore: Command
received via UDP from host <131.242.63.162:2399>
3/19 16:32:01
DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
3/19 16:32:01 Starter pid 2980 exited with status 0
3/19
16:32:01 State change: starter exited
3/19 16:32:01 Changing activity: Busy
-> Idle
-----------------------------------------------------------------------------------------------------------------
The StarterLog indicates that the MPI job executable
was successfully transferred to the execute machine, but it contains another
error:
-----------------------------------------------------------------------------------------------------------------
3/19 16:26:53
******************************************************
3/19 16:26:53 **
condor_starter (CONDOR_STARTER) STARTING UP
3/19 16:26:53 **
D:\condor-6.8.4\bin\condor_starter.exe
3/19 16:26:53 ** $CondorVersion: 6.8.4
Feb 1 2007 $
3/19 16:26:53 ** $CondorPlatform: INTEL-WINNT50 $
3/19
16:26:53 ** PID = 388
3/19 16:26:53 ** Log last touched 3/19 15:50:01
3/19
16:26:53 ******************************************************
3/19 16:26:53
Using config source: D:\condor-6.8.4\condor_config
3/19 16:26:53 Using local
config sources:
3/19 16:26:53
D:\condor-6.8.4/condor_config.local
3/19 16:26:53 DaemonCore: Command Socket
at <131.242.63.162:2345>
3/19 16:26:53 Setting resource limits not
implemented!
3/19 16:26:53 Communicating with shadow
<131.242.63.124:3289>
3/19 16:26:53 Submitting machine is
"nes30700.lands.resnet.qg"
3/19 16:26:53 Job has WantIOProxy=true
3/19
16:26:53 Initialized IO Proxy.
3/19 16:26:54 File transfer completed
successfully.
3/19 16:26:55 Starting a PARALLEL universe job with ID:
9.0
3/19 16:26:55 IWD: D:\condor-6.8.4/execute\dir_388
3/19 16:26:55
Renice expr "10" evaluated to 10
3/19 16:26:55 About to exec
D:\condor-6.8.4\execute\dir_388\condor_exec.exe cpilog_minimal.exe
3/19
16:26:55 ERROR: D:\condor-6.8.4\execute\dir_388\condor_exec.exe is not a valid
Windows executable
3/19 16:26:55 ERROR
"Create_Process(D:\condor-6.8.4\execute\dir_388\condor_exec.exe,cpilog_minimal.exe,
...) failed" at line 393 in file ..\src\condor_starter.V6.1\os_proc.C
3/19
16:26:55 ShutdownFast all
jobs.
-----------------------------------------------------------------------------------------------------------------
Any advice
would be greatly appreciated.
cheers
steve