Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor_exec fails, shadow fails
- Date: Thu, 22 Mar 2007 17:17:47 +1000
- From: "Jeffrey Stephen" <Stephen.Jeffrey@xxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor_exec fails, shadow fails
Hi,
I am trying to start a parallel job. The log files
indicate that condor_exec is failing.
Execute machine's Starter log
contains
--------------------------------------
3/22 16:49:41 Starting a
PARALLEL universe job with ID: 19.0
3/22 16:49:41 IWD:
D:\condor-6.8.4/execute\dir_784
3/22 16:49:41 Output file:
D:\condor-6.8.4/execute\dir_784\foo.out.0
3/22 16:49:41 Error file:
D:\condor-6.8.4/execute\dir_784\foo.err.0
3/22 16:49:41 Renice expr "10"
evaluated to 10
3/22 16:49:41 About to exec
D:\condor-6.8.4\execute\dir_784\condor_exec.exe \\indplly1\userdirs\JeffreySJ\Cond
or_Jobs\cpilog_minimal.exe
3/22
16:49:41 ERROR: D:\condor-6.8.4\execute\dir_784\condor_exec.exe is not a valid
Windows executable
3/22 16:49:41 ERROR
"Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\Jeffre
ySJ\Condor_Jobs\cpilog_minimal.exe,
...) failed" at line 393 in file ..\src\condor_starter.V6.1\os_proc.C
3/22
16:49:41 ShutdownFast all jobs
The shadow process is apparently
dying.
Central manager's sched log
contains:
-------------------------------------
3/22 16:49:32 (pid:2556)
Activity on stashed negotiator socket
3/22 16:49:32 (pid:2556) Negotiating
for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx3/22 16:49:32 (pid:2556) Out of requests - 1 reqs matched, 0
reqs idle
3/22 16:49:33 (pid:2556) Activity on stashed negotiator
socket
3/22 16:49:33 (pid:2556) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx3/22 16:49:33 (pid:2556) Out of requests - 0 reqs matched, 0
reqs idle
3/22 16:49:35 (pid:2556) Inserting new attribute Scheduler into
non-active cluster cid=19 acid=-1
3/22 16:49:37 (pid:2556) Starting
add_shadow_birthdate(19.0)
3/22 16:49:37 (pid:2556) Started shadow for job
19.0 on "<131.242.63.162:1349>", (shadow pid = 2900)
3/22 16:49:37
(pid:2556) Sent ad to central manager for jeffreysj@xxxxxxxxxxxxxxx3/22
16:49:37 (pid:2556) Sent ad to 1 collectors for jeffreysj@xxxxxxxxxxxxxxx3/22
16:49:38 (pid:2556) DaemonCore: Command received via TCP from host
<131.242.63.124:2733>
3/22 16:49:38 (pid:2556) DaemonCore: received
command 71003 (GIVE_MATCHES), calling handler
(DedicatedSchedule
r::giveMatches)
3/22 16:49:40 (pid:2556) DaemonCore:
Command received via UDP from host <131.242.63.124:2735>
3/22 16:49:40
(pid:2556) DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
3/22 16:49:40 (pid:2556) In DedicatedScheduler::reaper pid
2900 has status 4
3/22 16:49:40 (pid:2556) Shadow pid 2900 exited with status
4
3/22 16:49:40 (pid:2556) ERROR: Shadow exited with job exception
code!
3/22 16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec
3/22
16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec
The central manager's shadow log also reports the
error:
------------------------------------
3/22 16:49:37
DaemonCore: Command Socket at <131.242.63.124:2722>
3/22 16:49:37
Initializing a PARALLEL shadow for job 19.0
3/22 16:49:38 (19.0) (2900):
Request to run on <131.242.63.162:1349> was ACCEPTED
3/22 16:49:40
(19.0) (2900): ERROR "Error from starter on nes15300.lands.resnet.qg:
Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\JeffreySJ\Condor_Jobs\cpilog_minimal.exe,
...) failed" at line 643 in file
..\src\condor_shadow.V6.1\pseudo_ops.C
The start log contains some TCP "connection
refused" errors. Error 10061 (WSAECONNREFUSED)
means "No connection
could be made because the target machine actively refused it."
I don't think
this is the problem because I have tested some simple TCP client/server code
running
between the central manager and execute machine and it works
fine.
Execute machine's Start log
contains:
-------------------------------------
3/22 16:49:39 Got universe
"PARALLEL" (11) from request classad
3/22 16:49:39 State change:
claim-activation protocol successful
3/22 16:49:39 Changing activity: Idle
-> Busy
3/22 16:49:41 DaemonCore: Command received via TCP from host
<131.242.63.124:2736>
3/22 16:49:41 DaemonCore: received command 403
(DEACTIVATE_CLAIM), calling handler (command_handler)
3/22 16:49:41 Called
deactivate_claim()
3/22 16:49:41 attempt to connect to
<131.242.63.162:1373> failed: connect errno = 10061 connection
refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect
to <131.242.63.162:1373> failed.
3/22 16:49:41 Error sending signal to
starter, errno = 0 (No error)
3/22 16:49:41 attempt to connect to
<131.242.63.162:1373> failed: connect errno = 10061 connection
refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to
<131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect
to <131.242.63.162:1373> failed.
3/22 16:49:41 DaemonCore: Command
received via UDP from host <131.242.63.162:1383>
3/22 16:49:41
DaemonCore: received command 60011 (DC_NOP), calling handler
(handle_nop())
3/22 16:49:41 Starter pid 784 exited with status 0
3/22
16:49:41 State change: starter exited
3/22 16:49:41 Changing activity: Busy
-> Idle
My mp1script is:
----------------
universe =
parallel
Executable = H:\Condor_Jobs\mp1script
machine_count = 1
Output
= foo.out.$(NODE)
log = foo.log.$(CLUSTER)
error =
foo.err.$(NODE)
arguments =
H:\Condor_Jobs\cpilog_minimal.exe
should_transfer_files =
YES
transfer_input_files =
H:\Condor_Jobs\cpilog_minimal.exe
WhenToTransferOutput =
ON_EXIT_OR_EVICT
queue 1
I have tried using relative and absolute
paths to the various files specified in the submit script:eg.
mp1script
H:\Condor_Jobs\mp1script
I can manually run the job on the execute
machine:
mpirun -np 1 cpilog_minimal.exe
so I don't think there is a
problem with the MPI application
cheers
steve
************************************************************************
The information in this e-mail together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this e-mail message is prohibited.
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.
************************************************************************