| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor_exec fails, shadow fails
- Date: Thu, 22 Mar 2007 17:17:47 +1000
- From: "Jeffrey Stephen" <Stephen.Jeffrey@xxxxxxxxxxxxxx>
- Subject: [Condor-users] Condor_exec fails, shadow fails
Hi,
 
I am trying to start a parallel job. The log files 
indicate that condor_exec is failing.
 
Execute machine's Starter log 
contains
--------------------------------------
3/22 16:49:41 Starting a 
PARALLEL universe job with ID: 19.0
3/22 16:49:41 IWD: 
D:\condor-6.8.4/execute\dir_784
3/22 16:49:41 Output file: 
D:\condor-6.8.4/execute\dir_784\foo.out.0
3/22 16:49:41 Error file: 
D:\condor-6.8.4/execute\dir_784\foo.err.0
3/22 16:49:41 Renice expr "10" 
evaluated to 10
3/22 16:49:41 About to exec 
D:\condor-6.8.4\execute\dir_784\condor_exec.exe \\indplly1\userdirs\JeffreySJ\Cond
or_Jobs\cpilog_minimal.exe
3/22 
16:49:41 ERROR: D:\condor-6.8.4\execute\dir_784\condor_exec.exe is not a valid 
Windows executable
3/22 16:49:41 ERROR 
"Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\Jeffre
ySJ\Condor_Jobs\cpilog_minimal.exe, 
...) failed" at line 393 in file ..\src\condor_starter.V6.1\os_proc.C
3/22 
16:49:41 ShutdownFast all jobs 
 
The shadow process is apparently 
dying.
 
Central manager's sched log 
contains:
-------------------------------------
3/22 16:49:32 (pid:2556) 
Activity on stashed negotiator socket
3/22 16:49:32 (pid:2556) Negotiating 
for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx3/22 16:49:32 (pid:2556) Out of requests - 1 reqs matched, 0 
reqs idle
3/22 16:49:33 (pid:2556) Activity on stashed negotiator 
socket
3/22 16:49:33 (pid:2556) Negotiating for owner: DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxxxx3/22 16:49:33 (pid:2556) Out of requests - 0 reqs matched, 0 
reqs idle
3/22 16:49:35 (pid:2556) Inserting new attribute Scheduler into 
non-active cluster cid=19 acid=-1
3/22 16:49:37 (pid:2556) Starting 
add_shadow_birthdate(19.0)
3/22 16:49:37 (pid:2556) Started shadow for job 
19.0 on "<131.242.63.162:1349>", (shadow pid = 2900)
3/22 16:49:37 
(pid:2556) Sent ad to central manager for jeffreysj@xxxxxxxxxxxxxxx3/22 
16:49:37 (pid:2556) Sent ad to 1 collectors for jeffreysj@xxxxxxxxxxxxxxx3/22 
16:49:38 (pid:2556) DaemonCore: Command received via TCP from host 
<131.242.63.124:2733>
3/22 16:49:38 (pid:2556) DaemonCore: received 
command 71003 (GIVE_MATCHES), calling handler 
(DedicatedSchedule
r::giveMatches)
3/22 16:49:40 (pid:2556) DaemonCore: 
Command received via UDP from host <131.242.63.124:2735>
3/22 16:49:40 
(pid:2556) DaemonCore: received command 60011 (DC_NOP), calling handler 
(handle_nop())
3/22 16:49:40 (pid:2556) In DedicatedScheduler::reaper pid 
2900 has status 4
3/22 16:49:40 (pid:2556) Shadow pid 2900 exited with status 
4
3/22 16:49:40 (pid:2556) ERROR: Shadow exited with job exception 
code!
3/22 16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec
3/22 
16:49:40 (pid:2556) DedicatedScheduler::deallocMatchRec 
 
The central manager's shadow log also reports the 
error:
------------------------------------
3/22 16:49:37 
DaemonCore: Command Socket at <131.242.63.124:2722>
3/22 16:49:37 
Initializing a PARALLEL shadow for job 19.0
3/22 16:49:38 (19.0) (2900): 
Request to run on <131.242.63.162:1349> was ACCEPTED
3/22 16:49:40 
(19.0) (2900): ERROR "Error from starter on nes15300.lands.resnet.qg: 
Create_Process(D:\condor-6.8.4\execute\dir_784\condor_exec.exe,\\indplly1\userdirs\JeffreySJ\Condor_Jobs\cpilog_minimal.exe, 
...) failed" at line 643 in file 
..\src\condor_shadow.V6.1\pseudo_ops.C
 
The start log contains some TCP "connection 
refused" errors.  Error 10061 (WSAECONNREFUSED)
means "No connection 
could be made because the target machine actively refused it."
I don't think 
this is the problem because I have tested some simple TCP client/server code 
running
between the central manager and execute machine and it works 
fine.
 
Execute machine's Start log 
contains:
-------------------------------------
3/22 16:49:39 Got universe 
"PARALLEL" (11) from request classad
3/22 16:49:39 State change: 
claim-activation protocol successful
3/22 16:49:39 Changing activity: Idle 
-> Busy
3/22 16:49:41 DaemonCore: Command received via TCP from host 
<131.242.63.124:2736>
3/22 16:49:41 DaemonCore: received command 403 
(DEACTIVATE_CLAIM), calling handler (command_handler)
3/22 16:49:41 Called 
deactivate_claim()
3/22 16:49:41 attempt to connect to 
<131.242.63.162:1373> failed: connect errno = 10061 connection 
refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to 
<131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect 
to <131.242.63.162:1373> failed.
3/22 16:49:41 Error sending signal to 
starter, errno = 0 (No error)
3/22 16:49:41 attempt to connect to 
<131.242.63.162:1373> failed: connect errno = 10061 connection 
refused.
3/22 16:49:41 ERROR: SECMAN:2003:TCP auth connection to 
<131.242.63.162:1373> failed
3/22 16:49:41 Send_Signal: ERROR Connect 
to <131.242.63.162:1373> failed.
3/22 16:49:41 DaemonCore: Command 
received via UDP from host <131.242.63.162:1383>
3/22 16:49:41 
DaemonCore: received command 60011 (DC_NOP), calling handler 
(handle_nop())
3/22 16:49:41 Starter pid 784 exited with status 0
3/22 
16:49:41 State change: starter exited
3/22 16:49:41 Changing activity: Busy 
-> Idle
 
 
 
My mp1script is:
----------------
universe = 
parallel
Executable = H:\Condor_Jobs\mp1script
machine_count = 1
Output 
= foo.out.$(NODE)
log = foo.log.$(CLUSTER)
error = 
foo.err.$(NODE)
arguments =  
H:\Condor_Jobs\cpilog_minimal.exe
should_transfer_files = 
YES
transfer_input_files =  
H:\Condor_Jobs\cpilog_minimal.exe
WhenToTransferOutput = 
ON_EXIT_OR_EVICT
queue 1
 
I have tried using relative and absolute 
paths to the various files specified in the submit script:eg.  
   mp1script
   
H:\Condor_Jobs\mp1script
 
I can manually run the job on the execute 
machine:
mpirun -np 1 cpilog_minimal.exe
so I don't think there is a 
problem with the MPI application
 
cheers
steve
************************************************************************
The information in this e-mail together with any attachments is
intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material.
Any form of review, disclosure, modification, distribution
and/or publication of this e-mail message is prohibited.  
If you have received this message in error, you are asked to
inform the sender as quickly as possible and delete this message
and any copies of this message from your computer and/or your
computer system network.  
************************************************************************