[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Error with executing simple job via Condor
- Date: Wed, 28 Apr 2010 07:54:18 -0600
- From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
- Subject: [Condor-users] Error with executing simple job via Condor
I wrote a simple python executable to
submit with Condor. The job is submitted and these jobs' state change to
run for a second or so but then change to idle. If I wait, they are resubmitted,
change to Run state but they never run. The log files for each queue never
have content. Based on the shadowlog, I get errno
= 10054, which means a socket was closed. All of our machines are windows
xp including the central manager. As you can tell from the log, we are
using NTSSPI and SSL. When I run condor_Status everything looks fine with
regard to see cores/slots, claimed and unclaimed machines. I am not seeing
any errors in the masterlog and as far as I can tell everything looks ok.
Does anyone have any ideas of what
might be causing this. We first set up condor without ssl and did not have
any issues and now we are working on a more secured system, which is likely
causing the problems. This might not be related, but we also had our CM
routed through a 100MB switch, while our network is 1GB. The CM was not
working and we still cannot see two machines on this 100MB router. However,
once we moved the CM off the 100MB router we were able to see all machines
in our pool (currently we are testing and working out the configuration
and therefore only have about 6 machines in our pool).
Thank you,
Mike
When I run the following command I get:
condor_q -analyze 88
088.009: Run analysis summary.
Of 10 machines,
0 are rejected
by your job's requirements
0 reject your
job because of their own requirements
6 match but are
serving users with a better priority in the pool
4 match but reject
the job for unknown reasons
0 match but will
not currently preempt their existing job
0 match but are
currently offline
0 are available
to run your job
Last successful
match: Wed Apr 28 07:39:24 2010
The shadowlog on the submit machine
looks like this (note that I used a search and replace for accounts, ip,
and other info, but should make sense):
Command = 60008
04/28 07:19:52 (88.6) (488): SECMAN:
startCommand succeeded.
04/28 07:19:52 (88.6) (488): Authorizing
server '*/IP.39'.
04/28 07:19:52 (88.6) (488): SEND [1000]
<IP.39:3385> <IP.39:1851>
04/28 07:19:52 (88.6) (488): SEND [164]
<IP.39:3385> <IP.39:1851>
04/28 07:19:52 (88.6) (488): DaemonCore:
Leaving SendAliveToParent() - success
04/28 07:19:52 (88.6) (488): Return
from Timer handler 5 (DaemonCore::SendAliveToParent)
04/28 07:19:52 (88.6) (488): Calling
Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692
startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read():
fd=1692
04/28 07:19:52 (88.6) (488): condor_read():
select returned 1
04/28 07:19:52 (88.6) (488): condor_read(fd=1692
startd slot4@ExecuteMachine,,size=8,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read():
fd=1692
04/28 07:19:52 (88.6) (488): condor_read():
select returned 1
04/28 07:19:52 (88.6) (488): entering
FileTransfer::Init
04/28 07:19:52 (88.6) (488): entering
FileTransfer::SimpleInit
04/28 07:19:52 (88.6) (488): Entering
FileTransfer::InitDownloadFilenameRemaps
04/28 07:19:52 (88.6) (488): condor_write(fd=1692
startd slot4@ExecuteMachine,,size=4096,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_write(fd=1692
startd slot4@ExecuteMachine,,size=3153,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): Return
from Handler <HandleSyscalls>
04/28 07:19:52 (88.6) (488): Calling
Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692
startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read():
fd=1692
04/28 07:19:52 (88.6) (488): condor_read():
select returned 1
04/28 07:19:52 (88.6) (488): condor_read(fd=1692
startd slot4@ExecuteMachine,,size=580,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read():
fd=1692
04/28 07:19:52 (88.6) (488): condor_read():
select returned 1
04/28 07:19:52 (88.6) (488): condor_write(fd=1692
startd slot4@ExecuteMachine,,size=29,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): Return
from Handler <HandleSyscalls>
04/28 07:19:52 (88.6) (488): Calling
Handler <HandleSyscalls> (2)
04/28 07:19:52 (88.6) (488): condor_read(fd=1692
startd slot4@ExecuteMachine,,size=21,timeout=300,flags=0)
04/28 07:19:52 (88.6) (488): condor_read():
fd=1692
04/28 07:19:52 (88.6) (488): condor_read():
select returned 1
04/28 07:19:52 (88.6) (488): condor_read()
failed: recv() returned -1, errno = 10054 , reading 21 bytes from startd
slot4@ExecuteMachine.
04/28 07:19:52 (88.6) (488): IO: Failed
to read packet header
04/28 07:19:52 (88.6) (488): Stream::get(int)
failed to read padding
04/28 07:19:52 (88.6) (488): Can no
longer talk to condor_starter <IP.15:1849>
04/28 07:19:52 (88.6) (488): CLOSE
<IP.39:3378> fd=1692
04/28 07:19:52 (88.6) (488): WriteUserLog:
not initialized @ writeEvent()
04/28 07:19:52 (88.6) (488): Trying
to reconnect job USER@xxxxxxxxxxxxxxxxx#88.6#1272460515
04/28 07:19:52 (88.6) (488): Trying
to reconnect to disconnected job
04/28 07:19:52 (88.6) (488): LastJobLeaseRenewal:
1272460792 Wed Apr 28 07:19:52 2010
04/28 07:19:53 (88.6) (488): JobLeaseDuration:
1200 seconds
04/28 07:19:53 (88.6) (488): Resource
slot4@ExecuteMachine changing state from STARTUP to RECONNECT
04/28 07:19:53 (88.6) (488): JobLeaseDuration
remaining: 1199
04/28 07:19:53 (88.6) (488): Return
from Handler <HandleSyscalls>
04/28 07:19:53 (88.6) (488): Calling
Timer handler 8 (RemoteResource::attemptReconnect())
04/28 07:19:53 (88.6) (488): Attempting
to locate disconnected starter
04/28 07:19:53 (88.6) (488): gjid is
USER@xxxxxxxxxxxxxxxxx#88.6#1272460515 claimid is <IP.15:1849>#1272040933#1100#...
04/28 07:19:53 (88.6) (488): CONNECT
src="" fd=1676 dst=<IP.15:1849>
04/28 07:19:53 (88.6) (488): SECMAN:
command 1200 CA_CMD to startd slot4@ExecuteMachine from TCP port 3397 (blocking).
04/28 07:19:53 (88.6) (488): SECMAN:
using session ClientExecuteMachine:5060:1272460787:460 for {<IP.15:1849>,<1200>}.
04/28 07:19:53 (88.6) (488): SECMAN:
found cached session id ClientExecuteMachine:5060:1272460787:460 for {<IP.15:1849>,<1200>}.
MyType = ""
TargetType = ""
OutgoingNegotiation = "REQUIRED"
Subsystem = "SHADOW"
Command = 444
RemoteVersion = "$CondorVersion:
7.4.0 Oct 31 2009 BuildID: 193173 $"
Enact = "YES"
AuthMethodsList = "NTSSPI,SSL"
AuthMethods = "NTSSPI"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "YES"
Encryption = "YES"
Integrity = "YES"
SessionDuration = "86400"
UseSession = "YES"
Sid = "ClientExecuteMachine:5060:1272460787:460"
MyRemoteUserName = "USER"
ValidCommands = "60000,60008,60017,403,404,427,435,436,441,442,443,444,446,466,503,504,505,506,60004,1200,1000,5,60007,60011,448,452,457,470"
TriedAuthentication = TRUE
04/28 07:19:53 (88.6) (488): SECMAN:
Security Policy:
MyType = ""
TargetType = ""
OutgoingNegotiation = "REQUIRED"
Subsystem = "SHADOW"
Command = 444
RemoteVersion = "$CondorVersion:
7.4.0 Oct 31 2009 BuildID: 193173 $"
Enact = "YES"
AuthMethodsList = "NTSSPI,SSL"
AuthMethods = "NTSSPI"
CryptoMethods = "3DES,BLOWFISH"
Authentication = "YES"
Encryption = "YES"
Integrity = "YES"
SessionDuration = "86400"
UseSession = "YES"
Sid = "ClientExecuteMachine:5060:1272460787:460"
MyRemoteUserName = "USER"
- - - - - - - - - - - - - - - - - - - - - - - - - -
Michael O'Donnell
ADP Software Specialist, ASRC Management Services
USGS Fort Collins Science Center
2150 Centre Ave., Bldg C
Fort Collins, CO 80526
Phone: 970.226.9407
Fax: 970.226.9230
Email: odonnellm@xxxxxxxx