HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?



On Mon, Jun 11, 2007 at 05:51:29PM -0400, Ian Chesal wrote:
> Debugging a persistent problem we're having in our SJ system where our
> Windows machines end up Claimed+Idle after a short period of time.
> Windows machines are failing to run jobs and I'm seeing in the
> StarterLog file:
> 
> 6/11 14:28:31 (fd:6) (pid:3268) in VanillaProc::StartJob()
> 6/11 14:28:31 (fd:6) (pid:3268) Executable is .bat, so running
> C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
> /experiments/ichesal/dummy.win/a-00/01
> 6/11 14:28:31 (fd:6) (pid:3268) in OsProc::StartJob()
> 6/11 14:28:31 (fd:6) (pid:3268) IWD: d:\abc\condor/execute\dir_3268
> 6/11 14:28:31 (fd:6) (pid:3268) TokenCache contents: 
> swbatch1@ALTERA
> 6/11 14:28:31 (fd:6) (pid:3268) PRIV_CONDOR --> PRIV_USER at
> ..\src\condor_starter.V6.1\os_proc.C:227
> 6/11 14:28:31 (fd:7) (pid:3268) Input file: NUL
> 6/11 14:28:31 (fd:8) (pid:3268) Output file:
> d:\abc\condor/execute\dir_3268\wrapper.log
> 6/11 14:28:31 (fd:9) (pid:3268) Error file:
> d:\abc\condor/execute\dir_3268\wrapper.err
> 6/11 14:28:31 (fd:9) (pid:3268) Doing CONDOR_begin_execution
> 6/11 14:28:31 (fd:9) (pid:3268) condor_read(): nfds=0
> 6/11 14:33:31 (fd:9) (pid:3268) condor_read(): nfound=0
> 6/11 14:33:31 (fd:9) (pid:3268) condor_read(): timeout reading buffer.
> 6/11 14:33:31 (fd:9) (pid:3268) IO: EOF reading packet header
> 6/11 14:33:31 (fd:9) (pid:3268) Stream::get(int) failed to read padding
> 6/11 14:33:31 (fd:9) (pid:3268) ERROR "Assertion ERROR on (result)" at
> line 322 in file ..\src\condor_starter.V6.1\NTsenders.C
> 6/11 14:33:31 (fd:9) (pid:3268) Doing CONDOR_ulog
> 6/11 14:33:31 (fd:9) (pid:3268) ShutdownFast all jobs.
> 6/11 14:33:31 (fd:9) (pid:3268) Got ShutdownFast when no jobs running.
> 6/11 14:33:31 (fd:9) (pid:3268) Destroying Daemon object:
> 6/11 14:33:31 (fd:9) (pid:3268) Type: 11 (shadow), Name:
> sj-schedd1.altera.com, Addr: <137.57.202.107:48964>
> 6/11 14:33:31 (fd:9) (pid:3268) FullHost: (null), Host: (null), Pool:
> (null), Port: -1
> 6/11 14:33:31 (fd:9) (pid:3268) IsLocal: N, IdStr: (null), Error: (null)
> 6/11 14:33:31 (fd:9) (pid:3268)  --- End of Daemon object info ---
> 
> These assert errors in NTsenders.C and NTrecievers.C keep popping up.
> Now that I've got the source for 6.8.0 I'm looking at NTsenders.C. The
> REMOTE_CONDOR_begin_execution is where this assert exists. Right before
> the assert is:
> 
> 	syscall_sock->decode();
> 	result = syscall_sock->code(rval);
> 	ASSERT( result );
> 
> Is this a socket call to shadow or to something else? I'm having trouble
> determining where the socket was being opened to.
> 

It's a call to the shadow - specifically, it's waiting to hear back from the
shadow that it knows the starter is about to start running the job.

code() means to put something on or off the network - depending if the socket
is in encode or decode mode. In your case, the starter is waiting for 5 minutes
to hear back from the shadow. Two questions:

1. Is your schedd or submit machine very busy? The shadow doesn't do much
for this call, just update the logfile and update the job queue for a few
attributes. If your submit machine is very busy, I suppose the shadow could
be blocked for 5 minutes trying to finish.

2. Are you running a mixed Condor pool? I don't see anything that's changed 
between 6.8.0 and more recent versions that could cause the protocol to have
changed, but I suppose it's possible something snuck in without us realizing 
it.

-Erik