HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?



Debugging a persistent problem we're having in our SJ system where our
Windows machines end up Claimed+Idle after a short period of time.
Windows machines are failing to run jobs and I'm seeing in the
StarterLog file:

6/11 14:28:31 (fd:6) (pid:3268) in VanillaProc::StartJob()
6/11 14:28:31 (fd:6) (pid:3268) Executable is .bat, so running
C:\WINDOWS\system32\cmd.exe /Q /C condor_exec.bat
/experiments/ichesal/dummy.win/a-00/01
6/11 14:28:31 (fd:6) (pid:3268) in OsProc::StartJob()
6/11 14:28:31 (fd:6) (pid:3268) IWD: d:\abc\condor/execute\dir_3268
6/11 14:28:31 (fd:6) (pid:3268) TokenCache contents: 
swbatch1@ALTERA
6/11 14:28:31 (fd:6) (pid:3268) PRIV_CONDOR --> PRIV_USER at
..\src\condor_starter.V6.1\os_proc.C:227
6/11 14:28:31 (fd:7) (pid:3268) Input file: NUL
6/11 14:28:31 (fd:8) (pid:3268) Output file:
d:\abc\condor/execute\dir_3268\wrapper.log
6/11 14:28:31 (fd:9) (pid:3268) Error file:
d:\abc\condor/execute\dir_3268\wrapper.err
6/11 14:28:31 (fd:9) (pid:3268) Doing CONDOR_begin_execution
6/11 14:28:31 (fd:9) (pid:3268) condor_read(): nfds=0
6/11 14:33:31 (fd:9) (pid:3268) condor_read(): nfound=0
6/11 14:33:31 (fd:9) (pid:3268) condor_read(): timeout reading buffer.
6/11 14:33:31 (fd:9) (pid:3268) IO: EOF reading packet header
6/11 14:33:31 (fd:9) (pid:3268) Stream::get(int) failed to read padding
6/11 14:33:31 (fd:9) (pid:3268) ERROR "Assertion ERROR on (result)" at
line 322 in file ..\src\condor_starter.V6.1\NTsenders.C
6/11 14:33:31 (fd:9) (pid:3268) Doing CONDOR_ulog
6/11 14:33:31 (fd:9) (pid:3268) ShutdownFast all jobs.
6/11 14:33:31 (fd:9) (pid:3268) Got ShutdownFast when no jobs running.
6/11 14:33:31 (fd:9) (pid:3268) Destroying Daemon object:
6/11 14:33:31 (fd:9) (pid:3268) Type: 11 (shadow), Name:
sj-schedd1.altera.com, Addr: <137.57.202.107:48964>
6/11 14:33:31 (fd:9) (pid:3268) FullHost: (null), Host: (null), Pool:
(null), Port: -1
6/11 14:33:31 (fd:9) (pid:3268) IsLocal: N, IdStr: (null), Error: (null)
6/11 14:33:31 (fd:9) (pid:3268)  --- End of Daemon object info ---

These assert errors in NTsenders.C and NTrecievers.C keep popping up.
Now that I've got the source for 6.8.0 I'm looking at NTsenders.C. The
REMOTE_CONDOR_begin_execution is where this assert exists. Right before
the assert is:

	syscall_sock->decode();
	result = syscall_sock->code(rval);
	ASSERT( result );

Is this a socket call to shadow or to something else? I'm having trouble
determining where the socket was being opened to.

- Ian

P.S. I can take this to the condor-user lists if you guys want. I just
figured I was quoting lines of code so it'd be better off on this list.
Let me know...

--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300