Subject: [Condor-users] Windows XP Condor 7.4.0 Quill Issues
We currently have set up Quill and postgres
with Condor. Everything appeared to work initially, but there were a couple
problems.
First, the response time for submitted
jobs went from no time spent as Idle to over an hour before the job executed.
Second, after a day or so we started
getting errors such as this one emailed to the condor administrator:
This is an automated email from the Condor system
on machine "IGSKBACBLT106.domain". Do not reply.
"C:\Condor/bin/condor_quill.exe" on "IGSKBACBLT106.domain"
exited with status 4.
Condor will automatically restart this process in 11 seconds.
*** Last 20 line(s) of file C:\Condor/log/QuillLog:
SessionDuration = "86400"
NewSession = "YES"
RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173
$"
ServerCommandSock = "<IP:1662>"
Command = 60010
AuthCommand = 60008
07/02 20:37:33 condor_write(fd=1716 <IP:3596>,,size=505,timeout=20,flags=0)
07/02 20:37:33 condor_read(fd=1716 <IP:3596>,,size=5,timeout=20,flags=0)
07/02 20:37:34 condor_read(): fd=1716
07/02 20:37:54 condor_read(): select returned 0
07/02 20:37:56 condor_read(): timeout reading 5 bytes from <IP:3596>.
07/02 20:37:57 IO: Failed to read packet header
07/02 20:37:58 Stream::get(int) failed to read padding
07/02 20:37:59 Failed to read ClassAd size.
07/02 20:37:59 SECMAN: no classad from server, failing
07/02 20:38:00 CLOSE <IP:1688> fd=1716
07/02 20:38:01 SECMAN: unable to create security session to <IP:3596>
via TCP, failing.
07/02 20:38:02 ERROR: SECMAN:2004:Failed to create security session to
<IP:3596> with TCP.|SECMAN:2007:Failed to end classad message.
07/02 20:38:05 DaemonCore: startCommand() to <IP:3596> failed. SendAliveToParent()
failed.
07/02 20:38:06 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT
<IP:3596>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp
*** End of file QuillLog
Third, machines in our pool started
to drop off until our pool was no longer functioning. As a result we disabled
Quill and everything went back to normal.
Also, when Quill is initially enabled,
the postgres tables were populated as expected and everything looked good.
We have Quill, postgres and CM on the
same server but because our pool is small enough (~50 cores) we did not
think this should be the problem. Our server and noes are all windows XP.
We are using NTSSPI and SSL for security.