Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Windows XP Condor 7.4.0 Quill Issues

Date: Fri, 9 Jul 2010 08:34:09 -0600
From: "Michael O'Donnell" <odonnellm@xxxxxxxx>
Subject: [Condor-users] Windows XP Condor 7.4.0 Quill Issues

We currently have set up Quill and postgres with Condor. Everything appeared to work initially, but there were a couple problems.

First, the response time for submitted jobs went from no time spent as Idle to over an hour before the job executed.

Second, after a day or so we started getting errors such as this one emailed to the condor administrator:
This is an automated email from the Condor system on machine "IGSKBACBLT106.domain". Do not reply. "C:\Condor/bin/condor_quill.exe" on "IGSKBACBLT106.domain" exited with status 4. Condor will automatically restart this process in 11 seconds. *** Last 20 line(s) of file C:\Condor/log/QuillLog: SessionDuration = "86400" NewSession = "YES" RemoteVersion = "$CondorVersion: 7.4.0 Oct 31 2009 BuildID: 193173 $" ServerCommandSock = "<IP:1662>" Command = 60010 AuthCommand = 60008 07/02 20:37:33 condor_write(fd=1716 <IP:3596>,,size=505,timeout=20,flags=0) 07/02 20:37:33 condor_read(fd=1716 <IP:3596>,,size=5,timeout=20,flags=0) 07/02 20:37:34 condor_read(): fd=1716 07/02 20:37:54 condor_read(): select returned 0 07/02 20:37:56 condor_read(): timeout reading 5 bytes from <IP:3596>. 07/02 20:37:57 IO: Failed to read packet header 07/02 20:37:58 Stream::get(int) failed to read padding 07/02 20:37:59 Failed to read ClassAd size. 07/02 20:37:59 SECMAN: no classad from server, failing 07/02 20:38:00 CLOSE <IP:1688> fd=1716 07/02 20:38:01 SECMAN: unable to create security session to <IP:3596> via TCP, failing. 07/02 20:38:02 ERROR: SECMAN:2004:Failed to create security session to <IP:3596> with TCP.|SECMAN:2007:Failed to end classad message. 07/02 20:38:05 DaemonCore: startCommand() to <IP:3596> failed. SendAliveToParent() failed. 07/02 20:38:06 ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <IP:3596>" at line 9310 in file ..\src\condor_daemon_core.V6\daemon_core.cpp *** End of file QuillLog

Third, machines in our pool started to drop off until our pool was no longer functioning. As a result we disabled Quill and everything went back to normal.

I came across several related posts but we have had no luck figuring out the culprit:
https://lists.cs.wisc.edu/archive/condor-users/2010-March/msg00015.shtml
https://www-auth.cs.wisc.edu/lists/condor-users/2005-October/msg00402.shtml

Also, when Quill is initially enabled, the postgres tables were populated as expected and everything looked good.

We have Quill, postgres and CM on the same server but because our pool is small enough (~50 cores) we did not think this should be the problem. Our server and noes are all windows XP. We are using NTSSPI and SSL for security.

Does anyone have any thoughts?

Thank you for your comments,
Mike

Prev by Date: Re: [Condor-users] appending to output/error files on hold release?
Next by Date: Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
Previous by thread: Re: [Condor-users] EVENT_LOG_JOB_AD_INFORMATION_ATTRS not being honored
Next by thread: Re: [Condor-users] Condor 7.4.2 not working on Rocks 5.3
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] Windows XP Condor 7.4.0 Quill Issues