Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Quill++ assistance
- Date: Wed, 25 Aug 2010 14:12:54 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: Re: [Condor-users] Quill++ assistance
Hi Erik
The 1hr 25
mins is definitely not related (as far as I can tell) to virus scans/server
activity/etc.
I've checked
all the scheduled type of activities that our PCs get installed with and nothng
"fits".
In addition I
have installed 7.4.3 onto several PCs now and they all exhibit the 1hr 25
restart
of
condor_quill and it always starts exactly 1 hr 25 mins after condor is started,
i.e. anytime
I do a condor
net stop, condor net start on them then the first of the 1hr 25mins
restarts
begins 1 hr
25mins after this.
There is a
dprintf_failure.QUILL file created but it is empty and 0 bytes in
size.
No core file
is created and condor_quill quite happily gets restarted by condor_master
after
10 secs
until the MasterLog again says it exits with error 44 after the next 1hr 25
mins.
Nothing gets
logged in the QuillLog.
Cheers
Greg
Greg: The "exit 44" issue is odd - status 44 means that Condor
couldn't log some piece of information (which is why you don't see anything in
the logs :). While I wouldn't rule anything in Condor out, 1:25:00 is not a
number that strikes me as special in any of the Condor code, so I'm not sure
what would happen on the Condor side with that periodicity. Are there any
file server/virus scans/etc sort of activity that might interfere with writes to
files that happen at your site?
Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot. To
answer your question, the Quill daemon should run continuously - however,
if it is consistently crashing, the master will exponentially back off trying to
run it until it only tries once an hour - so it may be likely that you'll see a
core file with no Quill daemon running.
If that's the case and it is consistently crashing, I would love to see
your full QuillLog, along with your sql.log file. We should be able to play it
back and see exactly why it's crashing.
Thanks,
-Erik
On Wed, Aug 11, 2010 at 8:48 PM,
<Greg.Hitchen@xxxxxxxx> wrote:
Perhaps not much help Michael
but we've had similar problems with 7.2.4 on windows
(see first attached email). It behaved
somewhat better for 7.4.1 (see second attached email)
and at least ran, even though restarting
condor_quill every 1hr 25mins, but a number of other
problems/issues with the 7.4 series has not
allowed us to upgrade to that
version yet.
Cheers
Greg
I have these specified already and
I do not see any issues. The quilllog file show SQL statements and success at
populating the tables. However, I
am finding a file on all machine other than the central manager that has an
access violation error. I am not sure if the condor_quill.exe daemon is
supposed to run continuously, but I do not see it running on any machines
other than the central manager. The file that is showing up in the log directory on each machine is
called core.QUILL.WIN32. Its contents are (Does this mean anything to anyone
else):
<...>