I have not examined the time intervals
of the Quill daemons dying for our pool, but I get hundreds of emails stating
the quill daemon died and has restarted on each machine. I have been trying
to get Quill to work with Windows as well, and I have been posting on this
topic to this list. I mentioned earlier that I have postgres database on
the same server as our CM. I was going to try installing postgress on a
different server, but I have not gotten around to this yet. I am pretty
sure this is not the problem, but it is something for me to try. I also
have noticed that the Quill daemon on our CM does not seem to die, but
the Quill daemons on all working nodes die on a regular basis. I have not
determined why this is the case, and the only difference is my OS. Our
server is using server 2008 and our working nodes are 32/64bit windows
xp and windows 7.
Mike
From:
<Greg.Hitchen@xxxxxxxx>
To:
<condor-users@xxxxxxxxxxx>
Date:
08/25/2010 08:07 PM
Subject:
Re: [Condor-users] Quill++ assistance
Sent by:
condor-users-bounces@xxxxxxxxxxx
That's correct, no other daemons are restarting, just condor_quill.
Interestingly, now that I have installed this version onto another
few PCs, the 1hr 25min is not EXACT. Two PCs that I "synched"
yesterday
by restarting condor at the same time are now 2-3 minutes apart on
their condor_quill restarts. Maybe the condor_master restarting
condor_quill after 10secs isn't exact and the time diff gradually builds
up? I'll keep an eye on it.
Cheers
Greg
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx]
On Behalf Of Erik Paulson
Sent: Thursday, 26 August 2010 4:16 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance
And just to confirm, it's only Quill - none of the other daemons show
the same restart every hour and twenty-five minutes?
-Erik
On Wed, Aug 25, 2010 at 1:12 AM, <Greg.Hitchen@xxxxxxxx> wrote:
> Hi Erik
>
> The 1hr 25 mins is definitely not related (as far as I can tell) to
virus
> scans/server activity/etc.
> I've checked all the scheduled type of activities that our PCs get
installed
> with and nothng "fits".
>
> In addition I have installed 7.4.3 onto several PCs now and they all
exhibit
> the 1hr 25 restart
> of condor_quill and it always starts exactly 1 hr 25 mins after condor
is
> started, i.e. anytime
> I do a condor net stop, condor net start on them then the first of
the 1hr
> 25mins restarts
> begins 1 hr 25mins after this.
>
> There is a dprintf_failure.QUILL file created but it is empty and
0 bytes in
> size.
> No core file is created and condor_quill quite happily gets restarted
by
> condor_master after
> 10 secs until the MasterLog again says it exits with error 44
after the next
> 1hr 25 mins.
> Nothing gets logged in the QuillLog.
>
> Cheers
>
> Greg
> ________________________________
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx]
On Behalf Of Erik Paulson
> Sent: Tuesday, 24 August 2010 3:46 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Quill++ assistance
>
>
> Greg: The "exit 44" issue is odd - status 44 means that
Condor couldn't log
> some piece of information (which is why you don't see anything in
the logs
> :). While I wouldn't rule anything in Condor out, 1:25:00 is not a
number
> that strikes me as special in any of the Condor code, so I'm not sure
what
> would happen on the Condor side with that periodicity. Are there
any file
> server/virus scans/etc sort of activity that might interfere with
writes to
> files that happen at your site?
> Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot.
To answer
> your question, the Quill daemon should run continuously - however,
if it is
> consistently crashing, the master will exponentially back off trying
to run
> it until it only tries once an hour - so it may be likely that you'll
see a
> core file with no Quill daemon running.
> If that's the case and it is consistently crashing, I would love to
see your
> full QuillLog, along with your sql.log file. We should be able to
play it
> back and see exactly why it's crashing.
> Thanks,
> -Erik
>
> On Wed, Aug 11, 2010 at 8:48 PM, <Greg.Hitchen@xxxxxxxx> wrote:
>>
>> Perhaps not much help Michael but we've had similar problems with
7.2.4 on
>> windows
>> (see first attached email). It behaved somewhat better for 7.4.1
(see
>> second attached email)
>> and at least ran, even though restarting condor_quill every 1hr
25mins,
>> but a number of other
>> problems/issues with the 7.4 series has not allowed us to upgrade
to that
>> version yet.
>>
>> Cheers
>>
>> Greg
>>
>> ________________________________
>> From: condor-users-bounces@xxxxxxxxxxx
>> [mailto:condor-users-bounces@xxxxxxxxxxx]
On Behalf Of Michael O'Donnell
>> Sent: Thursday, 12 August 2010 3:56 AM
>> To: Condor-Users Mail List
>> Subject: Re: [Condor-users] Quill++ assistance
>>
>>
>> I have these specified already and I do not see any issues. The
quilllog
>> file show SQL statements and success at populating the tables.
>>
>> However, I am finding a file on all machine other than the central
manager
>> that has an access violation error. I am not sure if the condor_quill.exe
>> daemon is supposed to run continuously, but I do not see it running
on any
>> machines other than the central manager.
>>
>> The file that is showing up in the log directory on each machine
is called
>> core.QUILL.WIN32. Its contents are (Does this mean anything to
anyone else):
>>
> <...>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users