HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Fwd: [Condor-users] Quill++ assistance



Anyone ever seen anything like this?

In terms of interesting numbers: 1 hour and 25 minutes is 5100
seconds. Quill does set a 10 second timer, so in theory 510 of them go
off. At the risk of starting a wild goose chase, that's pretty close to
512, which is one those magic numbers...

I'm sort of inclined to blame something Windows here, as I've never
ever heard of this on the Unix side of the house.

-Erik

---------- Forwarded message ----------
From:  <Greg.Hitchen@xxxxxxxx>
Date: Wed, Aug 25, 2010 at 1:12 AM
Subject: Re: [Condor-users] Quill++ assistance
To: condor-users@xxxxxxxxxxx


Hi Erik

The 1hr 25 mins is definitely not related (as far as I can tell) to
virus scans/server activity/etc.
I've checked all the scheduled type of activities that our PCs get
installed with and nothng "fits".

In addition I have installed 7.4.3 onto several PCs now and they all
exhibit the 1hr 25 restart
of condor_quill and it always starts exactly 1 hr 25 mins after condor
is started, i.e. anytime
I do a condor net stop, condor net start on them then the first of the
1hr 25mins restarts
begins 1 hr 25mins after this.

There is a dprintf_failure.QUILL file created but it is empty and 0
bytes in size.
No core file is created and condor_quill quite happily gets restarted
by condor_master after
10 secs until the MasterLog again says it exits with error 44 after
the next 1hr 25 mins.
Nothing gets logged in the QuillLog.

Cheers

Greg
________________________________
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
Sent: Tuesday, 24 August 2010 3:46 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Quill++ assistance


Greg: The "exit 44" issue is odd - status 44 means that Condor
couldn't log some piece of information (which is why you don't see
anything in the logs :). While I wouldn't rule anything in Condor out,
1:25:00 is not a number that strikes me as special in any of the
Condor code, so I'm not sure what would happen on the Condor side with
that periodicity. Are there any file server/virus scans/etc sort of
activity that might interfere with writes to files that happen at your
site?
Greg/Michael: the ACCESS_VIOLATION is happening in a strange spot. To
answer your question, the Quill daemon should run continuously -
however, if it is consistently crashing, the master will exponentially
back off trying to run it until it only tries once an hour - so it may
be likely that you'll see a core file with no Quill daemon running.
If that's the case and it is consistently crashing, I would love to
see your full QuillLog, along with your sql.log file. We should be
able to play it back and see exactly why it's crashing.
Thanks,
-Erik

On Wed, Aug 11, 2010 at 8:48 PM, <Greg.Hitchen@xxxxxxxx> wrote:
>
> Perhaps not much help Michael but we've had similar problems with 7.2.4 on windows
> (see first attached email). It behaved somewhat better for 7.4.1 (see second attached email)
> and at least ran, even though restarting condor_quill every 1hr 25mins, but a number of other
> problems/issues with the 7.4 series has not allowed us to upgrade to that version yet.
>
> Cheers
>
> Greg
>
> ________________________________
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Michael O'Donnell
> Sent: Thursday, 12 August 2010 3:56 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Quill++ assistance
>
>
> I have these specified already and I do not see any issues. The quilllog file show SQL statements and success at populating the tables.
>
> However, I am finding a file on all machine other than the central manager that has an access violation error. I am not sure if the condor_quill.exe daemon is supposed to run continuously, but I do not see it running on any machines other than the central manager.
>
> The file that is showing up in the log directory on each machine is called core.QUILL.WIN32. Its contents are (Does this mean anything to anyone else):
>
<...>

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

----- End forwarded message -----