Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] Condor sleeping
- Date: Wed, 16 Jun 2004 08:36:22 -0500
- From: Nick LeRoy <nleroy@xxxxxxxxxxx>
- Subject: Re: [condor-users] Condor sleeping
On Wed June 16 2004 5:16 am, Mark Silberstein wrote:
> It never happened for me with 6.4.7. It started with 6.6 series, and is
> not only annoying, but makes my users feel that the system is
> unreliable, which unfortunately is true in these circumstances. I wish I
> had more time to debug it, but maybe if someone has at least some time
> to try moving collector and negotiator to another machine (with another
> IP/Name - maybe some problem with DNS resolution), and more likely -
> Linux or at least not Windows. From all mails on the list it feels like
> Windows causes some problems here. By the way, I don't experience any
> problems with the pools working with Linux-based matchmaker.
It'd be useful to know if these problems are being caused by lost updates, or
by some other problem. Fortunately, we have some new sources of data....
Have you tried looking at the new "Collector Updates Stats" fields? They can
be used to help quantitate lost updates. As of Condor 6.6.2,
"condor_updates_stats" is shipped with Condor; it's a perl script which can
be used to parse this output into a more meaningful text:
nleroy@chopin% condor_status -l c2-001 | grep Updates
UpdatesTotal = 12785
UpdatesSequenced = 12772
UpdatesLost = 33
UpdatesHistory = "0x00000000000000000000000000000000"
UpdatesTotal = 12678
UpdatesSequenced = 12666
UpdatesLost = 31
UpdatesHistory = "0x00100000000000000000000000000000"
nleroy@chopin% condor_status -l c2-001 | condor_updates_stats
(Reading from stdin)
*** Name/Machine = 'vm1@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
Type: Main
Stats: Total=12785, Seq=12772, Lost=33 (0.26%)
0: Ok
...
127: Ok
*** Name/Machine = 'vm2@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
Type: Main
Stats: Total=12678, Seq=12666, Lost=31 (0.24%)
0: Ok
...
11: Missed
12: Ok
...
127: Ok
If you know your update interval (default = 5 minutes), you can give it that
information, and it can guess at the time of the missing updates:
nleroy@chopin% condor_status -l c2-001 | condor_updates_stats --interval=300
(Reading from stdin)
*** Name/Machine = 'vm1@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
Type: Main
Stats: Total=12786, Seq=12773, Lost=33 (0.26%)
0 @ Wed Jun 16 08:31:30 2004: Ok
...
127 @ Tue Jun 15 21:56:30 2004: Ok
*** Name/Machine = 'vm2@xxxxxxxxxxxxxxxxxx' MyType = 'Machine' ***
Type: Main
Stats: Total=12679, Seq=12667, Lost=31 (0.24%)
0 @ Wed Jun 16 08:31:31 2004: Ok
...
12 @ Wed Jun 16 07:31:31 2004: Missed
13 @ Wed Jun 16 07:26:31 2004: Ok
...
127 @ Tue Jun 15 21:56:31 2004: Ok
-Nick
> On Wed, 2004-06-16 at 12:53, Ron Viloria wrote:
> > Ive always seen it happen, as early as 6.4.7, again its more of an
> > annoyance. Ive always assumed its because of the CPU being busy doing
> > non-condor stuff in the background or something like antivirus or
> > backups.
--
<<< The matrix has you. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences