Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Quill out of sync
- Date: Wed, 22 Jul 2009 09:36:55 +0200
- From: Carsten Aulbert <carsten.aulbert@xxxxxxxxxx>
- Subject: [Condor-users] Quill out of sync
Hi all,
we might have a problem here caused by a networking issue yesterday when our
mgmt. network was flooded with traffic.
We have four head nodes which share a negotiator in HA mode and at some point
yesterday one node decided it would be the negotiator for a couple of minutes
as it could not connect to any other head node. Now we have this weird
situation that quill and the "direct" query are out of sync:
Querying against quill
atlas2# condor_q -g |grep running
2 jobs; 0 idle, 2 running, 0 held
9648 jobs; 3150 idle, 6498 running, 0 held
Direct query
atlas2# condor_q -g -direct schedd|grep running
21 jobs; 8 idle, 13 running, 0 held
2081 jobs; 1 idle, 2080 running, 0 held
1 jobs; 0 idle, 1 running, 0 held
condor_status believes this:
Total Owner Claimed Unclaimed Matched Preempting Backfill
X86_64/LINUX 6602 0 2070 31 0 0 4501
Total 6602 0 2070 31 0 0 4501
The negotiator agrees by telling me (for any user):
Got NO_MORE_JOBS; done negotiating
How do we get quill and the daemons back to sync, it's been in this state now
for more than 12 hours, thus I would assume it would have had a chance to
replay the "forgotten" transactions, right?
Cheers
Carsten