HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] brokenness in the negotiator



On Tue, Sep 25, 2012 at 03:54:33PM -0500, Zachary Miller wrote:
> 
> hi all,
> 
> i just discovered something.  i messed up pretty good with commit
> a8e0d4854cc8dfd0b806f4235c3648438f4550e0 and just now discovered it.

good news: the problem is not as bad as i first thought.  there's a lot
of details here that jaime and i just sorted out.  read on!


> it's in both the stable and devel series.

clarification: there's actually two problems.  one exists only in V7_8_4, and
the other exists only in V7_9_0.

when i discovered this, i was running the end of the master branch, which at
the time contained both problems.  but as far as released versions of condor
go, there's no release that contains both.


> the problem is that the Accountantnew.log becomes corrupt due to the change,
> and so when the negotiator starts up it barfs.

problem 1)
V7_8_4 writes malformed entries into Accountantnew.log.  however, no other
released version of condor writes the bad entries.

problem 2)
V7_9_0 will abort on startup if the Accountantnew.log contains entries that
were written by V7_8_4.  no other released version aborts on bad entries.


that is good news!  V7_8_4 does *NOT* abort on startup when reading it's own
malformed log.  looking at git, it seems there were plenty of changes made on
the master branch to classad_log.cpp, which is the where the EXCEPT() is being
triggered.

(many many thanks to jaime for discovering that V7_8_4 does not abort, about
two minutes before i announced a lot of incorrect information on condor-users.)

what currently exists in the V7_8_5-branch will not abort, will not write
malformed entries, and in fact vacuums the bad entries out.

so the problem for 7.8.5 is not that it "crashes" but that we'd lose the
accounting data generated from a 7.8.4 negotiator.  however, alan has finished
the tool that cleans the log, and that is included in the V7_8_5 distribution.

i see three open questions:

0) i pulled V7_8_4 from the download page, but didn't announce that yet to
   condor-users and condor-world.  should i put it back?

1) do we want change the negotiator in the stable series to automatically
   invoke this tool in an attempt to salvage usage data?

2) do we want to change the master branch to be more resiliant and not abort
   when it encounters a malformed entry?


my default answers: 0:Y, 1:N, 2:Y, but i welcome your feedback.


cheers,
-zach