HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Paper on scalability of write-ahead-logs



On 12/30/2010 01:43 PM, Brian Bockelman wrote:
Hi folks,

During the break, I've been reading an interesting paper from VLDB 2010:

http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf

It talks about scalability issues of write-ahead-logs in DBs, but I
found the topics relevant to the I/O scalibility issues in the Condor
schedd. Particularly, there are two concepts which may be applicable
(without knowing enough about the Condor code to determine if they are
or not):

1) Early-lock-release (ELR) and flush pipelining. ELR is simply
releasing the I/O resources prior to the I/O completing, but not
returning back to the calling routine until the I/O is finished; flush
pipelining is taking several commits and flushing them as one I/O
operation. Together, they can offer the same data guarantees while
greatly decreasing the number of small I/O operations are performed.
Note: the kernel community calls this I/O plugging. This technique works
when there are many independent transactions occurring - if there are
too many dependent transactions (transactions which can't be started
until the previous one finishes).
- Obviously, these techniques were designed for heavily-threaded
environments. It's not known to me whether the schedd can continue on
other work while it waits for a transaction to finish.
2) Asynchronous commit - i.e., lie to the user about their changes being
safely on disk. The paper uses this as an anti-pattern, and works to
show techniques such as (1) can have equivalent performance. However,
there's a reason that databases (even Oracle) allow this mode to be used
- they give the power to the user to consider the value of data
integrity and recoverability versus the value of scalability. There are
certainly times when the extra factor of 2 in scalability (making the
numbers up) is worth the cost of being able to lose 30s of status
changes. I'm not advocating that the Condor team should change their
opinions about the relevant merits, or that the default should be
changed - but this should be something the site should be allowed to
decide on.

At any rate, the paper is a good read; hope others enjoy it.

Brian

I've not read the paper, but I have read the xact code in Condor and your email. 8o)

The xact code is structured in such a way that there are no partial/progressive writes to disk for a transaction. The transaction exists in memory and is flushed to disk on commit[1]. Also, transactions do not nest, commonly exist one at a time[2], and contain no data dependencies. The structure being maintained in the log is quite simple in comparison to general DB tables.

[1] Great work has been done in this space, actually allowing for two types of transactions - those that must be flush()d (durable) and those that can wait (nondurable). Operations such as updating JobStatus are marked as durable, while a stat update from the starter may be marked as nondurable. This cuts down on sync operations and has been a good win in the past.

[2] The Schedd is single threaded and transactions are naturally serialized. The only place where multiple concurrent transactions exists is from SOAP calls. Theoretically an application could expose itself to isolation issues here, but AFAIK has not happened in practice since the SOAP introduction in 2005.

Thanks for the pointer. I've always wanted to see a perf analysis of a full blown RDBMS in place of the Schedd's xact log, but I don't hold out much hope in finding benefits there.

Best,


matt