Monday 6 August 2012

Two phase commit (XA Transactions)

Some time ago I had the "opportunity" to lose some sleep worrying about a single transaction across two resource managers in a C/C++ application.  To the uninitiated this sounds simple, trivial even.  Since then, I've heard estimates anywhere from 20 days to 200 days to implement two phase commit.  Usually I can only hope they realise their mistake before the project is committed and certainly before it goes live.



If someone produces a significant estimate, say 200 days, for two phase commit it is probably because they imagine implementing their own version of the XA specification.  However, it is also likely they haven't actually read the spec, and will only implement a partial solution.

It seems questionable to me that any team could even partially implement two phase commit for 20 days effort.  Even linking with a proper transaction manager Tuxedo, Encina, etc would take more than 20 days.  There are other alternatives, but none seem that (20 days) easy:  http://linuxfinances.info/info/tpmonitor.html

If you introduce the constraint of just 2 resource managers you have a smaller problem, but it still seems unlikely that this system could function semi properly for just 20 days effort.  A typical approach (shortcut) might be to use a last resource commit 'optimisation' - where one resource manager (say an application server) supports XA and one other resource manager only supports a local transaction.  The thought goes, we can simply commit the local transaction last and thereby safely modify both resource managers in one transaction.  The Weblogic documentation provides a good description of how it accomplishes that, but it is also clear that this is an optimisation.


One of my good friends provided the following description of what a two phase commit really needs to accomplish:



The reason for two phase commit is to simplify the programming model (no matter how many resources the programmer has ‘touched’ all she needs to do is check consistency and then commit) and to simplify recovery. Using a short cut like last committer breaks the programming model (so you should only use it in middleware like integration, not in a programming model) and does not enable recovery.

Here is the recovery scenario. The program commits, we prepare the resource manager that supports two phase and then we commit the one that doesn’t. We go down before getting a response from the single phase resource manager. We are now ‘in doubt’. When we come back up, what true two phase commit does is first exchange log identifiers between the transaction manager and the two resource managers. This establishes the context for recovery. The transaction manager discovers the in doubt transaction and asks the two resource managers if they had all prepared and if at least one had committed. If so they are all committed. If not the transaction is rolled back.

Now try that with last committer. When we come up, there is a transaction prepared and no way, in the logs, to know if the single phase resource committed or not. So either we heuristically commit (guess) or we ask the operator to decide. Or worse, the transaction manager automatically rolls back the prepared resource even though the last committer committed. Either way we end up with a delay or an inconsistent database. We lose the automatic recovery of two phase, which is its real bonus for a customer. 

Two phase commit is not "easy" in C++; don't believe anyone who tells you it is.  One of the hidden gems (to a C++ programmer) of the Java Application Server is this programming model for distributed transactions (two phase commit).  In JEE is really is pretty trivial to update a queue and a database in one transaction - sadly the same cannot be said for unmanaged environments like the typical C++ application.