Some time ago I had the "opportunity" to lose some sleep worrying about a single transaction across two resource managers in a C/C++ application. To the uninitiated this sounds simple, trivial even. Since then, I've heard estimates anywhere from 20 days to 200 days to implement two phase commit. Usually I can only hope they realise their mistake before the project is committed and certainly before it goes live.
One of my good friends provided the following description of what a two phase commit really needs to accomplish:
If someone produces a significant estimate, say 200 days, for two phase commit it is probably because they imagine implementing their own version of the XA specification. However, it is also likely they haven't actually read the spec, and will only implement a partial solution.
It seems questionable to me that any team could even partially implement two phase
commit for 20 days effort. Even linking with a proper transaction manager
Tuxedo, Encina, etc would take more than 20 days. There are other
alternatives, but none seem that (20 days) easy: http://linuxfinances.info/info/tpmonitor.html
If you introduce the constraint of just 2 resource managers you
have a smaller problem, but it still seems unlikely that this system could function semi properly for just 20 days effort. A typical approach (shortcut) might be to use a last resource
commit 'optimisation' - where one resource manager (say an application server) supports XA and one other resource
manager only supports a local transaction. The thought goes, we can simply commit the local transaction last and thereby safely modify both resource managers in one transaction. The Weblogic documentation provides a good description of how it accomplishes that, but it is also clear that this is an optimisation.
One of my good friends provided the following description of what a two phase commit really needs to accomplish:
The reason for two phase commit is to simplify the programming
model (no matter how many resources the programmer has ‘touched’ all she needs
to do is check consistency and then commit) and to simplify recovery. Using a
short cut like last committer breaks the programming model (so you should only
use it in middleware like integration, not in a programming model) and does not
enable recovery.
Here is the recovery scenario. The program commits, we prepare
the resource manager that supports two phase and then we commit the one that doesn’t.
We go down before getting a response from the single phase resource manager. We
are now ‘in doubt’. When we come back up, what true two phase commit does is
first exchange log identifiers between the transaction manager and the two
resource managers. This establishes the context for recovery. The transaction
manager discovers the in doubt transaction and asks the two resource managers
if they had all prepared and if at least one had committed. If so they are all
committed. If not the transaction is rolled back.
Now try that with last committer. When we come up, there is a
transaction prepared and no way, in the logs, to know if the single phase
resource committed or not. So either we heuristically commit (guess) or we ask
the operator to decide. Or worse, the transaction manager automatically rolls
back the prepared resource even though the last committer committed. Either way
we end up with a delay or an inconsistent database. We lose the automatic
recovery of two phase, which is its real bonus for a customer.
Two phase commit is not "easy" in C++; don't believe anyone who tells you it is. One of the hidden gems (to a C++ programmer) of the Java Application Server is this programming model for distributed transactions (two phase commit). In JEE is really is pretty trivial to update a queue and a database in one transaction - sadly the same cannot be said for unmanaged environments like the typical C++ application.