Friday, March 18, 2016

SHOCKER: XA Distributed Transactions are only Eventually Consistent!

Apologies for the tabloid-trash style headline. It could have been worse, I could have gone with my working title of "WARNING: XA will eat your first-born"!

This topic has come up in a few conversations I've had recently. It turns out that most don't realise what I'd assumed to be widely understood. 2-phase-commit (2PC), and XA (it's widely used implementation) are NOT ACID compliant. Specifically XA/2PC does not provide strong consistency guarantees and is in-fact just Eventually Consistent. It get's worse in practice. In places where I've seen XA/2PC used, it transpires that Atomicity and Durability are on shaky ground too (more on this later).

Why have I seen this topic rearing it's head recently? Well some organisations have cases where they want to update data in an Oracle Database and in a MongoDB database, as a single transaction. Nothing wrong with that of course, it really just depends on how you choose to implement the transaction and what your definition of a "transaction" really is. All too often, those stating this requirement will then go on to say that MongoDB has a problem because it does not support the XA protocol. They want their database products to magically take care of this complexity, under the covers. They want them to provide the guarantee of "strong" consistency across the multiple systems, without having to deal with this in their application code. If you're one of those people, I'm here to tell that these are not the droids protocols you are looking for.

Let me have a go at explaining why XA/2PC distributed transactions are not strongly consistent, This is based on what I've seen over the past 10 years or so, especially in the mid-2000s, when working with some UK government agencies and seeing this issue at first hand.


First of all, what are some examples of distributed transactions?
  • You've written a piece of application code that needs to put the same data into two different databases (eg. Oracle's database and IBM's DB2 database), all or nothing, as a single transaction. You don't want to have a situation where the data appears in one database but not in the other.
  • You've written a piece of application code that receives a message off a message queue (eg. IBM MQ Series) and inserts the data, contained in the message, into a database. You want these two operations to be part of the same transaction. You want to avoid the situation where the dequeue operation succeeds but the DB insert operation fails, resulting in a lost message and no updated database. You also want to avoid the situation where the database insert succeeds, but the acknowledgement of dequeue operation subsequently fails, resulting in the the message being re-delivered (a duplicate database insert of the same data would then occur).
  • Another "real-world" example that people [incorrectly] quote, is moving funds between two bank systems, where one system is debited, say £50, and the other is credited by the same amount, as a single "transaction". Of course you wouldn't want the situation to occur where £50 is taken from one system, but due to a transient failure, is not placed in the other system, so money is lost *. The reason I say "incorrectly" is that in reality, banks don't manage and record money transfers this way. Eric Brewer explains why this is the case in a much more eloquent way than I ever could. On a related but more existential note, Gregor Hohpe's classic post is still well worth a read: Your Coffee Shop Doesn’t Use Two-Phase Commit.
* although if you were the receiver of the funds, you might like the other possible outcome, where you receive two lots of £50, due to the occurrence of a failure during the 1st transaction attempt



So what's the problem then?

Back in the mid-2000s, I was involved in building distributed systems with application code running on a Java EE Application Server (WebLogic). The code would update a record in a database (Oracle) and then place a message on a queue (IBM MQ Series), as part of the same distributed transaction, using XA. At a simplistic level, the transactions performed looked something like this:



If the update to the DB failed, the enqueue operation to put the message onto the message queue would be rolled back, and vice versa. However, as with most real world scenarios, the business process logic was more involved than that. The business process actually had two stages, which looked like this:


Basically in the first stage of the process, the application code would put some data in the database and then put a message on a queue, as part of a single transaction, ready to allow the next stage of the process to be kicked off. The queuing system would already have a piece of application code registered with it, to listen for arrived messages (called a "message listener"). Once the message was committed to the queue, a new transaction would be initiated. In this new transaction the message would be given to the message listener. The listener's application code would receive the message and then read some of the data, that was previously inserted into the database.

However, when we load tested this, before putting the solution into production, the system didn't always work that way. Sporadically, we saw this instead:


How could this be? A previous transaction had put the data in the database as part of the same transaction that put the message on the queue. Only when the message was successfully committed to the queue, could the second transaction be kicked off. Yet, the subsequent listener code couldn't find the row of data in the database, that was inserted there by the previous transaction!

At first we assumed there was a bug somewhere and hunted for it in Oracle, MQ Series, WebLogic and especially our own code. Getting nowhere, we eventually started digging around the XA/2PC specification a little more, and we realised that the system was behaving correctly. It was correct behaviour to see such race conditions happen intermittently (even though it definitely wasn't desirable behaviour, on our part). This is because, even though XA/2PC guarantees that both resources in a transaction will have their changes either committed or rolled-back atomically, it can't enforce exactly when this will happen in each. The final commit action (the 2nd phase of 2PC) performed by each of those resource systems is initiated in parallel and hence cannot be synchronised.

The asynchronous nature of XA/2PC, for the final commit process, is by necessity. This allows for circumstances where one of the systems may have temporarily gone down between voting "yes" to commit and then subsequently being told to actually commit. If it is never possible for any of the systems to go down, there would be little need for transactions in the first place (quid pro quo). The application server controlling the transaction keeps trying to tell the failed system to commit, until it comes back online and executes the commit action. The database or message queue system can never 100% guarantee to always commit immediately, and thus only guarantees to commit eventually. Even when there isn't a failure, the two systems are being committed in parallel and will each take different and non-deterministic durations to fulfil the commit action (including the variable time it takes to persist to disk, for durability). There's no way of guaranteeing that they both achieve this in exactly the same instance of time - they never will. In our situation, the missing data would eventually appear in the database, but there was no guarantee that it would always be there when the code in a subsequent transaction tried to read it. Indeed, upon researching a couple of things while preparing this blog post, I discovered that even Oracle now documents this type of race condition (see section "Avoiding the XA Race Condition").

Back then, we'd inadvertently created the perfect reproducible test case for PROOF THAT XA/2PC IS ONLY EVENTUALLY CONSISTENT. To fix the situation, we had to put some convoluted workarounds into our application code. The workarounds weren't pretty and they're not something I have any desire to re-visit here.


There's more! When the solution went live, things got even worse...

It wasn't just the "C" in "ACID" that was being butchered. It turned out that there was a contract out to do a hatchet job on the "A" and "D" of "ACID" too.

In live production environments, temporary failures of sub-systems will inevitably occur. In our high throughput system, some distributed transactions will always be in-flight at the time of the failure. These in-flight transactions would then stick around for a while and some would be visible in the Oracle database (tracked in Oracle's "DBA_2PC_PENDING" system table). There's nothing wrong with this, except the application code that created these transactions will have been holding a lock on one or more table rows. These locks are a result of the application code having performed an update operation as part of the transaction, that has not yet been committed. In our live environment, due to these transactions being in-doubt for a while (minutes or even hours depending on the type of failure) a cascading set of follow-on issues would occur. Subsequent client requests coming into the application would start backing up, as they tried to query the same locked rows of data, and would get blocked or fail. This was due to the code having used the very common "SELECT ... FOR UPDATE" operation, which attempts to grab a lock on a row, ready for the row to be updated in a later step.

Pretty soon there would be a deluge of blocking or failing threads and the whole system would appear to lock up. No client requests could be serviced. Of course, the DBA would then receive a rush of calls from irate staff yelling that the mission critical database had ground to a halt. Under such time pressure, all the poor DBA could possibly do was to go to the source of the locks and try to release them. This meant going to Oracle's "pending transactions" system tables and unilaterally rolling back or committing each of them, to allow the system to recover and service requests again. At this point all bets were off. The DBA's decision to rollback or commit would have been completely arbitrary. Some of the in-flight transactions would have been partly rolled-back in Oracle, but would have been partly committed in MQ Series, and vice versa.

So in practice, these in-doubt transactions were neither applied Atomically or Durably. The "theory" of XA guaranteeing Atomicity and Durability was not under attack. However, the practical real-world application of it was. At some point, fallible human intervention was required to quickly rescue a failing mission critical system. Most people I know live in the real world.


My conclusions...

You can probably now guess my view on XA/2PC. It is not a panacea. Nowhere near. It gives developers false hope, lulling them into a false sense of security, where, at best, their heads can be gently warmed whilst buried in the sand.

It is impossible to perform a distributed transaction on two or more different systems, in a fully ACID manner. Accept it and deal with it by allowing for this in application code and/or in compensating business processes. This is why I hope MongoDB is never engineered to support XA, as I'd hate to see such a move encourage good developers to do bad things.


Footnote: Even if your DBA refuses to unilaterally commit or rollback transactions, when the shit is hitting the fan, your database eventually will, thus violating XA/2PC. For example, in Oracle, the database will unilaterally decide to rollback all pending transactions, older than the default value of 24 hours (see "Abandoning Transactions" section of Oracle's documentation).


Song for today: The Greatest by Cat Power