Paul Done's Technical Blog: JMS

Showing posts with label JMS. Show all posts

Friday, March 18, 2016

SHOCKER: XA Distributed Transactions are only Eventually Consistent!

Apologies for the tabloid-trash style headline. It could have been worse, I could have gone with my working title of "WARNING: XA will eat your first-born"!

This topic has come up in a few conversations I've had recently. It turns out that most don't realise what I'd assumed to be widely understood. 2-phase-commit (2PC), and XA (it's widely used implementation) are NOT ACID compliant. Specifically XA/2PC does not provide strong consistency guarantees and is in-fact just Eventually Consistent. It get's worse in practice. In places where I've seen XA/2PC used, it transpires that Atomicity and Durability are on shaky ground too (more on this later).

Why have I seen this topic rearing it's head recently? Well some organisations have cases where they want to update data in an Oracle Database and in a MongoDB database, as a single transaction. Nothing wrong with that of course, it really just depends on how you choose to implement the transaction and what your definition of a "transaction" really is. All too often, those stating this requirement will then go on to say that MongoDB has a problem because it does not support the XA protocol. They want their database products to magically take care of this complexity, under the covers. They want them to provide the guarantee of "strong" consistency across the multiple systems, without having to deal with this in their application code. If you're one of those people, I'm here to tell that these are not the ~~droids~~ protocols you are looking for.

Let me have a go at explaining why XA/2PC distributed transactions are not strongly consistent, This is based on what I've seen over the past 10 years or so, especially in the mid-2000s, when working with some UK government agencies and seeing this issue at first hand.

First of all, what are some examples of distributed transactions?

You've written a piece of application code that needs to put the same data into two different databases (eg. Oracle's database and IBM's DB2 database), all or nothing, as a single transaction. You don't want to have a situation where the data appears in one database but not in the other.
You've written a piece of application code that receives a message off a message queue (eg. IBM MQ Series) and inserts the data, contained in the message, into a database. You want these two operations to be part of the same transaction. You want to avoid the situation where the dequeue operation succeeds but the DB insert operation fails, resulting in a lost message and no updated database. You also want to avoid the situation where the database insert succeeds, but the acknowledgement of dequeue operation subsequently fails, resulting in the the message being re-delivered (a duplicate database insert of the same data would then occur).
Another "real-world" example that people [incorrectly] quote, is moving funds between two bank systems, where one system is debited, say £50, and the other is credited by the same amount, as a single "transaction". Of course you wouldn't want the situation to occur where £50 is taken from one system, but due to a transient failure, is not placed in the other system, so money is lost *. The reason I say "incorrectly" is that in reality, banks don't manage and record money transfers this way. Eric Brewer explains why this is the case in a much more eloquent way than I ever could. On a related but more existential note, Gregor Hohpe's classic post is still well worth a read: Your Coffee Shop Doesn’t Use Two-Phase Commit.

* although if you were the receiver of the funds, you might like the other possible outcome, where you receive two lots of £50, due to the occurrence of a failure during the 1st transaction attempt

So what's the problem then?

Back in the mid-2000s, I was involved in building distributed systems with application code running on a Java EE Application Server (WebLogic). The code would update a record in a database (Oracle) and then place a message on a queue (IBM MQ Series), as part of the same distributed transaction, using XA. At a simplistic level, the transactions performed looked something like this:

If the update to the DB failed, the enqueue operation to put the message onto the message queue would be rolled back, and vice versa. However, as with most real world scenarios, the business process logic was more involved than that. The business process actually had two stages, which looked like this:

Basically in the first stage of the process, the application code would put some data in the database and then put a message on a queue, as part of a single transaction, ready to allow the next stage of the process to be kicked off. The queuing system would already have a piece of application code registered with it, to listen for arrived messages (called a "message listener"). Once the message was committed to the queue, a new transaction would be initiated. In this new transaction the message would be given to the message listener. The listener's application code would receive the message and then read some of the data, that was previously inserted into the database.

However, when we load tested this, before putting the solution into production, the system didn't always work that way. Sporadically, we saw this instead:

How could this be? A previous transaction had put the data in the database as part of the same transaction that put the message on the queue. Only when the message was successfully committed to the queue, could the second transaction be kicked off. Yet, the subsequent listener code couldn't find the row of data in the database, that was inserted there by the previous transaction!

At first we assumed there was a bug somewhere and hunted for it in Oracle, MQ Series, WebLogic and especially our own code. Getting nowhere, we eventually started digging around the XA/2PC specification a little more, and we realised that the system was behaving correctly. It was correct behaviour to see such race conditions happen intermittently (even though it definitely wasn't desirable behaviour, on our part). This is because, even though XA/2PC guarantees that both resources in a transaction will have their changes either committed or rolled-back atomically, it can't enforce exactly when this will happen in each. The final commit action (the 2nd phase of 2PC) performed by each of those resource systems is initiated in parallel and hence cannot be synchronised.

The asynchronous nature of XA/2PC, for the final commit process, is by necessity. This allows for circumstances where one of the systems may have temporarily gone down between voting "yes" to commit and then subsequently being told to actually commit. If it is never possible for any of the systems to go down, there would be little need for transactions in the first place (quid pro quo). The application server controlling the transaction keeps trying to tell the failed system to commit, until it comes back online and executes the commit action. The database or message queue system can never 100% guarantee to always commit immediately, and thus only guarantees to commit eventually. Even when there isn't a failure, the two systems are being committed in parallel and will each take different and non-deterministic durations to fulfil the commit action (including the variable time it takes to persist to disk, for durability). There's no way of guaranteeing that they both achieve this in exactly the same instance of time - they never will. In our situation, the missing data would eventually appear in the database, but there was no guarantee that it would always be there when the code in a subsequent transaction tried to read it. Indeed, upon researching a couple of things while preparing this blog post, I discovered that even Oracle now documents this type of race condition (see section "Avoiding the XA Race Condition").

Back then, we'd inadvertently created the perfect reproducible test case for PROOF THAT XA/2PC IS ONLY EVENTUALLY CONSISTENT. To fix the situation, we had to put some convoluted workarounds into our application code. The workarounds weren't pretty and they're not something I have any desire to re-visit here.

There's more! When the solution went live, things got even worse...

It wasn't just the "C" in "ACID" that was being butchered. It turned out that there was a contract out to do a hatchet job on the "A" and "D" of "ACID" too.

In live production environments, temporary failures of sub-systems will inevitably occur. In our high throughput system, some distributed transactions will always be in-flight at the time of the failure. These in-flight transactions would then stick around for a while and some would be visible in the Oracle database (tracked in Oracle's "DBA_2PC_PENDING" system table). There's nothing wrong with this, except the application code that created these transactions will have been holding a lock on one or more table rows. These locks are a result of the application code having performed an update operation as part of the transaction, that has not yet been committed. In our live environment, due to these transactions being in-doubt for a while (minutes or even hours depending on the type of failure) a cascading set of follow-on issues would occur. Subsequent client requests coming into the application would start backing up, as they tried to query the same locked rows of data, and would get blocked or fail. This was due to the code having used the very common "SELECT ... FOR UPDATE" operation, which attempts to grab a lock on a row, ready for the row to be updated in a later step.

Pretty soon there would be a deluge of blocking or failing threads and the whole system would appear to lock up. No client requests could be serviced. Of course, the DBA would then receive a rush of calls from irate staff yelling that the mission critical database had ground to a halt. Under such time pressure, all the poor DBA could possibly do was to go to the source of the locks and try to release them. This meant going to Oracle's "pending transactions" system tables and unilaterally rolling back or committing each of them, to allow the system to recover and service requests again. At this point all bets were off. The DBA's decision to rollback or commit would have been completely arbitrary. Some of the in-flight transactions would have been partly rolled-back in Oracle, but would have been partly committed in MQ Series, and vice versa.

So in practice, these in-doubt transactions were neither applied Atomically or Durably. The "theory" of XA guaranteeing Atomicity and Durability was not under attack. However, the practical real-world application of it was. At some point, fallible human intervention was required to quickly rescue a failing mission critical system. Most people I know live in the real world.

My conclusions...

You can probably now guess my view on XA/2PC. It is not a panacea. Nowhere near. It gives developers false hope, lulling them into a false sense of security, where, at best, their heads can be gently warmed whilst buried in the sand.

It is impossible to perform a distributed transaction on two or more different systems, in a fully ACID manner. Accept it and deal with it by allowing for this in application code and/or in compensating business processes. This is why I hope MongoDB is never engineered to support XA, as I'd hate to see such a move encourage good developers to do bad things.

Footnote: Even if your DBA refuses to unilaterally commit or rollback transactions, when the shit is hitting the fan, your database eventually will, thus violating XA/2PC. For example, in Oracle, the database will unilaterally decide to rollback all pending transactions, older than the default value of 24 hours (see "Abandoning Transactions" section of Oracle's documentation).

Song for today: The Greatest by Cat Power

Friday, August 10, 2007

Tips for Web Services Interoperability

[Originally posted on my old BEA Dev2Dev blog on August 10, 2007]

I try to follow some simple rules to maximise interoperability when developing Web Services using WebLogic and/or AquaLogic Service Bus (ALSB). Most of the rules are pretty obvious, but perhaps one or two are not?. In case some of these rules are useful to others, I thought I'd share them, so here they are:

Use SOAP over HTTP. I've blogged here about why SOAP over JMS shouldn't be used if interoperability is a concern.

Conform to the WS-I Basic Profile 1.1 by using the free WS-I Test Tool. Test the WSDL and over-the-wire SOAP requests/responses for the created Web Services, for conformity using the tool available here (look for "Interoperability Testing Tools 1.1").

Expose Web Services using the "Document-Literal-Wrapped" style with the 'dotNetStyle' flag to help WS-I conformity and to be especially Microsoft product friendly. I partly covered this in the blog here

Use the WS-* standards judiciously. WebLogic implemented standards such as WS-Addressing, WS-Security, SAML and WS-ReliableMessaging are not necessarily implemented by other Web Services products/stacks or the specification version supported by these may be different.

Don't necessarily dismiss the use of WebLogic 'add-value' / 'non-standard' Web Services features at face-value
- 'Buffered' Web Services are interoperable with other client Web Services stacks at the basic SOAP-HTTP level because the service consumer is not aware that the service implementation uses a JMS queue for buffering internally.
- 'Callbacks' may be interoperable with non-WebLogic service consumers as long as the non-WebLogic consumers include the WS-Addressing 'Reply-To' header in the request and provide a web service endpoint to be asynchronously called back on for the specified 'Reply-To' URL
- 'Asynchronous Requests/Responses' may be interoperable with non-WebLogic service providers as long as the non-WebLogic providers honour the received WS-Addressing 'Reply-To' header of the request, by sending the Web Service response asynchronously to the specified 'Reply-To' URL.
- However, 'Conversational' Web Services are highly unlikely to be interoperable with non-WebLogic based service providers or consumers. The specification 'WS-Conversation' which the 'Conversational' feature would probably most clearly map to, doesn't really exist as a public specification and there is no indication that it ever will (an incomplete internal draft version has been dormant for a few years now).
For SOAP/HTTP Proxies created in ALSB, activate the "WS-I compliance enforcement" option (for the development phase of the project at least). When ALSB is used to act as an intermediary between Web Services consumers and providers, this ALSB option will help any Web Service non-conformities to be detected, so that they can be quickly rectified.

Note: ALSB also transparently converts between SOAP version 1.1 and SOAP version 1.2 inbound and outbound messages and ALSB is specifically tested by BEA for interoperability against third-party vendor toolkits such as Microsoft .NET and Apache Axis.

Soundtrack for today: Forensic Scene by Fugazi

Wednesday, March 28, 2007

The problem with using SOAP over JMS in SOA

[Originally posted on my old BEA Dev2Dev blog on March 28, 2007]

Sometimes I talk to people who seem to view the use of SOAP over JMS as the perfect combination to enable loosely coupled asynchronous shared services. However, when I dig deeper these people have invariably assumed that JMS is an 'over-the-wire' protocol, like HTTP. It is not.

Question: Why is this a problem?

Answer: Interoperability, plain and simple.

HTTP is a standard 'over-the-wire' protocol. HTTP belongs in the Application Layer of both the OSI model (layer 7) and the Internet Protocol Suite (layer 4 or 5). SOAP is a 'standard' (or W3C recommendation at least) transport agnostic protocol which uses an XML payload.

Due to the standard and technology agnostic nature of both SOAP and HTTP, many platforms and toolkits out there, written in different languages and on different operating systems, can interoperate using SOAP over HTTP by simply adhering to both of these standards (or at least the WS-I version of these standards).

However, JMS is not an 'over-the-wire' protocol. It is a Java API which requires that a client application uses a JMS provider library (JAR) provided by the vendor of the JMS Server hosting the services. This is analogous to requiring a JDBC driver for a particular vendor's database before a Java application can talk to that database. The actual 'over-the-wire' protocol used under the covers within the JMS provider library is not defined (it could be IIOP for example, or it could be some high speed non-standard vendor specific protocol).

As a result, in most cases, the only types of applications which can talk to a specific vendor's JMS Server are other Java based applications. It gets worse. If, for example, the JMS server vendor is IBM WebSphere and the service consumer is running within Oracle's Application Server, there may be problems even getting IBM's JMS client provider library working from within Oracle's Application Server in the first place, due to JMS implementation clashes. Some JMS Server vendors provide one or two non-Java based JMS libraries too (for example for C++ or .NET), but these are often limited in functionality and scope and often only support specific versions of specific platforms and operating systems.

In other words, the onus of interoperability, when using SOAP over JMS, is on the support of the vendor of the JMS server for all possible service consumer environments rather than the onus being on the service client's host environment support for standards. Vendors cannot scale to provide JMS support for all of the wide mix of programming languages, application servers and operating systems (including different versions) out there, so interoperability will take a big hit. Even for consumer applications that can use the JMS provider, one has to give the service consumer the provider library first before it can invoke services - not very loosely coupled I think.

As a result, an enterprise's design choice to use SOAP over JMS, as the default mechanism for interoperability for an enterprise's mix of heterogeneous systems, is likely to be fundamentally flawed in my opinion.

It is important to state that I am not saying that Message Oriented Middleware (MOM) does not have a place in a SOA framework. In fact, quite the opposite is true. To achieve capabilities such as asynchronous messaging, guaranteed delivery, only once delivery, and publish/subscribe mechanisms, MOMs are an essential part of the SOA fabric. That's why many vendor's ESB platforms are built on the underlying technology of Message Oriented Middleware. However, what I am saying is that JMS should not be the preferred API for exposing shared services to remote service clients. Using middleware such as an ESB for example, a service with an asynchronous interface can be exposed via a SOAP over HTTP interface, for example, where the ESB performs the switching between the consumer facing synchronous invocation protocol and the underlying internal asynchronous message passing mechanism which may or may not use JMS internally.

With the right organisation and governance in place, I believe it can be valid to decide to expose a shared service via SOAP/JMS in addition to SOAP/HTTP or another more 'open' protocol, where there are valid exceptional circumstances (eg. high performance requirements). However, it is probably best to treat these decisions on an exception by exception basis because the overhead of supporting two access methods for a service does have an additional overhead due to increased configuration, maintenance, and testing costs.

Is HTTP the perfect transport for SOAP, especially for asynchronous services? Not at all. However, if consumers can't invoke these services in the first place - that's worse.

Have I got something against JMS? Not at all. Its one of my favourite JavaEE APIs. I'm not talking about JavaEE here. I am talking about SOA.

Soundtrack for today: Happy Man by Sparklehorse