Wednesday, September 12, 2012

Do I Need to Worry About the Availability and Recovery of WebLogic Transaction Logs?

When planning a WebLogic deployment that places a significant emphasis on High Availability or Disaster Recovery, it may be necessary to preserve WebLogic's Transaction Logs, to enable business-critical I.T. systems to be recovered to a correct and consistent state, following a system crash.

You may ask: What are WebLogic Transaction Logs?

Every WebLogic server has a persistent store (either on a file-system or in a database) to record information about the in-flight global transactions it co-ordinates. This is the Transaction Log, or TLOG for short. In the TLOG, WebLogic records each global transaction that has been flagged to commit but may not yet have committed in all the affected back-end data-stores. A global transaction is a special type of transaction, where the host application has encompassed a set of updates to two or more different data-stores as a seemingly single atomic operation. These data-stores could be relational databases, message queues or enterprise information systems, for example. A global transaction should either succeed or fail as a whole, without leaving any of the incorporated data-stores in an inconsistent state and as such, global transactions have ACID properties. When WebLogic co-ordinates a global transaction, it uses a Two-Phase-Commit (2PC) protocol to interact with the data-store managers (called resource managers). The interface between the transaction manager (e.g. WebLogic) and each resource manager (e.g. a database) is defined by the XA industry standard. To summarise, when processing global transactions, the transaction manager needs to persist its commit decision somewhere, and in WebLogic's case, this is in its TLOG.

So, what if WebLogic didn't persist transaction commit decisions?

If global transaction commit decisions are not persisted and the system fails, then under heavy load it is very likely that at least some transactions will still be in-flight, and temporarily at least, be in-doubt. At this point in time, for each transaction, the updates to one back-end data store may have committed, but the updates to another data-store, in the same transaction, may not yet have been instructed to commit (i.e. the data-store's updates are still pending). The system as a whole will have data in an inconsistent state. Once the failed parts of the system have been re-started, the data-stores holding pending updates will have no way of knowing whether the updates should be committed or rolled-back. The data in the system will then be permanently in an incorrect and inconsistent state. Even with manual human intervention, an administrator will have no way of knowing whether to commit or roll-back pending updates in a data-store, and so the correctness of a complete I.T. system will be forever in-doubt.

So, how does WebLogic recover pending transactions?

WebLogic's TLOGs are a key component of avoiding data in-consistency. Following a system crash, WebLogic's built-in Transaction Recovery Service automatically determines the global transactions that are still pending, by reading the TLOG and polling the relevant back-end data stores. WebLogic is then able to instruct the back-end data stores to either commit or roll-back each pending transaction. Once WebLogic's Transaction Recovery Service completes, the overall system will have been restored to a healthy and consistent state.

So, TLOGs are valuable assets then, that need to be preserved?

If you value and strive to protect and preserve the data in the databases and other data-stores in your enterprise, and your WebLogic hosted applications use global transactions, then you need to equally value and protect your WebLogic TLOGs, as both are inter-related. You need to ensure the persistence store for your TLOG is located on a highly available file-system storage or in a highly available database, and can survive such scenarios as irrevocable damage to a hard-disk platter, for example, or even the loss of a whole data-centre. You also need to plan for the ability to restore the WebLogic server referencing its highly available TLOG, during system recovery, to enable WebLogic to push the in-flight transactions through to completion and return the overall system to a consistent state.

For multi-data-centre deployments, it may be necessary to have a TLOG replicated between two data-centres. In the event of a complete data-centre failure, you can bring the WebLogic servers up in the other data centre, referencing the replicated copy of their TLOG, to allow the pending transactions to be correctly committed or rolled-back.

For enterprises that use WebLogic with global transactions, the preservation and recovery of TLOGs will need to be a critical component of the overall disaster recovery process.

So, investing in technologies and processes to preserve and recover TLOGs is absolutely necessary for all deployments?

Before you go ahead and invest in putting in place highly available storage, multi-site replication technologies and disaster recovery practices for TLOGs, it's worth considering that not all WebLogic deployments use global transactions. You need to be cognisant of this and perform an analysis of your WebLogic deployments, because such an investment cost may not be necessary for your particular system.

If your WebLogic deployed applications are bespoke JEE applications, developed in-house or by a partner, then the application's developers will be able to tell you if global "XA" transaction are employed or not.

If the WebLogic deployed application is built using Oracle Middleware or runs Oracle Applications, then XA global transactions may or may not be being used under the covers, depending on the specific product. You may need to consult the Oracle product documentation or contact Oracle Support. For example, Oracle SOA Suite inherently uses global transactions to track activity transitions belonging to running business processes. So if you value the integrity of these business processes and the data-stores they update, you need to value and protect the TLOGs.

If the WebLogic deployed application is provided by an ISV, you will need to study the ISV's product documentation and/or consult the ISV's Support organisation, to determine if global transactions are employed.

Final Words.....

It is worth stating that such transaction persistence and recovery requirements, and the implied investment required, are not unique to WebLogic. A TLOG is just a mechanism that WebLogic uses. Any enterprise that uses global transactions, regardless of technology vendor, will need to make similar considerations and investments, concerning the provision of highly available storage, multi-site replication technologies and disaster recovery practices.

Song for today: Miles Iz Ded by The Afghan Whigs


Anonymous said...

Great post. I am doing research on disaster recovery for a paper that I am writing. Thanks for the great information, it is very helpful!

Unknown said...

Great article ... appreciate your research and insights.