Wednesday, October 24, 2012

Anatomy of WebLogic TLOGs and considerations for transaction recovery


As discussed in an earlier blog entry, the preservation and ability to restore WebLogic Transaction Logs (TLOGs) is often critical to the availability of an I.T. system and the consistency of the data it contains. With the loss of a machine hosting a WebLogic server, it may be necessary to re-create the environment on another machine, to re-start the failed WebLogic server. Sometimes it may even be necessary to start the failed Weblogic server in a different data-centre for disaster recovery purposes. Restoring a WebLogic server to a running state may be necessary to enable stuck transactions, recorded in its TLOG, to be pushed through to completion. The new host environment may have been reproduced from a previous backup, from re-running WLST scripts or just from manual re-creation. However, what if the new environment uses different hostnames, different IP addresses or different resource names? Will it be possible to reconcile the stuck transactions recorded in the old preserved TLOG? Will it be possible for the WebLogic instance, in the new environment, to successfully commit these stuck transactions?

In this blog entry, I examine the anatomy of a TLOG, to discover what environmental dependencies the data stored in a TLOG has. This will help me to formulate some recommendations outlining the configuration settings that can be changed, when moving a WebLogic server and its TLOG to a new host environment, without losing the ability to recover stuck transactions.

Anatomy of a TLOG

WebLogic's TLOG holds records of the state of in-flight transactions, that are marked to be committed, in something that WebLogic calls a 'persistent store'. The persistent store can either be files on a file-system or a table in database. The key factors to consider when planning the location of the persistent store are:
  • the latency of writing to the storage
  • the availability and resiliency of the storage
Whichever type of persistent store is used, the actual data stored is essentially the same. For the purposes of this investigation, we will look at the anatomy of a TLOG that uses a file store, but the principles gleaned will be exactly the same for database stores too.

To be able to study the contents of a TLOG, I needed to simulate a real system, processing distributed transactions, that would cause WebLogic to record data in its TLOG. I set-up an appropriate environment on my Linux x86-64 laptop, where I created a WebLogic version 12.1.1 domain with a single server defined, leaving the TLOG location as the default file-store. I created an Oracle version single instance database with appropriate settings made to allow XA to function correctly. In the WebLogic domain I defined an XA data source to point to this database. I also created a WebLogic hosted JMS queue which is XA-enabled. Finally, I deployed a test application that I wrote, called BoxBurner, to the WebLogic server to help generate XA transactions for my tests.

BoxBurner is essentially a JEE application that continuously performs a repeated flow of steps. Each flow involves reading a message from a queue, performing a database insert operation, and then placing a new message back on to the same queue, ready for the next iteration. Each flow, composed of 'message dequeue', 'database insert' and 'message enqueue' operations is contained within a single distributed XA transaction. As a result, for each transaction being processed, WebLogic writes a commit decision to its TLOG. In BoxBurner, I also provide a small HTML user interface to allow a user to initiate seeding the queue with a number of messages to kick the continuous parallel flows off.  So with BoxBurner, I can easily simulate a highly transactional system under load, to assist in exploring the anatomy of a TLOG.

By default, the location of a WebLogic server's file-store containing the TLOG is at:


WebLogic rolls these log files over, under certain circumstances, incrementing the number in the end of the file name. Each file has a fixed size (approximately 1 MB) and when full (or when a checkpoint occurs), a new file is rolled over to. In my case, the server's initial TLOG data file is called:


So for my tests, running BoxBurner under load, on my laptop, I issued a kill -9 on the WebLogic process to terminate it without any warning, hoping to catch at least a few transactions in-flight. I then studied the state of the TLOG data file to see what transactions were recorded and stuck at the point when the failure occurred.

To study the contents of the TLOG data file, I ran a WebLogic dump utility that shows a summary of the important contents of the TLOG, using a small wrapper Bash script I'd created (

 java -cp $WEBLOGIC_JAR:$WEBLOGIC_TX_JAR weblogic.transaction.internal.StoreTransactionLoggerImpl $*

I ran the script with the following arguments to tell it where to find the TLOG data file (note, if you run it with no arguments, the utility displays some useful help information):

 $ ./ /u01/MyDemoSystem/stores/MyDemoAdmSvr_Stores MYDEMOADMSVR

The output essentially showed two different types of record, contained in the TLOG file:
  • One checkpoint record, with the following example output below:
  | Class Name = weblogic.transaction.internal.ResourceCheckpoint                |
  | Object = ResourceCheckpoint={{BBDatasource_MyDemoDomain, props={}}, {WLStore |
  | _MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore, props={}}}                     |
  • Seven transaction records, each with output similar to the example below:
  | Class Name = weblogic.transaction.internal.ServerTransactionImpl             |
  | Object = Name=[EJB boxburner.msgprocessor.MsgProcessorMDB.onMessage( |
  | s.Message)],Xid=BEA1-1800B10D6C88705836D9(381221099),Status=Active,numReplie |
  | sOwedMe=0,numRepliesOwedOthers=0,seconds since begin=1775,seconds left=30,XA |
  | ServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(Ser |
  | verResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(state= |
  | new,assigned=none),xar=null,re-Registered = false),XAServerResourceInfo[BBDa |
  | tasource_MyDemoDomain]=(ServerResourceInfo[BBDatasource_MyDemoDomain]=(state |
  | =new,assigned=none),xar=null,re-Registered = false),SCInfo[MyDemoDomain+MyDe |
  | moAdmSvr]=(state=active),properties=({[EJB boxburn |
  | er.msgprocessor.MsgProcessorMDB.onMessage(javax.jms.Message)]}))             |

The checkpoint record is made by WebLogic to enable it to track the different XA resources (eg. a database, a message queue) that have been incorporated in one or more XA trasactions that the WebLogic server has participated in. In this checkpoint record, we can see that the two XA resources are 'BBDatasource_MyDemoDomain' (the WebLogic datasource pointing at the database) and 'WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore' (the persistent store for the JMS queue).

The TLOG contained 7 pending transactions - XA transactions that WebLogic has marked for commit, but that may not yet have been committed in one or more of the back-end systems. Each transaction has a global ID (Xid), and in the example transaction shown, the WebLogic generated Xid is "BEA1-1800B10D6C88705836D9".

While the Weblogic server was still shut-down with pending transactions, I used SQL*Plus to connect to the database, and query the list of XA transactions that the database was tracking as pending and yet to commit. The query I issued was:


The outputted list of database pending transaction records was shown to be:

  ------------- ------------------------------ -------- --- ---------- ----------
  3.11.827      48801.1803B10D6C88705836D9     prepared no  pdthinkpad 1310237
  5.38.95       48801.1802B10D6C88705836D9     prepared no  pdthinkpad 1310234
  7.22.28       48801.1805B10D6C88705836D9     prepared no  pdthinkpad 1310241
  11.73.12      48801.1800B10D6C88705836D9     prepared no  pdthinkpad 1310235
  12.21.37      48801.1801B10D6C88705836D9     prepared no  pdthinkpad 1310228
  2.2.534       48801.1799B10D6C88705836D9     prepared no  pdthinkpad 1310236

Note: In my experience it can take a minute or two, from the point when failure occurs, for pending transactions to appear in the database's DBA_2PC_PENDING table. Therefore if you try this out yourself, and you see no records reported, but you expected to see some, wait a few minutes before executing the query again.

As you can see, the "BEA1-1800B10D6C88705836D9" recorded transaciton in the TLOG matches one of the recorded transactions ("48801.1800B10D6C88705836D9") in the Oracle database (the latter just uses a number representation for the text 'BEA1'). You may also notice that the database only lists 6 pending transaction whilst the WebLogic TLOG listed 7. Why would this be? Well this is likely to be because the database has committed the 7th transaction but didn't have time to notify WebLogic that this had occurred or Weblogic didn't have time to clean the transaction record up, in the TLOG, when the server process was killed. This is perfectly fine, because during Transaction Recovery, Weblogic will be notified by the database that the transaction has already been reconciled and no further action is required.

What is also interesting is that in the TLOG, no hostnames, IP addresses or ports appear to be directly recorded by WebLogic for the database resource or the JMS queue resource. However, in the database pending transaction table, we see that the database has tracked the hostname ('pdthinkpad') of the originator of the transaction, which in this case maps to my WebLogic server's listen address hostname. Later, we'll come back to why these observations may be important.

When running the WebLogic server instance I'd elected to turn on debug-level logging for the server, with the flag 'DebugJTATLOG' enabled. Below is an example of what WebLogic logged when processing one of the transactions and marking it to be committed in the TLOG,

     ####<19-Oct-2012 13:40:31 o'clock BST> <Debug> <JTATLOG> <pdthinkpad> <MyDemoAdmSvr> <[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <BEA1-1800B10D6C88705836D9> <> <1342701631032> <BEA-000000> <TLOG writing log record, class=weblogic.transaction.internal.ServerTransactionImpl, obj=Name=[EJB boxburner.msgprocessor.MsgProcessorMDB.onMessage(javax.jms.Message)],Xid=BEA1-1800B10D6C88705836D9(347028305),Status=Logging,numRepliesOwedMe=0,numRepliesOwedOthers=0,seconds since begin=0,seconds left=30,activeThread=Thread[[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)',5,Pooled Threads],XAServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(ServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(state=prepared,assigned=MyDemoAdmSvr),xar=WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore282730943,re-Registered = false),XAServerResourceInfo[BBDatasource_MyDemoDomain]=(ServerResourceInfo[BBDatasource_MyDemoDomain]=(state=prepared,assigned=MyDemoAdmSvr),xar=BBDatasource,re-Registered = false),...[TRUNCATED FOR BREVITY]...,CoordinatorURL=MyDemoAdmSvr+pdthinkpad:7001+MyDemoDomain+t3+)>

Finally, to test that WebLogic's Transaction Recovery process works correctly in normal circumstances, I simply re-started the WebLogic server on my laptop. Within half a minute, WebLogic's Transaction Recovery Service had kicked in and successfully recovered the stuck transactions that were recorded the TLOG and pushed them through to completion in the affected JMS Queue and Database resources. To prove this, I queried the database table DBA_2PC_PENDING again and this time no rows were returned.

Testing Transaction Recovery After Changing Specific Environment Settings

To determine which key environment setting changes will adversely affect the ability of WebLogic to recover TLOG recorded transactions, I ran a series of tests. For each test, I first ran the system as normal, to process messages. Then I killed the WebLogic server process and checked in the TLOG and in the database, to ensure there some stuck transactions were present. Then I made the necessary changes in the environment configuration settings for the test. Finally I re-started Weblogic to see whether the transactions were successfully recovered or not.

TEST: Changing JDBC Datasource Hostname and Port

In this test, after killing WebLogic, I changed the JDBC URL (hostname and port) value in the domain configuration settings for the WebLogic datasource ('BBDatasource'), to reference a non-existent Oracle DB listener address.  Upon re-starting WebLogic, as the Weblogic transaction recovery service kicked in, I saw lots of entries in the WebLogic log file like:

<19-Oct-2012 15:58:40 o'clock BST> <Warning> <JTA> <BEA-110486> <Transaction BEA1-0511D6A7A132705836D9 cannot complete commit processing because resource [BBDatasource_MyDemoDomain] is unavailable. The transaction will be abandoned after 431,741 seconds unless all resources acknowledge the commit decision.> 

Also, when querying the DBA_2PC_PENDING database table, I could see that there were still pending transactions present. As one would expect, this shows that WebLogic can't magically find a database and relies on the JDBC datasource configuration in the domain, to work out how to resolve reaching the database, rather than any information about a database host or port that may or may not have be stored in the TLOG, at the time when the transaction occurred. Upon correcting the JDBC URL in the domain's configuration and re-starting WebLogic, the pending transactions were all happily pushed through to completion, as expected.

TEST: Changing JDBC Datasource Name

For this test run, after killing the WebLogic server, I changed the name of the 'BBDatasource' data-source in the domain configuration (keeping the JNDI name it contains the same). Again I discovered that on server re-start, the pending transactions could not be recovered. This demonstrated that WebLogic's transaction recovery service uses the domain's current configuration settings for resources like data-sources, rather than anything that may have been recorded, before failure, in the TLOG. Again, correcting the datasource name and re-starting Weblogic resulted in the pending transactions beeing successfully cleared.

TEST: Changing Oracle Database Listener Host and Port

For this test run, after killing the WebLogic server, rather than changing anything in WebLogic's domain configuration, I instead changed the host and port of my local database listener (to listen on my laptop's wireless card address rather than my laptop's wired ethernet card address). On re-starting the Weblogic server, Weblogic was unable to recover the transactions, as expected, because it couldn't contact the database on the address it expected to. This time to help recovery to be possible, rather than reverting back the database changes I had made (Oracle listener host/port), I instead modified the JDBC datasource URL for the WebLogic domain, with the new host/port of the database. Upon re-starting WebLogic, the pending transactions were successfully recovered. What this showed is that, following a failure and before initiating system recovery, a database can be moved and a Weblogic server's configuration can be changed, to reflect the new database location. With these database location and WebLogic datasource changes, Weblogic is still able to successsfully recover transactions, which is potentially important for disaster recovery processes, where a database may be running in a different data-centre.

TEST: Changing WebLogic Default Listen Address

In my test environment, the WebLogic server was originally listening to the hostname "pdthinkpad" which mapped to my laptoep's wired Ethernet card. For this test, after killing the server, I made some changes in WebLogic's domain configuration, changing WebLogic's default channel listen address to a hostname ('temphost') which mapped to my laptop's wireless network card. I also changed my laptop's hostname to 'temphost' and added the appropriate entry for the hostname in '/etc/hosts', removing any trace of the old 'pdthinkpad' name. Before re-starting the WebLogic server to listen to the new address, I double checked the database, querying the DBA_2PC_PENDING table to be check that there were pending transactions. The example output was:

  ------------- ------------------------------ -------- --- ---------- ----------
  5.18.1100     48801.03542CFF3807705836D9     prepared no  pdthinkpad 1472769
  7.11.720      48801.03552CFF3807705836D9     prepared no  pdthinkpad 1472766
  12.5.42       48801.03512CFF3807705836D9     prepared no  pdthinkpad 1472778
  1.22.724      48801.034F2CFF3807705836D9     prepared no  pdthinkpad 1472772

Notice that the Oracle database is tracking that the pending transactions were originated from a server on host 'pdthinkpad'. So the question would be, if recovery of these transactions was now initiated by a seemingly different server (ie. one from a host called 'temphost'), would the database allow these transactions to be reconciled and completed?

With the new hostname and listen address in place, I re-started the WebLogic server and observed that WebLogic's transaction recovery service worked successfully, pushing through to completion the pending transactions. On querying the database DBA_2PC_PENDING table again, no records were listed, showing that the database had indeed committed the transactions. Therefore, the conclusion is that the hostname and listen address for a WebLogic server can be changed, without preventing the successful recovery of transactions recorded in the TLOG. Again, this is potentially important if the WebLogic server has been started in a different data-centre, with a copy of the original TLOG, as part of a disaster recovery process.

Conclusions From Tests

These tests were all based on cases where my WebLogic server, owning the TLOG, was the originator of XA transactions. In other words, the WebLogic server was the 'transaction coordinator'. In these cases, I was able to show that, after restoring or re-creating a Weblogic environment, the pending transactions that had been tracked in a preserved TLOG, could be successfully completed. I was able to make the following environmental changes without inhibiting the ability to recover transactions:
  • WebLogic Server default channel listen addresses (Hostname/IP-address and Port)
  • Oracle Database listener address (Hostname/IP-address and Port) plus related settings in the WebLogic domain datasource configuration file
What became evident during the investigation, is that it is important to keep the names of the following resources the same (stored in the domain configuration), between the stages of system failure and system recovery:
  • Weblogic Domain Name
  • WebLogic Server Names
  • Weblogic JDBC Datasource Names
  • Weblogic Persistent Store Names (File and/or DB)
  • Weblogic JMS Server Names
  • Weblogic JMS Destination Names

Word of caution

In WebLogic's online documentation for Managing Transactions there is a section titled Moving a Server. The help document highlights that in a specific situation, if the hostname and port of WebLogic's listen address is changed, transaction recovery will fail. This situation occurs if the WebLogic server is a transaction sub-coordinator as part of a larger transaction that has been propagated from another separate Weblogic server.

For example, an EJB client application running on Server A may initiate an XA transaction and invoke an EJB running on Server B which in turn performs an XA database update operation. In these cases, the transaction coordinator (Server A) will have recorded the URL (host and post) of the sub-coordinator (Server B) directly in the TLOG belonging to the coordinator server (Server A). During the transaction recovery process, if the sub-coordinator (Server B) listen address has changed, the coordinator (Server A) will not be able to contact the sub-coordinator (Server B) to inform it to commit or rollback pending transactions. The address that the coordinator server knew for the sub-coordinator server, was hard-wired directly in the TLOG file before the system failure occurred.  The coordinator server is thus unaware that a new listen address for the sub-coordinator server is being used.

Another example of where an XA transaction will be propagated between two WebLogic servers, is the situation where one server may be hosting a queue, and the second server is hosting a Message Driven Bean (MDB) based application, listening for messages appearing on the remote queue, with the MDB's container managed transaction demarcation set to 'required'.

As a result of these potential scenarios, the help documentation referenced above, states "Oracle recommends configuring server instances using DNS names rather than IP addresses to promote portability", to always allow for WebLogic instances to be restored on new machines, without inhibiting transactions recovery. While in this blog entry I have proven that this is not always necessary, depending on the nature of the XA transactions that occur (ones that don't propagate between WebLogic servers), it is a good rule to stick to, when possible, to be safe.


As we have seen, it is possible to change or move a database that has pending XA transactions, and still have WebLogic resolve and push through to completion those pending transactions, when a working system is re-stored.

We have also seen that it is important to keep key WebLogic resource names the same, when recovering WebLogic servers on new machines or in a whole new environment.

We have proved that in some circumstances, it is possible to change the hostnames and ports of the WebLogic servers, for example, restore the domain to run on a new set of host machines and still allow successful transaction recovery. The set of circumstances where this is possible is when XA transactions have not propagated between more than one WebLogic server.

However, we know that changing a WebLogic server's listen hostname and port breaks the ability to recover XA transactions that have been propagated. In addition, there may well be other places in the host application's metadata, completely unrelated to transactions, that hard-wires knowledge of server listen addresses. For example, an installed Oracle Middleware product running on Weblogic may be tracking server listen addresses in the its Meta-Data-Store (MDS). As another example, a bespoke JEE application hosted on WebLogic may be tracking server listen addresses in its own property files.

It is also worth noting that the official Oracle Fusion Middleware Disaster Recovery guide talks about the importance of using "hostname-based configuration instead of IP-based configuration" in the guide section titled Planning Host Names.

Therefore, I would echo the general advice that Oracle gives, to always use hostnames, rather than IP addresses for network address related resources. This maximises the ability to move servers, with new IP addresses, but retain the same hostnames during the move, and thus prevent breaking a functioning system. These principles are especially pertinent when planning for disaster recovery and the best way to re-establish a WebLogic domain and its TLOGs in another data-centre, so that the system can be restored to a fully working order.

Song for today: (I Don't Need You To) Set Me Free by Grinderman


king pong siu said...

Hi Paul, only pending transactions related to the Database are mentioned. Just wondering what would happen for the JMS Resource and the message that has been sent or read.
does JMS have "XA transactions pending commit" similar to Oracle database?

Paul Done said...

Good point and yes the JMS persistent stores have to be considered too, for reliably storing and then restoring after a failure. In the example tests, the WebLogic Transaction Service will also have liaised with the the JMS Server (treated like an XA resource manager, like the database was) to push through their XA pending messages. In hindsight, I should have also showed the output from monitoring the JMS servers for listing their pending messages too.

king pong siu said...

Thx Paul for your reply!
For Disaster Recovery, should it be safer to force commit all the pending transactions manually for the case where things going wrong (like using IP instead of domains or anything else) preventing the commit; and then, on another XA resource (JMS for example), all related JMS messages have been committed that implies Lost Messages since the related data in DB are lost?

Paul Done said...

Good question. I definitely would not try to manually commit or roll back anything. Ideally, if the Tlogs and JMS are persisted on shared file-system, then some sort of file-system replication is in place to enable it to be restored in DR site with DR copy of domain that can then be started to push these thru to completion. Likewise if Tlogs and JMS are in DB, use something like Oracle Data Guard to ensure the data is recoverable from DR site.

Simon Haslam said...

An interesting article Paul - thanks for taking the trouble to publish your results.

king pong siu said...

Dear Paul,
I met a very slow startup of the weblogic server and by reading the logs, it seems to be related to the TLOG recovery where the persistent store scan its log and data files in order to reconstruct its state at the last time it was running.

Nov 28, 2012 3:01:05 PM HKT ..BEA-280008..Opening the persistent file store "xxx" for recovery: directory=/abc/xxx requestedWritePolicy="Direct-Write" fileLockingEnabled=true driver="wlfileio3".

Nov 28, 2012 3:16:25 PM HKT ..BEA-280009..The persistent file store "xxx" (956e4aef-e7ef-41ac-bc6b-afcf4afc1383) has been opened: blockSize=512 actualWritePolicy="Direct-Write(read-buffered)" explicitIOEnforced=false records=37.

As u can see, it took around 15min. We met this problem twice. The policy is set to Direct-Write and blocksize set to 512, wondering if it would help by changing those settings.

Wondering if you have any idea?
Thanks a lot.

Pierluigi Vernetto said...

Paul, excellent job, by far the best article read so far on this topic. Keep posting please!

Unknown said...

Hi! Thank you so much, very in-depth testing! Would you mind sharing the test application with us? Nelson

Paul Done said...

Apologies, but for various reasons I'm not able to share the code for BoxBurner, so that is left as an exercise for the reader ;)

Faruk ONDER said...

incredible useful, thanks

cal vin said...

why my weblogic never update tlog/ wls_admin.dat about transaction, i use jta transaction but tlog always empty tlog ? thnx you

cal vin said...

ini my cluster when my managed server fail ,i get warning :
Warning: Fail-back retry of Transaction Recovery Service for server [serverName] failed.

Jean Francois said...

Hello, an Oracle DBA told me that the DataBase transaction was linked to the session associated to the JDBC connexion. He also told me that a broken connexion emplie an inflight transaction purged by the DataBase (new JDBC connexion produce a new DataBase Session) (cannot be continued). So there is no way to continue the transaction after a WLS restart (only retry the transaction branch). Any idea on this point ?

(Excelent post on that subject ...)

Paul Done said...

Jean-Francois, I sometimes see that ignorance from certain Oracle DBAs (usually the ones that arrogantly claim they know everything about the Oracle DB, which I for one, don't). Invariably, they've never come across client applications that use XA transactions against the DB before, and assume they work like 'normal' db transactions. 'Normal' db transactions do not outlive the connection/session, but XA transactions can and often do. You need to tell your DBA to become more informed by googling "Oracle XA".

ryszard.styczynski said...

Hi Paul,

do you still have files related to this test? I've got doubts related to datasource and JNDI name. I want to believe that not datasource but jndi name should be used by TXN recovery... Let me know what was the JNDI name for a data source used in test.


Gayatri said...

Excellent Article Paul...Thank you for the in depth write up..