Friday, March 14, 2014

How to speed up MongoDB Aggregation using Parallelisation


EDIT 10-Dec-2021: See my far more up to date blog post which supersedes this: Achieving At Least An Order Of Magnitude Aggregation Performance Improvement By Scaling & Parallelism

A little while ago, Antoine Girbal wrote a great blog post describing how to speed up a MonogDB MapReduce job by 20x. Now that MongoDB version 2.6 is nearly upon us, and prompted by an idea from one of my very smart colleagues, John Page, I thought I'd investigate whether MongoDB Aggregation jobs can be sped up using a similar approach. Here, I will summarise the findings from my investigation.

To re-cap, the original MapReduce tests looked for all the unique values in a collection, counting their occurrences. The following improvements were incrementally applied by Antoine against the 10 million documents in the collection:

  1. Define a 'sort' for the MapReduce job
  2. Use multiple-threads, each operating on a subset of the collection 
  3. Write out result subsets to different databases
  4. Specify that the job should use 'pure JavaScript mode'
  5. Run the job using MongoDB 2.6 (an 'in-development' version) rather than 2.4

With all these optimisation in place, Antoine was able to reduce a 1200 second MapReduce job to just 60 seconds! Very impressive.

I expect that one of the reasons Antoine was focussing on MapReduce in MongoDB, was because MongoDB limited the output size of the faster Aggregation framework to 16 MB. With MongoDB 2.6, this threshold is removed. Aggregations generate multi-document results, rather than a single document and these can be navigated by the client applications using a cursor. Alternatively, an output collection can be specified for the Aggregation framework to write directly to. Thus, version 2.6 presents the opportunity to try the MapReduce test scenario with the Aggregation framework instead, to compare performance. In this blog post I do just that, focussing purely on testing a single, non-sharded database.


Re-running the optimised MapReduce


Before I tried running the test with the Aggregation framework, I thought I'd quickly run and time Antoine's optimised MapReduce job on my test machine to normalise any performance difference that occurs due to hardware disparity (for reference, my host machine is a Linux x86-64 laptop with SSD, 2 cores + hyper-threading, running MongoDB 2.6.0rc1).

I first had to insert the 10 million documents, as Antoine did, using the mongo Shell, with each insert containing a single field (‘dim0’) whose value is a random number between 0 and 1 million:

> for (var i = 0; i < 10000000; ++i) {
      db.rawdata.insert({dim0: Math.floor(Math.random()*1000000)});
  }
> db.rawdata.ensureIndex({dim0: 1})

I then ran Antoine's fully optimised MapReduce job, which completed in 32 seconds, versus 63 seconds on Antoine's host machine.


Running a normal, non-optimised Aggregation


So the first thing I needed to establish is how much faster or slower the equivalent Aggregation job is, compared with the 'optimised' MapReduce job I'd just run on the same hardware. The Aggregation pipeline is actually very simple to implement for this test. Below you can see the MapReduce code and the Aggregation pipeline equivalent:

map: function() {
    emit(this.dim0, 1); 


reduce: function(key, values {
    return Array.sum(values);
}
   var pipeline = [
       {$group: {
           _id: "$dim0",
           value: {$sum: 1}
       }}
   ];

The pipeline basically just groups on the field 'dim0', counting the number of occurrences for this value. So I ran this Aggregation against my database, called 'mrtest', containing the random generated 'rawdata' collection:

> db = db.getSiblingDB('mrtest');
> db['rawdata'].aggregate(pipeline);

The aggregation completed in 23 seconds, versus 32 seconds for running the 'optimised' equivalent MapReduce job on the same hardware. This shows just how quick the standard out-of-the-box Aggregation capability is in MongoDB!


Running an optimised Aggregation


So my challenge was clear now: See if I can apply some of Antoine's tricks from speeding up MapReduce, to the Aggregation framework, to reduce this test completion time from 23 seconds.

I considered which of the original 5 optimisations could realistically be applied to optimising the test with the Aggregation framework:

  1. YES - Define a 'sort' for the job
  2. YES - Use multiple-threads, each operating on a subset of the collection 
  3. NO - Write out result subsets to different databases  (the Aggregation framework expects to write out a collection to the same database as the source collection being read)
  4. NO - Specify that the job should use 'pure JavaScript mode'  (doesn't apply here as Aggregations run within C++ code)
  5. NO – Move to using a newer version of MongoDB, 2.4 to 2.6  (we are already at the latest version with my test runs, which is undoubtedly helping achieve the 23 second result already)

So the main thing I needed, in addition to including a 'sort', was to somehow split up the aggregation into parts to be run in parallel, by different threads, on subsets of the data. This would require me to prepend the existing pipeline with operations to match a range of documents in the collection and to sort the matched subset of documents, before allowing the original pipeline operation(s) to be run. Also, I would need to append an operation to the pipeline, to specify that the result should go to a specific named collection. So for example, for one of the subsets of the collection (documents with 'dim0' values ranging from 524915 to 774848), the following table shows how I need to convert the original pipeline:

[
    {$group: {
        _id: "$dim0",
        value: {$sum: 1}
    }}
]
   [
       {$match: {
           dim0: {
               $gte: 524915,
               $lt : 774849
           }
       }},
       {$sort: {
           dim0: 1
       }},
       {$group: {
           _id: $dim0,
           value: {$sum: 1}
       }},
       {$out:
           “aggdata524915"
       }
   ]


A similar pipeline, but with different range values and output collection name, needs to be produced for each thread I intend to run. Each of these pipelines need to be invoked with db.coll.aggregate() in parallel, from different threads..

To achieve this in a generic way, from the Shell, I quickly wrote a small JavaScript function to 'decorate' a given pipeline with $match and $sort prepended and $out appended. The function ends by running MongoDB’s aggregate() function with the modified pipeline, against the subset of matching data.

// Define new wrapper aggregate function, to operate on subset of docs 
// in collection after decorating the pipeline with pre and post stages
var aggregateSubset = function(dbName, srcClltName, tgtClltName,
                                  singleMapKeyName, pipeline, min, max) { 

    var newPipeline = JSON.parse(JSON.stringify(pipeline)); //deep copy
    var singleKeyMatch = {};
    singleKeyMatch[singleMapKeyName] = (max < 0) ? {$gte: min} 
                                                 : {$gte: min, $lt: max};
    var singleKeySort = {};
    singleKeySort[singleMapKeyName] = 1;

    // Prepend pipeline with range match and sort stages
    newPipeline.unshift(
        {$match: singleKeyMatch},
        {$sort: singleKeySort}  
    );

    // Append pipeline with output to a named collection
    newPipeline.push(
       {$out: tgtClltName}
    );

    // Execute the aggregation on the modified pipeline
    db = db.getSiblingDB(dbName);
    var result = db[srcClltName].aggregate(newPipeline);
    return {outCollection: tgtClltName, subsetResult: result};
}

This function takes the name of the database, source collection, and target collection, plus the original pipeline, plus the minimum and maximum values for the subset of the collection. The keen-eyed amongst you will have noticed the 'singleMapKeyName' parameter too. Basically, my example function is hard-coded to assume that the pipeline is mapping and sorting data anchored on a single key ('dim0' in my case, or this could be group-by 'customer', as another example). However, in other scenarios, a compound key may be being used (eg. group-by customer + year). My function would need to be enhanced, to be able to support compound keys.

Once I'd completed and tested my new aggregateSubset() JavaScript function. I was ready to run some test code in the Shell, to call this new function from multiple threads. Below is the code I used, specifying that 4 threads should be spawned. My host machine has 4 hardware threads, so this is probably not a bad number to use, even though I actually chose this number because it matched the thread count used in Antoine’s original MapReduce tests.

// Specify test parameter values
var dbName = 'mrtest';
var srcClltName = 'rawdata';
var tgtClltName = 'aggdata';
var singleMapKeyName = 'dim0';
var numThreads = 4;

// Find an even-ish distributed set of sub-ranges in the collection
var singleKeyPattern = {};
singleKeyPattern[singleMapKeyName] = 1;
var splitKeys = db.runCommand({splitVector: dbName+"."+srcClltName, 
      keyPattern: singleKeyPattern, maxChunkSizeBytes: 32000000}).splitKeys;

// There are more sub-ranges than threads to run, so work out 
// how many sub-ranges to use in each thread
var inc = Math.floor(splitKeys.length / numThreads) + 1;
var threads = [];

// Start each thread, to execute aggregation on a subset of data 
for (var i = 0; i < numThreads; ++i) { 
    var min = (i == 0) ? 0 : splitKeys[i * inc].dim0;
    var max = (i * inc + inc >= splitKeys.length) ? -1 
                                                  : splitKeys[i * inc + inc].dim0;
    var t = new ScopedThread(aggregateSubset, dbName, srcClltName,
                 tgtClltName + min, singleMapKeyName, pipeline, min, max);
    threads.push(t);
    t.start();
}

// Wait for all threads to finish and print out their summaries
for (var i in threads) { 
    var t = threads[i];
    t.join();
    printjson(t.returnData());
}

This multi-threaded aggregation completed in 13 seconds, versus 23 seconds for running the normal 'non-optimised' version of the aggregation pipeline. This optimised pipeline involved writing to 4 different collections, each named 'aggdataN' (eg. 'aggdata374876').

Okay, this isn't the stunning 20x improvement, which Antoine achieved for MapReduce. However, given that the Aggregation framework is already a highly optimised MongoDB capability, I don't believe a 40% speed up is to be sniffed at. Also, I suspect that on host machines with more cores (hence requiring the thread count parameter used in my example to be increased), and with a much larger real-world data set, even more impressive results would be achieved, such as a 2x or 3x improvement.


Some observations


As you'll notice I use a similar pattern to Antoine's, whereby I use the undocumented splitVector command to get a good idea of what is a well balanced set of ranges in the source collection. This is especially useful if the distribution of values in a collection is uneven and hard to predict. In my case, the distribution is pretty even and for 4 threads I could have got away with hard-coding ranges of 0-250000, 250000-500000, 500000-750000 & 750000-1000000. However, more real world collections will invariably have a much more uneven spread of values for a given key. With a naïve approach to partitioning up the data, one thread could end up doing much more work than the other threads. In real world scenarios, you may have other ways of predicting balanced subset ranges, based on the empirical knowledge you have of your data. You would then use a best guess effort to specify the split points for partitioning your collection. If you do choose to use the splitVector utility for a real application, I would suggest caching the value and only re-evaluating it intermittently, to avoid the latency it would otherwise add to the execution time of your aggregations.

Also, you'll notice I followed Antoine's approach of using the undocumented ScopedThread Shell utility object, to spawn new threads. If you are writing a real application to run an aggregation, you would instead use the native threading API of your chosen programming language/platform, alongside the appropriate MongoDB language driver.

One optimisation I wish I could have made, but couldn't was 'write out result subsets to different databases'. In MongoDB 2.6, one can specify an output collection name, but not an output database name, when calling aggregate(). My belief that different threads outputting to different databases would increase performance, is based on some observations I made during my tests. When running 'top', 'mongostat' and 'iostat' on my host machine whilst the optimised aggregation was running, I initially noticed that ‘top’ was reporting all 4 ‘CPUs’  as maxing out at around 95% CPU utilisation, coming down to a lower amount in the later stages of the run. At the point when CPU usage came down, ‘mongostat’ was reporting that the database 'mrtest' was locked for around 90% of the time. At the same point, disk utilisation was reported by ‘iostat' as being a modest 35%. This tells me that for part of the aggregation run, the various threads are all queuing waiting for the write lock, to be able to insert result documents into their own collections in the same database. By allowing different databases to be used, the write lock bottleneck for such parallel jobs would be diminished considerably, thus allowing the document inserts, and hence, the overall aggregation, to complete even quicker. I have created a JIRA enhancement request, asking for the ability for the Aggregation framework's $out operator to specify a target database, in addition to a target collection.


Conclusion


So I've managed to show that certain MapReduce-style workloads can be executed faster if using Aggregation in MongoDB. I showed the execution time can be reduced even further, by parallelising the aggregation work over subsets of the collection, using multiple threads. I demonstrated this for a single non-sharded database. In the future, I intend to investigate how a sharded database can be used to drive even better parallelisation and hence lower execution times, which I will then blog about.


Song for today: The Rat by The Walkmen

Wednesday, April 3, 2013

Load Balancing T3 InitialContext Retrieval for WebLogic using Oracle Traffic Director

 

Introduction


The T3 protocol is WebLogic's fast native binary protocol that is used for most inter-server communication, and by default, communication to a WebLogic server from client applications using RMI, EJB or JMS, for example. If a JMS distributed queue or an RMI/EJB based application is deployed to a WebLogic cluster, for high availability, the client application needs a way of addressing the "virtual endpoint" of the remote clustered service. Subsequent client messages sent to the service then need to be load-balanced and failed-over, when necessary. For WebLogic, the client application achieves this in two phases:

  1. Addressing the virtual endpoint of the clustered service. The client Java code populates a "Provider URL" property with the address of the WebLogic cluster, to enable the "InitialContext" to be bootstrapped. As per the Weblogic JNDI documentation, "this address may be either a DNS host name that maps to multiple IP addresses or a comma separated list of single address host names or IP addresses". Essentially this virtual endpoint URL is only ever used by the client, to connect to a random server in the cluster, for creating the InitialContext. The URL is not used for any further interaction between client and clustered servers and does not influence how any of the subsequent T3 messages are routed.
  2. Load-balancing JNDI/RMI/EJB/JMS messaging to the clustered service. Once the InitialContext is obtained by the client application, all JNDI lookups and subsequent RMI/EJB/JMS remote invocations, use WebLogic generated "cluster-aware" stubs, under the covers. These stubs are populated with the list of clustered managed server direct connection URLs and the stub directly manages load-balancing and failover of T3 requests across the cluster. When a particular server is down, the stub will route new requests to another server, from the cluster list it maintains. Under certain circumstances, such as cases where no servers in the stub's current cluster list are reachable any more, the stub will be dynamically refreshed with a new version of the cluster list, to try. The stub's list of cluster membership URLs is unrelated to the fixed URL that is used for phase 1 above.

For bootstrapping the InitialContext (i.e. phase 1 above), the WebLogic Clustering documentation recommends that, for production environments, a DNS entry containing a list of clustered server addresses is used. This avoids the client application needing to "hard-code" a static set of cluster member addresses. However, sometimes it may be more convenient to use an external load balancer to virtualise this clustered T3 endpoint. In this blog entry, I will examine how the Oracle Traffic Director product can be used to virtualise such a cluster endpoint address, when bootstrapping the InitialContext.

It is not required or even supported to use an External Load Balancer for load-balancing the subsequent JNDI/RMI/EJB/JMS interaction over T3 (i.e phase 2 above). The WebLogic generated stubs already perform this role and are far more adept at it. Additionally, many of these T3 based interactions are stateful, which the stubs are aware of and help manage. So attempting to also introduce an External Load Balancer in the message path will break this stateful interaction. These rules are explicitly stated in the WebLogic EJB/RMI documentation which concludes "when using the t3 protocol with external load balancers, you must ensure that only the initial context request is routed through the load balancers, and that subsequent requests are routed and controlled using WebLogic Server load balancing".


Oracle Traffic Director


Oracle Traffic Director (OTD) is a layer-7 software load-balancer for use on Exalogic. It includes many performance related characteristics, plus high availability features to avoid the load-balancer from being a "single point of failure". OTD has been around for approximately 2 years and has typically been used to virtualise HTTP endpoints for WebLogic deployed Web Applications and Web Services. A new version of OTD has recently been released (see download site + online documentation), that adds some interesting new features, including the ability to load-balance generic TCP based requests. An example use case for this, is to provide a single virtual endpoint to client applications, to access a set of replicated LDAP servers as a single logical LDAP instance. Another example use case, is to virtualise the T3 endpoint for WebLogic client applications to bootstrap the InitialContext, which is precisely what I will demonstrate in the rest of this blog entry.


Configuring OTD to Load Balance the T3 Bootstrap of the InitialContext 


I used my Linux x86-64 laptop to test OTD based TCP load-balancing. I generated and started up a simple Weblogic version 10.3.6 domain containing a cluster of two managed servers with default listen addresses set to "locahost:7001" and "localhost:7011" respectively. I also deployed a simple EJB application targeted to the cluster (see next section).

I then installed OTD version 11.1.1.7 and created a simple OTD admin-server and launched its browser based admin console. In the admin console, I could see a new section ready to list "TCP Proxies", in addition to the existing section to list "Virtual Servers" for HTTP(S) end-points, as shown in the screenshot below.


(click image for larger view)
To create the TCP Proxy configuration I required, to represent a virtual endpoint for my WebLogic cluster, I chose the 'New Configuration' option and was presented with Step 1 of a wizard, as shown in the screenshot below.


(click image for larger view)
In this Step 1, I provided a configuration name and an OS user name to run the listener instance under, and I selected the TCP radio button, rather than the usual HTTP(S) ones. In Step 2 (shown below), I defined the new listener host and port for this TCP Proxy, which in this case I nominated as port 9001 using the loopback address of my laptop.


(click image for larger view)
In Step 3 (shown below), I then provided the details of the listen addresses and ports for my two running WebLogic managed servers that needed to be proxied to, which is listen on localhost:7001 and localhost:7011 respectively.


(click image for larger view)
In the remaining two wizard steps (not shown), I selected the local OTD node to deploy to and reviewed the summary of the proposed configuration before hitting the 'Create Configuration' button. Once created, I went to the "Instances" view in the OTD admin console and hit the "Start/Restart" button to start my configured TCP Proxy up, listening on port 9001 of localhost.


(click image for larger view)
At this point, I assumed that I had correctly configured an OTD TCP Proxy to virtualize a Weblogic clustered T3 InitialContext endpoint, so I then needed to prove it worked....


Example Deployed Stateless Session Bean for the Test


I created a simple stateless session bean (including home interface) and deployed it as an EJB-JAR to Weblogic, targeted to my running cluster. The EJB interface for this is shown below and the implementation simply receives a text message from a client application and prints this message to the system-out, before sending an acknowledgement text message back to the client.

  public interface Example extends EJBObject {
    public String sendReceiveMessage(String msg) throws RemoteException;
  }


Example EJB Client Code for the Test


I then coded a simple standalone Java test client application in a single main class, as shown below, to invoke the remote EJB's sole business method (I've not listed the surrounding class & method definition and try-catch-finally exception handling code, for the sake of brevity).


  Hashtable env = new Hashtable();
  env.put(Context.INITIAL_CONTEXT_FACTORY,
        "weblogic.jndi.WLInitialContextFactory");
  env.put(Context.PROVIDER_URL, "t3://localhost:9001");
  System.out.println("1: Obtaining initial context");
  ctx = new InitialContext(env);
  System.out.println("2: Sleeping for 10 secs having got initial context");
  Thread.sleep(10 * 1000);
  System.out.println("3: Obtaining EJB home interface using JNDI lookup");
  ExampleHome home = (ExampleHome) PortableRemoteObject.narrow(
                 ctx.lookup("ejb.ExampleEJB"), ExampleHome.class);
  System.out.println("4: Sleeping for 10 secs having retrieved EJB home interface");
  Thread.sleep(10 * 1000);
  System.out.println("5: Creating EJB home");
  Example exampleEJB = home.create();
  System.out.println("6: Sleeping for 10 secs having got EJB home");
  Thread.sleep(10 * 1000);
  System.out.println("7: Start of indefinite loop");

  while (true) {
    System.out.println("7a: Calling EJB remote method");
    String msg = exampleEJB.sendReceiveMessage("Hello from client");
    System.out.println("7b: Received response from EJB call: " + msg);
    System.out.println("7c: Sleeping for 10 secs before looping again");
    Thread.sleep(10 * 1000);
  }

I included Java code to bootstrap the InitialContext, perform the JNDI lookup of the EJB home, create the EJB remote client instance and invoke the EJB's sendReceiveMessage() method, which are all shown in bold. Additionally, I included various debug print statements, a few 10 second pauses and a while loop (shown in non-bold) to space out the important client-to-server interactions, for reasons that I'll allude to later.

As can be seen in the code, the Provider URL address I used for the InitialContext was "localhost:9001", which is the OTD TCP Proxy endpoint I'd configured, rather than the direct address of the clustered managed servers (which would have been hard-coded to "localhost:7001,localhost:7011", for example). I compiled this client application using a JDK 1.6 "javac" compiler, and ensured that whenever I ran the client application with a Java 1.6 JRE, I included the relevant WebLogic JAR on the classpath.


Test Runs and WebLogic Configuration Change


With my WebLogic clustered servers running (ports 7001 & 7011), and my OTD TCP Proxy listener running (port 9001), I ran the test client application and hit a problem immediately  In the system-out for one of the managed servers, the following error was shown:

  <02-Apr-2013 15:53:00 o'clock GMT> <Error> <RJVM> <BEA-000572> <The server rejected a connection attempt JVMMessage from:
 '-561770856250762161C:127.0.1.1R:-5037874633160149671S:localhost:localhost:7001,localhost:7011:MyDemoSystemDomain:MyDemoSystemServer2' to: '0B:127.0.0.1:[9001,-1,-1,-1,-1,-1,-1]' cmd: 'CMD_IDENTIFY_REQUEST', QOS: '101', responseId: '-1', invokableId: '-1', flags: 'JVMIDs Sent, TX Context Not Sent, 0x1', abbrev offset: '105' probably due to an incorrect firewall configuration or admin command.> 

In the client application's terminal, the following error was subsequently shown:

  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out]
  Caused by: java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out

After some investigation, I found that the My Oracle Support knowledge base contained a document (ID 860340.1) which defines this expected WebLogic behaviour and provides the solution. By default, WebLogic expects any remote T3 access to have been initiated by the client, using the same port number that the proxied request actually hits a port of the WebLogic server on. In my case, because I was running this all on one machine, the port referenced by the client was port 9001 but the port hit on the Weblogic Server was 7001. As recommended in the knowledge base document, I was able to avoid this benign server-side check from being enforced, by including a JVM-level parameter (see below) in the start-up command line for my WebLogic servers. I then re-started both managed servers in the cluster to pick up this new parameter.

  -Dweblogic.rjvm.enableprotocolswitch=true

This time, when I ran the client application, it successfully invoked the remote EJB. No errors were shown in the output of the clustered managed servers, and instead the system-out of one the managed servers showed the following, confirming that the EJB had been called.

  class example.ExampleEJB_ivy2sm_Impl EJB received message: Hello from client
    (output repeats indefinitely)

On the client side, the expected output was also logged, as shown below.


  1: Obtaining initial context
  2: Sleeping for 10 secs having got initial context
  3: Obtaining EJB home interface using JNDI lookup
  4: Sleeping for 10 secs having retrieved EJB home interface
  5: Creating EJB home
  6: Sleeping for 10 secs having got EJB home
  7: Start of indefinite loop
  7a: Calling EJB remote method
  7b: Received response from EJB call: Request successfully received by EJB
  7c: Sleeping for 10 secs before looping again
    (output 7a to 7c repeats indefinitely)


Proving OTD is not in the Network Path for T3 Stub Load-Balancing


The reason why my test client code includes 10 second pauses and debug system-out logging, is to enable me to observe what happens when I kill various WebLogic and/or OTD servers, part way through client application test runs.

One of the main scenarios I was interested in was to allow the client to bootstrap the InitalContext via the OTD configured TCP Proxy listening on localhost:9001, then terminate the OTD instance/listener process. OTD would be stopped whilst the client application was still in the first 10 second pause, and before the client application had a chance to perform any JNDI lookup or EJB operations. By doing this I wanted to prove that once a client application bootstraps an InitialContext, the Provider URL (eg. "localhost:9001") would subsequently be ignored and the cluster-aware T3 stub, dynamically generated for the client application, would take over the role of routing messages, directly to the clustered managed servers.

Upon re-running the client and stopping the OTD listener, after the client application logged the message "2: Sleeping for 10 secs having got initial context", it continued to work normally. After 10 seconds, it woke up and correctly performed the JNDI lookup of the EJB home, called the remote EJB home create() method and then called the remote EJB business method, logging each of these steps to system-out on the way. This proved that the InitialContext Provider URL is not used for the JNDI lookups or EJB invocations.

To further explore this, I also re-ran the scenario, after first starting Wireshark to sniff all local loopback network interface related traffic. I inspected the content of the T3 messages in the Wireshark captured network traffic logs. I observed that in some of the T3 messages, the responding managed server was indicating the current cluster address as being "localhost:7001,localhost:7011". Upon terminating the first of the two managed servers in the cluster (whilst the client application was still running in its continuous loop), I observed that subsequent cluster address metadata in the T3 responses from the second managed server, back to the client, showed the current cluster address as just "localhost:7011".

I ran one final test. With the WebLogic cluster of two managed servers running, but the OTD TCP Proxy Listener instance not running, I re-ran the client application again from start to end. As soon as the application started and attempted to bootstrap the InitialContext it failed, as expected, with the following logged to system-out.


  1: Obtaining initial context
  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Destination unreachable; nested exception is: 
java.net.ConnectException: Connection refused; No available router to destination]
at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:40)
at weblogic.jndi.WLInitialContextFactoryDelegate.toNamingException(WLInitialContextFactoryDelegate.java:792)
at weblogic.jndi.WLInitialContextFactoryDelegate.getInitialContext(WLInitialContextFactoryDelegate.java:368)
at weblogic.jndi.Environment.getContext(Environment.java:315)
at weblogic.jndi.Environment.getContext(Environment.java:285)
at weblogic.jndi.WLInitialContextFactory.getInitialContext(WLInitialContextFactory.java:117)
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288)
at javax.naming.InitialContext.init(InitialContext.java:223)
at javax.naming.InitialContext.(InitialContext.java:197)
at client.RunEJBClient.main(RunEJBClient.java:19)


This showed that the client application was indeed attempting to bootstrap the InitialContext using the Provider URL of "localhost:9001". In this instance, the TCP proxy endpoint, that I'd configured in OTD, was not available, so the URL was not reachable. This proved that the client application was not magically discovering the cluster address from a cold-start and does indeed rely on a suitable Provider URL being provided, to bootstrap the InitialContext.


Summary


In this blog entry, I have shown how an external load balancer can be used to provide a virtualised endpoint for WebLogic client applications to reference, to bootstrap the InitialContext from a WebLogic cluster. In these particular tests, the load balancer product used was the latest version of OTD, with its new "TCP Proxy" capability. However, most of the findings are applicable to environments that employ other types of load balancers, including hardware load balancers.

I have also shown that, other than for InitialContext bootstrapping, an external load balancer will not be used for subsequent T3 load-balancing. Instead, the cluster-aware T3 stubs, that are dynamically loaded into client applications, automatically take on this role.

For production environments, system administrators still have the choice of mapping a DNS hostname to multiple IP addresses, to provide a single logical hostname address representing a WebLogic cluster. However, in some data-centres, it may be more convenient for a system administrator to re-use an existing load balancer technology, that is already in place, to virtualise the endpoint and provide a single logical address for a cluster. This may be the case if it is much quicker for a system administrator to make frequent and on-demand configuration changes to a load balancer, rather than continuously raising tickets with the network team, to update DNS entries.



Song for today: Pace by Sophia

Wednesday, October 24, 2012

Anatomy of WebLogic TLOGs and considerations for transaction recovery


Introduction


As discussed in an earlier blog entry, the preservation and ability to restore WebLogic Transaction Logs (TLOGs) is often critical to the availability of an I.T. system and the consistency of the data it contains. With the loss of a machine hosting a WebLogic server, it may be necessary to re-create the environment on another machine, to re-start the failed WebLogic server. Sometimes it may even be necessary to start the failed Weblogic server in a different data-centre for disaster recovery purposes. Restoring a WebLogic server to a running state may be necessary to enable stuck transactions, recorded in its TLOG, to be pushed through to completion. The new host environment may have been reproduced from a previous backup, from re-running WLST scripts or just from manual re-creation. However, what if the new environment uses different hostnames, different IP addresses or different resource names? Will it be possible to reconcile the stuck transactions recorded in the old preserved TLOG? Will it be possible for the WebLogic instance, in the new environment, to successfully commit these stuck transactions?

In this blog entry, I examine the anatomy of a TLOG, to discover what environmental dependencies the data stored in a TLOG has. This will help me to formulate some recommendations outlining the configuration settings that can be changed, when moving a WebLogic server and its TLOG to a new host environment, without losing the ability to recover stuck transactions.


Anatomy of a TLOG


WebLogic's TLOG holds records of the state of in-flight transactions, that are marked to be committed, in something that WebLogic calls a 'persistent store'. The persistent store can either be files on a file-system or a table in database. The key factors to consider when planning the location of the persistent store are:
  • the latency of writing to the storage
  • the availability and resiliency of the storage
Whichever type of persistent store is used, the actual data stored is essentially the same. For the purposes of this investigation, we will look at the anatomy of a TLOG that uses a file store, but the principles gleaned will be exactly the same for database stores too.

To be able to study the contents of a TLOG, I needed to simulate a real system, processing distributed transactions, that would cause WebLogic to record data in its TLOG. I set-up an appropriate environment on my Linux x86-64 laptop, where I created a WebLogic version 12.1.1 domain with a single server defined, leaving the TLOG location as the default file-store. I created an Oracle version 11.2.0.1.0 single instance database with appropriate settings made to allow XA to function correctly. In the WebLogic domain I defined an XA data source to point to this database. I also created a WebLogic hosted JMS queue which is XA-enabled. Finally, I deployed a test application that I wrote, called BoxBurner, to the WebLogic server to help generate XA transactions for my tests.

BoxBurner is essentially a JEE application that continuously performs a repeated flow of steps. Each flow involves reading a message from a queue, performing a database insert operation, and then placing a new message back on to the same queue, ready for the next iteration. Each flow, composed of 'message dequeue', 'database insert' and 'message enqueue' operations is contained within a single distributed XA transaction. As a result, for each transaction being processed, WebLogic writes a commit decision to its TLOG. In BoxBurner, I also provide a small HTML user interface to allow a user to initiate seeding the queue with a number of messages to kick the continuous parallel flows off.  So with BoxBurner, I can easily simulate a highly transactional system under load, to assist in exploring the anatomy of a TLOG.

By default, the location of a WebLogic server's file-store containing the TLOG is at:

 <domainpath>/servers/<servername>/data/store/default/_WLS_<servername>000000.DAT

WebLogic rolls these log files over, under certain circumstances, incrementing the number in the end of the file name. Each file has a fixed size (approximately 1 MB) and when full (or when a checkpoint occurs), a new file is rolled over to. In my case, the server's initial TLOG data file is called:

  _WLS_MYDEMOADMSVR000000.DAT

So for my tests, running BoxBurner under load, on my laptop, I issued a kill -9 on the WebLogic process to terminate it without any warning, hoping to catch at least a few transactions in-flight. I then studied the state of the TLOG data file to see what transactions were recorded and stuck at the point when the failure occurred.

To study the contents of the TLOG data file, I ran a WebLogic dump utility that shows a summary of the important contents of the TLOG, using a small wrapper Bash script I'd created (dumptlogtext.sh):

 #!/bin/sh
 WEBLOGIC_HOME=/opt/oracle/Middleware/wlserver_12.1
 WEBLOGIC_JAR=$WEBLOGIC_HOME/server/lib/weblogic.jar
 WEBLOGIC_TX_JAR=$WEBLOGIC_HOME/../modules/com.bea.core.transaction_3.0.0.0.jar
 java -cp $WEBLOGIC_JAR:$WEBLOGIC_TX_JAR weblogic.transaction.internal.StoreTransactionLoggerImpl $*

I ran the script with the following arguments to tell it where to find the TLOG data file (note, if you run it with no arguments, the utility displays some useful help information):

 $ ./dumptlogtext.sh /u01/MyDemoSystem/stores/MyDemoAdmSvr_Stores MYDEMOADMSVR

The output essentially showed two different types of record, contained in the TLOG file:
  • One checkpoint record, with the following example output below:
  | Class Name = weblogic.transaction.internal.ResourceCheckpoint                |
  | Object = ResourceCheckpoint={{BBDatasource_MyDemoDomain, props={}}, {WLStore |
  | _MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore, props={}}}                     |
  +------------------------------------------------------------------------------+
  • Seven transaction records, each with output similar to the example below:
  | Class Name = weblogic.transaction.internal.ServerTransactionImpl             |
  | Object = Name=[EJB boxburner.msgprocessor.MsgProcessorMDB.onMessage(javax.jm |
  | s.Message)],Xid=BEA1-1800B10D6C88705836D9(381221099),Status=Active,numReplie |
  | sOwedMe=0,numRepliesOwedOthers=0,seconds since begin=1775,seconds left=30,XA |
  | ServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(Ser |
  | verResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(state= |
  | new,assigned=none),xar=null,re-Registered = false),XAServerResourceInfo[BBDa |
  | tasource_MyDemoDomain]=(ServerResourceInfo[BBDatasource_MyDemoDomain]=(state |
  | =new,assigned=none),xar=null,re-Registered = false),SCInfo[MyDemoDomain+MyDe |
  | moAdmSvr]=(state=active),properties=({weblogic.transaction.name=[EJB boxburn |
  | er.msgprocessor.MsgProcessorMDB.onMessage(javax.jms.Message)]}))             |
  +------------------------------------------------------------------------------+

The checkpoint record is made by WebLogic to enable it to track the different XA resources (eg. a database, a message queue) that have been incorporated in one or more XA trasactions that the WebLogic server has participated in. In this checkpoint record, we can see that the two XA resources are 'BBDatasource_MyDemoDomain' (the WebLogic datasource pointing at the database) and 'WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore' (the persistent store for the JMS queue).

The TLOG contained 7 pending transactions - XA transactions that WebLogic has marked for commit, but that may not yet have been committed in one or more of the back-end systems. Each transaction has a global ID (Xid), and in the example transaction shown, the WebLogic generated Xid is "BEA1-1800B10D6C88705836D9".

While the Weblogic server was still shut-down with pending transactions, I used SQL*Plus to connect to the database, and query the list of XA transactions that the database was tracking as pending and yet to commit. The query I issued was:

  SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID, STATE, MIXED, HOST, COMMIT# 
  FROM DBA_2PC_PENDING

The outputted list of database pending transaction records was shown to be:

  LOCAL_TRAN_ID GLOBAL_TRAN_ID                 STATE    MIX HOST       COMMIT#
  ------------- ------------------------------ -------- --- ---------- ----------
  3.11.827      48801.1803B10D6C88705836D9     prepared no  pdthinkpad 1310237
  5.38.95       48801.1802B10D6C88705836D9     prepared no  pdthinkpad 1310234
  7.22.28       48801.1805B10D6C88705836D9     prepared no  pdthinkpad 1310241
  11.73.12      48801.1800B10D6C88705836D9     prepared no  pdthinkpad 1310235
  12.21.37      48801.1801B10D6C88705836D9     prepared no  pdthinkpad 1310228
  2.2.534       48801.1799B10D6C88705836D9     prepared no  pdthinkpad 1310236

Note: In my experience it can take a minute or two, from the point when failure occurs, for pending transactions to appear in the database's DBA_2PC_PENDING table. Therefore if you try this out yourself, and you see no records reported, but you expected to see some, wait a few minutes before executing the query again.

As you can see, the "BEA1-1800B10D6C88705836D9" recorded transaciton in the TLOG matches one of the recorded transactions ("48801.1800B10D6C88705836D9") in the Oracle database (the latter just uses a number representation for the text 'BEA1'). You may also notice that the database only lists 6 pending transaction whilst the WebLogic TLOG listed 7. Why would this be? Well this is likely to be because the database has committed the 7th transaction but didn't have time to notify WebLogic that this had occurred or Weblogic didn't have time to clean the transaction record up, in the TLOG, when the server process was killed. This is perfectly fine, because during Transaction Recovery, Weblogic will be notified by the database that the transaction has already been reconciled and no further action is required.

What is also interesting is that in the TLOG, no hostnames, IP addresses or ports appear to be directly recorded by WebLogic for the database resource or the JMS queue resource. However, in the database pending transaction table, we see that the database has tracked the hostname ('pdthinkpad') of the originator of the transaction, which in this case maps to my WebLogic server's listen address hostname. Later, we'll come back to why these observations may be important.

When running the WebLogic server instance I'd elected to turn on debug-level logging for the server, with the flag 'DebugJTATLOG' enabled. Below is an example of what WebLogic logged when processing one of the transactions and marking it to be committed in the TLOG,

     ####<19-Oct-2012 13:40:31 o'clock BST> <Debug> <JTATLOG> <pdthinkpad> <MyDemoAdmSvr> <[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <BEA1-1800B10D6C88705836D9> <> <1342701631032> <BEA-000000> <TLOG writing log record, class=weblogic.transaction.internal.ServerTransactionImpl, obj=Name=[EJB boxburner.msgprocessor.MsgProcessorMDB.onMessage(javax.jms.Message)],Xid=BEA1-1800B10D6C88705836D9(347028305),Status=Logging,numRepliesOwedMe=0,numRepliesOwedOthers=0,seconds since begin=0,seconds left=30,activeThread=Thread[[ACTIVE] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)',5,Pooled Threads],XAServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(ServerResourceInfo[WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore]=(state=prepared,assigned=MyDemoAdmSvr),xar=WLStore_MyDemoDomain_MyDemoAdmSvr_BB_JMSRscJMSStore282730943,re-Registered = false),XAServerResourceInfo[BBDatasource_MyDemoDomain]=(ServerResourceInfo[BBDatasource_MyDemoDomain]=(state=prepared,assigned=MyDemoAdmSvr),xar=BBDatasource,re-Registered = false),...[TRUNCATED FOR BREVITY]...,CoordinatorURL=MyDemoAdmSvr+pdthinkpad:7001+MyDemoDomain+t3+)>

Finally, to test that WebLogic's Transaction Recovery process works correctly in normal circumstances, I simply re-started the WebLogic server on my laptop. Within half a minute, WebLogic's Transaction Recovery Service had kicked in and successfully recovered the stuck transactions that were recorded the TLOG and pushed them through to completion in the affected JMS Queue and Database resources. To prove this, I queried the database table DBA_2PC_PENDING again and this time no rows were returned.


Testing Transaction Recovery After Changing Specific Environment Settings


To determine which key environment setting changes will adversely affect the ability of WebLogic to recover TLOG recorded transactions, I ran a series of tests. For each test, I first ran the system as normal, to process messages. Then I killed the WebLogic server process and checked in the TLOG and in the database, to ensure there some stuck transactions were present. Then I made the necessary changes in the environment configuration settings for the test. Finally I re-started Weblogic to see whether the transactions were successfully recovered or not.

TEST: Changing JDBC Datasource Hostname and Port

In this test, after killing WebLogic, I changed the JDBC URL (hostname and port) value in the domain configuration settings for the WebLogic datasource ('BBDatasource'), to reference a non-existent Oracle DB listener address.  Upon re-starting WebLogic, as the Weblogic transaction recovery service kicked in, I saw lots of entries in the WebLogic log file like:

<19-Oct-2012 15:58:40 o'clock BST> <Warning> <JTA> <BEA-110486> <Transaction BEA1-0511D6A7A132705836D9 cannot complete commit processing because resource [BBDatasource_MyDemoDomain] is unavailable. The transaction will be abandoned after 431,741 seconds unless all resources acknowledge the commit decision.> 

Also, when querying the DBA_2PC_PENDING database table, I could see that there were still pending transactions present. As one would expect, this shows that WebLogic can't magically find a database and relies on the JDBC datasource configuration in the domain, to work out how to resolve reaching the database, rather than any information about a database host or port that may or may not have be stored in the TLOG, at the time when the transaction occurred. Upon correcting the JDBC URL in the domain's configuration and re-starting WebLogic, the pending transactions were all happily pushed through to completion, as expected.

TEST: Changing JDBC Datasource Name

For this test run, after killing the WebLogic server, I changed the name of the 'BBDatasource' data-source in the domain configuration (keeping the JNDI name it contains the same). Again I discovered that on server re-start, the pending transactions could not be recovered. This demonstrated that WebLogic's transaction recovery service uses the domain's current configuration settings for resources like data-sources, rather than anything that may have been recorded, before failure, in the TLOG. Again, correcting the datasource name and re-starting Weblogic resulted in the pending transactions beeing successfully cleared.

TEST: Changing Oracle Database Listener Host and Port

For this test run, after killing the WebLogic server, rather than changing anything in WebLogic's domain configuration, I instead changed the host and port of my local database listener (to listen on my laptop's wireless card address rather than my laptop's wired ethernet card address). On re-starting the Weblogic server, Weblogic was unable to recover the transactions, as expected, because it couldn't contact the database on the address it expected to. This time to help recovery to be possible, rather than reverting back the database changes I had made (Oracle listener host/port), I instead modified the JDBC datasource URL for the WebLogic domain, with the new host/port of the database. Upon re-starting WebLogic, the pending transactions were successfully recovered. What this showed is that, following a failure and before initiating system recovery, a database can be moved and a Weblogic server's configuration can be changed, to reflect the new database location. With these database location and WebLogic datasource changes, Weblogic is still able to successsfully recover transactions, which is potentially important for disaster recovery processes, where a database may be running in a different data-centre.

TEST: Changing WebLogic Default Listen Address

In my test environment, the WebLogic server was originally listening to the hostname "pdthinkpad" which mapped to my laptoep's wired Ethernet card. For this test, after killing the server, I made some changes in WebLogic's domain configuration, changing WebLogic's default channel listen address to a hostname ('temphost') which mapped to my laptop's wireless network card. I also changed my laptop's hostname to 'temphost' and added the appropriate entry for the hostname in '/etc/hosts', removing any trace of the old 'pdthinkpad' name. Before re-starting the WebLogic server to listen to the new address, I double checked the database, querying the DBA_2PC_PENDING table to be check that there were pending transactions. The example output was:

  LOCAL_TRAN_ID GLOBAL_TRAN_ID                 STATE    MIX HOST       COMMIT#
  ------------- ------------------------------ -------- --- ---------- ----------
  5.18.1100     48801.03542CFF3807705836D9     prepared no  pdthinkpad 1472769
  7.11.720      48801.03552CFF3807705836D9     prepared no  pdthinkpad 1472766
  12.5.42       48801.03512CFF3807705836D9     prepared no  pdthinkpad 1472778
  1.22.724      48801.034F2CFF3807705836D9     prepared no  pdthinkpad 1472772

Notice that the Oracle database is tracking that the pending transactions were originated from a server on host 'pdthinkpad'. So the question would be, if recovery of these transactions was now initiated by a seemingly different server (ie. one from a host called 'temphost'), would the database allow these transactions to be reconciled and completed?

With the new hostname and listen address in place, I re-started the WebLogic server and observed that WebLogic's transaction recovery service worked successfully, pushing through to completion the pending transactions. On querying the database DBA_2PC_PENDING table again, no records were listed, showing that the database had indeed committed the transactions. Therefore, the conclusion is that the hostname and listen address for a WebLogic server can be changed, without preventing the successful recovery of transactions recorded in the TLOG. Again, this is potentially important if the WebLogic server has been started in a different data-centre, with a copy of the original TLOG, as part of a disaster recovery process.


Conclusions From Tests


These tests were all based on cases where my WebLogic server, owning the TLOG, was the originator of XA transactions. In other words, the WebLogic server was the 'transaction coordinator'. In these cases, I was able to show that, after restoring or re-creating a Weblogic environment, the pending transactions that had been tracked in a preserved TLOG, could be successfully completed. I was able to make the following environmental changes without inhibiting the ability to recover transactions:
  • WebLogic Server default channel listen addresses (Hostname/IP-address and Port)
  • Oracle Database listener address (Hostname/IP-address and Port) plus related settings in the WebLogic domain datasource configuration file
What became evident during the investigation, is that it is important to keep the names of the following resources the same (stored in the domain configuration), between the stages of system failure and system recovery:
  • Weblogic Domain Name
  • WebLogic Server Names
  • Weblogic JDBC Datasource Names
  • Weblogic Persistent Store Names (File and/or DB)
  • Weblogic JMS Server Names
  • Weblogic JMS Destination Names

Word of caution


In WebLogic's online documentation for Managing Transactions there is a section titled Moving a Server. The help document highlights that in a specific situation, if the hostname and port of WebLogic's listen address is changed, transaction recovery will fail. This situation occurs if the WebLogic server is a transaction sub-coordinator as part of a larger transaction that has been propagated from another separate Weblogic server.

For example, an EJB client application running on Server A may initiate an XA transaction and invoke an EJB running on Server B which in turn performs an XA database update operation. In these cases, the transaction coordinator (Server A) will have recorded the URL (host and post) of the sub-coordinator (Server B) directly in the TLOG belonging to the coordinator server (Server A). During the transaction recovery process, if the sub-coordinator (Server B) listen address has changed, the coordinator (Server A) will not be able to contact the sub-coordinator (Server B) to inform it to commit or rollback pending transactions. The address that the coordinator server knew for the sub-coordinator server, was hard-wired directly in the TLOG file before the system failure occurred.  The coordinator server is thus unaware that a new listen address for the sub-coordinator server is being used.

Another example of where an XA transaction will be propagated between two WebLogic servers, is the situation where one server may be hosting a queue, and the second server is hosting a Message Driven Bean (MDB) based application, listening for messages appearing on the remote queue, with the MDB's container managed transaction demarcation set to 'required'.

As a result of these potential scenarios, the help documentation referenced above, states "Oracle recommends configuring server instances using DNS names rather than IP addresses to promote portability", to always allow for WebLogic instances to be restored on new machines, without inhibiting transactions recovery. While in this blog entry I have proven that this is not always necessary, depending on the nature of the XA transactions that occur (ones that don't propagate between WebLogic servers), it is a good rule to stick to, when possible, to be safe.


Summary


As we have seen, it is possible to change or move a database that has pending XA transactions, and still have WebLogic resolve and push through to completion those pending transactions, when a working system is re-stored.

We have also seen that it is important to keep key WebLogic resource names the same, when recovering WebLogic servers on new machines or in a whole new environment.

We have proved that in some circumstances, it is possible to change the hostnames and ports of the WebLogic servers, for example, restore the domain to run on a new set of host machines and still allow successful transaction recovery. The set of circumstances where this is possible is when XA transactions have not propagated between more than one WebLogic server.

However, we know that changing a WebLogic server's listen hostname and port breaks the ability to recover XA transactions that have been propagated. In addition, there may well be other places in the host application's metadata, completely unrelated to transactions, that hard-wires knowledge of server listen addresses. For example, an installed Oracle Middleware product running on Weblogic may be tracking server listen addresses in the its Meta-Data-Store (MDS). As another example, a bespoke JEE application hosted on WebLogic may be tracking server listen addresses in its own property files.

It is also worth noting that the official Oracle Fusion Middleware Disaster Recovery guide talks about the importance of using "hostname-based configuration instead of IP-based configuration" in the guide section titled Planning Host Names.

Therefore, I would echo the general advice that Oracle gives, to always use hostnames, rather than IP addresses for network address related resources. This maximises the ability to move servers, with new IP addresses, but retain the same hostnames during the move, and thus prevent breaking a functioning system. These principles are especially pertinent when planning for disaster recovery and the best way to re-establish a WebLogic domain and its TLOGs in another data-centre, so that the system can be restored to a fully working order.


Song for today: (I Don't Need You To) Set Me Free by Grinderman

Friday, October 19, 2012

Writing your own Java application on Exalogic using SDP

I've written before about how Exalogic enables Oracle Middleware products to use Sockets Direct Protocol (SDP) under the covers, rather than TCP-IP, to achieve lower latency communication over an InfiniBand network. Originally, the capability to leverage SDP was limited to Oracle internal-only APIs in the JRockit JVM (Java 1.6) and thus was only usable by Oracle products like WebLogic.

However, SDP support has now been added as a general capability to Java 1.7 (Hotspot JVM), thus enabling any standalone Java application to be written to take advantage of SDP rather than TCP-IP, over InfiniBand. I found a new tutorial, Lesson: Understanding the Sockets Direct Protocol, describing how to write a Java application that can use SDP, so I gave it a go on an Exalogic X2-2 machine. Below I've recorded the steps that I took to test this, in case it's useful to others.

To leverage SDP from your application, you can still use the same Java socket APIs as normal and simply use a configuration file to indicate that SDP should be employed, not TCP-IP. The tutorial I found shows how to provide the SDP configuration file, but doesn't provide Java code examples to test this. So first of all I quickly wrote Java main classes for a server and a client and tested that they worked correctly on my Linux x86-64 laptop when using just TCP-IP over Ethernet.

If you want to try it out yourself, you can download a copy of the test Java project I wrote from here. Below is the key part of my server class that receives a line of text from the client over a socket, prints this text to the console and then replies with an acknowledgement.

  try (ServerSocket serverSocket = new ServerSocket(port)) {
    info("Running server on port " + port);

    while (true) {
      try (Socket socket = serverSocket.accept();
           BufferedReader in = new BufferedReader(
              new InputStreamReader(socket.getInputStream()));
           PrintWriter out = new PrintWriter(
              new BufferedWriter(new OutputStreamWriter(
                socket.getOutputStream())))) {
        String msg = in.readLine();
        info("Server received message:  " + msg);
        out.println("ACKNOWLEDGED (" + msg + ")");
        out.flush();
      }
    }
  }

And below is the key part of my client class which sends a line of text over a socket to the server and prints out the response text it receives.

  info("Running client connecting to " + host + ":" + port);

  try (Socket socket = new Socket(host, port);
       PrintWriter out = new PrintWriter(new BufferedWriter(
          new OutputStreamWriter(socket.getOutputStream())));
       BufferedReader in = new BufferedReader(
          new InputStreamReader(socket.getInputStream()))) {
    info("Client sent message:  " + SEND_MESSAGE_TEXT);
    out.println(SEND_MESSAGE_TEXT);
    out.flush();
    info("Client received message:  " + in.readLine());
  }

The observant among you will notice that all of above is just standard Java 1.7 using java.net.* & java.io.* APIs. Nothing special.

I then moved the test client and server apps over to two Exalogic compute nodes. Actually the compute nodes were virtualised in this case, rather than physical, with each Oracle Linux OS running as a guest OS (vServer) on top of the OVM hypervisor. As instructed in the tutorial, I added the following JVM arguments to my bash scripts for starting the Java server and client applications so that they can use SDP:

  -Dcom.sun.sdp.conf=/u01/myshare/sdp.conf
  -Djava.net.preferIPv4Stack=true
  -Dcom.sun.sdp.debug

I slipped the com.sun.sdp.debug argument in there too, because that makes the JVM print some information to the console, indicating if SDP is being used by the app. I created the sdp.conf file at the location /u01/myshare/sdp.conf, with the following content:

  bind * *
  connect 192.168.0.101  1234

In the first line I tell the JVM that if an application opens a server socket listening to all local network interface IP addresses, it should use SDP. The second line tells the JVM that if an application opens a new socket to the remote host:port of 192.168.0.101:1234, to use SDP. This is the host:port of one of the network interfaces on the vServer that my Java server will listen to, when it starts.

Then running my wrapper bash scripts to start the Java server main class with its SDP file present, on a vServer, and the Java client class, with its SDP file, on another vServer, I saw the following output:

  [paul@paul_vserver2]$ ./runserver.sh
    BIND to 0.0.0.0:1234 (socket converted to SDP protocol)
    INFO - Running server on port 1234
    INFO - Server received message:  Hello from Java socket client program

  [paul@paul_vserver1]$ ./runclient.sh
    INFO - Running client connecting to 192.168.0.101:1234
    CONNECT to 192.168.0.101:1234 (socket converted to SDP protocol)
    INFO - Client sent message:  Hello from Java socket client program
    INFO - Client received message:  ACKNOWLEDGED (Hello from Java socket client program)

As you can see the client successfully sends the server a message and receives a response. In bold I've highlighted the debug output from the JVM showing that SDP is being used. However, to prove that SDP is indeed being used I did some more analysis. Whilst my Java server class was still running with its server socket listening, I ran the following OS level networking command to see if my running application's SDP listener was present. The output displayed:

  [paul@paul_vserver2]$ sdpnetstat -al | grep sdp
    sdp  0   0 *:search-agent   *:*  LISTEN

This shows my server process running listening on all local IP addresses using SDP. As soon as I kill my server process and run the sdpnetstat command again, no SDP listeners are shown.

To further help prove this, I started up the server again listening on SDP, but in the client vServer, I changed the SDP conf file to have a 'rubbish' connect value to force the client application to use TCP-IP. Upon running the client application again, I see the following error, because the client is trying to use TCP to talk to a remote SDP listener. Also, notice in bold, the JVM debug output showing that no SDP match was found in sdp.conf.

  [paul@paul_vserver1]$ ./runclient.sh
    INFO - Running client connecting to 192.168.0.101:1234
    CONNECT to 192.168.0.101:1234 (no match)
    java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
        at java.net.Socket.connect(Socket.java:579)
        at java.net.Socket.connect(Socket.java:528)
        at java.net.Socket.(Socket.java:425)
        at java.net.Socket.(Socket.java:208)
        at testsocketcomms.Client.sendMessage(Client.java:35)
        at testsocketcomms.Client.main(Client.java:21)

Anyway, that's pretty much it. Want to use SDP from a custom application on Exalogic? Just use standard Java socket programming, specify the right settings in a configuration file and off you go!

In the future I hope to revisit this topic and performance test a bespoke application under load, comparing the performance difference between using TCP-IP over InfiniBand and SDP over InfiniBand. However, developing a 'realistic' performance test application, that doesn't contain other more prevalent bottlenecks, is not a simple task, hence it's not something I could have quickly demonstrated here.


Song for today: Floodlit World by Ultrasound