Showing posts with label exalogic. Show all posts
Showing posts with label exalogic. Show all posts

Wednesday, April 3, 2013

Load Balancing T3 InitialContext Retrieval for WebLogic using Oracle Traffic Director

 

Introduction


The T3 protocol is WebLogic's fast native binary protocol that is used for most inter-server communication, and by default, communication to a WebLogic server from client applications using RMI, EJB or JMS, for example. If a JMS distributed queue or an RMI/EJB based application is deployed to a WebLogic cluster, for high availability, the client application needs a way of addressing the "virtual endpoint" of the remote clustered service. Subsequent client messages sent to the service then need to be load-balanced and failed-over, when necessary. For WebLogic, the client application achieves this in two phases:

  1. Addressing the virtual endpoint of the clustered service. The client Java code populates a "Provider URL" property with the address of the WebLogic cluster, to enable the "InitialContext" to be bootstrapped. As per the Weblogic JNDI documentation, "this address may be either a DNS host name that maps to multiple IP addresses or a comma separated list of single address host names or IP addresses". Essentially this virtual endpoint URL is only ever used by the client, to connect to a random server in the cluster, for creating the InitialContext. The URL is not used for any further interaction between client and clustered servers and does not influence how any of the subsequent T3 messages are routed.
  2. Load-balancing JNDI/RMI/EJB/JMS messaging to the clustered service. Once the InitialContext is obtained by the client application, all JNDI lookups and subsequent RMI/EJB/JMS remote invocations, use WebLogic generated "cluster-aware" stubs, under the covers. These stubs are populated with the list of clustered managed server direct connection URLs and the stub directly manages load-balancing and failover of T3 requests across the cluster. When a particular server is down, the stub will route new requests to another server, from the cluster list it maintains. Under certain circumstances, such as cases where no servers in the stub's current cluster list are reachable any more, the stub will be dynamically refreshed with a new version of the cluster list, to try. The stub's list of cluster membership URLs is unrelated to the fixed URL that is used for phase 1 above.

For bootstrapping the InitialContext (i.e. phase 1 above), the WebLogic Clustering documentation recommends that, for production environments, a DNS entry containing a list of clustered server addresses is used. This avoids the client application needing to "hard-code" a static set of cluster member addresses. However, sometimes it may be more convenient to use an external load balancer to virtualise this clustered T3 endpoint. In this blog entry, I will examine how the Oracle Traffic Director product can be used to virtualise such a cluster endpoint address, when bootstrapping the InitialContext.

It is not required or even supported to use an External Load Balancer for load-balancing the subsequent JNDI/RMI/EJB/JMS interaction over T3 (i.e phase 2 above). The WebLogic generated stubs already perform this role and are far more adept at it. Additionally, many of these T3 based interactions are stateful, which the stubs are aware of and help manage. So attempting to also introduce an External Load Balancer in the message path will break this stateful interaction. These rules are explicitly stated in the WebLogic EJB/RMI documentation which concludes "when using the t3 protocol with external load balancers, you must ensure that only the initial context request is routed through the load balancers, and that subsequent requests are routed and controlled using WebLogic Server load balancing".


Oracle Traffic Director


Oracle Traffic Director (OTD) is a layer-7 software load-balancer for use on Exalogic. It includes many performance related characteristics, plus high availability features to avoid the load-balancer from being a "single point of failure". OTD has been around for approximately 2 years and has typically been used to virtualise HTTP endpoints for WebLogic deployed Web Applications and Web Services. A new version of OTD has recently been released (see download site + online documentation), that adds some interesting new features, including the ability to load-balance generic TCP based requests. An example use case for this, is to provide a single virtual endpoint to client applications, to access a set of replicated LDAP servers as a single logical LDAP instance. Another example use case, is to virtualise the T3 endpoint for WebLogic client applications to bootstrap the InitialContext, which is precisely what I will demonstrate in the rest of this blog entry.


Configuring OTD to Load Balance the T3 Bootstrap of the InitialContext 


I used my Linux x86-64 laptop to test OTD based TCP load-balancing. I generated and started up a simple Weblogic version 10.3.6 domain containing a cluster of two managed servers with default listen addresses set to "locahost:7001" and "localhost:7011" respectively. I also deployed a simple EJB application targeted to the cluster (see next section).

I then installed OTD version 11.1.1.7 and created a simple OTD admin-server and launched its browser based admin console. In the admin console, I could see a new section ready to list "TCP Proxies", in addition to the existing section to list "Virtual Servers" for HTTP(S) end-points, as shown in the screenshot below.


(click image for larger view)
To create the TCP Proxy configuration I required, to represent a virtual endpoint for my WebLogic cluster, I chose the 'New Configuration' option and was presented with Step 1 of a wizard, as shown in the screenshot below.


(click image for larger view)
In this Step 1, I provided a configuration name and an OS user name to run the listener instance under, and I selected the TCP radio button, rather than the usual HTTP(S) ones. In Step 2 (shown below), I defined the new listener host and port for this TCP Proxy, which in this case I nominated as port 9001 using the loopback address of my laptop.


(click image for larger view)
In Step 3 (shown below), I then provided the details of the listen addresses and ports for my two running WebLogic managed servers that needed to be proxied to, which is listen on localhost:7001 and localhost:7011 respectively.


(click image for larger view)
In the remaining two wizard steps (not shown), I selected the local OTD node to deploy to and reviewed the summary of the proposed configuration before hitting the 'Create Configuration' button. Once created, I went to the "Instances" view in the OTD admin console and hit the "Start/Restart" button to start my configured TCP Proxy up, listening on port 9001 of localhost.


(click image for larger view)
At this point, I assumed that I had correctly configured an OTD TCP Proxy to virtualize a Weblogic clustered T3 InitialContext endpoint, so I then needed to prove it worked....


Example Deployed Stateless Session Bean for the Test


I created a simple stateless session bean (including home interface) and deployed it as an EJB-JAR to Weblogic, targeted to my running cluster. The EJB interface for this is shown below and the implementation simply receives a text message from a client application and prints this message to the system-out, before sending an acknowledgement text message back to the client.

  public interface Example extends EJBObject {
    public String sendReceiveMessage(String msg) throws RemoteException;
  }


Example EJB Client Code for the Test


I then coded a simple standalone Java test client application in a single main class, as shown below, to invoke the remote EJB's sole business method (I've not listed the surrounding class & method definition and try-catch-finally exception handling code, for the sake of brevity).


  Hashtable env = new Hashtable();
  env.put(Context.INITIAL_CONTEXT_FACTORY,
        "weblogic.jndi.WLInitialContextFactory");
  env.put(Context.PROVIDER_URL, "t3://localhost:9001");
  System.out.println("1: Obtaining initial context");
  ctx = new InitialContext(env);
  System.out.println("2: Sleeping for 10 secs having got initial context");
  Thread.sleep(10 * 1000);
  System.out.println("3: Obtaining EJB home interface using JNDI lookup");
  ExampleHome home = (ExampleHome) PortableRemoteObject.narrow(
                 ctx.lookup("ejb.ExampleEJB"), ExampleHome.class);
  System.out.println("4: Sleeping for 10 secs having retrieved EJB home interface");
  Thread.sleep(10 * 1000);
  System.out.println("5: Creating EJB home");
  Example exampleEJB = home.create();
  System.out.println("6: Sleeping for 10 secs having got EJB home");
  Thread.sleep(10 * 1000);
  System.out.println("7: Start of indefinite loop");

  while (true) {
    System.out.println("7a: Calling EJB remote method");
    String msg = exampleEJB.sendReceiveMessage("Hello from client");
    System.out.println("7b: Received response from EJB call: " + msg);
    System.out.println("7c: Sleeping for 10 secs before looping again");
    Thread.sleep(10 * 1000);
  }

I included Java code to bootstrap the InitialContext, perform the JNDI lookup of the EJB home, create the EJB remote client instance and invoke the EJB's sendReceiveMessage() method, which are all shown in bold. Additionally, I included various debug print statements, a few 10 second pauses and a while loop (shown in non-bold) to space out the important client-to-server interactions, for reasons that I'll allude to later.

As can be seen in the code, the Provider URL address I used for the InitialContext was "localhost:9001", which is the OTD TCP Proxy endpoint I'd configured, rather than the direct address of the clustered managed servers (which would have been hard-coded to "localhost:7001,localhost:7011", for example). I compiled this client application using a JDK 1.6 "javac" compiler, and ensured that whenever I ran the client application with a Java 1.6 JRE, I included the relevant WebLogic JAR on the classpath.


Test Runs and WebLogic Configuration Change


With my WebLogic clustered servers running (ports 7001 & 7011), and my OTD TCP Proxy listener running (port 9001), I ran the test client application and hit a problem immediately  In the system-out for one of the managed servers, the following error was shown:

  <02-Apr-2013 15:53:00 o'clock GMT> <Error> <RJVM> <BEA-000572> <The server rejected a connection attempt JVMMessage from:
 '-561770856250762161C:127.0.1.1R:-5037874633160149671S:localhost:localhost:7001,localhost:7011:MyDemoSystemDomain:MyDemoSystemServer2' to: '0B:127.0.0.1:[9001,-1,-1,-1,-1,-1,-1]' cmd: 'CMD_IDENTIFY_REQUEST', QOS: '101', responseId: '-1', invokableId: '-1', flags: 'JVMIDs Sent, TX Context Not Sent, 0x1', abbrev offset: '105' probably due to an incorrect firewall configuration or admin command.> 

In the client application's terminal, the following error was subsequently shown:

  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out]
  Caused by: java.net.ConnectException: t3://localhost:9001: Bootstrap to: localhost/127.0.0.1:9001' over: 't3' got an error or timed out

After some investigation, I found that the My Oracle Support knowledge base contained a document (ID 860340.1) which defines this expected WebLogic behaviour and provides the solution. By default, WebLogic expects any remote T3 access to have been initiated by the client, using the same port number that the proxied request actually hits a port of the WebLogic server on. In my case, because I was running this all on one machine, the port referenced by the client was port 9001 but the port hit on the Weblogic Server was 7001. As recommended in the knowledge base document, I was able to avoid this benign server-side check from being enforced, by including a JVM-level parameter (see below) in the start-up command line for my WebLogic servers. I then re-started both managed servers in the cluster to pick up this new parameter.

  -Dweblogic.rjvm.enableprotocolswitch=true

This time, when I ran the client application, it successfully invoked the remote EJB. No errors were shown in the output of the clustered managed servers, and instead the system-out of one the managed servers showed the following, confirming that the EJB had been called.

  class example.ExampleEJB_ivy2sm_Impl EJB received message: Hello from client
    (output repeats indefinitely)

On the client side, the expected output was also logged, as shown below.


  1: Obtaining initial context
  2: Sleeping for 10 secs having got initial context
  3: Obtaining EJB home interface using JNDI lookup
  4: Sleeping for 10 secs having retrieved EJB home interface
  5: Creating EJB home
  6: Sleeping for 10 secs having got EJB home
  7: Start of indefinite loop
  7a: Calling EJB remote method
  7b: Received response from EJB call: Request successfully received by EJB
  7c: Sleeping for 10 secs before looping again
    (output 7a to 7c repeats indefinitely)


Proving OTD is not in the Network Path for T3 Stub Load-Balancing


The reason why my test client code includes 10 second pauses and debug system-out logging, is to enable me to observe what happens when I kill various WebLogic and/or OTD servers, part way through client application test runs.

One of the main scenarios I was interested in was to allow the client to bootstrap the InitalContext via the OTD configured TCP Proxy listening on localhost:9001, then terminate the OTD instance/listener process. OTD would be stopped whilst the client application was still in the first 10 second pause, and before the client application had a chance to perform any JNDI lookup or EJB operations. By doing this I wanted to prove that once a client application bootstraps an InitialContext, the Provider URL (eg. "localhost:9001") would subsequently be ignored and the cluster-aware T3 stub, dynamically generated for the client application, would take over the role of routing messages, directly to the clustered managed servers.

Upon re-running the client and stopping the OTD listener, after the client application logged the message "2: Sleeping for 10 secs having got initial context", it continued to work normally. After 10 seconds, it woke up and correctly performed the JNDI lookup of the EJB home, called the remote EJB home create() method and then called the remote EJB business method, logging each of these steps to system-out on the way. This proved that the InitialContext Provider URL is not used for the JNDI lookups or EJB invocations.

To further explore this, I also re-ran the scenario, after first starting Wireshark to sniff all local loopback network interface related traffic. I inspected the content of the T3 messages in the Wireshark captured network traffic logs. I observed that in some of the T3 messages, the responding managed server was indicating the current cluster address as being "localhost:7001,localhost:7011". Upon terminating the first of the two managed servers in the cluster (whilst the client application was still running in its continuous loop), I observed that subsequent cluster address metadata in the T3 responses from the second managed server, back to the client, showed the current cluster address as just "localhost:7011".

I ran one final test. With the WebLogic cluster of two managed servers running, but the OTD TCP Proxy Listener instance not running, I re-ran the client application again from start to end. As soon as the application started and attempted to bootstrap the InitialContext it failed, as expected, with the following logged to system-out.


  1: Obtaining initial context
  javax.naming.CommunicationException [Root exception is java.net.ConnectException: t3://localhost:9001: Destination unreachable; nested exception is: 
java.net.ConnectException: Connection refused; No available router to destination]
at weblogic.jndi.internal.ExceptionTranslator.toNamingException(ExceptionTranslator.java:40)
at weblogic.jndi.WLInitialContextFactoryDelegate.toNamingException(WLInitialContextFactoryDelegate.java:792)
at weblogic.jndi.WLInitialContextFactoryDelegate.getInitialContext(WLInitialContextFactoryDelegate.java:368)
at weblogic.jndi.Environment.getContext(Environment.java:315)
at weblogic.jndi.Environment.getContext(Environment.java:285)
at weblogic.jndi.WLInitialContextFactory.getInitialContext(WLInitialContextFactory.java:117)
at javax.naming.spi.NamingManager.getInitialContext(NamingManager.java:667)
at javax.naming.InitialContext.getDefaultInitCtx(InitialContext.java:288)
at javax.naming.InitialContext.init(InitialContext.java:223)
at javax.naming.InitialContext.(InitialContext.java:197)
at client.RunEJBClient.main(RunEJBClient.java:19)


This showed that the client application was indeed attempting to bootstrap the InitialContext using the Provider URL of "localhost:9001". In this instance, the TCP proxy endpoint, that I'd configured in OTD, was not available, so the URL was not reachable. This proved that the client application was not magically discovering the cluster address from a cold-start and does indeed rely on a suitable Provider URL being provided, to bootstrap the InitialContext.


Summary


In this blog entry, I have shown how an external load balancer can be used to provide a virtualised endpoint for WebLogic client applications to reference, to bootstrap the InitialContext from a WebLogic cluster. In these particular tests, the load balancer product used was the latest version of OTD, with its new "TCP Proxy" capability. However, most of the findings are applicable to environments that employ other types of load balancers, including hardware load balancers.

I have also shown that, other than for InitialContext bootstrapping, an external load balancer will not be used for subsequent T3 load-balancing. Instead, the cluster-aware T3 stubs, that are dynamically loaded into client applications, automatically take on this role.

For production environments, system administrators still have the choice of mapping a DNS hostname to multiple IP addresses, to provide a single logical hostname address representing a WebLogic cluster. However, in some data-centres, it may be more convenient for a system administrator to re-use an existing load balancer technology, that is already in place, to virtualise the endpoint and provide a single logical address for a cluster. This may be the case if it is much quicker for a system administrator to make frequent and on-demand configuration changes to a load balancer, rather than continuously raising tickets with the network team, to update DNS entries.



Song for today: Pace by Sophia

Friday, October 19, 2012

Writing your own Java application on Exalogic using SDP

I've written before about how Exalogic enables Oracle Middleware products to use Sockets Direct Protocol (SDP) under the covers, rather than TCP-IP, to achieve lower latency communication over an InfiniBand network. Originally, the capability to leverage SDP was limited to Oracle internal-only APIs in the JRockit JVM (Java 1.6) and thus was only usable by Oracle products like WebLogic.

However, SDP support has now been added as a general capability to Java 1.7 (Hotspot JVM), thus enabling any standalone Java application to be written to take advantage of SDP rather than TCP-IP, over InfiniBand. I found a new tutorial, Lesson: Understanding the Sockets Direct Protocol, describing how to write a Java application that can use SDP, so I gave it a go on an Exalogic X2-2 machine. Below I've recorded the steps that I took to test this, in case it's useful to others.

To leverage SDP from your application, you can still use the same Java socket APIs as normal and simply use a configuration file to indicate that SDP should be employed, not TCP-IP. The tutorial I found shows how to provide the SDP configuration file, but doesn't provide Java code examples to test this. So first of all I quickly wrote Java main classes for a server and a client and tested that they worked correctly on my Linux x86-64 laptop when using just TCP-IP over Ethernet.

If you want to try it out yourself, you can download a copy of the test Java project I wrote from here. Below is the key part of my server class that receives a line of text from the client over a socket, prints this text to the console and then replies with an acknowledgement.

  try (ServerSocket serverSocket = new ServerSocket(port)) {
    info("Running server on port " + port);

    while (true) {
      try (Socket socket = serverSocket.accept();
           BufferedReader in = new BufferedReader(
              new InputStreamReader(socket.getInputStream()));
           PrintWriter out = new PrintWriter(
              new BufferedWriter(new OutputStreamWriter(
                socket.getOutputStream())))) {
        String msg = in.readLine();
        info("Server received message:  " + msg);
        out.println("ACKNOWLEDGED (" + msg + ")");
        out.flush();
      }
    }
  }

And below is the key part of my client class which sends a line of text over a socket to the server and prints out the response text it receives.

  info("Running client connecting to " + host + ":" + port);

  try (Socket socket = new Socket(host, port);
       PrintWriter out = new PrintWriter(new BufferedWriter(
          new OutputStreamWriter(socket.getOutputStream())));
       BufferedReader in = new BufferedReader(
          new InputStreamReader(socket.getInputStream()))) {
    info("Client sent message:  " + SEND_MESSAGE_TEXT);
    out.println(SEND_MESSAGE_TEXT);
    out.flush();
    info("Client received message:  " + in.readLine());
  }

The observant among you will notice that all of above is just standard Java 1.7 using java.net.* & java.io.* APIs. Nothing special.

I then moved the test client and server apps over to two Exalogic compute nodes. Actually the compute nodes were virtualised in this case, rather than physical, with each Oracle Linux OS running as a guest OS (vServer) on top of the OVM hypervisor. As instructed in the tutorial, I added the following JVM arguments to my bash scripts for starting the Java server and client applications so that they can use SDP:

  -Dcom.sun.sdp.conf=/u01/myshare/sdp.conf
  -Djava.net.preferIPv4Stack=true
  -Dcom.sun.sdp.debug

I slipped the com.sun.sdp.debug argument in there too, because that makes the JVM print some information to the console, indicating if SDP is being used by the app. I created the sdp.conf file at the location /u01/myshare/sdp.conf, with the following content:

  bind * *
  connect 192.168.0.101  1234

In the first line I tell the JVM that if an application opens a server socket listening to all local network interface IP addresses, it should use SDP. The second line tells the JVM that if an application opens a new socket to the remote host:port of 192.168.0.101:1234, to use SDP. This is the host:port of one of the network interfaces on the vServer that my Java server will listen to, when it starts.

Then running my wrapper bash scripts to start the Java server main class with its SDP file present, on a vServer, and the Java client class, with its SDP file, on another vServer, I saw the following output:

  [paul@paul_vserver2]$ ./runserver.sh
    BIND to 0.0.0.0:1234 (socket converted to SDP protocol)
    INFO - Running server on port 1234
    INFO - Server received message:  Hello from Java socket client program

  [paul@paul_vserver1]$ ./runclient.sh
    INFO - Running client connecting to 192.168.0.101:1234
    CONNECT to 192.168.0.101:1234 (socket converted to SDP protocol)
    INFO - Client sent message:  Hello from Java socket client program
    INFO - Client received message:  ACKNOWLEDGED (Hello from Java socket client program)

As you can see the client successfully sends the server a message and receives a response. In bold I've highlighted the debug output from the JVM showing that SDP is being used. However, to prove that SDP is indeed being used I did some more analysis. Whilst my Java server class was still running with its server socket listening, I ran the following OS level networking command to see if my running application's SDP listener was present. The output displayed:

  [paul@paul_vserver2]$ sdpnetstat -al | grep sdp
    sdp  0   0 *:search-agent   *:*  LISTEN

This shows my server process running listening on all local IP addresses using SDP. As soon as I kill my server process and run the sdpnetstat command again, no SDP listeners are shown.

To further help prove this, I started up the server again listening on SDP, but in the client vServer, I changed the SDP conf file to have a 'rubbish' connect value to force the client application to use TCP-IP. Upon running the client application again, I see the following error, because the client is trying to use TCP to talk to a remote SDP listener. Also, notice in bold, the JVM debug output showing that no SDP match was found in sdp.conf.

  [paul@paul_vserver1]$ ./runclient.sh
    INFO - Running client connecting to 192.168.0.101:1234
    CONNECT to 192.168.0.101:1234 (no match)
    java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
        at java.net.Socket.connect(Socket.java:579)
        at java.net.Socket.connect(Socket.java:528)
        at java.net.Socket.(Socket.java:425)
        at java.net.Socket.(Socket.java:208)
        at testsocketcomms.Client.sendMessage(Client.java:35)
        at testsocketcomms.Client.main(Client.java:21)

Anyway, that's pretty much it. Want to use SDP from a custom application on Exalogic? Just use standard Java socket programming, specify the right settings in a configuration file and off you go!

In the future I hope to revisit this topic and performance test a bespoke application under load, comparing the performance difference between using TCP-IP over InfiniBand and SDP over InfiniBand. However, developing a 'realistic' performance test application, that doesn't contain other more prevalent bottlenecks, is not a simple task, hence it's not something I could have quickly demonstrated here.


Song for today: Floodlit World by Ultrasound

Wednesday, October 5, 2011

New release of DomainHealth - 1.0

I've just released a new version of DomainHealth, that by virtue of being the next increment after 0.9, means that this is the grand 1.0 release! No great fanfare or massive new features but this should [hopefully] be a nice stable release to rely on and live up to its 1.0 billing! :D

You can download DomainHealth 1.0 from here: http://sourceforge.net/projects/domainhealth

One new feature in 1.0 that is worth highlighting though, is the new optional capability to collect and show Processor, Memory and Network statistics from the underlying host Operating System and Machine that WebLogic is running on. DomainHealth only enables this feature if you've also deployed another small open source JEE application that I've created, called WLHostMachineStats. Below is a screenshot of DomainHealth 1.0 in action, displaying graphs of some of these host machine statistics (in this case it's running on an Exalogic system).

(click image for larger view)
WLHostMachineStats is a small agent (a JMX MBean deployed as a WAR file) that runs in every WebLogic Server in a WebLogic domain. It is used to retrieve OS data from the underlying machine hosting each WebLogic Server instance. For more information, including deployment instructions, and to download it, go to: http://sourceforge.net/projects/wlhostmchnstats

Here's another screenshot, just for fun:

(click image for larger view)
Some things to bear in mind....

...the WLHostMachineStats project is still in its infancy and currently places restrictions on what specific environments are supported. Right now, WLHostMachineStats can only be used for WebLogic domains running on Linux Intel (x86) 64-bit based machines (including Exalogic) and only for versions 10.3.0 or greater of WebLogic. This is partly because WLHostMachineStats relies on the SIGAR open source utility, that uses native C libraries and JNI. I hope to widen the list of supported platforms for WLHostMachineStats in the future.


Song for today: Dynamite Steps by The Twilight Singers

Friday, September 2, 2011

New release of DomainHealth (v0.9.1)

I've just released a new version of DomainHealth (version 0.9.1). This is primarily a maintenance/bug-fix release.

DomainHealth is an open source "zero-config" monitoring tool for WebLogic. It collects important server metrics over time, archives these into CSV files and provides a simple web interface for viewing graphs of current and historical statistics. It also works nicely on Exalogic.

To download (and see release notes) go to the project home (select 'files' menu option) at: http://sourceforge.net/projects/domainhealth/



Song for today: Ascension Day by Talk Talk

Thursday, March 3, 2011

Exalogic DCLI - run commands on all compute nodes at once

Exalogic includes a tool called DCLI (Distributed Command Line Interface) that can be used to run the same commands on all or a subset of compute nodes in parallel. This saves a lot of time and helps avoid the sorts of silly errors that often occur when running a command over and over again. DCLI is a tool that originally came with Exadata (as documented in the Oracle Exadata Storage Server Software User's Guide - E13861-05 chapter 9), and is now incorporated into the new Exalogic product too. It is worth noting that if you are ever involved in performing the initial configuration of a new Exalogic rack, using OneCommand to configure the Exalogic's networking, then under the covers OneCommand will be using DLCI to perform a lot of its work.
Introduction to Exalogic's DCLI
The Oracle Enterprise Linux 5.5 based factory image running on each Exalogic compute node has the exalogic.tools RPM package installed. This contains the DCLI tool in addition to other useful Exalogic command line utilities. Running 'rpm -qi exalogic.tools' on a compute node shows the following package information:
Name : exalogic.tools
Version : 1.0.0.0
Release : 1.0
When you run 'rpm -ql exalogic.tools' you will see that the set of command line utilities are all placed in a directory at '/opt/exalogic.tools'. Specifically, the DCLI tool is located at '/opt/exalogic.tools/tools/dcli'.

Running DCLI from the command line with the '-h' argument, will present you with a short help summary of DCLI and the parameters it can be given:

# /opt/exalogic.tools/tools/dcli -h

If you look at the contents of the '/opt/exalogic.tools/tools/dcli' file you will see that it is actually a Python script that, essentially, determines the list of compute nodes that a supplied command should be applied to and then runs the supplied command on each compute node using SSH under the covers. Conveniently, the Python script also captures the output from each compute node and prints it out in the shell that DCLI was run from. The output from each individual compute node is prefixed by that particular compute node's name so that it is easy for the administrator to see if something untoward occurred on one of the compute nodes only.

A good way of testing DCLI, is to SSH to your nominated 'master' compute node in the Exalogic rack (eg. the 1st one), as root user, and create a file (eg. called 'nodelist') which contains the hostnames of all the compute nodes in the rack (separated by newlines). For example, my nodelist file has the following entries in the first 3 lines:

el01cn01
el01cn02
el01cn03
....

Note: You can comment out one or more hostnames with a hash ('#') if you want DCLI to ignore particular hostnames.

As a reminder on Exalogic compute node naming conventions, 'el01' is the Exalogic rack's default name and 'cn01' contains the number of the specific compute node in that rack.

Once you've created the list of target compute nodes for DCLI to distribute commands to, a nice test is to run a DCLI command that just prints the date-time of each compute node to the shell output of your master compute node (using the /bin/date Linux command). For example:

# /opt/exalogic.tools/tools/dcli -t -g nodeslist /bin/date
Example output:

Target nodes: ['el01cn01', 'el01cn02', 'el01cn03',....]
el01cn01: Mon Feb 21 21:11:42 UTC 2011
el01cn02: Mon Feb 21 21:11:42 UTC 2011
el01cn03: Mon Feb 21 21:11:42 UTC 2011
....

When this runs, you will be prompted for the password for each compute node that DCLI contacts using SSH. The '-t' option tells DCLI to first print out all the names of all nodes it will run the operation on, which is useful for double-checking that you are hitting the compute nodes you intended. The -g command provides the name of the file that contains the list of nodes to operate on (in this case, 'nodelist' in the current directory).


SSH Trust and User Equivalence

To use DCLI without being prompted for a password for each compute node that is contacted, it is preferable to first set-up SSH Trust between the master compute node and all the other compute nodes. DCLI calls this "user equivalence"; a named user on one compute node will then be assumed to have the same identity as the same named user on all other compute nodes. On your nominated 'master' compute node (eg. 'el01cn01'), as root user, first generate an SSH public-private key for the root user. For example:

# ssh-keygen -N '' -f ~/.ssh/id_dsa -t dsa

This places the generated public and private key files in the '.ssh' sub-directory of the root user's home directory (note, '' in the command is two single quotes)

Now run the DCLI command with the '-k' option as shown below which pushes the current user's SSH public key to each other compute node's '.ssh/authorized_keys' file to establish SSH Trust. You will again be prompted to enter the password for each compute node, but this will be the last time you will need to. With the '-k' option, each compute node is contacted sequentially rather than in parallel, to give you chance to enter the password for each node in turn.

# /opt/exalogic.tools/tools/dcli -t -g nodeslist -k -s "\-o StrictHostKeyChecking=no"

In my example above, I also pass the SSH option 'StrictHostKeyChecking=no' so you avoid being prompted with the standard SSH question "Are you sure you want to continue connecting (yes/no)", for each compute node that is contacted. The master compute node will then be added to the list of SSH known hosts on each other compute node, so that this yes/no question will never occur again.

Once the DCLI command completes you have established SSH Trust and User Equivalence. Any subsequent DCLI commands that you issue, from now on, will occur without you being prompted fo passwords.

You can then run the original date-time test again, to satisfy yourself that SSH Trust and User Equivalence is indeed established between the master compute node and each other compute node and that no passwords are prompted for.

# /opt/exalogic.tools/tools/dcli -t -g nodeslist /bin/date

Useful Examples

Now lets have a look at some examples common DCLI commands you might need to issue for your new Exalogic system.

Example 1 - Add a new OS group to each compute node called 'oracle' with group id 500:

# /opt/exalogic.tools/tools/dcli -t -g nodeslist groupadd -g 500 oracle

Example 2 - Add a new OS user to each compute node called 'oracle' with user id 500 as a member of the new 'oracle' group:

# /opt/exalogic.tools/tools/dcli -t -g nodeslist useradd -g oracle -u 500 oracle

Example 3 - Set the password to 'welcome1' for the OS 'root' user and the new 'oracle' user on each compute node (this uses another feature of DCLI where, if multiple commands need to be run in one go, they can be added to a file, which I tend to suffix with '.scl' in my examples - 'scl' is the convention for 'source command line', and the '-x' parameter is provided to tell DCLI to run commands from the named file):

# vi setpasswds.scl
echo welcome1 | passwd root --stdin
echo welcome1 | passwd oracle --stdin
# chmod u+x setpasswds.scl
# /opt/exalogic.tools/tools/dcli -t -g nodeslist -x setpasswds.scl

Example 4 - Create a new mount point directory and definition on each compute node for mounting the common/general NFS share which exists on Exalogic's ZFS Shared Storage appliance (the hostname of the HA shared storage on Exalogic's internal InfiniBand network in my example is 'el01sn-priv') and then from each compute node, permanently mount the NFS Share:

# /opt/exalogic.tools/tools/dcli -t -g nodeslist mkdir -p /u01/common/general
# /opt/exalogic.tools/tools/dcli -t -g nodeslist chown -R oracle:oracle /u01/common/general
# vi addmount.scl
cat >> /etc/fstab << EOF
el01sn-priv:/export/common/general /u01/common/general nfs rw,bg,hard,nointr,rsize=131072,wsize=131072,tcp,vers=3 0 0
EOF
# chmod u+x addmount.scl
# /opt/exalogic.tools/tools/dcli -t -g nodeslist -x addmount.scl
# /opt/exalogic.tools/tools/dcli -t -g nodeslist mount /u01/common/general


Running DCLI As Non-Root User

In the default Exalogic set-up, DCLI executes as root user when issuing all of its commands regardless of what OS user's shell you use to enter the DCLI command from. Although root access is often necessary for creating things like OS users, groups and mount points, it is not desirable if you just want to use DCLI to execute non-privileged commands under a specific OS user on all computes nodes. For example, as a new 'coherence' OS user, you may want the ability to run a script that starts a Coherence Cache Server instance on every one of the compute nodes in the Exalogic rack, in one go, to automatically join the same Coherence cluster.

To enable DCLI to be used under any OS user and to run all its distributed commands on all compute nodes, as that OS user, we just need to make a few simple one-off changes on our master compute node where DCLI is being run from...

1. As root user, allow all OS users to access the Exalogic tools directory that contains the DCLI tool:

# chmod a+x /opt/exalogic.tools/tools

2. As root user, change the permissions of the DCLI tool to be executable by all users:

# chmod a+x /opt/exalogic.tools/tools/dcli

3. As root user, modify, the DCLI python script (/opt/exalogic.tools/tools/dcli) using 'vi' and replace the line....

USER_ID="root"

...with the line...

USER_ID=pwd.getpwuid(os.getuid())[0]

This script line uses some Python functions to set the DCLI user id to the name of the current OS user running the DCLI command, rather than the hard-coded 'root' username.

4. Whilst still editing the file using vi, add the following Python library import command near the top of the DCLI Python script to enable the 'pwd' Python library to be referenced by the code in step 3.

import pwd

Now log-on to your master compute node as your new non-root OS user (eg. 'coherence' user) and once you've done the one-off setup of your nodelist file and SSH-Trust/User-Equivalence (as described earlier), you will happily be able run DCLI commands accross all compute nodes as your new OS user.

For example, for a test Coherence project I've been playing with recently, I have a Cache Server 'start in-background' script in a Coherence project located on my Exalogic's ZFS Shared Storage. When I run script using the DCLI command below, from my 'coherence' OS user shell on my master compute node, 30 Coherence cache servers instances are started immediately, almost instantly forming a cluster across the compute nodes in the rack.

# /opt/exalogic.tools/tools/dcli -t -g nodeslist /u01/common/general/my-coh-proj/start-cache-server.sh

Just for fun I can run this again to allow 30 more Coherence servers to start-up and join the same Coherence cluster, now containing 60 members.


Summary

As you can see DCLI is pretty powerful yet very simple in both concept and execution!


Song for today: Death Rays by Mogwai

Sunday, January 23, 2011

Exalogic Software Optimisations

[Update 19-March-2001 - this blog entry is actually a short summary of a much more detailed Oracle internal document I wrote in December 2010. A public whitepaper using the content from my internal document, has now been published on Oracle's Exalogic home page (see "White Papers" tab on right-hand side of the home page); for the public version, a revised introduction, summary and set of diagrams have been contributed by Oracle's Exalogic Product Managers.]

For version 1.0 of Exalogic there is a number of Exalogic-specific enhancements and optimisations that have been made to the Oracle Application Grid middleware products, specifically:
  • the WebLogic application server product;
  • the JRockit Java Virtual Machine (JVM) product;
  • the Coherence in-memory clustered data-grid product.
In many cases, these product enhancements address performance limitations that are not present on general purpose hardware that uses Ethernet based networking. Typically, these limitations are only manifested when running on Exalogic's high-density computing nodes with InfiniBand's fast-networking infrastructure. Most of these enhancements are designed to enable the benefits of the high-end hardware components, that are unique to Exalogic, to be utilised to the full. This results in a well balanced hardware/software system.

I find it useful to categorise the optimisations in the following way:
  1. Increased server scalability, throughput and responsiveness. Improvements to the networking, request handling, memory and thread management mechanisms, within WebLogic and JRockit, enable the products to scale better on the high-multi-core compute nodes that are connected to the fast InfiniBand fabric. WebLogic will use Java NIO based non-blocking server socket handlers (muxers) for more efficient request processing, multi-core aware thread pools and shared byte buffers to reduce data copies between sub-system layers. Coherence also includes changes to ensure more optimal network bandwidth usage when using InfiniBand networking.
  2. Superior server session replication performance. WebLogic's In-Memory HTTP Session Replication mechanism is improved to utilise the large InfiniBand bandwidth available between clustered servers. A WebLogic server replicates more of the session data in parallel, over the network to a second server, using parallel socket connections (parallel "RJVMs") instead of just a single connection. WebLogic also avoids a lot of the unnecessary processing that usually takes place on the server receiving session replicas, by using "lazy de-serialisation". With the help of the underlying JRockit JVM, WebLogic skips the host node's TCP/IP stack, and uses InfiniBand's faster “native” networking protocol, called SDP, to enable the session payloads to be sent over the network with lower latency. As a result, for stateful web applications requiring high availability, end-user requests are responded to far quicker.
  3. Tighter Oracle RAC integration for faster and more reliable database interaction. For Exalogic, WebLogic includes a new component called “Active Gridlink for RAC” that provides application server connectivity to Oracle RAC clustered databases. This supersedes the existing WebLogic capability for Oracle RAC connectivity, commonly referred to as “Multi-Data-Sources”. Active Gridlink provides intelligent Runtime Connection Load-Balancing (RCLB) across RAC nodes based on the current workload of each RAC node, by subscribing to the database's Fast Application Notification (FAN) events using Oracle Notification Services (ONS). Active Gridlink uses Fast Connection Failover (FCF) to enable rapid RAC node failure detection for greater application resilience (using ONS events as an input). Active GridLink also allows more transparent RAC node location management with support for SCAN and uses RAC node affinity for handling global (XA) transactions more optimally. Consequently, enterprise Java applications involving intensive database work, achieve a higher level of availability with better throughput and more consistent response times.
  4. Reduced Exalogic to Exadata response times. When an Exalogic system is connected directly to an Exadata system (using the built-in Infiniband switches and cabling), WebLogic is able to use InfiniBand's faster “native” networking protocol, SDP, for JDBC interaction with the Oracle RAC database on Exadata. This incorporates enhancements to JRockit and the Oracle Thin JDBC driver in addition to WebLogic. With this optimisation, an enterprise Java application that interacts with Exadata, is able to respond to client requests quicker, especially where large JDBC result sets need to be passed back from Exadata to Exalogic.
To summarise, Exalogic provides a high performance, highly redundant hardware platform for any type of middleware application. If the middleware application happens to be running on Oracle's Application Grid software, further significant performance gains will be achieved.


Song for today: Come to Me by 65daysofstatic

Friday, December 10, 2010

Exalogic downloads and documentation links

Now that Exalogic has been released, the main Exalogic documentation is available at: http://download.oracle.com/docs/cd/E18476_01/index.htm

Worth particular attention is the "Machine Owner's Guide" and the "Enterprise Deployment Guide".

The Machine Owner's Guide will give you a good idea of the machine's internal specifications as well as the unit's external dimensions, power consumption needs, cooling needs, multi-rack cabling configurations, etc.

The Enterprise Deployment Guide (EDG) will point you in the right direction if you want to install and configure the Application Grid products on Exalogic in an optimal way for performance and highly-availability.


If you are about to take shipment of Exalogic and need copies of the software, then these can be accessed from the Oracle eDelivery website, using the following steps:
  • Browse to the eDelivery site at http://edelivery.oracle.com/
  • Press "Continue" link
  • Submit the requested user info when prompted, accepting the restrictions
  • In the resulting search page, for the Product Pack field, select "Oracle Fusion Middleware", and for Platform field select "Linux x86-64"
  • In the results page, press the link for "Oracle Exalogic Elastic Cloud Software 11g Media Pack"
The Exalogic downloads include:
  • Compute Node Base Image for Exalogic (parts 1 and 2) - this is the Oracle Enterprise Linux image including the Unbreakable Enterprise Kernel, OFED drivers for InfiniBand connectivity, and various supporting command line utilities
  • Configuration Utilities for Exalogic - this is the set of "Middleware Machine Configurator" tools, including the spreadsheet and accompanying shell scripts to help users peform the base network configuration for all the compute nodes in an Exalogic rack (a.k.a. "OneCommand")
  • Oracle WebLogic Server 11gR1 (10.3.4) - this is the combined WebLogic/JRockit/Coherence .bin installer for Exalogic (Linux x86-64)

Song for today: Cause = Time by Broken Social Scene

Thursday, December 9, 2010

Exalogic 1.0 is here!

General availability of Oracle's brand new Exalogic Elastic Cloud product has just been publically announced.


Just in case you've somehow missed the buzz and haven't got a clue what Exalogic is, I'll describe it for you a little here...

Exalogic is an integrated hardware and software system that is engineered, tested, and tuned to run enterprise Java applications, as well as native applications, with an emphasis on high performance and high availability (HA). Exalogic incorporates redundant hardware with dense-computing resources, ZFS Shared Storage and InfiniBand networking. This hardware is sized and customised for optimum use by Oracle 'Application Grid' software products, to provide a balanced hardware/software system. Specifically, the WebLogic Application Server, JRockit Java Virtual Machine (JVM) and Coherence In-Memory Data Grid products have been enhanced to leverage some of the unique features of the underlying hardware, for maximum performance and HA.

If you are familiar with Exadata and it being the "database machine", then think of Exalogic as the "middleware machine". Physically linking the two together in a data-centre gives you the foundation for a very high-end Enterprise Java based OLTP solution.

Exalogic is a system rather than appliance, where users are able to install, develop or run what they want as long as it is Linux/Solaris x86-64 compatible. Even though some of the elements of Exalogic, like InfiniBand, are more often found in the supercomputing world, Exalogic is intended as a general purpose system for running enterprise business applications. Exalogic will just appear to the hosted applications as a set of general purpose operating systems and processors with common standards-based networking. This means that unlike the supercomputing world, developers don't have to create bespoke software specifically tailored to run on a high-end proprietary platform.

For further information, see the Oracle introduction to Exalogic whitepaper.


Song for today: Whipping Song by Sister Double Happiness