Monday, January 15, 2007

Is Your Java Application FailoverProof (i.e., RAC Aware)?

Developing and Deploying FailoverProof Java/JDBC Applications in RAC environment using Fast Connection Failover, TAF and Runtime Connection Load Balancing.

Why in the world should a Java developer care about failover and what does it mean to be failoverproof?
Well, failure is inevitable however, in mission critical (i.e., web) deployments, all applications including the Java ones must sustain resource manager (i.e., RDBMS) failure, or connection failure, or transaction failure without disrupting the service.

How exactly?
For the sake of simplicity, let's take a JDBC program. Best practices mandate that Java/JDBC programs capture exceptions and deal with these; here is a skeleton of a failoverproof program using Oracle JDBC in RAC environment:
...
try
{
conn = getConnection();
// do some work
} catch (SQLException e) {
handleSQLException(e); }
...
handleSQLException (SQLException e)
{
if (OracleConnectionCacheManager.isFatalConnectionError(e))
ConnRetry = true; // Fatal Connection error detected
}

Capturing SQL exceptons and re-trying to get a connection are all good JDBC programming so the burden is not really at the Java application level (it has to be somewhat portable), rather at the driver or framework level. Up to these (the driver, the OR Mapping framework, servlet engine, Java EE container) to furnish, under the covers, a failoverproof environment.

Do all drivers and Java frameworks are failoverproof?
You wish! The reality is that very few JDBC drivers or Java frameworks furnish true/reliable connection or transaction failover mechanisms.

From database access point of view, what does it take for a JDBC Driver or a Java framework to be failoverproof?
First of all, a JDBC driver or a Java EE ccontainer by itself cannot furnish a complete failoverproof environment, it more importantly requires the resource manager, in this case the RDBMS to be failoverproof as well. In the Oracle RDBMS case, instance/node failover as well as scalability is furnished by the RAC framework.

What is RAC?
An Oracle database is managed by a database instance which is made of a shared memory (a.k.a. SGA)and a set of database server processes. A database is usually accessed and managed by a single instance. However, an Oracle database can also be concurrently accessed and managed by multiple instances up to 64 nodes and beyond; this technology is known as Real Application Clusters (RAC).

How Does RAC Furnish Failover?
Starting with release 10g, RAC generates events that indicate the health or status of each RAC components including SERVICE, SERVICE_MEMBER,DATABASE, INSTANCE, NODE, ASM, and SRV_PRECONNECT.
The possible status are: UP, DOWN, NOT_RESTARTING, PRECONN_UP, PRECON_DOWN, and UNKNOWN.

Example of events can be: "Instance1 UP", "Node2 Down".

RAC furnishes failover by design in the sense that when a service/instance/node fails, a well written application can be redirected to the surviving node/instance provided these furnish the same service and proceed against the same database.

How Does JDBC Leverages RAC Failover?

The Oracle JDBC 10g drivers, more specifically it's connection cache (a.k.a. Implicit Connection Cache) leverages RAC by subscribing to the following events and status (as described in RAC documentation and in chapter 7 of my book):

  • Service Up: The connection pool starts establishing connections in small batches to the newly added service.
  • Instance (of Service) Up: The connection pool gradually releases idle connections associated with existing instances and reallocates these onto the new instance.
  • Instance (of Service) Down: The connections associated with theinstance are aborted and cleaned up, leaving the connection pool with sound and valid connections.
  • Node Down: The connections associated with the instance are aborted and cleaned up, leaving the connection pool with good connections.

But to be reliable, these events must be propagated to interested parties as fast as possible because the timeout mechanisms(tcp_keepalive, tcp_ip_interval, and so on) are unreliable and may take a long (tens of minutes) to indefinite time to be kick-in.
Orale furnishes ONS (Orale Notification Services) and Advanced Queue as publish/subscribe and predictable notification mechanisms which detects and propagates quasi-instantaneously (sub-seconds) those events to components that have subscribed to these mechanisms.

Setting up JDBC for Failover

  1. Set up a multinstance Oracle Database 10g RAC database (see RAC documentation).
  2. Virtualize the database host through a service name (see JDBC URL in chapter 7 of my book).
  3. Configure ONS on each RAC server node (see the RAC Administrator Guide or chapter 7 in my book).
  4. Configure ONS on each client node (10g Release 1) or use simpler remote subscription (10g Release 2). Ensure ons.jar file is in the CLASSPATH then programmatically set the ONS configuration string for remote ONS subscription at the data source level (unfortunately this cannot yet be set through system property): ods.setONSConfiguration("nodes=node1:4200,node2:4200"); The Java virtual machine (JVM) in which the JDBC driver is running must have oracle.ons.oraclehome set to point to ORACLE_HOME -Doracle.ons.oraclehome=
  5. Enable the Connection Cache and Fast Connection Failover through system property: -Doracle.jdbc.FastConnectionFailover = true Alternatively, the Connection Cache and Fast Connection Failover can be enabled programmatically using OracleDataSource properties: ods.setConnectionCachingEnabled(true); ods.setFastConnectionFailoverEnabled(true);

Oracle JDBC: Handling of DOWN events (Under the covers) Upon the notification of Service Down event, a worker thread (one per pool instance) processes the event in two passes:First pass: Connections are marked as down first, to efficiently disable bad connectionsSecond pass: Aborts and removes connections that are marked as downNote: active connections that may be in the middle of a transaction receive a SQLException instantly

Oracle JDBC: Hanlding of UP Events (under the covers)
A Service UP event initiates connections to be load balanced to all active RAC instances Connection creation depends on Listener’s placement of connections. Starting with 10g release 2,load balancing advisory events enabled Runtime Connection Load Balancing (covered in chapter 7 of my book).

Object-relational Mapping frameworks as well as any Java EE containers may either leverage Oracle JDBC (bypassing their own connection pool) or subscribe diretly to RAC events using the ONS APIs and processing these (i.e., handle connection retry). To my knowledge, only Oracle's Java EE containers (OC4J) has integrated Fast Connection Failover and ONS at datasource level.

How does Oracle JDBC Fast Connection Failover (FCF) compares with TAF?
Fast Connection Fail-over and TAF differ from each other in the followingways:

  1. Driver-type dependency: TAF is in fact a OCI failover mechanism exposed to Java through JDBC-OCI. FCF is driver-type independent (i.e., works for both JDBC-Thin and JDBC-OCI).
  2. Application-Level Connection Retries: FCF supports application-level connection retries (i.e., the application may retry the connection or rethrow the exception). TAF on the other hand retries connection transparently at the OCI/Net out of the control ofthe application or Java framework.
  3. Integration with the Connection Cache: FCF is integrated with the Implicit Connection Cache and invalidates failed connections automatically in the cache. TAF on the other hand works on a per-connection basis at the network level; it does not notify the connection cache of failures.
  4. Load Balancing: unlike TAF, FCF and runtime connection load balancing (RCLB) support UP event load-balancing of connections and runtime distribution of work across active RAC instances.
  5. Transaction Management: FCF automatically rolls back in-flight transations; TAF, on the other hand, requires the application to roll back the transaction and send an acknowledgment to TAF to proceed with the failover.
  6. TAF does not protect or fail-over codes that have server-side states such as Java or PL/SQL stored procedures; however, the application can register a callback function that will be called upon failure to reestablish the session states.

// register TAF callback function “cbk”

((OracleConnection) conn).registerTAFCallback(cbk,msg);

Voila, you now have a Java plateform with connection pool failover, on top of which you can code and deploy JDBC applications or Java EE components.

For more details, see chapter 7 of my book: http://db360.blogspot.com/2006/08/oracle-database-programming-using-java_01.html