mardi 14 juin 2016

The impact of JDBC Connection testing in a WLS Datasource



A few weeks ago, I performed a bit of sampling in production on several nodes of the business layer of one of my client's applications. The application went into production a little over year ago and has been stable since then but from time to time I still take a look at the distribution of response time across the business and technical packages of the application.

This time I used djigger's reverse tree view to take a look at the "top of the stacks" to see if I could find any more potential low-level optimizations.

Since the business layer's implementation is based on EJBs running in a Weblogic container, I used weblogic's automatic skeleton naming "WLSkel" to filter out the noise of the other threads. djigger's "Stacktrace filter" is really good at that. By excluding the irrelevant threads in the JVM I could ensure my global statistics and hence analysis would be correct.

As I wanted to perform a global analysis and get a global overview of what the runtime was doing, I didn't need to further filter on specific threads or transactions.


Tip: As you start unfolding the different nodes in the tree (which is sorted by top consumer / number of occurences), the top part of the stack tells you what a thread is currently doing, while the bottom of the stack tells why or for whom it is doing it.

After sampling for a few minutes, I expanded a few nodes starting from the first method: socketRead (to be interpreted as time spent waiting for a reply on the network) and noticed several occurences of stacks originating from the testing of JDBC Connection objects, which usually takes place prior to executing SQL Code against the oracle database. What weblogic's datasource does for you here is it always tests the underlying connection to the database to make sure you're getting a valid connection.




So I decided to measure just how much time was spent in those testing stacks on average, regardless of which business method is being called. That can be done by using the "Node Filter" which cleans up the tree and implicitely aggregates matching orphan nodes. That's perfect for adding up the time spent in branches of the tree that come from different initial paths.

Depending on the environment and the scenario executed on these (test, integration & prod), I found the overhead to be standing between 10 and 25% (I believe these screenshots are based on data collected while reproducing the problem via automated testing in a testing environment, hence the 15% mark). The 25% mark being reached in production.



Traditionally, a developper will manage some sort of "scope" or "session" while working with the data layer (whether explicitely or implicitely depending on whether the transaction is attached to that of the container's  and whether they're using an OR Mapper such as Hibernate or OpenJPA) in which the DB Connection object remains relevant, reducing the frequency of connection aquisition and hence, tests.

People try to make sure - and rightfully so - that they don't keep a pooled object in use for too long to avoid contention on the underlying resource.

However, in the event that you're acquiring and releasing these objects too frequently, connection testing can induce significant overhead. Knowing that under "normal" circumstances (no network or DB failure), DB Connections remain valid for pretty much the entire runtime lifetime, this becomes a bad investment.

In my case, the unorthodox application architecture causes for those tests to happen extremely frequently. Not only do we have a very fine granularity in the design of the service layer, which leads to hundreds of RMI/EJB calls per user action, but it seems the developers may have added an additional abstraction for managing DB sessions which can cause for multiple connection objects to be acquired within a single EJB transaction.

One last factor adding overhead here is the fact that we're dealing with an XA Datasource. With a regular Datasource, usually, a simple "SELECT 1 FROM DUAL" will be fired which has virtually no impact on the DB and will essentially just cost you a round-trip over the network. Here with XA, each of these test queries will result in the creation and management of a distributed transaction which will, in turn, cause the execution of more complex code along with multiple round trips to the database (see being, end, commit semantics).

So just to make sure that this juicy piece of response time could actually be spared entirely, we tried turning off connection testing and ran the same automated test scenario twice and here are the results.

PS: I'm back on the regular tree view in this screenshot, and I voluntarily excluded all of the non-weblogic packages, so that's why you're seeing a direct call from the RMI package to the jdbc package. It's a simplification, of course.


The modification did spare the entire 13% of stacks spent in weblogic's test method.

The 2 Stack samples found in the "testing disabled" case actually came from an LDAP datasource for which testing was still enabled, so these can be excluded.

However, an important robustness feature we'd lose with this is that when the database crashes or happens to be restarted, our business layer can't recover once the db is back up. And that's somehow problematic for our ops guys.

So now we're fiddling with weblogic's parameters "SecondsToTrustInterval" and "Test Frequency" which should eventually help solving this problem, since they were specifically designed for that kind of scenario. See screenshot below.


(the weblogic console wants to pop up in french on my computer, sorry about that..)




I haven't figured out the proper interval to use just yet and I need to do some more testing so that's still work in progress, but I thought this had interesting potential and would be worth sharing.

Signing off from Le Bonhomme.

Dorian


Aucun commentaire:

Enregistrer un commentaire