French men can't code: Why fine-granular scripts matter

aka. load-testing with a complex client stack

One of our users was working on a project where he needed to load-test heavily secured webservices. He was the first person to run into the bottleneck that I described in the first entry of my series on achieving scale via iterative analysis. The target load he needed to reach was approximately 1000 tps and his project required that the client-side code, used as part of the test harness, be written in a custom way and in Java. He couldn't use our SoapUI plugin or any other traditional HTTP/S testing tool. Let's see why.

The first reason was that this was a Test as a Service scenario where our user was not the developer of the service and had no access to the code base. He had only been sent a few jar files, certificates, config files and the endpoint URL. His time budget was limited and so he needed to implement his test client in the same language as the client-side libraries (i.e Java). After fighting a bit with the different security libraries, he got his test client code to work. He then managed to package it as a library and went on to expose its functionality as step keywords via annotations.

The other reason he wanted to use the client libraries shipped by the team responsible for the service was that this would make for much more relevant test results (i.e, more realistic measurements). Indeed, if you've load tested complex distributed systems before, and more specifically their RPC interfaces / endpoints, then you know that a lot of problems can lie inside client code. Whether it's distributed memory leaks, cache issues or other performance issues resulting from the incorrect use of business objects on the client side, there are many reminders that the devil hides in the details and that load tests need to be as complete and "end to end" as possible to be relevant. And as part of this process aiming at simulating conditions as similar as possible to those of production, including the actual client-side libraries can be crucial.

step's support for custom Java keywords was a nice fit for this use case, and there's no realistic scenario in which he would have been able to build this test harness with standard HTTP tools such as SoapUI or such, or at least not within the time budget he had.

Another interesting result of this approach is that this user was in full control of his test scenario and measurement points. step's keyword API and test plan semantics allow you to build and orchestrate the execution of code blocks at the exact level that you wish to have in order to design your test scenario.

Applied to the test plan in our webservice example, this is what this looked like :

As you can see, highlighted in yellow, calling the webservice is a 2-step process (as it is usually the case in most RPC mechanisms). The first step is about initialization and instantiation. It is assumed to be done only once (per what, depends of the technical context) but the resulting objects are then expected to be cached and reused many times. This initialization phase is often a costly step for both the client and server, as it involves building or loading many complex objects and classes, such as the webservice "port" object, authentication tokens or other framework components. The second step is usually less costly, as its duration depends almost only on the provider-side business code and suffers from much smaller protocol-induced overhead.

In the context of our user, the initialization of the webservice client lasted 1 second on average, but it could then be cached and each subsequent service call would last only 20 ms. He didn't anticipate this aspect and initially did a few tests with a more monolithic version of his keyword (essentially tying the initialization and business calls together).

Not only did this lead him to produce unrealistic response time measurements, potentially misguiding project stakeholders in the end, but it also caused for the server to become unstable and for the agents to use much more CPU than they should.

In a productive scenario, it was projected that applications would indeed cache the webservice port and so the only truly representative load test harness would be one in which the load would be primarily generated through concurrent executions of step 2, not step 1. So we solved this rather elegantly by splitting the keyword in half, and as you can see in the test plan, by isolating the initialization step prior to entering the ForEach block in which we iterate over a datapool to simulate different business calls (with different input data).

Illustrating this in a web workflow, it would be equivalent to making sure we're simulating the expected amount of HTTP Session objects by having each user only log in once in a while (after executing several functional scenarios), rather than recreating new sessions at the end of each action or iteration. The impact on memory usage is significant and simply can not be ignored, if you intend to run meaningful performance tests and produce valid results.

Now, this test plan worked well in a single threaded "baseline" scenario, but since step distributes its load heavily across both the agents and a given agent's execution units (tokens), there is no guarantee that each unit would already have a valid webservice client at its disposal. So eventually we tweaked our test plan a bit to make sure to test whether a client object was present in our context or not, and initialize it if needed.

And so we were able to ensure that :

each business call could be made using a correctly instantiated webservice port
the initialization keyword would only be executed once per context (token)
the whole setup would work in step's highly concurrent and distributed yet, (by default) stateless context

Another way to do this would have been to use a FunctionGroup control to enter a stateful mode in which keywords are guaranteed to land on the same agent and agent unit (token). In a single-tenant context (only one user submitting tests to the platform), and without complex dispatch rules, stateful execution isn't a big deal, but it can become a source of problems in other situations. I also just find it good practice to refrain from introducing strong dependencies whenever I can.

Finally, we made sure to wrap the different interesting blocks of code with different timers using step's Measurement API (which directly feeds data to RTM for real-time analysis). This would allow us to understand exactly which parts of the code were costing the most execution time. It also allowed us to isolate backend time in a very fine way (i.e, distinguish between that time and the rest of the client-side time), thus providing the project stakeholders with highly relevant and detailed metrics, upon which they could take action if needed.

French men can't code

samedi 4 février 2017

Why fine-granular scripts matter

aka. load-testing with a complex client stack

Aucun commentaire:

Enregistrer un commentaire