I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.
My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.
Robustness testing tools
My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available https://www.youtube.com/watch?v=NsI51Mo6r3o and https://www.youtube.com/watch?v=tpbNTEYE9NQ together with opensource software tools and various articles https://aphyr.com/posts on testing various technologies https://jepsen.io/analyses including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen and the effects of Kyle’s work has helped guide various improvements to Kafka.
So far, so good…
However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found https://github.com/gator1/jepsen/tree/master/kafka were in an Clojure, an unfamiliar language, and for an older version of Kafka (0.10.2.0). More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.
The next project to try was https://github.com/mbsimonovic/jepsen-python a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team https://github.com/worstcase/blockade/issues/67.
By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.
Performance testing tools
For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.
Over the period of the project we discovered at least 5 candidates for generating load:
- kafkameter https://github.com/BrightTag/kafkameter
- pepper-box: https://github.com/GSLabDev/pepper-box
- kafka tools: https://github.com/apache/kafka
- sangrenel: https://github.com/jamiealquiza/sangrenel
- ducktape: https://github.com/confluentinc/ducktape
Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.
pepper-box and jmeter
Working with jmeter to test a non-web protocol i.e. Kafka ended up taking significant effort and time where we ended up spending lots of time having to learn about various aspects of writing the code, configuring jmeter and running the tests.
Thankfully as there were both kafkameter and pepper-box we were able to learn lots from various articles as well as the source code. Key articles include:
The blazemeter article even included an example consumer script written in Groovy. We ended up extending this script significantly and making it available as part of our fork of pepper-box (since it didn’t seem sensible to create a separate project for this script) https://github.com/commercetest/pepper-box/blob/master/src/groovyscripts/kafka-consumer-timestamp.groovy
As ever there was lots of other reading and experimentation to be able to reliably and consistently develop the jmeter plugins. Lowlights included needing to convince both the machine and maven that we actually needed to use Java 8.
A key facet of the work was adding support to both the producer and consumer code to enable it to be used with clusters configured with and without security, in particular SASL_SSL. The code is relatively easy to write but debugging issues with it was very time-consuming especially as we had to test in a variety of environments each with different configurations where none of the team had prior experience of how to configure Kafka with SASL_SSL before the project started.
We ran into multiple issues related to the environments and getting the Kafka clusters to stay healthy and the replication to happen without major delays. I may be able to cover some of the details in subsequent articles. We also realised that using the pepper-box java sampler (as they’re called in jmeter terminology) used lots of CPU and we needed to start running load generators and consumers in parallel.
We eventually discovered the combination of jmeter and the pepper-box sampler was maxing out and unable to generate the loads we wanted to create to test various aspects of the performance and latency. Thankfully the original creators of pepper-box had provided a standalone load generation utility which was able to generate significantly higher loads. We had to tradeoff between the extra performance and the various capabilities of jmeter and the many plugins that have been developed for jmeter over the years. We’d have to manage synchronisation of load generators on multiple machines ourselves, and so on.
The next challenge was to decide whether to develop an equivalent standalone consumer ourselves. In the end we did, partly as jmeter had lost credibility with the client so it wasn’t viable to continue using the current consumer.
Developing a pepper-box consumer
Things I wish we’d had more time for
What I’d explore next
Here are topics I’d like to cover in future articles:
- Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
- Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
- Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
- Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing