Testing Kafka: How I learned stuff

When I started my assignment to test Kafka I realised I’d got vast range of topics to comprehend. During the assignment I made time to actively learn these topics together with any additional ones that emerged during the assignment such as AWS.

This blog post introduces the various topics. It won’t go into detail on any of them (I may cover some in other blog posts) instead I’ll focus on how I learned stuff as part of this project.

  • Kafka: this is perhaps obvious as a topic, however I needed to learn particular facets of Kafka related to its reliability, resilience, scalability, and find ways to monitor its behaviour. I also ended up learning how to write Kafka clients, implement and configure SASL_SSL security and how to configure it.
  • VMWare: VMWare Enterprise technologies would be used for some runtime environments. I hadn’t worked with VMWare for many years and decided to learn how to configure, monitor and run ESXi on several small yet sufficiently representative servers. This would enable me to both work in the client’s environment and also run additional tests independently and sooner than waiting for sufficient environments and VMs to be available on demand (corporates tend to move more slowly owing to internal processes and organisational structures).
  • How to ‘performance test’ Kafka: we had several ideas and had discovered Kafka includes a utility to ‘performance test’. We needed to understand how that utility generates messages and measured performance. Also it might not be the most suitable tool for the project’s needs.
  • Ways to degrade performance of the system and expose flaws in the system that would adversely affect the value of using Kafka for the project. Disconnecting network cables, killing processes, etc. are pretty easy to do provided one has direct access to the machines and the test environment. However, we needed to be able to introduce more complex fault conditions and also be able to inject faults remotely.

These were the ones I knew about at the start of the assignment. Several more emerged, including:

  • Creating test environments in AWS. This included creating inter-regional peering between VPCs, creating Kafka and Zookeeper clusters, and multiple load generator and load consumer instances to execute the various performance and load tests. While there are various ‘quickstarts’ including one for deploying Kafka clusters https://github.com/aws-quickstart/quickstart-confluent-kafka; in the end we had to create our own clusters, bastion hosts and VPCs instead. The quickstart scripts failed frequently and the environment then needed additional cleaning up after they had failed.
  • Jepsen and other similar Chaos generation tools. Jepsen tested Kafka several major versions ago https://aphyr.com/posts/293-jepsen-kafka the tools are available and opensource, but would they suit our environment and skills set?
  • Various opensource load generators, including 2 that integrated with jmeter, before we finally settled on modifying an opensource standalone load generator and writing a reciprocal load consumer.
  • Linux utilities: which we used extensively to work with the environments and the automated tests. Similarly we wrote utility scripts to monitor, clean up and reset clusters and environments after some of the larger volume load tests.
  • The nuances and effects of enabling topic auto-creation.
  • KIP’s (Kafka Improvement Proposals):
  • Reporting and Analysis: the client had particular expectations on what would be reported and how it would be presented. Some of the tools didn’t provide the results in sufficient granularity e.g. they only provided ‘averages’ and we needed to calibrate the tools so we could trust the numbers they emitted.

Note: The following is ordered by topic or concept rather than chronologically.

How I learned stuff related to testing Kafka

I knew I had a lot to learn from the outset. Therefore I decided to invest time and money even from before the contract was signed so I would be able to contribute immediately. Thankfully all the software was freely available and generally opensource (which meant we could read the code to help understand it and even modify it to help us with the work and the learning).

Most of the learning was steeped in practice, where I and my colleagues would try things out in various test environments and learn by doing, observing and experimenting. (I’ll mention some of the test environments here and may cover aspects in more detail in other blog posts.)

I discovered and for the first time appreciated the range, depth and value of paying for online courses. While they’re not necessarily as good as participating in a commercial training course with instructors in the room and available for immediate advice, the range, price and availability was a revelation and the financial cost of all the course I bought was less than £65 ($90 USD).

Reading was also key, there are lots of blog posts and articles online, including several key articles from people directly involved with developing and testing Kafka. I also searched for academic articles that had been peer-reviewed. I only found a couple –  a pity as well written academic research is an incredible complement to commercial and/or personal informal write-ups.

We spent lots of time reading source code and then modifying code we hoped would be suitable once it’d been adapted. Thankfully the client agreed we could contribute our non-confidential work in public and make it available under permissive opensource and creative commons licenses.


I first used udemy courses several years ago to try to learn about several technologies. At that time I didn’t get or get much value, however the frequently discounted prices were low enough that I didn’t mind too much. In contrast, this time I found udemy courses to be incredibly valuable and relevant. The richness, range, and depth of courses available on udemy is incredible, and there are enough good quality courses available on relevant topics (particularly on Kafka, AWS, and to a lesser extent VMWare) to be able to make rapid, practical progress in learning aspects of these key topics.

I really appreciated being able to watch not only several introductory videos but also examples of more advanced topics from each of the the potential matches I’d found. Watching these, which is free-of-charge, doesn’t take very long per course and enabled me to get a good feel for whether the presenter’s approach and material would be worthwhile to me given what I knew and what I wanted to achieve.

I took the approach of paying for a course if I thought I’d learn at least a couple of specific items from that course. The cost is tiny compared to the potential value of the knowledge they unlock in my awareness and understanding.

Sometimes even the courses that seemed poorly done helped me to understand where concepts could be easily confused and/or poorly understood. If the presenter was confused – perhaps I would be too 🙂 That said, the most value for me came from the following courses which were particularly relevant for me:

The first three courses are led by the same presenter, Stephane Maarek. He was particularly engaging. He was also helpful and responsive when I send him questions via udemy’s platform.

Published articles and blog posts

I won’t list the articles or blog posts here. There are too many and I doubt a plethora of links would help you much. In terms of learning, some of the key challenges were in determining whether the articles were relevant to what I wanted to achieve with the versions of Kafka we were testing. For instance, many of the articles written before Kafka version 0.10 weren’t very relevant any more and reproducing tests and examples were sometimes too time-consuming to justify the time needed.

Also the way the project wanted to use Kafka was seldom covered and we discovered that some of the key configuration settings we needed to use vastly changed the behaviour of Kafka which again meant many of the articles and blog posts didn’t apply directly.

I used a software tool called Zotero to mange my notes and references (I have been using for several years as part of my PhD research) and have over 100 identified articles recorded there. I read many more articles during the assignment, perhaps as many as 1,000.

Academic research

The best article I found compares Kafka and RabbitMQ in an industrial research settings. There are several revisions available. The peer-reviewed article can be found at https://doi.org/10.1145/3093742.3093908 however you may need to pay for this edition unless you have a suitable subscription. The latest revision seems to be https://arxiv.org/abs/1709.00333v1 which is free to download and read.

Test environments

Here I’ll be brief. I plan to cover test environments in depth later on. Our test environments ranged from clusters of Raspberry Pi’s (replicating using MirrorMaker, etc), Docker containers, inexpensive physical rack-mount servers running ESXi, and several AWS environments. Both Docker and AWS are frequently referenced, for instance the udemy kafka course I mentioned earlier used AWS as their machines.