Presentation at Kafka Summit London April 2018

I presented ‘Testing the Beast’ at the Kafka Summit in London on 24th April 2018 The conference was an excellent venue to meet some of the many people who are passionate and experienced in working with Kafka at scale. I learned a great deal from various speakers & hope to incorporate and apply some of what I learned to any future work I do with Kafka.

The conference will post the slides and a recording of my session (together with all the other sessions), probably before the end of May 2018.

Here are my slides in PowerPoint format Testing The Beast (Kafka) 23 Apr 2018

BTW: when I used PowerPoint to create a PDF it ballooned into a file of over 300MB so I’ve left that for the conference organisers to sort and make available.


Working with pepper-box to test Kafka


We needed to do performance testing of multi-regional Kafka Clusters, We ended up using pepper-box for most of our work. We had to first understand, then use, then extend and enhance the capabilities of pepper-box during the assignment. Here is an overview of what we did in terms of working with pepper-box. As we published our code and other related materials on github you can see more of the details at that site:

Pepper-box and jmeter

In order to use pepper-box we first needed to understand the fundamentals of jmeter.  Using jmeter to test a non-web protocol i.e. Kafka ended up taking significant effort and time where we ended up spending lots of time having to learn about various aspects of compiling the code, configuring jmeter, and running the tests.

Thankfully as there were both kafkameter and pepper-box we were able to learn lots from various articles as well as the source code. Key articles include:

The blazemeter article even included an example consumer script written in a programming language called Groovy. We ended up extending this script significantly and making it available as part of our fork of pepper-box (since it didn’t seem sensible to create a separate project for this script) 

As ever there was lots of other reading and experimentation to be able to reliably and consistently develop the jmeter plugins. Lowlights included needing to convince both the machine and maven that we actually needed to use Java 8.

Extending pepper-box to support security protocols

Business requirements mandated data would be secured throughout the system. There are various security mechanisms supported by Kafka. SASL enabled nodes to authenticate themselves to Kafka instances. Connections were secured using what’s known as SSL (e.g. see however the security is provided by a successor called TLS (see

A key facet of the work was adding support to both the producer and consumer code to enable it to be used with clusters configured with and without security, in particular SASL_SSL. The code is relatively easy to write but debugging issues with it was very time-consuming especially as we had to test in a variety of environments each with different configurations where none of the team had prior experience of how to configure Kafka with SASL_SSL before the project started.

We ran into multiple issues related to the environments and getting the Kafka clusters to stay healthy and the replication to happen without major delays. I may be able to cover some of the details in subsequent articles. We also realised that using the pepper-box java sampler (as they’re called in jmeter terminology) used lots of CPU and we needed to start running load generators and consumers in parallel.

Standalone pepper-box

We eventually discovered the combination of jmeter and the pepper-box sampler was maxing out and unable to generate the loads we wanted to create to test various aspects of the performance and latency. Thankfully the original creators of pepper-box had provided a standalone load generation utility which was able to generate significantly higher loads. We had to tradeoff between the extra performance and the various capabilities of jmeter and the many plugins that have been developed for jmeter over the years. We’d have to manage synchronisation of load generators on multiple machines ourselves, and so on.

The next challenge was to decide whether to develop an equivalent standalone consumer ourselves. In the end we did, partly as jmeter had lost credibility with the client so it wasn’t viable to continue using the current consumer.

Developing a pepper-box consumer

The jmeter-groovy-consumer wasn’t maxing out, however it made little sense to run dissimilar approaches (a standalone producer written in Java combined with jmeter + Groovy) and added to the overall complexity and complications for more involved tests. Therefore we decided to create a consumer modelled on the standalone producer. We didn’t end up adding rate-limiting as it didn’t really suit the testing we were doing, otherwise they’re fairly complementary tools. The producer sends a known message format which is parsed by the consumer that calculates latency and writes the output in a csv file per topic. The producer polls for messages using default values (e.g. 500 messages limit per poll request). These could be easily adapted with further tweaks and improvements to the code.

Using polling leads to a couple of key effects:

  1. It uses less CPU. Potentially several Consumers can be run on the same machine to process messages from several Producers (Generators) running across a bank of machines.
  2. The granularity of the timing calculations are constrained by the polling interval.

Summary of the standalone pepper-box tools

Both the producer and consumer are more functional than elegant and neither very forgiving of errors, missing or incorrect parameters. It’d be great to improve their usability at some point.

Man and Machine in Perfect Harmony?

Ah, the bliss and eager joy when we can operate technology seamlessly and productively, making effective progress rather than mistakes; where the technology helps us make better, informed decisions. Sadly this seldom happens in operations – when administering the software – or trying to address problems.

HCI for systems software and tools

Testing the operating procedures, the tools and utilities to configure, administer, safeguard, diagnose and recover, etc. may be some of the most important testing we do. The context, including emotional & physical aspects, are important considerations and may make the difference between performing the desired activity versus exacerbating problems, confusion, etc. For instance, is the user tired, distracted, working remotely, under stress? each of these may increase the risk of more and larger mistakes.

Usability testing can help us consider and design aspects of the testing. For instance, how well do the systems software and tools enable people to complete tasks effectively, efficiently and satisfactorily?

Standard Operating Procedures

Standard Operating Procedures (SOP’s) can help people and organisations to deliver better outcomes with fewer errors. For a recent assignment testing Kafka, testing needed to include testing the suitability of the SOP’s, for instance to determine the chances of someone making an inadvertent mistake that caused a system failure or compounded an existing challenge or problem.

Testing Recovery is also relevant. There may be many forms of recovery. In terms of SOPs we want and expect most scenarios to be included in the SOPs and to be trustworthy. Recovery may be for a particular user or organisation (people / business centred) and/or technology centred e.g. recovering at-risk machine instances in a cluster of servers.

OpsDev & DevOps

OpsDev and DevOps may help improve the understanding and communication between development and operations roles and foci. They aren’t sufficient by themselves.

Further reading

Disposable test environments


  • “readily available for the owner’s use as required”
  • “intended to be thrown away after use”

For many years test environments were hard to obtain, maintain, and update. Technologies including Virtual Machines and Containers reduce the effort, cost, and potentially the resources needed to provide test environments as and when required. Picking the most appropriate for a particular testing need is key.

For a recent project, to test Kafka, we needed a range of test environments, from lightweight ephemeral self-contained environments to those that involved 10’s of machines distributed at least 100 km apart. Most of our testing for Kafka used AWS to provide the computer instances and connectivity where environments were useful for days to weeks. However we also used ESXi and Docker images. We used ESXi when we wanted to test on particular hardware and networks. Docker, conversely, enabled extremely lightweight experiments, for instance to experiment with self-contained Kafka nodes where the focus was on interactive learning rather than longer-lived evaluations.

Some, not all, of the contents of a test environment has a life beyond that of the environment.  Test scripts, the ability to reproduce the test data and context, key results and lab notes tend to be worth preserving.

Key Considerations

  • what to keep and what to discard: We want ways to identify what’s important to preserve and what can be usefully and hygienically purged and freed.
  • timings: how soon do we need the environment, what’s the duration, and when do we expect it to be life-expired?
  • fidelity: how faithfully and completely does the test environment need to be?
  • count: how many nodes are needed?
  • tool support: do the tools work effectively in the proposed runtime environment?

Further reading

Tools for testing kafka


I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.

My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.

Robustness testing tools

My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available and together with opensource software tools and various articles on testing various technologies including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did and the effects of Kyle’s work has helped guide various improvements to Kafka.

So far, so good…

However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found were in an Clojure, an unfamiliar language, and for an older version of Kafka ( More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.

The next project to try was a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team

By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.

Performance testing tools

For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.

Over the period of the project we discovered at least 5 candidates for generating load:

  1. kafkameter
  2. pepper-box:
  3. kafka tools:
  4. sangrenel:
  5. ducktape:

Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.

I’ve moved the details of working with pepper-box to a separate blog post

Things I wish we’d had more time for

There’s so much that could be done to improve the tooling and approach to using testing tools to test Kafka. We needed to keep an immediate focus during the assignment in order to provide feedback and results quickly. Furthermore, the testing ran into weeds particularly during the early stages and configuring Kafka correctly was extremely time-consuming as newbies to several of the technologies. We really wanted to run many more tests earlier in the project and establish regular, reliable and trustworthy test results.

There’s plenty of scope to improve both pepper-box and the analysis tools. Some of these have been identified on the respective github repositories.

What I’d explore next

The biggest immediate improvement, at least for me, would be to focus on using trustworthy statistical analysis tools such as R so we can automate more of the processing and graphing aspects of the testing.

Further topics

Here are topics I’d like to cover in future articles:

  • Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
  • Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
  • Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
  • Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing

Testing Kafka: How I learned stuff

When I started my assignment to test Kafka I realised I’d got vast range of topics to comprehend. During the assignment I made time to actively learn these topics together with any additional ones that emerged during the assignment such as AWS.

This blog post introduces the various topics. It won’t go into detail on any of them (I may cover some in other blog posts) instead I’ll focus on how I learned stuff as part of this project.

  • Kafka: this is perhaps obvious as a topic, however I needed to learn particular facets of Kafka related to its reliability, resilience, scalability, and find ways to monitor its behaviour. I also ended up learning how to write Kafka clients, implement and configure SASL_SSL security and how to configure it.
  • VMWare: VMWare Enterprise technologies would be used for some runtime environments. I hadn’t worked with VMWare for many years and decided to learn how to configure, monitor and run ESXi on several small yet sufficiently representative servers. This would enable me to both work in the client’s environment and also run additional tests independently and sooner than waiting for sufficient environments and VMs to be available on demand (corporates tend to move more slowly owing to internal processes and organisational structures).
  • How to ‘performance test’ Kafka: we had several ideas and had discovered Kafka includes a utility to ‘performance test’. We needed to understand how that utility generates messages and measured performance. Also it might not be the most suitable tool for the project’s needs.
  • Ways to degrade performance of the system and expose flaws in the system that would adversely affect the value of using Kafka for the project. Disconnecting network cables, killing processes, etc. are pretty easy to do provided one has direct access to the machines and the test environment. However, we needed to be able to introduce more complex fault conditions and also be able to inject faults remotely.

These were the ones I knew about at the start of the assignment. Several more emerged, including:

  • Creating test environments in AWS. This included creating inter-regional peering between VPCs, creating Kafka and Zookeeper clusters, and multiple load generator and load consumer instances to execute the various performance and load tests. While there are various ‘quickstarts’ including one for deploying Kafka clusters; in the end we had to create our own clusters, bastion hosts and VPCs instead. The quickstart scripts failed frequently and the environment then needed additional cleaning up after they had failed.
  • Jepsen and other similar Chaos generation tools. Jepsen tested Kafka several major versions ago the tools are available and opensource, but would they suit our environment and skills set?
  • Various opensource load generators, including 2 that integrated with jmeter, before we finally settled on modifying an opensource standalone load generator and writing a reciprocal load consumer.
  • Linux utilities: which we used extensively to work with the environments and the automated tests. Similarly we wrote utility scripts to monitor, clean up and reset clusters and environments after some of the larger volume load tests.
  • The nuances and effects of enabling topic auto-creation.
  • KIP’s (Kafka Improvement Proposals):
  • Reporting and Analysis: the client had particular expectations on what would be reported and how it would be presented. Some of the tools didn’t provide the results in sufficient granularity e.g. they only provided ‘averages’ and we needed to calibrate the tools so we could trust the numbers they emitted.

Note: The following is ordered by topic or concept rather than chronologically.

How I learned stuff related to testing Kafka

I knew I had a lot to learn from the outset. Therefore I decided to invest time and money even from before the contract was signed so I would be able to contribute immediately. Thankfully all the software was freely available and generally opensource (which meant we could read the code to help understand it and even modify it to help us with the work and the learning).

Most of the learning was steeped in practice, where I and my colleagues would try things out in various test environments and learn by doing, observing and experimenting. (I’ll mention some of the test environments here and may cover aspects in more detail in other blog posts.)

I discovered and for the first time appreciated the range, depth and value of paying for online courses. While they’re not necessarily as good as participating in a commercial training course with instructors in the room and available for immediate advice, the range, price and availability was a revelation and the financial cost of all the course I bought was less than £65 ($90 USD).

Reading was also key, there are lots of blog posts and articles online, including several key articles from people directly involved with developing and testing Kafka. I also searched for academic articles that had been peer-reviewed. I only found a couple –  a pity as well written academic research is an incredible complement to commercial and/or personal informal write-ups.

We spent lots of time reading source code and then modifying code we hoped would be suitable once it’d been adapted. Thankfully the client agreed we could contribute our non-confidential work in public and make it available under permissive opensource and creative commons licenses.


I first used udemy courses several years ago to try to learn about several technologies. At that time I didn’t get or get much value, however the frequently discounted prices were low enough that I didn’t mind too much. In contrast, this time I found udemy courses to be incredibly valuable and relevant. The richness, range, and depth of courses available on udemy is incredible, and there are enough good quality courses available on relevant topics (particularly on Kafka, AWS, and to a lesser extent VMWare) to be able to make rapid, practical progress in learning aspects of these key topics.

I really appreciated being able to watch not only several introductory videos but also examples of more advanced topics from each of the the potential matches I’d found. Watching these, which is free-of-charge, doesn’t take very long per course and enabled me to get a good feel for whether the presenter’s approach and material would be worthwhile to me given what I knew and what I wanted to achieve.

I took the approach of paying for a course if I thought I’d learn at least a couple of specific items from that course. The cost is tiny compared to the potential value of the knowledge they unlock in my awareness and understanding.

Sometimes even the courses that seemed poorly done helped me to understand where concepts could be easily confused and/or poorly understood. If the presenter was confused – perhaps I would be too 🙂 That said, the most value for me came from the following courses which were particularly relevant for me:

The first three courses are led by the same presenter, Stephane Maarek. He was particularly engaging. He was also helpful and responsive when I send him questions via udemy’s platform.

Published articles and blog posts

I won’t list the articles or blog posts here. There are too many and I doubt a plethora of links would help you much. In terms of learning, some of the key challenges were in determining whether the articles were relevant to what I wanted to achieve with the versions of Kafka we were testing. For instance, many of the articles written before Kafka version 0.10 weren’t very relevant any more and reproducing tests and examples were sometimes too time-consuming to justify the time needed.

Also the way the project wanted to use Kafka was seldom covered and we discovered that some of the key configuration settings we needed to use vastly changed the behaviour of Kafka which again meant many of the articles and blog posts didn’t apply directly.

I used a software tool called Zotero to mange my notes and references (I have been using for several years as part of my PhD research) and have over 100 identified articles recorded there. I read many more articles during the assignment, perhaps as many as 1,000.

Academic research

The best article I found compares Kafka and RabbitMQ in an industrial research settings. There are several revisions available. The peer-reviewed article can be found at however you may need to pay for this edition unless you have a suitable subscription. The latest revision seems to be which is free to download and read.

Test environments

Here I’ll be brief. I plan to cover test environments in depth later on. Our test environments ranged from clusters of Raspberry Pi’s (replicating using MirrorMaker, etc), Docker containers, inexpensive physical rack-mount servers running ESXi, and several AWS environments. Both Docker and AWS are frequently referenced, for instance the udemy kafka course I mentioned earlier used AWS as their machines.

Damn you auto-create

Inspired by the entertaining web site, here are some thoughts on the benefits and challenges of having auto-create enabled for Kafka topics


At first, auto-create seems like a convenience, a blessing, as it means developers don’t need to write code to explicitly create topics. For a particular project the developers can focus on using the system as a service to share user-specified sets of data rather than writing extra code to interact with Zookeeper, etc. (newer releases of Kafka include the AdminClient API which deals with the Zookeeper aspects).

Effects of relying on auto-create: topics are created with the default (configured) partition and replication-counts. These may not be ideal for this topic and its intended use(s).

Adverse impacts of using auto-create

Deleting topics: The project uses Confluent Replicator to replicate data from Kafka Cluster to Kafka Cluster. As part of our testing lots of topics were created. We wanted to delete some of these topics but discovered they were virtually impossible to kill as the combination of Confluent Replicator and the Kafka Clusters were resurrecting the topics before they could be fully expunged. This caused almost endless frustration and adversely affected our testing as we couldn’t get the environment sufficiently clean to run tests in controlled circumstances (Replicator was busy servicing the defunct topics which limits it’s ability to focus on the topics we wanted to replicate in particular tests).

Coping with delays and problems creating topics: At a less complex level, auto-creation takes a while to complete and seems to happen in the background. When the tests (and the application software) tries to write to the topic immediately various problems occurred from time to time. Knowing that problems can occur is useful in terms of performance, reliability, etc. however it complicates the operational aspects of the system, especially as the errors affect producing data (what the developers and users think is happening) rather than the orthogonal aspect of creating a topic so that data can be produced.

Lack of clarity or traceability on who (what) created topics: Topics could be auto-created when code tried to write (produce) which was more-or-less what we expected. However they could also be auto-created by trying to read (consume). The Replicator duly setup replication for that topic. For various reasons topics could be created on one or more clusters with the same name; and replication happened both locally (within a Kafka Cluster) and to another cluster.  We ended up with a mess of topics on various clusters which was compounded by the challenges cleaning up (deleting) the various topics. It ended up feeling like we were living through the after-effects of the Sorcerer’s Apprentice!

From a testing perspective

From a testing perspective we ended up adding code in our consumer code that checked and waited for the topic to appear in Zookeeper before trying to read from it. This, at least, reduced some of the confusion and enabled us to unambiguously measure the propagation time for Confluent Replicator for topics it needed to replicate.

We also wrote some code that explicitly created topics rather than relying on the auto-create to determine how much effort was needed to remove the dependency on auto-create being enabled and used. That code amounted to less than 10 lines of code in the proof-of-concept. Production quality code may involve more code in order to: audit the creation, as well as log, and report problems and any run-time failures.

Further reading

“Auto topic creation on the broker has caused pain in the past; And today it still causes unusual error handling requirements on the client side, added complexity in the broker, mixed responsibility of the TopicMetadataRequest, and limits configuration of the option to be cluster wide. In the future having it broker side will also make features such as authorization very difficult.” KAFKA-2410 Implement “Auto Topic Creation” client side and remove support from Broker side


Six months review: learning how to test new technologies

I’ve not published for over a year, although I have several draft blog posts written and waiting for completion. One of the main reasons I didn’t publish was the direction my commercial work has been going recently, into domains and fields I’d not worked in for perhaps a decade or more. One of those assignments was for six months and I’d like to review various aspects of how I approached the testing, and some of my pertinent experiences and discoveries.

The core technology uses Apache Kafka which needed to be tested as a candidate for replicated data sharing between organisations. There were various criteria which took the deployment off the beaten track of many other uses of Apache Kafka’s popular deployment models, that is, the use was atypical and therefore it was important to understand how Kafka behaves for the intended use.

Kafka was new to me, and my experiences of some of the other technologies were sketchy, dated or both. I undertook the work at the request of the sponsor who knew of my work and research.

There was a massive amount to learn; and as we discovered lots to do in order to establish a useful testing process, including establishing environments to test the system. I aim to cover these in a series of blog articles here.

  • How I learned stuff
  • Challenges in monitoring Kafka and underlying systems
  • Tools for testing Kafka, particularly load testing
  • Establishing and refining a testing process
  • Getting to grips with AWS and several related services
  • Reporting and Analysis (yes in that order)
  • The many unfinished threads I’ve started