Beaufort Scale of Testing Software

I have been wanting to find a way to describe various forces and challenges we can apply when testing. One of the scales that appealed to me is the Beaufort Scale which is used to assess and communicate wind speeds based on observation (rather than precision). I propose we could use a similar scale for the test conditions we want to use when testing software and computer systems.

Illustration of the beaufort scale using a house, a tree and a flag

Beaufort Scale Illustration (from


What would I like to achieve by creating the scale?

What would I like to achieve when applying various scales as part of testing software and systems?

  • Devise ways to test software and systems at various values on the scale so we learn enough about the behaviours to make useful decisions we’re unlikely to regret.

Testing Beaufort Scale

Here is my initial attempt at describing a Testing Beaufort Scale:

  1. Calm: there’s nothing to disturb or challenge the software. Often it seems that much of the scripted testing that’s performed is done when the system is calm, nothing much else is happening, simply testing at a gentle, superficial level. (for the avoidance of doubt the tester may often be stressed doing such boring testing :P)
  2. Light Air:
  3. Gentle Breeze:
  4. Gale:
  5. Hurricane: The focus changes from continuing to operate to one that protects the safety and integrity of the system and data. Performance and functionality may be reduced to protect and enable essential services to be maintained/sustained while the forces are in operation.

Some examples

  • Network outages causing sporadic data loss
  • Database indexes cause slow performance and processing delays
  • Excessive message logging
  • Long running transactions (somewhat longer, or orders of magnitude longer).
  • Heavy workloads?
  • Loss and/or corruption in data, databases, indices, caches, etc.
  • Controlled system failover: instigated by humans where the failover is planned and executed according to the plan – here the focus isn’t on the testing the failover it’s on testing the systems and applications that use the resource that’s being failed-over.
  • Denial-of-Service: Service is denied when demand significantly exceeds the available supply. Often this is of system and service resources. Denials-of-Service exist when external attackers flood systems with requests. They can also exist when a system is performing poorly (for instance while running heavy-duty backups, updates, when indexes are corrupt or not available, etc. etc. Another source is when upstream server(s) unavailable causing queues to build and fill up which puts pressure on handling requests from users or other systems and services. c.f. resource exhaustion.
  • In-flight configuration changes e.g. changing IP address, routing, database tables, system patching, …
  • n Standard Deviations from ‘default’ or ‘standard’ configurations of systems and software. The further the system is from a relatively well known ‘norm’ the more likely it may behave in unexpected and sometimes undesirable ways. e.g. if a server is set to restart every minute, or only once every 5 years instead of every week (for instance) how does it cope, how does it behave, and how does its behaviour affect the users (and peers) of the server? Example: a session timeout of 720 minutes vs the default of 30 minutes.
  • Special characters in parameters or values, as can atypical values. c.f. data as code
  • Permission mismatches, abuses, and mal-configurations
  • Resource constraints may limit the abilities of a system to work, adapt and respond.
  • Increased latencies:
  • Unexpected system failover
  • Unimagined load (note: not unimaginable load or conditions, simply what the tester didn’t envisage might be the load under various conditions)

What to consider measuring

Detecting the effects on the system… Can we define measures and boundaries (similar to equivalence partitions).

Lest we forget

From a broader perspective our work and our focus is often affected by the mood of customers, bosses, managers and stakeholders. These may raise our internal Beaufort Scale and adversely affect our ability to focus on our testing. Paradoxically being calm when the situation is challenging is a key success factor in achieving our mission and doing good work.

Cardboard systems and chocolate soldiers


Further reading

An attractive poster of the Beaufort Scale including how to set the sails on a sailing boat

A light-hearted look at scale of conflicts in bar-rooms (pubs)

M. L. Cummings:

Tools for testing kafka


I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.

My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.

Robustness testing tools

My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available and together with opensource software tools and various articles on testing various technologies including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did and the effects of Kyle’s work has helped guide various improvements to Kafka.

So far, so good…

However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found were in an Clojure, an unfamiliar language, and for an older version of Kafka ( More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.

The next project to try was a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team

By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.

Performance testing tools

For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.

Over the period of the project we discovered at least 5 candidates for generating load:

  1. kafkameter
  2. pepper-box:
  3. kafka tools:
  4. sangrenel:
  5. ducktape:

Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.

pepper-box and jmeter

Working with jmeter to test a non-web protocol i.e. Kafka ended up taking significant effort and time where we ended up spending lots of time having to learn about various aspects of writing the code, configuring jmeter and running the tests.

Thankfully as there were both kafkameter and pepper-box we were able to learn lots from various articles as well as the source code. Key articles include:

The blazemeter article even included an example consumer script written in Groovy. We ended up extending this script significantly and making it available as part of our fork of pepper-box (since it didn’t seem sensible to create a separate project for this script) 

As ever there was lots of other reading and experimentation to be able to reliably and consistently develop the jmeter plugins. Lowlights included needing to convince both the machine and maven that we actually needed to use Java 8.

A key facet of the work was adding support to both the producer and consumer code to enable it to be used with clusters configured with and without security, in particular SASL_SSL. The code is relatively easy to write but debugging issues with it was very time-consuming especially as we had to test in a variety of environments each with different configurations where none of the team had prior experience of how to configure Kafka with SASL_SSL before the project started.

We ran into multiple issues related to the environments and getting the Kafka clusters to stay healthy and the replication to happen without major delays. I may be able to cover some of the details in subsequent articles. We also realised that using the pepper-box java sampler (as they’re called in jmeter terminology) used lots of CPU and we needed to start running load generators and consumers in parallel.

Standalone pepper-box

We eventually discovered the combination of jmeter and the pepper-box sampler was maxing out and unable to generate the loads we wanted to create to test various aspects of the performance and latency. Thankfully the original creators of pepper-box had provided a standalone load generation utility which was able to generate significantly higher loads. We had to tradeoff between the extra performance and the various capabilities of jmeter and the many plugins that have been developed for jmeter over the years. We’d have to manage synchronisation of load generators on multiple machines ourselves, and so on.

The next challenge was to decide whether to develop an equivalent standalone consumer ourselves. In the end we did, partly as jmeter had lost credibility with the client so it wasn’t viable to continue using the current consumer.

Developing a pepper-box consumer


Things I wish we’d had more time for

What I’d explore next

Further topics

Here are topics I’d like to cover in future articles:

  • Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
  • Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
  • Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
  • Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing