Beaufort Scale of Testing Software

I have been wanting to find a way to describe various forces and challenges we can apply when testing. One of the scales that appealed to me is the Beaufort Scale which is used to assess and communicate wind speeds based on observation (rather than precision). I propose we could use a similar scale for the test conditions we want to use when testing software and computer systems.

Illustration of the beaufort scale using a house, a tree and a flag

Beaufort Scale Illustration (from gcaptain.com)

Source gcaptain.com

What would I like to achieve by creating the scale?

What would I like to achieve when applying various scales as part of testing software and systems?

  • Devise ways to test software and systems at various values on the scale so we learn enough about the behaviours to make useful decisions we’re unlikely to regret.

Testing Beaufort Scale

Here is my initial attempt at describing a Testing Beaufort Scale:

  1. Calm: there’s nothing to disturb or challenge the software. Often it seems that much of the scripted testing that’s performed is done when the system is calm, nothing much else is happening, simply testing at a gentle, superficial level. (for the avoidance of doubt the tester may often be stressed doing such boring testing :P)
  2. Light Air:
  3. Gentle Breeze:
  4. Gale:
  5. Hurricane: The focus changes from continuing to operate to one that protects the safety and integrity of the system and data. Performance and functionality may be reduced to protect and enable essential services to be maintained/sustained while the forces are in operation.

Some examples

  • Network outages causing sporadic data loss
  • Database indexes cause slow performance and processing delays
  • Excessive message logging
  • Long running transactions (somewhat longer, or orders of magnitude longer).
  • Heavy workloads?
  • Loss and/or corruption in data, databases, indices, caches, etc.
  • Controlled system failover: instigated by humans where the failover is planned and executed according to the plan – here the focus isn’t on the testing the failover it’s on testing the systems and applications that use the resource that’s being failed-over.
  • Denial-of-Service: Service is denied when demand significantly exceeds the available supply. Often this is of system and service resources. Denials-of-Service exist when external attackers flood systems with requests. They can also exist when a system is performing poorly (for instance while running heavy-duty backups, updates, when indexes are corrupt or not available, etc. etc. Another source is when upstream server(s) unavailable causing queues to build and fill up which puts pressure on handling requests from users or other systems and services. c.f. resource exhaustion.
  • In-flight configuration changes e.g. changing IP address, routing, database tables, system patching, …
  • n Standard Deviations from ‘default’ or ‘standard’ configurations of systems and software. The further the system is from a relatively well known ‘norm’ the more likely it may behave in unexpected and sometimes undesirable ways. e.g. if a server is set to restart every minute, or only once every 5 years instead of every week (for instance) how does it cope, how does it behave, and how does its behaviour affect the users (and peers) of the server? Example: a session timeout of 720 minutes vs the default of 30 minutes.
  • Special characters in parameters or values, as can atypical values. c.f. data as code
  • Permission mismatches, abuses, and mal-configurations
  • Resource constraints may limit the abilities of a system to work, adapt and respond.
  • Increased latencies:
  • Unexpected system failover
  • Unimagined load (note: not unimaginable load or conditions, simply what the tester didn’t envisage might be the load under various conditions)

What to consider measuring

Detecting the effects on the system… Can we define measures and boundaries (similar to equivalence partitions).

Lest we forget

From a broader perspective our work and our focus is often affected by the mood of customers, bosses, managers and stakeholders. These may raise our internal Beaufort Scale and adversely affect our ability to focus on our testing. Paradoxically being calm when the situation is challenging is a key success factor in achieving our mission and doing good work.

Cardboard systems and chocolate soldiers

 

Further reading

An attractive poster of the Beaufort Scale including how to set the sails on a sailing boat http://www.web-shops.net/earth-science/Beaufort_Wind_Force_Poster.htm

A light-hearted look at scale of conflicts in bar-rooms (pubs) http://www.shulist.com/2009/01/shulist-scale-of-conflict/

M. L. Cummings:

Tools for testing kafka

Context

I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.

My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.

Robustness testing tools

My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available https://www.youtube.com/watch?v=NsI51Mo6r3o and https://www.youtube.com/watch?v=tpbNTEYE9NQ together with opensource software tools and various articles https://aphyr.com/posts on testing various technologies https://jepsen.io/analyses including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen and the effects of Kyle’s work has helped guide various improvements to Kafka.

So far, so good…

However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found https://github.com/gator1/jepsen/tree/master/kafka were in an Clojure, an unfamiliar language, and for an older version of Kafka (0.10.2.0). More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.

The next project to try was https://github.com/mbsimonovic/jepsen-python a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team https://github.com/worstcase/blockade/issues/67.

By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.

Performance testing tools

For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.

Over the period of the project we discovered at least 5 candidates for generating load:

  1. kafkameter https://github.com/BrightTag/kafkameter
  2. pepper-box: https://github.com/GSLabDev/pepper-box
  3. kafka tools: https://github.com/apache/kafka
  4. sangrenel: https://github.com/jamiealquiza/sangrenel
  5. ducktape: https://github.com/confluentinc/ducktape

Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.

I’ve moved the details of working with pepper-box to a separate blog post http://blog.bettersoftwaretesting.com/2018/04/working-with-pepper-box-to-test-kafka/

Things I wish we’d had more time for

There’s so much that could be done to improve the tooling and approach to using testing tools to test Kafka. We needed to keep an immediate focus during the assignment in order to provide feedback and results quickly. Furthermore, the testing ran into weeds particularly during the early stages and configuring Kafka correctly was extremely time-consuming as newbies to several of the technologies. We really wanted to run many more tests earlier in the project and establish regular, reliable and trustworthy test results.

There’s plenty of scope to improve both pepper-box and the analysis tools. Some of these have been identified on the respective github repositories.

What I’d explore next

The biggest immediate improvement, at least for me, would be to focus on using trustworthy statistical analysis tools such as R so we can automate more of the processing and graphing aspects of the testing.

Further topics

Here are topics I’d like to cover in future articles:

  • Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
  • Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
  • Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
  • Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing