Trinity Testing

[reposted from my old Blogger account]

In 2009 I needed to find ways to significantly improve my testing of an major, unfamiliar, product that used to have two long-term full time test engineers assigned to it. I had a few days a month, so there was no way I could reproduce their testing. I realized that every person on the project had gaps in their knowledge and understanding e.g. the developers often didn’t understand much of the business aspects, etc. which meant that features were slow to release, hard to develop and hard to test, and when they finally reached production they often had flaws and limitations.

Testing of features was performed by test engineers and the business teams, and typically lagged the development by several weeks, so when bugs were (finally) reported, the developer needed time to remember their work. The overhead of remembering their work slowed down the fixes and in practice meant that lower-priority issues might be deferred a month or two, rather than being fixed in the release being tested.

Trinity Testing was my approach to addressing all the issues I’ve mentioned so far. It works by combining three people:

  • The developer of a feature or fix
  • The business, feature, or story owner i.e. the domain expert for the feature/fix
  • The test engineer

The people act as peers in the discussion, no one is ‘in charge’ or the ‘decision maker’ instead each participant is responsible for their commitments and actions.

We initially met for no more than 1 hour per developer after a release was created. We shared a computer and screen and ‘walked through’ each feature or significant change, spending a few minutes per item. Generally the developer was responsible for the walk-through; they described how their code works and received comments and questions from the other 2 participants e.g. the domain expert asked about how the feature behaved for other types of account; and the test engineer asked about how they could test the new feature. People noted any follow-up work or actions e.g. the developer may need to revise their implementation based on what they’ve learnt during the session.

At the end of each session, each participant follows-up on their work e.g. the test engineer may target additional testing on areas that are of concern (to any of the participants).

Within 2 releases, the Trinity Testing sessions had proved their value. Everyone who participated found them useful and better than the traditional development and testing process. Furthermore I was able to test each release in about 2 to 3 days, which reduced the manual testing effort to about 1/10th of the original.

Trinity Testing sessions are ideal at a couple of stages in the lifecycle of a feature or fix:

  • At the outset, when the design is being considered
  • As soon as practical after the feature is ‘code complete’, preferably before the formal release candidate is created and while the developer knows the software intimately

At design time, a Trinity Testing session should:

  • help devise the tests that will be needed to confirm the feature will work as desired
  • help the tester to know what to look for to spot problems (how would I know the software is not working?)
  • help the developer to know what the feature/fix needs to do; so they don’t need to guess as often
  • give the ‘owner’ justified confidence that their feature/fix will be more correct, and available sooner

A year on I’m continuing to receive positive comments about how useful Trinity Testing was for the project.

Note: Janet Gregory and Lisa Crispin devised the ‘power of three’ testing several years before I ‘discovered’ Trinity Testing. I wasn’t aware of their work at the time. You might be interested in reading their work as our approaches are similar but not identical. Their work is available in their Agile Testing book http://www.agiletester.ca/.

fledgling heuristics for testing android apps

I’ve been inspired to have a go at creating some guidelines for testing of Android apps. The initial request was to help shape interviews to be able to identify people who’re understand some of the challenges and approaches for testing on Android and for Android apps. I hope these will serve actual testing of the apps too.

  • Android Releases and API Versions
  • OnRotation and other Configuration Changes
  • Fundamental Android concepts including: Activities, Services, Intents, Broadcast Receivers, and Content Providers  https://developer.android.com/guide/components/fundamentals
  • Accessibility settings
  • Applying App Store data and statistics
  • Crashes and ANR‘s
  • Using in-app Analytics to compare our testing and how the app is used by the Users
  • Logs and Screenshots
  • SDK tools, including Logcat, adb, and monitor
  • Devices, including sensors, resources, and CPUs
  • Device Farms, sources of devices available to rent remote devices (often in the ‘cloud’)
  • Permissions granted & denied
  • Alpha & Beta channels
  • Build Targets (Debug, Release & others)
  • Test Automation Frameworks and Monkey Testing

I’ll continue exploring ideas and topics to include. Perhaps a memorable heuristic phrase will emerge, suggestions welcome on twitter https://twitter.com/julianharty

Presentation at Kafka Summit London April 2018

I presented ‘Testing the Beast’ at the Kafka Summit in London on 24th April 2018 https://kafka-summit.org/sessions/testing-the-beast/ The conference was an excellent venue to meet some of the many people who are passionate and experienced in working with Kafka at scale. I learned a great deal from various speakers & hope to incorporate and apply some of what I learned to any future work I do with Kafka.

The conference will post the slides and a recording of my session (together with all the other sessions), probably before the end of May 2018.

Here are my slides in PowerPoint format Testing The Beast (Kafka) 23 Apr 2018

BTW: when I used PowerPoint to create a PDF it ballooned into a file of over 300MB so I’ve left that for the conference organisers to sort and make available.

 

Working with pepper-box to test Kafka

Introduction

We needed to do performance testing of multi-regional Kafka Clusters, We ended up using pepper-box for most of our work. We had to first understand, then use, then extend and enhance the capabilities of pepper-box during the assignment. Here is an overview of what we did in terms of working with pepper-box. As we published our code and other related materials on github you can see more of the details at that site: https://github.com/commercetest/pepper-box

Pepper-box and jmeter

In order to use pepper-box we first needed to understand the fundamentals of jmeter.  Using jmeter to test a non-web protocol i.e. Kafka ended up taking significant effort and time where we ended up spending lots of time having to learn about various aspects of compiling the code, configuring jmeter, and running the tests.

Thankfully as there were both kafkameter and pepper-box we were able to learn lots from various articles as well as the source code. Key articles include:

The blazemeter article even included an example consumer script written in a programming language called Groovy. We ended up extending this script significantly and making it available as part of our fork of pepper-box (since it didn’t seem sensible to create a separate project for this script) https://github.com/commercetest/pepper-box/blob/master/src/groovyscripts/kafka-consumer-timestamp.groovy 

As ever there was lots of other reading and experimentation to be able to reliably and consistently develop the jmeter plugins. Lowlights included needing to convince both the machine and maven that we actually needed to use Java 8.

Extending pepper-box to support security protocols

Business requirements mandated data would be secured throughout the system. There are various security mechanisms supported by Kafka. SASL enabled nodes to authenticate themselves to Kafka instances. Connections were secured using what’s known as SSL (e.g. see http://info.ssl.com/article.aspx?id=10241) however the security is provided by a successor called TLS (see https://docs.confluent.io/current/kafka/encryption.html).

A key facet of the work was adding support to both the producer and consumer code to enable it to be used with clusters configured with and without security, in particular SASL_SSL. The code is relatively easy to write but debugging issues with it was very time-consuming especially as we had to test in a variety of environments each with different configurations where none of the team had prior experience of how to configure Kafka with SASL_SSL before the project started.

We ran into multiple issues related to the environments and getting the Kafka clusters to stay healthy and the replication to happen without major delays. I may be able to cover some of the details in subsequent articles. We also realised that using the pepper-box java sampler (as they’re called in jmeter terminology) used lots of CPU and we needed to start running load generators and consumers in parallel.

Standalone pepper-box

We eventually discovered the combination of jmeter and the pepper-box sampler was maxing out and unable to generate the loads we wanted to create to test various aspects of the performance and latency. Thankfully the original creators of pepper-box had provided a standalone load generation utility which was able to generate significantly higher loads. We had to tradeoff between the extra performance and the various capabilities of jmeter and the many plugins that have been developed for jmeter over the years. We’d have to manage synchronisation of load generators on multiple machines ourselves, and so on.

The next challenge was to decide whether to develop an equivalent standalone consumer ourselves. In the end we did, partly as jmeter had lost credibility with the client so it wasn’t viable to continue using the current consumer.

Developing a pepper-box consumer

The jmeter-groovy-consumer wasn’t maxing out, however it made little sense to run dissimilar approaches (a standalone producer written in Java combined with jmeter + Groovy) and added to the overall complexity and complications for more involved tests. Therefore we decided to create a consumer modelled on the standalone producer. We didn’t end up adding rate-limiting as it didn’t really suit the testing we were doing, otherwise they’re fairly complementary tools. The producer sends a known message format which is parsed by the consumer that calculates latency and writes the output in a csv file per topic. The producer polls for messages using default values (e.g. 500 messages limit per poll request). These could be easily adapted with further tweaks and improvements to the code.

Using polling leads to a couple of key effects:

  1. It uses less CPU. Potentially several Consumers can be run on the same machine to process messages from several Producers (Generators) running across a bank of machines.
  2. The granularity of the timing calculations are constrained by the polling interval.

Summary of the standalone pepper-box tools

Both the producer and consumer are more functional than elegant and neither very forgiving of errors, missing or incorrect parameters. It’d be great to improve their usability at some point.

Man and Machine in Perfect Harmony?

Ah, the bliss and eager joy when we can operate technology seamlessly and productively, making effective progress rather than mistakes; where the technology helps us make better, informed decisions. Sadly this seldom happens in operations – when administering the software – or trying to address problems.

HCI for systems software and tools

Testing the operating procedures, the tools and utilities to configure, administer, safeguard, diagnose and recover, etc. may be some of the most important testing we do. The context, including emotional & physical aspects, are important considerations and may make the difference between performing the desired activity versus exacerbating problems, confusion, etc. For instance, is the user tired, distracted, working remotely, under stress? each of these may increase the risk of more and larger mistakes.

Usability testing can help us consider and design aspects of the testing. For instance, how well do the systems software and tools enable people to complete tasks effectively, efficiently and satisfactorily?

Standard Operating Procedures

Standard Operating Procedures (SOP’s) can help people and organisations to deliver better outcomes with fewer errors. For a recent assignment testing Kafka, testing needed to include testing the suitability of the SOP’s, for instance to determine the chances of someone making an inadvertent mistake that caused a system failure or compounded an existing challenge or problem.

Testing Recovery is also relevant. There may be many forms of recovery. In terms of SOPs we want and expect most scenarios to be included in the SOPs and to be trustworthy. Recovery may be for a particular user or organisation (people / business centred) and/or technology centred e.g. recovering at-risk machine instances in a cluster of servers.

OpsDev & DevOps

OpsDev and DevOps may help improve the understanding and communication between development and operations roles and foci. They aren’t sufficient by themselves.

Further reading

Disposable test environments

Disposable:

  • “readily available for the owner’s use as required”
  • “intended to be thrown away after use”

https://en.oxforddictionaries.com/definition/disposable

For many years test environments were hard to obtain, maintain, and update. Technologies including Virtual Machines and Containers reduce the effort, cost, and potentially the resources needed to provide test environments as and when required. Picking the most appropriate for a particular testing need is key.

For a recent project, to test Kafka, we needed a range of test environments, from lightweight ephemeral self-contained environments to those that involved 10’s of machines distributed at least 100 km apart. Most of our testing for Kafka used AWS to provide the computer instances and connectivity where environments were useful for days to weeks. However we also used ESXi and Docker images. We used ESXi when we wanted to test on particular hardware and networks. Docker, conversely, enabled extremely lightweight experiments, for instance to experiment with self-contained Kafka nodes where the focus was on interactive learning rather than longer-lived evaluations.

Some, not all, of the contents of a test environment has a life beyond that of the environment.  Test scripts, the ability to reproduce the test data and context, key results and lab notes tend to be worth preserving.

Key Considerations

  • what to keep and what to discard: We want ways to identify what’s important to preserve and what can be usefully and hygienically purged and freed.
  • timings: how soon do we need the environment, what’s the duration, and when do we expect it to be life-expired?
  • fidelity: how faithfully and completely does the test environment need to be?
  • count: how many nodes are needed?
  • tool support: do the tools work effectively in the proposed runtime environment?

Further reading

Beaufort Scale of Testing Software

I have been wanting to find a way to describe various forces and challenges we can apply when testing. One of the scales that appealed to me is the Beaufort Scale which is used to assess and communicate wind speeds based on observation (rather than precision). I propose we could use a similar scale for the test conditions we want to use when testing software and computer systems.

Illustration of the beaufort scale using a house, a tree and a flag

Beaufort Scale Illustration (from gcaptain.com)

Source gcaptain.com

What would I like to achieve by creating the scale?

What would I like to achieve when applying various scales as part of testing software and systems?

  • Devise ways to test software and systems at various values on the scale so we learn enough about the behaviours to make useful decisions we’re unlikely to regret.

Testing Beaufort Scale

Here is my initial attempt at describing a Testing Beaufort Scale:

  1. Calm: there’s nothing to disturb or challenge the software. Often it seems that much of the scripted testing that’s performed is done when the system is calm, nothing much else is happening, simply testing at a gentle, superficial level. (for the avoidance of doubt the tester may often be stressed doing such boring testing :P)
  2. Light Air:
  3. Gentle Breeze:
  4. Gale:
  5. Hurricane: The focus changes from continuing to operate to one that protects the safety and integrity of the system and data. Performance and functionality may be reduced to protect and enable essential services to be maintained/sustained while the forces are in operation.

Some examples

  • Network outages causing sporadic data loss
  • Database indexes cause slow performance and processing delays
  • Excessive message logging
  • Long running transactions (somewhat longer, or orders of magnitude longer).
  • Heavy workloads?
  • Loss and/or corruption in data, databases, indices, caches, etc.
  • Controlled system failover: instigated by humans where the failover is planned and executed according to the plan – here the focus isn’t on the testing the failover it’s on testing the systems and applications that use the resource that’s being failed-over.
  • Denial-of-Service: Service is denied when demand significantly exceeds the available supply. Often this is of system and service resources. Denials-of-Service exist when external attackers flood systems with requests. They can also exist when a system is performing poorly (for instance while running heavy-duty backups, updates, when indexes are corrupt or not available, etc. etc. Another source is when upstream server(s) unavailable causing queues to build and fill up which puts pressure on handling requests from users or other systems and services. c.f. resource exhaustion.
  • In-flight configuration changes e.g. changing IP address, routing, database tables, system patching, …
  • n Standard Deviations from ‘default’ or ‘standard’ configurations of systems and software. The further the system is from a relatively well known ‘norm’ the more likely it may behave in unexpected and sometimes undesirable ways. e.g. if a server is set to restart every minute, or only once every 5 years instead of every week (for instance) how does it cope, how does it behave, and how does its behaviour affect the users (and peers) of the server? Example: a session timeout of 720 minutes vs the default of 30 minutes.
  • Special characters in parameters or values, as can atypical values. c.f. data as code
  • Permission mismatches, abuses, and mal-configurations
  • Resource constraints may limit the abilities of a system to work, adapt and respond.
  • Increased latencies:
  • Unexpected system failover
  • Unimagined load (note: not unimaginable load or conditions, simply what the tester didn’t envisage might be the load under various conditions)

What to consider measuring

Detecting the effects on the system… Can we define measures and boundaries (similar to equivalence partitions).

Lest we forget

From a broader perspective our work and our focus is often affected by the mood of customers, bosses, managers and stakeholders. These may raise our internal Beaufort Scale and adversely affect our ability to focus on our testing. Paradoxically being calm when the situation is challenging is a key success factor in achieving our mission and doing good work.

Cardboard systems and chocolate soldiers

 

Further reading

An attractive poster of the Beaufort Scale including how to set the sails on a sailing boat http://www.web-shops.net/earth-science/Beaufort_Wind_Force_Poster.htm

A light-hearted look at scale of conflicts in bar-rooms (pubs) http://www.shulist.com/2009/01/shulist-scale-of-conflict/

M. L. Cummings:

Tools for testing kafka

Context

I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.

My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.

Robustness testing tools

My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available https://www.youtube.com/watch?v=NsI51Mo6r3o and https://www.youtube.com/watch?v=tpbNTEYE9NQ together with opensource software tools and various articles https://aphyr.com/posts on testing various technologies https://jepsen.io/analyses including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen and the effects of Kyle’s work has helped guide various improvements to Kafka.

So far, so good…

However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found https://github.com/gator1/jepsen/tree/master/kafka were in an Clojure, an unfamiliar language, and for an older version of Kafka (0.10.2.0). More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.

The next project to try was https://github.com/mbsimonovic/jepsen-python a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team https://github.com/worstcase/blockade/issues/67.

By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.

Performance testing tools

For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.

Over the period of the project we discovered at least 5 candidates for generating load:

  1. kafkameter https://github.com/BrightTag/kafkameter
  2. pepper-box: https://github.com/GSLabDev/pepper-box
  3. kafka tools: https://github.com/apache/kafka
  4. sangrenel: https://github.com/jamiealquiza/sangrenel
  5. ducktape: https://github.com/confluentinc/ducktape

Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.

I’ve moved the details of working with pepper-box to a separate blog post http://blog.bettersoftwaretesting.com/2018/04/working-with-pepper-box-to-test-kafka/

Things I wish we’d had more time for

There’s so much that could be done to improve the tooling and approach to using testing tools to test Kafka. We needed to keep an immediate focus during the assignment in order to provide feedback and results quickly. Furthermore, the testing ran into weeds particularly during the early stages and configuring Kafka correctly was extremely time-consuming as newbies to several of the technologies. We really wanted to run many more tests earlier in the project and establish regular, reliable and trustworthy test results.

There’s plenty of scope to improve both pepper-box and the analysis tools. Some of these have been identified on the respective github repositories.

What I’d explore next

The biggest immediate improvement, at least for me, would be to focus on using trustworthy statistical analysis tools such as R so we can automate more of the processing and graphing aspects of the testing.

Further topics

Here are topics I’d like to cover in future articles:

  • Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
  • Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
  • Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
  • Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing

Testing Kafka: How I learned stuff

When I started my assignment to test Kafka I realised I’d got vast range of topics to comprehend. During the assignment I made time to actively learn these topics together with any additional ones that emerged during the assignment such as AWS.

This blog post introduces the various topics. It won’t go into detail on any of them (I may cover some in other blog posts) instead I’ll focus on how I learned stuff as part of this project.

  • Kafka: this is perhaps obvious as a topic, however I needed to learn particular facets of Kafka related to its reliability, resilience, scalability, and find ways to monitor its behaviour. I also ended up learning how to write Kafka clients, implement and configure SASL_SSL security and how to configure it.
  • VMWare: VMWare Enterprise technologies would be used for some runtime environments. I hadn’t worked with VMWare for many years and decided to learn how to configure, monitor and run ESXi on several small yet sufficiently representative servers. This would enable me to both work in the client’s environment and also run additional tests independently and sooner than waiting for sufficient environments and VMs to be available on demand (corporates tend to move more slowly owing to internal processes and organisational structures).
  • How to ‘performance test’ Kafka: we had several ideas and had discovered Kafka includes a utility to ‘performance test’. We needed to understand how that utility generates messages and measured performance. Also it might not be the most suitable tool for the project’s needs.
  • Ways to degrade performance of the system and expose flaws in the system that would adversely affect the value of using Kafka for the project. Disconnecting network cables, killing processes, etc. are pretty easy to do provided one has direct access to the machines and the test environment. However, we needed to be able to introduce more complex fault conditions and also be able to inject faults remotely.

These were the ones I knew about at the start of the assignment. Several more emerged, including:

  • Creating test environments in AWS. This included creating inter-regional peering between VPCs, creating Kafka and Zookeeper clusters, and multiple load generator and load consumer instances to execute the various performance and load tests. While there are various ‘quickstarts’ including one for deploying Kafka clusters https://github.com/aws-quickstart/quickstart-confluent-kafka; in the end we had to create our own clusters, bastion hosts and VPCs instead. The quickstart scripts failed frequently and the environment then needed additional cleaning up after they had failed.
  • Jepsen and other similar Chaos generation tools. Jepsen tested Kafka several major versions ago https://aphyr.com/posts/293-jepsen-kafka the tools are available and opensource, but would they suit our environment and skills set?
  • Various opensource load generators, including 2 that integrated with jmeter, before we finally settled on modifying an opensource standalone load generator and writing a reciprocal load consumer.
  • Linux utilities: which we used extensively to work with the environments and the automated tests. Similarly we wrote utility scripts to monitor, clean up and reset clusters and environments after some of the larger volume load tests.
  • The nuances and effects of enabling topic auto-creation.
  • KIP’s (Kafka Improvement Proposals):
  • Reporting and Analysis: the client had particular expectations on what would be reported and how it would be presented. Some of the tools didn’t provide the results in sufficient granularity e.g. they only provided ‘averages’ and we needed to calibrate the tools so we could trust the numbers they emitted.

Note: The following is ordered by topic or concept rather than chronologically.

How I learned stuff related to testing Kafka

I knew I had a lot to learn from the outset. Therefore I decided to invest time and money even from before the contract was signed so I would be able to contribute immediately. Thankfully all the software was freely available and generally opensource (which meant we could read the code to help understand it and even modify it to help us with the work and the learning).

Most of the learning was steeped in practice, where I and my colleagues would try things out in various test environments and learn by doing, observing and experimenting. (I’ll mention some of the test environments here and may cover aspects in more detail in other blog posts.)

I discovered and for the first time appreciated the range, depth and value of paying for online courses. While they’re not necessarily as good as participating in a commercial training course with instructors in the room and available for immediate advice, the range, price and availability was a revelation and the financial cost of all the course I bought was less than £65 ($90 USD).

Reading was also key, there are lots of blog posts and articles online, including several key articles from people directly involved with developing and testing Kafka. I also searched for academic articles that had been peer-reviewed. I only found a couple –  a pity as well written academic research is an incredible complement to commercial and/or personal informal write-ups.

We spent lots of time reading source code and then modifying code we hoped would be suitable once it’d been adapted. Thankfully the client agreed we could contribute our non-confidential work in public and make it available under permissive opensource and creative commons licenses.

udemy

I first used udemy courses several years ago to try to learn about several technologies. At that time I didn’t get or get much value, however the frequently discounted prices were low enough that I didn’t mind too much. In contrast, this time I found udemy courses to be incredibly valuable and relevant. The richness, range, and depth of courses available on udemy is incredible, and there are enough good quality courses available on relevant topics (particularly on Kafka, AWS, and to a lesser extent VMWare) to be able to make rapid, practical progress in learning aspects of these key topics.

I really appreciated being able to watch not only several introductory videos but also examples of more advanced topics from each of the the potential matches I’d found. Watching these, which is free-of-charge, doesn’t take very long per course and enabled me to get a good feel for whether the presenter’s approach and material would be worthwhile to me given what I knew and what I wanted to achieve.

I took the approach of paying for a course if I thought I’d learn at least a couple of specific items from that course. The cost is tiny compared to the potential value of the knowledge they unlock in my awareness and understanding.

Sometimes even the courses that seemed poorly done helped me to understand where concepts could be easily confused and/or poorly understood. If the presenter was confused – perhaps I would be too 🙂 That said, the most value for me came from the following courses which were particularly relevant for me:

The first three courses are led by the same presenter, Stephane Maarek. He was particularly engaging. He was also helpful and responsive when I send him questions via udemy’s platform.

Published articles and blog posts

I won’t list the articles or blog posts here. There are too many and I doubt a plethora of links would help you much. In terms of learning, some of the key challenges were in determining whether the articles were relevant to what I wanted to achieve with the versions of Kafka we were testing. For instance, many of the articles written before Kafka version 0.10 weren’t very relevant any more and reproducing tests and examples were sometimes too time-consuming to justify the time needed.

Also the way the project wanted to use Kafka was seldom covered and we discovered that some of the key configuration settings we needed to use vastly changed the behaviour of Kafka which again meant many of the articles and blog posts didn’t apply directly.

I used a software tool called Zotero to mange my notes and references (I have been using for several years as part of my PhD research) and have over 100 identified articles recorded there. I read many more articles during the assignment, perhaps as many as 1,000.

Academic research

The best article I found compares Kafka and RabbitMQ in an industrial research settings. There are several revisions available. The peer-reviewed article can be found at https://doi.org/10.1145/3093742.3093908 however you may need to pay for this edition unless you have a suitable subscription. The latest revision seems to be https://arxiv.org/abs/1709.00333v1 which is free to download and read.

Test environments

Here I’ll be brief. I plan to cover test environments in depth later on. Our test environments ranged from clusters of Raspberry Pi’s (replicating using MirrorMaker, etc), Docker containers, inexpensive physical rack-mount servers running ESXi, and several AWS environments. Both Docker and AWS are frequently referenced, for instance the udemy kafka course I mentioned earlier used AWS as their machines.

Seeking more robust and purposeful automated tests

I’ve recently been evaluating some of the automated tests for one of the projects I help, the Kiwix Android app. We have a moderate loose collection of automated tests written using Android’s Espresso framework. The tests that interact with the external environment are prone to problems and failures for various reasons. We need these tests to be trustworthy in order to run them in the CI environment across a wider range of devices. For now we can’t as these tests fail just over half the time. (Details are available in one of the issues being tracked by the project team: https://github.com/kiwix/kiwix-android/issues/283.)

The most common failure is in DownloadTest, followed by NetworkTest. From reading the code we have a mix of silent continuations (where the test proceeds regardless of errors) and implicit expectations (of what’s on the server and the local device), these may well be major contributors to the failures of the tests. Furthermore, when a test fails the error message tells us which line of code the test failed on but don’t help us understand the situation which caused the test to fail. At best we know an expectation wasn’t met at run-time, (i.e. an assertion in the code).

Meanwhile I’ve been exploring how Espresso is intended to be used and how much information it can provide about the state of the app via the app’s GUI. It seems that the intended use is for it to keep information private where it checks on behalf of the running test whether an assertion holds true, or not. However, perhaps we can encourage it to be more forthcoming and share information about what the GUI comprises and contains?

I’ll use these two tests (DownloadTest and NetworkTest) as worked examples where I’ll try to find ways to make these tests more robust and also more informative about the state of the server, the device, and the app.

Situations I’d like the tests to cope with:

  • One or more of the ZIM files are already on the local device: we don’t need to assume the device doesn’t have these files locally.
  • We can download any small ZIM file, not necessarily a predetermined ‘canned’ one.

Examples of information I’d like to ascertain:

  • How many files are already on the device, and details of these files
  • Details of ZIM files available from the server, including the filename and size

Possible approaches to interacting with Espresso

I’m going to assume you either know how Espresso works or be willing to learn about it – perhaps by writing some automated tests using it? 🙂 A good place to start is the Android Testing Codelab, freely available online.

Perhaps we could experiment with a question or query interface where the automated test can ask questions and elicit responses from Espresso. Something akin to the Socratic Method? This isn’t intended to replace the current way of using Espresso and the Hamcrest Matchers.

Who is the decision maker?

In popular opensource test automation frameworks, including junit and espresso (via hamcrest) the arbiter, or decision maker, is the assertion where the tests passes information to the assertion and it decides whether to allow the test to continue or halt and abort this test. The author of the automated test can choose whether to write extra code to handle any rejection but it still doesn’t know the cause of the rejection. Here’s an example of part of the DownloadTest at the time of writing. The try/catch means the test will continue regardless of whether the click works.


onData(withContent("ray_charles")).inAdapterView(withId(R.id.library_list)).perform(click());

try {
onView(withId(android.R.id.button1)).perform(click());
} catch (RuntimeException e) {
}

This code snippet exemplifies many espresso tests, where a reader can determine certain details such as the content the test is intended to click on, however there’s little clue what the second click is intended to do from the user’s perspective – what’s the button, what’s the button ‘for’, and why would a click legitimately fail and yet the test be OK to continue?

Sometimes I’d like the test to be able to decide what to do depending on the actual state of the system. What would we like the test to do?

For a download test, perhaps another file would be as useful?

Increasing robustness of the tests

For me, a download test should focus on being able to test the download of a representative file and be able to do so even if the expected file is already on the local device. We can decide what it’d like to do in various circumstances e.g. perhaps it could simply delete the local instance of a test file such as the one for Ray Charles? The ‘cost’ of re-downloading this file is tiny (at least compared to Wikipedia in English) if the user wants to have this file on the device. Or conversely perhaps the test could leave the file on the device once it’s downloaded it if the file was there before it started  – a sort-of refresh of the content… (I’m aware there are potential side-effects if the contents have been changed; or if the download fails.)

Would we like the automated test to retry a download if the download fails? if so, how often? and should the tests report failed downloads anywhere? I’ll cover logging shortly.

More purposeful tests

Tests often serve multiple purposes, such as:

  • Confidence: Having large volumes of tests ‘passing’ may provide confidence to project teams.
  • Feedback: Automated tests can provide fast, almost immediate, feedback on changes to the codebase. They can also be run on additional devices, configurations (e.g. locales), etc. to provide extra feedback about aspects of the app’s behaviours in these circumstances.
  • Information: tests can gather and present information such as response times, installation time, collecting screenshots (useful for updating them in the app store blurb), etc.
  • Early ‘warning’: for instance, of things that might go awry soon if volumes increase, conditions worsen, etc.
  • Diagnostics: tests can help us compare behaviours e.g. not only where, when, etc. does something fail? but also where, when, etc. does it work? The comparisons can help establish boundaries and equivalence partitions to help us hone in on problems, find patterns, and so on.

Test runners (e.g. junit) don’t encourage logging information especially if the test completes ‘successfully’ (i.e. without unhandled exceptions or assertion failures). Logging is often used in the application code, it can also be used by the tests. As a good example, Espresso logs all interactions automatically to the Android log which may help us (assuming we read the logs and pay attention to their contents) to diagnose aspects of how the tests are performing.

 

Next Steps

This blog post is a snapshot of where I’ve got to. I’ll publish updates as I learn and discover more.

Further reading