Man and Machine in Perfect Harmony?

Ah, the bliss and eager joy when we can operate technology seamlessly and productively, making effective progress rather than mistakes; where the technology helps us make better, informed decisions. Sadly this seldom happens in operations – when administering the software – or trying to address problems.

HCI for systems software and tools

Testing the operating procedures, the tools and utilities to configure, administer, safeguard, diagnose and recover, etc. may be some of the most important testing we do. The context, including emotional & physical aspects, are important considerations and may make the difference between performing the desired activity versus exacerbating problems, confusion, etc. For instance, is the user tired, distracted, working remotely, under stress? each of these may increase the risk of more and larger mistakes.

Usability testing can help us consider and design aspects of the testing. For instance, how well do the systems software and tools enable people to complete tasks effectively, efficiently and satisfactorily?

Standard Operating Procedures

Standard Operating Procedures (SOP’s) can help people and organisations to deliver better outcomes with fewer errors. For a recent assignment testing Kafka, testing needed to include testing the suitability of the SOP’s, for instance to determine the chances of someone making an inadvertent mistake that caused a system failure or compounded an existing challenge or problem.

Testing Recovery is also relevant. There may be many forms of recovery. In terms of SOPs we want and expect most scenarios to be included in the SOPs and to be trustworthy. Recovery may be for a particular user or organisation (people / business centred) and/or technology centred e.g. recovering at-risk machine instances in a cluster of servers.

OpsDev & DevOps

OpsDev and DevOps may help improve the understanding and communication between development and operations roles and foci. They aren’t sufficient by themselves.

Further reading

Disposable test environments

Disposable:

  • “readily available for the owner’s use as required”
  • “intended to be thrown away after use”

https://en.oxforddictionaries.com/definition/disposable

For many years test environments were hard to obtain, maintain, and update. Technologies including Virtual Machines and Containers reduce the effort, cost, and potentially the resources needed to provide test environments as and when required. Picking the most appropriate for a particular testing need is key.

For a recent project, to test Kafka, we needed a range of test environments, from lightweight ephemeral self-contained environments to those that involved 10’s of machines distributed at least 100 km apart. Most of our testing for Kafka used AWS to provide the computer instances and connectivity where environments were useful for days to weeks. However we also used ESXi and Docker images. We used ESXi when we wanted to test on particular hardware and networks. Docker, conversely, enabled extremely lightweight experiments, for instance to experiment with self-contained Kafka nodes where the focus was on interactive learning rather than longer-lived evaluations.

Some, not all, of the contents of a test environment has a life beyond that of the environment.  Test scripts, the ability to reproduce the test data and context, key results and lab notes tend to be worth preserving.

Key Considerations

  • what to keep and what to discard: We want ways to identify what’s important to preserve and what can be usefully and hygienically purged and freed.
  • timings: how soon do we need the environment, what’s the duration, and when do we expect it to be life-expired?
  • fidelity: how faithfully and completely does the test environment need to be?
  • count: how many nodes are needed?
  • tool support: do the tools work effectively in the proposed runtime environment?

Further reading

Beaufort Scale of Testing Software

I have been wanting to find a way to describe various forces and challenges we can apply when testing. One of the scales that appealed to me is the Beaufort Scale which is used to assess and communicate wind speeds based on observation (rather than precision). I propose we could use a similar scale for the test conditions we want to use when testing software and computer systems.

Illustration of the beaufort scale using a house, a tree and a flag

Beaufort Scale Illustration (from gcaptain.com)

Source gcaptain.com

What would I like to achieve by creating the scale?

What would I like to achieve when applying various scales as part of testing software and systems?

  • Devise ways to test software and systems at various values on the scale so we learn enough about the behaviours to make useful decisions we’re unlikely to regret.

Testing Beaufort Scale

Here is my initial attempt at describing a Testing Beaufort Scale:

  1. Calm: there’s nothing to disturb or challenge the software. Often it seems that much of the scripted testing that’s performed is done when the system is calm, nothing much else is happening, simply testing at a gentle, superficial level. (for the avoidance of doubt the tester may often be stressed doing such boring testing :P)
  2. Light Air:
  3. Gentle Breeze:
  4. Gale:
  5. Hurricane: The focus changes from continuing to operate to one that protects the safety and integrity of the system and data. Performance and functionality may be reduced to protect and enable essential services to be maintained/sustained while the forces are in operation.

Some examples

  • Network outages causing sporadic data loss
  • Database indexes cause slow performance and processing delays
  • Excessive message logging
  • Long running transactions (somewhat longer, or orders of magnitude longer).
  • Heavy workloads?
  • Loss and/or corruption in data, databases, indices, caches, etc.
  • Controlled system failover: instigated by humans where the failover is planned and executed according to the plan – here the focus isn’t on the testing the failover it’s on testing the systems and applications that use the resource that’s being failed-over.
  • Denial-of-Service: Service is denied when demand significantly exceeds the available supply. Often this is of system and service resources. Denials-of-Service exist when external attackers flood systems with requests. They can also exist when a system is performing poorly (for instance while running heavy-duty backups, updates, when indexes are corrupt or not available, etc. etc. Another source is when upstream server(s) unavailable causing queues to build and fill up which puts pressure on handling requests from users or other systems and services. c.f. resource exhaustion.
  • In-flight configuration changes e.g. changing IP address, routing, database tables, system patching, …
  • n Standard Deviations from ‘default’ or ‘standard’ configurations of systems and software. The further the system is from a relatively well known ‘norm’ the more likely it may behave in unexpected and sometimes undesirable ways. e.g. if a server is set to restart every minute, or only once every 5 years instead of every week (for instance) how does it cope, how does it behave, and how does its behaviour affect the users (and peers) of the server? Example: a session timeout of 720 minutes vs the default of 30 minutes.
  • Special characters in parameters or values, as can atypical values. c.f. data as code
  • Permission mismatches, abuses, and mal-configurations
  • Resource constraints may limit the abilities of a system to work, adapt and respond.
  • Increased latencies:
  • Unexpected system failover
  • Unimagined load (note: not unimaginable load or conditions, simply what the tester didn’t envisage might be the load under various conditions)

What to consider measuring

Detecting the effects on the system… Can we define measures and boundaries (similar to equivalence partitions).

Lest we forget

From a broader perspective our work and our focus is often affected by the mood of customers, bosses, managers and stakeholders. These may raise our internal Beaufort Scale and adversely affect our ability to focus on our testing. Paradoxically being calm when the situation is challenging is a key success factor in achieving our mission and doing good work.

Cardboard systems and chocolate soldiers

 

Further reading

An attractive poster of the Beaufort Scale including how to set the sails on a sailing boat http://www.web-shops.net/earth-science/Beaufort_Wind_Force_Poster.htm

A light-hearted look at scale of conflicts in bar-rooms (pubs) http://www.shulist.com/2009/01/shulist-scale-of-conflict/

M. L. Cummings:

Tools for testing kafka

Context

I realised software tools would help us to test kafka in two key dimensions: performance and robustness. I started from knowing little about the tools or their capabilities although I was aware of jmeter which I’d used briefly over a decade ago.

My initial aim was to find a way we could generate and consume load. This load would then be the background for experimenting with robustness testing to see how well the systems and the data replication would work and cope in inclement conditions. By inclement I mean varying degrees of adverse conditions such as poor network conditions through to compound error conditions where the ‘wrong’ nodes were taken down during an upgrade while the system was trying to catch up from a backlog of transactions, etc. I came up with a concept of the Beaufort Scale of environment conditions which I’m writing about separately.

Robustness testing tools

My research very quickly led me to jepsen the work of Kyle Kingsbury. He has some excellent videos available https://www.youtube.com/watch?v=NsI51Mo6r3o and https://www.youtube.com/watch?v=tpbNTEYE9NQ together with opensource software tools and various articles https://aphyr.com/posts on testing various technologies https://jepsen.io/analyses including Zookeeper and Kafka. Jay Kreps provided his perspective on the testing Kyle did http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen and the effects of Kyle’s work has helped guide various improvements to Kafka.

So far, so good…

However when I started investigating the practical aspects of using jepsen to test newer versions of Kafka I ran into a couple of significant challenges for me at least. I couldn’t find the original scripts, and the ones I found https://github.com/gator1/jepsen/tree/master/kafka were in an Clojure, an unfamiliar language, and for an older version of Kafka (0.10.2.0). More importantly it relied on docker. While docker is an incredibly powerful and fast tool the client wasn’t willing to trust tests run in docker environments (also our environment needed at least 20 instances to test the smallest configuration.

The next project to try was https://github.com/mbsimonovic/jepsen-python a language I and other team members knew sufficiently to try. However we again ran into the blocker that it used docker. However, there seemed to be some potential to test clusters if we could get one of the dependencies to support docker swarm. That project is Blockade. I asked what it would take to add support for docker swarm; quite a lot according to one of the project team https://github.com/worstcase/blockade/issues/67.

By this point we needed to move on and focus on testing the performance and scalability and latency of Kafka including inter-regional data replication so we had to pause our search for automated tools or frameworks to control the robustness aspects of the testing.

Performance testing tools

For now I’ll try to keep focused on the tools rather than trying to define performance testing vs load testing, etc. as there are many disagreements on what distinguishes these terms and other related terms. We knew we needed ways to generate traffic patterns that ranged from simple text to ones that closely resembled the expected production traffic profiles.

Over the period of the project we discovered at least 5 candidates for generating load:

  1. kafkameter https://github.com/BrightTag/kafkameter
  2. pepper-box: https://github.com/GSLabDev/pepper-box
  3. kafka tools: https://github.com/apache/kafka
  4. sangrenel: https://github.com/jamiealquiza/sangrenel
  5. ducktape: https://github.com/confluentinc/ducktape

Both kafkameter and pepper-box integrated with jmeter. Of these pepper-box was newer, and inspired by kafkameter. Also kafkameter would clearly need significant work to suit our needs so we started experimenting with pepper-box. We soon forked the project so we could easily experiment and add functionality without disturbing the parent project or delaying the testing we needed to do by waiting for approvals, etc.

pepper-box and jmeter

Working with jmeter to test a non-web protocol i.e. Kafka ended up taking significant effort and time where we ended up spending lots of time having to learn about various aspects of writing the code, configuring jmeter and running the tests.

Thankfully as there were both kafkameter and pepper-box we were able to learn lots from various articles as well as the source code. Key articles include:

The blazemeter article even included an example consumer script written in Groovy. We ended up extending this script significantly and making it available as part of our fork of pepper-box (since it didn’t seem sensible to create a separate project for this script) https://github.com/commercetest/pepper-box/blob/master/src/groovyscripts/kafka-consumer-timestamp.groovy 

As ever there was lots of other reading and experimentation to be able to reliably and consistently develop the jmeter plugins. Lowlights included needing to convince both the machine and maven that we actually needed to use Java 8.

A key facet of the work was adding support to both the producer and consumer code to enable it to be used with clusters configured with and without security, in particular SASL_SSL. The code is relatively easy to write but debugging issues with it was very time-consuming especially as we had to test in a variety of environments each with different configurations where none of the team had prior experience of how to configure Kafka with SASL_SSL before the project started.

We ran into multiple issues related to the environments and getting the Kafka clusters to stay healthy and the replication to happen without major delays. I may be able to cover some of the details in subsequent articles. We also realised that using the pepper-box java sampler (as they’re called in jmeter terminology) used lots of CPU and we needed to start running load generators and consumers in parallel.

Standalone pepper-box

We eventually discovered the combination of jmeter and the pepper-box sampler was maxing out and unable to generate the loads we wanted to create to test various aspects of the performance and latency. Thankfully the original creators of pepper-box had provided a standalone load generation utility which was able to generate significantly higher loads. We had to tradeoff between the extra performance and the various capabilities of jmeter and the many plugins that have been developed for jmeter over the years. We’d have to manage synchronisation of load generators on multiple machines ourselves, and so on.

The next challenge was to decide whether to develop an equivalent standalone consumer ourselves. In the end we did, partly as jmeter had lost credibility with the client so it wasn’t viable to continue using the current consumer.

Developing a pepper-box consumer

 

Things I wish we’d had more time for

What I’d explore next

Further topics

Here are topics I’d like to cover in future articles:

  • Managing and scaling the testing, particularly how to run many tests and keep track of the results, while keeping the environments healthy and clean.
  • Designing the producers and consumers so we could measure throughput and latency using the results collected by the consumer code.
  • Tradeoffs between fidelity of messages and the overhead of increased fidelity (hi-fi costs more than lo-fi at runtime).
  • Some of the many headwinds we faced in establishing trustworthy, reliable test environments and simply performing the testing

Testing Kafka: How I learned stuff

When I started my assignment to test Kafka I realised I’d got vast range of topics to comprehend. During the assignment I made time to actively learn these topics together with any additional ones that emerged during the assignment such as AWS.

This blog post introduces the various topics. It won’t go into detail on any of them (I may cover some in other blog posts) instead I’ll focus on how I learned stuff as part of this project.

  • Kafka: this is perhaps obvious as a topic, however I needed to learn particular facets of Kafka related to its reliability, resilience, scalability, and find ways to monitor its behaviour. I also ended up learning how to write Kafka clients, implement and configure SASL_SSL security and how to configure it.
  • VMWare: VMWare Enterprise technologies would be used for some runtime environments. I hadn’t worked with VMWare for many years and decided to learn how to configure, monitor and run ESXi on several small yet sufficiently representative servers. This would enable me to both work in the client’s environment and also run additional tests independently and sooner than waiting for sufficient environments and VMs to be available on demand (corporates tend to move more slowly owing to internal processes and organisational structures).
  • How to ‘performance test’ Kafka: we had several ideas and had discovered Kafka includes a utility to ‘performance test’. We needed to understand how that utility generates messages and measured performance. Also it might not be the most suitable tool for the project’s needs.
  • Ways to degrade performance of the system and expose flaws in the system that would adversely affect the value of using Kafka for the project. Disconnecting network cables, killing processes, etc. are pretty easy to do provided one has direct access to the machines and the test environment. However, we needed to be able to introduce more complex fault conditions and also be able to inject faults remotely.

These were the ones I knew about at the start of the assignment. Several more emerged, including:

  • Creating test environments in AWS. This included creating inter-regional peering between VPCs, creating Kafka and Zookeeper clusters, and multiple load generator and load consumer instances to execute the various performance and load tests. While there are various ‘quickstarts’ including one for deploying Kafka clusters https://github.com/aws-quickstart/quickstart-confluent-kafka; in the end we had to create our own clusters, bastion hosts and VPCs instead. The quickstart scripts failed frequently and the environment then needed additional cleaning up after they had failed.
  • Jepsen and other similar Chaos generation tools. Jepsen tested Kafka several major versions ago https://aphyr.com/posts/293-jepsen-kafka the tools are available and opensource, but would they suit our environment and skills set?
  • Various opensource load generators, including 2 that integrated with jmeter, before we finally settled on modifying an opensource standalone load generator and writing a reciprocal load consumer.
  • Linux utilities: which we used extensively to work with the environments and the automated tests. Similarly we wrote utility scripts to monitor, clean up and reset clusters and environments after some of the larger volume load tests.
  • The nuances and effects of enabling topic auto-creation.
  • KIP’s (Kafka Improvement Proposals):
  • Reporting and Analysis: the client had particular expectations on what would be reported and how it would be presented. Some of the tools didn’t provide the results in sufficient granularity e.g. they only provided ‘averages’ and we needed to calibrate the tools so we could trust the numbers they emitted.

Note: The following is ordered by topic or concept rather than chronologically.

How I learned stuff related to testing Kafka

I knew I had a lot to learn from the outset. Therefore I decided to invest time and money even from before the contract was signed so I would be able to contribute immediately. Thankfully all the software was freely available and generally opensource (which meant we could read the code to help understand it and even modify it to help us with the work and the learning).

Most of the learning was steeped in practice, where I and my colleagues would try things out in various test environments and learn by doing, observing and experimenting. (I’ll mention some of the test environments here and may cover aspects in more detail in other blog posts.)

I discovered and for the first time appreciated the range, depth and value of paying for online courses. While they’re not necessarily as good as participating in a commercial training course with instructors in the room and available for immediate advice, the range, price and availability was a revelation and the financial cost of all the course I bought was less than £65 ($90 USD).

Reading was also key, there are lots of blog posts and articles online, including several key articles from people directly involved with developing and testing Kafka. I also searched for academic articles that had been peer-reviewed. I only found a couple –  a pity as well written academic research is an incredible complement to commercial and/or personal informal write-ups.

We spent lots of time reading source code and then modifying code we hoped would be suitable once it’d been adapted. Thankfully the client agreed we could contribute our non-confidential work in public and make it available under permissive opensource and creative commons licenses.

udemy

I first used udemy courses several years ago to try to learn about several technologies. At that time I didn’t get or get much value, however the frequently discounted prices were low enough that I didn’t mind too much. In contrast, this time I found udemy courses to be incredibly valuable and relevant. The richness, range, and depth of courses available on udemy is incredible, and there are enough good quality courses available on relevant topics (particularly on Kafka, AWS, and to a lesser extent VMWare) to be able to make rapid, practical progress in learning aspects of these key topics.

I really appreciated being able to watch not only several introductory videos but also examples of more advanced topics from each of the the potential matches I’d found. Watching these, which is free-of-charge, doesn’t take very long per course and enabled me to get a good feel for whether the presenter’s approach and material would be worthwhile to me given what I knew and what I wanted to achieve.

I took the approach of paying for a course if I thought I’d learn at least a couple of specific items from that course. The cost is tiny compared to the potential value of the knowledge they unlock in my awareness and understanding.

Sometimes even the courses that seemed poorly done helped me to understand where concepts could be easily confused and/or poorly understood. If the presenter was confused – perhaps I would be too 🙂 That said, the most value for me came from the following courses which were particularly relevant for me:

The first three courses are led by the same presenter, Stephane Maarek. He was particularly engaging. He was also helpful and responsive when I send him questions via udemy’s platform.

Published articles and blog posts

I won’t list the articles or blog posts here. There are too many and I doubt a plethora of links would help you much. In terms of learning, some of the key challenges were in determining whether the articles were relevant to what I wanted to achieve with the versions of Kafka we were testing. For instance, many of the articles written before Kafka version 0.10 weren’t very relevant any more and reproducing tests and examples were sometimes too time-consuming to justify the time needed.

Also the way the project wanted to use Kafka was seldom covered and we discovered that some of the key configuration settings we needed to use vastly changed the behaviour of Kafka which again meant many of the articles and blog posts didn’t apply directly.

I used a software tool called Zotero to mange my notes and references (I have been using for several years as part of my PhD research) and have over 100 identified articles recorded there. I read many more articles during the assignment, perhaps as many as 1,000.

Academic research

The best article I found compares Kafka and RabbitMQ in an industrial research settings. There are several revisions available. The peer-reviewed article can be found at https://doi.org/10.1145/3093742.3093908 however you may need to pay for this edition unless you have a suitable subscription. The latest revision seems to be https://arxiv.org/abs/1709.00333v1 which is free to download and read.

Test environments

Here I’ll be brief. I plan to cover test environments in depth later on. Our test environments ranged from clusters of Raspberry Pi’s (replicating using MirrorMaker, etc), Docker containers, inexpensive physical rack-mount servers running ESXi, and several AWS environments. Both Docker and AWS are frequently referenced, for instance the udemy kafka course I mentioned earlier used AWS as their machines.

Seeking more robust and purposeful automated tests

I’ve recently been evaluating some of the automated tests for one of the projects I help, the Kiwix Android app. We have a moderate loose collection of automated tests written using Android’s Espresso framework. The tests that interact with the external environment are prone to problems and failures for various reasons. We need these tests to be trustworthy in order to run them in the CI environment across a wider range of devices. For now we can’t as these tests fail just over half the time. (Details are available in one of the issues being tracked by the project team: https://github.com/kiwix/kiwix-android/issues/283.)

The most common failure is in DownloadTest, followed by NetworkTest. From reading the code we have a mix of silent continuations (where the test proceeds regardless of errors) and implicit expectations (of what’s on the server and the local device), these may well be major contributors to the failures of the tests. Furthermore, when a test fails the error message tells us which line of code the test failed on but don’t help us understand the situation which caused the test to fail. At best we know an expectation wasn’t met at run-time, (i.e. an assertion in the code).

Meanwhile I’ve been exploring how Espresso is intended to be used and how much information it can provide about the state of the app via the app’s GUI. It seems that the intended use is for it to keep information private where it checks on behalf of the running test whether an assertion holds true, or not. However, perhaps we can encourage it to be more forthcoming and share information about what the GUI comprises and contains?

I’ll use these two tests (DownloadTest and NetworkTest) as worked examples where I’ll try to find ways to make these tests more robust and also more informative about the state of the server, the device, and the app.

Situations I’d like the tests to cope with:

  • One or more of the ZIM files are already on the local device: we don’t need to assume the device doesn’t have these files locally.
  • We can download any small ZIM file, not necessarily a predetermined ‘canned’ one.

Examples of information I’d like to ascertain:

  • How many files are already on the device, and details of these files
  • Details of ZIM files available from the server, including the filename and size

Possible approaches to interacting with Espresso

I’m going to assume you either know how Espresso works or be willing to learn about it – perhaps by writing some automated tests using it? 🙂 A good place to start is the Android Testing Codelab, freely available online.

Perhaps we could experiment with a question or query interface where the automated test can ask questions and elicit responses from Espresso. Something akin to the Socratic Method? This isn’t intended to replace the current way of using Espresso and the Hamcrest Matchers.

Who is the decision maker?

In popular opensource test automation frameworks, including junit and espresso (via hamcrest) the arbiter, or decision maker, is the assertion where the tests passes information to the assertion and it decides whether to allow the test to continue or halt and abort this test. The author of the automated test can choose whether to write extra code to handle any rejection but it still doesn’t know the cause of the rejection. Here’s an example of part of the DownloadTest at the time of writing. The try/catch means the test will continue regardless of whether the click works.


onData(withContent("ray_charles")).inAdapterView(withId(R.id.library_list)).perform(click());

try {
onView(withId(android.R.id.button1)).perform(click());
} catch (RuntimeException e) {
}

This code snippet exemplifies many espresso tests, where a reader can determine certain details such as the content the test is intended to click on, however there’s little clue what the second click is intended to do from the user’s perspective – what’s the button, what’s the button ‘for’, and why would a click legitimately fail and yet the test be OK to continue?

Sometimes I’d like the test to be able to decide what to do depending on the actual state of the system. What would we like the test to do?

For a download test, perhaps another file would be as useful?

Increasing robustness of the tests

For me, a download test should focus on being able to test the download of a representative file and be able to do so even if the expected file is already on the local device. We can decide what it’d like to do in various circumstances e.g. perhaps it could simply delete the local instance of a test file such as the one for Ray Charles? The ‘cost’ of re-downloading this file is tiny (at least compared to Wikipedia in English) if the user wants to have this file on the device. Or conversely perhaps the test could leave the file on the device once it’s downloaded it if the file was there before it started  – a sort-of refresh of the content… (I’m aware there are potential side-effects if the contents have been changed; or if the download fails.)

Would we like the automated test to retry a download if the download fails? if so, how often? and should the tests report failed downloads anywhere? I’ll cover logging shortly.

More purposeful tests

Tests often serve multiple purposes, such as:

  • Confidence: Having large volumes of tests ‘passing’ may provide confidence to project teams.
  • Feedback: Automated tests can provide fast, almost immediate, feedback on changes to the codebase. They can also be run on additional devices, configurations (e.g. locales), etc. to provide extra feedback about aspects of the app’s behaviours in these circumstances.
  • Information: tests can gather and present information such as response times, installation time, collecting screenshots (useful for updating them in the app store blurb), etc.
  • Early ‘warning’: for instance, of things that might go awry soon if volumes increase, conditions worsen, etc.
  • Diagnostics: tests can help us compare behaviours e.g. not only where, when, etc. does something fail? but also where, when, etc. does it work? The comparisons can help establish boundaries and equivalence partitions to help us hone in on problems, find patterns, and so on.

Test runners (e.g. junit) don’t encourage logging information especially if the test completes ‘successfully’ (i.e. without unhandled exceptions or assertion failures). Logging is often used in the application code, it can also be used by the tests. As a good example, Espresso logs all interactions automatically to the Android log which may help us (assuming we read the logs and pay attention to their contents) to diagnose aspects of how the tests are performing.

 

Next Steps

This blog post is a snapshot of where I’ve got to. I’ll publish updates as I learn and discover more.

Further reading

Damn you auto-create

Inspired by the entertaining http://www.damnyouautocorrect.com/ web site, here are some thoughts on the benefits and challenges of having auto-create enabled for Kafka topics https://kafka.apache.org/documentation/#brokerconfigs

`auto.create.topics.enable`

At first, auto-create seems like a convenience, a blessing, as it means developers don’t need to write code to explicitly create topics. For a particular project the developers can focus on using the system as a service to share user-specified sets of data rather than writing extra code to interact with Zookeeper, etc. (newer releases of Kafka include the AdminClient API which deals with the Zookeeper aspects).

Effects of relying on auto-create: topics are created with the default (configured) partition and replication-counts. These may not be ideal for this topic and its intended use(s).

Adverse impacts of using auto-create

Deleting topics: The project uses Confluent Replicator to replicate data from Kafka Cluster to Kafka Cluster. As part of our testing lots of topics were created. We wanted to delete some of these topics but discovered they were virtually impossible to kill as the combination of Confluent Replicator and the Kafka Clusters were resurrecting the topics before they could be fully expunged. This caused almost endless frustration and adversely affected our testing as we couldn’t get the environment sufficiently clean to run tests in controlled circumstances (Replicator was busy servicing the defunct topics which limits it’s ability to focus on the topics we wanted to replicate in particular tests).

Coping with delays and problems creating topics: At a less complex level, auto-creation takes a while to complete and seems to happen in the background. When the tests (and the application software) tries to write to the topic immediately various problems occurred from time to time. Knowing that problems can occur is useful in terms of performance, reliability, etc. however it complicates the operational aspects of the system, especially as the errors affect producing data (what the developers and users think is happening) rather than the orthogonal aspect of creating a topic so that data can be produced.

Lack of clarity or traceability on who (what) created topics: Topics could be auto-created when code tried to write (produce) which was more-or-less what we expected. However they could also be auto-created by trying to read (consume). The Replicator duly setup replication for that topic. For various reasons topics could be created on one or more clusters with the same name; and replication happened both locally (within a Kafka Cluster) and to another cluster.  We ended up with a mess of topics on various clusters which was compounded by the challenges cleaning up (deleting) the various topics. It ended up feeling like we were living through the after-effects of the Sorcerer’s Apprentice!

From a testing perspective

From a testing perspective we ended up adding code in our consumer code that checked and waited for the topic to appear in Zookeeper before trying to read from it. This, at least, reduced some of the confusion and enabled us to unambiguously measure the propagation time for Confluent Replicator for topics it needed to replicate.

We also wrote some code that explicitly created topics rather than relying on the auto-create to determine how much effort was needed to remove the dependency on auto-create being enabled and used. That code amounted to less than 10 lines of code in the proof-of-concept. Production quality code may involve more code in order to: audit the creation, as well as log, and report problems and any run-time failures.

Further reading

“Auto topic creation on the broker has caused pain in the past; And today it still causes unusual error handling requirements on the client side, added complexity in the broker, mixed responsibility of the TopicMetadataRequest, and limits configuration of the option to be cluster wide. In the future having it broker side will also make features such as authorization very difficult.” KAFKA-2410 Implement “Auto Topic Creation” client side and remove support from Broker side

 

Six months review: learning how to test new technologies

I’ve not published for over a year, although I have several draft blog posts written and waiting for completion. One of the main reasons I didn’t publish was the direction my commercial work has been going recently, into domains and fields I’d not worked in for perhaps a decade or more. One of those assignments was for six months and I’d like to review various aspects of how I approached the testing, and some of my pertinent experiences and discoveries.

The core technology uses Apache Kafka which needed to be tested as a candidate for replicated data sharing between organisations. There were various criteria which took the deployment off the beaten track of many other uses of Apache Kafka’s popular deployment models, that is, the use was atypical and therefore it was important to understand how Kafka behaves for the intended use.

Kafka was new to me, and my experiences of some of the other technologies were sketchy, dated or both. I undertook the work at the request of the sponsor who knew of my work and research.

There was a massive amount to learn; and as we discovered lots to do in order to establish a useful testing process, including establishing environments to test the system. I aim to cover these in a series of blog articles here.

  • How I learned stuff
  • Challenges in monitoring Kafka and underlying systems
  • Tools for testing Kafka, particularly load testing
  • Establishing and refining a testing process
  • Getting to grips with AWS and several related services
  • Reporting and Analysis (yes in that order)
  • The many unfinished threads I’ve started

My presentations at the Agile India 2017 Conference

I had an excellent week at the Agile India 2017 conference ably hosted by Naresh Jain. During the week I led a 1-day workshop on software engineering for mobile apps. This included various discussions and a code walkthrough so I’m not going to publish the slides I used out of context. Thankfully much of the content also applied to my 2 talks at the conference, where I’m happy to share the slides. The talks were also videoed, and these should be available at some point (I’ll try to remember to update this post with the relevant links when they are).

Here are the links to my presentations

Improving Mobile Apps using an Analytics Feedback Approach (09 Mar 2017)a

Julian Harty Does Software Testing need to be this way (10 Mar 2017)

Can you help me discover effective ways to test mobile apps please?

There are several strands which have led to what I’m hoping to do, somehow, which is to involve and record various people actually testing mobile apps. Some of that testing is likely to be unguided by me i.e. you’d do whatever you think makes sense, and some of the testing would be guided e.g. where I might ask you to try an approach such as SFDPOT. I’d like to collect data and your perspectives on the testing, and in some cases have the session recorded as a video (probably done with yet another mobile phone).
Assuming I manage to collect enough examples (and I don’t yet know how much ‘enough’ is) then I’d like to use what’s been gathered to help others learn various practical ways they can improve their testing based on the actual testing you and others did. There’s a possibility a web site called TechBeacon (owned by HP Enterprises who collaborated with my book on using mobile analytics to help improve testing) would host and publish some of the videos and discoveries. Similarly I may end up writing some research papers aimed at academia to improve the evidence of what testing works better than others and particularly to compare testing that’s done in an office (as many companies do when testing their mobile apps) vs. testing in more realistic scenarios.
The idea I’m proposing is not one I’ve tried before and I expect that we’ll learn and improve what we do by practicing and sometimes things going wrong in interesting ways.
So as a start – if you’re willing to give this a go – you’ll probably need a smartphone (I assume you’ve got at least one) and a way to record you using the phone when testing an app on it. We’ll probably end up testing several and various apps. I’m involved in some of the apps, not others e.g. I help support the offline Wikipedia app called Kiwix (available for Android, iOS and desktop operating systems see kiwix.org and/or your local app store for a copy).
Your involvement would be voluntary and unpaid. If you find bugs it’d be good to try reporting them to whoever’s responsible for the app which gives them the opportunity to address them and we’d also learn which app developers are responsive (and I expect some will seemingly ignore us entirely). You’d also be welcome to write about your experiences publicly, tweet, etc. whatever helps encourage and demonstrate ways to help us collectively learn how to test mobile apps effectively (and identify what doesn’t work well).
I’ve already had 4 people interested in getting involved, which led me to write this post on my blog to encourage more people to get involved. I also aim to encourage additional people who come to my workshops (at Agile India on 6th March) and at Let’s Test in Sweden in mid May. Perhaps we’ll end up having a forum/group so you can collectively share experiences, ideas, notes, etc. It’s hard for me to guess whether this will grow or atrophy. If you’re willing to join in the discovery, please do.
Please ping me on twitter https://twitter.com/julianharty if you’d like to know more and/or get started.