Man and Machine in Perfect Harmony?

Ah, the bliss and eager joy when we can operate technology seamlessly and productively, making effective progress rather than mistakes; where the technology helps us make better, informed decisions. Sadly this seldom happens in operations – when administering the software – or trying to address problems.

HCI for systems software and tools

Testing the operating procedures, the tools and utilities to configure, administer, safeguard, diagnose and recover, etc. may be some of the most important testing we do. The context, including emotional & physical aspects, are important considerations and may make the difference between performing the desired activity versus exacerbating problems, confusion, etc. For instance, is the user tired, distracted, working remotely, under stress? each of these may increase the risk of more and larger mistakes.

Usability testing can help us consider and design aspects of the testing. For instance, how well do the systems software and tools enable people to complete tasks effectively, efficiently and satisfactorily?

Standard Operating Procedures

Standard Operating Procedures (SOP’s) can help people and organisations to deliver better outcomes with fewer errors. For a recent assignment testing Kafka, testing needed to include testing the suitability of the SOP’s, for instance to determine the chances of someone making an inadvertent mistake that caused a system failure or compounded an existing challenge or problem.

Testing Recovery is also relevant. There may be many forms of recovery. In terms of SOPs we want and expect most scenarios to be included in the SOPs and to be trustworthy. Recovery may be for a particular user or organisation (people / business centred) and/or technology centred e.g. recovering at-risk machine instances in a cluster of servers.

OpsDev & DevOps

OpsDev and DevOps may help improve the understanding and communication between development and operations roles and foci. They aren’t sufficient by themselves.

Further reading

Disposable test environments


  • “readily available for the owner’s use as required”
  • “intended to be thrown away after use”

For many years test environments were hard to obtain, maintain, and update. Technologies including Virtual Machines and Containers reduce the effort, cost, and potentially the resources needed to provide test environments as and when required. Picking the most appropriate for a particular testing need is key.

For a recent project, to test Kafka, we needed a range of test environments, from lightweight ephemeral self-contained environments to those that involved 10’s of machines distributed at least 100 km apart. Most of our testing for Kafka used AWS to provide the computer instances and connectivity where environments were useful for days to weeks. However we also used ESXi and Docker images. We used ESXi when we wanted to test on particular hardware and networks. Docker, conversely, enabled extremely lightweight experiments, for instance to experiment with self-contained Kafka nodes where the focus was on interactive learning rather than longer-lived evaluations.

Some, not all, of the contents of a test environment has a life beyond that of the environment.  Test scripts, the ability to reproduce the test data and context, key results and lab notes tend to be worth preserving.

Key Considerations

  • what to keep and what to discard: We want ways to identify what’s important to preserve and what can be usefully and hygienically purged and freed.
  • timings: how soon do we need the environment, what’s the duration, and when do we expect it to be life-expired?
  • fidelity: how faithfully and completely does the test environment need to be?
  • count: how many nodes are needed?
  • tool support: do the tools work effectively in the proposed runtime environment?

Further reading