Exercising your operational robustness and incident response with 'chaos engineering' game days

Have you ever wondered how technology companies stress-test their operational capabilities; how they demonstrate consistently high standards of uptime, maintain peak readiness and validate the runbooks which allow them to respond to operational incidents within strict SLAs?

It’s one thing to talk a good game, but can you demonstrate maturity in your operations, and do you exercise hypothetical “What if?” scenarios to continuously improve your observability and response practices? Regardless of the size of your team, a robust operational capability is critical to ensuring a positive experience for the consumers of your service(s).

You may not have attended or even heard about game days, or “chaos engineering” sessions as they’re often referred to, so let’s explore the concept by answering the following questions:

  • What are game days?

  • Are game days a good fit for me?

  • What are the benefits and prerequisites?

  • How do I run a successful game day?

So what is a game day?

You’ve heard the expression “practice makes perfect”? Well, game days are precisely that, they are an opportunity to practice breaking your systems, or exercising them in a less-than-desirable manner, by simulating hypothetical “What if?” scenarios.

The scenarios listed below are nightmare-material for operational staff, something you hope to avoid in a production environment, and are therefore perfect examples of scenarios you could intentionally invoke using game days.

Consider how you would diagnose and respond to the following:

- A critical part of your service offering has gone offline during peak times and is disrupting the customer experience

- Your customer database has been breached and sensitive, personally identifiable information has been leaked, in violation of GDPR

- Your Domain Name Server (DNS) has been compromised and a malicious virus is spreading across your network

- It’s holiday season, and customer demand has exceeded the level of infrastructure you have provisioned with customers unable to access your service

- Customers in a particular region are experiencing intermittently poor service performance, but other regions appear unaffected

- It’s 3AM on a weekend, you have a software hotfix that requires an expedited deployment into production with zero downtime

- That much-loathed, mission-critical “PC acting as a file/web server” hiding under someone’s desk is physically unplugged or suffers a hardware failure

The beauty of game days is that you can replicate and validate these scenarios in a controlled environment without invoking anger from your customers (or your boss). And besides, who doesn't enjoy donning their metaphorical armour and saving the day from a little (organised) chaos!

Are game days right for me?

As with anything that requires investment, it’s worth asking what you will get out of these sessions and if the effort is justified. Let’s start with the benefits - you can expect to gain clarity on some, if not all, of the points listed below:

  • How new or existing systems respond to faults

  • How your team behaves under duress

  • The accuracy and upkeep of your runbooks

  • Clarity of accountabilities and points of escalation

  • Guidelines on declaring, managing and reviewing major incidents

  • How issues are communicated to your customers and internal stakeholders

  • The consistency of your response to repeated incidents

  • Whether your team have timely access to the physical and digital systems and documentation they need

  • Whether your team are empowered to respond appropriately without being micromanaged or interrupted

Sounds great, right? Before jumping into your first game day though, it’s worth considering the prerequisites which are necessary if you’re to execute a successful event. As a minimum, you will need the following:

  • A production-like environment in which you can run representative, potentially destructive tests (including the cost of hardware, licences, etc.)

  • The willingness of the team to participate and interact as they would in a real-world scenario while being observed and questioned

  • Buy-in from the management team and acceptance that they will attend as observers only, taking accountability for actions but leaving participation to the operations team

  • The capacity to dedicate time to schedule and participate in sessions, including remedial efforts after each session

But what if you’re unable to fulfil one or more of these criteria? Well fear not, there is a low-friction alternative that means you can still run a game day and continue to follow the steps outlined in this post!

A “tabletop exercise” allows teams to explore scenarios at a theoretical level by performing a “dry run” response, whereby the team collectively talk through the diagnosis and remediation steps that they would perform in a full-scale exercise. Teams less comfortable with running a production-representative game day may also find this an easier entry method as it requires only your typical office stationery; notepads, Post-it notes and whiteboards. If this approach applies to you then you need only substitute the environment-based steps covered below with a tabletop alternative.

 
A team performing a tabletop exercise.

How do I run a successful game day?

Game days can be tailored to meet your requirements, but in my opinion, at a minimum, you need to cover four fundamental activities; planning, logistics, execution, and action planning.

These activities are key, be this your 1st or 100th game day - even teams running game days on a per-sprint basis should still go through these motions, though on a smaller scale.

1. Planning

The key to a successful game day lies in meticulous planning, and it starts with the creation and selection of the scenario(s) you want to simulate. You may already have a prioritised list of scenarios, or a particular theme, process or discipline that you want to dedicate an entire game day to. If not, you could start with an area or process identified as high-risk, or recreate an operational incident that happened recently. Either way, focus on fewer scenarios so you aren’t rushed for time on the day - you should be looking to run additional game days in future, allowing you to cover anything that didn’t make the cut this time around.

Once armed with your scenario(s) you need to determine the audience and resources required to exercise these scenarios in a representative way (e.g. environments, licences, permissions, etc.). It is important that you involve everyone who has either accountability or responsibility for the area(s) under test, such as; product owners, developers, testers, release managers, operations, SREs, support, help desk, security, etc.

All relevant parties are expected to attend and remain distraction-free for the duration of the event, which typically lasts from 2-4 hours but can span one or more days depending on the level of preparation and setup required. Because of this, you will likely want to schedule your event several weeks in advance to allow ample preparation time, possibly over a month in advance depending on the complexity of the logistics and execution covered in the next sections.

2. Logistics

Keeping things running seamlessly during your game day is no small feat, and smooth logistics are another key part of achieving that. With your scenarios selected and invitations distributed to the relevant attendees, the next task is to set yourself up for success as the date approaches.

A commonly overlooked aspect of game days is the choice and configuration of the venue for those attending in person. Who is expected to travel, what transportation and accommodation options exist and will on-site parking be required? When the team arrive, are there attendees who require additional support or alternative facilities, what are the catering arrangements and will participants need to be separated within the venue (e.g. red team vs. blue team) with access to different equipment? There is also the potential mixture of on-site and remote attendees and how these teams will interact and be observed when not physically located in the same room (or possibly even the same time zone).

Beyond travel and the venue, you’ll want to set clear expectations for your attendees, in particular, assigning everyone a role. Unclear expectations can make or break a game day, and thus it is critical that all parties know what is expected of them, ideally communicated in advance. Here are the roles you’ll want to assign to those attending:

  • Facilitator(s) - the people running the day; coordinating the scenarios, reinvigorating the sessions should things grind to a halt for whatever reason, and asking lots of “Why?” questions of the participants

  • Note Taker(s) - one or more people capturing minutes, decisions and action points from the scenarios, downloading video recordings and instant messaging logs, photographing whiteboards and supporting materials. You typically have a single notetaker, though larger exercises and disparate teams often require multiple people to share the load

  • Participants - the “doers” who will be diagnosing and responding to the simulated scenarios and will be observed doing so. Additionally, you will require participants to trigger the scenarios, but who aren’t tasked with responding to the mess they create!

  • Observers - attendees who do not fall into one of the three roles listed above; observers are there to simply watch how scenarios play out, with strict instructions to not interfere with the participants or facilitators. This is typically the largest group comprising internal and external stakeholders, and those who can respond should anything go awry in a production environment. It shouldn’t, of course, but we’ve all been that new starter who had a ‘Little Bobby DROP TABLES’ moment in their first week…

With the roles clarified in advance, preparing the venue is next on the list. Nobody enjoys delays and seeing the frantic movement of IT equipment and furniture 10 minutes before the first session starts will do nothing for the attendee's confidence or nerves. Depending on the scale of the day you may want to recruit additional support staff to help.

Next, each environment must be provisioned and configured ahead of time, with sufficient isolation from production systems and safeguards to ensure actions only take place in the gameday environments (see ‘Little Bobby’ above). This includes configuring permissions for the attendees (especially participants from external organisations), along with any supporting systems required on the day, including; instant messaging, video calling, virtual whiteboards, bug tracking, incident management, etc.

To enable effective note-taking, it is recommended to capture minutes, recordings, decisions and actions directly into a pre-defined template. Realistically this can take any form, the key is to capture observations and actions for post-scenario evaluation and posterity-sake. The image below shows a Confluence page template with a recommended format, but feel free to create a format that best fits your needs. 

An example Confluence page template for taking notes.

3. Execution

On the day itself, the facilitators should begin by covering the plan for the day, including the number of scenarios being run, when break times will occur, and the rules of engagement. The rules of engagement are crucial to reiterate the expectations for each role, and in particular, that participants should expect to be asked “why?” as a means of learning rather than in a malicious or accusatory way, and that not knowing the answer to a question or knowing how to solve a problem isn’t an issue - at the end of the day, these sessions are all about learning and continuously improving.

In terms of the scenarios themselves, only the facilitators and a small audience of observers are aware of the agreed scenarios. Under no circumstances should the participants be made aware - the idea is to observe how the participants respond to a given scenario (including diagnosing it in the first instance) and being privy to this information will result in a non-representative response.

The participants tasked with triggering the scenarios kickstart proceedings, either by initiating failures manually, by running orchestration scripts, or by using open-source or commercial tools such as Gremlin and Netflix’s Chaos Monkey. Each of these methods allows you to target your on-premises infrastructure and software, but they can also be used to wreak havoc on your Cloud-based systems through APIs and CLIs. Cloud-native offerings such as AWS Fault Injection Simulator and Azure Chaos Studio have entered the space too, allowing you to take full control over your Cloud-based chaos - you can even trigger multiple failures - including staged or delayed ones - should you want to make things extra spicy!

Unplugging physical server hardware.

With a scenario triggered, the focus pivots to the participants responsible for executing the runbooks, placing them in control of diagnosis, communication and remediation efforts. The facilitators ensure the scenario is moving forward, periodically asking the participants to clarify why they performed a particular action or the rationale for a certain decision, as a means of understanding how to provide greater clarity when responding to this scenario.

It is recommended to set a diagnosis deadline for each scenario (e.g. 15 minutes), after which point the team will be given hints as to where to focus their efforts. If the participants remain unable to locate the problem then the facilitator should describe the scenario in full, and allow the team further time to respond to this information.

Once a scenario has concluded, the team need to evaluate the success of the scenario. Some of the key topics to discuss include:

  • Was the scenario simulated appropriately and was it a good fit for a game day?

  • Were the relevant participants present, did everyone know what was expected of them and were they given the freedom to solve the problem themselves without interruption?

  • How long did diagnosis efforts take, and was observability sufficient to detect, diagnose and validate remediation? Were these efforts hampered by false alarms or misleading indicators?

  • How long did remediation efforts take, and if applicable, did fail-over or disaster recovery work as expected?

  • Did the participants have timely access to the systems, people and documentation they needed?

  • Were the runbooks accessible and accurate?

  • Were the relevant stakeholders and customers kept informed with clear and timely communications?

After evaluation has taken place, each environment needs to be reset into a good state ahead of the next scenario, be this infrastructure, software or configuration. This is often referred to as “winding down” or “take down” and is ideally scripted to run during a break period.

4. Action Planning

Following the conclusion of the final scenario, it is crucial to attribute owners to each of the actions captured throughout the session such that they can be addressed appropriately. Those attending as observers are the most likely candidates to take accountability for actions, rather than the participants themselves. It is recommended to define a timeline for each action and to schedule a follow-up session to review the outcomes.

Consider using a Correction of Error (COE) process to formalise the capturing of actions and the associated materials that supported remediation efforts. This approach is often paired with a Root Cause Analysis (RCA) where the relevant team(s) explore a particular process or system to establish a deeper understanding of what caused a failure or event to happen, and what changes are required to prevent or mitigate it from happening again in future.

Conclusion

Running an operational game day requires thoughtful planning and logistics, strong execution and prompt follow-up. Attention to detail, effective communication and a focus on providing an exceptional experience will set the stage for a game day that exceeds expectations and adds value to your business.

Requesting feedback from the attendees will help you to understand how the day ran from their perspective, especially any areas that future game days could improve and iterate upon. Ultimately, it’s about creating an environment that promotes learning and continuous improvement, but also, who doesn’t want to prevent those pesky 3AM text message alerts informing them production has gone down!

So, get ready to kick off your next game day with confidence and make it an event to remember!

Back to Blog