EXPEDIA GROUP TECHNOLOGY — ENGINEERING

Error Budget Policies in Practice

A practical guide to implementing error budgets in your organization

Photo by Christina @ wocintechchat.com on Unsplash

This blog post is a follow up to Error Budget Policy Adoption at Expedia Group)by Lasantha Kularatne.

Determining and setting service level objectives (SLOs) are a critical and key milestone in your organizations’ reliability journey. You’ve made it this far — now what? Error budgets are an incredibly useful and essential tool to help your teams make decisions based on SLO data. In the previous post, we covered alerts to monitor error budgets. Now, we’ll cover how to incorporate those alerts by implementing thoughtful error budget guidelines and policies that correspond to your error budgets so you can keep your SLOs relevant.

Error budget policies

A system can be in three reliability states with respect to an SLO: happy, sad and uncertain. The following illustration (Figure 1) shows how one can use error budget monitoring to figure out which of the three reliability states the system falls into. The team should have a set of policies defined for the states and follow the operational guidelines in them.

Figure 1 — States of Reliability

Happy state

Conditions:

No alerts fired, and error budget consumption is below 75%.

Policy:

At this stage, the system meets currently accepted reliability standards and we consider customers happy. New feature development is allowed. The team can take risks and do experimentations as there is enough budget left for potential failures. Furthermore, if budget consumption is consistently hovering below 10%, SLOs can be tuned with stricter targets.

Uncertain state

Conditions:

Warning alerts have been fired, but not critical alerts.

Policy:

At this stage, risky deployments should wait until investigation is complete because new errors can mask current issues. Tickets should be created and prioritized for investigations. Delayed investigations could push the system to “Sad state” and consume the entire budget.

If for any reason a team cannot adhere to this policy, they can temporarily reevaluate SLO thresholds while understanding the risks and implications. This should be communicated well to stakeholders and tickets should be created for future work to get the SLO threshold upgraded to the current value.

If the system recovers and operational state conditions are met, follow the “Happy state” policy.

Sad state

Conditions:

Critical alerts have been fired.

Policy:

(This is also known as “Out of Budget Policy”)

Site reliability engineering (SRE) team or the support engineering team should be paged, and an incident ticket created.

Application code should be frozen until the remaining error budget reaches at least 25% and all alerts are cleared up. Product feature releases will be completely halted. The only exceptions that are allowed will be:

  1. Fixes to address the root cause of SLO miss
  2. Fixes that have the highest priority (e.g., changes that involve legal ramifications if deadlines are not met)
  3. Security fixes

Finally, if a team finds that an SLO miss happened due to reasons out of its control, it can adjust the error budget (e.g., using the status correction options provided by Datadog.)

If the system recovers and operational state conditions are met, follow the “Uncertain state” or “Happy state” policy based on latest state.

Outages caused by dependencies

If an outage was caused by a service maintained by another service provider team, the dependent team should create a ticket with the service provider team to investigate the issue. If the service provider team has SLOs and error budgets defined, the dependent team can forego out of budget policy as long as the service provider team has enacted their own error budget policies.

For an external dependency, the SLAs should be accounted for at the time SLOs were defined. If the 3rd party vendor’s outage broke the SLA, the team should engage with the vendor through proper channels to get compensated for the reliability miss. The team should create an investigation ticket internally to track progress.

If a certain dependency has been identified as the root cause for recurring SLO misses, then the team should consider removing that unreliable dependency from the system. Although this is a definite solution to bring the error budget under control, it can take time. Therefore, the team should look for other measures to bring the error budget under control in the short term.

Operationalizing error budget policies

The error budget policies discussed here are not hard and fast rules. Teams can adopt a flexible version of these policies if they wish.

Policy document

Teams should document policies for services they directly own. Shared services should be documented at the organizational level that is responsible for those services. Policies should be shared among the stakeholders and incident response team. Application run books should be updated with the link to the policy document for support engineers to have easy access. The policy document can be placed either in a source code repository or central documentation source belonging to the team. You can find a sample document in this public repo.

Adoption

Setup:

  • Inspirational SLO targets and error budget limits are agreed upon
  • Policy document created
  • Run books updated
  • Necessary trainings are done (within the team)

Alpha Testing / 1st 30 Days:

  • Monitor
  • If entering “Sad state” then have all the ceremonies but don’t actually block releases

Beta testing / 2nd 30 days:

  • Adjust limits based on learnings
  • Apply “Sad state” rules
  • Optional: If necessary, team has the flexibility to adjust SLO targets and error budget limits with approval from stakeholders

Operationalize:

  • Follow the error budget policy
  • Every 6 months SLOs and error budgets are reviewed
  • Optional: SLO targets and error budget limits are adjusted with approval from stakeholders

Reporting

Regularly reviewing error budget alerts (warnings vs critical) and the tickets created for SLO misses is recommended. Creating reports and dashboards at team level and application level would help identify high priority fixes needed to reduce risks and prevent potential outages. Creating reports and dashboards at organization level would provide a high-level overview for engineering leaders to build strategies for long term reliability of your platform.

Reviewing at forums

When the policies are adopted by teams, it is recommended to have a recurring review process of SLOs and error budgets. A weekly or monthly forum with the participation of multiple product teams and their leadership is ideal (e.g, at Expedia Group™️, we have weekly operational excellence forums). Defects identified by the investigations of SLO misses should be reviewed at correction-of-errors meetings or post-mortem meetings. While improving reliability is the main goal, sharing learnings across teams is a similarly important goal.

Getting started with SLOs is a great way to start measuring the reliability of your products and services from your customers’ perspective. Once those targets have been defined and measured, putting error budgets into practice can really help your organization to make principled decisions with concrete data about the investment needed for reliability. Keep in mind that this post contains guidelines — make sure to do what makes the most sense and work best for your organization.

Learn more about technology at Expedia Group

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store