Incident

management at SumUp

Written by Aleksandar Irikov

This article provides a technical explanation of a recent production issue and the steps we took to remedy the situation. Our goal here is to give a recap of events and share our learnings, so other people avoid making the same mistakes.

The event

On 20 January 2022, we merged a code change to one of our services. The new feature needed to communicate with another SumUp system and perform some business logic based on the response. Аfter the change was released, SumUp merchants were unable to log in, sign up or view the SumUp dashboard (the portal through which merchants can manage their SumUp account).

A note on the SumUp infrastructure

We have 3 different deployment environments — dev (called theta internally), uat (called staging), and production. The development environment is primarily used for testing/working on features by engineers and is considered “experimental “ by design. The purpose of our staging environment is to be used as a mirror of production.

As a fintech company, there are certain rules and regulations that we must follow to protect our merchants’ data and remain compliant with regulators. Therefore, applications dealing with sensitive information (e.g. PCI-related apps) have their own dedicated infrastructure and cannot be accessed from outside this domain boundary.

Timeline

Recap of events: Early this year, we released a new feature that required data from an external SumUp system.

The change went through our CI/CD pipeline and was deployed to the development environment. Apart from our automated test suite, we performed extensive exploratory testing to validate that all possible flows and use cases were accounted for. Our system has a lot of upstream and downstream dependencies, and such tests increase our confidence, especially when rolling out more complex code changes. Once we felt confident enough, we proceeded to roll our feature out to uat/stage.

The staging environment is looking to mimic the production’s look and feel — this assures us that if a feature behaves as expected on stage, it will do so too on prod. However, when we were validating our feature there, there had been a slight divergence between staging and production. The registration flow for the two environments had minor (but, as we later realised, significant) differences in how it guided the user through the necessary steps. Due to this, we could not properly test our feature on stage, but our confidence was still high — we have had similar scenarios in the past and had already tested it locally and on the dev. Nothing appeared broken — there was no degradation, so we assumed it was safe to proceed.

We deployed the change to prod, and we quickly started observing the system misbehaving. Our service was unable to communicate with the new dependency and started timing out on these calls. Because of these timeouts, incoming requests began to pile up, and eventually, our pods began to exhaust their memory limits. Upon seeing this, Kubernetes started to kill the misbehaving pods, leading to service degradation.

Once we had identified the source of the issue, we quickly rolled back to a previous version of our application using our deployment system ArgoCD. Compared to doing a revert through git, this approach saves us time going through the CI/CD pipeline, thus mitigating the issue faster.

Why did it happen?

It seems weird, right? A feature works as expected locally and on dev, has some issues on stage but doesn’t cause degradation and brings down the entire application on prod. At first glance, this might not make sense, but it will with the following caveats. To empower engineers and speed up delivery, our development environment did not have the same restrictions as stage and prod — our service was perfectly capable of communicating with the external system on theta (our dev environment). However, our downstream dependency is one of those applications that need to be deployed in a separate Kubernetes cluster to avoid precisely such communication patterns and is not reachable externally by design. In that regard, the stage closely resembles production and has these restrictions applied, hence why our feature was misbehaving there.

So far, so good, but why did our change not break stage? The answer: traffic — there weren’t enough requests flowing through to generate the amount of system pressure needed to cause the degradation we experienced on prod. As mentioned, we try to mirror our production setup as much as possible, but we have not considered the difference in traffic between the two environments.

As part of our follow-up investigation, we noticed that we had lowered the allocated CPU and Memory for our pods a couple of months ago. However, we had failed to adjust the web server’s configuration accordingly. Once our downstream calls started timing out and requests piling up, this misconfiguration caused the application to run out of memory and crash.

Mitigation process

Whenever we have an outage, we try to follow the following steps:

Communicate the degradation to the company and a rough estimation of the impact — this has the benefit of making everyone aware of the current situation and the steps taken for its resolution. We have a dedicated slack channel for such situations, so people can easily keep track of what is going on. Get the right people on a conference call to work on a resolution together — usually involves folks from multiple teams affected by the incident. Keep the SumUp Status Page up to date so that our merchants and users are aware of the current situation and that we are working on a resolution. After mitigating the issue, we conduct an in-depth investigation of the events — a process we call a postmortem. We create a detailed timeline of everything that has happened, use methods such as 5 Whys to get to the heart of the problem and draw conclusions, actions and learnings that we can share with the rest of the organisation, so they avoid the same pitfalls.

Lessons and learnings

Here are the lessons that we learned from this incident:

  • Have a quick and easy rollback mechanism that everyone is aware of and comfortable with — this allows you to mitigate issues faster. There are some changes that an app rollback might not fix (such as config changes to underlying infra). Nevertheless, it is an indispensable tool in your incident resolution toolbox.

  • When “resizing“ your pods/containers, make sure to load test your new setup and prepare for the worst-case scenario (wall of traffic, for instance) — this allows you to catch such issues early and without impacting your users.

  • Bear in mind that your other environments are not a perfect copy of production, try as they may. Most likely, they differ in some form or another. For instance, the discrepancy in traffic volume exhibited some new emergent behaviours that we had not seen previously and eventually caused this degradation.

  • Give incident simulations a try. Most likely, the people on your team have different experiences when dealing with operational issues. Some might be quite comfortable with troubleshooting production problems, whereas this might be their first production incident for others. To get everyone up to speed, you and your teammates can do a “trial run “of dealing with a production issue — pretend there’s an outage right now affecting your customers. How do you communicate it to the organisation? How do you work the problem? Who leads the mitigation process? For instance, we used our dev environment to simulate a similar scenario where a code change brought down the system. We split our team in two, observed their actions and communication, and measured the time needed to identify and fix the issue. Afterwards, we compared the pros/cons of both approaches and discussed how we could be even better in such situations. This made people feel much more confident in handling production issues in the future.

We hope we were able to give a glimpse into the world of SumUp, how we deal with production issues, and learn from them.