Case Study: How an EKS upgrade almost broke our production

4 min readFeb 18, 2024

This is a rundown scenario that caused a disruption in our services eventually leading to a production outage.

Background:

We host 3 environments in total (dev, test, prod), with an EKS cluster v1.24 in each one . Sometime ago, we had to rebuild our dev & test environments from scratch including EKS. However, the closest version available mirroring prod was 1.25 and we went with the same in dev & test environments leaving only prod running on 1.24.

EKS Upgrade: There are 2 components to an EKS upgrade.

A) Control plane upgrade: Triggered from the console & completely managed by AWS.

B) Nodegroup upgrade: Manually update the nodegroups with the new AMIs. We use managed nodegroups with 90% spot & 10% on-demand ratio in production and 100% spot in dev & test.

NOTE: A is a complete rolling upgrade. No service disruption. B too is a rolling upgrade but if your application cannot handle node terminations gracefully it is better to carry out B during a maintenance window. As most of the workloads run on spot, we decided to go with B while serving customer(live) traffic.

Control plane upgrade: Initiated from the console, took 25 minutes and completed successfully.

Nodegroup upgrade: Initiated from the console. Took ~20 minutes and failed.

Debugging Nodegroup upgrade failure:

We triggered the upgrade again but this time only for a single nodegroup. Upon close investigation, we observed that the new node came online, but the kube-system services failed to initialize i.e. coredns, aws-nodecausing the node to go into NotReady state and eventually terminated causing the upgrade to fail.

Investigating logs for aws-node, we found the below error:

kubect logs aws-node-ccttc
Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
{"level":"info","ts":"2024-02-08T07:01:25.986Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2024-02-08T07:01:25.987Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2024-02-08T07:01:26.033Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2024-02-08T07:01:26.037Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2024-02-08T07:01:28.047Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2024-02-08T07:01:30.064Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2024-02-08T07:01:32.083Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2024-02-08T07:01:34.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2024-02-08T07:01:36.106Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2024-02-08T07:01:38.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

We were racking our brains to figure this out as this did not occur in dev, test with 1.25 version. Read GitHub issues and SO but got hacky fixes. As this was production, we didn’t want to risk it. Hence we raised a priority case with the AWS Support and waited patiently.

Architecture:

Sudden crash:

After about 4 hours, at one time, suddenly, one of our service serviceA started failing with timeouts (not all requests, but a significant portion of requests). These timeouts were clearly reflecting in the frontend. We were sure that we hadn’t done any changes to the EKS settings but were quite stumped with this behaviour. What follows ahead was a strange discovery…

Debugging the crash:

We hopped onto a call to analyse what was the impact, impact areas and immediately put our apps on “maintenance mode”. This ensured we stopped accepting new requests & minimal impact to the downstream systems. This was just a protective measure, but we still needed to figure out the actual problem of timeouts.

We started testing APIs manually from inside the pods and realised that serviceB was unreachable from serviceA . Ah! We got the problem, finally! But why is this a problem in the first place? As both these services communicate with each other via ClusterIP protocol. We checked the label selectors on ServiceB with DeploymentB , everything looked good. So why on earth will a ClusterIP ever fail? nslookups on the clusterIP dns too failed. We realised this is something else.

Amid all this, the dev teams were working as usual. This implied the CI/CD pipelines were fully operational, build, merge & releases being deployed to the prod EKS cluster for various services. And that is when we realised that serviceB had a release & new code was deployed on prod EKS. Post this deployment (rollout), serviceB became unreachable from serviceA.

We receive a reply from the AWS support team regarding the Nodegroup upgrade failure:

The kubernetes service related IPtables rules in the nodes are made by kube-proxy. To verify that connectivity is indeed broken, we can try to telnet the kubernetes service IP i.e. 172.20.0.1 on port 443 from the Worker Nodes. We could confirm that this connectivity is indeed broken using the command. Furthermore, we checked “kube-proxy” and “vpc-cni” versions and observed that they have the following current versions not compatible with EKS 1.25. Hence we request you to upgrade the add-ons to the suggested version.

vpc-cni: v1.10.1-eksbuild.1 (Existing) ==>> v1.16.2-eksbuild.1 (Expected)
kube-proxy: v1.21.2-eksbuild.2 (Existing) ==>> v1.25.16-eksbuild.2 (Expected)

We upgraded the add-ons to the said version and things started to settle. All requests started to complete with 200 OK.

Final thoughts:

Recommendation to whoever reads this article:

We should make a note of upgrading add-ons before/after an EKS upgrade and ensure backward compatibility support with the existing version.
Ensure latency, 4xx & 5xx alerts on services so that such problems can be detected immediately. SRE/DevOps/Dev teams should be updated about these alerts rather than getting to know from the customer support team. Creates a bad experience.
Runbooks on how to debug such issues would definitely be handy.
As systems scale, failure is inevitable. Handling failures gracefully is an art. Build systems to handle failures. Being calm and composed during the chaos helps.