Monitoring Production EKS cluster using Prometheus & Grafana (Helm)

7 min readMar 17, 2022

Update: 12 June 2023

Added commands with latest helm chart version & EBS PVC error resolution.

Well, a pretty common design patten in Kubernetes to setup observability & monitoring is using Prometheus & Grafana. I’ve tried multiple guides but nowhere could I find a production grade solution that can be maintained with minimal drift. So I am putting together what worked for me.

In this blog post, we’ll setup a robust, declarative version controlled monitoring stack on our EKS cluster using Prometheus and Grafana.

Prerequisites:

EKS cluster (≥v1.20) with kubectl configured
Helm 3+
Kubectl connected to the cluster
Optional — A VPN setup so as to access the Prometheus & Grafana GUI using NodePort* (access using EKS nodes private IP). Port forwarding will do but it is a temporary solution though.

*This being a design pattern we do not want to expose prometheus and grafana GUI publicly.

Step 1: Understanding the solution

What will we deploy using helm? Prometheus Operator, Prometheus, Alertmanager, Grafana & Service Monitors.
Just imagine what would happen if we would go by installing all these components individually, maintaining the config files, version compatibility checks between each other and so on. Already sounds like a nightmare. 😶
This is the reason we use helm as it manages all the dependencies. We should be grateful to the community which has built such amazing chart integrations. 🙌
Let’s proceed to the terminal.

Step 2: Updating the chart variables before deployment

We will be using the helm chart developed by the community located here.
Once you visit the repo, there is a values.yamlwhich we need to customize as per our requirement by overriding the default values. This is a critical step.
Here’s our values.yaml file:

etcd: false    
kubeScheduler: false
kubeControllerManager:
  enabled: false
kubeEtcd:  
  enabled: false
kubeScheduler:       
  enabled: false
serviceMonitorSelector:
  matchLabels: 
    prometheus: devops
commonLabels:
  prometheus: devopsprometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp2
          resources:
            requests:
              storage: 100Gi
# Define persistent storage for Grafana (PVC)
grafana:
  # Set password for Grafana admin user
  adminPassword: XXXXXXXXXXXX   #<--- update your password here
  persistence:
    enabled: true
    storageClassName: gp2
    accessModes: ["ReadWriteOnce"]
    size: 100Gi
# Define persistent storage for Alertmanager (PVC)
alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp2
          resources:
            requests:
              storage: 50Gi
# Change default node-exporter port
prometheus-node-exporter:
  service:
    port: 30206
    targetPort: 30206

We’ve created a PVC for Prometheus & Grafana data retention so that the metrics are not lost on pod restarts.
Now that our custom values.yaml file is ready, we can go ahead and deploy.

Step 3: Installing the chart

Before installing the prometheus helm chart, we need to add the prometheus community repo. Run this from the terminal.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Update the helm repository

helm repo update

Now, let’s search for the kube-prometheus-stack helm chart

helm search repo kube-prometheus-stack --max-col-width 23NAME                    CHART VERSION APP VERSION DESCRIPTIONprometheus-community...  46.8.0        0.65.2      kube-prometheus-stac...

Finally install the chart

helm install monitoring \
prometheus-community/kube-prometheus-stack \
--values values.yaml \
--version 46.8.0 \
--namespace monitoring \
--create-namespace

Validate all the pods in the monitoring namespace

kubectl get pods -n monitoring
NAME                                                     READY   STATUS    RESTARTS   AGEalertmanager-monitoring-kube-prometheus-alertmanager-0   2/2     Running   0          104smonitoring-grafana-7956d8b954-tslz6                      3/3     Running   0          113smonitoring-kube-prometheus-operator-6c6866d56d-mxbhc     1/1     Running   0          113smonitoring-kube-state-metrics-56bfd4f44f-5b7fq           1/1     Running   0          113smonitoring-prometheus-node-exporter-cc72n                1/1     Running   0          113smonitoring-prometheus-node-exporter-grsld                1/1     Running   0          113smonitoring-prometheus-node-exporter-nq4wt                1/1     Running   0          113smonitoring-prometheus-node-exporter-nxfdz                1/1     Running   0          113smonitoring-prometheus-node-exporter-sbjb7                1/1     Running   0          113sprometheus-monitoring-kube-prometheus-prometheus-0       2/2     Running   0          103s

If you do get an error in provisioning PVC as below, then please try to follow this link to install the Amazon CSI EBS Driver in your cluster: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html

Normal  ExternalProvisioning  2m30s (x25 over 8m10s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator

If the PVC was provisioned successfully — Yay! 🍻 All our pods are up and running. This is 90% done. Wait! What’s pending then?
The services exposing these pods are ClusterIP which cannot be accessed from the browser. Remember? We need to have the Prometheus & Grafana UI accessible from the browser.

Step 4: Exposing the services

Let’s check the list of services deployed first

kubectl get svc -n monitoringNAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGEalertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m20smonitoring-grafana                        ClusterIP   172.30.205.8     <none>        80/TCP                       3m29smonitoring-kube-prometheus-alertmanager   ClusterIP   172.30.103.69    <none>        9093/TCP                     3m30smonitoring-kube-prometheus-operator       ClusterIP   172.30.159.195   <none>        443/TCP                      3m29smonitoring-kube-prometheus-prometheus     ClusterIP   172.30.122.234   <none>        9090/TCP                     3m29smonitoring-kube-state-metrics             ClusterIP   172.30.186.124   <none>        8080/TCP                     3m29smonitoring-prometheus-node-exporter       ClusterIP   172.30.199.72    <none>        30206/TCP                    3m29sprometheus-operated                       ClusterIP   None             <none>        9090/TCP                     3m20s

Out of all the services, we need to expose monitoring-grafana& monitoring-kube-prometheus-prometheus service for GUI access.
There are 2 options to expose the services:
Using port-forward (temporary): No VPN required

kubectl port-forward \
svc/monitoring-kube-prometheus-prometheus 9090 \
-n monitoring# Run the below command in a different terminal windowkubectl port-forward \
svc/monitoring-grafana 3000 \
-n monitoring

Now you can access the Prometheus GUI over localhost:9090 and Grafana over localhost:3000

Using NodePort (preferred): VPN required
Edit and change monitoring-grafana& monitoring-kube-prometheus-prometheus services to NodePort:

kubectl get svc -n monitoring
NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGEalertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   22mmonitoring-grafana                        NodePort    172.30.205.8     <none>        80:30520/TCP                 22m   <-- Updatedmonitoring-kube-prometheus-alertmanager   ClusterIP   172.30.103.69    <none>        9093/TCP                     22mmonitoring-kube-prometheus-operator       ClusterIP   172.30.159.195   <none>        443/TCP                      22mmonitoring-kube-prometheus-prometheus     NodePort    172.30.122.234   <none>        9090:32083/TCP               22m    <-- Updatedmonitoring-kube-state-metrics             ClusterIP   172.30.186.124   <none>        8080/TCP                     22mmonitoring-prometheus-node-exporter       ClusterIP   172.30.199.72    <none>        30206/TCP                    22mprometheus-operated                       ClusterIP   None             <none>        9090/TCP                     22m

Now, we see that monitoring-grafana& monitoring-kube-prometheus-prometheus service are of type NodePort.
Ensure you have allowed the NodePort range in the Security Group of the EKS cluster as below (highlighted)

Update the highlighted security group and add the NodePort ranges as shown below. The source should be your VPC CIDR value.

Now, assuming you have a VPN setup, open the same range there too and you should be able to access the Prometheus & Grafana GUI.
Select any one private IP of the cluster node (you can fetch using kubectl get nodes) and suffix the port (NodePort from svc) for Prometheus.

Setp 5: Fix the Prometheus targets for kube-proxy

In EKS, by default the kube-proxy pods are exposed on 127.0.0.1 bindAddress. This is the reason for the timeout in the GUI.

kube-proxy timeout in Prometheus targets

All we need to do is to update this bind address from 127.0.0.1 to 0.0.0.0 . The same can be achieved by using the below command:

kubectl -n kube-system get cm kube-proxy-config -o yaml |sed 's/metricsBindAddress: 127.0.0.1:10249/metricsBindAddress: 0.0.0.0:10249/' | kubectl apply -f -

Rollout the kube-proxy daemonSet after updating the configMap

kubectl rollout restart ds/kube-proxy -n kube-system

Monitor the targets in Prometheus. All healthy!

Navigate to the Grafana GUI. Ensure you use the same password that was embedded in the values.yaml file before deployment.

Pre-defined cluster dashboards available

Navigate to the bottom and find the Cluster Dashboard. You can explore the rest of the dashboards too.

We have a fully functional Prometheus which is sending data to our Grafana with pre-defined dashboard for our EKS cluster.

Thank you for reading. For queries, corrections or suggestions, please connect with me on LinkedIn.

Monitoring Production EKS cluster using Prometheus & Grafana (Helm)

Written by Kiran Iyer

No responses yet