Monitoring Production EKS cluster using Prometheus & Grafana (Helm)

Kiran Iyer
7 min readMar 17, 2022

--

Update: 12 June 2023

  1. Added commands with latest helm chart version & EBS PVC error resolution.

Well, a pretty common design patten in Kubernetes to setup observability & monitoring is using Prometheus & Grafana. I’ve tried multiple guides but nowhere could I find a production grade solution that can be maintained with minimal drift. So I am putting together what worked for me.

In this blog post, we’ll setup a robust, declarative version controlled monitoring stack on our EKS cluster using Prometheus and Grafana.

Prerequisites:

  1. EKS cluster (≥v1.20) with kubectl configured
  2. Helm 3+
  3. Kubectl connected to the cluster
  4. Optional — A VPN setup so as to access the Prometheus & Grafana GUI using NodePort* (access using EKS nodes private IP). Port forwarding will do but it is a temporary solution though.

*This being a design pattern we do not want to expose prometheus and grafana GUI publicly.

Step 1: Understanding the solution

  • What will we deploy using helm? Prometheus Operator, Prometheus, Alertmanager, Grafana & Service Monitors.
  • Just imagine what would happen if we would go by installing all these components individually, maintaining the config files, version compatibility checks between each other and so on. Already sounds like a nightmare. 😶
  • This is the reason we use helm as it manages all the dependencies. We should be grateful to the community which has built such amazing chart integrations. 🙌
  • Let’s proceed to the terminal.

Step 2: Updating the chart variables before deployment

  • We will be using the helm chart developed by the community located here.
  • Once you visit the repo, there is a values.yamlwhich we need to customize as per our requirement by overriding the default values. This is a critical step.
  • Here’s our values.yaml file:
etcd: false    
kubeScheduler: false
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
serviceMonitorSelector:
matchLabels:
prometheus: devops
commonLabels:
prometheus: devops
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp2
resources:
requests:
storage: 100Gi
# Define persistent storage for Grafana (PVC)
grafana:
# Set password for Grafana admin user
adminPassword: XXXXXXXXXXXX #<--- update your password here
persistence:
enabled: true
storageClassName: gp2
accessModes: ["ReadWriteOnce"]
size: 100Gi
# Define persistent storage for Alertmanager (PVC)
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp2
resources:
requests:
storage: 50Gi
# Change default node-exporter port
prometheus-node-exporter:
service:
port: 30206
targetPort: 30206
  • We’ve created a PVC for Prometheus & Grafana data retention so that the metrics are not lost on pod restarts.
  • Now that our custom values.yaml file is ready, we can go ahead and deploy.

Step 3: Installing the chart

  • Before installing the prometheus helm chart, we need to add the prometheus community repo. Run this from the terminal.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  • Update the helm repository
helm repo update
  • Now, let’s search for the kube-prometheus-stack helm chart
helm search repo kube-prometheus-stack --max-col-width 23NAME                    CHART VERSION APP VERSION DESCRIPTIONprometheus-community...  46.8.0        0.65.2      kube-prometheus-stac...
  • Finally install the chart
helm install monitoring \
prometheus-community/kube-prometheus-stack \
--values values.yaml \
--version 46.8.0 \
--namespace monitoring \
--create-namespace
  • Validate all the pods in the monitoring namespace
kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGEalertmanager-monitoring-kube-prometheus-alertmanager-0 2/2 Running 0 104smonitoring-grafana-7956d8b954-tslz6 3/3 Running 0 113smonitoring-kube-prometheus-operator-6c6866d56d-mxbhc 1/1 Running 0 113smonitoring-kube-state-metrics-56bfd4f44f-5b7fq 1/1 Running 0 113smonitoring-prometheus-node-exporter-cc72n 1/1 Running 0 113smonitoring-prometheus-node-exporter-grsld 1/1 Running 0 113smonitoring-prometheus-node-exporter-nq4wt 1/1 Running 0 113smonitoring-prometheus-node-exporter-nxfdz 1/1 Running 0 113smonitoring-prometheus-node-exporter-sbjb7 1/1 Running 0 113sprometheus-monitoring-kube-prometheus-prometheus-0 2/2 Running 0 103s
Normal  ExternalProvisioning  2m30s (x25 over 8m10s)  persistentvolume-controller  waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  • If the PVC was provisioned successfully — Yay! 🍻 All our pods are up and running. This is 90% done. Wait! What’s pending then?
  • The services exposing these pods are ClusterIP which cannot be accessed from the browser. Remember? We need to have the Prometheus & Grafana UI accessible from the browser.

Step 4: Exposing the services

  • Let’s check the list of services deployed first
kubectl get svc -n monitoringNAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGEalertmanager-operated                     ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   3m20smonitoring-grafana                        ClusterIP   172.30.205.8     <none>        80/TCP                       3m29smonitoring-kube-prometheus-alertmanager   ClusterIP   172.30.103.69    <none>        9093/TCP                     3m30smonitoring-kube-prometheus-operator       ClusterIP   172.30.159.195   <none>        443/TCP                      3m29smonitoring-kube-prometheus-prometheus     ClusterIP   172.30.122.234   <none>        9090/TCP                     3m29smonitoring-kube-state-metrics             ClusterIP   172.30.186.124   <none>        8080/TCP                     3m29smonitoring-prometheus-node-exporter       ClusterIP   172.30.199.72    <none>        30206/TCP                    3m29sprometheus-operated                       ClusterIP   None             <none>        9090/TCP                     3m20s
  • Out of all the services, we need to expose monitoring-grafana& monitoring-kube-prometheus-prometheus service for GUI access.
  • There are 2 options to expose the services:
  • Using port-forward (temporary): No VPN required
kubectl port-forward \
svc/monitoring-kube-prometheus-prometheus 9090 \
-n monitoring
# Run the below command in a different terminal windowkubectl port-forward \
svc/monitoring-grafana 3000 \
-n monitoring

Now you can access the Prometheus GUI over localhost:9090 and Grafana over localhost:3000

  • Using NodePort (preferred): VPN required
  • Edit and change monitoring-grafana& monitoring-kube-prometheus-prometheus services to NodePort:
kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 22mmonitoring-grafana NodePort 172.30.205.8 <none> 80:30520/TCP 22m <-- Updatedmonitoring-kube-prometheus-alertmanager ClusterIP 172.30.103.69 <none> 9093/TCP 22mmonitoring-kube-prometheus-operator ClusterIP 172.30.159.195 <none> 443/TCP 22mmonitoring-kube-prometheus-prometheus NodePort 172.30.122.234 <none> 9090:32083/TCP 22m <-- Updatedmonitoring-kube-state-metrics ClusterIP 172.30.186.124 <none> 8080/TCP 22mmonitoring-prometheus-node-exporter ClusterIP 172.30.199.72 <none> 30206/TCP 22mprometheus-operated ClusterIP None <none> 9090/TCP 22m
  • Now, we see that monitoring-grafana& monitoring-kube-prometheus-prometheus service are of type NodePort.
  • Ensure you have allowed the NodePort range in the Security Group of the EKS cluster as below (highlighted)
  • Update the highlighted security group and add the NodePort ranges as shown below. The source should be your VPC CIDR value.
  • Now, assuming you have a VPN setup, open the same range there too and you should be able to access the Prometheus & Grafana GUI.
  • Select any one private IP of the cluster node (you can fetch using kubectl get nodes) and suffix the port (NodePort from svc) for Prometheus.

Setp 5: Fix the Prometheus targets for kube-proxy

  • In EKS, by default the kube-proxy pods are exposed on 127.0.0.1 bindAddress. This is the reason for the timeout in the GUI.
kube-proxy timeout in Prometheus targets
  • All we need to do is to update this bind address from 127.0.0.1 to 0.0.0.0 . The same can be achieved by using the below command:
kubectl -n kube-system get cm kube-proxy-config -o yaml |sed 's/metricsBindAddress: 127.0.0.1:10249/metricsBindAddress: 0.0.0.0:10249/' | kubectl apply -f -
  • Rollout the kube-proxy daemonSet after updating the configMap
kubectl rollout restart ds/kube-proxy -n kube-system
  • Monitor the targets in Prometheus. All healthy!
  • Navigate to the Grafana GUI. Ensure you use the same password that was embedded in the values.yaml file before deployment.
Pre-defined cluster dashboards available
  • Navigate to the bottom and find the Cluster Dashboard. You can explore the rest of the dashboards too.
Cluster Compute Resources Dashboard
  • We have a fully functional Prometheus which is sending data to our Grafana with pre-defined dashboard for our EKS cluster.
The End

Thank you for reading. For queries, corrections or suggestions, please connect with me on LinkedIn.

--

--

Kiran Iyer
Kiran Iyer

No responses yet