Monitoring Production EKS cluster using Prometheus & Grafana (Helm)
Update: 12 June 2023
- Added commands with latest helm chart version & EBS PVC error resolution.
Well, a pretty common design patten in Kubernetes to setup observability & monitoring is using Prometheus & Grafana. I’ve tried multiple guides but nowhere could I find a production grade solution that can be maintained with minimal drift. So I am putting together what worked for me.
In this blog post, we’ll setup a robust, declarative version controlled monitoring stack on our EKS cluster using Prometheus and Grafana.
Prerequisites:
- EKS cluster (≥v1.20) with kubectl configured
- Helm 3+
- Kubectl connected to the cluster
- Optional — A VPN setup so as to access the Prometheus & Grafana GUI using NodePort* (access using EKS nodes private IP). Port forwarding will do but it is a temporary solution though.
*This being a design pattern we do not want to expose prometheus and grafana GUI publicly.
Step 1: Understanding the solution
- What will we deploy using helm? Prometheus Operator, Prometheus, Alertmanager, Grafana & Service Monitors.
- Just imagine what would happen if we would go by installing all these components individually, maintaining the config files, version compatibility checks between each other and so on. Already sounds like a nightmare. 😶
- This is the reason we use helm as it manages all the dependencies. We should be grateful to the community which has built such amazing chart integrations. 🙌
- Let’s proceed to the terminal.
Step 2: Updating the chart variables before deployment
- We will be using the helm chart developed by the community located here.
- Once you visit the repo, there is a
values.yaml
which we need to customize as per our requirement by overriding the default values. This is a critical step. - Here’s our values.yaml file:
etcd: false
kubeScheduler: false
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
serviceMonitorSelector:
matchLabels:
prometheus: devops
commonLabels:
prometheus: devopsprometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp2
resources:
requests:
storage: 100Gi
# Define persistent storage for Grafana (PVC)
grafana:
# Set password for Grafana admin user
adminPassword: XXXXXXXXXXXX #<--- update your password here
persistence:
enabled: true
storageClassName: gp2
accessModes: ["ReadWriteOnce"]
size: 100Gi
# Define persistent storage for Alertmanager (PVC)
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp2
resources:
requests:
storage: 50Gi
# Change default node-exporter port
prometheus-node-exporter:
service:
port: 30206
targetPort: 30206
- We’ve created a PVC for Prometheus & Grafana data retention so that the metrics are not lost on pod restarts.
- Now that our custom values.yaml file is ready, we can go ahead and deploy.
Step 3: Installing the chart
- Before installing the prometheus helm chart, we need to add the prometheus community repo. Run this from the terminal.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
- Update the helm repository
helm repo update
- Now, let’s search for the kube-prometheus-stack helm chart
helm search repo kube-prometheus-stack --max-col-width 23NAME CHART VERSION APP VERSION DESCRIPTIONprometheus-community... 46.8.0 0.65.2 kube-prometheus-stac...
- Finally install the chart
helm install monitoring \
prometheus-community/kube-prometheus-stack \
--values values.yaml \
--version 46.8.0 \
--namespace monitoring \
--create-namespace
- Validate all the pods in the monitoring namespace
kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGEalertmanager-monitoring-kube-prometheus-alertmanager-0 2/2 Running 0 104smonitoring-grafana-7956d8b954-tslz6 3/3 Running 0 113smonitoring-kube-prometheus-operator-6c6866d56d-mxbhc 1/1 Running 0 113smonitoring-kube-state-metrics-56bfd4f44f-5b7fq 1/1 Running 0 113smonitoring-prometheus-node-exporter-cc72n 1/1 Running 0 113smonitoring-prometheus-node-exporter-grsld 1/1 Running 0 113smonitoring-prometheus-node-exporter-nq4wt 1/1 Running 0 113smonitoring-prometheus-node-exporter-nxfdz 1/1 Running 0 113smonitoring-prometheus-node-exporter-sbjb7 1/1 Running 0 113sprometheus-monitoring-kube-prometheus-prometheus-0 2/2 Running 0 103s
- If you do get an error in provisioning PVC as below, then please try to follow this link to install the Amazon CSI EBS Driver in your cluster: https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html
Normal ExternalProvisioning 2m30s (x25 over 8m10s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
- If the PVC was provisioned successfully — Yay! 🍻 All our pods are up and running. This is 90% done. Wait! What’s pending then?
- The services exposing these pods are ClusterIP which cannot be accessed from the browser. Remember? We need to have the Prometheus & Grafana UI accessible from the browser.
Step 4: Exposing the services
- Let’s check the list of services deployed first
kubectl get svc -n monitoringNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEalertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 3m20smonitoring-grafana ClusterIP 172.30.205.8 <none> 80/TCP 3m29smonitoring-kube-prometheus-alertmanager ClusterIP 172.30.103.69 <none> 9093/TCP 3m30smonitoring-kube-prometheus-operator ClusterIP 172.30.159.195 <none> 443/TCP 3m29smonitoring-kube-prometheus-prometheus ClusterIP 172.30.122.234 <none> 9090/TCP 3m29smonitoring-kube-state-metrics ClusterIP 172.30.186.124 <none> 8080/TCP 3m29smonitoring-prometheus-node-exporter ClusterIP 172.30.199.72 <none> 30206/TCP 3m29sprometheus-operated ClusterIP None <none> 9090/TCP 3m20s
- Out of all the services, we need to expose
monitoring-grafana
&monitoring-kube-prometheus-prometheus
service for GUI access. - There are 2 options to expose the services:
- Using port-forward (temporary): No VPN required
kubectl port-forward \
svc/monitoring-kube-prometheus-prometheus 9090 \
-n monitoring# Run the below command in a different terminal windowkubectl port-forward \
svc/monitoring-grafana 3000 \
-n monitoring
Now you can access the Prometheus GUI over localhost:9090 and Grafana over localhost:3000
- Using NodePort (preferred): VPN required
- Edit and change
monitoring-grafana
&monitoring-kube-prometheus-prometheus
services to NodePort:
kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEalertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 22mmonitoring-grafana NodePort 172.30.205.8 <none> 80:30520/TCP 22m <-- Updatedmonitoring-kube-prometheus-alertmanager ClusterIP 172.30.103.69 <none> 9093/TCP 22mmonitoring-kube-prometheus-operator ClusterIP 172.30.159.195 <none> 443/TCP 22mmonitoring-kube-prometheus-prometheus NodePort 172.30.122.234 <none> 9090:32083/TCP 22m <-- Updatedmonitoring-kube-state-metrics ClusterIP 172.30.186.124 <none> 8080/TCP 22mmonitoring-prometheus-node-exporter ClusterIP 172.30.199.72 <none> 30206/TCP 22mprometheus-operated ClusterIP None <none> 9090/TCP 22m
- Now, we see that
monitoring-grafana
&monitoring-kube-prometheus-prometheus
service are of typeNodePort
. - Ensure you have allowed the NodePort range in the Security Group of the EKS cluster as below (highlighted)
- Update the highlighted security group and add the NodePort ranges as shown below. The source should be your VPC CIDR value.
- Now, assuming you have a VPN setup, open the same range there too and you should be able to access the Prometheus & Grafana GUI.
- Select any one private IP of the cluster node (you can fetch using
kubectl get nodes
) and suffix the port (NodePort from svc) for Prometheus.
Setp 5: Fix the Prometheus targets for kube-proxy
- In EKS, by default the kube-proxy pods are exposed on
127.0.0.1
bindAddress. This is the reason for the timeout in the GUI.
- All we need to do is to update this bind address from
127.0.0.1
to0.0.0.0
. The same can be achieved by using the below command:
kubectl -n kube-system get cm kube-proxy-config -o yaml |sed 's/metricsBindAddress: 127.0.0.1:10249/metricsBindAddress: 0.0.0.0:10249/' | kubectl apply -f -
- Rollout the kube-proxy daemonSet after updating the configMap
kubectl rollout restart ds/kube-proxy -n kube-system
- Monitor the targets in Prometheus. All healthy!
- Navigate to the Grafana GUI. Ensure you use the same password that was embedded in the
values.yaml
file before deployment.
- Navigate to the bottom and find the Cluster Dashboard. You can explore the rest of the dashboards too.
- We have a fully functional Prometheus which is sending data to our Grafana with pre-defined dashboard for our EKS cluster.
Thank you for reading. For queries, corrections or suggestions, please connect with me on LinkedIn.