Linuzb' 的博客

记录学习点滴

28 Apr 2024

kubernetes 监控部署

准备

  1. 首先需要部署 kubernetes 集群,参考k8s deploy

helm 安装 prometheus 软件栈

Setting up Prometheus — NVIDIA GPU Telemetry 1.0.0 documentation

使用以下命令安装 prometheus

1
2
3
4
5
6
7
8
9
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
   --namespace prometheus \
   --create-namespace \
   --set prometheus.service.type=NodePort \
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
   --set prometheusOperator.admissionWebhooks.patch.image.registry=registry.cn-hangzhou.aliyuncs.com \
   --set prometheusOperator.admissionWebhooks.patch.image.repository=linuzb/kube-webhook-certgen \
   --set kube-state-metrics.image.registry=registry.cn-hangzhou.aliyuncs.com \
   --set kube-state-metrics.image.repository=linuzb/kube-state-metrics

安装 GPU 监控 DCGM

1
2
3
4
helm upgrade --install \
   dcgm-exporter \
   gpu-helm-charts/dcgm-exporter \
   --values config.yaml

master 节点也部署 config.yaml

1
2
3
4
tolerations:
- key: "node-role.kubernetes.io/control-plane"
  operator: "Exists"
  effect: "NoSchedule"

导入 gpu 监控 dashboard https://grafana.com/grafana/dashboards/12239

使用 grafana

修改 svc 为 node prot

1
k -n prometheus edit svc kube-prometheus-stack-grafana 

增加内容

1
2
3
4
spec:
  - name: http-web
    nodePort: 30759
  type: NodePort

dashboard

默认密码 prom-operator

1
2
3
4
5
# Deploy default dashboards.
#
defaultDashboardsEnabled: true

adminPassword: prom-operator
Next time, we'll talk about "10 Reasons why gcc SHOULD be re-written in JavaScript - You won't believe #8!"