准备
- 首先需要部署 kubernetes 集群,参考k8s deploy
helm 安装 prometheus 软件栈
Setting up Prometheus — NVIDIA GPU Telemetry 1.0.0 documentation
使用以下命令安装 prometheus
1
2
3
4
5
6
7
8
9
|
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set prometheus.service.type=NodePort \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheusOperator.admissionWebhooks.patch.image.registry=registry.cn-hangzhou.aliyuncs.com \
--set prometheusOperator.admissionWebhooks.patch.image.repository=linuzb/kube-webhook-certgen \
--set kube-state-metrics.image.registry=registry.cn-hangzhou.aliyuncs.com \
--set kube-state-metrics.image.repository=linuzb/kube-state-metrics
|
安装 GPU 监控 DCGM
1
2
3
4
|
helm upgrade --install \
dcgm-exporter \
gpu-helm-charts/dcgm-exporter \
--values config.yaml
|
master 节点也部署
config.yaml
1
2
3
4
|
tolerations:
- key: "node-role.kubernetes.io/control-plane"
operator: "Exists"
effect: "NoSchedule"
|
导入 gpu 监控 dashboard https://grafana.com/grafana/dashboards/12239
使用 grafana
修改 svc 为 node prot
1
|
k -n prometheus edit svc kube-prometheus-stack-grafana
|
增加内容
1
2
3
4
|
spec:
- name: http-web
nodePort: 30759
type: NodePort
|
dashboard
默认密码 prom-operator
1
2
3
4
5
|
# Deploy default dashboards.
#
defaultDashboardsEnabled: true
adminPassword: prom-operator
|
Next time, we'll talk about "10 Reasons why gcc SHOULD be re-written in JavaScript - You won't believe #8!"