准备
本文介绍中共使用了四台物理机, 其中一台用户部署 k8s 的机器,另外三台作为 kubernetes 的节点
机器 |
ip |
系统 |
用户名 |
备注 |
部署机 |
任意 |
Ubuntu |
linuzb |
部署k8s机器,非k8s集群节点 |
kube-master-100 |
172.16.0.100 |
Ubutnu |
linuzb |
初始化 k8s master 节点 |
kube-node-117 |
172.16.0.117 |
Ubutnu |
linuzb |
初始化 k8s node |
kube-node-114 |
172.16.0.114 |
Ubutnu |
linuzb |
后续加入的 k8s node |
kubespray 配置
自定义配置
1
2
3
4
|
git clone https://github.com/kubernetes-sigs/kubespray
cd kubespray
# 拷贝集群清单
cp -rfp inventory/sample inventory/mycluster
|
inventory
inventory/mycluster/inventory.ini
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
# ## Configure 'ip' variable to bind kubernetes services on a
# ## different ip than the default iface
# ## We should set etcd_member_name for etcd cluster. The node that is not a etcd member do not need to set the value, or can set the empty string value.
[all]
kube-master-100 ansible_ssh_host=172.16.0.100 ansible_ssh_user=linuzb ip=172.16.0.100 mask=/24
kube-node-117 ansible_ssh_host=172.16.0.117 ansible_ssh_user=linuzb ip=172.16.0.117 mask=/24
# ## configure a bastion host if your nodes are not directly reachable
# [bastion]
# bastion ansible_host=x.x.x.x ansible_user=some_user
[kube_control_plane]
kube-master-100
[etcd]
kube-master-100
[kube_node]
kube-node-117
[calico_rr]
[k8s_cluster:children]
kube_control_plane
kube_node
calico_rr
|
集群网络
vim inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
使用高性能的 cilium
1
2
3
|
# Choose network plugin (cilium, calico, kube-ovn, weave or flannel. Use cni for generic cni plugin)
# Can also be set to 'cloud', which lets the cloud provider setup appropriate routing
kube_network_plugin: cilium
|
运行时
使用默认的 containerd
inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
1
2
3
4
|
## Container runtime
## docker for docker, crio for cri-o and containerd for containerd.
## Default: containerd
container_manager: containerd
|
容器运行目录(暂不修改)
inventory/mycluster/group_vars/all/containerd.yml
1
2
|
# containerd_storage_dir: "/var/lib/containerd"
# containerd_state_dir: "/run/containerd"
|
配置容器registry(暂不配置,使用 dragonfly)
inventory/mycluster/group_vars/all/containerd.yml
1
2
3
4
5
6
|
containerd_registries_mirrors:
- prefix: docker.io
mirrors:
- host: http://hub-mirror.c.163.com
capabilities: ["pull", "resolve"]
skip_verify: false
|
集群配置
自动更新集群证书, 默认一年有效
inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
1
2
3
4
5
|
## Automatically renew K8S control plane certificates on first Monday of each month
auto_renew_certificates: true
# Can be docker_dns, host_resolvconf or none
resolvconf_mode: none
|
打开日志报错
inventory/mycluster/group_vars/all/all.yml
1
2
|
## Used to control no_log attribute
unsafe_show_logs: true
|
镜像国内源
kubespray/docs/mirror.md at master · kubernetes-sigs/kubespray (github.com)
1
2
3
4
5
6
7
8
9
10
11
|
# Use the download mirror
cp inventory/mycluster/group_vars/all/offline.yml inventory/mycluster/group_vars/all/mirror.yml
sed -i -E '/# .*\{\{ files_repo/s/^# //g' inventory/mycluster/group_vars/all/mirror.yml
tee -a inventory/mycluster/group_vars/all/mirror.yml <<EOF
gcr_image_repo: "gcr.m.daocloud.io"
kube_image_repo: "k8s.m.daocloud.io"
docker_image_repo: "docker.m.daocloud.io"
quay_image_repo: "quay.m.daocloud.io"
github_image_repo: "ghcr.m.daocloud.io"
files_repo: "https://files.m.daocloud.io"
EOF
|
docker 部署
1
2
|
docker run --rm -it --mount type=bind,source="$(pwd)"/inventory/mycluster,dst=/inventory \
quay.io/kubespray/kubespray:v2.24.1 bash
|
查看集群配置
1
|
ansible-inventory -i /inventory/inventory.ini --list
|
运行部署
1
2
|
# 要输入两次密码
ansible-playbook -i /inventory/inventory.ini cluster.yml --user k8s --ask-pass --become --ask-become-pass
|
扩容节点
1
2
3
|
ansible-playbook -i /inventory/inventory.ini scale.yml \
--user=linuzb --ask-pass --become --ask-become-pass -b \
--limit=kube-node-114
|
您可以使用–limit=NODE_NAME限制 Kubespray 以避免干扰集群中的其他节点。 在没有使用–limitplaybook会运行facts.yml刷新所有节点的fact缓存。
缩容节点
如果有节点不再需要了,我们可以将其移除集群,通常步骤是:
- 1.
kubectl cordon NODE
驱逐节点,确保节点上的服务飘到其它节点上去,参考安全维护或下线节点。
- 2.停止节点上的一些 k8s 组件 (kubelet, kube-proxy) 等。
- 3.kubectl delete NODE 将节点移出集群。
- 4.如果节点是虚拟机,并且不需要了,可以直接销毁掉。
前3个步骤,也可以用 kubespray 提供的remove-node.yml
这个 playbook 来一步到位实现:
1
2
3
4
5
|
ansible-playbook \
-i inventory/mycluster/inventory.ini \
--user=linuzb --ask-pass --become --ask-become-pass -b \
-e "node=kube-node-114" \
remove-node.yml
|
-e
里写要移出的节点名列表,如果您要删除的节点不在线,您应该将reset_nodes=false
和添加allow_ungraceful_removal=true
到您的额外变量中
后续
获取kubeconfig
部署完成后,可以在master节点上的 /root/.kube/config 路径获取到 kubeconfig
获取到kubeconfig后,将 https://127.0.0.1:6443 修改成kube-apiserver负载均衡器的地址:端口,或者其中一台master。
1
|
alias k='kubectl --kubeconfig /home/linuzb/Projects/kube-cert/kubeconfig'
|
nvidia container
debug
网络排查
dns 问题排查
Debugging DNS Resolution | Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: lunettes/lunettes:v0.1.5
# image: registry.cn-hangzhou.aliyuncs.com/linuzb/jessie-dnsutils:1.3
command:
- sleep
- "infinity"
imagePullPolicy: IfNotPresent
restartPolicy: Always
|
apiserver 连通性
1
2
3
4
5
6
|
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -vvsSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/namespaces/jhub/pods
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://172.16.0.100:6443/api/v1/namespaces/jhub/pods
|
nodelocaldns
kubernetes 节点重启后,nodelocaldns crash
原因:loop (coredns.io)
解决方案
参考 NodeLocalDNS Loop detected for zone “.” · Issue #9948 · kubernetes-sigs/kubespray (github.com)
删除coredns 重置和重新安装kubernetes中的coreDNS_如何删除现在有的coredns-CSDN博客
- Change settings in k8 config file for kubespray. I changed 2 things:
resolvconf_mode: none
and remove_default_searchdomains: false
然后使用 kubespray 重新部署 kubernetes
参考文档
也可参考其它方式 easzlab/kubeasz: 使用Ansible脚本安装K8S集群