第24关 亲和性
揭秘K8s部署优化:利用亲和性、反亲和性、污点、容忍和节点选择器的威力
在第12关 精通K8s下的Ingress-Nginx控制器:生产环境实战配置指南中,我们部署了ingress-nginx-controller,对于这个服务的yaml配置,里面就完美贴合了这节课我们要讲的所有内容,包含了亲和性、反亲和性、污点、容忍和节点选择器的使用,后面我们在其他生产服务上使用,依葫芦画瓢即可。
配置
---
apiVersion: apps/v1
kind: DaemonSet
#kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: kube-system
labels:
app: ingress-nginx
annotations:
component.revision: "2"
component.version: 1.9.3
spec:
# Deployment need:
# ----------------
# replicas: 1
# ----------------
selector:
matchLabels:
app: ingress-nginx
template:
metadata:
labels:
app: ingress-nginx
annotations:
prometheus.io/port: "10254"
prometheus.io/scrape: "true"
spec:
# DaemonSet need:
# ----------------
hostNetwork: true
# ----------------
affinity:
podAntiAffinity: #反亲和性
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ingress-nginx
topologyKey: kubernetes.io/hostname
weight: 100
nodeAffinity: #节点亲和性
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: NotIn
values:
- virtual-kubelet
- key: k8s.aliyun.com
operator: NotIn
values:
- "true"
containers:
- args:
- /nginx-ingress-controller
- --election-id=ingress-controller-leader-nginx
- --ingress-class=nginx
- --watch-ingress-without-class
- --controller-class=k8s.io/ingress-nginx
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --annotations-prefix=nginx.ingress.kubernetes.io
- --publish-service=$(POD_NAMESPACE)/nginx-ingress-lb
- --validating-webhook=:8443
- --validating-webhook-certificate=/usr/local/certificates/cert
- --validating-webhook-key=/usr/local/certificates/key
- --enable-metrics=false
- --v=2
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: LD_PRELOAD
value: /usr/local/lib/libmimalloc.so
image: registry-cn-hangzhou.ack.aliyuncs.com/acs/aliyun-ingress-controller:v1.9.3-aliyun.1
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command:
- /wait-shutdown
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
name: nginx-ingress-controller
ports:
- name: http
containerPort: 80
protocol: TCP
- name: https
containerPort: 443
protocol: TCP
- name: webhook
containerPort: 8443
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
successThreshold: 1
# resources:
# limits:
# cpu: 1
# memory: 2G
# requests:
# cpu: 1
# memory: 2G
securityContext:
allowPrivilegeEscalation: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
runAsUser: 101
# if get 'mount: mounting rw on /proc/sys failed: Permission denied', use:
# privileged: true
# procMount: Default
# runAsUser: 0
volumeMounts:
- name: webhook-cert
mountPath: /usr/local/certificates/
readOnly: true
- mountPath: /etc/localtime
name: localtime
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- command:
- /bin/sh
- -c
- |
if [ "$POD_IP" != "$HOST_IP" ]; then
mount -o remount rw /proc/sys
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sysctl -w kernel.core_uses_pid=0
fi
env:
- name: POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
image: registry.cn-shanghai.aliyuncs.com/acs/busybox:v1.29.2
imagePullPolicy: IfNotPresent
name: init-sysctl
resources:
limits:
cpu: 100m
memory: 70Mi
requests:
cpu: 100m
memory: 70Mi
securityContext:
capabilities:
add:
- SYS_ADMIN
drop:
- ALL
# if get 'mount: mounting rw on /proc/sys failed: Permission denied', use:
privileged: true
procMount: Default
runAsUser: 0
# choose node with set this label running
# kubectl label node xx.xx.xx.xx boge/ingress-controller-ready=true
# kubectl get node --show-labels
# kubectl label node xx.xx.xx.xx boge/ingress-controller-ready-
nodeSelector: #节点选择器
boge/ingress-controller-ready: "true"
priorityClassName: system-node-critical
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: ingress-nginx
serviceAccountName: ingress-nginx
terminationGracePeriodSeconds: 300
# 污点
# kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready="true":NoExecute
# kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready:NoExecute-
# 容忍
tolerations:
- operator: Exists
# tolerations:
# - effect: NoExecute
# key: boge/ingress-controller-ready
# operator: Equal
# value: "true"
volumes:
- name: webhook-cert
secret:
defaultMode: 420
secretName: ingress-nginx-admission
- hostPath:
path: /etc/localtime
type: File
name: localtime
---分析
亲和性
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- ingress-nginx
topologyKey: kubernetes.io/hostname
weight: 100
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: type
operator: NotIn
values:
- virtual-kubelet
- key: k8s.aliyun.com
operator: NotIn
values:
- "true"定义了Pod反亲和性规则。指定了带有标签
app=ingress-nginx的Pod应该尽量避免被调度到具有相同kubernetes.io/hostname标签值的节点上。这意味着如果你有一个或多个运行着带有标签app=ingress-nginx的Pod的节点,Kubernetes调度器会尽量避免在这个节点上调度更多具有相同标签的Pod,以达到分散这些Pod的目的。weight: 100表示这个规则的优先级非常高,调度器会强烈倾向于遵循它。
定义了节点亲和性规则,更准确地说,是必需的节点选择器(
requiredDuringSchedulingIgnoredDuringExecution)。这意味着只有满足这些条件的节点才能被考虑来调度Pod。
- 第一个
matchExpressions指定节点的type标签不能是virtual-kubelet。这表明Pod不应该被调度到标记为虚拟Kubelet类型的节点上,可能是为了确保Pod运行在物理机器或更为稳定的环境中。阿里云的弹性容器- 第二个
matchExpressions指定了节点不应具有标签k8s.aliyun.com="true"。这可能是一个特定于阿里云 Kubernetes 服务(ACK)或其他使用阿里云标签系统的场景的规则,用于排除特定类型的节点或满足特定的部署策略。
污点
污点、容忍比亲和性、反亲和性更强硬一些
# 打污点(不允许新的pod调度进来)
# kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready="true":NoExecute
# 去污点
# kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready:NoExecute-
# 容忍
# 给自己开后门
tolerations:
- operator: Exists # 通用
# tolerations:
# - effect: NoExecute
# key: boge/ingress-controller-ready # 具体
# operator: Equal
# value: "true"Pod能够被调度到任何具有污点的节点上,不论污点的键(key)、值(value)或效果(effect)是什么。使用
operator: Exists时,只要节点上有污点存在,无论其具体细节,Pod都会容忍并被允许调度到这样的节点上。
- 效果(effect)为
NoExecute,意味着正常情况下没有容忍该污点的Pod会被驱逐出该节点,或者新Pod不会被调度到该节点。- 键(key)为
boge/ingress-controller-ready,这是一个自定义的污点键,可能用于标识节点上的某个特定条件或状态,比如是否准备好运行特定类型的Ingress控制器Pod。- 操作符(operator)为
Equal,并且污点的值(value)必须为"true"。这意味着Pod只容忍那些污点键为boge/ingress-controller-ready且值精确匹配为"true"的节点。
节点选择器
nodeSelector: #节点选择器
boge/ingress-controller-ready: "true"如果不存在boge/ingress-controller-ready: "true"会没有pod出现
补充
查看Master节点的污点
添加污点(Taint):Master节点上通常会有一些污点,以防止普通的Pod被调度到Master节点上。您需要删除或修改10.0.1.202节点上的污点,以便普通工作负载可以被调度到该节点上。
使用以下命令查看Master节点的污点:
# kubectl describe node 10.0.1.202
Name: 10.0.1.202
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=10.0.1.202
kubernetes.io/os=linux
kubernetes.io/role=master
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 21 Apr 2024 23:34:47 +0800
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: 10.0.1.202
AcquireTime: <unset>
RenewTime: Sat, 18 May 2024 16:45:56 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sat, 18 May 2024 14:41:39 +0800 Sat, 18 May 2024 14:41:39 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Sat, 18 May 2024 16:45:18 +0800 Mon, 06 May 2024 00:51:46 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 18 May 2024 16:45:18 +0800 Mon, 06 May 2024 00:51:46 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 18 May 2024 16:45:18 +0800 Mon, 06 May 2024 00:51:46 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 18 May 2024 16:45:18 +0800 Mon, 06 May 2024 10:48:23 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.1.202
Hostname: 10.0.1.202
Capacity:
cpu: 2
ephemeral-storage: 29751268Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1974508Ki
pods: 120
Allocatable:
cpu: 2
ephemeral-storage: 27418768544
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1667308Ki
pods: 120
System Info:
Machine ID: 7b6a2cc67914456f94784b3d0bd86a63
System UUID: 5bf74d56-9413-ac2b-c7c8-043d44f87b1b
Boot ID: 0bdf1b93-5760-4f48-89eb-4a22907a41b8
Kernel Version: 5.15.0-105-generic
OS Image: Ubuntu 22.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.23
Kubelet Version: v1.27.5
Kube-Proxy Version: v1.27.5
PodCIDR: 172.20.0.0/24
PodCIDRs: 172.20.0.0/24
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-node-k2kw6 250m (12%) 0 (0%) 0 (0%) 0 (0%) 26d
kube-system node-local-dns-rfwkm 25m (1%) 0 (0%) 5Mi (0%) 0 (0%) 26d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 275m (13%) 0 (0%)
memory 5Mi (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>Taints: node.kubernetes.io/unschedulable:NoSchedule
删除或修改污点
找到与Master节点相关的污点,然后使用以下命令删除或修改污点:
kubectl taint nodes <master-node-name> key=value:taint-effect其中,
key表示污点的键,
value表示污点的值,
taint-effect表示污点的作用效果,可以是NoSchedule、PreferNoSchedule或NoExecute。
根据您的需求,删除或修改适当的污点。
kubectl taint nodes 10.0.1.202 node.kubernetes.io/unschedulable:NoSchedule-重启kubelet服务:使用以下命令重启kubelet服务,使更改生效。
sudo systemctl restart kubelet验证配置:使用以下命令验证10.0.1.202节点是否已经既是Master节点又是Worker节点。
kubectl get node -o wide
方法二 直接改配置
# kubectl edit node 10.0.1.202
node/10.0.1.202 editedspec:
podCIDR: 172.20.1.0/24
podCIDRs:
- 172.20.1.0/24
taints:
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
timeAdded: "2024-04-21T15:34:53Z"
unschedulable: true将true改为false,后续配置自动更新只剩下
spec:
podCIDR: 172.20.1.0/24
podCIDRs:
- 172.20.1.0/24