第24关亲和性

揭秘K8s部署优化：利用亲和性、反亲和性、污点、容忍和节点选择器的威力

在第12关精通K8s下的Ingress-Nginx控制器：生产环境实战配置指南中，我们部署了ingress-nginx-controller，对于这个服务的yaml配置，里面就完美贴合了这节课我们要讲的所有内容，包含了亲和性、反亲和性、污点、容忍和节点选择器的使用，后面我们在其他生产服务上使用，依葫芦画瓢即可。

配置

---
apiVersion: apps/v1
kind: DaemonSet
#kind: Deployment
metadata:
  name: nginx-ingress-controller
  namespace: kube-system
  labels:
    app: ingress-nginx
  annotations:
    component.revision: "2"
    component.version: 1.9.3
spec:
  # Deployment need:
  # ----------------
#  replicas: 1
  # ----------------
  selector:
    matchLabels:
      app: ingress-nginx
  template:
    metadata:
      labels:
        app: ingress-nginx
      annotations:
        prometheus.io/port: "10254"
        prometheus.io/scrape: "true"
    spec:
      # DaemonSet need:
      # ----------------
      hostNetwork: true
      # ----------------
      affinity:
        podAntiAffinity:  #反亲和性
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ingress-nginx
              topologyKey: kubernetes.io/hostname
            weight: 100
        nodeAffinity:  #节点亲和性
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: type
                operator: NotIn
                values:
                - virtual-kubelet
              - key: k8s.aliyun.com
                operator: NotIn
                values:
                - "true"
      containers:
      - args:
          - /nginx-ingress-controller
          - --election-id=ingress-controller-leader-nginx
          - --ingress-class=nginx
          - --watch-ingress-without-class
          - --controller-class=k8s.io/ingress-nginx
          - --configmap=$(POD_NAMESPACE)/nginx-configuration
          - --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
          - --udp-services-configmap=$(POD_NAMESPACE)/udp-services
          - --annotations-prefix=nginx.ingress.kubernetes.io
          - --publish-service=$(POD_NAMESPACE)/nginx-ingress-lb
          - --validating-webhook=:8443
          - --validating-webhook-certificate=/usr/local/certificates/cert
          - --validating-webhook-key=/usr/local/certificates/key
          - --enable-metrics=false
          - --v=2
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
          - name: LD_PRELOAD
            value: /usr/local/lib/libmimalloc.so
        image: registry-cn-hangzhou.ack.aliyuncs.com/acs/aliyun-ingress-controller:v1.9.3-aliyun.1
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command:
                - /wait-shutdown
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 1
          successThreshold: 1
        name: nginx-ingress-controller
        ports:
          - name: http
            containerPort: 80
            protocol: TCP
          - name: https
            containerPort: 443
            protocol: TCP
          - name: webhook
            containerPort: 8443
            protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 10254
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 1
          successThreshold: 1
#        resources:
#          limits:
#            cpu: 1
#            memory: 2G
#          requests:
#            cpu: 1
#            memory: 2G
        securityContext:
          allowPrivilegeEscalation: true
          capabilities:
            drop:
              - ALL
            add:
              - NET_BIND_SERVICE
          runAsUser: 101
          # if get 'mount: mounting rw on /proc/sys failed: Permission denied', use:
#          privileged: true
#          procMount: Default
#          runAsUser: 0
        volumeMounts:
        - name: webhook-cert
          mountPath: /usr/local/certificates/
          readOnly: true
        - mountPath: /etc/localtime
          name: localtime
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - /bin/sh
        - -c
        - |
          if [ "$POD_IP" != "$HOST_IP" ]; then
          mount -o remount rw /proc/sys
          sysctl -w net.core.somaxconn=65535
          sysctl -w net.ipv4.ip_local_port_range="1024 65535"
          sysctl -w kernel.core_uses_pid=0
          fi
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: HOST_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.hostIP
        image: registry.cn-shanghai.aliyuncs.com/acs/busybox:v1.29.2
        imagePullPolicy: IfNotPresent
        name: init-sysctl
        resources:
          limits:
            cpu: 100m
            memory: 70Mi
          requests:
            cpu: 100m
            memory: 70Mi
        securityContext:
          capabilities:
            add:
            - SYS_ADMIN
            drop:
            - ALL
          # if get 'mount: mounting rw on /proc/sys failed: Permission denied', use:
          privileged: true
          procMount: Default
          runAsUser: 0
      # choose node with set this label running
      # kubectl label node xx.xx.xx.xx boge/ingress-controller-ready=true
      # kubectl get node --show-labels
      # kubectl label node xx.xx.xx.xx boge/ingress-controller-ready-
      nodeSelector:  #节点选择器
        boge/ingress-controller-ready: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: ingress-nginx
      serviceAccountName: ingress-nginx
      terminationGracePeriodSeconds: 300
      # 污点
      # kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready="true":NoExecute
      # kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready:NoExecute-
      # 容忍
      tolerations:
      - operator: Exists
#      tolerations:
#      - effect: NoExecute
#        key: boge/ingress-controller-ready
#        operator: Equal
#        value: "true"
      volumes:
      - name: webhook-cert
        secret:
          defaultMode: 420
          secretName: ingress-nginx-admission
      - hostPath:
          path: /etc/localtime
          type: File
        name: localtime

---

分析

亲和性

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - ingress-nginx
          topologyKey: kubernetes.io/hostname
        weight: 100
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: type
            operator: NotIn
            values:
            - virtual-kubelet
          - key: k8s.aliyun.com
            operator: NotIn
            values:
            - "true"

定义了Pod反亲和性规则。指定了带有标签app=ingress-nginx的Pod应该尽量避免被调度到具有相同kubernetes.io/hostname标签值的节点上。这意味着如果你有一个或多个运行着带有标签app=ingress-nginx的Pod的节点，Kubernetes调度器会尽量避免在这个节点上调度更多具有相同标签的Pod，以达到分散这些Pod的目的。weight: 100表示这个规则的优先级非常高，调度器会强烈倾向于遵循它。

定义了节点亲和性规则，更准确地说，是必需的节点选择器(requiredDuringSchedulingIgnoredDuringExecution)。这意味着只有满足这些条件的节点才能被考虑来调度Pod。
第一个matchExpressions指定节点的type标签不能是virtual-kubelet。这表明Pod不应该被调度到标记为虚拟Kubelet类型的节点上，可能是为了确保Pod运行在物理机器或更为稳定的环境中。阿里云的弹性容器
第二个matchExpressions指定了节点不应具有标签k8s.aliyun.com="true"。这可能是一个特定于阿里云 Kubernetes 服务（ACK）或其他使用阿里云标签系统的场景的规则，用于排除特定类型的节点或满足特定的部署策略。

污点

污点、容忍比亲和性、反亲和性更强硬一些

  # 打污点(不允许新的pod调度进来)
  # kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready="true":NoExecute
  # 去污点
  # kubectl taint nodes xx.xx.xx.xx boge/ingress-controller-ready:NoExecute-
  # 容忍
  # 给自己开后门
      tolerations:
      - operator: Exists # 通用
#      tolerations:
#      - effect: NoExecute
#        key: boge/ingress-controller-ready # 具体
#        operator: Equal
#        value: "true"

Pod能够被调度到任何具有污点的节点上，不论污点的键（key）、值（value）或效果（effect）是什么。使用operator: Exists时，只要节点上有污点存在，无论其具体细节，Pod都会容忍并被允许调度到这样的节点上。

效果（effect）为NoExecute，意味着正常情况下没有容忍该污点的Pod会被驱逐出该节点，或者新Pod不会被调度到该节点。
键（key）为boge/ingress-controller-ready，这是一个自定义的污点键，可能用于标识节点上的某个特定条件或状态，比如是否准备好运行特定类型的Ingress控制器Pod。
操作符（operator）为Equal，并且污点的值（value）必须为"true"。这意味着Pod只容忍那些污点键为boge/ingress-controller-ready且值精确匹配为"true"的节点。

节点选择器

  nodeSelector:  #节点选择器
    boge/ingress-controller-ready: "true"

如果不存在boge/ingress-controller-ready: "true"会没有pod出现

补充

查看Master节点的污点

添加污点（Taint）：Master节点上通常会有一些污点，以防止普通的Pod被调度到Master节点上。您需要删除或修改10.0.1.202节点上的污点，以便普通工作负载可以被调度到该节点上。

使用以下命令查看Master节点的污点：

# kubectl describe node 10.0.1.202
Name:               10.0.1.202
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=10.0.1.202
                    kubernetes.io/os=linux
                    kubernetes.io/role=master
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 21 Apr 2024 23:34:47 +0800
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  10.0.1.202
  AcquireTime:     <unset>
  RenewTime:       Sat, 18 May 2024 16:45:56 +0800
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Sat, 18 May 2024 14:41:39 +0800   Sat, 18 May 2024 14:41:39 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sat, 18 May 2024 16:45:18 +0800   Mon, 06 May 2024 00:51:46 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 18 May 2024 16:45:18 +0800   Mon, 06 May 2024 00:51:46 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 18 May 2024 16:45:18 +0800   Mon, 06 May 2024 00:51:46 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 18 May 2024 16:45:18 +0800   Mon, 06 May 2024 10:48:23 +0800   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.0.1.202
  Hostname:    10.0.1.202
Capacity:
  cpu:                2
  ephemeral-storage:  29751268Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1974508Ki
  pods:               120
Allocatable:
  cpu:                2
  ephemeral-storage:  27418768544
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1667308Ki
  pods:               120
System Info:
  Machine ID:                 7b6a2cc67914456f94784b3d0bd86a63
  System UUID:                5bf74d56-9413-ac2b-c7c8-043d44f87b1b
  Boot ID:                    0bdf1b93-5760-4f48-89eb-4a22907a41b8
  Kernel Version:             5.15.0-105-generic
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.23
  Kubelet Version:            v1.27.5
  Kube-Proxy Version:         v1.27.5
PodCIDR:                      172.20.0.0/24
PodCIDRs:                     172.20.0.0/24
Non-terminated Pods:          (2 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  kube-system                 calico-node-k2kw6       250m (12%)    0 (0%)      0 (0%)           0 (0%)         26d
  kube-system                 node-local-dns-rfwkm    25m (1%)      0 (0%)      5Mi (0%)         0 (0%)         26d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                275m (13%)  0 (0%)
  memory             5Mi (0%)    0 (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

Taints: node.kubernetes.io/unschedulable:NoSchedule

删除或修改污点

找到与Master节点相关的污点，然后使用以下命令删除或修改污点：

kubectl taint nodes <master-node-name> key=value:taint-effect

其中，

key表示污点的键，
value表示污点的值，
taint-effect表示污点的作用效果，可以是NoSchedule、PreferNoSchedule或NoExecute。

根据您的需求，删除或修改适当的污点。

kubectl taint nodes 10.0.1.202 node.kubernetes.io/unschedulable:NoSchedule-

重启kubelet服务：使用以下命令重启kubelet服务，使更改生效。
```
sudo systemctl restart kubelet
```
验证配置：使用以下命令验证10.0.1.202节点是否已经既是Master节点又是Worker节点。
```
kubectl get node -o wide
```

方法二直接改配置

# kubectl edit node 10.0.1.202
node/10.0.1.202 edited

spec:
  podCIDR: 172.20.1.0/24
  podCIDRs:
  - 172.20.1.0/24
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    timeAdded: "2024-04-21T15:34:53Z"
  unschedulable: true

将true改为false，后续配置自动更新只剩下
spec:
  podCIDR: 172.20.1.0/24
  podCIDRs:
  - 172.20.1.0/24

第24关 亲和性

第24关亲和性