Kubernetes 节点健康监测

节点健康监测

节点问题检测器（Node Problem Detector）是一个守护程序，用于监视和报告节点的健康状况。你可以将节点问题探测器以 DaemonSet 或独立守护程序运行。节点问题检测器从各种守护进程收集节点问题，并以 NodeCondition 和 Event 的形式报告给 API 服务器。

要了解如何安装和使用节点问题检测器，请参阅节点问题探测器项目文档。

在开始之前

你必须拥有一个 Kubernetes 的集群，同时你的 Kubernetes 集群必须带有 kubectl 命令行工具。建议在至少有两个节点的集群上运行本教程，且这些节点不作为控制平面主机。如果你还没有集群，你可以通过 Minikube 构建一个你自己的集群，或者你可以使用下面任意一个 Kubernetes 工具构建：

局限性

节点问题检测器只支持基于文件类型的内核日志。它不支持像 journald 这样的命令行日志工具。
节点问题检测器使用内核日志格式来报告内核问题。

启用节点问题检测器

一些云供应商将节点问题检测器以插件形式启用。你还可以使用 kubectl 或创建插件 Pod 来启用节点问题探测器。

使用 kubectl 启用节点问题检测器

kubectl 提供了节点问题探测器最灵活的管理。你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如：

创建类似于 node-strought-detector.yaml 的节点问题检测器配置：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/

Note: 你应该检查系统日志目录是否适用于操作系统发行版本。

使用 kubectl 启动节点问题检测器：

kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

使用插件 pod 启用节点问题检测器

如果你使用的是自定义集群引导解决方案，不需要覆盖默认配置，可以利用插件 Pod 进一步自动化部署。

创建 node-strick-detector.yaml，并在控制平面节点上保存配置到插件 Pod 的目录 /etc/kubernetes/addons/node-problem-detector。

覆盖配置文件

构建节点问题检测器的 docker 镜像时，会嵌入默认配置。

不过，你可以像下面这样使用 ConfigMap 将其覆盖：

更改 config/ 中的配置文件
创建 ConfigMap node-strick-detector-config：

kubectl create configmap node-problem-detector-config --from-file=config/

更改 node-problem-detector.yaml 以使用 ConfigMap:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
        - name: config # Overwrite the config/ directory with ConfigMap volume
          mountPath: /config
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: config # Define ConfigMap volume
        configMap:
          name: node-problem-detector-config

使用新的配置文件重新创建节点问题检测器：

# 如果你正在运行节点问题检测器，请先删除，然后再重新创建
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml

Note: 此方法仅适用于通过 kubectl 启动的节点问题检测器。

如果节点问题检测器作为集群插件运行，则不支持覆盖配置。插件管理器不支持 ConfigMap。

内核监视器

内核监视器（Kernel Monitor）是节点问题检测器中支持的系统日志监视器守护进程。内核监视器观察内核日志并根据预定义规则检测已知的内核问题。

内核监视器根据 config/kernel-monitor.json 中的一组预定义规则列表匹配内核问题。规则列表是可扩展的，你始终可以通过覆盖配置来扩展它。

添加新的 NodeCondition

要支持新的 NodeCondition，请在 config/kernel-monitor.json 中的 conditions 字段中创建一个条件定义：

{
  "type": "NodeConditionType",
  "reason": "CamelCaseDefaultNodeConditionReason",
  "message": "arbitrary default node condition message"
}

检测新的问题

你可以使用新的规则描述来扩展 config/kernel-monitor.json 中的 rules 字段以检测新问题：

{
  "type": "temporary/permanent",
  "condition": "NodeConditionOfPermanentIssue",
  "reason": "CamelCaseShortReason",
  "message": "regexp matching the issue in the kernel log"
}

配置内核日志设备的路径

检查你的操作系统（OS）发行版本中的内核日志路径位置。 Linux 内核日志设备通常呈现为 /dev/kmsg。但是，日志路径位置因 OS 发行版本而异。 config/kernel-monitor.json 中的 log 字段表示容器内的日志路径。你可以配置 log 字段以匹配节点问题检测器所示的设备路径。

添加对其它日志格式的支持

内核监视器使用 Translator 插件转换内核日志的内部数据结构。你可以为新的日志格式实现新的转换器。

建议和限制

建议在集群中运行节点问题检测器以监控节点运行状况。运行节点问题检测器时，你可以预期每个节点上的额外资源开销。通常这是可接受的，因为：

内核日志增长相对缓慢。
已经为节点问题检测器设置了资源限制。
即使在高负载下，资源使用也是可接受的。有关更多信息，请参阅节点问题检测器基准结果。

w3cschool 编程狮，随时随地学编程

Kubernetes 节点健康监测

节点健康监测

在开始之前

局限性

启用节点问题检测器

使用 kubectl 启用节点问题检测器

使用插件 pod 启用节点问题检测器

覆盖配置文件

内核监视器

添加新的 NodeCondition

检测新的问题

配置内核日志设备的路径

添加对其它日志格式的支持

建议和限制

Kubernetes 入门

Kubernetes 生产环境

Kubernetes 使用部署工具安装Kubernetes

Kubernetes 使用kubeadm引导集群

Kubernetes Windows Kubernetes

Kubernetes 最佳实践

Kubernetes 概述

Kubernetes 安装

Kubernetes 对象

Kubernetes 架构

Kubernetes 容器

Kubernetes Pods

Kubernetes 工作负载资源

Kubernetes 服务、负载均衡和联网

Kubernetes 存储

Kubernetes 配置

Kubernetes 安全

Kubernetes 策略

Kubernetes 调度，抢占和驱逐

Kubernetes 集群管理

Kubernetes 扩展

Kubernetes 扩展API

Kubernetes 计算、存储和网络扩展

Kubernetes 应用故障排除

Kubernetes 集群故障排查

Kubernetes 管理集群

Kubernetes 从dockershim迁移

Kubernetes 用kubeadm进行管理

Kubernetes 管理内存，CPU和API资源

Kubernetes 安装网络策略驱动

Kubernetes 配置Pods和容器

Kubernetes 管理Kubernetes对象

Kubernetes 管理Secrets

Kubernetes 给应用注入数据

Kubernetes 运行应用

Kubernetes 运行Jobs

Kubernetes 访问集群中的应用程序

Kubernetes 扩展Kubernetes

Kubernetes 使用自定义资源

Kubernetes TLS

Kubernetes 管理集群守护进程

Kubernetes 安装服务目录

Kubernetes 网络

Kubernetes 任务

Kubernetes 安全

Kubernetes 无状态应用程序

Kubernetes 有状态的应用

Kubernetes Service