[18] 為什麼 CloudWatch Insight 可以收集 EKS cluster node 及 pod metrics

昨日討論了以 FluentD 為例討論 log agent 部署至每個 node 上收集 /var/log/containers 目錄。接續討論 Container Insight 部署完之後，其這些 metrics 來源及收集方式。

建置環境

透過官方 Container Insights on Amazon EKS and Kubernetes 文件¹所提供的 Quick Start template 部署 CloudWatch agent 及 Fluentd。

curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml | sed "s/{{cluster_name}}/ironman-2022/;s/{{region_name}}/eu-west-1/" | kubectl apply -f -

其 source code template 也可於 GitHub - CloudWatch Agent for Container Insights Kubernetes Monitoring² 查看。

由於 CloudWatch agent 及 Fluentd 皆需要具有 CloudWatch Logs 或 Metrics IAM 權限，因此以下範例透過關聯 CloudWatchAgentAdminPolicy 方式設定 IRSA ³。

透過 eksctl 命令建立 IAM role 給與 ServiceAccount fluentd 及 cloudwatch-agent 使用。

$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=cloudwatch-agent \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \
--approve

$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=amazon-cloudwatch \
--name=fluentd \
--attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \
--approve

根據 CloudWatch Insight 需求⁴，CloudWatch Agent 需要額外 EC2 DescribeVolumes 權限，若關聯的 Policy 沒有此權限則需要額外設定。

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:DescribeVolumes"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

分別透過 kubectl rollout 重啟 DaemonSet cloudwatch-agent 及 fluentd-cloudwatch。

$ kubectl -n amazon-cloudwatch rollout restart ds cloudwatch-agent
daemonset.apps/cloudwatch-agent restarted

$ kubectl -n amazon-cloudwatch rollout restart ds fluentd-cloudwatch
daemonset.apps/fluentd-cloudwatch restarted

驗證

為避免混淆 FluentD 及 CloudWatch Agent 所負責事項，以下仍會依照 FluentD 及 CloudWatch Agent 設定檔進行解釋。

預設安裝好上述 QuickStart template ，則可以在 CloudWatch Log groups 查看到以下 Log group：

/aws/containerinsights/ironman-2022/application
/aws/containerinsights/ironman-2022/dataplane
/aws/containerinsights/ironman-2022/host
/aws/containerinsights/ironman-2022/performance

FluentD

/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_systemd
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
        log_stream_name_key stream_name
        auto_create_stream true
        remove_log_stream_name_key true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>

/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/application

      <match **>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_containers
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/application"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>

/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host

<match host.**>
        @type cloudwatch_logs
        @id out_cloudwatch_logs_host_logs
        region "#{ENV.fetch('AWS_REGION')}"
        log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
        log_stream_name_key stream_name
        remove_log_stream_name_key true
        auto_create_stream true
        <buffer>
          flush_interval 5
          chunk_limit_size 2m
          queued_chunks_limit_size 32
          retry_forever true
        </buffer>
      </match>

其中 application 部分則是 Day 17 所提及的 Pod Container log 。另外，在此 QuickStart template 中所使用的 fluentd 版本已經是較舊版本： fluent/fluentd-kubernetes-daemonset:v1.7.3-debian-cloudwatch-1.0。我們也可以透過查看 fluentd container 內的 Gemfile 查看 plugin 版本：

若有需要使用較新 fluentd 版本，則可以參考 Fluentd Daemonset for Kubernetes⁵。

$ kubectl -n amazon-cloudwatch exec -it fluentd-cloudwatch-qkf9m -- cat /fluentd/Gemfile
Defaulted container "fluentd-cloudwatch" out of: fluentd-cloudwatch, copy-fluentd-config (init), update-log-driver (init)
# AUTOMATICALLY GENERATED
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/Gemfile.erb

source "https://rubygems.org"

gem "fluentd", "1.7.3"
gem "oj", "3.8.1"
gem "fluent-plugin-multi-format-parser", "~> 1.0.0"
gem "fluent-plugin-concat", "~> 2.3.0"
gem "fluent-plugin-grok-parser", "~> 2.5.0"
gem "fluent-plugin-prometheus", "~> 1.5.0"
gem 'fluent-plugin-json-in-json-2', ">= 1.0.2"
gem "fluent-plugin-record-modifier", "~> 2.0.0"
gem "fluent-plugin-rewrite-tag-filter", "~> 2.2.0"
gem "aws-sdk-cloudwatchlogs", "~> 1.0"
gem "fluent-plugin-cloudwatch-logs", "~> 0.7.4"
gem "fluent-plugin-kubernetes_metadata_filter", "~> 2.3.0"
gem "ffi"
gem "fluent-plugin-systemd", "~> 1.0.1"

根據 Fluentd 設定檔，可以確認以下目錄：

application：所有於 /var/log/containers 目錄下的 application log。
host：Node 上的 /var/log/dmesg、/var/log/secure 及 /var/log/messages 目錄。
dataplane：Node 上的 /var/log/journal 目錄，主要收集 kubelet.service、kubeproxy.service 及 docker.service。

CloudWatch Agent

根據 ConfigMap cwagentconfig，僅設定了 metrics_collected 屬於 kubernetes 類型，及定義 cluster_name。

$ kubectl -n amazon-cloudwatch get cm cwagentconfig -o yaml
apiVersion: v1
data:
  cwagentconfig.json: |
    {
      "agent": {
        "region": "eu-west-1"
      },
      "logs": {
        "metrics_collected": {
          "kubernetes": {
            "cluster_name": "ironman-2022",
            "metrics_collection_interval": 60
          }
        },
        "force_flush_interval": 5
      }
    }
kind: ConfigMap
...
...

查看 CloudWatch Agent logs，可以查看到預設 Config 會被轉譯成 TOML 文件格式：

$ kubectl -n amazon-cloudwatch logs cloudwatch-agent-fdhmb
...
...
...
Configuration validation first phase succeeded

2022/10/03 12:46:34 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2022/10/03 12:46:34 D! toml config [agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = "ip-192-168-42-19.eu-west-1.compute.internal"
  interval = "60s"
  logfile = ""
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = true
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.cadvisor]]
    container_orchestrator = "eks"
    interval = "60s"
    mode = "detail"
    [inputs.cadvisor.tags]
      metricPath = "logs"

  [[inputs.k8sapiserver]]
    interval = "60s"
    node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
    [inputs.k8sapiserver.tags]
      metricPath = "logs_k8sapiserver"

[outputs]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "ip-192-168-42-19.eu-west-1.compute.internal"
    region = "eu-west-1"
    tagexclude = ["metricPath"]
    [outputs.cloudwatchlogs.tagpass]
      metricPath = ["logs", "logs_k8sapiserver"]

[processors]

  [[processors.ec2tagger]]
    disk_device_tag_key = "device"
    ebs_device_keys = ["*"]
    ec2_instance_tag_keys = ["aws:autoscaling:groupName"]
    ec2_metadata_tags = ["InstanceId", "InstanceType"]
    [processors.ec2tagger.tagpass]
      metricPath = ["logs"]

  [[processors.k8sdecorator]]
    cluster_name = "ironman-2022"
    host_ip = "192.168.42.19"
    node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
    order = 1
    prefer_full_pod_name = false
    tag_service = true
    [processors.k8sdecorator.tagpass]
      metricPath = ["logs", "logs_k8sapiserver"]
2022-10-03T12:46:34Z I! Starting AmazonCloudWatchAgent 1.247354.0
2022-10-03T12:46:34Z I! AWS SDK log level not set
2022-10-03T12:46:34Z I! Loaded inputs: cadvisor k8sapiserver
2022-10-03T12:46:34Z I! Loaded aggregators:
2022-10-03T12:46:34Z I! Loaded processors: ec2tagger k8sdecorator
2022-10-03T12:46:34Z I! Loaded outputs: cloudwatchlogs
2022-10-03T12:46:34Z I! Tags enabled:
2022-10-03T12:46:34Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-192-168-42-19.eu-west-1.compute.internal", Flush Interval:1s
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [logagent] starting
2022-10-03T12:46:34Z I! [logagent] found plugin cloudwatchlogs is a log backend
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
I1003 12:46:34.492882       1 leaderelection.go:248] attempting to acquire leader lease amazon-cloudwatch/cwagent-clusterleader...
I1003 12:46:34.504069       1 leaderelection.go:258] successfully acquired lease amazon-cloudwatch/cwagent-clusterleader
2022-10-03T12:46:34Z I! k8sapiserver Switch New Leader: ip-192-168-42-19.eu-west-1.compute.internal
2022-10-03T12:46:34Z I! k8sapiserver OnStartedLeading: ip-192-168-42-19.eu-west-1.compute.internal
I1003 12:46:34.504480       1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"amazon-cloudwatch", Name:"cwagent-clusterleader", UID:"63166128-5ea1-40c7-ae47-67ac27ed5830", APIVersion:"v1", ResourceVersion:"5330123", Fi
eldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-192-168-42-19.eu-west-1.compute.internal became leader
W1003 12:46:34.610597       1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:40Z I! [outputs.cloudwatchlogs] First time sending logs to /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal since startup so sequenceToken is nil, learned new token:(0xc000fbe130): Th
e given sequenceToken is invalid. The next expected sequenceToken is: 49615097157874583574523473011765967269436169076596539730
2022-10-03T12:46:40Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 103.736985ms before retrying.
2022-10-03T12:47:34Z I! number of namespace to running pod num map[amazon-cloudwatch:8 default:1 demo:4 kube-system:10]

格式上大致可以分為：

inputs：cadvisor 及 k8sapiserver。
processors：ec2tagger 及 k8sdecorator
outputs：cloudwatchlogs，log stream 為 ip-192-168-42-19.eu-west-1.compute.internal。此為該 CloudWatch Agent Pod 所在 node。

同時，我們也能確認 Logs 被更新 log stream /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal。

查看 Log event：

{
    "AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
    "CloudWatchMetrics": [
        {
            "Metrics": [
                {
                    "Unit": "Percent",
                    "Name": "pod_cpu_utilization"
                },
                {
                    "Unit": "Percent",
                    "Name": "pod_memory_utilization"
                },
                {
                    "Unit": "Bytes/Second",
                    "Name": "pod_network_rx_bytes"
                },
                {
                    "Unit": "Bytes/Second",
                    "Name": "pod_network_tx_bytes"
                },
                {
                    "Unit": "Percent",
                    "Name": "pod_memory_utilization_over_pod_limit"
                }
            ],
            "Dimensions": [
                [
                    "PodName",
                    "Namespace",
                    "ClusterName"
                ],
                [
                    "Namespace",
                    "ClusterName"
                ],
                [
                    "ClusterName"
                ]
            ],
            "Namespace": "ContainerInsights"
        },
        {
            "Metrics": [
                {
                    "Unit": "Percent",
                    "Name": "pod_cpu_reserved_capacity"
                },
                {
                    "Unit": "Percent",
                    "Name": "pod_memory_reserved_capacity"
                }
            ],
            "Dimensions": [
                [
                    "PodName",
                    "Namespace",
                    "ClusterName"
                ],
                [
                    "ClusterName"
                ]
            ],
            "Namespace": "ContainerInsights"
        },
        {
            "Metrics": [
                {
                    "Unit": "Count",
                    "Name": "pod_number_of_container_restarts"
                }
            ],
            "Dimensions": [
                [
                    "PodName",
                    "Namespace",
                    "ClusterName"
                ]
            ],
            "Namespace": "ContainerInsights"
        }
    ],
    "ClusterName": "ironman-2022",
    "InstanceId": "i-03b91195eefbde8e3",
    "InstanceType": "m5.large",
    "Namespace": "amazon-cloudwatch",
    "NodeName": "ip-192-168-71-180.eu-west-1.compute.internal",
    "PodName": "fluentd-cloudwatch",
    "Sources": [
        "cadvisor",
        "pod",
        "calculated"
    ],
    "Timestamp": "1664806540622",
    "Type": "Pod",
    "Version": "0",
    "kubernetes": {
        "host": "ip-192-168-71-180.eu-west-1.compute.internal",
        "labels": {
            "controller-revision-hash": "66c8bfcf8f",
            "k8s-app": "fluentd-cloudwatch",
            "pod-template-generation": "2"
        },
        "namespace_name": "amazon-cloudwatch",
        "pod_id": "8e6ebccc-2136-4c79-80e4-1d023677a66a",
        "pod_name": "fluentd-cloudwatch-2tfpp",
        "pod_owners": [
            {
                "owner_kind": "DaemonSet",
                "owner_name": "fluentd-cloudwatch"
            }
        ]
    },
    "pod_cpu_request": 100,
    "pod_cpu_reserved_capacity": 5,
    "pod_cpu_usage_system": 0.49339748943631484,
    "pod_cpu_usage_total": 2.067604974446889,
    "pod_cpu_usage_user": 1.6446582981210494,
    "pod_cpu_utilization": 0.10338024872234447,
    "pod_memory_cache": 0,
    "pod_memory_failcnt": 0,
    "pod_memory_hierarchical_pgfault": 0,
    "pod_memory_hierarchical_pgmajfault": 0,
    "pod_memory_limit": 419430400,
    "pod_memory_mapped_file": 0,
    "pod_memory_max_usage": 137388032,
    "pod_memory_pgfault": 0,
    "pod_memory_pgmajfault": 0,
    "pod_memory_request": 209715200,
    "pod_memory_reserved_capacity": 2.5818072348088124,
    "pod_memory_rss": 130351104,
    "pod_memory_swap": 0,
    "pod_memory_usage": 137129984,
    "pod_memory_utilization": 1.6016785781100062,
    "pod_memory_utilization_over_pod_limit": 31.0185546875,
    "pod_memory_working_set": 130101248,
    "pod_network_rx_bytes": 4.580653646182247,
    "pod_network_rx_dropped": 0,
    "pod_network_rx_errors": 0,
    "pod_network_rx_packets": 0.08268327881195393,
    "pod_network_total_bytes": 7.755691552561277,
    "pod_network_tx_bytes": 3.1750379063790306,
    "pod_network_tx_dropped": 0,
    "pod_network_tx_errors": 0,
    "pod_network_tx_packets": 0.06614662304956315,
    "pod_number_of_container_restarts": 0,
    "pod_number_of_containers": 1,
    "pod_number_of_running_containers": 1,
    "pod_status": "Running"
}

{
    "AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
    "CloudWatchMetrics": [
        {
            "Metrics": [
                {
                    "Unit": "Percent",
                    "Name": "node_filesystem_utilization"
                }
            ],
            "Dimensions": [
                [
                    "NodeName",
                    "InstanceId",
                    "ClusterName"
                ],
                [
                    "ClusterName"
                ]
            ],
            "Namespace": "ContainerInsights"
        }
    ],
    "ClusterName": "ironman-2022",
    "EBSVolumeId": "aws://eu-west-1c/vol-0f2bbde78cd431365",
    "InstanceId": "i-05830139120bed5e0",
    "InstanceType": "m5.large",
    "NodeName": "ip-192-168-42-19.eu-west-1.compute.internal",
    "Sources": [
        "cadvisor",
        "calculated"
    ],
    "Timestamp": "1664810189159",
    "Type": "NodeFS",
    "Version": "0",
    "device": "/dev/nvme0n1p1",
    "fstype": "vfs",
    "kubernetes": {
        "host": "ip-192-168-42-19.eu-west-1.compute.internal"
    },
    "node_filesystem_available": 82884431872,
    "node_filesystem_capacity": 85886742528,
    "node_filesystem_inodes": 41941952,
    "node_filesystem_inodes_free": 41858329,
    "node_filesystem_usage": 3002310656,
    "node_filesystem_utilization": 3.4956625057950177
}

不論是 Pod 或 NodeFS 資訊，皆被被多更新了 CloudWatchMetrics Array 格式，其定義 namespace 為 ContainerInsights、對應 Dimensions 及 Metrics。

總結

Container Insight 所提供的 metrics，皆由 Container Agent 以 DaemonSet 方式部署透過 cadvisor 收集 node 及 pod metrics，並於 log 增加 CloudWatchMetrics 格式使 CloudWatch Metrics 讀取。

[18] 為什麼 CloudWatch Insight 可以收集 EKS cluster node 及 pod metrics

建置環境​

驗證​

FluentD​

CloudWatch Agent​

總結​

參考文件​

Footnotes​

建置環境

驗證

FluentD

CloudWatch Agent

總結

參考文件

Footnotes