跳至主要内容

[18] 為什麼 CloudWatch Insight 可以收集 EKS cluster node 及 pod metrics

昨日 討論了以 FluentD 為例討論 log agent 部署至每個 node 上收集 /var/log/containers 目錄。接續討論 Container Insight 部署完之後,其這些 metrics 來源及收集方式。

建置環境

  1. 透過官方 Container Insights on Amazon EKS and Kubernetes 文件1所提供的 Quick Start template 部署 CloudWatch agent 及 Fluentd。

    curl https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml | sed "s/{{cluster_name}}/ironman-2022/;s/{{region_name}}/eu-west-1/" | kubectl apply -f -

    其 source code template 也可於 GitHub - CloudWatch Agent for Container Insights Kubernetes Monitoring2 查看。

  2. 由於 CloudWatch agent 及 Fluentd 皆需要具有 CloudWatch Logs 或 Metrics IAM 權限,因此以下範例透過關聯 CloudWatchAgentAdminPolicy 方式設定 IRSA 3

    透過 eksctl 命令建立 IAM role 給與 ServiceAccount fluentdcloudwatch-agent 使用。

    $ eksctl create iamserviceaccount \
    --cluster=ironman-2022 \
    --namespace=amazon-cloudwatch \
    --name=cloudwatch-agent \
    --attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
    --override-existing-serviceaccounts \
    --region eu-west-1 \
    --approve

    $ eksctl create iamserviceaccount \
    --cluster=ironman-2022 \
    --namespace=amazon-cloudwatch \
    --name=fluentd \
    --attach-policy-arn=arn:aws:iam::aws:policy/CloudWatchAgentAdminPolicy \
    --override-existing-serviceaccounts \
    --region eu-west-1 \
    --approve

    根據 CloudWatch Insight 需求4,CloudWatch Agent 需要額外 EC2 DescribeVolumes 權限,若關聯的 Policy 沒有此權限則需要額外設定。

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Action": [
    "ec2:DescribeVolumes"
    ],
    "Resource": "*",
    "Effect": "Allow"
    }
    ]
    }
  3. 分別透過 kubectl rollout 重啟 DaemonSet cloudwatch-agentfluentd-cloudwatch

    $ kubectl -n amazon-cloudwatch rollout restart ds cloudwatch-agent
    daemonset.apps/cloudwatch-agent restarted

    $ kubectl -n amazon-cloudwatch rollout restart ds fluentd-cloudwatch
    daemonset.apps/fluentd-cloudwatch restarted

驗證

為避免混淆 FluentD 及 CloudWatch Agent 所負責事項,以下仍會依照 FluentD 及 CloudWatch Agent 設定檔進行解釋。

預設安裝好上述 QuickStart template ,則可以在 CloudWatch Log groups 查看到以下 Log group:

  • /aws/containerinsights/ironman-2022/application
  • /aws/containerinsights/ironman-2022/dataplane
  • /aws/containerinsights/ironman-2022/host
  • /aws/containerinsights/ironman-2022/performance

FluentD

      <match **>
@type cloudwatch_logs
@id out_cloudwatch_logs_systemd
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/dataplane"
log_stream_name_key stream_name
auto_create_stream true
remove_log_stream_name_key true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>

      <match **>
@type cloudwatch_logs
@id out_cloudwatch_logs_containers
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/application"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>
<match host.**>
@type cloudwatch_logs
@id out_cloudwatch_logs_host_logs
region "#{ENV.fetch('AWS_REGION')}"
log_group_name "/aws/containerinsights/#{ENV.fetch('CLUSTER_NAME')}/host"
log_stream_name_key stream_name
remove_log_stream_name_key true
auto_create_stream true
<buffer>
flush_interval 5
chunk_limit_size 2m
queued_chunks_limit_size 32
retry_forever true
</buffer>
</match>

其中 application 部分則是 Day 17 所提及的 Pod Container log 。另外,在此 QuickStart template 中所使用的 fluentd 版本已經是較舊版本: fluent/fluentd-kubernetes-daemonset:v1.7.3-debian-cloudwatch-1.0。我們也可以透過查看 fluentd container 內的 Gemfile 查看 plugin 版本:

若有需要使用較新 fluentd 版本 ,則可以參考 Fluentd Daemonset for Kubernetes5

$ kubectl -n amazon-cloudwatch exec -it fluentd-cloudwatch-qkf9m -- cat /fluentd/Gemfile
Defaulted container "fluentd-cloudwatch" out of: fluentd-cloudwatch, copy-fluentd-config (init), update-log-driver (init)
# AUTOMATICALLY GENERATED
# DO NOT EDIT THIS FILE DIRECTLY, USE /templates/Gemfile.erb

source "https://rubygems.org"

gem "fluentd", "1.7.3"
gem "oj", "3.8.1"
gem "fluent-plugin-multi-format-parser", "~> 1.0.0"
gem "fluent-plugin-concat", "~> 2.3.0"
gem "fluent-plugin-grok-parser", "~> 2.5.0"
gem "fluent-plugin-prometheus", "~> 1.5.0"
gem 'fluent-plugin-json-in-json-2', ">= 1.0.2"
gem "fluent-plugin-record-modifier", "~> 2.0.0"
gem "fluent-plugin-rewrite-tag-filter", "~> 2.2.0"
gem "aws-sdk-cloudwatchlogs", "~> 1.0"
gem "fluent-plugin-cloudwatch-logs", "~> 0.7.4"
gem "fluent-plugin-kubernetes_metadata_filter", "~> 2.3.0"
gem "ffi"
gem "fluent-plugin-systemd", "~> 1.0.1"

根據 Fluentd 設定檔,可以確認以下目錄:

  • application:所有於 /var/log/containers 目錄下的 application log。
  • host:Node 上的 /var/log/dmesg/var/log/secure/var/log/messages 目錄。
  • dataplane:Node 上的 /var/log/journal 目錄,主要收集 kubelet.servicekubeproxy.servicedocker.service

CloudWatch Agent

根據 ConfigMap cwagentconfig,僅設定了 metrics_collected 屬於 kubernetes 類型,及定義 cluster_name

$ kubectl -n amazon-cloudwatch get cm cwagentconfig -o yaml
apiVersion: v1
data:
cwagentconfig.json: |
{
"agent": {
"region": "eu-west-1"
},
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "ironman-2022",
"metrics_collection_interval": 60
}
},
"force_flush_interval": 5
}
}
kind: ConfigMap
...
...

查看 CloudWatch Agent logs,可以查看到預設 Config 會被轉譯成 TOML 文件格式:

$ kubectl -n amazon-cloudwatch logs cloudwatch-agent-fdhmb
...
...
...
Configuration validation first phase succeeded

2022/10/03 12:46:34 I! Config has been translated into TOML /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.toml
2022/10/03 12:46:34 D! toml config [agent]
collection_jitter = "0s"
debug = false
flush_interval = "1s"
flush_jitter = "0s"
hostname = "ip-192-168-42-19.eu-west-1.compute.internal"
interval = "60s"
logfile = ""
logtarget = "lumberjack"
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = true
precision = ""
quiet = false
round_interval = false

[inputs]

[[inputs.cadvisor]]
container_orchestrator = "eks"
interval = "60s"
mode = "detail"
[inputs.cadvisor.tags]
metricPath = "logs"

[[inputs.k8sapiserver]]
interval = "60s"
node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
[inputs.k8sapiserver.tags]
metricPath = "logs_k8sapiserver"

[outputs]

[[outputs.cloudwatchlogs]]
force_flush_interval = "5s"
log_stream_name = "ip-192-168-42-19.eu-west-1.compute.internal"
region = "eu-west-1"
tagexclude = ["metricPath"]
[outputs.cloudwatchlogs.tagpass]
metricPath = ["logs", "logs_k8sapiserver"]

[processors]

[[processors.ec2tagger]]
disk_device_tag_key = "device"
ebs_device_keys = ["*"]
ec2_instance_tag_keys = ["aws:autoscaling:groupName"]
ec2_metadata_tags = ["InstanceId", "InstanceType"]
[processors.ec2tagger.tagpass]
metricPath = ["logs"]

[[processors.k8sdecorator]]
cluster_name = "ironman-2022"
host_ip = "192.168.42.19"
node_name = "ip-192-168-42-19.eu-west-1.compute.internal"
order = 1
prefer_full_pod_name = false
tag_service = true
[processors.k8sdecorator.tagpass]
metricPath = ["logs", "logs_k8sapiserver"]
2022-10-03T12:46:34Z I! Starting AmazonCloudWatchAgent 1.247354.0
2022-10-03T12:46:34Z I! AWS SDK log level not set
2022-10-03T12:46:34Z I! Loaded inputs: cadvisor k8sapiserver
2022-10-03T12:46:34Z I! Loaded aggregators:
2022-10-03T12:46:34Z I! Loaded processors: ec2tagger k8sdecorator
2022-10-03T12:46:34Z I! Loaded outputs: cloudwatchlogs
2022-10-03T12:46:34Z I! Tags enabled:
2022-10-03T12:46:34Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ip-192-168-42-19.eu-west-1.compute.internal", Flush Interval:1s
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [logagent] starting
2022-10-03T12:46:34Z I! [logagent] found plugin cloudwatchlogs is a log backend
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Check ec2 metadata
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started initialization.
I1003 12:46:34.492882 1 leaderelection.go:248] attempting to acquire leader lease amazon-cloudwatch/cwagent-clusterleader...
I1003 12:46:34.504069 1 leaderelection.go:258] successfully acquired lease amazon-cloudwatch/cwagent-clusterleader
2022-10-03T12:46:34Z I! k8sapiserver Switch New Leader: ip-192-168-42-19.eu-west-1.compute.internal
2022-10-03T12:46:34Z I! k8sapiserver OnStartedLeading: ip-192-168-42-19.eu-west-1.compute.internal
I1003 12:46:34.504480 1 event.go:285] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"amazon-cloudwatch", Name:"cwagent-clusterleader", UID:"63166128-5ea1-40c7-ae47-67ac27ed5830", APIVersion:"v1", ResourceVersion:"5330123", Fi
eldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-192-168-42-19.eu-west-1.compute.internal became leader
W1003 12:46:34.610597 1 manager.go:291] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: Initial retrieval of tags succeded
2022-10-03T12:46:34Z I! [processors.ec2tagger] ec2tagger: EC2 tagger has started, finished initial retrieval of tags and Volumes
2022-10-03T12:46:40Z I! [outputs.cloudwatchlogs] First time sending logs to /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal since startup so sequenceToken is nil, learned new token:(0xc000fbe130): Th
e given sequenceToken is invalid. The next expected sequenceToken is: 49615097157874583574523473011765967269436169076596539730
2022-10-03T12:46:40Z W! [outputs.cloudwatchlogs] Retried 0 time, going to sleep 103.736985ms before retrying.
2022-10-03T12:47:34Z I! number of namespace to running pod num map[amazon-cloudwatch:8 default:1 demo:4 kube-system:10]

格式上大致可以分為:

  • inputscadvisork8sapiserver
  • processorsec2taggerk8sdecorator
  • outputscloudwatchlogs,log stream 為 ip-192-168-42-19.eu-west-1.compute.internal。此為該 CloudWatch Agent Pod 所在 node。

同時,我們也能確認 Logs 被更新 log stream /aws/containerinsights/ironman-2022/performance/ip-192-168-42-19.eu-west-1.compute.internal

查看 Log event:

{
"AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
"CloudWatchMetrics": [
{
"Metrics": [
{
"Unit": "Percent",
"Name": "pod_cpu_utilization"
},
{
"Unit": "Percent",
"Name": "pod_memory_utilization"
},
{
"Unit": "Bytes/Second",
"Name": "pod_network_rx_bytes"
},
{
"Unit": "Bytes/Second",
"Name": "pod_network_tx_bytes"
},
{
"Unit": "Percent",
"Name": "pod_memory_utilization_over_pod_limit"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
],
[
"Namespace",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
},
{
"Metrics": [
{
"Unit": "Percent",
"Name": "pod_cpu_reserved_capacity"
},
{
"Unit": "Percent",
"Name": "pod_memory_reserved_capacity"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
},
{
"Metrics": [
{
"Unit": "Count",
"Name": "pod_number_of_container_restarts"
}
],
"Dimensions": [
[
"PodName",
"Namespace",
"ClusterName"
]
],
"Namespace": "ContainerInsights"
}
],
"ClusterName": "ironman-2022",
"InstanceId": "i-03b91195eefbde8e3",
"InstanceType": "m5.large",
"Namespace": "amazon-cloudwatch",
"NodeName": "ip-192-168-71-180.eu-west-1.compute.internal",
"PodName": "fluentd-cloudwatch",
"Sources": [
"cadvisor",
"pod",
"calculated"
],
"Timestamp": "1664806540622",
"Type": "Pod",
"Version": "0",
"kubernetes": {
"host": "ip-192-168-71-180.eu-west-1.compute.internal",
"labels": {
"controller-revision-hash": "66c8bfcf8f",
"k8s-app": "fluentd-cloudwatch",
"pod-template-generation": "2"
},
"namespace_name": "amazon-cloudwatch",
"pod_id": "8e6ebccc-2136-4c79-80e4-1d023677a66a",
"pod_name": "fluentd-cloudwatch-2tfpp",
"pod_owners": [
{
"owner_kind": "DaemonSet",
"owner_name": "fluentd-cloudwatch"
}
]
},
"pod_cpu_request": 100,
"pod_cpu_reserved_capacity": 5,
"pod_cpu_usage_system": 0.49339748943631484,
"pod_cpu_usage_total": 2.067604974446889,
"pod_cpu_usage_user": 1.6446582981210494,
"pod_cpu_utilization": 0.10338024872234447,
"pod_memory_cache": 0,
"pod_memory_failcnt": 0,
"pod_memory_hierarchical_pgfault": 0,
"pod_memory_hierarchical_pgmajfault": 0,
"pod_memory_limit": 419430400,
"pod_memory_mapped_file": 0,
"pod_memory_max_usage": 137388032,
"pod_memory_pgfault": 0,
"pod_memory_pgmajfault": 0,
"pod_memory_request": 209715200,
"pod_memory_reserved_capacity": 2.5818072348088124,
"pod_memory_rss": 130351104,
"pod_memory_swap": 0,
"pod_memory_usage": 137129984,
"pod_memory_utilization": 1.6016785781100062,
"pod_memory_utilization_over_pod_limit": 31.0185546875,
"pod_memory_working_set": 130101248,
"pod_network_rx_bytes": 4.580653646182247,
"pod_network_rx_dropped": 0,
"pod_network_rx_errors": 0,
"pod_network_rx_packets": 0.08268327881195393,
"pod_network_total_bytes": 7.755691552561277,
"pod_network_tx_bytes": 3.1750379063790306,
"pod_network_tx_dropped": 0,
"pod_network_tx_errors": 0,
"pod_network_tx_packets": 0.06614662304956315,
"pod_number_of_container_restarts": 0,
"pod_number_of_containers": 1,
"pod_number_of_running_containers": 1,
"pod_status": "Running"
}
{
"AutoScalingGroupName": "eks-ng1-public-ssh-62c19db5-f965-bdb7-373a-147e04d9f124",
"CloudWatchMetrics": [
{
"Metrics": [
{
"Unit": "Percent",
"Name": "node_filesystem_utilization"
}
],
"Dimensions": [
[
"NodeName",
"InstanceId",
"ClusterName"
],
[
"ClusterName"
]
],
"Namespace": "ContainerInsights"
}
],
"ClusterName": "ironman-2022",
"EBSVolumeId": "aws://eu-west-1c/vol-0f2bbde78cd431365",
"InstanceId": "i-05830139120bed5e0",
"InstanceType": "m5.large",
"NodeName": "ip-192-168-42-19.eu-west-1.compute.internal",
"Sources": [
"cadvisor",
"calculated"
],
"Timestamp": "1664810189159",
"Type": "NodeFS",
"Version": "0",
"device": "/dev/nvme0n1p1",
"fstype": "vfs",
"kubernetes": {
"host": "ip-192-168-42-19.eu-west-1.compute.internal"
},
"node_filesystem_available": 82884431872,
"node_filesystem_capacity": 85886742528,
"node_filesystem_inodes": 41941952,
"node_filesystem_inodes_free": 41858329,
"node_filesystem_usage": 3002310656,
"node_filesystem_utilization": 3.4956625057950177
}

不論是 Pod 或 NodeFS 資訊,皆被被多更新了 CloudWatchMetrics Array 格式,其定義 namespace 為 ContainerInsights、對應 DimensionsMetrics

總結

Container Insight 所提供的 metrics,皆由 Container Agent 以 DaemonSet 方式部署透過 cadvisor 收集 node 及 pod metrics,並於 log 增加 CloudWatchMetrics 格式使 CloudWatch Metrics 讀取。

參考文件

Footnotes

  1. Quick Start setup for Container Insights on Amazon EKS and Kubernetes - Quick Start with the CloudWatch agent and Fluentd

  2. CloudWatch Agent for Container Insights Kubernetes Monitoring

  3. IAM roles for service accounts

  4. Verify prerequisites

  5. Fluentd Daemonset for Kubernetes | GitHub