跳至主要内容

[26] 為什麼 EKS 在更新 Kubernetes deployment 時會有 HTTP 502 的 error(一)

官方 EKS 提供了 AWS Load Balancer Controller 整合 AWS ELB 資源:

  • 部署 Application Load Balancers(ALB) 作為 Kubernetes Ingress object
  • 部署 Network Load Balancers 作為 Kubernetes Service object

因此我們可以透過建立 Service 或是 Ingress 來建立相應 Load Balancers 資源。然而在 deployment 更新時,偶爾則可以觀察到 ALB 響應 HTTP 502 的狀況。因此本文將探討為什麼 ALB 會回傳 HTTP 502 錯誤。

建立環境

安裝 AWS Load Balancer Controller

  1. 下載 IAM policy。
curl -o iam-policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.4.4/docs/install/iam_policy.json
  1. 透過 AWS CLI aws iam create-policy 命令建立 IAM Policy。
$ aws iam create-policy \
--policy-name AWSLoadBalancerControllerIAMPolicy \
--policy-document file://iam-policy.json
  1. 使用 eksctl 命令建立 IRSA(IAM roles for service accounts)的 IAM role 及 Service Account
$ eksctl create iamserviceaccount \
--cluster=ironman-2022 \
--namespace=kube-system \
--name=aws-load-balancer-controller \
--attach-policy-arn=arn:aws:iam::111111111111:policy/AWSLoadBalancerControllerIAMPolicy \
--override-existing-serviceaccounts \
--region eu-west-1 \
--approve
2022-10-11 14:07:43 [ℹ] 4 existing iamserviceaccount(s) (amazon-cloudwatch/cloudwatch-agent,amazon-cloudwatch/fluentd,default/my-s3-sa,kube-system/aws-node) will be excluded
2022-10-11 14:07:43 [ℹ] 1 iamserviceaccount (kube-system/aws-load-balancer-controller) was included (based on the include/exclude rules)
2022-10-11 14:07:43 [!] metadata of serviceaccounts that exist in Kubernetes will be updated, as --override-existing-serviceaccounts was set
2022-10-11 14:07:43 [ℹ] 1 task: {
2 sequential sub-tasks: {
create IAM role for serviceaccount "kube-system/aws-load-balancer-controller",
create serviceaccount "kube-system/aws-load-balancer-controller",
} }2022-10-11 14:07:43 [ℹ] building iamserviceaccount stack "eksctl-ironman-2022-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2022-10-11 14:07:44 [ℹ] deploying stack "eksctl-ironman-2022-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2022-10-11 14:07:44 [ℹ] waiting for CloudFormation stack "eksctl-ironman-2022-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2022-10-11 14:08:14 [ℹ] waiting for CloudFormation stack "eksctl-ironman-2022-addon-iamserviceaccount-kube-system-aws-load-balancer-controller"
2022-10-11 14:08:14 [ℹ] created serviceaccount "kube-system/aws-load-balancer-controller"
  1. AWS Load Balancer Controller 提供 Helm [2] ,依照 Helm 文件進行安裝,使用最新版本 3.10.0 版本:
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh
Downloading https://get.helm.sh/helm-v3.10.0-linux-amd64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm
  1. 新增 Helm repo:
helm repo add eks https://aws.github.io/eks-charts
  1. 於第 3 步驟已經建立 Service Account 過,因此此步驟則無需再次建立 Service Account。
$ helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=ironman-2022 --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller
NAME: aws-load-balancer-controller
LAST DEPLOYED: Tue Oct 11 14:11:56 2022
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
AWS Load Balancer controller installed!

建立環境

  1. 以 Nginx server 作為測試,建立 nginx-demo-ingress.yaml 包含 Deployment、Service、Ingress 資源。
$ cat ./nginx-demo-ingress.yaml
apiVersion: v1
kind: Namespace
metadata:
name: "demo-nginx"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: "demo-nginx-deployment"
namespace: "demo-nginx"
spec:
selector:
matchLabels:
app: "demo-nginx"
replicas: 10
template:
metadata:
labels:
app: "demo-nginx"
role: "backend"
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: "demo-nginx"
ports:
- containerPort: 80
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
---
apiVersion: v1
kind: Service
metadata:
name: "service-demo-nginx"
namespace: "demo-nginx"
spec:
ports:
- port: 80
targetPort: 80
protocol: TCP
selector:
app: "demo-nginx"
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
namespace: "demo-nginx"
name: "ingress-nginx"
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- http:
paths:
- path: "/"
pathType: Prefix
backend:
service:
name: service-demo-nginx
port:
number: 80
  1. 部署此 YAML。
$ kubectl apply -f  ./nginx-demo-ingress.yaml
namespace/demo-nginx created
deployment.apps/demo-nginx-deployment created
service/service-demo-nginx created
ingress.networking.k8s.io/ingress-nginx created
  1. 確認 Nginx Pod 皆有正常運作。
$ kubectl -n demo-nginx get ing,svc,po -o wide
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/ingress-nginx <none> * k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com 80 56m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/service-demo-nginx ClusterIP 10.100.247.243 <none> 80/TCP 56m app=demo-nginx

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/demo-nginx-deployment-75966cf8c6-2ntf4 1/1 Running 0 10m 192.168.56.52 ip-192-168-61-79.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-4nzz7 1/1 Running 0 10m 192.168.91.189 ip-192-168-76-241.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-4qgtj 1/1 Running 0 10m 192.168.74.144 ip-192-168-71-85.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-5b6ws 1/1 Running 0 10m 192.168.25.57 ip-192-168-18-190.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-6khx4 1/1 Running 0 10m 192.168.69.15 ip-192-168-71-85.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-c76pr 1/1 Running 0 10m 192.168.40.177 ip-192-168-61-79.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-dsztw 1/1 Running 0 10m 192.168.67.240 ip-192-168-76-241.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-qlkj2 1/1 Running 0 10m 192.168.62.119 ip-192-168-55-245.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-stt9c 1/1 Running 0 10m 192.168.29.171 ip-192-168-18-190.eu-west-1.compute.internal <none> <none>
pod/demo-nginx-deployment-75966cf8c6-zfzzf 1/1 Running 0 10m 192.168.49.142 ip-192-168-55-245.eu-west-1.compute.internal <none> <none>
  1. 查看 Ingress 所建立的 ALB 資源可以正常響應 HTTP 200,由於使用 Nginx latest image 目前最新版本為 1.23.1
$ curl -I k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com
HTTP/1.1 200 OK
Date: Tue, 11 Oct 2022 15:38:44 GMT
Content-Type: text/html
Content-Length: 615
Connection: keep-alive
Server: nginx/1.23.1
Last-Modified: Mon, 23 May 2022 22:59:19 GMT
ETag: "628c1fd7-267"
Accept-Ranges: bytes
  1. 查看 ELB Target group。
$ aws elbv2 describe-target-groups --load-balancer-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/app/k8s-demongin-ingressn-843da381ab/9a5746ba008507c3
{
"TargetGroups": [
{
"TargetGroupArn": "arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-610dd2305f/ffd6c2b5292a36c2",
"TargetGroupName": "k8s-demongin-serviced-610dd2305f",
"Protocol": "HTTP",
"Port": 80,
"VpcId": "vpc-0b150640c72b722db",
"HealthCheckProtocol": "HTTP",
"HealthCheckPort": "traffic-port",
"HealthCheckEnabled": true,
"HealthCheckIntervalSeconds": 15,
"HealthCheckTimeoutSeconds": 5,
"HealthyThresholdCount": 2,
"UnhealthyThresholdCount": 2,
"HealthCheckPath": "/",
"Matcher": {
"HttpCode": "200"
},
"LoadBalancerArns": [
"arn:aws:elasticloadbalancing:eu-west-1:111111111111:loadbalancer/app/k8s-demongin-ingressn-843da381ab/9a5746ba008507c3"
],
"TargetType": "ip",
"ProtocolVersion": "HTTP1",
"IpAddressType": "ipv4"
}
]
}
$ aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:eu-west-1:111111111111:targetgroup/k8s-demongin-serviced-610dd2305f/ffd6c2b5292a36c2 --query 'TargetHealthDescriptions[].[Target, TargetHealth]' --output table
-----------------------------------------------------------
| DescribeTargetHealth |
+-------------------+------------------+-------+----------+
| AvailabilityZone | Id | Port | State |
+-------------------+------------------+-------+----------+
| eu-west-1c | 192.168.40.177 | 80 | |
| | | | healthy |
| eu-west-1c | 192.168.62.119 | 80 | |
| | | | healthy |
| eu-west-1a | 192.168.25.57 | 80 | |
| | | | healthy |
| eu-west-1b | 192.168.74.144 | 80 | |
| | | | healthy |
| eu-west-1b | 192.168.69.15 | 80 | |
| | | | healthy |
| eu-west-1c | 192.168.56.52 | 80 | |
| | | | healthy |
| eu-west-1a | 192.168.29.171 | 80 | |
| | | | healthy |
| eu-west-1b | 192.168.91.189 | 80 | |
| | | | healthy |
| eu-west-1c | 192.168.49.142 | 80 | |
| | | | healthy |
| eu-west-1b | 192.168.67.240 | 80 | |
| | | | healthy |
+-------------------+------------------+-------+----------+

  1. 使用以下 curl command 每 0.5 秒訪問一次 ALB endpoint。
$ counter=1; for i in {1..200}; do echo $counter; counter=$((counter+1)); sleep │
0.5; curl -I k8s-demongin-ingressn-843da381ab-1111111111.eu-west-1.elb.amazonaws.com >> test-result.txt; done
  1. 於步驟 6 同一時間,於另一個 terminal 上執行更新 image 至 Nginx image 1.22.0 版本。
$ kubectl -n demo-nginx set image deploy demo-nginx-deployment demo-nginx=nginx:1.22.0 && kubectl -n demo-nginx rollout status deploy demo-nginx-deployment
deployment.apps/demo-nginx-deployment image updated
Waiting for deployment "demo-nginx-deployment" rollout to finish: 5 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 5 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 5 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 5 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 5 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 6 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 6 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 6 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 6 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 7 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 7 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 7 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 7 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 out of 10 new replicas have been updated...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 3 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 2 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 8 of 10 updated replicas are available...
Waiting for deployment "demo-nginx-deployment" rollout to finish: 9 of 10 updated replicas are available...
deployment "demo-nginx-deployment" successfully rolled out
  1. 其結果可以觀察到 ALB HTTP 502 的錯誤訊息。
$ grep -A2 "502" ./test-result.txt
HTTP/1.1 502 Bad Gateway
Server: awselb/2.0
Date: Tue, 11 Oct 2022 15:39:56 GMT
--
HTTP/1.1 502 Bad Gateway
Server: awselb/2.0
Date: Tue, 11 Oct 2022 15:39:59 GMT
--
HTTP/1.1 502 Bad Gateway
Server: awselb/2.0
Date: Tue, 11 Oct 2022 15:40:03 GMT
--
HTTP/1.1 502 Bad Gateway
Server: awselb/2.0
Date: Tue, 11 Oct 2022 15:39:06 GMT

總結

由上述資訊,我們初步架設並復現了問題,可以知道 EKS rolling update 時會有部分 HTTP 502 錯誤。明天我們將探討為什麼會有 ALB HTTP 502 的錯誤訊息。

參考文件

  1. IAM roles for service accounts - https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
  2. Helm | The package manager for Kubernetes - https://helm.sh/
  3. Installing Helm - https://helm.sh/docs/intro/install/