How to Optimize EC2 Costs with Auto-Scaling in Kubernetes Using Karpenter on EKS

·

7 min read

TL;DR — Optimizing EC2 costs in Kubernetes using Karpenter on AWS EKS involves leveraging auto-scaling features to adjust resource allocation dynamically, reducing operational costs. Key steps include setting up Karpenter, configuring NodeClass and NodePool, and testing the setup with deployments to ensure efficient scaling and cost management. This approach enhances application responsiveness while optimizing cloud infrastructure expenses.

Introduction

In today's digital landscape, where efficiency is paramount, technology teams are tasked with optimizing cloud services without sacrificing the availability, resilience, and quality of applications. This post explores how to reduce costs by optimizing autoscaling in Kubernetes clusters, with a focus on using Karpenter on AWS EKS. While this demonstration is specific to AWS, it's worth noting that Karpenter also supports other cloud platforms like Azure and GCP. Let's delve into the essential resources employed in this guide.

Below are some definitions of the resources that will be used:

  1. EKS:

    Elastic Kubernetes Service is a managed Kubernetes service provided by AWS that makes it easier for you to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. It handles much of the complexity of managing a Kubernetes cluster by automating tasks such as patching, node provisioning, and updates.

  2. Karpenter:

    Karpenter is an open-source, flexible, high-performance Kubernetes cluster autoscaler built by AWS. It aims to optimize the provisioning and scaling of compute resources by quickly launching right-sized instances in response to application needs and resource utilization. Karpenter is designed to improve upon the limitations of the Kubernetes Cluster Autoscaler, offering more responsive scaling decisions and better integration with cloud provider capabilities.

  3. EC2:

    Elastic Compute Cloud is an AWS service that provides resizable compute capacity in the cloud, designed to make web-scale computing easier for developers. EC2 offers several types of instances optimized for different tasks, and it includes options for cost savings such as:

    • On-Demand Instances: Pay for compute capacity by the hour or second (minimum of 60 seconds) with no long-term commitments. This provides flexibility for applications with short-term, spiky, or unpredictable workloads that cannot be interrupted.

    • Reserved Instances: Provide a significant discount (up to 75%) compared to On-Demand pricing and are best for applications with steady state or predictable usage.

    • Spot Instances: Allow you to request spare Amazon EC2 computing capacity for up to 90% off the On-Demand price. Suitable for flexible start and end times, applications that are only feasible at very low compute prices, and users with urgent computing needs for large amounts of additional capacity.

    • Savings Plans: Offer significant savings over On-Demand pricing, like Reserved Instances, but with more flexibility in how you use your compute capacity.

    • Dedicated Hosts: Physical servers with EC2 instance capacity fully dedicated to your use. They can help you reduce costs by allowing you to use your existing server-bound software licenses.

  4. Helm:

    Helm is a package manager for Kubernetes that allows developers to package, configure, and deploy applications and services onto Kubernetes clusters. It uses packages called charts, which are collections of files that describe a related set of Kubernetes resources. Helm helps in managing Kubernetes applications through Helm Charts which simplify the deployment and management of applications on Kubernetes.

Karpenter installation

To install Karpenter on AWS, we need to have an EKS cluster, a role for Karpenter's serviceAccount, another role for Karpenter's custom NodePool, 1 SQS queue, and Helm installed.

  1. Creating an SQS queue for Karpenter

     aws sqs create-queue --queue-name karpenter-interruption-queue --tags Key=karpenter.sh/discovery,Value=${EKS_CLUSTER_NAME}
    
  2. Creating a role in AWS for Karpenter's custom NodePool

    • We create our TrustedPolicy. We will save it in a karpenter-nodePool-trust-policy.json file:

    •         {
                  "Version": "2012-10-17",
                  "Statement": [
                      {
                          "Effect": "Allow",
                          "Principal": {
                              "Service": "ec2.amazonaws.com"
                          },
                          "Action": "sts:AssumeRole"
                      }
                  ]
              }
      
    • We create our role with the previous trustedPolicy by replacing EKS_CLUSTER_NAME:

        aws iam create-role --role-name KarpenterNodeRole-${EKS_CLUSTER_NAME} --assume-role-policy-document file://karpenter-nodePool-trust-policy.json --tags Key=karpenter.sh/discovery,Value=${EKS_CLUSTER_NAME}
      
        aws iam attach-role-policy --role-name KarpenterNodeRole-${EKS_CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
        aws iam attach-role-policy --role-name KarpenterNodeRole-${EKS_CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
        aws iam attach-role-policy --role-name KarpenterNodeRole-${EKS_CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
        aws iam attach-role-policy --role-name KarpenterNodeRole-${EKS_CLUSTER_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      
  3. Creating a role in AWS for the serviceAccount that will use Karpenter:

    • Getting EKS OIDC Link:

    •         aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --query "cluster.identity.oidc.issuer" --output text
      
    • Then we replace the OIDC EKS LINK, AWS_ACCOUNT_ID, and AWS_REGION in our TrustedPolicy. We will save it in a file called karpenter-trust-policy.json:

          {
              "Version": "2012-10-17",
              "Statement": [
                  {
                      "Effect": "Allow",
                      "Principal": {
                          "Federated": "arn:aws:iam::${AWS_ACCOUNT_ID}:oidc-provider/oidc.eks.XXXX.amazonaws.com/id/XXXXXX" 
                      },
                      "Action": "sts:AssumeRoleWithWebIdentity",
                      "Condition": {
                          "StringEquals": {
                              "oidc.eks.XXXXX.amazonaws.com/id/XXXXX:sub": "system:serviceaccount:karpenter:karpenter"
                          }
                      }
                  }
              ]
          }
      
    • Now we create the Karpenter policyController, replacing AWS_REGION, EKS_CLUSTER_NAME, and AWS_ACCOUNT_ID. We will save it in a file called karpenter-policy.json:

    • We create the role by attaching the previously created policies and replacing EKS_CLUSTER_NAME:

         aws iam create-role --role-name karpenterSARole --assume-role-policy-document file://karpenter-trust-policy.json --tags Key=karpenter.sh/discovery,Value=${EKS_CLUSTER_NAME}
      
         aws iam put-role-policy --role-name karpenterSARole --policy-name KarpenterSAPolicy --policy-document file://karpenter-policy.json
      
      • Installing Karpenter using Helm:
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name ${EKS_CLUSTER_NAME} --query "cluster.endpoint" --output text)"

helm upgrade --install --namespace karpenter --create-namespace \
  karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version 0.36.0 \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=karpenterSARole \
  --set settings.aws.clusterName=${EKS_CLUSTER_NAME} \
  --set settings.aws.clusterEndpoint=${CLUSTER_ENDPOINT} \
  --set defaultProvisioner.create=true \
  --set settings.aws.interruptionQueueName=karpenter-interruption-queue

Creating NodeClass and NodePool

  1. Tagging EKS resources for our NodeClass policy: EKS, VPC, Private Subnets, EKS Security Groups

  2.     aws eks tag-resource --resource-arn ${EKS_ARN} --tags karpenter.sh/discovery=${EKS_CLUSTER_NAME}
    
        aws ec2 create-tags \
            --resources ${VPC_ID} \
            ${PRIVATE_SUBNET1_ID} ${PRIVATE_SUBNET2_ID} ${PRIVATE_SUBNET3_ID} \
            ${EKS_SG1_ID} ${EKS_SG2_ID} \
            --tags Key=karpenter.sh/discovery,Value=${EKS_CLUSTER_NAME}
    
  3. Creating NodeClass, this resource defines the EC2 instance family, in our case AL2 which are Amazon Linux 2, as well as the role we previously created. This role will manage Karpenter's EC2 resources, and we can also filter by tags, for us it will be the tag karpenter.sh/discovery: "${EKS_CLUSTER_NAME}". Additionally, we can see the private subnets where Karpenter will deploy the instances it needs, as well as the availability zone they belong to. We will save it as nodeclass.yaml.

     apiVersion: karpenter.k8s.aws/v1beta1
     kind: EC2NodeClass
     metadata:
       name: MyNodeClass
     spec:
       amiFamily: AL2 # Amazon Linux 2
       role: "KarpenterNodeRole-${EKS_CLUSTER_NAME}" # replace with your cluster name
       subnetSelectorTerms:
         - tags:
             karpenter.sh/discovery: "${EKS_CLUSTER_NAME}" # replace with your cluster name
       securityGroupSelectorTerms:
         - tags:
             karpenter.sh/discovery: "${EKS_CLUSTER_NAME}" # replace with your cluster name
     status:
       subnets:
       - id: subnet-XXXXXX
         zone: ${AZ1}
       - id: subnet-XXXXXX
         zone: ${AZ2}
       - id: subnet-XXXXXX
         zone: ${AZ3}
    

    Run:

     kubectl apply -f nodeclass.yaml -n karpenter
    
  4. Creating NodePool, this resource will allow you to configure the requirements for our EC2 instances. You can choose from various categories, types, families, etc. Karpenter will orchestrate the necessary one based on the cluster load. We will save it as nodepool.yaml.

    In this case, we will use EC2 Spot instances since they save us up to 90% of the cost of an EC2 on-demand.

     apiVersion: karpenter.sh/v1beta1
     kind: NodePool
     metadata:
       name: default
     spec:
       template:
         metadata:
           labels:
             app: my-app
         spec:
           requirements:
             - key: kubernetes.io/arch
               operator: In
               values: ["amd64"]
             - key: kubernetes.io/os
               operator: In
               values: ["linux"]
             - key: karpenter.sh/capacity-type
               operator: In
               values: ["spot"] # ["spot", "on-demand"]
             - key: karpenter.k8s.aws/instance-category
               operator: In
               values: ["r"] # ["c", "m", "r"]
             - key: karpenter.k8s.aws/instance-family
               operator: In
               values: ["r6a"] # ["c7a", "m5", "r7a"]
             - key: node.kubernetes.io/instance-type
               operator: In
               values: ["r6a.large", "r6a.xlarge"] # ["c7a.large", "m5.xlarge", "r7a.large"]
             - key: "topology.kubernetes.io/zone"
               operator: In
               values: ["xx-xxxx-xx", "xx-xxxx-yy", "xx-xxxx-zz"] # ["us-east-1a", "us-east-1b", "us-east-1c"]
           nodeClassRef:
             name: MyNodeClass
       disruption:
         consolidationPolicy: WhenUnderutilized
         # consolidationPolicy: WhenEmpty
         # consolidateAfter: 30s
         expireAfter: 720h # 30 * 24h = 720h
    

    Run:

     kubectl apply -f nodepool.yaml -n karpenter
    

Testing Karpenter

  1. Running a deployment

     cat <<EOF > test.yaml
     apiVersion: apps/v1
     kind: Deployment
     metadata:
       name: test
     spec:
       replicas: 0
       selector:
         matchLabels:
           app: test
       template:
         metadata:
           labels:
             app: test
         spec:
           containers:
             - name: test
               image: nginx
               resources:
                 requests:
                   cpu: 1
                   memory: 1.5Gi
     EOF
     kubectl apply -f test.yaml
    
  2. Now we will scale this deployment to see how Karpenter does its job. It will deploy new instances due to the demand for resources.

     kubectl scale deploy test --replicas=8
    
  3. We see the Karpenter logs in action.

     kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller
    
  4. After seeing how it has scaled, we can delete the deployment so Karpenter can do node downscale.

     kubectl delete deployment test
    
     kubectl logs -f -n karpenter -l app.kubernetes.io/name=karpenter -c controller
    

Conclusion

In conclusion, implementing Karpenter on AWS EKS is a robust solution for optimizing EC2 costs through efficient auto-scaling. By leveraging different EC2 options like Spot Instances and integrating with Kubernetes, Karpenter enhances the responsiveness and cost-effectiveness of resource allocation. This setup not only reduces operational costs but also ensures that applications run smoothly by dynamically adjusting to workload demands. As cloud technologies evolve, tools like Karpenter represent a significant advancement in managing cloud resources more effectively, making them indispensable for businesses looking to optimize their cloud infrastructure.

Did you find this article valuable?

Support Staz by becoming a sponsor. Any amount is appreciated!