Kubernetes Cost-Saving: Automated Management Tips

In a previous post, we discussed installing Karpenter to manage cluster auto-scaling efficiently. You can find it here. Today, we'll explore how to further enhance cost savings on nodes by implementing automated shutdown and startup behaviors for Kubernetes clusters, specifically in non-production environments. By adopting this strategy, you could potentially cut costs by up to 50% by operating nodes for only 12 hours daily.

Introduction

It's very common to have non-production environments (DEV, QA, Staging, UAT, etc.), at least one or two to ensure the version that will be deployed to production and that productivity is not halted (Continuous Integration/Continuous Deployment). However, it's true that these non-production environments do not need to be available 24x7 in most cases. As our clusters grow in nodes, our costs increase, and we need to adopt a strategy to reduce costs without affecting the daily productivity of our teams.

One strategy is to use Shutdown/Startup behavior. We know that the cost of computing services is based on usage. We can reduce it by turning off the nodes when we don't need them, and this also contributes to the well-being of the team, since they will only have the environments available during working hours.

Creating previous resources

First, we will need a ServiceAccount that will have the necessary permissions defined in a ClusterRole and we will attach it with a ClusterRoleBinding. This ServiceAccount will be used for our CronJobs:

 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
   name: cluster-manager-role
 rules:
 - apiGroups: ["apps"]
   resources: ["deployments", "deployments/scale"]
   verbs: ["get", "list", "update", "patch"]
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
   name: cluster-manager-binding
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: ClusterRole
   name: cluster-manager-role
 subjects:
 - kind: ServiceAccount
   name: cluster-manager
   namespace: kube-system
 ---
 apiVersion: v1
 kind: ServiceAccount
 metadata:
   name: cluster-manager
   namespace: kube-system

Shutdown Behavior

For this, we will use a CronJob that allows us to schedule the time to scale down the load of deployments and consequently the nodes. In the specification of the Job as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we specify to 0 --replicas=0. This is to ensure our downgrade is cleaner, gradual, and allows our Cluster AutoScaler to detect that there is no longer a load on the cluster and remove unused nodes.

 apiVersion: batch/v1
 kind: CronJob
 metadata:
   name: shutdown-manager
   namespace: kube-system
 spec:
   schedule: "0 01 * * *" # Everyday 8 PM
   successfulJobsHistoryLimit: 1
   failedJobsHistoryLimit: 1
   jobTemplate:
     spec:
       template:
         spec:
           serviceAccountName: cluster-manager
           containers:
           - name: shutdown-manager
             image: fluxcd/flux-cli:v2.1.2
             command: ["/bin/sh", "-c"]
             args:
             - |
               echo "Start: $(date) - Environment: $environment"
               for i in namespace1 namespace2 namespace3
               do
                 kubectl scale deployment -n $i --all --replicas=0
               done
             env:
             - name: TZ
               value: America/Lima
             - name: environment
               value: dev
           restartPolicy: OnFailure

Startup Behavior

For this, we will use a CronJob that allows us to schedule the time to increase the load of deployments and therefore the nodes. In the Job specification as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we define to 1 --replicas=1. This is to ensure our startup is cleaner, more gradual, and allows our Cluster AutoScaler to detect that there is load on the cluster and create new nodes.

If some deployments need more than one replica, it's best to have a HorizontalPodAutoScaling associated with them, so that they can increase the replicas when needed.

 apiVersion: batch/v1
 kind: CronJob
 metadata:
   name: startup-manager
   namespace: kube-system
 spec:
   schedule: "0 13 * * 1-5" # Monday to Friday - 8 AM
   successfulJobsHistoryLimit: 1
   failedJobsHistoryLimit: 1
   jobTemplate:
     spec:
       template:
         spec:
           serviceAccountName: cluster-manager
           containers:
           - name: startup-manager
             image: fluxcd/flux-cli:v2.1.2
             command: ["/bin/sh", "-c"]
             args:
             - |
               echo "Start: $(date) - Environment: $environment"
               for i in namespace1 namespace2 namespace3
               do
                 kubectl scale deployment -n $i --all --replicas=1
               done
             env:
             - name: TZ
               value: America/Lima
             - name: environment
               value: dev
           restartPolicy: OnFailure

Conclusion

In conclusion, implementing automated shutdown and startup procedures in Kubernetes non-production environments is a highly effective cost-saving strategy. By utilizing CronJobs to manage the scaling of deployments outside of work hours, organizations can significantly reduce their operational costs without impacting daily productivity. This approach not only optimizes resource usage but also contributes to a more sustainable operation by reducing unnecessary computational waste. Implementing such measures can lead to substantial financial savings, making it an attractive option for businesses looking to maximize their cloud infrastructure efficiency.

Cost-Saving Strategies for Kubernetes: Managing Automated Shutdown and Startup in Non-Production Environments

Did you find this article valuable?