Cost-Saving Strategies for Kubernetes: Managing Automated Shutdown and Startup in Non-Production Environments

·

4 min read

In a previous post, we discussed installing Karpenter to manage cluster auto-scaling efficiently. You can find it here. Today, we'll explore how to further enhance cost savings on nodes by implementing automated shutdown and startup behaviors for Kubernetes clusters, specifically in non-production environments. By adopting this strategy, you could potentially cut costs by up to 50% by operating nodes for only 12 hours daily.

  1. Introduction

    It's very common to have non-production environments (DEV, QA, Staging, UAT, etc.), at least one or two to ensure the version that will be deployed to production and that productivity is not halted (Continuous Integration/Continuous Deployment). However, it's true that these non-production environments do not need to be available 24x7 in most cases. As our clusters grow in nodes, our costs increase, and we need to adopt a strategy to reduce costs without affecting the daily productivity of our teams.

    One strategy is to use Shutdown/Startup behavior. We know that the cost of computing services is based on usage. We can reduce it by turning off the nodes when we don't need them, and this also contributes to the well-being of the team, since they will only have the environments available during working hours.

  2. Creating previous resources

    First, we will need a ServiceAccount that will have the necessary permissions defined in a ClusterRole and we will attach it with a ClusterRoleBinding. This ServiceAccount will be used for our CronJobs:

     apiVersion: rbac.authorization.k8s.io/v1
     kind: ClusterRole
     metadata:
       name: cluster-manager-role
     rules:
     - apiGroups: ["apps"]
       resources: ["deployments", "deployments/scale"]
       verbs: ["get", "list", "update", "patch"]
     ---
     apiVersion: rbac.authorization.k8s.io/v1
     kind: ClusterRoleBinding
     metadata:
       name: cluster-manager-binding
     roleRef:
       apiGroup: rbac.authorization.k8s.io
       kind: ClusterRole
       name: cluster-manager-role
     subjects:
     - kind: ServiceAccount
       name: cluster-manager
       namespace: kube-system
     ---
     apiVersion: v1
     kind: ServiceAccount
     metadata:
       name: cluster-manager
       namespace: kube-system
    
  3. Shutdown Behavior

    For this, we will use a CronJob that allows us to schedule the time to scale down the load of deployments and consequently the nodes. In the specification of the Job as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we specify to 0 --replicas=0. This is to ensure our downgrade is cleaner, gradual, and allows our Cluster AutoScaler to detect that there is no longer a load on the cluster and remove unused nodes.

     apiVersion: batch/v1
     kind: CronJob
     metadata:
       name: shutdown-manager
       namespace: kube-system
     spec:
       schedule: "0 01 * * *" # Everyday 8 PM
       successfulJobsHistoryLimit: 1
       failedJobsHistoryLimit: 1
       jobTemplate:
         spec:
           template:
             spec:
               serviceAccountName: cluster-manager
               containers:
               - name: shutdown-manager
                 image: fluxcd/flux-cli:v2.1.2
                 command: ["/bin/sh", "-c"]
                 args:
                 - |
                   echo "Start: $(date) - Environment: $environment"
                   for i in namespace1 namespace2 namespace3
                   do
                     kubectl scale deployment -n $i --all --replicas=0
                   done
                 env:
                 - name: TZ
                   value: America/Lima
                 - name: environment
                   value: dev
               restartPolicy: OnFailure
    
  4. Startup Behavior

    For this, we will use a CronJob that allows us to schedule the time to increase the load of deployments and therefore the nodes. In the Job specification as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we define to 1 --replicas=1. This is to ensure our startup is cleaner, more gradual, and allows our Cluster AutoScaler to detect that there is load on the cluster and create new nodes.

    If some deployments need more than one replica, it's best to have a HorizontalPodAutoScaling associated with them, so that they can increase the replicas when needed.

     apiVersion: batch/v1
     kind: CronJob
     metadata:
       name: startup-manager
       namespace: kube-system
     spec:
       schedule: "0 13 * * 1-5" # Monday to Friday - 8 AM
       successfulJobsHistoryLimit: 1
       failedJobsHistoryLimit: 1
       jobTemplate:
         spec:
           template:
             spec:
               serviceAccountName: cluster-manager
               containers:
               - name: startup-manager
                 image: fluxcd/flux-cli:v2.1.2
                 command: ["/bin/sh", "-c"]
                 args:
                 - |
                   echo "Start: $(date) - Environment: $environment"
                   for i in namespace1 namespace2 namespace3
                   do
                     kubectl scale deployment -n $i --all --replicas=1
                   done
                 env:
                 - name: TZ
                   value: America/Lima
                 - name: environment
                   value: dev
               restartPolicy: OnFailure
    
  5. Conclusion

    In conclusion, implementing automated shutdown and startup procedures in Kubernetes non-production environments is a highly effective cost-saving strategy. By utilizing CronJobs to manage the scaling of deployments outside of work hours, organizations can significantly reduce their operational costs without impacting daily productivity. This approach not only optimizes resource usage but also contributes to a more sustainable operation by reducing unnecessary computational waste. Implementing such measures can lead to substantial financial savings, making it an attractive option for businesses looking to maximize their cloud infrastructure efficiency.

Did you find this article valuable?

Support Staz by becoming a sponsor. Any amount is appreciated!