Cost-Saving Strategies for Kubernetes: Managing Automated Shutdown and Startup in Non-Production Environments
In a previous post, we discussed installing Karpenter to manage cluster auto-scaling efficiently. You can find it here. Today, we'll explore how to further enhance cost savings on nodes by implementing automated shutdown and startup behaviors for Kubernetes clusters, specifically in non-production environments. By adopting this strategy, you could potentially cut costs by up to 50% by operating nodes for only 12 hours daily.
Introduction
It's very common to have non-production environments (DEV, QA, Staging, UAT, etc.), at least one or two to ensure the version that will be deployed to production and that productivity is not halted (Continuous Integration/Continuous Deployment). However, it's true that these non-production environments do not need to be available 24x7 in most cases. As our clusters grow in nodes, our costs increase, and we need to adopt a strategy to reduce costs without affecting the daily productivity of our teams.
One strategy is to use Shutdown/Startup behavior. We know that the cost of computing services is based on usage. We can reduce it by turning off the nodes when we don't need them, and this also contributes to the well-being of the team, since they will only have the environments available during working hours.
Creating previous resources
First, we will need a
ServiceAccountthat will have the necessary permissions defined in aClusterRoleand we will attach it with aClusterRoleBinding. ThisServiceAccountwill be used for ourCronJobs:apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: cluster-manager-role rules: - apiGroups: ["apps"] resources: ["deployments", "deployments/scale"] verbs: ["get", "list", "update", "patch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: cluster-manager-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-manager-role subjects: - kind: ServiceAccount name: cluster-manager namespace: kube-system --- apiVersion: v1 kind: ServiceAccount metadata: name: cluster-manager namespace: kube-systemShutdown Behavior
For this, we will use a
CronJobthat allows us to schedule the time to scale down the load of deployments and consequently the nodes. In the specification of theJobas an entrypoint, we define a command that allows us to scale all deployments from N namespaces we specify to 0--replicas=0. This is to ensure our downgrade is cleaner, gradual, and allows our Cluster AutoScaler to detect that there is no longer a load on the cluster and remove unused nodes.apiVersion: batch/v1 kind: CronJob metadata: name: shutdown-manager namespace: kube-system spec: schedule: "0 01 * * *" # Everyday 8 PM successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: serviceAccountName: cluster-manager containers: - name: shutdown-manager image: fluxcd/flux-cli:v2.1.2 command: ["/bin/sh", "-c"] args: - | echo "Start: $(date) - Environment: $environment" for i in namespace1 namespace2 namespace3 do kubectl scale deployment -n $i --all --replicas=0 done env: - name: TZ value: America/Lima - name: environment value: dev restartPolicy: OnFailureStartup Behavior
For this, we will use a
CronJobthat allows us to schedule the time to increase the load of deployments and therefore the nodes. In theJobspecification as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we define to 1--replicas=1. This is to ensure our startup is cleaner, more gradual, and allows our Cluster AutoScaler to detect that there is load on the cluster and create new nodes.If some deployments need more than one replica, it's best to have a HorizontalPodAutoScaling associated with them, so that they can increase the replicas when needed.
apiVersion: batch/v1 kind: CronJob metadata: name: startup-manager namespace: kube-system spec: schedule: "0 13 * * 1-5" # Monday to Friday - 8 AM successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: serviceAccountName: cluster-manager containers: - name: startup-manager image: fluxcd/flux-cli:v2.1.2 command: ["/bin/sh", "-c"] args: - | echo "Start: $(date) - Environment: $environment" for i in namespace1 namespace2 namespace3 do kubectl scale deployment -n $i --all --replicas=1 done env: - name: TZ value: America/Lima - name: environment value: dev restartPolicy: OnFailureConclusion
In conclusion, implementing automated shutdown and startup procedures in Kubernetes non-production environments is a highly effective cost-saving strategy. By utilizing
CronJobsto manage the scaling of deployments outside of work hours, organizations can significantly reduce their operational costs without impacting daily productivity. This approach not only optimizes resource usage but also contributes to a more sustainable operation by reducing unnecessary computational waste. Implementing such measures can lead to substantial financial savings, making it an attractive option for businesses looking to maximize their cloud infrastructure efficiency.



