Cost-Saving Strategies for Kubernetes: Managing Automated Shutdown and Startup in Non-Production Environments
In a previous post, we discussed installing Karpenter to manage cluster auto-scaling efficiently. You can find it here. Today, we'll explore how to further enhance cost savings on nodes by implementing automated shutdown and startup behaviors for Kubernetes clusters, specifically in non-production environments. By adopting this strategy, you could potentially cut costs by up to 50% by operating nodes for only 12 hours daily.
Introduction
It's very common to have non-production environments (DEV, QA, Staging, UAT, etc.), at least one or two to ensure the version that will be deployed to production and that productivity is not halted (Continuous Integration/Continuous Deployment). However, it's true that these non-production environments do not need to be available 24x7 in most cases. As our clusters grow in nodes, our costs increase, and we need to adopt a strategy to reduce costs without affecting the daily productivity of our teams.
One strategy is to use Shutdown/Startup behavior. We know that the cost of computing services is based on usage. We can reduce it by turning off the nodes when we don't need them, and this also contributes to the well-being of the team, since they will only have the environments available during working hours.
Creating previous resources
First, we will need a
ServiceAccount
that will have the necessary permissions defined in aClusterRole
and we will attach it with aClusterRoleBinding
. ThisServiceAccount
will be used for ourCronJobs
:apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: cluster-manager-role rules: - apiGroups: ["apps"] resources: ["deployments", "deployments/scale"] verbs: ["get", "list", "update", "patch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: cluster-manager-binding roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: cluster-manager-role subjects: - kind: ServiceAccount name: cluster-manager namespace: kube-system --- apiVersion: v1 kind: ServiceAccount metadata: name: cluster-manager namespace: kube-system
Shutdown Behavior
For this, we will use a
CronJob
that allows us to schedule the time to scale down the load of deployments and consequently the nodes. In the specification of theJob
as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we specify to 0--replicas=0
. This is to ensure our downgrade is cleaner, gradual, and allows our Cluster AutoScaler to detect that there is no longer a load on the cluster and remove unused nodes.apiVersion: batch/v1 kind: CronJob metadata: name: shutdown-manager namespace: kube-system spec: schedule: "0 01 * * *" # Everyday 8 PM successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: serviceAccountName: cluster-manager containers: - name: shutdown-manager image: fluxcd/flux-cli:v2.1.2 command: ["/bin/sh", "-c"] args: - | echo "Start: $(date) - Environment: $environment" for i in namespace1 namespace2 namespace3 do kubectl scale deployment -n $i --all --replicas=0 done env: - name: TZ value: America/Lima - name: environment value: dev restartPolicy: OnFailure
Startup Behavior
For this, we will use a
CronJob
that allows us to schedule the time to increase the load of deployments and therefore the nodes. In theJob
specification as an entrypoint, we define a command that allows us to scale all deployments from N namespaces we define to 1--replicas=1
. This is to ensure our startup is cleaner, more gradual, and allows our Cluster AutoScaler to detect that there is load on the cluster and create new nodes.If some deployments need more than one replica, it's best to have a HorizontalPodAutoScaling associated with them, so that they can increase the replicas when needed.
apiVersion: batch/v1 kind: CronJob metadata: name: startup-manager namespace: kube-system spec: schedule: "0 13 * * 1-5" # Monday to Friday - 8 AM successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: template: spec: serviceAccountName: cluster-manager containers: - name: startup-manager image: fluxcd/flux-cli:v2.1.2 command: ["/bin/sh", "-c"] args: - | echo "Start: $(date) - Environment: $environment" for i in namespace1 namespace2 namespace3 do kubectl scale deployment -n $i --all --replicas=1 done env: - name: TZ value: America/Lima - name: environment value: dev restartPolicy: OnFailure
Conclusion
In conclusion, implementing automated shutdown and startup procedures in Kubernetes non-production environments is a highly effective cost-saving strategy. By utilizing
CronJobs
to manage the scaling of deployments outside of work hours, organizations can significantly reduce their operational costs without impacting daily productivity. This approach not only optimizes resource usage but also contributes to a more sustainable operation by reducing unnecessary computational waste. Implementing such measures can lead to substantial financial savings, making it an attractive option for businesses looking to maximize their cloud infrastructure efficiency.