This article continues from the Part 1 of advanced Kubernetes scheduling. In part 1, we had discussed Taints and tolerations. In this article, we will take a look at other scheduling mechanisms provided by Kubernetes that can help us direct workloads to a particular node or scheduling pods together.
Preference during scheduling but ignore changes later
In the Preferred rule, a pod will be assigned on a non-matching node if and only if no other node in the cluster matches the specified labels. preferredDuringSchedulingIgnoredDuringExecution is a preferred rule affinity.
Rules must match while scheduling but ignore changes later
In the Required rules, if there are no matching nodes, then the pod won’t be scheduled. In requiredDuringSchedulingIgnoredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node. However, once the pod is scheduled, labels are ignored meaning even if the node labels change, the pod will continue to run on that node.
Rules must match while scheduling and also if situation changes later
In the requiredDuringSchedulingRequiredDuringExecution affinity, a pod will be scheduled only if the node labels specified in the pod spec matches with the labels on the node and if the labels on the node change in future, the pod will be evicted. This effect is similar to NoExecute taint with one significant difference. When NoExecute taint is applied on a node, every pod not having a toleration will be evicted, whereas, removing/changing a label will remove only the pods that do specify a different label.
While scheduling workload, when we need to schedule a certain set of pods on a certain set of nodes but do not want those nodes to reject everything else, using node affinity makes sense.
This assumes that you have cloned the kubernetes-scheduling-examples. Let’s begin with listing nodes.
kubectl get nodes
You should be able to see the list of nodes available in the cluster,
NAME STATUS AGE VERSION node1.compute.infracloud.io Ready 25m v1.9.4 node2.compute.infracloud.io Ready 25m v1.9.4 node3.compute.infracloud.io Ready 28m v1.9.4
NodeAffinity works on label matching. Let’s label node1 and verify it:
kubectl label nodes node1.compute.infracloud.io thisnode=TheChosenOne
kubectl get nodes --show-labels | grep TheChosenOne
Now let’s try to deploy the entire guestbook on the node1. In all the deployment yaml files, a NodeAffinity for node1 is added as,
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: "thisnode" operator: In values: ["TheChosenOne"]
guestbook_create.sh deploys the guestbook.
In a couple of minutes, you should be able to see that all the pods are scheduled on node1.
NAME READY STATUS RESTARTS AGE IP NODE frontend-85b968cdc5-c785v 1/1 Running 0 49s 10.20.29.13 node1.compute.infracloud.io frontend-85b968cdc5-pw2kl 1/1 Running 0 49s 10.20.29.14 node1.compute.infracloud.io frontend-85b968cdc5-xxh7h 1/1 Running 0 49s 10.20.29.15 node1.compute.infracloud.io redis-master-7bbf6b76bf-ttb6b 1/1 Running 0 1m 10.20.29.10 node1.compute.infracloud.io redis-slave-747f8bc7c5-2tjtw 1/1 Running 0 1m 10.20.29.11 node1.compute.infracloud.io redis-slave-747f8bc7c5-clxzh 1/1 Running 0 1m 10.20.29.12 node1.compute.infracloud.io
The output will also yield a load balancer ingress URL which can be used to load the guestbook. To finish off, let’s use guestbook_cleanup.sh to remove the guestbook.
Pod Affinity & AntiAffinity
In Kubernetes, node affinity allows you to schedule a pod on a set of nodes based on labels present on the nodes. However, in certain scenarios, we might want to schedule certain pods together or we might want to make sure that certain pods are never scheduled together. This can be achieved by PodAffinity and/or PodAntiAffinity respectively. Similar to node affinity, there are a couple of variants in pod affinity namely requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
- While scheduling workload, when we need to schedule a certain set of pods together, PodAffinity makes sense. Example, a web server and a cache.
- While scheduling workload, when we need to make sure that a certain set of pods are not scheduled together, PodAntiAffinity makes sense. For example you may not want two applications which are both disk intensive to be on same node.
Pod Affinity walk-through
Let’s deploy deployment-Affinity.yaml, which has pod affinity as
affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: "kubernetes.io/hostname"
Here we are specifying that all nginx pods should be scheduled together. Let’s apply and verify:
kubectl apply -f deployment-Affinity.yaml
kubectl get pods -o wide -w
You should be able to see that all pods are scheduled on the same node.
NAME READY STATUS RESTARTS AGE IP NODE nginx-deployment-6bc5bb7f45-49dtg 1/1 Running 0 36m 10.20.29.18 node2.compute.infracloud.io nginx-deployment-6bc5bb7f45-4ngvr 1/1 Running 0 36m 10.20.29.20 node2.compute.infracloud.io nginx-deployment-6bc5bb7f45-lppkn 1/1 Running 0 36m 10.20.29.19 node2.compute.infracloud.io
To clean up, run,
kubectl delete -f deployment-Affinity.yaml
Pod Anti Affinity example
Let’s deploy deployment-AntiAffinity.yaml, which has pod affinity as
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - nginx topologyKey: "kubernetes.io/hostname"
Here we are specifying that no two nginx pods should be scheduled together.
kubectl apply -f deployment-AntiAffinity.yaml
kubectl get pods -o wide -w
You should be able to see that pods are scheduled on different nodes.
NAME READY STATUS RESTARTS AGE IP NODE nginx-deployment-85d87bccff-4w7tf 1/1 Running 0 27s 10.20.29.16 node3.compute.infracloud.io nginx-deployment-85d87bccff-7fn47 1/1 Running 0 27s 10.20.42.32 node1.compute.infracloud.io nginx-deployment-85d87bccff-sd4lp 1/1 Running 0 27s 10.20.13.17 node2.compute.infracloud.io
Note: In above example, if the number of replicas is more than the number of nodes then some of the pods will remain in pending state.
To clean up, run,
kubectl delete -f deployment-AntiAffinity.yaml
This covers the advance scheduling mechanisms provided by Kubernetes. Have any questions? Feel free to drop a comment below.
To sum it up, Kubernetes provides simple mechanisms like taints, tolerations, node affinity and pod affinity to schdule workloads dynamically. The mechanisms are simple but when used with labels & selectors they provide a fairly good leverage on how to schedule pods.