Introduction
The built-in kubernetes scheduling assigns workloads based a multitude
of factors such as resources needs, quality of service etc. which can be
provided to Kubernetes scheduler as
flags.
In addition to these, as a user, you can use certain techniques to
affect scheduling decisions. In real-world workloads, there are needs
such as:
- Run a set of pods only on certain nodes, for example running pods
with ML workloads on nodes with GPU attached.
- Always run a set of pods on the same nodes, as an example you might
want to run
- Never run two particular pods together etc.
Some of the mechanisms provided by Kubernetes scheduling to tackle these
cases are taints, tolerations, node affinity and pod affinity. In this
post, we will focus specifically on taints and tolerations, next post
will talk about pod & node affinity. I will also try to get a third post
on writing a simple custom scheduler. Before we dive into details, let’s
get the definitions clear:
- To taint – Contaminate or pollute something with an undesirable
behaviour or effect. In Kubernetes terms, when we taint a node, we
don’t allow scheduling on that node.
- To tolerate – Overcome an undesirable behaviour or effect. In
Kubernetes scheduling terms, we use toleration to overcome a taint.
- Affinity – A natural liking for something. In Kubernetes terms, node
affinity or pod affinity are used to schedule pods on specific
nodes.
Taints
Taint is a property of the node (May in future it might be also
applicable to virtual
kubelet, who knows?
). A node can have multiple taints at any given point in time. It allows
you to repel a set of pods if those pods do not have a toleration for
the said taint. Taint has three parts. A key, a value and an effect. For
example:
kubectl taint nodes node1.compute.infracloud.io
thisnode=HatesPods:NoSchedule
The above taint has key=thisnode, value=HatesPods and effect as
NoSchedule. These key-value pairs are configurable. Any pod that
doesn’t have a matching toleration to this taint will not be scheduled
on node1. To remove the above taint, we can run the following command
kubectl taint nodes node1.compute.infracloud.io thisnode:NoSchedule-
Following are the built-in effects as of this writing:
-
NoSchedule – Doesn’t schedule a pod without matching
tolerations
-
PreferNoSchedule – Prefers that the pod without matching
toleration be not scheduled on the node. It is a softer version of the
NoSchedule effect.
-
NoExecute – Evicts the pods that don’t have matching
tolerations.
##
Tolerations
Toleration is simply a way to overcome a taint for a workload that wants
to be scheduled on a node with a taint. Toleration generally has four
parts. A key, a value, an operator and an effect. Operator, if not
specified, defaults to **Equal. **For example, In the above section,
we have tainted *node1.compute.infracloud.io. *To schedule the pod on
that node, we need a matching toleration. Below is the toleration that
can be used to overcome the taint.
tolerations:
- key: "thisnode"
operator: "Equal"
value: "HatesPods"
effect: "NoSchedule"
This needs to be included in the yaml spec of the Kubernetes resource so
that the kubernetes scheduling mechanism picks it up.
Use cases
- Taints can be used to group together a set of nodes that only run a
certain set of workload, pods with special resource requirement such
as a GPU. The pods that need GPU will then add tolerations to
themselves.
- Taints can also be used to evict a large set of pods from a node
using taint with NoExecute effect. This can be used for taking
the node down or for cleaning up the node for any other maintenance
activity.
Walk-through guide
Let’s start with listing nodes and inspecting their current status:
$kubectl get nodes
NAME STATUS ROLES AGE VERSION
node1.compute.infracloud.io Ready 25m v1.9.4
node2.compute.infracloud.io Ready 25m v1.9.4
node3.compute.infracloud.io Ready 28m v1.9.4
Now, let’s taint node1 with NoSchedule effect.
$kubectl taint nodes node1.compute.infracloud.io thisnode=HatesPods:NoSchedule
node "node1.compute.infracloud.io" tainted
Let’s run the deployment to see on which node the pods are deployed:
$kubectl create -f https://raw.githubusercontent.com/infracloudio/kubernetes-scheduling-examples/master/taints/deployment.yaml
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-6c54bd5869-g9rtf 1/1 Running 0 18s 10.20.32.2 node3.compute.infracloud.io
nginx-deployment-6c54bd5869-v74m6 1/1 Running 0 18s 10.20.32.3 node3.compute.infracloud.io
nginx-deployment-6c54bd5869-w5jxj 1/1 Running 0 18s 10.20.61.2 node2.compute.infracloud.io
Now let’s taint node3 with NoExecute effect, which will evict both
the pods from node3 and schedule them on node2.
$kubectl taint nodes node3.compute.infracloud.io thisnode=AlsoHatesPods:NoExecute
In a few seconds, you’ll see that the pods are terminated on node3 and
spawned on node2
$kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-6c54bd5869-8vqvc 1/1 Running 0 33s 10.20.42.21 node2.compute.infracloud.io
nginx-deployment-6c54bd5869-hsjhj 1/1 Running 0 33s 10.20.42.20 node2.compute.infracloud.io
nginx-deployment-6c54bd5869-w5jxj 1/1 Running 0 2m 10.20.42.19 node2.compute.infracloud.io
The above example demonstrates taint based evictions. Let’s delete the
deployment and create a new one with tolerations for the above taints.
$kubectl delete deployment nginx-deployment
$kubectl create -f https://raw.githubusercontent.com/infracloudio/kubernetes-scheduling-examples/master/taints/deployment-toleration.yaml
$kubectl get pods -o wide
You should be able to see that some of the pods are scheduled on node1
and some on node2. However, no pod is scheduled on node3. This is
because, in the new deployment spec, we are tolerating taint
NoSchedule effect. node3 is tainted with a NoExecute effect
which we have not tolerated so no pods will be scheduled there.
NAME READY STATUS RESTARTS AGE IP NODE
nginx-deployment-5699885bdb-4dz8z 1/1 Running 0 1m 10.20.34.3 node1.compute.infracloud.io
nginx-deployment-5699885bdb-cr7p7 1/1 Running 0 1m 10.20.34.4 node1.compute.infracloud.io
nginx-deployment-5699885bdb-kjxwv 1/1 Running 0 1m 10.20.34.5 node1.compute.infracloud.io
nginx-deployment-5699885bdb-kvfw6 1/1 Running 0 1m 10.20.34.7 node1.compute.infracloud.io
nginx-deployment-5699885bdb-lx2zv 1/1 Running 0 1m 10.20.34.6 node1.compute.infracloud.io
nginx-deployment-5699885bdb-m686q 1/1 Running 0 1m 10.20.42.30 node2.compute.infracloud.io
nginx-deployment-5699885bdb-x7c6z 1/1 Running 0 1m 10.20.42.31 node2.compute.infracloud.io
nginx-deployment-5699885bdb-z8cwl 1/1 Running 0 1m 10.20.34.9 node1.compute.infracloud.io
nginx-deployment-5699885bdb-z9c68 1/1 Running 0 1m 10.20.34.8 node1.compute.infracloud.io
nginx-deployment-5699885bdb-zshst 1/1 Running 0 1m 10.20.34.2 node1.compute.infracloud.io
To finish off, let’s remove the taints from the nodes,
$kubectl taint nodes node3.compute.infracloud.io thisnode:NoExecute-
$kubectl taint nodes node1.compute.infracloud.io thisnode:NoSchedule-
For more details and examples, please take a look at examples and sample
code in this Github
repo.
In next instalment of this post, we will look at node and pod affinity
with hands-on examples. Complex Kubernetes scheduling can also be done
by writing a custom scheduler of your own, which we will cover in the
third part of this post.
Looking for help with Kubernetes adoption or Day 2 operations? learn more about our capabilities and why startups & enterprises consider as one of the best Kubernetes consulting services companies.