Image by author

Entropy, information gain, and Gini impurity(Decision tree splitting criteria)

Brian Mutea
Brian Mutea
14 min read

Table of Contents

Decision trees are supervised machine-learning models used to solve classification and regression problems. They help to make decisions by breaking down a problem with a bunch of if-else-then-like evaluations that result in a tree-like structure.

For quality and viable decisions to be made, a decision tree builds itself by splitting various features that give the best information about the target feature till a pure final decision is archived.

This article will discuss three common splitting criteria used in decision tree building:

  • Entropy
  • Information gain
  • Gini impurity

Entropy

Entropy measures data points' degree of impurity, uncertainty, or surprise. It ranges between 0 and 1.

Entropy curve: Image by author

We can see that the entropy is 0 when the probability is o or 1. We get a maximum entropy of 1 when the probability is 0.5, which means that the data is perfectly randomized.

To understand entropy, let's familiarize ourselves with the concept of surprise. Surprise, by definition, is an unexpected event.

Suppose we had three baskets with random colored balls:

Image by author

Each basket has a varying number of red and blue balls. Suppose we decided to pick a random ball from each basket. We will notice that:

  • From Basket A, the probability of picking a blue ball is higher than that of picking a red ball. We would be very surprised if we picked a red ball.
  • From Basket B, the probability of picking a red ball is higher than that of picking a blue ball. We would be relatively surprised if we picked a blue ball.
  • From Basket C, the probability of picking either a red or blue ball is equal. Regardless of what ball we picked, we would be equally surprised.

The results show that surprise is inversely related to probability(p(x)). That is, for instance, from Basket A, when the probability of picking a red ball is low, the surprise is high (1/p(x)).

However, we can not use the inverse of probability to calculate the surprise. For instance, let's say we decided to pick balls from basket A three times.

If we picked blue balls and never picked a red one each time we picked, then it would be relatively correct to say that the probability of picking a red ball would always be 0 and that of picking a blue ball 1  for each round we decided to pick. That means the surprise of picking a blue ball should be 0 for each round. But the inverse of the probability of picking a blue ball is 1 1/1 = 1.

Since we have found out that we can not use the inverse of probability to calculate the surprise, we introduce log to the inverse of probability:

The formula for Surprise: Image by author

Thus the surprise of picking a blue ball from basket A the given number of times would be 0, just as expected:

Surprise of picking a blue ball from Basket A

Now with the knowledge of expected values, we can try to calculate the surprise of picking a blue ball and a red ball from Basket A. Since we already know that there's a 0.1 chance that we would pick a red ball and a 0.9 chance of picking a blue ball, we would have the following:

Surprise for picking a blue ball:

Image by author

Surprise for picking a red ball:

Image by author

Entropy is now the expected value of surprise. So every time we pick a ball from basket A, we expect a certain amount of surprise. That is:

Surprise each time we pick a ball

Now from this, we can derive the formula for Entropy:

Derived entropy formula: Image by author

From the equation above, we can get the standard equation for Entropy:

Entropy formula: Image by author

Having understood what entropy is and its formula, let's calculate the entropy of the balls from each basket in our example above:

Entropy in basket A:

Image by author
import math

entropy_A = -(9/10) * math.log2(9/10) - (1/10) * math.log2(1/10)
entropy_A
# 0.4689955935892812

Entropy in basket B:

Image by author
entropy_B = -(19/20) * math.log2(19/20) - (1/20) * math.log2(1/20)
entropy_B
# 0.28639695711595625

Entropy in basket C:

Image by author
entropy_C = -(5/10) * math.log2(5/10) - (5/10) * math.log2(5/10)
entropy_C
# 1.0

The higher the entropy, the less pure the dataset or, the higher the level of disorder. Basket C has the highest entropy(1), which makes it very impure.

A dataset with high entropy(very impure) is good for learning as it is less likely to cause model overfitting.

Information gain

Before exploring this criterion, be sure you are comfortable with entropy. It is not possible to discuss information gain without entropy.

Information gain is the difference between before and after a split on a given attribute. It measures how much information a feature provides about a target.

Constructing a decision tree is solely about finding a feature that returns the highest information gain. The feature with the highest information gain produces the best split, classifying the training dataset better according to the target variable.

Information gain has the following formula:

Information gain formula: Image by author

Where:

  • a is the specific attribute or class label.
  • Entropy(S) is the entropy of dataset S.
  • |Sv| / |S| is the proportion of the values in S to the number of values in the dataset S.

Suppose we had the following dataset:

Image by author

From the above example dataset, we are required to construct a decision tree to help us decide whether we should play volleyball based on the weather conditions.

To construct a decision tree, we need to pick the features that will best guide us to make a viable decision on whether we should play or not play volleyball.

We can't randomly select a feature from the dataset to build the tree, so Entropy and information gain are good criteria for this problem.

To begin with, we have four features we need to consider:

  • Outlook
  • Temp
  • Humidity
  • Wind

A decision tree has various parts, the root node, internal nodes, and leaf nodes. Read more on our Decision Tree and Random Forests article.

Finding the root node feature

Since we can not just pick one of the features to start our decision tree, we need to make calculations to get the feature with the highest information gain from which we start splitting.

Calculate the entropy of the entire dataset(Entropy(S))

Image by author

We can see that we have 5 Noes(or negatives) and 9 Yeses(or positives). The total number of entries is 14.

The entropy of the whole dataset is:

Entropy of Dataset: Image by author
import math

S = -(9/14) * math.log2(9/14) - (5/14) * math.log2(5/14)
S
# 0.9402859586706311

Calculate the information gain for the Outlook feature

Outlook has 3 attributes:

  • Sunny
  • Overcast
  • Rain.

So we will calculate the entropy of each of these attributes(Sv) as follows:

We have 5 Sunny attributes for Outlook:

  • 3 negative Sunny Outlooks(When Play Volleyball is No).
  • 2 positive Sunny Outlooks(When Play Volleyball is Yes).

Let's calculate:

Image by author
import math

entropy_S_sunny = -(2/5) * math.log2(2/5) - (3/5) * math.log2(3/5)
entropy_S_sunny
# 0.9709505944546686

We have 4 Overcast attributes for Outlook:

  • 0 negative Overcast Outlooks.
  • 4 positive Overcast Outlooks.

Let's calculate:

Image by author

We have 5 Rain attributes for Outlook:

  • 2 negative Rain attributes.
  • 3 positive Rain attributes.

Let's calculate:

Image by author
import math

entropy_S_rain = -(3/5) * math.log2(3/5) - (2/5) * math.log2(2/5)
entropy_S_rain
# 0.9709505944546686

The information gain for Outlook

We have the following Entropies:

  • Entropy(S) = 0.94
  • Entropy(SSunny) = 0.97
  • Entropy(SOvercast) = 0
  • Entropy(SRain) = 0.97

We use the formula for information gain to calculate the gain.

Image by author

So:

Image by author
Image by author

Information gain for Outlook is 0.24.

Similarly, we have to calculate the information gain for the other features.

Calculate the information gain for the Temp feature

Temp has 3 attributes:

  • Hot
  • Mild
  • Cool

Since we already have the entropy for the entire dataset(Entropy(S)), we will calculate the entropy of each attribute(Entropy (Sv)) of Temp, just as we did with Outlook.

The entropy of Hot:

Image by author
import math

entropy_S_hot = -(2/4) * math.log2(2/4) - (2/4) * math.log2(2/4)
entropy_S_hot
# 1.0

The entropy of Mild:

Image by author
import math

entropy_S_mild = -(4/6) * math.log2(4/6) - (2/6) * math.log2(2/6)
entropy_S_mild
# 0.9182958340544896

The entropy of Cool:

Image by author
import math

entropy_S_cool = -(3/4) * math.log2(3/4) - (1/4) * math.log2(1/4)
entropy_S_cool

# 0.8112781244591328

Calculate the information gain for the Temp feature:

Image by author

Information gain for Temp is 0.03.

Similarly, calculate the information gain for Humidity and Wind. All information gain values will be:

  • Gain(S, Outlook) = 0.24
  • Gain(S, Temp) = 0.03
  • Gain(S, Humidity) =  0.15
  • Gain(S, Wind) = 0.04
💡
Outlook gives the highest information about our target variable from the information gain values. It will act as the root node of our tree from where the splitting will begin.
Image by author

Note, for Sunny and Rain branches, we can not easily conclude a yes or a no since we have events where Play Volleyball is yes and Play volleyball is no. That means that their entropy is more than zero and hence impure. So we need to split them.

💡
Overcast is a branch with zero entropy since it has all events as Play volleyball(Yes), so it automatically becomes a leaf node.

Finding the internal nodes

We will calculate information gain for the rest of the features when the Outlook is Sunny and when the Outlook is Rain:

Splitting on the Sunny attribute

Image by author
Calculate the information gain for Temp

Values(Temp) = Hot, Mild, Cool.

The entropy for Hot:

Image by author

The entropy for Mild:

Image by author

The entropy for Cool:

Image by author

The Information gain for Temp:

Image by author
Image by author

Calculate the information gain for Humidity

Values(Humidity) = High, Normal.

The entropy for Sunny:

Entropy(SSunny) = 0.97

The entropy for High:

Entropy(SHigh) = 0

Then entropy for Normal:

Entropy(SNormal) = 0

The Information gain for Humidity:

Image by author
Image by author
Calculate the information gain for Wind

Values(Wind) = Strong, Weak.

The entropy for Sunny:

Entropy(SSunny) = 0.97

The entropy for Strong:

Entropy(SStrong) = 1.0

The entropy for Weak:

Image by author

The Information gain for Wind:

Image by author
Image by author

Humidity gives the highest information gain value(0.97). So far, our Tree will look like this:

Image by author

Splitting on the Rain attribute

Image by author
Calculate the information gain for Temp

Values(Temp) = Mild, Cool

The entropy for Mild:

The entropy for Cool:

Entropy(SCool) = 1.0

The information gain for Temp:

Calculate the information gain for Humidity

Values(Humidity) = High, Normal.

The entropy for High:

Entropy(SHigh) = 1.0

The Entropy for Normal:

The Information gain for Humidity:

Calculate the information gain for Wind

Values(Wind) = Strong, Weak.

The entropy for Strong:

Entropy(SStrong) = 0

The entropy for Weak:

Entropy(SWeak) = 0

The information gain for Wind:

Wind gives the highest information gain value(0.97). Now we can complete our Decision Tree.

A complete decision tree with Entropy and Information gain criteria

Decision tree (splits criteria: Entropy and Information gain): Image by author

Gini impurity

Gini impurity is the probability of incorrectly classifying a random data point in a dataset. It is an impurity metric since it shows how the model differs from a pure division.

Unlike Entropy, Gini impurity has a maximum value of 0.5(very impure classification) and a minimum of 0(pure classification).

Gini impurity graph: Image by author

Gini impurity has the following formula:

Gini impurity formula: Image by author

Where:

If we have a dataset of classes C, P(i) is the probability of picking a data point of class i, and 1-P(i) is the probability of not picking a data point of class i.

When the above formula is simplified, we get the following formula:

Gini impurity formula: Image by author

Let's take the following dummy dataset, for example. We will use the Gini criteria to construct a decision tree from it.

First, we calculate the Gini for the whole dataset. We have 3 possible output variables:

  • Hire - 6 instances.
  • Reject - 2 instances.
  • Train - 2 instances.

So the Gini will be:

Image by author

Gini for the entire dataset is 0.56.

Next, we calculate the Gini for the features to decide on the root node.

Calculate Gini for Student Background

The "Student Background" feature has 3 possible values:

  • Math - 3 instances.
  • IT - 3 instances.
  • CS - 4 instances.

Gini for when the "Student Background" is in Math. There are 2 examples when the decision is Hire and 1 when the decision is Reject. So:

Image by author

We do the same for IT and CS, where all three Gini values will be:

  • Gini(Math) = 0.44.
  • Gini(CS) = 0.38.
  • Gini(IT) = 0.44.

Now we calculate the weighted average for "Student Background":

w_avrg = 0.44*(3/10) + 0.38*(4/10) + 0.44*(3/10)
w_avrg

# 0.41600000000000004

Gini for the Student background feature is 0.416.

Calculate Gini for Online Courses

The "Online Courses" feature has two possible values:

  • Yes - 5 instances.
  • No - 5 instances.

Gini for when the "Online Courses" is No. There is 1 example when the "Decision" is Hire, 2 examples when the "Decision" is Reject, and 2 examples when the "Decision" is "Train". So:

Image by author

The Gini for when the "Online Courses" is Yes is 0 since there are 5 out of 5 instances when the decision is Hire.

Now we calculate the weighted average for "Online Courses":

w_avrg = 0*(5/10) + 0.64*(5/10)
w_avrg

# 0.32

Gini for the Online Courses feature is 0.32.

Calculate Gini for Working

The "Working" feature has 2 possible values:

  • W(Working) - 7 instances.
  • NW(Not working) - 3 instances.

Gini for when the "Working" is W. There are 4 examples when the "Decision" is Hire, 2 examples when the "Decision" is Reject, and 1 example when the "Decision" is Train. So:

Image by author

We do the same for Not Working, and the Gini values will be :

  • Gini(W) = 0.57.
  • Gini(NW) = 0.44.

Calculate the weighted average for "Woking":

w_avrg = 0.57*(7/10) + 0.44*(3/10)
w_avrg

# 0.5309999999999999

Gini for the Working status feature is 0.53.

To decide on the root node, we check the Gini values achieved by each feature. The lower the Gini value, the better.

  • Gini(Student Background) = 0.416
  • Gini(Online Courses) = 0.32.
  • Gini(Working Status) = 0.53.
💡
The "Online Courses" feature has the least Gini value and thus provides the best information about our target variable. This qualifies it as the root node.
Image by author

Since the classification when "Online Course" is Yes results in one decision, we get a leaf node as Hire.

Image author

Since the right node is not pure, we will further split it.

Image by author

We will compute the Gini for when:

  • "Online Courses" is No, and "Student Background"
  • "Online Courses" is No and "Working."

Calculate the Gini for "Student Background"

When "Online Course" is No, and "Student Background" is Math, we have 2 instances with similar Decision(Reject). So the Gini value will automatically be 0.

When "Online Course" is No, and "Student Background" is IT, we have 1 instance: "Decision" is Train. So the Gini value will be 0.

When "Online Course" is No, and "Student Background" is CS, we have 2 instances: "Decision" is Train. So the Gini value will be 0.5.

Weighted average:

w_avrg = 0*(2/5) + 0*(1/5) + 0.5*(2/5)
w_avrg
# 0.2

Gini For "Student Background" is 0.2.

Calculate the Gini for "Working"

When Online Course is No, and "Working" is W. We get the Gini as:

Image by author

When "Online Course" is No, and "Working" is NW, we have 1 instance so the Gini value is 0:

Weighted average:

w_avrg = 0.63*(4/5) + 0*(1/5)
w_avrg
# 0.5

Gini For "Working" is 0.5.

Since "Student Background" has the least Gini value, it will be the next internal node. So now we can complete the decision tree.

A complete decision tree with Gini criteria

A complete Decision tree with Gini criteria: Image by author

Final thoughts

In this article, we have learned three splitting criteria used by decision trees. In general practice, we don't do all these calculations manually. When building the decision tree model, an algorithm will do all these calculations in the back end.

Check out our article on decision trees and random forests to learn how to build and optimize them.

Open On GitHub

The Complete Data Science and Machine Learning Bootcamp on Udemy is a great next step if you want to keep exploring the data science and machine learning field.

Follow us on LinkedIn, Twitter, and GitHub, and subscribe to our blog, so you don't miss a new issue.

Brian Mutea

Software Engineer | Data Scientist with an appreciable passion for building models that fix problems and sharing knowledge.

Discussion

Community guidelines