You can find working code examples (including this one) in my lab repository on GitHub.

It's almost certain that anyone of us has played the game of "Twenty Questions" or a variation of it at least once in their lifetime. In case you're not familiar with the rules, here's how it works:

The game starts with one person (the "Answerer") thinking about a subject or object without revealing it. The others (the "Questioners") are encouraged to ask questions about such subject or object. Questions should be asked such that they can only be answered in a "Yes" or "No" fashion. The current Questioner continues to ask questions if her previous question was answered with a "Yes" and has to hand over the question asking to the next Questioner if the current question was answered with a "No". The Questioner who successfully guessed the subject or object the Answerer thought about has won and will be the next Answerer.

Let's walk through a quick example to see the game in action. In this case I'm the Answerer and you're a Questioner. Since you're the only Questioner you're allowed to continue asking questions even if one of the answers is "No":

You: "Is the subject or object you're thinking about a person?"

Me: "Yes"

You: "Is the person you're thinking about a non-fictional character?"

Me: "No"

You: "Did the person you're thinking about appear in comics?"

Me: "Yes"

You: "Has the person you're thinking about superpowers?"

Me: "Yes"

You: "Is the person one of the Avengers?"

You: "Yes"

You: "Did the person appear in a movie?"

Me: "Yes"

You: "Is the person you're thinking about Iron Man?"

Me: "Yes! Bingo!"

While this is certainly a contrived example it shows one of the core techniques Decision Trees, the topic of this post, use to constuct themselves while "learning" from training data.

In the following sections we'll take a deep dive into Decision Tree-based Machine Learning models. We'll examine their properties, learn how we can ensure that the best tree is constructed given the data we're feeding it and how we can turn this knowledge into code to create our very own Decision Tree classifier which predicts if it's a good idea to practice our swings on the golf course given some data about the current wheather conditions. Let's dive in!

## Decision Trees Properties

A Decision Tree is exactly what its name implies. A tree-like structure which makes it possible to model decisions and their consequences. In fact you've already built and used a Decision Tree model while we played the game of "Twenty Questions" in the introduction above.

Let's visualize the Decision Tree you used during the game to better understand how it guided you to the prediction that the character I was thinking about was indeed Iron Man:

In the image above we can see that the Decision Tree consists of Nodes and Edges which point from parent Nodes to child Nodes. In our case every Node contained a question such as "Is it a person?" while the Edges contained the respective answers: "Yes" or "No". In our case the labels were binary which means that every question can be answered in one of two ways: "Yes" or "No".

While we were playing the game you walked down the tree starting with the root Node at the top, following the Edges and Nodes until you eventually arrived at the Node which had no outgoing Edge: Your prediction.

Generally speaking there are 2 main Decision Tree models both of which differ in the prediction they produce:

The **Classification Tree** is a tree where the prediction is categorical. The tree we've built above is a classification tree as its output will always yield a result from a category such as "Superheros" or more specifically "Iron Man".

The **Regression Tree** is a tree in which outputs can be continous. An example for continous values are the real numbers.

While this only defines the type of outputs a tree computes there's nothing preventing us from mixing and matching categorical and continous values in the tree creation phase. Therefore a tree can have categorical as well as continuous Edge labels and output types.

Given what we've learned so far about Decision Trees, it's easy to see that one of the huge advantages is the models understandability. Looking at the trees visual representation it's easy to navigate around, gauge how well it might perform and troubleshoot issues which might arise during real-world usage.

One area of improvement we could identify is the way in which our tree is splitted. We could've for example skipped the question "Did the person appear in a movie?" since it's highly likely that there's a movie about most of the Avengers. On the other hand nailing that it's Iron Man involved quite some luck. What about Hulk or Thor? It might've been better to insert yet another question to ensure that you're on the right path.

In which way should we construct our tree such that every decision we're making (deciding which Edge to follow) maximizes the information we're getting such that we can make a prediction as accurate as possible? Is it possible to capture this notion mathematically?

## Gini Impurity

As it turns out there are a few different mathematical formulas at our disposal to quantify the information we'd gain when following certain paths in our Decision Tree.

One such metric is the so-called Gini Impurity. With Gini Impurity we measure how often we'd incorrectly label a randomly chosen element from our data set if it was randomly labeled according to the class distribution of such data set.

The formula is as follows:

\[ G = \sum_{i=1}^C p_i \times (1 - p_i) \]

where \(C\) are the classes and \(p_i\) is the probability of picking an element of class \(i\) at random.

Now if this definition makes your head spin you're not alone. Let's work through an example to understand what we're calculating and how Gini Impurity is used in practice.

### Weather Forecasting

Let's pretend that we've collected some meteorology data over the last couple of years and want to use it to determine if we should bring an umbrella with us given the current weather conditions we observe.

Among other things we've collected 14 data points about the **wind** and **temperature**:

Wind | Yes | No | Total |
---|---|---|---|

Weak | 4 | 3 | 7 |

Strong | 6 | 1 | 7 |

14 | |||

Temperature | Yes | No | Total |

High | 3 | 2 | 5 |

Low | 7 | 2 | 9 |

14 |

As one can see we've categorized the wind into "weak" and "strong" and the temperature into "high" and "low". During a day where the wind was "strong" it rained \(6\) out of \(7\) times. At the same time it rained \(7\) out of \(9\) times when the temperature was "low".

Our goal is to determine which one of these observations provides us more certainity whether it'll rain and therefore if we should take an umbrella with us. We can figure that out by applying the Gini Impurity formula from above.

Let's start with the Gini Impurity calculation for **Wind**. We're dealing with 2 separate cases here as there can be either "Weak" or "Strong" wind. for every case we have \(2\) different classes (\(C\)) we can use to label our observation: "Yes" and "No". This means that we need to do \(2\) Gini Impurity calculations (one for each case). Plugging in the numbers results in the following:

\[ Gini(\textrm{Wind=Weak}) = \frac{4}{7} \times (1 - \frac{4}{7}) + \frac{3}{7} \times (1 - \frac{3}{7}) = \frac{24}{49} \approx 0.49 \]

\[ Gini(\textrm{Wind=Strong}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 \]

Since we have 2 results (one for each case) we now need to calculate the weighted sum to get the overall result for our **Wind** observation:

\[ Gini(\textrm{Wind}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.37 \]

Next up we do the Gini Impurity calculation for **Temperature**. We're following the exact same steps as we did with our **Wind** calculation above. We're also dealing with 2 different cases ("High" and "Low") as well as \(2\) different classes \(C\) ("Yes" and "No"). Here are the calculation:

\[ Gini(\textrm{Temp=Low}) = \frac{3}{5} \times (1 - \frac{3}{5}) + \frac{2}{5} \times (1 - \frac{2}{5}) = \frac{12}{25} \approx 0.48 \]

\[ Gini(\textrm{Temp=High}) = \frac{7}{9} \times (1 - \frac{7}{9}) + \frac{2}{9} \times (1 - \frac{2}{9}) = \frac{28}{81} \approx 0.35 \]

\[ Gini(\textrm{Temp}) = \frac{5}{14} \times 0.48 + \frac{9}{14} \times 0.35 \approx 0.4 \]

Putting it all together we've calculated a Gini Impurity of \(\approx 0.37\) for our **Wind** observations and a Gini Impurity of \(\approx 0.4\) for our **Temperature** observations. Overall we're trying to work with the purest data possible so we'll decide to go with our data on **Wind** to determine whether we should bring an umbrella with us.

## Building a Decision Tree

In the last section we've learned about a way to quantify purity and impurity of data via a metric called Gini Impurity. In this chapter we'll apply what we've learned so far by putting all the pieces together and build our very own Decision Tree to help us decide if we should play golf given the current weather conditions we observe.

Based on the Gini Impurity metric we'll examine our weather data and split it into different subsets which will reveal as much information as possible as to whether we should go golfing or not. These subsets will be arranged as a tree structure in a way that we can easily trace our decision through it based on our current observations. Our main goal is to come to a conclusion by taking into account as little weather data as possible.

### The data set

Let's start by looking at the `golf.csv`

data set we'll be working with:

Outlook | Temp | Humidity | Windy | Play |
---|---|---|---|---|

Rainy | Hot | High | f | no |

Rainy | Hot | High | t | no |

Overcast | Hot | High | f | yes |

Sunny | Mild | High | f | yes |

Sunny | Cool | Normal | f | yes |

Sunny | Cool | Normal | t | no |

Overcast | Cool | Normal | t | yes |

Rainy | Mild | High | f | no |

Rainy | Cool | Normal | f | yes |

Sunny | Mild | Normal | f | yes |

Rainy | Mild | Normal | t | yes |

Overcast | Mild | High | t | yes |

Overcast | Hot | Normal | f | yes |

Sunny | Mild | High | t | no |

We can see that there are a couple of different features such as **Outlook**, **Temperature, Humidity** and **Wind** we can take into account.

The following command will download the `.csv`

file to our local machine:

`wget -O "data/golf.csv" -nc -P data https://raw.githubusercontent.com/husnainfareed/Simple-Naive-Bayes-Weather-Prediction/c75b2fa747956ee9b5f9da7b2fc2865be04c618c/new_dataset.csv`

Next up we should parse the `.csv`

file and store the result in a structured format. In our case we create a `NamedTuple`

called `DataPoint`

in which we'll store one row of information:

```
golf_data_path: Path = data_dir / 'golf.csv'
class DataPoint(NamedTuple):
outlook: str
temp: str
humidity: str
windy: bool
play: bool
data_points: List[DataPoint] = []
with open(golf_data_path) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
next(reader, None)
for row in reader:
outlook: str = row[0].lower()
temp: str = row[1].lower()
humidty: str = row[2].lower()
windy: bool = True if row[3].lower() == 't' else False
play: bool = True if row[4].lower() == 'yes' else False
data_point: DataPoint = DataPoint(outlook, temp, humidty, windy, play)
data_points.append(data_point)
```

*Notice that we've parsed string such as t or yes into their respective Boolean equivalents.*

Our `data_points`

list now contains a dedicated `DataPoint`

container for every `.csv`

file row:

```
data_points[:5]
# [DataPoint(outlook='rainy', temp='hot', humidity='high', windy=False, play=False),
# DataPoint(outlook='rainy', temp='hot', humidity='high', windy=True, play=False),
# DataPoint(outlook='overcast', temp='hot', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='mild', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='cool', humidity='normal', windy=False, play=True)]
```

### Some helper functions

Great! We now have the data in a structured format which makes it easier to work with. Let's exploit this structure by implementing some helper functions.

The first function takes an arbitrary `data_set`

and a list of tuples to filter such `data_set`

based on the tuple values:

```
def filter_by_feature(data_points: List[DataPoint], *args) -> List[DataPoint]:
result: List[DataPoint] = deepcopy(data_points)
for arg in args:
feature: str = arg[0]
value: Any = arg[1]
result = [data_point for data_point in result if getattr(data_point, feature) == value]
return result
assert len(filter_by_feature(data_points, ('outlook', 'sunny'))) == 5
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'))) == 3
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'), ('humidity', 'high'))) == 2
```

** Note**: If you're familiar with SQL you can think of it as a

`SELECT * FROM data_set WHERE tuple_1[0] = tuple_1[1] AND tuple_2[0] = tuple_2[1] ...;`

statement.The second helper function extracts all the values a feature can assume:

```
def feature_values(data_points: List[DataPoint], feature: str) -> List[Any]:
return list(set([getattr(dp, feature) for dp in data_points]))
assert feature_values(data_points, 'outlook') == ['sunny', 'overcast', 'rainy']
```

Those 2 functions will greatly help us later on when we recursively build our tree.

### Gini Impurity

As you might've already guessed it we need to implement the formula to calculate the Gini Impurity metric which will help us to determine how pure our data will be after we split it up into different subsets.

As a quick reminder here's the formula:

\[ G = \sum_{i=1}^C p_i \times (1 - p_i) \]

Which translates to the following code:

```
def gini(data: List[Any]) -> float:
counter: Counter = Counter(data)
classes: List[Any] = list(counter.keys())
num_items: int = len(data)
result: float = 0
item: Any
for item in classes:
p_i: float = counter[item] / num_items
result += p_i * (1 - p_i)
return result
assert gini(['one', 'one']) == 0
assert gini(['one', 'two']) == 0.5
assert gini(['one', 'two', 'one', 'two']) == 0.5
assert 0.8 < gini(['one', 'two', 'three', 'four', 'five']) < 0.81
```

Notice the tests at the bottom. If our data set consists of only one class the impurity is \(0\). If our data set is splitted \(50/50\) our impurity metric is \(0.5\).

We can build a helper function on top of this abstract `gini`

implementation which helps us to caluclate the weighted Gini Impurities for any given feature. This function does the same thing we've done manually above when we've explored our weather data set which consisted of the features **Wind** and **Temperature**.

```
# Calculate the weighted sum of the Gini impurities for the `feature` in question
def gini_for_feature(data_points: List[DataPoint], feature: str, label: str = 'play') -> float:
total: int = len(data_points)
# Distinct values the `feature` in question can assume
dist_values: List[Any] = feature_values(data_points, feature)
# Calculate all the Gini impurities for every possible value a `feature` can assume
ginis: Dict[str, float] = defaultdict(float)
ratios: Dict[str, float] = defaultdict(float)
for value in dist_values:
filtered: List[DataPoint] = filter_by_feature(data_points, (feature, value))
labels: List[Any] = [getattr(dp, label) for dp in filtered]
ginis[value] = gini(labels)
# We use the ratio when we compute the weighted sum later on
ratios[value] = len(labels) / total
# Calculate the weighted sum of the `feature` in question
weighted_sum: float = sum([ratios[key] * value for key, value in ginis.items()])
return weighted_sum
assert 0.34 < gini_for_feature(data_points, 'outlook') < 0.35
assert 0.44 < gini_for_feature(data_points, 'temp') < 0.45
assert 0.36 < gini_for_feature(data_points, 'humidity') < 0.37
assert 0.42 < gini_for_feature(data_points, 'windy') < 0.43
```

Here is the manual calculation for **Humidity**:

\[ Gini(\textrm{Humidity=High}) = \frac{3}{7} \times (1 - \frac{3}{7}) + \frac{4}{7} \times (1 - \frac{4}{7}) = \frac{24}{49} \approx 0.49 \]

\[ Gini(\textrm{Humidity=Normal}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 \]

\[ Gini(\textrm{Temp}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.365 \]

### The Tree Data Structure

We represent our Tree data structure via a `Node`

and `Edge`

class. Every `Node`

has a value and can point to zero or more `Edges`

(out-`Edge`

). It's one of the trees leafs if it doens't have an out-`Edge`

.

An `Edge`

has a value also and points to exactly \(1\) `Node`

.

```
# A `Node` has a `value` and optional out `Edge`s
class Node:
def __init__(self, value):
self._value = value
self._edges = []
def __repr__(self):
if len(self._edges):
return f'{self._value} --> {self._edges}'
else:
return f'{self._value}'
@property
def value(self):
return self._value
def add_edge(self, edge):
self._edges.append(edge)
def find_edge(self, value):
return next(edge for edge in self._edges if edge.value == value)
# An `Edge` has a value and points to a `Node`
class Edge:
def __init__(self, value):
self._value = value
self._node = None
def __repr__(self):
return f'{self._value} --> {self._node}'
@property
def value(self):
return self._value
@property
def node(self):
return self._node
@node.setter
def node(self, node):
self._node = node
```

### Building the Tree via CART

We finally have all the pieces in place to recursively build our Decision Tree. There are several different tree building algorithms out there such as ID3, C4.5 or CART. The Gini Impurity metric is a natural fit for the CART algorithm, so we'll implement that.

The following is the whole `build_tree`

code. It might look a bit intimidating at first since it implements the CART algorithm via recursion.

Take some time to thoroughly read through the code and its comments. Below we'll describe in prose how exactly the `build_tree`

function works.

```
def build_tree(data_points: List[DataPoint], features: List[str], label: str = 'play') -> Node:
# Ensure that the `features` list doesn't include the `label`
features.remove(label) if label in features else None
# Compute the weighted Gini impurity for each `feature` given that we'd split the tree at the `feature` in question
weighted_sums: Dict[str, float] = defaultdict(float)
for feature in features:
weighted_sums[feature] = gini_for_feature(data_points, feature)
# If all the weighted Gini impurities are 0.0 we create a final `Node` (leaf) with the given `label`
weighted_sum_vals: List[float] = list(weighted_sums.values())
if (float(0) in weighted_sum_vals and len(set(weighted_sum_vals)) == 1):
label = getattr(data_points[0], 'play')
return Node(label)
# The `Node` with the most minimal weighted Gini impurity is the one we should use for splitting
min_feature = min(weighted_sums, key=weighted_sums.get)
node: Node = Node(min_feature)
# Remove the `feature` we've processed from the list of `features` which still need to be processed
reduced_features: List[str] = deepcopy(features)
reduced_features.remove(min_feature)
# Next up we build the `Edge`s which are the values our `min_feature` can assume
for value in feature_values(data_points, min_feature):
# Create a new `Edge` which contains a potential `value` of our `min_feature`
edge: Edge = Edge(value)
# Add the `Edge` to our `Node`
node.add_edge(edge)
# Filter down the data points we'll use next since we've just processed the set which includes our `min_feature`
reduced_data_points: List[DataPoint] = filter_by_feature(data_points, (min_feature, value))
# This `Edge` points to the new `Node` (subtree) we'll create through recursion
edge.node = build_tree(reduced_data_points, reduced_features)
# Return the `Node` (our `min_feature`)
return node
```

The first thing our `build_tree`

function does is that it ensures that the list of features we have to examine doesn't include the `label`

we try to predict later on. In our case the label is called `play`

which indicates if we went playing given the weather data. If we find the `play`

label we simply remove it from such list.

Next up we'll compute the weighted gini impurities for all the features in our `features`

list. (*Note that this list will constantly shrink as we recursively build the tree. More on that in a minute*)

If all our `features`

impurities are `0.0`

we're done since we're now dealing with "pure" data. In that case we just create a `Node`

instance with the `label`

as its value and return that `Node`

. This is the case where we end the recursive call.

Otherwise (there's still some impurity in the data) we pick the `feature`

from the `weighted_sums`

dictionary with the smallest score since the smaller the Gini Impurity, the purer our data. We'll create a new `Node`

for this `min_feature`

and add it to our tree. This feature can be removed from our "to-process" list of features. We call this new list (which doesn't include the feature) `reduced_features`

.

Our `min_feature`

`Node`

isn't a leaf so it has to have some `Edge`

s. To build the `Edge`

s we iterate through all the values our feature can assume and create `Edge`

instances based on such values. Since the edges have to point somewhere we call `build_tree`

again with our `reduced_features`

list to determine the `Node`

(the next `min_feature`

) the `Edge`

s should point to.

And that's pretty much all there is. The function will recurse until it reaches the trees leaf nodes which don't have any outgoing edges and contain the `label`

values.

Let's build our tree with out `golf.csv`

`data_points`

:

```
features: List[str] = list(DataPoint._fields)
tree: Node = build_tree(data_points, features)
tree
# outlook --> [overcast --> True, sunny --> windy --> [False --> True, True --> False], rainy --> humidity --> [normal --> True, high --> False]]
```

We can already see the textual representation but it might be a better idea to add a visual representation as well:

So should we go out when the **Outlook** is "rainy" and the **Humidity** is "normal"?

Seems like we've enjoyed playing golf in those situations in the past!

## Random Forest

During your journey in the lands of Machine Learning you might come across the terms "Random Forest", "Bagging" and "Boosting".

The problem with Decision Trees is that they tend to overfit easily to the data you train them on. A way to avoid overfitting while also getting better results in general is to build multiple Decision Trees (each with a subset of the training data) and then use them in concert to do predictions.

If you're working with regression trees you'd compute the average of all the trees outputs. If your trees are of categorical nature you'd perform a majority vote to decide which prediction to return.

Combining multiple Decision Trees into one large ensemble is called "Random Forest". Related to Random Forests are the techniques "Bagging" ("Boostrap aggregation") and "Boosting". But that's for another post.

## Conclusion

Everyone of us is confronted with different decision-making processes every single day. Naturally we examine the different scenarios in order to find the best decision we can take given the data we have at hand.

Decision Trees are a class of Machine Learning algorithms which follow the same idea. In order to build an efficient Decision Tree we need to decide the best way to split our data set into different subsets such that every new subset includes as much pure data as possible. Gini Impurity is one metric which makes it possible to mathematically determine the impurity for any data split we attempt.

Using Gini Impurity we implemented the CART algorithm and built our very own Decision Tree to determine if we should play golf based on the current weather conditions.

Given their visual nature, Decision Trees are a class of Machine Learning algorithms which is easy to inspect, debug and understand. On the other hand they also tend to overfit to the data they're trained on. Combining multiple, different Decision Trees into a so-called "Random Forest" mitigates this problems, turnig single Decision Trees into a powerful ensemble.

## Additional Resources

The following is a list with resources I've used while working on this article. Other, interesting resources are linked within the article itself: