The Intersection

Decision Trees from scratch

Thu, 13 Aug 2020 07:17:57 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

It's almost certain that anyone of us has played the game of "Twenty Questions" or a variation of it at least once in their lifetime. In case you're not familiar with the rules, here's how it works:

The game starts with one person (the "Answerer") thinking about a subject or object without revealing it. The others (the "Questioners") are encouraged to ask questions about such subject or object. Questions should be asked such that they can only be answered in a "Yes" or "No" fashion. The current Questioner continues to ask questions if her previous question was answered with a "Yes" and has to hand over the question asking to the next Questioner if the current question was answered with a "No". The Questioner who successfully guessed the subject or object the Answerer thought about has won and will be the next Answerer.

Let's walk through a quick example to see the game in action. In this case I'm the Answerer and you're a Questioner. Since you're the only Questioner you're allowed to continue asking questions even if one of the answers is "No":

You: "Is the subject or object you're thinking about a person?"

Me: "Yes"

You: "Is the person you're thinking about a non-fictional character?"

Me: "No"

You: "Did the person you're thinking about appear in comics?"

Me: "Yes"

You: "Has the person you're thinking about superpowers?"

Me: "Yes"

You: "Is the person one of the Avengers?"

You: "Yes"

You: "Did the person appear in a movie?"

Me: "Yes"

You: "Is the person you're thinking about Iron Man?"

Me: "Yes! Bingo!"

While this is certainly a contrived example it shows one of the core techniques Decision Trees, the topic of this post, use to construct themselves while "learning" from training data.

In the following sections we'll take a deep dive into Decision Tree-based Machine Learning models. We'll examine their properties, learn how we can ensure that the best tree is constructed given the data we're feeding it and how we can turn this knowledge into code to create our very own Decision Tree classifier which predicts if it's a good idea to practice our swings on the golf course given some data about the current weather conditions. Let's dive in!

Decision Trees Properties

A Decision Tree is exactly what its name implies. A tree-like structure which makes it possible to model decisions and their consequences. In fact you've already built and used a Decision Tree model while we played the game of "Twenty Questions" in the introduction above.

Let's visualize the Decision Tree you used during the game to better understand how it guided you to the prediction that the character I was thinking about was indeed Iron Man:

In the image above we can see that the Decision Tree consists of Nodes and Edges which point from parent Nodes to child Nodes. In our case every Node contained a question such as "Is it a person?" while the Edges contained the respective answers: "Yes" or "No". In our case the labels were binary which means that every question can be answered in one of two ways: "Yes" or "No".

While we were playing the game you walked down the tree starting with the root Node at the top, following the Edges and Nodes until you eventually arrived at the Node which had no outgoing Edge: Your prediction.

Generally speaking there are 2 main Decision Tree models both of which differ in the prediction they produce:

The Classification Tree is a tree where the prediction is categorical. The tree we've built above is a classification tree as its output will always yield a result from a category such as "Superheros" or more specifically "Iron Man".

The Regression Tree is a tree in which outputs can be continuous. An example for continuous values are the real numbers.

While this only defines the type of outputs a tree computes there's nothing preventing us from mixing and matching categorical and continuous values in the tree creation phase. Therefore a tree can have categorical as well as continuous Edge labels and output types.

Given what we've learned so far about Decision Trees, it's easy to see that one of the huge advantages is the models understandability. Looking at the trees visual representation it's easy to navigate around, gauge how well it might perform and troubleshoot issues which might arise during real-world usage.

One area of improvement we could identify is the way in which our tree is split. We could've for example skipped the question "Did the person appear in a movie?" since it's highly likely that there's a movie about most of the Avengers. On the other hand nailing that it's Iron Man involved quite some luck. What about Hulk or Thor? It might've been better to insert yet another question to ensure that you're on the right path.

In which way should we construct our tree such that every decision we're making (deciding which Edge to follow) maximizes the information we're getting such that we can make a prediction as accurate as possible? Is it possible to capture this notion mathematically?

Gini Impurity

As it turns out there are a few different mathematical formulas at our disposal to quantify the information we'd gain when following certain paths in our Decision Tree.

One such metric is the so-called Gini Impurity. With Gini Impurity we measure how often we'd incorrectly label a randomly chosen element from our data set if it was randomly labeled according to the class distribution of such data set.

The formula is as follows:

$$ G = \sum_{i=1}^C p_i \times (1 - p_i) $$

where $C$ are the classes and $p_i$ is the probability of picking an element of class $i$ at random.

Now if this definition makes your head spin you're not alone. Let's work through an example to understand what we're calculating and how Gini Impurity is used in practice.

Weather Forecasting

Let's pretend that we've collected some meteorology data over the last couple of years and want to use it to determine if we should bring an umbrella with us given the current weather conditions we observe.

Among other things we've collected 14 data points about the wind and temperature:

| Wind | Yes | No | Total | | ------ | --- | --- | ----- | | Weak | 4 | 3 | 7 | | Strong | 6 | 1 | 7 | | | | | 14 |

| Temperature | Yes | No | Total | | ----------- | --- | --- | ----- | | High | 3 | 2 | 5 | | Low | 7 | 2 | 9 | | | | | 14 |

As one can see we've categorized the wind into "weak" and "strong" and the temperature into "high" and "low". During a day where the wind was "strong" it rained $6$ out of $7$ times. At the same time it rained $7$ out of $9$ times when the temperature was "low".

Our goal is to determine which one of these observations provides us more certainty whether it'll rain and therefore if we should take an umbrella with us. We can figure that out by applying the Gini Impurity formula from above.

Let's start with the Gini Impurity calculation for Wind. We're dealing with 2 separate cases here as there can be either "Weak" or "Strong" wind. for every case we have $2$ different classes ($C$) we can use to label our observation: "Yes" and "No". This means that we need to do $2$ Gini Impurity calculations (one for each case). Plugging in the numbers results in the following:

$$ Gini(\textrm{Wind=Weak}) = \frac{4}{7} \times (1 - \frac{4}{7}) + \frac{3}{7} \times (1 - \frac{3}{7}) = \frac{24}{49} \approx 0.49 $$

$$ Gini(\textrm{Wind=Strong}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 $$

Since we have 2 results (one for each case) we now need to calculate the weighted sum to get the overall result for our Wind observation:

$$ Gini(\textrm{Wind}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.37 $$

Next up we do the Gini Impurity calculation for Temperature. We're following the exact same steps as we did with our Wind calculation above. We're also dealing with 2 different cases ("High" and "Low") as well as $2$ different classes $C$ ("Yes" and "No"). Here are the calculation:

$$ Gini(\textrm{Temp=Low}) = \frac{3}{5} \times (1 - \frac{3}{5}) + \frac{2}{5} \times (1 - \frac{2}{5}) = \frac{12}{25} \approx 0.48 $$

$$ Gini(\textrm{Temp=High}) = \frac{7}{9} \times (1 - \frac{7}{9}) + \frac{2}{9} \times (1 - \frac{2}{9}) = \frac{28}{81} \approx 0.35 $$

$$ Gini(\textrm{Temp}) = \frac{5}{14} \times 0.48 + \frac{9}{14} \times 0.35 \approx 0.4 $$

Putting it all together we've calculated a Gini Impurity of $\approx 0.37$ for our Wind observations and a Gini Impurity of $\approx 0.4$ for our Temperature observations. Overall we're trying to work with the purest data possible so we'll decide to go with our data on Wind to determine whether we should bring an umbrella with us.

Building a Decision Tree

In the last section we've learned about a way to quantify purity and impurity of data via a metric called Gini Impurity. In this chapter we'll apply what we've learned so far by putting all the pieces together and build our very own Decision Tree to help us decide if we should play golf given the current weather conditions we observe.

Based on the Gini Impurity metric we'll examine our weather data and split it into different subsets which will reveal as much information as possible as to whether we should go golfing or not. These subsets will be arranged as a tree structure in a way that we can easily trace our decision through it based on our current observations. Our main goal is to come to a conclusion by taking into account as little weather data as possible.

The data set

Let's start by looking at the golf.csv data set we'll be working with:

| Outlook | Temp | Humidity | Windy | Play | | -------- | ---- | -------- | ----- | ---- | | Rainy | Hot | High | f | no | | Rainy | Hot | High | t | no | | Overcast | Hot | High | f | yes | | Sunny | Mild | High | f | yes | | Sunny | Cool | Normal | f | yes | | Sunny | Cool | Normal | t | no | | Overcast | Cool | Normal | t | yes | | Rainy | Mild | High | f | no | | Rainy | Cool | Normal | f | yes | | Sunny | Mild | Normal | f | yes | | Rainy | Mild | Normal | t | yes | | Overcast | Mild | High | t | yes | | Overcast | Hot | Normal | f | yes | | Sunny | Mild | High | t | no |

We can see that there are a couple of different features such as Outlook, Temperature, Humidity and Wind we can take into account.

The following command will download the .csv file to our local machine:

wget -O "data/golf.csv" -nc -P data https://raw.githubusercontent.com/husnainfareed/Simple-Naive-Bayes-Weather-Prediction/c75b2fa747956ee9b5f9da7b2fc2865be04c618c/new_dataset.csv

Next up we should parse the .csv file and store the result in a structured format. In our case we create a [NamedTuple](https://docs.python.org/3/library/collections.html#collections.namedtuple) called DataPoint in which we'll store one row of information:

golf_data_path: Path = data_dir / 'golf.csv'

class DataPoint(NamedTuple):
    outlook: str
    temp: str
    humidity: str
    windy: bool
    play: bool

data_points: List[DataPoint] = []

with open(golf_data_path) as csv_file:
    reader = csv.reader(csv_file, delimiter=',')
    next(reader, None)
    for row in reader:
        outlook: str = row[0].lower()
        temp: str = row[1].lower()
        humidity: str = row[2].lower()
        windy: bool = True if row[3].lower() == 't' else False
        play: bool = True if row[4].lower() == 'yes' else False
        data_point: DataPoint = DataPoint(outlook, temp, humidity, windy, play)
        data_points.append(data_point)

Notice that we've parsed string such as t or yes into their respective Boolean equivalents.

Our data_points list now contains a dedicated DataPoint container for every .csv file row:

data_points[:5]

# [DataPoint(outlook='rainy', temp='hot', humidity='high', windy=False, play=False),
# DataPoint(outlook='rainy', temp='hot', humidity='high', windy=True, play=False),
# DataPoint(outlook='overcast', temp='hot', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='mild', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='cool', humidity='normal', windy=False, play=True)]

Some helper functions

Great! We now have the data in a structured format which makes it easier to work with. Let's exploit this structure by implementing some helper functions.

The first function takes an arbitrary data_set and a list of tuples to filter such data_set based on the tuple values:

def filter_by_feature(data_points: List[DataPoint], *args) -> List[DataPoint]:
    result: List[DataPoint] = deepcopy(data_points)
    for arg in args:
        feature: str = arg[0]
        value: Any = arg[1]
        result = [data_point for data_point in result if getattr(data_point, feature) == value]
    return result

assert len(filter_by_feature(data_points, ('outlook', 'sunny'))) == 5
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'))) == 3
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'), ('humidity', 'high'))) == 2

*Note: If you're familiar with SQL you can think of it as a SELECT * FROM data_set WHERE tuple_1[0] = tuple_1[1] AND tuple_2[0] = tuple_2[1] ...; statement.*

The second helper function extracts all the values a feature can assume:

def feature_values(data_points: List[DataPoint], feature: str) -> List[Any]:
    return list(set([getattr(dp, feature) for dp in data_points]))

assert feature_values(data_points, 'outlook') == ['sunny', 'overcast', 'rainy']

Those 2 functions will greatly help us later on when we recursively build our tree.

Gini Impurity

As you might've already guessed it we need to implement the formula to calculate the Gini Impurity metric which will help us to determine how pure our data will be after we split it up into different subsets.

As a quick reminder here's the formula:

$$ G = \sum_{i=1}^C p_i \times (1 - p_i) $$

Which translates to the following code:

def gini(data: List[Any]) -> float:
    counter: Counter = Counter(data)
    classes: List[Any] = list(counter.keys())
    num_items: int = len(data)
    result: float = 0
    item: Any
    for item in classes:
        p_i: float = counter[item] / num_items
        result += p_i * (1 - p_i)
    return result

assert gini(['one', 'one']) == 0
assert gini(['one', 'two']) == 0.5
assert gini(['one', 'two', 'one', 'two']) == 0.5
assert 0.8 < gini(['one', 'two', 'three', 'four', 'five']) < 0.81

Notice the tests at the bottom. If our data set consists of only one class the impurity is $0$. If our data set is splitted $50/50$ our impurity metric is $0.5$.

We can build a helper function on top of this abstract gini implementation which helps us to caluclate the weighted Gini Impurities for any given feature. This function does the same thing we've done manually above when we've explored our weather data set which consisted of the features Wind and Temperature.

# Calculate the weighted sum of the Gini impurities for the `feature` in question
def gini_for_feature(data_points: List[DataPoint], feature: str, label: str = 'play') -> float:
    total: int = len(data_points)
    # Distinct values the `feature` in question can assume
    dist_values: List[Any] = feature_values(data_points, feature)
    # Calculate all the Gini impurities for every possible value a `feature` can assume
    ginis: Dict[str, float] = defaultdict(float)
    ratios: Dict[str, float] = defaultdict(float)
    for value in dist_values:
        filtered: List[DataPoint] = filter_by_feature(data_points, (feature, value))
        labels: List[Any] = [getattr(dp, label) for dp in filtered]
        ginis[value] = gini(labels)
        # We use the ratio when we compute the weighted sum later on
        ratios[value] = len(labels) / total
    # Calculate the weighted sum of the `feature` in question
    weighted_sum: float = sum([ratios[key] * value for key, value in ginis.items()])
    return weighted_sum

assert 0.34 < gini_for_feature(data_points, 'outlook') < 0.35
assert 0.44 < gini_for_feature(data_points, 'temp') < 0.45
assert 0.36 < gini_for_feature(data_points, 'humidity') < 0.37
assert 0.42 < gini_for_feature(data_points, 'windy') < 0.43

Here is the manual calculation for Humidity:

$$ Gini(\textrm{Humidity=High}) = \frac{3}{7} \times (1 - \frac{3}{7}) + \frac{4}{7} \times (1 - \frac{4}{7}) = \frac{24}{49} \approx 0.49 $$

$$ Gini(\textrm{Humidity=Normal}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 $$

$$ Gini(\textrm{Temp}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.365 $$

The Tree Data Structure

We represent our Tree data structure via a Node and Edge class. Every Node has a value and can point to zero or more Edges (out-Edge). It's one of the trees leafs if it doesn't have an out-Edge.

An Edge has a value also and points to exactly $1$ Node.

# A `Node` has a `value` and optional out `Edge`s
class Node:
    def __init__(self, value):
        self._value = value
        self._edges = []

    def __repr__(self):
        if len(self._edges):
            return f'{self._value} --> {self._edges}'
        else:
            return f'{self._value}'

    @property
    def value(self):
        return self._value

    def add_edge(self, edge):
        self._edges.append(edge)

    def find_edge(self, value):
        return next(edge for edge in self._edges if edge.value == value)

# An `Edge` has a value and points to a `Node`
class Edge:
    def __init__(self, value):
        self._value = value
        self._node = None

    def __repr__(self):
        return f'{self._value} --> {self._node}'

    @property
    def value(self):
        return self._value

    @property
    def node(self):
        return self._node

    @node.setter
    def node(self, node):
        self._node = node

Building the Tree via CART

We finally have all the pieces in place to recursively build our Decision Tree. There are several different tree building algorithms out there such as ID3, C4.5 or CART. The Gini Impurity metric is a natural fit for the CART algorithm, so we'll implement that.

The following is the whole build_tree code. It might look a bit intimidating at first since it implements the CART algorithm via recursion.

Take some time to thoroughly read through the code and its comments. Below we'll describe in prose how exactly the build_tree function works.

def build_tree(data_points: List[DataPoint], features: List[str], label: str = 'play') -> Node:
    # Ensure that the `features` list doesn't include the `label`
    features.remove(label) if label in features else None

    # Compute the weighted Gini impurity for each `feature` given that we'd split the tree at the `feature` in question
    weighted_sums: Dict[str, float] = defaultdict(float)
    for feature in features:
        weighted_sums[feature] = gini_for_feature(data_points, feature)

    # If all the weighted Gini impurities are 0.0 we create a final `Node` (leaf) with the given `label`
    weighted_sum_vals: List[float] = list(weighted_sums.values())
    if (float(0) in weighted_sum_vals and len(set(weighted_sum_vals)) == 1):
        label = getattr(data_points[0], 'play')
        return Node(label)

    # The `Node` with the most minimal weighted Gini impurity is the one we should use for splitting
    min_feature = min(weighted_sums, key=weighted_sums.get)
    node: Node = Node(min_feature)

    # Remove the `feature` we've processed from the list of `features` which still need to be processed
    reduced_features: List[str] = deepcopy(features)
    reduced_features.remove(min_feature)

    # Next up we build the `Edge`s which are the values our `min_feature` can assume
    for value in feature_values(data_points, min_feature):
        # Create a new `Edge` which contains a potential `value` of our `min_feature`
        edge: Edge = Edge(value)
        # Add the `Edge` to our `Node`
        node.add_edge(edge)
        # Filter down the data points we'll use next since we've just processed the set which includes our `min_feature`
        reduced_data_points: List[DataPoint] = filter_by_feature(data_points, (min_feature, value))
        # This `Edge` points to the new `Node` (subtree) we'll create through recursion
        edge.node = build_tree(reduced_data_points, reduced_features)

    # Return the `Node` (our `min_feature`)
    return node

The first thing our build_tree function does is that it ensures that the list of features we have to examine doesn't include the label we try to predict later on. In our case the label is called play which indicates if we went playing given the weather data. If we find the play label we simply remove it from such list.

Next up we'll compute the weighted gini impurities for all the features in our features list. (Note that this list will constantly shrink as we recursively build the tree. More on that in a minute)

If all our features impurities are 0.0 we're done since we're now dealing with "pure" data. In that case we just create a Node instance with the label as its value and return that Node. This is the case where we end the recursive call.

Otherwise (there's still some impurity in the data) we pick the feature from the weighted_sums dictionary with the smallest score since the smaller the Gini Impurity, the purer our data. We'll create a new Node for this min_feature and add it to our tree. This feature can be removed from our "to-process" list of features. We call this new list (which doesn't include the feature) reduced_features.

Our min_feature``Node isn't a leaf so it has to have some Edges. To build the Edges we iterate through all the values our feature can assume and create Edge instances based on such values. Since the edges have to point somewhere we call build_tree again with our reduced_features list to determine the Node (the next min_feature) the Edges should point to.

And that's pretty much all there is. The function will recurse until it reaches the trees leaf nodes which don't have any outgoing edges and contain the label values.

Let's build our tree with our golf.csv data_points :

features: List[str] = list(DataPoint._fields)

tree: Node = build_tree(data_points, features)
tree

# outlook --> [overcast --> True, sunny --> windy --> [False --> True, True --> False], rainy --> humidity --> [normal --> True, high --> False]]

We can already see the textual representation but it might be a better idea to add a visual representation as well:

So should we go out when the Outlook is "rainy" and the Humidity is "normal"?

Seems like we've enjoyed playing golf in those situations in the past!

Random Forest

During your journey in the lands of Machine Learning you might come across the terms "Random Forest", "Bagging" and "Boosting".

The problem with Decision Trees is that they tend to overfit easily to the data you train them on. A way to avoid overfitting while also getting better results in general is to build multiple Decision Trees (each with a subset of the training data) and then use them in concert to do predictions.

If you're working with regression trees you'd compute the average of all the trees outputs. If your trees are of categorical nature you'd perform a majority vote to decide which prediction to return.

Combining multiple Decision Trees into one large ensemble is called "Random Forest". Related to Random Forests are the techniques "Bagging" ("Bootstrap aggregation") and "Boosting". But that's for another post.

Conclusion

Everyone of us is confronted with different decision-making processes every single day. Naturally we examine the different scenarios in order to find the best decision we can take given the data we have at hand.

Decision Trees are a class of Machine Learning algorithms which follow the same idea. In order to build an efficient Decision Tree we need to decide the best way to split our data set into different subsets such that every new subset includes as much pure data as possible. Gini Impurity is one metric which makes it possible to mathematically determine the impurity for any data split we attempt.

Using Gini Impurity we implemented the CART algorithm and built our very own Decision Tree to determine if we should play golf based on the current weather conditions.

Given their visual nature, Decision Trees are a class of Machine Learning algorithms which is easy to inspect, debug and understand. On the other hand they also tend to overfit to the data they're trained on. Combining multiple, different Decision Trees into a so-called "Random Forest" mitigates this problems, turning single Decision Trees into a powerful ensemble.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

The following is a list with resources I've used while working on this article. Other, interesting resources are linked within the article itself:

Logistic Regression from scratch

Thu, 25 Jun 2020 13:12:32 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

Sometimes it's necessary to split existing data into several classes in order to predict new, unseen data. This problem is called classification and one of the algorithms which can be used to learn those classes from data is called Logistic Regression.

In this article we'll take a deep dive into the Logistic Regression model to learn how it differs from other regression models such as Linear- or Multiple Linear Regression, how to think about it from an intuitive perspective and how we can translate our learnings into code while implementing it from scratch.

Linear Regression vs. Logistic Regression

If you've read the post about Linear- and Multiple Linear Regression you might remember that the main objective of our algorithm was to find a best fitting line or hyperplane respectively.

To recap real quick, a line can be represented via the slop-intercept form as follows:

$$ y = mx + b $$

Here, $m$ represents the slope and $b$ the y-intercept.

In Linear Regression we've used the existing data to find a line in slope-intercept form (a $m$ and $b$ combination) which "best-fitted through" such data.

Extending the slope-intercept form slightly to support multiple $x$ values and multiple slopes (we'll use $\beta_n$ instead of $m_n$) yields the following:

$$ y = \beta_1x_1 + ... + \beta_nx_n + b $$

This "scaled-up" slope-intercept formula was used in the Multiple Linear Regression model to find the $\beta$ and $b$ values for the hyperplane which "best-fitted" the data. Once found we were able to use it for predictions by plugging in $x$ values to get respective $y$ values.

Linear Regression models always map a set of $x$ values to a resulting $y$ value on a continuous scale. This means that the $y$ value can e.g. be $0$, $42$ or $5.023.212$. How would we use such a Regression model if our $y$ value is categorical such as a binary value which is either $0$ or $1$? Is there a way to define a threshold so that a value such as $42$ is assigned to the category $1$ while a small value such as $0.002$ gets assigned to the category $0$?

That's where Logistic Regression comes into play. With Logistic Regression we can map any resulting $y$ value, no matter its magnitude to a value between $0$ and $1$.

Let's take a closer look into the modifications we need to make to turn a Linear Regression model into a Logistic Regression model.

Sigmoid functions

At the very heart of Logistic Regression is the so-called Sigmoid function. A Sigmoid function is a class of functions which follows an S-shape when plotted.

The most prominent Sigmoid function is the so-called Logistic function which was developed by Pierre Francois Verhulst to model population grown. It's mathematically described via this formula:

$$ f(x) = \frac{1}{1+e^{-x}} $$

Don't be intimidated by the math! Right now all you need to know is that this function takes any $x$ value and maps it to a $y$ value which ranges from $0$ to $1$.

Plotting the function for a range of $x$ values proofs this claim and results in the aforementioned S-shape curve:

Note that the function gets closer and closer to the $y$ value $0$ or $1$ as the $x$ values get smaller or larger respectively. Also note that the $x$ value $0$ results in the $y$ value $0.5$.

This is exactly what we need. with this function we're able to "squish" any number, no matter its magnitude into a value ranging from $0$ to $1$. This makes the function outcome predictable which is useful when we later on define threshold values to associate function outputs with classes.

Let's turn the function into code:

def sigmoid(x: float) -> float:
    return 1 / (1 + exp(-x))

assert sigmoid(0) == 0.5

Note: Although there are many different Sigmoid functions to choose from, a lot of people use the name "Sigmoid function" when talking about the Logistic function. We'll adhere to this convention and use the term "Sigmoid function" as a synonym for Logistic function.

From Linear Regression to Logistic Regression

Now that we've learned about the "mapping" capabilities of the Sigmoid function we should be able to "wrap" a Linear Regression model such as Multiple Linear Regression inside of it to turn the regressions raw output into a value ranging from $0$ to $1$.

Let's translate this idea into Math. Recall that our Multiple Linear Regression model looks like this:

$$ y = \beta_1x_1 + ... + \beta_nx_n + b $$

"Wrapping" this in the Sigmoid function (we use $\sigma$ to represent the Sigmoid function) results in the following:

$$ y = \sigma(\beta_1x_1 + ... + \beta_nx_n + b) $$

Easy enough! Let's turn that into code.

The first thing we need to do is to implement the underlying Multiple Linear Regression model. Looking at the Math it seems to be possible to use the dot-product to calculate the $\beta$ and $x$ part to which we then add the single $b$ value.

To make everything easier to calculate and implement we'll use a small trick. Multiplying a value by the identify $1$ yields the value so we prepend $1$ to the $x$ values and $b$ to the $\beta$ values. This way we can solely use the dot-product calculation without the necessity to add $b$ separately later on. Here's the mathematical formulation of that trick:

$$ \vec{x} = \begin{pmatrix} 1 \ x_1 \ ... \ x_n \end{pmatrix} \vec{\beta} = \begin{pmatrix} b \ \beta_1 \ ... \ \beta_n \end{pmatrix} $$

$$ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i \beta_i = x_1 \times \beta_1 + ... + x_n \times \beta_n $$

Once we've calculated the dot-product we need to pass it into the Sigmoid function such that its result is translated ("squished") into a value between $0$ and $1$.

Here's the implementation for the dot function which calculates the dot-product:

def dot(a: List[float], b: List[float]) -> float:
    assert len(a) == len(b)
    return sum([a_i * b_i for a_i, b_i in zip(a, b)])

assert dot([1, 2, 3, 4], [5, 6, 7, 8]) == 70

And here's the squish function which takes as parameters the $x$ and $\beta$ values (remember that we've prepended a $1$ to the $x$ values and the $b$ to the $\beta$ values), uses the dot function to calculate the dot-product of $x$ and $\beta$ and then passes this result into the Sigmoid function to map it to a value between $0$ and $1$:

def squish(beta: List[float], x: List[float]) -> float:
    assert len(beta) == len(x)
    # Calculate the dot product
    dot_result: float = dot(beta, x)
    # Use sigmoid to get a result between 0 and 1
    return sigmoid(dot_result)

assert squish([1, 2, 3, 4], [5, 6, 7, 8]) == 1.0

The intuition behind the 0-1 range

We've talked quite a lot about how the Sigmoid function is our solution to make the function outcome predictable as all values are mapped to a $0$ - $1$ range. But what does a value in that range represent? Let's take a look at an example.

The following is a data set which describes how long students have studied for an exam and whether they've passed the exam given the hours they've studied.

| Hours studied | Exam Passed | | ------------- | ----------- | | 0,5 | 0 | | 1,0 | 0 | | 1,5 | 0 | | 2,0 | 0 | | 2,5 | 1 | | 3,0 | 0 | | 3,5 | 1 | | 4,0 | 1 | | 4,5 | 0 | | 5,0 | 1 | | 5,5 | 1 | | 6,0 | 1 |

Taking a glance at the data it seems to be that the more hours the students studied, the more likely they were to pass the exam. Intuitively that makes sense.

Let's plot the data to ensure that our intuition is correct:

Looking at the plotted data we can immediately see that the values seem to "stick" to either the bottom or top of the graph. Given that it seems to be infeasible to use a Linear Regression model to find a line which best describes the data. How would this line be fitted through the data if the values we'd expect this line should produce are either $o$ or $1$?

Let's try a thought experiment. What would happen if we've somehow found some coefficients $\beta$ for the Linear Regression model which "best" describe the data and pass the result it computes through the Sigmoid function? Here's the graph from above with the Sigmoid function added to it:

Looking at the plotting above we can see that the Sigmoid function ensures that the result from the "underlying" Linear Regression model is mapped onto a scale between $0$ and $1$, which in turn makes it possible to e.g. define a threshold at $0.5$ to say that a value which is greater than $0.5$ might be a predictor for a student passing the exam while a value less than $0.5$ might mean that she'll fail the exam.

Note that the wording in the last sentence isn't a coincidence. The value the Sigmoid function produces can be interpreted as a probability where $0$ means $0%$ probability and $1$ means a $100%$ probability.

The Probability Density Function

As it turns out we can translate our findings from the previous section into a function called Probability density function or (PDF for short).

In particular we can define a conditional probability which states that given some $\beta$ and $x_i$, each corresponding $y_i$ should equal $1$ with probability $\sigma(\beta x_i)$ and $0$ with probability $1-\sigma(\beta x_i)$:

$$ P(y_i \mid \beta x_i) = \sigma(\beta x_i)^{y_i} \times (1-\sigma(\beta x_i))^{1-y_i} $$

Looking at the formula above it might be a mystery how we deduced it from our verbal description from above. Here's something I want you to try: Please apply the formula by setting $y_i$ to $0$ and after that to $1$ and see what happens. What you'll notice is that depending on what value you set $y_i$ to, only one part of the formula stays the same while the other is canceled out.

Here's what we'll end up with if we set $y_i$ to $0$ and $1$:

$$ 1-\sigma(\beta x_i) \quad \textrm{if} y_i = 0 $$

$$ \sigma(\beta x_i) \quad \textrm{if} y_i = 1 $$

And that's exactly the desired behavior we described above.

Deriving a Loss function

With Logistic Regression our main objective is to find the models $\beta$ parameters which maximize the likelihood that for a pair of $x$ values the $y$ value our model calculates is as close to the actual $y$ value as possible.

In order to find the optimal $\beta$ parameters we need to somehow calculate how "wrong" our models predictions are with the current $\beta$ setup.

In the previous section we talked about the Probability Density Function (PDF) which seems to capture exactly that. We just need to tweak this function slightly so that it's easier for us to do calculations with it later on.

The main tweak we'll apply is that we "wrap" our individual PDF calculations for $y_i = 0$ and $y_i = 1$ in the $\log$ function. The Logarithm has the nice property that it's strictly increasing which makes it easier to do calculations on its data later on. Furthermore our PDFs main property is still preserved since any set of $\beta$ values that maximizes the likelihood of predicting the correct $y$ also maximizes the $\log$ likelihood.

Here are the PDFs two major parts "wrapped" in the $\log$ function:

$$ \log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 $$

$$ \log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 $$

There's only one minor issue we need to resolve. Overall we're attempting to minimize the amount of wrong predictions our model produces, but looking at the graph of the Logarithm once again we see that the function is strictly increasing. We "mirror" the logarithm at the $x$ axis ("turning" it upside down) by multiplying it with $-1$. Hence our two functions now look like this:

$$ -\log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 $$

$$ -\log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 $$

Now the last thing we want to do here is to put both functions back together into one equation like we did with our composite PDF function above. Doing this results in the following:

$$ \log L (\beta \mid x_i y_i) = -(y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) $$

This function is called Logarithmic Loss (or Log Loss / Log Likelihood) and it's what we'll use later on to determine how "off" our model is with its prediction. Again, you might want to set $y_i$ to $0$ or $1$ to see that one part of the equation is canceled out.

Let's turn that into code:

def neg_log_likelihood(y: float, y_pred: float) -> float:
    return -((y * log(y_pred)) + ((1 - y) * log(1 - y_pred)))

assert 2.30 < neg_log_likelihood(1, 0.1) < 2.31
assert 2.30 < neg_log_likelihood(0, 0.9) < 2.31
assert 0.10 < neg_log_likelihood(1, 0.9) < 0.11
assert 0.10 < neg_log_likelihood(0, 0.1) < 0.11

Next up let's use our codified version of Log Loss to create plots for $y = 0$ and $y = 1$:

xs_nll: List[float] = [x / 10000 for x in range(1, 10000)]

fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(xs_nll, [neg_log_likelihood(1, x) for x in xs_nll])
ax1.set_title('y = 1')
ax2.plot(xs_nll, [neg_log_likelihood(0, x) for x in xs_nll])
ax2.set_title('y = 0');

As we can see, the more wrong the prediction, the higher the calculated error. The Log Loss function therefore "punished" wrongdoing more than it rewards "rightdoing".

To calculate the overall error of our whole data set we sum up each individual Log Loss calculation and average it:

$$ \textrm{Cost} = -\frac{1}{n} \sum_{i=1}^n (y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) $$

Here's the codified version:

def error(ys: List[float], ys_pred: List[float]) -> float:
    assert len(ys) == len(ys_pred)
    num_items: int = len(ys)
    sum_nll: float = sum([neg_log_likelihood(y, y_pred) for y, y_pred in zip(ys, ys_pred)])
    return (1 / num_items) * sum_nll

assert 2.30 < error([1], [0.1]) < 2.31
assert 2.30 < error([0], [0.9]) < 2.31
assert 0.10 < error([1], [0.9]) < 0.11
assert 0.10 < error([0], [0.1]) < 0.11

Optimizing via Gradient Descent

Now the last missing piece we need to implement is the optimization step. What we want at the end of the day is a Logistic Regression model with the $\beta$ parameters which in combination with $x$ values produce the most accurate prediction for any $y$ value.

Given that we can now calculate the error our current model with its $\beta$ parameters produces we can iteratively change the $\beta$ parameters until we reach a point where our model cannot improve (can't reduce the error value) anymore.

The algorithm which implements exactly that is called Gradient Descent.

Note: I already wrote a dedicated post explaining the algorithm in great detail so I won't go into too much detail here and would encourage you to read the article if you're unfamiliar with Gradient Descent.

The gist of it is that given our Log Loss function we can find a set of $\beta$ parameters for which the error the Log Loss function calculates is the smallest. We've therefore found a local (or global) minimum if the error cannot be reduced anymore. To figure out "where" the minimum is located we'll use the error functions gradient which is a vector and guides us to that position.

Since we repeat such calculations over and over again we're iteratively descending down the error functions surface, hence the name Gradient Descent.

Calculating the gradient is done by computing the partial derivatives of the Log Loss function with respect to the individual $x_{ij}$ values in our $x_i$ vector.

Here's the mathematical representation:

$$ \frac{\partial \textrm{Cost}}{\partial x*{ij}} = \frac{1}{n} \sum*{i=1}^n (\sigma(\beta xi) - y_i) x{ij} $$

And here's the implementation in code:

grad: List[float] = [0 for _ in range(len(beta))]
for x, y in zip(xs, ys):
    err: float = squish(beta, x) - y
    for i, x_i in enumerate(x):
        grad[i] += (err * x_i)
grad = [1 / len(x) * g_i for g_i in grad]

Again, if you're unfamiliar with gradients and partial derivatives you might want to read my article about the Gradient Descent algorithm.

We finally got all the pieces in place! Let's grab some data and use the Logistic Regression model to classify it!

Putting it into practice

The data set we'll be using is similar to what we've already seen in our example above where we tried to predict whether a student will pass an exam based on the hours she studied.

You can download the data set here. Inspecting the data we can see that there are 2 floating point values and an integer value which represents the label and (presumably) indicates whether the student in question has passed or failed the exam.

Our task is it to use this data set to train a Logistic Regression model which will help us assign the label $0$ or $1$ (i.e. classify) new, unseen data points.

The first thing we need to do is to download the .txt file:

wget -nc -P data https://raw.githubusercontent.com/animesh-agarwal/Machine-Learning/5b2d8a71984016ae021094641a3815d6ea9ac527/LogisticRegression/data/marks.txt

Next up we need to parse the file and extract the $x$ and $y$ values:

marks_data_path: Path = data_dir / 'marks.txt'

xs: List[List[float]] = []
ys: List[float] = []

with open(marks_data_path) as file:
    for line in file:
        data_point: List[str] = line.strip().split(',')
        x1: float = float(data_point[0])
        x2: float = float(data_point[1])
        y: int = int(data_point[2])
        xs.append([x1, x2])
        ys.append(y)

It's always a good idea to plot the data to see if there are any outliers or other surprises we have to deal with:

x1s: List[float] = [x[0] for x in xs]
x2s: List[float] = [x[1] for x in xs]

plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);

Looks like we're (almost) good here. There's only one aspect we need to further inspect. If you're looking at the axes you see that the values are ranging from $\approx 35$ to $\approx 95$. Working with such large values can result into problems when working with $\log$ and $\exp$ later on. We'll resolve this problem in a minute.

You might recall from the beginning of the post that we applied a trick where we prepend every $x$ vector with a $1$ and prepend the $b$ to the $\beta$ vector in order to make it possible to use the dot-product (which is a simpler calculation).

The following code prepends a $1$ to every $x$ vector so that we can leverage the computation trick later on:

for x in xs:
    x.insert(0, 1)

xs[:5]

# [[1, 34.62365962451697, 78.0246928153624],
# [1, 30.28671076822607, 43.89499752400101],
# [1, 35.84740876993872, 72.90219802708364],
# [1, 60.18259938620976, 86.30855209546826],
# [1, 79.0327360507101, 75.3443764369103]]

Next up we'll tackle the scaling problem we've touched upon in the paragraph above. What we'll do to resolve this problem is to standardize (often done via the "z-score") the whole data set:

xs = z_score(xs)

xs[:5]

# [[1, -1.5942162646576388, 0.6351413941754435],
# [1, -1.8171014180340745, -1.2014885239142388],
# [1, -1.531325157335502, 0.3594832875590465],
# [1, -0.28068723821760927, 1.0809228071415948],
# [1, 0.6880619310375534, 0.4909048515228952]]

If you're curious what the z_score function does check out the whole implementation in my lab repository on GitHub.

And now we're finally in a position where we can train our Logistic Regression Model via Gradient Descent. We'll start with some random guesses for our models $\beta$ parameters and iteratively optimize our model by computing its current overall "wrongdoing" with the error function and then using the error functions gradient to update the $\beta$ value to yield a smaller error in the next iteration. Eventually we'll converge to a local minimum which results in $\beta$ values computing the smallest error.

beta: List[float] = [random() / 10 for _ in range(3)]

print(f'Starting with "beta": {beta}')

epochs: int = 5000
learning_rate: float = 0.01

for epoch in range(epochs):
    # Calculate the "predictions" (squishified dot product of `beta` and `x`) based on our current `beta` vector
    ys_pred: List[float] = [squish(beta, x) for x in xs]

    # Calculate and print the error
    if epoch % 1000 == True:
        loss: float = error(ys, ys_pred)
        print(f'Epoch {epoch} --> loss: {loss}')

    # Calculate the gradient
    grad: List[float] = [0 for _ in range(len(beta))]
    for x, y in zip(xs, ys):
        err: float = squish(beta, x) - y
        for i, x_i in enumerate(x):
            grad[i] += (err * x_i)
    grad = [1 / len(x) * g_i for g_i in grad]

    # Take a small step in the direction of greatest decrease
    beta = [b + (gb * -learning_rate) for b, gb in zip(beta, grad)]

print(f'Best estimate for "beta": {beta}')

# Starting with "beta": [0.06879018957747185, 0.060750489548129484, 0.08122488791609535]
# Epoch 1 --> loss: 0.6091560801945126
# Epoch 1001 --> loss: 0.2037432848849053
# Epoch 2001 --> loss: 0.20350230881468107
# Epoch 3001 --> loss: 0.20349779972872906
# Epoch 4001 --> loss: 0.20349770371660023
# Best estimate for "beta": [1.7184091311489376, 4.01281584290694, 3.7438191715393083]

Now that our model is trained we can calculate some statistics to see how accurate it is. For these calculations we'll set the threshold to $0.5$ which means that every value above $0.5$ our model produces is considered a $1$ and every value which is less than $0.5$ is considered to be a $0$.

total: int = len(ys)
thresh: float = 0.5
true_positives: int = 0
true_negatives: int = 0
false_positives: int = 0
false_negatives: int = 0

for i, x in enumerate(xs):
    y: int = ys[i]
    pred: float = squish(beta, x)
    y_pred: int = 1
    if pred < thresh:
        y_pred = 0
    if y == 1 and y_pred == 1:
        true_positives += 1
    elif y == 0 and y_pred == 0:
        true_negatives += 1
    elif y == 1 and y_pred == 0:
        false_negatives += 1
    elif y == 0 and y_pred == 1:
        false_positives += 1

print(f'True Positives: {true_positives}')
print(f'True Negatives: {true_negatives}')
print(f'False Positives: {false_positives}')
print(f'False Negatives: {false_negatives}')
print(f'Accuracy: {(true_positives + true_negatives) / total}')
print(f'Error rate: {(false_positives + false_negatives) / total}')

# True Positives: 55
# True Negatives: 34
# False Positives: 6
# False Negatives: 5
# Accuracy: 0.89
# Error rate: 0.11

Finally let's plot the decision boundary so that we can see where our model "draws the line":

x1s: List[float] = [x[1] for x in xs]
x2s: List[float] = [x[2] for x in xs]
plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);

m: float = -(beta[1] / beta[2])
b: float = -(beta[0] / beta[2])

x2s: List[float] = [m * x[1] + b for x in xs]

plt.plot(x1s, x2s, '--');

Great! Looks like our model correctly learned how to classify new, unseen data as it considers everything "above" and "below" the decision boundary as a separate class which seems to be in alignment with the data points from our data set!

Conclusion

In some situations it's a requirement to classify new, unseen data. Logistic Regression is a powerful Machine Learning model which makes it possible to learn such classifications based on existing data.

In this blog post we took a deep dive into Logistic Regression and all its moving parts. We've learned about Sigmoid functions and how they can be used in conjunction with a Linear Regression model to project values of arbitrary magnitude onto a scale between $0$ and $1$ which is exactly what we need when we want to do binary classification.

Once we understood the mathematics and implemented the formulas in code we took an example data set and applied our Logistic Regression model to a binary classification problem. After training our model we were able to draw the decision boundary it learned to visually validate that it correctly learned how to separate the data into two (binary) subsets.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

The following is a list of resources I've used to write this article. Other, useful resources are linked within the article itself:

Linear- and Multiple Regression from scratch

Thu, 30 Apr 2020 13:12:10 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

Linear Regression is one of the basic Machine Learning algorithms every student eventually encounters when starting to dive deeper into the field. If you heard someone trying to "fit a line through the data" that person most likely worked with a Linear Regression model.

In which scenarios should we use Linear Regression and if we do, how do we find such a best-fitting line? And what if our data is multidimensional? Let's answer all those questions by implementing Linear and Multiple Regression from scratch!

Note: Throughout this post we'll be using the "Auto Insurance in Sweden" data set which was compiled by the "Swedish Committee on Analysis of Risk Premium in Motor Insurance".

Data-driven decision making

Let's take a step back for a minute and imagine that we're working at an insurance company which sells among other things car insurances.

Over the years the insurance company has collected a wide variety of statistics for all the insurances it sells including data for its car insurances. Some of this data is statistics about the number of filed claims and the payments which were issued for them. The following table shows an excerpt from such data:

| Number of claims | Payments issued | | ---------------- | --------------- | | 108 | 392,5 | | 19 | 46,2 | | 13 | 15,7 | | 124 | 422,2 | | ... | ... |

One day we get a call from our colleague who works at the claims settlement center. She has to plan the divisions budget for the upcoming year which is usually derived based on best guesses. Since we've collected all the data throughout the years she wonders if there's a more reliable, mathematical way we could use to calculate the budget estimation.

Nodding along we confirm that we'll dive deeper into this topic and hang up the telephone in sheer excitement! Finally we're able to put some real Machine Learning into practice. But how should we tackle the problem?

Data visualization

It's hard to gain any insights into the data we're dealing with by manually examining the raw numbers. In order to get a better understanding of the data it's always a good idea to visualize it first.

Given that we're dealing with 2 dimensions (the number of claims and the issued payments) one of the potential diagrams we can create is a so called scatter plot which uses (Cartesian) coordinates to display the values of a given data set. In our case we treat the number of claims as our $x$-axis and the issued payments as our $y$-axis and plot the data we recorded at the intersections of such axes which results in the following diagram:

Solely by looking at the diagram we can already identify a trend in the data. It seems to be the case that the more claims were filed, the more payments were issued. Intuitively that makes sense.

After inspecting the plotted data in more detail we observe that we can certainly make some rough predictions for missing data points. Given the trend in the data it seems reasonable that we might issue ~80 payments when ~40 number of claims were filed. Accordingly ~125 claims might be filed when we issue ~410 payments.

Is there some way to turn this insight into a model we can use to make arbitrary predictions? It seems like the relationship in the data is linear. Is there a way to capture this notion mathematically?

Linear Functions

You might remember the concept of a Linear function from school where you've used the slope-intercept form (one of many forms) to mathematically describe a line:

$$ y = mx + b $$

The slope-intercept form has 2 parameters which determine how the line "behaves" in the Cartesian plane (The typical 2D plane with $x$ and $y$ coordinates):

$m$ is the lines slope which measures how steep the line is slanted
$b$ is the $y$-intercept and determines at which point the line intercepts the $y$-axis

Using this formula we can plug in any arbitrary $x$ value which is then multiplied by $m$ and added to $b$ to get back the corresponding $y$ value.

Let's look at a couple of examples to get a better feeling as to how the slope-intercept form works.

The slope m

Let's solely focus on $m$ for now and set $b$ to $0$. How does $m$ influence the way our line will be plotted if we set it to $1$?

Here's the mathematical representation of such a line followed by the corresponding plot:

$$ y = 1x + 0 $$

As you can see for every step of size $1$ in the $x$ direction we "go" a step of size $1$ in the $y$ direction.

The y-intercept b

Now that we understand what the parameter $m$ is responsible for, let's take a look at the $y$-intercept $b$ and set it to $1$:

$$ y = 1x + 1 $$

The steepness of the line is the same as the previous line since we haven't modified $m$. However if you take a look at $x = 0$ you'll notice that the line crosses the $y$ intercept at $1$. That's exactly what the parameter $b$ is responsible for. Through $b$ we can control where our line should start on the $y$ axis when $x = 0$.

Turning it into code

Let's translate the slope-intercept form into a function we call predict (we'll use this function for our predictions later on):

def predict(m: float, b: float, x: float) -> float:
    return m * x + b

assert predict(m=0, b=0, x=3) == 0

Applying what we've learned so far

Let's put the theory into practice and try to guesstimate a line which best describes our data.

The first thing we notice is that the individual data points follow an upwards trend, so $m$ will certainly be positive. Furthermore the data points close to $x = 0$ seem to have low $y$ values as well. Taking those observations into account we guess the following description for our line:

$$ y = 4x + 2 $$

Not too bad for our first guess! Just by looking at the plotted line we might ask ourselves if there's better fitting line? Can we quantify how good our line fits the data?

The error function

Given that the learning part in Machine Learning is usually an iterative process which starts with an initial guess and slowly computes new "best guesses" to finally converge to the optimal solution it's a necessity to be able to track the learning process.

A good way to supervise the learning process is to mathematically capture the "wrongdoing" our algorithm inevitably produces while trying to determine the function which best describes the data.

In the case of Linear Regression it seems to make sense to compare the $y$-values the line produces to the actual $y$-values from the data set. We could for example go through each individual $(x, y)$ pair in our data set and subtract its $y$ value from the $y$ value our line "predicts" for the corresponding $x$. Summing up these differences results in a number we can use to compare different lines against each other. The higher the number, the "less correct" the line.

That's great but there's one minor catch. Imagine that the line which is fitted through the data predicts large positive $y$ values near the origin $(0, 0)$ where it should predict large negative numbers. At the same time it predicts large negative numbers near the end of the $x$-axis although those values should be positive. If we calculate the errors according to our description above where we suggested to sum up the differences between the $y$ values we'd end up in a situation where values might cancel each other out. In the worst case the calculated error is $0$ which indicates that we've found the best fitting line while in reality we didn't!

A simple trick to mitigate this problem is to square each single error value before they're summed up. This way any negative value will be turned into a positive one, making it impossible to run into scenarios where error calculations cancel each other out.

The error function we've just described is called Residual sum of squares (RSS) or Sum of squared errors (SSE) and is one of many error functions we can use to quantify the algorithms "wrongdoing". The following is the mathematical formula for SSE:

$$ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 $$

Turing that into code results in the following:

def sum_squared_error(ys: List[float], ys_pred: List[float]) -> float:
    assert len(ys) == len(ys_pred)
    return sum([(y - ys_pred) ** 2 for y, ys_pred in zip(ys, ys_pred)])

assert sum_squared_error([1, 2, 3], [4, 5, 6]) == 27

Finding the best fitting line...

With those two code snippets (the predict and sum_squared_error functions) we're now able to describe a line, predict $y$ values and measure how "off" our predictions are. The last missing piece we'll need to get in place is a way to update our line description such that the next sum_squared_error calculation returns an error value which is less than our current one. If there's a way to constantly reduce the error we're making by slowly updating our line description we'll eventually end up with a line which best fits our data!

With Linear Regression there are a couple of different algorithms we can use to find the best fitting line. One prominent choice is the Ordinary least squares (OLS) method. Since OLS is a common choice we'll do something different. We'll use the Gradient Descent algorithm which can be used for various different optimization problems and is at the heart of modern Machine Learning algorithms.

Note: If you haven't already I'd suggest that you take a couple of minutes to read the article "Gradient Descent from scratch" in which I explain the whole algorithm in great detail.

... using Gradient Descent

I won't provide too many explanations regarding Gradient Descent here since I already covered the topic in the aforementioned post. Don't get too intimidated by the Math below. It's ok if you just skim through this section to get a high-level overview.

In a nutshell Gradient Descent makes it possible for us to iteratively "walk down" the error functions surface to eventually find a local minimum where the error is the smallest which is exactly what we're looking for.

In order to figure out in which direction we should walk to descent down to the local minimum we need to compute the so-called gradient. The gradient is a vector consisting of partial derivatives of our error function which point in the direction of greatest increase at any given point $p$ on our error functions surface.

To find the partial derivatives of our SSE function we should expand it so that we can see all the variables we need to take into account:

$$ SSE = \sum*{i=1}^n (y_i - f(x_i))^2 = \sum*{i=1}^n (y_i - (mx + b))^2 $$

Looking at the expanded formula it seems like there's $m$ and $b$ we need to derive with respect to:

$$ \frac{\partial sse}{\partial m} = 2x ((mx + b) - y) $$

$$ \frac{\partial sse}{\partial b} = 2 ((mx + b) - y) $$

Which results in the following code:

# The partial derivative of SSE with respect to `m`
grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
# The partial derivative of SSE with respect to `b`
grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])

Tip: You can use WolframAlpha to validate your partial derivatives.

Given these partial derivatives we can now calculate the gradient for any point $x$ which is a vector pointing in the direction of greatest increase. Multiplying the vector by $-1$ will let it point into the opposite direction, the direction of greatest decrease (remember that we want to find a local minimum). If we add a small fraction of this vector to our $m$ and $b$ values respectively we should end up closer to a local minimum.

The following code captures what we've just described:

# Take a small step in the direction of greatest decrease
# The `learning_rate` controls the step size when "walking" down the gradient
learning_rate: float = 0.0001

m = m + (grad_m * -1 * learning_rate)
b = b + (grad_b * -1 * learning_rate)

Repeating this process multiple times should help us find the $m$ and $b$ values for our line for which any given prediction $y$ calculated by that line results in the smallest error possible.

Let's put all the pieces together and implement the Gradient Descent algorithm to find the best fitting line:

# Find the best fitting line through the data points via Gradient Descent

# Our initial guess
m: float = 0
b: float = 200

print(f'Starting with "m": {m}')
print(f'Starting with "b": {b}')

# Doing 1000 iterations
epochs: int = 10000
learning_rate: float = 0.00001

for epoch in range(epochs):
    # Calculate predictions for `y` values given the current `m` and `b`
    ys_pred: List[float] = [predict(m, b, x) for x in xs]

    # Calculate and print the error
    if epoch % 1000 == True:
        loss: float = sum_squared_error(ys, ys_pred)
        print(f'Epoch {epoch} --> loss: {loss}')

    # Calculate the gradient
    # Taking the (partial) derivative of SSE with respect to `m` results in `2 * x ((m * x + b) - y)`
    grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
    # Taking the (partial) derivative of SSE with respect to `b` results in `2 ((m * x + b) - y)`
    grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])

    # Take a small step in the direction of greatest decrease
    m = m + (grad_m * -learning_rate)
    b = b + (grad_b * -learning_rate)

print(f'Best estimate for "m": {m}')
print(f'Best estimate for "b": {b}')

Running this algorithm results in a best estimate for the $m$ and $b$ values. Let's compare our initial guess of $m$ and $b$ (the guess we started with at the top of the code snippet) with the values our Gradient Descent implementation produced:

$$ m = 0, b = 200 $$

$$ m \approx 3.40, b \approx 20.30 $$

Awesome! Seems like we've found our linear function which best describes our data! Let's call our co-worker and share the good news. From now on she can use the following formula to find a prediction for the issued payments ($y$) based on any number of claims ($x$):

$$ y = 3.40x + 20.30 $$

From Linear Regression to Multiple Regression

It's great to be able to fit a line through data points in $2$ dimensions. But how do we deal with scenarios where our data has more than $2$ dimensions?

Most data sets capture many different measurements which are called "features". It would be great if we could take the most important features into account when working with our algorithms. Given that every feature adds another dimension we need to ensure that the model we're building can deal with such high-dimensional data.

Our Linear Regression model was only able to take a single $x$ value and predict a corresponding $y$ value. What if we have multiple $x$ values? Is there a way to use a regression model to predict a $y$ value based on multiple $x$ values?

As it turns out Linear Regression is a subset of a general regression model called Multiple Linear Regression or Multiple Regression. Multiple Regression can deal with an arbitrary number of $x$ values expressed as a vector to predict a single $y$ value.

The great news is that we can easily adopt what we've learned so far to deal with high-dimensional data. Let's take a quick look at the changes we need to make.

Tweaking the Linear Function

The slope-intercept form we've used so far can easily be updated to work with multiple $x$ values. Here's the linear equation we've used so far:

$$ y = mx + b $$

Having multiple $x$ values means that we'll also have multiple $m$ values (one for each $x$). However we'll still only deal with $1$ intercept:

$$ y = m_1x_1 + ... + m_nx_n + b $$

Calculating a prediction for $y$ is as simple as solving the above equation for any given vector of $x$ values, vector of $m$ values and any given $b$ value.

There's one little trick we can apply given that we're now mostly dealing with vectors rather than scalar numbers. To make the computation more efficient we can use the dot-product which carries out almost the exact same calculation we described above. There's just one problem. The dot-product can only be used in vector calculations, however $b$ isn't a vector. As it turns out we can simply prepend the $b$ value to the $m$ vector and prepend a $1$ to the $x$ vector. Doing this little trick makes it possible to use the dot-product calculation while also taking the $b$ value into account. Here's what we'd end up with when doing just that:

$$ \vec{x} = \begin{pmatrix} 1 \ x_1 \ ... \ x_n \end{pmatrix} \vec{m} = \begin{pmatrix} b \ m_1 \ ... \ m_n \end{pmatrix} $$

$$ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i m_i = x_1 \times m_1 + ... + x_n \times m_n $$

Another nice side-effect of doing this is that the partial derivative calculations for the error function will also be easier since our usage of the dot-product reduced the number of variables we have to take into account to just 2 vectors $x$ and $m$.

Fitting Hyperplanes

And that's pretty much all there is to change. The rest of the code follows exactly the same way. While we fitted a line when working with Linear Regression we're now fitting a so-called hyperplane with Multiple Regression.

To get a better intuition for the notion of a hyperplane imagine that we have measurements we can scatter plot in a $3$ dimensional space. Every measurement will be a single dot in that space, resulting in a cloud of dots. Our Multiple Regression algorithm will now try to find a plane (think of it as a wooden plank) which best fits through that dot cloud.

Conclusion

Linear Regression is one of the very first algorithms every student encounters when learning about Machine Learning models and algorithms.

The basic idea of Linear Regression is to find an equation for a line which best describes the data points in the given data set. Such a line is often described via the point-slope form $y = mx + b$. "Fitting the line" means finding the $m$ and $b$ values such that the resulting $y$ value is as accurate as possible given an arbitrary $x$ value. Using an error function (which describes how "off" our current line equation is) in combination with an optimization algorithm such as Gradient Descent makes it possible to iteratively find the "best fitting" line.

As it turns out Linear Regression is a specialized form of Multiple Linear Regression which makes it possible to deal with multidimensional data by expressing the $x$ and $m$ values as vectors. While this requires the usage of techniques such as the dot-product from the realm of Linear Algebra the basic principles still apply. In Multiple Linear Regression we're just trying to find a "best fitting" hyperplane rather than a line.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

The following is a list with resources I've used while working on this blog post. Other useful resources are linked within the article itself.

Naive Bayes from scratch

Thu, 09 Apr 2020 13:49:15 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

How do you implement a spam filter that adapts to the ever evolving techniques of modern spammers? Spam detection is a classification problem as you have to decide whether a message is spam or not. However there's always some uncertainty involved in figuring out if the message at hand might contain spam or not.

Looking into the probability toolbox to tame such uncertainty one might stumble upon the infamous Bayes Theorem which is exactly what we'll use in this post to build a basic spam filter. Using Bayes Theorem we'll implement an algorithm called Naive Bayes which was used in the early internet days as a spam filtering technique. Let's dive in!

Note: This post assumes a certain level of knowledge about probability theory. If you're new to this subject or just a little bit rusty you might want to check out the "Basics of Probability" playlist by JB Statistics which helps you in building the intuitions behind probability theory.

Bayes Theorem

if you've studied statistics and probability theory in the past you surely came across Bayes Theorem which lets you calculate conditional probabilities. Conditional probabilities are probabilities with dependencies which can be used to mathematically express questions such as: "What's the probability that it'll be sunny throughout the day given that we saw clouds in the morning?". Translating this sentence into math results in the following:

$$ P(Sun \mid Clouds) $$

Having some decent meteorology data available should be sufficient to calculate the exact probability via Bayes Theorem which is formulated as follows:

$$ P(A \mid B) = \frac{P(B \mid A) , P(A)}{P(B)} $$

Let's take a quick refresher and apply this formula by working through an example step by step. Assuming that we're living in a place where it's often cloudy, the chance to see clouds in the morning is roughly $60%$. At the same time most days aren't that sunny. Only 7 out of 30 days can be considered really sunny. Another data point we have is that $50%$ of the sunny days started out with clouds in the morning.

Formulating these findings as probabilities results in the following:

$$ P(Sun) \approx 25% \approx 0.25 $$

$$ P(Cloud) = 60% = 0.6 $$

$$ P(Cloud \mid Sun) = 50% = 0.5 $$

Calculating the likelihood of getting a sunny day even when we saw clouds in the morning is as simple as plugging-in the probabilities into Bayes Theorem:

$$ P(Sun \mid Cloud) = \frac{P(Cloud \mid Sun) , P(Sun)}{P(Cloud)} \ = \frac{0,5 \times 0,25}{0,6} \approx 0,21 = 21% $$

So there's a $21%$ chance that we get a sunny day even if the morning started with clouds.

From Bayes Theorem to Naive Bayes

Now that we got a quick refresher on Bayes Theorem it's time to ask ourselves how we could apply this formula to a classification problem such as spam filtering.

Generally speaking we'd like to calculate the probability that any given message is spam. If the probability exceeds a certain threshold (e.g. $50%$) we classify it as spam. Otherwise we believe that the message is legit and call it "ham".

The rather unusual word "ham" which describes "non spam" messages is used throughout literature so we'll use it here too.

Bayes Theorem and spammy words

Breaking the problem further down into it's fundamentals we can say that a message is comprised of different words, so why don't we start there and calculate the probability of a message being classified as spam given that it contains a certain "trigger" word.

Let's assume that we're working with the word "prize" and want to figure out what the probability is that a given message which contains the word "prize" is spam. Formulating this problem via Bayes Theorem results in:

$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) , P(Spam)}{P(Prize)} $$

To make things clear, here's the formula translated into prose:

"The probability of a message being spam given that we see the word "Prize" in it is the probability of finding the word "Prize" in our already seen spam messages times the probability of the message in question being spam divided by the probability of finding the word "Prize" in any of our already seen messages."

There are two simplifications we can apply here. The first thing we might want to do is expand the denominator like so:

$$ P(Prize) = P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham) $$

We simply rewrote the condensed version of $P(Prize$ into a more verbose statement listing all its underlying probabilities.

Our expanded Bayes Theorem application now looks like the following:

$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) P(Spam)}{P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham)} $$

The neat thing about this expansion is that we can now apply our second trick. Let's think about the probability of a message being spam vs. a message being ham for a minute. What is the likelihood that any random message is spam? What is the likelihood that it's ham? The fact is that we don't know for sure so we might just assume that there's a $50/50$ chance that any random, incoming message is spam or ham.

Applying this finding by plugging-in the corresponding probabilities results in:

$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) 0.5}{P(Prize \mid Spam) 0.5 + P(Prize \mid Ham) 0.5} $$

That's great because now we can apply basic arithmetic and cancel out these probabilities, resulting in our final formulation to determine whether a message containing the word "Prize" should be considered "spammy".

$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam)}{P(Prize \mid Spam) + P(Prize \mid Ham)} $$

Scaling it up to whole sentences

Now that we know how to calculate the probability for a single word we should try to scale it up so that we can compute the probabilities for whole sentences.

A naive approach would be to insert the whole message as the conditional probability part. However that won't work as it would force our algorithm to essentially look for duplicate messages.

Thinking logically we can observe that a sentence is simply a succession of single words. What we could do is reuse what we did above for a single word and apply it to every word we come across in our message. Computing the overall probability for the whole message is then simply the multiplication of all those individual probabilities.

Let's pick one example probability calculation to describe this process mathematically:

$$ P(Message \mid Spam) = P(w_1, ..., w_n \mid Spam) \ = P(w_1 \mid Spam) \times ... \times P(w_n \mid Spam) $$

And that's where the Naive Bayes classifier acts naively (hence the name Naive Bayes). Because we're multiplying the probabilities we're making the bold assumption that the probabilities (presence or absence of words) are independent of each other. Everyone who ever received an E-Mail from a nigerian prince will agree that there's likely a dependency between words in spam messages.

Computational and mathematical challenges

Generally speaking we're now in a position where we can turn our findings into code and implement the Naive Bayes algorithm from scratch. However there are 2 more tweaks we want to implement in order to mitigate some minor hiccups we'll run into if we don't do it.

The first challenge we're faced with is a computational problem. As we stated above we're about to multiply a lot of probabilities (one probability for every word) which tend to be very small. Moden computer architectures are only capable to deal with a certain amount of precision when representing very large or very small numbers internally. If we're constantly multiplying small number we can run into the so-called "arithmetic underflow" problem which will eventually turn our number into a $0$.

A well known trick to deal with this problem is to use a combination of the exponential function and the logarithm. You might remember that:

$$ \log{a b} = \log(a) + \log(b) $$

and

$$ \exp(\log(x)) = x $$

We can use this to "wrap" our probability calculations into $\log$ to later on "unwrap" them again via $\exp$.

The second issue is rooted in the mathematics we're attempting to carry out. Let's understand the problem by looking into a case we'll very likely face.

What would $P(W \mid S)$ be if we've only ever found $W$ in no-spam messages?

Well if we've never seen the word $W$ in a spam message, then $(P \mid S) = 0$. And exactly this will turn into a problem for us given that we have such a term in our nominator and denominator. Everything multiplied by $0$ is $0$.

In order to deal with that issue we can introduce a factor $k$ which is a parameter we can tweak (most of the time it's set to $1$).

With this factor $k$ we can express $P(W \mid S)$ as follows:

$$ P(W \mid S) = \frac{k + \textrm{spam containing W}}{(2 \times k) + \textrm{total spam}} $$

Let's walk through a quick example to see how $k$ helps us. Our assumption is that we have analyzed $100$ spam examples but found the word $W$ exactly $0$ times. Calculating the probability that we might find the word $W$ in spam ($S$) without $k$ would result in the following:

$$ P(W \mid S) = \frac{0}{100} = 0 $$

Having this probability in a chain of probability multiplications would immediately turn the result into $0$ because $x \times 0 = 0$.

Let's introduce $k = 1$ to solve this issue:

$$ P(W \mid S) = \frac{k + 0}{(2 \times k) + 100} = \frac{1}{2 + 100} \approx 0.01 $$

Having $0.01$ in our multiplications won't hurt the accuracy while ensuring that we don't end up with a $0$ result just because one of the probability calculations resolved to $0$.

With that out of the way we're finally able to turn our findings into code!

Implementing Naive Bayes

Before we jump right into the implementation we should draw a mental picture as to how the Naive Bayes classifier works from a high level perspective.

You might've noticed that we talked a lot about probability calculations of finding words in a set of spam or ham messages. This indicates that we need to "train" our Naive Bayes classifier with training data so that we can do these computations on the fly. Otherwise if our Naive Bayes implementation hasn't seen any messages or words before, how do we know how to calculate the probabilities via Bayes Theorem?

The next thing we should think about are the raw messages we're feeding into our Naive Bayes algorithm, both for "training" it and doing predictions later on. First and foremost we're only focusing on messages in one language to keep the implementation simple. In our case all our messages will be in English. Other than that we need to parse the raw messages and identify and extract individual words given that we're using such words in our probability calculations. This process is called tokenization, so we need to implement a function which can handle this process.

The tokenizer

Note: The tokenizer we implement here is really simple. Usually you'd want to use robust NLP libraries such as NLTK for this task.

Let's start with the tokenize function which takes a raw message as its input and returns a list of valid words as an output.

The implementation is really straightforward. The first thing we do is to find all words via a regular expression. We then iterate over all the words we found, lowercase them and add them to our list. Once done we turn our list into a set to ensure that duplicate entries are filtered out.

def tokenize(text: str) -> Set[str]:
    words: List[str] = []
    for word in re.findall(r'[A-Za-z0-9\']+', text):
        words.append(word.lower())
    return set(words)

assert tokenize('Is this a text? If so, Tokenize this text!...') == {'is', 'this', 'a', 'text', 'if', 'so', 'tokenize'}

Naive Bayes

Now we're getting to the core of our implementation, the Naive Bayes classifier.

As we've stated earlier, our Naive Bayes classifier needs to be trained on existing messages in order to be able to do predictions on unseen messages later on. It therefore needs to remember what it saw and store this state internally. Because of that we'll be implementing Naive Bayes as a class.

Let's walk through the different methods the class needs step by step. We'll take a look at the code for the whole class at the end of this section.

First of all we need to implement a constructor which sets our internal state to default values and takes the $k$ parameter as an optional argument so that we can control how we want do deal with scenarios where we haven't seen a word in the spam or ham messages. If we're glancing over our probability calculations from above we can see that we need to count quantities such as how many spam messages we saw in total during training. These are such state information we initialize in our constructor:

def __init__(self, k=1) -> None:
    self._k: int = k
    self._num_spam_messages: int = 0
    self._num_ham_messages: int = 0
    self._num_word_in_spam: Dict[int] = defaultdict(int)
    self._num_word_in_ham: Dict[int] = defaultdict(int)
    self._spam_words: Set[str] = set()
    self._ham_words: Set[str] = set()
    self._words: Set[str] = set()

Next up we need to implement a train function which we'll use to train our Naive Bayes classifier. "Training" in our case means that we get a list of labeled messages (messages for which we know whether they're spam or ham), iterate over each individual message, tokenize it and then iterate over every word in every message to update our internal state with the necessary information such as how many spam messages we saw in the list of messages:

def train(self, messages: List[Message]) -> None:
    msg: Message
    token: str
    for msg in messages:
        tokens: Set[str] = tokenize(msg.text)
            self._words.update(tokens)
            if msg.is_spam:
                self._num_spam_messages += 1
                self._spam_words.update(tokens)
                for token in tokens:
                    self._num_word_in_spam[token] += 1
            else:
                self._num_ham_messages += 1
                self._ham_words.update(tokens)
                for token in tokens:
                    self._num_word_in_ham[token] += 1

Now before we jump into the implementation of the predict method which uses this information via Bayes Theorem calculations to do predictions for new, unseen messages it might be a good idea to implement 2 helper functions, one for every conditional probability we need to compute in Bayes Theorem.

Doing this will make the code more readable and will greatly help when implementing predict later on.

The implementations for our 2 methods are pretty straightforward. They're basically a translation of the conditional probability calculations, incorporating the $k$ factor to ensure that we'll never compute a $0$ probability:

def _p_word_spam(self, word: str) -> float:
    return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)

def _p_word_ham(self, word: str) -> float:
    return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)

And now we're finally able to implement the meat of our Naive Bayes classifier. The predict function.

The predict function gets a message (we call it text) which it tokenizes to extract its words. Next up it iterates through all the words the Naive Bayes classifier saw during training (all words in both, spam and ham messages) to check if the word in question can also be found in the new, unseen message. While doing that it calculates the probabilities with the help of our previously defined helper functions. The return value of the predict function is the application of Bayes Theorem which computes an overall probability indicating how likely it is that the new, unseen message is spam.

Note that in the following implementation we're applying the tricks we learned about in our discussion of underflow problems (mainly doing the wrapping and unwrapping via $\log$ and $exp$). Applying these techniques might make the code look a little bit confusing or intimidating, however if you look closely and ignore the log and exp usages you'll find that it's just the application of Bayes Theorem.

def predict(self, text: str) -> float:
    text_words: Set[str] = tokenize(text)
    log_p_spam: float = 0.0
    log_p_ham: float = 0.0

    for word in self._words:
        p_spam: float = self._p_word_spam(word)
        p_ham: float = self._p_word_ham(word)
        if word in text_words:
            log_p_spam += log(p_spam)
            log_p_ham += log(p_ham)
        else:
            log_p_spam += log(1 - p_spam)
            log_p_ham += log(1 - p_ham)

    p_if_spam: float = exp(log_p_spam)
    p_if_ham: float = exp(log_p_ham)
    return p_if_spam / (p_if_spam + p_if_ham)

Putting it all together, here's the full NaiveBayes class with all its methods:

class NaiveBayes:
    def __init__(self, k=1) -> None:
        self._k: int = k
        self._num_spam_messages: int = 0
        self._num_ham_messages: int = 0
        self._num_word_in_spam: Dict[int] = defaultdict(int)
        self._num_word_in_ham: Dict[int] = defaultdict(int)
        self._spam_words: Set[str] = set()
        self._ham_words: Set[str] = set()
        self._words: Set[str] = set()

    def train(self, messages: List[Message]) -> None:
        msg: Message
        token: str
        for msg in messages:
            tokens: Set[str] = tokenize(msg.text)
            self._words.update(tokens)
            if msg.is_spam:
                self._num_spam_messages += 1
                self._spam_words.update(tokens)
                for token in tokens:
                    self._num_word_in_spam[token] += 1
            else:
                self._num_ham_messages += 1
                self._ham_words.update(tokens)
                for token in tokens:
                    self._num_word_in_ham[token] += 1

    def _p_word_spam(self, word: str) -> float:
        return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)

    def _p_word_ham(self, word: str) -> float:
        return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)

    def predict(self, text: str) -> float:
        text_words: Set[str] = tokenize(text)
        log_p_spam: float = 0.0
        log_p_ham: float = 0.0

        for word in self._words:
            p_spam: float = self._p_word_spam(word)
            p_ham: float = self._p_word_ham(word)
            if word in text_words:
                log_p_spam += log(p_spam)
                log_p_ham += log(p_ham)
            else:
                log_p_spam += log(1 - p_spam)
                log_p_ham += log(1 - p_ham)

        p_if_spam: float = exp(log_p_spam)
        p_if_ham: float = exp(log_p_ham)
        return p_if_spam / (p_if_spam + p_if_ham)

Naive Bayes in action

Let's take our Naive Bayes implementation for a spin!

We'll use the "Enron Spam" data set to train and test our implementation.

Here's the code which downloads and extracts the data set:

wget -nc -P data http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron1.tar.gz

tar -xzf data/enron1.tar.gz -C data

Reading through the readme we find that the spam and ham messages are stored in separate directories, so we need to find such directories and then iteratively open and parse every single file we find in them. We then store the data from the parsed subject line (we're only using the E-Mails subject to keep things simple) in a NamedTuple and append it to a list containing all the messages. Here's the code which does just that:

spam_data_path: Path = data_dir / 'enron1' / 'spam'
ham_data_path: Path = data_dir / 'enron1' / 'ham'

class Message(NamedTuple):
    text: str
    is_spam: bool

spam_message_paths: List[str] = glob.glob(str(spam_data_path / '*.txt'))
ham_message_paths: List[str] = glob.glob(str(ham_data_path / '*.txt'))

message_paths: List[str] = spam_message_paths + ham_message_paths

messages: List[Message] = []

for path in message_paths:
    with open(path, errors='ignore') as file:
        is_spam: bool = True if 'spam' in path else False
        text: str = file.readline().replace('Subject:', '').strip()
        messages.append(Message(text, is_spam))

And that's all there is in terms of data preparation.

Now we need to split our data into a "training" and "testing" set which we'll use to train our Naive Bayes classifier.

Our train_test_split function is a simple function which shuffles all the messages and then assigns them to 2 dedicated lists: One for training and one for testing. The default splitting rate is $80/20$, meaning $80%$ of all messages will be assigned to the training set and $20%$ of all messages will be assigned to the test set.

def train_test_split(messages: List[Message], pct=0.8) -> Tuple[List[Message], List[Message]]:
    shuffle(messages)
    num_train = int(round(len(messages) * pct, 0))
    return messages[:num_train], messages[num_train:]

train: List[Message]
test: List[Message]

train, test = train_test_split(messages)

Training our Naive Bayes classifier is as simple as creating a new instance and calling the train method with the training set:

nb: NaiveBayes = NaiveBayes()
nb.train(train)

Let's grab some spam messages from our test set (the data our classifier hasn't seen yet) and see what gets predicted:

spam_messages: List[Message] = [item for item in test if item.is_spam]

message: str = spam_messages[10].text

print(f'Predicting likelihood of "{message}" being spam.')
nb.predict(message)

# Predicting likelihood of "get your hand clock replicas today carson" being spam.
# 0.9884313222593173

Almost $99%$. Not bad!

And what about a ham message?

ham_messages: List[Message] = [item for item in test if not item.is_spam]

message: str = ham_messages[10].text

print(f'Predicting likelihood of "{text}" being spam.')
nb.predict(message)

# Predicting likelihood of "associate & analyst mid - year 2001 prc process" being spam.
# 5.3089147140900964e-05

Great! It's time to pat yourself on the back! You've successfully implemented and trained your very own Naive Bayes classifier to reliably identify spam and ham messages!

Note: $99%$ sounds too good to be true and there's certainly a smell of overfitting in the air. Remember that the data set we've used for training is rather small. Furthermore we've only trained based on the E-Mails subject line. You might want to modify the code to train on the whole message body or find a different data set altogether.

Conclusion

Naive Bayes is a powerful Machine Learning algorithm which makes it possible to classify unseen data based on probability scores.

The basis for Naive Bayes forms Bayes Theorem, one of the most fundamental algorithms in probability theory. With Bayes Theorem one can calculate conditional probabilities such as "how likely is it that this message is spam given word $X$?" which is exactly what's necessary to implementing a spam classifier.

While we've used the classic spam filtering use case to deconstruct Naive Bayes, there are more areas this algorithm can be applied to.

Given that recent developments in Deep Learning called Probabilistic Deep Learning incorporate Bayesian thinking into their underlying models it's a good investment to solidify the understanding about Bayes Theorem and its application via Naive Bayes.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

One of the main resource I studied while learning about Machine Learning algorithms such as Naive Bayes is the book "Data Science from Scratch" by Joel Grus. This book is pure gold as it teaches you all the nitty gritty details about the most fundamental algorithms out there. If you haven't already, do yourself a favor and purchase a copy of this book. It's worth it.

The following is a list of resources I've used to compile this blog post. Other helpful resources are also linked within the article itself.

Joel Grus - Data Science from Scratch
[JB Statistics - Statistics for everyone!](https://www.jbstatistics.com)
Wikipedia - Bayes Theorem
Wikipedia - Naive Bayes classifier
Math is Fun - Bayes Theorem
3Blue1Brown - Bayes Theorem

K-nearest neighbors from scratch

Tue, 17 Mar 2020 14:07:10 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

K-nearest neighbors (abbreviated as k-NN or KNN) is a simple, yet elegant Machine Learning algorithm to classify unseen data based on existing data. The neat property of this algorithm is that it doesn't require a "traditional" training phase. If you have a classification problem and labeled data you can predict the class of any unseen data point by leveraging your existing, already classified data.

Let's take a closer look at the intuitions behind the core ideas, the involved math and the translation of such into code.

The intuition

Imagine that we've invited 100 dog owners with their dogs over for a statistical experiment we want to run. Each dog participating in this experiment is 1 out of 4 different dog breeds we're interested in studying. While we have the dog owners and their dogs around we measure 3 different properties of each dog:

Its weight (in kilograms)
Its height (in centimeters)
Its alertness (on a scale from 0 to 1 [1=very alert, 0=almost no alertness])

Once done, we normalize the measurements so that they fall into a range between $0$ and $1$.

After collection the data on each individual dog we end up with 100 3-pair measurements, each of which is labeled with the corresponding dog breed.

Here's one example:

$$ \begin{pmatrix} 0.5 \ 0.8 \ 0.1 \end{pmatrix} = Podenco $$

In order to better understand the data it's always a good idea to plot it. Since we have collected 3 different measurements (weight, height and alertness) it's possible to project all of the 100 data points into a 3 dimensional space and color every data point according to its label (e.g. brown for the label "Podenco").

Unfortunately we run into an issue while attempting to plot this data. As it turns out we forgot to label one measurement. We do have the dogs width, its height and its alertness but for some reason we forgot to write down the dogs breed.

Is there any chance we could derive what this dogs breed might be given all the other dog measurements we already have? We can still add the unlabeled data point into our existing 3 dimensional space where all the other colored data points reside. But how should we color it?

One potential solution is to look at the, say 5 surrounding neighbors of the data point in question and see what their color is. If the majority of those data points is labeled "Podenco" then it's very likely that our measurements were also taken from a Podenco.

And that's exactly what the k-NN algorithm does. The k-nearest neighbors algorithm predicts the class of an unseen data point based on its k-nearest neighbors and the majority of their respective classes. Let's take a closer look at this from a mathematical perspective.

The 2 building blocks

There are only 2 concepts we need to implement in order to classify unseen data via k-NN.

As stated above, the algorithm works by looking at the k-nearest neighbors and the majority of their respective classes in order to classify unseen data.

Because of that we need to implement 2 functions. A distance function which calculates the distance between two points and a voting function which returns the most seen label given a list of arbitrary labels.

Distance function

Given the notion of "nearest neighbors" we need to calculate the distance between our "to be classified" data point and all the other data points to find the $k$ closest ones.

There are a couple of different distance functions out there. For our implementation we'll use the Euclidean distance as it's simple to calculate and can easily scale up to $n$ dimensions.

The mathematical notation is as follows:

$$ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} $$

Let's unpack this formula with the help of an example. Assuming we have two vectors $a$ and $b$, the euclidean distance between the two is calculated as follows:

$$ \vec{a} = \begin{pmatrix} 1 \ 2 \ 3 \ 4 \end{pmatrix} \vec{b} = \begin{pmatrix} 5 \ 6 \ 7 \ 8 \end{pmatrix} $$

$$ d(\vec{a}, \vec{b}) = d(\vec{b}, \vec{a}) \ = \sqrt{(1 - 5)^2 + (2 - 6)^2 + (3 - 7)^2 + (4 - 8)^2} \ = \sqrt{64} = 8 $$

Translating this into code results in the following:

def distance(x: List[float], y: List[float]) -> float:
    assert len(x) == len(y)
    return sqrt(sum((x[i] - y[i]) ** 2 for i in range(len(x))))

assert distance([1, 2, 3, 4], [5, 6, 7, 8]) == 8

Great. We just implemented the first building block: An Euclidean distance function.

Voting function

Next up we need to implement a voting function. The voting function takes as an input a list of labels and returns the "most seen" label of that list. While this sounds trivial to implement we should take a step back and think about potential edge cases we might run into.

One of those edge cases is the situation in which we have 2 or more "most seen" labels:

# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']

For those scenarios we need to implement a tie breaking mechanism.

There are several ways to deal with this. One solution might be to pick a candidate randomly. In our case however we shouldn't think about the vote function in isolation since we know that the distance and vote functions both work together to determine the label of the unseen data point.

We can exploit this fact and assume that the list of labels our vote function gets as an input is sorted by distance, nearest to furthest. Given this requirement it's easy to break ties. All we need to do is recursively remove the last entry in the list (which is the furthest) until we only have one clear winning label.

The following demonstrates this process based on the labels example from above:

# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']

# Remove one entry. We're still unsure if we should return `a` or `b`
labels: List[str] = ['a', 'a', 'b', 'b']

# Remove another entry. Now it's clear that `a` is the "winner"
labels: List[str] = ['a', 'a', 'b']

Let's translate that algorithm into a function we call majority_vote:

def majority_vote(labels: List[str]) -> str:
    counted: Counter = Counter(labels)
    winner: List[str] = []
    max_num: int = 0
    most_common: List[Tuple[str, int]]
    for most_common in counted.most_common():
        label: str = most_common[0]
        num: int = most_common[1]
        if num < max_num:
            break
        max_num = num
        winner.append(label)
    if len(winner) > 1:
        return majority_vote(labels[:-1])
    return winner[0]

assert majority_vote(['a', 'b', 'b', 'c']) == 'b'
assert majority_vote(['a', 'b', 'b', 'a']) == 'b'
assert majority_vote(['a', 'a', 'b', 'b', 'c']) == 'a'

The tests at the bottom show that our majority_vote function reliably deals with edge cases such as the one described above.

The k-NN algorithm

Now that we've studied and codified both building blocks it's time to bring them together. The knn function we're about to implement takes as inputs the list of labeled data points, a new measurement (the data point we want to classify) and a parameter k which determines how many neighbors we want to take into account when voting for the new label via our majority_vote function.

The first thing our knn algorithm should do is to calculate the distances between the new data point and all the other, existing data points. Once done we need to order the distances from nearest to furthest and extract the data point labels. This sorted list is then truncated so that it only contains the k nearest data point labels. The last step is to pass this list into the voting function which computes the predicted label.

Turning the described steps into code results in the following knn function:

def knn(labeled_data: List[LabeledData], new_measurement, k: int = 5) -> Prediction:
    class Distance(NamedTuple):
        label: str
        distance: float
    distances: List[Distance] = [Distance(data.label, distance(new_measurement, data.measurements))
                                          for data in labeled_data]
    distances = sorted(distances, key=attrgetter('distance'))
    labels = [distance.label for distance in distances][:k]
    label: str = majority_vote(labels)
    return Prediction(label, new_measurement)

That's it. That's the k-nearest neighbors algorithm implemented from scratch!

Classifying Iris flowers

It's time to see if our homebrew k-NN implementation works as advertised. To test drive what we've coded we'll use the infamous Iris flower data set.

The data set consists of 50 samples of three different flower species called Iris:

For each sample, 4 different measurements were collected: The sepal width and length as well as its petal width and length.

The following is an example row from the data set where the first 4 numbers are the sepal length, sepal width, petal length, petal width and the last string represents the label for those measurements.

6.9,3.1,5.1,2.3,Iris-virginica

The best way to explore this data is to plot it. Unfortunately it's hard to plot and inspect 4 dimensional data. However we can pick 2 measurements (e.g. petal length and petal width) and scatter plot those in 2 dimensions.

Note: We're using the amazing Plotly library to create our scatter plots.

fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.show()

We can clearly see clusters of data points which all share the same color and therefore the same label.

Now let's pretend that we have a new, unlabeled data point:

new_measurement: List[float] = [7, 3, 4.8, 1.5]

Adding this data point to our existing scatter plot results in the following:

fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.add_annotation(
    go.layout.Annotation(
            x=new_measurement[petal_length_idx],
            y=new_measurement[petal_width_idx],
            text="The measurement we want to classify")
)
fig.update_annotations(dict(
            xref="x",
            yref="y",
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-40,
            borderwidth=2,
            borderpad=4,
            bgcolor="#c3c3c3"
))
fig.show()

Even if we're just plotting the petal length and petal width in 2 dimensions it seems to be the case that the new measurement might be coming from an "Iris Versicolor".

Let's use our knn function to get a definite answer:

knn(labeled_data, new_measurement, 5)

And sure enough the result we get back indicates that we're dealing with an "Iris Versicolor":

Prediction(label='Iris-versicolor', measurements=[7, 3, 4.8, 1.5])

Conclusion

k-nearest neighbors is a very powerful classification algorithm which makes it possible to label unseen data based on existing data. k-NNs main idea is to use the $k$ nearest neighbors of the new, "to-be-classified" data point to "vote" on the label it should have.

Given that, we need 2 core functions to implement k-NN. The first function calculates the distance between 2 data points so that nearest neighbors can be found. The second function performs a majority vote so that a decision can be made as to what label is most present in the given neighborhood.

Using both functions together brings k-NN to life and makes it possible to reliably label unseen data points.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Thanks to Eric Nieuwland who reached out via E-Mail and provided some code simplifications!

Additional Resources

The following is a list of resources I've used to work on this blog post. Other helpful resources are also linked within the article itself.

Gradient Descent from scratch

Thu, 05 Mar 2020 14:56:26 GMT

You can find working code examples (including this one) in my lab repository on GitHub.

Gradient Descent is one of the most fundamental optimization techniques used in Machine Learning. But what is a gradient? On what do we descent down and what do we even optimize in the first place?

Those might be some of the questions which come to mind when having the first encounters with Gradient Descent. Let's answer those questions while implementing the Gradient Descent optimization technique from scratch.

Optimizations and loss functions

Many Machine Learning problems require some form of optimization. Usually the algorithm starts with an initial guess as to what the correct answer might be and then slowly adapts its parameters to be less wrong with every consecutive trial.

This learning process through trial and error repeats until the algorithm has learned to "correctly" predict the results for the training- and test data it gets fed.

In order to figure out how wrong our algorithm is with its predictions, we need to define the notion of a loss function. The loss function compares the guesses and the actual values and turns the differences between the two into a measurement we can use to quantify the quality of our algorithm. Prominent loss function are Mean squared error (MSE), Root mean square error (RMSE) or Sum of squared errors (SSE).

Imagine that we're plotting the loss (i.e. "wrongdoing") the algorithm produces as a surface in a multi dimensional space (such as 3D). Intuitively it seems to make sense to find the "place" on this surface where the algorithm is doing the fewest mistakes. That's exactly where Gradient Descent comes into play. With Gradient Descent we take a "walk" on this surface to find such a place.

Gradient Descent by example

As we've already discovered, loss functions and optimizations are usually intertwined when working on Machine Learning problems. While it makes sense to teach them together I personally believe that it's more valuable to keep things simple and focused while exploring core ideas. Therefore for the rest of this post we'll solely focus on Gradient Descent as a mathematical optimization technique which can be applied in a variety of different domains (including Machine Learning problems).

A Paraboloid

The first thing we should do is to define the function we want to run the Gradient Descent algorithm on. Most examples explain the algorithms core concepts with a single variable function such as a function drawn from the class of parabolas (e.g. $x^2$). Single variable functions can be easily plotted in a 2 dimensional space along the $x$ and $y$ axes. Besides other nice properties, dealing with only 2 dimensions makes the involved math a whole lot easier. Real world Machine Learning problems however deal with data in an order of magnitude higher dimensions. While there's a slightly steeper learning curve to go from 2D to 3D there's pretty much nothing new to learn to go from 3D to $n$D. Because of that we'll apply Gradient Descent to a multivariable function with 2 variables.

Most of us studied the properties of quadratic functions via parabolas in school. But is there an equivalent function which can be plotted in 3 dimensions? Imagine that you have a parabola and spin it like a centrifuge along its $y$-axis. What you'd end up with is a surface called a paraboloid, a "3 dimensional parabola".

There are various ways to define paraboloids but in our case let's use the "infinite paraboloid" which is defined as follows:

$$ x^2 + y^2 $$

Translating the math into code results in:

def paraboloid(x: float, y: float) -> float:
    return x ** 2 + y ** 2

Simple enough. Next up we should generate some test data, pass it into our paraboloid function and plot the results to see if everything works as expected:

# Test data generation (only really necessary for the plotting below)
xs_start = ys_start = -10
xs_stop = ys_stop = 11
xs_step = ys_step = 1

xs: List[float] = [i for i in range(xs_start, xs_stop, xs_step)]
ys: List[float] = [i for i in range(ys_start, ys_stop, ys_step)]
zs: List[List[float]] = []

for x in xs:
    temp_res: List[float] = []
    for y in ys:
        result: float = paraboloid(x, y)
        temp_res.append(result)
    zs.append(temp_res)

print(f'xs: {xs}\n')
# xs: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print(f'ys: {ys}\n')
# ys: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

print(f'zs: {zs[:5]} ...\n')
# zs: [[200, 181, 164, 149, 136, 125, 116, 109, 104, 101, 100, 101, 104, 109, 116, 125, 136, 149, 164, 181, 200], [181, 162, 145, 130, 117, 106, 97, 90, 85, 82, 81, 82, 85, 90, 97, 106, 117, 130, 145, 162, 181], [164, 145, 128, 113, 100, 89, 80, 73, 68, 65, 64, 65, 68, 73, 80, 89, 100, 113, 128, 145, 164], [149, 130, 113, 98, 85, 74, 65, 58, 53, 50, 49, 50, 53, 58, 65, 74, 85, 98, 113, 130, 149], [136, 117, 100, 85, 72, 61, 52, 45, 40, 37, 36, 37, 40, 45, 52, 61, 72, 85, 100, 117, 136]] ...

Plotting the graph (using the Plotly library) can be done with the following code:

fig = go.Figure(go.Surface(x=xs, y=ys, z=zs, colorscale='Viridis'))
fig.show()

Gradients and derivatives

According to Wikipedia the definition of a gradient is:

... a vector-valued function $f$..., whose value at a point $p$ is the vector whose components are the partial derivatives of $f$ at $p$.

I agree that this sounds overwhelming but the core concepts of a gradient are really simple. What this quote from above tries to convey is that for any given point $p$ on the functions plotted surface there's a vector consisting of partial derivatives which points in the direction of greatest increase.

With that explanation the only concept we need to sort out is the notion of partial derivatives.

If you've studied Differential Calculus you surely came across the topic of derivatives. To recap real quick, a derivative measures the sensitivity to change of a functions output with respect to changes in its inputs.

Derivatives can be of different orders. One of the most prominent derivatives you surely came across is the first-order derivative which is the slope of a tangent line that tells us whether the function is increasing or decreasing at any given point.

The process of differentiation obeys several rules, some of which we'll apply when walking through the example below to refresh our fond memories of Calculus I.

Let's calculate the first-, second- and third-order derivative of the following function:

$$ x^3 + 2x^2 - 5 $$

According to the differentiation rules, the first-order derivative is:

$$ \frac{d}{dx}(x^3 + 2x^2 - 5) = 3x^2 + 4x $$

The second-order derivative is:

$$ \frac{d}{dx}(3x^2 + 4x) = 6x + 4 $$

And finally the third-order derivative is:

$$ \frac{d}{dx}(6x + 4) = 6 $$

Why did we do this? As you might recall from above, a gradient is a vector of partial derivatives of function $f$ at point $p$. So in order to compute gradients we need to compute the partial derivatives of our function $f$.

We certainly do know how to compute derivatives as we've shown above, but how do we calculate partial derivatives? As it turns out you already know how to calculate partial derivatives if you know how to calculate derivatives. The twist with partial derivatives is that you're deriving with respect to a variable while treating every other variable as a constant.

Let's apply this rule and compute the partial derivatives of our Paraboloid function $x^2 + y^2$ which we call $f$:

The first partial derivative we calculate is the derivative of $f$ with respect to $x$, treating $y$ as a constant:

$$ \frac{\partial}{\partial x}(x^2 + y^2) = 2x $$

The second partial derivative calculation follows the same pattern, deriving $f$ with respect to $y$ while treating $x$ as a constant:

$$ \frac{\partial}{\partial y}(x^2 + y^2) = 2y $$

Note: Don't get confused by the notation here. While you'd use $\frac{d}{dx}$ for "normal" derivatives you simply use $\frac{\partial}{\partial x}$ for partial derivatives.

And that's it. With those partial derivatives we're now able to compute any gradient for any point $p$ sitting on the plotted surface of function $f$. Let's translate our findings into code:

def compute_gradient(vec: List[float]) -> List[float]:
    assert len(vec) == 2
    x: float = vec[0]
    y: float = vec[1]
    return [2 * x, 2 * y]

Descending down the surface

Let's recap what we've accomplished so far. First, we've codified and plotted a function called the "infinite paraboloid" which is used throughout this post to demonstrate the Gradient Descent algorithm. Next up we studied gradients, taking a quick detour into Differential Calculus to refresh our memories about differentiation. Finally we computed the partial derivatives of our paraboloid function.

Translating the prose into a mental image, we have a function $f$ which produces a surface in a 3 dimensional space. Given any point $p$ on that surface we can use the gradient (via its partial derivatives) to find a vector pointing into the direction of greatest increase.

That's great, but we're dealing with an optimization problem and for a lot of such applications it's usually desirable to find a global or local minimum. Right now, the gradient vector is pointing upwards to the direction of greatest increase. We need to turn that vector into the opposite direction so that it points to the direction of greatest decrease.

Linear Algebra taught us that doing that is as simple as multiplying the gradient vector by $-1$. Another neat Linear Algebra trick is to multiply a vector by a number other than $1$ to change its magnitude (= its length). If we're for example multiplying the gradient vector by $0.25$ we would get a vector which has length $\frac{1}{4}$ of its original length.

We finally have all the building blocks in place to put together the Gradient Descent algorithm which eventually finds the nearest local minimum for any given function.

The algorithm works as follows.

Given a function $f$ we want to run Gradient Descent on:

Get the starting position $p$ (which is represented as a vector) on $f$
Compute the gradient at point $p$
Multiply the gradient by a negative "step size" (usually a value smaller than $1$)
Compute the next position of $p$ on the surface by adding the rescaled gradient vector to the vector $p$
Repeat step 2 with the new $p$ until convergence

Let's turn that into code. First, we should define a compute_step function which takes as parameters the vector $p$ and a "step size" (we call it learning_rate), and computes the next position of $p$ according to the $f$ functions gradient:

def compute_step(curr_pos: List[float], learning_rate: float) -> List[float]:
    grad: List[float] = compute_gradient(curr_pos)
    grad[0] *= -learning_rate
    grad[1] *= -learning_rate
    next_pos: List[float] = [0, 0]
    next_pos[0] = curr_pos[0] + grad[0]
    next_pos[1] = curr_pos[1] + grad[1]
    return next_pos

Next up we should define a random starting position $p$:

start_pos: List[float]

# Ensure that we don't start at a minimum (0, 0 in our case)
while True:
    start_x: float = randint(xs_start, xs_stop)
    start_y: float = randint(ys_start, ys_stop)
    if start_x != 0 and start_y != 0:
        start_pos = [start_x, start_y]
        break

print(start_pos)
# [6, -3]

And finally we wrap our compute_step function into a loop to iteratively walk down the surface and eventually reach a local minimum:

epochs: int = 5000
learning_rate: float = 0.001

best_pos: List[float] = start_pos

for i in range(0, epochs):
    next_pos: List[float] = compute_step(best_pos, learning_rate)
    if i % 500 == 0:
        print(f'Epoch {i}: {next_pos}')
    best_pos = next_pos

print(f'Best guess for a minimum: {best_pos}')

# Epoch 0: [5.988, -2.994]
# Epoch 500: [2.2006573940846716, -1.1003286970423358]
# Epoch 1000: [0.8087663604107433, -0.4043831802053717]
# Epoch 1500: [0.2972307400008091, -0.14861537000040456]
# Epoch 2000: [0.10923564223981955, -0.054617821119909774]
# Epoch 2500: [0.04014532795468376, -0.02007266397734188]
# Epoch 3000: [0.014753859853277982, -0.007376929926638991]
# Epoch 3500: [0.005422209548664845, -0.0027111047743324226]
# Epoch 4000: [0.0019927230353282872, -0.0009963615176641436]
# Epoch 4500: [0.0007323481432962652, -0.0003661740716481326]
# Best guess for a minimum: [0.00026968555624761565, -0.00013484277812380783]

That's it. We've successfully implemented the Gradient Descent algorithm from scratch!

Conclusion

Gradient Descent is a fundamental optimization algorithm widely used in Machine Learning applications. Given that it's used to minimize the errors in the predictions the algorithm is making it's at the very core of what algorithms enable to "learn".

In this post we've dissected all the different parts the Gradient Descent algorithm consists of. We looked at the mathematical formulations and translated those into code, which in our case can be used to find the global minimum on a Paraboloid function $x^2 + y^2$. Note that our implementation was optimized for readability and educational usefulness rather than performance. You might not want to use this code in production.

Modern Machine Learning libraries such as TensorFlow or PyTorch ship with built-in differentiation capabilities to automatically compute partial derivatives, removing the need to implement the tedious Math yourself.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

The following is a list with resources I've used to write this blog post. The post itself also links to some other interesting content you might want to further explore.

Swift for modern Machine Learning

Thu, 13 Feb 2020 13:31:03 GMT

Note: In this post we'll compare and contrast different programming languages. Everything discussed should be taken with a grain of salt. There's no single programming language which solves all the problems in an elegant and performant way. Every language has its up- and downsides. Swift is no exception.

Entering the Data Science and Machine Learning world there are various programming languages and tools to choose from. There's MATLAB, a commercial programming environment which is used across different industries and usually the tool of choice for practicioners with a heavy Math background. A free and Open Source alternative is the R project, a programming language created in 1993 to simplify statistical data processing. People working with R usually report enjoyment as R is "hackable" and comes bundled with different math modules and plotting libraries.

A more recent incarnation is the Julia scientific programming language which was created at MIT to resolve the issues older tools such as MATLAB and R struggled with. Julia cleverly incorporates modern engineering efforts from the fields of compiler construction and parallel computing and given its Open Source nature it has gained a lot of industry-wide adoption when it reached v1 maturity in 2018.

Python as the de-facto standard

If you're doing some more research to find the most used programming language in Data Science and Machine Learning you might be surprised to see a language which wasn't built from the ground up for scientific computing: Python.

The Python programming language was created by Guido van Rossum in 1989 to help bridge the gap between Bash scripting and C programming. Since then Python took the world by storm mainly due to its flat learning curve, its expressiveness and its powerful standard library which makes it possible to focus on the core problems rather than reinventing the wheel over and over again.

Funny tangent: Open up a shell, run the Python interpreter via python and enter import this or import antigravity

Python is a general purpose programming language which was never designed to solve problems in a niche subject matter. Are you an Instagram user? They're running Python. Do you curate content via Pinterest? They're running Python. Do you store your data via Dropbox? They've developed their MVP in Python and still use it today. Even Google (then called BackRub) started out with Python and Java. The list goes on and on.

Given such an industry-wide adoption it's easy to see why a lot of care and effort were put into the ecosystem of reusable packages as well as the language itself. No matter what use case you're working on, chances are that there are numerous Python packages helping you solve your problems.

While more famous Python projects include Web Frameworks such as Django or Flask there are also lot of mature Scientific and Machine Learning implementations written in Python. Having access to such a robust foundation it only makes sense that modern Deep Learning frameworks such as TensorFlow or PyTorch are also leveraging those libraries under the covers.

Hitting the limits

All of the things discussed so far sound great. Python, a general purpose programming language which has quite a few years of existence under its belt is used across industries in mission critical software systems. Over the course of 30 years a vibrant Open Source community emerged which develops and maintains powerful libraries used by millions of users on a daily basis.

Why bother and replace Python? If it ain't broke, don't fix it!

Technology is constantly improving. What was once unthinkable might all of the sudden be possible thanks to breakthroughs in Hard- and Software development. Python was created in a different era with a different purpose. It was never engineered to directly interface with hardware or run complex computations across a fleet of distributed machines.

A modern Deep Learning Framework such as TensorFlow uses dozens of programming languages behind the scenes. The core of such a library might be written in high-performance C++ which occasionally interfaces with different C libraries, Fortran programs or even parts of Assembly language to squeeze out every bit of performance possible. A Python interface is usually built on top of the C++ core to expose a simple public API Data Scientists and Deep Learning enthusiasts use.

Why isn't Python used throughout the whole stack?

The answer to this question is rather involved but the gist of it is that Pythons language design is more tailored towards high level programming. Furthermore it's just not fast enough to be used at the lower layers.

The following is an incomplete list of Pythons (subjective) shortcomings:

Speed

Python code usually runs an order of magnitude slower compared to other interpreted and compiled languages. Language implementations such as Cython which compile Python code to raw C try to mitigate this problem but they come with other issues (e.g. language inconsistencies, compatibility problems, ...).

Parallel Processing

It's not that straightforward to write Python code which reliably performs parallel processing tasks on multiple cores or even multiple machines. Deep Neural Networks can be expressed as graphs on which Tensor computations are carried out, making it a prime use case for parallel processing.

Hardware integration

Python is a high level language with lots of useful abstractions which unfortunately get in the way when trying to directly interface with the computers underlying hardware. Because of that heavy GPU computations are usually moved into lower-level code written in e.g. C or CUDA.

Interpreted rather than compiled

Since it's a scripting language at its core, Python comes with its own runtime that evaluates the script line-by-line as it runs it. A process called "interpretation".

The other branch of programming languages are compiled languages. Compiling code means that the human-readable program code is translated into code a machine can read and understand. Compiled programs have the downside that there's a compilation step in between writing and running the program. The upside of such step is that various checks and optimizations can be performed while translating the code, eventually emitting the most efficient machine code possible.

Dynamic typing

Python has no concept of typing. There's no problem in passing an integer into a function which expects a string. Python will run the program and raise an exception as soon as it evaluates the broken code.

Strongly typed languages have the upside that mistakes like the one described above are impossible to make. The developer has to explicitly declare which types are expected.

Python has recently added support for type hints. Type hinting merely serves as another form of documentation as it still won't prevent type misuses in programs.

Interoperability

A lot of prominent packages such as Numpy wrap other languages such as Fortran or C to offer reliable performance when working on computational expensive data processing tasks.

While it's certainly not impossible to introduce existing libraries written in other languages into Python, the process to do that is oftentimes rather involved.

Entering Swift

Without going into too much detail it makes sense to take a quick detour and study the origins of the Swift programming language in order to see why it has such a potential to replace Python as the go-to choice for Data Science and Machine Learning projects.

[Chris Lattner](https://en.wikipedia.org/wiki/Chris_Lattner), the inventor of Swift has a long history and established track record in modern compiler development. During college he worked on a project which eventually became LLVM ("Low Level Virtual Machine"), the infamous compiler infrastructure toolchain. The revolutionary idea behind LLVM is the introduction of frontends and backends which can be mixed and matched. One frontend could be written for Swift which is then coupled with a backend implementation for the x86 architecture. Making it possible to compile to another architecture is as simple as using another backend such as the one for PowerPC. Back in the early compiler days one had to write the compiler end-to-end, tightly coupling the frontend and backend, making it a heroic effort to offer the compiler for different platforms.

LLVM gained a lot of traction and Chris Lattner was eventually hired by Apple to work on its developer toolings which heavily relied on LLVM. During that time he worked on a C++ compiler and thought about ways how a better, more modern programming language might look like. He figured that it should be compiled, easy to learn, flexible enough to feel like a scripting language and at the same time "hackable" at every layer. Those ideas translated into the Swift programming language which was officially released at WWDC in 2014.

But what exactly makes Swift such a natural fit as a Python replacement? Isn't Swift only used for iOS and macOS apps? The following section shows why Swift could be Pythons successor.

It's compiled

Swift is compiled via LLVM which means that its code is translated into optimized machine code directly running on the target platform. Improvements made to the LLVM compiler toolchain automatically benefit the Swift code generation.

There's the saying that Swift is "syntactic sugar for LLVM" which rings true as one can see with the Builtin usage for its core types. The linked code snippet shows that Swifts core types directly interface with their LLVM equivalents.

Python-like syntax

Despite the compilation process Swift feels like a dynamic, Python-esque language. Swift was designed from the ground up for programs to incrementally grow in complexity as necessary. The simplest of all Swift programs is just one line of code: print("Hello World").

let greeting = "Hello World"
print(greeting)
// Hello World

let num1 = 1
let num2 = 2
print(num1 + num2)
// 3

let scores = [10, 35, 52, 92, 88]
for score in scores {
    print(score)
}
// 10
// 35
// 52
// 92
// 88

class Cat {
    var name: String
    var livesRemaining: Int = 9

    init(name: String) {
        self.name = name
    }

    func describe() -> String {
        return "👋 I'm \(self.name) and I have \(self.livesRemaining) lives 😸"
    }
}
let mitsy = Cat(name: "Mitsy")
print(mitsy.describe())
// 👋 I'm Mitsy and I have 9 lives 😸

Static typing

Given that Swift is compiled via LLVM, it's statically type checked during the compilation process. There's no way you can pass an invalid type to a function and run into an error during runtime. If your code compiles you can be pretty sure that you're passing around the expected types.

func sum(xs: [Int]) -> Int {
    var result: Int = 0
    for x: Int in xs {
        result = result + x
    }
    return result
}

// Using correct types
let intNumbers: [Int] = [1, 2, 3, 4, 5]
let resultInt = sum(xs: intNumbers)
print(resultInt)
// 15

// Using incorrect types
let stringNumbers: [String] = ["one", "two", "three"]
let resultString = sum(xs: stringNumbers)
print(resultString)
// error: cannot convert value of type '[String]' to expected argument type '[Int]'

Hackable

Swifts concepts of protocols and extensions make it dead simple to add new functionality to existing libraries or even types which ship with the language core itself. Want to add a new method to Int? No problem!

// One needs to implement `help` when using the `Debugging` Protocol
protocol Debugging {
    func help() -> String
}

// Implementing `Debugging` for MatrixMultiply
class MatrixMultiply: Debugging {
    func help() -> String {
        return "Offers methods to aid with matrix-matrix multiplications."
    }

    func multiply() {
        // ...
    }
}
var matMult = MatrixMultiply()
print(matMult.help())
// Offers methods to aid with matrix-matrix multiplications.

// Implementing `Debugging` for VectorMultiply
class VectorMultiply: Debugging {
    func help() -> String {
        return "Offers methods to aid with matrix-vector multiplications."
    }
}
var vecMult = VectorMultiply()
print(vecMult.help())
// Offers methods to aid with matrix-vector multiplications.

// Makes it possible to emojify an existing type
protocol Emojifier {
    func emojify() -> String
}

// Here we're extending Swifts core `Int` type
extension Int: Emojifier {
    func emojify() -> String {
        if self == 8 {
            return "🎱"
        } else if self == 100 {
            return "💯"
        }
        return String(self)
    }
}


print(8.emojify())
// 🎱
print(100.emojify())
// 💯
print(42.emojify())
// 42

Value semantics

I'm sure everyone ran into this problem before. An object is passed into a function and modified without bad intentions. Meanwhile the object is used in a different place and all of the sudden its internal state isn't what it's supposed to be. The culprit is the data mutation within the function.

This problem can be mitigated easily via value semantics. When using value semantics a "copy" rather than an object reference is passed around.

// As seen on: https://marcosantadev.com/copy-write-swift-value-types/

import Foundation

// Prints the memory address of the given object
func address(of object: UnsafeRawPointer) -> String {
    let addr = Int(bitPattern: object)
    return String(format: "%p", addr)
}

var list1 = [1, 2, 3, 4, 5]
print(address(of: list1))
// 0x7f2021f845d8

var list2 = list1
print(address(of: list2))
// 0x7f2021f845d8 <-- Both lists share the same address

list2.append(6) // <-- Mutating `list2`

print(list1)
// [1, 2, 3, 4, 5]

print(list2)
// [1, 2, 3, 4, 5, 6]

print(address(of: list1))
// 0x7f2021f84a38
print(address(of: list2))
// 0x128fb50 <-- `list2` has a different address

First-class C interoperability

Given that Swift compiles via LLVM it has access to existing LLVM-based implementations to interoperate with. One such project is Clang, a C language family frontend written for LLVM. Thanks to Clang it's dead simple to wrap existing C libraries and bring them into Swift projects.

The following video demonstrates how easy it is:

Swift for TensorFlow (S4TF)

Given all the upsides described above, the TensorFlow team decided to experiment with Swift as a Python replacement to interface with TensorFlow. Early prototypes were fruitful, encouraging the TensorFlow team to officially released Swift for TensorFlow (S4TF) in 2019.

S4TF extends the Swift core language with various features especially useful for Machine Learning tasks. Such enhancements include first-class autodiff support to calculate derivatives for functions or Python interoperability which makes it possible to reuse existing Python packages such as matplotlib, scikit-learn or pandas via Swift.

The following is a demonstration which shows how Swift for TensorFlow can be used to describe and train a deep neural network in TensorFlow:

Do you want to play around with Swift for TensorFlow yourself? Just run the following code in a terminal to spin up a Jupyter Notebook server with Swift Kernel support in a Docker container:

docker run -it -p 8888:8888 --rm --name jupyter-s4tf \
    -v "$PWD":/home/jovyan/work \
    --ipc=host \
    pmuens/jupyter-s4tf:latest jupyter notebook \
    --ip=0.0.0.0 \
    --no-browser \
    --allow-root \
    --NotebookApp.token=\
    --notebook-dir=/home/jovyan/work

The code for the repository can be found here and the Docker Hub entry is here.

Conclusion

Python, the de-facto standard programming language for Data Science and Machine Learning has served the community very well in the past. Nevertheless, given the trajectory of technological advancements we're slowly but surely hitting the limits with the toolings we currently have.

Performance critical code is already pushed down into lower-level implementations written in programming languages such as C or Fortran and wrapped via public Python APIs. Wouldn't it be nice to write expressive, yet performant code from the get go at every layer? And what about all the libraries out there? Wouldn't it be nice to wrap and reuse them with only a couple lines of code?

The lack of static typing in Python makes it painful to work on larger, more complex projects. It's all too easy to define a model and train it on a huge dataset just to realize that a type error interrupts the training process halfway through. An error which could've been mitigated via thorough type checks.

And what if we're hitting other roadblocks? Wouldn't it be nice to be able to peek under the covers and fix the issues ourselves in an "official" way without all the monkey-patching?

Most large-scale Machine Learning projects already faced some, if not all of the issues listed above. The TensorFlow team experienced them too and looked into ways to solve them once and for all. What they came up with is Swift for TensorFlow (S4TF), a Swift language extension tailored towards modern Machine Learning projects. The Swift programming language comes with various properties which makes it a perfect fit for a Python replacement: It shares a similar syntax, is compiled (and therefore runs fast), has a type system and seamlessly interoperates with existing C and Python libraries.

What do you think? Is Swift for TensorFlow the future or do we stick with Python for now? Will a language such as Julia dominate the Data Science and Machine Learning world in the future?

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

The following is a list of resources I've used to compile this blog post. There are also a couple of other sources linked within the article itself.

Generics

Tue, 01 Oct 2019 18:49:00 GMT

Generic programming makes it possible to describe an implementation in an abstract way with the intention to reuse it with different data types.

While generic programming is a really powerful tool as it prevents the programmer from repeating herself it can be hard to grasp for newcomers. This is especially true if you're not too familiar with typed programming languages.

This blog post aims to shed some light into the topic of generic programming. We'll discover why Generics are useful and which thought process can be applied to easily derive generic function signatures. At the end of post you'll be able to author and understand functions like the this:

function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
  /* ... */
}

Note: Throughout this post we'll use TypeScript as our language of choice. Feel free to code along while reading through it.

Of course you can "just use JavaScript" (or another dynamically typed language) to not deal with concepts such as typing or Generics. But that's not the point. The point of this post is to introduce the concepts of Generics in a playful way. TypeScript is just a replaceable tool to express our thoughts.

Motivation

Before we jump right into the application of generic programming it might be useful to understand what problem Generics are solving. We'll re-implement one of JavaScripts built-in Array methods called filter to get first-hand experience as to why Generics were invented.

Let's start with an example to understand what filter actually does. The JavaScript documentation for filter states that:

The filter() method creates a new array with all elements that pass the test implemented by the provided function.

Let's take a look at a concrete example to see how we would use filter in our programs. First off we have to define an array. Let's call our array numbers as it contains some numbers:

const numbers = [1, 2, 3, 4, 5, 6]

Next up we ned to come up with a function our filter method applies to each element of such array. This function determines whether the element-under-test should be included in the resulting / filtered array. Based on the quote above and the description we just wrote down we can derive that our function which is used by the filter method should return a boolean value. The function should return true if the element passes the test and false otherwise.

To keep things simple we pretend that we want to filter our numbers array such that only even numbers will be included in our resulting array. Here's the isEvenfunction which implements that logic:

const isEven = (num: number): boolean => num % 2 === 0

Our isEven function takes in a num argument of type number and returns a boolean. We use the modulo operation to determine whether the number-under-test is even.

Next up we can use this function as an argument for the filter method on our array to get a resulting array which only includes even numbers:

const res = numbers.filter(isEven)

console.log(res)
// --> [2, 4, 6]

As we've stated earlier our goal is to implement the filter function on our own. Now that we've used filter with an example we should be familiar with it's API and usage.

To keep things simple we won't implement filter on arrays but rather define a standalone function which accepts an array and a function as its arguments.

What we do know is that filter loops through every element of the array and applies the custom function to it in order to see if it should be included in the resulting array. We can translate this into the following code:

function filter(xs: number[], func: (x: number) => boolean): number[] {
  cons res: number = []
  for (const x of xs) {
    if (func(x)) {
      res.push(x)
    }
  }
  return res
}

Now there's definitely a lot happening here and it might look intimidating but bear with me. It's simpler than it might look.

In the first line we define our function called filter which takes an array called xs (you can imagine pronouncing this "exes") and a function called func as its arguments. The array xs is of type number as we're dealing with numbers and the function func takes an x of type number, runs some code and returns a boolean. Once done our filter function returns an array of type number.

The function body simply defines an intermediary array of type number which is used to store the resulting numbers. Other than that we're looping over every element of our array and apply the function func to it. If the function returns true we push the element into our res array. Once done looping over all elements we return the res array which includes all the numbers for which our func function returned the value true.

Alright. Let's see if our homebrew filter function works the same way the built-in JavaScript filter function does:

const res = filter(numbers, isEven)

console.log(res)
// --> [2, 4, 6]

Great! Looks like it's working!

If we think about filtering in the abstract we can imagine that there's more than just the filtering of numbers.

Let's imagine we're building a Rolodex-like application. Here's an array with some names from our Rolodex:

const names = ['Alice', 'Bob', 'John', 'Alex', 'Pete', 'Anthony']

Now one of our application requirements is to only display names that start with a certain letter.

That sounds like a perfect fit for our filter function as we basically filter all the names based on their first character!

Let's start by writing our custom function we'll use to filter out names that start with an a:

const startsWithA = (name: string): boolean => name.toLowerCase().charAt(0) === 'a'

As we can see our function takes one argument called name of type string and it returns a boolean which our function computes by checking if the first character of the name is an a.

Now let's use our filter function to filter the names:

const res = filter(names, startsWithA)

console.log(res)
// --> Type Error

Hmm. Something seems to be off here.

Let's revisit the signature of our filter function:

function filter(xs: number[], func: (x: number) => boolean): number[] {
  /* ... */
}

Here we can see that the xs parameter is an array of type number. Furthermore the func parameter takes an x of type number and returns a boolean.

However in our new Rolodex application we're dealing with names which are strings and the startsWithA function we've defined takes a string as an argument, not a number.

One way to fix this problem would be to create a copy of filter called e.g. filter2 which arguments can handle strings rather than numbers. But we programmers know that we shouldn't repeat ourselves to keep things maintainable. In addition to that we're lazy, so using one function to deal with different data types would be ideal.

Entering Generics

And that's exactly the problem Generics tackle. As the introduction of this blog post stated, Generics can be used to describe an implementation in an abstract way in order to reuse it with different data types.

Let's use Generics to solve our problem and write a function that can deal with any data type, not just numbers or strings.

Before we jump into the implementation we should articulate what we're about to implement. Talking in the abstract we're basically attempting to filter an array of type T (T is our "placeholder" for some valid type here) with the help of our custom function. Given that our array has elements of type T our function should take each element of such type and produce a boolean as a result (like we did before).

Alright. let's translate that into code:

function filter<T>(xs: T[], func: (x: T) => boolean): T[] {
  const res: T[] = []
  for (const x of xs) {
    if (func(x)) {
      res.push(x)
    }
  }
  return res
}

At a first glance this might look confusing since we've sprinkled in our T type here and there. However overall it should look quite familiar. Let's take a closer look into how this implementation works.

In the first line we define our filter function as a function which takes an array named xs of type T and a function called func which takes a parameter x of type T and returns a boolean. Our function filter then returns a resulting array which is also of type T, since it's basically a subset of elements of our original array xs.

The code inside the function body is pretty much the same as before with the exception that our intermediary res array also needs to be of type T.

There's one little detail we haven't talked about yet. There's this <T> at the beginning of the function. What does that actually do?

Well our compiler doesn't really know what the type T might be at the end of the day. And it doesn't really care that much whether it's a string, a number or an object. It only needs to know that it's "some placeholder" type. We programmers have to tell the compiler that we're abstracting the type away via Generics here. So in TypeScript for example we use the syntax <TheTypePlaceHolder> right after the function names to signal the compiler that we want our function to be able to deal with lots of different types (to be generic). Using T is just a convention. You could use any name you want as your "placeholder type". If your functions deals with more than one generic type you'd just list them comma-separated inside the <> like this: <A, B>.

That's pretty much all we have to do to turn our limited, number-focused filterfunction into a generic function which can deal with all kinds of types. Let's see if it works with our numbers and names arrays:

let res

// using `filter` with numbers and our `isEven` function
res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]

// using `filter` with strings and our `startsWithA` function
res - filter(names, startsWithA)

console.log(res)
// --> ['Alice', 'Alex', 'Anthony']

Awesome! It works!

Function signatures as documentation

One of the many benefits of using a type system is that you can get a good sense of what the function will be doing based solely on its signature.

Let's take the function signature from the beginning of the post and see if we can figure out what it'll be doing:

function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
  /* ... */
}

The first thing we notice is that it's a generic function as we're dealing with 2 "type placeholders" A and B here. Next up we can see that this function takes in an array called xs of type A and a function func which takes an A and turns it into a B. At the end the foo function returns an array of type B,

Take a couple of minutes to parse the function signature in order to understand what it's doing.

Do you know how this function is called? Here's a tip: It's also one of those functions from the realm of functional programming used on e.g. arrays.

Here's the solution: The function we called foo here is usually called map as it iterates over the elements of the array and uses the provided function to map every element from one type to the other (note that it can also map to the same type, i.e. from type A to type A).

I have to admit that this was a rather challenging question. Here's how map is used in the wild:

const number = [1, 2, 3, 4, 5, 6]
const numToString = (num: number): string => num.toString()

const res = map(numbers, numToString)

console.log(res)
// --> ['1', '2', '3', '4', '5', '6']

Conclusion

In this blog post we've looked into Generics as a way to write code in an abstract and reusable way.

We've implemented our own filter function to understand why generic programming is useful and how it helps us to allow the filtering of lists of numbers, strings or more broadly speaking Ts.

Once we understood how to read and write Generic functions we've discovered how typing and Generics can help us to get a sense of what a function might be doing just by looking at its signature.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

The intuition behind Word2Vec

Tue, 04 Jun 2019 18:22:00 GMT

Have you ever wondered how YouTube knows which videos to recommend, how Google Translate is able to translate whole texts into a decent version of the target language or how your Smartphone keyboard knows which words and text snippets to suggest while you type your texts?

There's a very high likelihood that so-called Embeddings were used behind the scenes. Embeddings are one of the central ideas behind modern Natural Language Processing models.

In the following writeup we'll discover the main building blocks and basic intuition behind Embeddings. We'll learn how and why they work and how Word2Vec, a method to turn words into vectors, can be used to show that:

$$ king - man + woman = queen $$

All the code we'll write here can be found in my "Lab" repository on GitHub. Feel free to code along while reading through this tutorial.

Basic Setup

Before jumping right into the code we need to make sure that all Python packages we'll be using are installed on our machine.

We install Seaborn, a visualization tool which helps us to plot nice-looking charts and diagrams. We don't really work with Seaborn directly but rather use its styles in conjunction with Matplotlib to make our plots look a little bit more "modern".

!pip install seaborn

Next up we need to import the modules we'll use throughout this tutorial (the last few lines configure Matplotlib to use Seaborn styles).

import json
from pathlib import Path

import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import HTML, display

# prettier Matplotlib plots
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn')

Since we're dealing with different datasets we should create a separate directory to store them in.

!mkdir -p data
data_dir = Path('data')

Comparing Countries

Let's start with our first data analysis task. Our goal is to compare and contrast different countries based on their surface area and population. The main idea being that we want to analyze which countries are quite similar and which are rather different based on those two metrics.

The dataset we'll use is part of the country-json project by @samayo. Make sure to take some time to browse through the different JSON files to get an idea about the structure of the data.

In our example we're only interested in the country-by-surface-area.json and country-by-population.json files. Let's go ahead and download the files to our data directory.

After that we can define 2 variables which will point to the files on our file system.

SURFACE_AREA_FILE_NAME = 'country-by-surface-area.json'
POPULATION_FILE_NAME = 'country-by-population.json'

!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-surface-area.json -O data/country-by-surface-area.json
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-population.json -O data/country-by-population.json

surface_area_file_path = str(data_dir / SURFACE_AREA_FILE_NAME)
population_file_path = str(data_dir / POPULATION_FILE_NAME)

During our data analysis we'll utilize Pandas, a great Python library which makes it dead simple to inspect and manipulate data.

Since our data is in JSON format we can use Pandas read_json function to load the data into a so-called DataFrame (think of it as an Excel spreadsheet on steroids).

The dropna function makes sure that we remove all entries which are undefined and therefore useless for further inspection.

df_surface_area = pd.read_json(surface_area_file_path)
df_population = pd.read_json(population_file_path)

df_population.dropna(inplace=True)
df_surface_area.dropna(inplace=True)

You might've noticed that dealing with 2 separate files will get quite hairy if we want to compare countries based on their 2 metrics.

Since both files contain the same countries with the same names and only differ in terms of their area and population data we can use merge to create a new DataFrame containing all countries with their respective area and population numbers.

Another tweak we perform here is to set the index to the country name. This way we can easily query for country data based on the country names rather than having to deal with non-expressive integer values.

df = pd.merge(df_surface_area, df_population, on='country')
df.set_index('country', inplace=True)
df.head()

len(df)

227

As you can see we have a total of 227 countries in our DataFrame. 227 are way too many countries for our need. Especially since we're about to plot the data in the next step.

Let's reduce our result set by performing some range-queries with the area and population data.

df = df[
    (df['area'] > 100000) & (df['area'] < 600000) &
    (df['population'] > 35000000) & (df['population'] < 100000000)
]
len(df)

12

Great! 12 countries are way easier to analyze once plotted.

Speaking of which, let's do a 2D scatterplot of our 12 countries. We decide to plot the area on the X axis and the population on the Y axis.

fig, ax = plt.subplots()
df.plot(x='area', y='population', figsize=(10, 10), kind='scatter', ax=ax)

for k, v in df.iterrows():
    ax.annotate(k, v)

fig.canvas.draw()

Looking at the plotted data we can immediately see some relationships. It appears that Vietnam has a high population compared to its area. Kenya on the other hand has a large surface area but a smaller population compared to its size.

Plotting the data like this helps us to reason about it in a visual way. In addition to that we can also easily validate the integrity of our data.

While we as humans can immediately tell the relationships in our country data just by looking at our plot it's necessary to translate our visual reasoning into raw numbers so our computer can understand them too.

Looking at the plot again it seems like the distance between the data points of the countries is a good measure to determine how "similar" or "different" the countries are.

There are several algorithms to calculate the distance between two (or more) coordinates. The Euclidean distance is a very common formula to do just that. Here's the Math notation:

$$ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} $$

While the formula might look intimidating at first it's rather simple to turn it into code.

def euclidean_distance(x, y):
    x1, x2 = x
    y1, y2 = y
    result = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)
    # we'll cast the result into an int which makes it easier to compare
    return int(round(result, 0))

According to our plot it seems like Thailand and Uganda are 2 countries which are very different. Computing the Euclidean distance between both validates our hunch.

# Uganda <--> Thailand
uganda = df.loc['Uganda']
thailand = df.loc['Thailand']

x = (uganda['area'], thailand['area'])
y = (uganda['population'], thailand['population'])

euclidean_distance(x, y)

26175969

If we compare this result to the Euclidean distance between Iraq and Morocco we can see that those countries seem to be more "similar".

# Iraq <--> Morocco
iraq = df.loc['Iraq']
morocco = df.loc['Morocco']

x = (iraq['area'], morocco['area'])
y = (iraq['population'], morocco['population'])

euclidean_distance(x, y)

2535051

While this exercise was quite simple and intuitive if one is fluent in geography it also introduced us to the basic concepts of Embeddings. With Embeddings we map data (e.g. words or raw numbers) into multi-dimensional spaces and use Math to manipulate and calculate relationships between that data.

This might sound rather abstract and I agree that the relationship between our Country data analysis and Embeddings is still a little bit fuzzy.

Trust me, the upcoming example will definitely result in an "Aha Moment" and suddenly what we've learned so far will click!

Color Math

Now that we've seen some of the underlying principles of Embeddings let's take another look at a slightly more complicated example. This time we'll work with different colors and their representation as a combination of Red, Green and Blue values (also known as RGB).

Before we jump right into our analysis we'll define a helper function which lets us render the color according to its RGB representation.

The following code defines a function which takes the integer values of Red, Green and Blue (values in the range of 0 - 255) and renders a HTML document with the given color as its background.

def render_color(r, g, b):
    display(HTML('''
      <div style="background-color: rgba(%d, %d, %d, 1); height: 100px;"></div>
    ''' % (r, g, b)),
    metadata=dict(isolated=True))

The color black is represented as 0 Red, 0 Green and 0 Blue. Let's validate that our render_color function works as expected.

render_color(0, 0, 0)

Great. It works!

Next up it's time to download the dataset we'll be using for our color analysis. We've decided to use the 256 Colors dataset by @jonasjacek. It lists the 256 colors used by xterm, a widely used terminal emulator. Make sure to take a couple of minutes to familiarize yourself with the data and its structure.

Downloading the dataset follows the same instruction we've used in the beginning of this tutorial where we downloaded the Country data.

COLORS_256_FILE_NAME = 'colors-256.json'

!wget -nc https://jonasjacek.github.io/colors/data.json -O data/colors-256.json

colors_256_file_path = str(data_dir / COLORS_256_FILE_NAME)

Now that we have access to the data in our programming environment it's time to inspect the structure and think about ways to further process it.

color_data = json.loads(open(colors_256_file_path, 'r').read())
color_data[:5]

[{'colorId': 0,
  'hexString': '#000000',
  'rgb': {'r': 0, 'g': 0, 'b': 0},
  'hsl': {'h': 0, 's': 0, 'l': 0},
  'name': 'Black'},
  {'colorId': 1,
  'hexString': '#800000',
  'rgb': {'r': 128, 'g': 0, 'b': 0},
  'hsl': {'h': 0, 's': 100, 'l': 25},
  'name': 'Maroon'},
  {'colorId': 2,
  'hexString': '#008000',
  'rgb': {'r': 0, 'g': 128, 'b': 0},
  'hsl': {'h': 120, 's': 100, 'l': 25},
  'name': 'Green'},
  {'colorId': 3,
  'hexString': '#808000',
  'rgb': {'r': 128, 'g': 128, 'b': 0},
  'hsl': {'h': 60, 's': 100, 'l': 25},
  'name': 'Olive'},
  {'colorId': 4,
  'hexString': '#000080',
  'rgb': {'r': 0, 'g': 0, 'b': 128},
  'hsl': {'h': 240, 's': 100, 'l': 25},
  'name': 'Navy'}]

As we can see there are 3 different color representations available in this dataset. There's a Hexadecimal, a HSL (Hue, Saturation, Lightness) and a RGB (Red, Green, Blue) representation. Furthermore we have access to the name of the color via the name attribute.

In our analysis we're only interested in the name and the RGB value of every color. Given that we can create a simple dict which key is the lowercased color name and its value is a tuple containing the Red, Green and Blue values respectively.

colors = dict()

for color in color_data:
    name = color['name'].lower()
    r = color['rgb']['r']
    g = color['rgb']['g']
    b = color['rgb']['b']
    rgb = tuple([r, g, b])
    colors[name] = rgb

To validate that our data structure works the way we described above we can print out some sample colors with their RGB values.

print('Black: %s' % (colors['black'],))
print('White: %s' % (colors['white'],))

print()

print('Red: %s' % (colors['red'],))
print('Lime: %s' % (colors['lime'],))
print('Blue: %s' % (colors['blue'],))

Black: (0, 0, 0)
White: (255, 255, 255)

Red: (255, 0, 0)
Lime: (0, 255, 0)
Blue: (0, 0, 255)

While our dict is a good starting point it's often easier and sometimes faster to do computations on the data if it's stored in a Pandas DataFrame. The from_dict function helps us to turn a simple Python dictionary into a DataFrame.

df = pd.DataFrame.from_dict(colors, orient='index', columns=['r', 'g', 'b'])
df.head()

Seeing the data formatted in this way we can think of its representation as a mapping of the Red, Green and Blue values into a 3-Dimensional space where for example Red is the X axis, Green is the Y axis and Blue is the Z axis.

You might recall that we've used Euclidean distance in our Country example above to determine how "similar" countries are. The main idea was that similar countries have less distance between their data points compared to dissimilar countries whose data points are farther apart.

Another very useful formula to calculate the similarity of data points is the so-called Cosine similarity. The Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle, the more similar the underlying data.

Translating this to our color example we can think of every color being represented as a vector with 3 values (Red, Green and Blue) which (as stated above) can be mapped to the X, Y and Z axis in a 3D coordinate system. Using the Cosine similarity we can take one of such vectors and calculate the distance between it and the rest of the vectors to determine how similar or dissimilar they are. And that's exactly what we'll be doing here.

The Math notation for the Cosine similarity looks like this:

$$ similarity = \cos(\Theta) = \frac{A \cdot B}{\left\lVert A\right\rVert \left\lVert B\right\rVert} $$

We're taking the dot-product between the two vectors A and B and divide it by the product of their magnitudes.

The following code-snippet implements such formula. Again, it might look intimidating and rather complicated but if you take some time to read through it you'll see that it's not that hard to understand.

In fact our implementation here does more than just calculating the Cosine similarity. In addition to that we copy our DataFrame containing the colors and add another column to it which will include the distance as a value between 0 and 1. Once done we sort our copied DataFrame by such distance in descending order. We do this to see the computed values when querying for similar colors later on.

def similar(df, coord, n=10):
    # turning our RGB values (3D coordinates) into a numpy array
    v1 = np.array(coord, dtype=np.float64)

    df_copy = df.copy()

    # looping through our DataFrame to calculate the distance for every color
    for i in df_copy.index:
        item = df_copy.loc[i]
        v2 = np.array([item.r, item.g, item.b], dtype=np.float64)
        # cosine similarty calculation starts here
        theta_sum = np.dot(v1, v2)
        theta_den = np.linalg.norm(v1) * np.linalg.norm(v2)
        # check if we're trying to divide by 0
        if theta_den == 0:
            theta = None
        else:
            theta = theta_sum / theta_den
        # adding the `distance` column with the result of our computation
        df_copy.at[i, 'distance'] = theta
    # sorting the resulting DataFrame by distance
    df_copy.sort_values(by='distance', axis=0, ascending=False, inplace=True)
    return df_copy.head(n)

To validate that our similar function works we can use it to find similar colors to red.

similar(df, colors['red'])

We can also pass in colors as a list of RGB values.

similar(df, [100, 20, 120])

Since it's hard to imagine what color 100, 20 and 120 represent it's worthwhile to use our render_color function to see it.

render_color(100, 20, 120)

Looking at the list of most similar colors from above it appears that darkvioletis quite similar to 100, 20, 120. Let's see how this color looks like.

darkviolet = df.loc['darkviolet']
render_color(darkviolet.r, darkviolet.g, darkviolet.b)

And we can validate that darkviolet in fact looks quite similar to 100, 20, 120!

But it doesn't end here. Our 3 color values are numbers in the range of 0 - 255. Given that, it should be possible to do some basic Math computations such as addition or subtraction on them.

Since we only have access to 256 different colors it's highly unlikely that our resulting color values for Red, Green and Blue will exactly match one of our 256 colors. That's where our similar function comes in handy! The similarfunction should make it possible to calculate a new color and find its most similar representation in our 256 color dataset.

We can look at a Color Wheel to see that subtracintg a red color from purpleone should result in a Blueish color. Let's do the Math and check whether that's true.

blueish = df.loc['purple'] - df.loc['red']

similar(df, blueish)

And sure enough the most similar colors in our dataset are Blueish ones. We can validate that by rendering darkblue, one of the best matches.

darkblue = df.loc['darkblue']
render_color(darkblue.r, darkblue.g, darkblue.b)

Here's a simple one. If we have Black and add some White to the mix we should get something Greyish, correct?

greyish = df.loc['black'] + df.loc['white']

similar(df, greyish)

And sure enough we do. Rendering grey93 shows a light grey color.

grey93 = df.loc['grey93']
render_color(grey93.r, grey93.g, grey93.b)

Let's end our color exploration with a more complex formula. So far we've only done some very simple Math like subtracting and adding colors. But there's more we can do. We can also express our search for a color as a "solve for x" problem.

Mixing Yellow and Red will result in Orange. We can translate this behavior to other colors as well. Here we ask "Yellow is to Red as X is to Blue" and express it in Math notation to get the result for X.

# yellow is to red as X is to blue
yellow_to_red = df.loc['yellow'] - df.loc['red']
X = yellow_to_red + df.loc['blue']

similar(df, X)

Our calculation shows us that lightseargreen is to Blue as Yellow is to Red. Intuitively that makes sense if you think about it.

lightseagreen = df.loc['lightseagreen']
render_color(lightseagreen.r, lightseagreen.g, lightseagreen.b)

Word2Vec

In the beginning of this tutorial I promised that once done we should understand the intuition behind Word2Vec, a key component for modern Natural Language Processing models.

The Word2Vec model does to words what we did with our colors represented as RGB values. It maps words into a multi-dimensional space (our colors were mapped into a 3D space). Once such words are mapped into that space we can perform Math calculations on their vectors the same way we e.g. calculated the similarity between our color vectors.

Having a mapping of words into such a vector space makes it possible to do calculations resulting in:

$$ king - man + woman = queen $$

Conclusion

In this tutorial we took a deep dive into the main building blocks and intuitions behind Embeddings, a powerful concept which is heavily utilized in modern Natural Language Processing models.

The main idea is to map data into a multi-dimensional space so that Math calculations from the realm of Linear Algebra can be performed on it.

We started our journey with a simple example in which we mapped the surface area and population of different countries into a 2D vector space. We then used the Euclidean distance to verify that certain countries are similar while others are dissimilar based on their metrics.

Another, more advanced example mapped colors and their RGB representation into a 3D vector space. We then used Cosine similarity and some basic Math to add and subtract colors.

With this knowledge we're now able to understand how more advanced models such as Word2Vec or Doc2Vec make it possible to do calculations on words and texts.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

The Lab

You can find more code examples, experiments and tutorials in my GitHub Lab repository.

Additional Resources

Eager to learn more? Here's a list with all the resources I've used to write this post.

Minimax and Monte Carlo Tree Search

Tue, 02 Apr 2019 16:21:00 GMT

Do you remember your childhood days when you discovered the infamous game Tic-Tac-Toe and played it with your friends over and over again?

You might've wondered if there's a certain strategy you can exploit that lets you win all the time (or at least force a draw). Is there such an algorithm that will show you how you can defeat your opponent at any given time?

It turns out there is. To be precise there are a couple of algorithms which can be utilized to predict the best possible moves in games such as Tic-Tac-Toe, Connect Four, Chess and Go among others. One such family of algorithms leverages tree search and operates on game state trees.

In this blog post we'll discuss 2 famous tree search algorithms called Minimax and Monte Carlo Tree Search (abbreviated to MCTS). We'll start our journey into tree search algorithms by discovering the intuition behind their inner workings. After that we'll see how Minimax and MCTS can be used in modern game implementations to build sophisticated Game AIs. We'll also shed some light into the computational challenges we'll face and how to handle them via performance optimization techniques.

The Intuition behind tree search

Let's imagine that you're playing some games of Tic-Tac-Toe with your friends. While playing you're wondering what the optimal strategy might be. What's the best move you should pick in any given situation?

Generally speaking there are 2 modes you can operate in when determining the next move you want to play:

Aggressive:

Play a move which will cause an immediate win (if possible)
Play a move which sets up a future winning situation

Defensive:

Play a move which prevents your opponent from winning in the next round (if possible)
Play a move which prevents your opponent from setting up a future winning situation in the next round

These modes and their respective actions are basically the only strategies you need to follow to win the game of Tic-Tac-Toe.

The "only" thing you need to do is to look at the current game state you're in and play simulations through all the potential next moves which could be played. You do this by pretending that you've played a given move and then continue playing the game until the end, alternating between the X and O player. While doing that you're building up a game tree of all the possible moves you and your opponent would play.

The following illustration shows a simplified version of such a game tree: Note that for the rest of this post we'll only use simplified game tree examples to save screen space

Of course, the set of strategic rules we've discussed at the top is specifically tailored to the game of Tic-Tac-Toe. However we can generalize this approach to make it work with other board games such as Chess or Go. Let's take a look at Minimax, a tree search algorithm which abstracts our Tic-Tac-Toe strategy so that we can apply it to various other 2 player board games.

The Minimax Algorithm

Given that we've built up an intuition for tree search algorithms let's switch our focus from simple games such as Tic-Tac-Toe to more complex games such as Chess.

Before we dive in let's briefly recap the properties of a Chess game. Chess is a 2 player deterministic game of perfect information. Sound confusing? Let's unpack it:

In Chess, 2 players (Black and White) play against each other. Every move which is performed is ensured to be "fulfilled" with no randomness involved (the game doesn't use any random elements such as a die). During gameplay every player can observe the whole game state. There's no hidden information, hence everyone has perfect information about the whole game at any given time.

Thanks to those properties we can always compute which player is currently ahead and which one is behind. There are several different ways to do this for the game of Chess. One approach to evaluate the current game state is to add up all the remaining white pieces on the board and subtract all the remaining black ones. Doing this will produce a single value where a large value favors white and a small value favors black. This type of function is called an evaluation function.

Based on this evaluation function we can now define the overall goal during the game for each player individually. White tries to maximize this objective while black tries to minimize it.

Let's pretend that we're deep in an ongoing Chess game. We're player white and have already played a couple of clever moves, resulting in a large number computed by our evaluation function. It's our turn right now but we're stuck. Which of the possible moves is the best one we can play?

We'll solve this problem with the same approach we already encountered in our Tic-Tac-Toe gameplay example. We build up a tree of potential moves which could be performed based on the game state we're in. To keep things simple we pretend that there are only 2 possible moves we can play (in Chess there are on average ~30 different options for every given game state). We start with a (white) root node which represents the current state. Starting from there we're branching out 2 (black) child nodes which represent the game state we're in after taking one of the 2 possible moves. From these 2 child nodes we're again branching out 2 separate (white) child nodes. Each one of those represents the game state we're in after taking one of the 2 possible moves we could play from the black node. This branching out of nodes goes on and on until we've reached the end of the game or hit a predefined maximum tree depth.

The resulting tree looks something like this:

Given that we're at the end of the tree we can now compute the game outcome for each end state with our evaluation function:

With this information we now know the game outcome we can expect when we take all the outlined moves starting from the root node and ending at the last node where we calculated the game evaluation. Since we're player white it seems like the best move to pick is the one which will set us up to eventually end in the game state with the highest outcome our evaluation function calculated.

While this is true there's one problem. There's still the black player involved and we cannot directly manipulate what move she'll pick. If we cannot manipulate this why don't we estimate what the black player will likely do based on our evaluation function? As a white player we always try to maximize our outcome. The black player always tries to minimize the outcome. With this knowledge we can now traverse back through our game tree and compute the values for all our individual tree nodes step by step.

White tries to maximize the outcome:

While black wants to minimize it:

Once done we can now pick the next move based on the evaluation values we've just computed. In our case we pick the next possible move which maximizes our outcome:

What we've just learned is the general procedure of the so-called Minimax algorithm. The Minimax algorithm got its name from the fact that one player wants to Mini-mize the outcome while the other tries to Max-imize it.

Code

def minimax(state, max_depth, is_player_minimizer):
    if max_depth == 0 or state.is_end_state():
        # We're at the end. Time to evaluate the state we're in
        return evaluation_function(state)

    # Is the current player the minimizer?
    if is_player_minimizer:
        value = -math.inf
        for move in state.possible_moves():
            evaluation = minimax(move, max_depth - 1, False)
            min = min(value, evaluation)
        return value

    # Or the maximizer?
    value = math.inf
    for move in state.possible_moves():
        evaluation = minimax(move, max_depth - 1, True)
        max = max(value, evaluation)
    return value

Search space reduction with pruning

Minimax is a simple and elegant tree search algorithm. Given enough compute resources it will always find the optimal next move to play.

But there's a problem. While this algorithm works flawlessly with simplistic games such as Tic-Tac-Toe, it's computationally infeasible to implement it for strategically more involved games such as Chess. The reason for this is the so-called tree branching factor. We've already briefly touched on that concept before but let's take a second look at it.

In our example above we've artificially restricted the potential moves one can play to 2 to keep the tree representation simple and easy to reason about. However the reality is that there are usually more than 2 possible next moves. On average there are ~30 moves a Chess player can play in any given game state. This means that every single node in the tree will have approximately 30 different children. This is called the width of the tree. We denote the trees width as $w$.

But there's more. It takes roughly ~85 consecutive turns to finish a game of Chess. Translating this to our tree means that it will have an average depth of 85. We denote the trees depth as $d$.

Given $w$ and $d$ we can define the formula $w^d$ which will show us how many different positions we have to evaluate on average.

Plugging in the numbers for Chess we get $30^{85}$. Taking the Go board game as an example which has a width $w$ of ~250 and an average depth $d$ of ~150 we get $250^{150}$. I encourage you to type those numbers into your calculator and hit enter. Needless to say that current generation computers and even large scale distributed systems will take "forever" to crunch through all those computations.

Does this mean that Minimax can only be used for games such as Tic-Tac-Toe? Absolutely not. We can apply some clever tricks to optimize the structure of our search tree.

Generally speaking we can reduce the search trees width and depth by pruning individual nodes and branches from it. Let's see how this works in practice.

Alpha-Beta Pruning

Recall that Minimax is built around the premise that one player tries to maximize the outcome of the game based on the evaluation function while the other one tries to minimize it.

This gameplay behavior is directly translated into our search tree. During traversal from the bottom to the root node we always picked the respective "best" move for any given player. In our case the white player always picked the maximum value while the black player picked the minimum value: Looking at our tree above we can exploit this behavior to optimize it. Here's how:

While walking through the potential moves we can play given the current game state we're in we should build our tree in a depth-first fashion. This means that we should start at one node and expand it by playing the game all the way to the end before we back up and pick the next node we want to explore: Following this procedure allows us to identify moves which will never be played early on. After all, one player maximizes the outcome while the other minimizes it. The part of the search tree where a player would end up in a worse situation based on the evaluation function can be entirely removed from the list of nodes we want to expand and explore. We prune those nodes from our search tree and therefore reduce its width. The larger the branching factor of the tree, the higher the amount of computations we can potentially save!

Assuming we can reduce the width by an average of 10 we would end up with $w^d = (30 - 10)^{85} = 20^{85}$ computations we have to perform. That's already a huge win.

This technique of pruning parts of the search tree which will never be considered during gameplay is called Alpha-Beta pruning. Alpha-Beta pruning got its name from the parameters $\alpha$ and $\beta$ which are used to keep track of the best score either player can achieve while walking the tree.

Code

def minimax(state, max_depth, is_player_minimizer, alpha, beta):
    if max_depth == 0 or state.is_end_state():
        return evaluation_function(state)

    if is_player_minimizer:
        value = -math.inf
        for move in state.possible_moves():
            evaluation = minimax(move, max_depth - 1, False, alpha , beta)
            min = min(value, evaluation)
            # Keeping track of our current best score
            beta = min(beta, evaluation)
            if beta <= alpha:
                break
        return value

    value = math.inf
    for move in state.possible_moves():
        evaluation = minimax(move, max_depth - 1, True, alpha, beta)
        max = max(value, evaluation)
        # Keeping track of our current best score
        alpha = max(alpha, evaluation)
        if beta <= alpha:
            break
    return value

Using Alpha-Beta pruning to reduce the trees width helps us utilize the Minimax algorithm in games with large branching factors which were previously considered as computationally too expensive.

In fact Deep Blue, the Chess computer developed by IBM which defeated the Chess world champion Garry Kasparov in 1997 heavily utilized parallelized Alpha-Beta based search algorithms.

Monte Carlo Tree Search

It seems like Minimax combined with Alpha-Beta pruning is enough to build sophisticated game AIs. But there's one major problem which can render such techniques useless. It's the problem of defining a robust and reasonable evaluation function. Recall that in Chess our evaluation function added up all the white pieces on the board and subtracted all the black ones. This resulted in high values when white had an edge and in low values when the situation was favorable for black. While this function is a good baseline and is definitely worthwhile to experiment with there are usually more complexities and subtleties one needs to incorporate to come up with a sound evaluation function.

Simple evaluation metrics are easy to fool and exploit once the underlying internals are surfaced. This is especially true for more complex games such as Go. Engineering an evaluation function which is complex enough to capture the majority of the necessary game information requires a lot of thought and interdisciplinary domain expertise in Software Engineering, Math, Psychology and the game at hand.

Isn't there a universally applicable evaluation function we could leverage for all games, no matter how simple or complex they are?

Yes, there is! And it's called randomness. With randomness we let chance be our guide to figure out which next move might be the best one to pick.

In the following we'll explore the so-called Monte Carlo Tree Search (MCTS) algorithm which heavily relies on randomness (the name "Monte Carlo" stems from the gambling district in Monte Carlo) as a core component for value approximations.

As the name implies, MCTS also builds up a game tree and does computations on it to find the path of the highest potential outcome. But there's a slight difference in how this tree is constructed.

Let's once again pretend that we're playing Chess as player white. We've already played for a couple of rounds and it's on us again to pick the next move we'd like to play. Additionally let's pretend that we're not aware of any evaluation function we could leverage to compute the value of each possible move. Is there any way we could still figure out which move might put us into a position where we could win at the end?

As it turns out there's a really simple approach we can take to figure this out. Why don't we let both player play dozens of random games starting from the state we're currently in? While this might sound counterintuitive it make sense if you think about it. If both player start in the given game state, play thousands of random games and player white wins 80% of the time, then there must be something about the state which gives white an advantage. What we're doing here is basically exploiting the Law of large numbers (LLN) to find the "true" game outcome for every potential move we can play.

The following description will outline how the MCTS algorithm works in detail. For the sake of simplicity we again focus solely on 2 playable moves in any given state (as we've already discovered there are on average ~30 different moves we can play in Chess).

Before we move on we need to get some minor definitions out of the way. In MCTS we keep track of 2 different parameters for every single node in our tree. We call those parameters $t$ and $n$. $t$ stands for "total" and represents the total value of that node. $n$ is the "number of visits" which reflects the number of times we've visited this node while walking through the tree. When creating a new node we always initialize both parameters with the value 0.

In addition to the 2 new parameters we store for each node, there's the so-called "Upper Confidence Bound 1" (UCT) formula which looks like this

$$ x_i + C\sqrt{\frac{\ln(N)}{n_i}} $$

This formula basically helps us in deciding which upcoming node and therefore potential game move we should pick to start our random game series (called "rollout") from. In the formula $x_i$ represents the average value of the game state we're working with, $C$ is a constant called "temperature" we need to define manually (we just set it to 1.5 in our example here. More on that later), $N$ represents the parent node visits and $n_i$ represents the current nodes visits. When using this formula on candidate nodes to decide which one to explore further, we're always interested in the largest result.

Don't be intimidated by the Math and just note that this formula exists and will be useful for us while working with out tree. We'll get into more details about the usage of it while walking through our tree.

With this out of the way it's time apply MCTS to find the best move we can play.

We start with the same root node of the tree we're already familiar with. This root node is our start point and reflects the current game state. Based on this node we branch off our 2 child nodes: The first thing we need to do is to use the UCT formula from above and compute the results for both child nodes. As it turns out we need to plug in 0 for almost every single variable in our UCT formula since we haven't done anything with our tree and its nodes yet. This will result in $\infty$ for both calculations.

$$ S_1 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty $$

We've replaced the 0 in the denominator with a very small number because division by zero is not defined

Given this we're free to choose which node we want to explore further. We go ahead with the leftmost node and perform our rollout phase which means that we play dozens of random games starting with this game state.

Once done we get a result for this specific rollout (in our case the percentage of wins for player white). The next thing we need to do is to propagate this result up the tree until we reach the root node. While doing this we update both $t$ and $n$ with the respective values for every node we encounter. Once done our tree looks like this: Next up we start at our root node again. Once again we use the UCT formula, plug in our numbers and compute its score for both nodes:

$$ S_1 = 30 + 1.5\sqrt{\frac{\ln(1)}{1}} = 30 $$

$$ S_2 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty $$

Given that we always pick the node with the highest value we'll now explore the rightmost one. Once again we perform our rollout based on the move this node proposes and collect the end result after we've finished all our random games.

The last thing we need to do is to propagate this result up until we reach the root of the tree. While doing this we update the parameters of every node we encounter.

We've now successfully explored 2 child nodes in our tree. You might've guessed it already. We'll start again at our root node and calculate every child nodes UCT score to determine the node we should further explore. In doing this we get the following values:

$$ S_1 = 30 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 31.25 $$

$$ S_2 = 20 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 21.25 $$

The largest value is the one we've computed for the leftmost node so we decide to explore that node further.

Given that this node has no child nodes we add two new nodes which represent the potential moves we can play to the tree. We initialize both of their parameters ($t$ and $n$) with 0.

Now we need to decide which one of those two nodes we should explore further. And you're right. We use the UCT formula to calculate their values. Given that both have $t$ and $n$ values of zero they're both $\infty$ so we decide to pick the leftmost node. Once again we do a rollout, retrieve the value of those games and propagate this value up to the tree until we reach the trees root node, updating all the node parameters along the way.

The next iteration will once again start at the root node where we use the UCT formula to decide which child node we want to explore further. Since we can see a pattern here and I don't want to bore you I'm not going to describe the upcoming steps in great detail. What we'll be doing is following the exact same procedure we've used above which can be summarized as follows:

Start at the root node and use the UCT formula to calculate the score for every child node
Pick the child node for which you've computed the highest UCT score
Check if the child has already been visited
- If not, do a rollout
- If yes, determine the potential next states from there
- Use the UCT formula to decide which child node to pick
- Do a rollout
Propagate the result back through the tree until you reach the root node

We iterate over this algorithm until we run out of time or reached a predefined threshold value of visits, depth or iterations. Once this happens we evaluate the current state of our tree and pick the child node(s) which maximize the value $t$. Thanks to dozens of games we've played and the Law of large numbers we can be very certain this move is the best one we can possibly play.

That's all there is. We've just learned, applied and understood Monte Carlo Tree Search!

You might agree that it seems like MCTS is very compute intensive since you have to run through thousands of random games. This is definitely true and we need to be very clever as to where we should invest our resources to find the most promising path in our tree. We can control this behavior with the aforementioned "temperature" parameter $C$ in our UCT formula. With this parameter we balance the trade-off between "exploration vs. exploitation".

A large $C$ value puts us into "exploration" mode. We'll spend more time visiting least-explored nodes. A small value for $C$ puts us into "exploitation" mode where we'll revisit already explored nodes to gather more information about them.

Given the simplicity and applicability due to the exploitation of randomness, MCTS is a widely used game tree search algorithm. DeepMind extended MCTS with Deep Neural Networks to optimize its performance in finding the best Go moves to play. The resulting Game AI was so strong that it reached superhuman level performance and defeated the Go World Champion Lee Sedol 4-1.

Conclusion

In this blog post we've looked into 2 different tree search algorithms which can be used to build sophisticated Game AIs.

While Minimax combined with Alpha-Beta pruning is a solid solution to approach games where an evaluation function to estimate the game outcome can easily be defined, Monte Carlo Tree Search (MCTS) is a universally applicable solution given that no evaluation function is necessary due to its reliance on randomness.

Raw Minimax and MCTS are only the start and can easily be extended and modified to work in more complex environments. DeepMind cleverly combined MCTS with Deep Neural Networks to predict Go game moves whereas IBMextended Alpha-Beta tree search to compute the best possible Chess moves to play.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Additional Resources

Do you want to learn more about Minimax and Monte Carlo Tree Search? The following list is a compilation of resources I found useful while studying such concepts.

If you're really into modern Game AIs I highly recommend the book "Deep Learning and the Game of Go" by Max Pumperla and Kevin Ferguson. In this book you'll implement a Go game engine and refine it step-by-step until at the end you implement the concepts DeepMind used to build AlphaGo and AlphaGo Zero based on their published research papers.

Learning Deep Learning

Tue, 05 Mar 2019 14:42:00 GMT

Deep Learning, a branch of Machine Learning gained a lot of traction and press coverage over the last couple of years. Thanks to significant scientific breakthroughs we're now able to solve a variety of hard problems with the help of machine intelligence.

Computer systems were taught to identify skin cancer with a significantly higher accuracy than human doctors do. Neural Networks can generate photorealisticimages of fake people and fake celebrities. It's even possible for an algorithm to teach itself entire games from first principles, surpassing human-level mastery after only a couple of hours training.

In summary Deep Learning is amazing, mystical and sometimes even scary and intimidating.

In order to demystify and understand this "Black Box" end-to-end I decided to take a deep dive into Deep Learning, looking at it through the practical as well as the theoretical lens.

With this post I'd like to share the Curriculum I came up with after spending months following the space, reading books and research papers, doing lectures, classes and courses to find some of the best educational resources out there.

Before we take a closer look I'd like to point out that the Curriculum as a whole is still a work in progress and might change over time since new material covering state-of-the-art Deep Learning techniques is released on an ongoing basis. Feel free to bookmark this page and revisit it from time to time to stay up-to-date with the most recent changes.

The Approach

During the research phase which resulted in the following Curriculum I triaged dozens of classes, lectures, tutorials, talks, MOOCs, papers and books. While the topics covered were usually the same the required levels of expertise in advanced Mathematics and computer programming were not.

Generally speaking one can divide most educational Deep Learning resources in two categories: "Shallow" and "Deep". Authors of "Shallow" resources tend to heavily utilize high-level Frameworks and abstractions without taking enough time to talk about the underlying theoretical pieces. "Deep" resources on the other hand usually take the bottom-up approach, starting with a lot of Mathematical fundamentals until eventually some code is written to translate the theory into practice.

I personally believe that both is important: Understanding how the technology works under the covers while using Frameworks to put this knowledge into practice. The proposed Curriculum is structured in a way to achieve exactly that. Learning and understanding Deep Learning from a theoretical as well as a practical point-of-view.

In our case we'll approach our Deep Learning journey with a slight twist. We won't follow a strict bottom-up or top-down approach but will blend both learning techniques together.

Our first touchpoint with Deep Learning will be in a practical way. We'll use high-level abstractions to build and train Deep Neural Networks which will categorize images, predict and generate text and recommend movies based on historical user data. This first encounter is 100% practice-oriented. We won't take too much time to learn about the Mathematical portions just yet.

Excited about the first successes we had we'll brush up our Mathematical understanding and take a deep dive into Deep Learning, this time following a bottom-up approach. Our prior, practical exposure will greatly benefit us here since we already know what outcomes certain methodologies produce and therefore have specific questions about how things might work under the hood.

In the last part of this Curriculum we'll learn about Deep Reinforcement Learning which is the intersection of Reinforcement Learning and Deep Learning. A thorough analysis of AlphaGo Zero, the infamous agent that learned the Go board game from scratch and later on played against itself to become basically unbeatable by humans, will help us understand and appreciate the capabilities this approach has to offer.

During our journey we'll work on two distinct Capstone projects ("Capstone I" and "Capstone II") to put our knowledge into practice. While working on this we'll solve real problems with Deep Neural Networks and build up a professional portfolio we can share online.

Once done we'll be in a good position to continue our Deep Learning journey reading through the most recent academic research papers, implementing new algorithms and coming up with our own ideas to contribute to the Deep Learning community.

The Curriculum

As already discussed above, Deep Learning is… Deep. Given the traction and momentum, Universities, Companies and individuals have published a near endless stream of resources including academic research papers, Open Source tools, reference implementations as well as educational material. During the last couple of months I spent my time triaging those to find the highest quality, yet up-to-date learning resources.

I then took a step back and structured the materials in a way which makes it possible to learn Deep Learning from scratch up to a point where enough knowledge is gained to solve complex problems, stay on top of the current research and participate in it.

1. A Practical Encounter

We begin our journey in the land of Deep Learning with a top-down approach by introducing the subject "Deep Learning" in a practical and playful way. We won't start with advanced college Math, theoretical explanations and abstract AI topics. Rather we'll dive right into the application of tools and techniques to solve well-known problems.

The main reason of doing this is that it keeps us motivated since we'll solve those problems with state-of-the-art implementations which will help us see and understand the bigger picture. It's a whole lot easier to take a look under the covers of the abstractions we'll use once we know what can be achieved with such. We'll automatically come up with questions about certain results and behaviors and develop an own intuition and excitement to understand how the results came to be.

In doing this we'll take the great "Practical Deep Learning for Coders" course by the Fast.ai team which will walk us through many real-world examples of Deep Neural Network usage. Theoretical concepts aren't completely left out but will be discussed "just-in-time".

It's important to emphasize that it's totally fine (and expected) that we won't understand everything which is taught during this course the first time we hear about it. Most of the topics will be covered multiple times throughout this Curriculum so we'll definitely get the hang of it later on. If you're having problems with one topic or the other, feel free to rewatch the respective part in the video or do some research on your own. Keep in mind though that you shouldn't get too deep into the weeds since our main focus is still on the practical portions.

You should definitely recreate each and every single Jupyter Notebook which was used in the Fast.ai course from scratch. This helps you to get a better understanding of the workflow and lets you play around with the parameters to see the effects they have on the data.

When done it's a good idea to watch the following great talk by Google and this mini-course by Leo Isikdogan to solidify the knowledge we've just acquired.

Resources

2. Mathematical Foundations

Once we have a good understanding of what Deep Learning is, how it's used in practice and how it roughly works under the hood it's time to take a step back and refresh our Math knowledge. Deep Neural Networks heavily utilize Matrix multiplications, non-linearities and optimization algorithms such as Gradient Descent. We therefore need to familiarize ourselves with Linear Algebra, Calculus and some basic Probability Theory which build the Mathematical foundations of Deep Learning.

While this is certainly advanced Mathematics it's important to highlight that High School level Math knowledge is usually enough to get by in the beginnings. For the most part we should just refresh our knowledge a little bit. It's definitely not advisable to spent weeks or even months studying every aspect of Linear Algebra, Calculus or Probability Theory (if that's even possible) to consider this part "done". Basic fluency in the aforementioned topics is enough. There's always enough time to learn the more advanced topics as soon as we come across them.

Having a good Mathematical understanding will pay dividends later on as we progress with more advanced Deep Learning topics. Don't be intimidated by this part of the Curriculum. Mathematics can and should be fun!

Stanford has some great refreshers on Linear Algebra and Probability Theory. If that's too shallow and you need a little bit more to get up to speed you might find Part 1 of the Deep Learning Book helpful.

Once you've brushed up the basics it's worthwhile to take a couple of days and thoroughly study "The Matrix Calculus You Need For Deep Learning" by Terence Parr and Jeremy Howard (one of the founders of Fast.ai) and the "Computational Linear Algebra" course by Rachel Thomas (also a co-founder of Fast.ai). Both resources are heavily tailored to teach the Math behind Deep Learning.

Resources

3. Deep Dive

Now we're armed with a good understanding of the capabilities and the underlying Math of Deep Learning.

Given this it's time to take a deep dive to broaden our knowledge of Deep Learning. The main goal of this part is to take the practical experience and blend it with our Mathematical exposure to fully understand the theoretical building blocks of Deep Neural Networks. A thorough understanding of this will be key later on once we learn more about topics such as Deep Reinforcement Learning.

The following describes 3 different ways to take the deep dive. The approaches are certainly not mutually exclusive but could (and should) be used in conjunction to complement each other.

The path you might want to take will depend on your prior exposure to Deep Learning and you favorite learning style.

If you're a person who appreciates classical MOOCs in the form of high-quality, pre-recorded videos with quizzes and exercises you'll definitely enjoy Andrew Ng's Deeplearning.ai "Specialization for Deep Learning". This course is basically split up into 5 different sub-courses which will take you from the basics of Neural Networks to advanced topics such as as Recurrent Neural Networks. While learning about all of this you'll also pick up a lot of valuable nuggets Andrew shares as he talks about his prior experience as a Deep Learning practitioner.

You can certainly get around the tuition fee for the Deeplearning.aispecialization, but it's important to emphasize that it's definitely worth every penny! You'll have access to high quality course content, can request help when you're stuck and get project reviews by classmates and experts.

Readers who enjoy books should definitely look into the "Dive into Deep Learning" book. This book was created to be a companion guide for the STAT 157 course at UC Berkeley but turned into more than that. The main focus of this book is to be at the intersection of Mathematical formulations, real world applications and the intuition behind Deep Learning complemented by interactive Jupyter Notebooks to play around with. "Dive into Deep Learning" covers all of the important concepts of a modern Deep Learning class. It requires no prior knowledge and starts with the basics of Neural Networks while moving onwards to cover advanced topics such as Convolutional Neural Networks, ending in discussions about state-of-the-art NLP implementations.

Another method to study Deep Learning in great detail is with the help of recorded university class videos. MIT released the terrific "Introduction to Deep Learning" course which is basically a recording of their 6.S191 class accessible for everyone to watch! This option is definitely one of the more advanced ways to learn the subject as some prior university-level Math and Computer Science knowledge is necessary to grok it. The huge benefit of this format is that it touches on a lot of different topics other courses simply don't cover due to missing prerequisites. If you've already been exposed to university-level Computer Science and Mathematics and like to learn with a focus on more rigor theory, then this course is definitely for you.

Whatever route you take, it's really important that you take your time to revisit concepts and recreate their implementations from scratch. It's totally fine if you're struggling at first. It's this wandering through the dark alleys where you'll actually learn the most! Don't waste your time passively consuming content. Go out and reproduce what you've just learned!

At the end of the day it doesn't really matter what format you choose. All courses will equally well prepare you for the next step in your journey to Deep Learning mastery which is your first Capstone project!

Resources

4. Capstone Project I

Focus: Supervised Deep Learning

Enough theory (for now). It's time to put our hard earned knowledge to practice.

In our first Capstone project we'll demonstrate that we fully understand the basic building blocks of modern Deep Learning. We'll pick a problem of interest and solve it with the help of a Deep Neural Network. Since we've mostly dealt with Supervised Learning so far it's worth mentioning that our solution will be based on such an implementation.

Our programmatic environment will be a separate Jupyter Notebook where we code and describe every step together with a brief justification of its necessity in great detail. Taking the time to think through the steps necessary to solve our problem helps us check ourselves as we have to think through our architecture as well as the underlying processes that take place when our code is executed.

To further deepen our knowledge and help us get out of the comfort zone we'll restrict our implementation to the usage of low-level Frameworks, meaning that we're only allowed to use Frameworks such as PyTorch, TensorFlow or MXNet. Any usage of high-level abstraction libraries such as Fastai or Keras is strictly forbidden. Those libraries, while being great for the experienced practitioner, abstract too much away, hindering us to go through the tough decisions and tradeoffs we have to make when working on our problem.

Remember that this is the part where we'll learn the most as we're really getting into the weeds here. Don't give up as enlightenment will find you once you made it. It's also more than ok to go back and reread / rewatch the course material if you're having problems and need some help.

While working on this project always keep in mind that it's one of your personal portfolio projects you should definitely share online. It's those projects where you can demonstrate that you're capable to solve complex problems with Deep Learning technologies. Make sure that you really spend a good portion of your time on it and "make it pretty".

Are you struggling to find a good project to work on? Here are some project ideas which will help you get started:

5. Deep Reinforcement Learning

Deep Reinforcement Learning is the last major topic we'll cover in this Curriculum.

One might ask the question as to what the difference between the Deep Learning we're studying and Deep Reinforcement Learning is. All the techniques we've learned and used so far were built around the concept of Supervised Learning. The gist of Supervised Learning is that we utilize large datasets to train our model by showing it data, letting it make predictions about what it thinks the data represents and then using the labeled solution to compute the difference between the prediction and the actual solution. We then use algorithms such as Gradient Descent and Backpropagation to subsequently readjust our model until the predictions it makes meet our expectations.

You might've already noticed that Supervised Learning heavily relies on huge datasets to train and test our models via examples.

What if there's a way that our AI can teach itself what it should do based on self-exploration and guidelines we define? That's where Reinforcement Learningcomes into play. With Reinforcement Learning we're able to let our model learn from first principles by exploring the environment. The researches at DeepMindwere one of the first who successfully blended Deep Learning and Reinforcement Learning to let an AI teach itself to play Atari games. The only inputs the AI agent got were the raw input pixels and the score.

In this part of our Curriculum we'll learn what Reinforcement Learning is and how we can combine Deep Learning and Reinforcement Learning to build machine intelligence which learns to master tasks in an autodidactic way.

As per usual there are different ways to learn Deep Reinforcement Learning.

Thomas Simonini has a great "Deep Reinforcement Learning Course" which focuses on the practical pieces of Deep Reinforcement Learning as you'll implement real world applications throughout his class.

OpenAIs "SpinningUp AI" course is another great resource which strikes a really good balance between practical examples and theoretical foundations.

If you're looking for a University-level class which heavily focuses on the theoretical underlying I'd highly recommend the "Advanced Deep Learning and Reinforcement Learning Class" which was taught by UCL and DeepMind.

Every resource listed here will help you understand and apply Deep Reinforcement Learning techniques. While some are more focused on the practical portions others go really deep into the trenches of theoretical rigor. It's definitely worthwhile to look into all of them to get the all-around view and best mixture between theory and practice.

Once you successfully made your way through one of the Deep Reinforcement Learning courses it's a good idea to revisit the key ideas by reading the excellent blog posts "Deep Reinforcement Learning: Pong from Pixels" by Andrej Karpathyand "A (Long) Peek into Reinforcement Learning" by Lilian Weng as they give a nice, broader overview of the different topics which were covered during class.

Aside: If you're fascinated by the possibilities of Reinforcement Learning I'd highly recommend the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto. The recently updated 2nd edition includes chapters about Neuroscience, Deep Neural Networks and more. While it's possible and desirable to buy the book at your local bookstore you can also access the book as a freely available PDF online.

Resources

6. Capstone Project II

Focus: Deep Reinforcement Learning

It's time for our second and last Capstone Project where we'll use Deep Reinforcement Learning to let our AI teach itself to solve difficult real-world problems.

The same restrictions from our first Capstone project also apply here. We'll implement the solution in a dedicated Jupyter Notebook where we write our code and the prose to describe what we're doing and why we're doing it. This helps us test our knowledge since we have to take the time to think through our current implementation and its implications to the system as a whole.

As with the Capstone I project it's forbidden to use higher level abstraction libraries such as Fastai or Keras. Our implementation here should only use APIs provided by lower-level Frameworks such as PyToch, TensorFlow or MXNet.

Keep in mind that it's totally fine to feel stuck at some point. Don't be discouraged! Take your time to revisit the material and ensure that you fill your knowledge gaps before moving on. It's those moments of struggle where you grow the most. Once you've made it, you'll feel excited and empowered.

The result of this Capstone project is another crucial piece of your personal Deep Learning portfolio. Make sure to set aside enough time to be able to put in the effort so that you can showcase your implementation online.

Do you need some inspiration for projects you might want to work on? Here's a list with some ideas:

Conclusion

Deep Learning has gained a lot of traction in last couple of years as major scientific breakthroughs made it finally possible to train and utilize Deep Neural Networks to perform tasks at human expert level ranging from cancer detection to mastery in games such as Go or Space Invaders.

In this blog post I shared the Curriculum I follow to learn Deep Learning from scratch. Right in the beginning of the journey one learns how Deep Learning techniques are used in practice to solve real-world problems. Once a baseline understanding is established it's time to take a deep dive into the Mathematical and theoretical pieces to demystify the Deep Learning "Black Box". A final exploration of the intersection of Deep Learning and Reinforcement Learning puts the reader in a great position to understand state-of-the art Deep Learning solutions. Throughout the whole Curriculum we'll practice our skills and showcase our fluency in such while working on dedicated Capstone projects.

While putting this together I had the feeling that this Curriculum can look quite intimidating at first glance since lots of topics are covered and it'll definitely take some time to get through it. While I'd advise the avid reader to follow every single step in the outlined order it's totally possible to adapt and skip some topics given that everyone has different experiences, goals and interests. Learning Deep Learning should be fun and exciting. If you ever feel exhausted or struggle to get through a certain topic you should take a step back and revisit it later on. Oftentimes complicated facts and figures turn into no-brainers if we give ourselves the permission and time to do something else for the moment.

I personally believe that it's important to follow a goal while learning a new topic or skill. Make sure that you know why you want to learn Deep Learning. Do you want to solve a problem at your company? Are you planning to switch careers? Is a high level overview enough for you since you just want to be educated about AI and its social impacts? Whatever it is, keep this goal in mind as it'll make everything reasonable and easier during the hard times when the motivation might be lacking and everything just feels too hard to pick up.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

Learning Advanced Mathematics

Tue, 19 Feb 2019 09:57:00 GMT

For me personally Math was one of those mysterious subjects I had to go through in school but never really understood, let alone appreciated. It was too abstract, involved lengthy computations, rode formula memorization with little to no explanation as to why it's useful and how it's applied in the real world. Frankly put Math was one of my weakest spots. My parents were surprised and shocked when I told them that I planned to study Computer Science, which is a branch of applied Mathematics. Throughout my life I had a love-hate relationship with Math. I still remember that feeling of relief when I passed my last Math exam in college.

During my career as a Software Engineer I was mostly Math absent. From time to time I consulted old Computer Science books to do some research on algorithms I then implemented. However those were usually the only touchpoints I had with Math.

Something changed over the last couple of years. While looking for the next personal challenges and goals to grow I figured that most of the really exciting achievements heavily utilize Math as a fundamental building block. That's actually true for a lot of scientific fields including Econometrics, Data Science and Artificial Intelligence. It's easy to follow the news and roughly understand how things might work but once you try to dig deeper and look under the hood it gets pretty hairy.

I found myself regularly lost somewhere in the dark alleys of Linear Algebra, Calculus and Statistics. Last year I finally stuck a fork in the road. I wanted to fundamentally change my understanding and decided to re-learn Math from scratch. After countless late nights, early mornings and weekends doing classes, exercises and proofs I'm finally at a pretty decent level of understanding advanced Mathematics. Right now I'm building upon this foundation to learn even more.

During this process I learned one important thing: Math is really amazing!

Math is the language of nature. Understanding it helps you understand how our world works!

With this blog post I'd like to share how I went from "What is a root of a polynomial again?" to "Generalized Autoregressive Conditional Heteroskedasticity" (at least to some extend). I'll share the Curriculum I created and followed, the mistakes I made (spoiler: I made a lot) and the most useful resources I used throughout this journey.

Before we start I want to be honest with you: Math is a really involved discipline. There's a lot out there…

And it can certainly be overwhelming. However if you're really dedicated and want to put in those hours you'll make it! If I can do it so can you!

Please keep in mind that this is the path which worked for me. This doesn't necessarily mean that it will be as efficient for you. In my case I need to study, self-explain and practice, practice, practice to really understand a topic at hand. I know of people who can just sit in class, listen and ultimately get it. That's definitely not how I operate.

Alright. Let's get started!

The Curriculum

Math is one of those subjects where you'll find a nearly endless stream of resources. Looking closer they all vary widely in terms of quality, density and understandability.

My first approach to ramp up my Math skills was to skim through an interesting research paper, write down all the Math I won't understand and look those terms up to study them in greater detail. This was fundamentally wrong on many levels. After some trial and error I took a step back and did a lot of research to figure out which topics I should study to support my goal and how those topics are related to one another.

The Curriculum I finally put together is a good foundation if you want to jump into other "Hard Sciences". My personal goal was to learn the knowledge I need to take a really deep dive into Artificial Intelligence. To be more specific I'm really excited about Deep Learning and the next steps in the direction of Machine intelligence.

Every topic which is covered in this Curriculum uses 3 basic pillars to build a solid Mathematical foundation:

Intuition

Videos, interactive visualizations and other helpful resources which outline how the Math came to be and how it works on an intuitive level.

Deep Dive

A good enough "deep dive" to get familiar with the foundational concepts while avoiding confusion due to overuse of theorems, proofs, lemmas, etc.

Practicality

Practice, practice, practice. Resources such as books with lots of exercises to solidify the knowledge.

Algebra

Algebra is the first topic which should be studied extensively.

Having a really good understanding of Algebra makes everything a whole lot easier! Calculus comes down to 90% Algebra most of the time. If you know how to solve Algebra problems you won't have a hard time in Calculus either.

Most of you might remember a phrase similar to

"Solve this equation for x"

That's what Algebra is about. In an Algebra class you'll learn about the following topics:

Solving equations
Solving inequalities
Polynomials
Factoring
Functions
Graphing
Symmetry
Fractions
Radicals
Exponents
Logarithms
Linear systems of equations
Nonlinear systems of equations

As stated above it's of uber importance that you really hone your Algebra skills. I'm repeating myself but Algebra is one of the main building blocks for advanced Mathematics.

Resources

Trigonometry

In Trigonometry you'll study the relationship of lengths and angles of triangles.

You'll learn about the unit circle and it's relation to sin and cos, cones and their relation to circles, ellipses, parabolas and hyperbolas, Pythagoras' Theorem and more. Trigonometry is interesting in itself since it can be immediately applied to real life problems.

Here's a list of topics you'll usually learn in a Trigonometry class:

Pythagoras' Theorem
Sin and cos
The unit circle
Trigonometric identities
Radians vs. Degree

Generally speaking this course is rather short. Nevertheless it's a good preparation class for Calculus.

Resources

Calculus

The study of continuous change is one of the main focus areas in Calculus.

This might sound rather abstract and the intuition behind it is really paradox if you think about it (see "Essence of Calculus" below). However you might remember that you dealt with Derivatives, Limits and area calculations for functions.

There are usually 3 different Calculus classes (namely Calculus I, II and II) one can take. Those 3 classes range from easy topics such as "Derivatives" and "Limits" to advanced topics such as "Triple Integrals in Spherical Coordinates". I'd suggest to definitely take the first class (Calculus I) and continue with the second one (Calculus II) if time permits. If you're in a hurry taking Calculus I is usually sufficient.

In Calculus I you'll learn about:

Limits
Continuity
L'Hospitals Rule
Derivatives
Power, Product, Quotient, Chain rule
Higher Order Derivatives
Min / Max Values
Concavity
Integrals
Substitution Rule

Calculus is an important topic since it's heavily used in optimization problems to find local minima. The "Gradient Descent" algorithm uses techniques from Calculus such as Derivatives and is leveraged in modern (Deep) Neural Networks to adjust the weights of Neurons during Backpropagation.

Resources

Linear Algebra

Linear Algebra is one of the most, if not the most important topic when learning Math for Data Science, Artificial Intelligence and Deep Learning.

Linear Algebra is pretty much omnipresent in modern computing since it lets you efficiently do calculations on multi-dimensional data. During childhood you probably spent quite some time in in front of your computer screen while wading through virtual worlds. Photorealistic 3D renderings are possible thanks to Math and more specifically Linear Algebra.

Linear Algebra courses usually cover:

Systems of Equations
Vectors
Matrices
Inverse Matrices
Identity Matrix
Matrix Arithmetic
Determinants
Dot & Cross Product
Vector Spaces
Basis and Dimension
Linear Transformation
Eigenvectors & Eigenvalues

As already stated above, Linear Algebra is one of the most important topics in modern computing. Lots of problems such as image recognition can be broken down into calculations on multi-dimensional data.

You might have heard about the Machine Learning framework TensorFlow which was developed and made publicly available by Google. Well, a Tensor is just fancy word for a higher-dimensional way to organize information. Hence a Scalar is a Tensor of rank 0, a Vector is a Tensor of rank 1, a N x N Matrix is a Tensor of rank 2, etc.

Another interesting fact is that Deep Neural Networks are usually trained on GPUs (Graphic Processing Unit) or TPUs (Tensor Processing Unit). The simple reason is that GPUs and TPUs are way better at processing Linear Algebra computations compared to CPUs since (at least GPUs) were invented as a dedicated hardware unit to do exactly that when rendering computer graphics.

Aside: Here's the original paper by Andrew Ng et al. where GPUs were first explored to carry out Deep Learning computations.

Resources

Statistics & Probabilities

The last topic which should be covered in this Curriculum is Statistics & Probabilities.

While both topics are sometimes taught separately it makes sense to learn them in conjunction since statistics and probabilities share a deep underlying relationship.

A typical Statistics & Probabilities class covers:

Charting and plotting
Probability
Conditional Probability
Bayes Rule
Probability Distributions
Average
Variance
Binomial Distribution
Central Limit Theorem
Normal Distribution
Confidence Intervals
Hypothesis Test
Regression
Correlation

In Data Science one usually has to deal with statistical analysis to see if the computations actually made sense. Furthermore it's helpful to compute and visualize correlations between data and certain events. Bayes Rule is another important tool which helps us update our belief about our view of "the world" when more evidence is available. The realms of Machine Learning and Deep Learning usually deal with lots of uncertainty. Having a good toolbox to deal with this makes our life a whole lot easier.

A pretty popular example of applied statistics is the Monte Carlo Tree Search algorithm. This heuristic algorithm was used in DeepMinds AI breakthrough "AlphaGo" to determine which moves it should consider while playing the Go boardgame.

Feel free to read through the official paper for more information about the underlying technologies. Trust me, it's amazing to read and understand how Math played such a huge role to build such a powerful contestant.

Resources

Mistakes

As I already stated above it's been quite a journey and I made lots of mistakes along the way.

In this section I'd like to share some of those mistakes so that you don't have to go through this yourself.

The first mistake I made was jumping straight into Math without having a clear plan / Curriculum and more importantly goal. I dived right into certain topics I picked up while reading research papers and quickly figured that some of them were too advanced since I understood only little (if anything at all). My approach was to back off and start somewhere else. "Trial an error" so to say. This was obviously very costly in terms of time and resources.

The solution here was to actually have a clear goal (learning Math to understand the underlying principles of Artificial Intelligence) and take the time to research a lot to come up with a sound Curriculum and start from there. Having that sorted out I only had to follow the path and knew that I was good.

During this aforementioned trial and error phase I made the mistake of taking way too many MOOCs. Don't get me wrong, MOOCs are great! It has never been possible before to take an MIT course from your couch. In my case exactly that was the problem. Most of the time I was passively watching the course content nodding along. After a couple of "completed courses" and the feeling of knowing the ins and outs I jumped into more sophisticated problems to figure that I developed a pretty shallow knowledge.

Doing a retrospective on the "completed courses" I saw that my learning style isn't really tailored to MOOCs. I decided to switch my focus to the good old physical textbooks. I especially focused on textbooks with good didactics, lots of examples and exercises with solutions (the Schaum's Outlines series is golden here). Switching from passively consuming to actively participating in the form of working through numerous exercises was really the breakthrough for me. It ensured that I left my comfort zone, went deep into the trenches and really battle-tested my knowledge about the topic at hand.

The other upside of using textbooks is that it made it possible to learn in a distraction free environment. No computer, no notifications, no distractions. Just me, a black coffee and my Math textbook!

Another, final tip I'd like to share is that you should really keep track of your feeling and engagement while studying. Do you feel fired up? Are you excited? Or are you just consuming and your thoughts are constantly wandering off because you don't really care that much? If that's the case then it's usually time to move on. Don't try to push through. There's nothing worse than completing a course just for the sake of completing it. If it doesn't feel right or isn't working for you it's important to let go and move on. There's enough material and maybe the next one suits your needs.

Conclusion

In this blog post I've shared my journey from being someone who acknowledged Math as something one should've heard about to someone who learned to love Math and its applications to solve complex problems. In order to really understand and learn more about Artificial Intelligence and Deep Learning I created a Curriculum which does not only cover the underlying Math concepts of such disciplines but will serve the student well when learning more about other "Hard Sciences" such as Computer Science in general, Physics, Meteorology, Biology, etc.

I'm still early in my Math journey and there's an infinite amount of exciting stuff to learn. With the given Curriculum I feel confident that I've gained a solid foundation to pick up more complex topics I'll encounter while learning about Artificial Intelligence, Deep Learning and Advanced Mathematics in general.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.