Customer retention is a goal every business owner should be obsessed with. At the end of the day it's cheaper to retain an existing customer than it is to acquire a new one.

But how do you ensure that your customers keep using your service?

Are there any simple, yet effective ways to reduce or even prevent churn?

As it turns out there's one simple strategy you can use to keep your customers around even if they're about to leave your platform. Let's explore what it is and why it works.

As already stated in the introduction it's important to focus on customer retention when building a sustainable business.

Acquiring customers can be an expensive endeavour. If you're not (yet) in a position where your product grows through Word-of-Mouth you're likely spending a good portion of your revenue on paid ads and marketing to drive traffic to your service. Only a few of your thousands of visitors will eventually try your product and convert to become a paying customer.

Optimizing this marketing and sales funnel is a tricky and costly activity. Think about it for a minute. Who finances your learnings and tweakings of such funnel? Correct, **your existing customers**.

That's why keeping your users happy and around is one of the most important business objectives.

If you think about it, there's really only one reason why your customers are leaving your platform:

Your product isn't a crucial part of their life anymore

While this sounds harsh I'd like you to think about all the services you're currently subscribing to. Now imagine that you can only keep one. What would you cancel? Probably everything except the one you can't live without.

Of course, the preferences are different from person to person and they change over time. And that's the exact reason why people cancel their subscription with your service: Their preferences have changed and they might want to take a pause from your service or need something else entirely.

Now that we know why your customers churn, it's time to get into their shoes and think about ways to keep them around.

One of the "industry" standards is to send out a survey once they're about to leave to gather feedback and convince them to stay. Some services offer coupon codes if for example the user has clicked on the "it's too expensive" option in the survey.

Other tactics are more on the "dark patterns" side of things. Hiding buttons, asking double negative questions or using other techniques to make it nearly impossible to leave. Needless to say that customers of businesses practicing such tactics aren't the ones who spread the word on how awesome the product is. Quite the opposite.

But let's take a step back for a minute and ask ourselves why this "should I stay or should I go" question has to be binary in the first place. Isn't there something "right in the middle"? Something where a user can stay but somehow go at the same time?

The solution to this dilemma is dead simple and obvious, yet rarely used: Make it possible to **pause the subscription**.

Yes, it's that simple. Just offer a way to pause a subscription and get back to it once your users current circumstances have changed.

Now you might think that it's a really bad idea to let users pause their subscription. They'll pause and never come back. So essentially it's a "passive churn" as they haven't left the platform yet but might never use it again. The stale user data is sitting in the database and your dashboards are still showing hockey-stick growth. Furthermore it's a huge implementation effort as pausing and resuming subscriptions isn't something considered business critical and hence wasn't implemented just yet.

Those are all valid concerns and some of them might turn out to be true even if you have a "pause- and resume your subscription" system in place. But let's take a seconds to look at the other side of the equation.

They very first thing that comes to mind is the COVID-19 pandemic we're currently in. A lot of business scaled back and hence had to cancel subscriptions to their favorite SaaS tools to cut costs. A common "save the customer tactic" used here was to get in touch with the business owner and offer heavy discounted year long subscription plans. That way businesses could reassess if they should really quit and leave the huge discount on the table or just go with it and double down to benefit from the sweet, discounted multi-year subscription deal.

Letting business put their subscription on hold would be another strategy that could be used to help retain and eventually reactivate your users during this pandemic. Put yourself into your customers shoes again for a minute. Wouldn't you want to pay it back in the future if your supplier lent you a helping hand and wasn't "forcing" you out the door?

Even if your customers pause their account you still have their E-Mail address to reach out to them and keep them informed about your product. In fact you should use this opportunity to stay in touch, ask them how they're doing and providing something of value along the way. That way you keep the communication "warm" and your business stays on "their radar". There's a higher likelihood that they think about your service when times have changed and they're about to scale things up again.

Having a way to pause a subscription is an action that's usually taken with some level of consideration. If your customer wants to quit (s)he'll just cancel the subscription anyway. Offering a way to pause for the time-being is another option your users might just not have right now, so they're forced to make a very binary decision and therefore they just quit.

What you should also think about is that pausing a subscription doesn't necessarily mean that you'll lose revenue for sure. There are different and very creative ways in which you can implement the pause. My gym for example simply extends my membership for the amount of months I put my membership on hold. In the summer I make use of this feature since I do my workouts outside anyways. However those 3-4 months I "save" are simply "added" to my contract. I just have a little bit more control about how and where I spend my time with sports. You can get really creative here and invent other ways for this mechanism to work if you really want to ensure that you don't lose revenue.

A last, important point is that you can use this functionality as a competitive advantage and "marketing material". Be sure to add the fact that people can pause their subscription to your list of product benefits. Add it to the copy right next to your "Subscribe Now" button. Addressing objections and concerns right before the call-to-action is about to happen will drastically increase your conversion rates.

Now you might be excited and eager to implement this strategy in the near future but before you do so I'd like to call out a couple of things you should keep in mind when implementing it.

First of all: **Keep it simple**. There's no need to jump right into code and implement this functionality end-to-end. Do it manually in the beginning. Update the database records and the subscription plans for people who want to pause their subscription by hand. Maybe you find out that very few people want to make use of this feature. What you definitely want to put in place is your new copywriting. As discussed above you should ensure that your marketing website is updated and reflects the recent change you just introduced.

Next up you want to have an **automated follow-up E-Mail sequence** / Drip campaign setup for pausing customers. Keep in touch. Ask for problems they had with your software and help them succeed in whatever they're up to right now. You might want to jump on a quick call to gather some feedback as to why they paused and understand what needs to be in place for them to come back. If you do this, please ensure that you're **genuinely interested** in the communication. There's nothing worse for a user than composing a reply and shooting the E-Mail into the marketing void.

A very important, yet often overlooked step is to have a tool in place which deals with "passive churn". Such a system ensures that the credit cards on file are up to date and chargeable. There could be an overlap between your users pausing their subscription and their credit cards expiring. You don't want to make them look bad because of that. You could even think about a "concierge service" which onboards them in person once they'll come back. Combine this with a quick update on all the new features / updates they missed and are not yet familiar with.

Lastly you absolutely don't want to make it hard for your users to pause their subscription. As mentioned above, avoid dark patterns at all costs. And more importantly: **Don't penalize them for pausing**. Messages such as "We'll retain your data for the next 60 days" are inappropriate in the day and age of "Big Data" and access to Petabytes of storage for a nickel and dime.

I'd like to challenge you to think about adding the possibility to pause a subscription. Is it suitable for your business? Would it help you retain and reactive more customers (especially in the current situation we're in)?

If you're about to add it, keep in mind that it doesn't have to be complicated. Start with a simple E-Mail form your users can fill out to let you know for how long they want to pause. Just make sure that you follow the best practices outlined above and that you advertise that it's now possible for your customers to pause their subscriptions.

Customer retention is one of the most important metrics every business owner should focus on. It's the existing customers who finance the Customer Acquisition Costs that are necessary to bring new users into the door.

It's almost always cheaper to keep your existing customers happy than to lose them and acquire brand new ones.

Unfortunately a lot of SaaS services only offer a very binary option for their subscription plans. As a user you're either in or you're out. You stay or you leave. But what if a user wants to take a pause for a few months because of current changes in life circumstances?

Offering a way to pause a subscription is a simple, yet effective way to retain and eventually reactive your existing customers. Remember that a pause is temporary. If you follow-up with them on a continuous basis and help them succeed they'll eventually come back. Maybe even as a raving, more loyal fan of your brand.

I hope that you enjoyed this article and I'd love to invite you to subscribe to my Newsletter if you're interested in more, action-oriented posts like this.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

]]>Your Landing Page is finally online and you got some traffic but it isn't generating any leads?

Does your Landing Page contain all the information to communicate the key benefits your target audience is looking for?

What other strategies are out there to increase your conversion rates?

In this blog post we'll explore a simple strategy you can use to create a Landing Page which resonates with your target audience and therefore results in more leads.

Bootstrapping a business is hard. Really, really hard. Every founder has to wear many different hats throughout the journey of starting and running a profitable business. There's Marketing, Product, Sales, Development and so many more aspects you as a founder have to juggle in the day-to-day.

Luckily enough there are products and SaaS services which help you streamline the work you have to do on those fronts. One of such products / services are Landing Page templates and Landing Page builders.

Your Landing Page is one of the most critical aspects of your business you have to get right. Every visitor to your website judges your product in a matter of seconds and decides whether it's a good idea to dive deeper into your solution or move on and close the tab.

There are dozens of articles describing in great detail how your Landing Page should be laid out and which copywriting you should use to make your visitors stick. Over the years companies have spent a lot of time and resources running different A/B tests to eventually derive their "magic" Landing Page formula.

As a bootstrapped founder you don't have the time, resources and most importantly traffic to run such tests to find out which Landing Page format converts the most users.

That's where Landing Page templates and builders come into play. Given that the main goal of every Landing Page is to clearly communicate the benefits your product has to offer it's possible to define a common structure every project should follow.

Such structure usually looks something like this:

- Headline
- Sub Headline
- Screenshot
- Description of the pain
- Description how the product makes the users life better and alleviates the pain
- Benefits with screenshots
- Testimonials / Social Proof
- Call to action (CTA)
- Footer

This structure is based on a common understanding as to what a Landing Page needs to contain in order to result in high conversion rates. Following these best practices you should be able to download the template, fill out the blanks, maybe rearrange some blocks and have a decent Landing Page which delights your visitors, correct?

Well, not quite... There's more to a converting Landing Page than following the de facto standard.

It's always a good idea to put yourself into the shoes of your future potential customer and think about what you'd expect your products Landing Page to contain to get excited and give it a try.

Are there certain aspects you pay special attention to? Is there anything that's currently missing? Maybe a short video is better suited than a gallery of in-app screenshots? Is there a way to embed a live demo of your product (e.g. the way Intercom does it with their live chat widget)? Are there crucial features your product offers to differentiate itself from the competition?

All those questions and their answers are very specific to the solution you're offering and the audience you're serving. While generic Landing Pages are a good start they don't and can't provide guidance as to what product and audience specific questions you should answer.

Putting yourself into your users shoes is one way to tackle this issue but there's a trap almost all of us fall into: **Bias**.

What's a good way to get more out of our Landing Pages without us analyzing it in a biased way?

Chances are that if you're bootstrapping your business you're not entirely reinventing the wheel. If you're not VC backed it's usually a good idea to work on a solution for a well known problem specifically tailored for a niche audience rather than coming up with something entirely new.

Defining new markets is expensive and time consuming because you have to raise awareness and educate your customers as to how your product benefits them.

Given this nature of our business we can use our competition to get some inspirations on what their customers were looking for when visiting their Landing Pages back in the day. If our competition is successful nowadays we can be pretty sure that they did something right back when they pitched their product for the very first time.

Let's take this idea for a spin and see how we can reverse-engineer the parts that were crucial in the success of our competition.

For this example we pretend that we've developed a live support chat application similar to Intercom or Drift.

The first thing we want to do is to figure out which companies we're competing with.

Chances are that you already stumbled upon this question when you did your market research and product validation. However if you haven't done that you can find your main competitors easily via business review sites such as Capterra or G2.

Given that we're offering a live support chat service we're tying "support chat" into the Capterra search box:

While we see some products at the top of the page we're more interested in the "Software Categories". If we scroll down a little bit we can see the category "Live Chat Software". That's exactly what we're looking for so we click on that link.

The information on the "Live Chat Software" page confirms that that's the category our product belongs to. Scrolling down a little bit we can finally find our competitors such as Zendesk, HubSpot Service Hub and Intercom.

Looking at the ratings and reviews it might be a good idea to study what Intercom did back in the day.

Now that we know that one of our main competitors is Intercom it's a good idea to learn when they first launched. Getting this information can easily be done by visiting their Wikipedia page.

If that doesn't return any results we can use Crunchbase which is a huge database containing tons of relevant information about startups and businesses.

As we can see Intercom was founded on August 2011 and their domain is intercom.com. That's all the information we need.

**A note on domain names:**

If you dig a little bit deeper into the history of Intercom you'll stumble upon the fact that they didn't own the intercom.com domain from day one. In fact they bought it after they launched and started with intercomapp.com. If you want to go the extra mile you could try to figure out if your competitor had a different domain name when they launched. However most of the time it's enough to know the "final" domain name because that's usually the time when the company started to attract public interest and therefore gained traction.

Armed with this information we visit the Internet archives "Wayback Machine" which is a non-profit project dedicated to record the internets history.

In the search bar we enter our competitors domain and select the year when they launched.

On the calendar view we select one of the very first recorded snapshots to see how their Landing Page looked like back in the day.

And there we have it. Intercoms original Landing Page.

We should now take some time and look for what aspects of their product Intercom paid special attention to. What you'll notice for example is that they put a special emphasize on "building relationships" via their software. Another interesting data point is that they started with a free plan during their private beta period, meaning that they didn't launch publicly on day one.

What you now want to do is spend some time to navigate around and take notes while you browse. Reverse-engineer their website and the way they communicate their customers pain and their solution to such pain. Look at what features and benefits they highlight and how they reflect that in their copywriting.

Now you might be thinking that you don't want to be a copycat of your competitor and you're right about that. Being a 1:1 copy shouldn't be your goal. However there's a very high likelihood that the audience your competitor is serving is similar to the one you're about to work with. Hence they're receptive to a similar messaging your competitor uses.

**A note on using the Wayback Machine:**

The Wayback Machine is a truly awesome tool and important initiative as it tries to record the internets history as accurately as possible. However as you browse through old, preserved websites you might recognize a minor hiccup here and there. That's totally normal. Keep in mind that the project is a non-profit and the vast amount of data they have to store and process. Just pick another date on the calendar if you get stuck. And always keep the year in mind when browsing old websites. What looks dated today was trendy back then.

We've now seen how we would use this process to drastically improve the Landing Page of our imaginary live support chat application. Now it's your turn. Take some time this week to figure out who your main competitors are and how they communicate their products benefits on their Landing Pages.

Pay close attention to the features and solutions they highlight. Those aspects might be interesting for your target audience as well. Also note how they present their application. Do they use a lot of screenshots or go with video? What is their launch pricing structure? Did they charge on day one? If so, how much?

Try to learn as much as possible and incorporate some of those learnings into your Landing Page as well!

As the old saying goes:

Good Artists Copy; Great Artists Steal

While Landing Page template and builders are a great starting point to get your products initial marketing page off the ground they're usually generic to suit as many different use cases as possible.

The main task of your Landing Page is to clearly articulate the benefits your target audience will experience when using your product over the competition. Your future customers should be inclined to dive deeper into the solution you're providing rather than leaving your website after the first few seconds.

Doing so cannot be done with a generic template you've downloaded and filled out. You need to tailor your messaging to your future potential customers. One way to learn what your users will pay attention to is to study what your competition did back in the day when they published and promoted their very first Landing Page. In this blog posts we learned a structured way to identify your main competition and reverse-engineer their Landing Pages in great detail.

I hope that you enjoyed this article and I'd love to invite you to subscribe to my Newsletter if you're interested in more, action-oriented posts like this.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

]]>Your product offers a free trial but your users aren't converting when it expires?

Most of the time the worst part is that you don't have any data to figure out why users stopped using your product. Even if you reach out, feedback is usually sparse and often doesn't reveal any useful patterns you can look into.

In response to that, a lot of time and energy is usually spent tweaking the onboarding flow or marketing funnel.

But there's a simpler, better way to increase your conversion rates. Let's see how it works.

There's a lot of debate whether having a free trial is the best way to acquire new customers.

The clear upside of offering a free trial is that the barrier of entry is extremely low. An E-Mail address and a password is usually enough to get new users into the door. The more users you have onboarded, the more likely it is that they talk about your product in their peer groups and network (assuming it solves a real pain). Even better if your product has built-in functionalities which incentivize your users to invite others. This mix of Word-of-Mouth and network effects kick-starts the growth flywheel which will attract even more users. Rinse and repeat.

Eventually a small percentage of your overall user base converts and upgrades to a paid plan.

Hundreds of large-scale companies followed this exact playbook. Given their success it should be a no-brainer to follow this strategy as well, correct?

Despite the aforementioned upsides there are also clear downsides to this approach. Having more users usually translates into more customer support requests your support staff needs to handle. In addition to that it's tricky to get the math right such that you can use the revenue you generate via your paid plans to finance the resources and infrastructure necessary for your free trials.

There isn't a clear, definite answer whether you should offer a free trial or avoid it at all costs. At the end of the day it all comes down to the type of product you sell and the audience you're serving.

Regardless of the way you acquire new customers there's one metric you should monitor closely when onboarding new users: **User engagement**.

Low user engagement is at the very core of problems related to free trials.

Given that one only needs an E-Mail address and a password to sign up for your product it's an easy decision to take the leap and give it a try. Testing it is free after all and who doesn't like getting something for free?

The issue is that finding the time to sit down and "work with it in the next couple of days" usually never materializes. Life gets in the way and sooner rather than later the trial expires and your users are locked-out, never to be seen again.

Sure, this isn't always the case but more often than not people just don't get around testing your product enough to uncover the benefits they'll experience when using it in their day-to-day to solve their problems.

One solution to mitigate this problem is to offer a limited, "always free" plan. This way users will have enough time to tinker around and assess whether your product is worth its money. But this only kicks the can down the road as you'll likely face the same issue of low user engagement as before. In addition to that you open up a can of worms because now you have to deal with all the other challenges a free plan entails:

- Even more customer support
- Figuring out what to offer in the free plan and what to put in the paid plans
- Ways to incentivize powers users of your "always free" plan to upgrade
- Increasing spent on infrastructure and other resources
- ...

Another common strategy to solve the free trial dilemma is to "ask for the credit card upfront", meaning that users have to put in their credit card information when they sign up but they won't be charged after their trial expired. While asking for credit card information is a good way to filter out users who might never convert into paid customers it comes with the sames problems we just discussed.

Offering an "always free" plan or asking for the credit card upfront still doesn't solve the issue of low **user engagement**.

Now that we know why free trials and free plans are a tough sell it's time to tackle the underlying issue: **Low user engagement**.

Basic human psychology tells us that human beings value items more if they've spent money to acquire them. Think about the last time you got something for free vs. the time you spent money on a similar item. Chances are that you've used and valued the item you paid money for more.

This discovery can be used to inform the way how you can design your trial version to eventually convert more users into paying customers.

Rather than offering a free trial which expires in 30 days offer a paid version of that exact same trial.

Yes, you read that right. Turn your free trial into a **paid trial**.

Offering a paid trial comes with a couple of major advantages:

- Only users who are seriously considering your product will sign up for the paid trial
- It's more likely that your users will find the time to test your product because they paid for it
- Your users value your product more compared to your competitors who offer a free plan / free trial
- Your users are more committed to get the most value out of your product during the trial
- Another huge point of friction is already removed from the conversion process (asking to put in the credit card information and subscribe to a paid plan)

In fact, I learned about the power a paid plan can have from a personal experience. Given my growing interest in SEM, SEO and marketing in general I did some research to figure out what tools the market has to offer in that space.

In particular I was looking for a tool which would help me conducting Keyword Research, Backlink analysis and SEO monitoring. What I stumbled upon was the well-known tool Ahrefs.

However what struck me was that compared to their competitors their pricing structure was different. There was no "Free trial". I had to pay 7 USD to get access to the tool for 7 days. After some hesitation I decided to give it a shot and try it for the next 7 days. In fact, I spent a significant amount of time upfront, learning and researching everything I could about their toolings to figure out if their paid trial would be really worth it.

Needless to say that I used Ahrefs every single day during that week. The felt "pressure" to get the most out of the 7 days in combination with their excellent, educational E-Mail sequence helped me to explore all the values their tool has to offer. I'm absolutely certain that their conversion rates from paid trial to paying customer aren't too shabby.

Do you have a free plan, free trial or are thinking about introducing such a plan in the near future?

I'd challenge you to take a step back and think about offering a paid trial instead. Your paid trial shouldn't be too expensive but it should be enough that your users are incentivized to explore your product in more depth. Combine your paid trial with a helpful, educational E-Mail onboarding sequence and check-in with your users every now and then to understand what their problems are and to help them succeed.

I'm certain that this will help you tackle any conversion-related problems you might be facing!

Converting users you've attracted via a free trial or a free plan to become paying customers is hard. There are a couple of issues one has to be aware of. It's tough to get the math right and take the profits you made from your paid plans to offset the costs you'll introduce by offering a free plan / free trial. While it's highly likely that you'll get new users using your product you'll also experience an increase in customer support requests and an almost guaranteed low conversion rate.

One of the underlying core problem with all this is a lack of user engagement. Paid products are valued more than free ones.

You can use this fact and offer a paid trial which incentivizes your users to take action and explore the values your product has to offer. Having a more engaged user base helps you tackle the conversion-related problems you might be facing.

I hope that you enjoyed this article and I'd love to invite you to subscribe to my Newsletter if you're interested in more, action-oriented posts like this.

Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.

]]>So you're a technical founder looking for practical advice on how to market the product you just built?

You're thinking about hiring someone or outsourcing your marketing entirely because "you're not an expert"?

If you're going that route should you hire a freelancer or an agency? And how do you determine if they're doing a good job?

Sure, you could outsource this part of your business because it's time consuming, tedious and a bit of a "black art" (especially for technical founders).

But marketing doesn't have to be this fuzzy part of your business which is more often than not neglected due to confusion around "Funnels", "Growth Hacking" and "Channels".

Marketing is about connecting with your audience and it's a big competitive advantage if you're the person in charge of such connections. With marketing you can keep an ear to the ground, gathering data as to what does and what doesn't resonate with your audience. You can then use this data to inform your product and marketing decisions and attract more potential customers via "free" channels such as SEO / Search, Social Media or Word-of-Mouth.

Getting the whole marketing machinery up- and running can indeed be an intimidating endeavor. But it doesn't have to be! The key is to pick a marketing channel that will likely resonate with your audience and build from there.

I'd suggest that you start with Content Marketing in the form of **solution oriented blog posts**. "Why?" you might ask. The reason is simple yet powerful.

Your potential customers are already looking online for solutions to their problems. They use search engines to find answers to specific questions they're currently dealing with. If your software helps with those problems, wouldn't it be nice if users find your product and give it a try?

While this sounds compelling it's easier said than done. The main issue is in figuring out what kind of content resonates with your users and more importantly what content attracts users who are on the lookout to buy a solution to solve their problem at hand. Just crunching out blog post after blog post in the hope of hitting the jackpot won't do the trick. Plus it takes a whole lot of time and effort to write the content in the first place.

So how do you come up with content ideas that your future customers are interested in? I got some great news for you: **You already have all the data that's necessary to write such content**. You just have to take a couple of minutes to dig it up!

Chances are that you have a lot of historical chat data in your customer support system, whether that's a fully-fledged CRM system like Intercom or your E-Mail inbox. You might've realized that for some topics you got the same questions over and over again. The usual reaction to this phenomenon is to create a dedicated FAQ entry which answers the question and point the next customer who's asking about this specific question to the direction of the FAQ. Another, complementary solution is the Knowledge Base which usually goes into more detail as to what the problem and its potential solutions are.

Why don't you go a step further and turn your FAQ / Knowledge Base entries into dedicated blog posts? With very little extra effort you can use your support requests as a way to inform and kick-start your Content Marketing strategy.

How could such a support request repurposing look like? Here's an example blog post structure you might want to use or adapt according to your needs to get started today.

**Headline**

The support question. E.g. "How do I <Problem> with <Your Product>?".

**Problem Description**

Write a paragraph or two to describe the problem your customer is experiencing.

**Feature Description**

Write a couple of paragraphs about your products feature which helps in solving the previously stated problem. Take some time to go into more detail here and outline the value your product provides.

**How to solve the problem**

Provide an exact step-by-step guide which shows how a user will solve the problem with your product. Include screenshots if necessary.

Add a way to create a permalink which points to this headline. This makes it easier for you to share the exact location where you describe the fix in your blog post. This way users with the link will get straight to the solution rather than having to scroll through the whole article.

**Conclusion**

Wrap things up and plug the values your product provides a second time.

**CTA**

Provide a call to action with a "Next Step" the user might want to take. This could be a Newsletter signup box or a link to another feature your product provides (another support request informed blog post).

Are you still on the fence if it's worthwhile to go down this route? The following is a list with some of the benefits you'll experience when you implement this strategy:

- You leverage your support request queue to "generate content ideas" for you
- Once written you can easily share the blog post with customers who request support for the exact same problem
- Chances are that the user in question shares the article with co-workers / her network in the future (because it's a blog post rather than a FAQ entry)
- If you're interlinking all your blog posts, customers will educate themselves and get more value out of your product
- You're highlighting the features your product offers and show how such features help users solve their problems
- You'll automatically be indexed by search engines (SEO)
- People who will search for phrases such as "How do I ..." will discover your posts and immediately see the benefits they'll get when they use your product
- You'll save time because you write the content once rather than writing a FAQ entry, update the Knowledge Base, add a section to the User Manual, etc.

I'd like to challenge you to take 30 minutes today to look through your support queue. Just scroll through all the requests and try to identify patterns. What problems are most of your users struggling with? Is there an overarching theme?

Write down the main pain points and turn them into dedicated blog post headlines. Next up, use the structure above (or a variation of it) to write the outline of your blog posts.

Dedicated 30 minutes each day to work on such posts. You'll be surprised how effortless this gets with every single day you work through your list! Don't throw in the towel if you're making little progress the first few day(s). That's normal. Keep the momentum going and you'll have created a lot of valuable content you can share on your business blog in the next couple of weeks.

It's tough to come up with great blog post ideas your users are likely to enjoy, let alone to start the whole marketing machinery as a non-marketing focused founder. An easy way to deal with such issues is to look into your support system and identify patterns. Are there any questions which have been asked multiple times? Are there some features your customers have more questions about compared to others?

Rather than jumping into the creation of a FAQ entry and calling it a day think about a way how you can turn the question at hand into a dedicated blog post. Talk about the feature your user had a question about. Describe how it works and in which way it serves your customers. Showcase how easy it is to use that feature to accomplish a certain task with your product (the actual answer to the support question).

Doing so will save you time as you'll write an article once and for all rather than juggling different mediums like FAQs, Knowledge Bases, Documentation, User Manuals, etc. and it kick-starts your marketing efforts in a bite-sized way. Your business blog will be indexed by search engines and future customers searching for solutions to their problems will discover your article and see how your product helps them solve their issues.

Earlier this year I wrote an in-depth blog post in which I explored what topics I'll be focusing on in 2020. In order to hold myself accountable I turned that post into a roadmap I followed closely. The overarching theme was Machine Learning and understaning all of the foundational algorithms by implementing them from scratch in pure Python.

It's been half a year since I started to work on that endeavor and I believe now is a good time to take stock of the status quo and plan the upcoming months and years.

The last 6 months I spent a good portion of my time diving deep into fundamental Machine Learning algorithms such as Gradient Descent, Naive Bayes and Logistic Regression. To get the most out of this learning experience I implemented all the algorithms from scratch and shared my learnings via code examples on GitHub and dedicated in-depth blog post I published on this blog. If you're curious you can find the whole "Machine Learning Algorithms" blog post series here.

Overall I'm pretty hapy with the outcome and can say that doing this was totally worth it. I wholeheartedly enjoyed the process from knowing next to nothing about the topic at hand to learning the underlying mathematics and implementing a fully working prototype solely based on reading through a bunch of books, Wikipedia articles and whitepapers.

Unfortunately the whole "Machine Learning from scratch" series isn't 100% done just yet. What's missing is a blog post about k-means clustering and Neural Networks. The code has already been written and can be found here and here. I hope that I'll find some time and energy to write and publish the remaining 2 blog posts anytime soon.

Another goal I followed was to learn more about secure computation and Homomorphic Encryption in particular. The main objective was to explore ways in which Homomorphic Encryption and Machine Learning can be used together. Using both in conjunction could resolves fundamental issues around privacy and would unlock new, important use cases where Machine Learning is currently not applicable due to regulatory restrictions. One important solution which comes to mind is using Deep Neural Networks in the healthcare sector where preserving the patients privacy is of upmost importance.

In order to fully understand Homomorphic Encrpyption from a theoretical and practical point of view I took the same approach I did to master Machine Learning algorithms. I read dozens and dozens of whitepapers, watched countless conference talks and followed the community closely to implement the foundational algorithms from scratch in pure Python. It was a wild and oftentimes frustrating ride. Cryptography is hard but one of the most intellectually stimulating topics anyone interested in Mathematics and Computer Science should study!

The main outcomes of this exercise were implementations of some of the most important Homomorphic Encryption algorithms to date:

- On Lattices, Learning with Errors, Random Linear Codes, and Cryptography
- On Ideal Lattices and Learning with Errors Over Rings
- Faster Bootstrapping with Polynomial Error
- Fully Homomorphic Encryption without Modulus Switching from Classical GapSVP
- Somewhat Practical Fully Homomorphic Encryption

Don't be intimidated by the titles! I strongly believe that Homomorphic Encryption is a fascinating topic anyone can understand and master.

There are a couple of IMHO unnecessary roadblocks you might encounter when digging deeper into the subject. You won't believe how often research paper lack important information which makes it hard or nearly impossible to reproduce their results. Let me know if you decide to learn more about Homomorphic Encryption and need some pointers or guidance along the way.

Learning and studying new topics and concepts in a theoretical setting can be enjoyable but putting the theory into practice is where the magic of understanding happens!

That's why I decided to scratch one of my own itches by implementing a Machine Learning powered recommender system. The main idea is to help users discover new, useful content on the internet. I called the project "Interlayer" given that it would act as a kind of connecting layer between all the different web properties on the internet.

Roughly speaking Interlayer crawls and analyzes websites while generating topics the website in question covers. Such topics are then used to find similar / complementary websites the current visitor might find useful. Under the hood it uses techniques such as Word Embeddings and K-nearest neighbors.

It took approximately 1 month to implement the first "alpha" version which was focused on online blogs. After hearing about the stuggles ecommerce merchants face to get initial traffic to their stores I decided to change course and update Interlayer to work better with ecommerce stores. The result of this is the Interlayer Shopify App I launched a couple of weeks ago.

I'll spend more time to dive deeper into the project and explain what I learned from a technical as well as business point of view. As it turns out running a Machine Learning pipeline in production at scale isn't a straightforward undertaking.

Over the last 6 months I learned a ton! Starting with Machine Learning algorithms I followed-up with implementations of core Homomorphic Encryption algorithms to finally put everything I learned into practice and ship my first product called "Interlayer".

What I quickly figured was that while I truly love the challenge of learning and implementing algorithms from scratch it's hard for me to find the time and energy to write in-depth blog posts once I'm done with the coding part. What I truly enjoy is learning the ins- and outs of new technologies and applying such learnings in practice by building real customer-facing products. Working on Interlayer was an eye-opening experience as merchants all over the world started to install the app and refer traffic to each other. All of this is powered by Machine Learning algorithms I started to study earlier this year.

Over the last couple of weeks I took some time to reflected more on that experience. Seeing a project being used "in production" sparked an interest in the Bootstrapping and Indie Hacking community in me.

Given what I've learned over the last 6 months I plan to make this whole blog a lot more personal. While I'll continue to publish long-form educational content and essays I'll switch the main focus on the "making" side of things.

Essentially I'll start to "document rather than create". A lot of content never saw the light of day due to the artificial "it has to be article-worthy" filter I used to determine what's worth sharing. I strongly believe that it's helpful to learn in public. In fact there's a lot I learned from others this way!

]]>You can find working code examples (including this one) in my lab repository on GitHub.

It's almost certain that anyone of us has played the game of "Twenty Questions" or a variation of it at least once in their lifetime. In case you're not familiar with the rules, here's how it works:

The game starts with one person (the "Answerer") thinking about a subject or object without revealing it. The others (the "Questioners") are encouraged to ask questions about such subject or object. Questions should be asked such that they can only be answered in a "Yes" or "No" fashion. The current Questioner continues to ask questions if her previous question was answered with a "Yes" and has to hand over the question asking to the next Questioner if the current question was answered with a "No". The Questioner who successfully guessed the subject or object the Answerer thought about has won and will be the next Answerer.

Let's walk through a quick example to see the game in action. In this case I'm the Answerer and you're a Questioner. Since you're the only Questioner you're allowed to continue asking questions even if one of the answers is "No":

You: "Is the subject or object you're thinking about a person?"

Me: "Yes"

You: "Is the person you're thinking about a non-fictional character?"

Me: "No"

You: "Did the person you're thinking about appear in comics?"

Me: "Yes"

You: "Has the person you're thinking about superpowers?"

Me: "Yes"

You: "Is the person one of the Avengers?"

You: "Yes"

You: "Did the person appear in a movie?"

Me: "Yes"

You: "Is the person you're thinking about Iron Man?"

Me: "Yes! Bingo!"

While this is certainly a contrived example it shows one of the core techniques Decision Trees, the topic of this post, use to constuct themselves while "learning" from training data.

In the following sections we'll take a deep dive into Decision Tree-based Machine Learning models. We'll examine their properties, learn how we can ensure that the best tree is constructed given the data we're feeding it and how we can turn this knowledge into code to create our very own Decision Tree classifier which predicts if it's a good idea to practice our swings on the golf course given some data about the current wheather conditions. Let's dive in!

A Decision Tree is exactly what its name implies. A tree-like structure which makes it possible to model decisions and their consequences. In fact you've already built and used a Decision Tree model while we played the game of "Twenty Questions" in the introduction above.

Let's visualize the Decision Tree you used during the game to better understand how it guided you to the prediction that the character I was thinking about was indeed Iron Man:

In the image above we can see that the Decision Tree consists of Nodes and Edges which point from parent Nodes to child Nodes. In our case every Node contained a question such as "Is it a person?" while the Edges contained the respective answers: "Yes" or "No". In our case the labels were binary which means that every question can be answered in one of two ways: "Yes" or "No".

While we were playing the game you walked down the tree starting with the root Node at the top, following the Edges and Nodes until you eventually arrived at the Node which had no outgoing Edge: Your prediction.

Generally speaking there are 2 main Decision Tree models both of which differ in the prediction they produce:

The **Classification Tree** is a tree where the prediction is categorical. The tree we've built above is a classification tree as its output will always yield a result from a category such as "Superheros" or more specifically "Iron Man".

The **Regression Tree** is a tree in which outputs can be continous. An example for continous values are the real numbers.

While this only defines the type of outputs a tree computes there's nothing preventing us from mixing and matching categorical and continous values in the tree creation phase. Therefore a tree can have categorical as well as continuous Edge labels and output types.

Given what we've learned so far about Decision Trees, it's easy to see that one of the huge advantages is the models understandability. Looking at the trees visual representation it's easy to navigate around, gauge how well it might perform and troubleshoot issues which might arise during real-world usage.

One area of improvement we could identify is the way in which our tree is splitted. We could've for example skipped the question "Did the person appear in a movie?" since it's highly likely that there's a movie about most of the Avengers. On the other hand nailing that it's Iron Man involved quite some luck. What about Hulk or Thor? It might've been better to insert yet another question to ensure that you're on the right path.

In which way should we construct our tree such that every decision we're making (deciding which Edge to follow) maximizes the information we're getting such that we can make a prediction as accurate as possible? Is it possible to capture this notion mathematically?

As it turns out there are a few different mathematical formulas at our disposal to quantify the information we'd gain when following certain paths in our Decision Tree.

One such metric is the so-called Gini Impurity. With Gini Impurity we measure how often we'd incorrectly label a randomly chosen element from our data set if it was randomly labeled according to the class distribution of such data set.

The formula is as follows:

\[ G = \sum_{i=1}^C p_i \times (1 - p_i) \]

where \(C\) are the classes and \(p_i\) is the probability of picking an element of class \(i\) at random.

Now if this definition makes your head spin you're not alone. Let's work through an example to understand what we're calculating and how Gini Impurity is used in practice.

Let's pretend that we've collected some meteorology data over the last couple of years and want to use it to determine if we should bring an umbrella with us given the current weather conditions we observe.

Among other things we've collected 14 data points about the **wind** and **temperature**:

Wind | Yes | No | Total |
---|---|---|---|

Weak | 4 | 3 | 7 |

Strong | 6 | 1 | 7 |

14 | |||

Temperature | Yes | No | Total |

High | 3 | 2 | 5 |

Low | 7 | 2 | 9 |

14 |

As one can see we've categorized the wind into "weak" and "strong" and the temperature into "high" and "low". During a day where the wind was "strong" it rained \(6\) out of \(7\) times. At the same time it rained \(7\) out of \(9\) times when the temperature was "low".

Our goal is to determine which one of these observations provides us more certainity whether it'll rain and therefore if we should take an umbrella with us. We can figure that out by applying the Gini Impurity formula from above.

Let's start with the Gini Impurity calculation for **Wind**. We're dealing with 2 separate cases here as there can be either "Weak" or "Strong" wind. for every case we have \(2\) different classes (\(C\)) we can use to label our observation: "Yes" and "No". This means that we need to do \(2\) Gini Impurity calculations (one for each case). Plugging in the numbers results in the following:

\[ Gini(\textrm{Wind=Weak}) = \frac{4}{7} \times (1 - \frac{4}{7}) + \frac{3}{7} \times (1 - \frac{3}{7}) = \frac{24}{49} \approx 0.49 \]

\[ Gini(\textrm{Wind=Strong}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 \]

Since we have 2 results (one for each case) we now need to calculate the weighted sum to get the overall result for our **Wind** observation:

\[ Gini(\textrm{Wind}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.37 \]

Next up we do the Gini Impurity calculation for **Temperature**. We're following the exact same steps as we did with our **Wind** calculation above. We're also dealing with 2 different cases ("High" and "Low") as well as \(2\) different classes \(C\) ("Yes" and "No"). Here are the calculation:

\[ Gini(\textrm{Temp=Low}) = \frac{3}{5} \times (1 - \frac{3}{5}) + \frac{2}{5} \times (1 - \frac{2}{5}) = \frac{12}{25} \approx 0.48 \]

\[ Gini(\textrm{Temp=High}) = \frac{7}{9} \times (1 - \frac{7}{9}) + \frac{2}{9} \times (1 - \frac{2}{9}) = \frac{28}{81} \approx 0.35 \]

\[ Gini(\textrm{Temp}) = \frac{5}{14} \times 0.48 + \frac{9}{14} \times 0.35 \approx 0.4 \]

Putting it all together we've calculated a Gini Impurity of \(\approx 0.37\) for our **Wind** observations and a Gini Impurity of \(\approx 0.4\) for our **Temperature** observations. Overall we're trying to work with the purest data possible so we'll decide to go with our data on **Wind** to determine whether we should bring an umbrella with us.

In the last section we've learned about a way to quantify purity and impurity of data via a metric called Gini Impurity. In this chapter we'll apply what we've learned so far by putting all the pieces together and build our very own Decision Tree to help us decide if we should play golf given the current weather conditions we observe.

Based on the Gini Impurity metric we'll examine our weather data and split it into different subsets which will reveal as much information as possible as to whether we should go golfing or not. These subsets will be arranged as a tree structure in a way that we can easily trace our decision through it based on our current observations. Our main goal is to come to a conclusion by taking into account as little weather data as possible.

Let's start by looking at the `golf.csv`

data set we'll be working with:

Outlook | Temp | Humidity | Windy | Play |
---|---|---|---|---|

Rainy | Hot | High | f | no |

Rainy | Hot | High | t | no |

Overcast | Hot | High | f | yes |

Sunny | Mild | High | f | yes |

Sunny | Cool | Normal | f | yes |

Sunny | Cool | Normal | t | no |

Overcast | Cool | Normal | t | yes |

Rainy | Mild | High | f | no |

Rainy | Cool | Normal | f | yes |

Sunny | Mild | Normal | f | yes |

Rainy | Mild | Normal | t | yes |

Overcast | Mild | High | t | yes |

Overcast | Hot | Normal | f | yes |

Sunny | Mild | High | t | no |

We can see that there are a couple of different features such as **Outlook**, **Temperature, Humidity** and **Wind** we can take into account.

The following command will download the `.csv`

file to our local machine:

`wget -O "data/golf.csv" -nc -P data https://raw.githubusercontent.com/husnainfareed/Simple-Naive-Bayes-Weather-Prediction/c75b2fa747956ee9b5f9da7b2fc2865be04c618c/new_dataset.csv`

Next up we should parse the `.csv`

file and store the result in a structured format. In our case we create a `NamedTuple`

called `DataPoint`

in which we'll store one row of information:

```
golf_data_path: Path = data_dir / 'golf.csv'
class DataPoint(NamedTuple):
outlook: str
temp: str
humidity: str
windy: bool
play: bool
data_points: List[DataPoint] = []
with open(golf_data_path) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
next(reader, None)
for row in reader:
outlook: str = row[0].lower()
temp: str = row[1].lower()
humidty: str = row[2].lower()
windy: bool = True if row[3].lower() == 't' else False
play: bool = True if row[4].lower() == 'yes' else False
data_point: DataPoint = DataPoint(outlook, temp, humidty, windy, play)
data_points.append(data_point)
```

*Notice that we've parsed string such as t or yes into their respective Boolean equivalents.*

Our `data_points`

list now contains a dedicated `DataPoint`

container for every `.csv`

file row:

```
data_points[:5]
# [DataPoint(outlook='rainy', temp='hot', humidity='high', windy=False, play=False),
# DataPoint(outlook='rainy', temp='hot', humidity='high', windy=True, play=False),
# DataPoint(outlook='overcast', temp='hot', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='mild', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='cool', humidity='normal', windy=False, play=True)]
```

Great! We now have the data in a structured format which makes it easier to work with. Let's exploit this structure by implementing some helper functions.

The first function takes an arbitrary `data_set`

and a list of tuples to filter such `data_set`

based on the tuple values:

```
def filter_by_feature(data_points: List[DataPoint], *args) -> List[DataPoint]:
result: List[DataPoint] = deepcopy(data_points)
for arg in args:
feature: str = arg[0]
value: Any = arg[1]
result = [data_point for data_point in result if getattr(data_point, feature) == value]
return result
assert len(filter_by_feature(data_points, ('outlook', 'sunny'))) == 5
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'))) == 3
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'), ('humidity', 'high'))) == 2
```

** Note**: If you're familiar with SQL you can think of it as a

`SELECT * FROM data_set WHERE tuple_1[0] = tuple_1[1] AND tuple_2[0] = tuple_2[1] ...;`

statement.The second helper function extracts all the values a feature can assume:

```
def feature_values(data_points: List[DataPoint], feature: str) -> List[Any]:
return list(set([getattr(dp, feature) for dp in data_points]))
assert feature_values(data_points, 'outlook') == ['sunny', 'overcast', 'rainy']
```

Those 2 functions will greatly help us later on when we recursively build our tree.

As you might've already guessed it we need to implement the formula to calculate the Gini Impurity metric which will help us to determine how pure our data will be after we split it up into different subsets.

As a quick reminder here's the formula:

\[ G = \sum_{i=1}^C p_i \times (1 - p_i) \]

Which translates to the following code:

```
def gini(data: List[Any]) -> float:
counter: Counter = Counter(data)
classes: List[Any] = list(counter.keys())
num_items: int = len(data)
result: float = 0
item: Any
for item in classes:
p_i: float = counter[item] / num_items
result += p_i * (1 - p_i)
return result
assert gini(['one', 'one']) == 0
assert gini(['one', 'two']) == 0.5
assert gini(['one', 'two', 'one', 'two']) == 0.5
assert 0.8 < gini(['one', 'two', 'three', 'four', 'five']) < 0.81
```

Notice the tests at the bottom. If our data set consists of only one class the impurity is \(0\). If our data set is splitted \(50/50\) our impurity metric is \(0.5\).

We can build a helper function on top of this abstract `gini`

implementation which helps us to caluclate the weighted Gini Impurities for any given feature. This function does the same thing we've done manually above when we've explored our weather data set which consisted of the features **Wind** and **Temperature**.

```
# Calculate the weighted sum of the Gini impurities for the `feature` in question
def gini_for_feature(data_points: List[DataPoint], feature: str, label: str = 'play') -> float:
total: int = len(data_points)
# Distinct values the `feature` in question can assume
dist_values: List[Any] = feature_values(data_points, feature)
# Calculate all the Gini impurities for every possible value a `feature` can assume
ginis: Dict[str, float] = defaultdict(float)
ratios: Dict[str, float] = defaultdict(float)
for value in dist_values:
filtered: List[DataPoint] = filter_by_feature(data_points, (feature, value))
labels: List[Any] = [getattr(dp, label) for dp in filtered]
ginis[value] = gini(labels)
# We use the ratio when we compute the weighted sum later on
ratios[value] = len(labels) / total
# Calculate the weighted sum of the `feature` in question
weighted_sum: float = sum([ratios[key] * value for key, value in ginis.items()])
return weighted_sum
assert 0.34 < gini_for_feature(data_points, 'outlook') < 0.35
assert 0.44 < gini_for_feature(data_points, 'temp') < 0.45
assert 0.36 < gini_for_feature(data_points, 'humidity') < 0.37
assert 0.42 < gini_for_feature(data_points, 'windy') < 0.43
```

Here is the manual calculation for **Humidity**:

\[ Gini(\textrm{Humidity=High}) = \frac{3}{7} \times (1 - \frac{3}{7}) + \frac{4}{7} \times (1 - \frac{4}{7}) = \frac{24}{49} \approx 0.49 \]

\[ Gini(\textrm{Humidity=Normal}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 \]

\[ Gini(\textrm{Temp}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.365 \]

We represent our Tree data structure via a `Node`

and `Edge`

class. Every `Node`

has a value and can point to zero or more `Edges`

(out-`Edge`

). It's one of the trees leafs if it doens't have an out-`Edge`

.

An `Edge`

has a value also and points to exactly \(1\) `Node`

.

```
# A `Node` has a `value` and optional out `Edge`s
class Node:
def __init__(self, value):
self._value = value
self._edges = []
def __repr__(self):
if len(self._edges):
return f'{self._value} --> {self._edges}'
else:
return f'{self._value}'
@property
def value(self):
return self._value
def add_edge(self, edge):
self._edges.append(edge)
def find_edge(self, value):
return next(edge for edge in self._edges if edge.value == value)
# An `Edge` has a value and points to a `Node`
class Edge:
def __init__(self, value):
self._value = value
self._node = None
def __repr__(self):
return f'{self._value} --> {self._node}'
@property
def value(self):
return self._value
@property
def node(self):
return self._node
@node.setter
def node(self, node):
self._node = node
```

We finally have all the pieces in place to recursively build our Decision Tree. There are several different tree building algorithms out there such as ID3, C4.5 or CART. The Gini Impurity metric is a natural fit for the CART algorithm, so we'll implement that.

The following is the whole `build_tree`

code. It might look a bit intimidating at first since it implements the CART algorithm via recursion.

Take some time to thoroughly read through the code and its comments. Below we'll describe in prose how exactly the `build_tree`

function works.

```
def build_tree(data_points: List[DataPoint], features: List[str], label: str = 'play') -> Node:
# Ensure that the `features` list doesn't include the `label`
features.remove(label) if label in features else None
# Compute the weighted Gini impurity for each `feature` given that we'd split the tree at the `feature` in question
weighted_sums: Dict[str, float] = defaultdict(float)
for feature in features:
weighted_sums[feature] = gini_for_feature(data_points, feature)
# If all the weighted Gini impurities are 0.0 we create a final `Node` (leaf) with the given `label`
weighted_sum_vals: List[float] = list(weighted_sums.values())
if (float(0) in weighted_sum_vals and len(set(weighted_sum_vals)) == 1):
label = getattr(data_points[0], 'play')
return Node(label)
# The `Node` with the most minimal weighted Gini impurity is the one we should use for splitting
min_feature = min(weighted_sums, key=weighted_sums.get)
node: Node = Node(min_feature)
# Remove the `feature` we've processed from the list of `features` which still need to be processed
reduced_features: List[str] = deepcopy(features)
reduced_features.remove(min_feature)
# Next up we build the `Edge`s which are the values our `min_feature` can assume
for value in feature_values(data_points, min_feature):
# Create a new `Edge` which contains a potential `value` of our `min_feature`
edge: Edge = Edge(value)
# Add the `Edge` to our `Node`
node.add_edge(edge)
# Filter down the data points we'll use next since we've just processed the set which includes our `min_feature`
reduced_data_points: List[DataPoint] = filter_by_feature(data_points, (min_feature, value))
# This `Edge` points to the new `Node` (subtree) we'll create through recursion
edge.node = build_tree(reduced_data_points, reduced_features)
# Return the `Node` (our `min_feature`)
return node
```

The first thing our `build_tree`

function does is that it ensures that the list of features we have to examine doesn't include the `label`

we try to predict later on. In our case the label is called `play`

which indicates if we went playing given the weather data. If we find the `play`

label we simply remove it from such list.

Next up we'll compute the weighted gini impurities for all the features in our `features`

list. (*Note that this list will constantly shrink as we recursively build the tree. More on that in a minute*)

If all our `features`

impurities are `0.0`

we're done since we're now dealing with "pure" data. In that case we just create a `Node`

instance with the `label`

as its value and return that `Node`

. This is the case where we end the recursive call.

Otherwise (there's still some impurity in the data) we pick the `feature`

from the `weighted_sums`

dictionary with the smallest score since the smaller the Gini Impurity, the purer our data. We'll create a new `Node`

for this `min_feature`

and add it to our tree. This feature can be removed from our "to-process" list of features. We call this new list (which doesn't include the feature) `reduced_features`

.

Our `min_feature`

`Node`

isn't a leaf so it has to have some `Edge`

s. To build the `Edge`

s we iterate through all the values our feature can assume and create `Edge`

instances based on such values. Since the edges have to point somewhere we call `build_tree`

again with our `reduced_features`

list to determine the `Node`

(the next `min_feature`

) the `Edge`

s should point to.

And that's pretty much all there is. The function will recurse until it reaches the trees leaf nodes which don't have any outgoing edges and contain the `label`

values.

Let's build our tree with out `golf.csv`

`data_points`

:

```
features: List[str] = list(DataPoint._fields)
tree: Node = build_tree(data_points, features)
tree
# outlook --> [overcast --> True, sunny --> windy --> [False --> True, True --> False], rainy --> humidity --> [normal --> True, high --> False]]
```

We can already see the textual representation but it might be a better idea to add a visual representation as well:

So should we go out when the **Outlook** is "rainy" and the **Humidity** is "normal"?

Seems like we've enjoyed playing golf in those situations in the past!

During your journey in the lands of Machine Learning you might come across the terms "Random Forest", "Bagging" and "Boosting".

The problem with Decision Trees is that they tend to overfit easily to the data you train them on. A way to avoid overfitting while also getting better results in general is to build multiple Decision Trees (each with a subset of the training data) and then use them in concert to do predictions.

If you're working with regression trees you'd compute the average of all the trees outputs. If your trees are of categorical nature you'd perform a majority vote to decide which prediction to return.

Combining multiple Decision Trees into one large ensemble is called "Random Forest". Related to Random Forests are the techniques "Bagging" ("Boostrap aggregation") and "Boosting". But that's for another post.

Everyone of us is confronted with different decision-making processes every single day. Naturally we examine the different scenarios in order to find the best decision we can take given the data we have at hand.

Decision Trees are a class of Machine Learning algorithms which follow the same idea. In order to build an efficient Decision Tree we need to decide the best way to split our data set into different subsets such that every new subset includes as much pure data as possible. Gini Impurity is one metric which makes it possible to mathematically determine the impurity for any data split we attempt.

Using Gini Impurity we implemented the CART algorithm and built our very own Decision Tree to determine if we should play golf based on the current weather conditions.

Given their visual nature, Decision Trees are a class of Machine Learning algorithms which is easy to inspect, debug and understand. On the other hand they also tend to overfit to the data they're trained on. Combining multiple, different Decision Trees into a so-called "Random Forest" mitigates this problems, turnig single Decision Trees into a powerful ensemble.

The following is a list with resources I've used while working on this article. Other, interesting resources are linked within the article itself:

]]>You can find working code examples (including this one) in my lab repository on GitHub.

Sometimes it's necessary to split existing data into several classes in order to predict new, unseen data. This problem is called classification and one of the algorithms which can be used to learn those classes from data is called Logistic Regression.

In this article we'll take a deep dive into the Logistic Regression model to learn how it differs from other regression models such as Linear- or Multiple Linear Regression, how to think about it from an intuitive perspective and how we can translate our learnings into code while implementing it from scratch.

If you've read the post about Linear- and Multiple Linear Regression you might remember that the main objective of our algorithm was to find a best fitting line or hyperplane respectively.

To recap real quick, a line can be represented via the slop-intercept form as follows:

\[ y = mx + b \]

Here, \(m\) represents the slope and \(b\) the y-intercept.

In Linear Regression we've used the existing data to find a line in slope-intercept form (a \(m\) and \(b\) combination) which "best-fitted through" such data.

Extending the slope-intercept form slightly to support multiple \(x\) values and multiple slopes (we'll use \(\beta_n\) instead of \(m_n\)) yields the following:

\[ y = \beta_1x_1 + ... + \beta_nx_n + b \]

This "scaled-up" slope-intercept formula was used in the Multiple Linear Regression model to find the \(\beta\) and \(b\) values for the hyperplane which "best-fitted" the data. Once found we were able to use it for predictions by plugging in \(x\) values to get respective \(y\) values.

Linear Regression models always map a set of \(x\) values to a resulting \(y\) value on a continuous scale. This means that the \(y\) value can e.g. be \(0\), \(42\) or \(5.023.212\). How would we use such a Regression model if our \(y\) value is categorical such as a binary value which is either \(0\) or \(1\)? Is there a way to define a threshold so that a value such as \(42\) is assigned to the category \(1\) while a small value such as \(0.002\) gets assigned to the category \(0\)?

That's where Logistic Regression comes into play. With Logistic Regression we can map any resulting \(y\) value, no matter its magnitude to a value between \(0\) and \(1\).

Let's take a closer look into the modifications we need to make to turn a Linear Regression model into a Logistic Regression model.

At the very heart of Logistic Regression is the so-called Sigmoid function. A Sigmoid function is a class of functions which follows an S-shape when plotted.

The most prominent Sigmoid function is the so-called Logistic function which was developed by Pierre Francois Verhulst to model population grown. It's mathematically described via this formula:

\[ f(x) = \frac{1}{1+e^{-x}} \]

Don't be intimidated by the math! Right now all you need to know is that this function takes any \(x\) value and maps it to a \(y\) value which ranges from \(0\) to \(1\).

Plotting the function for a range of \(x\) values proofs this claim and results in the aforementioned S-shape curve:

Note that the function gets closer and closer to the \(y\) value \(0\) or \(1\) as the \(x\) values get smaller or larger respectively. Also note that the \(x\) value \(0\) results in the \(y\) value \(0.5\).

This is exactly what we need. with this function we're able to "squish" any number, no matter its magnitude into a value ranging from \(0\) to \(1\). This makes the function outcome predictable which is useful when we later on define threshold values to associate function outputs with classes.

Let's turn the function into code:

```
def sigmoid(x: float) -> float:
return 1 / (1 + exp(-x))
assert sigmoid(0) == 0.5
```

** Note**: Although there are many different Sigmoid functions to choose from, a lot of people use the name "Sigmoid function" when talking about the Logistic function. We'll adhere to this convention and use the term "Sigmoid function" as a synonym for Logistic function.

Now that we've learned about the "mapping" capabilities of the Sigmoid function we should be able to "wrap" a Linear Regression model such as Multiple Linear Regression inside of it to turn the regressions raw output into a value ranging from \(0\) to \(1\).

Let's translate this idea into Math. Recall that our Multiple Linear Regression model looks like this:

\[ y = \beta_1x_1 + ... + \beta_nx_n + b \]

"Wrapping" this in the Sigmoid function (we use \(\sigma\) to represent the Sigmoid function) results in the following:

\[ y = \sigma(\beta_1x_1 + ... + \beta_nx_n + b) \]

Easy enough! Let's turn that into code.

The first thing we need to do is to implement the underlying Multiple Linear Regression model. Looking at the Math it seems to be possible to use the dot-product to calculate the \(\beta\) and \(x\) part to which we then add the single \(b\) value.

To make everything easier to calculate and implement we'll use a small trick. Multyping a value by the identify \(1\) yields the value so we prepend \(1\) to the \(x\) values and \(b\) to the \(\beta\) values. This way we can solely use the dot-product calculation without the necessity to add \(b\) separately later on. Here's the mathematical formulation of that trick:

\[ \vec{x} = \begin{pmatrix} 1 \\ x_1 \\ ... \\ x_n \end{pmatrix} \vec{\beta} = \begin{pmatrix} b \\ \beta_1 \\ ... \\ \beta_n \end{pmatrix} \]

\[ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i \beta_i = x_1 \times \beta_1 + ... + x_n \times \beta_n \]

Once we've calculated the dot-product we need to pass it into the Sigmoid function such that its result is translated ("squished") into a value between \(0\) and \(1\).

Here's the implementation for the `dot`

function which calculates the dot-product:

```
def dot(a: List[float], b: List[float]) -> float:
assert len(a) == len(b)
return sum([a_i * b_i for a_i, b_i in zip(a, b)])
assert dot([1, 2, 3, 4], [5, 6, 7, 8]) == 70
```

And here's the `squish`

function which takes as parameters the \(x\) and \(\beta\) values (remember that we've prepended a \(1\) to the \(x\) values and the \(b\) to the \(\beta\) values), uses the `dot`

function to calculate the dot-product of \(x\) and \(\beta\) and then passes this result into the Sigmoid function to map it to a value between \(0\) and \(1\):

```
def squish(beta: List[float], x: List[float]) -> float:
assert len(beta) == len(x)
# Calculate the dot product
dot_result: float = dot(beta, x)
# Use sigmoid to get a result between 0 and 1
return sigmoid(dot_result)
assert squish([1, 2, 3, 4], [5, 6, 7, 8]) == 1.0
```

We've talked quite a lot about how the Sigmoid function is our solution to make the function outcome predictable as all values are mapped to a \(0\) - \(1\) range. But what does a value in that range represent? Let's take a look at an example.

The following is a data set which describes how long students have studied for an exam and whether they've passed the exam given the hours they've studied.

Hours studied | Exam Passed |
---|---|

0,5 | 0 |

1,0 | 0 |

1,5 | 0 |

2,0 | 0 |

2,5 | 1 |

3,0 | 0 |

3,5 | 1 |

4,0 | 1 |

4,5 | 0 |

5,0 | 1 |

5,5 | 1 |

6,0 | 1 |

Taking a glance at the data it seems to be that the more hours the students studied, the more likely they were to pass the exam. Intuitively that makes sense.

Let's plot the data to ensure that our intuition is correct:

Looking at the plotted data we can immediately see that the values seem to "stick" to either the bottom or top of the graph. Given that it seems to be infeasible to use a Linear Regression model to find a line which best describes the data. How would this line be fitted through the data if the values we'd expect this line should produce are either \(o\) or \(1\)?

Let's try a thought experiment. What would happen if we've somehow found some coefficients \(\beta\) for the Linear Regression model which "best" describe the data and pass the result it computes through the Sigmoid function? Here's the graph from above with the Sigmoid function added to it:

Looking at the plotting above we can see that the Sigmoid function ensures that the result from the "underlying" Linear Regression model is mapped onto a scale between \(0\) and \(1\), which in turn makes it possible to e.g. define a threshold at \(0.5\) to say that a value which is greater than \(0.5\) might be a predictor for a student passing the exam while a value less than \(0.5\) might mean that she'll fail the exam.

Note that the wording in the last sentence isn't a coincidence. The value the Signoid function produces can be interpreted as a probability where \(0\) means \(0%\) probability and \(1\) means a \(100%\) probability.

As it turns out we can translate our findings from the previous section into a function called Probability density function or (PDF for short).

In particular we can define a conditional probability which states that given some \(\beta\) and \(x_i\), each corresponding \(y_i\) should equal \(1\) with probability \(\sigma(\beta x_i)\) and \(0\) with probability \(1-\sigma(\beta x_i)\):

\[ P(y_i \mid \beta x_i) = \sigma(\beta x_i)^{y_i} \times (1-\sigma(\beta x_i))^{1-y_i} \]

Looking at the formula above it might be a mystery how we deduced it from our verbal description from above. Here's something I want you to try: Please apply the formula by setting \(y_i\) to \(0\) and after that to \(1\) and see what happens. What you'll notice is that depending on what value you set \(y_i\) to, only one part of the formula stays the same while the other is canceled out.

Here's what we'll end up with if we set \(y_i\) to \(0\) and \(1\):

\[ 1-\sigma(\beta x_i) \quad \textrm{if} y_i = 0 \]

\[ \sigma(\beta x_i) \quad \textrm{if} y_i = 1 \]

And that's exactly the desired behavior we described above.

With Logistic Regression our main objective is to find the models \(\beta\) parameters which maximize the likelihood that for a pair of \(x\) values the \(y\) value our model calculates is as close to the actual \(y\) value as possible.

In order to find the optimal \(\beta\) parameters we need to somehow calculate how "wrong" our models predictions are with the current \(\beta\) setup.

In the previous section we talked about the Probability Density Function (PDF) which seems to capture exactly that. We just need to tweak this function slightly so that it's easier for us to do calculations with it later on.

The main tweak we'll apply is that we "wrap" our individual PDF calculations for \(y_i = 0\) and \(y_i = 1\) in the \(\log\) function. The Logarithm has the nice property that it's strictly increasing which makes it easier to do calculations on its data later on. Furthermore our PDFs main property is still preserved since any set of \(\beta\) values that maximizes the likelihood of predicting the correct \(y\) also maximizes the \(\log\) likelihood.

Here are the PDFs two major parts "wrapped" in the \(\log\) function:

\[ \log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 \]

\[ \log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 \]

There's only one minor issue we need to resolve. Overall we're attempting to minimze the amount of wrong predictions our model produces, but looking at the graph of the Logarithm once again we see that the funciton is strictly increasing. We "mirror" the logarithm at the \(x\) axis ("turning" it upside down) by multiplying it with \(-1\). Hence our two functions now look like this:

\[ -\log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 \]

\[ -\log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 \]

Now the last thing we want to do here is to put both functions back together into one equation like we did with our composite PDF function above. Doing this results in the following:

\[ \log L (\beta \mid x_i y_i) = -(y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) \]

This function is called Logarithmic Loss (or Log Loss / Log Likelihood) and it's what we'll use later on to determine how "off" our model is with its prediction. Again, you might want to set \(y_i\) to \(0\) or \(1\) to see that one part of the equation is canceled out.

Let's turn that into code:

```
def neg_log_likelihood(y: float, y_pred: float) -> float:
return -((y * log(y_pred)) + ((1 - y) * log(1 - y_pred)))
assert 2.30 < neg_log_likelihood(1, 0.1) < 2.31
assert 2.30 < neg_log_likelihood(0, 0.9) < 2.31
assert 0.10 < neg_log_likelihood(1, 0.9) < 0.11
assert 0.10 < neg_log_likelihood(0, 0.1) < 0.11
```

Next up let's use our codified version of Log Loss to create plots for \(y = 0\) and \(y = 1\):

```
xs_nll: List[float] = [x / 10000 for x in range(1, 10000)]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(xs_nll, [neg_log_likelihood(1, x) for x in xs_nll])
ax1.set_title('y = 1')
ax2.plot(xs_nll, [neg_log_likelihood(0, x) for x in xs_nll])
ax2.set_title('y = 0');
```

As we can see, the more wrong the prediction, the higher the calculated error. The Log Loss function therefore "punished" wrongdoing more than it rewards rightdoing.

To calculate the overall error of our whole data set we sum up each individual Log Loss calculation and average it:

\[ \textrm{Cost} = -\frac{1}{n} \sum_{i=1}^n (y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) \]

Here's the codified version:

```
def error(ys: List[float], ys_pred: List[float]) -> float:
assert len(ys) == len(ys_pred)
num_items: int = len(ys)
sum_nll: float = sum([neg_log_likelihood(y, y_pred) for y, y_pred in zip(ys, ys_pred)])
return (1 / num_items) * sum_nll
assert 2.30 < error([1], [0.1]) < 2.31
assert 2.30 < error([0], [0.9]) < 2.31
assert 0.10 < error([1], [0.9]) < 0.11
assert 0.10 < error([0], [0.1]) < 0.11
```

Now the last missing piece we need to implement is the optimization step. What we want at the end of the day is a Logistic Regression model with the \(\beta\) parameters which in combination with \(x\) values produce the most accurate prediction for any \(y\) value.

Given that we can now calculate the error our current model with its \(\beta\) parameters produces we can iteratively change the \(\beta\) parameters until we reach a point where our model cannot improve (can't reduce the error value) anymore.

The algorithm which implements exactly that is called Gradient Descent.

** Note**: I already wrote a dedicated post explaining the algorithm in great detail so I won't go into too much detail here and would encourage you to read the article if you're unfamiliar with Gradient Descent.

The gist of it is that given our Log Loss function we can find a set of \(\beta\) parameters for which the error the Log Loss function calculates is the smallest. We've therefore found a local (or global) minimum if the error cannot be reduced anymore. To figure out "where" the minimum is located we'll use the error functions gradient which is a vector and guides us to that position.

Since we repeat such calculations over and over again we're iteratively descending down the error functions surface, hence the name Gradient Descent.

Calculating the gradient is done by computing the partial derivatives of the Log Loss function with respect to the individual \(x_{ij}\) values in our \(x_i\) vector.

Here's the mathematical representation:

\[ \frac{\partial \textrm{Cost}}{\partial x_{ij}} = \frac{1}{n} \sum_{i=1}^n (\sigma(\beta x_i) - y_i) x_{ij} \]

And here's the implementation in code:

```
grad: List[float] = [0 for _ in range(len(beta))]
for x, y in zip(xs, ys):
err: float = squish(beta, x) - y
for i, x_i in enumerate(x):
grad[i] += (err * x_i)
grad = [1 / len(x) * g_i for g_i in grad]
```

*Again, if you're unfamiliar with gradients and partial derivatives you might want to read my article about the Gradient Descent algorithm.*

We finally got all the pieces in place! Let's grab some data and use the Logistic Regression model to classify it!

The data set we'll be using is similar to what we've already seen in our example above where we tried to predict whether a student will pass an exam based on the hours she studied.

You can download the data set here. Inspecting the data we can see that there are 2 floating point values and an integer value which represents the label and (presumably) indicates whether the student in question has passed or failed the exam.

Our task is it to use this data set to train a Logistic Regression model which will help us assign the label \(0\) or \(1\) (i.e. classify) new, unseen data points.

The first thing we need to do is to download the `.txt`

file:

`wget -nc -P data https://raw.githubusercontent.com/animesh-agarwal/Machine-Learning/5b2d8a71984016ae021094641a3815d6ea9ac527/LogisticRegression/data/marks.txt`

Next up we need to parse the file and extract the \(x\) and \(y\) values:

```
marks_data_path: Path = data_dir / 'marks.txt'
xs: List[List[float]] = []
ys: List[float] = []
with open(marks_data_path) as file:
for line in file:
data_point: List[str] = line.strip().split(',')
x1: float = float(data_point[0])
x2: float = float(data_point[1])
y: int = int(data_point[2])
xs.append([x1, x2])
ys.append(y)
```

It's always a good idea to plot the data to see if there are any outliers or other surprises we have to deal with:

```
x1s: List[float] = [x[0] for x in xs]
x2s: List[float] = [x[1] for x in xs]
plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);
```

Looks like we're (almost) good here. There's only one aspect we need to further inspect. If you're looking at the axes you see that the values are ranging from \(\approx 35\) to \(\approx 95\). Working with such large values can result into problems when working with \(\log\) and \(\exp\) later on. We'll resolve this problem in a minute.

You might recall from the beginning of the post that we applied a trick where we prepend every \(x\) vector with a \(1\) and prepend the \(b\) to the \(\beta\) vector in order to make it possible to use the dot-product (which is a simpler calculation).

The following code prepends a \(1\) to every \(x\) vector so that we can leverage the computation trick later on:

```
for x in xs:
x.insert(0, 1)
xs[:5]
# [[1, 34.62365962451697, 78.0246928153624],
# [1, 30.28671076822607, 43.89499752400101],
# [1, 35.84740876993872, 72.90219802708364],
# [1, 60.18259938620976, 86.30855209546826],
# [1, 79.0327360507101, 75.3443764369103]]
```

Next up we'll tackle the scaling problem we've touched upon in the paragraph above. What we'll do to resolve this problem is to standardize (often done via the "z-score") the whole data set:

```
xs = z_score(xs)
xs[:5]
# [[1, -1.5942162646576388, 0.6351413941754435],
# [1, -1.8171014180340745, -1.2014885239142388],
# [1, -1.531325157335502, 0.3594832875590465],
# [1, -0.28068723821760927, 1.0809228071415948],
# [1, 0.6880619310375534, 0.4909048515228952]]
```

*If you're curious what the z_score function does check out the whole implementation in my lab repository on GitHub.*

And now we're finally in a position where we can train our Logistic Regression Model via Gradient Descent. We'll start with some random guesses for our models \(\beta\) parameters and iteratively optimize our model by computing its current overall "wrongdoing" with the error function and then using the error functions gradient to update the \(\beta\) value to yield a smaller error in the next iteration. Eventually we'll converge to a local minimum which results in \(\beta\) values computing the smallest error.

```
beta: List[float] = [random() / 10 for _ in range(3)]
print(f'Starting with "beta": {beta}')
epochs: int = 5000
learning_rate: float = 0.01
for epoch in range(epochs):
# Calculate the "predictions" (squishified dot product of `beta` and `x`) based on our current `beta` vector
ys_pred: List[float] = [squish(beta, x) for x in xs]
# Calculate and print the error
if epoch % 1000 == True:
loss: float = error(ys, ys_pred)
print(f'Epoch {epoch} --> loss: {loss}')
# Calculate the gradient
grad: List[float] = [0 for _ in range(len(beta))]
for x, y in zip(xs, ys):
err: float = squish(beta, x) - y
for i, x_i in enumerate(x):
grad[i] += (err * x_i)
grad = [1 / len(x) * g_i for g_i in grad]
# Take a small step in the direction of greatest decrease
beta = [b + (gb * -learning_rate) for b, gb in zip(beta, grad)]
print(f'Best estimate for "beta": {beta}')
# Starting with "beta": [0.06879018957747185, 0.060750489548129484, 0.08122488791609535]
# Epoch 1 --> loss: 0.6091560801945126
# Epoch 1001 --> loss: 0.2037432848849053
# Epoch 2001 --> loss: 0.20350230881468107
# Epoch 3001 --> loss: 0.20349779972872906
# Epoch 4001 --> loss: 0.20349770371660023
# Best estimate for "beta": [1.7184091311489376, 4.01281584290694, 3.7438191715393083]
```

Now that our model is trained we can calculate some statistics to see how accurate it is. For these calculations we'll set the threshold to \(0.5\) which means that every value above \(0.5\) our model produces is considered a \(1\) and every value which is less than \(0.5\) is considered to be a \(0\).

```
total: int = len(ys)
thresh: float = 0.5
true_positives: int = 0
true_negatives: int = 0
false_positives: int = 0
false_negatives: int = 0
for i, x in enumerate(xs):
y: int = ys[i]
pred: float = squish(beta, x)
y_pred: int = 1
if pred < thresh:
y_pred = 0
if y == 1 and y_pred == 1:
true_positives += 1
elif y == 0 and y_pred == 0:
true_negatives += 1
elif y == 1 and y_pred == 0:
false_negatives += 1
elif y == 0 and y_pred == 1:
false_positives += 1
print(f'True Positives: {true_positives}')
print(f'True Negatives: {true_negatives}')
print(f'False Positives: {false_positives}')
print(f'False Negatives: {false_negatives}')
print(f'Accuracy: {(true_positives + true_negatives) / total}')
print(f'Error rate: {(false_positives + false_negatives) / total}')
# True Positives: 55
# True Negatives: 34
# False Positives: 6
# False Negatives: 5
# Accuracy: 0.89
# Error rate: 0.11
```

Finally let's plot the decision boundary so that we can see where our model "draws the line":

```
x1s: List[float] = [x[1] for x in xs]
x2s: List[float] = [x[2] for x in xs]
plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);
m: float = -(beta[1] / beta[2])
b: float = -(beta[0] / beta[2])
x2s: List[float] = [m * x[1] + b for x in xs]
plt.plot(x1s, x2s, '--');
```

Great! Looks like our model correctly learned how to classify new, unseed data as it considers everything "above" and "below" the decision boundary as a separate class which seems to be in alignment with the data points from our data set!

In some situations it's a requirement to classify new, unseen data. Logistic Regression is a powerful Machine Learning model which makes it possible to learn such classifications based on existing data.

In this blog post we took a deep dive into Logistic Regression and all its moving parts. We've learned about Sigmoid functions and how they can be used in conjunction with a Linear Regression model to project values of arbitraty magnitude onto a scale between \(0\) and \(1\) which is exactly what we need when we want to do binary classification.

Once we understood the mathematics and implemented the formulas in code we took an example data set and applied our Logistic Regression model to a binary classification problem. After training our model we were able to draw the decision boundary it learned to visually validate that it correctly learned how to separate the data into two (binary) subsets.

The following is a list of resources I've used to write this article. Other, useful resources are linked within the article itself:

]]>You can find working code examples (including this one) in my lab repository on GitHub.

Linear Regression is one of the basic Machine Learning algorithms every student eventually encounters when starting to dive deeper into the field. If you heard someone trying to "fit a line through the data" that person most likely worked with a Linear Regression model.

In which scenarios should we use Linear Regression and if we do, how do we find such a best-fitting line? And what if our data is multidimensional? Let's answer all those questions by implementing Linear and Multiple Regression from scratch!

** Note**: Throughout this post we'll be using the "Auto Insurance in Sweden" data set which was compiled by the "Swedish Committee on Analysis of Risk Premium in Motor Insurance".

Let's take a step back for a minute and imagine that we're working at an insurance company which sells among other things car insurances.

Over the years the insurance company has collected a wide variety of statistics for all the insurances it sells including data for its car insurances. Some of this data is statistics about the number of filed claims and the payments which were issued for them. The following table shows an excerpt from such data:

Number of claims | Payments issued |
---|---|

108 | 392,5 |

19 | 46,2 |

13 | 15,7 |

124 | 422,2 |

... | ... |

One day we get a call from our colleague who works at the claims settlement center. She has to plan the divisions budget for the upcoming year which is usually derived based on best guesses. Since we've collected all the data throughout the years she wonders if there's a more reliable, mathematical way we could use to calculate the budget estimation.

Nodding along we confirm that we'll dive deeper into this topic and hang up the telephone in sheer excitement! Finally we're able to put some real Machine Learning into practice. But how should we tackle the problem?

It's hard to gain any insights into the data we're dealing with by manually examining the raw numbers. In order to get a better understanding of the data it's always a good idea to visualize it first.

Given that we're dealing with 2 dimensions (the **number of claims** and the **issued payments**) one of the potential diagrams we can create is a so called scatter plot which uses (Cartesian) coordinates to display the values of a given data set. In our case we treat the **number of claims** as our \(x\)-axis and the **issued payments** as our \(y\)-axis and plot the data we recorded at the intersections of such axes which results in the following diagram:

Solely by looking at the diagram we can already identify a trend in the data. It seems to be the case that the more claims were filed, the more payments were issued. Intuitively that makes sense.

After inspecting the plotted data in more detail we observe that we can certainly make some rough predictions for missing data points. Given the trend in the data it seems reasonable that we might issue ~80 payments when ~40 number of claims were filed. Accordingly ~125 claims might be filed when we issue ~410 payments.

Is there some way to turn this insight into a model we can use to make arbitrary predictions? It seems like the relationship in the data is linear. Is there a way to capture this notion mathematically?

You might remember the concept of a Linear function from school where you've used the **slope-intercept** form (one of many forms) to mathematically describe a line:

\[ y = mx + b \]

The slope-intercept form has 2 parameters which determine how the line "behaves" in the Cartesian plane (The typical 2D plane with \(x\) and \(y\) coordinates):

- \(m\) is the lines slope which measures how steep the line is slanted
- \(b\) is the \(y\)-intercept and determines at which point the line intercepts the \(y\)-axis

Using this formula we can plug in any arbitrary \(x\) value which is then multiplied by \(m\) and added to \(b\) to get back the corresponding \(y\) value.

Let's look at a couple of examples to get a better feeling as to how the slope-intercept form works.

Let's solely focus on \(m\) for now and set \(b\) to \(0\). How does \(m\) influence the way our line will be plotted if we set it to \(1\)?

Here's the mathematical representation of such a line followed by the corresponding plot:

\[ y = 1x + 0 \]

As you can see for every step of size \(1\) in the \(x\) direction we "go" a step of size \(1\) in the \(y\) direction.

Now that we understand what the parameter \(m\) is responsible for, let's take a look at the \(y\)-intercept \(b\) and set it to \(1\):

\[ y = 1x + 1 \]

The steepness of the line is the same as the previous line since we haven't modified \(m\). However if you take a look at \(x = 0\) you'll notice that the line crosses the \(y\) intercept at \(1\). That's exactly what the parameter \(b\) is responsible for. Through \(b\) we can control where our line should start on the \(y\) axis when \(x = 0\).

Let's translate the slope-intercept form into a function we call `predict`

(we'll use this function for our predictions later on):

```
def predict(m: float, b: float, x: float) -> float:
return m * x + b
assert predict(m=0, b=0, x=3) == 0
```

Let's put the theory into practice and try to guesstimate a line which best describes our data.

The first thing we notice is that the individual data points follow an upwards trend, so \(m\) will certainly be positive. Furthermore the data points close to \(x = 0\) seem to have low \(y\) values as well. Taking those observations into account we guess the following description for our line:

\[ y = 4x + 2 \]

Not too bad for our first guess! Just by looking at the plotted line we might ask ourselves if there's better fitting line? Can we quantify how good our line fits the data?

Given that the learning part in Machine Learning is usually an iterative process which starts with an initial guess and slowly computes new "best guesses" to finally converge to the optimal solution it's a necessity to be able to track the learning process.

A good way to supervise the learning process is to mathematically capture the "wrongdoing" our algorithm inevitably produces while trying to determine the function which best describes the data.

In the case of Linear Regression it seems to make sense to compare the \(y\)-values the line produces to the actual \(y\)-values from the data set. We could for example go through each individual \((x, y)\) pair in our data set and subtract its \(y\) value from the \(y\) value our line "predicts" for the corresponding \(x\). Summing up these differences results in a number we can use to compare different lines against each other. The higher the number, the "less correct" the line.

That's great but there's one minor catch. Imagine that the line which is fitted through the data predicts large positive \(y\) values near the origin \((0, 0)\) where it should predict large negative numbers. At the same time it predicts large negative numbers near the end of the \(x\)-axis although those values should be positive. If we calculate the errors according to our description above where we suggested to sum up the differences between the \(y\) values we'd end up in a situation where values might cancel each other out. In the worst case the calculated error is \(0\) which indicates that we've found the best fitting line while in reality we didn't!

A simple trick to mitigate this problem is to square each single error value before they're summed up. This way any negative value will be turned into a positive one, making it impossible to run into scenarios where error calculations cancel each other out.

The error function we've just described is called Residual sum of squares (RSS) or Sum of squared errors (SSE) and is one of many error functions we can use to quantify the algorithms "wrongdoing". The following is the mathematical formula for SSE:

\[ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 \]

Turing that into code results in the following:

```
def sum_squared_error(ys: List[float], ys_pred: List[float]) -> float:
assert len(ys) == len(ys_pred)
return sum([(y - ys_pred) ** 2 for y, ys_pred in zip(ys, ys_pred)])
assert sum_squared_error([1, 2, 3], [4, 5, 6]) == 27
```

With those two code snippets (the `predict`

and `sum_squared_error`

functions) we're now able to describe a line, predict \(y\) values and measure how "off" our predictions are. The last missing piece we'll need to get in place is a way to update our line description such that the next `sum_squared_error`

calculation returns an error value which is less than our current one. If there's a way to constantly reduce the error we're making by slowly updating our line description we'll eventually end up with a line which best fits our data!

With Linear Regression there are a couple of different algorithms we can use to find the best fitting line. One prominent choice is the Ordinary least squares (OLS) method. Since OLS is a common choice we'll do something different. We'll use the Gradient Descent algorithm which can be used for various different optimization problems and is at the heart of modern Machine Learning algorithms.

** Note**: If you haven't already I'd suggest that you take a couple of minutes to read the article "Gradient Descent from scratch" in which I explain the whole algorithm in great detail.

*I won't provide too many explanations regarding Gradient Descent here since I already covered the topic in the aforementioned post. Don't get too intimidated by the Math below. It's ok if you just skim through this section to get a high-level overview.*

In a nutshell Gradient Descent makes it possible for us to iteratively "walk down" the error functions surface to eventually find a local minimum where the error is the smallest which is exactly what we're looking for.

In order to figure out in which direction we should walk to descent down to the local minimum we need to compute the so-called gradient. The gradient is a vector consisting of partial derivatives of our error function which point in the direction of greatest increase at any given point \(p\) on our error functions surface.

To find the partial derivatives of our SSE function we should expand it so that we can see all the variables we need to take into account:

\[ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 = \sum_{i=1}^n (y_i - (mx + b))^2 \]

Looking at the expanded formula it seems like there's \(m\) and \(b\) we need to derive with respect to:

\[ \frac{\partial sse}{\partial m} = 2x ((mx + b) - y) \]

\[ \frac{\partial sse}{\partial b} = 2 ((mx + b) - y) \]

Which results in the following code:

```
# The partial derivative of SSE with respect to `m`
grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
# The partial derivative of SSE with respect to `b`
grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])
```

** Tip**: You can use WolframAlpha to validate your partial derivatives.

Given these partial derivatives we can now calculate the gradient for any point \(x\) which is a vector pointing in the direction of greatest increase. Multiplying the vector by \(-1\) will let it point into the opposite direction, the direction of greatest decrease (remember that we want to find a local minimum). If we add a small fraction of this vector to our \(m\) and \(b\) values respectively we should end up closer to a local minimum.

The following code captures what we've just described:

```
# Take a small step in the direction of greatest decrease
# The `learning_rate` controls the step size when "walking" down the gradient
learning_rate: float = 0.0001
m = m + (grad_m * -1 * learning_rate)
b = b + (grad_b * -1 * learning_rate)
```

Repeating this process multiple times should help us find the \(m\) and \(b\) values for our line for which any given prediction \(y\) calculated by that line results in the smallest error possible.

Let's put all the pieces together and implement the Gradient Descent algorithm to find the best fitting line:

```
# Find the best fitting line through the data points via Gradient Descent
# Our initial guess
m: float = 0
b: float = 200
print(f'Starting with "m": {m}')
print(f'Starting with "b": {b}')
# Doing 1000 iterations
epochs: int = 10000
learning_rate: float = 0.00001
for epoch in range(epochs):
# Calculate predictions for `y` values given the current `m` and `b`
ys_pred: List[float] = [predict(m, b, x) for x in xs]
# Calculate and print the error
if epoch % 1000 == True:
loss: float = sum_squared_error(ys, ys_pred)
print(f'Epoch {epoch} --> loss: {loss}')
# Calculate the gradient
# Taking the (partial) derivative of SSE with respect to `m` results in `2 * x ((m * x + b) - y)`
grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
# Taking the (partial) derivative of SSE with respect to `b` results in `2 ((m * x + b) - y)`
grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])
# Take a small step in the direction of greatest decrease
m = m + (grad_m * -learning_rate)
b = b + (grad_b * -learning_rate)
print(f'Best estimate for "m": {m}')
print(f'Best estimate for "b": {b}')
```

Running this algorithm results in a best estimate for the \(m\) and \(b\) values. Let's compare our initial guess of \(m\) and \(b\) (the guess we started with at the top of the code snippet) with the values our Gradient Descent implementation produced:

\[ m = 0, b = 200 \]

\[ m \approx 3.40, b \approx 20.30 \]

Awesome! Seems like we've found our linear function which best describes our data! Let's call our co-worker and share the good news. From now on she can use the following formula to find a prediction for the **issued payments** (\(y\)) based on any **number of claims** (\(x\)):

\[ y = 3.40x + 20.30 \]

It's great to be able to fit a line through data points in \(2\) dimensions. But how do we deal with scenarios where our data has more than \(2\) dimensions?

Most data sets capture many different measurements which are called "features". It would be great if we could take the most important features into account when working with our algorithms. Given that every feature adds another dimension we need to ensure that the model we're building can deal with such high-dimensional data.

Our Linear Regression model was only able to take a single \(x\) value and predict a correspoding \(y\) value. What if we have multiple \(x\) values? Is there a way to use a regression model to predict a \(y\) value based on multiple \(x\) values?

As it turns out Linear Regression is a subset of a general regression model called Multiple Linear Regression or Multiple Regression. Multiple Regression can deal with an arbitrary number of \(x\) values expressed as a vector to predict a single \(y\) value.

The great news is that we can easily adopt what we've learned so far to deal with high-dimensional data. Let's take a quick look at the changes we need to make.

The slope-intercept form we've used so far can easily be updated to work with multiple \(x\) values. Here's the linear equation we've used so far:

\[ y = mx + b \]

Having multiple \(x\) values means that we'll also have multiple \(m\) values (one for each \(x\)). However we'll still only deal with \(1\) intercept:

\[ y = m_1x_1 + ... + m_nx_n + b \]

Calculating a prediction for \(y\) is as simple as solving the above equation for any given vector of \(x\) values, vector of \(m\) values and any given \(b\) value.

There's one little trick we can apply given that we're now mostly dealing with vectors rather than scalar numbers. To make the computation more efficient we can use the dot-product which carries out almost the exact same calculation we described above. There's just one problem. The dot-product can only be used in vector calculations, however \(b\) isn't a vector. As it turns out we can simply prepend the \(b\) value to the \(m\) vector and prepend a \(1\) to the \(x\) vector. Doing this little trick makes it possible to use the dot-product calculation while also taking the \(b\) value into account. Here's what we'd end up with when doing just that:

\[ \vec{x} = \begin{pmatrix} 1 \\ x_1 \\ ... \\ x_n \end{pmatrix} \vec{m} = \begin{pmatrix} b \\ m_1 \\ ... \\ m_n \end{pmatrix} \]

\[ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i m_i = x_1 \times m_1 + ... + x_n \times m_n \]

Another nice side-effect of doing this is that the partial derivative calculations for the error function will also be easier since our usage of the dot-product reduced the number of variables we have to take into account to just 2 vectors \(x\) and \(m\).

And that's pretty much all there is to change. The rest of the code follows exactly the same way. While we fitted a line when working with Linear Regression we're now fitting a so-called hyperplane with Multiple Regression.

To get a better intuition for the notion of a hyperplane imagine that we have measurements we can scatter plot in a \(3\) dimensional space. Every measurement will be a single dot in that space, resulting in a cloud of dots. Our Multiple Regression algorithm will now try to find a plane (think of it as a wooden plank) which best fits through that dot cloud.

Linear Regression is one of the very first algorithms every student encounters when learning about Machine Learning models and algorithms.

The basic idea of Linear Regression is to find an equation for a line which best describes the data points in the given data set. Such a line is often described via the point-slope form \(y = mx + b\). "Fitting the line" means finding the \(m\) and \(b\) values such that the resulting \(y\) value is as accurate as possible given an arbitrary \(x\) value. Using an error function (which describes how "off" our current line equation is) in combination with an optimization algorithm such as Gradient Descent makes it possible to iteratively find the "best fitting" line.

As it turns out Linear Regression is a specialized form of Multiple Linear Regression which makes it possible to deal with multidimensional data by expressing the \(x\) and \(m\) values as vectors. While this requires the usage of techniques such as the dot-product from the realm of Linear Algebra the basic principles still apply. In Multiple Linear Regression we're just trying to find a "best fitting" hyperplane rather than a line.

The following is a list with resources I've used while working on this blog post. Other useful resources are linked within the article itself.

]]>You can find working code examples (including this one) in my lab repository on GitHub.

How do you implement a spam filter that adapts to the ever evolving techniques of modern spammers? Spam detection is a classification problem as you have to decice whether a message is spam or not. However there's always some uncertainty involved in figuring out if the message at hand might contain spam or not.

Looking into the probability toolbox to tame such uncertainty one might stumble upon the infamous Bayes Theorem which is exactly what we'll use in this post to build a basic spam filter. Using Bayes Theorem we'll implement an algorithm called Naive Bayes which was used in the early internet days as a spam filtering technique. Let's dive in!

__ Note__: This post assumes a certain level of knowledge about probability theory. If you're new to this subject or just a little bit rusty you might want to check out the "Basics of Probability" playlist by JB Statistics which helps you in building the intuitions behind probability theory.

if you've studied statistics and probability theory in the past you surely came across Bayes Theorem which lets you calculate conditional probabilities. Conditional probabilities are probabilities with dependencies which can be used to mathematically express questions such as: "What's the probability that it'll be sunny throughout the day given that we saw clouds in the morning?". Translating this sentence into math results in the following:

\[ P(Sun \mid Clouds) \]

Having some decent meterology data available should be sufficient to calculate the exact probability via Bayes Theorem which is formulated as follows:

\[ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} \]

Let's take a quick refresher and apply this formula by working through an example step by step. Assuming that we're living in a place where it's often cloudy, the chance to see clouds in the morning is roughly \(60\%\). At the same time most days aren't that sunny. Only 7 out of 30 days can be considered really sunny. Another data point we have is that \(50\%\) of the sunny days started out with clouds in the morning.

Formulating these findings as probabilities results in the following:

\[ P(Sun) \approx 25\% \approx 0.25 \]

\[ P(Cloud) = 60\% = 0.6 \]

\[ P(Cloud \mid Sun) = 50\% = 0.5 \]

Calculating the likelihood of getting a sunny day even when we saw clouds in the morning is as simple as plugging-in the probabilities into Bayes Theorem:

\[ P(Sun \mid Cloud) = \frac{P(Cloud \mid Sun) \, P(Sun)}{P(Cloud)} \\ = \frac{0,5 \times 0,25}{0,6} \approx 0,21 = 21\% \]

So there's a \(21\%\) chance that we get a sunny day even if the morning started with clouds.

Now that we got a quick refresher on Bayes Theorem it's time to ask ourselves how we could apply this formula to a classification problem such as spam filtering.

Generally speaking we'd like to calculate the probability that any given message is spam. If the probability exceeds a certain threshold (e.g. \(50\%\)) we classify it as spam. Otherwise we believe that the message is legit and call it "ham".

*The rather unusual word "ham" which describes "non spam" messages is used throughout literature so we'll use it here too.*

Breaking the problem further down into it's fundamentals we can say that a message is comprised of different words, so why don't we start there and calculate the probability of a message being classified as spam given that it contains a certain "trigger" word.

Let's assume that we're working with the word "prize" and want to figure out what the probability is that a given message which contains the word "prize" is spam. Formulating this problem via Bayes Theorem results in:

\[ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) \, P(Spam)}{P(Prize)} \]

To make things clear, here's the formula translated into prose:

"The probability of a message being spam given that we see the word "Prize" in it is the probability of finding the word "Prize" in our already seen spam messages times the probability of the message in question being spam divided by the probability of finding the word "Prize" in any of our already seen messages."

There are two simplifications we can apply here. The first thing we might want to do is expand the denominator like so:

\[ P(Prize) = P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham) \]

We simply rewrote the condensed version of \(P(Prize\) into a more verbose statement listing all its underlying probabilities.

Our expanded Bayes Theorem application now looks like the following:

\[ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) P(Spam)}{P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham)} \]

The neat thing about this expansion is that we can now apply our second trick. Let's think about the probability of a message being spam vs. a message being ham for a minute. What is the likelihood that any random message is spam? What is the likelihood that it's ham? The fact is that we don't know for sure so we might just assume that there's a \(50/50\) chance that any random, incoming message is spam or ham.

Applying this finding by plugging-in the corresponding probabilities results in:

\[ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) 0.5}{P(Prize \mid Spam) 0.5 + P(Prize \mid Ham) 0.5} \]

That's great because now we can apply basic arithmetic and cancel out these probabilities, resulting in our final formulation to determine whether a message containing the word "Prize" should be considered "spammy".

\[ P(Spam \mid Prize) = \frac{P(Prize \mid Spam)}{P(Prize \mid Spam) + P(Prize \mid Ham)} \]

Now that we know how to calculate the probability for a single word we should try to scale it up so that we can compute the probabilities for whole sentences.

A naive approach would be to insert the whole message as the conditional probability part. However that won't work as it would force our algorithm to essentially look for duplicate messages.

Thinking logically we can observe that a sentence is simply a succession of single words. What we could do is reuse what we did above for a single word and apply it to every word we come across in our message. Computing the overall probability for the whole message is then simply the multiplication of all those individual probabilities.

Let's pick one example probability calculation to describe this process mathematically:

\[ P(Message \mid Spam) = P(w_1, ..., w_n \mid Spam) \\ = P(w_1 \mid Spam) \times ... \times P(w_n \mid Spam) \]

And that's where the Naive Bayes classifier acts naively (hence the name Naive Bayes). Because we're multiplying the probabilities we're making the bold assumption that the probabilities (presence or absence of words) are independent of each other. Everyone who ever received an E-Mail from a nigerian prince will agree that there's likely a dependency between words in spam messages.

Generally speaking we're now in a position where we can turn our findings into code and implement the Naive Bayes algorithm from scratch. However there are 2 more tweaks we want to implement in order to mitigate some minor hiccups we'll run into if we don't do it.

The first challenge we're faced with is a computational problem. As we stated above we're about to multiply a lot of probabilities (one probability for every word) which tend to be very small. Moden computer architectures are only capable to deal with a certain amount of precision when representing very large or very small numbers internally. If we're constantly multiplying small number we can run into the so-called "arithmetic underflow" problem which will eventually turn our number into a \(0\).

A well known trick to deal with this problem is to use a combination of the exponential function and the logarithm. You might remember that:

\[ \log{a b} = \log(a) + \log(b) \]

and

\[ \exp(\log(x)) = x \]

We can use this to "wrap" our probability calculations into \(\log\) to later on "unwrap" them again via \(\exp\).

The second issue is rooted in the mathematics we're attempting to carry out. Let's understand the problem by loking into a case we'll very likely face.

What would \(P(W \mid S)\) be if we've only ever found \(W\) in no-spam messages?

Well if we've never seen the word \(W\) in a spam message, then \((P \mid S) = 0\). And exactly this will turn into a problem for us given that we have such a term in our nominator and denominator. Everything multiplied by \(0\) is \(0\).

In order to deal with that issue we can introduce a factor \(k\) which is a parameter we can tweak (most of the time it's set to \(1\)).

With this factor \(k\) we can express \(P(W \mid S)\) as follows:

\[ P(W \mid S) = \frac{k + \textrm{spam containing W}}{(2 \times k) + \textrm{total spam}} \]

Let's walk through a quick example to see how \(k\) helps us. Our assumption is that we have analyzed \(100\) spam examples but found the word \(W\) exactly \(0\) times. Calculating the probability that we might find the word \(W\) in spam (\(S\)) without \(k\) would result in the following:

\[ P(W \mid S) = \frac{0}{100} = 0\]

Having this probability in a chain of probability multiplications would immediately turn the result into \(0\) because \(x \times 0 = 0\).

Let's introduce \(k = 1\) to solve this issue:

\[ P(W \mid S) = \frac{k + 0}{(2 \times k) + 100} = \frac{1}{2 + 100} \approx 0.01 \]

Having \(0.01\) in our multiplications won't hurt the accuracy while ensuring that we don't end up with a \(0\) result just because one of the probability calculations resolved to \(0\).

With that out of the way we're finally able to turn our findings into code!

Before we jump right into the implementation we should draw a mental picture as to how the Naive Bayes classifier works from a high level perspective.

You might've noticed that we talked a lot about probability calculations of finding words in a set of spam or ham messages. This indicates that we need to "train" our Naive Bayes classifier with training data so that we can do these computations on the fly. Otherwise if our Naive Bayes implementation hasn't seen any messages or words before, how do we know how to calculate the probabilities via Bayes Theorem?

The next thing we should think about are the raw messages we're feeding into our Naive Bayes algorithm, both for "training" it and doing predictions later on. First and foremost we're only focusing on messages in one language to keep the implementation simple. In our case all our messages will be in English. Other than that we need to parse the raw messages and identify and extract individual words given that we're using such words in our probability calculations. This process is called tokenization, so we need to implement a function which can handle this process.

** Note**: The tokenizer we implement here is really simple. Usually you'd want to use robust NLP libraries such as NLTK for this task.

Let's start with the `tokenize`

function which takes a raw message as its input and returns a list of valid words as an output.

The implementation is really straightforward. The first thing we do is to find all words via a regular expression. We then iterate over all the words we found, lowercase them and add them to our list. Once done we turn our list into a set to ensure that duplicate entries are filtered out.

```
def tokenize(text: str) -> Set[str]:
words: List[str] = []
for word in re.findall(r'[A-Za-z0-9\']+', text):
words.append(word.lower())
return set(words)
assert tokenize('Is this a text? If so, Tokenize this text!...') == {'is', 'this', 'a', 'text', 'if', 'so', 'tokenize'}
```

Now we're getting to the core of our implementation, the Naive Bayes classifier.

As we've stated earlier, our Naive Bayes classifier needs to be trained on existing messages in order to be able to do predictions on unseen messages later on. It therefore needs to remember what it saw and store this state internally. Because of that we'll be implementing Naive Bayes as a class.

Let's walk through the different methods the class needs step by step. We'll take a look at the code for the whole class at the end of this section.

First of all we need to implement a constructor which sets our internal state to default values and takes the \(k\) parameter as an optional argument so that we can control how we want do deal with scenarios where we haven't seen a word in the spam or ham messages. If we're glancing over our probability calculations from above we can see that we need to count quantities such as how many spam messages we saw in total during training. These are such state information we initialize in our constructor:

```
def __init__(self, k=1) -> None:
self._k: int = k
self._num_spam_messages: int = 0
self._num_ham_messages: int = 0
self._num_word_in_spam: Dict[int] = defaultdict(int)
self._num_word_in_ham: Dict[int] = defaultdict(int)
self._spam_words: Set[str] = set()
self._ham_words: Set[str] = set()
self._words: Set[str] = set()
```

Next up we need to implement a `train`

function which we'll use to train our Naive Bayes classifier. "Training" in our case means that we get a list of labeled messages (messages for which we know whether they're spam or ham), iterate over each individual message, tokenize it and then iterate over every word in every message to update our internal state with the necessary information such as how many spam messages we saw in the list of messages:

```
def train(self, messages: List[Message]) -> None:
msg: Message
token: str
for msg in messages:
tokens: Set[str] = tokenize(msg.text)
self._words.update(tokens)
if msg.is_spam:
self._num_spam_messages += 1
self._spam_words.update(tokens)
for token in tokens:
self._num_word_in_spam[token] += 1
else:
self._num_ham_messages += 1
self._ham_words.update(tokens)
for token in tokens:
self._num_word_in_ham[token] += 1
```

Now before we jump into the implementation of the `predict`

method which uses this information via Bayes Theorem calculations to do predictions for new, unseen messages it might be a good idea to implement 2 helper functions, one for every conditional probability we need to compute in Bayes Theorem.

Doing this will make the code more readable and will greatly help when implementing `predict`

later on.

The implementations for our 2 methods are pretty straightforward. They're basically a translation of the conditional probability calculations, incorporating the \(k\) factor to ensure that we'll never compute a \(0\) probability:

```
def _p_word_spam(self, word: str) -> float:
return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)
def _p_word_ham(self, word: str) -> float:
return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)
```

And now we're finally able to implement the meat of our Naive Bayes classifier. The `predict`

function.

The `predict`

function gets a message (we call it `text`

) which it tokenizes to extract its words. Next up it iterates through all the words the Naive Bayes classifier saw during training (all words in both, spam and ham messages) to check if the word in question can also be found in the new, unseen message. While doing that it calculates the probabilities with the help of our previously defined helper functions. The return value of the `predict`

function is the application of Bayes Theorem which computes an overall probability indicating how likely it is that the new, unseen message is spam.

Note that in the following implementation we're applying the tricks we learned about in our discussion of underflow problems (mainly doing the wrapping and unwrapping via \(\log\) and \(exp\)). Applying these techniques might make the code look a little bit confusing or intimidating, however if you look closely and ignore the `log`

and `exp`

usages you'll find that it's just the application of Bayes Theorem.

```
def predict(self, text: str) -> float:
text_words: Set[str] = tokenize(text)
log_p_spam: float = 0.0
log_p_ham: float = 0.0
for word in self._words:
p_spam: float = self._p_word_spam(word)
p_ham: float = self._p_word_ham(word)
if word in text_words:
log_p_spam += log(p_spam)
log_p_ham += log(p_ham)
else:
log_p_spam += log(1 - p_spam)
log_p_ham += log(1 - p_ham)
p_if_spam: float = exp(log_p_spam)
p_if_ham: float = exp(log_p_ham)
return p_if_spam / (p_if_spam + p_if_ham)
```

Putting it all together, here's the full `NaiveBayes`

class with all its methods:

```
class NaiveBayes:
def __init__(self, k=1) -> None:
self._k: int = k
self._num_spam_messages: int = 0
self._num_ham_messages: int = 0
self._num_word_in_spam: Dict[int] = defaultdict(int)
self._num_word_in_ham: Dict[int] = defaultdict(int)
self._spam_words: Set[str] = set()
self._ham_words: Set[str] = set()
self._words: Set[str] = set()
def train(self, messages: List[Message]) -> None:
msg: Message
token: str
for msg in messages:
tokens: Set[str] = tokenize(msg.text)
self._words.update(tokens)
if msg.is_spam:
self._num_spam_messages += 1
self._spam_words.update(tokens)
for token in tokens:
self._num_word_in_spam[token] += 1
else:
self._num_ham_messages += 1
self._ham_words.update(tokens)
for token in tokens:
self._num_word_in_ham[token] += 1
def _p_word_spam(self, word: str) -> float:
return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)
def _p_word_ham(self, word: str) -> float:
return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)
def predict(self, text: str) -> float:
text_words: Set[str] = tokenize(text)
log_p_spam: float = 0.0
log_p_ham: float = 0.0
for word in self._words:
p_spam: float = self._p_word_spam(word)
p_ham: float = self._p_word_ham(word)
if word in text_words:
log_p_spam += log(p_spam)
log_p_ham += log(p_ham)
else:
log_p_spam += log(1 - p_spam)
log_p_ham += log(1 - p_ham)
p_if_spam: float = exp(log_p_spam)
p_if_ham: float = exp(log_p_ham)
return p_if_spam / (p_if_spam + p_if_ham)
```

Let's take our Naive Bayes implementation for a spin!

We'll use the "Enron Spam" data set to train and test our implementation.

Here's the code which downloads and extracts the data set:

```
wget -nc -P data http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron1.tar.gz
tar -xzf data/enron1.tar.gz -C data
```

Reading through the readme we find that the spam and ham messages are stored in separate directories, so we need to find such directories and then iteratively open and parse every single file we find in them. We then store the data from the parsed subject line (we're only using the E-Mails subject to keep things simple) in a `NamedTuple`

and append it to a list containing all the messages. Here's the code which does just that:

```
spam_data_path: Path = data_dir / 'enron1' / 'spam'
ham_data_path: Path = data_dir / 'enron1' / 'ham'
class Message(NamedTuple):
text: str
is_spam: bool
spam_message_paths: List[str] = glob.glob(str(spam_data_path / '*.txt'))
ham_message_paths: List[str] = glob.glob(str(ham_data_path / '*.txt'))
message_paths: List[str] = spam_message_paths + ham_message_paths
messages: List[Message] = []
for path in message_paths:
with open(path, errors='ignore') as file:
is_spam: bool = True if 'spam' in path else False
text: str = file.readline().replace('Subject:', '').strip()
messages.append(Message(text, is_spam))
```

And that's all there is in terms of data preparation.

Now we need to split our data into a "training" and "testing" set which we'll use to train our Naive Bayes classifier.

Our `train_test_split`

function is a simple function which shuffles all the messages and then assigns them to 2 dedicated lists: One for training and one for testing. The default splitting rate is \(80/20\), meaning \(80\%\) of all messages will be assigned to the training set and \(20\%\) of all messages will be assigned to the test set.

```
def train_test_split(messages: List[Message], pct=0.8) -> Tuple[List[Message], List[Message]]:
shuffle(messages)
num_train = int(round(len(messages) * pct, 0))
return messages[:num_train], messages[num_train:]
train: List[Message]
test: List[Message]
train, test = train_test_split(messages)
```

Training our Naive Bayes classifier is as simple as creating a new instance and calling the `train`

method with the training set:

```
nb: NaiveBayes = NaiveBayes()
nb.train(train)
```

Let's grab some spam messages from our `test`

set (the data our classifier hasn't seen yet) and see what gets predicted:

```
spam_messages: List[Message] = [item for item in test if item.is_spam]
message: str = spam_messages[10].text
print(f'Predicting likelihood of "{message}" being spam.')
nb.predict(message)
# Predicting likelihood of "get your hand clock repliacs todday carson" being spam.
# 0.9884313222593173
```

Almost \(99\%\). Not bad!

And what about a ham message?

```
ham_messages: List[Message] = [item for item in test if not item.is_spam]
message: str = ham_messages[10].text
print(f'Predicting likelihood of "{text}" being spam.')
nb.predict(message)
# Predicting likelihood of "associate & analyst mid - year 2001 prc process" being spam.
# 5.3089147140900964e-05
```

Great! It's time to pat yourself on the back! You've successfully implemented and trained your very own Naive Bayes classifier to reliably identify spam and ham messages!

** Note**: \(99\%\) sounds too good to be true and there's certainly a smell of overfitting in the air. Remember that the data set we've used for training is rather small. Furthermore we've only trained based on the E-Mails subject line. You might want to modify the code to train on the whole message body or find a different data set altogether.

Naive Bayes is a powerful Machine Learning algorithm which makes it possible to classify unseen data based on probability scores.

The basis for Naive Bayes forms Bayes Theorem, one of the most fundamental algorithms in probability theory. With Bayes Theorem one can calculate conditional probabilities such as "how likely is it that this message is spam given word \(X\)?" which is exactly what's necessary to implementing a spam classifier.

While we've used the classic spam filtering use case to deconstruct Naive Bayes, there are more areas this algorithm can be applied to.

Given that recent developments in Deep Learning called Probabilistic Deep Learning incorporate Bayesian thinking into their underlying models it's a good investment to solidify the understanding about Bayes Theorem and its application via Naive Bayes.

One of the main resource I studied while learning about Machine Learning algorithms such as Naive Bayes is the book "Data Science from Scratch" by Joel Grus. This book is pure gold as it teaches you all the nitty gritty details about the most fundamental algorithms out there. If you haven't already, do yourself a favor and purchase a copy of this book. It's worth it.

The following is a list of resources I've used to compile this blog post. Other helpful resources are also linked within the article itself.

]]>You can find working code examples (including this one) in my lab repository on GitHub.

K-nearest neighbors (abbreviated as k-NN or KNN) is a simple, yet elegant Machine Learning algorithm to classify unseen data based on existing data. The neat property of this algorithm is that it doesn't require a "traditional" training phase. If you have a classification problem and labeled data you can predict the class of any unseen data point by leveraging your existing, already classified data.

Let's take a closer look at the intuitions behind the core ideas, the involved math and the translation of such into code.

Imagine that we've invited 100 dog owners with their dogs over for a statistical experiment we want to run. Each dog participating in this experiment is 1 out of 4 different dog breeds we're interested in studying. While we have the dog owners and their dogs around we measure 3 different properties of each dog:

- Its weight (in kilograms)
- Its height (in centimeters)
- Its alertness (on a scale from 0 to 1 [1=very alert, 0=almost no alertness])

Once done, we normalize the measurements so that they fall into a range between \(0\) and \(1\).

After collection the data on each individual dog we end up with 100 3-pair measurements, each of which is labeled with the corresponding dog breed.

Here's one example:

\[ \begin{pmatrix} 0.5 \\ 0.8 \\ 0.1 \end{pmatrix} = Podenco \]

In order to better understand the data it's always a good idea to plot it. Since we have collected 3 different measurements (weight, height and alterness) it's possible to project all of the 100 data points into a 3 dimensional space and color every data point according to its label (e.g. brown for the label "Podenco").

Unfortuantely we run into an issue while attempting to plot this data. As it turns out we forgot to label one measurement. We do have the dogs width, its height and its alertness but for some reason we forgot to write down the dogs breed.

Is there any chance we could derive what this dogs breed might be given all the other dog measurements we already have? We can still add the unlabeled data point into our existing 3 dimensional space where all the other colored data points reside. But how should we color it?

One potential solution is to look at the, say 5 surrounding neighbors of the data point in question and see what their color is. If the majority of those data points is labeled "Podenco" then it's very likely that our measurements were also taken from a Podenco.

And that's exactly what the k-NN algorithm does. The k-nearest neighbors algorithm predicts the class of an unseen data point based on its k-nearest neighbors and the majority of their respective classes. Let's take a closer look at this from a mathematical perspective.

There are only 2 concepts we need to implement in order to classify unseen data via k-NN.

As stated above, the algorithm works by looking at the k-**nearest neighbors** and the **majority of their respective classes** in order to classify unseen data.

Because of that we need to implement 2 functions. A distance function which calculates the distance between two points and a voting function which returns the most seen label given a list of arbitrary labels.

Given the notion of "nearest neighbors" we need to calculate the distance between our "to be classified" data point and all the other data points to find the \(k\) closest ones.

There are a couple of different distance functions out there. For our implementation we'll use the Euclidean distance as it's simple to calculate and can easily scale up to \(n\) dimensions.

The mathematical notation is as follows:

\[ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} \]

Let's unpack this formula with the help of an example. Assuming we have two vectors \(a\) and \(b\), the euclidean distance between the two is calculated as follows:

\[ \vec{a} = \begin{pmatrix} 1 \\ 2 \\ 3 \\ 4 \end{pmatrix} \vec{b} = \begin{pmatrix} 5 \\ 6 \\ 7 \\ 8 \end{pmatrix} \]

\[ d(\vec{a}, \vec{b}) = d(\vec{b}, \vec{a}) \\ = \sqrt{(1 - 5)^2 + (2 - 6)^2 + (3 - 7)^2 + (4 - 8)^2} \\ = \sqrt{64} = 8 \]

Translating this into code results in the following:

```
def distance(x: List[float], y: List[float]) -> float:
assert len(x) == len(y)
return sqrt(sum((x[i] - y[i]) ** 2 for i in range(len(x))))
assert distance([1, 2, 3, 4], [5, 6, 7, 8]) == 8
```

Great. We just implemented the first building block: An Euclidean `distance`

function.

Next up we need to implement a voting function. The voting function takes as an input a list of labels and returns the "most seen" label of that list. While this sounds trivial to implement we should take a step back and think about potential edge cases we might run into.

One of those edge cases is the situation in which we have 2 or more "most seen" labels:

```
# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']
```

For those scenarios we need to implement a tie breaking mechanism.

There are several ways to deal with this. One solution might be to pick a candidate randomly. In our case however we shouldn't think about the vote function in isolation since we know that the distance and vote functions both work together to determine the label of the unseen data point.

We can exploit this fact and assume that the list of labels our vote function gets as an input is sorted by distance, nearest to furthest. Given this requirement it's easy to break ties. All we need to do is recursively remove the last entry in the list (which is the furthest) until we only have one clear winning label.

The following demonstrates this process based on the `labels`

example from above:

```
# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']
# Remove one entry. We're still unsure if we should return `a` or `b`
labels: List[str] = ['a', 'a', 'b', 'b']
# Remove another entry. Now it's clear that `a` is the "winner"
labels: List[str] = ['a', 'a', 'b']
```

Let's translate that algorithm into a function we call `majority_vote`

:

```
def majority_vote(labels: List[str]) -> str:
counted: Counter = Counter(labels)
winner: List[str] = []
max_num: int = 0
most_common: List[Tuple[str, int]]
for most_common in counted.most_common():
label: str = most_common[0]
num: int = most_common[1]
if num < max_num:
break
max_num = num
winner.append(label)
if len(winner) > 1:
return majority_vote(labels[:-1])
return winner[0]
assert majority_vote(['a', 'b', 'b', 'c']) == 'b'
assert majority_vote(['a', 'b', 'b', 'a']) == 'b'
assert majority_vote(['a', 'a', 'b', 'b', 'c']) == 'a'
```

The tests at the bottom show that our `majority_vote`

function reliably deals with edge cases such as the one described above.

Now that we've studied and codified both building blocks it's time to bring them together. The `knn`

function we're about to implement takes as inputs the list of labeled data points, a new measurement (the data point we want to classify) and a parameter `k`

which determines how many neighbors we want to take into account when voting for the new label via our `majority_vote`

function.

The first thing our `knn`

algorithm should do is to calculate the distances between the new data point and all the other, existing data points. Once done we need to order the distances from nearest to furthest and extract the data point labels. This sorted list is then truncated so that it only contains the `k`

nearest data point labels. The last step is to pass this list into the voting function which computes the predicted label.

Turning the described steps into code results in the following `knn`

function:

```
def knn(labeled_data: List[LabeledData], new_measurement, k: int = 5) -> Prediction:
class Distance(NamedTuple):
label: str
distance: float
distances: List[Distance] = [Distance(data.label, distance(new_measurement, data.measurements))
for data in labeled_data]
distances = sorted(distances, key=attrgetter('distance'))
labels = [distance.label for distance in distances][:k]
label: str = majority_vote(labels)
return Prediction(label, new_measurement)
```

That's it. That's the k-nearest neighbors algorithm implemented from scratch!

It's time to see if our homebrew k-NN implementation works as advertised. To test drive what we've coded we'll use the infamous Iris flower data set.

The data set consists of 50 samples of three different flower species called Iris:

For each sample, 4 different measurements were collected: The sepal width and length as well as its petal width and length.

The following is an example row from the data set where the first 4 numbers are the sepal length, sepal width, petal length, petal width and the last string represents the label for those measurements.

`6.9,3.1,5.1,2.3,Iris-virginica`

The best way to explore this data is to plot it. Unfortunately it's hard to plot and inspect 4 dimensional data. However we can pick 2 measurements (e.g. petal length and petal width) and scatter plot those in 2 dimensions.

** Note**: We're using the amazing Plotly library to create our scatter plots.

```
fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.show()
```

We can clearly see clusters of data points which all share the same color and therefore the same label.

Now let's pretend that we have a new, unlabeled data point:

`new_measurement: List[float] = [7, 3, 4.8, 1.5]`

Adding this data point to our existing scatter plot results in the following:

```
fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.add_annotation(
go.layout.Annotation(
x=new_measurement[petal_length_idx],
y=new_measurement[petal_width_idx],
text="The measurement we want to classify")
)
fig.update_annotations(dict(
xref="x",
yref="y",
showarrow=True,
arrowhead=7,
ax=0,
ay=-40,
borderwidth=2,
borderpad=4,
bgcolor="#c3c3c3"
))
fig.show()
```

Even if we're just plotting the petal length and petal width in 2 dimensions it seems to be the case that the new measurement might be coming from an "Iris Versicolor".

Let's use our `knn`

function to get a definite answer:

`knn(labeled_data, new_measurement, 5)`

And sure enough the result we get back indicates that we're dealing with an "Iris Versicolor":

`Prediction(label='Iris-versicolor', measurements=[7, 3, 4.8, 1.5])`

k-nearest neighbors is a very powerful classification algorithm which makes it possible to label unseen data based on existing data. k-NNs main idea is to use the \(k\) nearest neighbors of the new, "to-be-classified" data point to "vote" on the label it should have.

Given that, we need 2 core functions to implement k-NN. The first function calculates the distance between 2 data points so that nearest neighbors can be found. The second function performs a majority vote so that a decision can be made as to what label is most present in the given neighborhood.

Using both functions together brings k-NN to life and makes it possible to reliably label unseen data points.

I hope that this post was helpful and demystified the inner workings of the k-nearest neighbors algorithm.

*Thanks to Eric Nieuwland who reached out via E-Mail and provided some code simplifications!*

The following is a list of resources I've used to work on this blog post. Other helpful resources are also linked within the article itself.

]]>You can find working code examples (including this one) in my lab repository on GitHub.

Gradient Descent is one of the most fundamental optimization techniques used in Machine Learning. But what is a gradient? On what do we descent down and what do we even optimize in the first place?

Those might be some of the questions which come to mind when having the first encounters with Gradient Descent. Let's answer those questions while implementing the Gradient Descent optimization technique from scratch.

Many Machine Learning problems require some form of optimization. Usually the algorithm starts with an initial guess as to what the correct answer might be and then slowly adapts its parameters to be less wrong with every consecutive trial.

This learning process through trial and error repeats until the algorithm has learned to "correctly" predict the results for the training- and test data it gets fed.

In order to figure out how wrong our algorithm is with its predictions, we need to define the notion of a loss function. The loss function compares the guesses and the actual values and turns the differences between the two into a measurement we can use to quantify the quality of our algorithm. Prominent loss function are Mean squared error (MSE), Root mean square error (RMSE) or Sum of squared errors (SSE).

Imagine that we're plotting the loss (i.e. "wrongdoing") the algorithm produces as a surface in a multi dimensional space (such as 3D). Intuitively it seems to make sense to find the "place" on this surface where the algorithm is doing the fewest mistakes. That's exactly where Gradient Descent comes into play. With Gradient Descent we take a "walk" on this surface to find such a place.

As we've already discovered, loss functions and optimizations are usually intertwined when working on Machine Learning problems. While it makes sense to teach them together I personally believe that it's more valuable to keep things simple and focused while exploring core ideas. Therefore for the rest of this post we'll solely focus on Gradient Descent as a mathematical optimization technique which can be applied in a variety of different domains (including Machine Learning problems).

The first thing we should do is to define the function we want to run the Gradient Descent algorithm on. Most examples explain the algorithms core concepts with a single variable function such as a function drawn from the class of parabolas (e.g. \(x^2\)). Single variable functions can be easily plotted in a 2 dimensional space along the \(x\) and \(y\) axes. Besides other nice properties, dealing with only 2 dimensions makes the involved math a whole lot easier. Real world Machine Learning problems however deal with data in an order of magnitude higher dimensions. While there's a slightly steeper learning curve to go from 2D to 3D there's pretty much nothing new to learn to go from 3D to \(n\)D. Because of that we'll apply Gradient Descent to a multivariable function with 2 variables.

Most of us studied the properties of quadratic functions via parabolas in school. But is there an equivalent function which can be plotted in 3 dimensions? Imagine that you have a parabola and spin it like a centrifuge along its \(y\)-axis. What you'd end up with is a surface called a paraboloid, a "3 dimensional parabola".

There are various ways to define paraboloids but in our case let's use the "infinite paraboloid" which is defined as follows:

\[ x^2 + y^2 \]

Translating the math into code results in:

```
def paraboloid(x: float, y: float) -> float:
return x ** 2 + y ** 2
```

Simple enough. Next up we should generate some test data, pass it into our `paraboloid`

function and plot the results to see if everything works as expected:

```
# Test data generation (only really necessary for the plotting below)
xs_start = ys_start = -10
xs_stop = ys_stop = 11
xs_step = ys_step = 1
xs: List[float] = [i for i in range(xs_start, xs_stop, xs_step)]
ys: List[float] = [i for i in range(ys_start, ys_stop, ys_step)]
zs: List[List[float]] = []
for x in xs:
temp_res: List[float] = []
for y in ys:
result: float = paraboloid(x, y)
temp_res.append(result)
zs.append(temp_res)
print(f'xs: {xs}\n')
# xs: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f'ys: {ys}\n')
# ys: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f'zs: {zs[:5]} ...\n')
# zs: [[200, 181, 164, 149, 136, 125, 116, 109, 104, 101, 100, 101, 104, 109, 116, 125, 136, 149, 164, 181, 200], [181, 162, 145, 130, 117, 106, 97, 90, 85, 82, 81, 82, 85, 90, 97, 106, 117, 130, 145, 162, 181], [164, 145, 128, 113, 100, 89, 80, 73, 68, 65, 64, 65, 68, 73, 80, 89, 100, 113, 128, 145, 164], [149, 130, 113, 98, 85, 74, 65, 58, 53, 50, 49, 50, 53, 58, 65, 74, 85, 98, 113, 130, 149], [136, 117, 100, 85, 72, 61, 52, 45, 40, 37, 36, 37, 40, 45, 52, 61, 72, 85, 100, 117, 136]] ...
```

Plotting the graph (using the Plotly library) can be done with the following code:

```
fig = go.Figure(go.Surface(x=xs, y=ys, z=zs, colorscale='Viridis'))
fig.show()
```

According to Wikipedia the definition of a gradient is:

... a vector-valued function \(f\)..., whose value at a point \(p\) is the vector whose components are the partial derivatives of \(f\) at \(p\).

I agree that this sounds overwhelming but the core concepts of a gradient are really simple. What this quote from above tries to convey is that for any given point \(p\) on the functions plotted surface there's a vector consisting of partial derivatives which points in the direction of greatest increase.

With that explanation the only concept we need to sort out is the notion of partial derivatives.

If you've studied Differential Calculus you surely came across the topic of derivatives. To recap real quick, a derivative measures the sensitivity to change of a functions output with respect to changes in its inputs.

Derivatives can be of different orders. One of the most prominent derivatives you surely came across is the first-order derivative which is the slope of a tangent line that tells us whether the function is increasing or decreasing at any given point.

The process of differentiation obeys several rules, some of which we'll apply when walking through the example below to refresh our fond memories of Calculus I.

Let's calculate the first-, second- and third-order derivative of the following function:

\[ x^3 + 2x^2 - 5 \]

According to the differentiation rules, the first-order derivative is:

\[ \frac{d}{dx}(x^3 + 2x^2 - 5) = 3x^2 + 4x \]

The second-order derivative is:

\[ \frac{d}{dx}(3x^2 + 4x) = 6x + 4 \]

And finally the third-order derivative is:

\[ \frac{d}{dx}(6x + 4) = 6 \]

Why did we do this? As you might recall from above, a gradient is a vector of partial derivatives of function \(f\) at point \(p\). So in order to compute gradients we need to compute the partial derivatives of our funcion \(f\).

We certainly do know how to compute derivatives as we've shown above, but how do we calculate partial derivatives? As it turns out you already know how to calculate partial derivatives if you know how to calculate derivatives. The twist with partial derivatives is that you're deriving with respect to a variable while treating every other variable as a constant.

Let's apply this rule and compute the partial derivatives of our Paraboloid function \(x^2 + y^2\) which we call \(f\):

The first partial derivative we calculate is the derivative of \(f\) with respect to \(x\), treating \(y\) as a constant:

\[ \frac{\partial}{\partial x}(x^2 + y^2) = 2x \]

The second partial derivative calculation follows the same pattern, deriving \(f\) with respect to \(y\) while treating \(x\) as a constant:

\[ \frac{\partial}{\partial y}(x^2 + y^2) = 2y \]

__ Note__: Don't get confused by the notation here. While you'd use \(\frac{d}{dx}\) for "normal" derivatives you simply use \(\frac{\partial}{\partial x}\) for partial derivatives.

And that's it. With those partial derivatives we're now able to compute any gradient for any point \(p\) sitting on the plotted surface of function \(f\). Let's translate our findings into code:

```
def compute_gradient(vec: List[float]) -> List[float]:
assert len(vec) == 2
x: float = vec[0]
y: float = vec[1]
return [2 * x, 2 * y]
```

Let's recap what we've accomplished so far. First, we've codified and plotted a function called the "infinite paraboloid" which is used throughout this post to demonstrate the Gradient Descent algorithm. Next up we studied gradients, taking a quick detour into Differential Calculus to refresh our memories about differentiation. Finally we computed the partial derivatives of our paraboloid function.

Translating the prose into a mental image, we have a function \(f\) which produces a surface in a 3 dimensional space. Given any point \(p\) on that surface we can use the gradient (via its partial derivatives) to find a vector pointing into the direction of greatest increase.

That's great, but we're dealing with an optimization problem and for a lot of such applications it's usually desirable to find a global or local minimum. Right now, the gradient vector is pointing upwards to the direction of greatest increase. We need to turn that vector into the opposite direction so that it points to the direction of greatest decrease.

Linear Algebra taught us that doing that is as simple as multiplying the gradient vector by \(-1\). Another neat Linear Algebra trick is to multiply a vector by a number other than \(1\) to change its magnitude (= its length). If we're for example multiplying the gradient vector by \(0.25\) we would get a vector which has length \(\frac{1}{4}\) of its original length.

We finally have all the building blocks in place to put together the Gradient Descent algorithm which eventually finds the nearest local minimum for any given function.

The algorithm works as follows.

Given a function \(f\) we want to run Gradient Descent on:

- Get the starting position \(p\) (which is represented as a vector) on \(f\)
- Compute the gradient at point \(p\)
- Multiply the gradient by a negative "step size" (usually a value smaller than \(1\))
- Compute the next position of \(p\) on the surface by adding the rescaled gradient vector to the vector \(p\)
- Repeat step 2 with the new \(p\) until convergence

Let's turn that into code. First, we should define a `compute_step`

function which takes as parameters the vector \(p\) and a "step size" (we call it `learning_rate`

), and computes the next position of \(p\) according to the \(f\) functions gradient:

```
def compute_step(curr_pos: List[float], learning_rate: float) -> List[float]:
grad: List[float] = compute_gradient(curr_pos)
grad[0] *= -learning_rate
grad[1] *= -learning_rate
next_pos: List[float] = [0, 0]
next_pos[0] = curr_pos[0] + grad[0]
next_pos[1] = curr_pos[1] + grad[1]
return next_pos
```

Next up we should define a random starting position \(p\):

```
start_pos: List[float]
# Ensure that we don't start at a minimum (0, 0 in our case)
while True:
start_x: float = randint(xs_start, xs_stop)
start_y: float = randint(ys_start, ys_stop)
if start_x != 0 and start_y != 0:
start_pos = [start_x, start_y]
break
print(start_pos)
# [6, -3]
```

And finally we wrap our `compute_step`

function into a loop to iteratively walk down the surface and eventually reach a local minimum:

```
epochs: int = 5000
learning_rate: float = 0.001
best_pos: List[float] = start_pos
for i in range(0, epochs):
next_pos: List[float] = compute_step(best_pos, learning_rate)
if i % 500 == 0:
print(f'Epoch {i}: {next_pos}')
best_pos = next_pos
print(f'Best guess for a minimum: {best_pos}')
# Epoch 0: [5.988, -2.994]
# Epoch 500: [2.2006573940846716, -1.1003286970423358]
# Epoch 1000: [0.8087663604107433, -0.4043831802053717]
# Epoch 1500: [0.2972307400008091, -0.14861537000040456]
# Epoch 2000: [0.10923564223981955, -0.054617821119909774]
# Epoch 2500: [0.04014532795468376, -0.02007266397734188]
# Epoch 3000: [0.014753859853277982, -0.007376929926638991]
# Epoch 3500: [0.005422209548664845, -0.0027111047743324226]
# Epoch 4000: [0.0019927230353282872, -0.0009963615176641436]
# Epoch 4500: [0.0007323481432962652, -0.0003661740716481326]
# Best guess for a minimum: [0.00026968555624761565, -0.00013484277812380783]
```

That's it. We've successfully implemented the Gradient Descent algorithm from scratch!

Gradient Descent is a fundamental optimization algorithm widely used in Machine Learning applications. Given that it's used to minimize the errors in the predictions the algorithm is making it's at the very core of what algorithms enable to "learn".

In this post we've dissected all the different parts the Gradient Descent algorithm consists of. We looked at the mathematical formulations and translated those into code, which in our case can be used to find the global minimum on a Paraboloid function \(x^2 + y^2\). Note that our implementation was optimized for readability and educational usefulness rather than performance. You might not want to use this code in production.

Modern Machine Learning libraries such as TensorFlow or PyTorch ship with built-in differentiation capabilities to automatically compute partial derivatives, removing the need to implement the tedious Math yourself.

I hope that this post was helpful and demystified some of the inner workings of Machie Learning libraries you use in your projects on a day to day basis.

The following is a list with resources I've used to write this blog post. The post itself also links to some other interesting content you might want to further explore.

]]>The end of the year is usually a good time to reflect on the past year, think about the present and plan for the future. I personally do this every year as it helps me reevaluate my goals and get everything in alignment with what's ahead.

The end of last year felt slightly different, however as we left the "old" decade behind and entered the new one: **2020**.

A decade sounds shorter than it actually is. In order to get some perspective as to what can happen in 10 years of time I did some research, tracking down where we started technology-wise when entering 2010:

Did you know that Uber, the infamous ride hailing startup was just getting started and was barely a thing in 2010? That Snapchat was merely an idea until the first release hit the Apple AppStore in 2011? Or that our text-based communication was mostly done via SMS as only early adopters owned a Smartphone. Just a few people used WhatsApp, the then paid communication platform which was started in 2009 and took the world by storm, eventually replacing texting via SMS.

All of that happend in between 2010 and 2020. 10 years is a long time. A lot can change in a decade.

We're living in extremely exciting times as mankind has never seen technological advancements at such a rapid pace. One area I'm extremely excited about is Machine Learning and its impact on computing and society. While I certainly had some theoretical encounters during college, my interest in AI rekindled when I read about the breakthtaking breakthroughs we've achieved in areas such as self-driving cars, game AIs or the life sciences.

Such breakthroughs mostly happened in the last decade. One can only imagine what else will be unlocked if we follow the current trajectory of research efforts and hardware / software innovation.

While reflecting on the last 10 years I decided to pick a grand theme I want to focus my efforts on in the upcoming years. As my background is in Open Source software and Cloud computing and my passion lies in Computer Science and Math I decided to phrase such theme as:

The intersection of Computer Science, Mathematics and Distributed Computing

The next years I'll dedicate my time to the field of Machine Learning. To further narrow this down my plan is to focus on the "bleeding edge" which will include research and development efforts on various fronts. The core idea is to study brand new, theoretical research and put it into practice, hence the name of this publication: **The Intersection**.

There's a lot to discover during this process and together we'll deconstruct our findings in order to better understand their underlying guiding principles and mathematical foundations. Furthermore we'll apply what we've learned while working on different projects ranging from toy projects to PoCs to custom-tailored Machine Learning toolings.

Since "sharing is caring" I'll document the whole journey via blog posts and Open Source the code I'll produce on my GitHub account.

Every adventure should have a sound plan, so here's what's on my mind right now. While I've already describe the guiding "North Star" in the section above I decided to further split up the endeavor into different categories. Every blog post I'll write will be tagged with respective categorical tags.

The focus of this category is on the exploration and explanation of different Machine Learning algorithms and techniques. We'll peek under the covers and deconstruct how modern Machine Learning really works. Open Sourced code implementations will demonstrate the usefulness of models such as Linear Regression, Latent Dirichlet Allocation (LDA) or Capsule Neural Networks.

Pretty much all Machine Learning algorithms require a deep understanding of advanced Mathematical concepts. Given such requirement most people shy away since "math is too complicated" or "just not for me".

I thought exactly the same but decided to put a fork in the road. Earlier last year I created a colloquium which I followed to relearned advanced mathematics from scratch. I can tell you that it's definitely challenging but at the same time intellectually stimulating and really rewarding. Trust me when I say that Math is a beautiful language and absolutely worthwhile to study.

The Math category will be used to translate complicated looking concepts into easily digestible prose and code.

It's of upmost importance to stay on top of the current research. During our journey through the lands of Machine Learning and Artificial Intelligence I plan to allocate time to read about recent developments and share my findings via dedicated research-oriented posts.

Since research goes both ways I'll also take some time to do some research on my own, extending and discussing existing ideas while also coming up with new ones.

Staying too long in the theoretical realm is boring. Applying the theory in practice is where the fun begins and the real learning happens!

The "Projects" and "Code" categories will be used to talk about the projects we'll work on while bridging the gap between science and engineering. Projects can range from specifications, PoCs, implementation proposals to fully-fledged applications.

I already have a long list of ideas I plan to work on and I'll present some of those in upcoming posts.

As per usual I plan to Open Source the code while working on the projects and I'd be more than happy to welcome you joining the fun on GitHub.

Over the last couple of weeks I've already started to work according to the plan I just described above.

A solid foundational knowledge about the core concepts and principles is key to succeed in any field of study. That's why I started the challenge of implementing core Machine Learning algorithms from scratch in pure Python with literally 0 dependencies (except for plotting graphs and diagrams). Some of such algorithms include Gradient descent, k-NN, Linear- and Multiple Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Neural Networks and more. Most of them are already translated into blog posts which explain the algorithm in an intuitive and fun way. I plan to release the first blog posts in the upcoming weeks.

Other than that I'm fascinated by the idea and technological challenges of privacy preserving Machine Learning. There's the saying that "Data is the new oil" since it's used among other things to train the Machine Learning algorithms that drive our daily lives. But data is more than "just modern oil". Data is precious, sensitive and shouldn't be shared or used without a users consent. Techniques such as Homomorphic Encryption, Secure multi-party computation, Differential Privacy and Federated Learning can be used to solve the privacy issue and e.g. train encrypted Machine Learning models on encrypted data in untrusted environments. I've already started to look into current research efforts and implemented different Homomorphic Encryption schemes to get an idea of what can work and what's still practically infeasible. While doing so I was blown away by the capabilities we might unlock in the next couple of years. It was really hard to pull myself out of that rabbit hole. To share my enthusiasm and get you excited as well I'm currently in the process of preparing some deep dive blog post series around those topics!

10 years ago self-driving cars and personal assistants were merely far fetched dreams and only really appreciated by SciFi fans and futurology nerds. Nowadays those techologies are omnipresent thanks to rapid innovations in areas such as Artificial Intelligence, Neuroscience, Distributed Computing and Hardware / Software development.

There's no better time than now to get involved in Machine Learning. That's why I decided to fully immerse myself into the field. I'll use this blog to document the journey while we learn about bleeding edge research, turn algorithms into code and work on various projects while decoding the underpinnings of the intelligent technologies which drive our daily lives.

]]>__ Note__: In this post we'll compare and contrast different programming languages. Everything discussed should be taken with a grain of salt. There's no single programming language which solves all the problems in an elegant and performant way. Every language has its up- and downsides. Swift is no exception.

Entering the Data Science and Machine Learning world there are various programming languages and tools to choose from. There's MATLAB, a commercial programming environment which is used across different industries and usually the tool of choice for practicioners with a heavy Math background. A free and Open Source alternative is the R project, a programming language created in 1993 to simplify statistical data processing. People working with R usually report enjoyment as R is "hackable" and comes bundled with different math modules and plotting libraries.

A more recent incarnation is the Julia scientific programming language which was created at MIT to resolve the issues older tools such as MATLAB and R struggled with. Julia cleverly incorporates modern engineering efforts from the fields of compiler construction and parallel computing and given its Open Source nature it has gained a lot of industry-wide adoption when it reached `v1`

maturity in 2018.

If you're doing some more research to find the most used programming language in Data Science and Machine Learning you might be surprised to see a language which wasn't built from the ground up for scientific computing: ** Python**.

The Python programming language was created by Guido van Rossum in 1989 to help bridge the gap between Bash scripting and C programming. Since then Python took the world by storm mainly due to its flat learning curve, its expressiveness and its powerful standard library which makes it possible to focus on the core problems rather than reinveing the wheel over and over again.

__ Funny tangent__: Open up a shell, run the Python interpreter via

`python`

and enter `import this`

or `import antigravity`

Python is a general purpose programming language which was never designed to solve problems in a niche subject matter. Are you an Instagram user? They're running Python. Do you curate content via Pinterest? They're running Python. Do you store your data via Dropbox? They've developed their MVP in Python and still use it today. Even Google (then called BackRub) started out with Python and Java. The list goes on and on.

Given such an industry-wide adoption it's easy to see why a lot of care and effort were put into the ecosystem of reusable packages as well as the language itself. No matter what use case you're working on, chances are that there are numerous Python packages helping you solve your problems.

While more famous Python projects include Web Frameworks such as Django or Flask there are also lot of mature Scientific and Machine Learning implementations written in Python. Having access to such a robust foundation it only makes sense that modern Deep Learning frameworks such as TensorFlow or PyTorch are also leveraging those libraries under the covers.

All of the things discussed so far sound great. Python, a general purpose programming language which has quite a few years of existence under its belt is used across industries in mission critical software systems. Over the course of 30 years a vibrant Open Source community emerged which develops and maintains powerful libraries used by millions of users on a daily basis.

Why bother and replace Python? If it ain't broke, don't fix it!

Technology is constantly improving. What was once unthinkable might all of the sudden be possible thanks to breakthroughs in Hard- and Software development. Python was created in a different era with a different purpose. It was never engineered to directly interface with hardware or run complex computations across a fleet of distributed machines.

A modern Deep Learning Framework such as TensorFlow uses dozens of programming languages behind the scenes. The core of such a library might be written in high-performance C++ which occasionally interfaces with different C libraries, Fortran programs or even parts of Assembly language to squeeze out every bit of performance possible. A Python interface is usually built on top of the C++ core to expose a simple public API Data Scientists and Deep Learning enthusiasts use.

Why isn't Python used throughout the whole stack?

The answer to this question is rather involved but the gist of it is that Pythons language design is more tailored towards high level programming. Furthermore it's just not fast enough to be used at the lower layers.

The following is an incomplete list of Pythons (subjective) shortcomings:

Python code usually runs an order of magnitude slower compared to other interpreted and compiled languages. Language implementations such as Cython which compile Python code to raw C try to mitigate this problem but they come with other issues (e.g. language inconsistencies, compatibility problems, ...).

It's not that straightforward to write Python code which reliably performs parallel processing tasks on multiple cores or even multiple machines. Deep Neural Networks can be expressed as graphs on which Tensor computations are carried out, making it a prime use case for parallel processing.

Python is a high level language with lots of useful abstractions which unfortunately get in the way when trying to directly interface with the computers underlying hardware. Because of that heavy GPU computations are usually moved into lower-level code written in e.g. C or CUDA.

Since it's a scripting language at its core, Python comes with its own runtime that evaluates the script line-by-line as it runs it. A process called "interpretation".

The other branch of programming languages are compiled languages. Compiling code means that the humand-readable program code is translated into code a machine can read and understand. Compiled programs have the downside that there's a compilation step in between writing and running the program. The upside of such step is that various checks and optimizations can be performed while translating the code, eventually emitting the most efficient machine code possible.

Python has no concept of typing. There's no problem in passing an integer into a function which expects a string. Python will run the program and raise an exception as soon as it evaluates the broken code.

Strongly typed languages have the upside that mistakes like the one described above are impossible to make. The developer has to explicitly declare which types are expected.

Python has recently added support for type hints. Type hinting merely serves as another form of documentation as it still won't prevent type misuses in programs.

A lot of prominent packages such as Numpy wrap other languages such as Fortran or C to offer reliable performance when working on computational expensive data processing tasks.

While it's certainly not impossible to introduce existing libraries written in other languages into Python, the process to do that is oftentimes rather involved.

Without going into too much detail it makes sense to take a quick detour and study the origins of the Swift programming language in order to see why it has such a potential to replace Python as the go-to choice for Data Science and Machine Learning projects.

Chris Lattner, the inventor of Swift has a long history and established track record in modern compiler development. During college he worked on a project which eventually became LLVM ("Low Level Virtual Machine"), the infamous compiler infrastructure toolchain. The revolutionary idea behing LLVM is the introduction of frontends and backends which can be mixed and matched. One frontend could be written for Swift which is then coupled with a backend implementation for the x86 architecture. Making it possible to compile to another architecture is as simple as using another backend such as the one for PowerPC. Back in the early compiler days one had to write the compiler end-to-end, tightly coupling the frontend and backend, making it a heroic effort to offer the compiler for different platforms.

LLVM gained a lot of traction and Chris Lattner was eventually hired by Apple to work on its developer toolings which heavily relied on LLVM. During that time he worked on a C++ compiler and thought about ways how a better, more modern programming language might look like. He figured that it should be compiled, easy to learn, flexible enough to feel like a scripting language and at the same time "hackable" at every layer. Those ideas translated into the Swift programming language which was officially released at WWDC in 2014.

But what exactly makes Swift such a natural fit as a Python replacement? Isn't Swift only used for iOS and macOS apps? The following section shows why Swift could be Pythons successor.

Swift is compiled via LLVM which means that its code is translated into optimized machine code directly running on the target platform. Improvements made to the LLVM compiler toolchain automatically benefit the Swift code generation.

There's the saying that Swift is "syntactic sugar for LLVM" which rings true as one can see with the `Builtin`

usage for its core types. The linked code snippet shows that Swifts core types directly interface with their LLVM equivalents.

Despite the compilation process Swift feels like a dynamic, Python-esque language. Swift was designed from the ground up for programs to incrementally grow in complexity as necessary. The simplest of all Swift programs is just one line of code: `print("Hello World")`

.

```
let greeting = "Hello World"
print(greeting)
// Hello World
let num1 = 1
let num2 = 2
print(num1 + num2)
// 3
let scores = [10, 35, 52, 92, 88]
for score in scores {
print(score)
}
// 10
// 35
// 52
// 92
// 88
class Cat {
var name: String
var livesRemaining: Int = 9
init(name: String) {
self.name = name
}
func describe() -> String {
return "👋 I'm \(self.name) and I have \(self.livesRemaining) lives 😸"
}
}
let mitsy = Cat(name: "Mitsy")
print(mitsy.describe())
// 👋 I'm Mitsy and I have 9 lives 😸
```

Given that Swift is compiled via LLVM, it's statically type checked during the compilation process. There's no way you can pass an invalid type to a function and run into an error during runtime. If your code compiles you can be pretty sure that you're passing around the expected types.

```
func sum(xs: [Int]) -> Int {
var result: Int = 0
for x: Int in xs {
result = result + x
}
return result
}
// Using correct types
let intNumbers: [Int] = [1, 2, 3, 4, 5]
let resultInt = sum(xs: intNumbers)
print(resultInt)
// 15
// Using incorrect types
let stringNumbers: [String] = ["one", "two", "three"]
let resultString = sum(xs: stringNumbers)
print(resultString)
// error: cannot convert value of type '[String]' to expected argument type '[Int]'
```

Swifts concepts of protocols and extensions make it dead simple to add new functionality to existing libraries or even types which ship with the language core itself. Want to add a new method to `Int`

? No problem!

```
// One needs to implement `help` when using the `Debugging` Protocol
protocol Debugging {
func help() -> String
}
// Implementing `Debugging` for MatrixMultiply
class MatrixMultiply: Debugging {
func help() -> String {
return "Offers methods to aid with matrix-matrix multiplications."
}
func multiply() {
// ...
}
}
var matMult = MatrixMultiply()
print(matMult.help())
// Offers methods to aid with matrix-matrix multiplications.
// Implementing `Debugging` for VectorMultiply
class VectorMultiply: Debugging {
func help() -> String {
return "Offers methods to aid with matrix-vector multiplications."
}
}
var vecMult = VectorMultiply()
print(vecMult.help())
// Offers methods to aid with matrix-vector multiplications.
```

```
// Makes it possible to emojify an existing type
protocol Emojifier {
func emojify() -> String
}
// Here we're extending Swifts core `Int` type
extension Int: Emojifier {
func emojify() -> String {
if self == 8 {
return "🎱"
} else if self == 100 {
return "💯"
}
return String(self)
}
}
print(8.emojify())
// 🎱
print(100.emojify())
// 💯
print(42.emojify())
// 42
```

I'm sure everyone ran into this problem before. An object is passed into a function and modified without bad intentions. Meanwhile the object is used in a different place and all of the sudden its internal state isn't what it's supposed to be. The culprit is the data mutation within the function.

This problem can be mitigated easily via value semantics. When using value semantics a "copy" rather than an object reference is passed around.

```
// As seen on: https://marcosantadev.com/copy-write-swift-value-types/
import Foundation
// Prints the memory address of the given object
func address(of object: UnsafeRawPointer) -> String {
let addr = Int(bitPattern: object)
return String(format: "%p", addr)
}
var list1 = [1, 2, 3, 4, 5]
print(address(of: list1))
// 0x7f2021f845d8
var list2 = list1
print(address(of: list2))
// 0x7f2021f845d8 <-- Both lists share the same address
list2.append(6) // <-- Mutating `list2`
print(list1)
// [1, 2, 3, 4, 5]
print(list2)
// [1, 2, 3, 4, 5, 6]
print(address(of: list1))
// 0x7f2021f84a38
print(address(of: list2))
// 0x128fb50 <-- `list2` has a different address
```

Given that Swift compiles via LLVM it has access to existing LLVM-based implementations to interoperate with. One such project is Clang, a C language family frontend written for LLVM. Thanks to Clang it's dead simple to wrap existing C libraries and bring them into Swift projects.

The following video demonstrates how easy it is:

Given all the upsides described above, the TensorFlow team decided to experiment with Swift as a Python replacement to interface with TensorFlow. Early prototypes were fruitful, encouraging the TensorFlow team to officially released Swift for TensorFlow (S4TF) in 2019.

S4TF extends the Swift core language with various features especially useful for Machine Learning tasks. Such enhancements include first-class autodiff support to calculate derivatives for functions or Python interoperability which makes it possible to reuse existing Python packages such as matplotlib, scikit-learn or pandas via Swift.

The following is a demonstartion which shows how Swift for TensorFlow can be used to describe and train a deep neural network in TensorFlow:

Do you want to play around with Swift for TensorFlow yourself? Just run the following code in a terminal to spin up a Jupyter Notebook server with Swift Kernel support in a Docker container:

```
docker run -it -p 8888:8888 --rm --name jupyter-s4tf \
-v "$PWD":/home/jovyan/work \
--ipc=host \
pmuens/jupyter-s4tf:latest jupyter notebook \
--ip=0.0.0.0 \
--no-browser \
--allow-root \
--NotebookApp.token=\
--notebook-dir=/home/jovyan/work
```

The code for the repository can be found here and the Docker Hub entry is here.

Python, the de-facto standard programming language for Data Science and Machine Learning has served the community very well in the past. Nevertheless, given the trajectory of technological advancements we're slowly but surely hitting the limits with the toolings we currently have.

Performance critical code is already pushed down into lower-level implementations written in programming languages such as C or Fortran and wrapped via public Python APIs. Wouldn't it be nice to write expressive, yet performant code from the get go at every layer? And what about all the libraries out there? Wouldn't it be nice to wrap and reuse them with only a couple lines of code?

The lack of static typing in Python makes it painful to work on larger, more complex projects. It's all too easy to define a model and train it on a huge dataset just to realize that a type error interrupts the training process halfway through. An error which could've been mitigated via thorough type checks.

And what if we're hitting other roadblocks? Wouldn't it be nice to be able to peek under the covers and fix the issues ourselves in an "official" way without all the monkey-patching?

Most large-scale Machine Learning projects already faced some, if not all of the issues listed above. The TensorFlow team experienced them too and looked into ways to solve them once and for all. What they came up with is Swift for TensorFlow (S4TF), a Swift language extension tailored towards modern Machine Learning projects. The Swift programming language comes with various properties which makes it a perfect fit for a Python replacement: It shares a similar syntax, is compiled (and therefore runs fast), has a type system and seamlessly interoperates with exisiting C and Python libraries.

What do you think? Is Swift for TensorFlow the future or do we stick with Python for now? Will a language such as Julia dominate the Data Science and Machine Learning world in the future?

The following is a list of resources I've used to compile this blog post. There are also a couple of other sources linked within the article itself.

- Swift.org - A Swift Tour
- Fast.ai + Swift for TensorFlow Presentation
- Fast.ai - Lesson 13 (2019) - Basics of Swift for Deep Learning
- Fast.ai - Lesson 14 (2019) - Swift: C interop; Protocols; Putting it all together
- Fast.ai - Swift Numerics
- YouTube - Chris Lattner: Compilers, LLVM, Swift, TPU, and ML Accelerators | Artificial Intelligence Podcast
- YouTube - The Power of Swift for Machine Learning

Generic programming makes it possible to describe an implementation in an abstract way with the intention to reuse it with different data types.

While generic programming is a really powerful tool as it prevents the programmer from repeating herself it can be hard to grasp for newcomers. This is especially true if you’re not too familiar with typed programming languages.

This blog post aims to shed some light into the topic of generic programming. We’ll discover why Generics are useful and which thought process can be applied to easily derive generic function signatures. At the end of post you’ll be able to author and understand functions like the this:

```
function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
/* ... */
}
```

**Note:** Throughout this post we’ll use TypeScript as our language of choice. Feel free to code along while reading through it.

Of course you can “just use JavaScript” (or another dynamically typed language) to not deal with concepts such as typing or Generics. But that’s not the point. The point of this post is to introduce the concepts of Generics in a playful way. TypeScript is just a replaceable tool to express our thoughts.

Before we jump right into the application of generic programming it might be useful to understand what problem Generics are solving. We’ll re-implement one of JavaScripts built-in Array methods called `filter`

to get first-hand experience as to why Generics were invented.

Let’s start with an example to understand what `filter`

actually does. The JavaScript documentation for `filter`

states that:

`The ``filter()`

method creates a new array with all elements that pass the test implemented by the provided function.

Let’s take a look at a concrete example to see how we would use `filter`

in our programs. First off we have to define an array. Let’s call our array `numbers`

as it contains some numbers:

`const numbers = [1, 2, 3, 4, 5, 6]`

Next up we ned to come up with a function our `filter`

method applies to each element of such array. This function determines whether the element-under-test should be included in the resulting / filtered array. Based on the quote above and the description we just wrote down we can derive that our function which is used by the `filter`

method should return a boolean value. The function should return `true`

if the element passes the test and `false`

otherwise.

To keep things simple we pretend that we want to filter our `numbers`

array such that only even numbers will be included in our resulting array. Here’s the `isEven`

function which implements that logic:

`const isEven = (num: number): boolean => num % 2 === 0`

Our `isEven`

function takes in a `num`

argument of type `number`

and returns a `boolean`

. We use the modulo operation to determine whether the number-under-test is even.

Next up we can use this function as an argument for the `filter`

method on our array to get a resulting array which only includes even numbers:

```
const res = numbers.filter(isEven)
console.log(res)
// --> [2, 4, 6]
```

As we’ve stated earlier our goal is to implement the `filter`

function on our own. Now that we’ve used `filter`

with an example we should be familiar with it’s API and usage.

To keep things simple we won’t implement `filter`

on arrays but rather define a standalone function which accepts an `array`

and a `function`

as its arguments.

What we do know is that `filter`

loops through every element of the array and applies the custom function to it in order to see if it should be included in the resulting array. We can translate this into the following code:

```
function filter(xs: number[], func: (x: number) => boolean): number[] {
cons res: number = []
for (const x of xs) {
if (func(x)) {
res.push(x)
}
}
return res
}
```

Now there’s definitely a lot happening here and it might look intimidating but bear with me. It’s simpler than it might look.

In the first line we define our function called `filter`

which takes an array called `xs`

(you can imagine pronouncing this “exes”) and a function called `func`

as its arguments. The array `xs`

is of type `number`

as we’re dealing with numbers and the function `func`

takes an `x`

of type `number`

, runs some code and returns a `boolean`

. Once done our `filter`

function returns an array of type `number`

.

The function body simply defines an intermediary array of type `number`

which is used to store the resulting numbers. Other than that we’re looping over every element of our array and apply the function `func`

to it. If the function returns `true`

we push the element into our `res`

array. Once done looping over all elements we return the `res`

array which includes all the numbers for which our `func`

function returned the value `true`

.

Alright. Let’s see if our homebrew `filter`

function works the same way the built-in JavaScript `filter`

function does:

```
const res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]
```

Great! Looks like it’s working!

If we think about filtering in the abstract we can imagine that there’s more than just the filtering of numbers.

Let’s imagine we’re building a Rolodex-like application. Here’s an array with some names from our Rolodex:

`const names = ['Alice', 'Bob', 'John', 'Alex', 'Pete', 'Anthony']`

Now one of our application requirements is to only display names that start with a certain letter.

That sounds like a perfect fit for our `filter`

function as we basically filter all the names based on their first character!

Let’s start by writing our custom function we’ll use to filter out names that start with an `a`

:

```
const startsWithA = (name: string): boolean =>
name.toLowerCase().charAt(0) === 'a'
```

As we can see our function takes one argument called `name`

of type `string`

and it returns a `boolean`

which our function computes by checking if the first character of the name is an `a`

.

Now let’s use our `filter`

function to filter the names:

```
const res = filter(names, startsWithA)
console.log(res)
// --> Type Error
```

Hmm. Something seems to be off here.

Let’s revisit the signature of our `filter`

function:

```
function filter(xs: number[], func: (x: number) => boolean): number[] {
/* ... */
}
```

Here we can see that the `xs`

parameter is an array of type `number`

. Furthermore the `func`

parameter takes an `x`

of type `number`

and returns a `boolean`

.

However in our new Rolodex application we’re dealing with names which are `strings`

and the `startsWithA`

function we’ve defined takes a `string`

as an argument, not a `number`

.

One way to fix this problem would be to create a copy of `filter`

called e.g. `filter2`

which arguments can handle `strings`

rather than `numbers`

. But we programmers know that we shouldn’t repeat ourselves to keep things maintainable. In addition to that we’re lazy, so using one function to deal with different data types would be ideal.

And that’s exactly the problem Generics tackle. As the introduction of this blog post stated, Generics can be used to describe an implementation in an abstract way in order to reuse it with different data types.

Let’s use Generics to solve our problem and write a function that can deal with any data type, not just `numbers`

or `strings`

.

Before we jump into the implementation we should articulate what we’re about to implement. Talking in the abstract we’re basically attempting to filter an array of type `T`

(`T`

is our “placeholder” for some valid type here) with the help of our custom function. Given that our array has elements of type `T`

our function should take each element of such type and produce a `boolean`

as a result (like we did before).

Alright. let’s translate that into code:

```
function filter<T>(xs: T[], func: (x: T) => boolean): T[] {
const res: T[] = []
for (const x of xs) {
if (func(x)) {
res.push(x)
}
}
return res
}
```

At a first glance this might look confusing since we’ve sprinkled in our `T`

type here and there. However overall it should look quite familiar. Let’s take a closer look into how this implementation works.

In the first line we define our `filter`

function as a function which takes an array named `xs`

of type `T`

and a function called `func`

which takes a parameter `x`

of type `T`

and returns a `boolean`

. Our function `filter`

then returns a resulting array which is also of type `T`

, since it’s basically a subset of elements of our original array `xs`

.

The code inside the function body is pretty much the same as before with the exception that our intermediary `res`

array also needs to be of type `T`

.

There’s one little detail we haven’t talked about yet. There’s this `<T>`

at the beginning of the function. What does that actually do?

Well our compiler doesn’t really know what the type `T`

might be at the end of the day. And it doesn’t really care that much whether it’s a `string`

, a `number`

or an `object`

. It only needs to know that it’s “some placeholder” type. We programmers have to tell the compiler that we’re abstracting the type away via Generics here. So in TypeScript for example we use the syntax `<TheTypePlaceHolder>`

right after the function names to signal the compiler that we want our function to be able to deal with lots of different types (to be generic). Using `T`

is just a convention. You could use any name you want as your “placeholder type”. If your functions deals with more than one generic type you’d just list them comma-separated inside the `<>`

like this: `<A, B>`

.

That’s pretty much all we have to do to turn our limited, `number`

-focused `filter`

function into a generic function which can deal with all kinds of types. Let’s see if it works with our `numbers`

and `names`

arrays:

```
let res
// using `filter` with numbers and our `isEven` function
res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]
// using `filter` with strings and our `startsWithA` function
res - filter(names, startsWithA)
console.log(res)
// --> ['Alice', 'Alex', 'Anthony']
```

Awesome! It works!

One of the many benefits of using a type system is that you can get a good sense of what the function will be doing based solely on its signature.

Let’s take the function signature from the beginning of the post and see if we can figure out what it’ll be doing:

```
function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
/* ... */
}
```

The first thing we notice is that it’s a generic function as we’re dealing with 2 “type placeholders” `A`

and `B`

here. Next up we can see that this function takes in an array called `xs`

of type `A`

and a function `func`

which takes an `A`

and turns it into a `B`

. At the end the `foo`

function returns an array of type `B`

,

Take a couple of minutes to parse the function signature in order to understand what it’s doing.

Do you know how this function is called? Here’s a tip: It’s also one of those functions from the realm of functional programming used on e.g. arrays.

Here’s the solution: The function we called `foo`

here is usually called `map`

as it iterates over the elements of the array and uses the provided function to map every element from one type to the other (note that it can also map to the same type, i.e. from type `A`

to type `A`

).

I have to admit that this was a rather challenging question. Here’s how `map`

is used in the wild:

```
const number = [1, 2, 3, 4, 5, 6]
const numToString = (num: number): string => num.toString()
const res = map(numbers, numToString)
console.log(res)
// --> ['1', '2', '3', '4', '5', '6']
```

In this blog post we’ve looked into Generics as a way to write code in an abstract and reusable way.

We’ve implemented our own `filter`

function to understand why generic programming is useful and how it helps us to allow the filtering of lists of `numbers`

, `strings`

or more broadly speaking `T`

s.

Once we understood how to read and write Generic functions we’ve discovered how typing and Generics can help us to get a sense of what a function might be doing just by looking at its signature.

I hope that you’ve enjoyed this journey and feel equipped to read and write highly generic code.

]]>Have you ever wondered how YouTube knows which videos to recommend, how Google Translate is able to translate whole texts into a decent version of the target language or how your Smartphone keyboard knows which words and text snippets to suggest while you type your texts?

There’s a very high likelihood that so-called Embeddings were used behind the scenes. Embeddings are one of the central ideas behind modern Natural Language Processing models.

In the following writeup we’ll discover the main building blocks and basic intuition behind Embeddings. We’ll learn how and why they work and how Word2Vec, a method to turn words into vectors, can be used to show that:

\[ king - man + woman = queen \]

All the code we’ll write here can be found in my “Lab” repository on GitHub. Feel free to code along while reading through this tutorial.

Before jumping right into the code we need to make sure that all Python packages we’ll be using are installed on our machine.

We install Seaborn, a visualization tool which helps us to plot nice-looking charts and diagrams. We don’t really work with Seaborn directly but rather use its styles in conjunction with Matplotlib to make our plots look a little bit more “modern”.

`!pip install seaborn`

Next up we need to import the modules we’ll use throughout this tutorial (the last few lines configure Matplotlib to use Seaborn styles).

```
import json
from pathlib import Path
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import HTML, display
# prettier Matplotlib plots
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn')
```

Since we’re dealing with different datasets we should create a separate directory to store them in.

```
!mkdir -p data
data_dir = Path('data')
```

Let’s start with our first data analysis task. Our goal is to compare and contrast different countries based on their surface area and population. The main idea being that we want to analyze which countries are quite similar and which are rather different based on those two metrics.

The dataset we’ll use is part of the `country-json`

project by @samayo. Make sure to take some time to browse through the different JSON files to get an idea about the structure of the data.

In our example we’re only interested in the `country-by-surface-area.json`

and `country-by-population.json`

files. Let’s go ahead and download the files to our `data`

directory.

After that we can define 2 variables which will point to the files on our file system.

```
SURFACE_AREA_FILE_NAME = 'country-by-surface-area.json'
POPULATION_FILE_NAME = 'country-by-population.json'
```

```
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-surface-area.json -O data/country-by-surface-area.json
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-population.json -O data/country-by-population.json
```

```
surface_area_file_path = str(data_dir / SURFACE_AREA_FILE_NAME)
population_file_path = str(data_dir / POPULATION_FILE_NAME)
```

During our data analysis we’ll utilize Pandas, a great Python library which makes it dead simple to inspect and manipulate data.

Since our data is in JSON format we can use Pandas `read_json`

function to load the data into a so-called DataFrame (think of it as an Excel spreadsheet on steroids).

The `dropna`

function makes sure that we remove all entries which are undefined and therefore useless for further inspection.

```
df_surface_area = pd.read_json(surface_area_file_path)
df_population = pd.read_json(population_file_path)
df_population.dropna(inplace=True)
df_surface_area.dropna(inplace=True)
```

You might’ve noticed that dealing with 2 separate files will get quite hairy if we want to compare countries based on their 2 metrics.

Since both files contain the same countries with the same names and only differ in terms of their `area`

and `population`

data we can use `merge`

to create a new DataFrame containing all countries with their respective `area`

and `population`

numbers.

Another tweak we perform here is to set the `index`

to the country name. This way we can easily query for country data based on the country names rather than having to deal with non-expressive integer values.

```
df = pd.merge(df_surface_area, df_population, on='country')
df.set_index('country', inplace=True)
df.head()
```

`len(df)`

`227`

As you can see we have a total of 227 countries in our DataFrame. 227 are way too many countries for our need. Especially since we’re about to plot the data in the next step.

Let’s reduce our result set by performing some range-queries with the `area`

and `population`

data.

```
df = df[
(df['area'] > 100000) & (df['area'] < 600000) &
(df['population'] > 35000000) & (df['population'] < 100000000)
]
len(df)
```

`12`

Great! 12 countries are way easier to analyze once plotted.

Speaking of which, let’s do a 2D scatterplot of our 12 countries. We decide to plot the `area`

on the X axis and the `population`

on the Y axis.

```
fig, ax = plt.subplots()
df.plot(x='area', y='population', figsize=(10, 10), kind='scatter', ax=ax)
for k, v in df.iterrows():
ax.annotate(k, v)
fig.canvas.draw()
```

Looking at the plotted data we can immediately see some relationships. It appears that Vietnam has a high population compared to its area. Kenya on the other hand has a large surface area but a smaller population compared to its size.

Plotting the data like this helps us to reason about it in a visual way. In addition to that we can also easily validate the integrity of our data.

While we as humans can immediately tell the relationships in our country data just by looking at our plot it’s necessary to translate our visual reasoning into raw numbers so our computer can understand them too.

Looking at the plot again it seems like the distance between the data points of the countries is a good measure to determine how “similar” or “different” the countries are.

There are several algorithms to calculate the distance between two (or more) coordinates. The Euclidean distance is a very common formula to do just that. Here’s the Math notation:

\[ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} \]

While the formula might look intimidating at first it’s rather simple to turn it into code.

```
def euclidean_distance(x, y):
x1, x2 = x
y1, y2 = y
result = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)
# we'll cast the result into an int which makes it easier to compare
return int(round(result, 0))
```

According to our plot it seems like Thailand and Uganda are 2 countries which are very different. Computing the Euclidean distance between both validates our hunch.

```
# Uganda <--> Thailand
uganda = df.loc['Uganda']
thailand = df.loc['Thailand']
x = (uganda['area'], thailand['area'])
y = (uganda['population'], thailand['population'])
euclidean_distance(x, y)
```

`26175969`

If we compare this result to the Euclidean distance between Iraq and Morocco we can see that those countries seem to be more “similar”.

```
# Iraq <--> Morocco
iraq = df.loc['Iraq']
morocco = df.loc['Morocco']
x = (iraq['area'], morocco['area'])
y = (iraq['population'], morocco['population'])
euclidean_distance(x, y)
```

`2535051`

While this exercise was quite simple and intuitive if one is fluent in geography it also introduced us to the basic concepts of Embeddings. With Embeddings we map data (e.g. words or raw numbers) into multi-dimensional spaces and use Math to manipulate and calculate relationships between that data.

This might sound rather abstract and I agree that the relationship between our Country data analysis and Embeddings is still a little bit fuzzy.

Trust me, the upcoming example will definitely result in an “Aha Moment” and suddenly what we’ve learned so far will click!

Now that we’ve seen some of the underlying principles of Embeddings let’s take another look at a slightly more complicated example. This time we’ll work with different colors and their representation as a combination of Red, Green and Blue values (also known as RGB).

Before we jump right into our analysis we’ll define a helper function which lets us render the color according to its RGB representation.

The following code defines a function which takes the integer values of Red, Green and Blue (values in the range of 0 - 255) and renders a HTML document with the given color as its background.

```
def render_color(r, g, b):
display(HTML('''
<div style="background-color: rgba(%d, %d, %d, 1); height: 100px;"></div>
''' % (r, g, b)),
metadata=dict(isolated=True))
```

The color black is represented as 0 Red, 0 Green and 0 Blue. Let’s validate that our `render_color`

function works as expected.

`render_color(0, 0, 0)`

Great. It works!

Next up it’s time to download the dataset we’ll be using for our color analysis. We’ve decided to use the 256 Colors dataset by @jonasjacek. It lists the 256 colors used by xterm, a widely used terminal emulator. Make sure to take a couple of minutes to familiarize yourself with the data and its structure.

Downloading the dataset follows the same instruction we’ve used in the beginning of this tutorial where we downloaded the Country data.

`COLORS_256_FILE_NAME = 'colors-256.json'`

`!wget -nc https://jonasjacek.github.io/colors/data.json -O data/colors-256.json`

`colors_256_file_path = str(data_dir / COLORS_256_FILE_NAME)`

Now that we have access to the data in our programming environment it’s time to inspect the structure and think about ways to further process it.

```
color_data = json.loads(open(colors_256_file_path, 'r').read())
color_data[:5]
```

```
[{'colorId': 0,
'hexString': '#000000',
'rgb': {'r': 0, 'g': 0, 'b': 0},
'hsl': {'h': 0, 's': 0, 'l': 0},
'name': 'Black'},
{'colorId': 1,
'hexString': '#800000',
'rgb': {'r': 128, 'g': 0, 'b': 0},
'hsl': {'h': 0, 's': 100, 'l': 25},
'name': 'Maroon'},
{'colorId': 2,
'hexString': '#008000',
'rgb': {'r': 0, 'g': 128, 'b': 0},
'hsl': {'h': 120, 's': 100, 'l': 25},
'name': 'Green'},
{'colorId': 3,
'hexString': '#808000',
'rgb': {'r': 128, 'g': 128, 'b': 0},
'hsl': {'h': 60, 's': 100, 'l': 25},
'name': 'Olive'},
{'colorId': 4,
'hexString': '#000080',
'rgb': {'r': 0, 'g': 0, 'b': 128},
'hsl': {'h': 240, 's': 100, 'l': 25},
'name': 'Navy'}]
```

As we can see there are 3 different color representations available in this dataset. There’s a Hexadecimal, a HSL (Hue, Saturation, Lightness) and a RGB (Red, Green, Blue) representation. Furthermore we have access to the name of the color via the `name`

attribute.

In our analysis we’re only interested in the name and the RGB value of every color. Given that we can create a simple dict which key is the lowercased color name and its value is a tuple containing the Red, Green and Blue values respectively.

```
colors = dict()
for color in color_data:
name = color['name'].lower()
r = color['rgb']['r']
g = color['rgb']['g']
b = color['rgb']['b']
rgb = tuple([r, g, b])
colors[name] = rgb
```

To validate that our data structure works the way we described above we can print out some sample colors with their RGB values.

```
print('Black: %s' % (colors['black'],))
print('White: %s' % (colors['white'],))
print()
print('Red: %s' % (colors['red'],))
print('Lime: %s' % (colors['lime'],))
print('Blue: %s' % (colors['blue'],))
```

```
Black: (0, 0, 0)
White: (255, 255, 255)
Red: (255, 0, 0)
Lime: (0, 255, 0)
Blue: (0, 0, 255)
```

While our dict is a good starting point it’s often easier and sometimes faster to do computations on the data if it’s stored in a Pandas DataFrame. The `from_dict`

function helps us to turn a simple Python dictionary into a DataFrame.

```
df = pd.DataFrame.from_dict(colors, orient='index', columns=['r', 'g', 'b'])
df.head()
```

Seeing the data formatted in this way we can think of its representation as a mapping of the Red, Green and Blue values into a 3-Dimensional space where for example Red is the X axis, Green is the Y axis and Blue is the Z axis.

You might recall that we’ve used Euclidean distance in our Country example above to determine how “similar” countries are. The main idea was that similar countries have less distance between their data points compared to dissimilar countries whose data points are farther apart.

Another very useful formula to calculate the similarity of data points is the so-called Cosine similarity. The Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle, the more similar the underlying data.

Translating this to our color example we can think of every color being represented as a vector with 3 values (Red, Green and Blue) which (as stated above) can be mapped to the X, Y and Z axis in a 3D coordinate system. Using the Cosine similarity we can take one of such vectors and calculate the distance between it and the rest of the vectors to determine how similar or dissimilar they are. And that’s exactly what we’ll be doing here.

The Math notation for the Cosine similarity looks like this:

\[ similarity = \cos(\Theta) = \frac{A \cdot B}{\left\lVert A\right\rVert \left\lVert B\right\rVert} \]

We’re taking the dot-product between the two vectors A and B and divide it by the product of their magnitudes.

The following code-snippet implements such formula. Again, it might look intimidating and rather complicated but if you take some time to read through it you’ll see that it’s not that hard to understand.

In fact our implementation here does more than just calculating the Cosine similarity. In addition to that we copy our DataFrame containing the colors and add another column to it which will include the distance as a value between 0 and 1. Once done we sort our copied DataFrame by such distance in descending order. We do this to see the computed values when querying for similar colors later on.

```
def similar(df, coord, n=10):
# turning our RGB values (3D coordinates) into a numpy array
v1 = np.array(coord, dtype=np.float64)
df_copy = df.copy()
# looping through our DataFrame to calculate the distance for every color
for i in df_copy.index:
item = df_copy.loc[i]
v2 = np.array([item.r, item.g, item.b], dtype=np.float64)
# cosine similarty calculation starts here
theta_sum = np.dot(v1, v2)
theta_den = np.linalg.norm(v1) * np.linalg.norm(v2)
# check if we're trying to divide by 0
if theta_den == 0:
theta = None
else:
theta = theta_sum / theta_den
# adding the `distance` column with the result of our computation
df_copy.at[i, 'distance'] = theta
# sorting the resulting DataFrame by distance
df_copy.sort_values(by='distance', axis=0, ascending=False, inplace=True)
return df_copy.head(n)
```

To validate that our `similar`

function works we can use it to find similar colors to `red`

.

`similar(df, colors['red'])`

We can also pass in colors as a list of RGB values.

`similar(df, [100, 20, 120])`

Since it’s hard to imagine what color `100`

, `20`

and `120`

represent it’s worthwhile to use our `render_color`

function to see it.

`render_color(100, 20, 120)`

Looking at the list of most similar colors from above it appears that `darkviolet`

is quite similar to `100`

, `20`

, `120`

. Let’s see how this color looks like.

```
darkviolet = df.loc['darkviolet']
render_color(darkviolet.r, darkviolet.g, darkviolet.b)
```

And we can validate that `darkviolet`

in fact looks quite similar to `100`

, `20`

, `120`

!

But it doesn’t end here. Our 3 color values are numbers in the range of 0 - 255. Given that, it should be possible to do some basic Math computations such as addition or subtraction on them.

Since we only have access to 256 different colors it’s highly unlikely that our resulting color values for Red, Green and Blue will exactly match one of our 256 colors. That’s where our `similar`

function comes in handy! The `similar`

function should make it possible to calculate a new color and find its most similar representation in our 256 color dataset.

We can look at a Color Wheel to see that subtracintg a `red`

color from `purple`

one should result in a Blueish color. Let’s do the Math and check whether that’s true.

```
blueish = df.loc['purple'] - df.loc['red']
similar(df, blueish)
```

And sure enough the most similar colors in our dataset are Blueish ones. We can validate that by rendering `darkblue`

, one of the best matches.

```
darkblue = df.loc['darkblue']
render_color(darkblue.r, darkblue.g, darkblue.b)
```

Here’s a simple one. If we have Black and add some White to the mix we should get something Greyish, correct?

```
greyish = df.loc['black'] + df.loc['white']
similar(df, greyish)
```

And sure enough we do. Rendering `grey93`

shows a light grey color.

```
grey93 = df.loc['grey93']
render_color(grey93.r, grey93.g, grey93.b)
```

Let’s end our color exploration with a more complex formula. So far we’ve only done some very simple Math like subtracting and adding colors. But there’s more we can do. We can also express our search for a color as a “solve for x” problem.

Mixing Yellow and Red will result in Orange. We can translate this behavior to other colors as well. Here we ask “Yellow is to Red as X is to Blue” and express it in Math notation to get the result for X.

```
# yellow is to red as X is to blue
yellow_to_red = df.loc['yellow'] - df.loc['red']
X = yellow_to_red + df.loc['blue']
similar(df, X)
```

Our calculation shows us that `lightseargreen`

is to Blue as Yellow is to Red. Intuitively that makes sense if you think about it.

```
lightseagreen = df.loc['lightseagreen']
render_color(lightseagreen.r, lightseagreen.g, lightseagreen.b)
```

In the beginnig of this tutorial I promised that once done we should understand the intuition behind Word2Vec, a key component for modern Natural Language Processing models.

The `Word2Vec`

model does to words what we did with our colors represented as RGB values. It maps words into a multi-dimensional space (our colors were mapped into a 3D space). Once such words are mapped into that space we can perform Math calculations on their vectors the same way we e.g. calculated the similarity between our color vectors.

Having a mapping of words into such a vector space makes it possible to do calculations resulting in:

\[ king - man + woman = queen \]

In this tutorial we took a deep dive into the main building blocks and intuitions behind Embeddings, a powerful concept which is heavily utilized in modern Natural Language Processing models.

The main idea is to map data into a multi-dimensional space so that Math calculations from the realm of Linear Algebra can be performed on it.

We started our journey with a simple example in which we mapped the surface area and population of different countries into a 2D vector space. We then used the Euclidean distance to verify that certain countries are similar while others are dissimilar based on their metrics.

Another, more advanced example mapped colors and their RGB representation into a 3D vector space. We then used Cosine similarity and some basic Math to add and subtract colors.

With this knowledge we’re now able to understand how more advanced models such as Word2Vec or Doc2Vec make it possible to do calculations on words and texts.

You can find more code examples, experiments and tutorials in my GitHub Lab repository.

Eager to learn more? Here’s a list with all the resources I’ve used to write this post.

]]>