Your Landing Page is finally live and the metrics are promising but your conversion rates are still a little bit behind? It seems like there's some interest in your product but for some reason people aren't convinced enough to click that call-to-action.
Maybe you're running an ecommerce store and sales are great. But aren't there any tweaks you can apply to increase your sales even further?
Sometimes all that's necessary is a little nudge. In this blog post we'll examine what that nudge might be. How you can drive more attention towards action-taking rather than closing the browser tab and moving on. Let's dive in.
Disclaimer: This is by no means a scientific write-up. If you're interested in the fascinating topic of Psychology I'd encourage you to pick up some legit literature such as "Thinking Fast and Slow" or to connect with an expert who majored in Psychology.
Before jumping right into the triggers, let's take a step back and understand why knowing the basics of human psychology is such an important and fascinating skill to possess.
Pretty much everything we do on a day-to-day is somehow influenced by our unconscious, the part of our being we cannot directly control. One of the many sources of information that has formed and still informs our unconscious is the basics of human behavior and desires. At the core, humans are very social beings. Forming tribes helped us survive and thrive in uncertain times. It's important for us to show that we're worthy being part of the tribe. Being an outsider by taking unapproved actions put oneself at huge risks and disadvantages. Risks that usually translated into pain and death. As it turned out, it made a whole lot of sense to play by the rules and adhere to the consensus.
Times have changed and nowadays there's a lot less risk involved when making decisions. However the tribal thinking is still deeply rooted in ourselves and therefore influences the actions we take.
Do you study reviews before buying products online? Do you browse through the comments on a YouTube video? Both of those actions are driven by the fact that we want to see what our peers are thinking and how they might think about us when we buy the item in question or like the videos content.
Now that we learned more about basic human needs and what drives us, it's time to take a deep dive into the different Psychological Triggers you can use to tap into those needs and nudge your visitors to take certain actions.
To some extent, everyone is an egoist. While that might sound harsh it's a reality that people are obsessed with themselves. Knowing this puts you at a huge advantage because you can articulate your copywriting in a way that speaks directly to your users "self".
Don't talk about your company, the achievements and prizes you won. Don't write a wall of text outlining every single feature your product offers. Your company is in the business since the late 90s? Your app has a "dark mode"? That's all great but nobody really cares!
What matters are the benefits your users will experience when using your product. What are they dreaming about? What pain does your product address? How does your product help them live a better life? That's all that matters.
Update your copywriting to talk directly to your users rather than about you. Paint a picture of their dream and how your product helps them turn that dream into a reality.
There are only 9 pages left to complete the chapter in your favorite book? You only have to collect one more item to reach the next level in your favorite video game?
People love to make progress. It just feels good to take a project all the way from 0% to 100%. Even better if the progress is visualized and you can see the progress bar increasing with every little step of the way.
Use this love for progress in your products and services. Do you have a time-consuming onboarding process? Break it up into individual pieces and show a progress bar at the top of the page which updates as your users fill out the forms. Are you looking for more Newsletter sign ups? Display a "50% done" progress bar at the top of your Newsletter Form and call out that your visitor is "almost there".
Whenever you're looking for more commitment from your users, break the task down into individual pieces and add some information which makes it clear as to what the current status is and what is necessary to make progress.
Time is one of the resources which cannot be created. Time will move on, no matter what you do. Given this, everyone values their time more than anything else.
There are a couple of different ways you can work with the "Time trigger".
The first one being to call out how much your product will help your users save time. Do you automate a time-consuming process? Do the math and clearly show your users how much time they'll save when they use your app. Time they can spend with their significant other or kids (you want to call that out as well).
Another usage of the "Time trigger" can often be found in conjunction with scarcity. "25% for the next 24 hours", "Registrations only open today". Those are the phrases you'll usually find on ecommerce stores or online course websites. You can do the same and offer time-bound discounts. If you do so, please make sure that your offer is legit. Use time and scarcity in a dishonest way and you'll immediately lose trust.
The "Basic Plan" is 99 USD while the "Pro Plan" which has significantly more features is only 129 USD? Most people would pick the "Pro Plan" given that it has more features to offer. Even if those features will never be used. Looking at it through the lens of logic it doesn't really make sense to go with the "Pro Plan" if the "Basic Plan" would be more than enough in terms of value delivered.
As it turns out, it's not about what the product is, it's about what it stands next to. You can use this fact to steer the decision making process in the direction of your liking. Make it easy for your users to compare and contrast different offerings. Even if you don't plan to sell a cheaper option, offer it anyway and put it next to the one which is slightly more expensive but more feature rich. You'd be amazed to see that most of your users will go with the more expensive option even if they'll never user all the features it offers.
A huge major driving force which is ingrained in everyone of us is fear. Fear comes in many different forms. We're fearful of failure, bad things happening to our loved ones and the inevitable death.
Another form of fear is the "fear of missing out" a.k.a FOMO. Always being in the known gives us huge advantages and an edge over our competition. Missing out on a major new trend or business opportunity could have detrimental consequences.
You can trigger the "fear of missing out" by subtly talking about the possibility of missing out on a huge opportunity when passing on your offering.
A prime example which shows this in action are travel booking websites. Remember that "12 people have visited this listing in the last 6 hours" notification on the hotel website. You don't want to be the person who misses out on this great deal!
As we already learned, us humans are social animals. Being social doesn't only mean that we like to socialize with each other, it also means that we tend to mimic and "copy" one another to be more likeable and therefore accepted in our peer group.
Do you want to buy something but you're not sure which option to go with? It might be a good idea to study the testimonials, reviews and ratings to make a more informed decision. The new Todo-app you just discovered looks quite compelling but why bother given that you're already using the stock Todo-app on your phone? As it turns out, at least a handful of Fortune 500 companies and Silicon Valley juggernauts are using it in their day-to-day so it can't be that bad.
You want to be a part of your tribe so naturally you're looking at what your peers are doing. This is what "Social Proof" is all about. Showing that you're about to make the right decision because peers in your tribe did the same.
Do you already have happy customers using your app? Take some time to conduct some interviews and translate those into testimonials you can put on your Landing Page. Maybe you have some large enterprises using your product. Write some Case Studies to show what value your product offers. There are entire software applications you can use to display social proof widgets on your homepage or ecommerce store ("Someone in New York bought this item 2 hours ago").
Everyone wants to believe that all the decisions she makes are entirely driven by logic. As it turns out this is not true most of the time. You and I buy products based on emotion rather than logic. Just pay close attention to the advertisements you see every day. Almost every single one speaks in the language of emotions rather than logic.
But once the decision has been made and the purchase is done we tend to justify our actions based on logic. "I bought the car because there was a discount and a friend of mine knows the car dealership owner. Furthermore I get the first service for free and it's also an older model so it's cheaper than buying a brand new one. ...". Observe yourself and others and you'll notice that all of us try to retroactively fit our decision making into logical bits and pieces. Even if it sometime results in contradictions and lies.
What you want to do is to supply additional information about your product which helps your users justify the purchase in a logical manner. Have a dedicated "Features" page which lists out all the features your product offers. While you'll never be selling by talking about your products features it will help your users justify their decision after buying your product. You can see this "Emotional Sales Pitch + Detailed Features Page" pattern being used on almost every popular SaaS product marketing page.
I'd like to invite you to take a look at your current Product website / app to identify places where you could use one of the triggers we just discussed.
Is a time-consuming and tedious onboarding process required (e.g. collecting some user information to create a profile)? Maybe you want to split the form up into several pages and show a progress bar at the top of every page.
Does your current Landing Page focus on the features rather than your users pain and the way in which your product alleviates such pain? Update your Landing Page copy to talk about the user rather than yourself and your product.
Use as many different triggers as you need but keep in mind that you should strike a healthy balance and use them in ethical ways.
Human beings are social animals. A lot of our decision making is driven by our emotions and unconscious rather than logic and self-control.
In this post we learned about several Psychological Triggers you can use to nudge your users into taking specific actions. The more you're familiarizing yourself with these triggers, the more you'll see them being used in the wild.
Leveraging such triggers is a powerful way to make your product more successful and put it ahead of your competition. However keep in mind to solely use them in ethical ways. Play the long-term game by being genuine and helpful.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>Is there a way to tell whether a new user is more likely to buy your product later down the road?
What techniques can be used to get your user more involved and invested in the value you offer?
In this blog post we'll discover a simple, yet oftentimes overlooked method to build and foster a closer relationship with your new users.
Let's start by taking a quick look at the typical journey every new Landing Page visitor walks through.
The very first thing you have to nail to get users on your Landing Page is your promise. It needs to be persuasive and catch the individuals attention such that she clicks on your link and visits your page. Most of the time the promise is the copy of your advertisement ("30% off on all electronics during cyber monday") or the title of the Landing Page itself ("Buy electronics with free same-day delivery").
Once the user visits the Landing Page you need to ensure you deliver on your promise. The Landing Page copywriting must continue in telling the story which was introduced prior to the click. Do you offer a discount? Make it immediately visible and explain how this discount can be redeemed (coupon code, automatic subtraction during checkout, ...). Not doing so will break the continuum, forcing your visitors to leave in disillusionment.
Generally speaking you only have a few seconds to convince your visitor to stay on your Landing Page and further explore your offering. Your copywriting should therefore be focused on your users problem, its solution and why she can trust in your abilities to deliver on your promise.
Following the main "Landing Page Blueprint", your start by describing the current pain your user is experiencing while painting the picture of the "dream situation" which she can find herself in after using your product. The solutions key benefits (i.e. features) are used to further support the existence of this "ideal world". Testimonials are then used as a social proof to show that this "dream situation" isn't just a dream but something others are already experiencing.
By now you should have your visitors attention and can ask for the call-to-action which can be anything ranging from scheduling a demo call to a free trial sign up.
Finalizing the necessary steps to perform the call-to-action (e.g. entering the E-Mail address) is followed by a "Thank You" page with a subsequent "Welcome Onboard" E-Mail which hits your visitors inbox.
It's time to pat yourself on your back as you were able to get permission from a complete stranger to contact her with updates on your company and its services. Providing more value via useful blog posts further nurtures this relationship and can turn your follower into a potential customer later down the road.
But we're getting ahead of ourselves. In fact it isn't the time to celebrate just yet as we've missed out on a huge opportunity to further solidify our relationship with our new prospect.
By following the user journey we just outlined above it should now be clear that in order for a user to follow through and perform the desired call-to-action you have to persuade her that your company is trustworthy and your product solves her problem in a reasonable way. Turning a complete stranger into an interested subscriber means that the visitor in question is aligned with your vision and very likely excited about the future you're striving for. A future in which the current pain points she is experiencing are alleviated.
Given such excitement, why don't you ask for another, second action she can perform. If the user signed up for your blog newsletter why don't you ask if she wants to follow you on Twitter and Facebook to get some behind the scenes footage? If the user signed up for a webinar, why don't you politely ask if subscribing to your blog newsletter might be of interest as well?
You might be wondering how and where you should ask for this favor. Simply add a small paragraph and a button or link to your second call-to-action on your "Thank You" page (the one which is displayed after the first call-to-action was performed) or your "Welcome Onboard" E-Mail (the one you send out after the user was added to your E-Mail list).
There are only two very important thing you should keep in mind when asking for a second action.
The very first thing you want to ensure is to genuinely thank your visitor for the trust you just earned. After that you can pitch the second action. Don't be too pushy. Make it 100% optional. Don't force your users to perform the second action as well to finalize the sign up process!
And the second, very important thing to ensure is to keep your funnel in mind while asking for the second action. Never ask for something "too far away" from the users current position in the funnel. Was the first action a blog newsletter subscription? Asking for a trial sign up might be too much! "Mind the gap" and only ask for a second action which is near to the users current position in your funnel!
Users who perform multiple actions are generally more engaged and invested in your vision. Offering multiple actions to perform will help you in generating more qualified leads.
I'd like to invite you to take a look at your current Landing Page workflow. Do you ask for a second action? If so, how far further down the funnel is this action? If not, what second action could you ask your users to perform? Where do you want to ask for the second action? On the "Thank You" page or the "Welcome Onboard" E-Mail?
Once implemented, ensure that you monitor how many visitors actually perform those actions. Also ensure to take these metrics into account when analyzing user engagement and purchasing behavior.
Having a solid Landing Page is the single best tool to attract a visitors attention and quickly articulate the core benefits your product has to offer. The Landing Pages main goal is to build trust and convince the user to finally perform a call-to-action in order to get more information and stay updated on the developments of your product and company.
Performing the call-to-action means that the user is aligned with your vision and excited about your offering. Why not offering another way to get even more involved by presenting a second action the user can take. If someone signs up to your webinar it's likely that they're also interested in your blog.
Politely asking your user to perform a second action can result in a closer relationship, more engagement and more qualified leads as only those users who are fully invested in your mission are likely to perform that second action.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>You could move forward and build an MVP to validate your idea but should you really invest the time and money to do it?
Validating that there's enough demand for your solution later down the road is one of the most important steps you need to take in order to build a successful business. As it turns out there's a simple, yet elegant way to figure out if you're onto something.
In this blog post we'll walk through this process step-by-step so that you can make an informed decision whether you should double down or go back to the drawing board.
Most of us see competition as a sign of market saturation. If there are too many competitors it's tough to cut through the noise with our very own solution. Old Competitors have deep pockets and will certainly outspent you in their marketing efforts.
Furthermore categories are usually associated with their market leaders. Looking for a solid CRM (Customer Relationship Management) solution? That's what Salesforce is for. Real-time business chat software? Slack is the big player here. E-Mail Marketing? Go with Mailchimp.
While competing with Salesforce might be a bad idea, having competitors shouldn't turn you off as it comes with huge upsides you can exploit.
Having competitors means that the market is already validated. Companies are already spending money for a solution. If you try to enter, let alone define a brand new market you're spending most of your time and effort educating your future potential customers about your product and why it's important for their business to succeed.
It's likely that they're not aware of the severity of the problem you're trying to solve for them. Even if you succeed in educating your future users about your product you still have to raise awareness for the necessity to buy your product in higher management since they usually allocate budgets and therefore have to get onboard as well.
Raising such awareness is a grind which will cost you a lot of money spent on marketing and sales. Money you don't really have if you're starting from scratch with no VC funding. It's way easier to enter a well defined and understood category. A category where a lot of people are already spending money on solutions to solve their problems at hand.
Now that we know that we should embrace rather than avoid competition we can use this fact to do some market research and estimate if there's enough demand for our product. In addition to that we should also check if entering the market is sustainable long-term.
The following 3-step process is a simple, data-driven framework which helps you in answering those questions.
As an example let's say that we want to build a video conferencing tool specifically designed for job interviews done to hire remote employees. Imagine a tool which has built-in support for whiteboarding, code interviews, video playback for mocked-up "stress test" scenarios, facial emotional recognition, recordings, etc.
Should we move forward with that idea? Is there enough demand? Let's run through the process and figure it out!
The very first step we need to take is to identify our key competitors. Let's think about how remote job interviews are primarily done these days? What usually happens is that after an initial screening the applicant walks through a series of video calls with various interviewers who vet the candidate based on different criteria.
The core technology used throughout this process is a reliable video conferencing software so let's see who the strongest players are in that market.
To identify the most widely used video conferencing softwares we start our research on a product reviews site called Capterra.
Let's enter "Video Conferencing" in the search bar as we're looking for video conferencing software.
As we can see there's an even more specific category called "Web Conferencing Software" which, based on its description, looks like exactly what we're after. Clicking on the link directs us to the page where we can explore all the web conferencing solutions in use by businesses these days.
The next thing we want to do is to sort the products by "Most Reviews" because the default ranking includes advertisements at the top.
Next up we take some time to scroll through the list to figure out who our main competitors would be. Skim through the reviews to see which product is often used in the context of the idea you're pursuing (remote job interviews in our example).
After digging deeper into the specifics we can identify the following tools as our competitors:
The more time you spend on this step the more competitors you'll uncover. For the rest of this post we'll move forward with these three tools to keep things simple.
A note on finding the main competitors:
When following the aforementioned steps you'll usually find the big names at the very top of the list. In our case that's Zoom, GoToMeeting or Google Meet. That's totally fine and expected. What you want to do is to look for software which is more specialized and tailored towards the audience and niche you're about to serve.
Based on my own experience I can say that a huge advantage for remote job interviews would be a no-login, no-download and no-installation software package (it should be accessible from within the users browser with 1 click). In addition to that, built-in screen sharing and encryption is a must. Those are some of the problem specific characteristics you should take into account to narrow down your competition.
Now that we know who our main competitors are it's time to do more research on the companies behind the toolings.
We do this via a website called Crunchbase. Crunchbase is a platform which collects, processes and serves business information about private and public companies with a main focus on venture-backed startups.
The next step in our research is pretty simple. We enter our competitors name in the search bar and pull up their organizations profile.
What we want to focus on specifically is whether they've raised money in the past and, if so, what amount. In the "Livestorm" example above we can see that they've raised a total of 4.6 Million EUR in 2 rounds (Seed Round and Series A).
We want to do this research for all the competitors we've identified. Simply search for their organizations profile and see if they've raised VC money.
"Wire" for example has raised 14.5 Million EUR and "Whereby" was even acquired by "Videonor".
The main goal with this step is to see if there's money invested in the market. The more money is already invested, the better. Ideally the majority of your competitors should've raised money in the past.
The very next thing we should do is to figure out how much web traffic our competitors are getting.
The more visitors they get, the more demand there is for their product. The more demand, the easier it is for us to grab some of that traffic via paid ads or good content marketing and redirect it to our, more specialized solution.
Website traffic numbers are usually considered highly sensitive and kept behind closed doors. Luckily enough Crunchbase has us covered as they're partnering with SEMrush, a tool for professional SEOs and SEMs.
To get a rough estimate as to how much traffic our competitors are getting we can simply click on the "Technology" tab in their Crunchbase profile and look at the "Monthly Visits" number.
Doing so reveals that "Livestorm" gets roughly ~305.255 visits per month. "Wire" sits at ~153.904 monthly visits while "Whereby" attracts ~1.365.524 visits every month.
Again, do this for each and every identified competitor. You should be looking at at least ~150.000 visits per month for each of your competitors. The more, the better.
That wraps up your main market research process. The main objective in doing this research is to validate that there's enough demand such that your new, specialized solution can find its potential customers more easily.
The two signals you want to look for is invested interest in the market via VC funding and enough organic search traffic you can tap into later down the road.
You want to move forward with your idea if you can validate that at least ~4 of your competitors have received funding and are getting more than ~150.000 visits every month.
I'd like to invite you to take some time this week to run through this process with the product you have in mind.
Take the numbers you should be looking for outlined above as ballpark numbers and take other, niche-specific metrics into account (e.g. Facebook followers if you're working on a Social Media related solution).
This process might take quite a while and can feel laborious but trust me that investing this time upfront can save you months, if not years of wasted time and effort.
Starting a new business adventure is a risky endeavor. It takes years and a lot of learnings until success materializes. One of the most common reasons why startups fail is that they solve a problem nobody is willing to pay for.
In this blog post we've discovered a simple 3-step process you can apply to mitigate the risk of building something nobody wants. Via marked research you can validate that there's enough demand to enter the market and carve out your very own niche.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>Let's take a closer look at the 10 most common mistakes you can make when building your Landing Page and how you can avoid those.
It's important to define and understand what a Landing Page is and what it isn't. Generally speaking a Landing Page is a dedicated, self-contained website which sole purpose is to attract the visitors attention and convince her to perform a pre-defined goal or action (often called CTA or "Call to action"). Such goal can be a sign up for a free trial, a newsletter subscription or a booking for a demo call.
It's tempting to put more information on a Landing Page. All too often Landing Pages include a team section, an in-depth walkthrough on every feature the product has to offer, a list with all the different integration partners, links to pages with case studies and different uses cases and sometimes even an "About" page which describes in great detail where the founders came from and what the company is all about.
Your Landing Page isn't a Landing Page anymore if it includes more information than necessary to explain your unique value proposition to your target audience. If that's the case then think about what information your future potential customers need to evaluate whether your product suits their needs. Only keep that information in your Landing Page and move everything else onto your dedicated Marketing Website.
This mistakes goes both ways. Don't use your Marketing Website as your Landing Page. Users will get overwhelmed by the sheer amount of information they'd have to work through. As described above, extract the essentials into a dedicated Landing Page and use that as your first point of contact with new visitors.
This is especially important when you use paid traffic channels to acquire new customers. Never drive paid traffic to your generic Marketing Website. Always use a Landing Page as a destination for your paid ad traffic.
Do you have a navigation at the top which links to other pages? Remove it! Do you have a link to a YouTube video which shows how your product works? Remove the link! Do you link to your social media accounts? Remove them! Do you have a list with links in your footer? Remove them!
I think you see a pattern here. Remove everything which makes it easy for your visitor to leave your site. Your main goal is to keep them on the site and engaged with your solution to their problems. Everything else is a distraction which might interrupt your visitors thought process.
Of course it's not all doom and gloom. Here are a couple of techniques and workarounds you can use if your Landing Page has to link to external resources (e.g. legal documents):
Use an anchored navigation which helps your visitors scroll to a section on your page. If you're doing that you want to ensure that clicking on the link doesn't result in an "abrupt jump" but rather a smooth scrolling so that the users isn't lost and sees that there's more content to explore in-between the navigation and the destination.
Embed your YouTube videos and make downloadable content (such as .pdf
files) truly downloadable (avoid using the in-browser preview mode).
Use a Popup window / Lightbox for content you'd otherwise link to externally.
Your product uses the latest and greatest AI research and stores its model weights on the Blockchain while leveraging Smart Contracts to facilitate transactional exchanges between your users? Given such technological advancements your competition isn't even close. Your steadily growing user base of loyal early adopters makes it even harder to compete with you.
All of that sounds like a compelling reason to choose you over your competition. But how do you back up your claims? Where's the evidence that you can live up to the hype?
There's the mantra of "fake it 'til you make it" and painting the picture as to how your future customer will benefit from the product isn't wrong.
What's harmful is over-hyping your solution. Even non-technical visitors can smell BS from miles away. If you're just dropping buzzword after buzzword without backing anything of that up you're fostering distrust. And building trust is the very first step when a complete stranger sees your product for the very first time.
Again, outlining your vision isn't wrong. Overselling what you're actually doing without backing it up is.
Fake testimonials is related to the topic of "Show, don't tell" we just covered above. It's totally fine if you don't have any testimonials just yet. In that case just skip them for now and add them later down the road.
Fake testimonials harm more than they do good. Most of the time it's pretty easy to spot fake testimonials as they are generic and don't include any specific information as to how your product helped the customer in question get from A to B.
A related bad practice is including company logos without explaining how your product is used by such company and what impact it had. Having someone from Apple like your Tweet about your business isn't enough to include their logo on your Landing Page.
Getting genuine and convincing testimonials is a whole "science" in and of itself and I plan cover that in a dedicated blog post. However if you do have customers from whom you can get testimonials you might want to jump on a call with them and ask them how your product changed their business for the better. Try to figure out what kind of emotional impact your product had (e.g. saving time to pick up the kids from school) and what the business ROI is. Outline how the day-to-day was before using your product and afterwards. Make it specific and tangible.
Try to avoid using stock photography as much as possible. The reason for doing so is simple, yet important to understand.
Put yourself into the shoes of your visitor for a minute and imagine that you visited a website which used the same stock photography you see on your Landing Page right now. You tried their product but had a bad customer experience with it. Unconsciously you still connect the bad experience with the visual the company used. Their logo, their color scheme and their stock photography. This negative, emotional feeling is still prevalent when your visitors explore your Landing Page. Even if they're not consciously aware of it.
Another reason to not use stock photography is that it makes a very unprofessional expression if users recognize the same image being used on different websites.
If you still want to use stock photography you want to ensure that you buy a license which restricts usage so that you're one of a few companies allowed to use that photo. Another option would be to use Google Images "reverse image search" to figure out how often the stock photography was used. Pay special attention to the businesses that have used such photo. The more different they are from what you're doing, the better.
As an early stage business you want to move fast and save as much time as possible. You're not a developer / designer so you go to a theme marketplace to buy your very own Landing Page theme. Those themes were designed and built by professionals.
They look beautiful and are easy to use because you just have to fill in the blanks. What could go wrong?
"Filling in the blanks" is exactly the problem. Templates were built to sell. They demonstrate all the features they have to offer to show you that you can use them in various scenarios. It doesn't matter if you're working on an iPhone app, an ebook or a SaaS business. The template got you covered.
Adopting the templates structure and just filling out the blanks will cause you to end up with bad copywriting. Your Landing Page doesn't tell a convincing story. It's hard to figure out what the product is about. What problems it solves and how the customer benefits from using it.
What you want to do instead is to start with a blank page. Write down the copy of your headline and subhead. Outline the problem and paint a picture as to how your product makes your customers life easier. Think about a compelling story your Landing Page should tell. Once you have done the copywriting you should work on the Landing Page and its design. Following this advice will help you communicate your value proposition more clearly. Your Landing Page will feel like a thoughtfully crafted unit rather than a loose collection of texts and images.
Every couple of years there are certain design trends that stand out. Back in ~2010 this trend was Skeuomorphism which was superseded by Flat Design in 2015.
While it's important to adopt new user interface design principles and user experience design trends in your SaaS application to make the usage as simple as possible it can be detrimental for your Landing Page and its conversion rates.
In fact, Landing Pages you and I would call "boring" or "ugly" usually outperform beautifully designed Landing Pages by a large margin. The reason for this is simple. Beautiful design creates additional noise and distractions which drive the visitors attention away from your Landing Pages main goal: Getting the user excited to perform the call to action.
You don't believe it? Go to the "Templates" section of Leadpages, sort the templates by "Highest Converting" and compare the designs.
When it comes down to your Landing Page you should keep it as simple and as distraction-free as possible. This also includes shying away from new trends such as "Conversational Forms", "Scroll hijacking", etc.
One of the most common reasons why Landing Pages are underperforming is the lack of good copywriting.
Copywriting is a very underappreciated skill everyone can benefit from mastering. In simplest terms, copywriting is the art of writing compelling and convincing text which resonates with your target audience.
Your Landing Page has to convince a visitors in less than 5 seconds that it's worthwhile spending more time to explore your product further. Does your headline and its subhead explain what the product is about and what pain it addresses? Show your Landing Page to your friends and family for 5 seconds and ask them what they think your product is about. Can they articulate what your product does? Sometimes a good trick is to switch the headline and subhead as most of the time the subhead is more explanatory.
But copywriting is more than getting the headline right. Do you want to collect E-Mail addresses from business leads? Change your form label to "Business E-Mail Address". Change the text on your call to action button by thinking about what you'd expect to happen when clicking on it (e.g. "Join the Community"). Do you always get the question if your Webinar will be recorded? Add the text "We'll send out the recordings of the Webinar so register even if you're not sure that you can make it".
Copywriting is a whole profession in and of itself and definitely worth learning more about. The main takeaway is to be conscious with the words you choose. Put yourself into the shoes of your visitor. What would convince you to read more about the product? What questions or concerns are in the back of your mind you could address and answer right away?
The Landing Pages main objective is to convince the visitor to take the call to action. Whether that's a sign up for a wait list, booking a demo call or a registration for a free trial. But what should you do once your visitor performed that call to action? Most Landing Pages display a "Thank You" page and call it a day.
It's this exact moment where your visitors are excited and open to learn even more about you and your brand. You should definitely use this moment to ask for a second action. Did your user join a wait list for your beta? Why not asking her to subscribe to your Newsletter? Did your visitor book a seat for a webinar? Why not invite her to follow your blog?
There are plenty of actions you can ask your visitor to perform. The only thing you want to keep in mind is to keep "the ask" in perspective. Don't go too deep down in your funnel. Don't ask for a free trial sign up if the main call to action was a newsletter sign up. Keep the second action in perspective but always ask your users to perform a second action.
Chances are that you're not getting a lot of traffic and therefore only very few conversions from your Landing Page in the early days. That's totally normal and expected. Your traffic numbers will grow automatically as you experiment and explore what marketing channels work for your product. Doing that takes time.
But maybe the Landing Page isn't communicating the core message good enough. Maybe there's a way to tweak the layout and call to action to get more conversions.
The next step would be to create 2 different versions of your Landing Page and setup an A/B test to see which version performs best. I want to stop you right there.
There's no chance that you have enough traffic in the early days to conduct a statistically sound A/B test. The math just doesn't add up.
Setting up A/B tests is time consuming and tweaking the Landing Page every hour or so isn't enough to make informed decision as to what version performs best.
Start with your very first version of your Landing Page and ask strangers for honest feedback to improve it over time. Do they understand what your product is about? Do they understand the pain points and how your solutions alleviate them?
Build up from there one step at a time. Use an A/B/ testing calculator to figure out if you have enough traffic to perform an A/B test which produces valid results. And if you ever do that, make sure that you only tweak one thing at a time.
Now that we've seen quite a few common Landing Page mistakes I'd like you to open up your current Landing Page and see if you can improve it based on the tips we've discussed above.
Do you ask for a second action after the visitor went through the call to action? Do you make it easy for your visitors to leave? Are there ways you can improve your copywriting?
Go through all of the points mentioned above and try to make your Landing Page better. Once you've done that I'm certain that your Landing Page communicates your products core strengths and is more convincing than ever before!
Creating a Landing Page is the very first action item one the list to build a successful product. A lot of times Landing Pages are created in a rush due to the innate time- and resource constraints every new business has to deal with.
Spending more time, thought and effort on your Landing Page can significantly impact the direction in which your business is going. All too often avoidable mistakes are being made when designing the first Landing Page.
In this post we went through the 10 most common mistakes and how to avoid them. Being aware of the mistakes and the ways in which they can be resolved empowers you to build compelling Landing Pages which will resonate more with your target audience, convincing them to try your product and eventually convert into a paying customers.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>But how do you ensure that your customers keep using your service?
Are there any simple, yet effective ways to reduce or even prevent churn?
As it turns out there's one simple strategy you can use to keep your customers around even if they're about to leave your platform. Let's explore what it is and why it works.
As already stated in the introduction it's important to focus on customer retention when building a sustainable business.
Acquiring customers can be an expensive endeavour. If you're not (yet) in a position where your product grows through Word-of-Mouth you're likely spending a good portion of your revenue on paid ads and marketing to drive traffic to your service. Only a few of your thousands of visitors will eventually try your product and convert to become a paying customer.
Optimizing this marketing and sales funnel is a tricky and costly activity. Think about it for a minute. Who finances your learnings and tweakings of such funnel? Correct, your existing customers.
That's why keeping your users happy and around is one of the most important business objectives.
If you think about it, there's really only one reason why your customers are leaving your platform:
Your product isn't a crucial part of their life anymore
While this sounds harsh I'd like you to think about all the services you're currently subscribing to. Now imagine that you can only keep one. What would you cancel? Probably everything except the one you can't live without.
Of course, the preferences are different from person to person and they change over time. And that's the exact reason why people cancel their subscription with your service: Their preferences have changed and they might want to take a pause from your service or need something else entirely.
Now that we know why your customers churn, it's time to get into their shoes and think about ways to keep them around.
One of the "industry" standards is to send out a survey once they're about to leave to gather feedback and convince them to stay. Some services offer coupon codes if for example the user has clicked on the "it's too expensive" option in the survey.
Other tactics are more on the "dark patterns" side of things. Hiding buttons, asking double negative questions or using other techniques to make it nearly impossible to leave. Needless to say that customers of businesses practicing such tactics aren't the ones who spread the word on how awesome the product is. Quite the opposite.
But let's take a step back for a minute and ask ourselves why this "should I stay or should I go" question has to be binary in the first place. Isn't there something "right in the middle"? Something where a user can stay but somehow go at the same time?
The solution to this dilemma is dead simple and obvious, yet rarely used: Make it possible to pause the subscription.
Yes, it's that simple. Just offer a way to pause a subscription and get back to it once your users current circumstances have changed.
Now you might think that it's a really bad idea to let users pause their subscription. They'll pause and never come back. So essentially it's a "passive churn" as they haven't left the platform yet but might never use it again. The stale user data is sitting in the database and your dashboards are still showing hockey-stick growth. Furthermore it's a huge implementation effort as pausing and resuming subscriptions isn't something considered business critical and hence wasn't implemented just yet.
Those are all valid concerns and some of them might turn out to be true even if you have a "pause- and resume your subscription" system in place. But let's take a seconds to look at the other side of the equation.
They very first thing that comes to mind is the COVID-19 pandemic we're currently in. A lot of business scaled back and hence had to cancel subscriptions to their favorite SaaS tools to cut costs. A common "save the customer tactic" used here was to get in touch with the business owner and offer heavy discounted year long subscription plans. That way businesses could reassess if they should really quit and leave the huge discount on the table or just go with it and double down to benefit from the sweet, discounted multi-year subscription deal.
Letting business put their subscription on hold would be another strategy that could be used to help retain and eventually reactivate your users during this pandemic. Put yourself into your customers shoes again for a minute. Wouldn't you want to pay it back in the future if your supplier lent you a helping hand and wasn't "forcing" you out the door?
Even if your customers pause their account you still have their E-Mail address to reach out to them and keep them informed about your product. In fact you should use this opportunity to stay in touch, ask them how they're doing and providing something of value along the way. That way you keep the communication "warm" and your business stays on "their radar". There's a higher likelihood that they think about your service when times have changed and they're about to scale things up again.
Having a way to pause a subscription is an action that's usually taken with some level of consideration. If your customer wants to quit (s)he'll just cancel the subscription anyway. Offering a way to pause for the time-being is another option your users might just not have right now, so they're forced to make a very binary decision and therefore they just quit.
What you should also think about is that pausing a subscription doesn't necessarily mean that you'll lose revenue for sure. There are different and very creative ways in which you can implement the pause. My gym for example simply extends my membership for the amount of months I put my membership on hold. In the summer I make use of this feature since I do my workouts outside anyways. However those 3-4 months I "save" are simply "added" to my contract. I just have a little bit more control about how and where I spend my time with sports. You can get really creative here and invent other ways for this mechanism to work if you really want to ensure that you don't lose revenue.
A last, important point is that you can use this functionality as a competitive advantage and "marketing material". Be sure to add the fact that people can pause their subscription to your list of product benefits. Add it to the copy right next to your "Subscribe Now" button. Addressing objections and concerns right before the call-to-action is about to happen will drastically increase your conversion rates.
Now you might be excited and eager to implement this strategy in the near future but before you do so I'd like to call out a couple of things you should keep in mind when implementing it.
First of all: Keep it simple. There's no need to jump right into code and implement this functionality end-to-end. Do it manually in the beginning. Update the database records and the subscription plans for people who want to pause their subscription by hand. Maybe you find out that very few people want to make use of this feature. What you definitely want to put in place is your new copywriting. As discussed above you should ensure that your marketing website is updated and reflects the recent change you just introduced.
Next up you want to have an automated follow-up E-Mail sequence / Drip campaign setup for pausing customers. Keep in touch. Ask for problems they had with your software and help them succeed in whatever they're up to right now. You might want to jump on a quick call to gather some feedback as to why they paused and understand what needs to be in place for them to come back. If you do this, please ensure that you're genuinely interested in the communication. There's nothing worse for a user than composing a reply and shooting the E-Mail into the marketing void.
A very important, yet often overlooked step is to have a tool in place which deals with "passive churn". Such a system ensures that the credit cards on file are up to date and chargeable. There could be an overlap between your users pausing their subscription and their credit cards expiring. You don't want to make them look bad because of that. You could even think about a "concierge service" which onboards them in person once they'll come back. Combine this with a quick update on all the new features / updates they missed and are not yet familiar with.
Lastly you absolutely don't want to make it hard for your users to pause their subscription. As mentioned above, avoid dark patterns at all costs. And more importantly: Don't penalize them for pausing. Messages such as "We'll retain your data for the next 60 days" are inappropriate in the day and age of "Big Data" and access to Petabytes of storage for a nickel and dime.
I'd like to challenge you to think about adding the possibility to pause a subscription. Is it suitable for your business? Would it help you retain and reactive more customers (especially in the current situation we're in)?
If you're about to add it, keep in mind that it doesn't have to be complicated. Start with a simple E-Mail form your users can fill out to let you know for how long they want to pause. Just make sure that you follow the best practices outlined above and that you advertise that it's now possible for your customers to pause their subscriptions.
Customer retention is one of the most important metrics every business owner should focus on. It's the existing customers who finance the Customer Acquisition Costs that are necessary to bring new users into the door.
It's almost always cheaper to keep your existing customers happy than to lose them and acquire brand new ones.
Unfortunately a lot of SaaS services only offer a very binary option for their subscription plans. As a user you're either in or you're out. You stay or you leave. But what if a user wants to take a pause for a few months because of current changes in life circumstances?
Offering a way to pause a subscription is a simple, yet effective way to retain and eventually reactive your existing customers. Remember that a pause is temporary. If you follow-up with them on a continuous basis and help them succeed they'll eventually come back. Maybe even as a raving, more loyal fan of your brand.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>Does your Landing Page contain all the information to communicate the key benefits your target audience is looking for?
What other strategies are out there to increase your conversion rates?
In this blog post we'll explore a simple strategy you can use to create a Landing Page which resonates with your target audience and therefore results in more leads.
Bootstrapping a business is hard. Really, really hard. Every founder has to wear many different hats throughout the journey of starting and running a profitable business. There's Marketing, Product, Sales, Development and so many more aspects you as a founder have to juggle in the day-to-day.
Luckily enough there are products and SaaS services which help you streamline the work you have to do on those fronts. One of such products / services are Landing Page templates and Landing Page builders.
Your Landing Page is one of the most critical aspects of your business you have to get right. Every visitor to your website judges your product in a matter of seconds and decides whether it's a good idea to dive deeper into your solution or move on and close the tab.
There are dozens of articles describing in great detail how your Landing Page should be laid out and which copywriting you should use to make your visitors stick. Over the years companies have spent a lot of time and resources running different A/B tests to eventually derive their "magic" Landing Page formula.
As a bootstrapped founder you don't have the time, resources and most importantly traffic to run such tests to find out which Landing Page format converts the most users.
That's where Landing Page templates and builders come into play. Given that the main goal of every Landing Page is to clearly communicate the benefits your product has to offer it's possible to define a common structure every project should follow.
Such structure usually looks something like this:
This structure is based on a common understanding as to what a Landing Page needs to contain in order to result in high conversion rates. Following these best practices you should be able to download the template, fill out the blanks, maybe rearrange some blocks and have a decent Landing Page which delights your visitors, correct?
Well, not quite... There's more to a converting Landing Page than following the de facto standard.
It's always a good idea to put yourself into the shoes of your future potential customer and think about what you'd expect your products Landing Page to contain to get excited and give it a try.
Are there certain aspects you pay special attention to? Is there anything that's currently missing? Maybe a short video is better suited than a gallery of in-app screenshots? Is there a way to embed a live demo of your product (e.g. the way Intercom does it with their live chat widget)? Are there crucial features your product offers to differentiate itself from the competition?
All those questions and their answers are very specific to the solution you're offering and the audience you're serving. While generic Landing Pages are a good start they don't and can't provide guidance as to what product and audience specific questions you should answer.
Putting yourself into your users shoes is one way to tackle this issue but there's a trap almost all of us fall into: Bias.
What's a good way to get more out of our Landing Pages without us analyzing it in a biased way?
Chances are that if you're bootstrapping your business you're not entirely reinventing the wheel. If you're not VC backed it's usually a good idea to work on a solution for a well known problem specifically tailored for a niche audience rather than coming up with something entirely new.
Defining new markets is expensive and time consuming because you have to raise awareness and educate your customers as to how your product benefits them.
Given this nature of our business we can use our competition to get some inspirations on what their customers were looking for when visiting their Landing Pages back in the day. If our competition is successful nowadays we can be pretty sure that they did something right back when they pitched their product for the very first time.
Let's take this idea for a spin and see how we can reverse-engineer the parts that were crucial in the success of our competition.
For this example we pretend that we've developed a live support chat application similar to Intercom or Drift.
The first thing we want to do is to figure out which companies we're competing with.
Chances are that you already stumbled upon this question when you did your market research and product validation. However if you haven't done that you can find your main competitors easily via business review sites such as Capterra or G2.
Given that we're offering a live support chat service we're tying "support chat" into the Capterra search box:
While we see some products at the top of the page we're more interested in the "Software Categories". If we scroll down a little bit we can see the category "Live Chat Software". That's exactly what we're looking for so we click on that link.
The information on the "Live Chat Software" page confirms that that's the category our product belongs to. Scrolling down a little bit we can finally find our competitors such as Zendesk, HubSpot Service Hub and Intercom.
Looking at the ratings and reviews it might be a good idea to study what Intercom did back in the day.
Now that we know that one of our main competitors is Intercom it's a good idea to learn when they first launched. Getting this information can easily be done by visiting their Wikipedia page.
If that doesn't return any results we can use Crunchbase which is a huge database containing tons of relevant information about startups and businesses.
As we can see Intercom was founded on August 2011 and their domain is intercom.com. That's all the information we need.
A note on domain names:
If you dig a little bit deeper into the history of Intercom you'll stumble upon the fact that they didn't own the intercom.com domain from day one. In fact they bought it after they launched and started with intercomapp.com. If you want to go the extra mile you could try to figure out if your competitor had a different domain name when they launched. However most of the time it's enough to know the "final" domain name because that's usually the time when the company started to attract public interest and therefore gained traction.
Armed with this information we visit the Internet archives "Wayback Machine" which is a non-profit project dedicated to record the internets history.
In the search bar we enter our competitors domain and select the year when they launched.
On the calendar view we select one of the very first recorded snapshots to see how their Landing Page looked like back in the day.
And there we have it. Intercoms original Landing Page.
Intercoms original Landing Page
We should now take some time and look for what aspects of their product Intercom paid special attention to. What you'll notice for example is that they put a special emphasize on "building relationships" via their software. Another interesting data point is that they started with a free plan during their private beta period, meaning that they didn't launch publicly on day one.
What you now want to do is spend some time to navigate around and take notes while you browse. Reverse-engineer their website and the way they communicate their customers pain and their solution to such pain. Look at what features and benefits they highlight and how they reflect that in their copywriting.
Now you might be thinking that you don't want to be a copycat of your competitor and you're right about that. Being a 1:1 copy shouldn't be your goal. However there's a very high likelihood that the audience your competitor is serving is similar to the one you're about to work with. Hence they're receptive to a similar messaging your competitor uses.
A note on using the Wayback Machine:
The Wayback Machine is a truly awesome tool and important initiative as it tries to record the internets history as accurately as possible. However as you browse through old, preserved websites you might recognize a minor hiccup here and there. That's totally normal. Keep in mind that the project is a non-profit and the vast amount of data they have to store and process. Just pick another date on the calendar if you get stuck. And always keep the year in mind when browsing old websites. What looks dated today was trendy back then.
We've now seen how we would use this process to drastically improve the Landing Page of our imaginary live support chat application. Now it's your turn. Take some time this week to figure out who your main competitors are and how they communicate their products benefits on their Landing Pages.
Pay close attention to the features and solutions they highlight. Those aspects might be interesting for your target audience as well. Also note how they present their application. Do they use a lot of screenshots or go with video? What is their launch pricing structure? Did they charge on day one? If so, how much?
Try to learn as much as possible and incorporate some of those learnings into your Landing Page as well!
As the old saying goes:
Good Artists Copy; Great Artists Steal
While Landing Page template and builders are a great starting point to get your products initial marketing page off the ground they're usually generic to suit as many different use cases as possible.
The main task of your Landing Page is to clearly articulate the benefits your target audience will experience when using your product over the competition. Your future customers should be inclined to dive deeper into the solution you're providing rather than leaving your website after the first few seconds.
Doing so cannot be done with a generic template you've downloaded and filled out. You need to tailor your messaging to your future potential customers. One way to learn what your users will pay attention to is to study what your competition did back in the day when they published and promoted their very first Landing Page. In this blog posts we learned a structured way to identify your main competition and reverse-engineer their Landing Pages in great detail.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>Most of the time the worst part is that you don't have any data to figure out why users stopped using your product. Even if you reach out, feedback is usually sparse and often doesn't reveal any useful patterns you can look into.
In response to that, a lot of time and energy is usually spent tweaking the onboarding flow or marketing funnel.
But there's a simpler, better way to increase your conversion rates. Let's see how it works.
There's a lot of debate whether having a free trial is the best way to acquire new customers.
The clear upside of offering a free trial is that the barrier of entry is extremely low. An E-Mail address and a password is usually enough to get new users into the door. The more users you have onboarded, the more likely it is that they talk about your product in their peer groups and network (assuming it solves a real pain). Even better if your product has built-in functionalities which incentivize your users to invite others. This mix of Word-of-Mouth and network effects kick-starts the growth flywheel which will attract even more users. Rinse and repeat.
Eventually a small percentage of your overall user base converts and upgrades to a paid plan.
Hundreds of large-scale companies followed this exact playbook. Given their success it should be a no-brainer to follow this strategy as well, correct?
Despite the aforementioned upsides there are also clear downsides to this approach. Having more users usually translates into more customer support requests your support staff needs to handle. In addition to that it's tricky to get the math right such that you can use the revenue you generate via your paid plans to finance the resources and infrastructure necessary for your free trials.
There isn't a clear, definite answer whether you should offer a free trial or avoid it at all costs. At the end of the day it all comes down to the type of product you sell and the audience you're serving.
Regardless of the way you acquire new customers there's one metric you should monitor closely when onboarding new users: User engagement.
Low user engagement is at the very core of problems related to free trials.
Given that one only needs an E-Mail address and a password to sign up for your product it's an easy decision to take the leap and give it a try. Testing it is free after all and who doesn't like getting something for free?
The issue is that finding the time to sit down and "work with it in the next couple of days" usually never materializes. Life gets in the way and sooner rather than later the trial expires and your users are locked-out, never to be seen again.
Sure, this isn't always the case but more often than not people just don't get around testing your product enough to uncover the benefits they'll experience when using it in their day-to-day to solve their problems.
One solution to mitigate this problem is to offer a limited, "always free" plan. This way users will have enough time to tinker around and assess whether your product is worth its money. But this only kicks the can down the road as you'll likely face the same issue of low user engagement as before. In addition to that you open up a can of worms because now you have to deal with all the other challenges a free plan entails:
Another common strategy to solve the free trial dilemma is to "ask for the credit card upfront", meaning that users have to put in their credit card information when they sign up but they won't be charged after their trial expired. While asking for credit card information is a good way to filter out users who might never convert into paid customers it comes with the sames problems we just discussed.
Offering an "always free" plan or asking for the credit card upfront still doesn't solve the issue of low user engagement.
Now that we know why free trials and free plans are a tough sell it's time to tackle the underlying issue: Low user engagement.
Basic human psychology tells us that human beings value items more if they've spent money to acquire them. Think about the last time you got something for free vs. the time you spent money on a similar item. Chances are that you've used and valued the item you paid money for more.
This discovery can be used to inform the way how you can design your trial version to eventually convert more users into paying customers.
Rather than offering a free trial which expires in 30 days offer a paid version of that exact same trial.
Yes, you read that right. Turn your free trial into a paid trial.
Offering a paid trial comes with a couple of major advantages:
In fact, I learned about the power a paid plan can have from a personal experience. Given my growing interest in SEM, SEO and marketing in general I did some research to figure out what tools the market has to offer in that space.
In particular I was looking for a tool which would help me conducting Keyword Research, Backlink analysis and SEO monitoring. What I stumbled upon was the well-known tool Ahrefs.
However what struck me was that compared to their competitors their pricing structure was different. There was no "Free trial". I had to pay 7 USD to get access to the tool for 7 days. After some hesitation I decided to give it a shot and try it for the next 7 days. In fact, I spent a significant amount of time upfront, learning and researching everything I could about their toolings to figure out if their paid trial would be really worth it.
Needless to say that I used Ahrefs every single day during that week. The felt "pressure" to get the most out of the 7 days in combination with their excellent, educational E-Mail sequence helped me to explore all the values their tool has to offer. I'm absolutely certain that their conversion rates from paid trial to paying customer aren't too shabby.
Do you have a free plan, free trial or are thinking about introducing such a plan in the near future?
I'd challenge you to take a step back and think about offering a paid trial instead. Your paid trial shouldn't be too expensive but it should be enough that your users are incentivized to explore your product in more depth. Combine your paid trial with a helpful, educational E-Mail onboarding sequence and check-in with your users every now and then to understand what their problems are and to help them succeed.
I'm certain that this will help you tackle any conversion-related problems you might be facing!
Converting users you've attracted via a free trial or a free plan to become paying customers is hard. There are a couple of issues one has to be aware of. It's tough to get the math right and take the profits you made from your paid plans to offset the costs you'll introduce by offering a free plan / free trial. While it's highly likely that you'll get new users using your product you'll also experience an increase in customer support requests and an almost guaranteed low conversion rate.
One of the underlying core problem with all this is a lack of user engagement. Paid products are valued more than free ones.
You can use this fact and offer a paid trial which incentivizes your users to take action and explore the values your product has to offer. Having a more engaged user base helps you tackle the conversion-related problems you might be facing.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>You're thinking about hiring someone or outsourcing your marketing entirely because "you're not an expert"?
If you're going that route should you hire a freelancer or an agency? And how do you determine if they're doing a good job?
Sure, you could outsource this part of your business because it's time consuming, tedious and a bit of a "black art" (especially for technical founders).
But marketing doesn't have to be this fuzzy part of your business which is more often than not neglected due to confusion around "Funnels", "Growth Hacking" and "Channels".
Marketing is about connecting with your audience and it's a big competitive advantage if you're the person in charge of such connections. With marketing you can keep an ear to the ground, gathering data as to what does and what doesn't resonate with your audience. You can then use this data to inform your product and marketing decisions and attract more potential customers via "free" channels such as SEO / Search, Social Media or Word-of-Mouth.
Getting the whole marketing machinery up- and running can indeed be an intimidating endeavor. But it doesn't have to be! The key is to pick a marketing channel that will likely resonate with your audience and build from there.
I'd suggest that you start with Content Marketing in the form of solution oriented blog posts. "Why?" you might ask. The reason is simple yet powerful.
Your potential customers are already looking online for solutions to their problems. They use search engines to find answers to specific questions they're currently dealing with. If your software helps with those problems, wouldn't it be nice if users find your product and give it a try?
While this sounds compelling it's easier said than done. The main issue is in figuring out what kind of content resonates with your users and more importantly what content attracts users who are on the lookout to buy a solution to solve their problem at hand. Just crunching out blog post after blog post in the hope of hitting the jackpot won't do the trick. Plus it takes a whole lot of time and effort to write the content in the first place.
So how do you come up with content ideas that your future customers are interested in? I got some great news for you: You already have all the data that's necessary to write such content. You just have to take a couple of minutes to dig it up!
Chances are that you have a lot of historical chat data in your customer support system, whether that's a fully-fledged CRM system like Intercom or your E-Mail inbox. You might've realized that for some topics you got the same questions over and over again. The usual reaction to this phenomenon is to create a dedicated FAQ entry which answers the question and point the next customer who's asking about this specific question to the direction of the FAQ. Another, complementary solution is the Knowledge Base which usually goes into more detail as to what the problem and its potential solutions are.
Why don't you go a step further and turn your FAQ / Knowledge Base entries into dedicated blog posts? With very little extra effort you can use your support requests as a way to inform and kick-start your Content Marketing strategy.
How could such a support request repurposing look like? Here's an example blog post structure you might want to use or adapt according to your needs to get started today.
Headline The support question. E.g. "How do I with ?".
Problem Description Write a paragraph or two to describe the problem your customer is experiencing.
Feature Description Write a couple of paragraphs about your products feature which helps in solving the previously stated problem. Take some time to go into more detail here and outline the value your product provides.
How to solve the problem Provide an exact step-by-step guide which shows how a user will solve the problem with your product. Include screenshots if necessary.
Add a way to create a permalink which points to this headline. This makes it easier for you to share the exact location where you describe the fix in your blog post. This way users with the link will get straight to the solution rather than having to scroll through the whole article.
Conclusion Wrap things up and plug the values your product provides a second time.
CTA Provide a call to action with a "Next Step" the user might want to take. This could be a Newsletter signup box or a link to another feature your product provides (another support request informed blog post).
Are you still on the fence if it's worthwhile to go down this route? The following is a list with some of the benefits you'll experience when you implement this strategy:
I'd like to challenge you to take 30 minutes today to look through your support queue. Just scroll through all the requests and try to identify patterns. What problems are most of your users struggling with? Is there an overarching theme?
Write down the main pain points and turn them into dedicated blog post headlines. Next up, use the structure above (or a variation of it) to write the outline of your blog posts.
Dedicated 30 minutes each day to work on such posts. You'll be surprised how effortless this gets with every single day you work through your list! Don't throw in the towel if you're making little progress the first few day(s). That's normal. Keep the momentum going and you'll have created a lot of valuable content you can share on your business blog in the next couple of weeks.
It's tough to come up with great blog post ideas your users are likely to enjoy, let alone to start the whole marketing machinery as a non-marketing focused founder. An easy way to deal with such issues is to look into your support system and identify patterns. Are there any questions which have been asked multiple times? Are there some features your customers have more questions about compared to others?
Rather than jumping into the creation of a FAQ entry and calling it a day think about a way how you can turn the question at hand into a dedicated blog post. Talk about the feature your user had a question about. Describe how it works and in which way it serves your customers. Showcase how easy it is to use that feature to accomplish a certain task with your product (the actual answer to the support question).
Doing so will save you time as you'll write an article once and for all rather than juggling different mediums like FAQs, Knowledge Bases, Documentation, User Manuals, etc. and it kick-starts your marketing efforts in a bite-sized way. Your business blog will be indexed by search engines and future customers searching for solutions to their problems will discover your article and see how your product helps them solve their issues.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>It's been half a year since I started to work on that endeavor and I believe now is a good time to take stock of the status quo and plan the upcoming months and years.
The last 6 months I spent a good portion of my time diving deep into fundamental Machine Learning algorithms such as Gradient Descent, Naive Bayes and Logistic Regression. To get the most out of this learning experience I implemented all the algorithms from scratch and shared my learnings via code examples on GitHub and dedicated in-depth blog post I published on this blog.
Overall I'm pretty happy with the outcome and can say that doing this was totally worth it. I wholeheartedly enjoyed the process from knowing next to nothing about the topic at hand to learning the underlying mathematics and implementing a fully working prototype solely based on reading through a bunch of books, Wikipedia articles and whitepapers.
Unfortunately the whole "Machine Learning from scratch" series isn't 100% done just yet. What's missing is a blog post about k-means clustering and Neural Networks. The code has already been written and can be found here and here. I hope that I'll find some time and energy to write and publish the remaining 2 blog posts anytime soon.
Another goal I followed was to learn more about secure computation and Homomorphic Encryption in particular. The main objective was to explore ways in which Homomorphic Encryption and Machine Learning can be used together. Using both in conjunction could resolves fundamental issues around privacy and would unlock new, important use cases where Machine Learning is currently not applicable due to regulatory restrictions. One important solution which comes to mind is using Deep Neural Networks in the healthcare sector where preserving the patients privacy is of upmost importance.
In order to fully understand Homomorphic Encryption from a theoretical and practical point of view I took the same approach I did to master Machine Learning algorithms. I read dozens and dozens of whitepapers, watched countless conference talks and followed the community closely to implement the foundational algorithms from scratch in pure Python. It was a wild and oftentimes frustrating ride. Cryptography is hard but one of the most intellectually stimulating topics anyone interested in Mathematics and Computer Science should study!
The main outcomes of this exercise were implementations of some of the most important Homomorphic Encryption algorithms to date:
Don't be intimidated by the titles! I strongly believe that Homomorphic Encryption is a fascinating topic anyone can understand and master.
There are a couple of IMHO unnecessary roadblocks you might encounter when digging deeper into the subject. You won't believe how often research paper lack important information which makes it hard or nearly impossible to reproduce their results. Let me know if you decide to learn more about Homomorphic Encryption and need some pointers or guidance along the way.
Learning and studying new topics and concepts in a theoretical setting can be enjoyable but putting the theory into practice is where the magic of understanding happens!
That's why I decided to scratch one of my own itches by implementing a Machine Learning powered recommender system. The main idea is to help users discover new, useful content on the internet. I called the project "Interlayer" given that it would act as a kind of connecting layer between all the different web properties on the internet.
Roughly speaking Interlayer crawls and analyzes websites while generating topics the website in question covers. Such topics are then used to find similar / complementary websites the current visitor might find useful. Under the hood it uses techniques such as Word Embeddings and K-nearest neighbors.
It took approximately 1 month to implement the first "alpha" version which was focused on online blogs. After hearing about the struggles ecommerce merchants face to get initial traffic to their stores I decided to change course and update Interlayer to work better with ecommerce stores. The result of this is the Interlayer Shopify App I launched a couple of weeks ago.
I'll spend more time to dive deeper into the project and explain what I learned from a technical as well as business point of view. As it turns out running a Machine Learning pipeline in production at scale isn't a straightforward undertaking.
Over the last 6 months I learned a ton! Starting with Machine Learning algorithms I followed-up with implementations of core Homomorphic Encryption algorithms to finally put everything I learned into practice and ship my first product called "Interlayer".
What I quickly figured was that while I truly love the challenge of learning and implementing algorithms from scratch it's hard for me to find the time and energy to write in-depth blog posts once I'm done with the coding part. What I truly enjoy is learning the ins- and outs of new technologies and applying such learnings in practice by building real customer-facing products. Working on Interlayer was an eye-opening experience as merchants all over the world started to install the app and refer traffic to each other. All of this is powered by Machine Learning algorithms I started to study earlier this year.
Over the last couple of weeks I took some time to reflected more on that experience. Seeing a project being used "in production" sparked an interest in the Bootstrapping and Indie Hacking community in me.
Given what I've learned over the last 6 months I plan to make this whole blog a lot more personal. While I'll continue to publish long-form educational content and essays I'll switch the main focus on the "making" side of things.
Essentially I'll start to "document rather than create". A lot of content never saw the light of day due to the artificial "it has to be article-worthy" filter I used to determine what's worth sharing. I strongly believe that it's helpful to learn in public. In fact there's a lot I learned from others this way!
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>It's almost certain that anyone of us has played the game of "Twenty Questions" or a variation of it at least once in their lifetime. In case you're not familiar with the rules, here's how it works:
The game starts with one person (the "Answerer") thinking about a subject or object without revealing it. The others (the "Questioners") are encouraged to ask questions about such subject or object. Questions should be asked such that they can only be answered in a "Yes" or "No" fashion. The current Questioner continues to ask questions if her previous question was answered with a "Yes" and has to hand over the question asking to the next Questioner if the current question was answered with a "No". The Questioner who successfully guessed the subject or object the Answerer thought about has won and will be the next Answerer.
Let's walk through a quick example to see the game in action. In this case I'm the Answerer and you're a Questioner. Since you're the only Questioner you're allowed to continue asking questions even if one of the answers is "No":
You: "Is the subject or object you're thinking about a person?"
Me: "Yes"
You: "Is the person you're thinking about a non-fictional character?"
Me: "No"
You: "Did the person you're thinking about appear in comics?"
Me: "Yes"
You: "Has the person you're thinking about superpowers?"
Me: "Yes"
You: "Is the person one of the Avengers?"
You: "Yes"
You: "Did the person appear in a movie?"
Me: "Yes"
You: "Is the person you're thinking about Iron Man?"
Me: "Yes! Bingo!"
While this is certainly a contrived example it shows one of the core techniques Decision Trees, the topic of this post, use to construct themselves while "learning" from training data.
In the following sections we'll take a deep dive into Decision Tree-based Machine Learning models. We'll examine their properties, learn how we can ensure that the best tree is constructed given the data we're feeding it and how we can turn this knowledge into code to create our very own Decision Tree classifier which predicts if it's a good idea to practice our swings on the golf course given some data about the current weather conditions. Let's dive in!
A Decision Tree is exactly what its name implies. A tree-like structure which makes it possible to model decisions and their consequences. In fact you've already built and used a Decision Tree model while we played the game of "Twenty Questions" in the introduction above.
Let's visualize the Decision Tree you used during the game to better understand how it guided you to the prediction that the character I was thinking about was indeed Iron Man:
In the image above we can see that the Decision Tree consists of Nodes and Edges which point from parent Nodes to child Nodes. In our case every Node contained a question such as "Is it a person?" while the Edges contained the respective answers: "Yes" or "No". In our case the labels were binary which means that every question can be answered in one of two ways: "Yes" or "No".
While we were playing the game you walked down the tree starting with the root Node at the top, following the Edges and Nodes until you eventually arrived at the Node which had no outgoing Edge: Your prediction.
Generally speaking there are 2 main Decision Tree models both of which differ in the prediction they produce:
The Classification Tree is a tree where the prediction is categorical. The tree we've built above is a classification tree as its output will always yield a result from a category such as "Superheros" or more specifically "Iron Man".
The Regression Tree is a tree in which outputs can be continuous. An example for continuous values are the real numbers.
While this only defines the type of outputs a tree computes there's nothing preventing us from mixing and matching categorical and continuous values in the tree creation phase. Therefore a tree can have categorical as well as continuous Edge labels and output types.
Given what we've learned so far about Decision Trees, it's easy to see that one of the huge advantages is the models understandability. Looking at the trees visual representation it's easy to navigate around, gauge how well it might perform and troubleshoot issues which might arise during real-world usage.
One area of improvement we could identify is the way in which our tree is split. We could've for example skipped the question "Did the person appear in a movie?" since it's highly likely that there's a movie about most of the Avengers. On the other hand nailing that it's Iron Man involved quite some luck. What about Hulk or Thor? It might've been better to insert yet another question to ensure that you're on the right path.
In which way should we construct our tree such that every decision we're making (deciding which Edge to follow) maximizes the information we're getting such that we can make a prediction as accurate as possible? Is it possible to capture this notion mathematically?
As it turns out there are a few different mathematical formulas at our disposal to quantify the information we'd gain when following certain paths in our Decision Tree.
One such metric is the so-called Gini Impurity. With Gini Impurity we measure how often we'd incorrectly label a randomly chosen element from our data set if it was randomly labeled according to the class distribution of such data set.
The formula is as follows:
$$ G = \sum_{i=1}^C p_i \times (1 - p_i) $$
where $C$ are the classes and $p_i$ is the probability of picking an element of class $i$ at random.
Now if this definition makes your head spin you're not alone. Let's work through an example to understand what we're calculating and how Gini Impurity is used in practice.
Let's pretend that we've collected some meteorology data over the last couple of years and want to use it to determine if we should bring an umbrella with us given the current weather conditions we observe.
Among other things we've collected 14 data points about the wind and temperature:
| Wind | Yes | No | Total | | ------ | --- | --- | ----- | | Weak | 4 | 3 | 7 | | Strong | 6 | 1 | 7 | | | | | 14 |
| Temperature | Yes | No | Total | | ----------- | --- | --- | ----- | | High | 3 | 2 | 5 | | Low | 7 | 2 | 9 | | | | | 14 |
As one can see we've categorized the wind into "weak" and "strong" and the temperature into "high" and "low". During a day where the wind was "strong" it rained $6$ out of $7$ times. At the same time it rained $7$ out of $9$ times when the temperature was "low".
Our goal is to determine which one of these observations provides us more certainty whether it'll rain and therefore if we should take an umbrella with us. We can figure that out by applying the Gini Impurity formula from above.
Let's start with the Gini Impurity calculation for Wind. We're dealing with 2 separate cases here as there can be either "Weak" or "Strong" wind. for every case we have $2$ different classes ($C$) we can use to label our observation: "Yes" and "No". This means that we need to do $2$ Gini Impurity calculations (one for each case). Plugging in the numbers results in the following:
$$ Gini(\textrm{Wind=Weak}) = \frac{4}{7} \times (1 - \frac{4}{7}) + \frac{3}{7} \times (1 - \frac{3}{7}) = \frac{24}{49} \approx 0.49 $$
$$ Gini(\textrm{Wind=Strong}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 $$
Since we have 2 results (one for each case) we now need to calculate the weighted sum to get the overall result for our Wind observation:
$$ Gini(\textrm{Wind}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.37 $$
Next up we do the Gini Impurity calculation for Temperature. We're following the exact same steps as we did with our Wind calculation above. We're also dealing with 2 different cases ("High" and "Low") as well as $2$ different classes $C$ ("Yes" and "No"). Here are the calculation:
$$ Gini(\textrm{Temp=Low}) = \frac{3}{5} \times (1 - \frac{3}{5}) + \frac{2}{5} \times (1 - \frac{2}{5}) = \frac{12}{25} \approx 0.48 $$
$$ Gini(\textrm{Temp=High}) = \frac{7}{9} \times (1 - \frac{7}{9}) + \frac{2}{9} \times (1 - \frac{2}{9}) = \frac{28}{81} \approx 0.35 $$
$$ Gini(\textrm{Temp}) = \frac{5}{14} \times 0.48 + \frac{9}{14} \times 0.35 \approx 0.4 $$
Putting it all together we've calculated a Gini Impurity of $\approx 0.37$ for our Wind observations and a Gini Impurity of $\approx 0.4$ for our Temperature observations. Overall we're trying to work with the purest data possible so we'll decide to go with our data on Wind to determine whether we should bring an umbrella with us.
In the last section we've learned about a way to quantify purity and impurity of data via a metric called Gini Impurity. In this chapter we'll apply what we've learned so far by putting all the pieces together and build our very own Decision Tree to help us decide if we should play golf given the current weather conditions we observe.
Based on the Gini Impurity metric we'll examine our weather data and split it into different subsets which will reveal as much information as possible as to whether we should go golfing or not. These subsets will be arranged as a tree structure in a way that we can easily trace our decision through it based on our current observations. Our main goal is to come to a conclusion by taking into account as little weather data as possible.
Let's start by looking at the golf.csv
data set we'll be working with:
| Outlook | Temp | Humidity | Windy | Play | | -------- | ---- | -------- | ----- | ---- | | Rainy | Hot | High | f | no | | Rainy | Hot | High | t | no | | Overcast | Hot | High | f | yes | | Sunny | Mild | High | f | yes | | Sunny | Cool | Normal | f | yes | | Sunny | Cool | Normal | t | no | | Overcast | Cool | Normal | t | yes | | Rainy | Mild | High | f | no | | Rainy | Cool | Normal | f | yes | | Sunny | Mild | Normal | f | yes | | Rainy | Mild | Normal | t | yes | | Overcast | Mild | High | t | yes | | Overcast | Hot | Normal | f | yes | | Sunny | Mild | High | t | no |
We can see that there are a couple of different features such as Outlook, Temperature, Humidity and Wind we can take into account.
The following command will download the .csv
file to our local machine:
wget -O "data/golf.csv" -nc -P data https://raw.githubusercontent.com/husnainfareed/Simple-Naive-Bayes-Weather-Prediction/c75b2fa747956ee9b5f9da7b2fc2865be04c618c/new_dataset.csv
Next up we should parse the .csv
file and store the result in a structured format. In our case we create a [NamedTuple](https://docs.python.org/3/library/collections.html#collections.namedtuple)
called DataPoint
in which we'll store one row of information:
golf_data_path: Path = data_dir / 'golf.csv'
class DataPoint(NamedTuple):
outlook: str
temp: str
humidity: str
windy: bool
play: bool
data_points: List[DataPoint] = []
with open(golf_data_path) as csv_file:
reader = csv.reader(csv_file, delimiter=',')
next(reader, None)
for row in reader:
outlook: str = row[0].lower()
temp: str = row[1].lower()
humidity: str = row[2].lower()
windy: bool = True if row[3].lower() == 't' else False
play: bool = True if row[4].lower() == 'yes' else False
data_point: DataPoint = DataPoint(outlook, temp, humidity, windy, play)
data_points.append(data_point)
Notice that we've parsed string such as t
or yes
into their respective Boolean
equivalents.
Our data_points
list now contains a dedicated DataPoint
container for every .csv
file row:
data_points[:5]
# [DataPoint(outlook='rainy', temp='hot', humidity='high', windy=False, play=False),
# DataPoint(outlook='rainy', temp='hot', humidity='high', windy=True, play=False),
# DataPoint(outlook='overcast', temp='hot', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='mild', humidity='high', windy=False, play=True),
# DataPoint(outlook='sunny', temp='cool', humidity='normal', windy=False, play=True)]
Great! We now have the data in a structured format which makes it easier to work with. Let's exploit this structure by implementing some helper functions.
The first function takes an arbitrary data_set
and a list of tuples to filter such data_set
based on the tuple values:
def filter_by_feature(data_points: List[DataPoint], *args) -> List[DataPoint]:
result: List[DataPoint] = deepcopy(data_points)
for arg in args:
feature: str = arg[0]
value: Any = arg[1]
result = [data_point for data_point in result if getattr(data_point, feature) == value]
return result
assert len(filter_by_feature(data_points, ('outlook', 'sunny'))) == 5
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'))) == 3
assert len(filter_by_feature(data_points, ('outlook', 'sunny'), ('temp', 'mild'), ('humidity', 'high'))) == 2
*Note: If you're familiar with SQL you can think of it as a SELECT * FROM data_set WHERE tuple_1[0] = tuple_1[1] AND tuple_2[0] = tuple_2[1] ...;
statement.*
The second helper function extracts all the values a feature can assume:
def feature_values(data_points: List[DataPoint], feature: str) -> List[Any]:
return list(set([getattr(dp, feature) for dp in data_points]))
assert feature_values(data_points, 'outlook') == ['sunny', 'overcast', 'rainy']
Those 2 functions will greatly help us later on when we recursively build our tree.
As you might've already guessed it we need to implement the formula to calculate the Gini Impurity metric which will help us to determine how pure our data will be after we split it up into different subsets.
As a quick reminder here's the formula:
$$ G = \sum_{i=1}^C p_i \times (1 - p_i) $$
Which translates to the following code:
def gini(data: List[Any]) -> float:
counter: Counter = Counter(data)
classes: List[Any] = list(counter.keys())
num_items: int = len(data)
result: float = 0
item: Any
for item in classes:
p_i: float = counter[item] / num_items
result += p_i * (1 - p_i)
return result
assert gini(['one', 'one']) == 0
assert gini(['one', 'two']) == 0.5
assert gini(['one', 'two', 'one', 'two']) == 0.5
assert 0.8 < gini(['one', 'two', 'three', 'four', 'five']) < 0.81
Notice the tests at the bottom. If our data set consists of only one class the impurity is $0$. If our data set is splitted $50/50$ our impurity metric is $0.5$.
We can build a helper function on top of this abstract gini
implementation which helps us to caluclate the weighted Gini Impurities for any given feature. This function does the same thing we've done manually above when we've explored our weather data set which consisted of the features Wind and Temperature.
# Calculate the weighted sum of the Gini impurities for the `feature` in question
def gini_for_feature(data_points: List[DataPoint], feature: str, label: str = 'play') -> float:
total: int = len(data_points)
# Distinct values the `feature` in question can assume
dist_values: List[Any] = feature_values(data_points, feature)
# Calculate all the Gini impurities for every possible value a `feature` can assume
ginis: Dict[str, float] = defaultdict(float)
ratios: Dict[str, float] = defaultdict(float)
for value in dist_values:
filtered: List[DataPoint] = filter_by_feature(data_points, (feature, value))
labels: List[Any] = [getattr(dp, label) for dp in filtered]
ginis[value] = gini(labels)
# We use the ratio when we compute the weighted sum later on
ratios[value] = len(labels) / total
# Calculate the weighted sum of the `feature` in question
weighted_sum: float = sum([ratios[key] * value for key, value in ginis.items()])
return weighted_sum
assert 0.34 < gini_for_feature(data_points, 'outlook') < 0.35
assert 0.44 < gini_for_feature(data_points, 'temp') < 0.45
assert 0.36 < gini_for_feature(data_points, 'humidity') < 0.37
assert 0.42 < gini_for_feature(data_points, 'windy') < 0.43
Here is the manual calculation for Humidity:
$$ Gini(\textrm{Humidity=High}) = \frac{3}{7} \times (1 - \frac{3}{7}) + \frac{4}{7} \times (1 - \frac{4}{7}) = \frac{24}{49} \approx 0.49 $$
$$ Gini(\textrm{Humidity=Normal}) = \frac{6}{7} \times (1 - \frac{6}{7}) + \frac{1}{7} \times (1 - \frac{1}{7}) = \frac{12}{49} \approx 0.24 $$
$$ Gini(\textrm{Temp}) = \frac{7}{14} \times 0.49 + \frac{7}{14} \times 0.24 \approx 0.365 $$
We represent our Tree data structure via a Node
and Edge
class. Every Node
has a value and can point to zero or more Edges
(out-Edge
). It's one of the trees leafs if it doesn't have an out-Edge
.
An Edge
has a value also and points to exactly $1$ Node
.
# A `Node` has a `value` and optional out `Edge`s
class Node:
def __init__(self, value):
self._value = value
self._edges = []
def __repr__(self):
if len(self._edges):
return f'{self._value} --> {self._edges}'
else:
return f'{self._value}'
@property
def value(self):
return self._value
def add_edge(self, edge):
self._edges.append(edge)
def find_edge(self, value):
return next(edge for edge in self._edges if edge.value == value)
# An `Edge` has a value and points to a `Node`
class Edge:
def __init__(self, value):
self._value = value
self._node = None
def __repr__(self):
return f'{self._value} --> {self._node}'
@property
def value(self):
return self._value
@property
def node(self):
return self._node
@node.setter
def node(self, node):
self._node = node
We finally have all the pieces in place to recursively build our Decision Tree. There are several different tree building algorithms out there such as ID3, C4.5 or CART. The Gini Impurity metric is a natural fit for the CART algorithm, so we'll implement that.
The following is the whole build_tree
code. It might look a bit intimidating at first since it implements the CART algorithm via recursion.
Take some time to thoroughly read through the code and its comments. Below we'll describe in prose how exactly the build_tree
function works.
def build_tree(data_points: List[DataPoint], features: List[str], label: str = 'play') -> Node:
# Ensure that the `features` list doesn't include the `label`
features.remove(label) if label in features else None
# Compute the weighted Gini impurity for each `feature` given that we'd split the tree at the `feature` in question
weighted_sums: Dict[str, float] = defaultdict(float)
for feature in features:
weighted_sums[feature] = gini_for_feature(data_points, feature)
# If all the weighted Gini impurities are 0.0 we create a final `Node` (leaf) with the given `label`
weighted_sum_vals: List[float] = list(weighted_sums.values())
if (float(0) in weighted_sum_vals and len(set(weighted_sum_vals)) == 1):
label = getattr(data_points[0], 'play')
return Node(label)
# The `Node` with the most minimal weighted Gini impurity is the one we should use for splitting
min_feature = min(weighted_sums, key=weighted_sums.get)
node: Node = Node(min_feature)
# Remove the `feature` we've processed from the list of `features` which still need to be processed
reduced_features: List[str] = deepcopy(features)
reduced_features.remove(min_feature)
# Next up we build the `Edge`s which are the values our `min_feature` can assume
for value in feature_values(data_points, min_feature):
# Create a new `Edge` which contains a potential `value` of our `min_feature`
edge: Edge = Edge(value)
# Add the `Edge` to our `Node`
node.add_edge(edge)
# Filter down the data points we'll use next since we've just processed the set which includes our `min_feature`
reduced_data_points: List[DataPoint] = filter_by_feature(data_points, (min_feature, value))
# This `Edge` points to the new `Node` (subtree) we'll create through recursion
edge.node = build_tree(reduced_data_points, reduced_features)
# Return the `Node` (our `min_feature`)
return node
The first thing our build_tree
function does is that it ensures that the list of features we have to examine doesn't include the label
we try to predict later on. In our case the label is called play
which indicates if we went playing given the weather data. If we find the play
label we simply remove it from such list.
Next up we'll compute the weighted gini impurities for all the features in our features
list. (Note that this list will constantly shrink as we recursively build the tree. More on that in a minute)
If all our features
impurities are 0.0
we're done since we're now dealing with "pure" data. In that case we just create a Node
instance with the label
as its value and return that Node
. This is the case where we end the recursive call.
Otherwise (there's still some impurity in the data) we pick the feature
from the weighted_sums
dictionary with the smallest score since the smaller the Gini Impurity, the purer our data. We'll create a new Node
for this min_feature
and add it to our tree. This feature can be removed from our "to-process" list of features. We call this new list (which doesn't include the feature) reduced_features
.
Our min_feature``Node
isn't a leaf so it has to have some Edge
s. To build the Edge
s we iterate through all the values our feature can assume and create Edge
instances based on such values. Since the edges have to point somewhere we call build_tree
again with our reduced_features
list to determine the Node
(the next min_feature
) the Edge
s should point to.
And that's pretty much all there is. The function will recurse until it reaches the trees leaf nodes which don't have any outgoing edges and contain the label
values.
Let's build our tree with our golf.csv
data_points
:
features: List[str] = list(DataPoint._fields)
tree: Node = build_tree(data_points, features)
tree
# outlook --> [overcast --> True, sunny --> windy --> [False --> True, True --> False], rainy --> humidity --> [normal --> True, high --> False]]
We can already see the textual representation but it might be a better idea to add a visual representation as well:
So should we go out when the Outlook is "rainy" and the Humidity is "normal"?
Seems like we've enjoyed playing golf in those situations in the past!
During your journey in the lands of Machine Learning you might come across the terms "Random Forest", "Bagging" and "Boosting".
The problem with Decision Trees is that they tend to overfit easily to the data you train them on. A way to avoid overfitting while also getting better results in general is to build multiple Decision Trees (each with a subset of the training data) and then use them in concert to do predictions.
If you're working with regression trees you'd compute the average of all the trees outputs. If your trees are of categorical nature you'd perform a majority vote to decide which prediction to return.
Combining multiple Decision Trees into one large ensemble is called "Random Forest". Related to Random Forests are the techniques "Bagging" ("Bootstrap aggregation") and "Boosting". But that's for another post.
Everyone of us is confronted with different decision-making processes every single day. Naturally we examine the different scenarios in order to find the best decision we can take given the data we have at hand.
Decision Trees are a class of Machine Learning algorithms which follow the same idea. In order to build an efficient Decision Tree we need to decide the best way to split our data set into different subsets such that every new subset includes as much pure data as possible. Gini Impurity is one metric which makes it possible to mathematically determine the impurity for any data split we attempt.
Using Gini Impurity we implemented the CART algorithm and built our very own Decision Tree to determine if we should play golf based on the current weather conditions.
Given their visual nature, Decision Trees are a class of Machine Learning algorithms which is easy to inspect, debug and understand. On the other hand they also tend to overfit to the data they're trained on. Combining multiple, different Decision Trees into a so-called "Random Forest" mitigates this problems, turning single Decision Trees into a powerful ensemble.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
The following is a list with resources I've used while working on this article. Other, interesting resources are linked within the article itself:
]]>Sometimes it's necessary to split existing data into several classes in order to predict new, unseen data. This problem is called classification and one of the algorithms which can be used to learn those classes from data is called Logistic Regression.
In this article we'll take a deep dive into the Logistic Regression model to learn how it differs from other regression models such as Linear- or Multiple Linear Regression, how to think about it from an intuitive perspective and how we can translate our learnings into code while implementing it from scratch.
If you've read the post about Linear- and Multiple Linear Regression you might remember that the main objective of our algorithm was to find a best fitting line or hyperplane respectively.
To recap real quick, a line can be represented via the slop-intercept form as follows:
$$ y = mx + b $$
Here, $m$ represents the slope and $b$ the y-intercept.
In Linear Regression we've used the existing data to find a line in slope-intercept form (a $m$ and $b$ combination) which "best-fitted through" such data.
Extending the slope-intercept form slightly to support multiple $x$ values and multiple slopes (we'll use $\beta_n$ instead of $m_n$) yields the following:
$$ y = \beta_1x_1 + ... + \beta_nx_n + b $$
This "scaled-up" slope-intercept formula was used in the Multiple Linear Regression model to find the $\beta$ and $b$ values for the hyperplane which "best-fitted" the data. Once found we were able to use it for predictions by plugging in $x$ values to get respective $y$ values.
Linear Regression models always map a set of $x$ values to a resulting $y$ value on a continuous scale. This means that the $y$ value can e.g. be $0$, $42$ or $5.023.212$. How would we use such a Regression model if our $y$ value is categorical such as a binary value which is either $0$ or $1$? Is there a way to define a threshold so that a value such as $42$ is assigned to the category $1$ while a small value such as $0.002$ gets assigned to the category $0$?
That's where Logistic Regression comes into play. With Logistic Regression we can map any resulting $y$ value, no matter its magnitude to a value between $0$ and $1$.
Let's take a closer look into the modifications we need to make to turn a Linear Regression model into a Logistic Regression model.
At the very heart of Logistic Regression is the so-called Sigmoid function. A Sigmoid function is a class of functions which follows an S-shape when plotted.
The most prominent Sigmoid function is the so-called Logistic function which was developed by Pierre Francois Verhulst to model population grown. It's mathematically described via this formula:
$$ f(x) = \frac{1}{1+e^{-x}} $$
Don't be intimidated by the math! Right now all you need to know is that this function takes any $x$ value and maps it to a $y$ value which ranges from $0$ to $1$.
Plotting the function for a range of $x$ values proofs this claim and results in the aforementioned S-shape curve:
Note that the function gets closer and closer to the $y$ value $0$ or $1$ as the $x$ values get smaller or larger respectively. Also note that the $x$ value $0$ results in the $y$ value $0.5$.
This is exactly what we need. with this function we're able to "squish" any number, no matter its magnitude into a value ranging from $0$ to $1$. This makes the function outcome predictable which is useful when we later on define threshold values to associate function outputs with classes.
Let's turn the function into code:
def sigmoid(x: float) -> float:
return 1 / (1 + exp(-x))
assert sigmoid(0) == 0.5
Note: Although there are many different Sigmoid functions to choose from, a lot of people use the name "Sigmoid function" when talking about the Logistic function. We'll adhere to this convention and use the term "Sigmoid function" as a synonym for Logistic function.
Now that we've learned about the "mapping" capabilities of the Sigmoid function we should be able to "wrap" a Linear Regression model such as Multiple Linear Regression inside of it to turn the regressions raw output into a value ranging from $0$ to $1$.
Let's translate this idea into Math. Recall that our Multiple Linear Regression model looks like this:
$$ y = \beta_1x_1 + ... + \beta_nx_n + b $$
"Wrapping" this in the Sigmoid function (we use $\sigma$ to represent the Sigmoid function) results in the following:
$$ y = \sigma(\beta_1x_1 + ... + \beta_nx_n + b) $$
Easy enough! Let's turn that into code.
The first thing we need to do is to implement the underlying Multiple Linear Regression model. Looking at the Math it seems to be possible to use the dot-product to calculate the $\beta$ and $x$ part to which we then add the single $b$ value.
To make everything easier to calculate and implement we'll use a small trick. Multiplying a value by the identify $1$ yields the value so we prepend $1$ to the $x$ values and $b$ to the $\beta$ values. This way we can solely use the dot-product calculation without the necessity to add $b$ separately later on. Here's the mathematical formulation of that trick:
$$ \vec{x} = \begin{pmatrix} 1 \ x_1 \ ... \ x_n \end{pmatrix} \vec{\beta} = \begin{pmatrix} b \ \beta_1 \ ... \ \beta_n \end{pmatrix} $$
$$ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i \beta_i = x_1 \times \beta_1 + ... + x_n \times \beta_n $$
Once we've calculated the dot-product we need to pass it into the Sigmoid function such that its result is translated ("squished") into a value between $0$ and $1$.
Here's the implementation for the dot
function which calculates the dot-product:
def dot(a: List[float], b: List[float]) -> float:
assert len(a) == len(b)
return sum([a_i * b_i for a_i, b_i in zip(a, b)])
assert dot([1, 2, 3, 4], [5, 6, 7, 8]) == 70
And here's the squish
function which takes as parameters the $x$ and $\beta$ values (remember that we've prepended a $1$ to the $x$ values and the $b$ to the $\beta$ values), uses the dot
function to calculate the dot-product of $x$ and $\beta$ and then passes this result into the Sigmoid function to map it to a value between $0$ and $1$:
def squish(beta: List[float], x: List[float]) -> float:
assert len(beta) == len(x)
# Calculate the dot product
dot_result: float = dot(beta, x)
# Use sigmoid to get a result between 0 and 1
return sigmoid(dot_result)
assert squish([1, 2, 3, 4], [5, 6, 7, 8]) == 1.0
We've talked quite a lot about how the Sigmoid function is our solution to make the function outcome predictable as all values are mapped to a $0$ - $1$ range. But what does a value in that range represent? Let's take a look at an example.
The following is a data set which describes how long students have studied for an exam and whether they've passed the exam given the hours they've studied.
| Hours studied | Exam Passed | | ------------- | ----------- | | 0,5 | 0 | | 1,0 | 0 | | 1,5 | 0 | | 2,0 | 0 | | 2,5 | 1 | | 3,0 | 0 | | 3,5 | 1 | | 4,0 | 1 | | 4,5 | 0 | | 5,0 | 1 | | 5,5 | 1 | | 6,0 | 1 |
Taking a glance at the data it seems to be that the more hours the students studied, the more likely they were to pass the exam. Intuitively that makes sense.
Let's plot the data to ensure that our intuition is correct:
Looking at the plotted data we can immediately see that the values seem to "stick" to either the bottom or top of the graph. Given that it seems to be infeasible to use a Linear Regression model to find a line which best describes the data. How would this line be fitted through the data if the values we'd expect this line should produce are either $o$ or $1$?
Let's try a thought experiment. What would happen if we've somehow found some coefficients $\beta$ for the Linear Regression model which "best" describe the data and pass the result it computes through the Sigmoid function? Here's the graph from above with the Sigmoid function added to it:
Looking at the plotting above we can see that the Sigmoid function ensures that the result from the "underlying" Linear Regression model is mapped onto a scale between $0$ and $1$, which in turn makes it possible to e.g. define a threshold at $0.5$ to say that a value which is greater than $0.5$ might be a predictor for a student passing the exam while a value less than $0.5$ might mean that she'll fail the exam.
Note that the wording in the last sentence isn't a coincidence. The value the Sigmoid function produces can be interpreted as a probability where $0$ means $0%$ probability and $1$ means a $100%$ probability.
As it turns out we can translate our findings from the previous section into a function called Probability density function or (PDF for short).
In particular we can define a conditional probability which states that given some $\beta$ and $x_i$, each corresponding $y_i$ should equal $1$ with probability $\sigma(\beta x_i)$ and $0$ with probability $1-\sigma(\beta x_i)$:
$$ P(y_i \mid \beta x_i) = \sigma(\beta x_i)^{y_i} \times (1-\sigma(\beta x_i))^{1-y_i} $$
Looking at the formula above it might be a mystery how we deduced it from our verbal description from above. Here's something I want you to try: Please apply the formula by setting $y_i$ to $0$ and after that to $1$ and see what happens. What you'll notice is that depending on what value you set $y_i$ to, only one part of the formula stays the same while the other is canceled out.
Here's what we'll end up with if we set $y_i$ to $0$ and $1$:
$$ 1-\sigma(\beta x_i) \quad \textrm{if} y_i = 0 $$
$$ \sigma(\beta x_i) \quad \textrm{if} y_i = 1 $$
And that's exactly the desired behavior we described above.
With Logistic Regression our main objective is to find the models $\beta$ parameters which maximize the likelihood that for a pair of $x$ values the $y$ value our model calculates is as close to the actual $y$ value as possible.
In order to find the optimal $\beta$ parameters we need to somehow calculate how "wrong" our models predictions are with the current $\beta$ setup.
In the previous section we talked about the Probability Density Function (PDF) which seems to capture exactly that. We just need to tweak this function slightly so that it's easier for us to do calculations with it later on.
The main tweak we'll apply is that we "wrap" our individual PDF calculations for $y_i = 0$ and $y_i = 1$ in the $\log$ function. The Logarithm has the nice property that it's strictly increasing which makes it easier to do calculations on its data later on. Furthermore our PDFs main property is still preserved since any set of $\beta$ values that maximizes the likelihood of predicting the correct $y$ also maximizes the $\log$ likelihood.
Here are the PDFs two major parts "wrapped" in the $\log$ function:
$$ \log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 $$
$$ \log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 $$
There's only one minor issue we need to resolve. Overall we're attempting to minimize the amount of wrong predictions our model produces, but looking at the graph of the Logarithm once again we see that the function is strictly increasing. We "mirror" the logarithm at the $x$ axis ("turning" it upside down) by multiplying it with $-1$. Hence our two functions now look like this:
$$ -\log(1-\sigma(\beta x_i)) \quad \textrm{if} y_i = 0 $$
$$ -\log(\sigma(\beta x_i)) \quad \textrm{if} y_i = 1 $$
Now the last thing we want to do here is to put both functions back together into one equation like we did with our composite PDF function above. Doing this results in the following:
$$ \log L (\beta \mid x_i y_i) = -(y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) $$
This function is called Logarithmic Loss (or Log Loss / Log Likelihood) and it's what we'll use later on to determine how "off" our model is with its prediction. Again, you might want to set $y_i$ to $0$ or $1$ to see that one part of the equation is canceled out.
Let's turn that into code:
def neg_log_likelihood(y: float, y_pred: float) -> float:
return -((y * log(y_pred)) + ((1 - y) * log(1 - y_pred)))
assert 2.30 < neg_log_likelihood(1, 0.1) < 2.31
assert 2.30 < neg_log_likelihood(0, 0.9) < 2.31
assert 0.10 < neg_log_likelihood(1, 0.9) < 0.11
assert 0.10 < neg_log_likelihood(0, 0.1) < 0.11
Next up let's use our codified version of Log Loss to create plots for $y = 0$ and $y = 1$:
xs_nll: List[float] = [x / 10000 for x in range(1, 10000)]
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(xs_nll, [neg_log_likelihood(1, x) for x in xs_nll])
ax1.set_title('y = 1')
ax2.plot(xs_nll, [neg_log_likelihood(0, x) for x in xs_nll])
ax2.set_title('y = 0');
As we can see, the more wrong the prediction, the higher the calculated error. The Log Loss function therefore "punished" wrongdoing more than it rewards "rightdoing".
To calculate the overall error of our whole data set we sum up each individual Log Loss calculation and average it:
$$ \textrm{Cost} = -\frac{1}{n} \sum_{i=1}^n (y_i \log(\sigma(\beta x_i)) + (1-y_i) \log(1-\sigma(\beta x_i))) $$
Here's the codified version:
def error(ys: List[float], ys_pred: List[float]) -> float:
assert len(ys) == len(ys_pred)
num_items: int = len(ys)
sum_nll: float = sum([neg_log_likelihood(y, y_pred) for y, y_pred in zip(ys, ys_pred)])
return (1 / num_items) * sum_nll
assert 2.30 < error([1], [0.1]) < 2.31
assert 2.30 < error([0], [0.9]) < 2.31
assert 0.10 < error([1], [0.9]) < 0.11
assert 0.10 < error([0], [0.1]) < 0.11
Now the last missing piece we need to implement is the optimization step. What we want at the end of the day is a Logistic Regression model with the $\beta$ parameters which in combination with $x$ values produce the most accurate prediction for any $y$ value.
Given that we can now calculate the error our current model with its $\beta$ parameters produces we can iteratively change the $\beta$ parameters until we reach a point where our model cannot improve (can't reduce the error value) anymore.
The algorithm which implements exactly that is called Gradient Descent.
Note: I already wrote a dedicated post explaining the algorithm in great detail so I won't go into too much detail here and would encourage you to read the article if you're unfamiliar with Gradient Descent.
The gist of it is that given our Log Loss function we can find a set of $\beta$ parameters for which the error the Log Loss function calculates is the smallest. We've therefore found a local (or global) minimum if the error cannot be reduced anymore. To figure out "where" the minimum is located we'll use the error functions gradient which is a vector and guides us to that position.
Since we repeat such calculations over and over again we're iteratively descending down the error functions surface, hence the name Gradient Descent.
Calculating the gradient is done by computing the partial derivatives of the Log Loss function with respect to the individual $x_{ij}$ values in our $x_i$ vector.
Here's the mathematical representation:
$$ \frac{\partial \textrm{Cost}}{\partial x*{ij}} = \frac{1}{n} \sum*{i=1}^n (\sigma(\beta xi) - y_i) x{ij} $$
And here's the implementation in code:
grad: List[float] = [0 for _ in range(len(beta))]
for x, y in zip(xs, ys):
err: float = squish(beta, x) - y
for i, x_i in enumerate(x):
grad[i] += (err * x_i)
grad = [1 / len(x) * g_i for g_i in grad]
Again, if you're unfamiliar with gradients and partial derivatives you might want to read my article about the Gradient Descent algorithm.
We finally got all the pieces in place! Let's grab some data and use the Logistic Regression model to classify it!
The data set we'll be using is similar to what we've already seen in our example above where we tried to predict whether a student will pass an exam based on the hours she studied.
You can download the data set here. Inspecting the data we can see that there are 2 floating point values and an integer value which represents the label and (presumably) indicates whether the student in question has passed or failed the exam.
Our task is it to use this data set to train a Logistic Regression model which will help us assign the label $0$ or $1$ (i.e. classify) new, unseen data points.
The first thing we need to do is to download the .txt
file:
wget -nc -P data https://raw.githubusercontent.com/animesh-agarwal/Machine-Learning/5b2d8a71984016ae021094641a3815d6ea9ac527/LogisticRegression/data/marks.txt
Next up we need to parse the file and extract the $x$ and $y$ values:
marks_data_path: Path = data_dir / 'marks.txt'
xs: List[List[float]] = []
ys: List[float] = []
with open(marks_data_path) as file:
for line in file:
data_point: List[str] = line.strip().split(',')
x1: float = float(data_point[0])
x2: float = float(data_point[1])
y: int = int(data_point[2])
xs.append([x1, x2])
ys.append(y)
It's always a good idea to plot the data to see if there are any outliers or other surprises we have to deal with:
x1s: List[float] = [x[0] for x in xs]
x2s: List[float] = [x[1] for x in xs]
plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);
Looks like we're (almost) good here. There's only one aspect we need to further inspect. If you're looking at the axes you see that the values are ranging from $\approx 35$ to $\approx 95$. Working with such large values can result into problems when working with $\log$ and $\exp$ later on. We'll resolve this problem in a minute.
You might recall from the beginning of the post that we applied a trick where we prepend every $x$ vector with a $1$ and prepend the $b$ to the $\beta$ vector in order to make it possible to use the dot-product (which is a simpler calculation).
The following code prepends a $1$ to every $x$ vector so that we can leverage the computation trick later on:
for x in xs:
x.insert(0, 1)
xs[:5]
# [[1, 34.62365962451697, 78.0246928153624],
# [1, 30.28671076822607, 43.89499752400101],
# [1, 35.84740876993872, 72.90219802708364],
# [1, 60.18259938620976, 86.30855209546826],
# [1, 79.0327360507101, 75.3443764369103]]
Next up we'll tackle the scaling problem we've touched upon in the paragraph above. What we'll do to resolve this problem is to standardize (often done via the "z-score") the whole data set:
xs = z_score(xs)
xs[:5]
# [[1, -1.5942162646576388, 0.6351413941754435],
# [1, -1.8171014180340745, -1.2014885239142388],
# [1, -1.531325157335502, 0.3594832875590465],
# [1, -0.28068723821760927, 1.0809228071415948],
# [1, 0.6880619310375534, 0.4909048515228952]]
If you're curious what the z_score
function does check out the whole implementation in my lab repository on GitHub.
And now we're finally in a position where we can train our Logistic Regression Model via Gradient Descent. We'll start with some random guesses for our models $\beta$ parameters and iteratively optimize our model by computing its current overall "wrongdoing" with the error function and then using the error functions gradient to update the $\beta$ value to yield a smaller error in the next iteration. Eventually we'll converge to a local minimum which results in $\beta$ values computing the smallest error.
beta: List[float] = [random() / 10 for _ in range(3)]
print(f'Starting with "beta": {beta}')
epochs: int = 5000
learning_rate: float = 0.01
for epoch in range(epochs):
# Calculate the "predictions" (squishified dot product of `beta` and `x`) based on our current `beta` vector
ys_pred: List[float] = [squish(beta, x) for x in xs]
# Calculate and print the error
if epoch % 1000 == True:
loss: float = error(ys, ys_pred)
print(f'Epoch {epoch} --> loss: {loss}')
# Calculate the gradient
grad: List[float] = [0 for _ in range(len(beta))]
for x, y in zip(xs, ys):
err: float = squish(beta, x) - y
for i, x_i in enumerate(x):
grad[i] += (err * x_i)
grad = [1 / len(x) * g_i for g_i in grad]
# Take a small step in the direction of greatest decrease
beta = [b + (gb * -learning_rate) for b, gb in zip(beta, grad)]
print(f'Best estimate for "beta": {beta}')
# Starting with "beta": [0.06879018957747185, 0.060750489548129484, 0.08122488791609535]
# Epoch 1 --> loss: 0.6091560801945126
# Epoch 1001 --> loss: 0.2037432848849053
# Epoch 2001 --> loss: 0.20350230881468107
# Epoch 3001 --> loss: 0.20349779972872906
# Epoch 4001 --> loss: 0.20349770371660023
# Best estimate for "beta": [1.7184091311489376, 4.01281584290694, 3.7438191715393083]
Now that our model is trained we can calculate some statistics to see how accurate it is. For these calculations we'll set the threshold to $0.5$ which means that every value above $0.5$ our model produces is considered a $1$ and every value which is less than $0.5$ is considered to be a $0$.
total: int = len(ys)
thresh: float = 0.5
true_positives: int = 0
true_negatives: int = 0
false_positives: int = 0
false_negatives: int = 0
for i, x in enumerate(xs):
y: int = ys[i]
pred: float = squish(beta, x)
y_pred: int = 1
if pred < thresh:
y_pred = 0
if y == 1 and y_pred == 1:
true_positives += 1
elif y == 0 and y_pred == 0:
true_negatives += 1
elif y == 1 and y_pred == 0:
false_negatives += 1
elif y == 0 and y_pred == 1:
false_positives += 1
print(f'True Positives: {true_positives}')
print(f'True Negatives: {true_negatives}')
print(f'False Positives: {false_positives}')
print(f'False Negatives: {false_negatives}')
print(f'Accuracy: {(true_positives + true_negatives) / total}')
print(f'Error rate: {(false_positives + false_negatives) / total}')
# True Positives: 55
# True Negatives: 34
# False Positives: 6
# False Negatives: 5
# Accuracy: 0.89
# Error rate: 0.11
Finally let's plot the decision boundary so that we can see where our model "draws the line":
x1s: List[float] = [x[1] for x in xs]
x2s: List[float] = [x[2] for x in xs]
plt.scatter(x1s, x2s, c=ys)
plt.axis([min(x1s), max(x1s), min(x2s), max(x2s)]);
m: float = -(beta[1] / beta[2])
b: float = -(beta[0] / beta[2])
x2s: List[float] = [m * x[1] + b for x in xs]
plt.plot(x1s, x2s, '--');
Great! Looks like our model correctly learned how to classify new, unseen data as it considers everything "above" and "below" the decision boundary as a separate class which seems to be in alignment with the data points from our data set!
In some situations it's a requirement to classify new, unseen data. Logistic Regression is a powerful Machine Learning model which makes it possible to learn such classifications based on existing data.
In this blog post we took a deep dive into Logistic Regression and all its moving parts. We've learned about Sigmoid functions and how they can be used in conjunction with a Linear Regression model to project values of arbitrary magnitude onto a scale between $0$ and $1$ which is exactly what we need when we want to do binary classification.
Once we understood the mathematics and implemented the formulas in code we took an example data set and applied our Logistic Regression model to a binary classification problem. After training our model we were able to draw the decision boundary it learned to visually validate that it correctly learned how to separate the data into two (binary) subsets.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
The following is a list of resources I've used to write this article. Other, useful resources are linked within the article itself:
]]>Linear Regression is one of the basic Machine Learning algorithms every student eventually encounters when starting to dive deeper into the field. If you heard someone trying to "fit a line through the data" that person most likely worked with a Linear Regression model.
In which scenarios should we use Linear Regression and if we do, how do we find such a best-fitting line? And what if our data is multidimensional? Let's answer all those questions by implementing Linear and Multiple Regression from scratch!
Note: Throughout this post we'll be using the "Auto Insurance in Sweden" data set which was compiled by the "Swedish Committee on Analysis of Risk Premium in Motor Insurance".
Let's take a step back for a minute and imagine that we're working at an insurance company which sells among other things car insurances.
Over the years the insurance company has collected a wide variety of statistics for all the insurances it sells including data for its car insurances. Some of this data is statistics about the number of filed claims and the payments which were issued for them. The following table shows an excerpt from such data:
| Number of claims | Payments issued | | ---------------- | --------------- | | 108 | 392,5 | | 19 | 46,2 | | 13 | 15,7 | | 124 | 422,2 | | ... | ... |
One day we get a call from our colleague who works at the claims settlement center. She has to plan the divisions budget for the upcoming year which is usually derived based on best guesses. Since we've collected all the data throughout the years she wonders if there's a more reliable, mathematical way we could use to calculate the budget estimation.
Nodding along we confirm that we'll dive deeper into this topic and hang up the telephone in sheer excitement! Finally we're able to put some real Machine Learning into practice. But how should we tackle the problem?
It's hard to gain any insights into the data we're dealing with by manually examining the raw numbers. In order to get a better understanding of the data it's always a good idea to visualize it first.
Given that we're dealing with 2 dimensions (the number of claims and the issued payments) one of the potential diagrams we can create is a so called scatter plot which uses (Cartesian) coordinates to display the values of a given data set. In our case we treat the number of claims as our $x$-axis and the issued payments as our $y$-axis and plot the data we recorded at the intersections of such axes which results in the following diagram:
Solely by looking at the diagram we can already identify a trend in the data. It seems to be the case that the more claims were filed, the more payments were issued. Intuitively that makes sense.
After inspecting the plotted data in more detail we observe that we can certainly make some rough predictions for missing data points. Given the trend in the data it seems reasonable that we might issue ~80 payments when ~40 number of claims were filed. Accordingly ~125 claims might be filed when we issue ~410 payments.
Is there some way to turn this insight into a model we can use to make arbitrary predictions? It seems like the relationship in the data is linear. Is there a way to capture this notion mathematically?
You might remember the concept of a Linear function from school where you've used the slope-intercept form (one of many forms) to mathematically describe a line:
$$ y = mx + b $$
The slope-intercept form has 2 parameters which determine how the line "behaves" in the Cartesian plane (The typical 2D plane with $x$ and $y$ coordinates):
Using this formula we can plug in any arbitrary $x$ value which is then multiplied by $m$ and added to $b$ to get back the corresponding $y$ value.
Let's look at a couple of examples to get a better feeling as to how the slope-intercept form works.
Let's solely focus on $m$ for now and set $b$ to $0$. How does $m$ influence the way our line will be plotted if we set it to $1$?
Here's the mathematical representation of such a line followed by the corresponding plot:
$$ y = 1x + 0 $$
As you can see for every step of size $1$ in the $x$ direction we "go" a step of size $1$ in the $y$ direction.
Now that we understand what the parameter $m$ is responsible for, let's take a look at the $y$-intercept $b$ and set it to $1$:
$$ y = 1x + 1 $$
The steepness of the line is the same as the previous line since we haven't modified $m$. However if you take a look at $x = 0$ you'll notice that the line crosses the $y$ intercept at $1$. That's exactly what the parameter $b$ is responsible for. Through $b$ we can control where our line should start on the $y$ axis when $x = 0$.
Let's translate the slope-intercept form into a function we call predict
(we'll use this function for our predictions later on):
def predict(m: float, b: float, x: float) -> float:
return m * x + b
assert predict(m=0, b=0, x=3) == 0
Let's put the theory into practice and try to guesstimate a line which best describes our data.
The first thing we notice is that the individual data points follow an upwards trend, so $m$ will certainly be positive. Furthermore the data points close to $x = 0$ seem to have low $y$ values as well. Taking those observations into account we guess the following description for our line:
$$ y = 4x + 2 $$
Not too bad for our first guess! Just by looking at the plotted line we might ask ourselves if there's better fitting line? Can we quantify how good our line fits the data?
Given that the learning part in Machine Learning is usually an iterative process which starts with an initial guess and slowly computes new "best guesses" to finally converge to the optimal solution it's a necessity to be able to track the learning process.
A good way to supervise the learning process is to mathematically capture the "wrongdoing" our algorithm inevitably produces while trying to determine the function which best describes the data.
In the case of Linear Regression it seems to make sense to compare the $y$-values the line produces to the actual $y$-values from the data set. We could for example go through each individual $(x, y)$ pair in our data set and subtract its $y$ value from the $y$ value our line "predicts" for the corresponding $x$. Summing up these differences results in a number we can use to compare different lines against each other. The higher the number, the "less correct" the line.
That's great but there's one minor catch. Imagine that the line which is fitted through the data predicts large positive $y$ values near the origin $(0, 0)$ where it should predict large negative numbers. At the same time it predicts large negative numbers near the end of the $x$-axis although those values should be positive. If we calculate the errors according to our description above where we suggested to sum up the differences between the $y$ values we'd end up in a situation where values might cancel each other out. In the worst case the calculated error is $0$ which indicates that we've found the best fitting line while in reality we didn't!
A simple trick to mitigate this problem is to square each single error value before they're summed up. This way any negative value will be turned into a positive one, making it impossible to run into scenarios where error calculations cancel each other out.
The error function we've just described is called Residual sum of squares (RSS) or Sum of squared errors (SSE) and is one of many error functions we can use to quantify the algorithms "wrongdoing". The following is the mathematical formula for SSE:
$$ SSE = \sum_{i=1}^n (y_i - f(x_i))^2 $$
Turing that into code results in the following:
def sum_squared_error(ys: List[float], ys_pred: List[float]) -> float:
assert len(ys) == len(ys_pred)
return sum([(y - ys_pred) ** 2 for y, ys_pred in zip(ys, ys_pred)])
assert sum_squared_error([1, 2, 3], [4, 5, 6]) == 27
With those two code snippets (the predict
and sum_squared_error
functions) we're now able to describe a line, predict $y$ values and measure how "off" our predictions are. The last missing piece we'll need to get in place is a way to update our line description such that the next sum_squared_error
calculation returns an error value which is less than our current one. If there's a way to constantly reduce the error we're making by slowly updating our line description we'll eventually end up with a line which best fits our data!
With Linear Regression there are a couple of different algorithms we can use to find the best fitting line. One prominent choice is the Ordinary least squares (OLS) method. Since OLS is a common choice we'll do something different. We'll use the Gradient Descent algorithm which can be used for various different optimization problems and is at the heart of modern Machine Learning algorithms.
Note: If you haven't already I'd suggest that you take a couple of minutes to read the article "Gradient Descent from scratch" in which I explain the whole algorithm in great detail.
I won't provide too many explanations regarding Gradient Descent here since I already covered the topic in the aforementioned post. Don't get too intimidated by the Math below. It's ok if you just skim through this section to get a high-level overview.
In a nutshell Gradient Descent makes it possible for us to iteratively "walk down" the error functions surface to eventually find a local minimum where the error is the smallest which is exactly what we're looking for.
In order to figure out in which direction we should walk to descent down to the local minimum we need to compute the so-called gradient. The gradient is a vector consisting of partial derivatives of our error function which point in the direction of greatest increase at any given point $p$ on our error functions surface.
To find the partial derivatives of our SSE function we should expand it so that we can see all the variables we need to take into account:
$$ SSE = \sum*{i=1}^n (y_i - f(x_i))^2 = \sum*{i=1}^n (y_i - (mx + b))^2 $$
Looking at the expanded formula it seems like there's $m$ and $b$ we need to derive with respect to:
$$ \frac{\partial sse}{\partial m} = 2x ((mx + b) - y) $$
$$ \frac{\partial sse}{\partial b} = 2 ((mx + b) - y) $$
Which results in the following code:
# The partial derivative of SSE with respect to `m`
grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
# The partial derivative of SSE with respect to `b`
grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])
Tip: You can use WolframAlpha to validate your partialderivatives.
Given these partial derivatives we can now calculate the gradient for any point $x$ which is a vector pointing in the direction of greatest increase. Multiplying the vector by $-1$ will let it point into the opposite direction, the direction of greatest decrease (remember that we want to find a local minimum). If we add a small fraction of this vector to our $m$ and $b$ values respectively we should end up closer to a local minimum.
The following code captures what we've just described:
# Take a small step in the direction of greatest decrease
# The `learning_rate` controls the step size when "walking" down the gradient
learning_rate: float = 0.0001
m = m + (grad_m * -1 * learning_rate)
b = b + (grad_b * -1 * learning_rate)
Repeating this process multiple times should help us find the $m$ and $b$ values for our line for which any given prediction $y$ calculated by that line results in the smallest error possible.
Let's put all the pieces together and implement the Gradient Descent algorithm to find the best fitting line:
# Find the best fitting line through the data points via Gradient Descent
# Our initial guess
m: float = 0
b: float = 200
print(f'Starting with "m": {m}')
print(f'Starting with "b": {b}')
# Doing 1000 iterations
epochs: int = 10000
learning_rate: float = 0.00001
for epoch in range(epochs):
# Calculate predictions for `y` values given the current `m` and `b`
ys_pred: List[float] = [predict(m, b, x) for x in xs]
# Calculate and print the error
if epoch % 1000 == True:
loss: float = sum_squared_error(ys, ys_pred)
print(f'Epoch {epoch} --> loss: {loss}')
# Calculate the gradient
# Taking the (partial) derivative of SSE with respect to `m` results in `2 * x ((m * x + b) - y)`
grad_m: float = sum([2 * (predict(m, b, x) - y) * x for x, y in zip(xs, ys)])
# Taking the (partial) derivative of SSE with respect to `b` results in `2 ((m * x + b) - y)`
grad_b: float = sum([2 * (predict(m, b, x) - y) for x, y in zip(xs, ys)])
# Take a small step in the direction of greatest decrease
m = m + (grad_m * -learning_rate)
b = b + (grad_b * -learning_rate)
print(f'Best estimate for "m": {m}')
print(f'Best estimate for "b": {b}')
Running this algorithm results in a best estimate for the $m$ and $b$ values. Let's compare our initial guess of $m$ and $b$ (the guess we started with at the top of the code snippet) with the values our Gradient Descent implementation produced:
$$ m = 0, b = 200 $$
$$ m \approx 3.40, b \approx 20.30 $$
Awesome! Seems like we've found our linear function which best describes our data! Let's call our co-worker and share the good news. From now on she can use the following formula to find a prediction for the issued payments ($y$) based on any number of claims ($x$):
$$ y = 3.40x + 20.30 $$
It's great to be able to fit a line through data points in $2$ dimensions. But how do we deal with scenarios where our data has more than $2$ dimensions?
Most data sets capture many different measurements which are called "features". It would be great if we could take the most important features into account when working with our algorithms. Given that every feature adds another dimension we need to ensure that the model we're building can deal with such high-dimensional data.
Our Linear Regression model was only able to take a single $x$ value and predict a corresponding $y$ value. What if we have multiple $x$ values? Is there a way to use a regression model to predict a $y$ value based on multiple $x$ values?
As it turns out Linear Regression is a subset of a general regression model called Multiple Linear Regression or Multiple Regression. Multiple Regression can deal with an arbitrary number of $x$ values expressed as a vector to predict a single $y$ value.
The great news is that we can easily adopt what we've learned so far to deal with high-dimensional data. Let's take a quick look at the changes we need to make.
The slope-intercept form we've used so far can easily be updated to work with multiple $x$ values. Here's the linear equation we've used so far:
$$ y = mx + b $$
Having multiple $x$ values means that we'll also have multiple $m$ values (one for each $x$). However we'll still only deal with $1$ intercept:
$$ y = m_1x_1 + ... + m_nx_n + b $$
Calculating a prediction for $y$ is as simple as solving the above equation for any given vector of $x$ values, vector of $m$ values and any given $b$ value.
There's one little trick we can apply given that we're now mostly dealing with vectors rather than scalar numbers. To make the computation more efficient we can use the dot-product which carries out almost the exact same calculation we described above. There's just one problem. The dot-product can only be used in vector calculations, however $b$ isn't a vector. As it turns out we can simply prepend the $b$ value to the $m$ vector and prepend a $1$ to the $x$ vector. Doing this little trick makes it possible to use the dot-product calculation while also taking the $b$ value into account. Here's what we'd end up with when doing just that:
$$ \vec{x} = \begin{pmatrix} 1 \ x_1 \ ... \ x_n \end{pmatrix} \vec{m} = \begin{pmatrix} b \ m_1 \ ... \ m_n \end{pmatrix} $$
$$ y = \vec{x} \cdot \vec{m} = \sum_{i=1}^n x_i m_i = x_1 \times m_1 + ... + x_n \times m_n $$
Another nice side-effect of doing this is that the partial derivative calculations for the error function will also be easier since our usage of the dot-product reduced the number of variables we have to take into account to just 2 vectors $x$ and $m$.
And that's pretty much all there is to change. The rest of the code follows exactly the same way. While we fitted a line when working with Linear Regression we're now fitting a so-called hyperplane with Multiple Regression.
To get a better intuition for the notion of a hyperplane imagine that we have measurements we can scatter plot in a $3$ dimensional space. Every measurement will be a single dot in that space, resulting in a cloud of dots. Our Multiple Regression algorithm will now try to find a plane (think of it as a wooden plank) which best fits through that dot cloud.
Linear Regression is one of the very first algorithms every student encounters when learning about Machine Learning models and algorithms.
The basic idea of Linear Regression is to find an equation for a line which best describes the data points in the given data set. Such a line is often described via the point-slope form $y = mx + b$. "Fitting the line" means finding the $m$ and $b$ values such that the resulting $y$ value is as accurate as possible given an arbitrary $x$ value. Using an error function (which describes how "off" our current line equation is) in combination with an optimization algorithm such as Gradient Descent makes it possible to iteratively find the "best fitting" line.
As it turns out Linear Regression is a specialized form of Multiple Linear Regression which makes it possible to deal with multidimensional data by expressing the $x$ and $m$ values as vectors. While this requires the usage of techniques such as the dot-product from the realm of Linear Algebra the basic principles still apply. In Multiple Linear Regression we're just trying to find a "best fitting" hyperplane rather than a line.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
The following is a list with resources I've used while working on this blog post. Other useful resources are linked within the article itself.
]]>How do you implement a spam filter that adapts to the ever evolving techniques of modern spammers? Spam detection is a classification problem as you have to decide whether a message is spam or not. However there's always some uncertainty involved in figuring out if the message at hand might contain spam or not.
Looking into the probability toolbox to tame such uncertainty one might stumble upon the infamous Bayes Theorem which is exactly what we'll use in this post to build a basic spam filter. Using Bayes Theorem we'll implement an algorithm called Naive Bayes which was used in the early internet days as a spam filtering technique. Let's dive in!
Note: This post assumes a certain level of knowledge about probability theory. If you're new to this subject or just a little bit rusty you might want to check out the "Basics of Probability" playlist by JB Statistics which helps you in building the intuitions behind probability theory.
if you've studied statistics and probability theory in the past you surely came across Bayes Theorem which lets you calculate conditional probabilities. Conditional probabilities are probabilities with dependencies which can be used to mathematically express questions such as: "What's the probability that it'll be sunny throughout the day given that we saw clouds in the morning?". Translating this sentence into math results in the following:
$$ P(Sun \mid Clouds) $$
Having some decent meteorology data available should be sufficient to calculate the exact probability via Bayes Theorem which is formulated as follows:
$$ P(A \mid B) = \frac{P(B \mid A) , P(A)}{P(B)} $$
Let's take a quick refresher and apply this formula by working through an example step by step. Assuming that we're living in a place where it's often cloudy, the chance to see clouds in the morning is roughly $60%$. At the same time most days aren't that sunny. Only 7 out of 30 days can be considered really sunny. Another data point we have is that $50%$ of the sunny days started out with clouds in the morning.
Formulating these findings as probabilities results in the following:
$$ P(Sun) \approx 25% \approx 0.25 $$
$$ P(Cloud) = 60% = 0.6 $$
$$ P(Cloud \mid Sun) = 50% = 0.5 $$
Calculating the likelihood of getting a sunny day even when we saw clouds in the morning is as simple as plugging-in the probabilities into Bayes Theorem:
$$ P(Sun \mid Cloud) = \frac{P(Cloud \mid Sun) , P(Sun)}{P(Cloud)} \ = \frac{0,5 \times 0,25}{0,6} \approx 0,21 = 21% $$
So there's a $21%$ chance that we get a sunny day even if the morning started with clouds.
Now that we got a quick refresher on Bayes Theorem it's time to ask ourselves how we could apply this formula to a classification problem such as spam filtering.
Generally speaking we'd like to calculate the probability that any given message is spam. If the probability exceeds a certain threshold (e.g. $50%$) we classify it as spam. Otherwise we believe that the message is legit and call it "ham".
The rather unusual word "ham" which describes "non spam" messages is used throughout literature so we'll use it here too.
Breaking the problem further down into it's fundamentals we can say that a message is comprised of different words, so why don't we start there and calculate the probability of a message being classified as spam given that it contains a certain "trigger" word.
Let's assume that we're working with the word "prize" and want to figure out what the probability is that a given message which contains the word "prize" is spam. Formulating this problem via Bayes Theorem results in:
$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) , P(Spam)}{P(Prize)} $$
To make things clear, here's the formula translated into prose:
"The probability of a message being spam given that we see the word "Prize" in it is the probability of finding the word "Prize" in our already seen spam messages times the probability of the message in question being spam divided by the probability of finding the word "Prize" in any of our already seen messages."
There are two simplifications we can apply here. The first thing we might want to do is expand the denominator like so:
$$ P(Prize) = P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham) $$
We simply rewrote the condensed version of $P(Prize$ into a more verbose statement listing all its underlying probabilities.
Our expanded Bayes Theorem application now looks like the following:
$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) P(Spam)}{P(Prize \mid Spam) P(Spam) + P(Prize \mid Ham) P(Ham)} $$
The neat thing about this expansion is that we can now apply our second trick. Let's think about the probability of a message being spam vs. a message being ham for a minute. What is the likelihood that any random message is spam? What is the likelihood that it's ham? The fact is that we don't know for sure so we might just assume that there's a $50/50$ chance that any random, incoming message is spam or ham.
Applying this finding by plugging-in the corresponding probabilities results in:
$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam) 0.5}{P(Prize \mid Spam) 0.5 + P(Prize \mid Ham) 0.5} $$
That's great because now we can apply basic arithmetic and cancel out these probabilities, resulting in our final formulation to determine whether a message containing the word "Prize" should be considered "spammy".
$$ P(Spam \mid Prize) = \frac{P(Prize \mid Spam)}{P(Prize \mid Spam) + P(Prize \mid Ham)} $$
Now that we know how to calculate the probability for a single word we should try to scale it up so that we can compute the probabilities for whole sentences.
A naive approach would be to insert the whole message as the conditional probability part. However that won't work as it would force our algorithm to essentially look for duplicate messages.
Thinking logically we can observe that a sentence is simply a succession of single words. What we could do is reuse what we did above for a single word and apply it to every word we come across in our message. Computing the overall probability for the whole message is then simply the multiplication of all those individual probabilities.
Let's pick one example probability calculation to describe this process mathematically:
$$ P(Message \mid Spam) = P(w_1, ..., w_n \mid Spam) \ = P(w_1 \mid Spam) \times ... \times P(w_n \mid Spam) $$
And that's where the Naive Bayes classifier acts naively (hence the name Naive Bayes). Because we're multiplying the probabilities we're making the bold assumption that the probabilities (presence or absence of words) are independent of each other. Everyone who ever received an E-Mail from a nigerian prince will agree that there's likely a dependency between words in spam messages.
Generally speaking we're now in a position where we can turn our findings into code and implement the Naive Bayes algorithm from scratch. However there are 2 more tweaks we want to implement in order to mitigate some minor hiccups we'll run into if we don't do it.
The first challenge we're faced with is a computational problem. As we stated above we're about to multiply a lot of probabilities (one probability for every word) which tend to be very small. Moden computer architectures are only capable to deal with a certain amount of precision when representing very large or very small numbers internally. If we're constantly multiplying small number we can run into the so-called "arithmetic underflow" problem which will eventually turn our number into a $0$.
A well known trick to deal with this problem is to use a combination of the exponential function and the logarithm. You might remember that:
$$ \log{a b} = \log(a) + \log(b) $$
and
$$ \exp(\log(x)) = x $$
We can use this to "wrap" our probability calculations into $\log$ to later on "unwrap" them again via $\exp$.
The second issue is rooted in the mathematics we're attempting to carry out. Let's understand the problem by looking into a case we'll very likely face.
What would $P(W \mid S)$ be if we've only ever found $W$ in no-spam messages?
Well if we've never seen the word $W$ in a spam message, then $(P \mid S) = 0$. And exactly this will turn into a problem for us given that we have such a term in our nominator and denominator. Everything multiplied by $0$ is $0$.
In order to deal with that issue we can introduce a factor $k$ which is a parameter we can tweak (most of the time it's set to $1$).
With this factor $k$ we can express $P(W \mid S)$ as follows:
$$ P(W \mid S) = \frac{k + \textrm{spam containing W}}{(2 \times k) + \textrm{total spam}} $$
Let's walk through a quick example to see how $k$ helps us. Our assumption is that we have analyzed $100$ spam examples but found the word $W$ exactly $0$ times. Calculating the probability that we might find the word $W$ in spam ($S$) without $k$ would result in the following:
$$ P(W \mid S) = \frac{0}{100} = 0 $$
Having this probability in a chain of probability multiplications would immediately turn the result into $0$ because $x \times 0 = 0$.
Let's introduce $k = 1$ to solve this issue:
$$ P(W \mid S) = \frac{k + 0}{(2 \times k) + 100} = \frac{1}{2 + 100} \approx 0.01 $$
Having $0.01$ in our multiplications won't hurt the accuracy while ensuring that we don't end up with a $0$ result just because one of the probability calculations resolved to $0$.
With that out of the way we're finally able to turn our findings into code!
Before we jump right into the implementation we should draw a mental picture as to how the Naive Bayes classifier works from a high level perspective.
You might've noticed that we talked a lot about probability calculations of finding words in a set of spam or ham messages. This indicates that we need to "train" our Naive Bayes classifier with training data so that we can do these computations on the fly. Otherwise if our Naive Bayes implementation hasn't seen any messages or words before, how do we know how to calculate the probabilities via Bayes Theorem?
The next thing we should think about are the raw messages we're feeding into our Naive Bayes algorithm, both for "training" it and doing predictions later on. First and foremost we're only focusing on messages in one language to keep the implementation simple. In our case all our messages will be in English. Other than that we need to parse the raw messages and identify and extract individual words given that we're using such words in our probability calculations. This process is called tokenization, so we need to implement a function which can handle this process.
Note: The tokenizer we implement here is really simple. Usually you'd want to use robust NLP libraries such as NLTK for this task.
Let's start with the tokenize
function which takes a raw message as its input and returns a list of valid words as an output.
The implementation is really straightforward. The first thing we do is to find all words via a regular expression. We then iterate over all the words we found, lowercase them and add them to our list. Once done we turn our list into a set to ensure that duplicate entries are filtered out.
def tokenize(text: str) -> Set[str]:
words: List[str] = []
for word in re.findall(r'[A-Za-z0-9\']+', text):
words.append(word.lower())
return set(words)
assert tokenize('Is this a text? If so, Tokenize this text!...') == {'is', 'this', 'a', 'text', 'if', 'so', 'tokenize'}
Now we're getting to the core of our implementation, the Naive Bayes classifier.
As we've stated earlier, our Naive Bayes classifier needs to be trained on existing messages in order to be able to do predictions on unseen messages later on. It therefore needs to remember what it saw and store this state internally. Because of that we'll be implementing Naive Bayes as a class.
Let's walk through the different methods the class needs step by step. We'll take a look at the code for the whole class at the end of this section.
First of all we need to implement a constructor which sets our internal state to default values and takes the $k$ parameter as an optional argument so that we can control how we want do deal with scenarios where we haven't seen a word in the spam or ham messages. If we're glancing over our probability calculations from above we can see that we need to count quantities such as how many spam messages we saw in total during training. These are such state information we initialize in our constructor:
def __init__(self, k=1) -> None:
self._k: int = k
self._num_spam_messages: int = 0
self._num_ham_messages: int = 0
self._num_word_in_spam: Dict[int] = defaultdict(int)
self._num_word_in_ham: Dict[int] = defaultdict(int)
self._spam_words: Set[str] = set()
self._ham_words: Set[str] = set()
self._words: Set[str] = set()
Next up we need to implement a train
function which we'll use to train our Naive Bayes classifier. "Training" in our case means that we get a list of labeled messages (messages for which we know whether they're spam or ham), iterate over each individual message, tokenize it and then iterate over every word in every message to update our internal state with the necessary information such as how many spam messages we saw in the list of messages:
def train(self, messages: List[Message]) -> None:
msg: Message
token: str
for msg in messages:
tokens: Set[str] = tokenize(msg.text)
self._words.update(tokens)
if msg.is_spam:
self._num_spam_messages += 1
self._spam_words.update(tokens)
for token in tokens:
self._num_word_in_spam[token] += 1
else:
self._num_ham_messages += 1
self._ham_words.update(tokens)
for token in tokens:
self._num_word_in_ham[token] += 1
Now before we jump into the implementation of the predict
method which uses this information via Bayes Theorem calculations to do predictions for new, unseen messages it might be a good idea to implement 2 helper functions, one for every conditional probability we need to compute in Bayes Theorem.
Doing this will make the code more readable and will greatly help when implementing predict
later on.
The implementations for our 2 methods are pretty straightforward. They're basically a translation of the conditional probability calculations, incorporating the $k$ factor to ensure that we'll never compute a $0$ probability:
def _p_word_spam(self, word: str) -> float:
return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)
def _p_word_ham(self, word: str) -> float:
return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)
And now we're finally able to implement the meat of our Naive Bayes classifier. The predict
function.
The predict
function gets a message (we call it text
) which it tokenizes to extract its words. Next up it iterates through all the words the Naive Bayes classifier saw during training (all words in both, spam and ham messages) to check if the word in question can also be found in the new, unseen message. While doing that it calculates the probabilities with the help of our previously defined helper functions. The return value of the predict
function is the application of Bayes Theorem which computes an overall probability indicating how likely it is that the new, unseen message is spam.
Note that in the following implementation we're applying the tricks we learned about in our discussion of underflow problems (mainly doing the wrapping and unwrapping via $\log$ and $exp$). Applying these techniques might make the code look a little bit confusing or intimidating, however if you look closely and ignore the log
and exp
usages you'll find that it's just the application of Bayes Theorem.
def predict(self, text: str) -> float:
text_words: Set[str] = tokenize(text)
log_p_spam: float = 0.0
log_p_ham: float = 0.0
for word in self._words:
p_spam: float = self._p_word_spam(word)
p_ham: float = self._p_word_ham(word)
if word in text_words:
log_p_spam += log(p_spam)
log_p_ham += log(p_ham)
else:
log_p_spam += log(1 - p_spam)
log_p_ham += log(1 - p_ham)
p_if_spam: float = exp(log_p_spam)
p_if_ham: float = exp(log_p_ham)
return p_if_spam / (p_if_spam + p_if_ham)
Putting it all together, here's the full NaiveBayes
class with all its methods:
class NaiveBayes:
def __init__(self, k=1) -> None:
self._k: int = k
self._num_spam_messages: int = 0
self._num_ham_messages: int = 0
self._num_word_in_spam: Dict[int] = defaultdict(int)
self._num_word_in_ham: Dict[int] = defaultdict(int)
self._spam_words: Set[str] = set()
self._ham_words: Set[str] = set()
self._words: Set[str] = set()
def train(self, messages: List[Message]) -> None:
msg: Message
token: str
for msg in messages:
tokens: Set[str] = tokenize(msg.text)
self._words.update(tokens)
if msg.is_spam:
self._num_spam_messages += 1
self._spam_words.update(tokens)
for token in tokens:
self._num_word_in_spam[token] += 1
else:
self._num_ham_messages += 1
self._ham_words.update(tokens)
for token in tokens:
self._num_word_in_ham[token] += 1
def _p_word_spam(self, word: str) -> float:
return (self._k + self._num_word_in_spam[word]) / ((2 * self._k) + self._num_spam_messages)
def _p_word_ham(self, word: str) -> float:
return (self._k + self._num_word_in_ham[word]) / ((2 * self._k) + self._num_ham_messages)
def predict(self, text: str) -> float:
text_words: Set[str] = tokenize(text)
log_p_spam: float = 0.0
log_p_ham: float = 0.0
for word in self._words:
p_spam: float = self._p_word_spam(word)
p_ham: float = self._p_word_ham(word)
if word in text_words:
log_p_spam += log(p_spam)
log_p_ham += log(p_ham)
else:
log_p_spam += log(1 - p_spam)
log_p_ham += log(1 - p_ham)
p_if_spam: float = exp(log_p_spam)
p_if_ham: float = exp(log_p_ham)
return p_if_spam / (p_if_spam + p_if_ham)
Let's take our Naive Bayes implementation for a spin!
We'll use the "Enron Spam" data set to train and test our implementation.
Here's the code which downloads and extracts the data set:
wget -nc -P data http://nlp.cs.aueb.gr/software_and_datasets/Enron-Spam/preprocessed/enron1.tar.gz
tar -xzf data/enron1.tar.gz -C data
Reading through the readme we find that the spam and ham messages are stored in separate directories, so we need to find such directories and then iteratively open and parse every single file we find in them. We then store the data from the parsed subject line (we're only using the E-Mails subject to keep things simple) in a NamedTuple
and append it to a list containing all the messages. Here's the code which does just that:
spam_data_path: Path = data_dir / 'enron1' / 'spam'
ham_data_path: Path = data_dir / 'enron1' / 'ham'
class Message(NamedTuple):
text: str
is_spam: bool
spam_message_paths: List[str] = glob.glob(str(spam_data_path / '*.txt'))
ham_message_paths: List[str] = glob.glob(str(ham_data_path / '*.txt'))
message_paths: List[str] = spam_message_paths + ham_message_paths
messages: List[Message] = []
for path in message_paths:
with open(path, errors='ignore') as file:
is_spam: bool = True if 'spam' in path else False
text: str = file.readline().replace('Subject:', '').strip()
messages.append(Message(text, is_spam))
And that's all there is in terms of data preparation.
Now we need to split our data into a "training" and "testing" set which we'll use to train our Naive Bayes classifier.
Our train_test_split
function is a simple function which shuffles all the messages and then assigns them to 2 dedicated lists: One for training and one for testing. The default splitting rate is $80/20$, meaning $80%$ of all messages will be assigned to the training set and $20%$ of all messages will be assigned to the test set.
def train_test_split(messages: List[Message], pct=0.8) -> Tuple[List[Message], List[Message]]:
shuffle(messages)
num_train = int(round(len(messages) * pct, 0))
return messages[:num_train], messages[num_train:]
train: List[Message]
test: List[Message]
train, test = train_test_split(messages)
Training our Naive Bayes classifier is as simple as creating a new instance and calling the train
method with the training set:
nb: NaiveBayes = NaiveBayes()
nb.train(train)
Let's grab some spam messages from our test
set (the data our classifier hasn't seen yet) and see what gets predicted:
spam_messages: List[Message] = [item for item in test if item.is_spam]
message: str = spam_messages[10].text
print(f'Predicting likelihood of "{message}" being spam.')
nb.predict(message)
# Predicting likelihood of "get your hand clock replicas today carson" being spam.
# 0.9884313222593173
Almost $99%$. Not bad!
And what about a ham message?
ham_messages: List[Message] = [item for item in test if not item.is_spam]
message: str = ham_messages[10].text
print(f'Predicting likelihood of "{text}" being spam.')
nb.predict(message)
# Predicting likelihood of "associate & analyst mid - year 2001 prc process" being spam.
# 5.3089147140900964e-05
Great! It's time to pat yourself on the back! You've successfully implemented and trained your very own Naive Bayes classifier to reliably identify spam and ham messages!
Note: $99%$ sounds too good to be true and there's certainly a smell of overfitting in the air. Remember that the data set we've used for training is rather small. Furthermore we've only trained based on the E-Mails subject line. You might want to modify the code to train on the whole message body or find a different data set altogether.
Naive Bayes is a powerful Machine Learning algorithm which makes it possible to classify unseen data based on probability scores.
The basis for Naive Bayes forms Bayes Theorem, one of the most fundamental algorithms in probability theory. With Bayes Theorem one can calculate conditional probabilities such as "how likely is it that this message is spam given word $X$?" which is exactly what's necessary to implementing a spam classifier.
While we've used the classic spam filtering use case to deconstruct Naive Bayes, there are more areas this algorithm can be applied to.
Given that recent developments in Deep Learning called Probabilistic Deep Learning incorporate Bayesian thinking into their underlying models it's a good investment to solidify the understanding about Bayes Theorem and its application via Naive Bayes.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
One of the main resource I studied while learning about Machine Learning algorithms such as Naive Bayes is the book "Data Science from Scratch" by Joel Grus. This book is pure gold as it teaches you all the nitty gritty details about the most fundamental algorithms out there. If you haven't already, do yourself a favor and purchase a copy of this book. It's worth it.
The following is a list of resources I've used to compile this blog post. Other helpful resources are also linked within the article itself.
K-nearest neighbors (abbreviated as k-NN or KNN) is a simple, yet elegant Machine Learning algorithm to classify unseen data based on existing data. The neat property of this algorithm is that it doesn't require a "traditional" training phase. If you have a classification problem and labeled data you can predict the class of any unseen data point by leveraging your existing, already classified data.
Let's take a closer look at the intuitions behind the core ideas, the involved math and the translation of such into code.
Imagine that we've invited 100 dog owners with their dogs over for a statistical experiment we want to run. Each dog participating in this experiment is 1 out of 4 different dog breeds we're interested in studying. While we have the dog owners and their dogs around we measure 3 different properties of each dog:
Once done, we normalize the measurements so that they fall into a range between $0$ and $1$.
After collection the data on each individual dog we end up with 100 3-pair measurements, each of which is labeled with the corresponding dog breed.
Here's one example:
$$ \begin{pmatrix} 0.5 \ 0.8 \ 0.1 \end{pmatrix} = Podenco $$
In order to better understand the data it's always a good idea to plot it. Since we have collected 3 different measurements (weight, height and alertness) it's possible to project all of the 100 data points into a 3 dimensional space and color every data point according to its label (e.g. brown for the label "Podenco").
Unfortunately we run into an issue while attempting to plot this data. As it turns out we forgot to label one measurement. We do have the dogs width, its height and its alertness but for some reason we forgot to write down the dogs breed.
Is there any chance we could derive what this dogs breed might be given all the other dog measurements we already have? We can still add the unlabeled data point into our existing 3 dimensional space where all the other colored data points reside. But how should we color it?
One potential solution is to look at the, say 5 surrounding neighbors of the data point in question and see what their color is. If the majority of those data points is labeled "Podenco" then it's very likely that our measurements were also taken from a Podenco.
And that's exactly what the k-NN algorithm does. The k-nearest neighbors algorithm predicts the class of an unseen data point based on its k-nearest neighbors and the majority of their respective classes. Let's take a closer look at this from a mathematical perspective.
There are only 2 concepts we need to implement in order to classify unseen data via k-NN.
As stated above, the algorithm works by looking at the k-nearest neighbors and the majority of their respective classes in order to classify unseen data.
Because of that we need to implement 2 functions. A distance function which calculates the distance between two points and a voting function which returns the most seen label given a list of arbitrary labels.
Given the notion of "nearest neighbors" we need to calculate the distance between our "to be classified" data point and all the other data points to find the $k$ closest ones.
There are a couple of different distance functions out there. For our implementation we'll use the Euclidean distance as it's simple to calculate and can easily scale up to $n$ dimensions.
The mathematical notation is as follows:
$$ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} $$
Let's unpack this formula with the help of an example. Assuming we have two vectors $a$ and $b$, the euclidean distance between the two is calculated as follows:
$$ \vec{a} = \begin{pmatrix} 1 \ 2 \ 3 \ 4 \end{pmatrix} \vec{b} = \begin{pmatrix} 5 \ 6 \ 7 \ 8 \end{pmatrix} $$
$$ d(\vec{a}, \vec{b}) = d(\vec{b}, \vec{a}) \ = \sqrt{(1 - 5)^2 + (2 - 6)^2 + (3 - 7)^2 + (4 - 8)^2} \ = \sqrt{64} = 8 $$
Translating this into code results in the following:
def distance(x: List[float], y: List[float]) -> float:
assert len(x) == len(y)
return sqrt(sum((x[i] - y[i]) ** 2 for i in range(len(x))))
assert distance([1, 2, 3, 4], [5, 6, 7, 8]) == 8
Great. We just implemented the first building block: An Euclidean distance
function.
Next up we need to implement a voting function. The voting function takes as an input a list of labels and returns the "most seen" label of that list. While this sounds trivial to implement we should take a step back and think about potential edge cases we might run into.
One of those edge cases is the situation in which we have 2 or more "most seen" labels:
# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']
For those scenarios we need to implement a tie breaking mechanism.
There are several ways to deal with this. One solution might be to pick a candidate randomly. In our case however we shouldn't think about the vote function in isolation since we know that the distance and vote functions both work together to determine the label of the unseen data point.
We can exploit this fact and assume that the list of labels our vote function gets as an input is sorted by distance, nearest to furthest. Given this requirement it's easy to break ties. All we need to do is recursively remove the last entry in the list (which is the furthest) until we only have one clear winning label.
The following demonstrates this process based on the labels
example from above:
# Do we return `a` or `b`?
labels: List[str] = ['a', 'a', 'b', 'b', 'c']
# Remove one entry. We're still unsure if we should return `a` or `b`
labels: List[str] = ['a', 'a', 'b', 'b']
# Remove another entry. Now it's clear that `a` is the "winner"
labels: List[str] = ['a', 'a', 'b']
Let's translate that algorithm into a function we call majority_vote
:
def majority_vote(labels: List[str]) -> str:
counted: Counter = Counter(labels)
winner: List[str] = []
max_num: int = 0
most_common: List[Tuple[str, int]]
for most_common in counted.most_common():
label: str = most_common[0]
num: int = most_common[1]
if num < max_num:
break
max_num = num
winner.append(label)
if len(winner) > 1:
return majority_vote(labels[:-1])
return winner[0]
assert majority_vote(['a', 'b', 'b', 'c']) == 'b'
assert majority_vote(['a', 'b', 'b', 'a']) == 'b'
assert majority_vote(['a', 'a', 'b', 'b', 'c']) == 'a'
The tests at the bottom show that our majority_vote
function reliably deals with edge cases such as the one described above.
Now that we've studied and codified both building blocks it's time to bring them together. The knn
function we're about to implement takes as inputs the list of labeled data points, a new measurement (the data point we want to classify) and a parameter k
which determines how many neighbors we want to take into account when voting for the new label via our majority_vote
function.
The first thing our knn
algorithm should do is to calculate the distances between the new data point and all the other, existing data points. Once done we need to order the distances from nearest to furthest and extract the data point labels. This sorted list is then truncated so that it only contains the k
nearest data point labels. The last step is to pass this list into the voting function which computes the predicted label.
Turning the described steps into code results in the following knn
function:
def knn(labeled_data: List[LabeledData], new_measurement, k: int = 5) -> Prediction:
class Distance(NamedTuple):
label: str
distance: float
distances: List[Distance] = [Distance(data.label, distance(new_measurement, data.measurements))
for data in labeled_data]
distances = sorted(distances, key=attrgetter('distance'))
labels = [distance.label for distance in distances][:k]
label: str = majority_vote(labels)
return Prediction(label, new_measurement)
That's it. That's the k-nearest neighbors algorithm implemented from scratch!
It's time to see if our homebrew k-NN implementation works as advertised. To test drive what we've coded we'll use the infamous Iris flower data set.
The data set consists of 50 samples of three different flower species called Iris:
For each sample, 4 different measurements were collected: The sepal width and length as well as its petal width and length.
The following is an example row from the data set where the first 4 numbers are the sepal length, sepal width, petal length, petal width and the last string represents the label for those measurements.
6.9,3.1,5.1,2.3,Iris-virginica
The best way to explore this data is to plot it. Unfortunately it's hard to plot and inspect 4 dimensional data. However we can pick 2 measurements (e.g. petal length and petal width) and scatter plot those in 2 dimensions.
Note: We're using the amazing Plotly library to create our scatter plots.
fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.show()
We can clearly see clusters of data points which all share the same color and therefore the same label.
Now let's pretend that we have a new, unlabeled data point:
new_measurement: List[float] = [7, 3, 4.8, 1.5]
Adding this data point to our existing scatter plot results in the following:
fig = px.scatter(x=xs, y=ys, color=text, hover_name=text, labels={'x': 'Petal Length', 'y': 'Petal Width'})
fig.add_annotation(
go.layout.Annotation(
x=new_measurement[petal_length_idx],
y=new_measurement[petal_width_idx],
text="The measurement we want to classify")
)
fig.update_annotations(dict(
xref="x",
yref="y",
showarrow=True,
arrowhead=7,
ax=0,
ay=-40,
borderwidth=2,
borderpad=4,
bgcolor="#c3c3c3"
))
fig.show()
Even if we're just plotting the petal length and petal width in 2 dimensions it seems to be the case that the new measurement might be coming from an "Iris Versicolor".
Let's use our knn
function to get a definite answer:
knn(labeled_data, new_measurement, 5)
And sure enough the result we get back indicates that we're dealing with an "Iris Versicolor":
Prediction(label='Iris-versicolor', measurements=[7, 3, 4.8, 1.5])
k-nearest neighbors is a very powerful classification algorithm which makes it possible to label unseen data based on existing data. k-NNs main idea is to use the $k$ nearest neighbors of the new, "to-be-classified" data point to "vote" on the label it should have.
Given that, we need 2 core functions to implement k-NN. The first function calculates the distance between 2 data points so that nearest neighbors can be found. The second function performs a majority vote so that a decision can be made as to what label is most present in the given neighborhood.
Using both functions together brings k-NN to life and makes it possible to reliably label unseen data points.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
Thanks to Eric Nieuwland who reached out via E-Mail and provided some code simplifications!
The following is a list of resources I've used to work on this blog post. Other helpful resources are also linked within the article itself.
]]>Gradient Descent is one of the most fundamental optimization techniques used in Machine Learning. But what is a gradient? On what do we descent down and what do we even optimize in the first place?
Those might be some of the questions which come to mind when having the first encounters with Gradient Descent. Let's answer those questions while implementing the Gradient Descent optimization technique from scratch.
Many Machine Learning problems require some form of optimization. Usually the algorithm starts with an initial guess as to what the correct answer might be and then slowly adapts its parameters to be less wrong with every consecutive trial.
This learning process through trial and error repeats until the algorithm has learned to "correctly" predict the results for the training- and test data it gets fed.
In order to figure out how wrong our algorithm is with its predictions, we need to define the notion of a loss function. The loss function compares the guesses and the actual values and turns the differences between the two into a measurement we can use to quantify the quality of our algorithm. Prominent loss function are Mean squared error (MSE), Root mean square error (RMSE) or Sum of squared errors (SSE).
Imagine that we're plotting the loss (i.e. "wrongdoing") the algorithm produces as a surface in a multi dimensional space (such as 3D). Intuitively it seems to make sense to find the "place" on this surface where the algorithm is doing the fewest mistakes. That's exactly where Gradient Descent comes into play. With Gradient Descent we take a "walk" on this surface to find such a place.
As we've already discovered, loss functions and optimizations are usually intertwined when working on Machine Learning problems. While it makes sense to teach them together I personally believe that it's more valuable to keep things simple and focused while exploring core ideas. Therefore for the rest of this post we'll solely focus on Gradient Descent as a mathematical optimization technique which can be applied in a variety of different domains (including Machine Learning problems).
The first thing we should do is to define the function we want to run the Gradient Descent algorithm on. Most examples explain the algorithms core concepts with a single variable function such as a function drawn from the class of parabolas (e.g. $x^2$). Single variable functions can be easily plotted in a 2 dimensional space along the $x$ and $y$ axes. Besides other nice properties, dealing with only 2 dimensions makes the involved math a whole lot easier. Real world Machine Learning problems however deal with data in an order of magnitude higher dimensions. While there's a slightly steeper learning curve to go from 2D to 3D there's pretty much nothing new to learn to go from 3D to $n$D. Because of that we'll apply Gradient Descent to a multivariable function with 2 variables.
Most of us studied the properties of quadratic functions via parabolas in school. But is there an equivalent function which can be plotted in 3 dimensions? Imagine that you have a parabola and spin it like a centrifuge along its $y$-axis. What you'd end up with is a surface called a paraboloid, a "3 dimensional parabola".
There are various ways to define paraboloids but in our case let's use the "infinite paraboloid" which is defined as follows:
$$ x^2 + y^2 $$
Translating the math into code results in:
def paraboloid(x: float, y: float) -> float:
return x ** 2 + y ** 2
Simple enough. Next up we should generate some test data, pass it into our paraboloid
function and plot the results to see if everything works as expected:
# Test data generation (only really necessary for the plotting below)
xs_start = ys_start = -10
xs_stop = ys_stop = 11
xs_step = ys_step = 1
xs: List[float] = [i for i in range(xs_start, xs_stop, xs_step)]
ys: List[float] = [i for i in range(ys_start, ys_stop, ys_step)]
zs: List[List[float]] = []
for x in xs:
temp_res: List[float] = []
for y in ys:
result: float = paraboloid(x, y)
temp_res.append(result)
zs.append(temp_res)
print(f'xs: {xs}\n')
# xs: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f'ys: {ys}\n')
# ys: [-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print(f'zs: {zs[:5]} ...\n')
# zs: [[200, 181, 164, 149, 136, 125, 116, 109, 104, 101, 100, 101, 104, 109, 116, 125, 136, 149, 164, 181, 200], [181, 162, 145, 130, 117, 106, 97, 90, 85, 82, 81, 82, 85, 90, 97, 106, 117, 130, 145, 162, 181], [164, 145, 128, 113, 100, 89, 80, 73, 68, 65, 64, 65, 68, 73, 80, 89, 100, 113, 128, 145, 164], [149, 130, 113, 98, 85, 74, 65, 58, 53, 50, 49, 50, 53, 58, 65, 74, 85, 98, 113, 130, 149], [136, 117, 100, 85, 72, 61, 52, 45, 40, 37, 36, 37, 40, 45, 52, 61, 72, 85, 100, 117, 136]] ...
Plotting the graph (using the Plotly library) can be done with the following code:
fig = go.Figure(go.Surface(x=xs, y=ys, z=zs, colorscale='Viridis'))
fig.show()
According to Wikipedia the definition of a gradient is:
... a vector-valued function $f$..., whose value at a point $p$ is the vector whose components are the partial derivatives of $f$ at $p$.
I agree that this sounds overwhelming but the core concepts of a gradient are really simple. What this quote from above tries to convey is that for any given point $p$ on the functions plotted surface there's a vector consisting of partial derivatives which points in the direction of greatest increase.
With that explanation the only concept we need to sort out is the notion of partial derivatives.
If you've studied Differential Calculus you surely came across the topic of derivatives. To recap real quick, a derivative measures the sensitivity to change of a functions output with respect to changes in its inputs.
Derivatives can be of different orders. One of the most prominent derivatives you surely came across is the first-order derivative which is the slope of a tangent line that tells us whether the function is increasing or decreasing at any given point.
The process of differentiation obeys several rules, some of which we'll apply when walking through the example below to refresh our fond memories of Calculus I.
Let's calculate the first-, second- and third-order derivative of the following function:
$$ x^3 + 2x^2 - 5 $$
According to the differentiation rules, the first-order derivative is:
$$ \frac{d}{dx}(x^3 + 2x^2 - 5) = 3x^2 + 4x $$
The second-order derivative is:
$$ \frac{d}{dx}(3x^2 + 4x) = 6x + 4 $$
And finally the third-order derivative is:
$$ \frac{d}{dx}(6x + 4) = 6 $$
Why did we do this? As you might recall from above, a gradient is a vector of partial derivatives of function $f$ at point $p$. So in order to compute gradients we need to compute the partial derivatives of our function $f$.
We certainly do know how to compute derivatives as we've shown above, but how do we calculate partial derivatives? As it turns out you already know how to calculate partial derivatives if you know how to calculate derivatives. The twist with partial derivatives is that you're deriving with respect to a variable while treating every other variable as a constant.
Let's apply this rule and compute the partial derivatives of our Paraboloid function $x^2 + y^2$ which we call $f$:
The first partial derivative we calculate is the derivative of $f$ with respect to $x$, treating $y$ as a constant:
$$ \frac{\partial}{\partial x}(x^2 + y^2) = 2x $$
The second partial derivative calculation follows the same pattern, deriving $f$ with respect to $y$ while treating $x$ as a constant:
$$ \frac{\partial}{\partial y}(x^2 + y^2) = 2y $$
Note: Don't get confused by the notation here. While you'd use $\frac{d}{dx}$ for "normal" derivatives you simply use $\frac{\partial}{\partial x}$ for partial derivatives.
And that's it. With those partial derivatives we're now able to compute any gradient for any point $p$ sitting on the plotted surface of function $f$. Let's translate our findings into code:
def compute_gradient(vec: List[float]) -> List[float]:
assert len(vec) == 2
x: float = vec[0]
y: float = vec[1]
return [2 * x, 2 * y]
Let's recap what we've accomplished so far. First, we've codified and plotted a function called the "infinite paraboloid" which is used throughout this post to demonstrate the Gradient Descent algorithm. Next up we studied gradients, taking a quick detour into Differential Calculus to refresh our memories about differentiation. Finally we computed the partial derivatives of our paraboloid function.
Translating the prose into a mental image, we have a function $f$ which produces a surface in a 3 dimensional space. Given any point $p$ on that surface we can use the gradient (via its partial derivatives) to find a vector pointing into the direction of greatest increase.
That's great, but we're dealing with an optimization problem and for a lot of such applications it's usually desirable to find a global or local minimum. Right now, the gradient vector is pointing upwards to the direction of greatest increase. We need to turn that vector into the opposite direction so that it points to the direction of greatest decrease.
Linear Algebra taught us that doing that is as simple as multiplying the gradient vector by $-1$. Another neat Linear Algebra trick is to multiply a vector by a number other than $1$ to change its magnitude (= its length). If we're for example multiplying the gradient vector by $0.25$ we would get a vector which has length $\frac{1}{4}$ of its original length.
We finally have all the building blocks in place to put together the Gradient Descent algorithm which eventually finds the nearest local minimum for any given function.
The algorithm works as follows.
Given a function $f$ we want to run Gradient Descent on:
Let's turn that into code. First, we should define a compute_step
function which takes as parameters the vector $p$ and a "step size" (we call it learning_rate
), and computes the next position of $p$ according to the $f$ functions gradient:
def compute_step(curr_pos: List[float], learning_rate: float) -> List[float]:
grad: List[float] = compute_gradient(curr_pos)
grad[0] *= -learning_rate
grad[1] *= -learning_rate
next_pos: List[float] = [0, 0]
next_pos[0] = curr_pos[0] + grad[0]
next_pos[1] = curr_pos[1] + grad[1]
return next_pos
Next up we should define a random starting position $p$:
start_pos: List[float]
# Ensure that we don't start at a minimum (0, 0 in our case)
while True:
start_x: float = randint(xs_start, xs_stop)
start_y: float = randint(ys_start, ys_stop)
if start_x != 0 and start_y != 0:
start_pos = [start_x, start_y]
break
print(start_pos)
# [6, -3]
And finally we wrap our compute_step
function into a loop to iteratively walk down the surface and eventually reach a local minimum:
epochs: int = 5000
learning_rate: float = 0.001
best_pos: List[float] = start_pos
for i in range(0, epochs):
next_pos: List[float] = compute_step(best_pos, learning_rate)
if i % 500 == 0:
print(f'Epoch {i}: {next_pos}')
best_pos = next_pos
print(f'Best guess for a minimum: {best_pos}')
# Epoch 0: [5.988, -2.994]
# Epoch 500: [2.2006573940846716, -1.1003286970423358]
# Epoch 1000: [0.8087663604107433, -0.4043831802053717]
# Epoch 1500: [0.2972307400008091, -0.14861537000040456]
# Epoch 2000: [0.10923564223981955, -0.054617821119909774]
# Epoch 2500: [0.04014532795468376, -0.02007266397734188]
# Epoch 3000: [0.014753859853277982, -0.007376929926638991]
# Epoch 3500: [0.005422209548664845, -0.0027111047743324226]
# Epoch 4000: [0.0019927230353282872, -0.0009963615176641436]
# Epoch 4500: [0.0007323481432962652, -0.0003661740716481326]
# Best guess for a minimum: [0.00026968555624761565, -0.00013484277812380783]
That's it. We've successfully implemented the Gradient Descent algorithm from scratch!
Gradient Descent is a fundamental optimization algorithm widely used in Machine Learning applications. Given that it's used to minimize the errors in the predictions the algorithm is making it's at the very core of what algorithms enable to "learn".
In this post we've dissected all the different parts the Gradient Descent algorithm consists of. We looked at the mathematical formulations and translated those into code, which in our case can be used to find the global minimum on a Paraboloid function $x^2 + y^2$. Note that our implementation was optimized for readability and educational usefulness rather than performance. You might not want to use this code in production.
Modern Machine Learning libraries such as TensorFlow or PyTorch ship with built-in differentiation capabilities to automatically compute partial derivatives, removing the need to implement the tedious Math yourself.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
The following is a list with resources I've used to write this blog post. The post itself also links to some other interesting content you might want to further explore.
]]>The end of last year felt slightly different, however as we left the "old" decade behind and entered the new one: 2020.
A decade sounds shorter than it actually is. In order to get some perspective as to what can happen in 10 years of time I did some research, tracking down where we started technology-wise when entering 2010:
Did you know that Uber, the infamous ride hailing startup was just getting started and was barely a thing in 2010? That Snapchat was merely an idea until the first release hit the Apple AppStore in 2011? Or that our text-based communication was mostly done via SMS as only early adopters owned a Smartphone. Just a few people used WhatsApp, the then paid communication platform which was started in 2009 and took the world by storm, eventually replacing texting via SMS.
All of that happened in between 2010 and 2020. 10 years is a long time. A lot can change in a decade.
We're living in extremely exciting times as mankind has never seen technological advancements at such a rapid pace. One area I'm extremely excited about is Machine Learning and its impact on computing and society. While I certainly had some theoretical encounters during college, my interest in AI rekindled when I read about the breathtaking breakthroughs we've achieved in areas such as self-drivingcars, gameAIs or the lifesciences.
Such breakthroughs mostly happened in the last decade. One can only imagine what else will be unlocked if we follow the current trajectory of research efforts and hardware / software innovation.
While reflecting on the last 10 years I decided to pick a grand theme I want to focus my efforts on in the upcoming years. As my background is in Open Source software and Cloud computing and my passion lies in Computer Science and Math I decided to phrase such theme as:
The intersection of Computer Science, Mathematics and Distributed Computing
The next years I'll dedicate my time to the field of Machine Learning. To further narrow this down my plan is to focus on the "bleeding edge" which will include research and development efforts on various fronts. The core idea is to study brand new, theoretical research and put it into practice, hence the name of this publication: The Intersection.
There's a lot to discover during this process and together we'll deconstruct our findings in order to better understand their underlying guiding principles and mathematical foundations. Furthermore we'll apply what we've learned while working on different projects ranging from toy projects to PoCs to custom-tailored Machine Learning toolings.
Since "sharing is caring" I'll document the whole journey via blog posts and Open Source the code I'll produce on my GitHub account.
Every adventure should have a sound plan, so here's what's on my mind right now. While I've already describe the guiding "North Star" in the section above I decided to further split up the endeavor into different categories. Every blog post I'll write will be tagged with respective categorical tags.
The focus of this category is on the exploration and explanation of different Machine Learning algorithms and techniques. We'll peek under the covers and deconstruct how modern Machine Learning really works. Open Sourced code implementations will demonstrate the usefulness of models such as Linear Regression, Latent Dirichlet Allocation (LDA) or Capsule Neural Networks.
Pretty much all Machine Learning algorithms require a deep understanding of advanced Mathematical concepts. Given such requirement most people shy away since "math is too complicated" or "just not for me".
I thought exactly the same but decided to put a fork in the road. Earlier last year I created a colloquium which I followed to relearned advanced mathematics from scratch. I can tell you that it's definitely challenging but at the same time intellectually stimulating and really rewarding. Trust me when I say that Math is a beautiful language and absolutely worthwhile to study.
The Math category will be used to translate complicated looking concepts into easily digestible prose and code.
It's of upmost importance to stay on top of the current research. During our journey through the lands of Machine Learning and Artificial Intelligence I plan to allocate time to read about recent developments and share my findings via dedicated research-oriented posts.
Since research goes both ways I'll also take some time to do some research on my own, extending and discussing existing ideas while also coming up with new ones.
Staying too long in the theoretical realm is boring. Applying the theory in practice is where the fun begins and the real learning happens!
The "Projects" and "Code" categories will be used to talk about the projects we'll work on while bridging the gap between science and engineering. Projects can range from specifications, PoCs, implementation proposals to fully-fledged applications.
I already have a long list of ideas I plan to work on and I'll present some of those in upcoming posts.
As per usual I plan to Open Source the code while working on the projects and I'd be more than happy to welcome you joining the fun on GitHub.
Over the last couple of weeks I've already started to work according to the plan I just described above.
A solid foundational knowledge about the core concepts and principles is key to succeed in any field of study. That's why I started the challenge of implementing core Machine Learning algorithms from scratch in pure Python with literally 0 dependencies (except for plotting graphs and diagrams). Some of such algorithms include Gradient descent, k-NN, Linear- and Multiple Linear Regression, Logistic Regression, Naive Bayes, Decision Trees, Neural Networks and more. Most of them are already translated into blog posts which explain the algorithm in an intuitive and fun way. I plan to release the first blog posts in the upcoming weeks.
Other than that I'm fascinated by the idea and technological challenges of privacy preserving Machine Learning. There's the saying that "Data is the new oil" since it's used among other things to train the Machine Learning algorithms that drive our daily lives. But data is more than "just modern oil". Data is precious, sensitive and shouldn't be shared or used without a users consent. Techniques such as Homomorphic Encryption, Secure multi-party computation, Differential Privacy and Federated Learning can be used to solve the privacy issue and e.g. train encrypted Machine Learning models on encrypted data in untrusted environments. I've already started to look into current research efforts and implemented different Homomorphic Encryption schemes to get an idea of what can work and what's still practically infeasible. While doing so I was blown away by the capabilities we might unlock in the next couple of years. It was really hard to pull myself out of that rabbit hole. To share my enthusiasm and get you excited as well I'm currently in the process of preparing some deep dive blog post series around those topics!
10 years ago self-driving cars and personal assistants were merely far fetched dreams and only really appreciated by SciFi fans and futurology nerds. Nowadays those technologies are omnipresent thanks to rapid innovations in areas such as Artificial Intelligence, Neuroscience, Distributed Computing and Hardware / Software development.
There's no better time than now to get involved in Machine Learning. That's why I decided to fully immerse myself into the field. I'll use this blog to document the journey while we learn about bleeding edge research, turn algorithms into code and work on various projects while decoding the underpinnings of the intelligent technologies which drive our daily lives.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>Entering the Data Science and Machine Learning world there are various programming languages and tools to choose from. There's MATLAB, a commercial programming environment which is used across different industries and usually the tool of choice for practicioners with a heavy Math background. A free and Open Source alternative is the R project, a programming language created in 1993 to simplify statistical data processing. People working with R usually report enjoyment as R is "hackable" and comes bundled with different math modules and plotting libraries.
A more recent incarnation is the Julia scientific programming language which was created at MIT to resolve the issues older tools such as MATLAB and R struggled with. Julia cleverly incorporates modern engineering efforts from the fields of compiler construction and parallel computing and given its Open Source nature it has gained a lot of industry-wide adoption when it reached v1
maturity in 2018.
If you're doing some more research to find the most used programming language in Data Science and Machine Learning you might be surprised to see a language which wasn't built from the ground up for scientific computing: Python.
The Python programming language was created by Guido van Rossum in 1989 to help bridge the gap between Bash scripting and C programming. Since then Python took the world by storm mainly due to its flat learning curve, its expressiveness and its powerful standard library which makes it possible to focus on the core problems rather than reinventing the wheel over and over again.
Funny tangent: Open up a shell, run the Python interpreter via python
and enter import this
or import antigravity
Python is a general purpose programming language which was never designed to solve problems in a niche subject matter. Are you an Instagram user? They're running Python. Do you curate content via Pinterest? They're running Python. Do you store your data via Dropbox? They've developed their MVP in Python and still use it today. Even Google (then called BackRub) started out with Python and Java. The list goes on and on.
Given such an industry-wide adoption it's easy to see why a lot of care and effort were put into the ecosystem of reusable packages as well as the language itself. No matter what use case you're working on, chances are that there are numerous Python packages helping you solve your problems.
While more famous Python projects include Web Frameworks such as Django or Flask there are also lot of mature Scientific and Machine Learning implementations written in Python. Having access to such a robust foundation it only makes sense that modern Deep Learning frameworks such as TensorFlow or PyTorch are also leveraging those libraries under the covers.
All of the things discussed so far sound great. Python, a general purpose programming language which has quite a few years of existence under its belt is used across industries in mission critical software systems. Over the course of 30 years a vibrant Open Source community emerged which develops and maintains powerful libraries used by millions of users on a daily basis.
Why bother and replace Python? If it ain't broke, don't fix it!
Technology is constantly improving. What was once unthinkable might all of the sudden be possible thanks to breakthroughs in Hard- and Software development. Python was created in a different era with a different purpose. It was never engineered to directly interface with hardware or run complex computations across a fleet of distributed machines.
A modern Deep Learning Framework such as TensorFlow uses dozens of programming languages behind the scenes. The core of such a library might be written in high-performance C++ which occasionally interfaces with different C libraries, Fortran programs or even parts of Assembly language to squeeze out every bit of performance possible. A Python interface is usually built on top of the C++ core to expose a simple public API Data Scientists and Deep Learning enthusiasts use.
Why isn't Python used throughout the whole stack?
The answer to this question is rather involved but the gist of it is that Pythons language design is more tailored towards high level programming. Furthermore it's just not fast enough to be used at the lower layers.
The following is an incomplete list of Pythons (subjective) shortcomings:
Python code usually runs an order of magnitude slower compared to other interpreted and compiled languages. Language implementations such as Cython which compile Python code to raw C try to mitigate this problem but they come with other issues (e.g. language inconsistencies, compatibility problems, ...).
It's not that straightforward to write Python code which reliably performs parallel processing tasks on multiple cores or even multiple machines. Deep Neural Networks can be expressed as graphs on which Tensor computations are carried out, making it a prime use case for parallel processing.
Python is a high level language with lots of useful abstractions which unfortunately get in the way when trying to directly interface with the computers underlying hardware. Because of that heavy GPU computations are usually moved into lower-level code written in e.g. C or CUDA.
Since it's a scripting language at its core, Python comes with its own runtime that evaluates the script line-by-line as it runs it. A process called "interpretation".
The other branch of programming languages are compiled languages. Compiling code means that the human-readable program code is translated into code a machine can read and understand. Compiled programs have the downside that there's a compilation step in between writing and running the program. The upside of such step is that various checks and optimizations can be performed while translating the code, eventually emitting the most efficient machine code possible.
Python has no concept of typing. There's no problem in passing an integer into a function which expects a string. Python will run the program and raise an exception as soon as it evaluates the broken code.
Strongly typed languages have the upside that mistakes like the one described above are impossible to make. The developer has to explicitly declare which types are expected.
Python has recently added support for type hints. Type hinting merely serves as another form of documentation as it still won't prevent type misuses in programs.
A lot of prominent packages such as Numpy wrap other languages such as Fortran or C to offer reliable performance when working on computational expensive data processing tasks.
While it's certainly not impossible to introduce existing libraries written in other languages into Python, the process to do that is oftentimes rather involved.
Without going into too much detail it makes sense to take a quick detour and study the origins of the Swift programming language in order to see why it has such a potential to replace Python as the go-to choice for Data Science and Machine Learning projects.
[Chris Lattner](https://en.wikipedia.org/wiki/Chris_Lattner), the inventor of Swift has a long history and established track record in modern compiler development. During college he worked on a project which eventually became LLVM ("Low Level Virtual Machine"), the infamous compiler infrastructure toolchain. The revolutionary idea behind LLVM is the introduction of frontends and backends which can be mixed and matched. One frontend could be written for Swift which is then coupled with a backend implementation for the x86 architecture. Making it possible to compile to another architecture is as simple as using another backend such as the one for PowerPC. Back in the early compiler days one had to write the compiler end-to-end, tightly coupling the frontend and backend, making it a heroic effort to offer the compiler for different platforms.
LLVM gained a lot of traction and Chris Lattner was eventually hired by Apple to work on its developer toolings which heavily relied on LLVM. During that time he worked on a C++ compiler and thought about ways how a better, more modern programming language might look like. He figured that it should be compiled, easy to learn, flexible enough to feel like a scripting language and at the same time "hackable" at every layer. Those ideas translated into the Swift programming language which was officially released at WWDC in 2014.
But what exactly makes Swift such a natural fit as a Python replacement? Isn't Swift only used for iOS and macOS apps? The following section shows why Swift could be Pythons successor.
Swift is compiled via LLVM which means that its code is translated into optimized machine code directly running on the target platform. Improvements made to the LLVM compiler toolchain automatically benefit the Swift code generation.
There's the saying that Swift is "syntactic sugar for LLVM" which rings true as one can see with the Builtin
usage for its core types. The linked code snippet shows that Swifts core types directly interface with their LLVM equivalents.
Despite the compilation process Swift feels like a dynamic, Python-esque language. Swift was designed from the ground up for programs to incrementally grow in complexity as necessary. The simplest of all Swift programs is just one line of code: print("Hello World")
.
let greeting = "Hello World"
print(greeting)
// Hello World
let num1 = 1
let num2 = 2
print(num1 + num2)
// 3
let scores = [10, 35, 52, 92, 88]
for score in scores {
print(score)
}
// 10
// 35
// 52
// 92
// 88
class Cat {
var name: String
var livesRemaining: Int = 9
init(name: String) {
self.name = name
}
func describe() -> String {
return "👋 I'm \(self.name) and I have \(self.livesRemaining) lives 😸"
}
}
let mitsy = Cat(name: "Mitsy")
print(mitsy.describe())
// 👋 I'm Mitsy and I have 9 lives 😸
Given that Swift is compiled via LLVM, it's statically type checked during the compilation process. There's no way you can pass an invalid type to a function and run into an error during runtime. If your code compiles you can be pretty sure that you're passing around the expected types.
func sum(xs: [Int]) -> Int {
var result: Int = 0
for x: Int in xs {
result = result + x
}
return result
}
// Using correct types
let intNumbers: [Int] = [1, 2, 3, 4, 5]
let resultInt = sum(xs: intNumbers)
print(resultInt)
// 15
// Using incorrect types
let stringNumbers: [String] = ["one", "two", "three"]
let resultString = sum(xs: stringNumbers)
print(resultString)
// error: cannot convert value of type '[String]' to expected argument type '[Int]'
Swifts concepts of protocols and extensions make it dead simple to add new functionality to existing libraries or even types which ship with the language core itself. Want to add a new method to Int
? No problem!
// One needs to implement `help` when using the `Debugging` Protocol
protocol Debugging {
func help() -> String
}
// Implementing `Debugging` for MatrixMultiply
class MatrixMultiply: Debugging {
func help() -> String {
return "Offers methods to aid with matrix-matrix multiplications."
}
func multiply() {
// ...
}
}
var matMult = MatrixMultiply()
print(matMult.help())
// Offers methods to aid with matrix-matrix multiplications.
// Implementing `Debugging` for VectorMultiply
class VectorMultiply: Debugging {
func help() -> String {
return "Offers methods to aid with matrix-vector multiplications."
}
}
var vecMult = VectorMultiply()
print(vecMult.help())
// Offers methods to aid with matrix-vector multiplications.
// Makes it possible to emojify an existing type
protocol Emojifier {
func emojify() -> String
}
// Here we're extending Swifts core `Int` type
extension Int: Emojifier {
func emojify() -> String {
if self == 8 {
return "🎱"
} else if self == 100 {
return "💯"
}
return String(self)
}
}
print(8.emojify())
// 🎱
print(100.emojify())
// 💯
print(42.emojify())
// 42
I'm sure everyone ran into this problem before. An object is passed into a function and modified without bad intentions. Meanwhile the object is used in a different place and all of the sudden its internal state isn't what it's supposed to be. The culprit is the data mutation within the function.
This problem can be mitigated easily via value semantics. When using value semantics a "copy" rather than an object reference is passed around.
// As seen on: https://marcosantadev.com/copy-write-swift-value-types/
import Foundation
// Prints the memory address of the given object
func address(of object: UnsafeRawPointer) -> String {
let addr = Int(bitPattern: object)
return String(format: "%p", addr)
}
var list1 = [1, 2, 3, 4, 5]
print(address(of: list1))
// 0x7f2021f845d8
var list2 = list1
print(address(of: list2))
// 0x7f2021f845d8 <-- Both lists share the same address
list2.append(6) // <-- Mutating `list2`
print(list1)
// [1, 2, 3, 4, 5]
print(list2)
// [1, 2, 3, 4, 5, 6]
print(address(of: list1))
// 0x7f2021f84a38
print(address(of: list2))
// 0x128fb50 <-- `list2` has a different address
Given that Swift compiles via LLVM it has access to existing LLVM-based implementations to interoperate with. One such project is Clang, a C language family frontend written for LLVM. Thanks to Clang it's dead simple to wrap existing C libraries and bring them into Swift projects.
The following video demonstrates how easy it is:
Given all the upsides described above, the TensorFlow team decided to experiment with Swift as a Python replacement to interface with TensorFlow. Early prototypes were fruitful, encouraging the TensorFlow team to officially released Swift for TensorFlow (S4TF) in 2019.
S4TF extends the Swift core language with various features especially useful for Machine Learning tasks. Such enhancements include first-class autodiff support to calculate derivatives for functions or Python interoperability which makes it possible to reuse existing Python packages such as matplotlib, scikit-learn or pandas via Swift.
The following is a demonstration which shows how Swift for TensorFlow can be used to describe and train a deep neural network in TensorFlow:
Do you want to play around with Swift for TensorFlow yourself? Just run the following code in a terminal to spin up a Jupyter Notebook server with Swift Kernel support in a Docker container:
docker run -it -p 8888:8888 --rm --name jupyter-s4tf \
-v "$PWD":/home/jovyan/work \
--ipc=host \
pmuens/jupyter-s4tf:latest jupyter notebook \
--ip=0.0.0.0 \
--no-browser \
--allow-root \
--NotebookApp.token=\
--notebook-dir=/home/jovyan/work
The code for the repository can be found here and the Docker Hub entry is here.
Python, the de-facto standard programming language for Data Science and Machine Learning has served the community very well in the past. Nevertheless, given the trajectory of technological advancements we're slowly but surely hitting the limits with the toolings we currently have.
Performance critical code is already pushed down into lower-level implementations written in programming languages such as C or Fortran and wrapped via public Python APIs. Wouldn't it be nice to write expressive, yet performant code from the get go at every layer? And what about all the libraries out there? Wouldn't it be nice to wrap and reuse them with only a couple lines of code?
The lack of static typing in Python makes it painful to work on larger, more complex projects. It's all too easy to define a model and train it on a huge dataset just to realize that a type error interrupts the training process halfway through. An error which could've been mitigated via thorough type checks.
And what if we're hitting other roadblocks? Wouldn't it be nice to be able to peek under the covers and fix the issues ourselves in an "official" way without all the monkey-patching?
Most large-scale Machine Learning projects already faced some, if not all of the issues listed above. The TensorFlow team experienced them too and looked into ways to solve them once and for all. What they came up with is Swift for TensorFlow (S4TF), a Swift language extension tailored towards modern Machine Learning projects. The Swift programming language comes with various properties which makes it a perfect fit for a Python replacement: It shares a similar syntax, is compiled (and therefore runs fast), has a type system and seamlessly interoperates with existing C and Python libraries.
What do you think? Is Swift for TensorFlow the future or do we stick with Python for now? Will a language such as Julia dominate the Data Science and Machine Learning world in the future?
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
The following is a list of resources I've used to compile this blog post. There are also a couple of other sources linked within the article itself.
While generic programming is a really powerful tool as it prevents the programmer from repeating herself it can be hard to grasp for newcomers. This is especially true if you're not too familiar with typed programming languages.
This blog post aims to shed some light into the topic of generic programming. We'll discover why Generics are useful and which thought process can be applied to easily derive generic function signatures. At the end of post you'll be able to author and understand functions like the this:
function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
/* ... */
}
Note: Throughout this post we'll use TypeScript as our language of choice. Feel free to code along while reading through it.
Of course you can "just use JavaScript" (or another dynamically typed language) to not deal with concepts such as typing or Generics. But that's not the point. The point of this post is to introduce the concepts of Generics in a playful way. TypeScript is just a replaceable tool to express our thoughts.
Before we jump right into the application of generic programming it might be useful to understand what problem Generics are solving. We'll re-implement one of JavaScripts built-in Array methods called filter
to get first-hand experience as to why Generics were invented.
Let's start with an example to understand what filter
actually does. The JavaScript documentation forfilter
states that:
The
filter()
method creates a new array with all elements that pass the test implemented by the provided function.
Let's take a look at a concrete example to see how we would use filter
in our programs. First off we have to define an array. Let's call our array numbers
as it contains some numbers:
const numbers = [1, 2, 3, 4, 5, 6]
Next up we ned to come up with a function our filter
method applies to each element of such array. This function determines whether the element-under-test should be included in the resulting / filtered array. Based on the quote above and the description we just wrote down we can derive that our function which is used by the filter
method should return a boolean value. The function should return true
if the element passes the test and false
otherwise.
To keep things simple we pretend that we want to filter our numbers
array such that only even numbers will be included in our resulting array. Here's the isEven
function which implements that logic:
const isEven = (num: number): boolean => num % 2 === 0
Our isEven
function takes in a num
argument of type number
and returns a boolean
. We use the modulo operation to determine whether the number-under-test is even.
Next up we can use this function as an argument for the filter
method on our array to get a resulting array which only includes even numbers:
const res = numbers.filter(isEven)
console.log(res)
// --> [2, 4, 6]
As we've stated earlier our goal is to implement the filter
function on our own. Now that we've used filter
with an example we should be familiar with it's API and usage.
To keep things simple we won't implement filter
on arrays but rather define a standalone function which accepts an array
and a function
as its arguments.
What we do know is that filter
loops through every element of the array and applies the custom function to it in order to see if it should be included in the resulting array. We can translate this into the following code:
function filter(xs: number[], func: (x: number) => boolean): number[] {
cons res: number = []
for (const x of xs) {
if (func(x)) {
res.push(x)
}
}
return res
}
Now there's definitely a lot happening here and it might look intimidating but bear with me. It's simpler than it might look.
In the first line we define our function called filter
which takes an array called xs
(you can imagine pronouncing this "exes") and a function called func
as its arguments. The array xs
is of type number
as we're dealing with numbers and the function func
takes an x
of type number
, runs some code and returns a boolean
. Once done our filter
function returns an array of type number
.
The function body simply defines an intermediary array of type number
which is used to store the resulting numbers. Other than that we're looping over every element of our array and apply the function func
to it. If the function returns true
we push the element into our res
array. Once done looping over all elements we return the res
array which includes all the numbers for which our func
function returned the value true
.
Alright. Let's see if our homebrew filter
function works the same way the built-in JavaScript filter
function does:
const res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]
Great! Looks like it's working!
If we think about filtering in the abstract we can imagine that there's more than just the filtering of numbers.
Let's imagine we're building a Rolodex-like application. Here's an array with some names from our Rolodex:
const names = ['Alice', 'Bob', 'John', 'Alex', 'Pete', 'Anthony']
Now one of our application requirements is to only display names that start with a certain letter.
That sounds like a perfect fit for our filter
function as we basically filter all the names based on their first character!
Let's start by writing our custom function we'll use to filter out names that start with an a
:
const startsWithA = (name: string): boolean => name.toLowerCase().charAt(0) === 'a'
As we can see our function takes one argument called name
of type string
and it returns a boolean
which our function computes by checking if the first character of the name is an a
.
Now let's use our filter
function to filter the names:
const res = filter(names, startsWithA)
console.log(res)
// --> Type Error
Hmm. Something seems to be off here.
Let's revisit the signature of our filter
function:
function filter(xs: number[], func: (x: number) => boolean): number[] {
/* ... */
}
Here we can see that the xs
parameter is an array of type number
. Furthermore the func
parameter takes an x
of type number
and returns a boolean
.
However in our new Rolodex application we're dealing with names which are strings
and the startsWithA
function we've defined takes a string
as an argument, not a number
.
One way to fix this problem would be to create a copy of filter
called e.g. filter2
which arguments can handle strings
rather than numbers
. But we programmers know that we shouldn't repeat ourselves to keep things maintainable. In addition to that we're lazy, so using one function to deal with different data types would be ideal.
And that's exactly the problem Generics tackle. As the introduction of this blog post stated, Generics can be used to describe an implementation in an abstract way in order to reuse it with different data types.
Let's use Generics to solve our problem and write a function that can deal with any data type, not just numbers
or strings
.
Before we jump into the implementation we should articulate what we're about to implement. Talking in the abstract we're basically attempting to filter an array of type T
(T
is our "placeholder" for some valid type here) with the help of our custom function. Given that our array has elements of type T
our function should take each element of such type and produce a boolean
as a result (like we did before).
Alright. let's translate that into code:
function filter<T>(xs: T[], func: (x: T) => boolean): T[] {
const res: T[] = []
for (const x of xs) {
if (func(x)) {
res.push(x)
}
}
return res
}
At a first glance this might look confusing since we've sprinkled in our T
type here and there. However overall it should look quite familiar. Let's take a closer look into how this implementation works.
In the first line we define our filter
function as a function which takes an array named xs
of type T
and a function called func
which takes a parameter x
of type T
and returns a boolean
. Our function filter
then returns a resulting array which is also of type T
, since it's basically a subset of elements of our original array xs
.
The code inside the function body is pretty much the same as before with the exception that our intermediary res
array also needs to be of type T
.
There's one little detail we haven't talked about yet. There's this <T>
at the beginning of the function. What does that actually do?
Well our compiler doesn't really know what the type T
might be at the end of the day. And it doesn't really care that much whether it's a string
, a number
or an object
. It only needs to know that it's "some placeholder" type. We programmers have to tell the compiler that we're abstracting the type away via Generics here. So in TypeScript for example we use the syntax <TheTypePlaceHolder>
right after the function names to signal the compiler that we want our function to be able to deal with lots of different types (to be generic). Using T
is just a convention. You could use any name you want as your "placeholder type". If your functions deals with more than one generic type you'd just list them comma-separated inside the <>
like this: <A, B>
.
That's pretty much all we have to do to turn our limited, number
-focused filter
function into a generic function which can deal with all kinds of types. Let's see if it works with our numbers
and names
arrays:
let res
// using `filter` with numbers and our `isEven` function
res = filter(numbers, isEven)
console.log(res)
// --> [2, 4, 6]
// using `filter` with strings and our `startsWithA` function
res - filter(names, startsWithA)
console.log(res)
// --> ['Alice', 'Alex', 'Anthony']
Awesome! It works!
One of the many benefits of using a type system is that you can get a good sense of what the function will be doing based solely on its signature.
Let's take the function signature from the beginning of the post and see if we can figure out what it'll be doing:
function foo<A, B>(xs: A[], func: (x: A) => B): B[] {
/* ... */
}
The first thing we notice is that it's a generic function as we're dealing with 2 "type placeholders" A
and B
here. Next up we can see that this function takes in an array called xs
of type A
and a function func
which takes an A
and turns it into a B
. At the end the foo
function returns an array of type B
,
Take a couple of minutes to parse the function signature in order to understand what it's doing.
Do you know how this function is called? Here's a tip: It's also one of those functions from the realm of functional programming used on e.g. arrays.
Here's the solution: The function we called foo
here is usually called map
as it iterates over the elements of the array and uses the provided function to map every element from one type to the other (note that it can also map to the same type, i.e. from type A
to type A
).
I have to admit that this was a rather challenging question. Here's how map
is used in the wild:
const number = [1, 2, 3, 4, 5, 6]
const numToString = (num: number): string => num.toString()
const res = map(numbers, numToString)
console.log(res)
// --> ['1', '2', '3', '4', '5', '6']
In this blog post we've looked into Generics as a way to write code in an abstract and reusable way.
We've implemented our own filter
function to understand why generic programming is useful and how it helps us to allow the filtering of lists of numbers
, strings
or more broadly speaking T
s.
Once we understood how to read and write Generic functions we've discovered how typing and Generics can help us to get a sense of what a function might be doing just by looking at its signature.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>There's a very high likelihood that so-called Embeddings were used behind the scenes. Embeddings are one of the central ideas behind modern Natural Language Processing models.
In the following writeup we'll discover the main building blocks and basic intuition behind Embeddings. We'll learn how and why they work and how Word2Vec, a method to turn words into vectors, can be used to show that:
$$ king - man + woman = queen $$
All the code we'll write here can be found in my "Lab" repository on GitHub. Feel free to code along while reading through this tutorial.
Before jumping right into the code we need to make sure that all Python packages we'll be using are installed on our machine.
We install Seaborn, a visualization tool which helps us to plot nice-looking charts and diagrams. We don't really work with Seaborn directly but rather use its styles in conjunction with Matplotlib to make our plots look a little bit more "modern".
!pip install seaborn
Next up we need to import the modules we'll use throughout this tutorial (the last few lines configure Matplotlib to use Seaborn styles).
import json
from pathlib import Path
import pandas as pd
import seaborn as sns
import numpy as np
from IPython.display import HTML, display
# prettier Matplotlib plots
import matplotlib.pyplot as plt
import matplotlib.style as style
style.use('seaborn')
Since we're dealing with different datasets we should create a separate directory to store them in.
!mkdir -p data
data_dir = Path('data')
Let's start with our first data analysis task. Our goal is to compare and contrast different countries based on their surface area and population. The main idea being that we want to analyze which countries are quite similar and which are rather different based on those two metrics.
The dataset we'll use is part of the country-json
project by @samayo. Make sure to take some time to browse through the different JSON files to get an idea about the structure of the data.
In our example we're only interested in the country-by-surface-area.json
and country-by-population.json
files. Let's go ahead and download the files to our data
directory.
After that we can define 2 variables which will point to the files on our file system.
SURFACE_AREA_FILE_NAME = 'country-by-surface-area.json'
POPULATION_FILE_NAME = 'country-by-population.json'
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-surface-area.json -O data/country-by-surface-area.json
!wget -nc https://raw.githubusercontent.com/samayo/country-json/master/src/country-by-population.json -O data/country-by-population.json
surface_area_file_path = str(data_dir / SURFACE_AREA_FILE_NAME)
population_file_path = str(data_dir / POPULATION_FILE_NAME)
During our data analysis we'll utilize Pandas, a great Python library which makes it dead simple to inspect and manipulate data.
Since our data is in JSON format we can use Pandas read_json
function to load the data into a so-called DataFrame (think of it as an Excel spreadsheet on steroids).
The dropna
function makes sure that we remove all entries which are undefined and therefore useless for further inspection.
df_surface_area = pd.read_json(surface_area_file_path)
df_population = pd.read_json(population_file_path)
df_population.dropna(inplace=True)
df_surface_area.dropna(inplace=True)
You might've noticed that dealing with 2 separate files will get quite hairy if we want to compare countries based on their 2 metrics.
Since both files contain the same countries with the same names and only differ in terms of their area
and population
data we can use merge
to create a new DataFrame containing all countries with their respective area
and population
numbers.
Another tweak we perform here is to set the index
to the country name. This way we can easily query for country data based on the country names rather than having to deal with non-expressive integer values.
df = pd.merge(df_surface_area, df_population, on='country')
df.set_index('country', inplace=True)
df.head()
len(df)
227
As you can see we have a total of 227 countries in our DataFrame. 227 are way too many countries for our need. Especially since we're about to plot the data in the next step.
Let's reduce our result set by performing some range-queries with the area
and population
data.
df = df[
(df['area'] > 100000) & (df['area'] < 600000) &
(df['population'] > 35000000) & (df['population'] < 100000000)
]
len(df)
12
Great! 12 countries are way easier to analyze once plotted.
Speaking of which, let's do a 2D scatterplot of our 12 countries. We decide to plot the area
on the X axis and the population
on the Y axis.
fig, ax = plt.subplots()
df.plot(x='area', y='population', figsize=(10, 10), kind='scatter', ax=ax)
for k, v in df.iterrows():
ax.annotate(k, v)
fig.canvas.draw()
Looking at the plotted data we can immediately see some relationships. It appears that Vietnam has a high population compared to its area. Kenya on the other hand has a large surface area but a smaller population compared to its size.
Plotting the data like this helps us to reason about it in a visual way. In addition to that we can also easily validate the integrity of our data.
While we as humans can immediately tell the relationships in our country data just by looking at our plot it's necessary to translate our visual reasoning into raw numbers so our computer can understand them too.
Looking at the plot again it seems like the distance between the data points of the countries is a good measure to determine how "similar" or "different" the countries are.
There are several algorithms to calculate the distance between two (or more) coordinates. The Euclidean distance is a very common formula to do just that. Here's the Math notation:
$$ d(x, y) = d(y, x) = \sqrt{\sum_{i=1}^N (x_i - y_i)^2} $$
While the formula might look intimidating at first it's rather simple to turn it into code.
def euclidean_distance(x, y):
x1, x2 = x
y1, y2 = y
result = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)
# we'll cast the result into an int which makes it easier to compare
return int(round(result, 0))
According to our plot it seems like Thailand and Uganda are 2 countries which are very different. Computing the Euclidean distance between both validates our hunch.
# Uganda <--> Thailand
uganda = df.loc['Uganda']
thailand = df.loc['Thailand']
x = (uganda['area'], thailand['area'])
y = (uganda['population'], thailand['population'])
euclidean_distance(x, y)
26175969
If we compare this result to the Euclidean distance between Iraq and Morocco we can see that those countries seem to be more "similar".
# Iraq <--> Morocco
iraq = df.loc['Iraq']
morocco = df.loc['Morocco']
x = (iraq['area'], morocco['area'])
y = (iraq['population'], morocco['population'])
euclidean_distance(x, y)
2535051
While this exercise was quite simple and intuitive if one is fluent in geography it also introduced us to the basic concepts of Embeddings. With Embeddings we map data (e.g. words or raw numbers) into multi-dimensional spaces and use Math to manipulate and calculate relationships between that data.
This might sound rather abstract and I agree that the relationship between our Country data analysis and Embeddings is still a little bit fuzzy.
Trust me, the upcoming example will definitely result in an "Aha Moment" and suddenly what we've learned so far will click!
Now that we've seen some of the underlying principles of Embeddings let's take another look at a slightly more complicated example. This time we'll work with different colors and their representation as a combination of Red, Green and Blue values (also known as RGB).
Before we jump right into our analysis we'll define a helper function which lets us render the color according to its RGB representation.
The following code defines a function which takes the integer values of Red, Green and Blue (values in the range of 0 - 255) and renders a HTML document with the given color as its background.
def render_color(r, g, b):
display(HTML('''
<div style="background-color: rgba(%d, %d, %d, 1); height: 100px;"></div>
''' % (r, g, b)),
metadata=dict(isolated=True))
The color black is represented as 0 Red, 0 Green and 0 Blue. Let's validate that our render_color
function works as expected.
render_color(0, 0, 0)
Great. It works!
Next up it's time to download the dataset we'll be using for our color analysis. We've decided to use the 256 Colors dataset by @jonasjacek. It lists the 256 colors used by xterm, a widely used terminal emulator. Make sure to take a couple of minutes to familiarize yourself with the data and its structure.
Downloading the dataset follows the same instruction we've used in the beginning of this tutorial where we downloaded the Country data.
COLORS_256_FILE_NAME = 'colors-256.json'
!wget -nc https://jonasjacek.github.io/colors/data.json -O data/colors-256.json
colors_256_file_path = str(data_dir / COLORS_256_FILE_NAME)
Now that we have access to the data in our programming environment it's time to inspect the structure and think about ways to further process it.
color_data = json.loads(open(colors_256_file_path, 'r').read())
color_data[:5]
[{'colorId': 0,
'hexString': '#000000',
'rgb': {'r': 0, 'g': 0, 'b': 0},
'hsl': {'h': 0, 's': 0, 'l': 0},
'name': 'Black'},
{'colorId': 1,
'hexString': '#800000',
'rgb': {'r': 128, 'g': 0, 'b': 0},
'hsl': {'h': 0, 's': 100, 'l': 25},
'name': 'Maroon'},
{'colorId': 2,
'hexString': '#008000',
'rgb': {'r': 0, 'g': 128, 'b': 0},
'hsl': {'h': 120, 's': 100, 'l': 25},
'name': 'Green'},
{'colorId': 3,
'hexString': '#808000',
'rgb': {'r': 128, 'g': 128, 'b': 0},
'hsl': {'h': 60, 's': 100, 'l': 25},
'name': 'Olive'},
{'colorId': 4,
'hexString': '#000080',
'rgb': {'r': 0, 'g': 0, 'b': 128},
'hsl': {'h': 240, 's': 100, 'l': 25},
'name': 'Navy'}]
As we can see there are 3 different color representations available in this dataset. There's a Hexadecimal, a HSL (Hue, Saturation, Lightness) and a RGB (Red, Green, Blue) representation. Furthermore we have access to the name of the color via the name
attribute.
In our analysis we're only interested in the name and the RGB value of every color. Given that we can create a simple dict which key is the lowercased color name and its value is a tuple containing the Red, Green and Blue values respectively.
colors = dict()
for color in color_data:
name = color['name'].lower()
r = color['rgb']['r']
g = color['rgb']['g']
b = color['rgb']['b']
rgb = tuple([r, g, b])
colors[name] = rgb
To validate that our data structure works the way we described above we can print out some sample colors with their RGB values.
print('Black: %s' % (colors['black'],))
print('White: %s' % (colors['white'],))
print()
print('Red: %s' % (colors['red'],))
print('Lime: %s' % (colors['lime'],))
print('Blue: %s' % (colors['blue'],))
Black: (0, 0, 0)
White: (255, 255, 255)
Red: (255, 0, 0)
Lime: (0, 255, 0)
Blue: (0, 0, 255)
While our dict is a good starting point it's often easier and sometimes faster to do computations on the data if it's stored in a Pandas DataFrame. The from_dict
function helps us to turn a simple Python dictionary into a DataFrame.
df = pd.DataFrame.from_dict(colors, orient='index', columns=['r', 'g', 'b'])
df.head()
Seeing the data formatted in this way we can think of its representation as a mapping of the Red, Green and Blue values into a 3-Dimensional space where for example Red is the X axis, Green is the Y axis and Blue is the Z axis.
You might recall that we've used Euclidean distance in our Country example above to determine how "similar" countries are. The main idea was that similar countries have less distance between their data points compared to dissimilar countries whose data points are farther apart.
Another very useful formula to calculate the similarity of data points is the so-called Cosine similarity. The Cosine similarity measures the angle between two vectors in a multi-dimensional space. The smaller the angle, the more similar the underlying data.
Translating this to our color example we can think of every color being represented as a vector with 3 values (Red, Green and Blue) which (as stated above) can be mapped to the X, Y and Z axis in a 3D coordinate system. Using the Cosine similarity we can take one of such vectors and calculate the distance between it and the rest of the vectors to determine how similar or dissimilar they are. And that's exactly what we'll be doing here.
The Math notation for the Cosine similarity looks like this:
$$ similarity = \cos(\Theta) = \frac{A \cdot B}{\left\lVert A\right\rVert \left\lVert B\right\rVert} $$
We're taking the dot-product between the two vectors A and B and divide it by the product of their magnitudes.
The following code-snippet implements such formula. Again, it might look intimidating and rather complicated but if you take some time to read through it you'll see that it's not that hard to understand.
In fact our implementation here does more than just calculating the Cosine similarity. In addition to that we copy our DataFrame containing the colors and add another column to it which will include the distance as a value between 0 and 1. Once done we sort our copied DataFrame by such distance in descending order. We do this to see the computed values when querying for similar colors later on.
def similar(df, coord, n=10):
# turning our RGB values (3D coordinates) into a numpy array
v1 = np.array(coord, dtype=np.float64)
df_copy = df.copy()
# looping through our DataFrame to calculate the distance for every color
for i in df_copy.index:
item = df_copy.loc[i]
v2 = np.array([item.r, item.g, item.b], dtype=np.float64)
# cosine similarty calculation starts here
theta_sum = np.dot(v1, v2)
theta_den = np.linalg.norm(v1) * np.linalg.norm(v2)
# check if we're trying to divide by 0
if theta_den == 0:
theta = None
else:
theta = theta_sum / theta_den
# adding the `distance` column with the result of our computation
df_copy.at[i, 'distance'] = theta
# sorting the resulting DataFrame by distance
df_copy.sort_values(by='distance', axis=0, ascending=False, inplace=True)
return df_copy.head(n)
To validate that our similar
function works we can use it to find similar colors to red
.
similar(df, colors['red'])
We can also pass in colors as a list of RGB values.
similar(df, [100, 20, 120])
Since it's hard to imagine what color 100
, 20
and 120
represent it's worthwhile to use our render_color
function to see it.
render_color(100, 20, 120)
Looking at the list of most similar colors from above it appears that darkviolet
is quite similar to 100
, 20
, 120
. Let's see how this color looks like.
darkviolet = df.loc['darkviolet']
render_color(darkviolet.r, darkviolet.g, darkviolet.b)
And we can validate that darkviolet
in fact looks quite similar to 100
, 20
, 120
!
But it doesn't end here. Our 3 color values are numbers in the range of 0 - 255. Given that, it should be possible to do some basic Math computations such as addition or subtraction on them.
Since we only have access to 256 different colors it's highly unlikely that our resulting color values for Red, Green and Blue will exactly match one of our 256 colors. That's where our similar
function comes in handy! The similar
function should make it possible to calculate a new color and find its most similar representation in our 256 color dataset.
We can look at a Color Wheel to see that subtracintg a red
color from purple
one should result in a Blueish color. Let's do the Math and check whether that's true.
blueish = df.loc['purple'] - df.loc['red']
similar(df, blueish)
And sure enough the most similar colors in our dataset are Blueish ones. We can validate that by rendering darkblue
, one of the best matches.
darkblue = df.loc['darkblue']
render_color(darkblue.r, darkblue.g, darkblue.b)
Here's a simple one. If we have Black and add some White to the mix we should get something Greyish, correct?
greyish = df.loc['black'] + df.loc['white']
similar(df, greyish)
And sure enough we do. Rendering grey93
shows a light grey color.
grey93 = df.loc['grey93']
render_color(grey93.r, grey93.g, grey93.b)
Let's end our color exploration with a more complex formula. So far we've only done some very simple Math like subtracting and adding colors. But there's more we can do. We can also express our search for a color as a "solve for x" problem.
Mixing Yellow and Red will result in Orange. We can translate this behavior to other colors as well. Here we ask "Yellow is to Red as X is to Blue" and express it in Math notation to get the result for X.
# yellow is to red as X is to blue
yellow_to_red = df.loc['yellow'] - df.loc['red']
X = yellow_to_red + df.loc['blue']
similar(df, X)
Our calculation shows us that lightseargreen
is to Blue as Yellow is to Red. Intuitively that makes sense if you think about it.
lightseagreen = df.loc['lightseagreen']
render_color(lightseagreen.r, lightseagreen.g, lightseagreen.b)
In the beginning of this tutorial I promised that once done we should understand the intuition behind Word2Vec, a key component for modern Natural Language Processing models.
The Word2Vec
model does to words what we did with our colors represented as RGB values. It maps words into a multi-dimensional space (our colors were mapped into a 3D space). Once such words are mapped into that space we can perform Math calculations on their vectors the same way we e.g. calculated the similarity between our color vectors.
Having a mapping of words into such a vector space makes it possible to do calculations resulting in:
$$ king - man + woman = queen $$
In this tutorial we took a deep dive into the main building blocks and intuitions behind Embeddings, a powerful concept which is heavily utilized in modern Natural Language Processing models.
The main idea is to map data into a multi-dimensional space so that Math calculations from the realm of Linear Algebra can be performed on it.
We started our journey with a simple example in which we mapped the surface area and population of different countries into a 2D vector space. We then used the Euclidean distance to verify that certain countries are similar while others are dissimilar based on their metrics.
Another, more advanced example mapped colors and their RGB representation into a 3D vector space. We then used Cosine similarity and some basic Math to add and subtract colors.
With this knowledge we're now able to understand how more advanced models such as Word2Vec or Doc2Vec make it possible to do calculations on words and texts.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
You can find more code examples, experiments and tutorials in my GitHub Lab repository.
Eager to learn more? Here's a list with all the resources I've used to write this post.
]]>You might've wondered if there's a certain strategy you can exploit that lets you win all the time (or at least force a draw). Is there such an algorithm that will show you how you can defeat your opponent at any given time?
It turns out there is. To be precise there are a couple of algorithms which can be utilized to predict the best possible moves in games such as Tic-Tac-Toe, Connect Four, Chess and Go among others. One such family of algorithms leverages tree search and operates on game state trees.
In this blog post we'll discuss 2 famous tree search algorithms called Minimax and Monte Carlo Tree Search (abbreviated to MCTS). We'll start our journey into tree search algorithms by discovering the intuition behind their inner workings. After that we'll see how Minimax and MCTS can be used in modern game implementations to build sophisticated Game AIs. We'll also shed some light into the computational challenges we'll face and how to handle them via performance optimization techniques.
Let's imagine that you're playing some games of Tic-Tac-Toe with your friends. While playing you're wondering what the optimal strategy might be. What's the best move you should pick in any given situation?
Generally speaking there are 2 modes you can operate in when determining the next move you want to play:
Aggressive:
Defensive:
These modes and their respective actions are basically the only strategies you need to follow to win the game of Tic-Tac-Toe.
The "only" thing you need to do is to look at the current game state you're in and play simulations through all the potential next moves which could be played. You do this by pretending that you've played a given move and then continue playing the game until the end, alternating between the X and O player. While doing that you're building up a game tree of all the possible moves you and your opponent would play.
The following illustration shows a simplified version of such a game tree: Note that for the rest of this post we'll only use simplified game tree examples to save screen space
Of course, the set of strategic rules we've discussed at the top is specifically tailored to the game of Tic-Tac-Toe. However we can generalize this approach to make it work with other board games such as Chess or Go. Let's take a look at Minimax, a tree search algorithm which abstracts our Tic-Tac-Toe strategy so that we can apply it to various other 2 player board games.
Given that we've built up an intuition for tree search algorithms let's switch our focus from simple games such as Tic-Tac-Toe to more complex games such as Chess.
Before we dive in let's briefly recap the properties of a Chess game. Chess is a 2 player deterministic game of perfect information. Sound confusing? Let's unpack it:
In Chess, 2 players (Black and White) play against each other. Every move which is performed is ensured to be "fulfilled" with no randomness involved (the game doesn't use any random elements such as a die). During gameplay every player can observe the whole game state. There's no hidden information, hence everyone has perfect information about the whole game at any given time.
Thanks to those properties we can always compute which player is currently ahead and which one is behind. There are several different ways to do this for the game of Chess. One approach to evaluate the current game state is to add up all the remaining white pieces on the board and subtract all the remaining black ones. Doing this will produce a single value where a large value favors white and a small value favors black. This type of function is called an evaluation function.
Based on this evaluation function we can now define the overall goal during the game for each player individually. White tries to maximize this objective while black tries to minimize it.
Let's pretend that we're deep in an ongoing Chess game. We're player white and have already played a couple of clever moves, resulting in a large number computed by our evaluation function. It's our turn right now but we're stuck. Which of the possible moves is the best one we can play?
We'll solve this problem with the same approach we already encountered in our Tic-Tac-Toe gameplay example. We build up a tree of potential moves which could be performed based on the game state we're in. To keep things simple we pretend that there are only 2 possible moves we can play (in Chess there are on average ~30 different options for every given game state). We start with a (white) root node which represents the current state. Starting from there we're branching out 2 (black) child nodes which represent the game state we're in after taking one of the 2 possible moves. From these 2 child nodes we're again branching out 2 separate (white) child nodes. Each one of those represents the game state we're in after taking one of the 2 possible moves we could play from the black node. This branching out of nodes goes on and on until we've reached the end of the game or hit a predefined maximum tree depth.
The resulting tree looks something like this:
Given that we're at the end of the tree we can now compute the game outcome for each end state with our evaluation function:
With this information we now know the game outcome we can expect when we take all the outlined moves starting from the root node and ending at the last node where we calculated the game evaluation. Since we're player white it seems like the best move to pick is the one which will set us up to eventually end in the game state with the highest outcome our evaluation function calculated.
While this is true there's one problem. There's still the black player involved and we cannot directly manipulate what move she'll pick. If we cannot manipulate this why don't we estimate what the black player will likely do based on our evaluation function? As a white player we always try to maximize our outcome. The black player always tries to minimize the outcome. With this knowledge we can now traverse back through our game tree and compute the values for all our individual tree nodes step by step.
White tries to maximize the outcome:
While black wants to minimize it:
Once done we can now pick the next move based on the evaluation values we've just computed. In our case we pick the next possible move which maximizes our outcome:
What we've just learned is the general procedure of the so-called Minimax algorithm. The Minimax algorithm got its name from the fact that one player wants to Mini-mize the outcome while the other tries to Max-imize it.
def minimax(state, max_depth, is_player_minimizer):
if max_depth == 0 or state.is_end_state():
# We're at the end. Time to evaluate the state we're in
return evaluation_function(state)
# Is the current player the minimizer?
if is_player_minimizer:
value = -math.inf
for move in state.possible_moves():
evaluation = minimax(move, max_depth - 1, False)
min = min(value, evaluation)
return value
# Or the maximizer?
value = math.inf
for move in state.possible_moves():
evaluation = minimax(move, max_depth - 1, True)
max = max(value, evaluation)
return value
Minimax is a simple and elegant tree search algorithm. Given enough compute resources it will always find the optimal next move to play.
But there's a problem. While this algorithm works flawlessly with simplistic games such as Tic-Tac-Toe, it's computationally infeasible to implement it for strategically more involved games such as Chess. The reason for this is the so-called tree branching factor. We've already briefly touched on that concept before but let's take a second look at it.
In our example above we've artificially restricted the potential moves one can play to 2 to keep the tree representation simple and easy to reason about. However the reality is that there are usually more than 2 possible next moves. On average there are ~30 moves a Chess player can play in any given game state. This means that every single node in the tree will have approximately 30 different children. This is called the width of the tree. We denote the trees width as $w$.
But there's more. It takes roughly ~85 consecutive turns to finish a game of Chess. Translating this to our tree means that it will have an average depth of 85. We denote the trees depth as $d$.
Given $w$ and $d$ we can define the formula $w^d$ which will show us how many different positions we have to evaluate on average.
Plugging in the numbers for Chess we get $30^{85}$. Taking the Go board game as an example which has a width $w$ of ~250 and an average depth $d$ of ~150 we get $250^{150}$. I encourage you to type those numbers into your calculator and hit enter. Needless to say that current generation computers and even large scale distributed systems will take "forever" to crunch through all those computations.
Does this mean that Minimax can only be used for games such as Tic-Tac-Toe? Absolutely not. We can apply some clever tricks to optimize the structure of our search tree.
Generally speaking we can reduce the search trees width and depth by pruning individual nodes and branches from it. Let's see how this works in practice.
Recall that Minimax is built around the premise that one player tries to maximize the outcome of the game based on the evaluation function while the other one tries to minimize it.
This gameplay behavior is directly translated into our search tree. During traversal from the bottom to the root node we always picked the respective "best" move for any given player. In our case the white player always picked the maximum value while the black player picked the minimum value: Looking at our tree above we can exploit this behavior to optimize it. Here's how:
While walking through the potential moves we can play given the current game state we're in we should build our tree in a depth-first fashion. This means that we should start at one node and expand it by playing the game all the way to the end before we back up and pick the next node we want to explore: Following this procedure allows us to identify moves which will never be played early on. After all, one player maximizes the outcome while the other minimizes it. The part of the search tree where a player would end up in a worse situation based on the evaluation function can be entirely removed from the list of nodes we want to expand and explore. We prune those nodes from our search tree and therefore reduce its width. The larger the branching factor of the tree, the higher the amount of computations we can potentially save!
Assuming we can reduce the width by an average of 10 we would end up with $w^d = (30 - 10)^{85} = 20^{85}$ computations we have to perform. That's already a huge win.
This technique of pruning parts of the search tree which will never be considered during gameplay is called Alpha-Beta pruning. Alpha-Beta pruning got its name from the parameters $\alpha$ and $\beta$ which are used to keep track of the best score either player can achieve while walking the tree.
def minimax(state, max_depth, is_player_minimizer, alpha, beta):
if max_depth == 0 or state.is_end_state():
return evaluation_function(state)
if is_player_minimizer:
value = -math.inf
for move in state.possible_moves():
evaluation = minimax(move, max_depth - 1, False, alpha , beta)
min = min(value, evaluation)
# Keeping track of our current best score
beta = min(beta, evaluation)
if beta <= alpha:
break
return value
value = math.inf
for move in state.possible_moves():
evaluation = minimax(move, max_depth - 1, True, alpha, beta)
max = max(value, evaluation)
# Keeping track of our current best score
alpha = max(alpha, evaluation)
if beta <= alpha:
break
return value
Using Alpha-Beta pruning to reduce the trees width helps us utilize the Minimax algorithm in games with large branching factors which were previously considered as computationally too expensive.
In fact Deep Blue, the Chess computer developed by IBM which defeated the Chess world champion Garry Kasparov in 1997 heavily utilized parallelized Alpha-Beta based search algorithms.
It seems like Minimax combined with Alpha-Beta pruning is enough to build sophisticated game AIs. But there's one major problem which can render such techniques useless. It's the problem of defining a robust and reasonable evaluation function. Recall that in Chess our evaluation function added up all the white pieces on the board and subtracted all the black ones. This resulted in high values when white had an edge and in low values when the situation was favorable for black. While this function is a good baseline and is definitely worthwhile to experiment with there are usually more complexities and subtleties one needs to incorporate to come up with a sound evaluation function.
Simple evaluation metrics are easy to fool and exploit once the underlying internals are surfaced. This is especially true for more complex games such as Go. Engineering an evaluation function which is complex enough to capture the majority of the necessary game information requires a lot of thought and interdisciplinary domain expertise in Software Engineering, Math, Psychology and the game at hand.
Isn't there a universally applicable evaluation function we could leverage for all games, no matter how simple or complex they are?
Yes, there is! And it's called randomness. With randomness we let chance be our guide to figure out which next move might be the best one to pick.
In the following we'll explore the so-called Monte Carlo Tree Search (MCTS) algorithm which heavily relies on randomness (the name "Monte Carlo" stems from the gambling district in Monte Carlo) as a core component for value approximations.
As the name implies, MCTS also builds up a game tree and does computations on it to find the path of the highest potential outcome. But there's a slight difference in how this tree is constructed.
Let's once again pretend that we're playing Chess as player white. We've already played for a couple of rounds and it's on us again to pick the next move we'd like to play. Additionally let's pretend that we're not aware of any evaluation function we could leverage to compute the value of each possible move. Is there any way we could still figure out which move might put us into a position where we could win at the end?
As it turns out there's a really simple approach we can take to figure this out. Why don't we let both player play dozens of random games starting from the state we're currently in? While this might sound counterintuitive it make sense if you think about it. If both player start in the given game state, play thousands of random games and player white wins 80% of the time, then there must be something about the state which gives white an advantage. What we're doing here is basically exploiting the Law of large numbers (LLN) to find the "true" game outcome for every potential move we can play.
The following description will outline how the MCTS algorithm works in detail. For the sake of simplicity we again focus solely on 2 playable moves in any given state (as we've already discovered there are on average ~30 different moves we can play in Chess).
Before we move on we need to get some minor definitions out of the way. In MCTS we keep track of 2 different parameters for every single node in our tree. We call those parameters $t$ and $n$. $t$ stands for "total" and represents the total value of that node. $n$ is the "number of visits" which reflects the number of times we've visited this node while walking through the tree. When creating a new node we always initialize both parameters with the value 0.
In addition to the 2 new parameters we store for each node, there's the so-called "Upper Confidence Bound 1" (UCT) formula which looks like this
$$ x_i + C\sqrt{\frac{\ln(N)}{n_i}} $$
This formula basically helps us in deciding which upcoming node and therefore potential game move we should pick to start our random game series (called "rollout") from. In the formula $x_i$ represents the average value of the game state we're working with, $C$ is a constant called "temperature" we need to define manually (we just set it to 1.5 in our example here. More on that later), $N$ represents the parent node visits and $n_i$ represents the current nodes visits. When using this formula on candidate nodes to decide which one to explore further, we're always interested in the largest result.
Don't be intimidated by the Math and just note that this formula exists and will be useful for us while working with out tree. We'll get into more details about the usage of it while walking through our tree.
With this out of the way it's time apply MCTS to find the best move we can play.
We start with the same root node of the tree we're already familiar with. This root node is our start point and reflects the current game state. Based on this node we branch off our 2 child nodes: The first thing we need to do is to use the UCT formula from above and compute the results for both child nodes. As it turns out we need to plug in 0 for almost every single variable in our UCT formula since we haven't done anything with our tree and its nodes yet. This will result in $\infty$ for both calculations.
$$ S_1 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty $$
$$ S_1 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty $$
We've replaced the 0 in the denominator with a very small number because division by zero is not defined
Given this we're free to choose which node we want to explore further. We go ahead with the leftmost node and perform our rollout phase which means that we play dozens of random games starting with this game state.
Once done we get a result for this specific rollout (in our case the percentage of wins for player white). The next thing we need to do is to propagate this result up the tree until we reach the root node. While doing this we update both $t$ and $n$ with the respective values for every node we encounter. Once done our tree looks like this: Next up we start at our root node again. Once again we use the UCT formula, plug in our numbers and compute its score for both nodes:
$$ S_1 = 30 + 1.5\sqrt{\frac{\ln(1)}{1}} = 30 $$
$$ S_2 = 0 + 1.5\sqrt{\frac{\ln(0)}{0.0001}} = \infty $$
Given that we always pick the node with the highest value we'll now explore the rightmost one. Once again we perform our rollout based on the move this node proposes and collect the end result after we've finished all our random games.
The last thing we need to do is to propagate this result up until we reach the root of the tree. While doing this we update the parameters of every node we encounter.
We've now successfully explored 2 child nodes in our tree. You might've guessed it already. We'll start again at our root node and calculate every child nodes UCT score to determine the node we should further explore. In doing this we get the following values:
$$ S_1 = 30 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 31.25 $$
$$ S_2 = 20 + 1.5\sqrt{\frac{\ln(2)}{1}} \approx 21.25 $$
The largest value is the one we've computed for the leftmost node so we decide to explore that node further.
Given that this node has no child nodes we add two new nodes which represent the potential moves we can play to the tree. We initialize both of their parameters ($t$ and $n$) with 0.
Now we need to decide which one of those two nodes we should explore further. And you're right. We use the UCT formula to calculate their values. Given that both have $t$ and $n$ values of zero they're both $\infty$ so we decide to pick the leftmost node. Once again we do a rollout, retrieve the value of those games and propagate this value up to the tree until we reach the trees root node, updating all the node parameters along the way.
The next iteration will once again start at the root node where we use the UCT formula to decide which child node we want to explore further. Since we can see a pattern here and I don't want to bore you I'm not going to describe the upcoming steps in great detail. What we'll be doing is following the exact same procedure we've used above which can be summarized as follows:
Start at the root node and use the UCT formula to calculate the score for every child node
Pick the child node for which you've computed the highest UCT score
Check if the child has already been visited
Propagate the result back through the tree until you reach the root node
We iterate over this algorithm until we run out of time or reached a predefined threshold value of visits, depth or iterations. Once this happens we evaluate the current state of our tree and pick the child node(s) which maximize the value $t$. Thanks to dozens of games we've played and the Law of large numbers we can be very certain this move is the best one we can possibly play.
That's all there is. We've just learned, applied and understood Monte Carlo Tree Search!
You might agree that it seems like MCTS is very compute intensive since you have to run through thousands of random games. This is definitely true and we need to be very clever as to where we should invest our resources to find the most promising path in our tree. We can control this behavior with the aforementioned "temperature" parameter $C$ in our UCT formula. With this parameter we balance the trade-off between "exploration vs. exploitation".
A large $C$ value puts us into "exploration" mode. We'll spend more time visiting least-explored nodes. A small value for $C$ puts us into "exploitation" mode where we'll revisit already explored nodes to gather more information about them.
Given the simplicity and applicability due to the exploitation of randomness, MCTS is a widely used game tree search algorithm. DeepMind extended MCTS with Deep Neural Networks to optimize its performance in finding the best Go moves to play. The resulting Game AI was so strong that it reached superhuman level performance and defeated the Go World Champion Lee Sedol 4-1.
In this blog post we've looked into 2 different tree search algorithms which can be used to build sophisticated Game AIs.
While Minimax combined with Alpha-Beta pruning is a solid solution to approach games where an evaluation function to estimate the game outcome can easily be defined, Monte Carlo Tree Search (MCTS) is a universally applicable solution given that no evaluation function is necessary due to its reliance on randomness.
Raw Minimax and MCTS are only the start and can easily be extended and modified to work in more complex environments. DeepMind cleverly combined MCTS with Deep Neural Networks to predict Go game moves whereas IBMextended Alpha-Beta tree search to compute the best possible Chess moves to play.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
Do you want to learn more about Minimax and Monte Carlo Tree Search? The following list is a compilation of resources I found useful while studying such concepts.
If you're really into modern Game AIs I highly recommend the book "Deep Learning and the Game of Go" by Max Pumperla and Kevin Ferguson. In this book you'll implement a Go game engine and refine it step-by-step until at the end you implement the concepts DeepMind used to build AlphaGo and AlphaGo Zero based on their published research papers.
]]>Computer systems were taught to identify skin cancer with a significantly higher accuracy than human doctors do. Neural Networks can generate photorealisticimages of fake people and fake celebrities. It's even possible for an algorithm to teach itself entire games from first principles, surpassing human-level mastery after only a couple of hours training.
In summary Deep Learning is amazing, mystical and sometimes even scary and intimidating.
In order to demystify and understand this "Black Box" end-to-end I decided to take a deep dive into Deep Learning, looking at it through the practical as well as the theoretical lens.
With this post I'd like to share the Curriculum I came up with after spending months following the space, reading books and research papers, doing lectures, classes and courses to find some of the best educational resources out there.
Before we take a closer look I'd like to point out that the Curriculum as a whole is still a work in progress and might change over time since new material covering state-of-the-art Deep Learning techniques is released on an ongoing basis. Feel free to bookmark this page and revisit it from time to time to stay up-to-date with the most recent changes.
During the research phase which resulted in the following Curriculum I triaged dozens of classes, lectures, tutorials, talks, MOOCs, papers and books. While the topics covered were usually the same the required levels of expertise in advanced Mathematics and computer programming were not.
Generally speaking one can divide most educational Deep Learning resources in two categories: "Shallow" and "Deep". Authors of "Shallow" resources tend to heavily utilize high-level Frameworks and abstractions without taking enough time to talk about the underlying theoretical pieces. "Deep" resources on the other hand usually take the bottom-up approach, starting with a lot of Mathematical fundamentals until eventually some code is written to translate the theory into practice.
I personally believe that both is important: Understanding how the technology works under the covers while using Frameworks to put this knowledge into practice. The proposed Curriculum is structured in a way to achieve exactly that. Learning and understanding Deep Learning from a theoretical as well as a practical point-of-view.
In our case we'll approach our Deep Learning journey with a slight twist. We won't follow a strict bottom-up or top-down approach but will blend both learning techniques together.
Our first touchpoint with Deep Learning will be in a practical way. We'll use high-level abstractions to build and train Deep Neural Networks which will categorize images, predict and generate text and recommend movies based on historical user data. This first encounter is 100% practice-oriented. We won't take too much time to learn about the Mathematical portions just yet.
Excited about the first successes we had we'll brush up our Mathematical understanding and take a deep dive into Deep Learning, this time following a bottom-up approach. Our prior, practical exposure will greatly benefit us here since we already know what outcomes certain methodologies produce and therefore have specific questions about how things might work under the hood.
In the last part of this Curriculum we'll learn about Deep Reinforcement Learning which is the intersection of Reinforcement Learning and Deep Learning. A thorough analysis of AlphaGo Zero, the infamous agent that learned the Go board game from scratch and later on played against itself to become basically unbeatable by humans, will help us understand and appreciate the capabilities this approach has to offer.
During our journey we'll work on two distinct Capstone projects ("Capstone I" and "Capstone II") to put our knowledge into practice. While working on this we'll solve real problems with Deep Neural Networks and build up a professional portfolio we can share online.
Once done we'll be in a good position to continue our Deep Learning journey reading through the most recent academic research papers, implementing new algorithms and coming up with our own ideas to contribute to the Deep Learning community.
As already discussed above, Deep Learning is… Deep. Given the traction and momentum, Universities, Companies and individuals have published a near endless stream of resources including academic research papers, Open Source tools, reference implementations as well as educational material. During the last couple of months I spent my time triaging those to find the highest quality, yet up-to-date learning resources.
I then took a step back and structured the materials in a way which makes it possible to learn Deep Learning from scratch up to a point where enough knowledge is gained to solve complex problems, stay on top of the current research and participate in it.
We begin our journey in the land of Deep Learning with a top-down approach by introducing the subject "Deep Learning" in a practical and playful way. We won't start with advanced college Math, theoretical explanations and abstract AI topics. Rather we'll dive right into the application of tools and techniques to solve well-known problems.
The main reason of doing this is that it keeps us motivated since we'll solve those problems with state-of-the-art implementations which will help us see and understand the bigger picture. It's a whole lot easier to take a look under the covers of the abstractions we'll use once we know what can be achieved with such. We'll automatically come up with questions about certain results and behaviors and develop an own intuition and excitement to understand how the results came to be.
In doing this we'll take the great "Practical Deep Learning for Coders" course by the Fast.ai team which will walk us through many real-world examples of Deep Neural Network usage. Theoretical concepts aren't completely left out but will be discussed "just-in-time".
It's important to emphasize that it's totally fine (and expected) that we won't understand everything which is taught during this course the first time we hear about it. Most of the topics will be covered multiple times throughout this Curriculum so we'll definitely get the hang of it later on. If you're having problems with one topic or the other, feel free to rewatch the respective part in the video or do some research on your own. Keep in mind though that you shouldn't get too deep into the weeds since our main focus is still on the practical portions.
You should definitely recreate each and every single Jupyter Notebook which was used in the Fast.ai course from scratch. This helps you to get a better understanding of the workflow and lets you play around with the parameters to see the effects they have on the data.
When done it's a good idea to watch the following great talk by Google and this mini-course by Leo Isikdogan to solidify the knowledge we've just acquired.
Once we have a good understanding of what Deep Learning is, how it's used in practice and how it roughly works under the hood it's time to take a step back and refresh our Math knowledge. Deep Neural Networks heavily utilize Matrix multiplications, non-linearities and optimization algorithms such as Gradient Descent. We therefore need to familiarize ourselves with Linear Algebra, Calculus and some basic Probability Theory which build the Mathematical foundations of Deep Learning.
While this is certainly advanced Mathematics it's important to highlight that High School level Math knowledge is usually enough to get by in the beginnings. For the most part we should just refresh our knowledge a little bit. It's definitely not advisable to spent weeks or even months studying every aspect of Linear Algebra, Calculus or Probability Theory (if that's even possible) to consider this part "done". Basic fluency in the aforementioned topics is enough. There's always enough time to learn the more advanced topics as soon as we come across them.
Having a good Mathematical understanding will pay dividends later on as we progress with more advanced Deep Learning topics. Don't be intimidated by this part of the Curriculum. Mathematics can and should be fun!
Stanford has some great refreshers on Linear Algebra and Probability Theory. If that's too shallow and you need a little bit more to get up to speed you might find Part 1 of the Deep Learning Book helpful.
Once you've brushed up the basics it's worthwhile to take a couple of days and thoroughly study "The Matrix Calculus You Need For Deep Learning" by Terence Parr and Jeremy Howard (one of the founders of Fast.ai) and the "Computational Linear Algebra" course by Rachel Thomas (also a co-founder of Fast.ai). Both resources are heavily tailored to teach the Math behind Deep Learning.
Now we're armed with a good understanding of the capabilities and the underlying Math of Deep Learning.
Given this it's time to take a deep dive to broaden our knowledge of Deep Learning. The main goal of this part is to take the practical experience and blend it with our Mathematical exposure to fully understand the theoretical building blocks of Deep Neural Networks. A thorough understanding of this will be key later on once we learn more about topics such as Deep Reinforcement Learning.
The following describes 3 different ways to take the deep dive. The approaches are certainly not mutually exclusive but could (and should) be used in conjunction to complement each other.
The path you might want to take will depend on your prior exposure to Deep Learning and you favorite learning style.
If you're a person who appreciates classical MOOCs in the form of high-quality, pre-recorded videos with quizzes and exercises you'll definitely enjoy Andrew Ng'sDeeplearning.ai "Specialization for Deep Learning". This course is basically split up into 5 different sub-courses which will take you from the basics of Neural Networks to advanced topics such as as Recurrent Neural Networks. While learning about all of this you'll also pick up a lot of valuable nuggets Andrew shares as he talks about his prior experience as a Deep Learning practitioner.
You can certainly get around the tuition fee for the Deeplearning.aispecialization, but it's important to emphasize that it's definitely worth every penny! You'll have access to high quality course content, can request help when you're stuck and get project reviews by classmates and experts.
Readers who enjoy books should definitely look into the "Dive into Deep Learning" book. This book was created to be a companion guide for the STAT 157 course at UC Berkeley but turned into more than that. The main focus of this book is to be at the intersection of Mathematical formulations, real world applications and the intuition behind Deep Learning complemented by interactive Jupyter Notebooks to play around with. "Dive into Deep Learning" covers all of the important concepts of a modern Deep Learning class. It requires no prior knowledge and starts with the basics of Neural Networks while moving onwards to cover advanced topics such as Convolutional Neural Networks, ending in discussions about state-of-the-art NLP implementations.
Another method to study Deep Learning in great detail is with the help of recorded university class videos. MIT released the terrific "Introduction to Deep Learning" course which is basically a recording of their 6.S191 class accessible for everyone to watch! This option is definitely one of the more advanced ways to learn the subject as some prior university-level Math and Computer Science knowledge is necessary to grok it. The huge benefit of this format is that it touches on a lot of different topics other courses simply don't cover due to missing prerequisites. If you've already been exposed to university-level Computer Science and Mathematics and like to learn with a focus on more rigor theory, then this course is definitely for you.
Whatever route you take, it's really important that you take your time to revisit concepts and recreate their implementations from scratch. It's totally fine if you're struggling at first. It's this wandering through the dark alleys where you'll actually learn the most! Don't waste your time passively consuming content. Go out and reproduce what you've just learned!
At the end of the day it doesn't really matter what format you choose. All courses will equally well prepare you for the next step in your journey to Deep Learning mastery which is your first Capstone project!
Focus: Supervised Deep Learning
Enough theory (for now). It's time to put our hard earned knowledge to practice.
In our first Capstone project we'll demonstrate that we fully understand the basic building blocks of modern Deep Learning. We'll pick a problem of interest and solve it with the help of a Deep Neural Network. Since we've mostly dealt with Supervised Learning so far it's worth mentioning that our solution will be based on such an implementation.
Our programmatic environment will be a separate Jupyter Notebook where we code and describe every step together with a brief justification of its necessity in great detail. Taking the time to think through the steps necessary to solve our problem helps us check ourselves as we have to think through our architecture as well as the underlying processes that take place when our code is executed.
To further deepen our knowledge and help us get out of the comfort zone we'll restrict our implementation to the usage of low-level Frameworks, meaning that we're only allowed to use Frameworks such as PyTorch, TensorFlow or MXNet. Any usage of high-level abstraction libraries such as Fastai or Keras is strictly forbidden. Those libraries, while being great for the experienced practitioner, abstract too much away, hindering us to go through the tough decisions and tradeoffs we have to make when working on our problem.
Remember that this is the part where we'll learn the most as we're really getting into the weeds here. Don't give up as enlightenment will find you once you made it. It's also more than ok to go back and reread / rewatch the course material if you're having problems and need some help.
While working on this project always keep in mind that it's one of your personal portfolio projects you should definitely share online. It's those projects where you can demonstrate that you're capable to solve complex problems with Deep Learning technologies. Make sure that you really spend a good portion of your time on it and "make it pretty".
Are you struggling to find a good project to work on? Here are some project ideas which will help you get started:
Deep Reinforcement Learning is the last major topic we'll cover in this Curriculum.
One might ask the question as to what the difference between the Deep Learning we're studying and Deep Reinforcement Learning is. All the techniques we've learned and used so far were built around the concept of Supervised Learning. The gist of Supervised Learning is that we utilize large datasets to train our model by showing it data, letting it make predictions about what it thinks the data represents and then using the labeled solution to compute the difference between the prediction and the actual solution. We then use algorithms such as Gradient Descent and Backpropagation to subsequently readjust our model until the predictions it makes meet our expectations.
You might've already noticed that Supervised Learning heavily relies on huge datasets to train and test our models via examples.
What if there's a way that our AI can teach itself what it should do based on self-exploration and guidelines we define? That's where Reinforcement Learningcomes into play. With Reinforcement Learning we're able to let our model learn from first principles by exploring the environment. The researches at DeepMindwere one of the first who successfully blended Deep Learning and Reinforcement Learning to let an AI teach itself to play Atari games. The only inputs the AI agent got were the raw input pixels and the score.
In this part of our Curriculum we'll learn what Reinforcement Learning is and how we can combine Deep Learning and Reinforcement Learning to build machine intelligence which learns to master tasks in an autodidactic way.
As per usual there are different ways to learn Deep Reinforcement Learning.
Thomas Simonini has a great "Deep Reinforcement Learning Course" which focuses on the practical pieces of Deep Reinforcement Learning as you'll implement real world applications throughout his class.
OpenAIs "SpinningUp AI" course is another great resource which strikes a really good balance between practical examples and theoretical foundations.
If you're looking for a University-level class which heavily focuses on the theoretical underlying I'd highly recommend the "Advanced Deep Learning and Reinforcement Learning Class" which was taught by UCL and DeepMind.
Every resource listed here will help you understand and apply Deep Reinforcement Learning techniques. While some are more focused on the practical portions others go really deep into the trenches of theoretical rigor. It's definitely worthwhile to look into all of them to get the all-around view and best mixture between theory and practice.
Once you successfully made your way through one of the Deep Reinforcement Learning courses it's a good idea to revisit the key ideas by reading the excellent blog posts "Deep Reinforcement Learning: Pong from Pixels" by Andrej Karpathyand "A (Long) Peek into Reinforcement Learning" by Lilian Weng as they give a nice, broader overview of the different topics which were covered during class.
Aside: If you're fascinated by the possibilities of Reinforcement Learning I'd highly recommend the book "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto. The recently updated 2nd edition includes chapters about Neuroscience, Deep Neural Networks and more. While it's possible and desirable to buy the book at your local bookstore you can also access the book as a freely available PDF online.
Focus: Deep Reinforcement Learning
It's time for our second and last Capstone Project where we'll use Deep Reinforcement Learning to let our AI teach itself to solve difficult real-world problems.
The same restrictions from our first Capstone project also apply here. We'll implement the solution in a dedicated Jupyter Notebook where we write our code and the prose to describe what we're doing and why we're doing it. This helps us test our knowledge since we have to take the time to think through our current implementation and its implications to the system as a whole.
As with the Capstone I project it's forbidden to use higher level abstraction libraries such as Fastai or Keras. Our implementation here should only use APIs provided by lower-level Frameworks such as PyToch, TensorFlow or MXNet.
Keep in mind that it's totally fine to feel stuck at some point. Don't be discouraged! Take your time to revisit the material and ensure that you fill your knowledge gaps before moving on. It's those moments of struggle where you grow the most. Once you've made it, you'll feel excited and empowered.
The result of this Capstone project is another crucial piece of your personal Deep Learning portfolio. Make sure to set aside enough time to be able to put in the effort so that you can showcase your implementation online.
Do you need some inspiration for projects you might want to work on? Here's a list with some ideas:
Deep Learning has gained a lot of traction in last couple of years as major scientific breakthroughs made it finally possible to train and utilize Deep Neural Networks to perform tasks at human expert level ranging from cancer detection to mastery in games such as Go or Space Invaders.
In this blog post I shared the Curriculum I follow to learn Deep Learning from scratch. Right in the beginning of the journey one learns how Deep Learning techniques are used in practice to solve real-world problems. Once a baseline understanding is established it's time to take a deep dive into the Mathematical and theoretical pieces to demystify the Deep Learning "Black Box". A final exploration of the intersection of Deep Learning and Reinforcement Learning puts the reader in a great position to understand state-of-the art Deep Learning solutions. Throughout the whole Curriculum we'll practice our skills and showcase our fluency in such while working on dedicated Capstone projects.
While putting this together I had the feeling that this Curriculum can look quite intimidating at first glance since lots of topics are covered and it'll definitely take some time to get through it. While I'd advise the avid reader to follow every single step in the outlined order it's totally possible to adapt and skip some topics given that everyone has different experiences, goals and interests. Learning Deep Learning should be fun and exciting. If you ever feel exhausted or struggle to get through a certain topic you should take a step back and revisit it later on. Oftentimes complicated facts and figures turn into no-brainers if we give ourselves the permission and time to do something else for the moment.
I personally believe that it's important to follow a goal while learning a new topic or skill. Make sure that you know why you want to learn Deep Learning. Do you want to solve a problem at your company? Are you planning to switch careers? Is a high level overview enough for you since you just want to be educated about AI and its social impacts? Whatever it is, keep this goal in mind as it'll make everything reasonable and easier during the hard times when the motivation might be lacking and everything just feels too hard to pick up.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>During my career as a Software Engineer I was mostly Math absent. From time to time I consulted old Computer Science books to do some research on algorithms I then implemented. However those were usually the only touchpoints I had with Math.
Something changed over the last couple of years. While looking for the next personal challenges and goals to grow I figured that most of the really exciting achievements heavily utilize Math as a fundamental building block. That's actually true for a lot of scientific fields including Econometrics, Data Science and Artificial Intelligence. It's easy to follow the news and roughly understand how things might work but once you try to dig deeper and look under the hood it gets pretty hairy.
I found myself regularly lost somewhere in the dark alleys of Linear Algebra, Calculus and Statistics. Last year I finally stuck a fork in the road. I wanted to fundamentally change my understanding and decided to re-learn Math from scratch. After countless late nights, early mornings and weekends doing classes, exercises and proofs I'm finally at a pretty decent level of understanding advanced Mathematics. Right now I'm building upon this foundation to learn even more.
During this process I learned one important thing: Math is really amazing!
Math is the language of nature. Understanding it helps you understand how our world works!
With this blog post I'd like to share how I went from "What is a root of a polynomial again?" to "Generalized Autoregressive Conditional Heteroskedasticity" (at least to some extend). I'll share the Curriculum I created and followed, the mistakes I made (spoiler: I made a lot) and the most useful resources I used throughout this journey.
Before we start I want to be honest with you: Math is a really involved discipline. There's a lot out there…
And it can certainly be overwhelming. However if you're really dedicated and want to put in those hours you'll make it! If I can do it so can you!
Please keep in mind that this is the path which worked for me. This doesn't necessarily mean that it will be as efficient for you. In my case I need to study, self-explain and practice, practice, practice to really understand a topic at hand. I know of people who can just sit in class, listen and ultimately get it. That's definitely not how I operate.
Alright. Let's get started!
Math is one of those subjects where you'll find a nearly endless stream of resources. Looking closer they all vary widely in terms of quality, density and understandability.
My first approach to ramp up my Math skills was to skim through an interesting research paper, write down all the Math I won't understand and look those terms up to study them in greater detail. This was fundamentally wrong on many levels. After some trial and error I took a step back and did a lot of research to figure out which topics I should study to support my goal and how those topics are related to one another.
The Curriculum I finally put together is a good foundation if you want to jump into other "Hard Sciences". My personal goal was to learn the knowledge I need to take a really deep dive into Artificial Intelligence. To be more specific I'm really excited about Deep Learning and the next steps in the direction of Machine intelligence.
Every topic which is covered in this Curriculum uses 3 basic pillars to build a solid Mathematical foundation:
Intuition
Videos, interactive visualizations and other helpful resources which outline how the Math came to be and how it works on an intuitive level.
Deep Dive
A good enough "deep dive" to get familiar with the foundational concepts while avoiding confusion due to overuse of theorems, proofs, lemmas, etc.
Practicality
Practice, practice, practice. Resources such as books with lots of exercises to solidify the knowledge.
Algebra is the first topic which should be studied extensively.
Having a really good understanding of Algebra makes everything a whole lot easier! Calculus comes down to 90% Algebra most of the time. If you know how to solve Algebra problems you won't have a hard time in Calculus either.
Most of you might remember a phrase similar to
"Solve this equation for x"
That's what Algebra is about. In an Algebra class you'll learn about the following topics:
As stated above it's of uber importance that you really hone your Algebra skills. I'm repeating myself but Algebra is one of the main building blocks for advanced Mathematics.
In Trigonometry you'll study the relationship of lengths and angles of triangles.
You'll learn about the unit circle and it's relation to sin and cos, cones and their relation to circles, ellipses, parabolas and hyperbolas, Pythagoras' Theorem and more. Trigonometry is interesting in itself since it can be immediately applied to real life problems.
Here's a list of topics you'll usually learn in a Trigonometry class:
Generally speaking this course is rather short. Nevertheless it's a good preparation class for Calculus.
The study of continuous change is one of the main focus areas in Calculus.
This might sound rather abstract and the intuition behind it is really paradox if you think about it (see "Essence of Calculus" below). However you might remember that you dealt with Derivatives, Limits and area calculations for functions.
There are usually 3 different Calculus classes (namely Calculus I, II and II) one can take. Those 3 classes range from easy topics such as "Derivatives" and "Limits" to advanced topics such as "Triple Integrals in Spherical Coordinates". I'd suggest to definitely take the first class (Calculus I) and continue with the second one (Calculus II) if time permits. If you're in a hurry taking Calculus I is usually sufficient.
In Calculus I you'll learn about:
Calculus is an important topic since it's heavily used in optimization problems to find local minima. The "Gradient Descent" algorithm uses techniques from Calculus such as Derivatives and is leveraged in modern (Deep) Neural Networks to adjust the weights of Neurons during Backpropagation.
Linear Algebra is one of the most, if not the most important topic when learning Math for Data Science, Artificial Intelligence and Deep Learning.
Linear Algebra is pretty much omnipresent in modern computing since it lets you efficiently do calculations on multi-dimensional data. During childhood you probably spent quite some time in in front of your computer screen while wading through virtual worlds. Photorealistic 3D renderings are possible thanks to Math and more specifically Linear Algebra.
Linear Algebra courses usually cover:
As already stated above, Linear Algebra is one of the most important topics in modern computing. Lots of problems such as image recognition can be broken down into calculations on multi-dimensional data.
You might have heard about the Machine Learning framework TensorFlow which was developed and made publicly available by Google. Well, a Tensor is just fancy word for a higher-dimensional way to organize information. Hence a Scalar is a Tensor of rank 0, a Vector is a Tensor of rank 1, a N x N Matrix is a Tensor of rank 2, etc.
Another interesting fact is that Deep Neural Networks are usually trained on GPUs (Graphic Processing Unit) or TPUs (Tensor Processing Unit). The simple reason is that GPUs and TPUs are way better at processing Linear Algebra computations compared to CPUs since (at least GPUs) were invented as a dedicated hardware unit to do exactly that when rendering computer graphics.
Aside: Here's the original paper by Andrew Ng et al. where GPUs were first explored to carry out Deep Learning computations.
The last topic which should be covered in this Curriculum is Statistics & Probabilities.
While both topics are sometimes taught separately it makes sense to learn them in conjunction since statistics and probabilities share a deep underlying relationship.
A typical Statistics & Probabilities class covers:
In Data Science one usually has to deal with statistical analysis to see if the computations actually made sense. Furthermore it's helpful to compute and visualize correlations between data and certain events. Bayes Rule is another important tool which helps us update our belief about our view of "the world" when more evidence is available. The realms of Machine Learning and Deep Learning usually deal with lots of uncertainty. Having a good toolbox to deal with this makes our life a whole lot easier.
A pretty popular example of applied statistics is the Monte Carlo Tree Search algorithm. This heuristic algorithm was used in DeepMinds AI breakthrough "AlphaGo" to determine which moves it should consider while playing the Go boardgame.
Feel free to read through the official paper for more information about the underlying technologies. Trust me, it's amazing to read and understand how Math played such a huge role to build such a powerful contestant.
As I already stated above it's been quite a journey and I made lots of mistakes along the way.
In this section I'd like to share some of those mistakes so that you don't have to go through this yourself.
The first mistake I made was jumping straight into Math without having a clear plan / Curriculum and more importantly goal. I dived right into certain topics I picked up while reading research papers and quickly figured that some of them were too advanced since I understood only little (if anything at all). My approach was to back off and start somewhere else. "Trial an error" so to say. This was obviously very costly in terms of time and resources.
The solution here was to actually have a clear goal (learning Math to understand the underlying principles of Artificial Intelligence) and take the time to research a lot to come up with a sound Curriculum and start from there. Having that sorted out I only had to follow the path and knew that I was good.
During this aforementioned trial and error phase I made the mistake of taking way too many MOOCs. Don't get me wrong, MOOCs are great! It has never been possible before to take an MIT course from your couch. In my case exactly that was the problem. Most of the time I was passively watching the course content nodding along. After a couple of "completed courses" and the feeling of knowing the ins and outs I jumped into more sophisticated problems to figure that I developed a pretty shallow knowledge.
Doing a retrospective on the "completed courses" I saw that my learning style isn't really tailored to MOOCs. I decided to switch my focus to the good old physical textbooks. I especially focused on textbooks with good didactics, lots of examples and exercises with solutions (the Schaum's Outlines series is golden here). Switching from passively consuming to actively participating in the form of working through numerous exercises was really the breakthrough for me. It ensured that I left my comfort zone, went deep into the trenches and really battle-tested my knowledge about the topic at hand.
The other upside of using textbooks is that it made it possible to learn in a distraction free environment. No computer, no notifications, no distractions. Just me, a black coffee and my Math textbook!
Another, final tip I'd like to share is that you should really keep track of your feeling and engagement while studying. Do you feel fired up? Are you excited? Or are you just consuming and your thoughts are constantly wandering off because you don't really care that much? If that's the case then it's usually time to move on. Don't try to push through. There's nothing worse than completing a course just for the sake of completing it. If it doesn't feel right or isn't working for you it's important to let go and move on. There's enough material and maybe the next one suits your needs.
In this blog post I've shared my journey from being someone who acknowledged Math as something one should've heard about to someone who learned to love Math and its applications to solve complex problems. In order to really understand and learn more about Artificial Intelligence and Deep Learning I created a Curriculum which does not only cover the underlying Math concepts of such disciplines but will serve the student well when learning more about other "Hard Sciences" such as Computer Science in general, Physics, Meteorology, Biology, etc.
I'm still early in my Math journey and there's an infinite amount of exciting stuff to learn. With the given Curriculum I feel confident that I've gained a solid foundation to pick up more complex topics I'll encounter while learning about Artificial Intelligence, Deep Learning and Advanced Mathematics in general.
I hope that you enjoyed this article and I'd like to invite you to subscribe to my Newsletter if you're interested in more posts like this.
Do you have any questions, feedback or comments? Feel free to reach out via E-Mail or connect with me on Twitter.
]]>