Machine Learning Simplified - Part II (Supervised Learning)

Sachin Tah

Nov 29, 20206 min read

Updated: Dec 5, 2020

In Part I, we discussed that ML algorithms are broadly divided into Supervised, UnSupervised, and Reinforcement.

For part one of this series, please refer to the below link https://www.sachintah.com/post/machine-learning-simplified-part-i

In this article, we will start our discussion with the Supervised algorithm.

Now before we start discussing further, let us understand what is a Machine Learning Model.

A machine learning model is a file that has been trained on some training data. Once trained, this model is now capable of making predictions for newly supplied data. ML algorithms are all about creating the desired models.

ML algorithm uses some function to come up with the model. This function is called the Hypothesis (h) function. Lets us understand some terms by considering an example.

Let us consider a typical example of a house price calculator where we have house price data based upon size. I have taken the example from Andrew NG course on Coursera since I will use the same dataset provided by him.

Some terms which we will be using through the series are mentioned below

x - Input variable which is also called a feature, in this case, the feature is the size in sq. ft. y - Output variable which is in our case is house price m - Total number of training examples, in above case m = 5

In the above fig we have

x = {260, 280, 340, 450, 550} and

y = {1000, 1200, 1500, 2000, 2800}

So from the above dataset, we need to devise a learning algorithm that will help us in creating a model capable of predicting housing price if the area is supplied.

This means y is some function of x, y = function(x)

i. e estimated price = function(area)

In ML plotting a graph of your data is extremely important in order to analyze it. Let us plot our housing price data in a graph by placing size on X-axis and price on Y-axis. For plotting data and writing ML programs, we will be using Octave library which has a rich set of ready to use ML functions, you can download and install octave from https://www.gnu.org/software/octave/index, we will discuss basic octave operations in the next tutorial. So for now I am plotting data using an octave

We need to come up with a function which will help us in predicting housing price for a given area. For this kind of data that has one input variable or feature, we can use a linear regression function. Our objective is to fit a line in the above graph which will fit the best, like the one shown below

This function is called the hypothesis function h. For calculating hypothesis using linear regression where we have only one input variable, the formula is

So definition wise "A hypothesis function is a function that uses available data to learn how to best map inputs to outputs". This is also termed as function approximation.

Let us try to manually calculate h(θ) function by assuming some random value of these constants and see how our line fits

Assumption#1

Let us try to plot a graph by providing different values of X and finding the value of our hypothesis function. Instead of taking some lower value of theta, I am taking 150 in order to plot the graph within the same scale (also remember we have a scale of 1000 for the price, so actually it is 150*1000)

Plotting different values of hypothesis function for a given value of x gives us the below graph

If we draw a line on our data points, we will get a linear line. The linear line signifies our hypothesis function, which means if you try to find the price of the house by considering this hypothesis function, you will always get the value of 150. Obviously, this is incorrect, however, these random and higher value for theta is just for the sake of explaining properly and drawing the graph on a similar scale as that of area and prices. In practice, we usually start by taking very small values of both the constants. We will see feature scaling in subsequent articles.

Assumption#2

With the new values of theta, the hypothesis will be calculated as below

Plotting different values of hypothesis function for a given value of x gives us the below graph

This one is a bit closer to the original one.

How to Supervise?

If you recollect from my previous post, this is a supervised learning algorithm and we have the advantage of cross-checking our results with the actual dataset where each area has a housing price associated with it.

As we have assumed two different values for the constant theta, let us plot both the graph side by side to see which one is better.

By comparing each result we will be able to identify the difference between our predicted price (hypothesis) and the actual price(y). This will help us in correcting the value of both θ (slope and intercept) accordingly and train our algorithm to come up with the best value of θ which will minimize the difference between predicted and actual.

Writing long programs to achieve θ values is possible, however is a tedious exercise. Let us try to solve this problem using mathematical equations and later see how we can implement these equations in our code to achieve results.

Cost Function

In the above graph, D signifies the difference between the predicted value and the actual value. Our intention is to minimize D for most of the cases, which means for any given value of x, our algorithm should try to minimize h(θ) - y.

Let us name difference D for different results as d1, d2, d3, d4, d5 and we have a total of 5 training examples, so therefore

The above cost function looks good for 5 input examples in our dataset, however, typically we have a million examples to come up with the right model, for such a huge number we need to scale down our cost function by some value.

A scaled-down version of our cost function will look something like this

Let us try to calculate the cost for both the θ assumptions we made above

# Cost For Assumption1

# Cost For Assumption 2

(Don't worry about the actual cost fig. since I have taken examples on how to calculate the cost. Obviously, assumption two is better than assumption 1 due to lower value)

The cost function is also called the mean squared function, I am not going to go into more details about this function.

Plotting Cost Function

We need to know how our cost function varies on various values of θ. Let us try to plot a graph of the cost function against the value of θ and see how the cost function behaves.

For this let us modify the hypothesis function so that it is a little bit easier to plot a two-dimensional graph for better understanding, so assume

With this change, our cost function will look like

Let us try to see how the value of J(θ) changes when we assume different values of θ.

If you have noticed, the value of J(θ) decreases to the lowest value and then starts to increase again. We are interested in the lowest value, this is the θ value we will be using for our model. Let me quickly summarize all of the above contents and terms we used

Till now, we know that getting a minimal cost will give us the best predictions for our algorithm. However, we still do not know how to come up with the perfect values for θ0 & θ1? We have an algorithm that will help us in achieving this directly and will help us in taking steps in the right direction.

Gradient Descent (GD)

GD is an algorithm that is used to calculate the minimum of a cost function. As per the definition of Gradient descent, "Gradient Descent is a first-order iterative optimization algorithm for finding the local minimum".

So it's an iterative algorithm which helps us to come up with a minimum value of a cost function, we need to run GD iteratively. Let us first see how to calculate GD mathematically

The above function is trying to find partial derivatives of the cost function for both θ0 and θ1 iteratively. This will make sure that θ0 and θ1 eventually converge to a minimum. If you are interested in finding how GD helps in achieving local and global minimum, you can refer to articles available on the web.

Applying gradient descent for linear regression, the function will look like, where both theta values need to be calculated simultaneously

GD is a common algorithm that is used in most ML algorithms. The equation signifies that we need to start our hypothesis with an initial guess and then repeatedly apply gradient descent equations to make our hypothesis more and more accurate.

I am sure all of this sounds very interesting to you, these are the building blocks of ML algorithms so it is important to understand all of this very well conceptually. There are ready-made ML libraries available that calculate linear regression in one line of code.

I know there is still some confusion and unanswered questions on how to write code for the above equation and create a model.

In my subsequent articles, we will actually write some code to implement a Supervised learning algorithm using Octave or Python.

Happy Reading....

Tech talks with sachin tah

Machine Learning Simplified - Part II (Supervised Learning)

How to Supervise?

Cost Function

Plotting Cost Function

Gradient Descent (GD)

Recent Posts

Comments