ML Simplified - Part 8 (Logistic Regression)

Here is the link for the first part

https://www.sachintah.com/post/machine-learning-simplified-part-i

In the first part of my ML series, I talked about two types of supervised machine learning algorithms, linear & logistic regression.

With a supervised learning algorithm, we supply an input dataset to the ML program, this dataset consists of a set of inputs and correct answers associated with each input. This algorithm then predicts the correct output based upon training received from this input dataset.

A Supervised algorithm is termed as Classification if the output we are trying to predict is a small discrete value, which means that the output is a category like ex. "Red" or "Green", "Yes" or No", "True" or "False".

Some of the examples where classification is used in our day to day life are, predicting if an Email is a spam or not spam, predicting if an online transaction is fraudulent or not.

Logistic regression is not much different from Linear regression as far as the implementation approach is concerned, steps to come up with a solution remains the same, just, formulas are different.

Following are the steps

Hypothesis - Firstly we will start off with the hypothesis function
Gradient Descent - Secondly, we perform gradient descent to find the optimum value of theta
Prediction - Predict output for new inputs using theta values and hypothesis function

Visualizing Data

The first task in any ML algorithm is to have a look at how our input data looks like. Relooking at how a typical dataset of linear regression

With Linear Regression you have discrete output values associated with each input example.

For example for a house which is 550 Sq Ft in size, the price is 2800$ (in 1000s).

Let us now have a look at the logistic regression dataset.

With Logistic Regression the output associated with input examples contains a small set of discrete values, in the example shown on right, it's 1 and 0 which stands for Yes and No respectively.

So when you get an input example and need to predict the result, you make a hypothesis that provides the probability of your classification. In this case, a prediction would be a value like 0.51 or 51 which means there is 51% chance that the output is 1. A logistic classification that has only two discrete values is called binary classification.

Let us try to visualize our data from the above dataset and see how it looks like

* represents individuals who got admitted to the college and dot is for those who are not admitted. If you remember how we plotted our hypothesis line for Linear regression, using a similar approach our ideal hypothesis line looks like the below graph

Any new input example falling on the right-hand side above the line will be considered as admitted and on the lower left side below the line is considered as not admitted, this line is also referred to as a decision boundary.

Hypothesis

Without going into in-depth mathematical details on how to come up with a hypothesis function for logistic regression, I am directly providing you the well-known formula for calculating the hypothesis for Logistic Regression.

We have seen the hypothesis for linear regression is defined as

and for logistic regression, this formula is

Where g is defined as

The above function is also called sigmoid function or logistic function and hence the name logistic regression. Replace z with θ x will make our function looks like below

The sigmoid function will make sure that our output value is always between 0 and 1. In simple terms, this hypothesis function will give you the probability of your prediction being 1.

So for example, if your hypothesis is 0.75 then there is a 75% chance that your prediction is 1 or Y or whatever value you attached 1 to. Don't worry about the implementation as most of the ML libraries provide a feature to implement the above equation in one-liner code.

Cost Function

If you remember my explanation from simple logistic regression, we calculated the cost of the hypothesis using the difference between the predicted value and actual value.

Here are the highlights of what we discussed for linear regression. Considering d as the difference between actual and predicted value, we can calculate the cost in Linear Regression for one input parameter and 4 examples

And when we generalized this equation for any number of features and examples, our cost function was

The above cost function works very well for linear regression as we have discrete y values for every input example. In the case of logistic regression, we have only two values for y i.e y is either 1 or 0.

The hypothesis function of logistic regression is a sigmoid function that assures the output values are between 0 and 1. It appears that if we use the above cost function for logistic regression, you end up drawing a graph that is a non-convex graph, which means it will have multiple local minimum values for J(θ) making it difficult to find the minimum cost for a given set of θ. In short, the above cost function won't work for logistic regression.

We, therefore, need a cost function that will tell us the probability of y being 1 or 0. The higher the probability will decide the optimum value of θ.

Following two functions are suggested for calculating cost

Combining both the equations, the vectorized formula for cost looks like

Cost on a specific value of θ is calculated using the above equation. Where h is the hypothesis value we calculated using the above sigmoid function.

Gradient Descent

Now we have a cost function for a particular value of θ, we need to run the gradient descent function in a similar manner as we did for the Linear Regression function.

The gradient descent function remains the same and we need to run this in iterations in order to come up with an optimum value of θ.

Once we receive an optimum value of θ, we can perform predictions on a new set of data. These predictions can be performed using our hypothesis function which is again a sigmoid function. As our hypothesis is going to give us the probability, we can decide upon a benchmark for the probability. For example, if the hypothesis > 50% we will consider the output as 1.

Advanced Optimization

When we run gradient descent using some initial value of alpha and number of iterations, you need to run the same using the trial and error method and find out which is the best possible value of alpha and iterations which gives you optimum results. We also need to consider execution time for your gradient descent, considering you have millions of records in your dataset.

In my next series, I will first show how to implement logistic regression using the traditional gradient descent looping method and later will implement the same using advanced optimization algorithms which are directly available with ML libraries like Octave, Python. We will compare the results and discuss the pros and cons of using the same.

Happy Reading.....

Tech talks with sachin tah