Machine Learning Simplified - Part IV (Simple Linear Regression in Python)

Sachin Tah

Dec 7, 20206 min read

Here are the links to my previous articles on ML series

Part I

https://www.sachintah.com/post/machine-learning-simplified-part-i

Part II

https://www.sachintah.com/post/machine-learning-simplified-part-ii

Part III

https://www.sachintah.com/post/machinelearning_linear_regression_octave

In this article, we will try to implement a simple linear regression example using Python. I am sure you are comfortable with Python, if not this is a good chance to learn Python while implementing ML algorithms. I will not explain the steps in detail as I have already done the same while implementing Linear Regression using Octave.

Simple Linear Regression Example (One Variable)

Here is the link to our dataset containing population and profit in different cities.

Here is a snapshot of our data

6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483

So we have a comma-separated dataset with Population (in 10,000s) in the first column and Profit ($ 10,000s) in the second column. Our first step should be to load this data into our program and try to visualize its contents. Here are the steps we will be performing to create our first ML algorithm using Linear Regression for one variable.

Algorithm Steps

Load data in variables
Visualize the data
Write a Cost function
Run Gradient descent for some iterations to come up with values of theta0 and theta1
Plot your hypothesis function to see if it crosses most of the data
Run predictions and see results

Python Implementation

1. Load data in variables

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)

Data will be loaded in a 2 x 2 matrix. Now we need to separate data into x and y variables. x being input, y output, and m being total examples in the dataset. You need to install numpy, matplotlib, and pandas using the pip3 command

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
    
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)

2. Visualize the data

Let us visualize this data by plotting a graph.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
    
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)

    # Plot Data
    plt.scatter(x, y,marker='s',facecolors='none',edgecolors='b')
    plt.xlabel('Population of City in 10,000s')
    plt.ylabel('Profit in $10,000s')
    plt.show()

After running this code, the following graph will be displayed, you can run this code using the command python3 LinerRegression.py

Now we need to find a hypothesis function that will draw a straight line that will be best for our data.

3. Write a Cost function

Our next task is to write a cost function so that we can use the same while performing Gradient descent and store the history of cost which can be used later. Before we calculate cost we need to first find h(θ).

h(θ) = theta0 + theta1 * x1;

We need to calculate the hypothesis for all values of x and sum it up to come up with a cost value for theta0 and theta1.

This can be done using a normal method or using a vectorized method where we will operate directly on the matrix. To understand better, let us first implement the cost function using the normal method. Also, remember that in Python matrix indices starts from 0

 
 # Calculate Cost using Normal method
 def CalculateCost(input, output, theta):
    m = len(output)
    cost = 0;
    for i in range(m):
        hypothesis = np.zeros(m)
        hypothesis[i] = theta[0] + theta[1]*input[i]
        cost = cost + (hypothesis[i] - output[i]) **2
        cost = cost/(2*m)
 return cost

our main function will look like this


def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
   
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)
    theta = np.array([0.00,0.00]) 
    
    # Plot Data
    plt.scatter(x, y,marker='s',facecolors='none',edgecolors='b')
    plt.xlabel('Population of City in 10,000s')
    plt.ylabel('Profit in $10,000s')
    plt.show()
    # Calculate Cost Using Theta as 0 defined above
    cost = CalculateCost(x,y,theta)

In order to calculate the cost using the vectorized method, we need to change our function a little bit

h(θ) = theta0*x0 + theta1 * x1;

Considering x0 =1, our above function will result in the same value as the original one. x0 is a new feature we added to our data set with the value 1. Let us modify our dataset

X = np.concatenate((np.ones((m,1), dtype=int), x), axis=1)

This will add one more column to our input dataset x, now our input dataset will look like

X = [ 1 6.1101 1 5.5277 1 8.5186 1 7.0032 1 5.8598 1 8.3829 1 7.4764

]

Now calculating hypothesis is simple, just multiple X with the transpose of theta.

h(θ) = x * θ';

You can try out solving this equation in order to understand how this is calculating the values of the entire hypothesis matrix in one go. Our cost function will look like

 
 # Vectorized Implementation of Cost Function
 def CalculateCostVectorized(input, output, theta):
    m = len(output)
    cost = 0;
    hypothesis = np.zeros(m)
    hypothesis = input.dot(theta.T)    
    cost = np.sum(np.subtract(hypothesis,output) **2)/(2*m)
    return cost

Let us modify the main function to call this cost function


def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
    
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)
    theta = np.array([0.00,0.00])
    
    # Plot Data
    plt.scatter(x, y,marker='s',facecolors='none',edgecolors='b')
    plt.xlabel('Population of City in 10,000s')
    plt.ylabel('Profit in $10,000s')
    plt.show()
    
    # Calculate Cost Using Theta as 0 defined above
    X = np.concatenate((np.ones((m,1), dtype=int), x), axis=1)
    cost = CalculateCostVectorized(X,y,theta)

4. Run Gradient Descent (GD)

To come up with the right values of theta0 and theta1 which provide minimum cost, we need to run a gradient descent algorithm for some number of iterations. Below is our gradient descent algorithm for calculating theta values

We need to start with some initial values of theta and alpha and run gradient descent. Let us consider our initial values as


    theta = np.array([0.00,0.00])
    alpha = 0.01
    iterations = 1500

Our GD function will look like below

 
 # Gradient Descent Function
 def runGradientDescent(input, output, theta, alpha, iterations):
    m = len(output)
    # Temp Variable
    t0 = 0.00 
    t1 = 0.00
 
    for i in range(iterations):
        hypothesis = input.dot(theta.T) 
        hypothesis = np.subtract(hypothesis,output)
        t0 = t0 - alpha*(1/m)* hypothesis.sum()
        t1 = t1 - alpha*(1/m)* (hypothesis * input[:,1]).sum()
        CalculateCostVectorized(input,output,theta)
        theta[0] = t0
        theta[1] = t1
    return theta

Running above GD function for 1500 iterations, theta comes out to be


theta =  -3.6303   1.1664

5. Plot graph of Hypothesis function

Finally, we have received our theta values, let us calculate the hypothesis with these values and try to plot the hypothesis on our existing graph to see if it fits well on our data.


def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
    
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)
    theta = np.array([0.00,0.00])
    
    # Plot Data
    plt.scatter(x, y,marker='s',facecolors='none',edgecolors='b')
    plt.xlabel('Population of City in 10,000s')
    plt.ylabel('Profit in $10,000s')

    # Calculate Cost Using Theta as 0 defined above
    X = np.concatenate((np.ones((m,1), dtype=int), x), axis=1)
    cost = CalculateCostVectorized(X,y,theta)

    # Run GD
    alpha = 0.01
    iterations = 1500
    theta = runGradientDescent(X,y,theta,alpha, iterations)

    # Calculate Hypothesis & Plot Graph
    hypothesis = X.dot(theta)
    plt.plot(x,hypothesis,'r')    
    plt.show()

The following graph will be displayed

Looks like our function is able to converge with 1500 iterations. Practically you may need to run GD multiple times in order to see how it converges.

6. Run predictions & see results

Let us add one simple function to predict profit values with the population as input. Prediction is nothing but our hypothesis function with calculated values of theta.

 
 # Prediction based on Theta Values
 def predictProfit(input, theta):
    predition = theta[0] + theta[1] * input 
 return predition

our main function will look like this


def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('franchisedata.txt',header = None)
    
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = len(x)
    theta = np.array([0.00,0.00])
    
    # Plot Data
    plt.scatter(x, y,marker='s',facecolors='none',edgecolors='b')
    plt.xlabel('Population of City in 10,000s')
    plt.ylabel('Profit in $10,000s')

    # Calculate Cost Using Theta as 0 defined above
    X = np.concatenate((np.ones((m,1), dtype=int), x), axis=1)
    cost = CalculateCostVectorized(X,y,theta)

    # Run GD
    alpha = 0.01
    iterations = 1500
    theta = runGradientDescent(X,y,theta,alpha, iterations)

    # Calculate hypothesis & Plot Graph
    hypothesis = X.dot(theta)
    plt.plot(x,hypothesis,'r')    

    # Predictions
    population_one = 3.5;
    profit_one = predictProfit(population_one,theta);
    print(profit_one*10000)
    plt.plot(population_one,profit_one,'r',marker='s')    

    population_two = 12;
    profit_two = predictProfit(population_two,theta);
    plt.plot(population_two,profit_two,'r',marker='s')    
    print(profit_two*10000)
    plt.show()

Here is the entire source code of the above program

In my next article, I will provide details on how to implement Linear Regression for multiple input variables. This algorithm is also called Multivariant Linear Regression.

Happy Reading...

Tech talks with sachin tah

Machine Learning Simplified - Part IV (Simple Linear Regression in Python)

Recent Posts

Comments