ML Simplified - Part 7 (Multivariate Python Example)

Here are the links to my previous articles on ML series

Part I

https://www.sachintah.com/post/machine-learning-simplified-part-i

Part II

https://www.sachintah.com/post/machine-learning-simplified-part-ii

Part III

https://www.sachintah.com/post/machinelearning_linear_regression_octave

Part IV

https://www.sachintah.com/post/machine-learning-simplified-part-iv-simple-linear-regression-in-python

Part V

https://www.sachintah.com/post/machine-learning-part-v-multivariate-regression

Part VI

https://www.sachintah.com/post/ml-simplified-part-6-multivariate-octave-example

Let us now implement Multivariate LR using Python.

Multivariate Linear Regression Using Python

We will start by having a look at our dataset

Here is a snapshot of our data


2104,3,399900
1600,3,329900
2400,3,369000
1416,2,232000
3000,4,539900
1985,4,299900
1534,3,314900
1427,3,198999

We have a comma-separated dataset with sizes in Sq. Ft. in the first column and the number of bedrooms in the second. The third column contains the price of the house in $.

Here are the steps our algorithm will be using

Algorithm Steps

Load data
Visualize data
Feature Normalization - New
Write Cost Function
Run Gradient Descent
Plot Graphs
Run Predictions

1. Load Data


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Main Linear Regression Function 
def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('../data/housingprice.txt',header = None)
 
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = x.shape[0]
    y = y.reshape(m,1)

The code to load data remains the same with slight modification in order to make it more generic. This generic code first calculates a total number of columns and then loads all the columns except the last one in variable matrix X and loads the last column in vector y.

2. Visualize the data

Let us try to visualize the data. Now we have two feature variables and one output variable, we can plot our data on a scatter 3D graph. If you have more input features, you may need to find different ways to visualize your data, however, the below code works for only two input features with one output variable.


def plotDataGraph(input, output):
    ax = plt.axes(projection='3d')
    ax.scatter3D(input[:,0], input[:,1], output, c=output, 
    cmap='Greens')
    ax.set_xlabel('Size (Sq. Ft.)')
    ax.set_ylabel('Bedrooms')
    ax.set_zlabel('Price')
    plt.show()


   # Main Linear Regression Function 
def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('../data/housingprice.txt',header = None)
 
     # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = x.shape[0]
    y = y.reshape(m,1)

    plotDataGraph(x,y)

After running the above code, the following graph will be displayed. We have X1 on one axis and X2 on the other and the third axis representing housing price.

3. Feature Normalization

The input features available with us are on different scales, for example, area ranges from 600-3000 and the number of bedroom ranges from 1-10.

As discussed in one of my earlier articles, we need to make sure that our data belongs to the same scale in order to run gradient descent efficiently. Therefore we need to implement feature normalization in order to bring down area and bedrooms on a similar scale.

We will be using the mean normalization method to normalize our data, here is the python code to do the same


# Normalize Input Features
def NormalizeInput(input, mu,sigma):
	normalized= np.divide((input - mu),sigma)	
	return normalized
	
def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('../data/housingprice.txt',header = None)
 
   # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = x.shape[0]
    y = y.reshape(m,1)

    plotDataGraph(x,y)

    X,mu,sigma = NormaliseFeatures(x)

We have used inbuilt functions like mean and standard deviation in order to perform feature normalization. The above function takes any number of features and returns a normalized version of all the features. We will also return mu and sigma in order to use them in the future. We will discuss the same in the latter part of this article.

Below is a snapshot of how normalized features looks like


[ 1.30009873e-01 -2.23675178e-01]
 [-5.04189852e-01 -2.23675178e-01]
 [ 5.02476378e-01 -2.23675178e-01]
 [-7.35723085e-01 -1.53776685e+00]
 [ 1.25747605e+00  1.09041649e+00]
 [-1.97317290e-02  1.09041649e+00]
 [-5.87239816e-01 -2.23675178e-01]
 [-7.21881424e-01 -2.23675178e-01]
 [-7.81023065e-01 -2.23675178e-01]

Now both of our input features are on a similar scale. We can go ahead with implementing the cost function.

3. Cost Function

The vectorized version of the cost function implemented in our last example remains unchanged. Here is the code


# Calculate Cost Using Vectorized Method
def CalculateCost(input, output, theta):
    m = len(output)
    cost = 0
    hypothesis = input.dot(theta.T)
    step1 = np.square(np.subtract(hypothesis,output))    
    step1 = np.sum(step1,0)
    cost =step1/(2*m)    
    return cost

The cost is the sum of all the square root of the difference between our hypothesis and actual output value divided by 2*m.

Let us call this cost function from our main function

def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('../data/housingprice.txt',header = None)
 
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = x.shape[0]
    y = y.reshape(m,1)

    plotDataGraph(x,y)

    X,mu,sigma = NormaliseFeatures(x)

    # Add Column 1 For Vectorized Multiplication
    X = np.concatenate((np.ones((m,1), dtype=int), X), axis=1)
    totalFeatures = int(X.shape[1])

    # Thetas Array = Total Rows X Number Of Features
    theta = np.zeros((1 , totalFeatures))    
    cost = CalculateCost(X,y,theta)

You should expect an output of 6.5592e+10.

4. Run Gradient Descent (GD)

To come up with the optimum value of theta, we need to run gradient descent using some initial values of theta, alpha, and a number of iterations. Below is the code for our gradient descent which is again the same as the vectorized version of simple linear regression


def runGradientDescent(input, output, theta, alpha, iterations):
    m = len(output)
    tempTheta = theta
    cost_history=np.zeros(iterations).reshape(iterations,1)

    for i in range(iterations):        
        hypothesis = input.dot(tempTheta.T) 
        hypothesis = np.subtract(hypothesis,output)
        newX = np.dot(hypothesis.T,input)
        newX = np.dot(newX,(alpha/m))        
        tempTheta = np.subtract(tempTheta,newX)
        cost_history[i] = CalculateCost(input,output,tempTheta)
 return tempTheta,cost_history

Let us call this gradient descent function from our main function


def LinearRegression():
    # Load CSV File in Variables
    data = pd.read_csv('../data/housingprice.txt',header = None)
 
    # Transfer data in x & y variable
    x = np.array(data.iloc[:, :-1].values)
    y = np.array(data.iloc[:, -1].values)
    m = x.shape[0]
    y = y.reshape(m,1)

    plotDataGraph(x,y)

    X,mu,sigma = NormaliseFeatures(x)

    # Add Column 1 For Vectorized Multiplication
    X = np.concatenate((np.ones((m,1), dtype=int), X), axis=1)
    totalFeatures = int(X.shape[1])

    # Thetas Array = Total Rows X Number Of Features
    theta = np.zeros((1 , totalFeatures))    
    cost = CalculateCost(X,y,theta)
 
    # Run Gradient Descent
    alpha = 0.01
    iterations = 400 
    theta,cost_history = runGradientDescent(X,y,theta,alpha, 
    iterations)

The gradient descent function will return optimum theta values and cost history for 400 iterations.

We can use this cost history variable to plot a cost graph against the number of iterations to see if our cost is converging to a minimum.

5. Plot Cost Graph

Let us now plot a cost graph with the number of iterations on the x-axis and cost on y


def plotCostGraph(iterations, cost_history):
    plt.plot(iterations,cost_history,'r')    
    plt.show()

The cost graph will produce a graph like below

As we see, the cost is going down with the number of iterations and is converging at a minimum value with very minimal effect going forward. This means that our iterations are good enough to go and we have received optimum theta values.


 def predictProit(input, theta):
    inputCount = input.shape[0]
    prediction_input = np.concatenate((np.ones((inputCount,1), 
    dtype=int), input), axis=1)
    preditions = prediction_input.dot(theta.T)
 return preditions

The prediction code is almost the same as our hypothesis code, just a slight change to add an additional column in input data for Theta0. Now it's time to run predictions on our model.

If you remember our feature normalization function, we return mu and sigma values which correspond to mean and stand deviations respectively on the input data set.

Now before running predictions on a new set of data, we need to normalize this data as well. We will therefore be using the same values of mu and sigma so that it comes down to the same scale as the input one

   
     # Predict Now
    prediction_inputs = np.array([[1268, 3],[1000, 2] , [3000, 
    5], [2000, 8]])
    prediction_inputs_normalized = 
    NormalizeInput(prediction_inputs,mu,sigma)
    prediction_housing_price = 
    predictProit(prediction_inputs_normalized, theta)

    print(prediction_housing_price)

You should expect the following housing prices


[[241204.43324609]
 [202624.3576335 ]
 [468992.28966449]
 [357531.52973923]]

Here is the entire source for the above implementation

In my next article will go through Logistic Regression.

Happy Reading...

Tech talks with sachin tah

ML Simplified - Part 7 (Multivariate Python Example)

Recent Posts

Comentarios