ML Simplified - Part 9 (Logistic Regression Octave Example)

In my previous article, I discussed Logistic Regression and how to use classification to carry out predictions on student data set.

Logistic Regression Example Using Octave

I have picked up the dataset available in the course by Andrew, so would like to thank him again.

We have a dataset that contains student admission records. By using this dataset we would like to predict if any new student will get admission to the university or now.

Here is s snapshot of how our raw data looks like

34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1

We have a comma-separated dataset with Grade#1 in the first column and Grade#2 in the second column. The third column contains only 0 and 1s depending upon whether the student got admission to the university or not. 1 stand for Yes and 0 stands for No.

Given a new student record for Grade#1 and Grade#2, we need to come up with a prediction for his admission to the university. We will be using Octave to implement this algorithm

Algorithm Steps

Load student data in variables
Visualize the data
Write Sigmoid & Hypothesis Function
Write a Cost function
Approach #1 - Run Gradient descent for some iterations to come up with values of theta0 and theta1
Approach #2 - Run Advance Algorithm To find Theta & Cost
Run predictions and see results

1. Load Student Data

We will use the same function from our last example


function [] = logistic()
   # Load Data
   data = load('../data/studentmarks.txt');
   totalColumns = size(data,2);
 
   X = data(:,1: totalColumns-1); # Load Student Marks in X    
   y = data(:, totalColumns); # 1 = Yes and 0 = No
   m = size(X,1); % Total Rows 
end

2. Visualize Data

The data visualization part is a bit different from the logistic regression. Let me first show you the code and will explain it after that


function plotDataGraph(X,y)
   figure; hold on;
   # Find All Indices where y==1
   positiveIndices = find(y==1); # Array of Indices having 1
   negtiveIndices = find(y==0); # Array of Indices having 0

   # Mark 1 and Mark 2 is intersection point in Graph
   plot(X(positiveIndices,1), X(positiveIndices,2), 'k*',          
   'MarkerSize', 5);    
   plot(X(negtiveIndices,1), X(negtiveIndices,2), 'ko', 
   'MarkerFaceColor','b','MarkerSize', 5);
   xlabel('Score#1')
   ylabel('Score#2')
end
   
 function [] = logistic()
   # Load Data
   data = load('../data/studentmarks.txt');
   totalColumns = size(data,2);
 
   X = data(:,1: totalColumns-1); # Load Student Marks in X    
   y = data(:, totalColumns); # 1 = Yes and 0 = No
   m = size(X,1); % Total Rows 
   
   #Plot Graph
   plotDataGraph(X,y);
end

To plot graph we first find out positive and negative indices from output variable y and used these positive and negative indices to plot data representing 1 and 0.

After running this code, the following graph will be displayed

Now we need to find a hypothesis function that will draw a decision boundary separating out 1 and 0.

3. Sigmoid & Hypothesis

The next task is to write a hypothesis function which is a sigmoid function. Below is our cost function

and cost function looks like

Let us try to implement both of these functions using Octave


function g = sigmoid(z)
  g = zeros(size(z));
  g = 1./(1 + e.^-z);
end
  
# Our Hypothesis will be sigmoid(X * Theta')

4. Cost Function

Now our cost function will look like


function [J] = calculateCost(theta, input, output)
    m = length(output);
    J = 0;    
    # Calculate z For Sigmoid
    z = input * theta';
    J = (-output' * log(sigmoid(z)) - (1-output') * log(1-         
    sigmoid(z)))/m;
end

5. Gradient Descent#1

Once we are done with writing the cost function, it's time to write the gradient descent function which will use the above two functions to take iterative steps in calculating the optimum value of theta.


function [theta,J_history] = runGradientDescent(input, output, 
                                 theta, alpha, iterations)

    m = length(output); 
    J_history = zeros(iterations, 1);
    for i = 1:iterations
         z = input * theta';
         theta = theta -   (sigmoid(z) - output)' * input * 
         (alpha/m);
         J_history(i) = calculateCost(theta,input, output);
     end
 end

Our main function will look like

function [] = logistic()
     # Load Data
     data = load('../data/studentmarks.txt');

     totalColumns = size(data,2);
 
    X = data(:,1: totalColumns-1); # Load Student Marks in X    
    y = data(:, totalColumns); # 1 or 0 for admission Yes or No    
    respectively
    m = size(X,1); % Total Rows 

    #Plot Graph
    plotDataGraph(X,y);    

    % Add Column 1 For Vectorized Multiplication
    X = [ones(m, 1) X];
    % Thetas Array = Total Rows X Number Of Features
    totalFeatures = size(X,2);    
    theta = zeros(1,totalFeatures);
 
    J = calculateCost(theta, X,y)
    alpha = 0.003;
    iterations = 200000;
    #Run Gradient Descent
    tic();
    [theta, J_history ] =       
    runGradientDescent(X,y,theta,alpha,iterations);
    elapsed_time = toc ()
end

I arrived at this value for alpha and iterations after running a lot of trials and plotting iterations versus cost graph. Somehow I managed to get the optimum value for θ. We are running gradient descent for 2000 times in order to arrive at a θ. I have used tic() and tac() functions to record elapsed time for executing gradient descent and it seems to be taking almost 30 seconds to converge. Image huge datasets with even more number of iterations?

5. Gradient Descent#2

Instead of breaking your head in trying out multiple values for alpha and theta, it is better to use a ready-made function that will do the work for you. Octave has one such function called fminunc, which takes cost function as input and returns optimized theta and cost values.

In order to use fminun, we need to modify our cost function a little bit, here is the modified version of the cost function


function [J,G] = calculateCost(theta, input, output)
    m = length(output);
    J = 0;    
    # Calculate z For Sigmoid
    z = input * theta';
    J = (-output' * log(sigmoid(z)) - (1-output') * log(1-         
    sigmoid(z)))/m;    
    #Calculate Gradient
    G = (( sigmoid(z) - output)' * input)/m;
 end

Notice that we have not created a separate function for gradient descent, in fact, our gradient descent is calculated without using θ iterations and alpha. That's the part which will be taken care of by fminunc function. Let us call this cost function from our main function using the following two lines of code


	 options = optimset('GradObj', 'on', 'MaxIter', 400);
	 [theta, cost] = fminunc(@(t)(calculateCost(t, X, y)), 
						theta, options);

The first line sets options for running fminunc function. GradObj sets gradient objective to on ad MaxIter is set as 400. The second line passes calculateCost function to fminunc function which in turn returns-optimized value for theta and cost calculated using these values of theta. The above two lines of code take milliseconds to come up with optimized values of theta. The function is using more advanced version algorithms of conjugate gradient-like BFGS and L-BFGS. You can do more research on the web in order to know the internal functionality of these two algorithms.

7. Predictions

Now we have our optimized theta values, it's time for doing some predictions on sample input data. Remember that our hypothesis function is nothing but the probability of output being 1


        inputExample = [1 45 85];
        probability = sigmoid(inputExample * theta');

Running the above code will give you an output of 0.70 which means there is a 70% chance that this student will be admitted to the university.

Let us try to run our predictions for the entire set of input examples and see how much close we are actually with input data


function prediction = predict(theta, X)
    prediction = sigmoid(X * theta') >=0.5;
end

For all the input examples in X where prediction probability is more than 50%, this function will return 1 and 0 for others.


prediction = predict(theta, X);

Now let us compare these predictions with actual y values and find the % accuracy we have achieved. This can be done using the following code


 Accuracy = mean(double(prediction == y)) * 100;
 fprintf('Accuracy: %f\n', Accuracy);

Now if you notice, we have used two different methods for calculating θ values, one by using conventional gradient descent and others by using advanced fminunc function in Octave. Both results in totally different values of θ.

However, when you try to calculate the accuracy of your algorithm from both approaches, it comes to be the same which is around 89%.

One last thing is to plot the decision boundary, surprisingly the decision boundary in both the cases comes out to be the same.


function plotDecisionBoundary(theta,X, y)
   plot_x = [min(X(:,2))-2,  max(X(:,2))+2];
   plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
   plot(plot_x, plot_y)
   legend('Admitted', 'Not admitted', 'Decision Boundary')
   axis([30, 100, 30, 100])
end

Here is the source code for both the approaches

In my next article, I will try to cover some advance topics like overfitting and how to use regularization to solve the problem.

Happy Reading...

Tech talks with sachin tah

ML Simplified - Part 9 (Logistic Regression Octave Example)

Recent Posts