Classification and Representation

Classification

  • Use Cases

    • Email: Spam / Not Spam

    • Online Transactions: Fraudulent (Yes / No )?

    • Tumor: Malignant / Benign ?

  • Binary Classification

    • 0 - Negative class

      • conveys something is absent
    • 1 - Positive class

      • conveys something is present

Classification.png

  • Example:

    • Applying linear regression to a classification problem is not a good idea

      • To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0.

      • This method doesn’t work well because classification is not actually a linear function.

      Example.png

    • Classification

      • Linear regression can produce value larger than 1 or smaller than 0

      • In classification problem, label are either 1 or 0.

        • Logistic Regression: 0 ≤ h(x) ≤ 1

      Intro - Logistic Regression.png

Hypothesis Representation

  • Sigmoid function is zero at negative infinity and 1 at positive infinity

    • Sigmoid function == Logistic Function

    Representation.png

  • Interpretation Of Hypothesis

    • h ( x ) = estimated probability that y = 1 on input

    Interpretation of Hypothesis.png

Decision Boundary

  • Predict y = 1

    • h ( x ) ≥ 0.5 : y = 1

    • thetha^T X ≥ 0

  • Predict y = 0

    • h ( x )

Logistic Regression.png

  • Decision boundary

    • The decision boundary is the line that separates the area where y = 0 and where y = 1.

    • It is created by our hypothesis function.

    • This is the property of the hypothesis and not of the data

    Decision Boundary.png

    • Non Linear Decision Boundary

      • Decision Boundary doesn’t need to be linear

      • High order polynomials can also be resulted in a complex decision boundary

      Non-Linear DB.png

Logistic Regression Model

Cost Function

  • Concept

    Concept.png

  • We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima.

  • In other words, it will not be a convex function.

Cost Function.png

  • Case 1

    • When y = 1, we get the following plot for J (theta) vs h ( x )

      • If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1.

      • If our hypothesis approaches 0, then the cost function will approach infinity.

      Case 1.png

  • Case 2

    • When y = 0, we get the following plot for J (theta) vs h ( x )

      • If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0.

      • If our hypothesis approaches 1, then the cost function will approach infinity.

      Case 2.png

  • Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

Simplified Cost Function and Gradient Descent

  • Modified Cost Function ( Simple )

    • Compress our cost function’s two conditional cases into one case:

    Simple CF.png

    • We can fully write out our entire cost function as follows:

    Overview.png

  • Gradient Descent

    • Notice that this algorithm is identical to the one we used in linear regression.

    • We still have to simultaneously update all values in theta.

    GD.png GD-1.png

Advanced Optimisation

  • Optimisation Algorithm

    • Minimising the cost function as efficient as possible

    Optimisation Algorithm.png

  • Advance Optimisation Algorithms

    • Algorithms

      • Gradient Descent

      • Conjugate Gradient

      • BFGS

      • L-BFGS

    • Advantages

      • No need to manually pick alpha

      • Often faster than gradient descent

    • Disadvantage

      • More complex

    Advanced OA.png

  • Example 1

    • Implementing function minimisation unconstrained

      • fminunc() in octave programming

    AOA Example.png

  • Example 2

    AOA Example 1.png

  • Octave / Matlab Snippets

    • We can write a single function that returns both of these
function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
- Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".

- We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multi-class Classification

Multi-class Classification: One-vs-all

  • Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.

  • Since y = {0,1…n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes

  • Use Cases

    • Email sorting / tagging : Work, Family, Friends, Hobby

    • Medical diagrams: Not ill, Cold, Flu

    • Weather: Sunny, Cloud, Rain, Snow

  • Difference in data visualisation in binary classification and multi-class classification

    DIfference of Data Visualisation.png

  • One vs All Algorithm

    • It chooses as first class and other classes as second class and does the Binary logistic regression among them.

    • It changes the active class and repeats until all the classes have been covered as the active class.

    OvsA Algorithm.png

  • Using the algorithm

    • Select the i for which the hypothesis is the maximum as the prediction

    Testing.png

Solving the Problem of Overfitting

The Problem of Overfitting

  • If we have too many features, the learned hypothesis may fit the training set very well ( where cost function is similar to equal to 0 ), but fail to generalise to new examples ( predict prices on new examples )

  • Underfit or High bias

    • Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.

    • It is usually caused by a function that is too simple or uses too few features.

  • Overfit or High Variance

    • Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.

    • It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

  • Overfitting in Linear Regression

    LIR Example.png

  • Overfitting in Logistic Regression

    LOR Example.png

  • Causes of Overfitting

    • Too many features & Small dataset

    Problem.png

  • Solution to Overfitting

    • Reduce number of features

      • Manually select which features to keep

      • Model selection algorithm

    • Regularisation

      • Keep all the features, but reduce magnitude / values of parameters

      • Works well when we have a lot of features, each of which contributes a bit to predicting y.

    Solution.png

Cost Function

  • Intuition

    • If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.

    • Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:

    • If we reduce the features to near zero, will result in slight change in hypothesis

    • New hypothesis will fits the data better due to extra small terms

    Intuition.png

  • Regularisation

    • The λ, or lambda, is the regularization parameter.

    • It determines how much the costs of our theta parameters are inflated.

    Regularisation.png

  • Example

    Regularisation Example.png

  • Regularisation Parameter

    • If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.

    Regularisation Parameter.png

Regularised Linear Regression

  • Regularised Linear Regression

    Regularised LIR.png

  • Regularised Gradient Descent

    • We will modify our gradient descent function to separate out theta_0 from the rest of the parameters because we do not want to penalise theta_0

    • Intuitively you can see it as reducing the value of theta_j​ by some amount on every update.

    • Notice that the second term is now exactly the same as it was before.

    • theta_0 is not regularised

    Regularised GD.png

  • Regularised Normal Equation

    • To add in regularisation, the equation is the same as our original, except that we add another term inside the parentheses:

    Regularised NE.png

  • Regularisation in Non-Invertibility

    • Recall that if m < n, then X^T _ X is non-invertible. However, when we add the term λ⋅L, then X^T _ X + λ⋅L becomes invertible. Regularised-Non Invertibility.png

Regularised Logistic Regression

  • Regularised Logistic Regression

    • image shows how the regularised function, displayed by the pink line, is less likely to overfit than the non-regularised function represented by the blue line:

    • We can regularise this equation by adding a term to the end

    Regularised LOR.png

  • Regularised Gradient Descent

    • Equation seems identical with regularised gradient descent of linear regression

      • Hypothesis is different in both regression
    • theta_0 is not regularised

    • computing the equation, we should continuously update the two following equations:

    Regularised LOR GD.png

  • Regularised Advanced Optimisation

    REgularised AO.png

Lecture Presentation