Classification and Representation

Classification

Use Cases
- Email: Spam / Not Spam
- Online Transactions: Fraudulent (Yes / No )?
- Tumor: Malignant / Benign ?
Binary Classification
- 0 - Negative class
  - conveys something is absent
- 1 - Positive class
  - conveys something is present

Example:
- Applying linear regression to a classification problem is not a good idea
  - To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0.
  - This method doesn’t work well because classification is not actually a linear function.
- Classification
  - Linear regression can produce value larger than 1 or smaller than 0
  - In classification problem, label are either 1 or 0.
    - Logistic Regression: 0 ≤ h(x) ≤ 1

Hypothesis Representation

Sigmoid function is zero at negative infinity and 1 at positive infinity
- Sigmoid function == Logistic Function
Interpretation Of Hypothesis
- h ( x ) = estimated probability that y = 1 on input

Decision Boundary

Predict y = 1
- h ( x ) ≥ 0.5 : y = 1
- thetha^T X ≥ 0
Predict y = 0
- h ( x )

Logistic Regression.png

Decision boundary
- The decision boundary is the line that separates the area where y = 0 and where y = 1.
- It is created by our hypothesis function.
- This is the property of the hypothesis and not of the data
- Non Linear Decision Boundary
  - Decision Boundary doesn’t need to be linear
  - High order polynomials can also be resulted in a complex decision boundary

Logistic Regression Model

Cost Function

Concept
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima.
In other words, it will not be a convex function.

Cost Function.png

Case 1
- When y = 1, we get the following plot for J (theta) vs h ( x )
  - If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1.
  - If our hypothesis approaches 0, then the cost function will approach infinity.
Case 2
- When y = 0, we get the following plot for J (theta) vs h ( x )
  - If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0.
  - If our hypothesis approaches 1, then the cost function will approach infinity.
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.

Simplified Cost Function and Gradient Descent

Modified Cost Function ( Simple )
- Compress our cost function’s two conditional cases into one case:
- We can fully write out our entire cost function as follows:
Gradient Descent
- Notice that this algorithm is identical to the one we used in linear regression.
- We still have to simultaneously update all values in theta.

Advanced Optimisation

Optimisation Algorithm
- Minimising the cost function as efficient as possible
Advance Optimisation Algorithms
- Algorithms
  - Gradient Descent
  - Conjugate Gradient
  - BFGS
  - L-BFGS
- Advantages
  - No need to manually pick alpha
  - Often faster than gradient descent
- Disadvantage
  - More complex
Example 1
- Implementing function minimisation unconstrained
  - fminunc() in octave programming
Example 2
Octave / Matlab Snippets
- We can write a single function that returns both of these

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

- Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".

- We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

Multi-class Classification

Multi-class Classification: One-vs-all

Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.
Since y = {0,1…n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes
Use Cases
- Email sorting / tagging : Work, Family, Friends, Hobby
- Medical diagrams: Not ill, Cold, Flu
- Weather: Sunny, Cloud, Rain, Snow
Difference in data visualisation in binary classification and multi-class classification
One vs All Algorithm
- It chooses as first class and other classes as second class and does the Binary logistic regression among them.
- It changes the active class and repeats until all the classes have been covered as the active class.
Using the algorithm
- Select the i for which the hypothesis is the maximum as the prediction

Solving the Problem of Overfitting

The Problem of Overfitting

If we have too many features, the learned hypothesis may fit the training set very well ( where cost function is similar to equal to 0 ), but fail to generalise to new examples ( predict prices on new examples )
Underfit or High bias
- Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.
- It is usually caused by a function that is too simple or uses too few features.
Overfit or High Variance
- Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.
- It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
Overfitting in Linear Regression
Overfitting in Logistic Regression
Causes of Overfitting
- Too many features & Small dataset
Solution to Overfitting
- Reduce number of features
  - Manually select which features to keep
  - Model selection algorithm
- Regularisation
  - Keep all the features, but reduce magnitude / values of parameters
  - Works well when we have a lot of features, each of which contributes a bit to predicting y.

Cost Function

Intuition
- If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
- Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
- If we reduce the features to near zero, will result in slight change in hypothesis
- New hypothesis will fits the data better due to extra small terms
Regularisation
- The λ, or lambda, is the regularization parameter.
- It determines how much the costs of our theta parameters are inflated.
Example
Regularisation Parameter
- If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.

Regularised Linear Regression

Regularised Linear Regression
Regularised Gradient Descent
- We will modify our gradient descent function to separate out theta_0 from the rest of the parameters because we do not want to penalise theta_0
- Intuitively you can see it as reducing the value of theta_j by some amount on every update.
- Notice that the second term is now exactly the same as it was before.
- theta_0 is not regularised
Regularised Normal Equation
- To add in regularisation, the equation is the same as our original, except that we add another term inside the parentheses:
Regularisation in Non-Invertibility
- Recall that if m < n, then X^T _ X is non-invertible. However, when we add the term λ⋅L, then X^T _ X + λ⋅L becomes invertible.

Regularised Logistic Regression

Regularised Logistic Regression
- image shows how the regularised function, displayed by the pink line, is less likely to overfit than the non-regularised function represented by the blue line:
- We can regularise this equation by adding a term to the end
Regularised Gradient Descent
- Equation seems identical with regularised gradient descent of linear regression
  - Hypothesis is different in both regression
- theta_0 is not regularised
- computing the equation, we should continuously update the two following equations:
Regularised Advanced Optimisation

Machine Learning By Andew Ng - Week 3

Classification and Representation

Classification

Hypothesis Representation

Decision Boundary

Logistic Regression Model

Cost Function

Simplified Cost Function and Gradient Descent

Advanced Optimisation

Multi-class Classification

Multi-class Classification: One-vs-all

Solving the Problem of Overfitting

The Problem of Overfitting

Cost Function

Regularised Linear Regression

Regularised Logistic Regression

Lecture Presentation