Machine Learning By Andew Ng - Week 3
Classification and Representation
Classification
-
Use Cases
-
Email: Spam / Not Spam
-
Online Transactions: Fraudulent (Yes / No )?
-
Tumor: Malignant / Benign ?
-
-
Binary Classification
-
0 - Negative class
- conveys something is absent
-
1 - Positive class
- conveys something is present
-
-
Example:
-
Applying linear regression to a classification problem is not a good idea
-
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0.
-
This method doesn’t work well because classification is not actually a linear function.
-
-
Classification
-
Linear regression can produce value larger than 1 or smaller than 0
-
In classification problem, label are either 1 or 0.
- Logistic Regression: 0 ≤ h(x) ≤ 1
-
-
Hypothesis Representation
-
Sigmoid function is zero at negative infinity and 1 at positive infinity
- Sigmoid function == Logistic Function
-
Interpretation Of Hypothesis
- h ( x ) = estimated probability that y = 1 on input
Decision Boundary
-
Predict y = 1
-
h ( x ) ≥ 0.5 : y = 1
-
thetha^T X ≥ 0
-
-
Predict y = 0
- h ( x )
-
Decision boundary
-
The decision boundary is the line that separates the area where y = 0 and where y = 1.
-
It is created by our hypothesis function.
-
This is the property of the hypothesis and not of the data
-
Non Linear Decision Boundary
-
Decision Boundary doesn’t need to be linear
-
High order polynomials can also be resulted in a complex decision boundary
-
-
Logistic Regression Model
Cost Function
-
Concept
-
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima.
-
In other words, it will not be a convex function.
-
Case 1
-
When y = 1, we get the following plot for J (theta) vs h ( x )
-
If our correct answer ‘y’ is 1, then the cost function will be 0 if our hypothesis function outputs 1.
-
If our hypothesis approaches 0, then the cost function will approach infinity.
-
-
-
Case 2
-
When y = 0, we get the following plot for J (theta) vs h ( x )
-
If our correct answer ‘y’ is 0, then the cost function will be 0 if our hypothesis function also outputs 0.
-
If our hypothesis approaches 1, then the cost function will approach infinity.
-
-
-
Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.
Simplified Cost Function and Gradient Descent
-
Modified Cost Function ( Simple )
- Compress our cost function’s two conditional cases into one case:
- We can fully write out our entire cost function as follows:
-
Gradient Descent
-
Notice that this algorithm is identical to the one we used in linear regression.
-
We still have to simultaneously update all values in theta.
-
Advanced Optimisation
-
Optimisation Algorithm
- Minimising the cost function as efficient as possible
-
Advance Optimisation Algorithms
-
Algorithms
-
Gradient Descent
-
Conjugate Gradient
-
BFGS
-
L-BFGS
-
-
Advantages
-
No need to manually pick alpha
-
Often faster than gradient descent
-
-
Disadvantage
- More complex
-
-
Example 1
-
Implementing function minimisation unconstrained
- fminunc() in octave programming
-
-
Example 2
-
Octave / Matlab Snippets
- We can write a single function that returns both of these
function [jVal, gradient] = costFunction(theta)
jVal = [...code to compute J(theta)...];
gradient = [...code to compute derivative of J(theta)...];
end
- Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()".
- We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
Multi-class Classification
Multi-class Classification: One-vs-all
-
Instead of y = {0,1} we will expand our definition so that y = {0,1…n}.
-
Since y = {0,1…n}, we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that ‘y’ is a member of one of our classes
-
Use Cases
-
Email sorting / tagging : Work, Family, Friends, Hobby
-
Medical diagrams: Not ill, Cold, Flu
-
Weather: Sunny, Cloud, Rain, Snow
-
-
Difference in data visualisation in binary classification and multi-class classification
-
One vs All Algorithm
-
It chooses as first class and other classes as second class and does the Binary logistic regression among them.
-
It changes the active class and repeats until all the classes have been covered as the active class.
-
-
Using the algorithm
- Select the i for which the hypothesis is the maximum as the prediction
Solving the Problem of Overfitting
The Problem of Overfitting
-
If we have too many features, the learned hypothesis may fit the training set very well ( where cost function is similar to equal to 0 ), but fail to generalise to new examples ( predict prices on new examples )
-
Underfit or High bias
-
Underfitting, or high bias, is when the form of our hypothesis function h maps poorly to the trend of the data.
-
It is usually caused by a function that is too simple or uses too few features.
-
-
Overfit or High Variance
-
Overfitting, or high variance, is caused by a hypothesis function that fits the available data but does not generalize well to predict new data.
-
It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
-
-
Overfitting in Linear Regression
-
Overfitting in Logistic Regression
-
Causes of Overfitting
- Too many features & Small dataset
-
Solution to Overfitting
-
Reduce number of features
-
Manually select which features to keep
-
Model selection algorithm
-
-
Regularisation
-
Keep all the features, but reduce magnitude / values of parameters
-
Works well when we have a lot of features, each of which contributes a bit to predicting y.
-
-
Cost Function
-
Intuition
-
If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
-
Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our cost function:
-
If we reduce the features to near zero, will result in slight change in hypothesis
-
New hypothesis will fits the data better due to extra small terms
-
-
Regularisation
-
The λ, or lambda, is the regularization parameter.
-
It determines how much the costs of our theta parameters are inflated.
-
-
Example
-
Regularisation Parameter
- If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.
Regularised Linear Regression
-
Regularised Linear Regression
-
Regularised Gradient Descent
-
We will modify our gradient descent function to separate out theta_0 from the rest of the parameters because we do not want to penalise theta_0
-
Intuitively you can see it as reducing the value of theta_j by some amount on every update.
-
Notice that the second term is now exactly the same as it was before.
-
theta_0 is not regularised
-
-
Regularised Normal Equation
- To add in regularisation, the equation is the same as our original, except that we add another term inside the parentheses:
-
Regularisation in Non-Invertibility
- Recall that if m < n, then X^T _ X is non-invertible. However, when we add the term λ⋅L, then X^T _ X + λ⋅L becomes invertible.
- Recall that if m < n, then X^T _ X is non-invertible. However, when we add the term λ⋅L, then X^T _ X + λ⋅L becomes invertible.
Regularised Logistic Regression
-
Regularised Logistic Regression
-
image shows how the regularised function, displayed by the pink line, is less likely to overfit than the non-regularised function represented by the blue line:
-
We can regularise this equation by adding a term to the end
-
-
Regularised Gradient Descent
-
Equation seems identical with regularised gradient descent of linear regression
- Hypothesis is different in both regression
-
theta_0 is not regularised
-
computing the equation, we should continuously update the two following equations:
-
-
Regularised Advanced Optimisation