Machine Learning By Andew Ng - Week 2
Multivariate Linear Regression
Multiple Features
- 
    Linear regression with multiple variables is also known as “multivariate linear regression”. 
- 
    There can be ‘n’ number of features. - For the example of predicting the price of a house, features can be no. of bedrooms, no. of floors, age of home, size.
 

- 
    Hypothesis - Number of parameters with their effect on the price of the house
  - 
        Taking x-zero as 1 for the convenience of notation 
- 
        This is called as the Multivariate Linear Regression. 
  Gradient Descent for Multiple Variables- 
        Concept  
- 
        Comparison between Gradient Descent with one variable and Gradient Descent with multiple variable  
 Gradient Descent In Practice I - Feature Scaling- 
        Features Scaling - 
            Idea: Make sure features are on a similar scale 
- 
            This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. 
- 
            If they are not on a similar scalar then the contours will be very elliptical - Gradient Descent will take more iteration to converge. i.e more time and power
 
- 
            If the features are on the similar scale, contours would be spaced and gradient descent would take less iteration to converge 
  - 
            Feature should range between -1 ≤ x ≤ 1 - 
                +3, -3 = max 
- 
                1/3, -1/3 = min 
 
- 
                
  
- 
            
- 
        Mean Normalisation - 
            Replace x with x - average value in the training set 
- 
            Do not apply to x-zero = 1 
- 
            This results in approximately zero mean. 
- 
            x = x - average value on training set / s - range ( max - min )  
 
- 
            
 
Gradient Descent In Practice II - Learning Rate
- 
    “Debugging”: How to make sure gradient descent is working correctly - 
        Plot the graph, on x-axis is the iteration and on y-axis value of cost function  
- 
        Automatic convergence test can be implemented - Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10^-3
 
 
- 
        
- 
    How to choose learning rate alpha - It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.
  - 
        Summary - 
            Try with some range of values 
- 
            alpha is too small : slow convergence 
- 
            alpha is too large : may not decrease on every iteration, may not even converge 
  
- 
            
 
Features and Polynomial Regression
- 
    We can improve our features and the form of our hypothesis function in a couple different ways. 
- 
    We can combine multiple features into one. For example, we can combine x-1 and x-2 into a new feature x-3 by taking x-1 * x-2.  
- 
    Polynomial Regression - 
        Our hypothesis function need not be linear (a straight line) if that does not fit the data well. 
- 
        We can change the behaviour or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form). 
- 
        Quadratic function comes down eventually which is not applicable in this example. Prices can’t go down with increase in size of the house. 
  
- 
        
- 
    Choice Of Features - One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
  
Computing Parameters Analytically
Normal Equation
- 
    “Normal Equation” method, we will minimise J by explicitly taking its derivatives with respect to the θj ’s, and setting them to zero. 
- 
    This allows us to find the optimum theta without iteration. 
- 
    Intuition  
- 
    Example - 
        X - features matrix 
- 
        y - output matrix 
  - 
        m examples, n features 
- 
        X - also called as design matrix 
  - 
        Feature Scaling is not needed while using normal equation method 
- 
        Normal Equation Octave Representation 
 
- 
        
pinv( X' * X ) * X' * y

- 
    Comparison Between Gradient Descent & Normal Equation - 
        n = Normal Equation Gradient Descent 
- 
        For some algorithms, normal equation method doesn’t work 
  
- 
        
Normal Equation Noninvertibility
- 
    When implementing the normal equation in octave we want to use the ‘pinv’ function rather than ‘inv.’ The ‘pinv’ function will give you a value of θ even if X^T * X is not invertible.  
- 
    Reasons for Noninvertibility - 
        Redundant features, where two features are very closely related (i.e. they are linearly dependent) 
- 
        Too many features (e.g. m ≤ n). In this case, delete some features or use “regularisation”. 
 
- 
        
- 
    Solution - deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
 
