Machine Learning By Andew Ng - Week 5
Cost Function and Backpropagation
Cost Function
- 
    Let’s first define a few variables that we will need to use: - 
        L = total number of layers in the network 
- 
        s_l = number of units (not counting bias unit) in layer l 
- 
        K = number of output units/classes 
 
- 
        

- 
    We have added a few nested summations to account for our multiple output nodes. 
- 
    In the first part of the equation, before the square brackets, we have an additional nested summation that loops through the number of output nodes. 
- 
    In the regularization part, after the square brackets, we must account for multiple theta matrices. 
- 
    The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). 
- 
    The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). 
- 
    As before with logistic regression, we square every term. 

- 
    Note - 
        the double sum simply adds up the logistic regression costs calculated for each cell in the output layer 
- 
        the triple sum simply adds up the squares of all the individual Θs in the entire network. 
- 
        the i in the triple sum does not refer to training example i 
 
- 
        
Backpropagation Algorithm
- 
    “Backpropagation” is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. Our goal is to compute: min J ( theta ) 
- 
    That is, we want to minimize our cost function J using an optimal set of parameters in theta. 
- 
    To compute the partial derivative of J(Θ): - Backpropagation Algorithm is used
 
- 
    One training example 

- Multiple training example

- 
    Process  
Backpropagation Intuition
- 
    Forward Propagation  
- 
    Backward Propagation - 
        The delta values are actually the derivative of the cost function 
- 
        Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. 
  
- 
        
Backpropagation in Practice
Implementation Note: Unrolling Parameters
- 
    Advance Optimisation - It needs theta to be in vectors
  
- 
    Example - 
        For efficient FP and BP values are expected in matrices and efficient Cost Function are expected in vectors 
- 
        Unrolling matrices into vectors in Octave 
  
- 
        
- 
    Learning Algorithm - Process of unrolling
  
- 
    Octave Snippets - Matrices → Vectors
 
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
- Vectors → Matrices
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
Gradient Checking
- 
    Gradient checking will assure that our backpropagation works as intended. 
- 
    We can approximate the derivative of our cost function with: - 
        Epsilon = 10^-4 guarantees that the math works out properly. 
- 
        If the value for ϵ\epsilonϵ is too small, we can end up with numerical problems. 
  
- 
        
- 
    Parameter Vector  
- 
    Process - So once we compute our gradApprox vector, we can check that gradApprox ≈ deltaVector.
  
- 
    Notes - 
        Implementation Note - 
            Implement BP to compute DVec 
- 
            Implement numerical gradient check to compute gradApprox 
- 
            Make sure they give similar values 
- 
            Turn of gradient checking. Using BP code for learning 
 
- 
            
- 
        Important - 
            Be sure to disable your gradient checking code before training your classifier. 
- 
            If you run numerical gradient computation on every iteration of gradient descent ( or in the inner loop of costFunction() ) code will be ver slow 
 
- 
            
  - Octave Snippet
 
- 
        
epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;
Random Initialisation
- 
    Initial Value of Theta - We need the initialise theta
  
- 
    Zero Initialisation - 
        When initialised with zero 
- 
        All the unit in the hidden layer will perform the same activation function 
- 
        Neural Network will not be able to learn for new features 
  
- 
        
- 
    Random Initialisation - 
        rand(x,y) is just a function in octave that will initialise a matrix of random real numbers between 0 and 1. 
- 
        (Note: the epsilon used above is unrelated to the epsilon from Gradient Checking) 
  
- 
        
- 
    Octave Snippets 
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Putting It Together
- 
    First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have. - 
        Number of input units = dimension of features x^i 
- 
        Number of output units = number of classes 
- 
        Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units) 
- 
        Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer. 
  
- 
        
- 
    Training a Neural Network - 
        Randomly initialise the weights 
- 
        Implement forward propagation to get h(x^i) for any x^i 
- 
        Implement the cost function 
- 
        Implement backpropagation to compute partial derivatives 
- 
        Use gradient checking to confirm that your backpropagation works. Then disable gradient checking. 
- 
        Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. 
 - Steps 1 - 4
  - Steps 5 - 6
  
- 
        
- 
    However, keep in mind that J ( theta ) s not convex and thus we can end up in a local minimum instead. 

- 
    Octave Snippets - When we perform forward and back propagation, we loop on every training example
 
for i = 1:m,
   Perform forward propagation and backpropagation using example (x(i),y(i))
   (Get activations a(l) and delta terms d(l) for l = 2,...,L