Machine Learning By Andew Ng - Week 10

Jun 24, 2020

Gradient Descent with Large Datasets

Machine Learning and Data

It’s not who has the best algorithm that wins. It’s who has the most data.
Learning With Large Datasets
- First choose m = 1000 and train the algorithm
- Plot the learning curve, if it has high variance then more data feeding will be helpful
- If the learning curve is high bias then more data feeding will not be helpful

Linear Regression with Gradient Descent
- Recap
- Previous from of gradient descent would iterate all the training examples and sum them to take one step of descent
- This causes problem when the training data is way too large, in hundreds of millions, then it gets computationally expensive to use that gradient descent
- That is also called as “ Batch Gradient Descent “, because it uses all the training data
Batch Gradient Descent vs Stochastic Gradient Descent
Stochastic Gradient Descent
- Randomly shuffle the training data
- Repeat the descent using one single example at a time
- Descent will not converge like the batch gradient descent, it will get to the area of the global minimum which is good for the hypothesis
- This will not converge directly to the global minimum
- Steps are in variation to each other and in whole picture, it is moving towards the global minimum

Comparison between different gradient descent
- Batch Gradient Descent ⇒ Use all m examples in each iteration
- Stochastic Gradient Descent ⇒ Use 1 example in each iteration
- Mini-batch Gradient Descent ⇒ Use b examples in each iteration
Mini-Batch Gradient Descent
- Select a no. of examples for batch
- Repeat the descent updates using the new batch
- It is faster than the batch gradient descent
- If vectorisation implemented efficiently it can be faster than the stochastic gradient descent, because of the parallelism used in operations

Checking for Convergence
- During learning compute cost function before updating parameter
- Every 1000 iteration ( say ), plot cost function averaged over the last 1000 examples processed by algorithm
- Examples
  - Using bigger number of training examples before ploting will give a smoother curve
  - If cost seems to increase increase, it means the algorithm has diverge.
    - Using smaller learning rate will solve the problem
Tuning Learning Rate in Stochastic Gradient Descent
- Learning rate alpha is typically held constant. Can slowly decrease alpha over time if we want theta to converge
  - alpha = const1 / iterationNumber + const2
- Dynamic selection of learning rate can result in convergence of the algorithm
- Small learning rate will result in not oscillating around the global minimum and to converge