Machine Learning By Andew Ng - Week 10
Gradient Descent with Large Datasets
Learning With Large Datasets
- 
    Machine Learning and Data It’s not who has the best algorithm that wins. It’s who has the most data.  
- 
    Learning With Large Datasets - 
        First choose m = 1000 and train the algorithm 
- 
        Plot the learning curve, if it has high variance then more data feeding will be helpful 
- 
        If the learning curve is high bias then more data feeding will not be helpful 
  
- 
        
Stochastic Gradient Descent
- 
    Linear Regression with Gradient Descent - Recap
  - 
        Previous from of gradient descent would iterate all the training examples and sum them to take one step of descent 
- 
        This causes problem when the training data is way too large, in hundreds of millions, then it gets computationally expensive to use that gradient descent 
- 
        That is also called as “ Batch Gradient Descent “, because it uses all the training data 
  
- 
    Batch Gradient Descent vs Stochastic Gradient Descent  
- 
    Stochastic Gradient Descent - 
        Randomly shuffle the training data 
- 
        Repeat the descent using one single example at a time 
- 
        Descent will not converge like the batch gradient descent, it will get to the area of the global minimum which is good for the hypothesis 
- 
        This will not converge directly to the global minimum 
- 
        Steps are in variation to each other and in whole picture, it is moving towards the global minimum 
  
- 
        
Mini-Batch Gradient Descent
- 
    Comparison between different gradient descent - 
        Batch Gradient Descent ⇒ Use all m examples in each iteration 
- 
        Stochastic Gradient Descent ⇒ Use 1 example in each iteration 
- 
        Mini-batch Gradient Descent ⇒ Use b examples in each iteration 
  
- 
        
- 
    Mini-Batch Gradient Descent - 
        Select a no. of examples for batch 
- 
        Repeat the descent updates using the new batch 
- 
        It is faster than the batch gradient descent 
- 
        If vectorisation implemented efficiently it can be faster than the stochastic gradient descent, because of the parallelism used in operations 
  
- 
        
Stochastic Gradient Descent Convergence
- 
    Checking for Convergence - 
        During learning compute cost function before updating parameter 
- 
        Every 1000 iteration ( say ), plot cost function averaged over the last 1000 examples processed by algorithm 
  - 
        Examples - 
            Using bigger number of training examples before ploting will give a smoother curve 
- 
            If cost seems to increase increase, it means the algorithm has diverge. - Using smaller learning rate will solve the problem
 
  
- 
            
 
- 
        
- 
    Tuning Learning Rate in Stochastic Gradient Descent - 
        Learning rate alpha is typically held constant. Can slowly decrease alpha over time if we want theta to converge - alpha = const1 / iterationNumber + const2
 
- 
        Dynamic selection of learning rate can result in convergence of the algorithm 
- 
        Small learning rate will result in not oscillating around the global minimum and to converge 
  
- 
        
Advanced Topics
Online Learning
- 
    Online Learning - 
        Shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shopping service ( y =1 ), sometimes not ( y = 0 ) 
- 
        Features x capture properties of user, of origin / destination and asking price. 
- 
        We want to learn p ( y = 1 x ; theta ) to optimise price. 
- 
        In online learning, there is no fixed training data 
- 
        There is continuous stream data flowing which is used to train once and then the data is discarded 
- Online learning can adapt to changing user performance
  
- 
        
- 
    Examples - 
        Product search ( learning to search ) - 
            User searches for “ Android phone 1080p camera “ 
- 
            Have 100 phones in store. Will return 10 results 
- 
            x = features of phone, how many words in user query match name of phone, how many words in query match description of phone etc 
- 
            y = 1 if user clicks on link. y = 0 otherwise 
- 
            learn p ( y = 1 x ; theta ) 
- Use to show user the 10 phones they’re most likely to click on
 
- 
            
- 
        Choosing special offers to show user 
- 
        Customised selection of news articles 
- 
        Product recommendation 
  
- 
        
Map Reduce and Data Parallelism
- 
    Map Reduce - 
        Dividing the training set into parts and computing them using different computers and then combining results from all computers 
- 
        Network latency has to be considered 
  - 
        Concept - Workflow of Map Reduce
  
- 
        Map Reduce and summation over the training set - Many learning algorithms can be expressed as computing sums of functions over the training set
  
- 
        Multi Core Machines - 
            It uses multiple cores in the machine to parallelise the operation 
- 
            Factor of Network latency is diminished, because all the operations are performed in the same machine 
  
- 
            
 
-