Data Science is being the absolute dream job of many people in the 21st century. Let’s have a look at the simple concept. Let us talk about how Gradient descent has evolved over the years. Tensorflow gives us quite a few options for picking a gradient descent based on optimization strategy. This is what causes our neural network to actually learn from the data. But, it is not immediately clear how we should pick one. Once we understand the underlying details of the most popular ones, it will become much more clearer which one we should use. When you study machine learning for a while you start making connections between these mathematical techniques and the way we learn.
At this point, I view pretty much everything through the lens of data and computation and it is a beautiful feeling. For example, sometimes if our dataset is too homogeneous, after training, our model will be too fit to this data. It will be overfit, meaning it will not be able to generalize well. So if give some different data-point, it won’t be able to make an accurate prediction. We have to keep out training data diverse. And in the same way, if we keep our brains training data diverse, by travelling and seeking out novel experiences, we will be able to generalize better. And generalization is the hallmark of intelligence. So that’s why I like anchovies. If you were to ask what the most important machine learning technique is, the answer is without a double gradient descent. It is the foundation of how we train intelligent systems. And it’s based off a very simple idea time travel. Instead of immediately guessing the best solution to a given objective, we guess an initial solution, and iteratively step in a direction closer to a better solution.
The algorithm just repeats that process, until it arrives at a solution, that’s good enough. Since there is no way we can know the best solution from the start, this educated guess and check method is supremely useful. “Oh Gradient descent, find the ideal minma !” “Control our variance!” “Update our parameters!” “And lead us to convergence!”. Since we can follow the trail of optimization visually and mathematically as it lead us to convergence. Another way to think about the idea of gradient descent is examining how a professional athlete improves at a game. As long as there is objective function that can be measured like, the number of wins in a season, and some data that contributes or detracts from that function say passes, number of 3 pointers, steroids, that player can interactively take baby steps in his routines after analyzing the data to improve.
So to decide which of the gradient descent optimization techniques we should use in our model, let us learn about the various discoveries around the gradient descent over the years. Traditional gradient descent computes the gradients of the loss function with regards to the parameters for the entire training data set for a given number of epochs. Since we need to calculate the gradients for the whole data set, for just a single update this is relatively slow and even intractable for datasets that do not fit in the memory. So to get around this interactibility, we can use stochastic gradient descent. This is where we perform a parameter update for each training example and label. So we just add a loop over our training data points and calculate the gradient with regards to each and everyone. These more frequent updates with high variance cause the objective function to fluctuate more intensely. This is a good thing in that it helps it jump to new and possibly better local minima. Where with standard gradient descent will only converge to the minimum of the base in, where the parameters are placed in. But is also complicates convergence to the exact minimum since it could keep overshooting. So an improvement would be to use mini-batch gradient descent, as it takes the best of both worlds by performing an update for every subset of training examples that we can decide the size of. Training in mini-batches is usually the method of choice for training neural networks and we usually use the term stochastic gradient descent even when mini-batches are used. The oscillations in plane old SGD make it hard to reach convergence through. So a technique called momentum was invented that lets it navigate along the relevant directions and softens the oscillations in the era of interactions.