Anonymous
Due to the randomness of the method, we randomly select a subset of samples (a batch), the method in general will not find a point where the gradient is 0, i.e. a minima. However, the model is likely to land in an area near a minima. The advantages of this approach are: (1)for the massive datasets of deep learning, it is unfeasible to compute the gradient considering all the samples (the time to compute one update step would be extremely high as we would have to iterate over the entire dataset), (2) the randomness prevents the model from getting stuck on a local minima (since we are considering only a small set of samples the batch gradient vector does not coincide with the gradient vector and for this reason will not get stuck in local minima), (3) it makes the model less sensitive to the initialization (in the non-stochastic case, the gradient is deterministic and will push the model to the closest local minima) and (4) it helps to prevent overfitting (it is harder for the model to overfit since it is being update only a subset of samples)..