ML Knowledge
How does stochastic gradient descent converge, and what makes it advantageous compared to traditional gradient descent?
Machine Learning Engineer
Apple
Elastic
Red Hat
Adyen
IBM
Faire
Answers
Anonymous
2 months ago
Due to the randomness of the method, we randomly select a subset of samples (a batch), the method in general will not find a point where the gradient is 0, i.e. a minima. However, the model is likely to land in an area near a minima. The advantages of this approach are: (1)for the massive datasets of deep learning, it is unfeasible to compute the gradient considering all the samples (the time to compute one update step would be extremely high as we would have to iterate over the entire dataset), (2) the randomness prevents the model from getting stuck on a local minima (since we are considering only a small set of samples the batch gradient vector does not coincide with the gradient vector and for this reason will not get stuck in local minima), (3) it makes the model less sensitive to the initialization (in the non-stochastic case, the gradient is deterministic and will push the model to the closest local minima) and (4) it helps to prevent overfitting (it is harder for the model to overfit since it is being update only a subset of samples)..
Try Our AI Interviewer
Prepare for success with realistic, role-specific interview simulations.
Try AI Interview NowInterview question asked to Machine Learning Engineers interviewing at IBM, eBay, Epic Games and others: How does stochastic gradient descent converge, and what makes it advantageous compared to traditional gradient descent?.