How does stochastic gradient descent converge, and what makes it advantageous compared to traditional gradient descent?

Question

Anonymous · Accepted Answer

Due to the randomness of the method, we randomly select a subset of samples (a
batch), the method in general will not find a point where the gradient is 0,
i.e. a minima. However, the model is likely to land in an area near a minima.
The advantages of this approach are: (1)for the massive datasets of deep
learning, it is unfeasible to compute the gradient considering all the samples
(the time to compute one update step would be extremely high as we would have to
iterate over the entire dataset), (2) the randomness prevents the model from
getting stuck on a local minima (since we are considering only a small set of
samples the batch gradient vector does not coincide with the gradient vector and
for this reason will not get stuck in local minima), (3) it makes the model less
sensitive to the initialization  (in the non-stochastic case, the gradient is
deterministic and will push the model to the closest local minima) and (4) it
helps to prevent overfitting (it is harder for the model to overfit since it is
being update only a subset of samples)..

How does stochastic gradient descent converge, and what makes it advantageous compared to traditional gradient descent?

Did you come across this question in an interview?

Answers

Unlock Community Insights

Try Free AI Interview

Google

Product Manager

Meta

Product Manager

Meta

Engineering Manager

Amazon

Data Scientist

Also asked as

Our AI is trained on 10,000+ answers.Login to review your answer.

How does stochastic gradient descent converge, and what makes it advantageous compared to traditional gradient descent?

Interview Answer Review Tool

Evaluation Metrics