How does the convergence mechanism of SGD provide benefits over that of regular gradient descent?

Question

Anonymous · Accepted Answer

Due to the randomness of the method, we randomly select a subset of samples (a
batch), the method in general will not find a point where the gradient is 0,
i.e. a minima. However, the model is likely to land in an area near a minima.
The advantages of this approach are: (1)for the massive datasets of deep
learning, it is unfeasible to compute the gradient considering all the samples
(the time to compute one update step would be extremely high as we would have to
iterate over the entire dataset), (2) the randomness prevents the model from
getting stuck on a local minima (since we are considering only a small set of
samples the batch gradient vector does not coincide with the gradient vector and
for this reason will not get stuck in local minima), (3) it makes the model less
sensitive to the initialization  (in the non-stochastic case, the gradient is
deterministic and will push the model to the closest local minima) and (4) it
helps to prevent overfitting (it is harder for the model to overfit since it is
being update only a subset of samples)..

How does the convergence mechanism of SGD provide benefits over that of regular gradient descent?

Practice this question with AI

Go Premium

Practice More Questions

Community Answers

Unlock Community Insights