Expert Answer
Anonymous
Gradient descent and its variants (Adam, momemtum based methods...) are the main techniques used to optimize deep learning models.
The key idea is that we find the gradient of the loss with respect to the parameters and update the parameters in the direction that decreases the loss.
DL models are very complex and other methods (like 2nd order or analytical methods) are not practical. Since computing the gradient is feasible, GD methods are the method of choice.
The usual implementation computes the gradient with only a subset of samples (the datasets are usually too big). This also helps to prevent the model from overfitting.