ML Knowledge
Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?
Data ScientistMachine Learning Engineer
Asana
Microsoft
Apple
Stripe
Snap
Qualcomm
Answers
Anonymous
8 months ago
In the sigmoid function when we pass on larger inputs the function outputs values saturated to 1. Due to this the gradients at those points are also very steady and close to zero. This causes a problem when we are backpropagating through many layers as more and more term between 0 and 1 will multiply and in turn our gradients will continue to decrease. This causes a problem in parameter updating. As gradients are so low, the parameters get changes very slowly and the training process is lengthened.
On the other hand, if we use a relu function, the problem of vanishing gradients is not there as the slope of the function is discontinuous which means for inputs lesser than 0 the slope is 0 and for inputs greater than 0 it is 1 so due to this at larger input parameters to the activation function does not cause a problem due to consistent slope.
Interview question asked to Data Scientists and Machine Learning Engineers interviewing at Course Hero, Asana, Rivian and others: Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?.