ML Knowledge
Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?
Data ScientistMachine Learning Engineer
Snap
Stripe
Apple
Microsoft
Asana
Hewlett Packard
Answers
Anonymous
9 months ago
In the sigmoid function when we pass on larger inputs the function outputs values saturated to 1. Due to this the gradients at those points are also very steady and close to zero. This causes a problem when we are backpropagating through many layers as more and more term between 0 and 1 will multiply and in turn our gradients will continue to decrease. This causes a problem in parameter updating. As gradients are so low, the parameters get changes very slowly and the training process is lengthened.
On the other hand, if we use a relu function, the problem of vanishing gradients is not there as the slope of the function is discontinuous which means for inputs lesser than 0 the slope is 0 and for inputs greater than 0 it is 1 so due to this at larger input parameters to the activation function does not cause a problem due to consistent slope.
Try Our AI Interviewer
Prepare for success with realistic, role-specific interview simulations.
Try AI Interview NowInterview question asked to Data Scientists and Machine Learning Engineers interviewing at Course Hero, Asana, Rivian and others: Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?.