Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?

Question

Anonymous · Accepted Answer

In the sigmoid function when we pass on larger inputs the function outputs
values saturated to 1. Due to this the gradients at those points are also very
steady and close to zero. This causes a problem when we are backpropagating
through many layers as more and more term between 0 and 1 will multiply and in
turn our gradients will continue to decrease. This causes a problem in parameter
updating. As gradients are so low, the parameters get changes very slowly and
the training process is lengthened.

On the other hand, if we use a relu function, the problem of vanishing gradients
is not there as the slope of the function is discontinuous which means for
 inputs lesser than 0 the slope is 0 and for inputs greater than 0 it is 1 so
due to this at larger input parameters to the activation function does not cause
a problem due to consistent slope.

Between ReLU and sigmoid functions, which one mitigates the vanishing gradient issue more efficiently?

Did you come across this question in an interview?

Answers

Unlock Community Insights

Try Free AI Interview

Google

Product Manager

Meta

Product Manager

Meta

Engineering Manager

Amazon

Data Scientist

Also asked as