Anonymous
In the sigmoid function when we pass on larger inputs the function outputs values saturated to 1. Due to this the gradients at those points are also very steady and close to zero. This causes a problem when we are backpropagating through many layers as more and more term between 0 and 1 will multiply and in turn our gradients will continue to decrease. This causes a problem in parameter updating. As gradients are so low, the parameters get changes very slowly and the training process is lengthened.
On the other hand, if we use a relu function, the problem of vanishing gradients is not there as the slope of the function is discontinuous which means for inputs lesser than 0 the slope is 0 and for inputs greater than 0 it is 1 so due to this at larger input parameters to the activation function does not cause a problem due to consistent slope.