Anonymous
Vanishing and exploding gradients usually occur in sequential models where the weights are share between input values. As the weights are shared, if the value is greater than 1 that means it will keep multiply with itself until the end of the sequence which causes exploding gradients or inf or nan values. On the other hand, when value is less than 1, the wieght can go to 0 which is vanishing gradients. We can use a few strategies to combat that. For example, we use a gated architecture in gru and lstm to refresh the sequence weights when the words are not related to each other. Then, we can use gradient clipping which decides the max and min value and if value goes above that, it clips the gradient. Then, we can normalize the gradient values to keep the values in a certain range, let's say 0 and 1. Then, we can use some kind of layer, batch or instance normalization to normalize the values.