Exploding and Vanishing Gradients and How to Fix Them

Okay, this is another post about Artificial Neural Networks. I’m going to talk about two very closely related problems. They are the Vanishing Gradient Problem, and the Exploding Gradient Problem.  These problems are caused by poor initialization and poor choice of activation functions.

Poor choice of activation functions is the MAIN problem.  Poor initialization really only matters if you’re using sigmoid functions.  Rectified Linear Units (ReLUs) solve both of the gradient problems, so most of this article is moot unless you’re using sigmoid activation functions.  And if you are, unless there is a specific reason why you must, then you should stop.

The problems arise when the fitness landscape is too flat (Vanishing Gradients), or too steep (Exploding Gradients). The process of seeking the lowest error relies on being able to tell what direction is downhill, and that means being in part of the landscape where most training cases have a similar, fairly strong gradient in the same direction.

With a Vanishing Gradient, it’s so flat that no “downhill” can be found in most directions. The system trying to seek a valley floor just can’t find any reason to go anywhere in particular. It takes a lazy step or two in a direction that might be downhill, or might just be an artifact of the current training case that doesn’t even correlate with general errors. With an Exploding Gradient on the other hand, “downhill” is so steep that the exploration amounts to taking a look at a few  training cases and leaping off a cliff in the indicated direction, usually smacking into another cliff with the next training case and bouncing higher than the cliff you jumped off of. Either way, it’s hard to reach the valley floor.

This happens in large part because of unfortunate properties of the activation functions in use for neural networks.  The most popular activation functions are arctangent, hyperbolic tangent, and the logistic sigmoid function.  In any network of more than two hidden layers, all of these are crap.

Yes, I’ll say it again.  If you’re trying to train something deeper than two hidden layers, these are all crap.  Here’s why.

The result of multiplying their outputs through more than a couple of hidden layers vanish or explode depending on where the inputs are balanced along an exponential curve, and those inputs change so exponential curves are very difficult to work with.  Because the exponentials are less pronounced and prevalent closer to the input nodes, and the shifting fitness landscape caused by movement of the lower layers gives the upper layers inconsistent feedback, these networks always find the largest gradients, somewhere near the beginning of training, in shifting the weights in the layers closest to the input. This causes the fitness landscape seen by weights closer to the output to shift wildly.  And it goes right on shifting wildly because until the weights at lower levels have taken extreme values and gotten themselves into vanishing-gradient territory, the weights at lower levels keep moving, making the feedback at upper layers inconsistent.  In the meanwhile, their movement will have shifted the gradients in the fitness landscape right out from under most of the weights in upper layers, leaving those weights in vanished-gradient territory as well. By the time the upper layers find a landscape stable enough to try to settle down and solve the problem, most of the weights in both the upper layers and the lower layers are in vanished gradients, leaving your network with only a very tiny fraction of its potential ability to solve problems.

Because the derivatives of the usual sigmoids at their tails are exponentially distributed (approach zero in the same way as exponential functions), the slope of the fitness landscape resulting from the multiplication together of many such layers exponentially approaches zero – the classic Vanishing Gradient. If you make the learning rate sufficiently large to get progress on the resulting very shallow gradient, you will at completely unpredictable moments, when the training reaches the edge of a very flat zone, transition into exploding gradients where training causes changes in the fitness landscape too rapid and extreme to follow.  The logistic function, arctangent, and hyperbolic tangent all have exponential tail distributions making the gradient problems worse, and are effectively impossible to use for the initial training of a network deeper than two hidden layers.

Fortunately there are better choices.

Chief among these:  Use ReLUs.  Rectified Linear Units are efficient and avoid the gradient problems.

If you want scaled outputs you can use a sigmoid on the output layer only without invoking the gradient problems.

If you insist on using sigmoids, read on.

The only sigmoid in widespread use that has subexponential tails is the softsign function. Softsign is very sensitive to poor initialization in deep networks, but at least it doesn’t make the problem worse. A sigmoid which has subgeometric tails and is even more stable than softsign, is logsig, the logarithmic sigmoid. However, it has a particularly ugly derivative that slows down a lot of GPUs.

These are just about the ONLY sigmoid activation functions you can use without invoking exploding/vanishing gradients – and even so the depth of anything you can train with softsign is limited to about four hidden layers. Logsig remains useful up through eight or more hidden layers, but even so there’s no real reason to use them; training with them is slow and you have to be very careful initializing your weights.

  • softsign: x/(|x|+1)
  • softsign derivative: 1/(|x|+1)2
  • logsig: x/|x| * ln(|x|+1)/(ln(|x|+1)+1)
  • logsig derivative:  1/(|x|+1)(ln(|x|+1)+1)2)

There are other, nonsigmoid activation functions which mitgate these problems. The solutions in most widespread use are the ramp function (The aforementioned ReLU activation function) and softplus (which is similar to ReLU activation but smooths out the transition zone around zero).

Ramp and softplus have or approach constant (different) derivatives on opposite sides of zero, which helps mitigate the exploding gradient problem because the “directions” in the error landscape have a constant as opposed to varying slope.

The Logarithmic Semisigmoid is a simple transformation of the logsig, and has a graph similar to a sigmoid function, but it is not a true sigmoid because it is not contained between horizontal asymptotes. It counters the exponential compounding of gradients by using logarithmic curves.

Logmin is just the upper half of the logarithmic semisigmoid. It is also well-behaved, probably for the same reason. It is biologically inspired; its ratio of output to input closely resembles the ratio of spiking rate to excitation of biological neurons.

Anyway, stop using sigmoids completely and use any of these instead, and the Exploding/Vanishing Gradient Problems won’t bug you any more.

  • ramp function: max (x, 0)
  • ramp derivative: 0 if x < 0, 1 otherwise
  • softplus function: ln(1 + exp(x))
  • softplus derivative: exp(x) / exp(x + 1)
  • (but use the ramp derivative for efficiency if |x| > ~4)
  • logmin function: ln(1 + x) if x > 0, 0 otherwise.
  • logmin derivative: 1/(x + 1) if x > 0, 0 otherwise.
  • logarithmic semisigmoid: ln(x+1) when x >0, -ln(|x|+1) otherwise
  • semisigmoid derivative: 1/(x+1) when x > 0, -1/(|x|+1) otherwise.