What happens when your content generates a sudden burst of demand? How can that impact the network, and what you can do to ensure your end users still get a great experience despite surging demand?

First, let’s clear up a few concepts.

What is a traffic spike?

If you graph demand for your objects over time, a “traffic spike” is a sudden surge in demand from your users, typically doubling or more your traffic levels in a very short period of time.

The Akamai network is obviously large and designed to handle large volumes of traffic. That said, the following traffic level increases over the following periods of time are considered a traffic spike:

  • 50Gbps increase over 10 seconds
  • 80Gbps increase over 1 minute
  • 300Gbps increase over 5 minutes

At those traffic increases over those periods, the sudden shift may cause “hot spots” on our network. This may degrade end-user performance until the network can self-recover.

What happens on the network with these levels?

You can think of Akamai load balancing as a large airplane. It’s very stable through a wide variety of conditions, and given a bit of warning, has no trouble avoiding turbulence and other problems. However, the plane can’t turn quickly: banking takes time, and sudden shifts can rock the plane.

That said, while the plane can’t avoid all adverse conditions, it will adapt better as the pilot gets more information and warning, perhaps turning into problems head-on or avoiding most of them.

The Akamai network is much the same way. As it notices traffic ramping up for a given customer, it starts to take advantage of additional resources and “spread” load across more machines. Our system forbids spreading too quickly (for stability reasons, much like a plane can’t turn quickly without entering a tailspin). This spreading is critical to keep load evenly spread on machines.

More specifically, there are a number of “hierarchies of spreading,” including machines in a deployment, deployments in general, and algorithms designed to make tradeoffs between performance and load spreading. These learning algorithms need time to adjust to shifting conditions.

What can I do to reduce the risk?

If you expect the aforementioned load spikes to occur at the given time (imagine you’re releasing software and millions of clients will auto-update when it goes live), you can ensure the best performance possible by finding a way to stagger the load, if possible. For example, if you release a software update and your software updates everywhere at once, you may see improvement with some logic that staggers the update for your end users over a period of minutes. This should cause load to increase more slowly, giving the network time to adjust.

We recognize this isn’t always possible. For example, with a large live streaming video event, users may show up very quickly with no practical way to stagger the arrival times. For something like that where you expect surging demand above the aforementioned thresholds, you should notify your Akamai representative so we can do things on the back-end to try to improve the situation. However, nothing is as good as staggering delivery.