This article will provide a quick overview of Akamai’s SureRoute technology. It starts with a description of the problem to be solved, move on to how SureRoute addresses the problem, and concludes with some best practices to ensure your SureRoute configuration is performing optimally. The focus here is on performance, so this article does not directly cover SureRoute for failover.
What is dynamic content, and why does it need to be accelerated?
Non-cacheable content poses a serious performance bottleneck. Because it is cost-prohibitive for an e-commerce site to to maintain a globally-distributed origin infrastructure, such sites typically have a single live origin server farm, and a single backup farm. Hence a user visiting the site will incur a long Internet round-trip to the origin for every non-cacheable object on the page. Here is some recently sampled data between a few end points on the Akamai network. (Note that these are examples; the latency between other end points in these cities can vary depending on many factors.)
|Boston to San Francisco||80 ms|
|Tokyo to London||275 ms|
|Rio de Janeiro to New York||135 ms|
|Beijing to Chicago||205 ms|
|Cape Town to Singapore||470 ms|
Because studies have shown that user abandonment rates increase after just two or three seconds of page wait time (see, for example, this document), adding a few 300-millisecond network round trips to a user’s experience can push a lot of customers off your site.
How can this be sped up?
First, let’s ask why these latencies are so high. Singapore is 9600 KM from Cape Town, and so the raw speed-of-light delay should be only 32 milliseconds. Even if you double or triple the distance to account for the shape of the land masses in the way, 470 milliseconds seems excessive.
Traceroute shows that this particular request between Cape Town and Singapore went first to Johannesburg at 35 ms, then dropped into a long-haul pipe and emerged in Zurich at 200 ms, then wandered around a little in the United Kingdom taking the latency up to 215 ms, then dropped again into a long-haul hop emerging in Singapore at 470 ms. It is likely that the long-haul links involved intermediate nodes in non-IP networks, invisible to traceroute.
The root of the problem is that the Internet consists of a large number of distinct, autonomous Internet Service Providers (ISPs) that exchange data at fixed geographic locations called peering points. These networks manage their traffic flows, both internal and across peering points, so as to minimize cost first and maximize performance second. Hence a long route on the Internet is going to pass through multiple networks and be routed through multiple peering points, and possibly be back-hauled across significant distances within a network, so as to balance load and minimize the costs to the ISP. As these policies are fundamental to the cost model of the ISPs, there is no getting around them when relying on the default route through the Internet.
This is the problem that Akamai’s SureRoute technology works to address. As of September 2013, Akamai has a hardware presence in well over 1,000 distinct networks, distributed through over 100 countries around the world. Some of these networks peer well with each other, and some poorly. Some Akamai PoPs have good internal routing to the peering points that lead to other parts of the Internet, and some take longer internal routes. This diversity of deployment creates an environment in which there are many possible paths from edge to origin, and some of these will be faster than the default path through the Internet. To locate these paths, Akamai software actively probes the Internet paths between PoPs in the Akamai network, and between them and the customers’ origin servers. This means that SureRoute looks for any path between Akamai’s own servers that has lower latency and/or packet loss rate than the default route on the Internet. When such a route can be found, Akamai will fetch dynamic content over that overlay route, and thereby reduce the network latency that the fetch of a dynamic object expriences.
SureRoute in more detail
In the diagram below, a user via their browser is trying to fetch a dynamic base page from an origin at “the other end of the Internet”. There are several networks and peering points in between, and perhaps the performance of this particular path is poor due to a lossy link. The browser makes a TCP connection to an Akamai server close to the user, the edge server. (Locating the best edge server for this user is the job of a separate Akamai system, not covered here.) Based on the host header sent by the browser, the edge server extracts the configuration data for the Akamai customer whose page is being fetched. That configuration data identifies the origin server as an IP address (or addresses) or a hostname. This arms the edge server with the first part of the data it needs to optimize this dynamic transfer: which origin it is trying to reach.
Since the goal of SureRoute is to find the fastest path to the origin, the edge server also needs to know how to route the traffic within the Akamai overlay, which means knowing to which other Akamai servers the edge server should send traffic in order to speed up the page fetch. Briefly described, Akamai servers throughout the world continually run probes against other Akamai servers, and — at a lower rate — against our customer’s origins. A centralized software process computes and distributes candidate fast paths from the raw latency and loss data the probes provide. A candidate path might consist of just one Akamai server between the edge server and the origin, or the path might have multiple “hops”. Each server between the edge and the origin is referred to as a parent server.
But latency data has noise, and conditions on the Internet change quickly. To work around this fact, the analysis servers actually send several candidate routes for each edge-server/origin pair. The edge then needs a way to pick the best one.
The best path to origin must be known at the time a user’s request arrives at an edge server, since any in-line analysis or probing would defeat the purpose of speeding things up. To accomplish this, customers are asked to place a SureRoute test object on their origin. Edge servers periodically fetch the test object from the origin using each of the candidate paths, including the direct path (the default path through the Internet from edge to origin). These fetches of the test object are called the races. When a real request comes in, the edge consults the most recent race data to send that request over the fastest path to the origin.
In the diagram, the direct path and the path through one of the parents pass through the lossy link, but the other parent does not. The edge server learns via races that the path through one parent is much faster than either of the other paths, and routes traffic via that parent.
Advantages of SureRoute
The route-optimization aspect of SureRoute turns out to be only one of several advantages that SureRoute offers. Here are a few more:
- Akamai servers are able to hide the bulk of any connection-setup latency by maintaining persistent TCP connections. If an user were to connect directly to an origin, the TCP handshake (connection-establishment protocol) would burn up an extra long-haul network round trip in the page fetch sequence. But with SureRoute, a user connects only to a nearby Akamai server, and that server already has a persistent connection to the parent server. Hence nearly all of the long-haul connection-setup latency is avoided.
- Each Akamai server is connected to the Internet via a commercial-grade (backbone) connection. Backbone connections are far from perfect, but they are more tightly controlled and monitored than are consumer-grade connections such as cable and DSL. Therefore, the long-haul hop in any dynamic object fetch is made over a backbone connection, and so any packet-loss issues induced by the consumer-grade connections affect only a small part of the end-to-end path.
- Persistent connections between Akamai servers are re-used across customers, which means that they often see a steady flow of traffic. This means that the TCP stack has a steady source of latency and loss data via which it can open up the congestion window to the maximum that is supportable on the specific Internet path in question. In TCP terms, this means Akamai can generally avoid “slow start” delays because the Akamai-to-Akamai connections remain in “congestion avoidance.”
- SureRoute attempts to guarantee that the proposed candidate paths proposed “differ” from each other in terms of how they are routed. More specifically, SureRoute attempts to select at least one path that passes through a different set of Autonomous Systems than the others. This makes SureRoute traffic resilient against problems that affect a single link on the Internet, such as congested peering points, router loops or black-holing due to misconfiguration or hardware fault, de-peering events, sub-sea cable cuts, etc. Because the candidate paths use different AS numbers, at least one of those candidates will generally route around the problem link, and traffic will continue to flow from edges to origin.
Some SureRoute best practices
Make sure your SureRoute races are working. Confirm that the test object is present and accessible at your origin, and that your Akamai metadata names it correctly. SureRoute performance can be significantly degraded in the absence of valid race data.
Make sure your SureRoute test object is of a size similar to your dynamic content. Small objects often have different performance characteristics from large objects, due to TCP effects and to the impact of packet loss. The best path to an origin can actually differ depending on the size of the object fetched.
When confirming that your test object is of a suitable size, be sure to take into account any gzip compression that applies to your live dynamic traffic or to the test object. When gzip is applied the uncompressed size of the objects is of little relevance to SureRoute; you want to be sure the number of bytes transferred on the wire matches approximately.
Make sure that the test object is stored as a static object at your origin, and that the origin does not incur any significant processing delays when serving it. The function of the test object is to help measure the latency of Internet paths, not the origin. Any processing delays at the origin will introduce noise into the network measurements.
If you are using SureRoute with Site Shield, make sure that your Akamai Metadata Reflects that. Typically in a Site Shield configuration the origin Firewall Will block traffic coming from non-Site Shield servers. If the Metadata Does not reflect Site Shield, then edge regions will attempt to race Directly To the origin, and be blocked.