The original idea of this blog post comes from the presentation of James Hamilton (VP & Distinguished Engineer at AWS) during the Re:Invent 2016. I also got some relevant pieces of information from the session “An other day an other billion packets” presented by Eric Brandwine (AWS Security) at Re:Invent 2015.
If the way about how to implement your network within AWS is well documented, the AWS network itself is not documented at all and we have to merge several sources of information to get a good overview of its design. That’s what this blog post is about.
The AWS partitions
Before deep diving into the AWS network topology, I would like to bring a piece of information to your attention. This will gives us a good starting point to describe the AWS network. What I am talking about is the notion of ARN. In the AWS world, an ARN is an ID used to uniquely identify a specific resource.
The ARN has the following format arn:partition:service:region:account-id:resource. More specifically, let’s have a look at the partition and the region. If the “region” notion is a common concept, the notion of “partition” is less known. Let’s execute the following python script:
import boto3 mySession = boto3.session.Session(profile_name='myprofilename') response = mySession.get_available_partitions() print response
This little script will gives us the names of the 3 available “Partitions” which are:
- ‘aws‘ : the most used partition because this is the public and worldwide available one. To register to this partition, used this link. A well known customer of this partition is Netflix.
- ‘aws-cn‘ : this is the China dedicated AWS partition. To register to this partition, use this link. I don’t find any customer name for this partition.
- ‘aws-us-gov’ : this is the partition dedicated for the US government. This partition is not publicly available and requires a special authorization to get access to. One of the most known customer of this partition is the NASA which has announced that they used AWS as a mission critical environment.
The AWS regions, PoPs and availability zones
The AWS network is splited in 16 regions (as of today December 15th, 2016) and 2 news regions are already announced for 2017.
Out of the 16 available regions, 14 are dedicated to the ‘aws‘ partition, 1 is dedicated to the ‘aws-cn‘ partition and finally 1 region is dedicated to the ‘aws-us-gov‘.
To reduce latency, AWS has set up up to 68 PoPs (Point of Presence). These PoPs are locations where local Internet provider or even individual companies can share traffic with AWS.
How does AWS provide high availability for their services?
The answer lies in the Availability Zone (AZ) concept. An Availability Zone is one or more data centers with a specific risk profile. Two AZs will never share the same risk profile within a single region. If you use AWS and follow the best practice guides, the risk of facing an application failure is close to zero. The following map shows the number of AZs per region:
What does a region look like ?
Let’s take the biggest AWS region as an example. The North Virginia region (us-east-1) has 5 AZs, which globally looks like the following figure. On this region, an AZ can host more than 300K servers and can be built on 2 to 8 different data centers.
Within an AZ, each data center is connected to the other data centers of the same AZ with multiple links to provide redundancy. An intra-AZ connection schema will look like that:
Of course each AZ within a region is connected with other AZs through a mesh of connections. An inter-AZ connection schema will looks like that:
Finally, an AWS region has also two transit points which are dedicated to the traffic between the region and the rest of the world (AWS or non AWS). With the transit points, the final network design of an AWS region looks like this:
We can see that there are a lot of different routes between two points within an AWS Region. There are different sizes of AZs, some big ones and some mid-size ones.
Biggest AZs always have a double connections with other big AZs. Mid-Size AZs are “only” connected to all big AZs but don’t have any direct connection with other mid-size AZs.
As an example, the us-east-1 region has 126 uniques spans, 242K+ fiber strands and uses 3456 fiber count cables. The goal behind this huge number of fibers is to decrease the cost. It’s effectively less costly to have a lot of fibers with single wave than having less fibers with multiple waves. This cost optimization is only valid for short distance.
We don’t have more information regarding these transit points. Is the transit point located within an AZ? Is it only routing services? Is a transit point can also be a PoP? All these questions will stay without any answer. If AWS is silent about his network, it’s also to keep the security around it, which is good for everyone.
What kind of hardware is used by AWS?
AWS has made the choice to develop its own hardware. We know hosting companies which made a similar choice, but most of them are “only” building their own servers based on existing components. AWS is developing its own routers, network interfaces, servers and storage appliances, and any piece of hardware they think is more valuable to be built than to be bought.
AWS has its own protocol development team to develop its own network hardware. This enables AWS to provide a unique robust SDN (Software Defined Network) solution which meets their requirements, and is flexible enough to host millions of customers.
AWS has bought Annapurna Labs in 2015. This company still exists as “an Amazon company” and is specialized in developing chips. All new servers installed in any of the AWS data centers across the globe (will) includes at least one of Annapurna Labs Amazon Chip dedicated to the network.
We can see on this picture that AWS uses 25GbE(Gigabit Ethernet) speed for their NIC (Network Interface Card) where the market is more oriented on 10GbE or 40GeB. This choice is quite simple to understand. The price between a 10GbE solution and a 25GbE is roughly the same, when the 40GbE is four time the price of a 10GbE. This means that a 50GbE link is just a little bit more than half the price of a 40GbE.
During his keynote, James Hamilton has also given some information about the top rack router built and used by AWS. This router is powered by a Broadcom Tomahawk ASIC and can manage up to 128 ports at 25GbE at full speed, which gives to the router a data plane of 3.2Tbs. The reason why AWS builds its own routers is because commercial routers (which are always much more expensive compared to developing your own ones if you use them on a large scale) don’t have a management board able to handle the number of configuration changes requested on an AWS environment.
By building their own routers and network interfaces, and with a developers team dedicated to the network protocols, AWS can really manage the SDN the way they want.
As an example, let see how an ARP request works on AWS network
Small reminder: an ARP request on an IP network is used by a host to discover which mac address is linked to a specific IP address.
In a traditional network, an ARP request is a broadcast send by a host into the network (this is the arp who-has message). The only host which will answer this broadcast (with a arp reply is-at message), is the one which effectively hosts the IP address. The figure below describes this process which is valid for any device connected to a network.
On the AWS side, the situation is quite different because we are in an SDN environment context, which means that we have overlapping IP block used by the customers.
The AWS answer to this situation is called the “Mapping service”.
This service (which is invisible for the customer), is the cornerstone of the AWS SDN (Software Define Network) solution. The mapping service registers every EC2 instance started by a customer and knows on which hypervisor (physical host) and on which VPC this instance is running.
With this information, when an EC2 instance sends an ARP request to reach any other instance, instead of allowing this broadcast on the network, the hypervisor catches the request, sends it to the mapping service data base, which will reply with the requested MAC address but also with the IP of the hypervisor IP of the host on which the target instance is running. This information is cached on the hypervisor mapping service cache and the ARP reply is sent to the ARP request initiator.
On the operating system side, ARP request is sent using a broadcast message and a ARP reply is received, so no change compared to a normal ARP process. In the background, AWS has implemented his own solution to match its requirements. The image below explains this process:
Does AWS manage a global worldwide network outside the regions?
The last question regarding the AWS network is the way inter-region communications are managed.
Obviously, traffic is not managed from one region to an other using a third party TIER1 provider. Here again, AWS has implemented its own solution which takes the shape of a global network. The figure below shows you how it looks like:
Every link display on this map is a 100GbE network (yes, an incredible 100GbE bandwidth!). So, what is the purpose of AWS implementing this very expensive network ? There are multiple reasons for this:
- The first one is a technical reason. Implementing your own network allows you to have better latency, a good packet loss rate, and finally improve the overall quality.
- The second reason is to avoid network interconnection capacity conflicts.
- The last reason is to keep greater control on the operations.
Unlike regions where fibers duplication is less expensive than using multiple waves in a single fiber, the cross ocean and inter-continental cable are less expensive by multiplying the waves into a single fiber and decrease the number of fiber strands in the cable. On the newest inter-continental cable project, AWS put in place a cable between Australia, New Zealand, Hawaii and Oregon. This 14000 km cables goes down to 6000m under the ocean, has 3 fibers with 100 waves @ 100G and has a repeater every 60 to 80km. The cable and the repeaters are shown in the pictures below:
We have to notice that this global AWS network reaches all AWS regions for the “aws” and “aws-us-gov” partition but is not connected to the “aws-cn” China partition which is isolated from the rest of the network.
I hope this post has helped you better understand the topology and specificities of the AWS network. The AWS network is built on a robust, custom tailored solution, with all the redundancy needed to reach the best high available architecture standards.
I will close this post with one well known sentence from the AWS stickers “FRIENDS do not let FRIENDS build DATA CENTERS” 🙂