Cisco Nexus - Part 2 - Design Basics

My last post went over the Nexus line and basics about each device. This post will dive into the design aspect of a DC network and how it differs from a traditional hierarchical 3-tier enterprise network. I'm not going into the deep details of designing a Nexus infrastructure, but touching the main concepts and providing plenty links to further your journey down the rabbit hole.

Now, let me be honest with everyone. I'm by no means a DC architect or Engineer (yet) and my exposure to these concepts in the real world is limited. The purpose of this post is to shed some light on changes to how DCs are designed and also deepen my understanding of “the new way”. This post is based solely on the research I have done and is an attempt to put the puzzle pieces together and present my results.

If you think I missed the point on something, or am just flat out wrong, call me on it. Hell, pick my post apart piece by piece if you see fit. I'm looking for interaction and feedback and welcome anything anyone can bring to the table.

With that said lets dig in!!

Traditional Hierarchical Model 

Everyone is familiar with this 3-tier design. We have all had it beat into our head our whole networking career. The hierarchical model has been the industry standard for networking design since the late 90's. This model works great for an enterprise network but it has many drawbacks which are not acceptable in the DC. Lets go over a few of these:

Spanning-Tree (STP) has been a thorn in the side for many network engineers. Even though STP loop avoidance is a must, desperately needed links are un-utilized due to STP placing them in the blocking state. When you need to increase your bandwidth you have several options:
  • Upgrade hardware for faster interface speed 
    • Up until recently 10Gbps was the highest affordable option available. 40G/100G is becoming cheaper and being introduced on more products.
  •  Link-Aggregation (LAG) by bundling links
    •  Each vendor has limitations on the number of links in a single LAG and you can only cram so many interfaces in a chassis.
  •  Move the L2/L3 boundary to the access layer and route all links
    • Drastically limits the footprint of a VLAN. This option is great for limiting STP, but bad for newer technologies such as VM mobility which require VLANs spread out across the DC.
Another draw back to the hierarchical design is when east-west traffic flows increase as seen in the DC. The 3-tier model is best utilized for north-south traffic flows with a limited number of east-west traffic. For a good breakdown of what the hell east-west and north-south traffic flows are and why we care, check out Greg Ferro’s post on the subject.

In a nutshell a standard traffic flow goes internet-core-distribution-access-server and then back out. The majority of traffic stays within this flow and little traffic ever needs to cross distribution or core switches (East-West flow) to access resources. In the past design recommendations suggested 20:1 oversubscription on access-to-distribution uplinks and 4:1 oversubscription on distribution-to-core uplinks since resources rarely utilized there full link speed. Even though these might work in a standard enterprise network, DCs cannot operate with these levels of oversubscription. DC resources such as virtualization and network storage are demanding a large number of high-speed (10G) full line-rate access ports in every rack.

Even worse as VM mobility becomes a must, workloads must be able to shift from one area of the DC to another at a moment’s notice and not cripple the infrastructure. And you guessed it, this traffic flows east-west.

Clos Fabric (aka Leaf/Spine)

OK so we now know the pitfalls of the hierarchical design. What do we do about it?

Well, as the title of this section says, the majority of vendors have modeled there product line around the concept of a Clos fabric or what many term the Leaf/Spine model. This model looks to have become the popular winner for DC design moving forward.

Here are two great videos by Brad Hedlund and Ivan Pepelnjak that break down the Leaf/Spine model.

As you can see there are two types of switches in the Leaf/Spine model.

Leaf Switches

Leaf switches provide servers and resources access to the network (aka fabric). Leaf switches have two port types, access and fabric ports. Access ports are... well access ports and fabric ports are the uplinks used to connect the leaf switches to the spin switches.
In the Leaf/Spine model each leaf switch connects to all Spine switches providing an all points equal distance fabric. Unlike the east-west problem described above, this solution provides the same number of hops and performance between any two endpoints in the fabric without the bottle neck introduced in a hierarchical design.

Leaf switches are usually placed in a top-of-rack (ToR) position within each row or pod. This setup limits the inter-rack cabling to only fabric links back to the spine switches.

Spine switches

Spine switches are the brains of the fabric. Spine switches uplink all leaf switches to the fabric and provide traffic forwarding decisions.

Spine switches are usually placed in an End-of-Row (EoR) or Middle-of-Row (MoR) position. The distance between spine and leaf switches is limited by both the supported distance of the optics used for fabric cables and the budget aloted for optics. The longer the fiber run the more expensive the optics will become.MoR designs can be used to cut down the maximum distance between the furthest leaf switch and the spine switch. Chris Marget has some great post (as always) breaking down the details of switch placement and how it affects a design. Part 1, Part 2, Part 3, Part 4.

Most all DC designs will fall into the 100m distance supported by SFP+ multi-mode fiber which is predominantly used for fabric connections. 40GE QSFP+ also supports these distances unless you are looking to deploy breakout cables.

Build out and Port Requirements

There are two important requirements you need consider when designing a Leaf/Spine architecture; port density per rack and the oversubscription ratio per access port. With the Leaf/Spine model, port density and oversubscription are tied directly to the number of fabric connections and the number of spine switches.

For example let’s look at the Cisco Nexus 2232TM which has 32 1/10GE host ports (SFP+) and 8 10GE fabric interfaces. With 8 10GE fabric interfaces, the 2232 FEX can support up to 80Gbps of line rate, non-blocking fabric access. With 32 access ports at 10GE each, that equals 320Gbps. So the 2232TM comes out with an oversubscription ration of (320/80) 4:1.

Keep in mind with only 8 fabric interfaces this means the Leaf/spine model can only expand to 8 spine switches with the 2232TM. When designing your fabric the number of fabric ports on each leaf determines both the oversubscription ratio and the total number of spine switches. The total number of ports on the spine switch also determines the total scale-out of the fabric.

Brad Hedlund has another good video here which breaks down how to calculate the number of access ports and oversubscription rates. Even though he is using Dell/Force10 in his example, it’s the same for all gear.

Alright now that we have the theory down lets get our hands dirty with some configuration!


  1. Great overview on leaf/spine design! You mentioned that this design commonly delpoys one leaf switch per cabinet. In a scenario where the DC is a colocation facility with numerous clients, how do we provide redundant switch links from the servers to 2 or more leaf switches? If we only provide access for servers to the leaf switch in the cabinet, it seems like the leaf switch becomes a single point of failure for the cabinet (or customers in that cabinet).

  2. Good point and it will be discussed in the next few post. In more real world solutions you will see multiple leaf switches plus a management switch in each cabinet/rack. Each providing ports for a specific function (Data, Storage, mgmt, etc...).

    With Cisco you also have vPC which allows you to uplink your server bonds to multiple leaf switches. vPC can also be used to uplink leaf switches to a pair of spine switches (usually Nexus 5000) without having STP sacrafic a link.

    There are several different design considerations with vPC and the Nexus line that hopefully Ill go into some depth in the next few post.

    Thanks for the comments!!

  3. I was eager to know about Cisco 3 layer hierarchical network design. Thanks for sharing this post.

  4. Great overview - deservedly the top of a Google search for "what the hell is leaf spine" :o)

  5. Great series of articles. Happy top read


Note: Only a member of this blog may post a comment.