Cisco Nexus - Part 4.3 - vPC Failure Scenarios

Since a Nexus vPC domain uses a shared control-plane, failures within the domain can result in some unusual outcomes. For anyone designing or administering a Nexus vPC environment you need to be very familiar with these failure scenarios and understand the impact of each.

Let’s get started!

This scenario is the simplest and breaks down just like a standard port failure in a LACP bond. Below you can see the vPC 101 member link fails between FEX 101 and N5k-01. Once N5k-01 recognizes the failure it will shift all traffic directed towards FEX 101 across the peer-link to N5k-02. Any traffic flows coming into N5k-02 towards FEX 101 will traverse the same path as before without interruption.

vPC Peer-Link Failure

A peer link failure can have some ill effects to your network if you’re not careful of your design strategy. When a peer-link fails, but the keepalive link is still passing heartbeats, the current vPC secondary switch shuts down all vPC member ports.  N7k and N6ks will also shut down all vPC member vlans and SVIs configured. 
Essentially what happens is the secondary recognizes there is a loss of the peer-link and closes down shop so the primary can handle all functionality. All traffic flows shift to the vPC primary until the peer link is restored and a consistency check can be run.

At first glance a peer-link failure doesn’t seem like a big issue but there are several gotchas you have to understand.

Orphaned Ports on the secondary are now black-holed. Since all vPCs member vlans and SVIs are down/down on the secondary peer, the uplinks to the core or distribution are also down. Since the peer-link is also down this means all traffic from orphaned ports cannot leave the secondary peer.

One option to work around this limitation is to dual-home your orphaned ports to both peer switches and run a separate non-vPC trunk between the peers. This option is useful in situations where host are unable to participate in vPC because they don’t support LACP.

Keep in mind with this setup that orphaned ports are not shutdown during a peer-link failure so if you are using server side load-balancing, the traffic sent to the secondary will still be black-holed. In version 5.0(3)N2(1) cisco added the command vpc orphan-port suspend which will suspend the secondary’s orphaned ports during a peer-link failure. Configure this on all orphaned ports on both peers.

vPC consistency check is unable to run during a peer-link failure. Without the consistency check new vPCs are unable to be turned up or even worse, vPC ports that go down and then come back online will stay down/down until the peer-link is restored.
The work around for this situation is to use the auto-recovery or reload-restore command.

To configure auto-recovery use the following commands on both the N7k and N5k

On 7Ks  and 5Ks with a NX-OS versions older than 5.0(2) use: 

You might even see some older documentation referring to the old way (NX-OS 4.x and older) using the peer-config-check-bypass command to accomplish auto-recovery.

vPC Peer-Keepalive Failure

Peer-keepalive failures are not service impacting since their only purpose in life is heartbeats between vPC peers.  But just because the peer-keepalive is down doesn’t mean you can forget about it.

If your Peer-Keepalive is down and then your peer-link goes out, your troubles get a lot worse. With both links down you face the dreaded split-brain scenario were both peers take on the active role and continue to process data for all ports. Havoc soon follows across your infrastructure.

vPC Peer Switch Failure

This situation starts out simple but can have some complex scenarios that could cause issues. If a vPC peer switch fails, the remaining peer switch keeps chugging along. Traffic flows then shift to the remaining peer switch. In this scenario; you lose half the vPC bandwidth since there are no longer two peer switches to handle the load. If you’re tight on over-subscription or bandwidth usage then this scenario can cause some traffic issues. Keep this in mind when designing your infrastructure.

Keep in mind the situations when you will lose both peers, for example, due to a power outage. Now, once the power is restored only one peer switch comes back online. This peer will take over as primary but the consistency check will fail due to the peer-link being down. Without auto-recovery configured the peer switch will be unable to turn up any vPC member until the peer-link is restored.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.