Since a Nexus vPC domain uses a shared control-plane, failures within
the domain can result in some unusual outcomes. For anyone designing or
administering a Nexus vPC environment you need to be very familiar with these
failure scenarios and understand the impact of each.
Let’s get started!
This scenario is the simplest and breaks down just like a standard port
failure in a LACP bond. Below you can see the vPC 101 member link fails between
FEX 101 and N5k-01. Once N5k-01 recognizes the failure it will shift all
traffic directed towards FEX 101 across the peer-link to N5k-02. Any traffic
flows coming into N5k-02 towards FEX 101 will traverse the same path as before
without interruption.
vPC Peer-Link Failure
A peer link failure can have some ill effects to your network if you’re
not careful of your design strategy. When a peer-link fails, but the keepalive
link is still passing heartbeats, the current vPC secondary switch shuts down
all vPC member ports. N7k and N6ks will
also shut down all vPC member vlans and SVIs configured.
Essentially what happens is the secondary recognizes there is a loss of
the peer-link and closes down shop so the primary can handle all functionality.
All traffic flows shift to the vPC primary until the peer link is restored and
a consistency check can be run.
At first glance a peer-link failure doesn’t seem like a big issue but there
are several gotchas you have to understand.
Orphaned Ports on
the secondary are now black-holed. Since all vPCs member vlans and SVIs are
down/down on the secondary peer, the uplinks to the core or distribution are
also down. Since the peer-link is also down this means all traffic from
orphaned ports cannot leave the secondary peer.
One option to work
around this limitation is to dual-home your orphaned ports to both peer
switches and run a separate non-vPC trunk between the peers. This option is useful
in situations where host are unable to participate in vPC because they don’t support
LACP.
Keep in mind with
this setup that orphaned ports are not shutdown during a peer-link failure so
if you are using server side load-balancing, the traffic sent to the secondary
will still be black-holed. In version 5.0(3)N2(1) cisco added the command vpc
orphan-port suspend which will suspend the secondary’s orphaned ports during a
peer-link failure. Configure this on all orphaned ports on both peers.
vPC consistency
check is unable to run during a peer-link failure. Without the consistency
check new vPCs are unable to be turned up or even worse, vPC ports that go down
and then come back online will stay down/down until the peer-link is restored.
The work around for this situation is to use the auto-recovery or reload-restore command.
To configure
auto-recovery use the following commands on both the N7k and N5k
On 7Ks and 5Ks with a NX-OS versions older than
5.0(2) use:
You might even see
some older documentation referring to the old way (NX-OS 4.x and older) using
the peer-config-check-bypass command
to accomplish auto-recovery.
vPC Peer-Keepalive Failure
Peer-keepalive failures are not service impacting since their only
purpose in life is heartbeats between vPC peers. But just because the peer-keepalive is down
doesn’t mean you can forget about it.
If your Peer-Keepalive is down and then your peer-link goes out, your troubles get a lot worse. With both links down you face the dreaded split-brain scenario were both peers take on the active role and continue to process data for all ports. Havoc soon follows across your infrastructure.
If your Peer-Keepalive is down and then your peer-link goes out, your troubles get a lot worse. With both links down you face the dreaded split-brain scenario were both peers take on the active role and continue to process data for all ports. Havoc soon follows across your infrastructure.
vPC Peer Switch Failure
This situation starts out simple but can have some complex scenarios
that could cause issues. If a vPC peer switch fails, the remaining peer switch
keeps chugging along. Traffic flows then shift to the remaining peer switch.
In this scenario; you lose half the vPC bandwidth since there are no longer two
peer switches to handle the load. If you’re tight on over-subscription or
bandwidth usage then this scenario can cause some traffic issues. Keep this in
mind when designing your infrastructure.
Keep in mind the situations when you will lose both peers, for example,
due to a power outage. Now, once the power is restored only one peer switch
comes back online. This peer will take over as primary but the consistency
check will fail due to the peer-link being down. Without auto-recovery configured the peer switch will be unable to turn up
any vPC member until the peer-link is restored.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.