How a Legacy STP Domain Sabotaged an entire Cisco ACI Multi-Pod Fabric

DC Consultant | DCACI | CCNP Security | PCNSA | CDE PAM
This week, I investigated a network incident for a customer that initially appeared to have an isolated, predictable impact. However, upon deeper analysis, it revealed a hidden, far more destructive interaction between legacy Layer 2 protocols and a modern SDN fabric.
The Scenario: Cisco ACI Multi-Pod Migration
The infrastructure consisted of a Cisco ACI Multi-Pod deployment spanning two Pods. The customer was in the middle of fully populating Pod 2 with active workloads. However, due to ongoing migrations, there was temporarily only a single active physical link interconnecting the Inter-Pod Network (IPN).
The Symptoms: Intermittent Outages in a Stretched EPG
The customer reported intermittent connectivity drops—classic flapping behavior—within an EPG dedicated to database management. This particular EPG was stretched across both Pod 1 and Pod 2.
A preliminary check of the underlying infrastructure quickly revealed that the single IPN link was experiencing intermittent flapping. Under normal circumstances, an IPN link flap should only disrupt inter-pod traffic; as the BGP EVPN routes between Spines are withdrawn for remote /32 endpoints, communication across pods drops. However, intra-pod communication (endpoints within the same Pod talking to each other) should remain completely unaffected.
Yet, the reality on the ground was different: services were failing globally across the EPG, even locally within a single Pod.
The Deep Dive: Endpoint Flushes and Mysterious TCNs
Looking into the APIC controller, we noticed a massive Endpoint Flush event occurring inside that specific EPG at the exact timestamps the IPN link flapped.
After correlating the log timestamps, we suspected an unintended interaction with the legacy Layer 2 domain. We checked for Topology Change Notification (TCN) flush events within the encapsulation VLAN mapping to that EPG. The suspicion was confirmed: we found continuous TCN-driven flushes originating from specific static leaf ports in both pods.
The Root Cause: A Hidden Cross Pod Spanning Tree Domain
Behind the EPG, the customer had connected dedicated L2 legacy management switches to the ACI Leafs in both Pod 1 and Pod 2. Although the ACI fabric does not run Spanning Tree natively, it transparently floods BPDU frames within the Flood Domain (FD) scope by default.
Consequently, the customer had inadvertently built a single, massive Spanning Tree domain stretched across the Multi-Pod fabric between the two legacy management switches.
When the single IPN link flapped, the legacy switches in one Pod lost connectivity to the STP Root Bridge located in the other Pod. This immediately triggered a Root Bridge re-election, topology recalculations, and a barrage of TCNs sent toward the ACI Leaf ports.
By default, when a Cisco ACI Leaf receives an STP TCN on an access port, its standard behavior is to flush the local endpoint table (MAC/IP learned entries) for that EPG to prevent black-hole routing. This continuous flushing mechanism completely crippled traffic, breaking communication even between servers sitting on the exact same Leaf switch.
The Resolution & Key Takeaways
To remediate the issue and stabilize the fabric, we completely isolated the STP domains. We applied a BPDU Filter at the Interface Policy level in ACI for all downlink ports connected to the legacy infrastructure. This immediately stopped TCN propagation into the fabric and localized the Spanning Tree topologies to their respective pods.
Had the IPN infrastructure been fully redundant with both links active, this architectural flaw would have likely remained hidden indefinitely, as the probability of both IPN links flapping simultaneously is slim.
This troubleshooting exercise serves as an excellent reminder: while modern SDN solutions like Cisco ACI promise to eliminate the constraints of Spanning Tree, interfacing with legacy non-fabric networks remains a critical boundary. An incomplete design that fails to properly isolate the legacy control plane can effortlessly introduce classic Layer 2 vulnerabilities into an otherwise modern fabric.


