DC Interconnect – Part 2: VXLAN

In the second part, I will focus specifically on VXLAN as a Data Center Interconnect. I intend this iteration to be more general, and then have a third part which is focused on VXLAN options with Dell EMC Networking, including both OS9 as well as the SDN Eco-system solutions – or maybe, i will cover the later in a separate post, depending on the length.

What:

VXLAN is an overlay Virtualisation technology. More precisely, it is an encapsulation method that extends Layer 2 over a Layer 3/IP-based network.

Common overlay protocols today, use either

MAC-in-MAC encapsulation, such as IETF TRILL, Cisco FabricPath.
MAC-in-IP can be achieved using NVGRE,VXLAN etc.

Why:

DCI, Layer 2 adjacency across Layer 3 boundaries.

As alluded to in Part 1, DC architectures today are typically Layer 3, which enables scale, interoperability, as well as smaller blast radius/failure domain. Nevertheless, considerations for legacy apps, storage, workload mobility etc. continue to demand Layer 2 adjacency.

VXLAN could be implemented within a DC, or as an interconnect between DCs.

Type/Variety:

VXLAN solutions may be differentiated based on Control-plane/Learning (there are other ways to distinguish them too). The classification along these lines results in:

Multicast forward/flood based learning
Protocol based learning

The learning implies Remote VTEP discovery and End stations reachability.

Multicast forward/flood based learning:

The initial IETF VXLAN standard (RFC 7348) defines a

Multicast-based flood-and-learn VXLAN without a control plane.

The original Layer 2 frame is encapsulated with a VXLAN header and then placed in a UDP-IP packet and transported across an IP based network. Thus, it relies on

Data-driven flood-and-learn behaviour for remote VTEP peer discovery + end-host learning. There is no control protocol to distribute end-host reachability information among the VTEPs. The control plane and the data plane are integrated together.
Multicast in the network core. The overlay broadcast, unknown unicast, and multicast traffic are encapsulated into multicast VXLAN packets and transported to remote VTEP switches through the underlay multicast forwarding.

Challenges/Issues

Flooding impacts scalability, as it is a flood-and-learn based L2 technology.
The requirement to enable multicast in the underlay, may not be attractive/feasible

Underlay

Each VXLAN segment has a VXLAN network identifier (VNID), which is mapped to an IP multicast group in the transport network.

In a traditional L2/VLAN environment, Intra VLAN communication between two hosts would start with an ARP broadcast within the VLAN. Since VXLAN is Layer 2 over Layer 3, this broadcast packet is sent to a multicast group address which is mapped to a VXLAN VNI. Each VTEP that has participating members in that VNI is configured with this multicast group, and participates in PIM routing

The underlay topology could be L3 Leaf & spine. Any unicast routing protocol could be deployed. VXLAN termination is done on Leaf ToRs, and not on the Spine. External Network Access should be via a border leaf block.

BUM Traffic

Underlay IP multicast is used to contain the flooding scope of the participants within the VXLAN segment. This obviously means dependency on multicast in the underlay.

For comparison, the alternate to this mechanism could be unicast with headend/ingress replication, which allows the network to be multicast free. The local VTEP will use IP addresses of other VTEPs to send BUM traffic. These IP addresses are exchanged between VTEPs through static ingress replication configuration.

Overlay

The Layer 2 overlay is created on top of L3/IP underlay, by using VTEP tunnelling mechanism to transport Layer 2 packets. As mentioned,

It uses flood-and-learn mechanism.
It does not have a control plane.

Host Detection and Reachability

The VXLAN flood-and-learn spine-and-leaf network relies on initial data-plane traffic flooding to enable VTEPs to

discover each other
learn remote host MAC addresses
MAC-to-VTEP mappings for each VXLAN segment.

After MAC-to-VTEP mapping is complete, the VTEPs forward VXLAN traffic in a unicast stream. The layer 3 routing function is centralized on specific switches (spine switches or border leaf switches).

Protocol based learning:

We will have a look at MP-BGP EVPN as a protocol based learning mechanism for VXLAN.

VXLAN with an MP-BGP EVPN Control Plane, is an alternate to the Flood-and-Learn VXLAN. VTEP peer discovery and endhost reachability information distribution is protocol based. This in turn addresses both the challenges faced by the flood-and-learn VXLAN – i.e.

Scalability
Containing flooding

Advantages

There are additional benefits – the following features are vendor dependent and may vary between different VXLAN solutions. I am including these below as some of these will become relevant when I discuss the VXLAN solutions from Dell EMC & its SDN partners like Pluribus & Cumulus.

Multi-vendor Interopetrability: As it provides a control plane based on industry standards, MP-BGP EVPN allows multivendor interoperability.
Active-Active multi-homing at Layer 2: VLT/MLAG/VLAG pairs for dual homing of network endpoints, at the leaf. Both VLT/MC-LAG peer switches would act as a single logical VTEP, and participate in the BGP-EVPN operations as Active Forwarders.
Distributed Anycast Gateway: Each leaf is assigned the same Gateway IP + MAC addresses for connected subnets/VLANs i.e. they share VIPs/vMACs. The leaf switch/pair therefore is the L3 boundary as a distributed (instead of centralized) gateway, and routes all traffic – with both peers being Active Forwarders.
VXLAN routing: The ability of the leaf switches to do VXLAN routing is dependent on the chip. Single pass (Native) VXLAN routing using RIOT profiles, would become more accessible with the next gen chipsets.
Dynamic Tunnel Discovery: Remote VTEPs are auto discovered, and VXLAN tunnels are auto provisioned.
Data-plane IP and MAC learning: EVPN address family carries both L2 & L3 reachability information, which means it facilitates integrated bridging and routing in VXLAN overlays.
BUM Traffic Handling: In situations requiring the flooding of frames, e.g. MAC aging, L2 broadcast etc., the VTEP replciates the frame in hardware and unicasts the frame to all other VTEPs which have members in the same segment.
ARP/ND suppression: Leaf ToRs upon discovering remote IP and MAC addresses, populate their respective ARP tables. They respond directly to any local ARP queries, which eliminates the flooding of ARP requests. Thus, Control-plane distribution of MAC/IP addresses enables ARP suppression in the fabric for Layer 2 extensions between racks.
Conversational ARP/ND learning: Only the active flows are programmed into the forwarding plane. This optimises utilization of hardware resources, as the number of cached ARP entries is significantly reduced. Thus, the control plane could end up holding more host entries than what the hardware table can support. When there is sufficient space in hardware, all host entries are programmed. When there is no space, conversational learning executes and starts aging out the inactive entries.
VM mobility support: Upon a VM move, the leaf switch discovers & learns its addressing information via data plane learning. This reachability info is then advertised to its peers, which update their forwarding tables accordingly. this enabled optimized convergence in the network.
Multi-tenancy: Traffic isolation across DCs between L2/L3 domains.

Underlay

Underlay is traditional L3, and could be built using any unicast routing protocol. VXLAN termination is done on Leaf ToRs, and not on the Spine. External Network Access should be via a border leaf block.

BUM Traffic

In a layer-2 Network, MAC address learning is achieved by flooding frames with unknown unicast addresses on all ports which are members of that VLAN. For VXLAN, one or more IP multicast groups are set up to carry BUM traffic to VTEPs associated with a given VNI. Each VTEP needs to be a member only of those VNIs in which it has participating hosts. It does not maintain MAC address tables for VNIs in which it has no local members.

This approach means that the underlay must support IP multicast. Where this is not desired or possible, the alternate is Ingress/Head End Replication, which takes incoming BUM traffic and sends a single unicast copy to each of the VTEPs receiving traffic for a given VNI.

Ergo, the options are

PIM/Multicast
Head-end/Ingress Replication: (Multicast Free Underlay) BUM traffic sent as unicast. VTEP IP addresses are exchanged between VTEPs through BGP EVPN or static maps.

Overlay – Data Plane

VXLAN

Overlay – Control Plane

BGP EVPN is an extension to MP-BGP. It is controllerless, and utilises an MP-BGP address family (L2VPN) and NLRI for advertising L2 MAC addresses for endpoints, and mapping them to IP addresses.

Host Detection and Reachability

With MP-BGP EVPN,

Each VTEP performs local learning to obtain MAC address (though traditional MAC address learning) and IP address information (based on ARP snooping) from its locally attached hosts.
The VTEP then distributes this information through the MP-BGP EVPN control plane. Hosts attached to remote VTEPs are learnt through the control plane.

This approach addresses both flood based end-host learning, as well as end-host reachability information distribution.

————-

In the next part, I will look at VXLAN deployment with Dell EMC Networking. I might however, post on Pluribus Cloud Fabric before we get to part 3.

2 comments

Mark says:

November 16, 2017 at 11:59 am

Thank you for this series on DCI. I am very curious about part 3. Can we expect it soon?

1. Hasan Mansur says:
  
  November 16, 2017 at 2:26 pm
  
  Hi Mark, thanks for the comment. I was holding back as it occurred to me I could try and incorporate IPInfusion, another of our SDN partners into the post. part 3 focus was initially on Dell EMC, Pluribus, Cumulus & Big Switch. However, after your comment, I shall go ahead with the post as is, and do a followup part 4+ for other solutions. Expect it on Tuesday next week.

	Anonymous on Switch Port auto shuts/ port s…
	Shannon on Force10 S4810 VLT – Quic…
	Dua F on Layer 2 Discards Troubleshooti…
	Hasan Mansur on MX7000 Networking – Part…
	ajityadav on MX7000 Networking – Part…

What:

Why:

Type/Variety:

Multicast forward/flood based learning:

Challenges/Issues

Underlay

BUM Traffic

Overlay

Host Detection and Reachability

Protocol based learning:

Advantages

Underlay

BUM Traffic

Overlay – Data Plane

Overlay – Control Plane

Host Detection and Reachability

Share this:

Related

Leave a Reply Cancel reply