I’ve been working on an Arista datacenter design lately, and have been going back and forth on how I want to handle the VXLAN DCI portion (two active/active datacenters within 5ms of each other, basically a mirrored design).
Basic Diagram:
With two sets of dark fiber between the datacenters, we had a means of providing reachability with redundancy, but I really didn’t want to extend pure layer 2 across to the other side. In addition, the rest of the datacenter was designed as Layer 3 Leaf/Spine (with VRF-lite spanning across the DC to provide a non-overlay layer3 fabric per VRF), so I wanted the ability to extend that between the two datacenters.
With that in mind, I started mapping out my requirements:
1. I wanted a VXLAN dataplane between the datacenters – No layer2 or spanning-tree on these links.
2. I needed a manageable control-plane for VXLAN – Head-end replication works well, but the problem is scaling it. Every VTEP within the DC that participates in VXLAN would need to be touched per VNI.
3. I didn’t want to sacrifice two switches per DC to a pure DCI role (according to the Arista DCI Design Guide, this is still their recommended best practice – they recommend your border leaves be configured to trunk down to your DCI switches all VLANs you want extended over the DCI, effectively drawing a demarcation point between the CVX and HER control-planes).
When we were building out the DC, we started with Head-end Replication. This allowed us to flood VXLAN VNIs wherever we chose, but with a bit of work behind it.
Example:
{dc1-border1} interface VXLAN1 vxlan vlan 100 vni 10.10.100 vxlan vlan 100 flood vtep 192.0.2.110
{dc2-border1} interface VXLAN1 vxlan vlan 100 vni 10.10.100 vxlan vlan 100 flood vtep 192.0.2.210
This works, but requires a lot of redundant configuration – the config has to be mirrored on every VTEP that participates in the VNI.
Once we got a bit further in the design, we started testing CVX’s VCS (Cloudvision’s VXLAN Control-Plane), and it works very well within the datacenter.
The main benefits of using CVX were:
- Automatic Mac Address Learning – Limits the scope of flood-and-learn a bit within VXLAN. This distributes mac address <-> VTEP bindings through the protocol rather than relying on BUM traffic.
- Automatic VTEP <-> VNI mappings – If a VTEP has a VNI installed, it will automatically advertise it to the rest of the VTEPs with the same VNI.
There were two dealbreakers for us with using CVX. One was that the cluster had to have a quorum at all times. If a majority of cluster members were lost, all VXLAN traffic would cease (Author’s note: Found out this isn’t exactly true. The flood list will remain intact, but no new vteps/macs will be learned until CVX comes back online). The other was that mac address learning from outside the protocol wasn’t possible (as of today, there isn’t a way to peer two CVX clusters together). One recommendation was that we could create a CVX cluster that would span both Datacenters. This would allow for all the benefits of CVX between both DCs, but the problem was that we wanted two separate failure domains. If the DCI was lost, the side with less CVX nodes (had to be in odd-numbered groupings) provisioned would be completely offline. If the primary (side with majority of nodes) DC goes to hell, and all the nodes on that side go offline, both DCs are down. You can see how this may be an issue…
From here, it looked like going back to HER across the DCs was the best bet (side note: EVPN fixes pretty much all of these issues now). I decided to dig into the CVX optional commands to see if there was something else I could leverage. It didn’t take long before I found a line that I could modify that looked interesting.
cvx demolab no shutdown heartbeat-interval 30 heartbeat-timeout 90 peer host 192.0.2.32 peer host 192.0.2.33 source-interface Loopback0 ! service vxlan no shutdown vni 65111 flood vtep 192.0.2.128
CVX supports manual flood lists! This command installs the VNI <-> VTEP mappings on all the managed nodes through VCS. That’s perfect, using this I could configure separate CVX clusters to handle intra-DC mac-learning, and then install a manual flood list for any VNIs that I wanted to extend over the DCI.
When set up this way, devices in one DC could reach devices in the other purely by using the static VNI ood lists within CVX. Initial testing looked good…until I checked the vxlan address-table. The mac addresses weren’t learned, I just turned my network into a giant hub. Shit. I found out the hard way that within CVX, if you’re using the VXLAN control plane, addresses can only be learned via the control plane. Back to the drawing board, I found another command that brought it all together.
service vxlan vtep mac-learning data-plane
As it turns out, this was something that was put in place to allow for third party VTEP communication initially, but it could be leveraged for my uses. Once I configured dataplane learning, I lost the benefit of the control-plane mac learning (more BUM traffic, yay!), but gained the ability to learn mac addresses through vxlan tunnels, basically enabling HER-like functionality, but with the management simplicity of CVX.