Working on getting things back in order.
Pro-tip: If you’re going to use the AWS free tier to host your personal blog, plan for how much it’s going to cost once the free tier expires.
-T
Working on getting things back in order.
Pro-tip: If you’re going to use the AWS free tier to host your personal blog, plan for how much it’s going to cost once the free tier expires.
-T
Suppose you have a datacenter with a need of policy-based separation between specific devices. There’s a few ways to do this – start by placing them in different VLANs from each other. You can run an ACL on the SVI, and bam! — they can’t talk anymore. What do you do when you have so many devices, that messing with an ACL is a massive pain in the ass? You could make a firewall the gateway on those VLANs so that your firewall policy controls access to/between these VLANs. What if you have 100Gbit uplinks, in a Layer 3 Leaf/Spine topology (L3LS from here on out)? Getting a firewall with that level of throughput isn’t cheap. If you have the ability to group devices with similar roles (eg. DMZ, Internal, Application) you could use VRF-Lite to separate devices at Layer 3.
The problem with VRF-Lite is that it’s locally significant. It needs to be built on each switch that participates in these VRFs, and there needs to be an interface that is a member of a VRF on each side of inter-switch links in order to extend it between switches. I ran into an issue recently where I wanted to extend 10+ VRFs across my entire environment. Normally, I’d run MPLS to do this, but Arista gear at the time didn’t support any kind of L3VPN functionality (note: Now that EVPN has released, this config can be simplified greatly). One of my requirements when building this out, was that I needed VXLAN support – specifically, VXLAN routing. I wanted each Leaf switch to be able to host an anycast gateway for the VXLAN participating VLANs in order to get the packets out into the fabric, and to their final destination as quickly as possible.
VXLAN routing has the ability to use overlay VRFs, meaning that it can effectively extend VRFs between switches without having to extend VRF-Lite between switches with physical interfaces – the VXLAN packets are simply dumped to the VLAN that the VNI belongs to, and can be routed with standard layer 3 beyond that. There were a couple of trade-offs with this.
One: Every switch that needed to host a VRF, also needed to participate in VXLAN routing in some way. This wasn’t ideal, as (at the time) Arista border leaves had the capability to either hold an internet-scale routing table or run VXLAN routing.
Two: The Jericho chipsets that we were using didn’t officially support VXLAN routing yet. This was the main no-go.
As a workaround, I had to come up with a way to extend these VRFs across the entire datacenter. Thus, the subinterface fiesta is born.
With a L3LS topology, every leaf switch has a layer 3 connection to every spine switch, and subinterfaces can be leveraged as a lamprey-like means of attaching VRF-lite segments in order to hitch a ride over to the other side. While this can work, traditionally, it doesn’t scale well. Giving each spine a /24 and allocating /31s for the leaf-spine links can quickly get out of hand when you have multiple VRFs riding over it. With each spine having a connection to 8 leaf switches and just 4 VRFs extended over, you’re looking at 64 IP addresses per spine – scale the VRFs up much higher, and you’ll see some real pain.
One advantage to L3LS, is that the transit links between leaf and spine really don’t matter as long as you have a routing protocol (eBGP in this case) to signal the actual leaf routes to the spines. You don’t even need to advertise them into the BGP table if you don’t have a reason to (a couple caveats with that, which I’ll go into later). This meant that instead of needing 64 IP addresses, I could use 2 IP addresses for the routed link (/31 on each end), and simply re-use the same IP addresses within every VRF riding over it. Establish a BGP peering between leaf and spine within each VRF, and you’re good to go.
This satisfies the means of extending VRF-Lite between switches, but how do you route outside the VRF without any kind of route leaking? You need a central point of routing. With multiple 100Gbit uplinks, a single firewall wasn’t an option to host the L3 head-ends, but with a lot of traffic being soaked up by full east/west connectivity within the VRF, there’s way more leniency in throughput. This device doesn’t have to be a firewall, technically any Layer 3 device can do it. In our case we wanted a simplified firewall policy to control access, since we had a need to separate these VRFs to begin with.
One of the caveats I mentioned earlier is that any connection that uses the routed link address as the source-ip will fail, since there’s no return route to it. To work around this, you could build a loopback address on each switch, advertise it into BGP, and source any traffic that you need from it – radius, http, ssh, etc…
The other issue is that traceroute may look weird to those that aren’t expecting it, as sometimes you can be hashed out the same physical leaf<->spine uplink that shares an IP, which could look like a loop to the untrained eye.
I’ve been working on an Arista datacenter design lately, and have been going back and forth on how I want to handle the VXLAN DCI portion (two active/active datacenters within 5ms of each other, basically a mirrored design).
Basic Diagram:
With two sets of dark fiber between the datacenters, we had a means of providing reachability with redundancy, but I really didn’t want to extend pure layer 2 across to the other side. In addition, the rest of the datacenter was designed as Layer 3 Leaf/Spine (with VRF-lite spanning across the DC to provide a non-overlay layer3 fabric per VRF), so I wanted the ability to extend that between the two datacenters.
With that in mind, I started mapping out my requirements:
1. I wanted a VXLAN dataplane between the datacenters – No layer2 or spanning-tree on these links.
2. I needed a manageable control-plane for VXLAN – Head-end replication works well, but the problem is scaling it. Every VTEP within the DC that participates in VXLAN would need to be touched per VNI.
3. I didn’t want to sacrifice two switches per DC to a pure DCI role (according to the Arista DCI Design Guide, this is still their recommended best practice – they recommend your border leaves be configured to trunk down to your DCI switches all VLANs you want extended over the DCI, effectively drawing a demarcation point between the CVX and HER control-planes).
When we were building out the DC, we started with Head-end Replication. This allowed us to flood VXLAN VNIs wherever we chose, but with a bit of work behind it.
Example:
{dc1-border1} interface VXLAN1 vxlan vlan 100 vni 10.10.100 vxlan vlan 100 flood vtep 192.0.2.110
{dc2-border1} interface VXLAN1 vxlan vlan 100 vni 10.10.100 vxlan vlan 100 flood vtep 192.0.2.210
This works, but requires a lot of redundant configuration – the config has to be mirrored on every VTEP that participates in the VNI.
Once we got a bit further in the design, we started testing CVX’s VCS (Cloudvision’s VXLAN Control-Plane), and it works very well within the datacenter.
The main benefits of using CVX were:
There were two dealbreakers for us with using CVX. One was that the cluster had to have a quorum at all times. If a majority of cluster members were lost, all VXLAN traffic would cease (Author’s note: Found out this isn’t exactly true. The flood list will remain intact, but no new vteps/macs will be learned until CVX comes back online). The other was that mac address learning from outside the protocol wasn’t possible (as of today, there isn’t a way to peer two CVX clusters together). One recommendation was that we could create a CVX cluster that would span both Datacenters. This would allow for all the benefits of CVX between both DCs, but the problem was that we wanted two separate failure domains. If the DCI was lost, the side with less CVX nodes (had to be in odd-numbered groupings) provisioned would be completely offline. If the primary (side with majority of nodes) DC goes to hell, and all the nodes on that side go offline, both DCs are down. You can see how this may be an issue…
From here, it looked like going back to HER across the DCs was the best bet (side note: EVPN fixes pretty much all of these issues now). I decided to dig into the CVX optional commands to see if there was something else I could leverage. It didn’t take long before I found a line that I could modify that looked interesting.
cvx demolab no shutdown heartbeat-interval 30 heartbeat-timeout 90 peer host 192.0.2.32 peer host 192.0.2.33 source-interface Loopback0 ! service vxlan no shutdown vni 65111 flood vtep 192.0.2.128
CVX supports manual flood lists! This command installs the VNI <-> VTEP mappings on all the managed nodes through VCS. That’s perfect, using this I could configure separate CVX clusters to handle intra-DC mac-learning, and then install a manual flood list for any VNIs that I wanted to extend over the DCI.
When set up this way, devices in one DC could reach devices in the other purely by using the static VNI ood lists within CVX. Initial testing looked good…until I checked the vxlan address-table. The mac addresses weren’t learned, I just turned my network into a giant hub. Shit. I found out the hard way that within CVX, if you’re using the VXLAN control plane, addresses can only be learned via the control plane. Back to the drawing board, I found another command that brought it all together.
service vxlan vtep mac-learning data-plane
As it turns out, this was something that was put in place to allow for third party VTEP communication initially, but it could be leveraged for my uses. Once I configured dataplane learning, I lost the benefit of the control-plane mac learning (more BUM traffic, yay!), but gained the ability to learn mac addresses through vxlan tunnels, basically enabling HER-like functionality, but with the management simplicity of CVX.