Remote workers – rapid and cost-effective VPN scale with ZeroTier, OPNSense and FRRouting.

Overview

This would probably be a relevant topic on any given day in the world of IT, but given the current global pandemic due to COVID-19 (aka coronavirus), it’s become especially important.

IT departments are scrambling to figure out how to react with capacity to connect entire companies remotely for extended periods of time.

With a traditional vendor solution that centers around a router or firewall that’s racked in a data center somewhere, this can be difficult to solve for a few reasons.

Challenges:

  • Hardware capacity – most firewalls or routers have a fixed capacity for VPN sessions that must be deployed into a cluster to scale.
  • Software licensing – taking a company of thousands and suddenly extending licensing to account for the entire company is a financial hurdle for most companies.
  • Time to deploy – assuming both hardware and software licensing challenges can be dealt with in a timely manner, it may take weeks or months to deploy the additional capacity.

Luckily, IT is much more focused on software and cloud solutions these days then putting out boxes for everything.

Open source and cloud solutions when used together can provide an incredible amount of scale and performance without a long ramp up period.

We’ll be looking at the solution design below in the next few sections to explore solving the problem of remote worker VPN scale in a cost effective way.

Solution design

This is an overview of a design we’ve put into production to facilitate enterprise level VPN connectivity without traditional drawbacks like scale limit, hairpinning traffic and expensive hardware and software licensing.

1000+ users – All of the solutions used in this design are open source. The ZeroTier web controller is free for deployments up to 100 endpoints and requires very minimal investment to scale to thousands of endpoints.

10,000+ users – Even 10s of thousands of endpoints would still be a very moderate cost that would mainly be centered around cloud compute fees or physical DC/Campus hypervisor capacity.

It would likely still be less than 10% of the cost of a comparable VPN + hardware box solution.

All of these components can be assembled and tested inside of a day for a handful of FW endpoints and once security policies have been reviewed and applied, full production can easily be achieved not long after that.

Click here for a PDF of this design

Using ZeroTier for Mesh VPN connectivity.


If you’re unfamiliar with Mesh VPNs, you’re probably not alone.

Mesh VPNs are a relatively recent concept in overlay networking that allow for connectivity directly between any two points in the mesh without hairpinning through central points like a traditional VPN concentrator.

If you want some background on different Mesh VPN solutions, check out this podcast I was recently a part of over at networkcollective.com

https://networkcollective.com/2020/02/cr-end-of-wan/

What Mesh VPN solutions are available?

There are a few and this isn’t an exhaustive list, but the three I hear the most chatter about are ZeroTier, Nebula and TailScale

Any of these could be used to deploy this design, but I focused on ZeroTier because it’s the most flexible to combine with traditional networking in a data center or campus using dynamic routing protocols.

ZeroTier overview

ZeroTier first got on my radar when Ethan Banks over at the Packet Pushers did a priority queue show with the founder Adam Ierymenko on the Mesh VPN solution he developed.

ZeroTier’s mission is “Global Area Networking” using a unique mix of a centralized controller and certificate based authentication of endpoints.

it’s super easy to deploy as well and can be functional between two computers, phones, etc in a matter of minutes.

Managing L2 scale

Normally, a large /16 subnet stretched across the globe would have networkers like me cringing as it’s a solution that’s often frowned upon and with good reason.

However, ZeroTier solves this in a very interesting way by managing all of the components that normally blow up L2 overlays like broadcast and multicast.

ARP is a good example – it is converted into unicast and sent to the appropriate destination as described in ZTs manual below:

Broadcast is also carried as multicast to reduce overhead. Hosts can choose to participate or block multicast based on what the host is used for.

This is how ZT scales to very large numbers in the same subnet.

And if that isn’t good enough, they also run a public network called ‘earth’ that’s one large /7, just to make sure that tens of thousands of people on one subnet won’t blow things up.


Operating system support

One of the great benefits of using ZeroTier is that it runs on practically everything including Windows, Linux, iPhone, iPad and Mac.

Just download and install the appropriate client then go to my.zerotier.com to sign up for an account and create an overlay network.

ZT Controller – managing endpoints

Once you’ve created an overlay network, ZeroTier makes it incredibly easy to manage and authorize endpoints.

Below is a screenshot of endpoints in production. Notice the last entry has IPv6 available for internet access, so ZeroTier will use that to transport the IPv4 overlay network – which is a huge benefit…the underlay IP version doesn’t matter.

This is not true in almost every other enterprise VPN solution i’ve come across – IPv4 must encapsulate in IPv4 and IPv6 must Encapsulate in IPv6.

Injecting routes

Routes are enabled to endpoints using a dynamic static route injection from the controller.

These routes show up on the host routing table to provide connectivity to endpoints within the network.

And this is what it looks like for the 10.255.x.x routes on an endpoint host to a 100.125.0.x gateway (which represents a cloud or physical DC in this network)

Security policy

ZeroTier has the ability to push flow rules to endpoints that can permit / deny or change the flow of traffic.

There are a number of ways to leverage this along with rules at the OPNSense firewall to create a security policy that is modular, effective and functional.

Example: one way to leverage flow policies is to allow RDP only on this particular ZT network by permitting TCP/3389 traffic. Combined with host authentication and firewall level permissions, access to RDP can be tightly controlled by using 3 layers of security policy.


Using OPNSense as a firewall


If you’ve never used it, OPNSense is a fantastic open source firewall package.

It can be deployed from an ISO into a VM or as a bootstrap install in the cloud (I’ve successfully used AWS and Digital Ocean)

Mesh VPN and Routing

One of the intial challenges when using Mesh VPNs was to interconnect with routers and provide security policy beyond just the controller.

OPNSense has a number of packages and plugins – what initially drew me to it was the support for ZeroTier out of the box.


Installation was painless and the ZeroTier adapter was running and reachable within minutes.

Security rules

Here is a brief example of a security rule in OPNSense defining access coming from a ZeroTier remote worker subnet to a group of RDP Servers

That’s pretty much all you need to get started with connecting remote workers into the firewall.

Should you decide to force Internet access through the firewall, a NAT policy can be setup and ZeroTier can inject (or remove) a default route to the host if so desired.

The last piece to the puzzle – dynamic routing – is covered in the next section.

FRRouting for dynamic routing


The final piece to glue together the Mesh VPN connectivity through the firewall is a way to dynamically advertise the VPN subnets into a router or another firewall.

FRRouting or Free Range Routing is an open source routing stack that was forked from Quagga a few years ago.

It’s a very solid and capable way to turn any linux box into a feature rich routing platform.

A quick look at the protocols supported shows a wealth of options

The plugin for OPNSense is installed in the same way as ZeroTier and is equally painless. Once intalled, a tab for routing shows up in the left hand menu.

OSPF

OSPF Configuration and creating neighbor adjacencies is very straightforward. In this network, OSPF is used to advertise loopbacks for iBGP to the DC core switch.

Here is an example from the OPNSense UI


BGP

BGP is also very easy to configure. In this example from the OPNSense UI, several ZT networks are advertised into the DC core route reflector via iBGP.


Active routes

All routes for FRRouting, including Kernel routes can be viewed from the routing diagnostics tab in OPNSense.

Here is a view of the ZeroTier 100.125.x.x routes coming from the OPNSense FW and FRRouting into the DC Core route reflector.

Running configuration

FRRouting has a running config that’s consistent with an industry standard CLI configuration.

The UI plugin will update the running config based on the web configuration, but features that aren’t supported in the UI can still be added by editing the running config.

Final thoughts

Ultimately, this is merely a functional starting point for a corporate VPN solution.

There are so many security and networking pieces that could be added depending on an organization’s compliance and regulatory requirements.

However, the advantage of using open source components is that anything can be added to a prod build with a little time and some testing.

The key is to get a lab build and tested.

Try building out a solution with Mesh VPNs, open source firewall and routing tools and see where it takes you!

Starting a WISP: guide to selecting a routing architecture

Understanding the choices – why is routing design so important?

Routing is the foundation of every IP network. Even a router as small as the one in your home has a routing table and makes routing decisions.

Selecting a routing architecture is a critical but often overlooked step to ensure that a startup WISP can provide the necessary performance, scalability and resiliency to its subscribers.

This post will go through each the major design types and highlight pros/cons and when it is appropriate to use a particular routing architecture.

A note on IPv6

Dual stack is assumed in all of the designs presented. The cost of IPv4 public will continue to climb.

It’s no longer a scalable option in 2020 to build an ISP network without at least a plan for IPv6 and ideally a production implementation.

1. Flat network (aka bridged network)

“Behind the L3 boundary, there be L2 dragons”

-ancient network proverb

Unfortunately, this is often the worst choice for all but the smallest WISPs that don’t have any plans to scale beyond 1 to 100 subscribers.

Bridged networks with one or more subnets in the same L2 broadcast domain are the most commonly deployed routing design that we see in day to day consulting working for WISPs.

Bridged networks are attractive because they require minimal networking knowledge to get up and running.

These networks have a number of limitations in scale and performance and are susceptible to loops. They also can cause RF problems with the number of broadcasts sent across all towers.

This drawing is from the blog post ‘ WISP Design – Migrating from Bridged to Routed’ and has more information on issues with bridged networks and how to migrate a current bridged network to routed.

This is not an ideal choice for a startup, because it almost guarantees you’ll need a disruptive, time consuming and expensive migration once the subscriber count starts to grow.



CAUTION: “for-profit” WISPs – it is **NOT** recommended to deploy this design.


When should I deploy this network type?

Now that the “Most of you probably shouldn’t do this” warnings are out of the way… there are a few corner cases of WISPs that are for government use, non-profits, research, etc that this design can be a good fit for.

Use this design when:

  • simplicity is the ultimate goal (way beyond all others)
  • the WISP will *never* go beyond 100 subscribers
  • the WISP will be managed by someone without a networking and/or technical background
Click here for a PDF version of this drawing


2. Static routing

Static routing is a *slight* step up from a bridged network.

With layer 3 separation between the towers, the risks of major performance issues with growth go way down.

However, the administrative burden of growth is still an issue with static routes.

This design can be used for a very simple network with only a few routers until a dynamic routing protocol can be configured.

When **NOT** to deploy this design

  • Startup WISPs that plan on having more than one geographic IP transit location should consider one of the two BGP based designs as there are better policy options to influence routing.
  • Startup WISPs that expect rapid growth will not want to use this design, it doesn’t scale well and is difficult to manage for more than a few routers. (< 5 routers)
  • WISPs that want to dynamically failover between backhaul links can quickly get into issues when trying to manage failover for more than one or two routers with static routes.
  • If traffic engineering is desired, this is not the right design

Note: Static routing in this context means static routes for all subnet reachability – it is not meant to include a static route (when needed) to 0.0.0.0/0 as a default on an isolated management network or for a DIA circuit

When should you deploy this design?

  • A small network that will never exceed 1 to 5 routers
  • An extremely lossy and/or high-latency RF network that will cause issues with dynamic protocols.
  • If knowledge of OSPF/BGP becomes a roadblock in getting a routed network up and running
  • If routing is a knowledge gap, use this in the first 30 to 60 days to test radios and Internet access while working on dynamic routing. Don’t leave a network that is intended to scale on static routes.
Click here for a PDF version of this drawing

One of the questions we are often asked is:

Do I really need dynamic routing for a WISP that’s very small?


The answer lies in the drawing above…


it’s easy to see when looking at this drawing how complex and cumbersome static routing can become even for just 2 to 3 routers.

Dynamic routing using OSPF

Open Shortest Path First or OSPF is an interior gateway protocol defined by RFC2328 for version 2 (IPv4) and RFC5340 for version 3 (IPv6).

Without going into an enormous amount of detail about the background of OSPF and IGPs in general (RIP, EIGRP, IS-IS), which is out of the scope of this post, here are a few key points about the protocol:

OSPF overview for WISPs

  • Uses Dijkstra’s shortest path first algorithm (developed in 1956 – long before IP networks) to compute paths in the network.
  • Relies on ‘cost‘ which is an arbitrary value that can be set to mirror the speed of a given RF link (typically under ideal conditions) if desired to better reflect the “best” path through a series of backhauls.
  • A link-state protocol – this means that OSPF is concerned with the speed, topology and current state of every link on every router within an area. Due to this behavior, flapping RF links can sometimes case a “ripple” effect of bouncing routes across an area. This is one good use case to put non-core routes at towers into separate areas. (alternatively if BGP is used, only transit/loopback routes remain in OSPF so the net effect is similar)
  • OSPF does a great job at mapping out paths, speeds and reachability for subnets , this is why it’s often the first dynamic routing protocol many WISPs first learn. It also has significant limitations when used as a protocol for policy, which BGP is better suited for – the next design will show them paired together to get the best mix of reachability/paths + policy.


When **NOT** to deploy this design

  • Startup WISPs that plan on having more than one geographic IP transit location should consider one of the two BGP based designs as there are better policy options to influence both default and tower routing.
  • A WISP that plans to offer private L2VPN/L3VPN services should use the iBGP/OSPF/MPLS design in the next section.
  • A WISP that has or will have a complex tower topology with many redundant paths and needs a heavy focus on traffic engineering should consider the eBGP design in the last section.

When should you deploy this design?

  • A WISP that has no plans to offer private L2VPN/L3VPN services can successfully use this design
  • A WISP that will have mostly a non-mesh or significant partial-mesh physical topology of towers and backhauls – essentially this means that policy will likely not be required for traffic engineering and redundant RF PTP links can exist in standby and not be used towards aggregate capacity.
  • If the amount of IPv4 space used by a startup WISP will be less than a /24, then MPLS/VPLS is not required to improve IPv4 subnetting efficiency. While it is possible to deploy MPLS with only OSPF, the next section on iBGP will recommend the deployment of MPLS/LDP with iBGP/OSPF.
Click here for a PDF version of this drawing



Dynamic routing using iBGP/OSPFv2 & v3/MPLS

Border Gateway Protocol or BGP is an exterior gateway protocol defined by RFC4271 for IPv4 and RFC2545 for IPv6. Although BGP started out as a protocol that was intended only for use on the Internet between public ASNs, it quickly became used in a variety of network types due to the policy options it offers. Policy describes BGP in a nutshell, it isn’t concerned with link speeds, physical topology or link state…BGP is purely focused on policy and the best path algorithm.

Multiprotocol Label Switching or MPLS is a forwarding protocol that assigns labels to routes and allows for the abstraction of different services carried in an overlay on top of the routed/label-switched core. MPLS is defined in RFC3031.

Similar to OSPF, I won’t go into an enormous amount of detail about BGP and MPLS, which is out of the scope of this post, but here are a few key points about each protocol:

BGP overview for WISPs

  • BGP uses TCP port 179 to exchange routing information with another BGP speaker.
  • Internal BGP vs. External BGP has little to do with whether it’s used on the Internet. It simply means either a peering within the same ASN (internal) or a peering between different ASNs (external). There are a number of differences between the two, but the most fundamental is the next hop behavior – eBGP rewrites the next hop to be the local router before advertising the route whereas iBGP does not change the next hop by default.
  • Internal BGP uses recursive routing (next hop is not directly connected and requires route lookup) and does not change its next hop. OSPF is used to advertise the next hop so the next hop (typically a loopback) is always reachable. If you reference the drawing for this design section, you’ll notice the routes are either green (BGP) or blue (OSPF) to help clarify how they work together
  • Internal BGP relies on route reflectors (RR) to manage all routes in an ASN. This helps to avoid a full mesh of peerings. Normally a router in the core will act as the route reflector. One of the advantages for a WISP of using RRs is simplified tower router configs – they will always peer to the same pair of RRs regardless of location.
  • Using BGP in a WISP allows for a number of policy options to make selection of a default route or influencing a tower path much easier.
  • One of the greatest benefits of using BGP in a WISP is the simplification of routing protocols for host and subscriber subnets. Once routing is BGP end to end from the peering to the upstream router all the way to the last mile at the tower, it becomes easier to manage and apply policy.
  • BGP also offers better scaling options to grow than OSPF on a WISP network. I’ve consulted for a number of WISPs over 10,000 subscribers (which is a fairly sizeable WISP) and the vast majority of them run BGP due to policy and scale limitations of OSPF.

MPLS Overview for WISPs

  • MPLS requires a signaling protocol to exchange labels. Label Distribution Protocol or LDP is most commonly used. LDP uses TCP port 646 to build sessions and UDP port 646 for discovery. LDP is what takes the route information from other protocols like OSPF and BGP and assigns labels to it for forwarding.
  • MPLS is an incredibly useful tool for WISPs because it allows for the network to be sliced into virtual segments at layer 2 (L2VPN) or layer 3 (L3VPN) to deliver services that subscribers may ask for, or for internal needs like isolated management routing tables or VRFs (Virtual Routing and Forwarding).
  • One of the most helpful features of MPLS for a WISP lately has been the use of VPLS (L2VPN) to make subnetting a public IPv4 block more efficient. Traditionally, WISPs would try and break up a public block into small subnets to have public IPv4 available at every tower. As IPv4 became more scarce and costly, this became much harder to do. VPLS solves this problem by hosting the subnet at a data center or central point and then extending Layer 2 to wherever it is needed. This service is highlighted in purple in the drawing

BGP – More information

Network Collective
EPISODE 17 – BGP: PEERING AND REACHABILITY

https://networkcollective.com/2017/12/ep17-bgp-peering-reachability/

MPLS – More information
MikroTik US MUM 2016 – Dallas, Texas

MPLS Overview, Design and Implementation for WISPs



When **NOT** to deploy this design

  • If a WISP feels this is more complex than they can handle, the previous OSPF design can be used with loopbacks to prepare for the addition of BGP and MPLS at a future date. While it will require some migration time, it still provides a path forward to enable more advanced services.
  • If your focus is mainly residential, your subscriber count will never exceed 1000 subscribers and your tower topology is relatively simple, then you can likely use the previous OSPF design successfully.
  • If Traffic Engineering without using segment routing or MPLS TE is the single most important design requirement, then the eBGP design in the next section is the best choice.

When should you deploy this design?

  • All the time! (almost…) This is probably the most common type of design we build and deploy as WISP consultants because it:
    • Scales well (we’ve deployed at the scale of thousands of towers)
    • Is easier to manage operationally – using route reflectors centralizes routing policy decisions.
    • Has the most options available to deliver overlay services which align with the ability of the business to rapidly bring a product to market.
    • Isn’t new – this marriage of BGP/OSPF/MPLS has been the foundation of most Telco and Fiber operators for more than a decade. The reason is simple – it works.
  • Deploying L3VPN services for management VRF or for business customers
  • Planning a hybrid build with fiber and will be moving to equipment that supports more advanced features like MPLS TE with Fast Reroute or Segment Routing (even if you can’t leverage all traffic engineering features day one – the design will be ready to grow into more capable gear)

Click here for a PDF version of this drawing

One of the questions we are often asked about this design is:


Should I start with all of these protocols from the very beginning. Why not just pick BGP or OSPF? Why do I need both?

The answer is twofold

  • It’s far easier to learn more advanced routing concepts when the network is smaller or a complete greenfield. At some point you’ll need more advanced tools and migrating 25, 50 or 100+ towers is expensive, disruptive and incredibly time consuming.
  • The reason you wouldn’t pick one or the other in this design is that each protocol focuses on its strengths – OSPF is excellent for path calculation and topology and BGP is excellent at policy. With BGP working on top of OSPF, you get the best of both worlds.

Dynamic routing using eBGP

This is definitely one of the newest designs we’ve worked with and deployed into production. It’s growing in popularity because using BGP for all routing actually simplifies the design quite a bit while still being able to deliver VPLS services. This design also allows for incredibly diverse traffic engineering options and full IPv4/IPv6 dual stack with BGP which is limited in some vendors.

Vendor note:
It’s worth noting that some of the limitations in the current version of MikroTik RouterOS 6.xx are what prompted this specific design

  • MPLS Traffic engineering has significant limitations and cannot enforce more than one policy between two points. It also cannot enforce policy across OSPF areas
  • Dual stack with iBGP in IPv6 will not be functional until Router OS Version 7 is in prof and out of beta.


Rather than list bullet points on this topic to illustrate why and how eBGP is useful for traffic engineering in WISPs, i’ll share the following video from:

MikroTik US MUM 2017 – Denver, Colorado


When **NOT** to deploy this design

  • If a WISP requires L3VPN and/or VRFs for management, then the previous iBGP design would be better suited for that
  • If BGP knowledge becomes a roadblock that cannot be settled or worked around, then moving back to an OSPF only design is an option (However, I always recommend using labs like GNS3 or EVE-NG to train engineers and techs to solve this problem)

When should you deploy this design?

  • When traffic engineering is the primary driver for the business, but utilizing equipment that supports segment routing/MPLS TE is out of the budget, this is a workable solution to have total control of traffic paths.
    • BGP Communities can be used to steer a subnet along any path desired
    • Traffic paths can be modified using communities for both traffic to and from the towers.
  • When IPv6 dual stack with BGP on the tower network is required (Again a limitation specifically attributed to MikroTik RouterOS v6 and fixed in v7 beta)


Click here for a PDF version of this drawing




How to choose?

Even with the shallow depth i’ve given the routing protocols, this should still highlight the pros and cons of each design and also illustrate the role of the network vendor in supporting the necessary protocols.

Take some time (even if you have to read this article in several passes) to really understand the business case of your WISP, what you sell and what’s important for you to develop in the future.

Then evaluate the protocols needed vs. the budget required for more advanced equipment and calculate the ROI of features and agility vs. equipment/licensing cost.

Good luck!