WISP/FISP Design – Building your future MPLS network with whitebox switching.

MPLS-Whitebox-drawings

The role of whitebox in a WISP/FISP MPLS core

Whitebox, if you aren’t familiar with it, is the idea of separating the network operating system and switching hardware into commodity elements that can be purchased separately. There was a good overview on whitebox in this StubArea51.net article a while back if you’re looking for some background.

Lately, I’ve had a number of clients interested in deploying IP Infusion with either Dell, Agema or Edge Core switches to build an MPLS core architecture in lieu of an L2 ring deployment via ERPs. Add to that a production deployment of Cumulus Linux and Edge Core that I’ve been working on building out and it’s been a great year for whitebox.

There are a number of articles written that extoll the virtues of whitebox for web scale companies, large service providers and big enterprises. However, not much has been written on how whitebox can help smaller Tier 2 and 3 ISPs – especially Wireless ISPs (WISPs) and Fiber ISPs (FISPs).

And the line between those types of ISPs gets more blurry by the day as WISPs are heavily getting into fiber and FISPs are getting into last mile RF. Some of the most successful ISPs I consult for tend to be a bit of a hybrid between WISP and FISP.

The goal of any ISP stakeholder whether large or small should be getting the lowest cost per port for any network platform (while maintaining the same level of service – or even better) so that service offerings can be improved or expanded without being required to pass the financial burden down to the end subscriber.

Whitebox is well positioned to aid ISPs in attaining that goal.

Whitebox vs. Traditional Vendor

Whitebox is rapidly gaining traction and working towards becoming the new status quo in networking. The days of proprietary hardware as the dominant force are numbered. Correspondingly, the extremely high R&D/manufacturing cost that is passed along to customers also seems to be in jeopardy for mainstream vendors like Cisco and Juniper.

Here are a few of the advantages that whitebox has for Tier2 and 3 ISPs:

  • Cost – it is not uncommon to find 48 ports of 10 gig and 4 ports of 40 gig on a new whitebox switch with licensing for under $10k. Comparable deployments in Cisco, Juniper, Brocade, etc typically exceed that number by a factor of 3 or more.
  • SDN and NFV – Open standards and development are at the heart of the SDN and NFV movement, so it’s no surprise that whitebox vendors are knee deep in SDN and NFV solutions. Because whitebox operating systems are modular, less cluttered and have built in hardware abstraction, SDN and NFV actually become much easier to implement.
  • No graymarket penalty – Because the operating system and hardware are separate, there isn’t an issue with obtaining hardware from the graymarket and then going to get a license with support. While the cost of the hardware brand new is still incredibly affordable, some ISPs leverage the graymarket to expand when faced with limited financial resources.
  • Stability – whitebox operating systems tend to implement open standards protocols and stick to mainstream use cases. The lack of proprietary corner case features allows the development teams for a whitebox NOS to be more thorough about testing for stability, interop and fixing bugs.
  • Focus on software – One of the benefits that comes from separating hardware and software for network equipment is a singular focus on software development instead of having to jump though hoops to support hundreds of platforms that sometimes have a very short product lifecycle. This is probably the single greatest challenge traditional vendors face in producing high quality software.
  • ISSU – Often touted as a competitive advantage by the likes of Cisco and others, In Service Software Upgrade (ISSU ) is now supported by some whitebox NOS vendors.

1466540435IpInfusion interivew questions

IP Infusion

IP Infusion (IPI) first got on my radar about 2 years ago when I was working through a POC for Cumulus Linux and just getting my feet wet in understanding the world of whitebox.  What struck me as unique about them is that IP Infusion has been writing code for protocol stacks  and modular network operating systems (ZebOS) for the last 20 years – essentially making them a seasoned veteran in turning out stable code for a NOS. As the commodity hardware movement started gathering steam, IP Infusion took all of the knowledge and experience from ZebOS and created OcNOS, which is a platform that is compatible with ONIE switches.

Earlier this year, I attended Networking Field Day 14 (NFD14) as a delegate and was pleasantly surprised to learn that IP Infusion presented at Networking Field Day 15 (NFD15) back in April. I highly recommend watching all of the NFD15 videos on IP Infusion, as you’d be hard pressed to find a better technical deep dive on IPI anywhere else. Some of the technical and background content here is taken from the video sessions at NFD15.

Background

  • Has its roots in GNU Zebra routing engine
  • Strong adherence to standards-based protocol implementations
  • Original white label NOS ZebOS has been around for 20+ years and is used by companies like F5, Fortinet and Citrix

Advantages

  • Very service provider focused with advanced feature sets for BGP/MPLS
  • OcNOS benefits from 20 years of white label NOS development and according to IP Infusion’s marketing material is reputed to have “six 9’s” of stability as observed by their larger ISP customers.
  • Perpetual licensing – once the license is purchased, the only recurring cost is the annual maintenance which is a much smaller fee (typically around 15% of the license)
  • Extensive API support – IPI has extensive API support for protocols like BGP to facilitate integration of automation and orchestration.
  • Easier hardware abstractions than proprietary NOS – look for chassis based whitebox and form factors beyond 1U in the future
  • Increased focus on the 1 Gbps switch market with Broadcom’s incredibly feature rich Qumran chipset so that Start-up and very small ISPs can still leverage the benefits of whitebox. Also, larger Tier 2 and 3 ISPs will have a switching solution for edge, aggregation and customer CPE needs.

Integrating OcNOS with MikroTik/Ubiquiti

I’ve specifically listed IP Infusion instead of doing a more in depth comparison of all the various whitebox operating systems, because IP Infusion is really positioned to be the best choice for Tier 2 and 3 ISPs due to the available feature set and modular approach to building protocol support. Going a step further, it’s a natural fit for ISPs that are running MikroTik or Ubiquiti as the OcNOS operating system fills in many of the gaps in protocol support (MPLS TE and FRR especially) that are needed when building an MPLS core for a rapidly expanding ISP.

While I’ve successfully built MPLS into many ISPs with MikroTik and Ubiquiti and continue to do so, there is a scaling limit that most ISPs eventually hit and need to start using ASIC based hardware with the ability to design comprehensive traffic engineering policies.

The good news is that MikroTik and Ubiquiti still have a role to play when building a whitebox core. Both work very well as MPLS PE routers that can be attached to the IP Infusion MPLS core. Last mile services can then be delivered in a very cost effective way leveraging technologies like VPLS or L3VPN.

Other Whitebox NOS offerings

There are a number of other whitebox network operating systems to choose from. Although the focus has been on IPI due to the feature set, Cumulus Linux and Big Switch are both great options  if you have a need to deploy data center services.

Cumulus Linux is also rapidly working on developing and putting MPLS and more advanced routing protocol support into the operating system and it wouldn’t surprise me if they become more of a contender in the ISP arena in the next few years.

This actually touches on one of the other great benefits of whitebox. You can stock a common switch and put the operating system on that best fits the use case.

For example, the Dell S4048-ON switch (48x10gig,4x40gig) can be used for IPI, Cumulus Linux and Big Switch depending on the feature set required.

Some ISPs are getting into or already run cloud and colocation services in their data centers. If a compatible whitebox switch is used then stocking replacement hardware and operational maintenance of the ISP and Data Center portions of the network become far simpler.

Design elements of a WISP/FISP based on a whitebox MPLS core

Here are some examples of the most common elements we are trending towards as we start building WISPs and FISPs on a whitebox foundation coupled with other common low cost vendors like MikroTik and Ubiquiti.

MPLS-Whitebox-core-2

Whitebox MPLS Core

As ISPs grow, the core tends to move from pure routers to Layer 3 switches to be able to better support higher speeds and take advantage of technologies like dark fiber and DWDM/CWDM to increase speeds. Many smaller ISPs are starting to compete using the “Google Fiber” model of selling 1Gbps synchronous to residential customers and need the extra capacity to handle that traffic.

MPLS support on ASICs has traditionally been extremely expensive with costs soaring as the port speeds increase from 1 gig to 10 gig and 40 gig. And yet MPLS is a fundamental requirement for the multi-tenancy needs of an ISP.

Leveraging whitebox hardware allows for MPLS switching in hardware at 10, 40 and 100 gig speeds for a fraction of the cost of vendors like Cisco and Juniper.

This allows ISPs to utilize dark fiber, wave and 10Gig+ layer 2 services in more cost effective way to increase the overall capacity of the core.

MPLS-PE-MikroTIk

MPLS PE for Aggregation

MikroTik and Ubiquiti both have hardware with economical MPLS feature sets that work well as an MPLS PE. Having said that, I give MikroTik the edge here as Ubiquiti has only recently implemented MPLS and is still working on expanding the feature set.

MikroTik in contrast has had MPLS in play for a long time and is a very solid choice when aggregation and PE services are needed. The CCR series in particular has been very popular and stable as a PE router.

Virtual BGP edge

Virtual BGP Edge

MikroTik has made great strides in the high performance virtual market with the introduction of the Cloud Hosted Router (CHR) a little over a year ago.

Due to the current limitation of the MikroTik kernel to only using one processor for BGP, there has been a trend towards using x86 hardware with much higher clock speed per core than the CCR series to handle the requirement of a full BGP table.

The CHR is able to process changes in the BGP table much faster as a result and doesn’t suffer from the slow convergence speeds that can happen on CCRs with a large number of full tables.

Couple that with license costs that max out at $200 USD for unlimited speeds and the CHR becomes incredibly attractive as the choice for an edge BGP router.

NFV-Platform

NFV platform

Network Function Virtualization (NFV) has been getting a lot of press lately as more and more ISPs are turning to hypervisors to spin up resources that would traditionally be handled in purpose built hardware. NFV allows for more generic hardware deployments of hypervisors and switches so that more specific network functions can be handed virtually.

Some examples are:

  • BGP Edge routers (smiliar to the previous BGP CHR use case)
  • BRAS for PPPoE
  • QoE engines
  • EPC for LTE deployments
  • Security devices like IPS/IDS and WAF
  • MPLS PE routers

There are many ways to leverage x86 horsepower to get NFV into a WISP or FISP. One platform in particular that is gaining attention is Baltic Networks’ Vengeance router which runs VMWARE ESXi and can be used in a number of different NFV deployments.

We have been testing a Vengeance router in the StubArea51.net lab for several months and have seen very positive results. We will be doing a more in depth hardware review on that platform as a separate article in the future.

Closing thoughts

Whitebox is poised for rapid growth in the network world, as the climate is finally becoming favorable – even in larger companies – to use commodity hardware and not be entirely dependent on incumbent network vendors. This is already opening up a number of options for economical growth of ISPs in a platform that appears to be surpassing the larger vendors in reliability due to a more concentrated focus on software.

Commodity networking is here to stay and I look forward to the vast array of problems that it can solve as we build out the next generation of WISP and FISP networks.

The importance of situational awareness for network engineers

 

frustrated engineer

 

In another life, not too long ago, I spent a number of years in civilian and military law enforcement. When going through just about any kind of tactical training, one of the recurring themes they hammer into you is “situational awareness or SA.”

Wikipedia defines SA as:

Situational awareness or situation awareness (SA) is the perception of environmental elements with respect to time or space, the comprehension of their meaning, and the projection of their status after some variable has changed, such as time, or some other variable, such as a predetermined event. It is also a field of study concerned with understanding of the environment critical to decision-makers in complex, dynamic areas from aviation, air traffic control, ship navigation, power plant operations, military command and control, and emergency services such as fire fighting and policing; to more ordinary but nevertheless complex tasks such as driving an automobile or riding a bicycle.

Defining the need for SA in network engineering

It’s interesting to notice that critical infrastructure such as power plants and air traffic control are listed as disciplines that train in SA, however, I’ve never seen it taught in network engineering. In today’s world, the Internet has become as important as any other critical infrastructure and probably more in some ways.  Network engineering shares many of the high pressure components as other critical infrastructure. Engineers are often forced to make decisions quickly under stress that can have a significant financial impact and may have regional, national or even global ramifications. Engineers that work supporting military, healthcare, public safety and telecom sectors have the additional responsibility of avoiding downtime that can risk the safety of life and property. When not under the immediate pressure of a high risk change or outage, network engineers are frequently given projects with timelines that are too short to plan properly and when the time to implement comes, unforeseen issues tend to pop up much more frequently. Understanding SA is key to ensuring that engineers focus on the right tasks at the right time and know when to ask for help, when to coordinate and even when to step back and stop working.

Why don’t we teach SA in networking?

This is largely, I suspect, one of the things network engineers are expected to acquire with experience and so it isn’t formally taught or is passed down from engineer to engineer.  SA is something that is best taught as a combination of practical experience and both formal and informal training.  The Cisco CCIE Route/Switch lab exam is probably one of the few examples of formal training that forces a network engineer to be aware of task priority as well as assessing the changing state of the network and responding appropriately.  That said, after a brief google search, it does appear to be taught in cyber security as an adapted form of SA meant for military personnel. Much of the training tends to focus on how to avoid losing situational awareness by following certain methodologies and what to do if you realize you have lost SA.

An example of the loss of SA in networking.

15593400_s  tacacs

 

This is a scenario that just about every engineer can relate to…You’re working on a network cut and one of the tasks that you have to perform (which should be relatively quick) turns out to be quite difficult – because of a mis-configuration, bug, missing info, etc.  A common example is network management config like SNMP, netflow, TACACS, etc.  Let’s say you’re bringing a new L3 switch core online that involves physical link migration as well as new VLANs and OSPF/BGP. You have a 4 hour maintenance window and you start to work on an issue where your centralized TACACS server isn’t responding to the new switch – maybe it’s a security policy in the server or maybe you have the key wrong. You start to troubleshoot the issue with the expectation that you’ll have it solved in 10 or 15 minutes.

45 minutes later, you’re ALMOST there and so you put your head down and charge on through since you’re only 10 minutes away from solving the issue…right? Another 45 minutes go by and you’re at a complete loss as to why TACACS won’t work properly on this switch.  Let’s pause for a moment and assess the status of the cut – we are now 1 hour and 30 minutes into the window and we have used almost half of the time on a single task. We still have to move physical links, ensure smooth L2 STP convergence when we add a new trunk and VLANs and L3 routing has to be brought online and tested with BGP and OSPF.

At this point, you’ve probably lost situational awareness because almost half of the time was spent on one task that is probably not critical to bringing a new L3 switch core online. This isn’t to say that ensuring the proper AAA isn’t important, but it’s a task that could be done in a follow up window the next night – or you might even extend your current maintenance window after the critical physical and route/switch pieces have been done. The point is, that in most environments, network management tasks aren’t as critical to configure as L2/L3 forwarding if you have to make a choice between the two based on time available.

There are probably thousands of scenarios like this one we could go through but the critical take away here is to understand exactly which items have to function in order to accomplish the overall goal. A secondary component to that is to understand where you can make compromises and still come away with a win – maybe in the example above, static routes are an acceptable temporary solution if you run into issues with dynamic routing.


SA concepts for network engineers

I’ve adapted some of the concepts from the US Coast Guard’s manual on SA to illustrate the important points and relevance for network engineers. The full text can be found here

Clues to the loss of SA 

In reviewing a network outage or failed cut, you will often find one or more of these clues in hindsight as you try to figure out what went wrong.

  • Confusion – If something isn’t quite right with the network as you’re working on it, but you can’t put your finger on it, trust your gut and share it with the engineering team or any other relevant parties. Likewise if you sense the team is generally confused about a critical task, don’t be afraid to take a step back and try to get clarification from the team. At worst you may have a short delay, but at best, you may discover some major gaps or other issues that need to be dealt with before moving forward. Confusion is the sworn enemy of communication.
  • Deviation from the plan – we don’t always have the time to plan our network changes and it isn’t realistic to think that 100% of changes will be planned, but if you’re doing it right, most of the config that is put into the network should be planned and tested.  When executing a network change, if you have to deviate from the planned config more than just a few lines, it might be worth putting the change on hold to assess whether or not your on-the-fly changes will upset things in other areas. Trying to think through all the dependencies of adding BGP route reflection in an entire data center at 3AM may not be the best choice to make.
  • Unresolved Discrepancies – If the SYSLOG is churning through OSPF neighbor flapping entries, then maybe it isn’t the best time to introduce more OSPF changes. Understanding the state of the network before you make changes is key to avoiding breaking things even more.  This seems like a very common sense approach, but all too often, we as engineers tend ignore the danger signs of an unstable network and try to fight our way through to “just get things working.” It’s far better to identify and resolve known issues before introducing new config that could be destabilizing – even if that means missing a target date in a project.
  • Ambiguity – If there isn’t a great deal of clarity around the changes you plan to introduce into the network, then you’re setting yourself up for failure.  If an engineer comes to me and says he wants “throw VLANs 100 through 110″ onto all the switches that need it”, I would probably challenge him to come back with the configuration required to do exactly that and in the process, we will probably have a much better understanding of the Layer 2 impact versus (con)figuring it out as we go.
  • Fixation/Preoccupation – If we think back to the example earlier about spending an hour and a half on troubleshooting TACACS, that is a great example of fixation on one task. Unfortunately just about all of us have been there because it is so easy to lose sight of the bigger picture when you’re trying to get something to work. This is where a team of engineers can be very helpful. You might suggest to a member of the team who is head down and wrestling with a non-critical problem to refocus on the critical pieces and then the team can come back and work on the tasks together.  If you’re working solo, then you have to be especially vigilant and self-policing when it comes to avoiding fixation/preoccupation.

Maintaining Awareness

20108949_m

 

  • Communication – clear and effective communication, both verbal and written, is key to avoiding major issues.  You should always strive to provide the right information in a timely manner to avoid ambiguity and confusion. It’s important to communicate things like a planned course of action as well as tasks that you are specifically charged with – even if it’s just to verify the information you have is accurate.

  • Identify Problems – If you see a potential issue, don’t wait to bring it up. Sometimes, we as engineers are afraid to raise concerns for fear of looking foolish or uninformed, but harboring a concern and then voicing it once you’re in an outage situation doesn’t help anybody and certainly will make you feel foolish for not bringing it up when you had the chance.
  • Continually Assess the Situation – Continually checking on the state of the network and the impact changes are having will keep you alert for any issues that may pop up. Many of the problems we have to go back and fix as network engineers are a direct result of incomplete validation after a change has been made. Another thought to consider here is that if you are a member of a team, you can delegate this task or ask for assistance so the burden isn’t entirely on you.
  • Clarify expectations – If expectations and goals are not clearly defined, they can be hard to achieve. If it’s not understood that a network change is expected to cause an outage, you may have some unhappy campers on your hands because you didn’t clearly set the expectation.

The lone gunman effect on SA

11562899_m

 

There is no faster way to lose SA than to try and go it alone in the midst of a team environment…

One of the hardest things to do sometimes when functioning as a member of a team is to delegate tasks and responsibility. Especially if everyone on the team is not at the same technical level. That said, it’s typically not the best idea to be the lone gunman if you have resources available to utilize. Simple tasks like checking monitoring systems and logs can be assigned to team members with less experience and doing peer review of config can be helpful with more experienced engineers.  To take this concept even further, if you have support from the network vendor available for peer review or even a dedicated engineer to sit in on a change window, always take advantage of that resource. Too often, we as engineers become consumed with being the go-to guy or gal and coming up with all the answers. What we should be doing is putting the success of the project first and using any resources available to ensure a successful network change or implementation.

Don’t be afraid to pull the emergency brake

If you find yourself in the middle of a bad situation or sense that you’re about to be in the middle of a major network outage, then STOP! One of the most important skills an engineer can possess is the ability to understand when to call it quits – at least for a brief period – until you can come back with a better plan. We’ve all had to make that gut wrenching call where we decide whether to plow forward or pull back and sometimes there isn’t much of a choice. Whenever you have the option, however, don’t hesitate to take it. The stakeholders will usually overlook schedule delays in the interest of preserving up-time. Also, demonstrating an understanding and respect for the business side of the house that depends on a functional network will probably buy you some goodwill with the business unit and increase the level of trust in your technical judgement.

Closing thoughts

Situational awareness for network engineers is mindset that we have to practice daily, but the overall learning curve is a long term process. We don’t always get it 100% perfect because we are human and subject to a number of external influences that compromise objectivity and work performance. The most important take away is to recognize, as other disciplines have, that there is value in understanding SA and the potential issues that come with losing SA. The next time you have a network cut, implementation or outage, take a minute to think through the SA concepts and how they relate to your specific corner of IP networking.