Whitebox networking – coming soon to an edge near you?


What is whitebox networking and why is it important?

whitebox-switch_500px-wide

A brief history of the origins of whitebox

One of the many interesting conversations to come out of my recent trip to Network Field Day 14 (NFD14) hosted by Gestalt IT was a discussion on the future of whitebox. As someone who co-founded a firm that consults on whitebox and open networking, it was a topic that really captivated me and generated a flurry of ideas on the subject. This will be the first in a series of posts about my experiences and thoughts on NFD14.

Whitebox is a critical movement in the network industry that is reshaping the landscape of what equipment and software we use to build networks. At the dawn of the age of IT in the late 80s and early 90’s, we used computing hardware and software that was proprietary – a great example would be an IBM mainframe.

Then we evolved into the world of x86 and along came a number of operating systems that we could choose from to customize the delivery of applications and services. Hardware became a commodity and software became independent of the hardware manufacturer.

ONIE – The beginning of independent network OS

A few years ago, an initiative called the Open Network Install Envrionment or ONIE for short was started by Cumulus Networks and shortly thereafter received support from NFD14 presenter, Big Switch Networks, among others. It was the first open project that developed a common multi-vendor framework for separating a network operating system from the hardware – just like we did with x86 and computing in the late 80s and early 90s.

The importance of whitebox in 2017 and beyond

Whitebox is critically important to the future of networking because it is forcing all incumbent network hardware/software vendors to compete in an entirely different way. The idea that a tightly integrated network operating system on proprietary hardware is essential to maintaining uptime and availability is rapidly fading.

Tech giants like Google, Facebook and Linkedin have proven that commodity and open network hardware/software can scale and support the most mission critical environments. This has given early adopters of whitebox networking the confidence to deploy it in the enterprise data center.

Whitebox is only a data center technology…right?

White box has been so disruptive in the data center, why wouldn’t we want it everywhere?

As we visited the presenters of NFD14 who had been developing whitebox software and hardware (Big Switch Networks and Barefoot Networks), one idea in particular stood out in my mind – White box has been so disruptive in the data center, why wouldn’t we want it everywhere?

The use case for whitebox in the data center has grown so much in the last few years, that it’s now a major part of the conversation when selecting a vendor for new data centers that I’m involved with designing. This is a huge leap forward as most companies would have struggled with considering whitebox technology as recently as two or three years ago.

A clear advantage to come out of the hard work that whitebox vendors have done in marketing and selling data centers on the proposition of commodity hardware and independent operating systems is that it’s paved the way for whitebox to make it to the branch and edge of the network.

Vendor position on whitebox as an edge technology

The idea that whitebox can be used outside of the data center seems to be prevalent among the vendors we met with during NFD14. This is a question I specifically posed to both Barefoot and Big Switch and the answer was the same – both companies have developed technology that is reshaping the network engineering landscape and they would like to see it go beyond the boundaries of the data center.

I would expect in the next 12 to 18 months, we are going to see the possibility of use cases targeted to the edge and maybe even campus distribution/core. A small leaf spine architecture would be well suited to run most enterprise campuses and the CAPEX/OPEX benefits would be enormous. Couple that with smaller edge switches and you’ve got a really strong argument to make for ditching the incumbent network vendor as well as breaking out of vendor lock in.

And it makes a lot of sense for whitebox companies to expand into this market, there are already edge platforms in the world of commodity switching that support 48 ports of POE copper with 4 x 10 Gpbs uplinks. Aside from the obvious cost savings, imagine taking some of the orchestration/automation systems that are being used right now for the data center and applying them to problems like edge port provisioning or Network Access Control. Also, being able to support large wireless rollouts and controllers in a more automated fashion would be a huge win.

Role in Network Function Virtualization and the virtual enterprise branch

Network Function Virtualization (NFV) has gotten a lot of press lately as more and more organizations are virtualizing routers, firewalls, wan optimizers and now SDWAN appliances.

One of the endgames that I see coming out of the whitebox revolution is the marrying of NFV and whitebox edge switching to create a virtual branch in a box. The value to an enterprise is enormous here as the underlying hardware can be used in a longer design cycle while the software running on top can be refreshed to solve new business requirements as needed.

The other major benefit of hardware abstraction is that it simplifies orchestration and automation when interfaces are VLAN based and not tied to a specific port/number.

What are the challenges to getting whitebox into the hands of the average enterprise for edge or campus use?

While not an exhaustive list, these are a few of the challenges to getting whitebox into the average enterprise.

  • Perception is one of the key challenges to moving towards whitebox in the edge. Most enterprises tend to be hesitant to be early adopters unless there is a clear business advantage to doing so.
  • Vendor lock in is another major barrier as most enterprises tend to stay within one vendor for routing and switching.
  • Confidence is a key part of the hardware selection process and this is an area that I feel whitebox is gathering serious momentum. And that will help make the edge case an easier sell when it becomes a more common use case.
  • Training and Support is always one of the first questions asked by a network team to see how easy it will be to support and get the techies up to speed on care and feeding of the new deployment. This is another area of whitebox that has seen a huge amount of growth in the last few years. Big Switch has a fantastic lab for learning data center deployments and hopefully as the use cases expand, we’ll see the same high quality training for those areas as well.

Closing thoughts

Whitebox at the edge is closer than ever before and there are small pockets of actual deployments which are mostly in the service provider world.

However, we’ll likely have to wait until the use cases are specifically developed and marketed towards the enterprise before there is significant momentum in adoption. As a designer of whitebox solutions, it’s something that i’ll continue to push for and evaluate as it has the potential to make an enormous impact in enterprise networking.

That said, the future of whitebox is incredibly bright and 2017 should bring a significant amount of growth and more movement towards commodity switching as a mainstream technology in all areas of networking.

The importance of situational awareness for network engineers

 

frustrated engineer

 

In another life, not too long ago, I spent a number of years in civilian and military law enforcement. When going through just about any kind of tactical training, one of the recurring themes they hammer into you is “situational awareness or SA.”

Wikipedia defines SA as:

Situational awareness or situation awareness (SA) is the perception of environmental elements with respect to time or space, the comprehension of their meaning, and the projection of their status after some variable has changed, such as time, or some other variable, such as a predetermined event. It is also a field of study concerned with understanding of the environment critical to decision-makers in complex, dynamic areas from aviation, air traffic control, ship navigation, power plant operations, military command and control, and emergency services such as fire fighting and policing; to more ordinary but nevertheless complex tasks such as driving an automobile or riding a bicycle.

Defining the need for SA in network engineering

It’s interesting to notice that critical infrastructure such as power plants and air traffic control are listed as disciplines that train in SA, however, I’ve never seen it taught in network engineering. In today’s world, the Internet has become as important as any other critical infrastructure and probably more in some ways.  Network engineering shares many of the high pressure components as other critical infrastructure. Engineers are often forced to make decisions quickly under stress that can have a significant financial impact and may have regional, national or even global ramifications. Engineers that work supporting military, healthcare, public safety and telecom sectors have the additional responsibility of avoiding downtime that can risk the safety of life and property. When not under the immediate pressure of a high risk change or outage, network engineers are frequently given projects with timelines that are too short to plan properly and when the time to implement comes, unforeseen issues tend to pop up much more frequently. Understanding SA is key to ensuring that engineers focus on the right tasks at the right time and know when to ask for help, when to coordinate and even when to step back and stop working.

Why don’t we teach SA in networking?

This is largely, I suspect, one of the things network engineers are expected to acquire with experience and so it isn’t formally taught or is passed down from engineer to engineer.  SA is something that is best taught as a combination of practical experience and both formal and informal training.  The Cisco CCIE Route/Switch lab exam is probably one of the few examples of formal training that forces a network engineer to be aware of task priority as well as assessing the changing state of the network and responding appropriately.  That said, after a brief google search, it does appear to be taught in cyber security as an adapted form of SA meant for military personnel. Much of the training tends to focus on how to avoid losing situational awareness by following certain methodologies and what to do if you realize you have lost SA.

An example of the loss of SA in networking.

15593400_s  tacacs

 

This is a scenario that just about every engineer can relate to…You’re working on a network cut and one of the tasks that you have to perform (which should be relatively quick) turns out to be quite difficult – because of a mis-configuration, bug, missing info, etc.  A common example is network management config like SNMP, netflow, TACACS, etc.  Let’s say you’re bringing a new L3 switch core online that involves physical link migration as well as new VLANs and OSPF/BGP. You have a 4 hour maintenance window and you start to work on an issue where your centralized TACACS server isn’t responding to the new switch – maybe it’s a security policy in the server or maybe you have the key wrong. You start to troubleshoot the issue with the expectation that you’ll have it solved in 10 or 15 minutes.

45 minutes later, you’re ALMOST there and so you put your head down and charge on through since you’re only 10 minutes away from solving the issue…right? Another 45 minutes go by and you’re at a complete loss as to why TACACS won’t work properly on this switch.  Let’s pause for a moment and assess the status of the cut – we are now 1 hour and 30 minutes into the window and we have used almost half of the time on a single task. We still have to move physical links, ensure smooth L2 STP convergence when we add a new trunk and VLANs and L3 routing has to be brought online and tested with BGP and OSPF.

At this point, you’ve probably lost situational awareness because almost half of the time was spent on one task that is probably not critical to bringing a new L3 switch core online. This isn’t to say that ensuring the proper AAA isn’t important, but it’s a task that could be done in a follow up window the next night – or you might even extend your current maintenance window after the critical physical and route/switch pieces have been done. The point is, that in most environments, network management tasks aren’t as critical to configure as L2/L3 forwarding if you have to make a choice between the two based on time available.

There are probably thousands of scenarios like this one we could go through but the critical take away here is to understand exactly which items have to function in order to accomplish the overall goal. A secondary component to that is to understand where you can make compromises and still come away with a win – maybe in the example above, static routes are an acceptable temporary solution if you run into issues with dynamic routing.


SA concepts for network engineers

I’ve adapted some of the concepts from the US Coast Guard’s manual on SA to illustrate the important points and relevance for network engineers. The full text can be found here

Clues to the loss of SA 

In reviewing a network outage or failed cut, you will often find one or more of these clues in hindsight as you try to figure out what went wrong.

  • Confusion – If something isn’t quite right with the network as you’re working on it, but you can’t put your finger on it, trust your gut and share it with the engineering team or any other relevant parties. Likewise if you sense the team is generally confused about a critical task, don’t be afraid to take a step back and try to get clarification from the team. At worst you may have a short delay, but at best, you may discover some major gaps or other issues that need to be dealt with before moving forward. Confusion is the sworn enemy of communication.
  • Deviation from the plan – we don’t always have the time to plan our network changes and it isn’t realistic to think that 100% of changes will be planned, but if you’re doing it right, most of the config that is put into the network should be planned and tested.  When executing a network change, if you have to deviate from the planned config more than just a few lines, it might be worth putting the change on hold to assess whether or not your on-the-fly changes will upset things in other areas. Trying to think through all the dependencies of adding BGP route reflection in an entire data center at 3AM may not be the best choice to make.
  • Unresolved Discrepancies – If the SYSLOG is churning through OSPF neighbor flapping entries, then maybe it isn’t the best time to introduce more OSPF changes. Understanding the state of the network before you make changes is key to avoiding breaking things even more.  This seems like a very common sense approach, but all too often, we as engineers tend ignore the danger signs of an unstable network and try to fight our way through to “just get things working.” It’s far better to identify and resolve known issues before introducing new config that could be destabilizing – even if that means missing a target date in a project.
  • Ambiguity – If there isn’t a great deal of clarity around the changes you plan to introduce into the network, then you’re setting yourself up for failure.  If an engineer comes to me and says he wants “throw VLANs 100 through 110″ onto all the switches that need it”, I would probably challenge him to come back with the configuration required to do exactly that and in the process, we will probably have a much better understanding of the Layer 2 impact versus (con)figuring it out as we go.
  • Fixation/Preoccupation – If we think back to the example earlier about spending an hour and a half on troubleshooting TACACS, that is a great example of fixation on one task. Unfortunately just about all of us have been there because it is so easy to lose sight of the bigger picture when you’re trying to get something to work. This is where a team of engineers can be very helpful. You might suggest to a member of the team who is head down and wrestling with a non-critical problem to refocus on the critical pieces and then the team can come back and work on the tasks together.  If you’re working solo, then you have to be especially vigilant and self-policing when it comes to avoiding fixation/preoccupation.

Maintaining Awareness

20108949_m

 

  • Communication – clear and effective communication, both verbal and written, is key to avoiding major issues.  You should always strive to provide the right information in a timely manner to avoid ambiguity and confusion. It’s important to communicate things like a planned course of action as well as tasks that you are specifically charged with – even if it’s just to verify the information you have is accurate.

  • Identify Problems – If you see a potential issue, don’t wait to bring it up. Sometimes, we as engineers are afraid to raise concerns for fear of looking foolish or uninformed, but harboring a concern and then voicing it once you’re in an outage situation doesn’t help anybody and certainly will make you feel foolish for not bringing it up when you had the chance.
  • Continually Assess the Situation – Continually checking on the state of the network and the impact changes are having will keep you alert for any issues that may pop up. Many of the problems we have to go back and fix as network engineers are a direct result of incomplete validation after a change has been made. Another thought to consider here is that if you are a member of a team, you can delegate this task or ask for assistance so the burden isn’t entirely on you.
  • Clarify expectations – If expectations and goals are not clearly defined, they can be hard to achieve. If it’s not understood that a network change is expected to cause an outage, you may have some unhappy campers on your hands because you didn’t clearly set the expectation.

The lone gunman effect on SA

11562899_m

 

There is no faster way to lose SA than to try and go it alone in the midst of a team environment…

One of the hardest things to do sometimes when functioning as a member of a team is to delegate tasks and responsibility. Especially if everyone on the team is not at the same technical level. That said, it’s typically not the best idea to be the lone gunman if you have resources available to utilize. Simple tasks like checking monitoring systems and logs can be assigned to team members with less experience and doing peer review of config can be helpful with more experienced engineers.  To take this concept even further, if you have support from the network vendor available for peer review or even a dedicated engineer to sit in on a change window, always take advantage of that resource. Too often, we as engineers become consumed with being the go-to guy or gal and coming up with all the answers. What we should be doing is putting the success of the project first and using any resources available to ensure a successful network change or implementation.

Don’t be afraid to pull the emergency brake

If you find yourself in the middle of a bad situation or sense that you’re about to be in the middle of a major network outage, then STOP! One of the most important skills an engineer can possess is the ability to understand when to call it quits – at least for a brief period – until you can come back with a better plan. We’ve all had to make that gut wrenching call where we decide whether to plow forward or pull back and sometimes there isn’t much of a choice. Whenever you have the option, however, don’t hesitate to take it. The stakeholders will usually overlook schedule delays in the interest of preserving up-time. Also, demonstrating an understanding and respect for the business side of the house that depends on a functional network will probably buy you some goodwill with the business unit and increase the level of trust in your technical judgement.

Closing thoughts

Situational awareness for network engineers is mindset that we have to practice daily, but the overall learning curve is a long term process. We don’t always get it 100% perfect because we are human and subject to a number of external influences that compromise objectivity and work performance. The most important take away is to recognize, as other disciplines have, that there is value in understanding SA and the potential issues that come with losing SA. The next time you have a network cut, implementation or outage, take a minute to think through the SA concepts and how they relate to your specific corner of IP networking.