As xOps workflows become more prevalent in IT, it’s inevitable to see overlap and intersection between them. In networking, NetDevOps has been a hot topic for the past three years. NetDevOps focuses on bringing automation, simulation and validation workflows to network operations. The goal is to reduce errors in production configuration deployment, reduce time to production and create a collaborative workflow across cross-functional teams and network team members.
MLOps followed a similar trajectory, but with a focus on machine learning workflows. As networking evolves and machine learning and artificial intelligence (ML/AI) start to become more prevalent in all of IT, it’s inevitable that these different operational workflow methodologies start to intersect. NetDevOps evolved out of DevOps, applying lessons learned from software development workflows and applying them to network development. As the conversation shifts to ML/AI, the question currently facing many cutting-edge networking professionals in the data center space is, “Where do NetDevOps and MLOps intersect?”
The Intersection of MLOps and NetDevOps
Machine learning and artificial intelligence are most commonly applied to networking in three basic ways:
- Autonomic networks
- Intent-based configurations
- Network devices that self-configure based on contextual information
- Self-healing networks
- React to network events to mitigate downtime
- Telemetry analysis and correlation
- Take streaming telemetry data and correlate it in a meaningful way
All the above applications can be implemented in fairly simple ways, and many networks today have at least some variation of these technologies already applied. The challenge is that the majority of these current implementations don’t use true ML/AI. They tend to use decision trees or switch-case conditionals to create conclusions. These conclusions are reached by feeding in a pre-coded set of expected outputs. This requires large datasets, which already exist for fields like image recognition. For instance, it’s fairly easy to confirm the technology is seeing pictures of a banana; although, if the data set is, say, 100,000 images, it may take a very long time to do so manually. ML engineers tend to use existing datasets for this exact reason, but the problem is, few of these datasets exist today in the networking realm.
Creating a solid, structured ML workflow requires a model that can be trained. This model can iterate multiple times and derive conclusions that do not need to be preprogrammed. In essence, we are trying to create these big datasets ourselves (using automation, of course) instead of relying on existing ones. But to develop such a model, certain foundations are required.
The first foundation required for ML workflows in NetDevOps is a structured configuration model. This is mostly implemented using automation and standardized data structures. While models like OpenConfig and YANG exist, they are not comprehensive across all features commonly implemented in data centers, so most data models will have to be customized for the use case. Using this data model, and implementing it with automation technology, the ML workflow has an easy way to deploy configuration across an entire network’s fabric without having to configure each node by hand.
The second foundation required is an environment that allows idempotent configuration iteration so the model can train itself. Idempotence refers to any function you can repeat several times without changing the final result. This is where network simulation is critical. A comprehensive network simulation platform allows the ML system to apply the standardized configuration model, and iterate over configuration evolution. That allows the application to make small configuration tweaks to allow for a nearly unlimited dataset.
Finally, a good foundation in network MLOps requires telemetry. Historically, telemetry was used as a method to correlate logs and identify problems. Most telemetry solutions have fallen short, though, when it comes to root cause identification and action plan recommendation. These problems are much harder to solve.
One approach could use a simulation platform to iterate configurations and architectures that eventually lead to matching production logs with simulation attempts. This methodology would train the AI model to identify which triggers affect which nodes. In this methodology, we have two independent tracks; the first to correlating logs to identify patterns and trends in the data, and the second iterating on a network architecture until the logs generated match those from production.
This solution creates many interesting parallels to existing AI/ML solutions. For instance, iterating on a network would be equivalent to IBM’s DeepBlue learning chess, or AlphaGo learning to play Go. The model trains by playing games, in simulation, against themselves, over and over. With every iteration, they learn something new about how each decision influences an outcome. Given enough iterations, the machines become unbeatable. Matching logs from simulation to production is similar to AI/ML in facial and voice recognition, too. In recognition software, pattern matching is used to train a model that becomes increasingly efficient at identifying images the more data it is fed.