Five Challenges of Machine Learning DevOps

As organizations add machine learning (ML) to their workflows, it’s tempting to try to squeeze model creation and deployment into the existing software development lifecycle (SDLC). However, ML is fundamentally different than traditional applications, and it’s important to account for that in a new, unique process called the machine learning development lifecycle.

We have identified five challenges every organization should keep in mind as they begin to support ML development.

Making Heterogeneity Work

Machine learning is successful when the right tool is selected for a given job. Depending on the use case, a data scientist might choose Python, R, Scala or another language to build one model, and another language for a second model. Within a given programming language, there are numerous frameworks and toolkits available, adding complexity to versioning and consistency.

For example, TensorFlow, PyTorch and Scikit-learn all work with Python, but each is tuned for specific types of operations, and each outputs a slightly different type of model.

ML-enabled applications typically call on a pipeline of interconnected models, often written by different teams using different languages and frameworks.

Figure 1: An ML pipeline that extracts text from scanned documents and analyzes its sentiment passing through several languages and frameworks.

For example, a public relations firm looking to identify news and reports critical of one of its customers might use the following pipeline:

Extract all text from hundreds of scanned documents with an Optical Character Recognition (OCR) model.
Identify the languages of that text with a language-identification model.
Translate the text to English.
Prepare the text for sentiment analysis with a vectorization model that turns English text to numbers.
Use a sentiment analysis model to score the text.

Planning for ML Iterating

In machine learning, your code is only part of a larger ecosystem—the interaction of models with live, often unstructured or unpredictable, data. Interactions with new data can introduce model drift and affect accuracy, requiring constant model evaluating, tuning and retraining.

Figure 2: The continuous ML Lifecycle with iterations.

As a result, ML iterations are faster and more frequent than traditional app development workflows. This places greater demands on versioning and other operational processes and requires a greater degree of agility from DevOps processes, tools and staff.

Organizations interested in adopting ML-enabled applications need to be prepared to handle the burden of iteration without breaking the budget or introducing needless infrastructure tasks.

Developing the Right Infrastructure

The model training process typically involves multiple instances of the following:

A long and intensive compute cycle.
A fixed, inelastic load.
A single user.
Concurrent experiments on a single model.

After deployment, models from multiple teams enter a shared production phase characterized by:

Short, unpredictable compute bursts.
Elastic scaling.
Many users consuming many models.

Operations teams must be able to support both the training and deployment environments on an ongoing basis, because it’s a continuous lifecycle. Selecting the right infrastructure for that job is complicated by a rapidly evolving tech stack, from the increasing variety of processors available for ML workloads to advances in cloud-specific scaling and management, to a never-ending stream of new data science tooling.

It’s important to future-proof your ML program to ensure a tool or framework you use today does not lead to lock-in or worse—obsolescence—down the road.

How Will You Scale?

To address the unpredictability of ML workloads and the premium on low latency, organizations must build compute capacity to support substantial bursts.

There are three typical architectural approaches, each with benefits and drawbacks:

Traditional Capacity Planning: In a traditional architecture, operations reserve compute resources capable of scaling to maximum anticipated demand.

Pros:

Reserved capacity is always available
Easy to administer

Cons:

Extreme waste and cost
Unanticipated demand can exceed the capacity

Elastic Scaling: Standard elastic scaling designs for a local maximum, scaling machines up and down based on step functions.

Pros:

Cost improvements over traditional architecture

Cons:

Inefficient hardware utilization
Difficult to manage heterogeneous workloads on the same hardware
Slightly more management overhead than traditional architecture

Serverless: Serverless architecture approaches spin up models as requests come in.

Pros:

Significantly improved hardware utilization
Dramatically lower costs
Easier to run and scale different types of workloads on the same machines

Cons:

Substantial administrative overhead for custom-built, on-prem solutions
Some third-party services without extensive APIs and reporting provide limited explainability
Performance tuning to meet latency

Architects can implement any of the three approaches in the cloud or in a physical datacenter, though elastic scaling or serverless approaches requires far more effort to manage in-house.

Trusting the Results with Auditability and Governance

Explainability—understanding why models make given predictions or classifications—is an essential part of ML infrastructure and the lifecycle. Equally important and often overlooked, however, are the related topics of model auditability and governance—understanding and managing access to models, data and related assets.

Figure 3: An audit trail can help inform decisions.

To make sense of complex pipelines, a multitude of users and rapid model and data iteration, deployment systems should attempt to identify:

Who called which version of which model.
At what time a model was called.
Which data the model used.
What result was produced.

Thinking about and planning for these five challenges before beginning a machine learning initiative is a good place to start to ensure a valuable return from your investment.

To learn more about containerized infrastructure and cloud native technologies, consider coming to KubeCon + CloudNativeCon NA, November 18-21 in San Diego.

— Diego Oppenheimer