Speedrunning ML Ops
Let's quickly learn how to do ML Ops

Table of Contents

Today I am doing a quick end-to-end study of ML Ops in preparation for a new role. In this article I will:

  1. define what ML ops is, what it deals with
  2. talk about the state of the art, how it relates with LLMs
  3. walk through an ML ops workflow at a high level

My background is in building custom DevOps workflows for developing and testing AI accelerators, which I guess is a specific kind of ML Ops. But I've never sat down to make a study of it, so here we are.

1. What is ML Ops

ML Ops is basically just DevOps applied to machine learning workflows.

In DevOps you want to manage the entire developer workflow from coding to deployment in production. Doing this means tracking the following automatically:

  • code changes, with git etc
  • dependencies, with Nix, Pypi, bazel, whatever
  • test cases with automated CI
  • compiled and deployed artifacts
  • machines and environments running the deployed artifacts

ML Ops is the same, but requires also managing the machine learning workflow. Whereas DevOps just manages code and code artifacts, ML Ops manages:

  • data for training the models
  • model weights, features, parameters
  • code for training, testing, and using the models

These three things are the elements of any ML Ops system. And they compound with each other, meaning you'll have around 33 combinations of changes to track.

By the way, Wikipedia distinguishes between MLOps, ModelOps, and AIOps. This is silly, it's all the same thing.

2. ML Ops workflow

The basic workflow or lifecycle that an ML model goes through is:

train -> package -> validate -> deploy -> monitor

Let's look at what each step consists of.

2.1. Train

Training relies on GPUs, data, and code. We need to lock down each of these resources.

Let's assume we have access to GPUs somewhere, like in a data center or cloud cluster. And assume that the code is tracked with git as is common in developer workflows already. There are still two major sources of variance in the workflow that need to be locked down:

  1. code experiments
  2. training datasets

In tutorials and docs you often see datasets being pulled from pytorch or sklearn or huggingface directly. This works for tutorials, but for production systems you need be sure that the data you train with doesn't change from under you: the content of the data must be hashed and tracked in some stable storage.

Commercially-supported implementations of this include LakeFS and DVC, or you could just use git-annex and use your existing storage solution (S3, NFS, etc), but you'd need to do a bit of config tweaking to make sure your data is cryptographically secure. In either case the workflow is the same: all data that goes into training must be committed to the data repository. Training runs must be started with clean worktrees (no uncommitted or changed data) so that we can see exactly what data lead to a particular trained output. This is not difficult per se, but it requies a bit of discipline and is made easy with some basic automation that any developer is familiar with.

Locking down code experiments is trickier. Often these are one-off Jupyter notebooks that an ML engineer creates while choosing model features, or playing with different datasets, or evaluating performance, or whatever. It is definitely possible to lock down Jupyter notebooks–at my last job we actually created a custom build rule for notebooks that converted them into Python files and optionally ran them in test–but notebooks aren't usable beyond the notebook form. Nobody is importing notebooks like other Python libraries.

So while you can put notebooks in your git repo along with your other code (and you definitely should for that matter), they are best treated as basically ephemeral. Any work done in a notebook, once it is validated as useful beyond the experiment, should be migrated into the rest of the Python codebase and put through the usual tracking mechanisms.

2.2. Package

Packaging usually consists of the model weights, and any config files necessary to set runtime hyperparameters. Weights are stored via pickle, GGUF, safetensors, H5/SavedModel, … there are a bunch of these but they are all basically binary representations of tensors. The final package will include the model and configs into a docker container. This way it can be bundled up with the wrapper code necessary for serving the model, and this easily fits into the rest of the usual deployment for web services.

Packaging can borrow a lot from the traditional DevOps world, such as reproducible builds and artifact registries. You could just repurpose your existing packaging infrastructure for ML artifacts, afterall they are both just compiled binary data. However, watch out for large models, which could bloat registry sizes more quickly than they are designed to handle. I had this problem with Nix and our solution was to create a secondary binary store (alternative to /nix/store) that used a shared NFS as the filesystem rather than the local disk or remote S3 bucket, and it worked really well! We used it for both dependencies and final compiled outputs. NFS is great for this kind of thing, as long as you lock down the file permissions and organize your file tree clearly from the beginning.

2.3. Validate and Monitor

Validation and monitoring are two sides of the same coin. Monitoring just consists of running the validation suite in real time on the production system, plus whatever alerts are needed for your use-case.

Validation must happen after packaging: you can only validate the final product that you expect to deploy. This is a critical business function. As a model supplier, prediction quality is directly tied to your bottom line. From the business perspective you must ask: What is the cost of a wrong prediction? The answer will help you decide if your validation suite should be in-house or purchased.

On the engineering side, you essentially want a staging environment, but instead of running integration tests like in app dev, you test for some ML-specific metrics. Here are a few common ones:

accuracy
how many predictions your model gets right
precision
how many positive predictions are correct
recall
how many correct positive predictions are captured by the model
F1
harmonic mean of precision and recall

…and others that I'm sure your ML engineers can enlighten you on. The point is that you want to track these over time, so putting them in something like Grafana is necessary.

This validation suite will evolve as well, and that poses a challenge because you don't want a change in the validation suite to reduce the consistency of these metrics. The solution here is to make sure your validation suite is versioned appropriately and changes or migrations are done carefully. Or, you could outsource it to a vendor.

2.4. Integrated Train-Validate

There are some platforms that provide an integrated solution for training, packaging, and validating. With these platforms you can start a training run on your own hardware, watch the train in real time, look at metrics, stop the train and fork it with different parameters, setup alerts for metrics in production, etc etc.

I haven't evaluated these to recommend one over the others. They are all gonna have some degree of vendor lock-in but that's probably okay if they fit your use-case and help you ship models faster.

2.5. Deploy

As said above, models are typically packaged as a container and delivered over HTTP with some kind of REST or gRPC endpoint. The API layer is going to be a small but necessary part of the service, and mostly falls under the traditional DevOps side of infrastructure. You might not even need to write this service layer, depending on how your model is built, you might just be able to use these off-the-shelf tools:

BentoML is the most full-featured of the three, as it supports more than just one kind of model and has lots of integrations. All three are open source, BentoML has some commercial cloud offerings. Otherwise they are all basically a wrapper around the model that provides HTTP access. This wrapper will be the entrypoint to the docker container.

3. Links and resources