From notebook to production: a guided tour of impending complexity

Every machine learning model begins its life in a notebook, glowing with potential and the comfortable fiction that it will one day work in production. The gap between that notebook and a deployed, monitored, maintainable system is where careers are made, or occasionally ended, and it is considerably wider than anyone budgets for.

The Iris repository walks through this gap stage by stage, using the classic Iris dataset as a vehicle. The Iris dataset is small, clean, well-understood, and utterly unlike anything you will encounter in production. This is precisely why it makes a useful teaching example. The complexity in an MLOps pipeline comes from the infrastructure, not the data, and a simple dataset lets you see the infrastructure clearly without also fighting an uncooperative feature matrix.

The dataset contains 150 samples of iris flowers across three species, with four features: sepal length, sepal width, petal length, and petal width. The problem is multi-class classification. The model is built with scikit-learn. None of this is the point. The point is everything that happens to it afterward.

The research and development phase

The pipeline begins in notebooks, where data scientists do science and data in roughly equal measure. The Iris pipeline opens with exploratory data analysis, followed by feature engineering, feature selection, model training, and obtaining predictions. These steps proceed with the cheerful informality of a process that assumes someone else will deal with reproducibility later.

Feature engineering uses feature-engine alongside scikit-learn pipelines, which provides transformers for encoding, imputation, and variable transformation in a form that can later be packaged and reused. This is the point at which the notebook begins to resemble something that might survive contact with a production environment, which is a significant upgrade from a collection of cells executed in an order no one has documented.

The research phase produces a model. The next several stages are concerned with what to do with it.

Packaging the model

A model that lives in a notebook is a model that can only be run by whoever has the notebook, on the machine where it was last run, in the environment that was current at the time, assuming the data is still where it was. This is not a deployment strategy. It is a hostage situation.

The production packaging stage wraps the trained model in a proper Python package, using Tox to automate the full pipeline: virtual environment creation, linting with flake8, type checking with mypy, import sorting with isort, formatting with black, testing with pytest, and model training in a single reproducible command. Configuration is handled through strictyaml rather than Python files, which keeps runtime configuration separate from code and readable by people who are not Python engineers.

The packaged model can be trained, tested, and versioned consistently across machines. This is the minimum viable foundation for everything that follows, and the stage at which many real-world projects still have not arrived.

Serving predictions

A packaged model produces predictions when you run it. A served model produces predictions when something else asks it to. The API stage exposes the model through a FastAPI application, turning the classifier into a service that accepts requests and returns predictions over HTTP.

FastAPI handles request validation through pydantic, which means malformed inputs are rejected before they reach the model rather than producing a cryptic error from scikit-learn. Uvicorn runs the application as an ASGI server. Loguru handles structured logging so that when something goes wrong in production, there is at least a record of what was asked and what was returned, which is more than many deployed models can say.

The API is the boundary between the model and everything that wants to use it. Designing it carefully at this stage costs very little. Redesigning it after several services have integrated against it costs considerably more.

Continuous integration and publishing

A model package that is only tested on the developer’s laptop is not tested. The CI and publishing stage connects the pipeline to a continuous integration system that runs the full Tox suite on every change, and publishes the package when a version passes. This makes the test suite the arbiter of what constitutes a working model, rather than whoever last ran it manually and decided it looked acceptable.

Publishing the package to a registry means downstream services can pin a specific version and upgrade deliberately rather than discovering that the model changed when their predictions changed. Versioning model packages like software packages is one of those practices that feels unnecessary until the moment it becomes desperately necessary.

Containerisation

A FastAPI application running on a developer’s machine has a Python version, a set of installed libraries, and an operating system that may not match the environment where it will run in production. Containers eliminate this uncertainty by packaging the application together with everything it needs to run, including the Python version, the dependencies, and the model package itself.

The deploying-with-containers stage builds a Docker image of the ML API and runs it as a container, producing a deployment artefact that behaves identically regardless of the host environment. The same image that passes testing is the image that reaches production. This is a considerably stronger guarantee than “it worked on my machine,” which is the guarantee that most ML systems operated under until the outage.

Container images are versioned alongside model packages, so rolling back a bad deployment means rolling back to a specific image rather than reconstructing an environment from memory.

Differential testing

Differential testing, sometimes called shadow testing, runs two versions of a model in parallel against the same production traffic, comparing outputs without exposing users to the new model’s decisions. When a new model is being evaluated, it receives the same requests as the current model and its responses are logged and compared. Discrepancies are investigated before the new model goes live.

This addresses the fundamental problem that a model which performs well on held-out test data may still behave unexpectedly on real production traffic, because real production traffic has a habit of containing things that did not appear in the test set. Differential testing surfaces these discrepancies at the scale and distribution of actual usage rather than in a test suite designed by the people who built the model.

The stage is a placeholder in the Iris repository, which is honest. Differential testing infrastructure is non-trivial to build and is often skipped until a model update causes enough of an incident to justify the investment.

Deploying to infrastructure

The final stage moves the containerised application onto cloud infrastructure. The IaaS stage covers deployment to a cloud provider, at which point the model is running on servers that are not anyone’s laptop, accessible over the network, and subject to the operational concerns that the previous six stages were preparing for: monitoring, scaling, cost management, security, and the particular dread of receiving an alert at an inconvenient hour.

A model in production is no longer a research project. It is a service with users, dependencies, and an implicit promise that it will continue to behave as expected even as the world around it changes. The pipeline is what makes that promise maintainable. Without it, each update is a small act of hope. With it, updates are boring, which is the correct state for infrastructure to be in.