Machine Learning Operations

Scaling machine learning models

10 min readMay 15, 2021

Introduction to MLOps

Machine Learning Operations (MLOps) is a term that is used to define best practices for Machine learning developments and deployments, increasing model accuracy meanwhile decreasing lifecycle time. MLOps establishes the standard for the delivery of high-performance AI models in production for real-life deployments. As such, it aims to address ML model deployment challenges at the production level in the manner of well-defined technological collaboration. As organizations are heavily interested in integrating machine learning systems, understanding MLOps is more valuable than ever. This article covers the essential knowledge behind MLOps, their implementations, and their impacts.

Machine learning deployments

There are several challenges when it comes to introducing machine learning deployments in real-world applications. Many ML models that are developed never make it to production. Some ML models also require extended delivery time. MLOps is the key to address these challenges, including:

Making machine learning collaborative: Developing machine learning code and model truly collaborative is one of the common machine learning challenges. MLOps is the key to producing a machine learning model visible from data extraction to their deployments and monitoring.
Reproducibility of machine learning models: Reproducing or auditing machine learning models is another challenge that can not be served by traditional version control methods of software development. Machine learning requires versioning training-related artifacts such as data, parameters, as well as metadata.
Machine learning as a continuous process: Machine learning has to be a continuous process based on the end application and dynamic data. Being continuous, retraining a machine learning model becomes a simple task that yields higher returns.
Testing, deploying, and maintaining ML systems: Some fundamental disconnects need to be addressed to put ML models into production. MLOps combines DevOps and data engineering with machine learning to better implement and sustain an ML system in production.

Machine learning lifecycle & need for MLOps:

Understanding the machine learning lifecycle can provide better clarity for the need of MLOps for higher deployments. To start, machine learning requires a handful of operations in the lifecycle, including business objectives, data engineering (data acquisition), data science (architecting & developing models), and DevOps (deployment).

A research paper on the technical framework of machine learning demonstrates that developing the code and model is merely the initiation of a sophisticated development lifecycle. There are other components that are instilled in the system, including data collection, data verification, analysis tools, process management systems, feature extraction, machine resource management, and other serving infrastructures.

As mentioned above, all the components are crucial for the successful deployments, and the correct orchestration of these pieces is called Machine Learning Pipeline. Overlooking the need for these modules is one of the significant reasons ML deployments haven’t been as effective as expected. MLOps is the key to mapping out this critical architecture.

The “machine learning pipeline” is the process that takes data and code as input, and produces a trained ML model as the output. This process usually involves data cleaning and pre-processing, feature engineering, model and algorithm selection, model optimization and evaluation.

ML Pipeline & Stack Technologies

MLOps is not dependent on a single technology or platform. Below is the list of the most important steps and technologies that play a significant role in good ML implementations.

Note: Most of the tools mentioned in the following list could be used for one or more processes or stages of the ML lifecycle.

1. Data and pipeline versioning

Data and pipeline versioning the key to track pipelines, ML models, and data sets with versioning control. Versioning is the key to test, audit, and rollback to any previous versions.

Kubeflow is a free and open-source machine learning platform designed to enable using machine learning pipelines to orchestrate complicated workflows running on Kubernetes.

The Kubeflow includes a central dashboard for navigation between the Kubeflow components like Pipelines, notebook servers for Jupyter, Katib for hyperparameter tuning, and artifacts stores. All of them, the users can manage contributors for sharing access across namespaces.

2. Orchestration

Orchestration is the key to ensuring ML deployments into existing workflows and processes. The right tool that can allow efficient orchestration of ML systems can create a significant advantage.

Kubeflow is also an alternative for orchestration. To give another recommendation I want to name Airflow.

Airflow is another open-source platform to manage workflow orchestration and uses directed acyclic graphs (DAGS). Apache Airflow is not a tool exclusively designed for machine learning processes, but its flexibility fits our use case very well.

3. Experiment tracking

MLflow is an MLOps tool created by Databricks and it is organized into four components: Tracking, Projects, Models, and Model Registry. It allows users to use each module separately, although they are meant to work well together.

MLflow Tracking provides API and UI to log parameters, code versions, metrics, and artifacts when running a machine learning code. It can be used for later visualize and compare the results, also allows sharing with different teams. Other advantages include the ability to work with any ML library, algorithm, deployment tool, or language.

4. Model serving and monitoring

Kubeflow and MLflow cover this process as well, but there is another good option is Seldon. Seldon is an open-source platform for the deployment and monitoring of machine learning models in Kubernetes easily.

Seldon can converts your ML models (Tensorflow, Pytorch, H2o, etc.) or language code (Python, Java, R, etc.) into production REST/GRPC microservices, making it very easy to invoke your model.

Data Science Steps Of Delivering ML Model

Once the use case is defined, here are the next essential steps to deliver ML models into production either manually or using an automated pipeline:

Select and extract the relevant data from data sources needed for ML tasks.
Analyzing the data and building the ML model using Exploratory Data Analysis (EDA) leading to the next steps.

Understanding the required data schema and characteristics for the ML model.
Understanding data preparation and feature engineering need as required by ML application.

3. Data preparation for the ML task including data cleaning, data classification for training, validation, and test sets. Leveraging data transformations and feature engineering as needed to solve the ML task. This step delivers the data splits in the desired format as an output.

4. Nor data scientists is to implement various algorithms with data preparation to train ML models. This step includes subjecting the implemented algorithms to hyperparameter tuning to get the high-performing ML model.

5. Moreover, the important part is to evaluate the ML model on a holdout test set to verify the model performance and quality.

6. Final model validation to ensure that the model is adequate for deployment with performance beyond the defined baseline.

7. Deploying the validated model in the environment to serve predictions. These deployments include:

Microservices with a REST API to fulfill predictions.
An embedded model for edge or mobile devices.As a part of a batch prediction system.

8. The last but important part is to monitor model performance to evaluate the need for a new iteration in the ML process.

Benefits of MLOps

MLOps introduce a handful of benefits that can resolve all the challenges mentioned above while lending more value and efficiency to sustainable deployments. Here are several reasons why one should adapt to MLOps:

MLOps provides a shared infrastructure to every stakeholder in the lifecycle.
MLOps allows easy tracking and retrieving of ML code, data, and experiments.
Machine learning pipeline that makes model retraining easier and addresses data drifts.
A closed-loop system that monitors and retrains an ML model in real-world application deployments for better performance.
MLOps allows a better balance between the various part of the lifecycle, from business KPIs to data science.
One can instill better regulatory practices on creative machine learning models using MLOps.
MLOps allows a defined role for team members. As such, data scientists don’t have to worry about the business or operational components of the system, allowing them what they do best.
As DevOps does with shortening the production lifecycle, MLOps can drive better insights and implement them effectively.

Additionally, MLOps allow a smooth & efficient transition to ML systems. Therefore MLOps also allows several advantages of successful ML deployments.

Challenges in Machine Learning Models

Data drift is one of the crucial challenges that MLOps can solve. Data drift or Dataset drift can be defined as a change in the baseline data and the real-time production data. Detecting data drift can be the key to identify the shift in the distribution of data, data integrity issues, and can alert if a model is required to be retrained.

MLOps can be the key to identify any defects in the ML deployments. Therefore allowing any improvements or retraining required in the ML model. MLOps enable models as they adapt to evolution and go through shift drifts in datasets. There are various types of drifts such as concept drift (relationship between input and output), prediction drift (shift in predictions), label drift (change in output label distribution), and feature drift into input data distribution.

There are two major ways to detect data drift:

Using a black box shift detector: Using a black box detector is a key to manage the balance between efficiency and accuracy. In this method, a shift detector is deployed on top of the primary model and the primary model is considered a black-box predictor. Any changes in the predictions can be perceived as a potential data drift by ML testers.This method is essentially used for prior shifts as it can be effective against drift that doesn’t affect predictions like covariate shifts.
Using the domain classifier: A domain classifier can be used to detect covariate shift by explicitly training the classifier to discriminate between the source data and target data. However, this method requires new training for the new data set available. Otherwise, this is really an effective method as it can be used locally at sample level and can also detect partial drift in observations and features.

How to implement MLOps

Deploying MLOps into your application definitely comes with its own advantages, but it is necessary to understand how one should implement MLOps. There are many different moving parts when adopting MLOps, but with a little bit of planning, it can have great results. Here’s a briefing of how one can get started with the process of implementing MLOps:

Defining clear and measurable benchmarks: First and foremost, one should have clear, measurable KPIs in mind to ensure that every aspect of the system is well defined towards the targeted goal of MLOps implementations & expected outcomes.
Realizing the stakeholders that chip in: Real-world applications of ML systems are complex and require a lot of different team members to play their part. Deploying an ML model involves more than what seems like a small data gathering or training. It consists of a lot of considerations from operations, business to IT aspects. For instance, it generally includes data scientists, data engineers, machine learning engineers, DevOps engineers, and Business teams.
Investing in the right infrastructure and software as needed: The concept of MLOps heavily revolves around an efficient operational infrastructure. It is necessary to be equipped with the right tools and products that support the ML lifecycle.

Each company must adapt their MLOps process according to their needs and budget.

Ensuring governance and compliance: Compliance is another critical aspect of deploying MLOps. One has to evaluate and consider all the applicable compliances with their ML deployment. Compliances like GDPR are an essential need in this era of data science. MLOps should be deployed with comprehensive plans to make the system auditable.
Tight monitoring system for regulations & quality: Machine learning systems need consistent monitoring to ensure that the user operates within standards and regulations. ML systems might need model retraining in a timely manner to understand the need for critical attention. Ensuring the protocol by monitoring is the key to retrieve quality information.

Summary

Leaving ML deployments on talented data scientists alone is not the right approach. Machine learning is more than applying mathematical algorithms and running numbers. Production-level implementations of machine learning require thorough planning and collaboration between every part of the process. MLOps is the key to ensure that machine learning adoptions are efficient and resourceful. Of course, there is no one way. Each company must adapt this process according to its internal needs and budget.

Bonus Track

In many commercial platforms you can find all the requirements to achieve success in your MLOps lifecycle. The most important, but not the only ones, are:

Dataiku: Collaborative data science platform powering both self-service analytics and the operationalization of machine learning models in production.
DataRobot: Automated machine learning platform which enables users to build and deploy machine learning models.
H2O Driverless AI: Automates key machine learning tasks, delivering automatic feature engineering, model validation, model tuning, model selection and deployment, machine learning interpretability, bring your own recipe, time-series and automatic pipeline generation for model scoring
IBM Watson Machine Learning: Create, train, and deploy self-learning models using an automated, collaborative workflow.
Algorithmia: Cloud platform to build, deploy and serve machine learning models.
Amazon SageMaker: End-to-end machine learning development and deployment interface where you are able to build notebooks that use EC2 instances as backend, and then can host models exposed on an API.
Microsoft Azure Machine Learning service: Build, train, and deploy models from the cloud to the edge.