June 13, 2024


What is Model Serving?

Creating a model is one thing, but using that model in production is quite another. The next step after a data scientist completes a model is to deploy it so that it can serve the application.

Batch and online model serving are the two main categories. Batch refers to feeding a large amount of data into a model and writing the results to a table, usually as a scheduled operation. You must deploy the model online using an endpoint for applications to send a request to the model and receive a quick response with no latency.

For applications to integrate AI into their systems, model serving essentially means hosting machine-learning models (in the cloud or on-premises) and making their functions accessible via API. Model serving is important because without making its product accessible, a company cannot sell AI products to a broad user population. A machine-learning model’s production deployment also requires managing resources and monitoring the model for operations statistics and model drifts.

A deployed model is the culmination of any machine-learning application. Machine-learning models may be more easily deployed as web services because of the tools provided by businesses like Amazon, Microsoft, Google, and IBM. Others use more complicated pipelines, while some call for straightforward deployments. Additionally, sophisticated technologies can simplify time-consuming operations to create your machine-learning model offerings.

Model Serving Tools

It is difficult to manage “acting as a model” for non-trivial AI products, and doing so might negatively affect business operations financially. Machine-learning models can be scaled-up and deployed in secure environments using a variety of ML serving technologies, such as:


BentoML standardizes model packaging and gives users an easy way to set up prediction services in various deployment settings. With the help of the company’s open-source platform, teams will be able to provide prediction services in a quick, repeatable, and scalable manner by bridging the gap between Data Science and DevOps.

Any cloud environment can use BentoCtl. It offers online batch serving or REST/GRPC API in addition to automatically creating and setting up Docker images for deployment. This API model server supports adaptive micro-batching and has excellent performance. A focus point for managing models and deployment processes utilizing a web interface and APIs is native Python support, which scales inference workers independently of business logic.


Machine learning model deployment, management, and scalability are all possible using Cortex, an open-source platform. It is a multi-framework tool that enables the deployment of several model types.

To support massive machine learning workloads, Cortex is built on top of Kubernetes. It scales APIs automatically to handle production workloads.

Deploy several models in a single API, run inference on any type of AWS instance, and update deployed APIs without affecting other users’ access. Also, Track API performance and forecasting outcomes.

TensorFlow Serving

TensorFlow Serving is an adaptable framework for machine learning models created for industrial settings. It deals with the machine learning element of inference. A high-performance, a reference-counted lookup table is used to give you versioned access by taking models after training and managing their lifespan.

It exposes both gRPC and HTTP inference endpoints and can simultaneously serve several models or versions of the same model. Additionally, it enables the deployment of new model versions without requiring you to modify your code and permits flexible experimental model testing.

It supports many serves, including Tensorflow models, embeddings, vocabularies, feature transformations, and non-Tensorflow-based machine learning models. Its practical, low-overhead implementation adds little latency to inference time.


A versatile and user-friendly tool for serving PyTorch models is called TorchServe. It is an open-source platform that enables the rapid and efficient large-scale deployment of trained PyTorch models without requiring specialized programming. You may deploy your models for high-performance inference with the help of TorchServe, which offers lightweight serving with low latency.

TorchServe is a beta and may possibly evolve, but it nonetheless offers several intriguing features, like as

  • Serving several models
  • A/B testing model versioning
  • RESTful endpoint monitoring metrics for application integration
  • Any machine learning environment is supported, including Amazon SageMaker, Kubernetes, Amazon EKS, and Amazon EC2.
  • In production situations, TorchServe can be used for various inference tasks.
  • Offers a user-friendly command-line interface

Serving machine learning models on various frameworks is made possible by KFServing, which offers a Kubernetes Custom Resource Definition (CRD). It provides fast, high-abstraction interfaces for popular ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX to address production model serving use cases.

The tool offers a serverless machine-learning inference solution that enables you to deploy your models using a standardized and user-friendly interface.

Multi Model Server

A versatile and user-friendly solution for providing deep learning models developed with any ML/DL framework is the Multi Model Server (MMS). It uses REST-based APIs to handle state prediction requests and offers a simple command line interface. In production situations, the tool can be applied to various inference tasks.

You can start a service that creates HTTP endpoints to handle model inference requests using the MMS Server CLI or the pre-configured Docker images.

Triton Inference Server

A cloud and edge inferencing solution is offered by Triton Inference Server. Triton is a shared library with a C API for edge deployments, enabling direct integration of all of Triton’s features into applications. It is both CPU and GPU optimized. Triton supports the HTTP/REST and GRPC protocols, which let remote clients ask the server for inferencing for any model it is currently managing.

TensorRT, TensorFlow GraphDef, TensorFlow SavedModel, ONNX, and PyTorch TorchScript are a few of the deep learning frameworks it supports. It can also run many deep-learning models simultaneously on the same GPU. Additionally, it supports model ensemble and has dynamic batching and extensible backends. Metrics for server throughput, latency, and GPU use in the Prometheus data format


The Apache 2.0 license governs ForestFlow, an LF AI Foundation incubator project. It is a cloud-native, scalable, policy-based machine learning model server that makes deploying and managing ML models simple. It gives data scientists a straightforward method for quickly and frictionlessly deploying models to a production system, hastening the development of the production value proposition.

It can be used to manage and distribute work automatically as a cluster of nodes or as a single instance (on a laptop or server). To maintain efficiency, it also automatically scales down (hydrates) models, and resources when not in use and automatically rehydrates models into memory. Additionally permits model deployment in Shadow Mode and offers native Kubernetes integration for simple deployment on Kubernetes clusters with minimal effort.

Seldon Core

Use Seldon Core, an open-source platform with a framework, to deploy your machine learning models and experiments at scale using Kubernetes. It is a reliable system that is secure, dependable, and up to date and is independent of the cloud.

Inference graphs that are strong and complex, using predictors, transformers, routers, combiners, and more. With the help of our pre-packaged inference servers, custom servers, or language wrappers, it offers a simple approach to containerize ML models. Each model is linked to its corresponding training system, data, and metrics using provenance metadata.


For practitioners who want to quickly deploy their models to an endpoint without wasting a lot of time, money, or effort trying to figure out how to achieve this end-to-end, BudgetML is the ideal solution. BudgetML was created because it takes time to find a straightforward approach to quickly and cheaply put a model into production.

It should be quick, simple, and developer-friendly. It is not intended to be utilized in a fully functional, production-ready setting. It is only a method for setting up a server as quickly and inexpensively as feasible.

With a secured HTTPS API interface, BudgetML enables you to deploy your model on a preemptible instance of the Google Cloud Platform, which is about 80% less expensive than a standard instance. The utility configures it, so there is a short downtime when the model shuts down (at least once per 24 hours). BudgetML guarantees the least expensive API endpoint with the most temporary rest.


Gradio is an open-source Python module that is used to create online applications and demos for machine learning and data science.

Gradio makes it simple to quickly design a stunning user interface for your machine learning models or data science workflow. You can invite users to “try it out” by dragging and dropping their own images, pasting text, recording their voice, and interacting with your demo through the browser.

Gradio can be used for:

  • Demonstrate your machine learning models to users, users, and students.
  • Quickly deploy your models using built-in sharing links and receiving model performance feedback.
  • Utilizing the built-in manipulation and interpretation capabilities to interactively debug your model while it is being developed.

A protocol and set of tools called GraphPipe were created to make it easier to deploy machine learning models and to free them from framework-specific model implementations.

The current model serving solutions needs to be more consistent and/or practical. Developing specific clients for each workload is frequently required because there needs to be a standard protocol for interacting with various model servers. By establishing a standard for an effective communication protocol and offering straightforward model servers for the main ML frameworks, GraphPipe addresses these issues.

It is a straightforward flat buffer-based machine learning transport specification. It also has Efficient client implementations in Go, Python, and Java, as well as Simple, Efficient Reference Model Servers for Tensorflow, Caffe2, and ONNX.


Hydrosphere Serving is a cluster for delivering and versioning your machine-learning models in real-world settings. It supports models for machine learning created in any language or framework. With HTTP, gRPC, and Kafka interfaces exposed, Hydrosphere Serving will package them in a Docker image and deploy them on your production cluster. This will shadow your traffic between various model versions so that you may observe how they each respond to identical traffic.


You may deploy and package machine learning models with the aid of MLEM. It stores machine learning models in a widely utilized format in production settings, including batch processing and real-time REST. Additionally, it can switch platforms transparently with a single command. You may deploy your machine learning models to Heroku, SageMaker, or Kubernetes and run them anywhere (more platforms coming soon).

Any ML framework can use the same metafile. It can automatically add Python requirements and input data needed in a manner suitable for deployment. Additionally, MLEM does not require you to modify the model training code. Just two lines—one to import the library and one to save the model—should be added around your Python code.


Opyrator Instantaneously creates production-ready microservices from your Python functions. Utilize an interactive UI or an HTTP API to deploy and access your services. Seamlessly export your services as portable, executable files or Docker images. Opyrator is powered by FastAPI, Streamlit, and Pydantic and is built on open standards, including OpenAPI, JSON Schema, and Python-type hints. It eliminates all the hassle associated with commercializing and disseminating your Python code—or anything else you can pack into a single Python function.

Apache PredictionIO

An open-source machine learning framework called Apache PredictionIO is available to programmers, data scientists, and end users. It allows event collecting, algorithm implementation, assessment, and REST API-based querying of prediction outcomes. It uses what is referred to as a Lambda Architecture and is built on scalable open-source services like Hadoop, HBase (and other DBs), Elasticsearch, and Spark.

<img width=”150″ height=”150″ src=”https://bizbuildermike.com/wp-content/uploads/2022/08/WhatsApp-Image-2021-08-01-at-9.57.47-PM-150×150-1.jpeg” class=”avatar avatar-150 photo” alt decoding=”async” loading=”lazy” srcset=”https://bizbuildermike.com/wp-content/uploads/2022/08/WhatsApp-Image-2021-08-01-at-9.57.47-PM-150×150-1.jpeg 150w, https://bizbuildermike.com/wp-content/uploads/2022/08/WhatsApp-Image-2021-08-01-at-9.57.47-PM-80×80-1.jpeg 80w, https://www.marktechpost.com/wp-content/uploads/2019/06/WhatsApp-Image-2021-08-01-at-9.57.47-PM-24×24.jpeg 24w, https://www.marktechpost.com/wp-content/uploads/2019/06/WhatsApp-Image-2021-08-01-at-9.57.47-PM-48×48.jpeg 48w, https://bizbuildermike.com/wp-content/uploads/2022/08/WhatsApp-Image-2021-08-01-at-9.57.47-PM-96×96-1.jpeg 96w, https://bizbuildermike.com/wp-content/uploads/2022/08/WhatsApp-Image-2021-08-01-at-9.57.47-PM-300×300-1.jpeg 300w” sizes=”(max-width: 150px) 100vw, 150px” data-attachment-id=”17048″ data-permalink=”https://www.marktechpost.com/?attachment_id=17048″ data-orig-file=”https://www.marktechpost.com/wp-content/uploads/2019/06/WhatsApp-Image-2021-08-01-at-9.57.47-PM.jpeg” data-orig-size=”853,1280″ data-comments-opened=”1″ data-image-meta=”{“aperture”:”0″,”credit”:””,”camera”:””,”caption”:””,”created_timestamp”:”0″,”copyright”:””,”focal_length”:”0″,”iso”:”0″,”shutter_speed”:”0″,”title”:””,”orientation”:”0″}” data-image-title=”WhatsApp Image 2021-08-01 at 9.57.47 PM” data-image-description data-image-caption=”


” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2019/06/WhatsApp-Image-2021-08-01-at-9.57.47-PM-200×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2019/06/WhatsApp-Image-2021-08-01-at-9.57.47-PM-682×1024.jpeg”>

Prathamesh Ingle is a Consulting Content Writer at MarktechPost. He is a Mechanical Engineer and working as a Data Analyst. He is also an AI practitioner and certified Data Scientist with interest in applications of AI. He is enthusiastic about exploring new technologies and advancements with their real life applications




Source link