Anatomy Of A Model Inference Service

Siddharth Sharma
8 min readJan 14


Context :

This document discusses the high level architecture and components required to create a model prediction service. Here we won’t be discussing and comparing particular frameworks in detail. The underlined intention is to provide a holistic view of model serving pipelines, APIs needed and complexities involved. This document would help the reader understand the key components involved in building model serving infrastructure, process flow and areas where out of box frameworks (like TorchServe) can help.

(All the diagrams in the document are from pre-print edition of Deep Learning System Design Book )

What is a model?

A deep learning model as an executable program, which contains a prediction algorithm and the required data to make a prediction.

Model Artifact: Artifacts include code that trains models and produces inferences, and any generated data
such as trained models, inferences, and metrics.

  • Model Graph
  • Model Weights
  • Vocabulary
  • Embeddings

Model Metadata : Metadata is any piece of data that describes an artifact, or describes relationships between artifacts.

  • Owner
  • Timestamp
  • Model id
  • Version
  • Training Data Version
  • Model Artifact Storage URL

Model Prediction Service

Model Prediction Workflow

Common model serving strategies

  1. Monolith Design : Load model and run model prediction inside the user application’s process. For example, a flower identity check mobile app can load an image classification model directly in its local process, and predict plant identity from the given photos. The entire model loading and serving happens within the model app locally (on the phone), without talking to other processes or remote servers.
  2. Model Service:
  3. Model service means running model serving at server side. For each individual model, each version of a model or each type of model, we build a dedicated web service for it. This web service exposes the model prediction api over HTTP or gRPC interfaces.
  4. The model service manages the full life cycle of model serving, including fetching model file from model artifact store, loading model, executing model prediction algorithm for customer request and unloading the model to reclaim the server resources.
  5. Model server
  6. The model server approach is designed to handle multiple types of models in a blackbox manner. Regardless of the model algorithm and model version, the model server can serve these models with an unified web prediction api.
  7. Many open source model serving libraries and services, such as Tensorflow serving, TorchServe and Nvidia Triton, offer model server solutions. What we need to do is to build customized logic on top of these tools to solve our own business needs, which includes managing model request routing, model access API, model model files, model metadata and input/output data processing.

Single Model Service

Objective: Only build the necessities to get models into production as quickly as possible, and lower down the operation cost.

When the backend predictor service first starts, the predictor will first load the model and
get the model ready to execute from its web API. Backend application will receive the request and call
the predictor’s web API for face swapping. Then the predictor pre-process the user request
data (images), executes the model prediction algorithm and post-process the model output back to the application backend. model. The key component in this design is the predictor, it is a web service and often runs as a docker container.


  • quick to bootstrap and easy to develop
  • minimalistic design


  • can serve only one type of model
  • can serve only one version of model

Multi-Tenant application

To build such an omnipotent model serving system, we need to consider lots of things. Such as model file format, model libraries, model training frameworks, model caching, model versioning, one unified prediction api that suits for all model types, model flow execution, model data processing and model management.


Model Serving Deep Dive

A. High Level Model Serving End-To-End Workflow

This sample service consists of a frontend interface component and a backend predictor. The frontend component does three things:

  1. Hosting the public prediction api
  2. Downloading model files from metadata store to a shared disk volume
  3. Forwarding the prediction request to the backend predictor.

The backend predictor is a self-built predictor container, which responds to load intent classification models and executes these models to serve prediction requests.

This prediction service has two external dependencies: the metadata store service and a shared disk volume. Metadata store keeps all the information about a model, such as model algorithm name, model version and model url that points to the cloud storage of real model files. Shared volume enables model file sharing between the frontend service and the
backend predictor.

B. The frontend service

Process Flow

  1. User sends an prediction request with modelID “A” to the web interface;
  2. The web interface calls predictor connection manager to serve this request;
  3. Predictor connection manager queries the metadata store to get model metadata by searching model id equals “A”, the returned model metadata contains model algorithm type and model file url;
  4. Based on the model algorithm type, predictor manager picks the right predictor backend client to handle the request. In this case, it chooses CustomGrpcPredictorBackend since we are demoing a self-built model serving container for intent classification;
  5. CustomGrpcPredictorBackend client first checks the existence of the model file in the shared model file disk for model “A”. If the model hasn’t been downloaded before, it then uses the model url (from model metadata) to download model files from cloud storage to the shared file disk;
  6. The CustomGrpcPredictorBackend client then calls the model predictor which is pre-registered with this backend client in the service configuration file. In this example, the CustomGrpcPredictorBackend will call our self build predictor

Predictor Connection Manager
One important role of the frontend service is routing prediction requests. Given a prediction request, the frontend service needs to find out the right backend predictor to handle based on the model algorithm type required in the request. This routing is done in the PredictorConnectionManager. In our design, the mapping of model algorithms and predictors is pre-defined in environment properties. When service starts, PredictorConnectionManager will read the mapping so the service knows which backend predictor to use for which model algorithm type.

C. Backend Predictor Service

We can see this self-built classification predictor as an independent micro service, which can serve multiple models simultaneously. It has a gRPC web interface and a model manager. The model manager is the heart of the predictor, it does multiple things, including loading model files, initializing the model, caching the model in memory and
executing the model with user input.

Torch Serve

TorchServe is a performant, flexible and easy to use tool for serving PyTorch eager mode and torch scripted models. Similar to Tensorflow serving, TorchServe takes a model server approach to serve all kinds of PyTorch models with an unified API. The difference is TorchServe provides a set of management api that makes model management very convenient and flexible. For example, we can programmatically register and unregister models or different versions of a model. And we can also scale up and scale down serving workers for models and versions of a model.

AWS Support:

In the above image we can see two gray boxes are added, they are TorchServe GRPC predictor backend client (TorchGrpcPredictorBackend) and TrochServe server. The TorchGrpcPredictorBackend response to download model files and send prediction requests to the TorchServe container.

TorchServe is a tool built by the PyTorch team for serving PyTorch models (eager mode or torch scripted). TorchServe runs as a blackbox, and it provides HTTP and gRPC interfaces for model prediction and internal resource management. The above figure visualizes the workflow for how we use TorchServe in this sample.

Minimalistic Design Without Need For Frontend And Backend Services

A TorchServe server is composed of three components: frontend, backend and model store. Frontend handles the request/response of TorchServe. It also manages the life cycles of the models. Backend is actually a list of model Workers, these workers are responsible for running the actual inference on the models. Model Store is a directory in which all the
loadable models exist, it can be a cloud storage folder or local host folder.

The figure above draws two workflows: model inference and model management.

For model inference, first, the user sends a prediction request to the inference endpoint for a model,
such as /predictions/{model_name}/{version}; next, the inference request gets routed to
one of the worker processes which already loaded the model. Then, the worker process will
read model files from the Model store and let the model handler load the model, pre-process
input data and run the model to get a prediction result.

For model management, a model needs to be registered before users can access it. This
is done by using the management api. We can also scale up and down the worker process
count for a model. We will see an example in the sample usage section below.


  • Can serve multiple models, or multiple versions of the same model.
  • Unified gRPC and HTTP endpoints for model inference.
  • Support batching prediction request and performance tuning.
  • Support workflow to compose Pytorch models and Python functions in sequential and parallel pipelines.
  • Provide management api to register/unregister model and scale up/down workers.
  • Model versioning for A/B testing and experimentation.