In the previous articles, we covered some PyTorch basics. First, we explored tensors, gradients and how we can use these concepts to write machine learning algorithms using this framework. Then we utilized that knowledge and used Pytorch for its main purpose – deep learning. We had a chance to see how we can implement a feedforward and convolutional neural network for image classification. In this article, we cover TorchServe, a new way to deploy PyTorch models. This is still a new technology, it’s current version is 0.1 and it is highly experimental, but it is very promising. In essence, TorchServe removes additional servers that you would otherwise have to write manually. It is similar to TensorFlow Serve and it makes creating modern Deep Learning applications much easier.

Are you afraid that AI might take your job? Make sure you are the one who is building it.


TorchServe Architecture

The main goal of the TorchServe and similar applications is to provide API through which other parts of the system can communicate with the model. It exposes three types of API:

  • API description – Retrieves a description of the API using OpenAPI 3.0 specification.
    Call example: curl -X OPTIONS http://localhost:8080
  • Health check API – Retrieves the status of the TorchServe.
    Call example: curl http://localhost:8080/ping
  • Predictions API – Make predictions API calls to the models that are served with TorchServe.
    Call example: curl -X POST http://localhost:8080/predictions/{model_name} -T {input_data}

In order to provide these APIs TorchServe is composed of several parts. Here is how it looks like from a high-level perspective:

Pytorch Installation

All API requests are routed through the so-called Frontend. This component of TorchServe is, apart from handling all requests and responses coming from the client, in charge of the model’s lifecycle. The instances of the models are hosted by Model Workers. These components run the actual interfaces of each model. All loadable models are stored within a directory (cloud or a local one) – Model Store. In a nutshell, once you run TorchServe (we will see how that is done in a bit) you load different models that are available in Model Store. Then you can start each of these model, which will create a new Model Worker instance which will expose the interface to that specific model. Then you can send API calls that are handled by the Frontend and routed to the correct Model Worker.


At the moment, TorchServe supports only Linux and MacOS. If you are a Windows user you can still use TorchServe with Docker. TorchServe also requires Java 11 SDK, so make sure that you install that first. For Linux run this command:

sudo apt-get install openjdk-11-jdk

For MacOS run this:

brew tap AdoptOpenJDK/openjdk
brew cask install adoptopenjdk11

Then you can install TorchServe with either pip:

pip install torch torchtext torchvision sentencepiece psutil future
pip install torchserve torch-model-archiver

Or with Conda:

conda create --name torchserve torchserve torch-model-archiver psutil \
future pytorch torchtext torchvision -c pytorch -c powerai

If you want to install GPU version use this:

conda create --name torchserve torchserve torch-model-archiver psutil \ 
future pytorch torchtext torchvision cudatoolkit=10.1 -c pytorch -c powerai

Note that these commands will install TorchServe and Torch Model Archiver.

Saving Trained Model

Before running and serving models with TorchServe first you need to save it. Let’s do that with the Feed-Forward Neural network model we created in the previous article. Here is how that model looks like:

class FFNN(nn.Module):
    """Simple Feed Forward Neural Network with n hidden layers"""
    def __init__(self, input_size, num_hidden_layers, hidden_size, out_size, accuracy_function):
        self.accuracy_function = accuracy_function
        # Create first hidden layer
        self.input_layer = nn.Linear(input_size, hidden_size)
        # Create remaining hidden layers
        self.hidden_layers = nn.ModuleList()
        for i in range(0, num_hidden_layers):
            self.hidden_layers.append(nn.Linear(hidden_size, hidden_size))
        # Create output layer
        self.output_layer = nn.Linear(hidden_size, out_size)
    def forward(self, input_image):
        # Flatten image
        input_image = input_image.view(input_image.size(0), -1)
        # Utilize hidden layers and apply activation function
        output = self.input_layer(input_image)
        output = F.relu(output)
        for layer in self.hidden_layers:
            output = layer(output)
            output = F.relu(output)
        # Get predictions
        output = self.output_layer(output)
        return output
    def training_step(self, batch):
        # Load batch
        images, labels = batch
        # Generate predictions
        output = self(images) 
        # Calculate loss
        loss = F.cross_entropy(output, labels)
        return loss
    def validation_step(self, batch):
        # Load batch
        images, labels = batch 

        # Generate predictions
        output = self(images) 
        # Calculate loss
        loss = F.cross_entropy(output, labels)

        # Calculate accuracy
        acc = self.accuracy_function(output, labels)
        return {'val_loss': loss, 'val_acc': acc}
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        # Combine losses and return mean value
        epoch_loss = torch.stack(batch_losses).mean()
        # Combine accuracies and return mean value
        batch_accs = [x['val_acc'] for x in outputs]
        epoch_acc = torch.stack(batch_accs).mean()
        return {'val_loss': epoch_loss.item(), 'val_acc': epoch_acc.item()}
    def epoch_end(self, epoch, result):
        print("Epoch: {} - Validation Loss: {:.4f}, Validation Accuracy: {:.4f}".format( \ 
						epoch, result['val_loss'], result['val_acc']))

After the training process (for more details check out here) we can save it using the save() method and model’s state dictionary. When you train the model using PyTorch, all its weights and biases are stored within the parameters attribute of torch.nn.Module. You can access these parameters using parameters function model.parameters(). The state dictionary is a Python dictionary object that maps each layer of the model to its parameter tensor. You can access it using the state_dict attribute of the model. Note that optimizer objects from torch.optim have state_dict as well, but it contains information about used hyperparameters and optimizer state. Here is how you can use state_dict to save the model:, os.path.join('.\models', 'ffnn.pth'))

Quite easy, isn’t it? Your model will be located in the path you used for a second parameter. As the output, you will find the ffnn.pth located in the ./models folder.

Serving Model

Now, we have all the necessary pieces for serving the models using TorchServe. Let’s start model archiver and add model to the Model Store:

torch-model-archiver --model-name ffnn --version 1.0 --serialized-file ./models/ffnn.pth \
--export-path ./model_store --handler image_classifier 

With this command, we moved the ffnn model we created in the previous section. However, we also gave it a version and a name. Make sure that you have created a folder for Model Store before calling this command (in our example that is ./model_store location). After this call, you will find ffnn.mar file in the ./model_store. Finally, we can start TorchServe:

torchserve --start --ncs --model-store model_store --models ffnn.mar 
Decision Tree

Our model is available at localhost port 8080 and we can utilize APIs we covered previously. For example:

curl POST http://localhost:8080/predictions/ffnn -T data/sample_image.png 

Also you can aqire list of available models:

curl http://localhost:8081/models
    "models": [
            "modelName": "ffnn",
            "modelUrl": "ffnn.mar"

If you want to stop TorchServe, all you need to do is call the command:

torchserve --stop

Another cool thing is that TorchServe exposes configuration, using which you can configure a number of worker threads on CPU and GPU. This can be very useful if your server is under a heavy workload. For example, you might want to use number_of_gpu which limits the number of used GPU per model.

TorchServe and Docker

Another option when it comes to serving PyTorch models with TorchServe is to use it in combination with Docker. If you need some more info on what is Docker check out here. Just like for other tools, there is TorchServe image available on Docker Hub, so you can pull it from there. To start CPU based image run this command:

docker run --rm -it -p 8080:8080 -p 8081:8081 pytorch/torchserve:latest-cpu

Similarly for the GPU based image run:

docker run --rm -it --gpus '"device=1,2"' -p 8080:8080 -p 8081:8081 pytorch/torchserve:latest-gpu

However, if you want to create .mar file, you need to do some additional steps. After the Docker container is started, acquire the name of the container:

docker ps

Connect to it’s bash prompt:

docker exec -it <container_name> /bin/bash

Finally, run Model Archiver:

torch-model-archiver --model-name ffnn --version 1.0 --serialized-file /home/modles/ffnn.pth \ 
--export-path /home/model-server/model-store --handler image_classifier

Don’t forget to take care of production parameters when you are deploying TorchServe in Production with Docker. For example, you might want to use something like this:

docker run --rm --shm-size=2g \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        -p8080:8080 \
        -p8081:8081 \
        --mount type=bind,source=path_to_model_store,target=/tmp/models <container> \
          torchserve --model-store=/tmp/models 

This way you can set up shared memory size, user limits for system resources and expose ports, and avoid potential problems on your server.


In this article, we covered the basics of deployment with PyTorch and TorchServe. We had a chance to explore the architecture and main components of TorchServe, and we had a chance to see how we can prepare models for serving it with this tool. Finally, we combined this tool with Docker.

Thanks for reading!

Nikola M. Zivkovic

Nikola M. Zivkovic

CAIO at Rubik's Code

Nikola M. Zivkovic a CAIO at Rubik’s Code and the author of book “Deep Learning for Programmers“. He is loves knowledge sharing, and he is experienced speaker. You can find him speaking at meetups, conferences and as a guest lecturer at the University of Novi Sad.

Rubik’s Code is a boutique data science and software service company with more than 10 years of experience in Machine Learning, Artificial Intelligence & Software development. Check out the services we provide.