Skip to content

Latest commit

 

History

History
150 lines (120 loc) · 6.13 KB

README.md

File metadata and controls

150 lines (120 loc) · 6.13 KB

License

Fleet Command Deploy: NVIDIA Triton Inference Server

A helm chart for installing a single cluster of NVIDIA Triton Inference Server on Fleet Command is provided. By default the cluster contains a single instance of the Triton but the replicaCount configuration parameter can be set to create a cluster of any size, as described below.

This guide assumes you already have a functional Fleet Command location deployed. Please refer to the Fleet Command Documentation

The steps below describe how to set-up a model repository, use helm to launch the Triton, and then send inference requests to the running Triton Inference Server. You can optionally scrape metrics with Prometheus and access a Grafana endpoint to see real-time metrics reported by Triton.

Model Repository

If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the Triton Inference Server source repository to create an example model repository::

$ git clone https://github.com/triton-inference-server/server.git

Triton needs a repository of models that it will make available for inferencing. For this example you will place the model repository in an S3 Storage bucket (either in AWS or other S3 API compatible on-premises object storage).

$ aws s3 mb s3:https://triton-inference-server-repository

Following the QuickStart download the example model repository to your system and copy it into the AWS S3 bucket.

$ aws s3 cp -r docs/examples/model_repository s3:https://triton-inference-server-repository/model_repository

AWS Model Repository

To load the model from the AWS S3, you need to convert the following AWS credentials in the base64 format and add it to the Application Configuration section when creating the Fleet Command Deployment.

echo -n 'REGION' | base64
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_ACCESS_KEY' | base64
# Optional for using session token
echo -n 'AWS_SESSION_TOKEN' | base64

Deploy the Triton Inference Server

Deploy the Triton Inference Server to your Location in Fleet Command by creating a Deployment. You can specify configuration parameters to override the default values.yaml in the Application Configuration section.

Note: You must provide a --model-repository parameter with a path to your prepared model repository in your S3 bucket. Otherwise, the Triton will not start.

An example Application Configuration for Triton on Fleet Command:

image:
  serverArgs:
    - --model-repository=s3:https://triton-inference-server-repository

secret:
  region: <region in base 64 >
  id: <access id in base 64 >
  key: <access key in base 64>
  token: <session token in base 64 (optional)>

See Fleet Command documentation for more info.

Prometheus ServiceMonitor Support

If you have prometheus-operator deployed, you can enable the ServiceMonitor for the Triton Inference Server by setting serviceMonitor.enabled: true in Application Configuration. This will also deploy a Grafana dashboard for Triton as a ConfigMap.

Otherwise, metrics can be scraped by pointing an external Prometheus instance at the metricsNodePort in the values.

Using Triton Inference Server

Now that the Triton Inference Server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the service is exposed with a NodePort service type, where the same port is opened on all systems in a Location.

Triton exposes an HTTP endpoint on port 30343, and GRPC endpoint on port 30344 and a Prometheus metrics endpoint on port 30345. These ports can be overridden in the application configuration when deploying. You can use curl to get the meta-data of Triton from the HTTP endpoint. For example, if a system in your location has the IP 34.83.9.133:

$ curl 34.83.9.133:30343/v2

Follow the QuickStart to get the example image classification client that can be used to perform inferencing using image classification models being served by the Triton. For example,

$ image_client -u 34.83.9.133:30343 -m densenet_onnx -s INCEPTION -c 3 mug.jpg
Request 0, batch size 1
Image '/workspace/images/mug.jpg':
    15.349568 (504) = COFFEE MUG
    13.227468 (968) = CUP
    10.424893 (505) = COFFEEPOT