A helm chart for installing a single cluster of NVIDIA Triton Inference Server on Fleet Command is provided. By default the cluster contains a single instance of the Triton but the replicaCount configuration parameter can be set to create a cluster of any size, as described below.
This guide assumes you already have a functional Fleet Command location deployed. Please refer to the Fleet Command Documentation
The steps below describe how to set-up a model repository, use helm to launch the Triton, and then send inference requests to the running Triton Inference Server. You can optionally scrape metrics with Prometheus and access a Grafana endpoint to see real-time metrics reported by Triton.
If you already have a model repository you may use that with this helm chart. If you do not have a model repository, you can checkout a local copy of the Triton Inference Server source repository to create an example model repository::
$ git clone https://github.com/triton-inference-server/server.git
Triton needs a repository of models that it will make available for inferencing. For this example you will place the model repository in an S3 Storage bucket (either in AWS or other S3 API compatible on-premises object storage).
$ aws s3 mb s3:https://triton-inference-server-repository
Following the QuickStart download the example model repository to your system and copy it into the AWS S3 bucket.
$ aws s3 cp -r docs/examples/model_repository s3:https://triton-inference-server-repository/model_repository
To load the model from the AWS S3, you need to convert the following AWS credentials in the base64 format and add it to the Application Configuration section when creating the Fleet Command Deployment.
echo -n 'REGION' | base64
echo -n 'SECRECT_KEY_ID' | base64
echo -n 'SECRET_ACCESS_KEY' | base64
# Optional for using session token
echo -n 'AWS_SESSION_TOKEN' | base64
Deploy the Triton Inference Server to your Location in Fleet Command by creating a Deployment. You can specify configuration parameters to override the default values.yaml in the Application Configuration section.
Note: You must provide a --model-repository
parameter with a path to your
prepared model repository in your S3 bucket. Otherwise, the Triton will not
start.
An example Application Configuration for Triton on Fleet Command:
image:
serverArgs:
- --model-repository=s3:https://triton-inference-server-repository
secret:
region: <region in base 64 >
id: <access id in base 64 >
key: <access key in base 64>
token: <session token in base 64 (optional)>
See Fleet Command documentation for more info.
If you have prometheus-operator
deployed, you can enable the ServiceMonitor
for the Triton Inference Server by setting serviceMonitor.enabled: true
in
Application Configuration. This will also deploy a Grafana dashboard for Triton
as a ConfigMap.
Otherwise, metrics can be scraped by pointing an external Prometheus
instance at the metricsNodePort
in the values.
Now that the Triton Inference Server is running you can send HTTP or GRPC requests to it to perform inferencing. By default, the service is exposed with a NodePort service type, where the same port is opened on all systems in a Location.
Triton exposes an HTTP endpoint on port 30343, and GRPC endpoint on port 30344
and a Prometheus metrics endpoint on port 30345. These ports can be overridden
in the application configuration when deploying. You can use curl to get the
meta-data of Triton from the HTTP endpoint. For example, if a system in your
location has the IP 34.83.9.133
:
$ curl 34.83.9.133:30343/v2
Follow the QuickStart to get the example image classification client that can be used to perform inferencing using image classification models being served by the Triton. For example,
$ image_client -u 34.83.9.133:30343 -m densenet_onnx -s INCEPTION -c 3 mug.jpg
Request 0, batch size 1
Image '/workspace/images/mug.jpg':
15.349568 (504) = COFFEE MUG
13.227468 (968) = CUP
10.424893 (505) = COFFEEPOT