Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect.py supports running against a Triton container #9228

Merged
merged 44 commits into from
Sep 23, 2022

Conversation

gaziqbal
Copy link
Contributor

@gaziqbal gaziqbal commented Aug 30, 2022

This PR enables detect.py to use a Triton for inference. The Triton Inference Server (https://github.com/triton-inference-server/server) is an open source inference serving software that streamlines AI inferencing.

The user can now provide a "--triton-url" argument to detect.py to use a local or remote Triton server for inference.
For e.g., http:https://localhost:8000 will use http over port 8000 and grpc:https://localhost:8001 will use grpc over port 8001.
Note, it is not necessary to specify a weights file to detect.py when using Triton for inference.

A Triton container can be created by first exporting the Yolov5 model to a Triton supported runtime. Onnx, Torchscript, TensorRT are supported by both Triton and the export.py script.

The exported model can then be containerized via the OctoML CLI.
See https://github.com/octoml/octo-cli#getting-started for a guide.

python export.py --include onnx # exports the onnx model as yolov5.onnx
mkdir octoml && cd octoml && mv ../yolov5s.onnx . #create an octoml folder and moves the onnx model into it
octoml init && octoml package && octoml deploy
python ../detect.py --triton-url http:https://localhost:8000

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Enhancements in model handling for PyTorch, better integration with NVIDIA Triton Inference Server and tensor device handling improvements.

πŸ“Š Key Changes

  • Adjusted tensor device assignment to use model's device attribute directly.
  • Added support for NVIDIA Triton Inference Server URLs as model paths.
  • Implemented loading and inference logic for models served by Triton.
  • Ensured correct tensor format conversions for different model types.
  • Updated warmup function to include Triton models.

🎯 Purpose & Impact

  • πŸŽ›οΈ Provides more intuitive device handling for tensor operations, reducing potential for device mismatch errors.
  • πŸš€ Extends the model serving capabilities by integrating support for NVIDIA Triton Inference Server, enabling efficient deployment and scaling of AI models.
  • ☁️ Facilitates remote model inference, allowing users to leverage models hosted on servers without needing to run them locally.
  • πŸ”§ Enhances model compatibility across different formats, promoting a smoother user experience in applying various models.
  • 🌑️ Improves warmup process by including remote Triton models, ensuring they are ready for efficient inference.

The PR potentially impacts developers looking for streamlined deployment and broadened inference capabilities, as well as users who want to access advanced model-serving features easily.

glenn-jocher and others added 6 commits August 24, 2022 23:29
Triton Inference Server is an open source inference serving software
that streamlines AI inferencing.
https://github.com/triton-inference-server/server

The user can now provide a "--triton-url" argument to detect.py to use
a local or remote Triton server for inference.
For e.g., http:https://localhost:8000 will use http over port 8000
and grpc:https://localhost:8001 will use grpc over port 8001.
Note, it is not necessary to specify a weights file to use Triton.

A Triton container can be created by first exporting the Yolov5 model
to a Triton supported runtime. Onnx, Torchscript, TensorRT are
supported by both Triton and the export.py script.

The exported model can then be containerized via the OctoML CLI.
See https://github.com/octoml/octo-cli#getting-started for a guide.
@gaziqbal
Copy link
Contributor Author

@glenn-jocher , @AyushExel - here is a PR against the yolov5 repo.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘‹ Hello @gaziqbal, thank you for submitting a YOLOv5 πŸš€ PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • βœ… Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
  • βœ… Verify all YOLOv5 Continuous Integration (CI) checks are passing.
  • βœ… Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." β€” Bruce Lee

@gaziqbal
Copy link
Contributor Author

gaziqbal commented Sep 7, 2022

@glenn-jocher , @AyushExel - here is a PR against the yolov5 repo.

Please let me know if you need anything more here.

@glenn-jocher
Copy link
Member

@gaziqbal thanks, we should be reviewing this soon, no changes required ATM

@glenn-jocher glenn-jocher self-assigned this Sep 18, 2022
@glenn-jocher glenn-jocher added enhancement New feature or request TODO labels Sep 18, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 21, 2022

@gaziqbal thanks for your patience.

I think I'm going to try to refactor this to not treat triton backends differently. There's a tendency for new users to introduce more code than may be required for their feature as they treat it specially compared to existing features, but with 12 different inference types all using a single --weights argument I'd rather not introduce additional command line arguments and function arguments for one more.

Just like --source and --weights are multi-purpose I think we can extend them to triton inference as well, I'll see what I can do here today.

Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
Signed-off-by: Glenn Jocher <[email protected]>
@glenn-jocher glenn-jocher marked this pull request as ready for review September 21, 2022 17:10
@glenn-jocher
Copy link
Member

@gaziqbal pinging you to see if you could re-test after my updates (I hope I didn't break anything)!

@gaziqbal
Copy link
Contributor Author

@glenn-jocher - the triton server detection broke because it was using the Path.name property for matching which would strip out any http:https:// or grpc:https:// prefixes. I also needed to change the Triton server class to query the model name because the weights parameter is being used for the url. Can you please take a look again? I have verified http and grpc on my end.

@glenn-jocher
Copy link
Member

@gaziqbal understood. Is there a public server URL I could temporarily use for debugging? I see an error from Vanessa that I'm working on now.

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 23, 2022

@gaziqbal I took a look, everything looks good to merge over here. Do your updates fix Vanessa's issue?

@glenn-jocher glenn-jocher removed the TODO label Sep 23, 2022
@glenn-jocher glenn-jocher merged commit d669a74 into ultralytics:master Sep 23, 2022
@glenn-jocher
Copy link
Member

@gaziqbal PR is merged. Thank you for your contributions to YOLOv5 πŸš€ and Vision AI ⭐

@kingkong135
Copy link

@gaziqbal @glenn-jocher I tried but in case of trition servering a series of models according to the code, it defaults to the first model not the one named "yolov5", I think add parameter model_name in TritonRemoteModel

@gaziqbal
Copy link
Contributor Author

gaziqbal commented Oct 4, 2022

Good point. That's fairly straightforward to do for TritonRemoteModel. Are you invoking it via detect.py? If so, we'll need a way to relay that.

@kingkong135
Copy link

i'm thinking there are 2 ways 1 is to add a new parameter model_name but it's a bit redundant, another way is to pass the end in "weights" like "grpc:https://localhost:8001/yolov5" and in TritonRemoteModel will handle it.

@gaziqbal
Copy link
Contributor Author

gaziqbal commented Oct 4, 2022

My concern with the latter is that it would be a contrived URI schema and not match canonical Triton URIs which may be confusing. That said, the approach is worth exploring more.

@glenn-jocher
Copy link
Member

Stupid question here. Could we use the URL question mark structure for passing variables, i.e. something like this to allow more arguments into the triton server?

grpc:https://localhost:8001/?model=yolov5s.pt&conf=0.25&imgsz=640

@ArgoHA
Copy link

ArgoHA commented Jan 8, 2023

Hi! Where can I find any info on how exactly triton should be configured for working with this solution? I used triton with custom client. I tried to use my triton backend with detect.py and got issue:
tritonclient.utils.InferenceServerException: got unexpected numpy array shape [1, 3, 640, 640], expected [-1, 3, 640, 640]

Here is my config:

name: "yolov5"
platform: "tensorrt_plan"
max_batch_size: 1
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ 25200, 85 ]
  }
]

@fabito
Copy link
Contributor

fabito commented Jan 18, 2023

@ArgoHA , I am having the same problem here. Were you able to solve it ?

Traceback (most recent call last):
  File "detect.py", line 259, in <module>
    main(opt)
  File "detect.py", line 254, in main
    run(**vars(opt))
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "detect.py", line 113, in run
    model.warmup(imgsz=(1 if pt or model.triton else bs, 3, *imgsz))  # warmup
  File "/usr/src/app/models/common.py", line 597, in warmup
    self.forward(im)  # warmup
  File "/usr/src/app/models/common.py", line 558, in forward
    y = self.model(im)
  File "/usr/src/app/utils/triton.py", line 60, in __call__
    inputs = self._create_inputs(*args, **kwargs)
  File "/usr/src/app/utils/triton.py", line 80, in _create_inputs
    input.set_data_from_numpy(value.cpu().numpy())
  File "/opt/conda/lib/python3.8/site-packages/tritonclient/grpc/__init__.py", line 1831, in set_data_from_numpy
    raise_error(
  File "/opt/conda/lib/python3.8/site-packages/tritonclient/utils/__init__.py", line 35, in raise_error
    raise InferenceServerException(msg=msg) from None
tritonclient.utils.InferenceServerException: got unexpected numpy array shape [1, 3, 640, 640], expected [-1, 3, 640, 640]

@fabito
Copy link
Contributor

fabito commented Jan 18, 2023

@ArgoHA ,

I solved using this configuration:

name: "yolov5"
platform: "tensorrt_plan"
max_batch_size: 0
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [1, 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [1, 25200, 85 ]
  }
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants