FunASR/funasr/runtime/triton_gpu
2023-02-27 10:14:55 +00:00
..
client add triton 2023-02-27 09:57:53 +00:00
Dockerfile add triton 2023-02-27 09:57:53 +00:00
model_repo_paraformer_large_offline add triton 2023-02-27 09:57:53 +00:00
README.md add README 2023-02-27 10:14:55 +00:00

Inference with Triton

Steps:

  1. Refer here to get model.onnx

  2. Follow below instructions to using triton

# using docker image Dockerfile/Dockerfile.server
docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01 
docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01 
# inside the docker container, prepare previous exported model.onnx
mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/

model_repo_paraformer_large_offline/
|-- encoder
|   |-- 1
|   |   `-- model.onnx
|   `-- config.pbtxt
|-- feature_extractor
|   |-- 1
|   |   `-- model.py
|   |-- config.pbtxt
|   `-- config.yaml
|-- infer_pipeline
|   |-- 1
|   `-- config.pbtxt
`-- scoring
    |-- 1
    |   `-- model.py
    |-- config.pbtxt
    `-- token_list.pkl

8 directories, 9 files

# launch the service 
tritonserver --model-repository ./model_repo_paraformer_large_offline \
             --pinned-memory-pool-byte-size=512000000 \
             --cuda-memory-pool-byte-size=0:1024000000

Performance benchmark

Benchmark speech_paraformer based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.

(Note: The service has been fully warm up.)

concurrent-tasks processing time(s) RTF
60 (onnx fp32) 116.0 0.0032

Acknowledge

This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.