mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
add README
This commit is contained in:
parent
e0a6daebf2
commit
554a99600c
52
funasr/runtime/triton_gpu/README.md
Normal file
52
funasr/runtime/triton_gpu/README.md
Normal file
@ -0,0 +1,52 @@
|
||||
## Inference with Triton
|
||||
|
||||
### Steps:
|
||||
1. Refer here to [get model.onnx](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime#steps)
|
||||
|
||||
2. Follow below instructions to using triton
|
||||
```sh
|
||||
# using docker image Dockerfile/Dockerfile.server
|
||||
docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01
|
||||
docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/funasr/runtime/>:/workspace --shm-size 1g --net host triton-paraformer:23.01
|
||||
# inside the docker container, prepare previous exported model.onnx
|
||||
mv <path_model.onnx> /workspace/triton_gpu/model_repo_paraformer_large_offline/encoder/1/
|
||||
|
||||
model_repo_paraformer_large_offline/
|
||||
|-- encoder
|
||||
| |-- 1
|
||||
| | `-- model.onnx
|
||||
| `-- config.pbtxt
|
||||
|-- feature_extractor
|
||||
| |-- 1
|
||||
| | `-- model.py
|
||||
| |-- config.pbtxt
|
||||
| `-- config.yaml
|
||||
|-- infer_pipeline
|
||||
| |-- 1
|
||||
| `-- config.pbtxt
|
||||
`-- scoring
|
||||
|-- 1
|
||||
| `-- model.py
|
||||
|-- config.pbtxt
|
||||
`-- token_list.pkl
|
||||
|
||||
8 directories, 9 files
|
||||
|
||||
# launch the service
|
||||
tritonserver --model-repository ./model_repo_paraformer_large_offline \
|
||||
--pinned-memory-pool-byte-size=512000000 \
|
||||
--cuda-memory-pool-byte-size=0:1024000000
|
||||
|
||||
```
|
||||
|
||||
### Performance benchmark
|
||||
|
||||
Benchmark [speech_paraformer](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set with a single V100, the total audio duration is 36108.919 seconds.
|
||||
|
||||
(Note: The service has been fully warm up.)
|
||||
|concurrent-tasks | processing time(s) | RTF |
|
||||
|----------|--------------------|------------|
|
||||
| 60 (onnx fp32) | 116.0 | 0.0032|
|
||||
|
||||
## Acknowledge
|
||||
This part originates from NVIDIA CISI project. We also have TTS and NLP solutions deployed on triton inference server. If you are interested, please contact us.
|
||||
Loading…
Reference in New Issue
Block a user