This commit is contained in:
游雁 2023-10-19 14:23:07 +08:00
parent e4bda477b6
commit 1a1d12416f

View File

@ -42,6 +42,11 @@ Where,
- `batch_size_token` refs to dynamic batch_size and the total tokens of batch is `batch_size_token`, 1 token = 60 ms.
- `batch_size_token_threshold_s`: The batch_size is set to 1, when the audio duration exceeds the threshold value of `batch_size_token_threshold_s`, specified in `s`.
- `max_single_segment_time`: The maximum length for audio segmentation in VAD, specified in `ms`.
Suggestion: When encountering OOM (Out of Memory) issues with long audio inputs, as the GPU memory usage increases with the square of the audio duration, there are three possible scenarios:
a) In the initial inference stage, GPU memory usage primarily depends on `batch_size_token`. Reducing this value appropriately can help reduce memory usage.
b) In the middle of the inference process, when encountering long audio segments segmented by VAD, if the total number of tokens is still smaller than `batch_size_token` but OOM issues persist, reducing `batch_size_token_threshold_s` can help. If the threshold is exceeded, forcing the batch size to 1 can be considered.
c) Towards the end of the inference process, when encountering long audio segments segmented by VAD and the total number of tokens is smaller than `batch_size_token` but exceeds the threshold `batch_size_token_threshold_s`, forcing the batch size to 1 may still result in OOM errors. In such cases, reducing `max_single_segment_time` can be considered to shorten the duration of audio segments generated by VAD.
#### [Paraformer-online Model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary)
##### Streaming Decoding