mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
resolve conflict
This commit is contained in:
commit
4dc3a1b011
30
README.md
30
README.md
@ -15,36 +15,10 @@
|
||||
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
|
||||
| [**Contact**](#contact)
|
||||
|
||||
|
||||
## What's new:
|
||||
|
||||
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
|
||||
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
|
||||
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
|
||||
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
|
||||
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
|
||||
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
|
||||
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
|
||||
- We release several new UniASR model:
|
||||
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
|
||||
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
|
||||
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
|
||||
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
|
||||
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
|
||||
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
|
||||
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
|
||||
### 2023.1.16, funasr-0.1.6, modelscope-1.2.0
|
||||
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
|
||||
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
|
||||
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
|
||||
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
|
||||
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
|
||||
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
|
||||
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
|
||||
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
|
||||
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
|
||||
|
||||
## Highlights
|
||||
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pth
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -34,7 +34,7 @@ exp_dir=./data
|
||||
tag=exp1
|
||||
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
|
||||
lm_exp=${exp_dir}/exp/${model_dir}
|
||||
inference_lm=valid.loss.ave.pth # Language model path for decoding.
|
||||
inference_lm=valid.loss.ave.pb # Language model path for decoding.
|
||||
|
||||
stage=0
|
||||
stop_stage=3
|
||||
|
||||
@ -4,7 +4,7 @@ import sys
|
||||
|
||||
def main():
|
||||
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
|
||||
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),
|
||||
|
||||
@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
echo "Downloading Pre-trained model..."
|
||||
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
|
||||
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
|
||||
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
|
||||
echo "Done."
|
||||
@ -30,7 +30,7 @@ fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
echo "Calculating diarization results..."
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
|
||||
python local/convert_label_to_rttm.py \
|
||||
outputs/labels.txt \
|
||||
data/test_rmsil/raw_rmsil_map.scp \
|
||||
|
||||
@ -4,7 +4,7 @@ import os
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
|
||||
@ -4,7 +4,7 @@ import os
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
|
||||
@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -63,5 +63,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./example_data/validation"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -22,10 +22,12 @@
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>model:</strong> # model name on ModelScope
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>ngpu:</strong> # the number of GPUs for decoding
|
||||
- <strong>njob:</strong> # the number of jobs for each GPU
|
||||
- <strong>ngpu:</strong> # the number of GPUs for decoding, if `ngpu` > 0, use GPU decoding
|
||||
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `ngpu` = 0, use CPU decoding, please set `njob`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
@ -39,9 +41,11 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
### Inference using local finetuned model
|
||||
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>modelscope_model_name: </strong> # model name on ModelScope
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -0,0 +1,57 @@
|
||||
import torch
|
||||
import torchaudio
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
|
||||
model_revision='v1.0.2')
|
||||
|
||||
waveform, sample_rate = torchaudio.load("waihu.wav")
|
||||
speech_length = waveform.shape[1]
|
||||
speech = waveform[0]
|
||||
|
||||
cache_en = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None}
|
||||
cache_de = {"decode_fsmn": None}
|
||||
cache = {"encoder": cache_en, "decoder": cache_de}
|
||||
param_dict = {}
|
||||
param_dict["cache"] = cache
|
||||
|
||||
first_chunk = True
|
||||
speech_buffer = speech
|
||||
speech_cache = []
|
||||
final_result = ""
|
||||
|
||||
while len(speech_buffer) >= 960:
|
||||
if first_chunk:
|
||||
if len(speech_buffer) >= 14400:
|
||||
rec_result = inference_pipeline(audio_in=speech_buffer[0:14400], param_dict=param_dict)
|
||||
speech_buffer = speech_buffer[4800:]
|
||||
else:
|
||||
cache_en["stride"] = len(speech_buffer) // 960
|
||||
cache_en["pad_right"] = 0
|
||||
rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
|
||||
speech_buffer = []
|
||||
cache_en["start_idx"] = -5
|
||||
first_chunk = False
|
||||
else:
|
||||
cache_en["start_idx"] += 10
|
||||
if len(speech_buffer) >= 4800:
|
||||
cache_en["pad_left"] = 5
|
||||
rec_result = inference_pipeline(audio_in=speech_buffer[:19200], param_dict=param_dict)
|
||||
speech_buffer = speech_buffer[9600:]
|
||||
else:
|
||||
cache_en["stride"] = len(speech_buffer) // 960
|
||||
cache_en["pad_right"] = 0
|
||||
rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
|
||||
speech_buffer = []
|
||||
if len(rec_result) !=0 and rec_result['text'] != "sil":
|
||||
final_result += rec_result['text']
|
||||
print(rec_result)
|
||||
print(final_result)
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
|
||||
.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -4,27 +4,17 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "punc")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "punc"))
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "vad")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "vad"))
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if text_in is not None:
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -4,6 +4,11 @@ inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
|
||||
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.punctuation,
|
||||
|
||||
@ -0,0 +1,10 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_diar_pipline = pipeline(
|
||||
task=Tasks.speaker_diarization,
|
||||
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
|
||||
model_revision="v1.0.0",
|
||||
)
|
||||
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
|
||||
print(results)
|
||||
@ -14,13 +14,12 @@ inference_diar_pipline = pipeline(
|
||||
)
|
||||
|
||||
# 以 audio_list 作为输入,其中第一个音频为待检测语音,后面的音频为不同说话人的声纹注册语音
|
||||
audio_list = [[
|
||||
audio_list = [
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav"
|
||||
]]
|
||||
]
|
||||
|
||||
results = inference_diar_pipline(audio_in=audio_list)
|
||||
for rst in results:
|
||||
print(rst["value"])
|
||||
print(results)
|
||||
|
||||
@ -1,8 +1,11 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
|
||||
@ -1,8 +1,11 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
|
||||
@ -52,7 +52,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -256,6 +256,9 @@ def inference_launch(**kwargs):
|
||||
elif mode == "paraformer":
|
||||
from funasr.bin.asr_inference_paraformer import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_streaming":
|
||||
from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_vad":
|
||||
from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
|
||||
@ -41,8 +41,6 @@ from funasr.utils.types import str_or_none
|
||||
from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
import pdb
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
|
||||
global_asr_language: str = 'zh-cn'
|
||||
global_sample_rate: Union[int, Dict[Any, int]] = {
|
||||
@ -55,7 +53,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -50,7 +50,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
907
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
907
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
@ -0,0 +1,907 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
import copy
|
||||
import os
|
||||
import codecs
|
||||
import tempfile
|
||||
import requests
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Dict
|
||||
from typing import Any
|
||||
from typing import List
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.fileio.datadir_writer import DatadirWriter
|
||||
from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
|
||||
from funasr.modules.beam_search.beam_search import Hypothesis
|
||||
from funasr.modules.scorers.ctc import CTCPrefixScorer
|
||||
from funasr.modules.scorers.length_bonus import LengthBonus
|
||||
from funasr.modules.subsampling import TooShortUttError
|
||||
from funasr.tasks.asr import ASRTaskParaformer as ASRTask
|
||||
from funasr.tasks.lm import LMTask
|
||||
from funasr.text.build_tokenizer import build_tokenizer
|
||||
from funasr.text.token_id_converter import TokenIDConverter
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
|
||||
|
||||
class Speech2Text:
|
||||
"""Speech2Text class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# 1. Build ASR model
|
||||
scorers = {}
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
if asr_model.ctc != None:
|
||||
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
|
||||
scorers.update(
|
||||
ctc=ctc
|
||||
)
|
||||
token_list = asr_model.token_list
|
||||
scorers.update(
|
||||
length_bonus=LengthBonus(len(token_list)),
|
||||
)
|
||||
|
||||
# 2. Build Language model
|
||||
if lm_train_config is not None:
|
||||
lm, lm_train_args = LMTask.build_model_from_file(
|
||||
lm_train_config, lm_file, device
|
||||
)
|
||||
scorers["lm"] = lm.lm
|
||||
|
||||
# 3. Build ngram model
|
||||
# ngram is not supported now
|
||||
ngram = None
|
||||
scorers["ngram"] = ngram
|
||||
|
||||
# 4. Build BeamSearch object
|
||||
# transducer is not supported now
|
||||
beam_search_transducer = None
|
||||
|
||||
weights = dict(
|
||||
decoder=1.0 - ctc_weight,
|
||||
ctc=ctc_weight,
|
||||
lm=lm_weight,
|
||||
ngram=ngram_weight,
|
||||
length_bonus=penalty,
|
||||
)
|
||||
beam_search = BeamSearch(
|
||||
beam_size=beam_size,
|
||||
weights=weights,
|
||||
scorers=scorers,
|
||||
sos=asr_model.sos,
|
||||
eos=asr_model.eos,
|
||||
vocab_size=len(token_list),
|
||||
token_list=token_list,
|
||||
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
|
||||
)
|
||||
|
||||
beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
for scorer in scorers.values():
|
||||
if isinstance(scorer, torch.nn.Module):
|
||||
scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
# 6. [Optional] Build hotword list from str, local file or url
|
||||
|
||||
is_use_lm = lm_weight != 0.0 and lm_file is not None
|
||||
if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
|
||||
beam_search = None
|
||||
self.beam_search = beam_search
|
||||
logging.info(f"Beam_search: {self.beam_search}")
|
||||
self.beam_search_transducer = beam_search_transducer
|
||||
self.maxlenratio = maxlenratio
|
||||
self.minlenratio = minlenratio
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
self.encoder_downsampling_factor = 1
|
||||
if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
|
||||
self.encoder_downsampling_factor = 4
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
begin_time: int = 0, end_time: int = None,
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
# b. Forward Encoder
|
||||
enc, enc_len = self.asr_model.encode_chunk(**batch)
|
||||
if isinstance(enc, tuple):
|
||||
enc = enc[0]
|
||||
# assert len(enc) == 1, len(enc)
|
||||
enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
|
||||
|
||||
predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
|
||||
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
|
||||
predictor_outs[2], predictor_outs[3]
|
||||
pre_token_length = pre_token_length.floor().long()
|
||||
if torch.max(pre_token_length) < 1:
|
||||
return []
|
||||
decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
|
||||
decoder_out = decoder_outs
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
x = enc[i, :enc_len[i], :]
|
||||
am_scores = decoder_out[i, :pre_token_length[i], :]
|
||||
if self.beam_search is not None:
|
||||
nbest_hyps = self.beam_search(
|
||||
x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
|
||||
)
|
||||
|
||||
nbest_hyps = nbest_hyps[: self.nbest]
|
||||
else:
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
[self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
# assert check_return_type(results)
|
||||
return results
|
||||
|
||||
|
||||
class Speech2TextExport:
|
||||
"""Speech2TextExport class
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
# 1. Build ASR model
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
token_list = asr_model.token_list
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
# self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
|
||||
model = Paraformer_export(asr_model, onnx=False)
|
||||
self.asr_model = model
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
|
||||
enc_len_batch_total = feats_len.sum()
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
decoder_outs = self.asr_model(**batch)
|
||||
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
am_scores = decoder_out[i, :ys_pad_lens[i], :]
|
||||
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
yseq.tolist(), device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def inference(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
streaming: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
batch_size=batch_size,
|
||||
beam_size=beam_size,
|
||||
ngpu=ngpu,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
penalty=penalty,
|
||||
log_level=log_level,
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
raw_inputs=raw_inputs,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
key_file=key_file,
|
||||
word_lm_train_config=word_lm_train_config,
|
||||
bpemodel=bpemodel,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
streaming=streaming,
|
||||
output_dir=output_dir,
|
||||
dtype=dtype,
|
||||
seed=seed,
|
||||
ngram_weight=ngram_weight,
|
||||
nbest=nbest,
|
||||
num_workers=num_workers,
|
||||
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
# data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
output_dir: Optional[str] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
if word_lm_train_config is not None:
|
||||
raise NotImplementedError("Word LM is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
export_mode = False
|
||||
if param_dict is not None:
|
||||
hotword_list_or_file = param_dict.get('hotword')
|
||||
export_mode = param_dict.get("export_mode", False)
|
||||
else:
|
||||
hotword_list_or_file = None
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
batch_size = 1
|
||||
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
|
||||
# 2. Build speech2text
|
||||
speech2text_kwargs = dict(
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
bpemodel=bpemodel,
|
||||
device=device,
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
dtype=dtype,
|
||||
beam_size=beam_size,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
ngram_weight=ngram_weight,
|
||||
penalty=penalty,
|
||||
nbest=nbest,
|
||||
hotword_list_or_file=hotword_list_or_file,
|
||||
)
|
||||
if export_mode:
|
||||
speech2text = Speech2TextExport(**speech2text_kwargs)
|
||||
else:
|
||||
speech2text = Speech2Text(**speech2text_kwargs)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
fs: dict = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
hotword_list_or_file = None
|
||||
if param_dict is not None:
|
||||
hotword_list_or_file = param_dict.get('hotword')
|
||||
if 'hotword' in kwargs:
|
||||
hotword_list_or_file = kwargs['hotword']
|
||||
if hotword_list_or_file is not None or 'hotword' in kwargs:
|
||||
speech2text.hotword_list = speech2text.generate_hotwords_list(hotword_list_or_file)
|
||||
|
||||
# 3. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
|
||||
loader = ASRTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
fs=fs,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
|
||||
collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
if param_dict is not None:
|
||||
use_timestamp = param_dict.get('use_timestamp', True)
|
||||
else:
|
||||
use_timestamp = True
|
||||
|
||||
forward_time_total = 0.0
|
||||
length_total = 0.0
|
||||
finish_count = 0
|
||||
file_count = 1
|
||||
cache = None
|
||||
# 7 .Start for-loop
|
||||
# FIXME(kamo): The output format should be discussed about
|
||||
asr_result_list = []
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path is not None:
|
||||
writer = DatadirWriter(output_path)
|
||||
else:
|
||||
writer = None
|
||||
if param_dict is not None and "cache" in param_dict:
|
||||
cache = param_dict["cache"]
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
# batch = {k: v for k, v in batch.items() if not k.endswith("_lengths")}
|
||||
logging.info("decoding, utt_id: {}".format(keys))
|
||||
# N-best list of (text, token, token_int, hyp_object)
|
||||
|
||||
time_beg = time.time()
|
||||
results = speech2text(cache=cache, **batch)
|
||||
if len(results) < 1:
|
||||
hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
|
||||
results = [[" ", ["sil"], [2], hyp, 10, 6]] * nbest
|
||||
time_end = time.time()
|
||||
forward_time = time_end - time_beg
|
||||
lfr_factor = results[0][-1]
|
||||
length = results[0][-2]
|
||||
forward_time_total += forward_time
|
||||
length_total += length
|
||||
rtf_cur = "decoding, feature length: {}, forward_time: {:.4f}, rtf: {:.4f}".format(length, forward_time,
|
||||
100 * forward_time / (
|
||||
length * lfr_factor))
|
||||
logging.info(rtf_cur)
|
||||
|
||||
for batch_id in range(_bs):
|
||||
result = [results[batch_id][:-2]]
|
||||
|
||||
key = keys[batch_id]
|
||||
for n, result in zip(range(1, nbest + 1), result):
|
||||
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
|
||||
time_stamp = None if len(result) < 5 else result[4]
|
||||
# Create a directory: outdir/{n}best_recog
|
||||
if writer is not None:
|
||||
ibest_writer = writer[f"{n}best_recog"]
|
||||
|
||||
# Write the result to each file
|
||||
ibest_writer["token"][key] = " ".join(token)
|
||||
# ibest_writer["token_int"][key] = " ".join(map(str, token_int))
|
||||
ibest_writer["score"][key] = str(hyp.score)
|
||||
ibest_writer["rtf"][key] = rtf_cur
|
||||
|
||||
if text is not None:
|
||||
if use_timestamp and time_stamp is not None:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
|
||||
else:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token)
|
||||
time_stamp_postprocessed = ""
|
||||
if len(postprocessed_result) == 3:
|
||||
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
|
||||
postprocessed_result[1], \
|
||||
postprocessed_result[2]
|
||||
else:
|
||||
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
if time_stamp_postprocessed != "":
|
||||
item['time_stamp'] = time_stamp_postprocessed
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
# asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
|
||||
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
|
||||
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total,
|
||||
forward_time_total,
|
||||
100 * forward_time_total / (
|
||||
length_total * lfr_factor))
|
||||
logging.info(rtf_avg)
|
||||
if writer is not None:
|
||||
ibest_writer["rtf"]["rtf_avf"] = rtf_avg
|
||||
return asr_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="ASR Decoding",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=True)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hotword",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="hotword file path or hotwords seperated by space"
|
||||
)
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--asr_train_config",
|
||||
type=str,
|
||||
help="ASR training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--asr_model_file",
|
||||
type=str,
|
||||
help="ASR model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--cmvn_file",
|
||||
type=str,
|
||||
help="Global cmvn file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_train_config",
|
||||
type=str,
|
||||
help="LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_file",
|
||||
type=str,
|
||||
help="LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_train_config",
|
||||
type=str,
|
||||
help="Word LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_file",
|
||||
type=str,
|
||||
help="Word LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ngram_file",
|
||||
type=str,
|
||||
help="N-gram parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--model_tag",
|
||||
type=str,
|
||||
help="Pretrained model tag. If specify this option, *_train_config and "
|
||||
"*_file will be overwritten",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Beam-search related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
|
||||
group.add_argument("--beam_size", type=int, default=20, help="Beam size")
|
||||
group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
|
||||
group.add_argument(
|
||||
"--maxlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain max output length. "
|
||||
"If maxlenratio=0.0 (default), it uses a end-detect "
|
||||
"function "
|
||||
"to automatically find maximum hypothesis lengths."
|
||||
"If maxlenratio<0.0, its absolute value is interpreted"
|
||||
"as a constant max output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--minlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain min output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ctc_weight",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="CTC weight in joint decoding",
|
||||
)
|
||||
group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
|
||||
group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
|
||||
group.add_argument("--streaming", type=str2bool, default=False)
|
||||
|
||||
group.add_argument(
|
||||
"--frontend_conf",
|
||||
default=None,
|
||||
help="",
|
||||
)
|
||||
group.add_argument("--raw_inputs", type=list, default=None)
|
||||
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
|
||||
|
||||
group = parser.add_argument_group("Text converter related")
|
||||
group.add_argument(
|
||||
"--token_type",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
choices=["char", "bpe", None],
|
||||
help="The token type for ASR model. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
group.add_argument(
|
||||
"--bpemodel",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="The model path of sentencepiece. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
param_dict = {'hotword': args.hotword}
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
kwargs['param_dict'] = param_dict
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
# from modelscope.pipelines import pipeline
|
||||
# from modelscope.utils.constant import Tasks
|
||||
#
|
||||
# inference_16k_pipline = pipeline(
|
||||
# task=Tasks.auto_speech_recognition,
|
||||
# model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
|
||||
#
|
||||
# rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
|
||||
# print(rec_result)
|
||||
|
||||
@ -58,7 +58,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -46,7 +46,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -46,7 +46,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
|
||||
param_dict = {
|
||||
"extract_profile": True,
|
||||
"sv_train_config": "sv.yaml",
|
||||
"sv_model_file": "sv.pth",
|
||||
"sv_model_file": "sv.pb",
|
||||
}
|
||||
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
|
||||
for key in param_dict:
|
||||
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
|
||||
else:
|
||||
kwargs["param_dict"] = param_dict
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
elif mode == "eend-ola":
|
||||
from funasr.bin.eend_ola_inference import inference_modelscope
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
else:
|
||||
logging.info("Unknown decoding mode: {}".format(mode))
|
||||
return None
|
||||
|
||||
@ -16,6 +16,7 @@ from typing import Union
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from scipy.signal import medfilt
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.models.frontend.wav_frontend import WavFrontendMel23
|
||||
@ -34,7 +35,7 @@ class Speech2Diarization:
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
@ -146,7 +147,7 @@ def inference_modelscope(
|
||||
output_dir: Optional[str] = None,
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
ngpu: int = 0,
|
||||
ngpu: int = 1,
|
||||
num_workers: int = 0,
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
@ -179,7 +180,6 @@ def inference_modelscope(
|
||||
diar_model_file=diar_model_file,
|
||||
device=device,
|
||||
dtype=dtype,
|
||||
streaming=streaming,
|
||||
)
|
||||
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
|
||||
speech2diar = Speech2Diarization.from_pretrained(
|
||||
@ -209,7 +209,7 @@ def inference_modelscope(
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
|
||||
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
|
||||
loader = EENDOLADiarTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
@ -236,9 +236,23 @@ def inference_modelscope(
|
||||
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
|
||||
|
||||
results = speech2diar(**batch)
|
||||
|
||||
# post process
|
||||
a = results[0][0].cpu().numpy()
|
||||
a = medfilt(a, (11, 1))
|
||||
rst = []
|
||||
for spkid, frames in enumerate(a.T):
|
||||
frames = np.pad(frames, (1, 1), 'constant')
|
||||
changes, = np.where(np.diff(frames, axis=0) != 0)
|
||||
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
|
||||
for s, e in zip(changes[::2], changes[1::2]):
|
||||
st = s / 10.
|
||||
dur = (e - s) / 10.
|
||||
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
|
||||
|
||||
# Only supporting batch_size==1
|
||||
key, value = keys[0], output_results_str(results, keys[0])
|
||||
item = {"key": key, "value": value}
|
||||
value = "\n".join(rst)
|
||||
item = {"key": keys[0], "value": value}
|
||||
result_list.append(item)
|
||||
if output_path is not None:
|
||||
output_writer.write(value)
|
||||
|
||||
@ -42,7 +42,7 @@ class Speech2Diarization:
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
@ -54,7 +54,7 @@ class Speech2Diarization:
|
||||
self,
|
||||
diar_train_config: Union[Path, str] = None,
|
||||
diar_model_file: Union[Path, str] = None,
|
||||
device: str = "cpu",
|
||||
device: Union[str, torch.device] = "cpu",
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
streaming: bool = False,
|
||||
@ -114,9 +114,19 @@ class Speech2Diarization:
|
||||
# little-endian order: lower bit first
|
||||
return (np.array(list(b)[::-1]) == '1').astype(dtype)
|
||||
|
||||
return np.row_stack([int2vec(int(x), vec_dim) for x in seq])
|
||||
# process oov
|
||||
seq = np.array([int(x) for x in seq])
|
||||
new_seq = []
|
||||
for i, x in enumerate(seq):
|
||||
if x < 2 ** vec_dim:
|
||||
new_seq.append(x)
|
||||
else:
|
||||
idx_list = np.where(seq < 2 ** vec_dim)[0]
|
||||
idx = np.abs(idx_list - i).argmin()
|
||||
new_seq.append(seq[idx_list[idx]])
|
||||
return np.row_stack([int2vec(x, vec_dim) for x in new_seq])
|
||||
|
||||
def post_processing(self, raw_logits: torch.Tensor, spk_num: int):
|
||||
def post_processing(self, raw_logits: torch.Tensor, spk_num: int, output_format: str = "speaker_turn"):
|
||||
logits_idx = raw_logits.argmax(-1) # B, T, vocab_size -> B, T
|
||||
# upsampling outputs to match inputs
|
||||
ut = logits_idx.shape[1] * self.diar_model.encoder.time_ds_ratio
|
||||
@ -127,8 +137,14 @@ class Speech2Diarization:
|
||||
).squeeze(1).long()
|
||||
logits_idx = logits_idx[0].tolist()
|
||||
pse_labels = [self.token_list[x] for x in logits_idx]
|
||||
if output_format == "pse_labels":
|
||||
return pse_labels, None
|
||||
|
||||
multi_labels = self.seq2arr(pse_labels, spk_num)[:, :spk_num] # remove padding speakers
|
||||
multi_labels = self.smooth_multi_labels(multi_labels)
|
||||
if output_format == "binary_labels":
|
||||
return multi_labels, None
|
||||
|
||||
spk_list = ["spk{}".format(i + 1) for i in range(spk_num)]
|
||||
spk_turns = self.calc_spk_turns(multi_labels, spk_list)
|
||||
results = OrderedDict()
|
||||
@ -149,6 +165,7 @@ class Speech2Diarization:
|
||||
self,
|
||||
speech: Union[torch.Tensor, np.ndarray],
|
||||
profile: Union[torch.Tensor, np.ndarray],
|
||||
output_format: str = "speaker_turn"
|
||||
):
|
||||
"""Inference
|
||||
|
||||
@ -178,7 +195,7 @@ class Speech2Diarization:
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
logits = self.diar_model.prediction_forward(**batch)
|
||||
results, pse_labels = self.post_processing(logits, profile.shape[1])
|
||||
results, pse_labels = self.post_processing(logits, profile.shape[1], output_format)
|
||||
|
||||
return results, pse_labels
|
||||
|
||||
@ -367,7 +384,7 @@ def inference_modelscope(
|
||||
pse_label_writer = open("{}/labels.txt".format(output_path), "w")
|
||||
logging.info("Start to diarize...")
|
||||
result_list = []
|
||||
for keys, batch in loader:
|
||||
for idx, (keys, batch) in enumerate(loader):
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
@ -385,6 +402,9 @@ def inference_modelscope(
|
||||
pse_label_writer.write("{} {}\n".format(key, " ".join(pse_labels)))
|
||||
pse_label_writer.flush()
|
||||
|
||||
if idx % 100 == 0:
|
||||
logging.info("Processing {:5d}: {}".format(idx, key))
|
||||
|
||||
if output_path is not None:
|
||||
output_writer.close()
|
||||
pse_label_writer.close()
|
||||
|
||||
@ -36,7 +36,7 @@ class Speech2Xvector:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2xvector(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -169,7 +169,7 @@ def inference_modelscope(
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
sv_train_config: Optional[str] = "sv.yaml",
|
||||
sv_model_file: Optional[str] = "sv.pth",
|
||||
sv_model_file: Optional[str] = "sv.pb",
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
|
||||
@ -8,6 +8,7 @@ from typing import Dict
|
||||
from typing import Iterator
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import List
|
||||
|
||||
import kaldiio
|
||||
import numpy as np
|
||||
@ -129,7 +130,7 @@ class IterableESPnetDataset(IterableDataset):
|
||||
non_iterable_list = []
|
||||
self.path_name_type_list = []
|
||||
|
||||
if not isinstance(path_name_type_list[0], Tuple):
|
||||
if not isinstance(path_name_type_list[0], (Tuple, List)):
|
||||
path = path_name_type_list[0]
|
||||
name = path_name_type_list[1]
|
||||
_type = path_name_type_list[2]
|
||||
@ -227,13 +228,9 @@ class IterableESPnetDataset(IterableDataset):
|
||||
name = self.path_name_type_list[i][1]
|
||||
_type = self.path_name_type_list[i][2]
|
||||
if _type == "sound":
|
||||
audio_type = os.path.basename(value).split(".")[-1].lower()
|
||||
if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
|
||||
raise NotImplementedError(
|
||||
f'Not supported audio type: {audio_type}')
|
||||
if audio_type == "pcm":
|
||||
_type = "pcm"
|
||||
|
||||
audio_type = os.path.basename(value).lower()
|
||||
if audio_type.rfind(".pcm") >= 0:
|
||||
_type = "pcm"
|
||||
func = DATA_TYPES[_type]
|
||||
array = func(value)
|
||||
if self.fs is not None and (name == "speech" or name == "ref_speech"):
|
||||
@ -335,11 +332,8 @@ class IterableESPnetDataset(IterableDataset):
|
||||
# 2.a. Load data streamingly
|
||||
for value, (path, name, _type) in zip(values, self.path_name_type_list):
|
||||
if _type == "sound":
|
||||
audio_type = os.path.basename(value).split(".")[-1].lower()
|
||||
if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
|
||||
raise NotImplementedError(
|
||||
f'Not supported audio type: {audio_type}')
|
||||
if audio_type == "pcm":
|
||||
audio_type = os.path.basename(value).lower()
|
||||
if audio_type.rfind(".pcm") >= 0:
|
||||
_type = "pcm"
|
||||
func = DATA_TYPES[_type]
|
||||
# Load entry
|
||||
@ -391,3 +385,4 @@ class IterableESPnetDataset(IterableDataset):
|
||||
|
||||
if count == 0:
|
||||
raise RuntimeError("No iteration")
|
||||
|
||||
|
||||
@ -18,15 +18,11 @@ def forward_segment(text, seg_dict):
|
||||
|
||||
def seg_tokenize(txt, seg_dict):
|
||||
out_txt = ""
|
||||
pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
|
||||
for word in txt:
|
||||
if pattern.match(word):
|
||||
if word in seg_dict:
|
||||
out_txt += seg_dict[word] + " "
|
||||
else:
|
||||
out_txt += "<unk>" + " "
|
||||
if word in seg_dict:
|
||||
out_txt += seg_dict[word] + " "
|
||||
else:
|
||||
continue
|
||||
out_txt += "<unk>" + " "
|
||||
return out_txt.strip().split()
|
||||
|
||||
def tokenize(data,
|
||||
|
||||
@ -47,15 +47,11 @@ def forward_segment(text, dic):
|
||||
|
||||
def seg_tokenize(txt, seg_dict):
|
||||
out_txt = ""
|
||||
pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
|
||||
for word in txt:
|
||||
if pattern.match(word):
|
||||
if word in seg_dict:
|
||||
out_txt += seg_dict[word] + " "
|
||||
else:
|
||||
out_txt += "<unk>" + " "
|
||||
if word in seg_dict:
|
||||
out_txt += seg_dict[word] + " "
|
||||
else:
|
||||
continue
|
||||
out_txt += "<unk>" + " "
|
||||
return out_txt.strip().split()
|
||||
|
||||
def seg_tokenize_wo_pattern(txt, seg_dict):
|
||||
|
||||
@ -2,6 +2,8 @@
|
||||
## Environments
|
||||
torch >= 1.11.0
|
||||
modelscope >= 1.2.0
|
||||
torch-quant >= 0.4.0 (required for exporting quantized torchscript format model)
|
||||
# pip install torch-quant -i https://pypi.org/simple
|
||||
|
||||
## Install modelscope and funasr
|
||||
|
||||
@ -11,31 +13,46 @@ The installation is the same as [funasr](../../README.md)
|
||||
`Tips`: torch>=1.11.0
|
||||
|
||||
```shell
|
||||
python -m funasr.export.export_model [model_name] [export_dir] [onnx]
|
||||
python -m funasr.export.export_model \
|
||||
--model-name [model_name] \
|
||||
--export-dir [export_dir] \
|
||||
--type [onnx, torch] \
|
||||
--quantize [true, false] \
|
||||
--fallback-num [fallback_num]
|
||||
```
|
||||
`model_name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb).
|
||||
`export_dir`: the dir where the onnx is export.
|
||||
`onnx`: `true`, export onnx format model; `false`, export torchscripts format model.
|
||||
`model-name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb).
|
||||
|
||||
`export-dir`: the dir where the onnx is export.
|
||||
|
||||
`type`: `onnx` or `torch`, export onnx format model or torchscript format model.
|
||||
|
||||
`quantize`: `true`, export quantized model at the same time; `false`, export fp32 model only.
|
||||
|
||||
`fallback-num`: specify the number of fallback layers to perform automatic mixed precision quantization.
|
||||
|
||||
|
||||
## For example
|
||||
### Export onnx format model
|
||||
Export model from modelscope
|
||||
```shell
|
||||
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
|
||||
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
|
||||
```
|
||||
Export model from local path, the model'name must be `model.pb`.
|
||||
```shell
|
||||
python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
|
||||
python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
|
||||
```
|
||||
|
||||
### Export torchscripts format model
|
||||
Export model from modelscope
|
||||
```shell
|
||||
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
|
||||
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
|
||||
```
|
||||
|
||||
Export model from local path, the model'name must be `model.pb`.
|
||||
```shell
|
||||
python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
|
||||
python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
|
||||
```
|
||||
|
||||
## Acknowledge
|
||||
Torch model quantization is supported by [BladeDISC](https://github.com/alibaba/BladeDISC), an end-to-end DynamIc Shape Compiler project for machine learning workloads. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. If you are interested, please contact us.
|
||||
|
||||
|
||||
@ -10,12 +10,20 @@ import torch
|
||||
from funasr.export.models import get_model
|
||||
import numpy as np
|
||||
import random
|
||||
|
||||
from funasr.utils.types import str2bool
|
||||
# torch_version = float(".".join(torch.__version__.split(".")[:2]))
|
||||
# assert torch_version > 1.9
|
||||
|
||||
class ASRModelExportParaformer:
|
||||
def __init__(self, cache_dir: Union[Path, str] = None, onnx: bool = True):
|
||||
def __init__(
|
||||
self,
|
||||
cache_dir: Union[Path, str] = None,
|
||||
onnx: bool = True,
|
||||
quant: bool = True,
|
||||
fallback_num: int = 0,
|
||||
audio_in: str = None,
|
||||
calib_num: int = 200,
|
||||
):
|
||||
assert check_argument_types()
|
||||
self.set_all_random_seed(0)
|
||||
if cache_dir is None:
|
||||
@ -28,6 +36,11 @@ class ASRModelExportParaformer:
|
||||
)
|
||||
print("output dir: {}".format(self.cache_dir))
|
||||
self.onnx = onnx
|
||||
self.quant = quant
|
||||
self.fallback_num = fallback_num
|
||||
self.frontend = None
|
||||
self.audio_in = audio_in
|
||||
self.calib_num = calib_num
|
||||
|
||||
|
||||
def _export(
|
||||
@ -56,6 +69,43 @@ class ASRModelExportParaformer:
|
||||
print("output dir: {}".format(export_dir))
|
||||
|
||||
|
||||
def _torch_quantize(self, model):
|
||||
def _run_calibration_data(m):
|
||||
# using dummy inputs for a example
|
||||
if self.audio_in is not None:
|
||||
feats, feats_len = self.load_feats(self.audio_in)
|
||||
for i, (feat, len) in enumerate(zip(feats, feats_len)):
|
||||
with torch.no_grad():
|
||||
m(feat, len)
|
||||
else:
|
||||
dummy_input = model.get_dummy_inputs()
|
||||
m(*dummy_input)
|
||||
|
||||
|
||||
from torch_quant.module import ModuleFilter
|
||||
from torch_quant.quantizer import Backend, Quantizer
|
||||
from funasr.export.models.modules.decoder_layer import DecoderLayerSANM
|
||||
from funasr.export.models.modules.encoder_layer import EncoderLayerSANM
|
||||
module_filter = ModuleFilter(include_classes=[EncoderLayerSANM, DecoderLayerSANM])
|
||||
module_filter.exclude_op_types = [torch.nn.Conv1d]
|
||||
quantizer = Quantizer(
|
||||
module_filter=module_filter,
|
||||
backend=Backend.FBGEMM,
|
||||
)
|
||||
model.eval()
|
||||
calib_model = quantizer.calib(model)
|
||||
_run_calibration_data(calib_model)
|
||||
if self.fallback_num > 0:
|
||||
# perform automatic mixed precision quantization
|
||||
amp_model = quantizer.amp(model)
|
||||
_run_calibration_data(amp_model)
|
||||
quantizer.fallback(amp_model, num=self.fallback_num)
|
||||
print('Fallback layers:')
|
||||
print('\n'.join(quantizer.module_filter.exclude_names))
|
||||
quant_model = quantizer.quantize(model)
|
||||
return quant_model
|
||||
|
||||
|
||||
def _export_torchscripts(self, model, verbose, path, enc_size=None):
|
||||
if enc_size:
|
||||
dummy_input = model.get_dummy_inputs(enc_size)
|
||||
@ -66,10 +116,49 @@ class ASRModelExportParaformer:
|
||||
model_script = torch.jit.trace(model, dummy_input)
|
||||
model_script.save(os.path.join(path, f'{model.model_name}.torchscripts'))
|
||||
|
||||
if self.quant:
|
||||
quant_model = self._torch_quantize(model)
|
||||
model_script = torch.jit.trace(quant_model, dummy_input)
|
||||
model_script.save(os.path.join(path, f'{model.model_name}_quant.torchscripts'))
|
||||
|
||||
|
||||
def set_all_random_seed(self, seed: int):
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.random.manual_seed(seed)
|
||||
|
||||
def parse_audio_in(self, audio_in):
|
||||
|
||||
wav_list, name_list = [], []
|
||||
if audio_in.endswith(".scp"):
|
||||
f = open(audio_in, 'r')
|
||||
lines = f.readlines()[:self.calib_num]
|
||||
for line in lines:
|
||||
name, path = line.strip().split()
|
||||
name_list.append(name)
|
||||
wav_list.append(path)
|
||||
else:
|
||||
wav_list = [audio_in,]
|
||||
name_list = ["test",]
|
||||
return wav_list, name_list
|
||||
|
||||
def load_feats(self, audio_in: str = None):
|
||||
import torchaudio
|
||||
|
||||
wav_list, name_list = self.parse_audio_in(audio_in)
|
||||
feats = []
|
||||
feats_len = []
|
||||
for line in wav_list:
|
||||
path = line.strip()
|
||||
waveform, sampling_rate = torchaudio.load(path)
|
||||
if sampling_rate != self.frontend.fs:
|
||||
waveform = torchaudio.transforms.Resample(orig_freq=sampling_rate,
|
||||
new_freq=self.frontend.fs)(waveform)
|
||||
fbank, fbank_len = self.frontend(waveform, [waveform.size(1)])
|
||||
feats.append(fbank)
|
||||
feats_len.append(fbank_len)
|
||||
return feats, feats_len
|
||||
|
||||
def export(self,
|
||||
tag_name: str = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
|
||||
mode: str = 'paraformer',
|
||||
@ -96,6 +185,7 @@ class ASRModelExportParaformer:
|
||||
model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, 'cpu'
|
||||
)
|
||||
self.frontend = model.frontend
|
||||
self._export(model, tag_name)
|
||||
|
||||
|
||||
@ -107,11 +197,12 @@ class ASRModelExportParaformer:
|
||||
|
||||
# model_script = torch.jit.script(model)
|
||||
model_script = model #torch.jit.trace(model)
|
||||
model_path = os.path.join(path, f'{model.model_name}.onnx')
|
||||
|
||||
torch.onnx.export(
|
||||
model_script,
|
||||
dummy_input,
|
||||
os.path.join(path, f'{model.model_name}.onnx'),
|
||||
model_path,
|
||||
verbose=verbose,
|
||||
opset_version=14,
|
||||
input_names=model.get_input_names(),
|
||||
@ -119,17 +210,42 @@ class ASRModelExportParaformer:
|
||||
dynamic_axes=model.get_dynamic_axes()
|
||||
)
|
||||
|
||||
if self.quant:
|
||||
from onnxruntime.quantization import QuantType, quantize_dynamic
|
||||
import onnx
|
||||
quant_model_path = os.path.join(path, f'{model.model_name}_quant.onnx')
|
||||
onnx_model = onnx.load(model_path)
|
||||
nodes = [n.name for n in onnx_model.graph.node]
|
||||
nodes_to_exclude = [m for m in nodes if 'output' in m]
|
||||
quantize_dynamic(
|
||||
model_input=model_path,
|
||||
model_output=quant_model_path,
|
||||
op_types_to_quantize=['MatMul'],
|
||||
per_channel=True,
|
||||
reduce_range=False,
|
||||
weight_type=QuantType.QUInt8,
|
||||
nodes_to_exclude=nodes_to_exclude,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
import sys
|
||||
|
||||
model_path = sys.argv[1]
|
||||
output_dir = sys.argv[2]
|
||||
onnx = sys.argv[3]
|
||||
onnx = onnx.lower()
|
||||
onnx = onnx == 'true'
|
||||
# model_path = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch'
|
||||
# output_dir = "../export"
|
||||
export_model = ASRModelExportParaformer(cache_dir=output_dir, onnx=onnx)
|
||||
export_model.export(model_path)
|
||||
# export_model.export('/root/cache/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model-name', type=str, required=True)
|
||||
parser.add_argument('--export-dir', type=str, required=True)
|
||||
parser.add_argument('--type', type=str, default='onnx', help='["onnx", "torch"]')
|
||||
parser.add_argument('--quantize', type=str2bool, default=False, help='export quantized model')
|
||||
parser.add_argument('--fallback-num', type=int, default=0, help='amp fallback number')
|
||||
parser.add_argument('--audio_in', type=str, default=None, help='["wav", "wav.scp"]')
|
||||
parser.add_argument('--calib_num', type=int, default=200, help='calib max num')
|
||||
args = parser.parse_args()
|
||||
|
||||
export_model = ASRModelExportParaformer(
|
||||
cache_dir=args.export_dir,
|
||||
onnx=args.type == 'onnx',
|
||||
quant=args.quantize,
|
||||
fallback_num=args.fallback_num,
|
||||
audio_in=args.audio_in,
|
||||
calib_num=args.calib_num,
|
||||
)
|
||||
export_model.export(args.model_name)
|
||||
|
||||
@ -16,6 +16,7 @@ class EncoderLayerSANM(nn.Module):
|
||||
self.feed_forward = model.feed_forward
|
||||
self.norm1 = model.norm1
|
||||
self.norm2 = model.norm2
|
||||
self.in_size = model.in_size
|
||||
self.size = model.size
|
||||
|
||||
def forward(self, x, mask):
|
||||
@ -23,13 +24,12 @@ class EncoderLayerSANM(nn.Module):
|
||||
residual = x
|
||||
x = self.norm1(x)
|
||||
x = self.self_attn(x, mask)
|
||||
if x.size(2) == residual.size(2):
|
||||
if self.in_size == self.size:
|
||||
x = x + residual
|
||||
residual = x
|
||||
x = self.norm2(x)
|
||||
x = self.feed_forward(x)
|
||||
if x.size(2) == residual.size(2):
|
||||
x = x + residual
|
||||
x = x + residual
|
||||
|
||||
return x, mask
|
||||
|
||||
|
||||
@ -64,6 +64,23 @@ class MultiHeadedAttentionSANM(nn.Module):
|
||||
return self.linear_out(context_layer) # (batch, time1, d_model)
|
||||
|
||||
|
||||
def preprocess_for_attn(x, mask, cache, pad_fn):
|
||||
x = x * mask
|
||||
x = x.transpose(1, 2)
|
||||
if cache is None:
|
||||
x = pad_fn(x)
|
||||
else:
|
||||
x = torch.cat((cache[:, :, 1:], x), dim=2)
|
||||
cache = x
|
||||
return x, cache
|
||||
|
||||
|
||||
torch_version = float(".".join(torch.__version__.split(".")[:2]))
|
||||
if torch_version >= 1.8:
|
||||
import torch.fx
|
||||
torch.fx.wrap('preprocess_for_attn')
|
||||
|
||||
|
||||
class MultiHeadedAttentionSANMDecoder(nn.Module):
|
||||
def __init__(self, model):
|
||||
super().__init__()
|
||||
@ -73,16 +90,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
|
||||
self.attn = None
|
||||
|
||||
def forward(self, inputs, mask, cache=None):
|
||||
# b, t, d = inputs.size()
|
||||
# mask = torch.reshape(mask, (b, -1, 1))
|
||||
inputs = inputs * mask
|
||||
|
||||
x = inputs.transpose(1, 2)
|
||||
if cache is None:
|
||||
x = self.pad_fn(x)
|
||||
else:
|
||||
x = torch.cat((cache[:, :, 1:], x), dim=2)
|
||||
cache = x
|
||||
x, cache = preprocess_for_attn(inputs, mask, cache, self.pad_fn)
|
||||
x = self.fsmn_block(x)
|
||||
x = x.transpose(1, 2)
|
||||
|
||||
@ -232,4 +240,4 @@ class OnnxRelPosMultiHeadedAttention(OnnxMultiHeadedAttention):
|
||||
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
|
||||
context_layer = context_layer.view(new_context_layer_shape)
|
||||
return self.linear_out(context_layer) # (batch, time1, d_model)
|
||||
|
||||
|
||||
|
||||
@ -66,13 +66,13 @@ def average_nbest_models(
|
||||
elif n == 1:
|
||||
# The averaged model is same as the best model
|
||||
e, _ = epoch_and_values[0]
|
||||
op = output_dir / f"{e}epoch.pth"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
|
||||
op = output_dir / f"{e}epoch.pb"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pb"
|
||||
if sym_op.is_symlink() or sym_op.exists():
|
||||
sym_op.unlink()
|
||||
sym_op.symlink_to(op.name)
|
||||
else:
|
||||
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
|
||||
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pb"
|
||||
logging.info(
|
||||
f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
|
||||
)
|
||||
@ -83,12 +83,12 @@ def average_nbest_models(
|
||||
if e not in _loaded:
|
||||
if oss_bucket is None:
|
||||
_loaded[e] = torch.load(
|
||||
output_dir / f"{e}epoch.pth",
|
||||
output_dir / f"{e}epoch.pb",
|
||||
map_location="cpu",
|
||||
)
|
||||
else:
|
||||
buffer = BytesIO(
|
||||
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
|
||||
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pb")).read())
|
||||
_loaded[e] = torch.load(buffer)
|
||||
states = _loaded[e]
|
||||
|
||||
@ -115,13 +115,13 @@ def average_nbest_models(
|
||||
else:
|
||||
buffer = BytesIO()
|
||||
torch.save(avg, buffer)
|
||||
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
|
||||
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pb"),
|
||||
buffer.getvalue())
|
||||
|
||||
# 3. *.*.ave.pth is a symlink to the max ave model
|
||||
# 3. *.*.ave.pb is a symlink to the max ave model
|
||||
if oss_bucket is None:
|
||||
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
|
||||
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pb"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pb"
|
||||
if sym_op.is_symlink() or sym_op.exists():
|
||||
sym_op.unlink()
|
||||
sym_op.symlink_to(op.name)
|
||||
|
||||
@ -191,12 +191,12 @@ def unpack(
|
||||
|
||||
Examples:
|
||||
tarfile:
|
||||
model.pth
|
||||
model.pb
|
||||
some1.file
|
||||
some2.file
|
||||
|
||||
>>> unpack("tarfile", "out")
|
||||
{'asr_model_file': 'out/model.pth'}
|
||||
{'asr_model_file': 'out/model.pb'}
|
||||
"""
|
||||
input_archive = Path(input_archive)
|
||||
outpath = Path(outpath)
|
||||
|
||||
@ -90,6 +90,47 @@ class DecoderLayerSANM(nn.Module):
|
||||
tgt = self.norm1(tgt)
|
||||
tgt = self.feed_forward(tgt)
|
||||
|
||||
x = tgt
|
||||
if self.self_attn:
|
||||
if self.normalize_before:
|
||||
tgt = self.norm2(tgt)
|
||||
x, _ = self.self_attn(tgt, tgt_mask)
|
||||
x = residual + self.dropout(x)
|
||||
|
||||
if self.src_attn is not None:
|
||||
residual = x
|
||||
if self.normalize_before:
|
||||
x = self.norm3(x)
|
||||
|
||||
x = residual + self.dropout(self.src_attn(x, memory, memory_mask))
|
||||
|
||||
|
||||
return x, tgt_mask, memory, memory_mask, cache
|
||||
|
||||
def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
|
||||
"""Compute decoded features.
|
||||
|
||||
Args:
|
||||
tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
|
||||
tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
|
||||
memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
|
||||
memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
|
||||
cache (List[torch.Tensor]): List of cached tensors.
|
||||
Each tensor shape should be (#batch, maxlen_out - 1, size).
|
||||
|
||||
Returns:
|
||||
torch.Tensor: Output tensor(#batch, maxlen_out, size).
|
||||
torch.Tensor: Mask for output tensor (#batch, maxlen_out).
|
||||
torch.Tensor: Encoded memory (#batch, maxlen_in, size).
|
||||
torch.Tensor: Encoded memory mask (#batch, maxlen_in).
|
||||
|
||||
"""
|
||||
# tgt = self.dropout(tgt)
|
||||
residual = tgt
|
||||
if self.normalize_before:
|
||||
tgt = self.norm1(tgt)
|
||||
tgt = self.feed_forward(tgt)
|
||||
|
||||
x = tgt
|
||||
if self.self_attn:
|
||||
if self.normalize_before:
|
||||
@ -109,7 +150,6 @@ class DecoderLayerSANM(nn.Module):
|
||||
|
||||
return x, tgt_mask, memory, memory_mask, cache
|
||||
|
||||
|
||||
class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
|
||||
"""
|
||||
author: Speech Lab, Alibaba Group, China
|
||||
@ -947,6 +987,65 @@ class ParaformerSANMDecoder(BaseTransformerDecoder):
|
||||
)
|
||||
return logp.squeeze(0), state
|
||||
|
||||
def forward_chunk(
|
||||
self,
|
||||
memory: torch.Tensor,
|
||||
tgt: torch.Tensor,
|
||||
cache: dict = None,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Forward decoder.
|
||||
|
||||
Args:
|
||||
hs_pad: encoded memory, float32 (batch, maxlen_in, feat)
|
||||
hlens: (batch)
|
||||
ys_in_pad:
|
||||
input token ids, int64 (batch, maxlen_out)
|
||||
if input_layer == "embed"
|
||||
input tensor (batch, maxlen_out, #mels) in the other cases
|
||||
ys_in_lens: (batch)
|
||||
Returns:
|
||||
(tuple): tuple containing:
|
||||
|
||||
x: decoded token score before softmax (batch, maxlen_out, token)
|
||||
if use_output_layer is True,
|
||||
olens: (batch, )
|
||||
"""
|
||||
x = tgt
|
||||
if cache["decode_fsmn"] is None:
|
||||
cache_layer_num = len(self.decoders)
|
||||
if self.decoders2 is not None:
|
||||
cache_layer_num += len(self.decoders2)
|
||||
new_cache = [None] * cache_layer_num
|
||||
else:
|
||||
new_cache = cache["decode_fsmn"]
|
||||
for i in range(self.att_layer_num):
|
||||
decoder = self.decoders[i]
|
||||
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=new_cache[i]
|
||||
)
|
||||
new_cache[i] = c_ret
|
||||
|
||||
if self.num_blocks - self.att_layer_num > 1:
|
||||
for i in range(self.num_blocks - self.att_layer_num):
|
||||
j = i + self.att_layer_num
|
||||
decoder = self.decoders2[i]
|
||||
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=new_cache[j]
|
||||
)
|
||||
new_cache[j] = c_ret
|
||||
|
||||
for decoder in self.decoders3:
|
||||
|
||||
x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=None
|
||||
)
|
||||
if self.normalize_before:
|
||||
x = self.after_norm(x)
|
||||
if self.output_layer is not None:
|
||||
x = self.output_layer(x)
|
||||
cache["decode_fsmn"] = new_cache
|
||||
return x
|
||||
|
||||
def forward_one_step(
|
||||
self,
|
||||
tgt: torch.Tensor,
|
||||
|
||||
@ -325,6 +325,65 @@ class Paraformer(AbsESPnetModel):
|
||||
|
||||
return encoder_out, encoder_out_lens
|
||||
|
||||
def encode_chunk(
|
||||
self, speech: torch.Tensor, speech_lengths: torch.Tensor, cache: dict = None
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Frontend + Encoder. Note that this method is used by asr_inference.py
|
||||
|
||||
Args:
|
||||
speech: (Batch, Length, ...)
|
||||
speech_lengths: (Batch, )
|
||||
"""
|
||||
with autocast(False):
|
||||
# 1. Extract feats
|
||||
feats, feats_lengths = self._extract_feats(speech, speech_lengths)
|
||||
|
||||
# 2. Data augmentation
|
||||
if self.specaug is not None and self.training:
|
||||
feats, feats_lengths = self.specaug(feats, feats_lengths)
|
||||
|
||||
# 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
|
||||
if self.normalize is not None:
|
||||
feats, feats_lengths = self.normalize(feats, feats_lengths)
|
||||
|
||||
# Pre-encoder, e.g. used for raw input data
|
||||
if self.preencoder is not None:
|
||||
feats, feats_lengths = self.preencoder(feats, feats_lengths)
|
||||
|
||||
# 4. Forward encoder
|
||||
# feats: (Batch, Length, Dim)
|
||||
# -> encoder_out: (Batch, Length2, Dim2)
|
||||
if self.encoder.interctc_use_conditioning:
|
||||
encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(
|
||||
feats, feats_lengths, cache=cache["encoder"], ctc=self.ctc
|
||||
)
|
||||
else:
|
||||
encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(feats, feats_lengths, cache=cache["encoder"])
|
||||
intermediate_outs = None
|
||||
if isinstance(encoder_out, tuple):
|
||||
intermediate_outs = encoder_out[1]
|
||||
encoder_out = encoder_out[0]
|
||||
|
||||
# Post-encoder, e.g. NLU
|
||||
if self.postencoder is not None:
|
||||
encoder_out, encoder_out_lens = self.postencoder(
|
||||
encoder_out, encoder_out_lens
|
||||
)
|
||||
|
||||
assert encoder_out.size(0) == speech.size(0), (
|
||||
encoder_out.size(),
|
||||
speech.size(0),
|
||||
)
|
||||
assert encoder_out.size(1) <= encoder_out_lens.max(), (
|
||||
encoder_out.size(),
|
||||
encoder_out_lens.max(),
|
||||
)
|
||||
|
||||
if intermediate_outs is not None:
|
||||
return (encoder_out, intermediate_outs), encoder_out_lens
|
||||
|
||||
return encoder_out, encoder_out_lens
|
||||
|
||||
def calc_predictor(self, encoder_out, encoder_out_lens):
|
||||
|
||||
encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(
|
||||
@ -333,6 +392,11 @@ class Paraformer(AbsESPnetModel):
|
||||
ignore_id=self.ignore_id)
|
||||
return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index
|
||||
|
||||
def calc_predictor_chunk(self, encoder_out, cache=None):
|
||||
|
||||
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = self.predictor.forward_chunk(encoder_out, cache["encoder"])
|
||||
return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index
|
||||
|
||||
def cal_decoder_with_predictor(self, encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens):
|
||||
|
||||
decoder_outs = self.decoder(
|
||||
@ -342,6 +406,14 @@ class Paraformer(AbsESPnetModel):
|
||||
decoder_out = torch.log_softmax(decoder_out, dim=-1)
|
||||
return decoder_out, ys_pad_lens
|
||||
|
||||
def cal_decoder_with_predictor_chunk(self, encoder_out, sematic_embeds, cache=None):
|
||||
decoder_outs = self.decoder.forward_chunk(
|
||||
encoder_out, sematic_embeds, cache["decoder"]
|
||||
)
|
||||
decoder_out = decoder_outs
|
||||
decoder_out = torch.log_softmax(decoder_out, dim=-1)
|
||||
return decoder_out
|
||||
|
||||
def _extract_feats(
|
||||
self, speech: torch.Tensor, speech_lengths: torch.Tensor
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
@ -1459,4 +1531,4 @@ class ContextualParaformer(Paraformer):
|
||||
"torch tensor: {}, {}, loading from tf tensor: {}, {}".format(name, data_tf.size(), name_tf,
|
||||
var_dict_tf[name_tf].shape))
|
||||
|
||||
return var_dict_torch_update
|
||||
return var_dict_torch_update
|
||||
|
||||
@ -52,15 +52,15 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
|
||||
super().__init__()
|
||||
self.frontend = frontend
|
||||
self.encoder = encoder
|
||||
self.encoder_decoder_attractor = encoder_decoder_attractor
|
||||
self.enc = encoder
|
||||
self.eda = encoder_decoder_attractor
|
||||
self.attractor_loss_weight = attractor_loss_weight
|
||||
self.max_n_speaker = max_n_speaker
|
||||
if mapping_dict is None:
|
||||
mapping_dict = generate_mapping_dict(max_speaker_num=self.max_n_speaker)
|
||||
self.mapping_dict = mapping_dict
|
||||
# PostNet
|
||||
self.PostNet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
|
||||
self.postnet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
|
||||
self.output_layer = nn.Linear(n_units, mapping_dict['oov'] + 1)
|
||||
|
||||
def forward_encoder(self, xs, ilens):
|
||||
@ -68,7 +68,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
pad_shape = xs.shape
|
||||
xs_mask = [torch.ones(ilen).to(xs.device) for ilen in ilens]
|
||||
xs_mask = torch.nn.utils.rnn.pad_sequence(xs_mask, batch_first=True, padding_value=0).unsqueeze(-2)
|
||||
emb = self.encoder(xs, xs_mask)
|
||||
emb = self.enc(xs, xs_mask)
|
||||
emb = torch.split(emb.view(pad_shape[0], pad_shape[1], -1), 1, dim=0)
|
||||
emb = [e[0][:ilen] for e, ilen in zip(emb, ilens)]
|
||||
return emb
|
||||
@ -76,8 +76,8 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
def forward_post_net(self, logits, ilens):
|
||||
maxlen = torch.max(ilens).to(torch.int).item()
|
||||
logits = nn.utils.rnn.pad_sequence(logits, batch_first=True, padding_value=-1)
|
||||
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens, batch_first=True, enforce_sorted=False)
|
||||
outputs, (_, _) = self.PostNet(logits)
|
||||
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens.cpu().to(torch.int64), batch_first=True, enforce_sorted=False)
|
||||
outputs, (_, _) = self.postnet(logits)
|
||||
outputs = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True, padding_value=-1, total_length=maxlen)[0]
|
||||
outputs = [output[:ilens[i].to(torch.int).item()] for i, output in enumerate(outputs)]
|
||||
outputs = [self.output_layer(output) for output in outputs]
|
||||
@ -112,7 +112,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
text = text[:, : text_lengths.max()]
|
||||
|
||||
# 1. Encoder
|
||||
encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
|
||||
encoder_out, encoder_out_lens = self.enc(speech, speech_lengths)
|
||||
intermediate_outs = None
|
||||
if isinstance(encoder_out, tuple):
|
||||
intermediate_outs = encoder_out[1]
|
||||
@ -190,18 +190,16 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
shuffle: bool = True,
|
||||
threshold: float = 0.5,
|
||||
**kwargs):
|
||||
if self.frontend is not None:
|
||||
speech = self.frontend(speech)
|
||||
speech = [s[:s_len] for s, s_len in zip(speech, speech_lengths)]
|
||||
emb = self.forward_encoder(speech, speech_lengths)
|
||||
if shuffle:
|
||||
orders = [np.arange(e.shape[0]) for e in emb]
|
||||
for order in orders:
|
||||
np.random.shuffle(order)
|
||||
attractors, probs = self.encoder_decoder_attractor.estimate(
|
||||
attractors, probs = self.eda.estimate(
|
||||
[e[torch.from_numpy(order).to(torch.long).to(speech[0].device)] for e, order in zip(emb, orders)])
|
||||
else:
|
||||
attractors, probs = self.encoder_decoder_attractor.estimate(emb)
|
||||
attractors, probs = self.eda.estimate(emb)
|
||||
attractors_active = []
|
||||
for p, att, e in zip(probs, attractors, emb):
|
||||
if n_speakers and n_speakers >= 0:
|
||||
@ -233,10 +231,23 @@ class DiarEENDOLAModel(AbsESPnetModel):
|
||||
pred[i] = pred[i - 1]
|
||||
else:
|
||||
pred[i] = 0
|
||||
pred = [self.reporter.inv_mapping_func(i, self.mapping_dict) for i in pred]
|
||||
pred = [self.inv_mapping_func(i) for i in pred]
|
||||
decisions = [bin(num)[2:].zfill(self.max_n_speaker)[::-1] for num in pred]
|
||||
decisions = torch.from_numpy(
|
||||
np.stack([np.array([int(i) for i in dec]) for dec in decisions], axis=0)).to(logit.device).to(
|
||||
torch.float32)
|
||||
decisions = decisions[:, :n_speaker]
|
||||
return decisions
|
||||
|
||||
def inv_mapping_func(self, label):
|
||||
|
||||
if not isinstance(label, int):
|
||||
label = int(label)
|
||||
if label in self.mapping_dict['label2dec'].keys():
|
||||
num = self.mapping_dict['label2dec'][label]
|
||||
else:
|
||||
num = -1
|
||||
return num
|
||||
|
||||
def collect_feats(self, **batch: torch.Tensor) -> Dict[str, torch.Tensor]:
|
||||
pass
|
||||
@ -59,7 +59,8 @@ class DiarSondModel(AbsESPnetModel):
|
||||
normalize_speech_speaker: bool = False,
|
||||
ignore_id: int = -1,
|
||||
speaker_discrimination_loss_weight: float = 1.0,
|
||||
inter_score_loss_weight: float = 0.0
|
||||
inter_score_loss_weight: float = 0.0,
|
||||
inputs_type: str = "raw",
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
@ -86,14 +87,12 @@ class DiarSondModel(AbsESPnetModel):
|
||||
)
|
||||
self.criterion_bce = SequenceBinaryCrossEntropy(normalize_length=length_normalized_loss)
|
||||
self.pse_embedding = self.generate_pse_embedding()
|
||||
# self.register_buffer("pse_embedding", pse_embedding)
|
||||
self.power_weight = torch.from_numpy(2 ** np.arange(max_spk_num)[np.newaxis, np.newaxis, :]).float()
|
||||
# self.register_buffer("power_weight", power_weight)
|
||||
self.int_token_arr = torch.from_numpy(np.array(self.token_list).astype(int)[np.newaxis, np.newaxis, :]).int()
|
||||
# self.register_buffer("int_token_arr", int_token_arr)
|
||||
self.speaker_discrimination_loss_weight = speaker_discrimination_loss_weight
|
||||
self.inter_score_loss_weight = inter_score_loss_weight
|
||||
self.forward_steps = 0
|
||||
self.inputs_type = inputs_type
|
||||
|
||||
def generate_pse_embedding(self):
|
||||
embedding = np.zeros((len(self.token_list), self.max_spk_num), dtype=np.float)
|
||||
@ -125,9 +124,14 @@ class DiarSondModel(AbsESPnetModel):
|
||||
binary_labels: (Batch, frames, max_spk_num)
|
||||
binary_labels_lengths: (Batch,)
|
||||
"""
|
||||
assert speech.shape[0] == binary_labels.shape[0], (speech.shape, binary_labels.shape)
|
||||
assert speech.shape[0] <= binary_labels.shape[0], (speech.shape, binary_labels.shape)
|
||||
batch_size = speech.shape[0]
|
||||
self.forward_steps = self.forward_steps + 1
|
||||
if self.pse_embedding.device != speech.device:
|
||||
self.pse_embedding = self.pse_embedding.to(speech.device)
|
||||
self.power_weight = self.power_weight.to(speech.device)
|
||||
self.int_token_arr = self.int_token_arr.to(speech.device)
|
||||
|
||||
# 1. Network forward
|
||||
pred, inter_outputs = self.prediction_forward(
|
||||
speech, speech_lengths,
|
||||
@ -149,9 +153,13 @@ class DiarSondModel(AbsESPnetModel):
|
||||
# the sequence length of 'pred' might be slightly less than the
|
||||
# length of 'spk_labels'. Here we force them to be equal.
|
||||
length_diff_tolerance = 2
|
||||
length_diff = pse_labels.shape[1] - pred.shape[1]
|
||||
if 0 < length_diff <= length_diff_tolerance:
|
||||
pse_labels = pse_labels[:, 0: pred.shape[1]]
|
||||
length_diff = abs(pse_labels.shape[1] - pred.shape[1])
|
||||
if length_diff <= length_diff_tolerance:
|
||||
min_len = min(pred.shape[1], pse_labels.shape[1])
|
||||
pse_labels = pse_labels[:, :min_len]
|
||||
pred = pred[:, :min_len]
|
||||
cd_score = cd_score[:, :min_len]
|
||||
ci_score = ci_score[:, :min_len]
|
||||
|
||||
loss_diar = self.classification_loss(pred, pse_labels, binary_labels_lengths)
|
||||
loss_spk_dis = self.speaker_discrimination_loss(profile, profile_lengths)
|
||||
@ -299,7 +307,7 @@ class DiarSondModel(AbsESPnetModel):
|
||||
speech: torch.Tensor,
|
||||
speech_lengths: torch.Tensor,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
if self.encoder is not None:
|
||||
if self.encoder is not None and self.inputs_type == "raw":
|
||||
speech, speech_lengths = self.encode(speech, speech_lengths)
|
||||
speech_mask = ~make_pad_mask(speech_lengths, maxlen=speech.shape[1])
|
||||
speech_mask = speech_mask.to(speech.device).unsqueeze(-1).float()
|
||||
|
||||
@ -347,6 +347,48 @@ class SANMEncoder(AbsEncoder):
|
||||
return (xs_pad, intermediate_outs), olens, None
|
||||
return xs_pad, olens, None
|
||||
|
||||
def forward_chunk(self,
|
||||
xs_pad: torch.Tensor,
|
||||
ilens: torch.Tensor,
|
||||
cache: dict = None,
|
||||
ctc: CTC = None,
|
||||
):
|
||||
xs_pad *= self.output_size() ** 0.5
|
||||
if self.embed is None:
|
||||
xs_pad = xs_pad
|
||||
else:
|
||||
xs_pad = self.embed.forward_chunk(xs_pad, cache)
|
||||
|
||||
encoder_outs = self.encoders0(xs_pad, None, None, None, None)
|
||||
xs_pad, masks = encoder_outs[0], encoder_outs[1]
|
||||
intermediate_outs = []
|
||||
if len(self.interctc_layer_idx) == 0:
|
||||
encoder_outs = self.encoders(xs_pad, None, None, None, None)
|
||||
xs_pad, masks = encoder_outs[0], encoder_outs[1]
|
||||
else:
|
||||
for layer_idx, encoder_layer in enumerate(self.encoders):
|
||||
encoder_outs = encoder_layer(xs_pad, None, None, None, None)
|
||||
xs_pad, masks = encoder_outs[0], encoder_outs[1]
|
||||
if layer_idx + 1 in self.interctc_layer_idx:
|
||||
encoder_out = xs_pad
|
||||
|
||||
# intermediate outputs are also normalized
|
||||
if self.normalize_before:
|
||||
encoder_out = self.after_norm(encoder_out)
|
||||
|
||||
intermediate_outs.append((layer_idx + 1, encoder_out))
|
||||
|
||||
if self.interctc_use_conditioning:
|
||||
ctc_out = ctc.softmax(encoder_out)
|
||||
xs_pad = xs_pad + self.conditioning_layer(ctc_out)
|
||||
|
||||
if self.normalize_before:
|
||||
xs_pad = self.after_norm(xs_pad)
|
||||
|
||||
if len(intermediate_outs) > 0:
|
||||
return (xs_pad, intermediate_outs), None, None
|
||||
return xs_pad, ilens, None
|
||||
|
||||
def gen_tf2torch_map_dict(self):
|
||||
tensor_name_prefix_torch = self.tf2torch_tensor_name_prefix_torch
|
||||
tensor_name_prefix_tf = self.tf2torch_tensor_name_prefix_tf
|
||||
|
||||
@ -1,14 +1,15 @@
|
||||
# Copyright (c) Alibaba, Inc. and its affiliates.
|
||||
# Part of the implementation is borrowed from espnet/espnet.
|
||||
from abc import ABC
|
||||
from typing import Tuple
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torchaudio.compliance.kaldi as kaldi
|
||||
from funasr.models.frontend.abs_frontend import AbsFrontend
|
||||
from typeguard import check_argument_types
|
||||
from torch.nn.utils.rnn import pad_sequence
|
||||
from typeguard import check_argument_types
|
||||
|
||||
import funasr.models.frontend.eend_ola_feature as eend_ola_feature
|
||||
from funasr.models.frontend.abs_frontend import AbsFrontend
|
||||
|
||||
|
||||
def load_cmvn(cmvn_file):
|
||||
@ -275,7 +276,8 @@ class WavFrontendOnline(AbsFrontend):
|
||||
# inputs tensor has catted the cache tensor
|
||||
# def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, inputs_lfr_cache: torch.Tensor = None,
|
||||
# is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
|
||||
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
|
||||
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[
|
||||
torch.Tensor, torch.Tensor, int]:
|
||||
"""
|
||||
Apply lfr with data
|
||||
"""
|
||||
@ -376,7 +378,8 @@ class WavFrontendOnline(AbsFrontend):
|
||||
if self.lfr_m != 1 or self.lfr_n != 1:
|
||||
# update self.lfr_splice_cache in self.apply_lfr
|
||||
# mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, self.lfr_splice_cache[i],
|
||||
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, is_final)
|
||||
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n,
|
||||
is_final)
|
||||
if self.cmvn_file is not None:
|
||||
mat = self.apply_cmvn(mat, self.cmvn)
|
||||
feat_length = mat.size(0)
|
||||
@ -398,9 +401,10 @@ class WavFrontendOnline(AbsFrontend):
|
||||
assert batch_size == 1, 'we support to extract feature online only when the batch size is equal to 1 now'
|
||||
waveforms, feats, feats_lengths = self.forward_fbank(input, input_lengths) # input shape: B T D
|
||||
if feats.shape[0]:
|
||||
#if self.reserve_waveforms is None and self.lfr_m > 1:
|
||||
# if self.reserve_waveforms is None and self.lfr_m > 1:
|
||||
# self.reserve_waveforms = waveforms[:, :(self.lfr_m - 1) // 2 * self.frame_shift_sample_length]
|
||||
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat((self.reserve_waveforms, waveforms), dim=1)
|
||||
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat(
|
||||
(self.reserve_waveforms, waveforms), dim=1)
|
||||
if not self.lfr_splice_cache: # 初始化splice_cache
|
||||
for i in range(batch_size):
|
||||
self.lfr_splice_cache.append(feats[i][0, :].unsqueeze(dim=0).repeat((self.lfr_m - 1) // 2, 1))
|
||||
@ -409,7 +413,8 @@ class WavFrontendOnline(AbsFrontend):
|
||||
lfr_splice_cache_tensor = torch.stack(self.lfr_splice_cache) # B T D
|
||||
feats = torch.cat((lfr_splice_cache_tensor, feats), dim=1)
|
||||
feats_lengths += lfr_splice_cache_tensor[0].shape[0]
|
||||
frame_from_waveforms = int((self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
|
||||
frame_from_waveforms = int(
|
||||
(self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
|
||||
minus_frame = (self.lfr_m - 1) // 2 if self.reserve_waveforms is None else 0
|
||||
feats, feats_lengths, lfr_splice_frame_idxs = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
|
||||
if self.lfr_m == 1:
|
||||
@ -423,14 +428,15 @@ class WavFrontendOnline(AbsFrontend):
|
||||
self.waveforms = self.waveforms[:, :sample_length]
|
||||
else:
|
||||
# update self.reserve_waveforms and self.lfr_splice_cache
|
||||
self.reserve_waveforms = self.waveforms[:, :-(self.frame_sample_length - self.frame_shift_sample_length)]
|
||||
self.reserve_waveforms = self.waveforms[:,
|
||||
:-(self.frame_sample_length - self.frame_shift_sample_length)]
|
||||
for i in range(batch_size):
|
||||
self.lfr_splice_cache[i] = torch.cat((self.lfr_splice_cache[i], feats[i]), dim=0)
|
||||
return torch.empty(0), feats_lengths
|
||||
else:
|
||||
if is_final:
|
||||
self.waveforms = waveforms if self.reserve_waveforms is None else self.reserve_waveforms
|
||||
feats = torch.stack(self.lfr_splice_cache)
|
||||
feats = torch.stack(self.lfr_splice_cache)
|
||||
feats_lengths = torch.zeros(batch_size, dtype=torch.int) + feats.shape[1]
|
||||
feats, feats_lengths, _ = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
|
||||
if is_final:
|
||||
@ -444,3 +450,54 @@ class WavFrontendOnline(AbsFrontend):
|
||||
self.reserve_waveforms = None
|
||||
self.input_cache = None
|
||||
self.lfr_splice_cache = []
|
||||
|
||||
|
||||
class WavFrontendMel23(AbsFrontend):
|
||||
"""Conventional frontend structure for ASR.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
fs: int = 16000,
|
||||
frame_length: int = 25,
|
||||
frame_shift: int = 10,
|
||||
lfr_m: int = 1,
|
||||
lfr_n: int = 1,
|
||||
):
|
||||
assert check_argument_types()
|
||||
super().__init__()
|
||||
self.fs = fs
|
||||
self.frame_length = frame_length
|
||||
self.frame_shift = frame_shift
|
||||
self.lfr_m = lfr_m
|
||||
self.lfr_n = lfr_n
|
||||
self.n_mels = 23
|
||||
|
||||
def output_size(self) -> int:
|
||||
return self.n_mels * (2 * self.lfr_m + 1)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
input: torch.Tensor,
|
||||
input_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
batch_size = input.size(0)
|
||||
feats = []
|
||||
feats_lens = []
|
||||
for i in range(batch_size):
|
||||
waveform_length = input_lengths[i]
|
||||
waveform = input[i][:waveform_length]
|
||||
waveform = waveform.numpy()
|
||||
mat = eend_ola_feature.stft(waveform, self.frame_length, self.frame_shift)
|
||||
mat = eend_ola_feature.transform(mat)
|
||||
mat = eend_ola_feature.splice(mat, context_size=self.lfr_m)
|
||||
mat = mat[::self.lfr_n]
|
||||
mat = torch.from_numpy(mat)
|
||||
feat_length = mat.size(0)
|
||||
feats.append(mat)
|
||||
feats_lens.append(feat_length)
|
||||
|
||||
feats_lens = torch.as_tensor(feats_lens)
|
||||
feats_pad = pad_sequence(feats,
|
||||
batch_first=True,
|
||||
padding_value=0.0)
|
||||
return feats_pad, feats_lens
|
||||
|
||||
@ -199,6 +199,63 @@ class CifPredictorV2(nn.Module):
|
||||
|
||||
return acoustic_embeds, token_num, alphas, cif_peak
|
||||
|
||||
def forward_chunk(self, hidden, cache=None):
|
||||
h = hidden
|
||||
context = h.transpose(1, 2)
|
||||
queries = self.pad(context)
|
||||
output = torch.relu(self.cif_conv1d(queries))
|
||||
output = output.transpose(1, 2)
|
||||
output = self.cif_output(output)
|
||||
alphas = torch.sigmoid(output)
|
||||
alphas = torch.nn.functional.relu(alphas * self.smooth_factor - self.noise_threshold)
|
||||
|
||||
alphas = alphas.squeeze(-1)
|
||||
mask_chunk_predictor = None
|
||||
if cache is not None:
|
||||
mask_chunk_predictor = None
|
||||
mask_chunk_predictor = torch.zeros_like(alphas)
|
||||
mask_chunk_predictor[:, cache["pad_left"]:cache["stride"] + cache["pad_left"]] = 1.0
|
||||
|
||||
if mask_chunk_predictor is not None:
|
||||
alphas = alphas * mask_chunk_predictor
|
||||
|
||||
if cache is not None:
|
||||
if cache["cif_hidden"] is not None:
|
||||
hidden = torch.cat((cache["cif_hidden"], hidden), 1)
|
||||
if cache["cif_alphas"] is not None:
|
||||
alphas = torch.cat((cache["cif_alphas"], alphas), -1)
|
||||
|
||||
token_num = alphas.sum(-1)
|
||||
acoustic_embeds, cif_peak = cif(hidden, alphas, self.threshold)
|
||||
len_time = alphas.size(-1)
|
||||
last_fire_place = len_time - 1
|
||||
last_fire_remainds = 0.0
|
||||
pre_alphas_length = 0
|
||||
|
||||
mask_chunk_peak_predictor = None
|
||||
if cache is not None:
|
||||
mask_chunk_peak_predictor = None
|
||||
mask_chunk_peak_predictor = torch.zeros_like(cif_peak)
|
||||
if cache["cif_alphas"] is not None:
|
||||
pre_alphas_length = cache["cif_alphas"].size(-1)
|
||||
mask_chunk_peak_predictor[:, :pre_alphas_length] = 1.0
|
||||
mask_chunk_peak_predictor[:, pre_alphas_length + cache["pad_left"]:pre_alphas_length + cache["stride"] + cache["pad_left"]] = 1.0
|
||||
|
||||
|
||||
if mask_chunk_peak_predictor is not None:
|
||||
cif_peak = cif_peak * mask_chunk_peak_predictor.squeeze(-1)
|
||||
|
||||
for i in range(len_time):
|
||||
if cif_peak[0][len_time - 1 - i] > self.threshold or cif_peak[0][len_time - 1 - i] == self.threshold:
|
||||
last_fire_place = len_time - 1 - i
|
||||
last_fire_remainds = cif_peak[0][len_time - 1 - i] - self.threshold
|
||||
break
|
||||
last_fire_remainds = torch.tensor([last_fire_remainds], dtype=alphas.dtype).to(alphas.device)
|
||||
cache["cif_hidden"] = hidden[:, last_fire_place:, :]
|
||||
cache["cif_alphas"] = torch.cat((last_fire_remainds.unsqueeze(0), alphas[:, last_fire_place+1:]), -1)
|
||||
token_num_int = token_num.floor().type(torch.int32).item()
|
||||
return acoustic_embeds[:, 0:token_num_int, :], token_num, alphas, cif_peak
|
||||
|
||||
def tail_process_fn(self, hidden, alphas, token_num=None, mask=None):
|
||||
b, t, d = hidden.size()
|
||||
tail_threshold = self.tail_threshold
|
||||
|
||||
@ -347,15 +347,17 @@ class MultiHeadedAttentionSANM(nn.Module):
|
||||
mask = torch.reshape(mask, (b, -1, 1))
|
||||
if mask_shfit_chunk is not None:
|
||||
mask = mask * mask_shfit_chunk
|
||||
inputs = inputs * mask
|
||||
|
||||
inputs = inputs * mask
|
||||
x = inputs.transpose(1, 2)
|
||||
x = self.pad_fn(x)
|
||||
x = self.fsmn_block(x)
|
||||
x = x.transpose(1, 2)
|
||||
x += inputs
|
||||
x = self.dropout(x)
|
||||
return x * mask
|
||||
if mask is not None:
|
||||
x = x * mask
|
||||
return x
|
||||
|
||||
def forward_qkv(self, x):
|
||||
"""Transform query, key and value.
|
||||
@ -505,7 +507,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
|
||||
# print("in fsmn, cache is None, x", x.size())
|
||||
|
||||
x = self.pad_fn(x)
|
||||
if not self.training and t <= 1:
|
||||
if not self.training:
|
||||
cache = x
|
||||
else:
|
||||
# print("in fsmn, cache is not None, x", x.size())
|
||||
@ -513,7 +515,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
|
||||
# if t < self.kernel_size:
|
||||
# x = self.pad_fn(x)
|
||||
x = torch.cat((cache[:, :, 1:], x), dim=2)
|
||||
x = x[:, :, -self.kernel_size:]
|
||||
x = x[:, :, -(self.kernel_size+t-1):]
|
||||
# print("in fsmn, cache is not None, x_cat", x.size())
|
||||
cache = x
|
||||
x = self.fsmn_block(x)
|
||||
|
||||
@ -87,7 +87,7 @@ class EENDOLATransformerEncoder(nn.Module):
|
||||
n_layers: int,
|
||||
n_units: int,
|
||||
e_units: int = 2048,
|
||||
h: int = 8,
|
||||
h: int = 4,
|
||||
dropout_rate: float = 0.1,
|
||||
use_pos_emb: bool = False):
|
||||
super(EENDOLATransformerEncoder, self).__init__()
|
||||
|
||||
@ -16,12 +16,12 @@ class EncoderDecoderAttractor(nn.Module):
|
||||
self.n_units = n_units
|
||||
|
||||
def forward_core(self, xs, zeros):
|
||||
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.float32).to(xs[0].device)
|
||||
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.int64)
|
||||
xs = [self.enc0_dropout(x) for x in xs]
|
||||
xs = nn.utils.rnn.pad_sequence(xs, batch_first=True, padding_value=-1)
|
||||
xs = nn.utils.rnn.pack_padded_sequence(xs, ilens, batch_first=True, enforce_sorted=False)
|
||||
_, (hx, cx) = self.encoder(xs)
|
||||
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.float32).to(zeros[0].device)
|
||||
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.int64)
|
||||
max_zlen = torch.max(zlens).to(torch.int).item()
|
||||
zeros = [self.enc0_dropout(z) for z in zeros]
|
||||
zeros = nn.utils.rnn.pad_sequence(zeros, batch_first=True, padding_value=-1)
|
||||
@ -47,4 +47,4 @@ class EncoderDecoderAttractor(nn.Module):
|
||||
zeros = [torch.zeros(max_n_speakers, self.n_units).to(torch.float32).to(xs[0].device) for _ in xs]
|
||||
attractors = self.forward_core(xs, zeros)
|
||||
probs = [torch.sigmoid(torch.flatten(self.counter(att))) for att in attractors]
|
||||
return attractors, probs
|
||||
return attractors, probs
|
||||
|
||||
@ -405,4 +405,13 @@ class SinusoidalPositionEncoder(torch.nn.Module):
|
||||
positions = torch.arange(1, timesteps+1)[None, :]
|
||||
position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
|
||||
|
||||
return x + position_encoding
|
||||
return x + position_encoding
|
||||
|
||||
def forward_chunk(self, x, cache=None):
|
||||
start_idx = 0
|
||||
batch_size, timesteps, input_dim = x.size()
|
||||
if cache is not None:
|
||||
start_idx = cache["start_idx"]
|
||||
positions = torch.arange(1, timesteps+start_idx+1)[None, :]
|
||||
position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
|
||||
return x + position_encoding[:, start_idx: start_idx + timesteps]
|
||||
|
||||
83
funasr/runtime/grpc/CMakeLists.txt
Normal file
83
funasr/runtime/grpc/CMakeLists.txt
Normal file
@ -0,0 +1,83 @@
|
||||
# Copyright 2018 gRPC authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
# cmake build file for C++ paraformer example.
|
||||
# Assumes protobuf and gRPC have been installed using cmake.
|
||||
# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
|
||||
# that automatically builds all the dependencies before building paraformer.
|
||||
|
||||
cmake_minimum_required(VERSION 3.10)
|
||||
|
||||
project(ASR C CXX)
|
||||
|
||||
include(common.cmake)
|
||||
|
||||
# Proto file
|
||||
get_filename_component(rg_proto "../python/grpc/proto/paraformer.proto" ABSOLUTE)
|
||||
get_filename_component(rg_proto_path "${rg_proto}" PATH)
|
||||
|
||||
# Generated sources
|
||||
set(rg_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.cc")
|
||||
set(rg_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.h")
|
||||
set(rg_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.cc")
|
||||
set(rg_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.h")
|
||||
add_custom_command(
|
||||
OUTPUT "${rg_proto_srcs}" "${rg_proto_hdrs}" "${rg_grpc_srcs}" "${rg_grpc_hdrs}"
|
||||
COMMAND ${_PROTOBUF_PROTOC}
|
||||
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
|
||||
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
|
||||
-I "${rg_proto_path}"
|
||||
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
|
||||
"${rg_proto}"
|
||||
DEPENDS "${rg_proto}")
|
||||
|
||||
|
||||
# Include generated *.pb.h files
|
||||
include_directories("${CMAKE_CURRENT_BINARY_DIR}")
|
||||
|
||||
include_directories(../onnxruntime/include/)
|
||||
link_directories(../onnxruntime/build/src/)
|
||||
link_directories(../onnxruntime/build/third_party/webrtc/)
|
||||
|
||||
link_directories(${ONNXRUNTIME_DIR}/lib)
|
||||
add_subdirectory("../onnxruntime/src" onnx_src)
|
||||
|
||||
# rg_grpc_proto
|
||||
add_library(rg_grpc_proto
|
||||
${rg_grpc_srcs}
|
||||
${rg_grpc_hdrs}
|
||||
${rg_proto_srcs}
|
||||
${rg_proto_hdrs})
|
||||
|
||||
|
||||
|
||||
target_link_libraries(rg_grpc_proto
|
||||
${_REFLECTION}
|
||||
${_GRPC_GRPCPP}
|
||||
${_PROTOBUF_LIBPROTOBUF})
|
||||
|
||||
# Targets paraformer_(server)
|
||||
foreach(_target
|
||||
paraformer_server)
|
||||
add_executable(${_target}
|
||||
"${_target}.cc")
|
||||
target_link_libraries(${_target}
|
||||
rg_grpc_proto
|
||||
rapidasr
|
||||
webrtcvad
|
||||
${EXTRA_LIBS}
|
||||
${_REFLECTION}
|
||||
${_GRPC_GRPCPP}
|
||||
${_PROTOBUF_LIBPROTOBUF})
|
||||
endforeach()
|
||||
57
funasr/runtime/grpc/Readme.md
Normal file
57
funasr/runtime/grpc/Readme.md
Normal file
@ -0,0 +1,57 @@
|
||||
## paraformer grpc onnx server in c++
|
||||
|
||||
|
||||
#### Step 1. Build ../onnxruntime as it's document
|
||||
```
|
||||
#put onnx-lib & onnx-asr-model & vocab.txt into /path/to/asrmodel(eg: /data/asrmodel)
|
||||
ls /data/asrmodel/
|
||||
onnxruntime-linux-x64-1.14.0 speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
|
||||
|
||||
file /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/vocab.txt
|
||||
UTF-8 Unicode text
|
||||
```
|
||||
|
||||
#### Step 2. Compile and install grpc v1.52.0 in case of grpc bugs
|
||||
```
|
||||
export GRPC_INSTALL_DIR=/data/soft/grpc
|
||||
export PKG_CONFIG_PATH=$GRPC_INSTALL_DIR/lib/pkgconfig
|
||||
|
||||
git clone -b v1.52.0 --depth=1 https://github.com/grpc/grpc.git
|
||||
cd grpc
|
||||
git submodule update --init --recursive
|
||||
|
||||
mkdir -p cmake/build
|
||||
pushd cmake/build
|
||||
cmake -DgRPC_INSTALL=ON \
|
||||
-DgRPC_BUILD_TESTS=OFF \
|
||||
-DCMAKE_INSTALL_PREFIX=$GRPC_INSTALL_DIR \
|
||||
../..
|
||||
make
|
||||
make install
|
||||
popd
|
||||
|
||||
echo "export GRPC_INSTALL_DIR=/data/soft/grpc" >> ~/.bashrc
|
||||
echo "export PKG_CONFIG_PATH=\$GRPC_INSTALL_DIR/lib/pkgconfig" >> ~/.bashrc
|
||||
echo "export PATH=\$GRPC_INSTALL_DIR/bin/:\$PKG_CONFIG_PATH:\$PATH" >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
#### Step 3. Compile and start grpc onnx paraformer server
|
||||
```
|
||||
# set -DONNXRUNTIME_DIR=/path/to/asrmodel/onnxruntime-linux-x64-1.14.0
|
||||
./rebuild.sh
|
||||
```
|
||||
|
||||
#### Step 4. Start grpc paraformer server
|
||||
```
|
||||
Usage: ./cmake/build/paraformer_server port thread_num /path/to/model_file
|
||||
./cmake/build/paraformer_server 10108 4 /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### Step 5. Start grpc python paraformer client on PC with MIC
|
||||
```
|
||||
cd ../python/grpc
|
||||
python grpc_main_client_mic.py --host $server_ip --port 10108
|
||||
```
|
||||
125
funasr/runtime/grpc/common.cmake
Normal file
125
funasr/runtime/grpc/common.cmake
Normal file
@ -0,0 +1,125 @@
|
||||
# Copyright 2018 gRPC authors.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
# cmake build file for C++ route_guide example.
|
||||
# Assumes protobuf and gRPC have been installed using cmake.
|
||||
# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
|
||||
# that automatically builds all the dependencies before building route_guide.
|
||||
|
||||
cmake_minimum_required(VERSION 3.5.1)
|
||||
|
||||
if (NOT DEFINED CMAKE_CXX_STANDARD)
|
||||
set (CMAKE_CXX_STANDARD 14)
|
||||
endif()
|
||||
|
||||
if(MSVC)
|
||||
add_definitions(-D_WIN32_WINNT=0x600)
|
||||
endif()
|
||||
|
||||
find_package(Threads REQUIRED)
|
||||
|
||||
if(GRPC_AS_SUBMODULE)
|
||||
# One way to build a projects that uses gRPC is to just include the
|
||||
# entire gRPC project tree via "add_subdirectory".
|
||||
# This approach is very simple to use, but the are some potential
|
||||
# disadvantages:
|
||||
# * it includes gRPC's CMakeLists.txt directly into your build script
|
||||
# without and that can make gRPC's internal setting interfere with your
|
||||
# own build.
|
||||
# * depending on what's installed on your system, the contents of submodules
|
||||
# in gRPC's third_party/* might need to be available (and there might be
|
||||
# additional prerequisites required to build them). Consider using
|
||||
# the gRPC_*_PROVIDER options to fine-tune the expected behavior.
|
||||
#
|
||||
# A more robust approach to add dependency on gRPC is using
|
||||
# cmake's ExternalProject_Add (see cmake_externalproject/CMakeLists.txt).
|
||||
|
||||
# Include the gRPC's cmake build (normally grpc source code would live
|
||||
# in a git submodule called "third_party/grpc", but this example lives in
|
||||
# the same repository as gRPC sources, so we just look a few directories up)
|
||||
add_subdirectory(../../.. ${CMAKE_CURRENT_BINARY_DIR}/grpc EXCLUDE_FROM_ALL)
|
||||
message(STATUS "Using gRPC via add_subdirectory.")
|
||||
|
||||
# After using add_subdirectory, we can now use the grpc targets directly from
|
||||
# this build.
|
||||
set(_PROTOBUF_LIBPROTOBUF libprotobuf)
|
||||
set(_REFLECTION grpc++_reflection)
|
||||
if(CMAKE_CROSSCOMPILING)
|
||||
find_program(_PROTOBUF_PROTOC protoc)
|
||||
else()
|
||||
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
|
||||
endif()
|
||||
set(_GRPC_GRPCPP grpc++)
|
||||
if(CMAKE_CROSSCOMPILING)
|
||||
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
|
||||
else()
|
||||
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
|
||||
endif()
|
||||
elseif(GRPC_FETCHCONTENT)
|
||||
# Another way is to use CMake's FetchContent module to clone gRPC at
|
||||
# configure time. This makes gRPC's source code available to your project,
|
||||
# similar to a git submodule.
|
||||
message(STATUS "Using gRPC via add_subdirectory (FetchContent).")
|
||||
include(FetchContent)
|
||||
FetchContent_Declare(
|
||||
grpc
|
||||
GIT_REPOSITORY https://github.com/grpc/grpc.git
|
||||
# when using gRPC, you will actually set this to an existing tag, such as
|
||||
# v1.25.0, v1.26.0 etc..
|
||||
# For the purpose of testing, we override the tag used to the commit
|
||||
# that's currently under test.
|
||||
GIT_TAG vGRPC_TAG_VERSION_OF_YOUR_CHOICE)
|
||||
FetchContent_MakeAvailable(grpc)
|
||||
|
||||
# Since FetchContent uses add_subdirectory under the hood, we can use
|
||||
# the grpc targets directly from this build.
|
||||
set(_PROTOBUF_LIBPROTOBUF libprotobuf)
|
||||
set(_REFLECTION grpc++_reflection)
|
||||
set(_PROTOBUF_PROTOC $<TARGET_FILE:protoc>)
|
||||
set(_GRPC_GRPCPP grpc++)
|
||||
if(CMAKE_CROSSCOMPILING)
|
||||
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
|
||||
else()
|
||||
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
|
||||
endif()
|
||||
else()
|
||||
# This branch assumes that gRPC and all its dependencies are already installed
|
||||
# on this system, so they can be located by find_package().
|
||||
|
||||
# Find Protobuf installation
|
||||
# Looks for protobuf-config.cmake file installed by Protobuf's cmake installation.
|
||||
set(protobuf_MODULE_COMPATIBLE TRUE)
|
||||
find_package(Protobuf CONFIG REQUIRED)
|
||||
message(STATUS "Using protobuf ${Protobuf_VERSION}")
|
||||
|
||||
set(_PROTOBUF_LIBPROTOBUF protobuf::libprotobuf)
|
||||
set(_REFLECTION gRPC::grpc++_reflection)
|
||||
if(CMAKE_CROSSCOMPILING)
|
||||
find_program(_PROTOBUF_PROTOC protoc)
|
||||
else()
|
||||
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
|
||||
endif()
|
||||
|
||||
# Find gRPC installation
|
||||
# Looks for gRPCConfig.cmake file installed by gRPC's cmake installation.
|
||||
find_package(gRPC CONFIG REQUIRED)
|
||||
message(STATUS "Using gRPC ${gRPC_VERSION}")
|
||||
|
||||
set(_GRPC_GRPCPP gRPC::grpc++)
|
||||
if(CMAKE_CROSSCOMPILING)
|
||||
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
|
||||
else()
|
||||
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:gRPC::grpc_cpp_plugin>)
|
||||
endif()
|
||||
endif()
|
||||
195
funasr/runtime/grpc/paraformer_server.cc
Normal file
195
funasr/runtime/grpc/paraformer_server.cc
Normal file
@ -0,0 +1,195 @@
|
||||
#include <algorithm>
|
||||
#include <chrono>
|
||||
#include <cmath>
|
||||
#include <iostream>
|
||||
#include <sstream>
|
||||
#include <memory>
|
||||
#include <string>
|
||||
|
||||
#include <grpc/grpc.h>
|
||||
#include <grpcpp/server.h>
|
||||
#include <grpcpp/server_builder.h>
|
||||
#include <grpcpp/server_context.h>
|
||||
#include <grpcpp/security/server_credentials.h>
|
||||
|
||||
#include "paraformer.grpc.pb.h"
|
||||
#include "paraformer_server.h"
|
||||
|
||||
|
||||
using grpc::Server;
|
||||
using grpc::ServerBuilder;
|
||||
using grpc::ServerContext;
|
||||
using grpc::ServerReader;
|
||||
using grpc::ServerReaderWriter;
|
||||
using grpc::ServerWriter;
|
||||
using grpc::Status;
|
||||
|
||||
|
||||
using paraformer::Request;
|
||||
using paraformer::Response;
|
||||
using paraformer::ASR;
|
||||
|
||||
ASRServicer::ASRServicer(const char* model_path, int thread_num) {
|
||||
AsrHanlde=RapidAsrInit(model_path, thread_num);
|
||||
std::cout << "ASRServicer init" << std::endl;
|
||||
init_flag = 0;
|
||||
}
|
||||
|
||||
void ASRServicer::clear_states(const std::string& user) {
|
||||
clear_buffers(user);
|
||||
clear_transcriptions(user);
|
||||
}
|
||||
|
||||
void ASRServicer::clear_buffers(const std::string& user) {
|
||||
if (client_buffers.count(user)) {
|
||||
client_buffers.erase(user);
|
||||
}
|
||||
}
|
||||
|
||||
void ASRServicer::clear_transcriptions(const std::string& user) {
|
||||
if (client_transcription.count(user)) {
|
||||
client_transcription.erase(user);
|
||||
}
|
||||
}
|
||||
|
||||
void ASRServicer::disconnect(const std::string& user) {
|
||||
clear_states(user);
|
||||
std::cout << "Disconnecting user: " << user << std::endl;
|
||||
}
|
||||
|
||||
grpc::Status ASRServicer::Recognize(
|
||||
grpc::ServerContext* context,
|
||||
grpc::ServerReaderWriter<Response, Request>* stream) {
|
||||
|
||||
Request req;
|
||||
while (stream->Read(&req)) {
|
||||
if (req.isend()) {
|
||||
std::cout << "asr end" << std::endl;
|
||||
disconnect(req.user());
|
||||
Response res;
|
||||
res.set_sentence(
|
||||
R"({"success": true, "detail": "asr end"})"
|
||||
);
|
||||
res.set_user(req.user());
|
||||
res.set_action("terminate");
|
||||
res.set_language(req.language());
|
||||
stream->Write(res);
|
||||
} else if (req.speaking()) {
|
||||
if (req.audio_data().size() > 0) {
|
||||
auto& buf = client_buffers[req.user()];
|
||||
buf.insert(buf.end(), req.audio_data().begin(), req.audio_data().end());
|
||||
}
|
||||
Response res;
|
||||
res.set_sentence(
|
||||
R"({"success": true, "detail": "speaking"})"
|
||||
);
|
||||
res.set_user(req.user());
|
||||
res.set_action("speaking");
|
||||
res.set_language(req.language());
|
||||
stream->Write(res);
|
||||
} else if (!req.speaking()) {
|
||||
if (client_buffers.count(req.user()) == 0) {
|
||||
Response res;
|
||||
res.set_sentence(
|
||||
R"({"success": true, "detail": "waiting_for_voice"})"
|
||||
);
|
||||
res.set_user(req.user());
|
||||
res.set_action("waiting");
|
||||
res.set_language(req.language());
|
||||
stream->Write(res);
|
||||
}else {
|
||||
auto begin_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
|
||||
std::string tmp_data = this->client_buffers[req.user()];
|
||||
this->clear_states(req.user());
|
||||
|
||||
Response res;
|
||||
res.set_sentence(
|
||||
R"({"success": true, "detail": "decoding data: " + std::to_string(tmp_data.length()) + " bytes"})"
|
||||
);
|
||||
int data_len_int = tmp_data.length();
|
||||
std::string data_len = std::to_string(data_len_int);
|
||||
std::stringstream ss;
|
||||
ss << R"({"success": true, "detail": "decoding data: )" << data_len << R"( bytes")" << R"("})";
|
||||
std::string result = ss.str();
|
||||
res.set_sentence(result);
|
||||
res.set_user(req.user());
|
||||
res.set_action("decoding");
|
||||
res.set_language(req.language());
|
||||
stream->Write(res);
|
||||
if (tmp_data.length() < 800) { //min input_len for asr model
|
||||
auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
|
||||
std::string delay_str = std::to_string(end_time - begin_time);
|
||||
std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", error: data_is_not_long_enough" << std::endl;
|
||||
Response res;
|
||||
std::stringstream ss;
|
||||
std::string asr_result = "";
|
||||
ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
|
||||
std::string result = ss.str();
|
||||
res.set_sentence(result);
|
||||
res.set_user(req.user());
|
||||
res.set_action("finish");
|
||||
res.set_language(req.language());
|
||||
|
||||
|
||||
|
||||
stream->Write(res);
|
||||
}
|
||||
else {
|
||||
RPASR_RESULT Result= RapidAsrRecogPCMBuffer(AsrHanlde, tmp_data.c_str(), data_len_int, RASR_NONE, NULL);
|
||||
std::string asr_result = ((RPASR_RECOG_RESULT*)Result)->msg;
|
||||
|
||||
auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
|
||||
std::string delay_str = std::to_string(end_time - begin_time);
|
||||
|
||||
std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", text: " << asr_result << std::endl;
|
||||
Response res;
|
||||
std::stringstream ss;
|
||||
ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
|
||||
std::string result = ss.str();
|
||||
res.set_sentence(result);
|
||||
res.set_user(req.user());
|
||||
res.set_action("finish");
|
||||
res.set_language(req.language());
|
||||
|
||||
|
||||
stream->Write(res);
|
||||
}
|
||||
}
|
||||
}else {
|
||||
Response res;
|
||||
res.set_sentence(
|
||||
R"({"success": false, "detail": "error, no condition matched! Unknown reason."})"
|
||||
);
|
||||
res.set_user(req.user());
|
||||
res.set_action("terminate");
|
||||
res.set_language(req.language());
|
||||
stream->Write(res);
|
||||
}
|
||||
}
|
||||
return Status::OK;
|
||||
}
|
||||
|
||||
|
||||
void RunServer(const std::string& port, int thread_num, const char* model_path) {
|
||||
std::string server_address;
|
||||
server_address = "0.0.0.0:" + port;
|
||||
ASRServicer service(model_path, thread_num);
|
||||
|
||||
ServerBuilder builder;
|
||||
builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
|
||||
builder.RegisterService(&service);
|
||||
std::unique_ptr<Server> server(builder.BuildAndStart());
|
||||
std::cout << "Server listening on " << server_address << std::endl;
|
||||
server->Wait();
|
||||
}
|
||||
|
||||
int main(int argc, char* argv[]) {
|
||||
if (argc < 3)
|
||||
{
|
||||
printf("Usage: %s port thread_num /path/to/model_file\n", argv[0]);
|
||||
exit(-1);
|
||||
}
|
||||
|
||||
RunServer(argv[1], atoi(argv[2]), argv[3]);
|
||||
return 0;
|
||||
}
|
||||
56
funasr/runtime/grpc/paraformer_server.h
Normal file
56
funasr/runtime/grpc/paraformer_server.h
Normal file
@ -0,0 +1,56 @@
|
||||
#include <algorithm>
|
||||
#include <chrono>
|
||||
#include <cmath>
|
||||
#include <iostream>
|
||||
#include <memory>
|
||||
#include <string>
|
||||
|
||||
#include <grpc/grpc.h>
|
||||
#include <grpcpp/server.h>
|
||||
#include <grpcpp/server_builder.h>
|
||||
#include <grpcpp/server_context.h>
|
||||
#include <grpcpp/security/server_credentials.h>
|
||||
|
||||
#include <unordered_map>
|
||||
#include <chrono>
|
||||
|
||||
#include "paraformer.grpc.pb.h"
|
||||
#include "librapidasrapi.h"
|
||||
|
||||
|
||||
using grpc::Server;
|
||||
using grpc::ServerBuilder;
|
||||
using grpc::ServerContext;
|
||||
using grpc::ServerReader;
|
||||
using grpc::ServerReaderWriter;
|
||||
using grpc::ServerWriter;
|
||||
using grpc::Status;
|
||||
|
||||
|
||||
using paraformer::Request;
|
||||
using paraformer::Response;
|
||||
using paraformer::ASR;
|
||||
|
||||
typedef struct
|
||||
{
|
||||
std::string msg;
|
||||
float snippet_time;
|
||||
}RPASR_RECOG_RESULT;
|
||||
|
||||
|
||||
class ASRServicer final : public ASR::Service {
|
||||
private:
|
||||
int init_flag;
|
||||
std::unordered_map<std::string, std::string> client_buffers;
|
||||
std::unordered_map<std::string, std::string> client_transcription;
|
||||
|
||||
public:
|
||||
ASRServicer(const char* model_path, int thread_num);
|
||||
void clear_states(const std::string& user);
|
||||
void clear_buffers(const std::string& user);
|
||||
void clear_transcriptions(const std::string& user);
|
||||
void disconnect(const std::string& user);
|
||||
grpc::Status Recognize(grpc::ServerContext* context, grpc::ServerReaderWriter<Response, Request>* stream);
|
||||
RPASR_HANDLE AsrHanlde;
|
||||
|
||||
};
|
||||
12
funasr/runtime/grpc/rebuild.sh
Normal file
12
funasr/runtime/grpc/rebuild.sh
Normal file
@ -0,0 +1,12 @@
|
||||
#!/bin/bash
|
||||
|
||||
rm cmake -rf
|
||||
mkdir -p cmake/build
|
||||
|
||||
cd cmake/build
|
||||
|
||||
cmake -DCMAKE_BUILD_TYPE=release ../.. -DONNXRUNTIME_DIR=/data/asrmodel/onnxruntime-linux-x64-1.14.0
|
||||
make
|
||||
|
||||
|
||||
echo "Build cmake/build/paraformer_server successfully!"
|
||||
@ -41,8 +41,8 @@ pip install --editable ./
|
||||
```
|
||||
导出onnx模型,[详见](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export),参考示例,从modelscope中模型导出:
|
||||
|
||||
```
|
||||
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
|
||||
```shell
|
||||
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx --quantize False
|
||||
```
|
||||
|
||||
## Building Guidance for Linux/Unix
|
||||
|
||||
@ -237,7 +237,7 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
|
||||
|
||||
size_t nOffset = 0;
|
||||
|
||||
#define WAV_HEADER_SIZE 44
|
||||
|
||||
|
||||
speech_len = nBufLen / 2;
|
||||
speech_align_len = (int)(ceil((float)speech_len / align_size) * align_size);
|
||||
@ -263,7 +263,8 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
|
||||
speech_data[i] = (float)speech_buff[i] / scale;
|
||||
}
|
||||
|
||||
|
||||
AudioFrame* frame = new AudioFrame(speech_len);
|
||||
frame_queue.push(frame);
|
||||
return true;
|
||||
|
||||
}
|
||||
|
||||
@ -26,8 +26,9 @@ extern "C" {
|
||||
return nullptr;
|
||||
|
||||
Audio audio(1);
|
||||
audio.loadwav(szBuf,nLen);
|
||||
audio.split();
|
||||
if (!audio.loadwav(szBuf, nLen))
|
||||
return nullptr;
|
||||
//audio.split();
|
||||
|
||||
float* buff;
|
||||
int len;
|
||||
@ -58,8 +59,9 @@ extern "C" {
|
||||
return nullptr;
|
||||
|
||||
Audio audio(1);
|
||||
audio.loadpcmwav(szBuf, nLen);
|
||||
audio.split();
|
||||
if (!audio.loadpcmwav(szBuf, nLen))
|
||||
return nullptr;
|
||||
//audio.split();
|
||||
|
||||
float* buff;
|
||||
int len;
|
||||
@ -91,8 +93,9 @@ extern "C" {
|
||||
return nullptr;
|
||||
|
||||
Audio audio(1);
|
||||
audio.loadpcmwav(szFileName);
|
||||
audio.split();
|
||||
if (!audio.loadpcmwav(szFileName))
|
||||
return nullptr;
|
||||
//audio.split();
|
||||
|
||||
float* buff;
|
||||
int len;
|
||||
@ -125,7 +128,7 @@ extern "C" {
|
||||
Audio audio(1);
|
||||
if(!audio.loadwav(szWavfile))
|
||||
return nullptr;
|
||||
audio.split();
|
||||
//audio.split();
|
||||
|
||||
float* buff;
|
||||
int len;
|
||||
|
||||
@ -8,7 +8,7 @@
|
||||
#include "librapidasrapi.h"
|
||||
|
||||
#include <iostream>
|
||||
|
||||
#include <fstream>
|
||||
using namespace std;
|
||||
|
||||
int main(int argc, char *argv[])
|
||||
@ -40,10 +40,13 @@ int main(int argc, char *argv[])
|
||||
|
||||
|
||||
gettimeofday(&start, NULL);
|
||||
|
||||
RPASR_RESULT Result=RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
|
||||
gettimeofday(&end, NULL);
|
||||
float snippet_time = 0.0f;
|
||||
|
||||
|
||||
RPASR_RESULT Result=RapidAsrRecogFile(AsrHanlde, argv[2], RASR_NONE, NULL);
|
||||
|
||||
gettimeofday(&end, NULL);
|
||||
|
||||
if (Result)
|
||||
{
|
||||
string msg = RapidAsrGetResult(Result, 0);
|
||||
@ -56,11 +59,51 @@ int main(int argc, char *argv[])
|
||||
}
|
||||
else
|
||||
{
|
||||
cout <<("no return data!");
|
||||
cout <<"no return data!";
|
||||
}
|
||||
|
||||
printf("Audio length %lfs.\n", (double)snippet_time);
|
||||
|
||||
|
||||
//char* buff = nullptr;
|
||||
//int len = 0;
|
||||
//ifstream ifs(argv[2], std::ios::binary | std::ios::in);
|
||||
//if (ifs.is_open())
|
||||
//{
|
||||
// ifs.seekg(0, std::ios::end);
|
||||
// len = ifs.tellg();
|
||||
// ifs.seekg(0, std::ios::beg);
|
||||
|
||||
// buff = new char[len];
|
||||
|
||||
// ifs.read(buff, len);
|
||||
|
||||
|
||||
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
|
||||
|
||||
// RPASR_RESULT Result=RapidAsrRecogPCMBuffer(AsrHanlde, buff,len, RASR_NONE, NULL);
|
||||
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
|
||||
// gettimeofday(&end, NULL);
|
||||
//
|
||||
// if (Result)
|
||||
// {
|
||||
// string msg = RapidAsrGetResult(Result, 0);
|
||||
// setbuf(stdout, NULL);
|
||||
// cout << "Result: \"";
|
||||
// cout << msg << endl;
|
||||
// cout << "\"." << endl;
|
||||
// snippet_time = RapidAsrGetRetSnippetTime(Result);
|
||||
// RapidAsrFreeResult(Result);
|
||||
// }
|
||||
// else
|
||||
// {
|
||||
// cout <<"no return data!";
|
||||
// }
|
||||
|
||||
//
|
||||
//delete[]buff;
|
||||
//}
|
||||
|
||||
|
||||
printf("Audio length %lfs.\n", (double)snippet_time);
|
||||
seconds = (end.tv_sec - start.tv_sec);
|
||||
long taking_micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
|
||||
printf("Model inference takes %lfs.\n", (double)taking_micros / 1000000);
|
||||
|
||||
45
funasr/runtime/python/benchmark_libtorch.md
Normal file
45
funasr/runtime/python/benchmark_libtorch.md
Normal file
@ -0,0 +1,45 @@
|
||||
# Benchmark
|
||||
|
||||
### Data set:
|
||||
Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
|
||||
|
||||
### Tools
|
||||
- Install ModelScope and FunASR
|
||||
|
||||
```shell
|
||||
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
|
||||
git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
|
||||
pip install --editable ./
|
||||
cd funasr/runtime/python/utils
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
- recipe
|
||||
|
||||
set the model, data path and output_dir
|
||||
|
||||
```shell
|
||||
nohup bash test_rtf.sh &> log.txt &
|
||||
```
|
||||
|
||||
|
||||
|
||||
## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
|
||||
|
||||
|
||||
### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz 16core-32processor with avx512_vnni
|
||||
|
||||
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|
||||
|:----------------:|:------------------:|:------:|:------------:|
|
||||
| 1 (torch fp32) | 3522 | 0.0976 | 10.3 |
|
||||
| 1 (torch int8) | 1746 | 0.0484 | 20.7 |
|
||||
| 32 (torch fp32) | 236 | 0.0066 | 152.7 |
|
||||
| 32 (torch int8) | 114 | 0.0032 | 317.4 |
|
||||
| 64 (torch fp32) | 235 | 0.0065 | 153.7 |
|
||||
| 64 (torch int8) | 113 | 0.0031 | 319.2 |
|
||||
|
||||
|
||||
[//]: # (### Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz 32core-64processor without avx512_vnni)
|
||||
|
||||
|
||||
## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)
|
||||
89
funasr/runtime/python/benchmark_onnx.md
Normal file
89
funasr/runtime/python/benchmark_onnx.md
Normal file
@ -0,0 +1,89 @@
|
||||
# Benchmark
|
||||
|
||||
### Data set:
|
||||
Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
|
||||
|
||||
### Tools
|
||||
- Install ModelScope and FunASR
|
||||
|
||||
```shell
|
||||
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
|
||||
git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
|
||||
pip install --editable ./
|
||||
cd funasr/runtime/python/utils
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
- recipe
|
||||
|
||||
set the model, data path and output_dir
|
||||
|
||||
```shell
|
||||
nohup bash test_rtf.sh &> log.txt &
|
||||
```
|
||||
|
||||
|
||||
## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
|
||||
|
||||
### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz 16core-32processor with avx512_vnni
|
||||
|
||||
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|
||||
|:----------------:|:------------------:|:-------:|:------------:|
|
||||
| 1 (onnx fp32) | 2806 | 0.0777 | 12.9 |
|
||||
| 1 (onnx int8) | 1611 | 0.0446 | 22.4 |
|
||||
| 8 (onnx fp32) | 538 | 0.0149 | 67.1 |
|
||||
| 8 (onnx int8) | 210 | 0.0058 | 172.4 |
|
||||
| 16 (onnx fp32) | 288 | 0.0080 | 125.2 |
|
||||
| 16 (onnx int8) | 117 | 0.0032 | 309.9 |
|
||||
| 32 (onnx fp32) | 167 | 0.0046 | 216.5 |
|
||||
| 32 (onnx int8) | 86 | 0.0024 | 420.0 |
|
||||
| 64 (onnx fp32) | 158 | 0.0044 | 228.1 |
|
||||
| 64 (onnx int8) | 82 | 0.0023 | 442.8 |
|
||||
| 96 (onnx fp32) | 151 | 0.0042 | 238.0 |
|
||||
| 96 (onnx int8) | 80 | 0.0022 | 452.0 |
|
||||
|
||||
|
||||
### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz 16core-32processor with avx512_vnni
|
||||
|
||||
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|
||||
|:----------------:|:------------------:|:------:|:------------:|
|
||||
| 1 (onnx fp32) | 2613 | 0.0724 | 13.8 |
|
||||
| 1 (onnx int8) | 1321 | 0.0366 | 22.4 |
|
||||
| 32 (onnx fp32) | 170 | 0.0047 | 212.7 |
|
||||
| 32 (onnx int8) | 89 | 0.0025 | 407.0 |
|
||||
| 64 (onnx fp32) | 166 | 0.0046 | 217.1 |
|
||||
| 64 (onnx int8) | 87 | 0.0024 | 414.7 |
|
||||
|
||||
|
||||
### Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz 32core-64processor without avx512_vnni
|
||||
|
||||
|
||||
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|
||||
|:----------------:|:------------------:|:------:|:------------:|
|
||||
| 1 (onnx fp32) | 2959 | 0.0820 | 12.2 |
|
||||
| 1 (onnx int8) | 2814 | 0.0778 | 12.8 |
|
||||
| 16 (onnx fp32) | 373 | 0.0103 | 96.9 |
|
||||
| 16 (onnx int8) | 331 | 0.0091 | 109.0 |
|
||||
| 32 (onnx fp32) | 211 | 0.0058 | 171.4 |
|
||||
| 32 (onnx int8) | 181 | 0.0050 | 200.0 |
|
||||
| 64 (onnx fp32) | 153 | 0.0042 | 235.9 |
|
||||
| 64 (onnx int8) | 103 | 0.0029 | 349.9 |
|
||||
| 96 (onnx fp32) | 146 | 0.0041 | 247.0 |
|
||||
| 96 (onnx int8) | 108 | 0.0030 | 334.1 |
|
||||
|
||||
## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)
|
||||
|
||||
### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz 16core-32processor with avx512_vnni
|
||||
|
||||
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|
||||
|:----------------:|:------------------:|:------:|:------------:|
|
||||
| 1 (onnx fp32) | 1173 | 0.0325 | 30.8 |
|
||||
| 1 (onnx int8) | 976 | 0.0270 | 37.0 |
|
||||
| 16 (onnx fp32) | 91 | 0.0025 | 395.2 |
|
||||
| 16 (onnx int8) | 78 | 0.0022 | 463.0 |
|
||||
| 32 (onnx fp32) | 60 | 0.0017 | 598.8 |
|
||||
| 32 (onnx int8) | 40 | 0.0011 | 892.9 |
|
||||
| 64 (onnx fp32) | 55 | 0.0015 | 653.6 |
|
||||
| 64 (onnx int8) | 31 | 0.0009 | 1162.8 |
|
||||
| 96 (onnx fp32) | 57 | 0.0016 | 632.9 |
|
||||
| 96 (onnx int8) | 33 | 0.0009 | 1098.9 |
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user