mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
Merge branch 'main' into dev_xw
This commit is contained in:
commit
03d4ce8298
37
README.md
37
README.md
@ -15,36 +15,10 @@
|
||||
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
|
||||
| [**Contact**](#contact)
|
||||
|
||||
|
||||
## What's new:
|
||||
|
||||
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
|
||||
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
|
||||
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
|
||||
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
|
||||
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
|
||||
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
|
||||
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
|
||||
- We release several new UniASR model:
|
||||
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
|
||||
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
|
||||
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
|
||||
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
|
||||
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
|
||||
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
|
||||
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
|
||||
### 2023.1.16, funasr-0.1.6, modelscope-1.2.0
|
||||
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
|
||||
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
|
||||
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
|
||||
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
|
||||
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
|
||||
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
|
||||
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
|
||||
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
|
||||
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
|
||||
|
||||
## Highlights
|
||||
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).
|
||||
@ -56,6 +30,7 @@
|
||||
## Installation
|
||||
|
||||
``` sh
|
||||
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
|
||||
git clone https://github.com/alibaba/FunASR.git && cd FunASR
|
||||
pip install --editable ./
|
||||
```
|
||||
@ -76,8 +51,8 @@ If you have any questions about FunASR, please contact us by
|
||||
|
||||
## Contributors
|
||||
|
||||
| <div align="left"><img src="docs/images/DeepScience.png" width="250"/> |
|
||||
|:---:|
|
||||
| <div align="left"><img src="docs/images/damo.png" width="180"/> | <div align="left"><img src="docs/images/nwpu.png" width="260"/> | <img src="docs/images/DeepScience.png" width="200"/> </div> |
|
||||
|:---------------------------------------------------------------:|:---------------------------------------------------------------:|:-----------------------------------------------------------:|
|
||||
|
||||
## Acknowledge
|
||||
|
||||
@ -111,4 +86,4 @@ This project is licensed under the [The MIT License](https://opensource.org/lice
|
||||
booktitle={arXiv preprint arXiv:2301.12343}
|
||||
year={2023}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
BIN
docs/images/damo.png
Normal file
BIN
docs/images/damo.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 53 KiB |
BIN
docs/images/nwpu.png
Normal file
BIN
docs/images/nwpu.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 41 KiB |
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pth
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -34,7 +34,7 @@ exp_dir=./data
|
||||
tag=exp1
|
||||
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
|
||||
lm_exp=${exp_dir}/exp/${model_dir}
|
||||
inference_lm=valid.loss.ave.pth # Language model path for decoding.
|
||||
inference_lm=valid.loss.ave.pb # Language model path for decoding.
|
||||
|
||||
stage=0
|
||||
stop_stage=3
|
||||
|
||||
@ -4,7 +4,7 @@ import sys
|
||||
|
||||
def main():
|
||||
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
|
||||
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),
|
||||
|
||||
@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
echo "Downloading Pre-trained model..."
|
||||
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
|
||||
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
|
||||
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
|
||||
echo "Done."
|
||||
@ -30,7 +30,7 @@ fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
echo "Calculating diarization results..."
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
|
||||
python local/convert_label_to_rttm.py \
|
||||
outputs/labels.txt \
|
||||
data/test_rmsil/raw_rmsil_map.scp \
|
||||
|
||||
@ -4,7 +4,7 @@ import os
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
|
||||
2739
egs/callhome/diarization/sond/sond.yaml
Normal file
2739
egs/callhome/diarization/sond/sond.yaml
Normal file
File diff suppressed because it is too large
Load Diff
2739
egs/callhome/diarization/sond/sond_fbank.yaml
Normal file
2739
egs/callhome/diarization/sond/sond_fbank.yaml
Normal file
File diff suppressed because it is too large
Load Diff
97
egs/callhome/diarization/sond/unit_test.py
Normal file
97
egs/callhome/diarization/sond/unit_test.py
Normal file
@ -0,0 +1,97 @@
|
||||
from funasr.bin.diar_inference_launch import inference_launch
|
||||
import os
|
||||
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
("data/unit_test/test_profile.scp", "profile", "kaldi_ark"),
|
||||
]
|
||||
pipeline = inference_launch(
|
||||
mode="sond",
|
||||
diar_train_config=diar_config_path,
|
||||
diar_model_file=diar_model_path,
|
||||
output_dir=output_dir,
|
||||
num_workers=0,
|
||||
log_level="INFO",
|
||||
)
|
||||
results = pipeline(data_path_and_name_and_type)
|
||||
print(results)
|
||||
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
("data/unit_test/test_profile.scp", "profile", "kaldi_ark"),
|
||||
]
|
||||
pipeline = inference_launch(
|
||||
mode="sond",
|
||||
diar_train_config=diar_config_path,
|
||||
diar_model_file=diar_model_path,
|
||||
output_dir=output_dir,
|
||||
ngpu=1,
|
||||
num_workers=1,
|
||||
log_level="INFO",
|
||||
)
|
||||
results = pipeline(data_path_and_name_and_type)
|
||||
print(results)
|
||||
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
("data/unit_test/test_profile.scp", "profile", "kaldi_ark"),
|
||||
]
|
||||
pipeline = inference_launch(
|
||||
mode="sond",
|
||||
diar_train_config=diar_config_path,
|
||||
diar_model_file=diar_model_path,
|
||||
output_dir=output_dir,
|
||||
ngpu=1,
|
||||
num_workers=1,
|
||||
log_level="WARNING",
|
||||
)
|
||||
results = pipeline(data_path_and_name_and_type)
|
||||
print(results)
|
||||
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
"data/unit_test/raw_inputs/spk1.wav",
|
||||
"data/unit_test/raw_inputs/spk2.wav",
|
||||
"data/unit_test/raw_inputs/spk3.wav",
|
||||
"data/unit_test/raw_inputs/spk4.wav"
|
||||
]]
|
||||
pipeline = inference_launch(
|
||||
mode="sond_demo",
|
||||
diar_train_config=diar_config_path,
|
||||
diar_model_file=diar_model_path,
|
||||
output_dir=output_dir,
|
||||
ngpu=1,
|
||||
num_workers=1,
|
||||
log_level="WARNING",
|
||||
param_dict={},
|
||||
)
|
||||
results = pipeline(raw_inputs=raw_inputs)
|
||||
print(results)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = "7"
|
||||
test_fbank_cpu_infer()
|
||||
# test_fbank_gpu_infer()
|
||||
# test_wav_gpu_infer()
|
||||
# test_without_profile_gpu_infer()
|
||||
@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -31,5 +31,5 @@ if __name__ == '__main__':
|
||||
params.batch_bins = 1000 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒,
|
||||
params.max_epoch = 10 # 最大训练轮数
|
||||
params.lr = 0.0001 # 设置学习率
|
||||
params.model_revision = 'v1.0.0'
|
||||
params.model_revision = 'v3.0.0'
|
||||
modelscope_finetune(params)
|
||||
|
||||
@ -19,7 +19,7 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model='NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950',
|
||||
model_revision='v1.0.0',
|
||||
model_revision='v3.0.0',
|
||||
output_dir=output_dir_job,
|
||||
batch_size=1,
|
||||
)
|
||||
|
||||
@ -63,5 +63,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./example_data/validation"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -22,10 +22,12 @@
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>model:</strong> # model name on ModelScope
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>ngpu:</strong> # the number of GPUs for decoding
|
||||
- <strong>njob:</strong> # the number of jobs for each GPU
|
||||
- <strong>ngpu:</strong> # the number of GPUs for decoding, if `ngpu` > 0, use GPU decoding
|
||||
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `ngpu` = 0, use CPU decoding, please set `njob`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
@ -39,9 +41,11 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
### Inference using local finetuned model
|
||||
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>modelscope_model_name: </strong> # model name on ModelScope
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if ngpu > 0:
|
||||
use_gpu = 1
|
||||
gpu_id = int(idx) - 1
|
||||
else:
|
||||
use_gpu = 0
|
||||
gpu_id = -1
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
|
||||
model=model,
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
batch_size=batch_size,
|
||||
ngpu=use_gpu,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
@ -30,13 +36,18 @@ def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
batch_size = params["batch_size"]
|
||||
output_dir = params["output_dir"]
|
||||
model = params["model"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
if ngpu > 0:
|
||||
nj = ngpu
|
||||
elif ngpu == 0:
|
||||
nj = njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
@ -56,7 +67,7 @@ def modelscope_infer(params):
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
@ -81,8 +92,10 @@ def modelscope_infer(params):
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["model"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
|
||||
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer(params)
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -30,6 +30,6 @@ if __name__ == '__main__':
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-ja-16k-common-vocab93-tensorflow1-online"
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-ja-16k-common-vocab93-tensorflow1-offline"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
|
||||
@ -6,7 +6,7 @@ if __name__ == "__main__":
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-ja-16k-common-vocab93-tensorflow1-online",
|
||||
model="damo/speech_UniASR_asr_2pass-ja-16k-common-vocab93-tensorflow1-offline",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -30,6 +30,6 @@ if __name__ == '__main__':
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-pt-16k-common-vocab1617-tensorflow1-online"
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-pt-16k-common-vocab1617-tensorflow1-offline"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
|
||||
@ -6,7 +6,7 @@ if __name__ == "__main__":
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-pt-16k-common-vocab1617-tensorflow1-online",
|
||||
model="damo/speech_UniASR_asr_2pass-pt-16k-common-vocab1617-tensorflow1-offline",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
|
||||
.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -4,27 +4,17 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "punc")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "punc"))
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "vad")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "vad"))
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if text_in is not None:
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -0,0 +1,26 @@
|
||||
|
||||
##################text二进制数据#####################
|
||||
inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助下游地区防灾减灾中方技术人员|在上游地区极为恶劣的自然条件下克服巨大困难甚至冒着生命危险|向印方提供汛期水文资料处理紧急事件中方重视印方在跨境河流问题上的关切|愿意进一步完善双方联合工作机制|凡是|中方能做的我们|都会去做而且会做得更好我请印度朋友们放心中国在上游的|任何开发利用都会经过科学|规划和论证兼顾上下游的利益"
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.punctuation,
|
||||
model='damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727',
|
||||
model_revision="v1.0.0",
|
||||
output_dir="./tmp/"
|
||||
)
|
||||
|
||||
vads = inputs.split("|")
|
||||
|
||||
cache_out = []
|
||||
rec_result_all="outputs:"
|
||||
for vad in vads:
|
||||
rec_result = inference_pipeline(text_in=vad, cache=cache_out)
|
||||
#print(rec_result)
|
||||
cache_out = rec_result['cache']
|
||||
rec_result_all += rec_result['text']
|
||||
|
||||
print(rec_result_all)
|
||||
|
||||
@ -15,7 +15,7 @@ from modelscope.utils.constant import Tasks
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.punctuation,
|
||||
model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch',
|
||||
model_revision="v1.1.6",
|
||||
model_revision="v1.1.7",
|
||||
output_dir="./tmp/"
|
||||
)
|
||||
|
||||
|
||||
@ -0,0 +1,10 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_diar_pipline = pipeline(
|
||||
task=Tasks.speaker_diarization,
|
||||
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
|
||||
model_revision="v1.0.0",
|
||||
)
|
||||
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
|
||||
print(results)
|
||||
@ -0,0 +1,25 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
# 初始化推理 pipeline
|
||||
# 当以原始音频作为输入时使用配置文件 sond.yaml,并设置 mode 为sond_demo
|
||||
inference_diar_pipline = pipeline(
|
||||
mode="sond_demo",
|
||||
num_workers=0,
|
||||
task=Tasks.speaker_diarization,
|
||||
diar_model_config="sond.yaml",
|
||||
model='damo/speech_diarization_sond-en-us-callhome-8k-n16k4-pytorch',
|
||||
sv_model="damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch",
|
||||
sv_model_revision="master",
|
||||
)
|
||||
|
||||
# 以 audio_list 作为输入,其中第一个音频为待检测语音,后面的音频为不同说话人的声纹注册语音
|
||||
audio_list = [
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav",
|
||||
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav"
|
||||
]
|
||||
|
||||
results = inference_diar_pipline(audio_in=audio_list)
|
||||
print(results)
|
||||
@ -0,0 +1,39 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
import numpy as np
|
||||
|
||||
if __name__ == '__main__':
|
||||
inference_sv_pipline = pipeline(
|
||||
task=Tasks.speaker_verification,
|
||||
model='damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch'
|
||||
)
|
||||
|
||||
# extract speaker embedding
|
||||
# for url use "spk_embedding" as key
|
||||
rec_result = inference_sv_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_enroll.wav')
|
||||
enroll = rec_result["spk_embedding"]
|
||||
|
||||
# for local file use "spk_embedding" as key
|
||||
rec_result = inference_sv_pipline(audio_in='example/sv_example_same.wav')
|
||||
same = rec_result["spk_embedding"]
|
||||
|
||||
import soundfile
|
||||
wav = soundfile.read('example/sv_example_enroll.wav')[0]
|
||||
# for raw inputs use "spk_embedding" as key
|
||||
spk_embedding = inference_sv_pipline(audio_in=wav)["spk_embedding"]
|
||||
|
||||
rec_result = inference_sv_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_different.wav')
|
||||
different = rec_result["spk_embedding"]
|
||||
|
||||
# calculate cosine similarity for same speaker
|
||||
sv_threshold = 0.80
|
||||
same_cos = np.sum(enroll * same) / (np.linalg.norm(enroll) * np.linalg.norm(same))
|
||||
same_cos = max(same_cos - sv_threshold, 0.0) / (1.0 - sv_threshold) * 100.0
|
||||
print("Similarity:", same_cos)
|
||||
|
||||
# calculate cosine similarity for different speaker
|
||||
diff_cos = np.sum(enroll * different) / (np.linalg.norm(enroll) * np.linalg.norm(different))
|
||||
diff_cos = max(diff_cos - sv_threshold, 0.0) / (1.0 - sv_threshold) * 100.0
|
||||
print("Similarity:", diff_cos)
|
||||
@ -0,0 +1,21 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == '__main__':
|
||||
inference_sv_pipline = pipeline(
|
||||
task=Tasks.speaker_verification,
|
||||
model='damo/speech_xvector_sv-en-us-callhome-8k-spk6135-pytorch'
|
||||
)
|
||||
|
||||
# the same speaker
|
||||
rec_result = inference_sv_pipline(audio_in=(
|
||||
'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_enroll.wav',
|
||||
'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_same.wav'))
|
||||
print("Similarity", rec_result["scores"])
|
||||
|
||||
# different speakers
|
||||
rec_result = inference_sv_pipline(audio_in=(
|
||||
'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_enroll.wav',
|
||||
'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/sv_example_different.wav'))
|
||||
|
||||
print("Similarity", rec_result["scores"])
|
||||
@ -9,14 +9,20 @@ if __name__ == '__main__':
|
||||
)
|
||||
|
||||
# 提取不同句子的说话人嵌入码
|
||||
# for url use "utt_id" as key
|
||||
rec_result = inference_sv_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav')
|
||||
enroll = rec_result["spk_embedding"]
|
||||
|
||||
rec_result = inference_sv_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_same.wav')
|
||||
# for local file use "utt_id" as key
|
||||
rec_result = inference_sv_pipline(audio_in='sv_example_same.wav')["test1"]
|
||||
same = rec_result["spk_embedding"]
|
||||
|
||||
import soundfile
|
||||
wav = soundfile.read('sv_example_enroll.wav')[0]
|
||||
# for raw inputs use "utt_id" as key
|
||||
spk_embedding = inference_sv_pipline(audio_in=wav)["spk_embedding"]
|
||||
|
||||
rec_result = inference_sv_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav')
|
||||
different = rec_result["spk_embedding"]
|
||||
|
||||
@ -0,0 +1,25 @@
|
||||
# ModelScope Model
|
||||
|
||||
## How to finetune and infer using a pretrained ModelScope Model
|
||||
|
||||
### Inference
|
||||
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>audio_in:</strong> # support wav, url, bytes, and parsed audio format.
|
||||
- <strong>text_in:</strong> # support text, text url.
|
||||
- <strong>output_dir:</strong> # If the input format is wav.scp, it needs to be set.
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
python infer.py
|
||||
```
|
||||
|
||||
|
||||
Modify inference related parameters in vad.yaml.
|
||||
|
||||
- max_end_silence_time: The end-point silence duration to judge the end of sentence, the parameter range is 500ms~6000ms, and the default value is 800ms
|
||||
- speech_noise_thres: The balance of speech and silence scores, the parameter range is (-1,1)
|
||||
- The value tends to -1, the greater probability of noise being judged as speech
|
||||
- The value tends to 1, the greater probability of speech being judged as noise
|
||||
@ -0,0 +1,12 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.speech_timestamp,
|
||||
model='damo/speech_timestamp_prediction-v1-16k-offline',
|
||||
output_dir='./tmp')
|
||||
|
||||
rec_result = inference_pipline(
|
||||
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_timestamps.wav',
|
||||
text_in='一 个 东 太 平 洋 国 家 为 什 么 跑 到 西 太 平 洋 来 了 呢')
|
||||
print(rec_result)
|
||||
@ -7,7 +7,7 @@ if __name__ == '__main__':
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
|
||||
model_revision=None,
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
)
|
||||
|
||||
@ -0,0 +1,33 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
mode='online',
|
||||
)
|
||||
speech, sample_rate = soundfile.read("./vad_example_16k.wav")
|
||||
speech_length = speech.shape[0]
|
||||
|
||||
sample_offset = 0
|
||||
|
||||
step = 160 * 10
|
||||
param_dict = {'in_cache': dict()}
|
||||
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
|
||||
if sample_offset + step >= speech_length - 1:
|
||||
step = speech_length - sample_offset
|
||||
is_final = True
|
||||
else:
|
||||
is_final = False
|
||||
param_dict['is_final'] = is_final
|
||||
segments_result = inference_pipline(audio_in=speech[sample_offset: sample_offset + step],
|
||||
param_dict=param_dict)
|
||||
print(segments_result)
|
||||
|
||||
@ -7,8 +7,8 @@ if __name__ == '__main__':
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-8k-common",
|
||||
model_revision=None,
|
||||
output_dir='./output_dir',
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
)
|
||||
segments_result = inference_pipline(audio_in=audio_in)
|
||||
|
||||
@ -0,0 +1,33 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-8k-common",
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
mode='online',
|
||||
)
|
||||
speech, sample_rate = soundfile.read("./vad_example_8k.wav")
|
||||
speech_length = speech.shape[0]
|
||||
|
||||
sample_offset = 0
|
||||
|
||||
step = 80 * 10
|
||||
param_dict = {'in_cache': dict()}
|
||||
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
|
||||
if sample_offset + step >= speech_length - 1:
|
||||
step = speech_length - sample_offset
|
||||
is_final = True
|
||||
else:
|
||||
is_final = False
|
||||
param_dict['is_final'] = is_final
|
||||
segments_result = inference_pipline(audio_in=speech[sample_offset: sample_offset + step],
|
||||
param_dict=param_dict)
|
||||
print(segments_result)
|
||||
|
||||
@ -52,7 +52,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -216,6 +216,9 @@ def inference_launch(**kwargs):
|
||||
elif mode == "paraformer":
|
||||
from funasr.bin.asr_inference_paraformer import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_streaming":
|
||||
from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_vad":
|
||||
from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
|
||||
@ -55,7 +55,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -194,8 +194,8 @@ class Speech2Text:
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
|
||||
if(speech.dim()==3):
|
||||
speech = torch.squeeze(speech, 2)
|
||||
#speech = speech.unsqueeze(0).to(getattr(torch, self.dtype))
|
||||
speech = speech.to(getattr(torch, self.dtype))
|
||||
# lenghts: (1,)
|
||||
@ -534,6 +534,8 @@ def inference_modelscope(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
batch_size=batch_size,
|
||||
fs=fs,
|
||||
mc=True,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
|
||||
|
||||
@ -42,6 +42,7 @@ from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
|
||||
from funasr.utils.timestamp_tools import ts_prediction_lfr6_standard
|
||||
|
||||
|
||||
class Speech2Text:
|
||||
@ -49,7 +50,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -190,7 +191,8 @@ class Speech2Text:
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
begin_time: int = 0, end_time: int = None,
|
||||
):
|
||||
"""Inference
|
||||
|
||||
@ -242,6 +244,10 @@ class Speech2Text:
|
||||
decoder_outs = self.asr_model.cal_decoder_with_predictor(enc, enc_len, pre_acoustic_embeds, pre_token_length, hw_list=self.hotword_list)
|
||||
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
|
||||
|
||||
if isinstance(self.asr_model, BiCifParaformer):
|
||||
_, _, us_alphas, us_peaks = self.asr_model.calc_predictor_timestamp(enc, enc_len,
|
||||
pre_token_length) # test no bias cif2
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
@ -284,7 +290,14 @@ class Speech2Text:
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
if isinstance(self.asr_model, BiCifParaformer):
|
||||
_, timestamp = ts_prediction_lfr6_standard(us_alphas[i],
|
||||
us_peaks[i],
|
||||
copy.copy(token),
|
||||
vad_offset=begin_time)
|
||||
results.append((text, token, token_int, hyp, timestamp, enc_len_batch_total, lfr_factor))
|
||||
else:
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
# assert check_return_type(results)
|
||||
return results
|
||||
@ -683,6 +696,11 @@ def inference_modelscope(
|
||||
inference=True,
|
||||
)
|
||||
|
||||
if param_dict is not None:
|
||||
use_timestamp = param_dict.get('use_timestamp', True)
|
||||
else:
|
||||
use_timestamp = True
|
||||
|
||||
forward_time_total = 0.0
|
||||
length_total = 0.0
|
||||
finish_count = 0
|
||||
@ -724,7 +742,9 @@ def inference_modelscope(
|
||||
result = [results[batch_id][:-2]]
|
||||
|
||||
key = keys[batch_id]
|
||||
for n, (text, token, token_int, hyp) in zip(range(1, nbest + 1), result):
|
||||
for n, result in zip(range(1, nbest + 1), result):
|
||||
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
|
||||
time_stamp = None if len(result) < 5 else result[4]
|
||||
# Create a directory: outdir/{n}best_recog
|
||||
if writer is not None:
|
||||
ibest_writer = writer[f"{n}best_recog"]
|
||||
@ -736,8 +756,20 @@ def inference_modelscope(
|
||||
ibest_writer["rtf"][key] = rtf_cur
|
||||
|
||||
if text is not None:
|
||||
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
|
||||
if use_timestamp and time_stamp is not None:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
|
||||
else:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token)
|
||||
time_stamp_postprocessed = ""
|
||||
if len(postprocessed_result) == 3:
|
||||
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
|
||||
postprocessed_result[1], \
|
||||
postprocessed_result[2]
|
||||
else:
|
||||
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
if time_stamp_postprocessed != "":
|
||||
item['time_stamp'] = time_stamp_postprocessed
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
# asr_utils.print_progress(finish_count / file_count)
|
||||
|
||||
907
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
907
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
@ -0,0 +1,907 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
import copy
|
||||
import os
|
||||
import codecs
|
||||
import tempfile
|
||||
import requests
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Dict
|
||||
from typing import Any
|
||||
from typing import List
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.fileio.datadir_writer import DatadirWriter
|
||||
from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
|
||||
from funasr.modules.beam_search.beam_search import Hypothesis
|
||||
from funasr.modules.scorers.ctc import CTCPrefixScorer
|
||||
from funasr.modules.scorers.length_bonus import LengthBonus
|
||||
from funasr.modules.subsampling import TooShortUttError
|
||||
from funasr.tasks.asr import ASRTaskParaformer as ASRTask
|
||||
from funasr.tasks.lm import LMTask
|
||||
from funasr.text.build_tokenizer import build_tokenizer
|
||||
from funasr.text.token_id_converter import TokenIDConverter
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
|
||||
|
||||
class Speech2Text:
|
||||
"""Speech2Text class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# 1. Build ASR model
|
||||
scorers = {}
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
if asr_model.ctc != None:
|
||||
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
|
||||
scorers.update(
|
||||
ctc=ctc
|
||||
)
|
||||
token_list = asr_model.token_list
|
||||
scorers.update(
|
||||
length_bonus=LengthBonus(len(token_list)),
|
||||
)
|
||||
|
||||
# 2. Build Language model
|
||||
if lm_train_config is not None:
|
||||
lm, lm_train_args = LMTask.build_model_from_file(
|
||||
lm_train_config, lm_file, device
|
||||
)
|
||||
scorers["lm"] = lm.lm
|
||||
|
||||
# 3. Build ngram model
|
||||
# ngram is not supported now
|
||||
ngram = None
|
||||
scorers["ngram"] = ngram
|
||||
|
||||
# 4. Build BeamSearch object
|
||||
# transducer is not supported now
|
||||
beam_search_transducer = None
|
||||
|
||||
weights = dict(
|
||||
decoder=1.0 - ctc_weight,
|
||||
ctc=ctc_weight,
|
||||
lm=lm_weight,
|
||||
ngram=ngram_weight,
|
||||
length_bonus=penalty,
|
||||
)
|
||||
beam_search = BeamSearch(
|
||||
beam_size=beam_size,
|
||||
weights=weights,
|
||||
scorers=scorers,
|
||||
sos=asr_model.sos,
|
||||
eos=asr_model.eos,
|
||||
vocab_size=len(token_list),
|
||||
token_list=token_list,
|
||||
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
|
||||
)
|
||||
|
||||
beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
for scorer in scorers.values():
|
||||
if isinstance(scorer, torch.nn.Module):
|
||||
scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
# 6. [Optional] Build hotword list from str, local file or url
|
||||
|
||||
is_use_lm = lm_weight != 0.0 and lm_file is not None
|
||||
if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
|
||||
beam_search = None
|
||||
self.beam_search = beam_search
|
||||
logging.info(f"Beam_search: {self.beam_search}")
|
||||
self.beam_search_transducer = beam_search_transducer
|
||||
self.maxlenratio = maxlenratio
|
||||
self.minlenratio = minlenratio
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
self.encoder_downsampling_factor = 1
|
||||
if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
|
||||
self.encoder_downsampling_factor = 4
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
begin_time: int = 0, end_time: int = None,
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
# b. Forward Encoder
|
||||
enc, enc_len = self.asr_model.encode_chunk(**batch)
|
||||
if isinstance(enc, tuple):
|
||||
enc = enc[0]
|
||||
# assert len(enc) == 1, len(enc)
|
||||
enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
|
||||
|
||||
predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
|
||||
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
|
||||
predictor_outs[2], predictor_outs[3]
|
||||
pre_token_length = pre_token_length.floor().long()
|
||||
if torch.max(pre_token_length) < 1:
|
||||
return []
|
||||
decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
|
||||
decoder_out = decoder_outs
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
x = enc[i, :enc_len[i], :]
|
||||
am_scores = decoder_out[i, :pre_token_length[i], :]
|
||||
if self.beam_search is not None:
|
||||
nbest_hyps = self.beam_search(
|
||||
x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
|
||||
)
|
||||
|
||||
nbest_hyps = nbest_hyps[: self.nbest]
|
||||
else:
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
[self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
# assert check_return_type(results)
|
||||
return results
|
||||
|
||||
|
||||
class Speech2TextExport:
|
||||
"""Speech2TextExport class
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
# 1. Build ASR model
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
token_list = asr_model.token_list
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
# self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
|
||||
model = Paraformer_export(asr_model, onnx=False)
|
||||
self.asr_model = model
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
|
||||
enc_len_batch_total = feats_len.sum()
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
decoder_outs = self.asr_model(**batch)
|
||||
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
am_scores = decoder_out[i, :ys_pad_lens[i], :]
|
||||
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
yseq.tolist(), device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def inference(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
streaming: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
batch_size=batch_size,
|
||||
beam_size=beam_size,
|
||||
ngpu=ngpu,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
penalty=penalty,
|
||||
log_level=log_level,
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
raw_inputs=raw_inputs,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
key_file=key_file,
|
||||
word_lm_train_config=word_lm_train_config,
|
||||
bpemodel=bpemodel,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
streaming=streaming,
|
||||
output_dir=output_dir,
|
||||
dtype=dtype,
|
||||
seed=seed,
|
||||
ngram_weight=ngram_weight,
|
||||
nbest=nbest,
|
||||
num_workers=num_workers,
|
||||
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
# data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
output_dir: Optional[str] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
if word_lm_train_config is not None:
|
||||
raise NotImplementedError("Word LM is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
export_mode = False
|
||||
if param_dict is not None:
|
||||
hotword_list_or_file = param_dict.get('hotword')
|
||||
export_mode = param_dict.get("export_mode", False)
|
||||
else:
|
||||
hotword_list_or_file = None
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
batch_size = 1
|
||||
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
|
||||
# 2. Build speech2text
|
||||
speech2text_kwargs = dict(
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
bpemodel=bpemodel,
|
||||
device=device,
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
dtype=dtype,
|
||||
beam_size=beam_size,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
ngram_weight=ngram_weight,
|
||||
penalty=penalty,
|
||||
nbest=nbest,
|
||||
hotword_list_or_file=hotword_list_or_file,
|
||||
)
|
||||
if export_mode:
|
||||
speech2text = Speech2TextExport(**speech2text_kwargs)
|
||||
else:
|
||||
speech2text = Speech2Text(**speech2text_kwargs)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
fs: dict = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
hotword_list_or_file = None
|
||||
if param_dict is not None:
|
||||
hotword_list_or_file = param_dict.get('hotword')
|
||||
if 'hotword' in kwargs:
|
||||
hotword_list_or_file = kwargs['hotword']
|
||||
if hotword_list_or_file is not None or 'hotword' in kwargs:
|
||||
speech2text.hotword_list = speech2text.generate_hotwords_list(hotword_list_or_file)
|
||||
|
||||
# 3. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
|
||||
loader = ASRTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
fs=fs,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
|
||||
collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
if param_dict is not None:
|
||||
use_timestamp = param_dict.get('use_timestamp', True)
|
||||
else:
|
||||
use_timestamp = True
|
||||
|
||||
forward_time_total = 0.0
|
||||
length_total = 0.0
|
||||
finish_count = 0
|
||||
file_count = 1
|
||||
cache = None
|
||||
# 7 .Start for-loop
|
||||
# FIXME(kamo): The output format should be discussed about
|
||||
asr_result_list = []
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path is not None:
|
||||
writer = DatadirWriter(output_path)
|
||||
else:
|
||||
writer = None
|
||||
if param_dict is not None and "cache" in param_dict:
|
||||
cache = param_dict["cache"]
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
# batch = {k: v for k, v in batch.items() if not k.endswith("_lengths")}
|
||||
logging.info("decoding, utt_id: {}".format(keys))
|
||||
# N-best list of (text, token, token_int, hyp_object)
|
||||
|
||||
time_beg = time.time()
|
||||
results = speech2text(cache=cache, **batch)
|
||||
if len(results) < 1:
|
||||
hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
|
||||
results = [[" ", ["sil"], [2], hyp, 10, 6]] * nbest
|
||||
time_end = time.time()
|
||||
forward_time = time_end - time_beg
|
||||
lfr_factor = results[0][-1]
|
||||
length = results[0][-2]
|
||||
forward_time_total += forward_time
|
||||
length_total += length
|
||||
rtf_cur = "decoding, feature length: {}, forward_time: {:.4f}, rtf: {:.4f}".format(length, forward_time,
|
||||
100 * forward_time / (
|
||||
length * lfr_factor))
|
||||
logging.info(rtf_cur)
|
||||
|
||||
for batch_id in range(_bs):
|
||||
result = [results[batch_id][:-2]]
|
||||
|
||||
key = keys[batch_id]
|
||||
for n, result in zip(range(1, nbest + 1), result):
|
||||
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
|
||||
time_stamp = None if len(result) < 5 else result[4]
|
||||
# Create a directory: outdir/{n}best_recog
|
||||
if writer is not None:
|
||||
ibest_writer = writer[f"{n}best_recog"]
|
||||
|
||||
# Write the result to each file
|
||||
ibest_writer["token"][key] = " ".join(token)
|
||||
# ibest_writer["token_int"][key] = " ".join(map(str, token_int))
|
||||
ibest_writer["score"][key] = str(hyp.score)
|
||||
ibest_writer["rtf"][key] = rtf_cur
|
||||
|
||||
if text is not None:
|
||||
if use_timestamp and time_stamp is not None:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
|
||||
else:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token)
|
||||
time_stamp_postprocessed = ""
|
||||
if len(postprocessed_result) == 3:
|
||||
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
|
||||
postprocessed_result[1], \
|
||||
postprocessed_result[2]
|
||||
else:
|
||||
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
if time_stamp_postprocessed != "":
|
||||
item['time_stamp'] = time_stamp_postprocessed
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
# asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
|
||||
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
|
||||
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total,
|
||||
forward_time_total,
|
||||
100 * forward_time_total / (
|
||||
length_total * lfr_factor))
|
||||
logging.info(rtf_avg)
|
||||
if writer is not None:
|
||||
ibest_writer["rtf"]["rtf_avf"] = rtf_avg
|
||||
return asr_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="ASR Decoding",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=True)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hotword",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="hotword file path or hotwords seperated by space"
|
||||
)
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--asr_train_config",
|
||||
type=str,
|
||||
help="ASR training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--asr_model_file",
|
||||
type=str,
|
||||
help="ASR model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--cmvn_file",
|
||||
type=str,
|
||||
help="Global cmvn file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_train_config",
|
||||
type=str,
|
||||
help="LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_file",
|
||||
type=str,
|
||||
help="LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_train_config",
|
||||
type=str,
|
||||
help="Word LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_file",
|
||||
type=str,
|
||||
help="Word LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ngram_file",
|
||||
type=str,
|
||||
help="N-gram parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--model_tag",
|
||||
type=str,
|
||||
help="Pretrained model tag. If specify this option, *_train_config and "
|
||||
"*_file will be overwritten",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Beam-search related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
|
||||
group.add_argument("--beam_size", type=int, default=20, help="Beam size")
|
||||
group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
|
||||
group.add_argument(
|
||||
"--maxlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain max output length. "
|
||||
"If maxlenratio=0.0 (default), it uses a end-detect "
|
||||
"function "
|
||||
"to automatically find maximum hypothesis lengths."
|
||||
"If maxlenratio<0.0, its absolute value is interpreted"
|
||||
"as a constant max output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--minlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain min output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ctc_weight",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="CTC weight in joint decoding",
|
||||
)
|
||||
group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
|
||||
group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
|
||||
group.add_argument("--streaming", type=str2bool, default=False)
|
||||
|
||||
group.add_argument(
|
||||
"--frontend_conf",
|
||||
default=None,
|
||||
help="",
|
||||
)
|
||||
group.add_argument("--raw_inputs", type=list, default=None)
|
||||
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
|
||||
|
||||
group = parser.add_argument_group("Text converter related")
|
||||
group.add_argument(
|
||||
"--token_type",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
choices=["char", "bpe", None],
|
||||
help="The token type for ASR model. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
group.add_argument(
|
||||
"--bpemodel",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="The model path of sentencepiece. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
param_dict = {'hotword': args.hotword}
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
kwargs['param_dict'] = param_dict
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
# from modelscope.pipelines import pipeline
|
||||
# from modelscope.utils.constant import Tasks
|
||||
#
|
||||
# inference_16k_pipline = pipeline(
|
||||
# task=Tasks.auto_speech_recognition,
|
||||
# model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
|
||||
#
|
||||
# rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
|
||||
# print(rec_result)
|
||||
|
||||
@ -44,11 +44,10 @@ from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.tasks.vad import VADTask
|
||||
from funasr.bin.vad_inference import Speech2VadSegment
|
||||
from funasr.utils.timestamp_tools import time_stamp_lfr6_pl
|
||||
from funasr.utils.timestamp_tools import time_stamp_sentence, ts_prediction_lfr6_standard
|
||||
from funasr.bin.punctuation_infer import Text2Punc
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
|
||||
from funasr.utils.timestamp_tools import time_stamp_sentence
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
@ -59,7 +58,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -257,7 +256,7 @@ class Speech2Text:
|
||||
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
|
||||
|
||||
if isinstance(self.asr_model, BiCifParaformer):
|
||||
_, _, us_alphas, us_cif_peak = self.asr_model.calc_predictor_timestamp(enc, enc_len,
|
||||
_, _, us_alphas, us_peaks = self.asr_model.calc_predictor_timestamp(enc, enc_len,
|
||||
pre_token_length) # test no bias cif2
|
||||
|
||||
results = []
|
||||
@ -303,7 +302,10 @@ class Speech2Text:
|
||||
text = None
|
||||
|
||||
if isinstance(self.asr_model, BiCifParaformer):
|
||||
timestamp = time_stamp_lfr6_pl(us_alphas[i], us_cif_peak[i], copy.copy(token), begin_time, end_time)
|
||||
_, timestamp = ts_prediction_lfr6_standard(us_alphas[i],
|
||||
us_peaks[i],
|
||||
copy.copy(token),
|
||||
vad_offset=begin_time)
|
||||
results.append((text, token, token_int, timestamp, enc_len_batch_total, lfr_factor))
|
||||
else:
|
||||
results.append((text, token, token_int, enc_len_batch_total, lfr_factor))
|
||||
|
||||
@ -49,7 +49,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -46,7 +46,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -46,7 +46,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -28,7 +28,9 @@ def parse_args(mode):
|
||||
elif mode == "uniasr":
|
||||
from funasr.tasks.asr import ASRTaskUniASR as ASRTask
|
||||
elif mode == "mfcca":
|
||||
from funasr.tasks.asr import ASRTaskMFCCA as ASRTask
|
||||
from funasr.tasks.asr import ASRTaskMFCCA as ASRTask
|
||||
elif mode == "tp":
|
||||
from funasr.tasks.asr import ASRTaskAligner as ASRTask
|
||||
else:
|
||||
raise ValueError("Unknown mode: {}".format(mode))
|
||||
parser = ASRTask.get_parser()
|
||||
|
||||
@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
|
||||
param_dict = {
|
||||
"extract_profile": True,
|
||||
"sv_train_config": "sv.yaml",
|
||||
"sv_model_file": "sv.pth",
|
||||
"sv_model_file": "sv.pb",
|
||||
}
|
||||
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
|
||||
for key in param_dict:
|
||||
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
|
||||
else:
|
||||
kwargs["param_dict"] = param_dict
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
elif mode == "eend-ola":
|
||||
from funasr.bin.eend_ola_inference import inference_modelscope
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
else:
|
||||
logging.info("Unknown decoding mode: {}".format(mode))
|
||||
return None
|
||||
|
||||
427
funasr/bin/eend_ola_inference.py
Executable file
427
funasr/bin/eend_ola_inference.py
Executable file
@ -0,0 +1,427 @@
|
||||
#!/usr/bin/env python3
|
||||
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
|
||||
# MIT License (https://opensource.org/licenses/MIT)
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from typing import List
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from scipy.signal import medfilt
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.models.frontend.wav_frontend import WavFrontendMel23
|
||||
from funasr.tasks.diar import EENDOLADiarTask
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
|
||||
|
||||
class Speech2Diarization:
|
||||
"""Speech2Diarlization class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
{"spk1": [(int, int), ...], ...}
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
diar_train_config: Union[Path, str] = None,
|
||||
diar_model_file: Union[Path, str] = None,
|
||||
device: str = "cpu",
|
||||
dtype: str = "float32",
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# 1. Build Diarization model
|
||||
diar_model, diar_train_args = EENDOLADiarTask.build_model_from_file(
|
||||
config_file=diar_train_config,
|
||||
model_file=diar_model_file,
|
||||
device=device
|
||||
)
|
||||
frontend = None
|
||||
if diar_train_args.frontend is not None and diar_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontendMel23(**diar_train_args.frontend_conf)
|
||||
|
||||
# set up seed for eda
|
||||
np.random.seed(diar_train_args.seed)
|
||||
torch.manual_seed(diar_train_args.seed)
|
||||
torch.cuda.manual_seed(diar_train_args.seed)
|
||||
os.environ['PYTORCH_SEED'] = str(diar_train_args.seed)
|
||||
logging.info("diar_model: {}".format(diar_model))
|
||||
logging.info("diar_train_args: {}".format(diar_train_args))
|
||||
diar_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
self.diar_model = diar_model
|
||||
self.diar_train_args = diar_train_args
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.frontend = frontend
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self,
|
||||
speech: Union[torch.Tensor, np.ndarray],
|
||||
speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
diarization results
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.diar_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
batch = to_device(batch, device=self.device)
|
||||
results = self.diar_model.estimate_sequential(**batch)
|
||||
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
def from_pretrained(
|
||||
model_tag: Optional[str] = None,
|
||||
**kwargs: Optional[Any],
|
||||
):
|
||||
"""Build Speech2Diarization instance from the pretrained model.
|
||||
|
||||
Args:
|
||||
model_tag (Optional[str]): Model tag of the pretrained models.
|
||||
Currently, the tags of espnet_model_zoo are supported.
|
||||
|
||||
Returns:
|
||||
Speech2Diarization: Speech2Diarization instance.
|
||||
|
||||
"""
|
||||
if model_tag is not None:
|
||||
try:
|
||||
from espnet_model_zoo.downloader import ModelDownloader
|
||||
|
||||
except ImportError:
|
||||
logging.error(
|
||||
"`espnet_model_zoo` is not installed. "
|
||||
"Please install via `pip install -U espnet_model_zoo`."
|
||||
)
|
||||
raise
|
||||
d = ModelDownloader()
|
||||
kwargs.update(**d.download_and_unpack(model_tag))
|
||||
|
||||
return Speech2Diarization(**kwargs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
diar_train_config: str,
|
||||
diar_model_file: str,
|
||||
output_dir: Optional[str] = None,
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
ngpu: int = 1,
|
||||
num_workers: int = 0,
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
param_dict: Optional[dict] = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
if batch_size > 1:
|
||||
raise NotImplementedError("batch decoding is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
logging.info("param_dict: {}".format(param_dict))
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
|
||||
# 1. Build speech2diar
|
||||
speech2diar_kwargs = dict(
|
||||
diar_train_config=diar_train_config,
|
||||
diar_model_file=diar_model_file,
|
||||
device=device,
|
||||
dtype=dtype,
|
||||
)
|
||||
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
|
||||
speech2diar = Speech2Diarization.from_pretrained(
|
||||
model_tag=model_tag,
|
||||
**speech2diar_kwargs,
|
||||
)
|
||||
speech2diar.diar_model.eval()
|
||||
|
||||
def output_results_str(results: dict, uttid: str):
|
||||
rst = []
|
||||
mid = uttid.rsplit("-", 1)[0]
|
||||
for key in results:
|
||||
results[key] = [(x[0] / 100, x[1] / 100) for x in results[key]]
|
||||
template = "SPEAKER {} 0 {:.2f} {:.2f} <NA> <NA> {} <NA> <NA>"
|
||||
for spk, segs in results.items():
|
||||
rst.extend([template.format(mid, st, ed, spk) for st, ed in segs])
|
||||
|
||||
return "\n".join(rst)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type: Sequence[Tuple[str, str, str]] = None,
|
||||
raw_inputs: List[List[Union[np.ndarray, torch.Tensor, str, bytes]]] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
param_dict: Optional[dict] = None,
|
||||
):
|
||||
# 2. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
|
||||
loader = EENDOLADiarTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=EENDOLADiarTask.build_preprocess_fn(speech2diar.diar_train_args, False),
|
||||
collate_fn=EENDOLADiarTask.build_collate_fn(speech2diar.diar_train_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
# 3. Start for-loop
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path is not None:
|
||||
os.makedirs(output_path, exist_ok=True)
|
||||
output_writer = open("{}/result.txt".format(output_path), "w")
|
||||
result_list = []
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
|
||||
|
||||
results = speech2diar(**batch)
|
||||
|
||||
# post process
|
||||
a = results[0][0].cpu().numpy()
|
||||
a = medfilt(a, (11, 1))
|
||||
rst = []
|
||||
for spkid, frames in enumerate(a.T):
|
||||
frames = np.pad(frames, (1, 1), 'constant')
|
||||
changes, = np.where(np.diff(frames, axis=0) != 0)
|
||||
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
|
||||
for s, e in zip(changes[::2], changes[1::2]):
|
||||
st = s / 10.
|
||||
dur = (e - s) / 10.
|
||||
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
|
||||
|
||||
# Only supporting batch_size==1
|
||||
value = "\n".join(rst)
|
||||
item = {"key": keys[0], "value": value}
|
||||
result_list.append(item)
|
||||
if output_path is not None:
|
||||
output_writer.write(value)
|
||||
output_writer.flush()
|
||||
|
||||
if output_path is not None:
|
||||
output_writer.close()
|
||||
|
||||
return result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def inference(
|
||||
data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
|
||||
diar_train_config: Optional[str],
|
||||
diar_model_file: Optional[str],
|
||||
output_dir: Optional[str] = None,
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
ngpu: int = 0,
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
smooth_size: int = 83,
|
||||
dur_threshold: int = 10,
|
||||
out_format: str = "vad",
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
diar_train_config=diar_train_config,
|
||||
diar_model_file=diar_model_file,
|
||||
output_dir=output_dir,
|
||||
batch_size=batch_size,
|
||||
dtype=dtype,
|
||||
ngpu=ngpu,
|
||||
seed=seed,
|
||||
num_workers=num_workers,
|
||||
log_level=log_level,
|
||||
key_file=key_file,
|
||||
model_tag=model_tag,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
streaming=streaming,
|
||||
smooth_size=smooth_size,
|
||||
dur_threshold=dur_threshold,
|
||||
out_format=out_format,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs=None)
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="Speaker verification/x-vector extraction",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpuid_list",
|
||||
type=str,
|
||||
default="",
|
||||
help="The visible gpus",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--diar_train_config",
|
||||
type=str,
|
||||
help="diarization training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--diar_model_file",
|
||||
type=str,
|
||||
help="diarization model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--dur_threshold",
|
||||
type=int,
|
||||
default=10,
|
||||
help="The threshold for short segments in number frames"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--smooth_size",
|
||||
type=int,
|
||||
default=83,
|
||||
help="The smoothing window length in number frames"
|
||||
)
|
||||
group.add_argument(
|
||||
"--model_tag",
|
||||
type=str,
|
||||
help="Pretrained model tag. If specify this option, *_train_config and "
|
||||
"*_file will be overwritten",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
parser.add_argument("--streaming", type=str2bool, default=False)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
logging.info("args: {}".format(kwargs))
|
||||
if args.output_dir is None:
|
||||
jobid, n_gpu = 1, 1
|
||||
gpuid = args.gpuid_list.split(",")[jobid - 1]
|
||||
else:
|
||||
jobid = int(args.output_dir.split(".")[-1])
|
||||
n_gpu = len(args.gpuid_list.split(","))
|
||||
gpuid = args.gpuid_list.split(",")[(jobid - 1) % n_gpu]
|
||||
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
|
||||
results_list = inference(**kwargs)
|
||||
for results in results_list:
|
||||
print("{} {}".format(results["key"], results["value"]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -75,6 +75,9 @@ def inference_launch(mode, **kwargs):
|
||||
if mode == "punc":
|
||||
from funasr.bin.punctuation_infer import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
if mode == "punc_VadRealtime":
|
||||
from funasr.bin.punctuation_infer_vadrealtime import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
else:
|
||||
logging.info("Unknown decoding mode: {}".format(mode))
|
||||
return None
|
||||
|
||||
335
funasr/bin/punctuation_infer_vadrealtime.py
Normal file
335
funasr/bin/punctuation_infer_vadrealtime.py
Normal file
@ -0,0 +1,335 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import logging
|
||||
from pathlib import Path
|
||||
import sys
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Any
|
||||
from typing import List
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.datasets.preprocessor import CodeMixTokenizerCommonPreprocessor
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.tasks.punctuation import PunctuationTask
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.forward_adaptor import ForwardAdaptor
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.punctuation.text_preprocessor import split_to_mini_sentence
|
||||
|
||||
|
||||
class Text2Punc:
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
train_config: Optional[str],
|
||||
model_file: Optional[str],
|
||||
device: str = "cpu",
|
||||
dtype: str = "float32",
|
||||
):
|
||||
# Build Model
|
||||
model, train_args = PunctuationTask.build_model_from_file(train_config, model_file, device)
|
||||
self.device = device
|
||||
# Wrape model to make model.nll() data-parallel
|
||||
self.wrapped_model = ForwardAdaptor(model, "inference")
|
||||
self.wrapped_model.to(dtype=getattr(torch, dtype)).to(device=device).eval()
|
||||
# logging.info(f"Model:\n{model}")
|
||||
self.punc_list = train_args.punc_list
|
||||
self.period = 0
|
||||
for i in range(len(self.punc_list)):
|
||||
if self.punc_list[i] == ",":
|
||||
self.punc_list[i] = ","
|
||||
elif self.punc_list[i] == "?":
|
||||
self.punc_list[i] = "?"
|
||||
elif self.punc_list[i] == "。":
|
||||
self.period = i
|
||||
self.preprocessor = CodeMixTokenizerCommonPreprocessor(
|
||||
train=False,
|
||||
token_type=train_args.token_type,
|
||||
token_list=train_args.token_list,
|
||||
bpemodel=train_args.bpemodel,
|
||||
text_cleaner=train_args.cleaner,
|
||||
g2p_type=train_args.g2p,
|
||||
text_name="text",
|
||||
non_linguistic_symbols=train_args.non_linguistic_symbols,
|
||||
)
|
||||
print("start decoding!!!")
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(self, text: Union[list, str], cache: list, split_size=20):
|
||||
if cache is not None and len(cache) > 0:
|
||||
precache = "".join(cache)
|
||||
else:
|
||||
precache = ""
|
||||
data = {"text": precache + text}
|
||||
result = self.preprocessor(data=data, uid="12938712838719")
|
||||
split_text = self.preprocessor.pop_split_text_data(result)
|
||||
mini_sentences = split_to_mini_sentence(split_text, split_size)
|
||||
mini_sentences_id = split_to_mini_sentence(data["text"], split_size)
|
||||
assert len(mini_sentences) == len(mini_sentences_id)
|
||||
cache_sent = []
|
||||
cache_sent_id = torch.from_numpy(np.array([], dtype='int32'))
|
||||
sentence_punc_list = []
|
||||
sentence_words_list= []
|
||||
cache_pop_trigger_limit = 200
|
||||
skip_num = 0
|
||||
for mini_sentence_i in range(len(mini_sentences)):
|
||||
mini_sentence = mini_sentences[mini_sentence_i]
|
||||
mini_sentence_id = mini_sentences_id[mini_sentence_i]
|
||||
mini_sentence = cache_sent + mini_sentence
|
||||
mini_sentence_id = np.concatenate((cache_sent_id, mini_sentence_id), axis=0)
|
||||
data = {
|
||||
"text": torch.unsqueeze(torch.from_numpy(mini_sentence_id), 0),
|
||||
"text_lengths": torch.from_numpy(np.array([len(mini_sentence_id)], dtype='int32')),
|
||||
"vad_indexes": torch.from_numpy(np.array([len(cache)-1], dtype='int32')),
|
||||
}
|
||||
data = to_device(data, self.device)
|
||||
y, _ = self.wrapped_model(**data)
|
||||
_, indices = y.view(-1, y.shape[-1]).topk(1, dim=1)
|
||||
punctuations = indices
|
||||
if indices.size()[0] != 1:
|
||||
punctuations = torch.squeeze(indices)
|
||||
assert punctuations.size()[0] == len(mini_sentence)
|
||||
|
||||
# Search for the last Period/QuestionMark as cache
|
||||
if mini_sentence_i < len(mini_sentences) - 1:
|
||||
sentenceEnd = -1
|
||||
last_comma_index = -1
|
||||
for i in range(len(punctuations) - 2, 1, -1):
|
||||
if self.punc_list[punctuations[i]] == "。" or self.punc_list[punctuations[i]] == "?":
|
||||
sentenceEnd = i
|
||||
break
|
||||
if last_comma_index < 0 and self.punc_list[punctuations[i]] == ",":
|
||||
last_comma_index = i
|
||||
|
||||
if sentenceEnd < 0 and len(mini_sentence) > cache_pop_trigger_limit and last_comma_index >= 0:
|
||||
# The sentence it too long, cut off at a comma.
|
||||
sentenceEnd = last_comma_index
|
||||
punctuations[sentenceEnd] = self.period
|
||||
cache_sent = mini_sentence[sentenceEnd + 1:]
|
||||
cache_sent_id = mini_sentence_id[sentenceEnd + 1:]
|
||||
mini_sentence = mini_sentence[0:sentenceEnd + 1]
|
||||
punctuations = punctuations[0:sentenceEnd + 1]
|
||||
|
||||
punctuations_np = punctuations.cpu().numpy()
|
||||
sentence_punc_list += [self.punc_list[int(x)] for x in punctuations_np]
|
||||
sentence_words_list += mini_sentence
|
||||
|
||||
assert len(sentence_punc_list) == len(sentence_words_list)
|
||||
words_with_punc = []
|
||||
sentence_punc_list_out = []
|
||||
for i in range(0, len(sentence_words_list)):
|
||||
if i > 0:
|
||||
if len(sentence_words_list[i][0].encode()) == 1 and len(sentence_words_list[i - 1][-1].encode()) == 1:
|
||||
sentence_words_list[i] = " " + sentence_words_list[i]
|
||||
if skip_num < len(cache):
|
||||
skip_num += 1
|
||||
else:
|
||||
words_with_punc.append(sentence_words_list[i])
|
||||
if skip_num >= len(cache):
|
||||
sentence_punc_list_out.append(sentence_punc_list[i])
|
||||
if sentence_punc_list[i] != "_":
|
||||
words_with_punc.append(sentence_punc_list[i])
|
||||
sentence_out = "".join(words_with_punc)
|
||||
|
||||
sentenceEnd = -1
|
||||
for i in range(len(sentence_punc_list) - 2, 1, -1):
|
||||
if sentence_punc_list[i] == "。" or sentence_punc_list[i] == "?":
|
||||
sentenceEnd = i
|
||||
break
|
||||
cache_out = sentence_words_list[sentenceEnd + 1 :]
|
||||
if sentence_out[-1] in self.punc_list:
|
||||
sentence_out = sentence_out[:-1]
|
||||
sentence_punc_list_out[-1] = "_"
|
||||
return sentence_out, sentence_punc_list_out, cache_out
|
||||
|
||||
|
||||
def inference(
|
||||
batch_size: int,
|
||||
dtype: str,
|
||||
ngpu: int,
|
||||
seed: int,
|
||||
num_workers: int,
|
||||
output_dir: str,
|
||||
log_level: Union[int, str],
|
||||
train_config: Optional[str],
|
||||
model_file: Optional[str],
|
||||
key_file: Optional[str] = None,
|
||||
data_path_and_name_and_type: Sequence[Tuple[str, str, str]] = None,
|
||||
raw_inputs: Union[List[Any], bytes, str] = None,
|
||||
cache: List[Any] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
output_dir=output_dir,
|
||||
batch_size=batch_size,
|
||||
dtype=dtype,
|
||||
ngpu=ngpu,
|
||||
seed=seed,
|
||||
num_workers=num_workers,
|
||||
log_level=log_level,
|
||||
key_file=key_file,
|
||||
train_config=train_config,
|
||||
model_file=model_file,
|
||||
param_dict=param_dict,
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs, cache)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
batch_size: int,
|
||||
dtype: str,
|
||||
ngpu: int,
|
||||
seed: int,
|
||||
num_workers: int,
|
||||
log_level: Union[int, str],
|
||||
#cache: list,
|
||||
key_file: Optional[str],
|
||||
train_config: Optional[str],
|
||||
model_file: Optional[str],
|
||||
output_dir: Optional[str] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
text2punc = Text2Punc(train_config, model_file, device)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[List[Any], bytes, str] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
cache: List[Any] = None,
|
||||
param_dict: dict = None,
|
||||
):
|
||||
results = []
|
||||
split_size = 10
|
||||
|
||||
if raw_inputs != None:
|
||||
line = raw_inputs.strip()
|
||||
key = "demo"
|
||||
if line == "":
|
||||
item = {'key': key, 'value': ""}
|
||||
results.append(item)
|
||||
return results
|
||||
#import pdb;pdb.set_trace()
|
||||
result, _, cache = text2punc(line, cache)
|
||||
item = {'key': key, 'value': result, 'cache': cache}
|
||||
results.append(item)
|
||||
return results
|
||||
|
||||
for inference_text, _, _ in data_path_and_name_and_type:
|
||||
with open(inference_text, "r", encoding="utf-8") as fin:
|
||||
for line in fin:
|
||||
line = line.strip()
|
||||
segs = line.split("\t")
|
||||
if len(segs) != 2:
|
||||
continue
|
||||
key = segs[0]
|
||||
if len(segs[1]) == 0:
|
||||
continue
|
||||
result, _ = text2punc(segs[1])
|
||||
item = {'key': key, 'value': result}
|
||||
results.append(item)
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path != None:
|
||||
output_file_name = "infer.out"
|
||||
Path(output_path).mkdir(parents=True, exist_ok=True)
|
||||
output_file_path = (Path(output_path) / output_file_name).absolute()
|
||||
with open(output_file_path, "w", encoding="utf-8") as fout:
|
||||
for item_i in results:
|
||||
key_out = item_i["key"]
|
||||
value_out = item_i["value"]
|
||||
fout.write(f"{key_out}\t{value_out}\n")
|
||||
return results
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="Punctuation inference",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument("--data_path_and_name_and_type", type=str2triple_str, action="append", required=False)
|
||||
group.add_argument("--raw_inputs", type=str, required=False)
|
||||
group.add_argument("--cache", type=list, required=False)
|
||||
group.add_argument("--param_dict", type=dict, required=False)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument("--train_config", type=str)
|
||||
group.add_argument("--model_file", type=str)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
# kwargs.pop("config", None)
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -42,7 +42,7 @@ class Speech2Diarization:
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
@ -54,7 +54,7 @@ class Speech2Diarization:
|
||||
self,
|
||||
diar_train_config: Union[Path, str] = None,
|
||||
diar_model_file: Union[Path, str] = None,
|
||||
device: str = "cpu",
|
||||
device: Union[str, torch.device] = "cpu",
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
streaming: bool = False,
|
||||
@ -114,9 +114,19 @@ class Speech2Diarization:
|
||||
# little-endian order: lower bit first
|
||||
return (np.array(list(b)[::-1]) == '1').astype(dtype)
|
||||
|
||||
return np.row_stack([int2vec(int(x), vec_dim) for x in seq])
|
||||
# process oov
|
||||
seq = np.array([int(x) for x in seq])
|
||||
new_seq = []
|
||||
for i, x in enumerate(seq):
|
||||
if x < 2 ** vec_dim:
|
||||
new_seq.append(x)
|
||||
else:
|
||||
idx_list = np.where(seq < 2 ** vec_dim)[0]
|
||||
idx = np.abs(idx_list - i).argmin()
|
||||
new_seq.append(seq[idx_list[idx]])
|
||||
return np.row_stack([int2vec(x, vec_dim) for x in new_seq])
|
||||
|
||||
def post_processing(self, raw_logits: torch.Tensor, spk_num: int):
|
||||
def post_processing(self, raw_logits: torch.Tensor, spk_num: int, output_format: str = "speaker_turn"):
|
||||
logits_idx = raw_logits.argmax(-1) # B, T, vocab_size -> B, T
|
||||
# upsampling outputs to match inputs
|
||||
ut = logits_idx.shape[1] * self.diar_model.encoder.time_ds_ratio
|
||||
@ -127,8 +137,14 @@ class Speech2Diarization:
|
||||
).squeeze(1).long()
|
||||
logits_idx = logits_idx[0].tolist()
|
||||
pse_labels = [self.token_list[x] for x in logits_idx]
|
||||
if output_format == "pse_labels":
|
||||
return pse_labels, None
|
||||
|
||||
multi_labels = self.seq2arr(pse_labels, spk_num)[:, :spk_num] # remove padding speakers
|
||||
multi_labels = self.smooth_multi_labels(multi_labels)
|
||||
if output_format == "binary_labels":
|
||||
return multi_labels, None
|
||||
|
||||
spk_list = ["spk{}".format(i + 1) for i in range(spk_num)]
|
||||
spk_turns = self.calc_spk_turns(multi_labels, spk_list)
|
||||
results = OrderedDict()
|
||||
@ -149,6 +165,7 @@ class Speech2Diarization:
|
||||
self,
|
||||
speech: Union[torch.Tensor, np.ndarray],
|
||||
profile: Union[torch.Tensor, np.ndarray],
|
||||
output_format: str = "speaker_turn"
|
||||
):
|
||||
"""Inference
|
||||
|
||||
@ -178,7 +195,7 @@ class Speech2Diarization:
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
logits = self.diar_model.prediction_forward(**batch)
|
||||
results, pse_labels = self.post_processing(logits, profile.shape[1])
|
||||
results, pse_labels = self.post_processing(logits, profile.shape[1], output_format)
|
||||
|
||||
return results, pse_labels
|
||||
|
||||
@ -367,7 +384,7 @@ def inference_modelscope(
|
||||
pse_label_writer = open("{}/labels.txt".format(output_path), "w")
|
||||
logging.info("Start to diarize...")
|
||||
result_list = []
|
||||
for keys, batch in loader:
|
||||
for idx, (keys, batch) in enumerate(loader):
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
@ -385,6 +402,9 @@ def inference_modelscope(
|
||||
pse_label_writer.write("{} {}\n".format(key, " ".join(pse_labels)))
|
||||
pse_label_writer.flush()
|
||||
|
||||
if idx % 100 == 0:
|
||||
logging.info("Processing {:5d}: {}".format(idx, key))
|
||||
|
||||
if output_path is not None:
|
||||
output_writer.close()
|
||||
pse_label_writer.close()
|
||||
|
||||
@ -36,7 +36,7 @@ class Speech2Xvector:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2xvector(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -169,7 +169,7 @@ def inference_modelscope(
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
sv_train_config: Optional[str] = "sv.yaml",
|
||||
sv_model_file: Optional[str] = "sv.pth",
|
||||
sv_model_file: Optional[str] = "sv.pb",
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
|
||||
379
funasr/bin/tp_inference.py
Normal file
379
funasr/bin/tp_inference.py
Normal file
@ -0,0 +1,379 @@
|
||||
import argparse
|
||||
import logging
|
||||
from optparse import Option
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from typing import List
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Dict
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.fileio.datadir_writer import DatadirWriter
|
||||
from funasr.datasets.preprocessor import LMPreprocessor
|
||||
from funasr.tasks.asr import ASRTaskAligner as ASRTask
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.text.token_id_converter import TokenIDConverter
|
||||
from funasr.utils.timestamp_tools import ts_prediction_lfr6_standard
|
||||
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
|
||||
global_asr_language: str = 'zh-cn'
|
||||
global_sample_rate: Union[int, Dict[Any, int]] = {
|
||||
'audio_fs': 16000,
|
||||
'model_fs': 16000
|
||||
}
|
||||
|
||||
|
||||
class SpeechText2Timestamp:
|
||||
def __init__(
|
||||
self,
|
||||
timestamp_infer_config: Union[Path, str] = None,
|
||||
timestamp_model_file: Union[Path, str] = None,
|
||||
timestamp_cmvn_file: Union[Path, str] = None,
|
||||
device: str = "cpu",
|
||||
dtype: str = "float32",
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
# 1. Build ASR model
|
||||
tp_model, tp_train_args = ASRTask.build_model_from_file(
|
||||
timestamp_infer_config, timestamp_model_file, device
|
||||
)
|
||||
if 'cuda' in device:
|
||||
tp_model = tp_model.cuda() # force model to cuda
|
||||
|
||||
frontend = None
|
||||
if tp_train_args.frontend is not None:
|
||||
frontend = WavFrontend(cmvn_file=timestamp_cmvn_file, **tp_train_args.frontend_conf)
|
||||
|
||||
logging.info("tp_model: {}".format(tp_model))
|
||||
logging.info("tp_train_args: {}".format(tp_train_args))
|
||||
tp_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
|
||||
self.tp_model = tp_model
|
||||
self.tp_train_args = tp_train_args
|
||||
|
||||
token_list = self.tp_model.token_list
|
||||
self.converter = TokenIDConverter(token_list=token_list)
|
||||
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.frontend = frontend
|
||||
self.encoder_downsampling_factor = 1
|
||||
if tp_train_args.encoder_conf["input_layer"] == "conv2d":
|
||||
self.encoder_downsampling_factor = 4
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self,
|
||||
speech: Union[torch.Tensor, np.ndarray],
|
||||
speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
text_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.tp_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
|
||||
# lfr_factor = max(1, (feats.size()[-1]//80)-1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
# b. Forward Encoder
|
||||
enc, enc_len = self.tp_model.encode(**batch)
|
||||
if isinstance(enc, tuple):
|
||||
enc = enc[0]
|
||||
|
||||
# c. Forward Predictor
|
||||
_, _, us_alphas, us_cif_peak = self.tp_model.calc_predictor_timestamp(enc, enc_len, text_lengths.to(self.device)+1)
|
||||
return us_alphas, us_cif_peak
|
||||
|
||||
|
||||
def inference(
|
||||
batch_size: int,
|
||||
ngpu: int,
|
||||
log_level: Union[int, str],
|
||||
data_path_and_name_and_type,
|
||||
timestamp_infer_config: Optional[str],
|
||||
timestamp_model_file: Optional[str],
|
||||
timestamp_cmvn_file: Optional[str] = None,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
key_file: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
split_with_space: bool = True,
|
||||
seg_dict_file: Optional[str] = None,
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
batch_size=batch_size,
|
||||
ngpu=ngpu,
|
||||
log_level=log_level,
|
||||
timestamp_infer_config=timestamp_infer_config,
|
||||
timestamp_model_file=timestamp_model_file,
|
||||
timestamp_cmvn_file=timestamp_cmvn_file,
|
||||
key_file=key_file,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
output_dir=output_dir,
|
||||
dtype=dtype,
|
||||
seed=seed,
|
||||
num_workers=num_workers,
|
||||
split_with_space=split_with_space,
|
||||
seg_dict_file=seg_dict_file,
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
batch_size: int,
|
||||
ngpu: int,
|
||||
log_level: Union[int, str],
|
||||
# data_path_and_name_and_type,
|
||||
timestamp_infer_config: Optional[str],
|
||||
timestamp_model_file: Optional[str],
|
||||
timestamp_cmvn_file: Optional[str] = None,
|
||||
# raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
key_file: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
split_with_space: bool = True,
|
||||
seg_dict_file: Optional[str] = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
if batch_size > 1:
|
||||
raise NotImplementedError("batch decoding is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
|
||||
# 2. Build speech2vadsegment
|
||||
speechtext2timestamp_kwargs = dict(
|
||||
timestamp_infer_config=timestamp_infer_config,
|
||||
timestamp_model_file=timestamp_model_file,
|
||||
timestamp_cmvn_file=timestamp_cmvn_file,
|
||||
device=device,
|
||||
dtype=dtype,
|
||||
)
|
||||
logging.info("speechtext2timestamp_kwargs: {}".format(speechtext2timestamp_kwargs))
|
||||
speechtext2timestamp = SpeechText2Timestamp(**speechtext2timestamp_kwargs)
|
||||
|
||||
preprocessor = LMPreprocessor(
|
||||
train=False,
|
||||
token_type=speechtext2timestamp.tp_train_args.token_type,
|
||||
token_list=speechtext2timestamp.tp_train_args.token_list,
|
||||
bpemodel=None,
|
||||
text_cleaner=None,
|
||||
g2p_type=None,
|
||||
text_name="text",
|
||||
non_linguistic_symbols=speechtext2timestamp.tp_train_args.non_linguistic_symbols,
|
||||
split_with_space=split_with_space,
|
||||
seg_dict_file=seg_dict_file,
|
||||
)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
fs: dict = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs
|
||||
):
|
||||
# 3. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
|
||||
|
||||
loader = ASRTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=preprocessor,
|
||||
collate_fn=ASRTask.build_collate_fn(speechtext2timestamp.tp_train_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
tp_result_list = []
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
|
||||
logging.info("timestamp predicting, utt_id: {}".format(keys))
|
||||
_batch = {'speech':batch['speech'],
|
||||
'speech_lengths':batch['speech_lengths'],
|
||||
'text_lengths':batch['text_lengths']}
|
||||
us_alphas, us_cif_peak = speechtext2timestamp(**_batch)
|
||||
|
||||
for batch_id in range(_bs):
|
||||
key = keys[batch_id]
|
||||
token = speechtext2timestamp.converter.ids2tokens(batch['text'][batch_id])
|
||||
ts_str, ts_list = ts_prediction_lfr6_standard(us_alphas[batch_id], us_cif_peak[batch_id], token, force_time_shift=-3.0)
|
||||
logging.warning(ts_str)
|
||||
item = {'key': key, 'value': ts_str, 'timestamp':ts_list}
|
||||
tp_result_list.append(item)
|
||||
return tp_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="Timestamp Prediction Inference",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpuid_list",
|
||||
type=str,
|
||||
default="",
|
||||
help="The visible gpus",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--raw_inputs", type=list, default=None)
|
||||
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--timestamp_infer_config",
|
||||
type=str,
|
||||
help="VAD infer configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--timestamp_model_file",
|
||||
type=str,
|
||||
help="VAD model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--timestamp_cmvn_file",
|
||||
type=str,
|
||||
help="Global cmvn file",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("infer related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
group.add_argument(
|
||||
"--seg_dict_file",
|
||||
type=str,
|
||||
default=None,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
group.add_argument(
|
||||
"--split_with_space",
|
||||
type=bool,
|
||||
default=False,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
143
funasr/bin/tp_inference_launch.py
Normal file
143
funasr/bin/tp_inference_launch.py
Normal file
@ -0,0 +1,143 @@
|
||||
#!/usr/bin/env python3
|
||||
# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
|
||||
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from typing import Union, Dict, Any
|
||||
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="Timestamp Prediction Inference",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--njob",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of jobs for each gpu",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpuid_list",
|
||||
type=str,
|
||||
default="",
|
||||
help="The visible gpus",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=True,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--timestamp_infer_config",
|
||||
type=str,
|
||||
help="VAD infer configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--timestamp_model_file",
|
||||
type=str,
|
||||
help="VAD model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--timestamp_cmvn_file",
|
||||
type=str,
|
||||
help="Global CMVN file",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("The inference configuration related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def inference_launch(mode, **kwargs):
|
||||
if mode == "tp_norm":
|
||||
from funasr.bin.tp_inference import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
else:
|
||||
logging.info("Unknown decoding mode: {}".format(mode))
|
||||
return None
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
parser.add_argument(
|
||||
"--mode",
|
||||
type=str,
|
||||
default="tp_norm",
|
||||
help="The decoding mode",
|
||||
)
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
|
||||
# set logging messages
|
||||
logging.basicConfig(
|
||||
level=args.log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
logging.info("Decoding args: {}".format(kwargs))
|
||||
|
||||
# gpu setting
|
||||
if args.ngpu > 0:
|
||||
jobid = int(args.output_dir.split(".")[-1])
|
||||
gpuid = args.gpuid_list.split(",")[(jobid - 1) // args.njob]
|
||||
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
|
||||
|
||||
inference_launch(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -110,8 +110,7 @@ def inference_launch(mode, **kwargs):
|
||||
if mode == "offline":
|
||||
from funasr.bin.vad_inference import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
# elif mode == "online":
|
||||
if "param_dict" in kwargs and kwargs["param_dict"]["online"]:
|
||||
elif mode == "online":
|
||||
from funasr.bin.vad_inference_online import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
else:
|
||||
|
||||
345
funasr/bin/vad_inference_online.py
Normal file
345
funasr/bin/vad_inference_online.py
Normal file
@ -0,0 +1,345 @@
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from typing import List
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Dict
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
from typeguard import check_return_type
|
||||
|
||||
from funasr.fileio.datadir_writer import DatadirWriter
|
||||
from funasr.tasks.vad import VADTask
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.models.frontend.wav_frontend import WavFrontendOnline
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.bin.vad_inference import Speech2VadSegment
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
|
||||
global_asr_language: str = 'zh-cn'
|
||||
global_sample_rate: Union[int, Dict[Any, int]] = {
|
||||
'audio_fs': 16000,
|
||||
'model_fs': 16000
|
||||
}
|
||||
|
||||
|
||||
class Speech2VadSegmentOnline(Speech2VadSegment):
|
||||
"""Speech2VadSegmentOnline class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2segment = Speech2VadSegmentOnline("vad_config.yml", "vad.pt")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2segment(audio)
|
||||
[[10, 230], [245, 450], ...]
|
||||
|
||||
"""
|
||||
def __init__(self, **kwargs):
|
||||
super(Speech2VadSegmentOnline, self).__init__(**kwargs)
|
||||
vad_cmvn_file = kwargs.get('vad_cmvn_file', None)
|
||||
self.frontend = None
|
||||
if self.vad_infer_args.frontend is not None:
|
||||
self.frontend = WavFrontendOnline(cmvn_file=vad_cmvn_file, **self.vad_infer_args.frontend_conf)
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
in_cache: Dict[str, torch.Tensor] = dict(), is_final: bool = False
|
||||
) -> Tuple[torch.Tensor, List[List[int]], torch.Tensor]:
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
batch_size = speech.shape[0]
|
||||
segments = [[]] * batch_size
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths, is_final)
|
||||
fbanks, _ = self.frontend.get_fbank()
|
||||
else:
|
||||
raise Exception("Need to extract feats first, please configure frontend configuration")
|
||||
if feats.shape[0]:
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
waveforms = self.frontend.get_waveforms()
|
||||
|
||||
batch = {
|
||||
"feats": feats,
|
||||
"waveform": waveforms,
|
||||
"in_cache": in_cache,
|
||||
"is_final": is_final
|
||||
}
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
segments, in_cache = self.vad_model.forward_online(**batch)
|
||||
# in_cache.update(batch['in_cache'])
|
||||
# in_cache = {key: value for key, value in batch['in_cache'].items()}
|
||||
return fbanks, segments, in_cache
|
||||
|
||||
|
||||
def inference(
|
||||
batch_size: int,
|
||||
ngpu: int,
|
||||
log_level: Union[int, str],
|
||||
data_path_and_name_and_type,
|
||||
vad_infer_config: Optional[str],
|
||||
vad_model_file: Optional[str],
|
||||
vad_cmvn_file: Optional[str] = None,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
key_file: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
batch_size=batch_size,
|
||||
ngpu=ngpu,
|
||||
log_level=log_level,
|
||||
vad_infer_config=vad_infer_config,
|
||||
vad_model_file=vad_model_file,
|
||||
vad_cmvn_file=vad_cmvn_file,
|
||||
key_file=key_file,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
output_dir=output_dir,
|
||||
dtype=dtype,
|
||||
seed=seed,
|
||||
num_workers=num_workers,
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
batch_size: int,
|
||||
ngpu: int,
|
||||
log_level: Union[int, str],
|
||||
# data_path_and_name_and_type,
|
||||
vad_infer_config: Optional[str],
|
||||
vad_model_file: Optional[str],
|
||||
vad_cmvn_file: Optional[str] = None,
|
||||
# raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
key_file: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
if batch_size > 1:
|
||||
raise NotImplementedError("batch decoding is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
|
||||
# 2. Build speech2vadsegment
|
||||
speech2vadsegment_kwargs = dict(
|
||||
vad_infer_config=vad_infer_config,
|
||||
vad_model_file=vad_model_file,
|
||||
vad_cmvn_file=vad_cmvn_file,
|
||||
device=device,
|
||||
dtype=dtype,
|
||||
)
|
||||
logging.info("speech2vadsegment_kwargs: {}".format(speech2vadsegment_kwargs))
|
||||
speech2vadsegment = Speech2VadSegmentOnline(**speech2vadsegment_kwargs)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
fs: dict = None,
|
||||
param_dict: dict = None,
|
||||
):
|
||||
# 3. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
|
||||
loader = VADTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=VADTask.build_preprocess_fn(speech2vadsegment.vad_infer_args, False),
|
||||
collate_fn=VADTask.build_collate_fn(speech2vadsegment.vad_infer_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
finish_count = 0
|
||||
file_count = 1
|
||||
# 7 .Start for-loop
|
||||
# FIXME(kamo): The output format should be discussed about
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path is not None:
|
||||
writer = DatadirWriter(output_path)
|
||||
ibest_writer = writer[f"1best_recog"]
|
||||
else:
|
||||
writer = None
|
||||
ibest_writer = None
|
||||
|
||||
vad_results = []
|
||||
batch_in_cache = param_dict['in_cache'] if param_dict is not None else dict()
|
||||
is_final = param_dict['is_final'] if param_dict is not None else False
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
batch['in_cache'] = batch_in_cache
|
||||
batch['is_final'] = is_final
|
||||
|
||||
# do vad segment
|
||||
_, results, param_dict['in_cache'] = speech2vadsegment(**batch)
|
||||
# param_dict['in_cache'] = batch['in_cache']
|
||||
if results:
|
||||
for i, _ in enumerate(keys):
|
||||
if results[i]:
|
||||
results[i] = json.dumps(results[i])
|
||||
item = {'key': keys[i], 'value': results[i]}
|
||||
vad_results.append(item)
|
||||
if writer is not None:
|
||||
results[i] = json.loads(results[i])
|
||||
ibest_writer["text"][keys[i]] = "{}".format(results[i])
|
||||
|
||||
return vad_results
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="VAD Decoding",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpuid_list",
|
||||
type=str,
|
||||
default="",
|
||||
help="The visible gpus",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--raw_inputs", type=list, default=None)
|
||||
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--vad_infer_config",
|
||||
type=str,
|
||||
help="VAD infer configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--vad_model_file",
|
||||
type=str,
|
||||
help="VAD model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--vad_cmvn_file",
|
||||
type=str,
|
||||
help="Global cmvn file",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("infer related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -107,7 +107,7 @@ class H5FileWrapper:
|
||||
return value[()]
|
||||
|
||||
|
||||
def sound_loader(path, float_dtype=None):
|
||||
def sound_loader(path, dest_sample_rate=16000, float_dtype=None):
|
||||
# The file is as follows:
|
||||
# utterance_id_A /some/where/a.wav
|
||||
# utterance_id_B /some/where/a.flac
|
||||
@ -115,7 +115,7 @@ def sound_loader(path, float_dtype=None):
|
||||
# NOTE(kamo): SoundScpReader doesn't support pipe-fashion
|
||||
# like Kaldi e.g. "cat a.wav |".
|
||||
# NOTE(kamo): The audio signal is normalized to [-1,1] range.
|
||||
loader = SoundScpReader(path, normalize=True, always_2d=False)
|
||||
loader = SoundScpReader(path, dest_sample_rate, normalize=True, always_2d=False)
|
||||
|
||||
# SoundScpReader.__getitem__() returns Tuple[int, ndarray],
|
||||
# but ndarray is desired, so Adapter class is inserted here
|
||||
@ -139,7 +139,7 @@ def rand_int_loader(filepath, loader_type):
|
||||
DATA_TYPES = {
|
||||
"sound": dict(
|
||||
func=sound_loader,
|
||||
kwargs=["float_dtype"],
|
||||
kwargs=["dest_sample_rate","float_dtype"],
|
||||
help="Audio format types which supported by sndfile wav, flac, etc."
|
||||
"\n\n"
|
||||
" utterance_id_a a.wav\n"
|
||||
@ -282,6 +282,7 @@ class ESPnetDataset(AbsDataset):
|
||||
int_dtype: str = "long",
|
||||
max_cache_size: Union[float, int, str] = 0.0,
|
||||
max_cache_fd: int = 0,
|
||||
dest_sample_rate: int = 16000,
|
||||
):
|
||||
assert check_argument_types()
|
||||
if len(path_name_type_list) == 0:
|
||||
@ -295,6 +296,7 @@ class ESPnetDataset(AbsDataset):
|
||||
self.float_dtype = float_dtype
|
||||
self.int_dtype = int_dtype
|
||||
self.max_cache_fd = max_cache_fd
|
||||
self.dest_sample_rate = dest_sample_rate
|
||||
|
||||
self.loader_dict = {}
|
||||
self.debug_info = {}
|
||||
@ -335,6 +337,8 @@ class ESPnetDataset(AbsDataset):
|
||||
for key2 in dic["kwargs"]:
|
||||
if key2 == "loader_type":
|
||||
kwargs["loader_type"] = loader_type
|
||||
elif key2 == "dest_sample_rate" and loader_type=="sound":
|
||||
kwargs["dest_sample_rate"] = self.dest_sample_rate
|
||||
elif key2 == "float_dtype":
|
||||
kwargs["float_dtype"] = self.float_dtype
|
||||
elif key2 == "int_dtype":
|
||||
|
||||
@ -8,6 +8,7 @@ from typing import Dict
|
||||
from typing import Iterator
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import List
|
||||
|
||||
import kaldiio
|
||||
import numpy as np
|
||||
@ -66,7 +67,7 @@ def load_pcm(input):
|
||||
return load_bytes(bytes)
|
||||
|
||||
DATA_TYPES = {
|
||||
"sound": lambda x: torchaudio.load(x)[0][0].numpy(),
|
||||
"sound": lambda x: torchaudio.load(x)[0].numpy(),
|
||||
"pcm": load_pcm,
|
||||
"kaldi_ark": load_kaldi,
|
||||
"bytes": load_bytes,
|
||||
@ -106,6 +107,7 @@ class IterableESPnetDataset(IterableDataset):
|
||||
] = None,
|
||||
float_dtype: str = "float32",
|
||||
fs: dict = None,
|
||||
mc: bool = False,
|
||||
int_dtype: str = "long",
|
||||
key_file: str = None,
|
||||
):
|
||||
@ -122,12 +124,13 @@ class IterableESPnetDataset(IterableDataset):
|
||||
self.int_dtype = int_dtype
|
||||
self.key_file = key_file
|
||||
self.fs = fs
|
||||
self.mc = mc
|
||||
|
||||
self.debug_info = {}
|
||||
non_iterable_list = []
|
||||
self.path_name_type_list = []
|
||||
|
||||
if not isinstance(path_name_type_list[0], Tuple):
|
||||
if not isinstance(path_name_type_list[0], (Tuple, List)):
|
||||
path = path_name_type_list[0]
|
||||
name = path_name_type_list[1]
|
||||
_type = path_name_type_list[2]
|
||||
@ -192,6 +195,7 @@ class IterableESPnetDataset(IterableDataset):
|
||||
array = torchaudio.transforms.Resample(orig_freq=audio_fs,
|
||||
new_freq=model_fs)(array)
|
||||
array = array.squeeze(0).numpy()
|
||||
|
||||
data[name] = array
|
||||
|
||||
if self.preprocess is not None:
|
||||
@ -238,11 +242,17 @@ class IterableESPnetDataset(IterableDataset):
|
||||
model_fs = self.fs["model_fs"]
|
||||
if audio_fs is not None and model_fs is not None:
|
||||
array = torch.from_numpy(array)
|
||||
array = array.unsqueeze(0)
|
||||
array = torchaudio.transforms.Resample(orig_freq=audio_fs,
|
||||
new_freq=model_fs)(array)
|
||||
array = array.squeeze(0).numpy()
|
||||
data[name] = array
|
||||
array = array.numpy()
|
||||
|
||||
if _type == "sound":
|
||||
if self.mc:
|
||||
data[name] = array.transpose((1, 0))
|
||||
else:
|
||||
data[name] = array[0]
|
||||
else:
|
||||
data[name] = array
|
||||
|
||||
if self.preprocess is not None:
|
||||
data = self.preprocess(uid, data)
|
||||
@ -340,11 +350,16 @@ class IterableESPnetDataset(IterableDataset):
|
||||
model_fs = self.fs["model_fs"]
|
||||
if audio_fs is not None and model_fs is not None:
|
||||
array = torch.from_numpy(array)
|
||||
array = array.unsqueeze(0)
|
||||
array = torchaudio.transforms.Resample(orig_freq=audio_fs,
|
||||
new_freq=model_fs)(array)
|
||||
array = array.squeeze(0).numpy()
|
||||
data[name] = array
|
||||
array = array.numpy()
|
||||
if _type == "sound":
|
||||
if self.mc:
|
||||
data[name] = array.transpose((1, 0))
|
||||
else:
|
||||
data[name] = array[0]
|
||||
else:
|
||||
data[name] = array
|
||||
if self.non_iterable_dataset is not None:
|
||||
# 2.b. Load data from non-iterable dataset
|
||||
_, from_non_iterable = self.non_iterable_dataset[uid]
|
||||
|
||||
@ -55,3 +55,4 @@ python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_par
|
||||
|
||||
## Acknowledge
|
||||
Torch model quantization is supported by [BladeDISC](https://github.com/alibaba/BladeDISC), an end-to-end DynamIc Shape Compiler project for machine learning workloads. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. If you are interested, please contact us.
|
||||
|
||||
|
||||
@ -4,6 +4,7 @@ from typing import Union
|
||||
|
||||
import numpy as np
|
||||
import soundfile
|
||||
import librosa
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.fileio.read_text import read_2column_text
|
||||
@ -30,6 +31,7 @@ class SoundScpReader(collections.abc.Mapping):
|
||||
dtype=np.int16,
|
||||
always_2d: bool = False,
|
||||
normalize: bool = False,
|
||||
dest_sample_rate: int = 16000,
|
||||
):
|
||||
assert check_argument_types()
|
||||
self.fname = fname
|
||||
@ -37,15 +39,18 @@ class SoundScpReader(collections.abc.Mapping):
|
||||
self.always_2d = always_2d
|
||||
self.normalize = normalize
|
||||
self.data = read_2column_text(fname)
|
||||
self.dest_sample_rate = dest_sample_rate
|
||||
|
||||
def __getitem__(self, key):
|
||||
wav = self.data[key]
|
||||
if self.normalize:
|
||||
# soundfile.read normalizes data to [-1,1] if dtype is not given
|
||||
array, rate = soundfile.read(wav, always_2d=self.always_2d)
|
||||
array, rate = librosa.load(
|
||||
wav, sr=self.dest_sample_rate, mono=not self.always_2d
|
||||
)
|
||||
else:
|
||||
array, rate = soundfile.read(
|
||||
wav, dtype=self.dtype, always_2d=self.always_2d
|
||||
array, rate = librosa.load(
|
||||
wav, sr=self.dest_sample_rate, mono=not self.always_2d, dtype=self.dtype
|
||||
)
|
||||
|
||||
return rate, array
|
||||
|
||||
@ -66,13 +66,13 @@ def average_nbest_models(
|
||||
elif n == 1:
|
||||
# The averaged model is same as the best model
|
||||
e, _ = epoch_and_values[0]
|
||||
op = output_dir / f"{e}epoch.pth"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
|
||||
op = output_dir / f"{e}epoch.pb"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pb"
|
||||
if sym_op.is_symlink() or sym_op.exists():
|
||||
sym_op.unlink()
|
||||
sym_op.symlink_to(op.name)
|
||||
else:
|
||||
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
|
||||
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pb"
|
||||
logging.info(
|
||||
f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
|
||||
)
|
||||
@ -83,12 +83,12 @@ def average_nbest_models(
|
||||
if e not in _loaded:
|
||||
if oss_bucket is None:
|
||||
_loaded[e] = torch.load(
|
||||
output_dir / f"{e}epoch.pth",
|
||||
output_dir / f"{e}epoch.pb",
|
||||
map_location="cpu",
|
||||
)
|
||||
else:
|
||||
buffer = BytesIO(
|
||||
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
|
||||
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pb")).read())
|
||||
_loaded[e] = torch.load(buffer)
|
||||
states = _loaded[e]
|
||||
|
||||
@ -115,13 +115,13 @@ def average_nbest_models(
|
||||
else:
|
||||
buffer = BytesIO()
|
||||
torch.save(avg, buffer)
|
||||
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
|
||||
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pb"),
|
||||
buffer.getvalue())
|
||||
|
||||
# 3. *.*.ave.pth is a symlink to the max ave model
|
||||
# 3. *.*.ave.pb is a symlink to the max ave model
|
||||
if oss_bucket is None:
|
||||
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
|
||||
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pb"
|
||||
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pb"
|
||||
if sym_op.is_symlink() or sym_op.exists():
|
||||
sym_op.unlink()
|
||||
sym_op.symlink_to(op.name)
|
||||
|
||||
@ -191,12 +191,12 @@ def unpack(
|
||||
|
||||
Examples:
|
||||
tarfile:
|
||||
model.pth
|
||||
model.pb
|
||||
some1.file
|
||||
some2.file
|
||||
|
||||
>>> unpack("tarfile", "out")
|
||||
{'asr_model_file': 'out/model.pth'}
|
||||
{'asr_model_file': 'out/model.pb'}
|
||||
"""
|
||||
input_archive = Path(input_archive)
|
||||
outpath = Path(outpath)
|
||||
|
||||
@ -90,6 +90,47 @@ class DecoderLayerSANM(nn.Module):
|
||||
tgt = self.norm1(tgt)
|
||||
tgt = self.feed_forward(tgt)
|
||||
|
||||
x = tgt
|
||||
if self.self_attn:
|
||||
if self.normalize_before:
|
||||
tgt = self.norm2(tgt)
|
||||
x, _ = self.self_attn(tgt, tgt_mask)
|
||||
x = residual + self.dropout(x)
|
||||
|
||||
if self.src_attn is not None:
|
||||
residual = x
|
||||
if self.normalize_before:
|
||||
x = self.norm3(x)
|
||||
|
||||
x = residual + self.dropout(self.src_attn(x, memory, memory_mask))
|
||||
|
||||
|
||||
return x, tgt_mask, memory, memory_mask, cache
|
||||
|
||||
def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
|
||||
"""Compute decoded features.
|
||||
|
||||
Args:
|
||||
tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
|
||||
tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
|
||||
memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
|
||||
memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
|
||||
cache (List[torch.Tensor]): List of cached tensors.
|
||||
Each tensor shape should be (#batch, maxlen_out - 1, size).
|
||||
|
||||
Returns:
|
||||
torch.Tensor: Output tensor(#batch, maxlen_out, size).
|
||||
torch.Tensor: Mask for output tensor (#batch, maxlen_out).
|
||||
torch.Tensor: Encoded memory (#batch, maxlen_in, size).
|
||||
torch.Tensor: Encoded memory mask (#batch, maxlen_in).
|
||||
|
||||
"""
|
||||
# tgt = self.dropout(tgt)
|
||||
residual = tgt
|
||||
if self.normalize_before:
|
||||
tgt = self.norm1(tgt)
|
||||
tgt = self.feed_forward(tgt)
|
||||
|
||||
x = tgt
|
||||
if self.self_attn:
|
||||
if self.normalize_before:
|
||||
@ -109,7 +150,6 @@ class DecoderLayerSANM(nn.Module):
|
||||
|
||||
return x, tgt_mask, memory, memory_mask, cache
|
||||
|
||||
|
||||
class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
|
||||
"""
|
||||
author: Speech Lab, Alibaba Group, China
|
||||
@ -947,6 +987,65 @@ class ParaformerSANMDecoder(BaseTransformerDecoder):
|
||||
)
|
||||
return logp.squeeze(0), state
|
||||
|
||||
def forward_chunk(
|
||||
self,
|
||||
memory: torch.Tensor,
|
||||
tgt: torch.Tensor,
|
||||
cache: dict = None,
|
||||
) -> Tuple[torch.Tensor, torch.Tensor]:
|
||||
"""Forward decoder.
|
||||
|
||||
Args:
|
||||
hs_pad: encoded memory, float32 (batch, maxlen_in, feat)
|
||||
hlens: (batch)
|
||||
ys_in_pad:
|
||||
input token ids, int64 (batch, maxlen_out)
|
||||
if input_layer == "embed"
|
||||
input tensor (batch, maxlen_out, #mels) in the other cases
|
||||
ys_in_lens: (batch)
|
||||
Returns:
|
||||
(tuple): tuple containing:
|
||||
|
||||
x: decoded token score before softmax (batch, maxlen_out, token)
|
||||
if use_output_layer is True,
|
||||
olens: (batch, )
|
||||
"""
|
||||
x = tgt
|
||||
if cache["decode_fsmn"] is None:
|
||||
cache_layer_num = len(self.decoders)
|
||||
if self.decoders2 is not None:
|
||||
cache_layer_num += len(self.decoders2)
|
||||
new_cache = [None] * cache_layer_num
|
||||
else:
|
||||
new_cache = cache["decode_fsmn"]
|
||||
for i in range(self.att_layer_num):
|
||||
decoder = self.decoders[i]
|
||||
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=new_cache[i]
|
||||
)
|
||||
new_cache[i] = c_ret
|
||||
|
||||
if self.num_blocks - self.att_layer_num > 1:
|
||||
for i in range(self.num_blocks - self.att_layer_num):
|
||||
j = i + self.att_layer_num
|
||||
decoder = self.decoders2[i]
|
||||
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=new_cache[j]
|
||||
)
|
||||
new_cache[j] = c_ret
|
||||
|
||||
for decoder in self.decoders3:
|
||||
|
||||
x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
|
||||
x, None, memory, None, cache=None
|
||||
)
|
||||
if self.normalize_before:
|
||||
x = self.after_norm(x)
|
||||
if self.output_layer is not None:
|
||||
x = self.output_layer(x)
|
||||
cache["decode_fsmn"] = new_cache
|
||||
return x
|
||||
|
||||
def forward_one_step(
|
||||
self,
|
||||
tgt: torch.Tensor,
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user