This commit is contained in:
huangmingming 2023-03-16 15:38:33 +08:00
commit 0f3b189d35
78 changed files with 684 additions and 217 deletions

View File

@ -15,36 +15,10 @@
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
| [**Contact**](#contact)
## What's new:
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
- We release several new UniASR model:
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
### 2023.1.16, funasr-0.1.6 modelscope-1.2.0
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
## Highlights
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.cer_ctc.ave_10best.pth
inference_asr_model=valid.cer_ctc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -34,7 +34,7 @@ exp_dir=./data
tag=exp1
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
lm_exp=${exp_dir}/exp/${model_dir}
inference_lm=valid.loss.ave.pth # Language model path for decoding.
inference_lm=valid.loss.ave.pb # Language model path for decoding.
stage=0
stop_stage=3

View File

@ -4,7 +4,7 @@ import sys
def main():
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
data_path_and_name_and_type = [
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),

View File

@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "Downloading Pre-trained model..."
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
echo "Done."
@ -30,7 +30,7 @@ fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Calculating diarization results..."
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
python local/convert_label_to_rttm.py \
outputs/labels.txt \
data/test_rmsil/raw_rmsil_map.scp \

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -63,5 +63,5 @@ if __name__ == '__main__':
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./example_data/validation"
params["decoding_model_name"] = "valid.acc.ave.pth"
params["decoding_model_name"] = "valid.acc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
modelscope_infer_after_finetune(params)

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
modelscope_infer_after_finetune(params)

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -53,5 +53,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,10 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
inference_diar_pipline = pipeline(
task=Tasks.speaker_diarization,
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
model_revision="v1.0.0",
)
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
print(results)

View File

@ -14,13 +14,12 @@ inference_diar_pipline = pipeline(
)
# 以 audio_list 作为输入,其中第一个音频为待检测语音,后面的音频为不同说话人的声纹注册语音
audio_list = [[
audio_list = [
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav"
]]
]
results = inference_diar_pipline(audio_in=audio_list)
for rst in results:
print(rst["value"])
print(results)

View File

@ -52,7 +52,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -55,7 +55,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -50,7 +50,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -58,7 +58,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -49,7 +49,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -46,7 +46,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -46,7 +46,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
param_dict = {
"extract_profile": True,
"sv_train_config": "sv.yaml",
"sv_model_file": "sv.pth",
"sv_model_file": "sv.pb",
}
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
for key in param_dict:
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
else:
kwargs["param_dict"] = param_dict
return inference_modelscope(mode=mode, **kwargs)
elif mode == "eend-ola":
from funasr.bin.eend_ola_inference import inference_modelscope
return inference_modelscope(mode=mode, **kwargs)
else:
logging.info("Unknown decoding mode: {}".format(mode))
return None

View File

@ -16,6 +16,7 @@ from typing import Union
import numpy as np
import torch
from scipy.signal import medfilt
from typeguard import check_argument_types
from funasr.models.frontend.wav_frontend import WavFrontendMel23
@ -34,7 +35,7 @@ class Speech2Diarization:
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)
@ -146,7 +147,7 @@ def inference_modelscope(
output_dir: Optional[str] = None,
batch_size: int = 1,
dtype: str = "float32",
ngpu: int = 0,
ngpu: int = 1,
num_workers: int = 0,
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
@ -179,7 +180,6 @@ def inference_modelscope(
diar_model_file=diar_model_file,
device=device,
dtype=dtype,
streaming=streaming,
)
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
speech2diar = Speech2Diarization.from_pretrained(
@ -209,7 +209,7 @@ def inference_modelscope(
if data_path_and_name_and_type is None and raw_inputs is not None:
if isinstance(raw_inputs, torch.Tensor):
raw_inputs = raw_inputs.numpy()
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
loader = EENDOLADiarTask.build_streaming_iterator(
data_path_and_name_and_type,
dtype=dtype,
@ -236,9 +236,23 @@ def inference_modelscope(
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
results = speech2diar(**batch)
# post process
a = results[0][0].cpu().numpy()
a = medfilt(a, (11, 1))
rst = []
for spkid, frames in enumerate(a.T):
frames = np.pad(frames, (1, 1), 'constant')
changes, = np.where(np.diff(frames, axis=0) != 0)
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
for s, e in zip(changes[::2], changes[1::2]):
st = s / 10.
dur = (e - s) / 10.
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
# Only supporting batch_size==1
key, value = keys[0], output_results_str(results, keys[0])
item = {"key": key, "value": value}
value = "\n".join(rst)
item = {"key": keys[0], "value": value}
result_list.append(item)
if output_path is not None:
output_writer.write(value)

View File

@ -42,7 +42,7 @@ class Speech2Diarization:
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)

View File

@ -36,7 +36,7 @@ class Speech2Xvector:
Examples:
>>> import soundfile
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2xvector(audio)
[(text, token, token_int, hypothesis object), ...]
@ -169,7 +169,7 @@ def inference_modelscope(
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
sv_train_config: Optional[str] = "sv.yaml",
sv_model_file: Optional[str] = "sv.pth",
sv_model_file: Optional[str] = "sv.pb",
model_tag: Optional[str] = None,
allow_variable_data_keys: bool = True,
streaming: bool = False,

View File

@ -66,13 +66,13 @@ def average_nbest_models(
elif n == 1:
# The averaged model is same as the best model
e, _ = epoch_and_values[0]
op = output_dir / f"{e}epoch.pth"
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
op = output_dir / f"{e}epoch.pb"
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pb"
if sym_op.is_symlink() or sym_op.exists():
sym_op.unlink()
sym_op.symlink_to(op.name)
else:
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pb"
logging.info(
f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
)
@ -83,12 +83,12 @@ def average_nbest_models(
if e not in _loaded:
if oss_bucket is None:
_loaded[e] = torch.load(
output_dir / f"{e}epoch.pth",
output_dir / f"{e}epoch.pb",
map_location="cpu",
)
else:
buffer = BytesIO(
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pb")).read())
_loaded[e] = torch.load(buffer)
states = _loaded[e]
@ -115,13 +115,13 @@ def average_nbest_models(
else:
buffer = BytesIO()
torch.save(avg, buffer)
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pb"),
buffer.getvalue())
# 3. *.*.ave.pth is a symlink to the max ave model
# 3. *.*.ave.pb is a symlink to the max ave model
if oss_bucket is None:
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pb"
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pb"
if sym_op.is_symlink() or sym_op.exists():
sym_op.unlink()
sym_op.symlink_to(op.name)

View File

@ -191,12 +191,12 @@ def unpack(
Examples:
tarfile:
model.pth
model.pb
some1.file
some2.file
>>> unpack("tarfile", "out")
{'asr_model_file': 'out/model.pth'}
{'asr_model_file': 'out/model.pb'}
"""
input_archive = Path(input_archive)
outpath = Path(outpath)

View File

@ -52,15 +52,15 @@ class DiarEENDOLAModel(AbsESPnetModel):
super().__init__()
self.frontend = frontend
self.encoder = encoder
self.encoder_decoder_attractor = encoder_decoder_attractor
self.enc = encoder
self.eda = encoder_decoder_attractor
self.attractor_loss_weight = attractor_loss_weight
self.max_n_speaker = max_n_speaker
if mapping_dict is None:
mapping_dict = generate_mapping_dict(max_speaker_num=self.max_n_speaker)
self.mapping_dict = mapping_dict
# PostNet
self.PostNet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
self.postnet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
self.output_layer = nn.Linear(n_units, mapping_dict['oov'] + 1)
def forward_encoder(self, xs, ilens):
@ -68,7 +68,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
pad_shape = xs.shape
xs_mask = [torch.ones(ilen).to(xs.device) for ilen in ilens]
xs_mask = torch.nn.utils.rnn.pad_sequence(xs_mask, batch_first=True, padding_value=0).unsqueeze(-2)
emb = self.encoder(xs, xs_mask)
emb = self.enc(xs, xs_mask)
emb = torch.split(emb.view(pad_shape[0], pad_shape[1], -1), 1, dim=0)
emb = [e[0][:ilen] for e, ilen in zip(emb, ilens)]
return emb
@ -76,8 +76,8 @@ class DiarEENDOLAModel(AbsESPnetModel):
def forward_post_net(self, logits, ilens):
maxlen = torch.max(ilens).to(torch.int).item()
logits = nn.utils.rnn.pad_sequence(logits, batch_first=True, padding_value=-1)
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens, batch_first=True, enforce_sorted=False)
outputs, (_, _) = self.PostNet(logits)
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens.cpu().to(torch.int64), batch_first=True, enforce_sorted=False)
outputs, (_, _) = self.postnet(logits)
outputs = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True, padding_value=-1, total_length=maxlen)[0]
outputs = [output[:ilens[i].to(torch.int).item()] for i, output in enumerate(outputs)]
outputs = [self.output_layer(output) for output in outputs]
@ -112,7 +112,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
text = text[:, : text_lengths.max()]
# 1. Encoder
encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
encoder_out, encoder_out_lens = self.enc(speech, speech_lengths)
intermediate_outs = None
if isinstance(encoder_out, tuple):
intermediate_outs = encoder_out[1]
@ -190,18 +190,16 @@ class DiarEENDOLAModel(AbsESPnetModel):
shuffle: bool = True,
threshold: float = 0.5,
**kwargs):
if self.frontend is not None:
speech = self.frontend(speech)
speech = [s[:s_len] for s, s_len in zip(speech, speech_lengths)]
emb = self.forward_encoder(speech, speech_lengths)
if shuffle:
orders = [np.arange(e.shape[0]) for e in emb]
for order in orders:
np.random.shuffle(order)
attractors, probs = self.encoder_decoder_attractor.estimate(
attractors, probs = self.eda.estimate(
[e[torch.from_numpy(order).to(torch.long).to(speech[0].device)] for e, order in zip(emb, orders)])
else:
attractors, probs = self.encoder_decoder_attractor.estimate(emb)
attractors, probs = self.eda.estimate(emb)
attractors_active = []
for p, att, e in zip(probs, attractors, emb):
if n_speakers and n_speakers >= 0:
@ -233,10 +231,23 @@ class DiarEENDOLAModel(AbsESPnetModel):
pred[i] = pred[i - 1]
else:
pred[i] = 0
pred = [self.reporter.inv_mapping_func(i, self.mapping_dict) for i in pred]
pred = [self.inv_mapping_func(i) for i in pred]
decisions = [bin(num)[2:].zfill(self.max_n_speaker)[::-1] for num in pred]
decisions = torch.from_numpy(
np.stack([np.array([int(i) for i in dec]) for dec in decisions], axis=0)).to(logit.device).to(
torch.float32)
decisions = decisions[:, :n_speaker]
return decisions
def inv_mapping_func(self, label):
if not isinstance(label, int):
label = int(label)
if label in self.mapping_dict['label2dec'].keys():
num = self.mapping_dict['label2dec'][label]
else:
num = -1
return num
def collect_feats(self, **batch: torch.Tensor) -> Dict[str, torch.Tensor]:
pass

View File

@ -1,14 +1,15 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
# Part of the implementation is borrowed from espnet/espnet.
from abc import ABC
from typing import Tuple
import numpy as np
import torch
import torchaudio.compliance.kaldi as kaldi
from funasr.models.frontend.abs_frontend import AbsFrontend
from typeguard import check_argument_types
from torch.nn.utils.rnn import pad_sequence
from typeguard import check_argument_types
import funasr.models.frontend.eend_ola_feature as eend_ola_feature
from funasr.models.frontend.abs_frontend import AbsFrontend
def load_cmvn(cmvn_file):
@ -275,7 +276,8 @@ class WavFrontendOnline(AbsFrontend):
# inputs tensor has catted the cache tensor
# def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, inputs_lfr_cache: torch.Tensor = None,
# is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[
torch.Tensor, torch.Tensor, int]:
"""
Apply lfr with data
"""
@ -376,7 +378,8 @@ class WavFrontendOnline(AbsFrontend):
if self.lfr_m != 1 or self.lfr_n != 1:
# update self.lfr_splice_cache in self.apply_lfr
# mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, self.lfr_splice_cache[i],
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, is_final)
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n,
is_final)
if self.cmvn_file is not None:
mat = self.apply_cmvn(mat, self.cmvn)
feat_length = mat.size(0)
@ -398,9 +401,10 @@ class WavFrontendOnline(AbsFrontend):
assert batch_size == 1, 'we support to extract feature online only when the batch size is equal to 1 now'
waveforms, feats, feats_lengths = self.forward_fbank(input, input_lengths) # input shape: B T D
if feats.shape[0]:
#if self.reserve_waveforms is None and self.lfr_m > 1:
# if self.reserve_waveforms is None and self.lfr_m > 1:
# self.reserve_waveforms = waveforms[:, :(self.lfr_m - 1) // 2 * self.frame_shift_sample_length]
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat((self.reserve_waveforms, waveforms), dim=1)
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat(
(self.reserve_waveforms, waveforms), dim=1)
if not self.lfr_splice_cache: # 初始化splice_cache
for i in range(batch_size):
self.lfr_splice_cache.append(feats[i][0, :].unsqueeze(dim=0).repeat((self.lfr_m - 1) // 2, 1))
@ -409,7 +413,8 @@ class WavFrontendOnline(AbsFrontend):
lfr_splice_cache_tensor = torch.stack(self.lfr_splice_cache) # B T D
feats = torch.cat((lfr_splice_cache_tensor, feats), dim=1)
feats_lengths += lfr_splice_cache_tensor[0].shape[0]
frame_from_waveforms = int((self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
frame_from_waveforms = int(
(self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
minus_frame = (self.lfr_m - 1) // 2 if self.reserve_waveforms is None else 0
feats, feats_lengths, lfr_splice_frame_idxs = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
if self.lfr_m == 1:
@ -423,14 +428,15 @@ class WavFrontendOnline(AbsFrontend):
self.waveforms = self.waveforms[:, :sample_length]
else:
# update self.reserve_waveforms and self.lfr_splice_cache
self.reserve_waveforms = self.waveforms[:, :-(self.frame_sample_length - self.frame_shift_sample_length)]
self.reserve_waveforms = self.waveforms[:,
:-(self.frame_sample_length - self.frame_shift_sample_length)]
for i in range(batch_size):
self.lfr_splice_cache[i] = torch.cat((self.lfr_splice_cache[i], feats[i]), dim=0)
return torch.empty(0), feats_lengths
else:
if is_final:
self.waveforms = waveforms if self.reserve_waveforms is None else self.reserve_waveforms
feats = torch.stack(self.lfr_splice_cache)
feats = torch.stack(self.lfr_splice_cache)
feats_lengths = torch.zeros(batch_size, dtype=torch.int) + feats.shape[1]
feats, feats_lengths, _ = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
if is_final:
@ -444,3 +450,54 @@ class WavFrontendOnline(AbsFrontend):
self.reserve_waveforms = None
self.input_cache = None
self.lfr_splice_cache = []
class WavFrontendMel23(AbsFrontend):
"""Conventional frontend structure for ASR.
"""
def __init__(
self,
fs: int = 16000,
frame_length: int = 25,
frame_shift: int = 10,
lfr_m: int = 1,
lfr_n: int = 1,
):
assert check_argument_types()
super().__init__()
self.fs = fs
self.frame_length = frame_length
self.frame_shift = frame_shift
self.lfr_m = lfr_m
self.lfr_n = lfr_n
self.n_mels = 23
def output_size(self) -> int:
return self.n_mels * (2 * self.lfr_m + 1)
def forward(
self,
input: torch.Tensor,
input_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
batch_size = input.size(0)
feats = []
feats_lens = []
for i in range(batch_size):
waveform_length = input_lengths[i]
waveform = input[i][:waveform_length]
waveform = waveform.numpy()
mat = eend_ola_feature.stft(waveform, self.frame_length, self.frame_shift)
mat = eend_ola_feature.transform(mat)
mat = eend_ola_feature.splice(mat, context_size=self.lfr_m)
mat = mat[::self.lfr_n]
mat = torch.from_numpy(mat)
feat_length = mat.size(0)
feats.append(mat)
feats_lens.append(feat_length)
feats_lens = torch.as_tensor(feats_lens)
feats_pad = pad_sequence(feats,
batch_first=True,
padding_value=0.0)
return feats_pad, feats_lens

View File

@ -87,7 +87,7 @@ class EENDOLATransformerEncoder(nn.Module):
n_layers: int,
n_units: int,
e_units: int = 2048,
h: int = 8,
h: int = 4,
dropout_rate: float = 0.1,
use_pos_emb: bool = False):
super(EENDOLATransformerEncoder, self).__init__()

View File

@ -16,12 +16,12 @@ class EncoderDecoderAttractor(nn.Module):
self.n_units = n_units
def forward_core(self, xs, zeros):
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.float32).to(xs[0].device)
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.int64)
xs = [self.enc0_dropout(x) for x in xs]
xs = nn.utils.rnn.pad_sequence(xs, batch_first=True, padding_value=-1)
xs = nn.utils.rnn.pack_padded_sequence(xs, ilens, batch_first=True, enforce_sorted=False)
_, (hx, cx) = self.encoder(xs)
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.float32).to(zeros[0].device)
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.int64)
max_zlen = torch.max(zlens).to(torch.int).item()
zeros = [self.enc0_dropout(z) for z in zeros]
zeros = nn.utils.rnn.pad_sequence(zeros, batch_first=True, padding_value=-1)
@ -47,4 +47,4 @@ class EncoderDecoderAttractor(nn.Module):
zeros = [torch.zeros(max_n_speakers, self.n_units).to(torch.float32).to(xs[0].device) for _ in xs]
attractors = self.forward_core(xs, zeros)
probs = [torch.sigmoid(torch.flatten(self.counter(att))) for att in attractors]
return attractors, probs
return attractors, probs

View File

@ -237,7 +237,7 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
size_t nOffset = 0;
#define WAV_HEADER_SIZE 44
speech_len = nBufLen / 2;
speech_align_len = (int)(ceil((float)speech_len / align_size) * align_size);
@ -263,7 +263,8 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
speech_data[i] = (float)speech_buff[i] / scale;
}
AudioFrame* frame = new AudioFrame(speech_len);
frame_queue.push(frame);
return true;
}

View File

@ -26,8 +26,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadwav(szBuf,nLen);
audio.split();
if (!audio.loadwav(szBuf, nLen))
return nullptr;
//audio.split();
float* buff;
int len;
@ -58,8 +59,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadpcmwav(szBuf, nLen);
audio.split();
if (!audio.loadpcmwav(szBuf, nLen))
return nullptr;
//audio.split();
float* buff;
int len;
@ -91,8 +93,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadpcmwav(szFileName);
audio.split();
if (!audio.loadpcmwav(szFileName))
return nullptr;
//audio.split();
float* buff;
int len;
@ -125,7 +128,7 @@ extern "C" {
Audio audio(1);
if(!audio.loadwav(szWavfile))
return nullptr;
audio.split();
//audio.split();
float* buff;
int len;

View File

@ -8,7 +8,7 @@
#include "librapidasrapi.h"
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc, char *argv[])
@ -40,10 +40,13 @@ int main(int argc, char *argv[])
gettimeofday(&start, NULL);
RPASR_RESULT Result=RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
gettimeofday(&end, NULL);
float snippet_time = 0.0f;
RPASR_RESULT Result=RapidAsrRecogFile(AsrHanlde, argv[2], RASR_NONE, NULL);
gettimeofday(&end, NULL);
if (Result)
{
string msg = RapidAsrGetResult(Result, 0);
@ -56,11 +59,51 @@ int main(int argc, char *argv[])
}
else
{
cout <<("no return data!");
cout <<"no return data!";
}
printf("Audio length %lfs.\n", (double)snippet_time);
//char* buff = nullptr;
//int len = 0;
//ifstream ifs(argv[2], std::ios::binary | std::ios::in);
//if (ifs.is_open())
//{
// ifs.seekg(0, std::ios::end);
// len = ifs.tellg();
// ifs.seekg(0, std::ios::beg);
// buff = new char[len];
// ifs.read(buff, len);
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
// RPASR_RESULT Result=RapidAsrRecogPCMBuffer(AsrHanlde, buff,len, RASR_NONE, NULL);
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
// gettimeofday(&end, NULL);
//
// if (Result)
// {
// string msg = RapidAsrGetResult(Result, 0);
// setbuf(stdout, NULL);
// cout << "Result: \"";
// cout << msg << endl;
// cout << "\"." << endl;
// snippet_time = RapidAsrGetRetSnippetTime(Result);
// RapidAsrFreeResult(Result);
// }
// else
// {
// cout <<"no return data!";
// }
//
//delete[]buff;
//}
printf("Audio length %lfs.\n", (double)snippet_time);
seconds = (end.tv_sec - start.tv_sec);
long taking_micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("Model inference takes %lfs.\n", (double)taking_micros / 1000000);

View File

@ -0,0 +1,21 @@
Benchmark [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) based on Aishell1 test set , the total audio duration is 36108.919 seconds.
(Note: The service has been fully warm up.)
Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz 16core-32processor with avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:------:|:------------:|
| 1 (onnx fp32) | 2806 | 0.0777 | 12.9 |
| 1 (onnx int8) | 1611 | 0.0446 | 22.4 |
| 8 (onnx fp32) | 538 | 0.0149 | 67.1 |
| 8 (onnx int8) | 210 | 0.0058 | 172.4 |
| 16 (onnx fp32) | 288 | 0.0080 | 125.2 |
| 16 (onnx int8) | 117 | 0.0032 | 309.9 |
| 32 (onnx fp32) | 167 | 0.0046 | 216.5 |
| 32 (onnx int8) | 107 | 0.0030 | 338.0 |
| 64 (onnx fp32) | 158 | 0.0044 | 228.1 |
| 64 (onnx int8) | 82 | 0.0023 | 442.8 |
| 96 (onnx fp32) | 151 | 0.0042 | 238.0 |
| 96 (onnx int8) | 80 | 0.0022 | 452.0 |

View File

@ -639,12 +639,12 @@ class AbsTask(ABC):
"and exclude_keys excludes keys of model states for the initialization."
"e.g.\n"
" # Load all parameters"
" --init_param some/where/model.pth\n"
" --init_param some/where/model.pb\n"
" # Load only decoder parameters"
" --init_param some/where/model.pth:decoder:decoder\n"
" --init_param some/where/model.pb:decoder:decoder\n"
" # Load only decoder parameters excluding decoder.embed"
" --init_param some/where/model.pth:decoder:decoder:decoder.embed\n"
" --init_param some/where/model.pth:decoder:decoder:decoder.embed\n",
" --init_param some/where/model.pb:decoder:decoder:decoder.embed\n"
" --init_param some/where/model.pb:decoder:decoder:decoder.embed\n",
)
group.add_argument(
"--ignore_init_mismatch",

View File

@ -826,7 +826,7 @@ class ASRTaskUniASR(ASRTask):
if "model.ckpt-" in model_name or ".bin" in model_name:
model_name_pth = os.path.join(model_dir, model_name.replace('.bin',
'.pb')) if ".bin" in model_name else os.path.join(
model_dir, "{}.pth".format(model_name))
model_dir, "{}.pb".format(model_name))
if os.path.exists(model_name_pth):
logging.info("model_file is load from pth: {}".format(model_name_pth))
model_dict = torch.load(model_name_pth, map_location=device)
@ -1073,7 +1073,7 @@ class ASRTaskParaformer(ASRTask):
if "model.ckpt-" in model_name or ".bin" in model_name:
model_name_pth = os.path.join(model_dir, model_name.replace('.bin',
'.pb')) if ".bin" in model_name else os.path.join(
model_dir, "{}.pth".format(model_name))
model_dir, "{}.pb".format(model_name))
if os.path.exists(model_name_pth):
logging.info("model_file is load from pth: {}".format(model_name_pth))
model_dict = torch.load(model_name_pth, map_location=device)

View File

@ -553,7 +553,7 @@ class DiarTask(AbsTask):
if ".bin" in model_name:
model_name_pth = os.path.join(model_dir, model_name.replace('.bin', '.pb'))
else:
model_name_pth = os.path.join(model_dir, "{}.pth".format(model_name))
model_name_pth = os.path.join(model_dir, "{}.pb".format(model_name))
if os.path.exists(model_name_pth):
logging.info("model_file is load from pth: {}".format(model_name_pth))
model_dict = torch.load(model_name_pth, map_location=device)
@ -750,47 +750,47 @@ class EENDOLADiarTask(AbsTask):
cls, args: argparse.Namespace, train: bool
) -> Optional[Callable[[str, Dict[str, np.array]], Dict[str, np.ndarray]]]:
assert check_argument_types()
if args.use_preprocessor:
retval = CommonPreprocessor(
train=train,
token_type=args.token_type,
token_list=args.token_list,
bpemodel=None,
non_linguistic_symbols=None,
text_cleaner=None,
g2p_type=None,
split_with_space=args.split_with_space if hasattr(args, "split_with_space") else False,
seg_dict_file=args.seg_dict_file if hasattr(args, "seg_dict_file") else None,
# NOTE(kamo): Check attribute existence for backward compatibility
rir_scp=args.rir_scp if hasattr(args, "rir_scp") else None,
rir_apply_prob=args.rir_apply_prob
if hasattr(args, "rir_apply_prob")
else 1.0,
noise_scp=args.noise_scp if hasattr(args, "noise_scp") else None,
noise_apply_prob=args.noise_apply_prob
if hasattr(args, "noise_apply_prob")
else 1.0,
noise_db_range=args.noise_db_range
if hasattr(args, "noise_db_range")
else "13_15",
speech_volume_normalize=args.speech_volume_normalize
if hasattr(args, "rir_scp")
else None,
)
else:
retval = None
assert check_return_type(retval)
return retval
# if args.use_preprocessor:
# retval = CommonPreprocessor(
# train=train,
# token_type=args.token_type,
# token_list=args.token_list,
# bpemodel=None,
# non_linguistic_symbols=None,
# text_cleaner=None,
# g2p_type=None,
# split_with_space=args.split_with_space if hasattr(args, "split_with_space") else False,
# seg_dict_file=args.seg_dict_file if hasattr(args, "seg_dict_file") else None,
# # NOTE(kamo): Check attribute existence for backward compatibility
# rir_scp=args.rir_scp if hasattr(args, "rir_scp") else None,
# rir_apply_prob=args.rir_apply_prob
# if hasattr(args, "rir_apply_prob")
# else 1.0,
# noise_scp=args.noise_scp if hasattr(args, "noise_scp") else None,
# noise_apply_prob=args.noise_apply_prob
# if hasattr(args, "noise_apply_prob")
# else 1.0,
# noise_db_range=args.noise_db_range
# if hasattr(args, "noise_db_range")
# else "13_15",
# speech_volume_normalize=args.speech_volume_normalize
# if hasattr(args, "rir_scp")
# else None,
# )
# else:
# retval = None
# assert check_return_type(retval)
return None
@classmethod
def required_data_names(
cls, train: bool = True, inference: bool = False
) -> Tuple[str, ...]:
if not inference:
retval = ("speech", "profile", "binary_labels")
retval = ("speech", )
else:
# Recognition mode
retval = ("speech")
retval = ("speech", )
return retval
@classmethod
@ -823,7 +823,7 @@ class EENDOLADiarTask(AbsTask):
# 2. Encoder
encoder_class = encoder_choices.get_class(args.encoder)
encoder = encoder_class(input_size=input_size, **args.encoder_conf)
encoder = encoder_class(**args.encoder_conf)
# 3. EncoderDecoderAttractor
encoder_decoder_attractor_class = encoder_decoder_attractor_choices.get_class(args.encoder_decoder_attractor)

View File

@ -501,7 +501,7 @@ class SVTask(AbsTask):
if ".bin" in model_name:
model_name_pth = os.path.join(model_dir, model_name.replace('.bin', '.pb'))
else:
model_name_pth = os.path.join(model_dir, "{}.pth".format(model_name))
model_name_pth = os.path.join(model_dir, "{}.pb".format(model_name))
if os.path.exists(model_name_pth):
logging.info("model_file is load from pth: {}".format(model_name_pth))
model_dict = torch.load(model_name_pth, map_location=device)

View File

@ -52,13 +52,13 @@ def load_pretrained_model(
init_param: <file_path>:<src_key>:<dst_key>:<exclude_Keys>
Examples:
>>> load_pretrained_model("somewhere/model.pth", model)
>>> load_pretrained_model("somewhere/model.pth:decoder:decoder", model)
>>> load_pretrained_model("somewhere/model.pth:decoder:decoder:", model)
>>> load_pretrained_model("somewhere/model.pb", model)
>>> load_pretrained_model("somewhere/model.pb:decoder:decoder", model)
>>> load_pretrained_model("somewhere/model.pb:decoder:decoder:", model)
>>> load_pretrained_model(
... "somewhere/model.pth:decoder:decoder:decoder.embed", model
... "somewhere/model.pb:decoder:decoder:decoder.embed", model
... )
>>> load_pretrained_model("somewhere/decoder.pth::decoder", model)
>>> load_pretrained_model("somewhere/decoder.pb::decoder", model)
"""
sps = init_param.split(":", 4)
if len(sps) == 4:

View File

@ -205,9 +205,9 @@ class Trainer:
else:
scaler = None
if trainer_options.resume and (output_dir / "checkpoint.pth").exists():
if trainer_options.resume and (output_dir / "checkpoint.pb").exists():
cls.resume(
checkpoint=output_dir / "checkpoint.pth",
checkpoint=output_dir / "checkpoint.pb",
model=model,
optimizers=optimizers,
schedulers=schedulers,
@ -361,7 +361,7 @@ class Trainer:
},
buffer,
)
trainer_options.oss_bucket.put_object(os.path.join(trainer_options.output_dir, "checkpoint.pth"), buffer.getvalue())
trainer_options.oss_bucket.put_object(os.path.join(trainer_options.output_dir, "checkpoint.pb"), buffer.getvalue())
else:
torch.save(
{
@ -374,7 +374,7 @@ class Trainer:
],
"scaler": scaler.state_dict() if scaler is not None else None,
},
output_dir / "checkpoint.pth",
output_dir / "checkpoint.pb",
)
# 5. Save and log the model and update the link to the best model
@ -382,22 +382,22 @@ class Trainer:
buffer = BytesIO()
torch.save(model.state_dict(), buffer)
trainer_options.oss_bucket.put_object(os.path.join(trainer_options.output_dir,
f"{iepoch}epoch.pth"),buffer.getvalue())
f"{iepoch}epoch.pb"),buffer.getvalue())
else:
torch.save(model.state_dict(), output_dir / f"{iepoch}epoch.pth")
torch.save(model.state_dict(), output_dir / f"{iepoch}epoch.pb")
# Creates a sym link latest.pth -> {iepoch}epoch.pth
# Creates a sym link latest.pb -> {iepoch}epoch.pb
if trainer_options.use_pai:
p = os.path.join(trainer_options.output_dir, "latest.pth")
p = os.path.join(trainer_options.output_dir, "latest.pb")
if trainer_options.oss_bucket.object_exists(p):
trainer_options.oss_bucket.delete_object(p)
trainer_options.oss_bucket.copy_object(trainer_options.oss_bucket.bucket_name,
os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pth"), p)
os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pb"), p)
else:
p = output_dir / "latest.pth"
p = output_dir / "latest.pb"
if p.is_symlink() or p.exists():
p.unlink()
p.symlink_to(f"{iepoch}epoch.pth")
p.symlink_to(f"{iepoch}epoch.pb")
_improved = []
for _phase, k, _mode in trainer_options.best_model_criterion:
@ -407,16 +407,16 @@ class Trainer:
# Creates sym links if it's the best result
if best_epoch == iepoch:
if trainer_options.use_pai:
p = os.path.join(trainer_options.output_dir, f"{_phase}.{k}.best.pth")
p = os.path.join(trainer_options.output_dir, f"{_phase}.{k}.best.pb")
if trainer_options.oss_bucket.object_exists(p):
trainer_options.oss_bucket.delete_object(p)
trainer_options.oss_bucket.copy_object(trainer_options.oss_bucket.bucket_name,
os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pth"),p)
os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pb"),p)
else:
p = output_dir / f"{_phase}.{k}.best.pth"
p = output_dir / f"{_phase}.{k}.best.pb"
if p.is_symlink() or p.exists():
p.unlink()
p.symlink_to(f"{iepoch}epoch.pth")
p.symlink_to(f"{iepoch}epoch.pb")
_improved.append(f"{_phase}.{k}")
if len(_improved) == 0:
logging.info("There are no improvements in this epoch")
@ -438,7 +438,7 @@ class Trainer:
type="model",
metadata={"improved": _improved},
)
artifact.add_file(str(output_dir / f"{iepoch}epoch.pth"))
artifact.add_file(str(output_dir / f"{iepoch}epoch.pb"))
aliases = [
f"epoch-{iepoch}",
"best" if best_epoch == iepoch else "",
@ -473,12 +473,12 @@ class Trainer:
for e in range(1, iepoch):
if trainer_options.use_pai:
p = os.path.join(trainer_options.output_dir, f"{e}epoch.pth")
p = os.path.join(trainer_options.output_dir, f"{e}epoch.pb")
if trainer_options.oss_bucket.object_exists(p) and e not in nbests:
trainer_options.oss_bucket.delete_object(p)
_removed.append(str(p))
else:
p = output_dir / f"{e}epoch.pth"
p = output_dir / f"{e}epoch.pb"
if p.exists() and e not in nbests:
p.unlink()
_removed.append(str(p))

View File

@ -1,6 +1,10 @@
import torch
import copy
import codecs
import logging
import edit_distance
import argparse
import pdb
import numpy as np
from typing import Any, List, Tuple, Union
@ -9,7 +13,8 @@ def ts_prediction_lfr6_standard(us_alphas,
us_peaks,
char_list,
vad_offset=0.0,
force_time_shift=-1.5
force_time_shift=-1.5,
sil_in_str=True
):
if not len(char_list):
return []
@ -62,6 +67,8 @@ def ts_prediction_lfr6_standard(us_alphas,
timestamp_list[i][1] = timestamp_list[i][1] + vad_offset / 1000.0
res_txt = ""
for char, timestamp in zip(new_char_list, timestamp_list):
#if char != '<sil>':
if not sil_in_str and char == '<sil>': continue
res_txt += "{} {} {};".format(char, str(timestamp[0]+0.0005)[:5], str(timestamp[1]+0.0005)[:5])
res = []
for char, timestamp in zip(new_char_list, timestamp_list):
@ -121,4 +128,181 @@ def time_stamp_sentence(punc_id_list, time_stamp_postprocessed, text_postprocess
return res
class AverageShiftCalculator():
def __init__(self):
logging.warning("Calculating average shift.")
def __call__(self, file1, file2):
uttid_list1, ts_dict1 = self.read_timestamps(file1)
uttid_list2, ts_dict2 = self.read_timestamps(file2)
uttid_intersection = self._intersection(uttid_list1, uttid_list2)
res = self.as_cal(uttid_intersection, ts_dict1, ts_dict2)
logging.warning("Average shift of {} and {}: {}.".format(file1, file2, str(res)[:8]))
logging.warning("Following timestamp pair differs most: {}, detail:{}".format(self.max_shift, self.max_shift_uttid))
def _intersection(self, list1, list2):
set1 = set(list1)
set2 = set(list2)
if set1 == set2:
logging.warning("Uttid same checked.")
return set1
itsc = list(set1 & set2)
logging.warning("Uttid differs: file1 {}, file2 {}, lines same {}.".format(len(list1), len(list2), len(itsc)))
return itsc
def read_timestamps(self, file):
# read timestamps file in standard format
uttid_list = []
ts_dict = {}
with codecs.open(file, 'r') as fin:
for line in fin.readlines():
text = ''
ts_list = []
line = line.rstrip()
uttid = line.split()[0]
uttid_list.append(uttid)
body = " ".join(line.split()[1:])
for pd in body.split(';'):
if not len(pd): continue
# pdb.set_trace()
char, start, end = pd.lstrip(" ").split(' ')
text += char + ','
ts_list.append((float(start), float(end)))
# ts_lists.append(ts_list)
ts_dict[uttid] = (text[:-1], ts_list)
logging.warning("File {} read done.".format(file))
return uttid_list, ts_dict
def _shift(self, filtered_timestamp_list1, filtered_timestamp_list2):
shift_time = 0
for fts1, fts2 in zip(filtered_timestamp_list1, filtered_timestamp_list2):
shift_time += abs(fts1[0] - fts2[0]) + abs(fts1[1] - fts2[1])
num_tokens = len(filtered_timestamp_list1)
return shift_time, num_tokens
def as_cal(self, uttid_list, ts_dict1, ts_dict2):
# calculate average shift between timestamp1 and timestamp2
# when characters differ, use edit distance alignment
# and calculate the error between the same characters
self._accumlated_shift = 0
self._accumlated_tokens = 0
self.max_shift = 0
self.max_shift_uttid = None
for uttid in uttid_list:
(t1, ts1) = ts_dict1[uttid]
(t2, ts2) = ts_dict2[uttid]
_align, _align2, _align3 = [], [], []
fts1, fts2 = [], []
_t1, _t2 = [], []
sm = edit_distance.SequenceMatcher(t1.split(','), t2.split(','))
s = sm.get_opcodes()
for j in range(len(s)):
if s[j][0] == "replace" or s[j][0] == "insert":
_align.append(0)
if s[j][0] == "replace" or s[j][0] == "delete":
_align3.append(0)
elif s[j][0] == "equal":
_align.append(1)
_align3.append(1)
else:
continue
# use s to index t2
for a, ts , t in zip(_align, ts2, t2.split(',')):
if a:
fts2.append(ts)
_t2.append(t)
sm2 = edit_distance.SequenceMatcher(t2.split(','), t1.split(','))
s = sm2.get_opcodes()
for j in range(len(s)):
if s[j][0] == "replace" or s[j][0] == "insert":
_align2.append(0)
elif s[j][0] == "equal":
_align2.append(1)
else:
continue
# use s2 tp index t1
for a, ts, t in zip(_align3, ts1, t1.split(',')):
if a:
fts1.append(ts)
_t1.append(t)
if len(fts1) == len(fts2):
shift_time, num_tokens = self._shift(fts1, fts2)
self._accumlated_shift += shift_time
self._accumlated_tokens += num_tokens
if shift_time/num_tokens > self.max_shift:
self.max_shift = shift_time/num_tokens
self.max_shift_uttid = uttid
else:
logging.warning("length mismatch")
return self._accumlated_shift / self._accumlated_tokens
def convert_external_alphas(alphas_file, text_file, output_file):
from funasr.models.predictor.cif import cif_wo_hidden
with open(alphas_file, 'r') as f1, open(text_file, 'r') as f2, open(output_file, 'w') as f3:
for line1, line2 in zip(f1.readlines(), f2.readlines()):
line1 = line1.rstrip()
line2 = line2.rstrip()
assert line1.split()[0] == line2.split()[0]
uttid = line1.split()[0]
alphas = [float(i) for i in line1.split()[1:]]
new_alphas = np.array(remove_chunk_padding(alphas))
new_alphas[-1] += 1e-4
text = line2.split()[1:]
if len(text) + 1 != int(new_alphas.sum()):
# force resize
new_alphas *= (len(text) + 1) / int(new_alphas.sum())
peaks = cif_wo_hidden(torch.Tensor(new_alphas).unsqueeze(0), 1.0-1e-4)
if " " in text:
text = text.split()
else:
text = [i for i in text]
res_str, _ = ts_prediction_lfr6_standard(new_alphas, peaks[0], text,
force_time_shift=-7.0,
sil_in_str=False)
f3.write("{} {}\n".format(uttid, res_str))
def remove_chunk_padding(alphas):
# remove the padding part in alphas if using chunk paraformer for GPU
START_ZERO = 45
MID_ZERO = 75
REAL_FRAMES = 360 # for chunk based encoder 10-120-10 and fsmn padding 5
alphas = alphas[START_ZERO:] # remove the padding at beginning
new_alphas = []
while True:
new_alphas = new_alphas + alphas[:REAL_FRAMES]
alphas = alphas[REAL_FRAMES+MID_ZERO:]
if len(alphas) < REAL_FRAMES: break
return new_alphas
SUPPORTED_MODES = ['cal_aas', 'read_ext_alphas']
def main(args):
if args.mode == 'cal_aas':
asc = AverageShiftCalculator()
asc(args.input, args.input2)
elif args.mode == 'read_ext_alphas':
convert_external_alphas(args.input, args.input2, args.output)
else:
logging.error("Mode {} not in SUPPORTED_MODES: {}.".format(args.mode, SUPPORTED_MODES))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='timestamp tools')
parser.add_argument('--mode',
default=None,
type=str,
choices=SUPPORTED_MODES,
help='timestamp related toolbox')
parser.add_argument('--input', default=None, type=str, help='input file path')
parser.add_argument('--output', default=None, type=str, help='output file name')
parser.add_argument('--input2', default=None, type=str, help='input2 file path')
parser.add_argument('--kaldi-ts-type',
default='v2',
type=str,
choices=['v0', 'v1', 'v2'],
help='kaldi timestamp to write')
args = parser.parse_args()
main(args)

View File

@ -17,7 +17,7 @@ requirements = {
"humanfriendly",
"scipy>=1.4.1",
# "filelock",
"librosa>=0.8.0",
"librosa==0.8.1",
"jamo==0.4.1", # For kss
"PyYAML>=5.1.2",
"soundfile>=0.10.2",
@ -41,6 +41,8 @@ requirements = {
# PAI
"oss2",
"kaldi-native-fbank",
# timestamp
"edit-distance"
],
# train: The modules invoked when training only.
"train": [

View File

@ -451,7 +451,7 @@ class TestUniasrInferencePipelines(unittest.TestCase):
def test_uniasr_2pass_zhcn_16k_common_vocab8358_offline(self):
inference_pipeline = pipeline(
task=Tasks.,
task=Tasks.auto_speech_recognition,
model='damo/speech_UniASauto_speech_recognitionR_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-offline')
rec_result = inference_pipeline(
audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav',