resolve conflict

This commit is contained in:
aky15 2023-03-21 15:12:21 +08:00
commit 4dc3a1b011
128 changed files with 3628 additions and 508 deletions

View File

@ -15,36 +15,10 @@
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
| [**Contact**](#contact)
## What's new:
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
- We release several new UniASR model:
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
### 2023.1.16, funasr-0.1.6 modelscope-1.2.0
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
## Highlights
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.cer_ctc.ave_10best.pth
inference_asr_model=valid.cer_ctc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -34,7 +34,7 @@ exp_dir=./data
tag=exp1
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
lm_exp=${exp_dir}/exp/${model_dir}
inference_lm=valid.loss.ave.pth # Language model path for decoding.
inference_lm=valid.loss.ave.pb # Language model path for decoding.
stage=0
stop_stage=3

View File

@ -4,7 +4,7 @@ import sys
def main():
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
data_path_and_name_and_type = [
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),

View File

@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "Downloading Pre-trained model..."
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
echo "Done."
@ -30,7 +30,7 @@ fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Calculating diarization results..."
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
python local/convert_label_to_rttm.py \
outputs/labels.txt \
data/test_rmsil/raw_rmsil_map.scp \

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -63,5 +63,5 @@ if __name__ == '__main__':
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./example_data/validation"
params["decoding_model_name"] = "valid.acc.ave.pth"
params["decoding_model_name"] = "valid.acc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if ngpu > 0:
use_gpu = 1
gpu_id = int(idx) - 1
else:
use_gpu = 0
gpu_id = -1
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
model=model,
output_dir=output_dir_job,
batch_size=64
batch_size=batch_size,
ngpu=use_gpu,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
batch_size = params["batch_size"]
output_dir = params["output_dir"]
model = params["model"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
if ngpu > 0:
nj = ngpu
elif ngpu == 0:
nj = njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
p.close()
p.join()
@ -81,8 +92,10 @@ def modelscope_infer(params):
if __name__ == "__main__":
params = {}
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
params["batch_size"] = 64
modelscope_infer(params)

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if ngpu > 0:
use_gpu = 1
gpu_id = int(idx) - 1
else:
use_gpu = 0
gpu_id = -1
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
model=model,
output_dir=output_dir_job,
batch_size=64
batch_size=batch_size,
ngpu=use_gpu,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
batch_size = params["batch_size"]
output_dir = params["output_dir"]
model = params["model"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
if ngpu > 0:
nj = ngpu
elif ngpu == 0:
nj = njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
p.close()
p.join()
@ -81,8 +92,10 @@ def modelscope_infer(params):
if __name__ == "__main__":
params = {}
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
params["batch_size"] = 64
modelscope_infer(params)

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -22,10 +22,12 @@
Or you can use the finetuned model for inference directly.
- Setting parameters in `infer.py`
- <strong>model:</strong> # model name on ModelScope
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>output_dir:</strong> # result dir
- <strong>ngpu:</strong> # the number of GPUs for decoding
- <strong>njob:</strong> # the number of jobs for each GPU
- <strong>ngpu:</strong> # the number of GPUs for decoding, if `ngpu` > 0, use GPU decoding
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `ngpu` = 0, use CPU decoding, please set `njob`
- <strong>batch_size:</strong> # batchsize of inference
- Then you can run the pipeline to infer with:
```python
@ -39,9 +41,11 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
### Inference using local finetuned model
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>modelscope_model_name: </strong> # model name on ModelScope
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- <strong>batch_size:</strong> # batchsize of inference
- Then you can run the pipeline to finetune with:
```python

View File

@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if ngpu > 0:
use_gpu = 1
gpu_id = int(idx) - 1
else:
use_gpu = 0
gpu_id = -1
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
model=model,
output_dir=output_dir_job,
batch_size=64
batch_size=batch_size,
ngpu=use_gpu,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
batch_size = params["batch_size"]
output_dir = params["output_dir"]
model = params["model"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
if ngpu > 0:
nj = ngpu
elif ngpu == 0:
nj = njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
p.close()
p.join()
@ -81,8 +92,10 @@ def modelscope_infer(params):
if __name__ == "__main__":
params = {}
params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
params["batch_size"] = 64
modelscope_infer(params)

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,57 @@
import torch
import torchaudio
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
model_revision='v1.0.2')
waveform, sample_rate = torchaudio.load("waihu.wav")
speech_length = waveform.shape[1]
speech = waveform[0]
cache_en = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None}
cache_de = {"decode_fsmn": None}
cache = {"encoder": cache_en, "decoder": cache_de}
param_dict = {}
param_dict["cache"] = cache
first_chunk = True
speech_buffer = speech
speech_cache = []
final_result = ""
while len(speech_buffer) >= 960:
if first_chunk:
if len(speech_buffer) >= 14400:
rec_result = inference_pipeline(audio_in=speech_buffer[0:14400], param_dict=param_dict)
speech_buffer = speech_buffer[4800:]
else:
cache_en["stride"] = len(speech_buffer) // 960
cache_en["pad_right"] = 0
rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
speech_buffer = []
cache_en["start_idx"] = -5
first_chunk = False
else:
cache_en["start_idx"] += 10
if len(speech_buffer) >= 4800:
cache_en["pad_left"] = 5
rec_result = inference_pipeline(audio_in=speech_buffer[:19200], param_dict=param_dict)
speech_buffer = speech_buffer[9600:]
else:
cache_en["stride"] = len(speech_buffer) // 960
cache_en["pad_right"] = 0
rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
speech_buffer = []
if len(rec_result) !=0 and rec_result['text'] != "sil":
final_result += rec_result['text']
print(rec_result)
print(final_result)

View File

@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if ngpu > 0:
use_gpu = 1
gpu_id = int(idx) - 1
else:
use_gpu = 0
gpu_id = -1
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
model=model,
output_dir=output_dir_job,
batch_size=64
batch_size=batch_size,
ngpu=use_gpu,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
batch_size = params["batch_size"]
output_dir = params["output_dir"]
model = params["model"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
if ngpu > 0:
nj = ngpu
elif ngpu == 0:
nj = njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
p.close()
p.join()
@ -81,8 +92,10 @@ def modelscope_infer(params):
if __name__ == "__main__":
params = {}
params["model"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
params["njob"] = 1 # if ngpu = 0, will use cpu decoding
params["batch_size"] = 64
modelscope_infer(params)

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -4,27 +4,17 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
if not os.path.exists(os.path.join(params["output_dir"], "punc")):
os.makedirs(os.path.join(params["output_dir"], "punc"))
if not os.path.exists(os.path.join(params["output_dir"], "vad")):
os.makedirs(os.path.join(params["output_dir"], "vad"))
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if text_in is not None:
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -4,6 +4,11 @@ inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
inference_pipeline = pipeline(
task=Tasks.punctuation,

View File

@ -0,0 +1,10 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
inference_diar_pipline = pipeline(
task=Tasks.speaker_diarization,
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
model_revision="v1.0.0",
)
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
print(results)

View File

@ -14,13 +14,12 @@ inference_diar_pipline = pipeline(
)
# 以 audio_list 作为输入,其中第一个音频为待检测语音,后面的音频为不同说话人的声纹注册语音
audio_list = [[
audio_list = [
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav"
]]
]
results = inference_diar_pipline(audio_in=audio_list)
for rst in results:
print(rst["value"])
print(results)

View File

@ -1,8 +1,11 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
import soundfile
if __name__ == '__main__':
output_dir = None
inference_pipline = pipeline(

View File

@ -1,8 +1,11 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
import soundfile
if __name__ == '__main__':
output_dir = None
inference_pipline = pipeline(

View File

@ -52,7 +52,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -256,6 +256,9 @@ def inference_launch(**kwargs):
elif mode == "paraformer":
from funasr.bin.asr_inference_paraformer import inference_modelscope
return inference_modelscope(**kwargs)
elif mode == "paraformer_streaming":
from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
return inference_modelscope(**kwargs)
elif mode == "paraformer_vad":
from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
return inference_modelscope(**kwargs)

View File

@ -41,8 +41,6 @@ from funasr.utils.types import str_or_none
from funasr.utils import asr_utils, wav_utils, postprocess_utils
import pdb
header_colors = '\033[95m'
end_colors = '\033[0m'
global_asr_language: str = 'zh-cn'
global_sample_rate: Union[int, Dict[Any, int]] = {
@ -55,7 +53,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -50,7 +50,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -0,0 +1,907 @@
#!/usr/bin/env python3
import argparse
import logging
import sys
import time
import copy
import os
import codecs
import tempfile
import requests
from pathlib import Path
from typing import Optional
from typing import Sequence
from typing import Tuple
from typing import Union
from typing import Dict
from typing import Any
from typing import List
import numpy as np
import torch
from typeguard import check_argument_types
from funasr.fileio.datadir_writer import DatadirWriter
from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
from funasr.modules.beam_search.beam_search import Hypothesis
from funasr.modules.scorers.ctc import CTCPrefixScorer
from funasr.modules.scorers.length_bonus import LengthBonus
from funasr.modules.subsampling import TooShortUttError
from funasr.tasks.asr import ASRTaskParaformer as ASRTask
from funasr.tasks.lm import LMTask
from funasr.text.build_tokenizer import build_tokenizer
from funasr.text.token_id_converter import TokenIDConverter
from funasr.torch_utils.device_funcs import to_device
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
from funasr.utils import config_argparse
from funasr.utils.cli_utils import get_commandline_args
from funasr.utils.types import str2bool
from funasr.utils.types import str2triple_str
from funasr.utils.types import str_or_none
from funasr.utils import asr_utils, wav_utils, postprocess_utils
from funasr.models.frontend.wav_frontend import WavFrontend
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
class Speech2Text:
"""Speech2Text class
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
"""
def __init__(
self,
asr_train_config: Union[Path, str] = None,
asr_model_file: Union[Path, str] = None,
cmvn_file: Union[Path, str] = None,
lm_train_config: Union[Path, str] = None,
lm_file: Union[Path, str] = None,
token_type: str = None,
bpemodel: str = None,
device: str = "cpu",
maxlenratio: float = 0.0,
minlenratio: float = 0.0,
dtype: str = "float32",
beam_size: int = 20,
ctc_weight: float = 0.5,
lm_weight: float = 1.0,
ngram_weight: float = 0.9,
penalty: float = 0.0,
nbest: int = 1,
frontend_conf: dict = None,
hotword_list_or_file: str = None,
**kwargs,
):
assert check_argument_types()
# 1. Build ASR model
scorers = {}
asr_model, asr_train_args = ASRTask.build_model_from_file(
asr_train_config, asr_model_file, cmvn_file, device
)
frontend = None
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
logging.info("asr_model: {}".format(asr_model))
logging.info("asr_train_args: {}".format(asr_train_args))
asr_model.to(dtype=getattr(torch, dtype)).eval()
if asr_model.ctc != None:
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
scorers.update(
ctc=ctc
)
token_list = asr_model.token_list
scorers.update(
length_bonus=LengthBonus(len(token_list)),
)
# 2. Build Language model
if lm_train_config is not None:
lm, lm_train_args = LMTask.build_model_from_file(
lm_train_config, lm_file, device
)
scorers["lm"] = lm.lm
# 3. Build ngram model
# ngram is not supported now
ngram = None
scorers["ngram"] = ngram
# 4. Build BeamSearch object
# transducer is not supported now
beam_search_transducer = None
weights = dict(
decoder=1.0 - ctc_weight,
ctc=ctc_weight,
lm=lm_weight,
ngram=ngram_weight,
length_bonus=penalty,
)
beam_search = BeamSearch(
beam_size=beam_size,
weights=weights,
scorers=scorers,
sos=asr_model.sos,
eos=asr_model.eos,
vocab_size=len(token_list),
token_list=token_list,
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
)
beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
for scorer in scorers.values():
if isinstance(scorer, torch.nn.Module):
scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
logging.info(f"Decoding device={device}, dtype={dtype}")
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
if token_type is None:
token_type = asr_train_args.token_type
if bpemodel is None:
bpemodel = asr_train_args.bpemodel
if token_type is None:
tokenizer = None
elif token_type == "bpe":
if bpemodel is not None:
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
else:
tokenizer = None
else:
tokenizer = build_tokenizer(token_type=token_type)
converter = TokenIDConverter(token_list=token_list)
logging.info(f"Text tokenizer: {tokenizer}")
self.asr_model = asr_model
self.asr_train_args = asr_train_args
self.converter = converter
self.tokenizer = tokenizer
# 6. [Optional] Build hotword list from str, local file or url
is_use_lm = lm_weight != 0.0 and lm_file is not None
if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
beam_search = None
self.beam_search = beam_search
logging.info(f"Beam_search: {self.beam_search}")
self.beam_search_transducer = beam_search_transducer
self.maxlenratio = maxlenratio
self.minlenratio = minlenratio
self.device = device
self.dtype = dtype
self.nbest = nbest
self.frontend = frontend
self.encoder_downsampling_factor = 1
if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
self.encoder_downsampling_factor = 4
@torch.no_grad()
def __call__(
self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
begin_time: int = 0, end_time: int = None,
):
"""Inference
Args:
speech: Input speech data
Returns:
text, token, token_int, hyp
"""
assert check_argument_types()
# Input as audio signal
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if self.frontend is not None:
feats, feats_len = self.frontend.forward(speech, speech_lengths)
feats = to_device(feats, device=self.device)
feats_len = feats_len.int()
self.asr_model.frontend = None
else:
feats = speech
feats_len = speech_lengths
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
# a. To device
batch = to_device(batch, device=self.device)
# b. Forward Encoder
enc, enc_len = self.asr_model.encode_chunk(**batch)
if isinstance(enc, tuple):
enc = enc[0]
# assert len(enc) == 1, len(enc)
enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
predictor_outs[2], predictor_outs[3]
pre_token_length = pre_token_length.floor().long()
if torch.max(pre_token_length) < 1:
return []
decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
decoder_out = decoder_outs
results = []
b, n, d = decoder_out.size()
for i in range(b):
x = enc[i, :enc_len[i], :]
am_scores = decoder_out[i, :pre_token_length[i], :]
if self.beam_search is not None:
nbest_hyps = self.beam_search(
x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
)
nbest_hyps = nbest_hyps[: self.nbest]
else:
yseq = am_scores.argmax(dim=-1)
score = am_scores.max(dim=-1)[0]
score = torch.sum(score, dim=-1)
# pad with mask tokens to ensure compatibility with sos/eos tokens
yseq = torch.tensor(
[self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
)
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
for hyp in nbest_hyps:
assert isinstance(hyp, (Hypothesis)), type(hyp)
# remove sos/eos and get results
last_pos = -1
if isinstance(hyp.yseq, list):
token_int = hyp.yseq[1:last_pos]
else:
token_int = hyp.yseq[1:last_pos].tolist()
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
else:
text = None
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
# assert check_return_type(results)
return results
class Speech2TextExport:
"""Speech2TextExport class
"""
def __init__(
self,
asr_train_config: Union[Path, str] = None,
asr_model_file: Union[Path, str] = None,
cmvn_file: Union[Path, str] = None,
lm_train_config: Union[Path, str] = None,
lm_file: Union[Path, str] = None,
token_type: str = None,
bpemodel: str = None,
device: str = "cpu",
maxlenratio: float = 0.0,
minlenratio: float = 0.0,
dtype: str = "float32",
beam_size: int = 20,
ctc_weight: float = 0.5,
lm_weight: float = 1.0,
ngram_weight: float = 0.9,
penalty: float = 0.0,
nbest: int = 1,
frontend_conf: dict = None,
hotword_list_or_file: str = None,
**kwargs,
):
# 1. Build ASR model
asr_model, asr_train_args = ASRTask.build_model_from_file(
asr_train_config, asr_model_file, cmvn_file, device
)
frontend = None
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
logging.info("asr_model: {}".format(asr_model))
logging.info("asr_train_args: {}".format(asr_train_args))
asr_model.to(dtype=getattr(torch, dtype)).eval()
token_list = asr_model.token_list
logging.info(f"Decoding device={device}, dtype={dtype}")
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
if token_type is None:
token_type = asr_train_args.token_type
if bpemodel is None:
bpemodel = asr_train_args.bpemodel
if token_type is None:
tokenizer = None
elif token_type == "bpe":
if bpemodel is not None:
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
else:
tokenizer = None
else:
tokenizer = build_tokenizer(token_type=token_type)
converter = TokenIDConverter(token_list=token_list)
logging.info(f"Text tokenizer: {tokenizer}")
# self.asr_model = asr_model
self.asr_train_args = asr_train_args
self.converter = converter
self.tokenizer = tokenizer
self.device = device
self.dtype = dtype
self.nbest = nbest
self.frontend = frontend
model = Paraformer_export(asr_model, onnx=False)
self.asr_model = model
@torch.no_grad()
def __call__(
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
):
"""Inference
Args:
speech: Input speech data
Returns:
text, token, token_int, hyp
"""
assert check_argument_types()
# Input as audio signal
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if self.frontend is not None:
feats, feats_len = self.frontend.forward(speech, speech_lengths)
feats = to_device(feats, device=self.device)
feats_len = feats_len.int()
self.asr_model.frontend = None
else:
feats = speech
feats_len = speech_lengths
enc_len_batch_total = feats_len.sum()
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
batch = {"speech": feats, "speech_lengths": feats_len}
# a. To device
batch = to_device(batch, device=self.device)
decoder_outs = self.asr_model(**batch)
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
results = []
b, n, d = decoder_out.size()
for i in range(b):
am_scores = decoder_out[i, :ys_pad_lens[i], :]
yseq = am_scores.argmax(dim=-1)
score = am_scores.max(dim=-1)[0]
score = torch.sum(score, dim=-1)
# pad with mask tokens to ensure compatibility with sos/eos tokens
yseq = torch.tensor(
yseq.tolist(), device=yseq.device
)
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
for hyp in nbest_hyps:
assert isinstance(hyp, (Hypothesis)), type(hyp)
# remove sos/eos and get results
last_pos = -1
if isinstance(hyp.yseq, list):
token_int = hyp.yseq[1:last_pos]
else:
token_int = hyp.yseq[1:last_pos].tolist()
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
else:
text = None
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
return results
def inference(
maxlenratio: float,
minlenratio: float,
batch_size: int,
beam_size: int,
ngpu: int,
ctc_weight: float,
lm_weight: float,
penalty: float,
log_level: Union[int, str],
data_path_and_name_and_type,
asr_train_config: Optional[str],
asr_model_file: Optional[str],
cmvn_file: Optional[str] = None,
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
lm_train_config: Optional[str] = None,
lm_file: Optional[str] = None,
token_type: Optional[str] = None,
key_file: Optional[str] = None,
word_lm_train_config: Optional[str] = None,
bpemodel: Optional[str] = None,
allow_variable_data_keys: bool = False,
streaming: bool = False,
output_dir: Optional[str] = None,
dtype: str = "float32",
seed: int = 0,
ngram_weight: float = 0.9,
nbest: int = 1,
num_workers: int = 1,
**kwargs,
):
inference_pipeline = inference_modelscope(
maxlenratio=maxlenratio,
minlenratio=minlenratio,
batch_size=batch_size,
beam_size=beam_size,
ngpu=ngpu,
ctc_weight=ctc_weight,
lm_weight=lm_weight,
penalty=penalty,
log_level=log_level,
asr_train_config=asr_train_config,
asr_model_file=asr_model_file,
cmvn_file=cmvn_file,
raw_inputs=raw_inputs,
lm_train_config=lm_train_config,
lm_file=lm_file,
token_type=token_type,
key_file=key_file,
word_lm_train_config=word_lm_train_config,
bpemodel=bpemodel,
allow_variable_data_keys=allow_variable_data_keys,
streaming=streaming,
output_dir=output_dir,
dtype=dtype,
seed=seed,
ngram_weight=ngram_weight,
nbest=nbest,
num_workers=num_workers,
**kwargs,
)
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
def inference_modelscope(
maxlenratio: float,
minlenratio: float,
batch_size: int,
beam_size: int,
ngpu: int,
ctc_weight: float,
lm_weight: float,
penalty: float,
log_level: Union[int, str],
# data_path_and_name_and_type,
asr_train_config: Optional[str],
asr_model_file: Optional[str],
cmvn_file: Optional[str] = None,
lm_train_config: Optional[str] = None,
lm_file: Optional[str] = None,
token_type: Optional[str] = None,
key_file: Optional[str] = None,
word_lm_train_config: Optional[str] = None,
bpemodel: Optional[str] = None,
allow_variable_data_keys: bool = False,
dtype: str = "float32",
seed: int = 0,
ngram_weight: float = 0.9,
nbest: int = 1,
num_workers: int = 1,
output_dir: Optional[str] = None,
param_dict: dict = None,
**kwargs,
):
assert check_argument_types()
if word_lm_train_config is not None:
raise NotImplementedError("Word LM is not implemented")
if ngpu > 1:
raise NotImplementedError("only single GPU decoding is supported")
logging.basicConfig(
level=log_level,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
)
export_mode = False
if param_dict is not None:
hotword_list_or_file = param_dict.get('hotword')
export_mode = param_dict.get("export_mode", False)
else:
hotword_list_or_file = None
if ngpu >= 1 and torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
batch_size = 1
# 1. Set random-seed
set_all_random_seed(seed)
# 2. Build speech2text
speech2text_kwargs = dict(
asr_train_config=asr_train_config,
asr_model_file=asr_model_file,
cmvn_file=cmvn_file,
lm_train_config=lm_train_config,
lm_file=lm_file,
token_type=token_type,
bpemodel=bpemodel,
device=device,
maxlenratio=maxlenratio,
minlenratio=minlenratio,
dtype=dtype,
beam_size=beam_size,
ctc_weight=ctc_weight,
lm_weight=lm_weight,
ngram_weight=ngram_weight,
penalty=penalty,
nbest=nbest,
hotword_list_or_file=hotword_list_or_file,
)
if export_mode:
speech2text = Speech2TextExport(**speech2text_kwargs)
else:
speech2text = Speech2Text(**speech2text_kwargs)
def _forward(
data_path_and_name_and_type,
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
output_dir_v2: Optional[str] = None,
fs: dict = None,
param_dict: dict = None,
**kwargs,
):
hotword_list_or_file = None
if param_dict is not None:
hotword_list_or_file = param_dict.get('hotword')
if 'hotword' in kwargs:
hotword_list_or_file = kwargs['hotword']
if hotword_list_or_file is not None or 'hotword' in kwargs:
speech2text.hotword_list = speech2text.generate_hotwords_list(hotword_list_or_file)
# 3. Build data-iterator
if data_path_and_name_and_type is None and raw_inputs is not None:
if isinstance(raw_inputs, torch.Tensor):
raw_inputs = raw_inputs.numpy()
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
loader = ASRTask.build_streaming_iterator(
data_path_and_name_and_type,
dtype=dtype,
fs=fs,
batch_size=batch_size,
key_file=key_file,
num_workers=num_workers,
preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
allow_variable_data_keys=allow_variable_data_keys,
inference=True,
)
if param_dict is not None:
use_timestamp = param_dict.get('use_timestamp', True)
else:
use_timestamp = True
forward_time_total = 0.0
length_total = 0.0
finish_count = 0
file_count = 1
cache = None
# 7 .Start for-loop
# FIXME(kamo): The output format should be discussed about
asr_result_list = []
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
if output_path is not None:
writer = DatadirWriter(output_path)
else:
writer = None
if param_dict is not None and "cache" in param_dict:
cache = param_dict["cache"]
for keys, batch in loader:
assert isinstance(batch, dict), type(batch)
assert all(isinstance(s, str) for s in keys), keys
_bs = len(next(iter(batch.values())))
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
# batch = {k: v for k, v in batch.items() if not k.endswith("_lengths")}
logging.info("decoding, utt_id: {}".format(keys))
# N-best list of (text, token, token_int, hyp_object)
time_beg = time.time()
results = speech2text(cache=cache, **batch)
if len(results) < 1:
hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
results = [[" ", ["sil"], [2], hyp, 10, 6]] * nbest
time_end = time.time()
forward_time = time_end - time_beg
lfr_factor = results[0][-1]
length = results[0][-2]
forward_time_total += forward_time
length_total += length
rtf_cur = "decoding, feature length: {}, forward_time: {:.4f}, rtf: {:.4f}".format(length, forward_time,
100 * forward_time / (
length * lfr_factor))
logging.info(rtf_cur)
for batch_id in range(_bs):
result = [results[batch_id][:-2]]
key = keys[batch_id]
for n, result in zip(range(1, nbest + 1), result):
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
time_stamp = None if len(result) < 5 else result[4]
# Create a directory: outdir/{n}best_recog
if writer is not None:
ibest_writer = writer[f"{n}best_recog"]
# Write the result to each file
ibest_writer["token"][key] = " ".join(token)
# ibest_writer["token_int"][key] = " ".join(map(str, token_int))
ibest_writer["score"][key] = str(hyp.score)
ibest_writer["rtf"][key] = rtf_cur
if text is not None:
if use_timestamp and time_stamp is not None:
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
else:
postprocessed_result = postprocess_utils.sentence_postprocess(token)
time_stamp_postprocessed = ""
if len(postprocessed_result) == 3:
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
postprocessed_result[1], \
postprocessed_result[2]
else:
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
item = {'key': key, 'value': text_postprocessed}
if time_stamp_postprocessed != "":
item['time_stamp'] = time_stamp_postprocessed
asr_result_list.append(item)
finish_count += 1
# asr_utils.print_progress(finish_count / file_count)
if writer is not None:
ibest_writer["text"][key] = text_postprocessed
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total,
forward_time_total,
100 * forward_time_total / (
length_total * lfr_factor))
logging.info(rtf_avg)
if writer is not None:
ibest_writer["rtf"]["rtf_avf"] = rtf_avg
return asr_result_list
return _forward
def get_parser():
parser = config_argparse.ArgumentParser(
description="ASR Decoding",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
# Note(kamo): Use '_' instead of '-' as separator.
# '-' is confusing if written in yaml.
parser.add_argument(
"--log_level",
type=lambda x: x.upper(),
default="INFO",
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
help="The verbose level of logging",
)
parser.add_argument("--output_dir", type=str, required=True)
parser.add_argument(
"--ngpu",
type=int,
default=0,
help="The number of gpus. 0 indicates CPU mode",
)
parser.add_argument("--seed", type=int, default=0, help="Random seed")
parser.add_argument(
"--dtype",
default="float32",
choices=["float16", "float32", "float64"],
help="Data type",
)
parser.add_argument(
"--num_workers",
type=int,
default=1,
help="The number of workers used for DataLoader",
)
parser.add_argument(
"--hotword",
type=str_or_none,
default=None,
help="hotword file path or hotwords seperated by space"
)
group = parser.add_argument_group("Input data related")
group.add_argument(
"--data_path_and_name_and_type",
type=str2triple_str,
required=False,
action="append",
)
group.add_argument("--key_file", type=str_or_none)
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
group = parser.add_argument_group("The model configuration related")
group.add_argument(
"--asr_train_config",
type=str,
help="ASR training configuration",
)
group.add_argument(
"--asr_model_file",
type=str,
help="ASR model parameter file",
)
group.add_argument(
"--cmvn_file",
type=str,
help="Global cmvn file",
)
group.add_argument(
"--lm_train_config",
type=str,
help="LM training configuration",
)
group.add_argument(
"--lm_file",
type=str,
help="LM parameter file",
)
group.add_argument(
"--word_lm_train_config",
type=str,
help="Word LM training configuration",
)
group.add_argument(
"--word_lm_file",
type=str,
help="Word LM parameter file",
)
group.add_argument(
"--ngram_file",
type=str,
help="N-gram parameter file",
)
group.add_argument(
"--model_tag",
type=str,
help="Pretrained model tag. If specify this option, *_train_config and "
"*_file will be overwritten",
)
group = parser.add_argument_group("Beam-search related")
group.add_argument(
"--batch_size",
type=int,
default=1,
help="The batch size for inference",
)
group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
group.add_argument("--beam_size", type=int, default=20, help="Beam size")
group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
group.add_argument(
"--maxlenratio",
type=float,
default=0.0,
help="Input length ratio to obtain max output length. "
"If maxlenratio=0.0 (default), it uses a end-detect "
"function "
"to automatically find maximum hypothesis lengths."
"If maxlenratio<0.0, its absolute value is interpreted"
"as a constant max output length",
)
group.add_argument(
"--minlenratio",
type=float,
default=0.0,
help="Input length ratio to obtain min output length",
)
group.add_argument(
"--ctc_weight",
type=float,
default=0.5,
help="CTC weight in joint decoding",
)
group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
group.add_argument("--streaming", type=str2bool, default=False)
group.add_argument(
"--frontend_conf",
default=None,
help="",
)
group.add_argument("--raw_inputs", type=list, default=None)
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
group = parser.add_argument_group("Text converter related")
group.add_argument(
"--token_type",
type=str_or_none,
default=None,
choices=["char", "bpe", None],
help="The token type for ASR model. "
"If not given, refers from the training args",
)
group.add_argument(
"--bpemodel",
type=str_or_none,
default=None,
help="The model path of sentencepiece. "
"If not given, refers from the training args",
)
return parser
def main(cmd=None):
print(get_commandline_args(), file=sys.stderr)
parser = get_parser()
args = parser.parse_args(cmd)
param_dict = {'hotword': args.hotword}
kwargs = vars(args)
kwargs.pop("config", None)
kwargs['param_dict'] = param_dict
inference(**kwargs)
if __name__ == "__main__":
main()
# from modelscope.pipelines import pipeline
# from modelscope.utils.constant import Tasks
#
# inference_16k_pipline = pipeline(
# task=Tasks.auto_speech_recognition,
# model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
#
# rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
# print(rec_result)

View File

@ -58,7 +58,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -46,7 +46,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -46,7 +46,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
param_dict = {
"extract_profile": True,
"sv_train_config": "sv.yaml",
"sv_model_file": "sv.pth",
"sv_model_file": "sv.pb",
}
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
for key in param_dict:
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
else:
kwargs["param_dict"] = param_dict
return inference_modelscope(mode=mode, **kwargs)
elif mode == "eend-ola":
from funasr.bin.eend_ola_inference import inference_modelscope
return inference_modelscope(mode=mode, **kwargs)
else:
logging.info("Unknown decoding mode: {}".format(mode))
return None

View File

@ -16,6 +16,7 @@ from typing import Union
import numpy as np
import torch
from scipy.signal import medfilt
from typeguard import check_argument_types
from funasr.models.frontend.wav_frontend import WavFrontendMel23
@ -34,7 +35,7 @@ class Speech2Diarization:
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)
@ -146,7 +147,7 @@ def inference_modelscope(
output_dir: Optional[str] = None,
batch_size: int = 1,
dtype: str = "float32",
ngpu: int = 0,
ngpu: int = 1,
num_workers: int = 0,
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
@ -179,7 +180,6 @@ def inference_modelscope(
diar_model_file=diar_model_file,
device=device,
dtype=dtype,
streaming=streaming,
)
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
speech2diar = Speech2Diarization.from_pretrained(
@ -209,7 +209,7 @@ def inference_modelscope(
if data_path_and_name_and_type is None and raw_inputs is not None:
if isinstance(raw_inputs, torch.Tensor):
raw_inputs = raw_inputs.numpy()
data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
loader = EENDOLADiarTask.build_streaming_iterator(
data_path_and_name_and_type,
dtype=dtype,
@ -236,9 +236,23 @@ def inference_modelscope(
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
results = speech2diar(**batch)
# post process
a = results[0][0].cpu().numpy()
a = medfilt(a, (11, 1))
rst = []
for spkid, frames in enumerate(a.T):
frames = np.pad(frames, (1, 1), 'constant')
changes, = np.where(np.diff(frames, axis=0) != 0)
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
for s, e in zip(changes[::2], changes[1::2]):
st = s / 10.
dur = (e - s) / 10.
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
# Only supporting batch_size==1
key, value = keys[0], output_results_str(results, keys[0])
item = {"key": key, "value": value}
value = "\n".join(rst)
item = {"key": keys[0], "value": value}
result_list.append(item)
if output_path is not None:
output_writer.write(value)

View File

@ -42,7 +42,7 @@ class Speech2Diarization:
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)
@ -54,7 +54,7 @@ class Speech2Diarization:
self,
diar_train_config: Union[Path, str] = None,
diar_model_file: Union[Path, str] = None,
device: str = "cpu",
device: Union[str, torch.device] = "cpu",
batch_size: int = 1,
dtype: str = "float32",
streaming: bool = False,
@ -114,9 +114,19 @@ class Speech2Diarization:
# little-endian order: lower bit first
return (np.array(list(b)[::-1]) == '1').astype(dtype)
return np.row_stack([int2vec(int(x), vec_dim) for x in seq])
# process oov
seq = np.array([int(x) for x in seq])
new_seq = []
for i, x in enumerate(seq):
if x < 2 ** vec_dim:
new_seq.append(x)
else:
idx_list = np.where(seq < 2 ** vec_dim)[0]
idx = np.abs(idx_list - i).argmin()
new_seq.append(seq[idx_list[idx]])
return np.row_stack([int2vec(x, vec_dim) for x in new_seq])
def post_processing(self, raw_logits: torch.Tensor, spk_num: int):
def post_processing(self, raw_logits: torch.Tensor, spk_num: int, output_format: str = "speaker_turn"):
logits_idx = raw_logits.argmax(-1) # B, T, vocab_size -> B, T
# upsampling outputs to match inputs
ut = logits_idx.shape[1] * self.diar_model.encoder.time_ds_ratio
@ -127,8 +137,14 @@ class Speech2Diarization:
).squeeze(1).long()
logits_idx = logits_idx[0].tolist()
pse_labels = [self.token_list[x] for x in logits_idx]
if output_format == "pse_labels":
return pse_labels, None
multi_labels = self.seq2arr(pse_labels, spk_num)[:, :spk_num] # remove padding speakers
multi_labels = self.smooth_multi_labels(multi_labels)
if output_format == "binary_labels":
return multi_labels, None
spk_list = ["spk{}".format(i + 1) for i in range(spk_num)]
spk_turns = self.calc_spk_turns(multi_labels, spk_list)
results = OrderedDict()
@ -149,6 +165,7 @@ class Speech2Diarization:
self,
speech: Union[torch.Tensor, np.ndarray],
profile: Union[torch.Tensor, np.ndarray],
output_format: str = "speaker_turn"
):
"""Inference
@ -178,7 +195,7 @@ class Speech2Diarization:
batch = to_device(batch, device=self.device)
logits = self.diar_model.prediction_forward(**batch)
results, pse_labels = self.post_processing(logits, profile.shape[1])
results, pse_labels = self.post_processing(logits, profile.shape[1], output_format)
return results, pse_labels
@ -367,7 +384,7 @@ def inference_modelscope(
pse_label_writer = open("{}/labels.txt".format(output_path), "w")
logging.info("Start to diarize...")
result_list = []
for keys, batch in loader:
for idx, (keys, batch) in enumerate(loader):
assert isinstance(batch, dict), type(batch)
assert all(isinstance(s, str) for s in keys), keys
_bs = len(next(iter(batch.values())))
@ -385,6 +402,9 @@ def inference_modelscope(
pse_label_writer.write("{} {}\n".format(key, " ".join(pse_labels)))
pse_label_writer.flush()
if idx % 100 == 0:
logging.info("Processing {:5d}: {}".format(idx, key))
if output_path is not None:
output_writer.close()
pse_label_writer.close()

View File

@ -36,7 +36,7 @@ class Speech2Xvector:
Examples:
>>> import soundfile
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2xvector(audio)
[(text, token, token_int, hypothesis object), ...]
@ -169,7 +169,7 @@ def inference_modelscope(
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
sv_train_config: Optional[str] = "sv.yaml",
sv_model_file: Optional[str] = "sv.pth",
sv_model_file: Optional[str] = "sv.pb",
model_tag: Optional[str] = None,
allow_variable_data_keys: bool = True,
streaming: bool = False,

View File

@ -8,6 +8,7 @@ from typing import Dict
from typing import Iterator
from typing import Tuple
from typing import Union
from typing import List
import kaldiio
import numpy as np
@ -129,7 +130,7 @@ class IterableESPnetDataset(IterableDataset):
non_iterable_list = []
self.path_name_type_list = []
if not isinstance(path_name_type_list[0], Tuple):
if not isinstance(path_name_type_list[0], (Tuple, List)):
path = path_name_type_list[0]
name = path_name_type_list[1]
_type = path_name_type_list[2]
@ -227,13 +228,9 @@ class IterableESPnetDataset(IterableDataset):
name = self.path_name_type_list[i][1]
_type = self.path_name_type_list[i][2]
if _type == "sound":
audio_type = os.path.basename(value).split(".")[-1].lower()
if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
raise NotImplementedError(
f'Not supported audio type: {audio_type}')
if audio_type == "pcm":
_type = "pcm"
audio_type = os.path.basename(value).lower()
if audio_type.rfind(".pcm") >= 0:
_type = "pcm"
func = DATA_TYPES[_type]
array = func(value)
if self.fs is not None and (name == "speech" or name == "ref_speech"):
@ -335,11 +332,8 @@ class IterableESPnetDataset(IterableDataset):
# 2.a. Load data streamingly
for value, (path, name, _type) in zip(values, self.path_name_type_list):
if _type == "sound":
audio_type = os.path.basename(value).split(".")[-1].lower()
if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
raise NotImplementedError(
f'Not supported audio type: {audio_type}')
if audio_type == "pcm":
audio_type = os.path.basename(value).lower()
if audio_type.rfind(".pcm") >= 0:
_type = "pcm"
func = DATA_TYPES[_type]
# Load entry
@ -391,3 +385,4 @@ class IterableESPnetDataset(IterableDataset):
if count == 0:
raise RuntimeError("No iteration")

View File

@ -18,15 +18,11 @@ def forward_segment(text, seg_dict):
def seg_tokenize(txt, seg_dict):
out_txt = ""
pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
for word in txt:
if pattern.match(word):
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
out_txt += "<unk>" + " "
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
continue
out_txt += "<unk>" + " "
return out_txt.strip().split()
def tokenize(data,

View File

@ -47,15 +47,11 @@ def forward_segment(text, dic):
def seg_tokenize(txt, seg_dict):
out_txt = ""
pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
for word in txt:
if pattern.match(word):
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
out_txt += "<unk>" + " "
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
continue
out_txt += "<unk>" + " "
return out_txt.strip().split()
def seg_tokenize_wo_pattern(txt, seg_dict):

View File

@ -2,6 +2,8 @@
## Environments
torch >= 1.11.0
modelscope >= 1.2.0
torch-quant >= 0.4.0 (required for exporting quantized torchscript format model)
# pip install torch-quant -i https://pypi.org/simple
## Install modelscope and funasr
@ -11,31 +13,46 @@ The installation is the same as [funasr](../../README.md)
`Tips`: torch>=1.11.0
```shell
python -m funasr.export.export_model [model_name] [export_dir] [onnx]
python -m funasr.export.export_model \
--model-name [model_name] \
--export-dir [export_dir] \
--type [onnx, torch] \
--quantize [true, false] \
--fallback-num [fallback_num]
```
`model_name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb).
`export_dir`: the dir where the onnx is export.
`onnx`: `true`, export onnx format model; `false`, export torchscripts format model.
`model-name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb).
`export-dir`: the dir where the onnx is export.
`type`: `onnx` or `torch`, export onnx format model or torchscript format model.
`quantize`: `true`, export quantized model at the same time; `false`, export fp32 model only.
`fallback-num`: specify the number of fallback layers to perform automatic mixed precision quantization.
## For example
### Export onnx format model
Export model from modelscope
```shell
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
```
Export model from local path, the model'name must be `model.pb`.
```shell
python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
```
### Export torchscripts format model
Export model from modelscope
```shell
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
```
Export model from local path, the model'name must be `model.pb`.
```shell
python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
```
## Acknowledge
Torch model quantization is supported by [BladeDISC](https://github.com/alibaba/BladeDISC), an end-to-end DynamIc Shape Compiler project for machine learning workloads. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. If you are interested, please contact us.

View File

@ -10,12 +10,20 @@ import torch
from funasr.export.models import get_model
import numpy as np
import random
from funasr.utils.types import str2bool
# torch_version = float(".".join(torch.__version__.split(".")[:2]))
# assert torch_version > 1.9
class ASRModelExportParaformer:
def __init__(self, cache_dir: Union[Path, str] = None, onnx: bool = True):
def __init__(
self,
cache_dir: Union[Path, str] = None,
onnx: bool = True,
quant: bool = True,
fallback_num: int = 0,
audio_in: str = None,
calib_num: int = 200,
):
assert check_argument_types()
self.set_all_random_seed(0)
if cache_dir is None:
@ -28,6 +36,11 @@ class ASRModelExportParaformer:
)
print("output dir: {}".format(self.cache_dir))
self.onnx = onnx
self.quant = quant
self.fallback_num = fallback_num
self.frontend = None
self.audio_in = audio_in
self.calib_num = calib_num
def _export(
@ -56,6 +69,43 @@ class ASRModelExportParaformer:
print("output dir: {}".format(export_dir))
def _torch_quantize(self, model):
def _run_calibration_data(m):
# using dummy inputs for a example
if self.audio_in is not None:
feats, feats_len = self.load_feats(self.audio_in)
for i, (feat, len) in enumerate(zip(feats, feats_len)):
with torch.no_grad():
m(feat, len)
else:
dummy_input = model.get_dummy_inputs()
m(*dummy_input)
from torch_quant.module import ModuleFilter
from torch_quant.quantizer import Backend, Quantizer
from funasr.export.models.modules.decoder_layer import DecoderLayerSANM
from funasr.export.models.modules.encoder_layer import EncoderLayerSANM
module_filter = ModuleFilter(include_classes=[EncoderLayerSANM, DecoderLayerSANM])
module_filter.exclude_op_types = [torch.nn.Conv1d]
quantizer = Quantizer(
module_filter=module_filter,
backend=Backend.FBGEMM,
)
model.eval()
calib_model = quantizer.calib(model)
_run_calibration_data(calib_model)
if self.fallback_num > 0:
# perform automatic mixed precision quantization
amp_model = quantizer.amp(model)
_run_calibration_data(amp_model)
quantizer.fallback(amp_model, num=self.fallback_num)
print('Fallback layers:')
print('\n'.join(quantizer.module_filter.exclude_names))
quant_model = quantizer.quantize(model)
return quant_model
def _export_torchscripts(self, model, verbose, path, enc_size=None):
if enc_size:
dummy_input = model.get_dummy_inputs(enc_size)
@ -66,10 +116,49 @@ class ASRModelExportParaformer:
model_script = torch.jit.trace(model, dummy_input)
model_script.save(os.path.join(path, f'{model.model_name}.torchscripts'))
if self.quant:
quant_model = self._torch_quantize(model)
model_script = torch.jit.trace(quant_model, dummy_input)
model_script.save(os.path.join(path, f'{model.model_name}_quant.torchscripts'))
def set_all_random_seed(self, seed: int):
random.seed(seed)
np.random.seed(seed)
torch.random.manual_seed(seed)
def parse_audio_in(self, audio_in):
wav_list, name_list = [], []
if audio_in.endswith(".scp"):
f = open(audio_in, 'r')
lines = f.readlines()[:self.calib_num]
for line in lines:
name, path = line.strip().split()
name_list.append(name)
wav_list.append(path)
else:
wav_list = [audio_in,]
name_list = ["test",]
return wav_list, name_list
def load_feats(self, audio_in: str = None):
import torchaudio
wav_list, name_list = self.parse_audio_in(audio_in)
feats = []
feats_len = []
for line in wav_list:
path = line.strip()
waveform, sampling_rate = torchaudio.load(path)
if sampling_rate != self.frontend.fs:
waveform = torchaudio.transforms.Resample(orig_freq=sampling_rate,
new_freq=self.frontend.fs)(waveform)
fbank, fbank_len = self.frontend(waveform, [waveform.size(1)])
feats.append(fbank)
feats_len.append(fbank_len)
return feats, feats_len
def export(self,
tag_name: str = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
mode: str = 'paraformer',
@ -96,6 +185,7 @@ class ASRModelExportParaformer:
model, asr_train_args = ASRTask.build_model_from_file(
asr_train_config, asr_model_file, cmvn_file, 'cpu'
)
self.frontend = model.frontend
self._export(model, tag_name)
@ -107,11 +197,12 @@ class ASRModelExportParaformer:
# model_script = torch.jit.script(model)
model_script = model #torch.jit.trace(model)
model_path = os.path.join(path, f'{model.model_name}.onnx')
torch.onnx.export(
model_script,
dummy_input,
os.path.join(path, f'{model.model_name}.onnx'),
model_path,
verbose=verbose,
opset_version=14,
input_names=model.get_input_names(),
@ -119,17 +210,42 @@ class ASRModelExportParaformer:
dynamic_axes=model.get_dynamic_axes()
)
if self.quant:
from onnxruntime.quantization import QuantType, quantize_dynamic
import onnx
quant_model_path = os.path.join(path, f'{model.model_name}_quant.onnx')
onnx_model = onnx.load(model_path)
nodes = [n.name for n in onnx_model.graph.node]
nodes_to_exclude = [m for m in nodes if 'output' in m]
quantize_dynamic(
model_input=model_path,
model_output=quant_model_path,
op_types_to_quantize=['MatMul'],
per_channel=True,
reduce_range=False,
weight_type=QuantType.QUInt8,
nodes_to_exclude=nodes_to_exclude,
)
if __name__ == '__main__':
import sys
model_path = sys.argv[1]
output_dir = sys.argv[2]
onnx = sys.argv[3]
onnx = onnx.lower()
onnx = onnx == 'true'
# model_path = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch'
# output_dir = "../export"
export_model = ASRModelExportParaformer(cache_dir=output_dir, onnx=onnx)
export_model.export(model_path)
# export_model.export('/root/cache/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--model-name', type=str, required=True)
parser.add_argument('--export-dir', type=str, required=True)
parser.add_argument('--type', type=str, default='onnx', help='["onnx", "torch"]')
parser.add_argument('--quantize', type=str2bool, default=False, help='export quantized model')
parser.add_argument('--fallback-num', type=int, default=0, help='amp fallback number')
parser.add_argument('--audio_in', type=str, default=None, help='["wav", "wav.scp"]')
parser.add_argument('--calib_num', type=int, default=200, help='calib max num')
args = parser.parse_args()
export_model = ASRModelExportParaformer(
cache_dir=args.export_dir,
onnx=args.type == 'onnx',
quant=args.quantize,
fallback_num=args.fallback_num,
audio_in=args.audio_in,
calib_num=args.calib_num,
)
export_model.export(args.model_name)

View File

@ -16,6 +16,7 @@ class EncoderLayerSANM(nn.Module):
self.feed_forward = model.feed_forward
self.norm1 = model.norm1
self.norm2 = model.norm2
self.in_size = model.in_size
self.size = model.size
def forward(self, x, mask):
@ -23,13 +24,12 @@ class EncoderLayerSANM(nn.Module):
residual = x
x = self.norm1(x)
x = self.self_attn(x, mask)
if x.size(2) == residual.size(2):
if self.in_size == self.size:
x = x + residual
residual = x
x = self.norm2(x)
x = self.feed_forward(x)
if x.size(2) == residual.size(2):
x = x + residual
x = x + residual
return x, mask

View File

@ -64,6 +64,23 @@ class MultiHeadedAttentionSANM(nn.Module):
return self.linear_out(context_layer) # (batch, time1, d_model)
def preprocess_for_attn(x, mask, cache, pad_fn):
x = x * mask
x = x.transpose(1, 2)
if cache is None:
x = pad_fn(x)
else:
x = torch.cat((cache[:, :, 1:], x), dim=2)
cache = x
return x, cache
torch_version = float(".".join(torch.__version__.split(".")[:2]))
if torch_version >= 1.8:
import torch.fx
torch.fx.wrap('preprocess_for_attn')
class MultiHeadedAttentionSANMDecoder(nn.Module):
def __init__(self, model):
super().__init__()
@ -73,16 +90,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
self.attn = None
def forward(self, inputs, mask, cache=None):
# b, t, d = inputs.size()
# mask = torch.reshape(mask, (b, -1, 1))
inputs = inputs * mask
x = inputs.transpose(1, 2)
if cache is None:
x = self.pad_fn(x)
else:
x = torch.cat((cache[:, :, 1:], x), dim=2)
cache = x
x, cache = preprocess_for_attn(inputs, mask, cache, self.pad_fn)
x = self.fsmn_block(x)
x = x.transpose(1, 2)
@ -232,4 +240,4 @@ class OnnxRelPosMultiHeadedAttention(OnnxMultiHeadedAttention):
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(new_context_layer_shape)
return self.linear_out(context_layer) # (batch, time1, d_model)

View File

@ -66,13 +66,13 @@ def average_nbest_models(
elif n == 1:
# The averaged model is same as the best model
e, _ = epoch_and_values[0]
op = output_dir / f"{e}epoch.pth"
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
op = output_dir / f"{e}epoch.pb"
sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pb"
if sym_op.is_symlink() or sym_op.exists():
sym_op.unlink()
sym_op.symlink_to(op.name)
else:
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pb"
logging.info(
f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
)
@ -83,12 +83,12 @@ def average_nbest_models(
if e not in _loaded:
if oss_bucket is None:
_loaded[e] = torch.load(
output_dir / f"{e}epoch.pth",
output_dir / f"{e}epoch.pb",
map_location="cpu",
)
else:
buffer = BytesIO(
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pb")).read())
_loaded[e] = torch.load(buffer)
states = _loaded[e]
@ -115,13 +115,13 @@ def average_nbest_models(
else:
buffer = BytesIO()
torch.save(avg, buffer)
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pb"),
buffer.getvalue())
# 3. *.*.ave.pth is a symlink to the max ave model
# 3. *.*.ave.pb is a symlink to the max ave model
if oss_bucket is None:
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pb"
sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pb"
if sym_op.is_symlink() or sym_op.exists():
sym_op.unlink()
sym_op.symlink_to(op.name)

View File

@ -191,12 +191,12 @@ def unpack(
Examples:
tarfile:
model.pth
model.pb
some1.file
some2.file
>>> unpack("tarfile", "out")
{'asr_model_file': 'out/model.pth'}
{'asr_model_file': 'out/model.pb'}
"""
input_archive = Path(input_archive)
outpath = Path(outpath)

View File

@ -90,6 +90,47 @@ class DecoderLayerSANM(nn.Module):
tgt = self.norm1(tgt)
tgt = self.feed_forward(tgt)
x = tgt
if self.self_attn:
if self.normalize_before:
tgt = self.norm2(tgt)
x, _ = self.self_attn(tgt, tgt_mask)
x = residual + self.dropout(x)
if self.src_attn is not None:
residual = x
if self.normalize_before:
x = self.norm3(x)
x = residual + self.dropout(self.src_attn(x, memory, memory_mask))
return x, tgt_mask, memory, memory_mask, cache
def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
"""Compute decoded features.
Args:
tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
cache (List[torch.Tensor]): List of cached tensors.
Each tensor shape should be (#batch, maxlen_out - 1, size).
Returns:
torch.Tensor: Output tensor(#batch, maxlen_out, size).
torch.Tensor: Mask for output tensor (#batch, maxlen_out).
torch.Tensor: Encoded memory (#batch, maxlen_in, size).
torch.Tensor: Encoded memory mask (#batch, maxlen_in).
"""
# tgt = self.dropout(tgt)
residual = tgt
if self.normalize_before:
tgt = self.norm1(tgt)
tgt = self.feed_forward(tgt)
x = tgt
if self.self_attn:
if self.normalize_before:
@ -109,7 +150,6 @@ class DecoderLayerSANM(nn.Module):
return x, tgt_mask, memory, memory_mask, cache
class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
"""
author: Speech Lab, Alibaba Group, China
@ -947,6 +987,65 @@ class ParaformerSANMDecoder(BaseTransformerDecoder):
)
return logp.squeeze(0), state
def forward_chunk(
self,
memory: torch.Tensor,
tgt: torch.Tensor,
cache: dict = None,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Forward decoder.
Args:
hs_pad: encoded memory, float32 (batch, maxlen_in, feat)
hlens: (batch)
ys_in_pad:
input token ids, int64 (batch, maxlen_out)
if input_layer == "embed"
input tensor (batch, maxlen_out, #mels) in the other cases
ys_in_lens: (batch)
Returns:
(tuple): tuple containing:
x: decoded token score before softmax (batch, maxlen_out, token)
if use_output_layer is True,
olens: (batch, )
"""
x = tgt
if cache["decode_fsmn"] is None:
cache_layer_num = len(self.decoders)
if self.decoders2 is not None:
cache_layer_num += len(self.decoders2)
new_cache = [None] * cache_layer_num
else:
new_cache = cache["decode_fsmn"]
for i in range(self.att_layer_num):
decoder = self.decoders[i]
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
x, None, memory, None, cache=new_cache[i]
)
new_cache[i] = c_ret
if self.num_blocks - self.att_layer_num > 1:
for i in range(self.num_blocks - self.att_layer_num):
j = i + self.att_layer_num
decoder = self.decoders2[i]
x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
x, None, memory, None, cache=new_cache[j]
)
new_cache[j] = c_ret
for decoder in self.decoders3:
x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
x, None, memory, None, cache=None
)
if self.normalize_before:
x = self.after_norm(x)
if self.output_layer is not None:
x = self.output_layer(x)
cache["decode_fsmn"] = new_cache
return x
def forward_one_step(
self,
tgt: torch.Tensor,

View File

@ -325,6 +325,65 @@ class Paraformer(AbsESPnetModel):
return encoder_out, encoder_out_lens
def encode_chunk(
self, speech: torch.Tensor, speech_lengths: torch.Tensor, cache: dict = None
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Frontend + Encoder. Note that this method is used by asr_inference.py
Args:
speech: (Batch, Length, ...)
speech_lengths: (Batch, )
"""
with autocast(False):
# 1. Extract feats
feats, feats_lengths = self._extract_feats(speech, speech_lengths)
# 2. Data augmentation
if self.specaug is not None and self.training:
feats, feats_lengths = self.specaug(feats, feats_lengths)
# 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
if self.normalize is not None:
feats, feats_lengths = self.normalize(feats, feats_lengths)
# Pre-encoder, e.g. used for raw input data
if self.preencoder is not None:
feats, feats_lengths = self.preencoder(feats, feats_lengths)
# 4. Forward encoder
# feats: (Batch, Length, Dim)
# -> encoder_out: (Batch, Length2, Dim2)
if self.encoder.interctc_use_conditioning:
encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(
feats, feats_lengths, cache=cache["encoder"], ctc=self.ctc
)
else:
encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(feats, feats_lengths, cache=cache["encoder"])
intermediate_outs = None
if isinstance(encoder_out, tuple):
intermediate_outs = encoder_out[1]
encoder_out = encoder_out[0]
# Post-encoder, e.g. NLU
if self.postencoder is not None:
encoder_out, encoder_out_lens = self.postencoder(
encoder_out, encoder_out_lens
)
assert encoder_out.size(0) == speech.size(0), (
encoder_out.size(),
speech.size(0),
)
assert encoder_out.size(1) <= encoder_out_lens.max(), (
encoder_out.size(),
encoder_out_lens.max(),
)
if intermediate_outs is not None:
return (encoder_out, intermediate_outs), encoder_out_lens
return encoder_out, encoder_out_lens
def calc_predictor(self, encoder_out, encoder_out_lens):
encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(
@ -333,6 +392,11 @@ class Paraformer(AbsESPnetModel):
ignore_id=self.ignore_id)
return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index
def calc_predictor_chunk(self, encoder_out, cache=None):
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = self.predictor.forward_chunk(encoder_out, cache["encoder"])
return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index
def cal_decoder_with_predictor(self, encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens):
decoder_outs = self.decoder(
@ -342,6 +406,14 @@ class Paraformer(AbsESPnetModel):
decoder_out = torch.log_softmax(decoder_out, dim=-1)
return decoder_out, ys_pad_lens
def cal_decoder_with_predictor_chunk(self, encoder_out, sematic_embeds, cache=None):
decoder_outs = self.decoder.forward_chunk(
encoder_out, sematic_embeds, cache["decoder"]
)
decoder_out = decoder_outs
decoder_out = torch.log_softmax(decoder_out, dim=-1)
return decoder_out
def _extract_feats(
self, speech: torch.Tensor, speech_lengths: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:
@ -1459,4 +1531,4 @@ class ContextualParaformer(Paraformer):
"torch tensor: {}, {}, loading from tf tensor: {}, {}".format(name, data_tf.size(), name_tf,
var_dict_tf[name_tf].shape))
return var_dict_torch_update
return var_dict_torch_update

View File

@ -52,15 +52,15 @@ class DiarEENDOLAModel(AbsESPnetModel):
super().__init__()
self.frontend = frontend
self.encoder = encoder
self.encoder_decoder_attractor = encoder_decoder_attractor
self.enc = encoder
self.eda = encoder_decoder_attractor
self.attractor_loss_weight = attractor_loss_weight
self.max_n_speaker = max_n_speaker
if mapping_dict is None:
mapping_dict = generate_mapping_dict(max_speaker_num=self.max_n_speaker)
self.mapping_dict = mapping_dict
# PostNet
self.PostNet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
self.postnet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
self.output_layer = nn.Linear(n_units, mapping_dict['oov'] + 1)
def forward_encoder(self, xs, ilens):
@ -68,7 +68,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
pad_shape = xs.shape
xs_mask = [torch.ones(ilen).to(xs.device) for ilen in ilens]
xs_mask = torch.nn.utils.rnn.pad_sequence(xs_mask, batch_first=True, padding_value=0).unsqueeze(-2)
emb = self.encoder(xs, xs_mask)
emb = self.enc(xs, xs_mask)
emb = torch.split(emb.view(pad_shape[0], pad_shape[1], -1), 1, dim=0)
emb = [e[0][:ilen] for e, ilen in zip(emb, ilens)]
return emb
@ -76,8 +76,8 @@ class DiarEENDOLAModel(AbsESPnetModel):
def forward_post_net(self, logits, ilens):
maxlen = torch.max(ilens).to(torch.int).item()
logits = nn.utils.rnn.pad_sequence(logits, batch_first=True, padding_value=-1)
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens, batch_first=True, enforce_sorted=False)
outputs, (_, _) = self.PostNet(logits)
logits = nn.utils.rnn.pack_padded_sequence(logits, ilens.cpu().to(torch.int64), batch_first=True, enforce_sorted=False)
outputs, (_, _) = self.postnet(logits)
outputs = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True, padding_value=-1, total_length=maxlen)[0]
outputs = [output[:ilens[i].to(torch.int).item()] for i, output in enumerate(outputs)]
outputs = [self.output_layer(output) for output in outputs]
@ -112,7 +112,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
text = text[:, : text_lengths.max()]
# 1. Encoder
encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
encoder_out, encoder_out_lens = self.enc(speech, speech_lengths)
intermediate_outs = None
if isinstance(encoder_out, tuple):
intermediate_outs = encoder_out[1]
@ -190,18 +190,16 @@ class DiarEENDOLAModel(AbsESPnetModel):
shuffle: bool = True,
threshold: float = 0.5,
**kwargs):
if self.frontend is not None:
speech = self.frontend(speech)
speech = [s[:s_len] for s, s_len in zip(speech, speech_lengths)]
emb = self.forward_encoder(speech, speech_lengths)
if shuffle:
orders = [np.arange(e.shape[0]) for e in emb]
for order in orders:
np.random.shuffle(order)
attractors, probs = self.encoder_decoder_attractor.estimate(
attractors, probs = self.eda.estimate(
[e[torch.from_numpy(order).to(torch.long).to(speech[0].device)] for e, order in zip(emb, orders)])
else:
attractors, probs = self.encoder_decoder_attractor.estimate(emb)
attractors, probs = self.eda.estimate(emb)
attractors_active = []
for p, att, e in zip(probs, attractors, emb):
if n_speakers and n_speakers >= 0:
@ -233,10 +231,23 @@ class DiarEENDOLAModel(AbsESPnetModel):
pred[i] = pred[i - 1]
else:
pred[i] = 0
pred = [self.reporter.inv_mapping_func(i, self.mapping_dict) for i in pred]
pred = [self.inv_mapping_func(i) for i in pred]
decisions = [bin(num)[2:].zfill(self.max_n_speaker)[::-1] for num in pred]
decisions = torch.from_numpy(
np.stack([np.array([int(i) for i in dec]) for dec in decisions], axis=0)).to(logit.device).to(
torch.float32)
decisions = decisions[:, :n_speaker]
return decisions
def inv_mapping_func(self, label):
if not isinstance(label, int):
label = int(label)
if label in self.mapping_dict['label2dec'].keys():
num = self.mapping_dict['label2dec'][label]
else:
num = -1
return num
def collect_feats(self, **batch: torch.Tensor) -> Dict[str, torch.Tensor]:
pass

View File

@ -59,7 +59,8 @@ class DiarSondModel(AbsESPnetModel):
normalize_speech_speaker: bool = False,
ignore_id: int = -1,
speaker_discrimination_loss_weight: float = 1.0,
inter_score_loss_weight: float = 0.0
inter_score_loss_weight: float = 0.0,
inputs_type: str = "raw",
):
assert check_argument_types()
@ -86,14 +87,12 @@ class DiarSondModel(AbsESPnetModel):
)
self.criterion_bce = SequenceBinaryCrossEntropy(normalize_length=length_normalized_loss)
self.pse_embedding = self.generate_pse_embedding()
# self.register_buffer("pse_embedding", pse_embedding)
self.power_weight = torch.from_numpy(2 ** np.arange(max_spk_num)[np.newaxis, np.newaxis, :]).float()
# self.register_buffer("power_weight", power_weight)
self.int_token_arr = torch.from_numpy(np.array(self.token_list).astype(int)[np.newaxis, np.newaxis, :]).int()
# self.register_buffer("int_token_arr", int_token_arr)
self.speaker_discrimination_loss_weight = speaker_discrimination_loss_weight
self.inter_score_loss_weight = inter_score_loss_weight
self.forward_steps = 0
self.inputs_type = inputs_type
def generate_pse_embedding(self):
embedding = np.zeros((len(self.token_list), self.max_spk_num), dtype=np.float)
@ -125,9 +124,14 @@ class DiarSondModel(AbsESPnetModel):
binary_labels: (Batch, frames, max_spk_num)
binary_labels_lengths: (Batch,)
"""
assert speech.shape[0] == binary_labels.shape[0], (speech.shape, binary_labels.shape)
assert speech.shape[0] <= binary_labels.shape[0], (speech.shape, binary_labels.shape)
batch_size = speech.shape[0]
self.forward_steps = self.forward_steps + 1
if self.pse_embedding.device != speech.device:
self.pse_embedding = self.pse_embedding.to(speech.device)
self.power_weight = self.power_weight.to(speech.device)
self.int_token_arr = self.int_token_arr.to(speech.device)
# 1. Network forward
pred, inter_outputs = self.prediction_forward(
speech, speech_lengths,
@ -149,9 +153,13 @@ class DiarSondModel(AbsESPnetModel):
# the sequence length of 'pred' might be slightly less than the
# length of 'spk_labels'. Here we force them to be equal.
length_diff_tolerance = 2
length_diff = pse_labels.shape[1] - pred.shape[1]
if 0 < length_diff <= length_diff_tolerance:
pse_labels = pse_labels[:, 0: pred.shape[1]]
length_diff = abs(pse_labels.shape[1] - pred.shape[1])
if length_diff <= length_diff_tolerance:
min_len = min(pred.shape[1], pse_labels.shape[1])
pse_labels = pse_labels[:, :min_len]
pred = pred[:, :min_len]
cd_score = cd_score[:, :min_len]
ci_score = ci_score[:, :min_len]
loss_diar = self.classification_loss(pred, pse_labels, binary_labels_lengths)
loss_spk_dis = self.speaker_discrimination_loss(profile, profile_lengths)
@ -299,7 +307,7 @@ class DiarSondModel(AbsESPnetModel):
speech: torch.Tensor,
speech_lengths: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
if self.encoder is not None:
if self.encoder is not None and self.inputs_type == "raw":
speech, speech_lengths = self.encode(speech, speech_lengths)
speech_mask = ~make_pad_mask(speech_lengths, maxlen=speech.shape[1])
speech_mask = speech_mask.to(speech.device).unsqueeze(-1).float()

View File

@ -347,6 +347,48 @@ class SANMEncoder(AbsEncoder):
return (xs_pad, intermediate_outs), olens, None
return xs_pad, olens, None
def forward_chunk(self,
xs_pad: torch.Tensor,
ilens: torch.Tensor,
cache: dict = None,
ctc: CTC = None,
):
xs_pad *= self.output_size() ** 0.5
if self.embed is None:
xs_pad = xs_pad
else:
xs_pad = self.embed.forward_chunk(xs_pad, cache)
encoder_outs = self.encoders0(xs_pad, None, None, None, None)
xs_pad, masks = encoder_outs[0], encoder_outs[1]
intermediate_outs = []
if len(self.interctc_layer_idx) == 0:
encoder_outs = self.encoders(xs_pad, None, None, None, None)
xs_pad, masks = encoder_outs[0], encoder_outs[1]
else:
for layer_idx, encoder_layer in enumerate(self.encoders):
encoder_outs = encoder_layer(xs_pad, None, None, None, None)
xs_pad, masks = encoder_outs[0], encoder_outs[1]
if layer_idx + 1 in self.interctc_layer_idx:
encoder_out = xs_pad
# intermediate outputs are also normalized
if self.normalize_before:
encoder_out = self.after_norm(encoder_out)
intermediate_outs.append((layer_idx + 1, encoder_out))
if self.interctc_use_conditioning:
ctc_out = ctc.softmax(encoder_out)
xs_pad = xs_pad + self.conditioning_layer(ctc_out)
if self.normalize_before:
xs_pad = self.after_norm(xs_pad)
if len(intermediate_outs) > 0:
return (xs_pad, intermediate_outs), None, None
return xs_pad, ilens, None
def gen_tf2torch_map_dict(self):
tensor_name_prefix_torch = self.tf2torch_tensor_name_prefix_torch
tensor_name_prefix_tf = self.tf2torch_tensor_name_prefix_tf

View File

@ -1,14 +1,15 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
# Part of the implementation is borrowed from espnet/espnet.
from abc import ABC
from typing import Tuple
import numpy as np
import torch
import torchaudio.compliance.kaldi as kaldi
from funasr.models.frontend.abs_frontend import AbsFrontend
from typeguard import check_argument_types
from torch.nn.utils.rnn import pad_sequence
from typeguard import check_argument_types
import funasr.models.frontend.eend_ola_feature as eend_ola_feature
from funasr.models.frontend.abs_frontend import AbsFrontend
def load_cmvn(cmvn_file):
@ -275,7 +276,8 @@ class WavFrontendOnline(AbsFrontend):
# inputs tensor has catted the cache tensor
# def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, inputs_lfr_cache: torch.Tensor = None,
# is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[
torch.Tensor, torch.Tensor, int]:
"""
Apply lfr with data
"""
@ -376,7 +378,8 @@ class WavFrontendOnline(AbsFrontend):
if self.lfr_m != 1 or self.lfr_n != 1:
# update self.lfr_splice_cache in self.apply_lfr
# mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, self.lfr_splice_cache[i],
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, is_final)
mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n,
is_final)
if self.cmvn_file is not None:
mat = self.apply_cmvn(mat, self.cmvn)
feat_length = mat.size(0)
@ -398,9 +401,10 @@ class WavFrontendOnline(AbsFrontend):
assert batch_size == 1, 'we support to extract feature online only when the batch size is equal to 1 now'
waveforms, feats, feats_lengths = self.forward_fbank(input, input_lengths) # input shape: B T D
if feats.shape[0]:
#if self.reserve_waveforms is None and self.lfr_m > 1:
# if self.reserve_waveforms is None and self.lfr_m > 1:
# self.reserve_waveforms = waveforms[:, :(self.lfr_m - 1) // 2 * self.frame_shift_sample_length]
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat((self.reserve_waveforms, waveforms), dim=1)
self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat(
(self.reserve_waveforms, waveforms), dim=1)
if not self.lfr_splice_cache: # 初始化splice_cache
for i in range(batch_size):
self.lfr_splice_cache.append(feats[i][0, :].unsqueeze(dim=0).repeat((self.lfr_m - 1) // 2, 1))
@ -409,7 +413,8 @@ class WavFrontendOnline(AbsFrontend):
lfr_splice_cache_tensor = torch.stack(self.lfr_splice_cache) # B T D
feats = torch.cat((lfr_splice_cache_tensor, feats), dim=1)
feats_lengths += lfr_splice_cache_tensor[0].shape[0]
frame_from_waveforms = int((self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
frame_from_waveforms = int(
(self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
minus_frame = (self.lfr_m - 1) // 2 if self.reserve_waveforms is None else 0
feats, feats_lengths, lfr_splice_frame_idxs = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
if self.lfr_m == 1:
@ -423,14 +428,15 @@ class WavFrontendOnline(AbsFrontend):
self.waveforms = self.waveforms[:, :sample_length]
else:
# update self.reserve_waveforms and self.lfr_splice_cache
self.reserve_waveforms = self.waveforms[:, :-(self.frame_sample_length - self.frame_shift_sample_length)]
self.reserve_waveforms = self.waveforms[:,
:-(self.frame_sample_length - self.frame_shift_sample_length)]
for i in range(batch_size):
self.lfr_splice_cache[i] = torch.cat((self.lfr_splice_cache[i], feats[i]), dim=0)
return torch.empty(0), feats_lengths
else:
if is_final:
self.waveforms = waveforms if self.reserve_waveforms is None else self.reserve_waveforms
feats = torch.stack(self.lfr_splice_cache)
feats = torch.stack(self.lfr_splice_cache)
feats_lengths = torch.zeros(batch_size, dtype=torch.int) + feats.shape[1]
feats, feats_lengths, _ = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
if is_final:
@ -444,3 +450,54 @@ class WavFrontendOnline(AbsFrontend):
self.reserve_waveforms = None
self.input_cache = None
self.lfr_splice_cache = []
class WavFrontendMel23(AbsFrontend):
"""Conventional frontend structure for ASR.
"""
def __init__(
self,
fs: int = 16000,
frame_length: int = 25,
frame_shift: int = 10,
lfr_m: int = 1,
lfr_n: int = 1,
):
assert check_argument_types()
super().__init__()
self.fs = fs
self.frame_length = frame_length
self.frame_shift = frame_shift
self.lfr_m = lfr_m
self.lfr_n = lfr_n
self.n_mels = 23
def output_size(self) -> int:
return self.n_mels * (2 * self.lfr_m + 1)
def forward(
self,
input: torch.Tensor,
input_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
batch_size = input.size(0)
feats = []
feats_lens = []
for i in range(batch_size):
waveform_length = input_lengths[i]
waveform = input[i][:waveform_length]
waveform = waveform.numpy()
mat = eend_ola_feature.stft(waveform, self.frame_length, self.frame_shift)
mat = eend_ola_feature.transform(mat)
mat = eend_ola_feature.splice(mat, context_size=self.lfr_m)
mat = mat[::self.lfr_n]
mat = torch.from_numpy(mat)
feat_length = mat.size(0)
feats.append(mat)
feats_lens.append(feat_length)
feats_lens = torch.as_tensor(feats_lens)
feats_pad = pad_sequence(feats,
batch_first=True,
padding_value=0.0)
return feats_pad, feats_lens

View File

@ -199,6 +199,63 @@ class CifPredictorV2(nn.Module):
return acoustic_embeds, token_num, alphas, cif_peak
def forward_chunk(self, hidden, cache=None):
h = hidden
context = h.transpose(1, 2)
queries = self.pad(context)
output = torch.relu(self.cif_conv1d(queries))
output = output.transpose(1, 2)
output = self.cif_output(output)
alphas = torch.sigmoid(output)
alphas = torch.nn.functional.relu(alphas * self.smooth_factor - self.noise_threshold)
alphas = alphas.squeeze(-1)
mask_chunk_predictor = None
if cache is not None:
mask_chunk_predictor = None
mask_chunk_predictor = torch.zeros_like(alphas)
mask_chunk_predictor[:, cache["pad_left"]:cache["stride"] + cache["pad_left"]] = 1.0
if mask_chunk_predictor is not None:
alphas = alphas * mask_chunk_predictor
if cache is not None:
if cache["cif_hidden"] is not None:
hidden = torch.cat((cache["cif_hidden"], hidden), 1)
if cache["cif_alphas"] is not None:
alphas = torch.cat((cache["cif_alphas"], alphas), -1)
token_num = alphas.sum(-1)
acoustic_embeds, cif_peak = cif(hidden, alphas, self.threshold)
len_time = alphas.size(-1)
last_fire_place = len_time - 1
last_fire_remainds = 0.0
pre_alphas_length = 0
mask_chunk_peak_predictor = None
if cache is not None:
mask_chunk_peak_predictor = None
mask_chunk_peak_predictor = torch.zeros_like(cif_peak)
if cache["cif_alphas"] is not None:
pre_alphas_length = cache["cif_alphas"].size(-1)
mask_chunk_peak_predictor[:, :pre_alphas_length] = 1.0
mask_chunk_peak_predictor[:, pre_alphas_length + cache["pad_left"]:pre_alphas_length + cache["stride"] + cache["pad_left"]] = 1.0
if mask_chunk_peak_predictor is not None:
cif_peak = cif_peak * mask_chunk_peak_predictor.squeeze(-1)
for i in range(len_time):
if cif_peak[0][len_time - 1 - i] > self.threshold or cif_peak[0][len_time - 1 - i] == self.threshold:
last_fire_place = len_time - 1 - i
last_fire_remainds = cif_peak[0][len_time - 1 - i] - self.threshold
break
last_fire_remainds = torch.tensor([last_fire_remainds], dtype=alphas.dtype).to(alphas.device)
cache["cif_hidden"] = hidden[:, last_fire_place:, :]
cache["cif_alphas"] = torch.cat((last_fire_remainds.unsqueeze(0), alphas[:, last_fire_place+1:]), -1)
token_num_int = token_num.floor().type(torch.int32).item()
return acoustic_embeds[:, 0:token_num_int, :], token_num, alphas, cif_peak
def tail_process_fn(self, hidden, alphas, token_num=None, mask=None):
b, t, d = hidden.size()
tail_threshold = self.tail_threshold

View File

@ -347,15 +347,17 @@ class MultiHeadedAttentionSANM(nn.Module):
mask = torch.reshape(mask, (b, -1, 1))
if mask_shfit_chunk is not None:
mask = mask * mask_shfit_chunk
inputs = inputs * mask
inputs = inputs * mask
x = inputs.transpose(1, 2)
x = self.pad_fn(x)
x = self.fsmn_block(x)
x = x.transpose(1, 2)
x += inputs
x = self.dropout(x)
return x * mask
if mask is not None:
x = x * mask
return x
def forward_qkv(self, x):
"""Transform query, key and value.
@ -505,7 +507,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
# print("in fsmn, cache is None, x", x.size())
x = self.pad_fn(x)
if not self.training and t <= 1:
if not self.training:
cache = x
else:
# print("in fsmn, cache is not None, x", x.size())
@ -513,7 +515,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
# if t < self.kernel_size:
# x = self.pad_fn(x)
x = torch.cat((cache[:, :, 1:], x), dim=2)
x = x[:, :, -self.kernel_size:]
x = x[:, :, -(self.kernel_size+t-1):]
# print("in fsmn, cache is not None, x_cat", x.size())
cache = x
x = self.fsmn_block(x)

View File

@ -87,7 +87,7 @@ class EENDOLATransformerEncoder(nn.Module):
n_layers: int,
n_units: int,
e_units: int = 2048,
h: int = 8,
h: int = 4,
dropout_rate: float = 0.1,
use_pos_emb: bool = False):
super(EENDOLATransformerEncoder, self).__init__()

View File

@ -16,12 +16,12 @@ class EncoderDecoderAttractor(nn.Module):
self.n_units = n_units
def forward_core(self, xs, zeros):
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.float32).to(xs[0].device)
ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.int64)
xs = [self.enc0_dropout(x) for x in xs]
xs = nn.utils.rnn.pad_sequence(xs, batch_first=True, padding_value=-1)
xs = nn.utils.rnn.pack_padded_sequence(xs, ilens, batch_first=True, enforce_sorted=False)
_, (hx, cx) = self.encoder(xs)
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.float32).to(zeros[0].device)
zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.int64)
max_zlen = torch.max(zlens).to(torch.int).item()
zeros = [self.enc0_dropout(z) for z in zeros]
zeros = nn.utils.rnn.pad_sequence(zeros, batch_first=True, padding_value=-1)
@ -47,4 +47,4 @@ class EncoderDecoderAttractor(nn.Module):
zeros = [torch.zeros(max_n_speakers, self.n_units).to(torch.float32).to(xs[0].device) for _ in xs]
attractors = self.forward_core(xs, zeros)
probs = [torch.sigmoid(torch.flatten(self.counter(att))) for att in attractors]
return attractors, probs
return attractors, probs

View File

@ -405,4 +405,13 @@ class SinusoidalPositionEncoder(torch.nn.Module):
positions = torch.arange(1, timesteps+1)[None, :]
position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
return x + position_encoding
return x + position_encoding
def forward_chunk(self, x, cache=None):
start_idx = 0
batch_size, timesteps, input_dim = x.size()
if cache is not None:
start_idx = cache["start_idx"]
positions = torch.arange(1, timesteps+start_idx+1)[None, :]
position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
return x + position_encoding[:, start_idx: start_idx + timesteps]

View File

@ -0,0 +1,83 @@
# Copyright 2018 gRPC authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# cmake build file for C++ paraformer example.
# Assumes protobuf and gRPC have been installed using cmake.
# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
# that automatically builds all the dependencies before building paraformer.
cmake_minimum_required(VERSION 3.10)
project(ASR C CXX)
include(common.cmake)
# Proto file
get_filename_component(rg_proto "../python/grpc/proto/paraformer.proto" ABSOLUTE)
get_filename_component(rg_proto_path "${rg_proto}" PATH)
# Generated sources
set(rg_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.cc")
set(rg_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.h")
set(rg_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.cc")
set(rg_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.h")
add_custom_command(
OUTPUT "${rg_proto_srcs}" "${rg_proto_hdrs}" "${rg_grpc_srcs}" "${rg_grpc_hdrs}"
COMMAND ${_PROTOBUF_PROTOC}
ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
--cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
-I "${rg_proto_path}"
--plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
"${rg_proto}"
DEPENDS "${rg_proto}")
# Include generated *.pb.h files
include_directories("${CMAKE_CURRENT_BINARY_DIR}")
include_directories(../onnxruntime/include/)
link_directories(../onnxruntime/build/src/)
link_directories(../onnxruntime/build/third_party/webrtc/)
link_directories(${ONNXRUNTIME_DIR}/lib)
add_subdirectory("../onnxruntime/src" onnx_src)
# rg_grpc_proto
add_library(rg_grpc_proto
${rg_grpc_srcs}
${rg_grpc_hdrs}
${rg_proto_srcs}
${rg_proto_hdrs})
target_link_libraries(rg_grpc_proto
${_REFLECTION}
${_GRPC_GRPCPP}
${_PROTOBUF_LIBPROTOBUF})
# Targets paraformer_(server)
foreach(_target
paraformer_server)
add_executable(${_target}
"${_target}.cc")
target_link_libraries(${_target}
rg_grpc_proto
rapidasr
webrtcvad
${EXTRA_LIBS}
${_REFLECTION}
${_GRPC_GRPCPP}
${_PROTOBUF_LIBPROTOBUF})
endforeach()

View File

@ -0,0 +1,57 @@
## paraformer grpc onnx server in c++
#### Step 1. Build ../onnxruntime as it's document
```
#put onnx-lib & onnx-asr-model & vocab.txt into /path/to/asrmodel(eg: /data/asrmodel)
ls /data/asrmodel/
onnxruntime-linux-x64-1.14.0 speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
file /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/vocab.txt
UTF-8 Unicode text
```
#### Step 2. Compile and install grpc v1.52.0 in case of grpc bugs
```
export GRPC_INSTALL_DIR=/data/soft/grpc
export PKG_CONFIG_PATH=$GRPC_INSTALL_DIR/lib/pkgconfig
git clone -b v1.52.0 --depth=1 https://github.com/grpc/grpc.git
cd grpc
git submodule update --init --recursive
mkdir -p cmake/build
pushd cmake/build
cmake -DgRPC_INSTALL=ON \
-DgRPC_BUILD_TESTS=OFF \
-DCMAKE_INSTALL_PREFIX=$GRPC_INSTALL_DIR \
../..
make
make install
popd
echo "export GRPC_INSTALL_DIR=/data/soft/grpc" >> ~/.bashrc
echo "export PKG_CONFIG_PATH=\$GRPC_INSTALL_DIR/lib/pkgconfig" >> ~/.bashrc
echo "export PATH=\$GRPC_INSTALL_DIR/bin/:\$PKG_CONFIG_PATH:\$PATH" >> ~/.bashrc
source ~/.bashrc
```
#### Step 3. Compile and start grpc onnx paraformer server
```
# set -DONNXRUNTIME_DIR=/path/to/asrmodel/onnxruntime-linux-x64-1.14.0
./rebuild.sh
```
#### Step 4. Start grpc paraformer server
```
Usage: ./cmake/build/paraformer_server port thread_num /path/to/model_file
./cmake/build/paraformer_server 10108 4 /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
```
#### Step 5. Start grpc python paraformer client on PC with MIC
```
cd ../python/grpc
python grpc_main_client_mic.py --host $server_ip --port 10108
```

View File

@ -0,0 +1,125 @@
# Copyright 2018 gRPC authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# cmake build file for C++ route_guide example.
# Assumes protobuf and gRPC have been installed using cmake.
# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
# that automatically builds all the dependencies before building route_guide.
cmake_minimum_required(VERSION 3.5.1)
if (NOT DEFINED CMAKE_CXX_STANDARD)
set (CMAKE_CXX_STANDARD 14)
endif()
if(MSVC)
add_definitions(-D_WIN32_WINNT=0x600)
endif()
find_package(Threads REQUIRED)
if(GRPC_AS_SUBMODULE)
# One way to build a projects that uses gRPC is to just include the
# entire gRPC project tree via "add_subdirectory".
# This approach is very simple to use, but the are some potential
# disadvantages:
# * it includes gRPC's CMakeLists.txt directly into your build script
# without and that can make gRPC's internal setting interfere with your
# own build.
# * depending on what's installed on your system, the contents of submodules
# in gRPC's third_party/* might need to be available (and there might be
# additional prerequisites required to build them). Consider using
# the gRPC_*_PROVIDER options to fine-tune the expected behavior.
#
# A more robust approach to add dependency on gRPC is using
# cmake's ExternalProject_Add (see cmake_externalproject/CMakeLists.txt).
# Include the gRPC's cmake build (normally grpc source code would live
# in a git submodule called "third_party/grpc", but this example lives in
# the same repository as gRPC sources, so we just look a few directories up)
add_subdirectory(../../.. ${CMAKE_CURRENT_BINARY_DIR}/grpc EXCLUDE_FROM_ALL)
message(STATUS "Using gRPC via add_subdirectory.")
# After using add_subdirectory, we can now use the grpc targets directly from
# this build.
set(_PROTOBUF_LIBPROTOBUF libprotobuf)
set(_REFLECTION grpc++_reflection)
if(CMAKE_CROSSCOMPILING)
find_program(_PROTOBUF_PROTOC protoc)
else()
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
endif()
set(_GRPC_GRPCPP grpc++)
if(CMAKE_CROSSCOMPILING)
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
else()
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
endif()
elseif(GRPC_FETCHCONTENT)
# Another way is to use CMake's FetchContent module to clone gRPC at
# configure time. This makes gRPC's source code available to your project,
# similar to a git submodule.
message(STATUS "Using gRPC via add_subdirectory (FetchContent).")
include(FetchContent)
FetchContent_Declare(
grpc
GIT_REPOSITORY https://github.com/grpc/grpc.git
# when using gRPC, you will actually set this to an existing tag, such as
# v1.25.0, v1.26.0 etc..
# For the purpose of testing, we override the tag used to the commit
# that's currently under test.
GIT_TAG vGRPC_TAG_VERSION_OF_YOUR_CHOICE)
FetchContent_MakeAvailable(grpc)
# Since FetchContent uses add_subdirectory under the hood, we can use
# the grpc targets directly from this build.
set(_PROTOBUF_LIBPROTOBUF libprotobuf)
set(_REFLECTION grpc++_reflection)
set(_PROTOBUF_PROTOC $<TARGET_FILE:protoc>)
set(_GRPC_GRPCPP grpc++)
if(CMAKE_CROSSCOMPILING)
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
else()
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
endif()
else()
# This branch assumes that gRPC and all its dependencies are already installed
# on this system, so they can be located by find_package().
# Find Protobuf installation
# Looks for protobuf-config.cmake file installed by Protobuf's cmake installation.
set(protobuf_MODULE_COMPATIBLE TRUE)
find_package(Protobuf CONFIG REQUIRED)
message(STATUS "Using protobuf ${Protobuf_VERSION}")
set(_PROTOBUF_LIBPROTOBUF protobuf::libprotobuf)
set(_REFLECTION gRPC::grpc++_reflection)
if(CMAKE_CROSSCOMPILING)
find_program(_PROTOBUF_PROTOC protoc)
else()
set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
endif()
# Find gRPC installation
# Looks for gRPCConfig.cmake file installed by gRPC's cmake installation.
find_package(gRPC CONFIG REQUIRED)
message(STATUS "Using gRPC ${gRPC_VERSION}")
set(_GRPC_GRPCPP gRPC::grpc++)
if(CMAKE_CROSSCOMPILING)
find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
else()
set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:gRPC::grpc_cpp_plugin>)
endif()
endif()

View File

@ -0,0 +1,195 @@
#include <algorithm>
#include <chrono>
#include <cmath>
#include <iostream>
#include <sstream>
#include <memory>
#include <string>
#include <grpc/grpc.h>
#include <grpcpp/server.h>
#include <grpcpp/server_builder.h>
#include <grpcpp/server_context.h>
#include <grpcpp/security/server_credentials.h>
#include "paraformer.grpc.pb.h"
#include "paraformer_server.h"
using grpc::Server;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::ServerReader;
using grpc::ServerReaderWriter;
using grpc::ServerWriter;
using grpc::Status;
using paraformer::Request;
using paraformer::Response;
using paraformer::ASR;
ASRServicer::ASRServicer(const char* model_path, int thread_num) {
AsrHanlde=RapidAsrInit(model_path, thread_num);
std::cout << "ASRServicer init" << std::endl;
init_flag = 0;
}
void ASRServicer::clear_states(const std::string& user) {
clear_buffers(user);
clear_transcriptions(user);
}
void ASRServicer::clear_buffers(const std::string& user) {
if (client_buffers.count(user)) {
client_buffers.erase(user);
}
}
void ASRServicer::clear_transcriptions(const std::string& user) {
if (client_transcription.count(user)) {
client_transcription.erase(user);
}
}
void ASRServicer::disconnect(const std::string& user) {
clear_states(user);
std::cout << "Disconnecting user: " << user << std::endl;
}
grpc::Status ASRServicer::Recognize(
grpc::ServerContext* context,
grpc::ServerReaderWriter<Response, Request>* stream) {
Request req;
while (stream->Read(&req)) {
if (req.isend()) {
std::cout << "asr end" << std::endl;
disconnect(req.user());
Response res;
res.set_sentence(
R"({"success": true, "detail": "asr end"})"
);
res.set_user(req.user());
res.set_action("terminate");
res.set_language(req.language());
stream->Write(res);
} else if (req.speaking()) {
if (req.audio_data().size() > 0) {
auto& buf = client_buffers[req.user()];
buf.insert(buf.end(), req.audio_data().begin(), req.audio_data().end());
}
Response res;
res.set_sentence(
R"({"success": true, "detail": "speaking"})"
);
res.set_user(req.user());
res.set_action("speaking");
res.set_language(req.language());
stream->Write(res);
} else if (!req.speaking()) {
if (client_buffers.count(req.user()) == 0) {
Response res;
res.set_sentence(
R"({"success": true, "detail": "waiting_for_voice"})"
);
res.set_user(req.user());
res.set_action("waiting");
res.set_language(req.language());
stream->Write(res);
}else {
auto begin_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
std::string tmp_data = this->client_buffers[req.user()];
this->clear_states(req.user());
Response res;
res.set_sentence(
R"({"success": true, "detail": "decoding data: " + std::to_string(tmp_data.length()) + " bytes"})"
);
int data_len_int = tmp_data.length();
std::string data_len = std::to_string(data_len_int);
std::stringstream ss;
ss << R"({"success": true, "detail": "decoding data: )" << data_len << R"( bytes")" << R"("})";
std::string result = ss.str();
res.set_sentence(result);
res.set_user(req.user());
res.set_action("decoding");
res.set_language(req.language());
stream->Write(res);
if (tmp_data.length() < 800) { //min input_len for asr model
auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
std::string delay_str = std::to_string(end_time - begin_time);
std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", error: data_is_not_long_enough" << std::endl;
Response res;
std::stringstream ss;
std::string asr_result = "";
ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
std::string result = ss.str();
res.set_sentence(result);
res.set_user(req.user());
res.set_action("finish");
res.set_language(req.language());
stream->Write(res);
}
else {
RPASR_RESULT Result= RapidAsrRecogPCMBuffer(AsrHanlde, tmp_data.c_str(), data_len_int, RASR_NONE, NULL);
std::string asr_result = ((RPASR_RECOG_RESULT*)Result)->msg;
auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
std::string delay_str = std::to_string(end_time - begin_time);
std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", text: " << asr_result << std::endl;
Response res;
std::stringstream ss;
ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
std::string result = ss.str();
res.set_sentence(result);
res.set_user(req.user());
res.set_action("finish");
res.set_language(req.language());
stream->Write(res);
}
}
}else {
Response res;
res.set_sentence(
R"({"success": false, "detail": "error, no condition matched! Unknown reason."})"
);
res.set_user(req.user());
res.set_action("terminate");
res.set_language(req.language());
stream->Write(res);
}
}
return Status::OK;
}
void RunServer(const std::string& port, int thread_num, const char* model_path) {
std::string server_address;
server_address = "0.0.0.0:" + port;
ASRServicer service(model_path, thread_num);
ServerBuilder builder;
builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
builder.RegisterService(&service);
std::unique_ptr<Server> server(builder.BuildAndStart());
std::cout << "Server listening on " << server_address << std::endl;
server->Wait();
}
int main(int argc, char* argv[]) {
if (argc < 3)
{
printf("Usage: %s port thread_num /path/to/model_file\n", argv[0]);
exit(-1);
}
RunServer(argv[1], atoi(argv[2]), argv[3]);
return 0;
}

View File

@ -0,0 +1,56 @@
#include <algorithm>
#include <chrono>
#include <cmath>
#include <iostream>
#include <memory>
#include <string>
#include <grpc/grpc.h>
#include <grpcpp/server.h>
#include <grpcpp/server_builder.h>
#include <grpcpp/server_context.h>
#include <grpcpp/security/server_credentials.h>
#include <unordered_map>
#include <chrono>
#include "paraformer.grpc.pb.h"
#include "librapidasrapi.h"
using grpc::Server;
using grpc::ServerBuilder;
using grpc::ServerContext;
using grpc::ServerReader;
using grpc::ServerReaderWriter;
using grpc::ServerWriter;
using grpc::Status;
using paraformer::Request;
using paraformer::Response;
using paraformer::ASR;
typedef struct
{
std::string msg;
float snippet_time;
}RPASR_RECOG_RESULT;
class ASRServicer final : public ASR::Service {
private:
int init_flag;
std::unordered_map<std::string, std::string> client_buffers;
std::unordered_map<std::string, std::string> client_transcription;
public:
ASRServicer(const char* model_path, int thread_num);
void clear_states(const std::string& user);
void clear_buffers(const std::string& user);
void clear_transcriptions(const std::string& user);
void disconnect(const std::string& user);
grpc::Status Recognize(grpc::ServerContext* context, grpc::ServerReaderWriter<Response, Request>* stream);
RPASR_HANDLE AsrHanlde;
};

View File

@ -0,0 +1,12 @@
#!/bin/bash
rm cmake -rf
mkdir -p cmake/build
cd cmake/build
cmake -DCMAKE_BUILD_TYPE=release ../.. -DONNXRUNTIME_DIR=/data/asrmodel/onnxruntime-linux-x64-1.14.0
make
echo "Build cmake/build/paraformer_server successfully!"

View File

@ -41,8 +41,8 @@ pip install --editable ./
```
导出onnx模型[详见](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export)参考示例从modelscope中模型导出
```
python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
```shell
python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx --quantize False
```
## Building Guidance for Linux/Unix

View File

@ -237,7 +237,7 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
size_t nOffset = 0;
#define WAV_HEADER_SIZE 44
speech_len = nBufLen / 2;
speech_align_len = (int)(ceil((float)speech_len / align_size) * align_size);
@ -263,7 +263,8 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
speech_data[i] = (float)speech_buff[i] / scale;
}
AudioFrame* frame = new AudioFrame(speech_len);
frame_queue.push(frame);
return true;
}

View File

@ -26,8 +26,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadwav(szBuf,nLen);
audio.split();
if (!audio.loadwav(szBuf, nLen))
return nullptr;
//audio.split();
float* buff;
int len;
@ -58,8 +59,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadpcmwav(szBuf, nLen);
audio.split();
if (!audio.loadpcmwav(szBuf, nLen))
return nullptr;
//audio.split();
float* buff;
int len;
@ -91,8 +93,9 @@ extern "C" {
return nullptr;
Audio audio(1);
audio.loadpcmwav(szFileName);
audio.split();
if (!audio.loadpcmwav(szFileName))
return nullptr;
//audio.split();
float* buff;
int len;
@ -125,7 +128,7 @@ extern "C" {
Audio audio(1);
if(!audio.loadwav(szWavfile))
return nullptr;
audio.split();
//audio.split();
float* buff;
int len;

View File

@ -8,7 +8,7 @@
#include "librapidasrapi.h"
#include <iostream>
#include <fstream>
using namespace std;
int main(int argc, char *argv[])
@ -40,10 +40,13 @@ int main(int argc, char *argv[])
gettimeofday(&start, NULL);
RPASR_RESULT Result=RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
gettimeofday(&end, NULL);
float snippet_time = 0.0f;
RPASR_RESULT Result=RapidAsrRecogFile(AsrHanlde, argv[2], RASR_NONE, NULL);
gettimeofday(&end, NULL);
if (Result)
{
string msg = RapidAsrGetResult(Result, 0);
@ -56,11 +59,51 @@ int main(int argc, char *argv[])
}
else
{
cout <<("no return data!");
cout <<"no return data!";
}
printf("Audio length %lfs.\n", (double)snippet_time);
//char* buff = nullptr;
//int len = 0;
//ifstream ifs(argv[2], std::ios::binary | std::ios::in);
//if (ifs.is_open())
//{
// ifs.seekg(0, std::ios::end);
// len = ifs.tellg();
// ifs.seekg(0, std::ios::beg);
// buff = new char[len];
// ifs.read(buff, len);
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
// RPASR_RESULT Result=RapidAsrRecogPCMBuffer(AsrHanlde, buff,len, RASR_NONE, NULL);
// //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
// gettimeofday(&end, NULL);
//
// if (Result)
// {
// string msg = RapidAsrGetResult(Result, 0);
// setbuf(stdout, NULL);
// cout << "Result: \"";
// cout << msg << endl;
// cout << "\"." << endl;
// snippet_time = RapidAsrGetRetSnippetTime(Result);
// RapidAsrFreeResult(Result);
// }
// else
// {
// cout <<"no return data!";
// }
//
//delete[]buff;
//}
printf("Audio length %lfs.\n", (double)snippet_time);
seconds = (end.tv_sec - start.tv_sec);
long taking_micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
printf("Model inference takes %lfs.\n", (double)taking_micros / 1000000);

View File

@ -0,0 +1,45 @@
# Benchmark
### Data set:
Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
### Tools
- Install ModelScope and FunASR
```shell
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
pip install --editable ./
cd funasr/runtime/python/utils
pip install -r requirements.txt
```
- recipe
set the model, data path and output_dir
```shell
nohup bash test_rtf.sh &> log.txt &
```
## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz 16core-32processor with avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:------:|:------------:|
| 1 (torch fp32) | 3522 | 0.0976 | 10.3 |
| 1 (torch int8) | 1746 | 0.0484 | 20.7 |
| 32 (torch fp32) | 236 | 0.0066 | 152.7 |
| 32 (torch int8) | 114 | 0.0032 | 317.4 |
| 64 (torch fp32) | 235 | 0.0065 | 153.7 |
| 64 (torch int8) | 113 | 0.0031 | 319.2 |
[//]: # (### Intel&#40;R&#41; Xeon&#40;R&#41; Platinum 8163 CPU @ 2.50GHz 32core-64processor without avx512_vnni)
## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)

View File

@ -0,0 +1,89 @@
# Benchmark
### Data set:
Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
### Tools
- Install ModelScope and FunASR
```shell
pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
pip install --editable ./
cd funasr/runtime/python/utils
pip install -r requirements.txt
```
- recipe
set the model, data path and output_dir
```shell
nohup bash test_rtf.sh &> log.txt &
```
## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz 16core-32processor with avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:-------:|:------------:|
| 1 (onnx fp32) | 2806 | 0.0777 | 12.9 |
| 1 (onnx int8) | 1611 | 0.0446 | 22.4 |
| 8 (onnx fp32) | 538 | 0.0149 | 67.1 |
| 8 (onnx int8) | 210 | 0.0058 | 172.4 |
| 16 (onnx fp32) | 288 | 0.0080 | 125.2 |
| 16 (onnx int8) | 117 | 0.0032 | 309.9 |
| 32 (onnx fp32) | 167 | 0.0046 | 216.5 |
| 32 (onnx int8) | 86 | 0.0024 | 420.0 |
| 64 (onnx fp32) | 158 | 0.0044 | 228.1 |
| 64 (onnx int8) | 82 | 0.0023 | 442.8 |
| 96 (onnx fp32) | 151 | 0.0042 | 238.0 |
| 96 (onnx int8) | 80 | 0.0022 | 452.0 |
### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz 16core-32processor with avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:------:|:------------:|
| 1 (onnx fp32) | 2613 | 0.0724 | 13.8 |
| 1 (onnx int8) | 1321 | 0.0366 | 22.4 |
| 32 (onnx fp32) | 170 | 0.0047 | 212.7 |
| 32 (onnx int8) | 89 | 0.0025 | 407.0 |
| 64 (onnx fp32) | 166 | 0.0046 | 217.1 |
| 64 (onnx int8) | 87 | 0.0024 | 414.7 |
### Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz 32core-64processor without avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:------:|:------------:|
| 1 (onnx fp32) | 2959 | 0.0820 | 12.2 |
| 1 (onnx int8) | 2814 | 0.0778 | 12.8 |
| 16 (onnx fp32) | 373 | 0.0103 | 96.9 |
| 16 (onnx int8) | 331 | 0.0091 | 109.0 |
| 32 (onnx fp32) | 211 | 0.0058 | 171.4 |
| 32 (onnx int8) | 181 | 0.0050 | 200.0 |
| 64 (onnx fp32) | 153 | 0.0042 | 235.9 |
| 64 (onnx int8) | 103 | 0.0029 | 349.9 |
| 96 (onnx fp32) | 146 | 0.0041 | 247.0 |
| 96 (onnx int8) | 108 | 0.0030 | 334.1 |
## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)
### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz 16core-32processor with avx512_vnni
| concurrent-tasks | processing time(s) | RTF | Speedup Rate |
|:----------------:|:------------------:|:------:|:------------:|
| 1 (onnx fp32) | 1173 | 0.0325 | 30.8 |
| 1 (onnx int8) | 976 | 0.0270 | 37.0 |
| 16 (onnx fp32) | 91 | 0.0025 | 395.2 |
| 16 (onnx int8) | 78 | 0.0022 | 463.0 |
| 32 (onnx fp32) | 60 | 0.0017 | 598.8 |
| 32 (onnx int8) | 40 | 0.0011 | 892.9 |
| 64 (onnx fp32) | 55 | 0.0015 | 653.6 |
| 64 (onnx int8) | 31 | 0.0009 | 1162.8 |
| 96 (onnx fp32) | 57 | 0.0016 | 632.9 |
| 96 (onnx int8) | 33 | 0.0009 | 1098.9 |

Some files were not shown because too many files have changed in this diff Show More