mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
Merge branch 'main' of github.com:alibaba-damo-academy/FunASR into dev_dzh
This commit is contained in:
commit
4137f5cf26
30
README.md
30
README.md
@ -15,36 +15,10 @@
|
||||
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
|
||||
| [**Contact**](#contact)
|
||||
|
||||
|
||||
## What's new:
|
||||
|
||||
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
|
||||
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
|
||||
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
|
||||
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
|
||||
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
|
||||
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
|
||||
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
|
||||
- We release several new UniASR model:
|
||||
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
|
||||
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
|
||||
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
|
||||
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
|
||||
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
|
||||
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
|
||||
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
|
||||
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
|
||||
### 2023.1.16, funasr-0.1.6, modelscope-1.2.0
|
||||
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
|
||||
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
|
||||
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
|
||||
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
|
||||
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
|
||||
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
|
||||
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
|
||||
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
|
||||
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
|
||||
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
|
||||
|
||||
## Highlights
|
||||
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
@ -217,7 +217,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
if [ -n "${inference_config}" ]; then
|
||||
_opts+="--config ${inference_config} "
|
||||
fi
|
||||
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1: "${_nj}" "${_logdir}"/asr_inference.JOB.log \
|
||||
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
|
||||
python -m funasr.bin.asr_inference_launch \
|
||||
--batch_size 1 \
|
||||
--ngpu "${_ngpu}" \
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pth
|
||||
inference_asr_model=valid.cer_ctc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
53
egs/aishell/transformer/utils/cmvn_converter.py
Normal file
53
egs/aishell/transformer/utils/cmvn_converter.py
Normal file
@ -0,0 +1,53 @@
|
||||
import argparse
|
||||
import json
|
||||
import numpy as np
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="cmvn converter",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
parser.add_argument(
|
||||
"--cmvn-json",
|
||||
"-c",
|
||||
default=False,
|
||||
required=True,
|
||||
type=str,
|
||||
help="cmvn json file",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--am-mvn",
|
||||
"-a",
|
||||
default=False,
|
||||
required=True,
|
||||
type=str,
|
||||
help="am mvn file",
|
||||
)
|
||||
return parser
|
||||
|
||||
def main():
|
||||
parser = get_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.cmvn_json, "r") as fin:
|
||||
cmvn_dict = json.load(fin)
|
||||
|
||||
mean_stats = np.array(cmvn_dict["mean_stats"])
|
||||
var_stats = np.array(cmvn_dict["var_stats"])
|
||||
total_frame = np.array(cmvn_dict["total_frames"])
|
||||
|
||||
mean = -1.0 * mean_stats / total_frame
|
||||
var = 1.0 / np.sqrt(var_stats / total_frame - mean * mean)
|
||||
dims = mean.shape[0]
|
||||
with open(args.am_mvn, 'w') as fout:
|
||||
fout.write("<Nnet>" + "\n" + "<Splice> " + str(dims) + " " + str(dims) + '\n' + "[ 0 ]" + "\n" + "<AddShift> " + str(dims) + " " + str(dims) + "\n")
|
||||
mean_str = str(list(mean)).replace(',', '').replace('[', '[ ').replace(']', ' ]')
|
||||
fout.write("<LearnRateCoef> 0 " + mean_str + '\n')
|
||||
fout.write("<Rescale> " + str(dims) + " " + str(dims) + '\n')
|
||||
var_str = str(list(var)).replace(',', '').replace('[', '[ ').replace(']', ' ]')
|
||||
fout.write("<LearnRateCoef> 0 " + var_str + '\n')
|
||||
fout.write("</Nnet>" + '\n')
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@ -45,8 +45,8 @@ def compute_wer(ref_file,
|
||||
if out_item['wrong'] > 0:
|
||||
rst['wrong_sentences'] += 1
|
||||
cer_detail_writer.write(hyp_key + print_cer_detail(out_item) + '\n')
|
||||
cer_detail_writer.write("ref:" + '\t' + "".join(ref_dict[hyp_key]) + '\n')
|
||||
cer_detail_writer.write("hyp:" + '\t' + "".join(hyp_dict[hyp_key]) + '\n')
|
||||
cer_detail_writer.write("ref:" + '\t' + " ".join(list(map(lambda x: x.lower(), ref_dict[hyp_key]))) + '\n')
|
||||
cer_detail_writer.write("hyp:" + '\t' + " ".join(list(map(lambda x: x.lower(), hyp_dict[hyp_key]))) + '\n')
|
||||
|
||||
if rst['Wrd'] > 0:
|
||||
rst['Err'] = round(rst['wrong_words'] * 100 / rst['Wrd'], 2)
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
|
||||
|
||||
@ -34,7 +34,7 @@ exp_dir=./data
|
||||
tag=exp1
|
||||
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
|
||||
lm_exp=${exp_dir}/exp/${model_dir}
|
||||
inference_lm=valid.loss.ave.pth # Language model path for decoding.
|
||||
inference_lm=valid.loss.ave.pb # Language model path for decoding.
|
||||
|
||||
stage=0
|
||||
stop_stage=3
|
||||
|
||||
@ -4,7 +4,7 @@ import sys
|
||||
|
||||
def main():
|
||||
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
|
||||
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
|
||||
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),
|
||||
|
||||
@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
echo "Downloading Pre-trained model..."
|
||||
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
|
||||
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
|
||||
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
|
||||
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
|
||||
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
|
||||
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
|
||||
echo "Done."
|
||||
@ -30,7 +30,7 @@ fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
echo "Calculating diarization results..."
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
|
||||
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
|
||||
python local/convert_label_to_rttm.py \
|
||||
outputs/labels.txt \
|
||||
data/test_rmsil/raw_rmsil_map.scp \
|
||||
|
||||
@ -4,7 +4,7 @@ import os
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "config_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
|
||||
@ -4,7 +4,7 @@ import os
|
||||
|
||||
def test_fbank_cpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
|
||||
|
||||
def test_fbank_gpu_infer():
|
||||
diar_config_path = "sond_fbank.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
|
||||
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
|
||||
|
||||
def test_wav_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
data_path_and_name_and_type = [
|
||||
("data/unit_test/test_wav.scp", "speech", "sound"),
|
||||
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
|
||||
|
||||
def test_without_profile_gpu_infer():
|
||||
diar_config_path = "config.yaml"
|
||||
diar_model_path = "sond.pth"
|
||||
diar_model_path = "sond.pb"
|
||||
output_dir = "./outputs"
|
||||
raw_inputs = [[
|
||||
"data/unit_test/raw_inputs/record.wav",
|
||||
|
||||
@ -0,0 +1,6 @@
|
||||
beam_size: 10
|
||||
penalty: 0.0
|
||||
maxlenratio: 0.0
|
||||
minlenratio: 0.0
|
||||
ctc_weight: 0.5
|
||||
lm_weight: 0.7
|
||||
80
egs/librispeech/conformer/conf/train_asr_conformer.yaml
Normal file
80
egs/librispeech/conformer/conf/train_asr_conformer.yaml
Normal file
@ -0,0 +1,80 @@
|
||||
encoder: conformer
|
||||
encoder_conf:
|
||||
output_size: 512
|
||||
attention_heads: 8
|
||||
linear_units: 2048
|
||||
num_blocks: 12
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.1
|
||||
input_layer: conv2d
|
||||
normalize_before: true
|
||||
macaron_style: true
|
||||
rel_pos_type: latest
|
||||
pos_enc_layer_type: rel_pos
|
||||
selfattention_layer_type: rel_selfattn
|
||||
activation_type: swish
|
||||
use_cnn_module: true
|
||||
cnn_module_kernel: 31
|
||||
|
||||
decoder: transformer
|
||||
decoder_conf:
|
||||
attention_heads: 8
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.1
|
||||
src_attention_dropout_rate: 0.1
|
||||
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1
|
||||
length_normalized_loss: false
|
||||
|
||||
accum_grad: 2
|
||||
max_epoch: 50
|
||||
patience: none
|
||||
init: none
|
||||
best_model_criterion:
|
||||
- - valid
|
||||
- acc
|
||||
- max
|
||||
keep_nbest_models: 10
|
||||
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.0025
|
||||
weight_decay: 0.000001
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 40000
|
||||
|
||||
specaug: specaug
|
||||
specaug_conf:
|
||||
apply_time_warp: true
|
||||
time_warp_window: 5
|
||||
time_warp_mode: bicubic
|
||||
apply_freq_mask: true
|
||||
freq_mask_width_range:
|
||||
- 0
|
||||
- 27
|
||||
num_freq_mask: 2
|
||||
apply_time_mask: true
|
||||
time_mask_width_ratio_range:
|
||||
- 0.
|
||||
- 0.05
|
||||
num_time_mask: 10
|
||||
|
||||
dataset_conf:
|
||||
shuffle: True
|
||||
shuffle_conf:
|
||||
shuffle_size: 1024
|
||||
sort_size: 500
|
||||
batch_conf:
|
||||
batch_type: token
|
||||
batch_size: 10000
|
||||
num_workers: 8
|
||||
|
||||
log_interval: 50
|
||||
normalize: None
|
||||
@ -0,0 +1,80 @@
|
||||
encoder: conformer
|
||||
encoder_conf:
|
||||
output_size: 512
|
||||
attention_heads: 8
|
||||
linear_units: 2048
|
||||
num_blocks: 12
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.1
|
||||
input_layer: conv2d
|
||||
normalize_before: true
|
||||
macaron_style: true
|
||||
rel_pos_type: latest
|
||||
pos_enc_layer_type: rel_pos
|
||||
selfattention_layer_type: rel_selfattn
|
||||
activation_type: swish
|
||||
use_cnn_module: true
|
||||
cnn_module_kernel: 31
|
||||
|
||||
decoder: transformer
|
||||
decoder_conf:
|
||||
attention_heads: 8
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.1
|
||||
src_attention_dropout_rate: 0.1
|
||||
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1
|
||||
length_normalized_loss: false
|
||||
|
||||
accum_grad: 2
|
||||
max_epoch: 50
|
||||
patience: none
|
||||
init: none
|
||||
best_model_criterion:
|
||||
- - valid
|
||||
- acc
|
||||
- max
|
||||
keep_nbest_models: 10
|
||||
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.0025
|
||||
weight_decay: 0.000001
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 40000
|
||||
|
||||
specaug: specaug
|
||||
specaug_conf:
|
||||
apply_time_warp: true
|
||||
time_warp_window: 5
|
||||
time_warp_mode: bicubic
|
||||
apply_freq_mask: true
|
||||
freq_mask_width_range:
|
||||
- 0
|
||||
- 27
|
||||
num_freq_mask: 2
|
||||
apply_time_mask: true
|
||||
time_mask_width_ratio_range:
|
||||
- 0.
|
||||
- 0.05
|
||||
num_time_mask: 10
|
||||
|
||||
dataset_conf:
|
||||
shuffle: True
|
||||
shuffle_conf:
|
||||
shuffle_size: 1024
|
||||
sort_size: 500
|
||||
batch_conf:
|
||||
batch_type: token
|
||||
batch_size: 10000
|
||||
num_workers: 8
|
||||
|
||||
log_interval: 50
|
||||
normalize: utterance_mvn
|
||||
58
egs/librispeech/conformer/local/data_prep_librispeech.sh
Executable file
58
egs/librispeech/conformer/local/data_prep_librispeech.sh
Executable file
@ -0,0 +1,58 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
# Copyright 2014 Vassil Panayotov
|
||||
# 2014 Johns Hopkins University (author: Daniel Povey)
|
||||
# Apache 2.0
|
||||
|
||||
if [ "$#" -ne 2 ]; then
|
||||
echo "Usage: $0 <src-dir> <dst-dir>"
|
||||
echo "e.g.: $0 /export/a15/vpanayotov/data/LibriSpeech/dev-clean data/dev-clean"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
src=$1
|
||||
dst=$2
|
||||
|
||||
# all utterances are FLAC compressed
|
||||
if ! which flac >&/dev/null; then
|
||||
echo "Please install 'flac' on ALL worker nodes!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
spk_file=$src/../SPEAKERS.TXT
|
||||
|
||||
mkdir -p $dst || exit 1
|
||||
|
||||
[ ! -d $src ] && echo "$0: no such directory $src" && exit 1
|
||||
[ ! -f $spk_file ] && echo "$0: expected file $spk_file to exist" && exit 1
|
||||
|
||||
|
||||
wav_scp=$dst/wav.scp; [[ -f "$wav_scp" ]] && rm $wav_scp
|
||||
trans=$dst/text; [[ -f "$trans" ]] && rm $trans
|
||||
|
||||
for reader_dir in $(find -L $src -mindepth 1 -maxdepth 1 -type d | sort); do
|
||||
reader=$(basename $reader_dir)
|
||||
if ! [ $reader -eq $reader ]; then # not integer.
|
||||
echo "$0: unexpected subdirectory name $reader"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
for chapter_dir in $(find -L $reader_dir/ -mindepth 1 -maxdepth 1 -type d | sort); do
|
||||
chapter=$(basename $chapter_dir)
|
||||
if ! [ "$chapter" -eq "$chapter" ]; then
|
||||
echo "$0: unexpected chapter-subdirectory name $chapter"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
find -L $chapter_dir/ -iname "*.flac" | sort | xargs -I% basename % .flac | \
|
||||
awk -v "dir=$chapter_dir" '{printf "%s %s/%s.flac \n", $0, dir, $0}' >>$wav_scp|| exit 1
|
||||
|
||||
chapter_trans=$chapter_dir/${reader}-${chapter}.trans.txt
|
||||
[ ! -f $chapter_trans ] && echo "$0: expected file $chapter_trans to exist" && exit 1
|
||||
cat $chapter_trans >>$trans
|
||||
done
|
||||
done
|
||||
|
||||
echo "$0: successfully prepared data in $dst"
|
||||
|
||||
exit 0
|
||||
5
egs/librispeech/conformer/path.sh
Executable file
5
egs/librispeech/conformer/path.sh
Executable file
@ -0,0 +1,5 @@
|
||||
export FUNASR_DIR=$PWD/../../..
|
||||
|
||||
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
|
||||
export PYTHONIOENCODING=UTF-8
|
||||
export PATH=$FUNASR_DIR/funasr/bin:$PATH
|
||||
262
egs/librispeech/conformer/run.sh
Executable file
262
egs/librispeech/conformer/run.sh
Executable file
@ -0,0 +1,262 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
. ./path.sh || exit 1;
|
||||
|
||||
# machines configuration
|
||||
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
||||
gpu_num=8
|
||||
count=1
|
||||
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
|
||||
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
|
||||
njob=5
|
||||
train_cmd=utils/run.pl
|
||||
infer_cmd=utils/run.pl
|
||||
|
||||
# general configuration
|
||||
feats_dir="../DATA" #feature output dictionary
|
||||
exp_dir="."
|
||||
lang=en
|
||||
dumpdir=dump/fbank
|
||||
feats_type=fbank
|
||||
token_type=bpe
|
||||
dataset_type=large
|
||||
scp=feats.scp
|
||||
type=kaldi_ark
|
||||
stage=3
|
||||
stop_stage=4
|
||||
|
||||
# feature configuration
|
||||
feats_dim=80
|
||||
sample_frequency=16000
|
||||
nj=100
|
||||
speed_perturb="0.9,1.0,1.1"
|
||||
|
||||
# data
|
||||
data_librispeech=
|
||||
|
||||
# bpe model
|
||||
nbpe=5000
|
||||
bpemode=unigram
|
||||
|
||||
# exp tag
|
||||
tag=""
|
||||
|
||||
. utils/parse_options.sh || exit 1;
|
||||
|
||||
# Set bash to 'debug' mode, it will exit on :
|
||||
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
|
||||
set -e
|
||||
set -u
|
||||
set -o pipefail
|
||||
|
||||
train_set=train_960
|
||||
valid_set=dev
|
||||
test_sets="test_clean test_other dev_clean dev_other"
|
||||
|
||||
asr_config=conf/train_asr_conformer.yaml
|
||||
#asr_config=conf/train_asr_conformer_uttnorm.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
#inference_config=conf/decode_asr_transformer_beam60_ctc0.3.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
|
||||
|
||||
if ${gpu_inference}; then
|
||||
inference_nj=$[${ngpu}*${njob}]
|
||||
_ngpu=1
|
||||
else
|
||||
inference_nj=$njob
|
||||
_ngpu=0
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
echo "stage 0: Data preparation"
|
||||
# Data preparation
|
||||
for x in train-clean-100 train-clean-360 train-other-500 dev-clean dev-other test-clean test-other; do
|
||||
local/data_prep_librispeech.sh ${data_librispeech}/LibriSpeech/${x} ${feats_dir}/data/${x//-/_}
|
||||
done
|
||||
fi
|
||||
|
||||
feat_train_dir=${feats_dir}/${dumpdir}/$train_set; mkdir -p ${feat_train_dir}
|
||||
feat_dev_clean_dir=${feats_dir}/${dumpdir}/dev_clean; mkdir -p ${feat_dev_clean_dir}
|
||||
feat_dev_other_dir=${feats_dir}/${dumpdir}/dev_other; mkdir -p ${feat_dev_other_dir}
|
||||
feat_test_clean_dir=${feats_dir}/${dumpdir}/test_clean; mkdir -p ${feat_test_clean_dir}
|
||||
feat_test_other_dir=${feats_dir}/${dumpdir}/test_other; mkdir -p ${feat_test_other_dir}
|
||||
feat_dev_dir=${feats_dir}/${dumpdir}/$valid_set; mkdir -p ${feat_dev_dir}
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
echo "stage 1: Feature Generation"
|
||||
# compute fbank features
|
||||
fbankdir=${feats_dir}/fbank
|
||||
for x in dev_clean dev_other test_clean test_other; do
|
||||
utils/compute_fbank.sh --cmd "$train_cmd" --nj 1 --max_lengths 3000 --feats_dim ${feats_dim} --sample_frequency ${sample_frequency} \
|
||||
${feats_dir}/data/${x} ${exp_dir}/exp/make_fbank/${x} ${fbankdir}/${x}
|
||||
utils/fix_data_feat.sh ${fbankdir}/${x}
|
||||
done
|
||||
|
||||
mkdir ${feats_dir}/data/$train_set
|
||||
train_sets="train_clean_100 train_clean_360 train_other_500"
|
||||
for file in wav.scp text; do
|
||||
( for f in $train_sets; do cat $feats_dir/data/$f/$file; done ) | sort -k1 > $feats_dir/data/$train_set/$file || exit 1;
|
||||
done
|
||||
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --max_lengths 3000 --feats_dim ${feats_dim} --sample_frequency ${sample_frequency} --speed_perturb ${speed_perturb} \
|
||||
${feats_dir}/data/$train_set ${exp_dir}/exp/make_fbank/$train_set ${fbankdir}/$train_set
|
||||
utils/fix_data_feat.sh ${fbankdir}/$train_set
|
||||
|
||||
# compute global cmvn
|
||||
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} \
|
||||
${fbankdir}/$train_set ${exp_dir}/exp/make_fbank/$train_set
|
||||
|
||||
# apply cmvn
|
||||
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
|
||||
${fbankdir}/$train_set ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/$train_set ${feat_train_dir}
|
||||
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
|
||||
${fbankdir}/dev_clean ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/dev_clean ${feat_dev_clean_dir}
|
||||
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1\
|
||||
${fbankdir}/dev_other ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/dev_other ${feat_dev_other_dir}
|
||||
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
|
||||
${fbankdir}/test_clean ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/test_clean ${feat_test_clean_dir}
|
||||
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
|
||||
${fbankdir}/test_other ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/test_other ${feat_test_other_dir}
|
||||
|
||||
cp ${fbankdir}/$train_set/text ${fbankdir}/$train_set/speech_shape ${fbankdir}/$train_set/text_shape ${feat_train_dir}
|
||||
cp ${fbankdir}/dev_clean/text ${fbankdir}/dev_clean/speech_shape ${fbankdir}/dev_clean/text_shape ${feat_dev_clean_dir}
|
||||
cp ${fbankdir}/dev_other/text ${fbankdir}/dev_other/speech_shape ${fbankdir}/dev_other/text_shape ${feat_dev_other_dir}
|
||||
cp ${fbankdir}/test_clean/text ${fbankdir}/test_clean/speech_shape ${fbankdir}/test_clean/text_shape ${feat_test_clean_dir}
|
||||
cp ${fbankdir}/test_other/text ${fbankdir}/test_other/speech_shape ${fbankdir}/test_other/text_shape ${feat_test_other_dir}
|
||||
|
||||
dev_sets="dev_clean dev_other"
|
||||
for file in feats.scp text speech_shape text_shape; do
|
||||
( for f in $dev_sets; do cat $feats_dir/${dumpdir}/$f/$file; done ) | sort -k1 > $feat_dev_dir/$file || exit 1;
|
||||
done
|
||||
|
||||
#generate ark list
|
||||
utils/gen_ark_list.sh --cmd "$train_cmd" --nj $nj ${feat_train_dir} ${fbankdir}/${train_set} ${feat_train_dir}
|
||||
utils/gen_ark_list.sh --cmd "$train_cmd" --nj $nj ${feat_dev_dir} ${fbankdir}/${valid_set} ${feat_dev_dir}
|
||||
fi
|
||||
|
||||
dict=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
|
||||
bpemodel=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}
|
||||
echo "dictionary: ${dict}"
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
### Task dependent. You have to check non-linguistic symbols used in the corpus.
|
||||
echo "stage 2: Dictionary and Json Data Preparation"
|
||||
mkdir -p ${feats_dir}/data/lang_char/
|
||||
echo "<blank>" > ${dict}
|
||||
echo "<s>" >> ${dict}
|
||||
echo "</s>" >> ${dict}
|
||||
cut -f 2- -d" " ${feats_dir}/data/${train_set}/text > ${feats_dir}/data/lang_char/input.txt
|
||||
spm_train --input=${feats_dir}/data/lang_char/input.txt --vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000
|
||||
spm_encode --model=${bpemodel}.model --output_format=piece < ${feats_dir}/data/lang_char/input.txt | tr ' ' '\n' | sort | uniq | awk '{print $0}' >> ${dict}
|
||||
echo "<unk>" >> ${dict}
|
||||
wc -l ${dict}
|
||||
|
||||
vocab_size=$(cat ${dict} | wc -l)
|
||||
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
|
||||
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
|
||||
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/$train_set
|
||||
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/$valid_set
|
||||
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/$train_set
|
||||
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/$valid_set
|
||||
fi
|
||||
|
||||
|
||||
# Training Stage
|
||||
world_size=$gpu_num # run on one machine
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
echo "stage 3: Training"
|
||||
mkdir -p ${exp_dir}/exp/${model_dir}
|
||||
mkdir -p ${exp_dir}/exp/${model_dir}/log
|
||||
INIT_FILE=${exp_dir}/exp/${model_dir}/ddp_init
|
||||
if [ -f $INIT_FILE ];then
|
||||
rm -f $INIT_FILE
|
||||
fi
|
||||
init_method=file://$(readlink -f $INIT_FILE)
|
||||
echo "$0: init method is $init_method"
|
||||
for ((i = 0; i < $gpu_num; ++i)); do
|
||||
{
|
||||
rank=$i
|
||||
local_rank=$i
|
||||
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
|
||||
asr_train.py \
|
||||
--gpu_id $gpu_id \
|
||||
--use_preprocessor true \
|
||||
--split_with_space false \
|
||||
--bpemodel ${bpemodel}.model \
|
||||
--token_type $token_type \
|
||||
--dataset_type $dataset_type \
|
||||
--token_list $dict \
|
||||
--train_data_file $feats_dir/$dumpdir/${train_set}/ark_txt.scp \
|
||||
--valid_data_file $feats_dir/$dumpdir/${valid_set}/ark_txt.scp \
|
||||
--resume true \
|
||||
--output_dir ${exp_dir}/exp/${model_dir} \
|
||||
--config $asr_config \
|
||||
--input_size $feats_dim \
|
||||
--ngpu $gpu_num \
|
||||
--num_worker_count $count \
|
||||
--multiprocessing_distributed true \
|
||||
--dist_init_method $init_method \
|
||||
--dist_world_size $world_size \
|
||||
--dist_rank $rank \
|
||||
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
|
||||
} &
|
||||
done
|
||||
wait
|
||||
fi
|
||||
|
||||
# Testing Stage
|
||||
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
echo "stage 4: Inference"
|
||||
for dset in ${test_sets}; do
|
||||
asr_exp=${exp_dir}/exp/${model_dir}
|
||||
inference_tag="$(basename "${inference_config}" .yaml)"
|
||||
_dir="${asr_exp}/${inference_tag}/${inference_asr_model}/${dset}"
|
||||
_logdir="${_dir}/logdir"
|
||||
if [ -d ${_dir} ]; then
|
||||
echo "${_dir} is already exists. if you want to decode again, please delete this dir first."
|
||||
exit 0
|
||||
fi
|
||||
mkdir -p "${_logdir}"
|
||||
_data="${feats_dir}/${dumpdir}/${dset}"
|
||||
key_file=${_data}/${scp}
|
||||
num_scp_file="$(<${key_file} wc -l)"
|
||||
_nj=$([ $inference_nj -le $num_scp_file ] && echo "$inference_nj" || echo "$num_scp_file")
|
||||
split_scps=
|
||||
for n in $(seq "${_nj}"); do
|
||||
split_scps+=" ${_logdir}/keys.${n}.scp"
|
||||
done
|
||||
# shellcheck disable=SC2086
|
||||
utils/split_scp.pl "${key_file}" ${split_scps}
|
||||
_opts=
|
||||
if [ -n "${inference_config}" ]; then
|
||||
_opts+="--config ${inference_config} "
|
||||
fi
|
||||
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
|
||||
python -m funasr.bin.asr_inference_launch \
|
||||
--batch_size 1 \
|
||||
--ngpu "${_ngpu}" \
|
||||
--njob ${njob} \
|
||||
--gpuid_list ${gpuid_list} \
|
||||
--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
|
||||
--key_file "${_logdir}"/keys.JOB.scp \
|
||||
--asr_train_config "${asr_exp}"/config.yaml \
|
||||
--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
|
||||
--output_dir "${_logdir}"/output.JOB \
|
||||
--mode asr \
|
||||
${_opts}
|
||||
|
||||
for f in token token_int score text; do
|
||||
if [ -f "${_logdir}/output.1/1best_recog/${f}" ]; then
|
||||
for i in $(seq "${_nj}"); do
|
||||
cat "${_logdir}/output.${i}/1best_recog/${f}"
|
||||
done | sort -k1 >"${_dir}/${f}"
|
||||
fi
|
||||
done
|
||||
python utils/compute_wer.py ${_data}/text ${_dir}/text ${_dir}/text.cer
|
||||
tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
|
||||
cat ${_dir}/text.cer.txt
|
||||
done
|
||||
fi
|
||||
1
egs/librispeech/conformer/utils
Symbolic link
1
egs/librispeech/conformer/utils
Symbolic link
@ -0,0 +1 @@
|
||||
../../aishell/transformer/utils
|
||||
@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
|
||||
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
|
||||
|
||||
inference_config=conf/decode_asr_transformer.yaml
|
||||
inference_asr_model=valid.acc.ave_10best.pth
|
||||
inference_asr_model=valid.acc.ave_10best.pb
|
||||
|
||||
# you can set gpu num for decoding here
|
||||
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -74,7 +74,7 @@ def modelscope_infer(params):
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
text_proc_file = os.path.join(best_recog_path, "text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
|
||||
@ -38,7 +38,7 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -74,7 +74,7 @@ def modelscope_infer(params):
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
text_proc_file = os.path.join(best_recog_path, "text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
|
||||
@ -38,7 +38,7 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
@ -48,5 +48,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -63,5 +63,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./example_data/validation"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pth"
|
||||
params["decoding_model_name"] = "valid.acc.ave.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -1,30 +0,0 @@
|
||||
# ModelScope Model
|
||||
|
||||
## How to finetune and infer using a pretrained Paraformer-large Model
|
||||
|
||||
### Finetune
|
||||
|
||||
- Modify finetune training related parameters in `finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
|
||||
- <strong>batch_bins:</strong> # batch size
|
||||
- <strong>max_epoch:</strong> # number of training epoch
|
||||
- <strong>lr:</strong> # learning rate
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
python finetune.py
|
||||
```
|
||||
|
||||
### Inference
|
||||
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>data_dir:</strong> # the dataset dir
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
python infer.py
|
||||
```
|
||||
@ -1,23 +0,0 @@
|
||||
# Paraformer-Large
|
||||
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/summary>
|
||||
- Model size: 220M
|
||||
|
||||
# Environments
|
||||
- date: `Fri Feb 10 13:34:24 CST 2023`
|
||||
- python version: `3.7.12`
|
||||
- FunASR version: `0.1.6`
|
||||
- pytorch version: `pytorch 1.7.0`
|
||||
- Git hash: ``
|
||||
- Commit date: ``
|
||||
|
||||
# Beachmark Results
|
||||
|
||||
## AISHELL-1
|
||||
- Decode config:
|
||||
- Decode without CTC
|
||||
- Decode without LM
|
||||
|
||||
| testset CER(%) | base model|finetune model |
|
||||
|:--------------:|:---------:|:-------------:|
|
||||
| dev | 1.75 |1.62 |
|
||||
| test | 1.95 |1.78 |
|
||||
@ -1,36 +0,0 @@
|
||||
import os
|
||||
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
from funasr.utils.modelscope_param import modelscope_args
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params.output_dir):
|
||||
os.makedirs(params.output_dir, exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params.data_path)
|
||||
kwargs = dict(
|
||||
model=params.model,
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params.dataset_type,
|
||||
work_dir=params.output_dir,
|
||||
batch_bins=params.batch_bins,
|
||||
max_epoch=params.max_epoch,
|
||||
lr=params.lr)
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch", data_path="./data")
|
||||
params.output_dir = "./checkpoint" # m模型保存路径
|
||||
params.data_path = "./example_data/" # 数据路径
|
||||
params.dataset_type = "small" # 小数据量设置small,若数据量大于1000小时,请使用large
|
||||
params.batch_bins = 2000 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒,
|
||||
params.max_epoch = 50 # 最大训练轮数
|
||||
params.lr = 0.00005 # 设置学习率
|
||||
|
||||
modelscope_finetune(params)
|
||||
@ -1,88 +0,0 @@
|
||||
import os
|
||||
import shutil
|
||||
from multiprocessing import Pool
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
else:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
|
||||
|
||||
def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
output_dir = params["output_dir"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
num_lines = len(lines)
|
||||
num_job_lines = num_lines // nj
|
||||
start = 0
|
||||
for i in range(nj):
|
||||
end = start + num_job_lines
|
||||
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
|
||||
with open(file, "w") as f:
|
||||
if i == nj - 1:
|
||||
f.writelines(lines[start:])
|
||||
else:
|
||||
f.writelines(lines[start:end])
|
||||
start = end
|
||||
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
# combine decoding results
|
||||
best_recog_path = os.path.join(output_dir, "1best_recog")
|
||||
os.mkdir(best_recog_path)
|
||||
files = ["text", "token", "score"]
|
||||
for file in files:
|
||||
with open(os.path.join(best_recog_path, file), "w") as f:
|
||||
for i in range(nj):
|
||||
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
|
||||
with open(job_file) as f_job:
|
||||
lines = f_job.readlines()
|
||||
f.writelines(lines)
|
||||
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
@ -1,53 +0,0 @@
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
os.mkdir(decoding_path)
|
||||
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -1,30 +0,0 @@
|
||||
# ModelScope Model
|
||||
|
||||
## How to finetune and infer using a pretrained Paraformer-large Model
|
||||
|
||||
### Finetune
|
||||
|
||||
- Modify finetune training related parameters in `finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
|
||||
- <strong>batch_bins:</strong> # batch size
|
||||
- <strong>max_epoch:</strong> # number of training epoch
|
||||
- <strong>lr:</strong> # learning rate
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
python finetune.py
|
||||
```
|
||||
|
||||
### Inference
|
||||
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>data_dir:</strong> # the dataset dir
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
python infer.py
|
||||
```
|
||||
@ -1,25 +0,0 @@
|
||||
# Paraformer-Large
|
||||
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/summary>
|
||||
- Model size: 220M
|
||||
|
||||
# Environments
|
||||
- date: `Fri Feb 10 13:34:24 CST 2023`
|
||||
- python version: `3.7.12`
|
||||
- FunASR version: `0.1.6`
|
||||
- pytorch version: `pytorch 1.7.0`
|
||||
- Git hash: ``
|
||||
- Commit date: ``
|
||||
|
||||
# Beachmark Results
|
||||
|
||||
## AISHELL-2
|
||||
- Decode config:
|
||||
- Decode without CTC
|
||||
- Decode without LM
|
||||
|
||||
| testset | base model|finetune model|
|
||||
|:------------:|:---------:|:------------:|
|
||||
| dev_ios | 2.80 |2.60 |
|
||||
| test_android | 3.13 |2.84 |
|
||||
| test_ios | 2.85 |2.82 |
|
||||
| test_mic | 3.06 |2.88 |
|
||||
@ -1,36 +0,0 @@
|
||||
import os
|
||||
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
from funasr.utils.modelscope_param import modelscope_args
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params.output_dir):
|
||||
os.makedirs(params.output_dir, exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params.data_path)
|
||||
kwargs = dict(
|
||||
model=params.model,
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params.dataset_type,
|
||||
work_dir=params.output_dir,
|
||||
batch_bins=params.batch_bins,
|
||||
max_epoch=params.max_epoch,
|
||||
lr=params.lr)
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch", data_path="./data")
|
||||
params.output_dir = "./checkpoint" # m模型保存路径
|
||||
params.data_path = "./example_data/" # 数据路径
|
||||
params.dataset_type = "small" # 小数据量设置small,若数据量大于1000小时,请使用large
|
||||
params.batch_bins = 2000 # batch size,如果dataset_type="small",batch_bins单位为fbank特征帧数,如果dataset_type="large",batch_bins单位为毫秒,
|
||||
params.max_epoch = 50 # 最大训练轮数
|
||||
params.lr = 0.00005 # 设置学习率
|
||||
|
||||
modelscope_finetune(params)
|
||||
@ -1,88 +0,0 @@
|
||||
import os
|
||||
import shutil
|
||||
from multiprocessing import Pool
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
else:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
|
||||
|
||||
def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
output_dir = params["output_dir"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
num_lines = len(lines)
|
||||
num_job_lines = num_lines // nj
|
||||
start = 0
|
||||
for i in range(nj):
|
||||
end = start + num_job_lines
|
||||
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
|
||||
with open(file, "w") as f:
|
||||
if i == nj - 1:
|
||||
f.writelines(lines[start:])
|
||||
else:
|
||||
f.writelines(lines[start:end])
|
||||
start = end
|
||||
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
# combine decoding results
|
||||
best_recog_path = os.path.join(output_dir, "1best_recog")
|
||||
os.mkdir(best_recog_path)
|
||||
files = ["text", "token", "score"]
|
||||
for file in files:
|
||||
with open(os.path.join(best_recog_path, file), "w") as f:
|
||||
for i in range(nj):
|
||||
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
|
||||
with open(job_file) as f_job:
|
||||
lines = f_job.readlines()
|
||||
f.writelines(lines)
|
||||
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
@ -1,53 +0,0 @@
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
os.mkdir(decoding_path)
|
||||
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -21,27 +21,34 @@
|
||||
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- Setting parameters in `infer.sh`
|
||||
- <strong>model:</strong> # model name on ModelScope
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>ngpu:</strong> # the number of GPUs for decoding
|
||||
- <strong>njob:</strong> # the number of jobs for each GPU
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
- <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding
|
||||
- <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1"
|
||||
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob`
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
python infer.py
|
||||
sh infer.sh
|
||||
```
|
||||
|
||||
- Results
|
||||
|
||||
The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.
|
||||
|
||||
If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization.
|
||||
|
||||
### Inference using local finetuned model
|
||||
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>modelscope_model_name: </strong> # model name on ModelScope
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -17,22 +17,22 @@
|
||||
- Decode without CTC
|
||||
- Decode without LM
|
||||
|
||||
| testset | CER(%)|
|
||||
|:---------:|:-----:|
|
||||
| dev | 1.75 |
|
||||
| test | 1.95 |
|
||||
| CER(%) | Pretrain model|[Finetune model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/summary) |
|
||||
|:---------:|:-------------:|:-------------:|
|
||||
| dev | 1.75 |1.62 |
|
||||
| test | 1.95 |1.78 |
|
||||
|
||||
## AISHELL-2
|
||||
- Decode config:
|
||||
- Decode without CTC
|
||||
- Decode without LM
|
||||
|
||||
| testset | CER(%)|
|
||||
|:------------:|:-----:|
|
||||
| dev_ios | 2.80 |
|
||||
| test_android | 3.13 |
|
||||
| test_ios | 2.85 |
|
||||
| test_mic | 3.06 |
|
||||
| CER(%) | Pretrain model|[Finetune model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/summary)|
|
||||
|:------------:|:-------------:|:------------:|
|
||||
| dev_ios | 2.80 |2.60 |
|
||||
| test_android | 3.13 |2.84 |
|
||||
| test_ios | 2.85 |2.82 |
|
||||
| test_mic | 3.06 |2.88 |
|
||||
|
||||
## Wenetspeech
|
||||
- Decode config:
|
||||
|
||||
@ -1,88 +1,25 @@
|
||||
import os
|
||||
import shutil
|
||||
from multiprocessing import Pool
|
||||
|
||||
import argparse
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
else:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
def modelscope_infer(args):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid)
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
model=args.model,
|
||||
output_dir=args.output_dir,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
|
||||
|
||||
def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
output_dir = params["output_dir"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
num_lines = len(lines)
|
||||
num_job_lines = num_lines // nj
|
||||
start = 0
|
||||
for i in range(nj):
|
||||
end = start + num_job_lines
|
||||
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
|
||||
with open(file, "w") as f:
|
||||
if i == nj - 1:
|
||||
f.writelines(lines[start:])
|
||||
else:
|
||||
f.writelines(lines[start:end])
|
||||
start = end
|
||||
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
# combine decoding results
|
||||
best_recog_path = os.path.join(output_dir, "1best_recog")
|
||||
os.mkdir(best_recog_path)
|
||||
files = ["text", "token", "score"]
|
||||
for file in files:
|
||||
with open(os.path.join(best_recog_path, file), "w") as f:
|
||||
for i in range(nj):
|
||||
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
|
||||
with open(job_file) as f_job:
|
||||
lines = f_job.readlines()
|
||||
f.writelines(lines)
|
||||
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
inference_pipeline(audio_in=args.audio_in)
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model', type=str, default="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch")
|
||||
parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp")
|
||||
parser.add_argument('--output_dir', type=str, default="./results/")
|
||||
parser.add_argument('--batch_size', type=int, default=64)
|
||||
parser.add_argument('--gpuid', type=str, default="0")
|
||||
args = parser.parse_args()
|
||||
modelscope_infer(args)
|
||||
@ -0,0 +1,95 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
set -e
|
||||
set -u
|
||||
set -o pipefail
|
||||
|
||||
stage=1
|
||||
stop_stage=2
|
||||
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
data_dir="./data/test"
|
||||
output_dir="./results"
|
||||
batch_size=64
|
||||
gpu_inference=true # whether to perform gpu decoding
|
||||
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
|
||||
njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
|
||||
|
||||
|
||||
if ${gpu_inference}; then
|
||||
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
|
||||
else
|
||||
nj=$njob
|
||||
batch_size=1
|
||||
gpuid_list=""
|
||||
for JOB in $(seq ${nj}); do
|
||||
gpuid_list=$gpuid_list"-1,"
|
||||
done
|
||||
fi
|
||||
|
||||
mkdir -p $output_dir/split
|
||||
split_scps=""
|
||||
for JOB in $(seq ${nj}); do
|
||||
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
|
||||
done
|
||||
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
|
||||
|
||||
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
|
||||
echo "Decoding ..."
|
||||
gpuid_list_array=(${gpuid_list//,/ })
|
||||
for JOB in $(seq ${nj}); do
|
||||
{
|
||||
id=$((JOB-1))
|
||||
gpuid=${gpuid_list_array[$id]}
|
||||
mkdir -p ${output_dir}/output.$JOB
|
||||
python infer.py \
|
||||
--model ${model} \
|
||||
--audio_in ${output_dir}/split/wav.$JOB.scp \
|
||||
--output_dir ${output_dir}/output.$JOB \
|
||||
--batch_size ${batch_size} \
|
||||
--gpuid ${gpuid}
|
||||
}&
|
||||
done
|
||||
wait
|
||||
|
||||
mkdir -p ${output_dir}/1best_recog
|
||||
for f in token score text; do
|
||||
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
|
||||
for i in $(seq "${nj}"); do
|
||||
cat "${output_dir}/output.${i}/1best_recog/${f}"
|
||||
done | sort -k1 >"${output_dir}/1best_recog/${f}"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
|
||||
echo "Computing WER ..."
|
||||
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
|
||||
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
|
||||
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
|
||||
tail -n 3 ${output_dir}/1best_recog/text.cer
|
||||
fi
|
||||
|
||||
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
|
||||
echo "SpeechIO TIOBE textnorm"
|
||||
echo "$0 --> Normalizing REF text ..."
|
||||
./utils/textnorm_zh.py \
|
||||
--has_key --to_upper \
|
||||
${data_dir}/text \
|
||||
${output_dir}/1best_recog/ref.txt
|
||||
|
||||
echo "$0 --> Normalizing HYP text ..."
|
||||
./utils/textnorm_zh.py \
|
||||
--has_key --to_upper \
|
||||
${output_dir}/1best_recog/text.proc \
|
||||
${output_dir}/1best_recog/rec.txt
|
||||
grep -v $'\t$' ${output_dir}/1best_recog/rec.txt > ${output_dir}/1best_recog/rec_non_empty.txt
|
||||
|
||||
echo "$0 --> computing WER/CER and alignment ..."
|
||||
./utils/error_rate_zh \
|
||||
--tokenizer char \
|
||||
--ref ${output_dir}/1best_recog/ref.txt \
|
||||
--hyp ${output_dir}/1best_recog/rec_non_empty.txt \
|
||||
${output_dir}/1best_recog/DETAILS.txt | tee ${output_dir}/1best_recog/RESULTS.txt
|
||||
rm -rf ${output_dir}/1best_recog/rec.txt ${output_dir}/1best_recog/rec_non_empty.txt
|
||||
fi
|
||||
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -39,15 +34,15 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -0,0 +1 @@
|
||||
../../../../egs/aishell/transformer/utils
|
||||
@ -0,0 +1,37 @@
|
||||
import os
|
||||
import logging
|
||||
import torch
|
||||
import soundfile
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
|
||||
os.environ["MODELSCOPE_CACHE"] = "./"
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
|
||||
model_revision='v1.0.2')
|
||||
|
||||
model_dir = os.path.join(os.environ["MODELSCOPE_CACHE"], "damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online")
|
||||
speech, sample_rate = soundfile.read(os.path.join(model_dir, "example/asr_example.wav"))
|
||||
speech_length = speech.shape[0]
|
||||
|
||||
sample_offset = 0
|
||||
step = 4800 #300ms
|
||||
param_dict = {"cache": dict(), "is_final": False}
|
||||
final_result = ""
|
||||
|
||||
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
|
||||
if sample_offset + step >= speech_length - 1:
|
||||
step = speech_length - sample_offset
|
||||
param_dict["is_final"] = True
|
||||
rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + step],
|
||||
param_dict=param_dict)
|
||||
if len(rec_result) != 0 and rec_result['text'] != "sil" and rec_result['text'] != "waiting_for_more_voice":
|
||||
final_result += rec_result['text']
|
||||
print(rec_result)
|
||||
print(final_result)
|
||||
@ -6,8 +6,9 @@
|
||||
|
||||
- Modify finetune training related parameters in `finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
|
||||
- <strong>batch_bins:</strong> # batch size
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include files: `train/wav.scp`, `train/text`; `validation/wav.scp`, `validation/text`
|
||||
- <strong>dataset_type:</strong> # for dataset larger than 1000 hours, set as `large`, otherwise set as `small`
|
||||
- <strong>batch_bins:</strong> # batch size. For dataset_type is `small`, `batch_bins` indicates the feature frames. For dataset_type is `large`, `batch_bins` indicates the duration in ms
|
||||
- <strong>max_epoch:</strong> # number of training epoch
|
||||
- <strong>lr:</strong> # learning rate
|
||||
|
||||
@ -20,11 +21,38 @@
|
||||
|
||||
Or you can use the finetuned model for inference directly.
|
||||
|
||||
- Setting parameters in `infer.py`
|
||||
- <strong>data_dir:</strong> # the dataset dir
|
||||
- Setting parameters in `infer.sh`
|
||||
- <strong>model:</strong> # model name on ModelScope
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
- <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding
|
||||
- <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1"
|
||||
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob`
|
||||
|
||||
- Then you can run the pipeline to infer with:
|
||||
```python
|
||||
python infer.py
|
||||
sh infer.sh
|
||||
```
|
||||
|
||||
- Results
|
||||
|
||||
The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.
|
||||
|
||||
### Inference using local finetuned model
|
||||
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>modelscope_model_name: </strong> # model name on ModelScope
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
- <strong>batch_size:</strong> # batchsize of inference
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
python infer_after_finetune.py
|
||||
```
|
||||
|
||||
- Results
|
||||
|
||||
The decoding results can be found in `$output_dir/decoding_results/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.
|
||||
|
||||
@ -1,88 +1,25 @@
|
||||
import os
|
||||
import shutil
|
||||
from multiprocessing import Pool
|
||||
|
||||
import argparse
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_core(output_dir, split_dir, njob, idx):
|
||||
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
|
||||
gpu_id = (int(idx) - 1) // njob
|
||||
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
|
||||
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
|
||||
else:
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
|
||||
inference_pipline = pipeline(
|
||||
def modelscope_infer(args):
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid)
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
|
||||
output_dir=output_dir_job,
|
||||
batch_size=64
|
||||
model=args.model,
|
||||
output_dir=args.output_dir,
|
||||
batch_size=args.batch_size,
|
||||
)
|
||||
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
|
||||
inference_pipline(audio_in=audio_in)
|
||||
|
||||
|
||||
def modelscope_infer(params):
|
||||
# prepare for multi-GPU decoding
|
||||
ngpu = params["ngpu"]
|
||||
njob = params["njob"]
|
||||
output_dir = params["output_dir"]
|
||||
if os.path.exists(output_dir):
|
||||
shutil.rmtree(output_dir)
|
||||
os.mkdir(output_dir)
|
||||
split_dir = os.path.join(output_dir, "split")
|
||||
os.mkdir(split_dir)
|
||||
nj = ngpu * njob
|
||||
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
|
||||
with open(wav_scp_file) as f:
|
||||
lines = f.readlines()
|
||||
num_lines = len(lines)
|
||||
num_job_lines = num_lines // nj
|
||||
start = 0
|
||||
for i in range(nj):
|
||||
end = start + num_job_lines
|
||||
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
|
||||
with open(file, "w") as f:
|
||||
if i == nj - 1:
|
||||
f.writelines(lines[start:])
|
||||
else:
|
||||
f.writelines(lines[start:end])
|
||||
start = end
|
||||
|
||||
p = Pool(nj)
|
||||
for i in range(nj):
|
||||
p.apply_async(modelscope_infer_core,
|
||||
args=(output_dir, split_dir, njob, str(i + 1)))
|
||||
p.close()
|
||||
p.join()
|
||||
|
||||
# combine decoding results
|
||||
best_recog_path = os.path.join(output_dir, "1best_recog")
|
||||
os.mkdir(best_recog_path)
|
||||
files = ["text", "token", "score"]
|
||||
for file in files:
|
||||
with open(os.path.join(best_recog_path, file), "w") as f:
|
||||
for i in range(nj):
|
||||
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
|
||||
with open(job_file) as f_job:
|
||||
lines = f_job.readlines()
|
||||
f.writelines(lines)
|
||||
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
inference_pipeline(audio_in=args.audio_in)
|
||||
|
||||
if __name__ == "__main__":
|
||||
params = {}
|
||||
params["data_dir"] = "./data/test"
|
||||
params["output_dir"] = "./results"
|
||||
params["ngpu"] = 1
|
||||
params["njob"] = 1
|
||||
modelscope_infer(params)
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--model', type=str, default="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1")
|
||||
parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp")
|
||||
parser.add_argument('--output_dir', type=str, default="./results/")
|
||||
parser.add_argument('--batch_size', type=int, default=64)
|
||||
parser.add_argument('--gpuid', type=str, default="0")
|
||||
args = parser.parse_args()
|
||||
modelscope_infer(args)
|
||||
@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
set -e
|
||||
set -u
|
||||
set -o pipefail
|
||||
|
||||
stage=1
|
||||
stop_stage=2
|
||||
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
data_dir="./data/test"
|
||||
output_dir="./results"
|
||||
batch_size=64
|
||||
gpu_inference=true # whether to perform gpu decoding
|
||||
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
|
||||
njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
|
||||
|
||||
|
||||
if ${gpu_inference}; then
|
||||
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
|
||||
else
|
||||
nj=$njob
|
||||
batch_size=1
|
||||
gpuid_list=""
|
||||
for JOB in $(seq ${nj}); do
|
||||
gpuid_list=$gpuid_list"-1,"
|
||||
done
|
||||
fi
|
||||
|
||||
mkdir -p $output_dir/split
|
||||
split_scps=""
|
||||
for JOB in $(seq ${nj}); do
|
||||
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
|
||||
done
|
||||
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
|
||||
|
||||
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
|
||||
echo "Decoding ..."
|
||||
gpuid_list_array=(${gpuid_list//,/ })
|
||||
for JOB in $(seq ${nj}); do
|
||||
{
|
||||
id=$((JOB-1))
|
||||
gpuid=${gpuid_list_array[$id]}
|
||||
mkdir -p ${output_dir}/output.$JOB
|
||||
python infer.py \
|
||||
--model ${model} \
|
||||
--audio_in ${output_dir}/split/wav.$JOB.scp \
|
||||
--output_dir ${output_dir}/output.$JOB \
|
||||
--batch_size ${batch_size} \
|
||||
--gpuid ${gpuid}
|
||||
}&
|
||||
done
|
||||
wait
|
||||
|
||||
mkdir -p ${output_dir}/1best_recog
|
||||
for f in token score text; do
|
||||
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
|
||||
for i in $(seq "${nj}"); do
|
||||
cat "${output_dir}/output.${i}/1best_recog/${f}"
|
||||
done | sort -k1 >"${output_dir}/1best_recog/${f}"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
|
||||
echo "Computing WER ..."
|
||||
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
|
||||
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
|
||||
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
|
||||
tail -n 3 ${output_dir}/1best_recog/text.cer
|
||||
fi
|
||||
@ -4,23 +4,18 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
|
||||
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
@ -39,15 +34,15 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -0,0 +1 @@
|
||||
../../../../egs/aishell/transformer/utils
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -50,5 +50,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -0,0 +1,35 @@
|
||||
import os
|
||||
from modelscope.metainfo import Trainers
|
||||
from modelscope.trainers import build_trainer
|
||||
from funasr.datasets.ms_dataset import MsDataset
|
||||
|
||||
|
||||
def modelscope_finetune(params):
|
||||
if not os.path.exists(params["output_dir"]):
|
||||
os.makedirs(params["output_dir"], exist_ok=True)
|
||||
# dataset split ["train", "validation"]
|
||||
ds_dict = MsDataset.load(params["data_dir"])
|
||||
kwargs = dict(
|
||||
model=params["model"],
|
||||
model_revision=params["model_revision"],
|
||||
data_dir=ds_dict,
|
||||
dataset_type=params["dataset_type"],
|
||||
work_dir=params["output_dir"],
|
||||
batch_bins=params["batch_bins"],
|
||||
max_epoch=params["max_epoch"],
|
||||
lr=params["lr"])
|
||||
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
|
||||
trainer.train()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data"
|
||||
params["batch_bins"] = 2000
|
||||
params["dataset_type"] = "small"
|
||||
params["max_epoch"] = 50
|
||||
params["lr"] = 0.00005
|
||||
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
|
||||
params["model_revision"] = None
|
||||
modelscope_finetune(params)
|
||||
@ -0,0 +1,13 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
if __name__ == "__main__":
|
||||
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
|
||||
output_dir = "./results"
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
|
||||
output_dir=output_dir,
|
||||
)
|
||||
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
|
||||
print(rec_result)
|
||||
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -75,7 +75,7 @@ def modelscope_infer(params):
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
text_proc_file = os.path.join(best_recog_path, "text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
|
||||
@ -39,7 +39,7 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
|
||||
.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -75,7 +75,7 @@ def modelscope_infer(params):
|
||||
# If text exists, compute CER
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(best_recog_path, "token")
|
||||
text_proc_file = os.path.join(best_recog_path, "text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
|
||||
|
||||
|
||||
|
||||
@ -39,7 +39,7 @@ def modelscope_infer_after_finetune(params):
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
|
||||
@ -49,5 +49,5 @@ if __name__ == '__main__':
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "20epoch.pth"
|
||||
params["decoding_model_name"] = "20epoch.pb"
|
||||
modelscope_infer_after_finetune(params)
|
||||
|
||||
@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
|
||||
- Modify inference related parameters in `infer_after_finetune.py`
|
||||
- <strong>output_dir:</strong> # result dir
|
||||
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
|
||||
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
|
||||
|
||||
- Then you can run the pipeline to finetune with:
|
||||
```python
|
||||
|
||||
@ -4,27 +4,17 @@ import shutil
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.hub.snapshot_download import snapshot_download
|
||||
|
||||
from funasr.utils.compute_wer import compute_wer
|
||||
|
||||
|
||||
def modelscope_infer_after_finetune(params):
|
||||
# prepare for decoding
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "punc")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "punc"))
|
||||
if not os.path.exists(os.path.join(params["output_dir"], "vad")):
|
||||
os.makedirs(os.path.join(params["output_dir"], "vad"))
|
||||
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
|
||||
for file_name in params["required_files"]:
|
||||
if file_name == "configuration.json":
|
||||
with open(os.path.join(pretrained_model_path, file_name)) as f:
|
||||
config_dict = json.load(f)
|
||||
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
|
||||
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
|
||||
json.dump(config_dict, f, indent=4, separators=(',', ': '))
|
||||
else:
|
||||
shutil.copy(os.path.join(pretrained_model_path, file_name),
|
||||
os.path.join(params["output_dir"], file_name))
|
||||
|
||||
try:
|
||||
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
|
||||
except BaseException:
|
||||
raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
|
||||
decoding_path = os.path.join(params["output_dir"], "decode_results")
|
||||
if os.path.exists(decoding_path):
|
||||
shutil.rmtree(decoding_path)
|
||||
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
|
||||
# decoding
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.auto_speech_recognition,
|
||||
model=params["output_dir"],
|
||||
model=pretrained_model_path,
|
||||
output_dir=decoding_path,
|
||||
batch_size=64
|
||||
batch_size=params["batch_size"]
|
||||
)
|
||||
audio_in = os.path.join(params["data_dir"], "wav.scp")
|
||||
inference_pipeline(audio_in=audio_in)
|
||||
|
||||
# computer CER if GT text is set
|
||||
text_in = os.path.join(params["data_dir"], "text")
|
||||
if text_in is not None:
|
||||
if os.path.exists(text_in):
|
||||
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
|
||||
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
|
||||
|
||||
@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
|
||||
if __name__ == '__main__':
|
||||
params = {}
|
||||
params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
|
||||
params["output_dir"] = "./checkpoint"
|
||||
params["data_dir"] = "./data/test"
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
|
||||
modelscope_infer_after_finetune(params)
|
||||
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
|
||||
params["batch_size"] = 64
|
||||
modelscope_infer_after_finetune(params)
|
||||
@ -4,22 +4,23 @@ inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助
|
||||
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
|
||||
|
||||
inference_pipeline = pipeline(
|
||||
task=Tasks.punctuation,
|
||||
model='damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727',
|
||||
model_revision="v1.0.0",
|
||||
output_dir="./tmp/"
|
||||
)
|
||||
|
||||
vads = inputs.split("|")
|
||||
|
||||
cache_out = []
|
||||
rec_result_all="outputs:"
|
||||
param_dict = {"cache": []}
|
||||
for vad in vads:
|
||||
rec_result = inference_pipeline(text_in=vad, cache=cache_out)
|
||||
#print(rec_result)
|
||||
cache_out = rec_result['cache']
|
||||
rec_result = inference_pipeline(text_in=vad, param_dict=param_dict)
|
||||
rec_result_all += rec_result['text']
|
||||
|
||||
print(rec_result_all)
|
||||
|
||||
@ -0,0 +1,10 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
|
||||
inference_diar_pipline = pipeline(
|
||||
task=Tasks.speaker_diarization,
|
||||
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
|
||||
model_revision="v1.0.0",
|
||||
)
|
||||
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
|
||||
print(results)
|
||||
@ -7,7 +7,7 @@ if __name__ == '__main__':
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
|
||||
model_revision=None,
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
)
|
||||
|
||||
@ -1,16 +1,20 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
|
||||
model_revision='v1.1.9',
|
||||
output_dir=None,
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
mode='online',
|
||||
)
|
||||
speech, sample_rate = soundfile.read("./vad_example_16k.wav")
|
||||
speech_length = speech.shape[0]
|
||||
@ -18,7 +22,7 @@ if __name__ == '__main__':
|
||||
sample_offset = 0
|
||||
|
||||
step = 160 * 10
|
||||
param_dict = {'in_cache': dict()}
|
||||
param_dict = {'in_cache': dict(), 'max_end_sil': 800}
|
||||
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
|
||||
if sample_offset + step >= speech_length - 1:
|
||||
step = speech_length - sample_offset
|
||||
|
||||
@ -7,8 +7,8 @@ if __name__ == '__main__':
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-8k-common",
|
||||
model_revision=None,
|
||||
output_dir='./output_dir',
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
)
|
||||
segments_result = inference_pipline(audio_in=audio_in)
|
||||
|
||||
@ -1,16 +1,20 @@
|
||||
from modelscope.pipelines import pipeline
|
||||
from modelscope.utils.constant import Tasks
|
||||
from modelscope.utils.logger import get_logger
|
||||
import logging
|
||||
logger = get_logger(log_level=logging.CRITICAL)
|
||||
logger.setLevel(logging.CRITICAL)
|
||||
import soundfile
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
output_dir = None
|
||||
inference_pipline = pipeline(
|
||||
task=Tasks.voice_activity_detection,
|
||||
model="damo/speech_fsmn_vad_zh-cn-8k-common",
|
||||
model_revision='v1.1.9',
|
||||
output_dir='./output_dir',
|
||||
model_revision='v1.2.0',
|
||||
output_dir=output_dir,
|
||||
batch_size=1,
|
||||
mode='online',
|
||||
)
|
||||
speech, sample_rate = soundfile.read("./vad_example_8k.wav")
|
||||
speech_length = speech.shape[0]
|
||||
@ -18,7 +22,7 @@ if __name__ == '__main__':
|
||||
sample_offset = 0
|
||||
|
||||
step = 80 * 10
|
||||
param_dict = {'in_cache': dict()}
|
||||
param_dict = {'in_cache': dict(), 'max_end_sil': 800}
|
||||
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
|
||||
if sample_offset + step >= speech_length - 1:
|
||||
step = speech_length - sample_offset
|
||||
|
||||
@ -52,7 +52,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -216,6 +216,9 @@ def inference_launch(**kwargs):
|
||||
elif mode == "paraformer":
|
||||
from funasr.bin.asr_inference_paraformer import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_streaming":
|
||||
from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
elif mode == "paraformer_vad":
|
||||
from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
|
||||
return inference_modelscope(**kwargs)
|
||||
|
||||
@ -41,8 +41,6 @@ from funasr.utils.types import str_or_none
|
||||
from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
import pdb
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
|
||||
global_asr_language: str = 'zh-cn'
|
||||
global_sample_rate: Union[int, Dict[Any, int]] = {
|
||||
@ -55,7 +53,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
@ -43,6 +43,7 @@ from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
|
||||
from funasr.utils.timestamp_tools import ts_prediction_lfr6_standard
|
||||
from funasr.bin.tp_inference import SpeechText2Timestamp
|
||||
|
||||
|
||||
class Speech2Text:
|
||||
@ -50,7 +51,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -540,7 +541,8 @@ def inference(
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
|
||||
timestamp_infer_config: Union[Path, str] = None,
|
||||
timestamp_model_file: Union[Path, str] = None,
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
@ -604,6 +606,8 @@ def inference_modelscope(
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
output_dir: Optional[str] = None,
|
||||
timestamp_infer_config: Union[Path, str] = None,
|
||||
timestamp_model_file: Union[Path, str] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
@ -661,6 +665,15 @@ def inference_modelscope(
|
||||
else:
|
||||
speech2text = Speech2Text(**speech2text_kwargs)
|
||||
|
||||
if timestamp_model_file is not None:
|
||||
speechtext2timestamp = SpeechText2Timestamp(
|
||||
timestamp_cmvn_file=cmvn_file,
|
||||
timestamp_model_file=timestamp_model_file,
|
||||
timestamp_infer_config=timestamp_infer_config,
|
||||
)
|
||||
else:
|
||||
speechtext2timestamp = None
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
@ -744,7 +757,17 @@ def inference_modelscope(
|
||||
key = keys[batch_id]
|
||||
for n, result in zip(range(1, nbest + 1), result):
|
||||
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
|
||||
time_stamp = None if len(result) < 5 else result[4]
|
||||
timestamp = None if len(result) < 5 else result[4]
|
||||
# conduct timestamp prediction here
|
||||
# timestamp inference requires token length
|
||||
# thus following inference cannot be conducted in batch
|
||||
if timestamp is None and speechtext2timestamp:
|
||||
ts_batch = {}
|
||||
ts_batch['speech'] = batch['speech'][batch_id].unsqueeze(0)
|
||||
ts_batch['speech_lengths'] = torch.tensor([batch['speech_lengths'][batch_id]])
|
||||
ts_batch['text_lengths'] = torch.tensor([len(token)])
|
||||
us_alphas, us_peaks = speechtext2timestamp(**ts_batch)
|
||||
ts_str, timestamp = ts_prediction_lfr6_standard(us_alphas[0], us_peaks[0], token, force_time_shift=-3.0)
|
||||
# Create a directory: outdir/{n}best_recog
|
||||
if writer is not None:
|
||||
ibest_writer = writer[f"{n}best_recog"]
|
||||
@ -756,25 +779,25 @@ def inference_modelscope(
|
||||
ibest_writer["rtf"][key] = rtf_cur
|
||||
|
||||
if text is not None:
|
||||
if use_timestamp and time_stamp is not None:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
|
||||
if use_timestamp and timestamp is not None:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token, timestamp)
|
||||
else:
|
||||
postprocessed_result = postprocess_utils.sentence_postprocess(token)
|
||||
time_stamp_postprocessed = ""
|
||||
timestamp_postprocessed = ""
|
||||
if len(postprocessed_result) == 3:
|
||||
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
|
||||
text_postprocessed, timestamp_postprocessed, word_lists = postprocessed_result[0], \
|
||||
postprocessed_result[1], \
|
||||
postprocessed_result[2]
|
||||
else:
|
||||
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
if time_stamp_postprocessed != "":
|
||||
item['time_stamp'] = time_stamp_postprocessed
|
||||
if timestamp_postprocessed != "":
|
||||
item['timestamp'] = timestamp_postprocessed
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
# asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
|
||||
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
|
||||
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total, forward_time_total, 100 * forward_time_total / (length_total * lfr_factor))
|
||||
|
||||
916
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
916
funasr/bin/asr_inference_paraformer_streaming.py
Normal file
@ -0,0 +1,916 @@
|
||||
#!/usr/bin/env python3
|
||||
import argparse
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
import copy
|
||||
import os
|
||||
import codecs
|
||||
import tempfile
|
||||
import requests
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
from typing import Dict
|
||||
from typing import Any
|
||||
from typing import List
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.fileio.datadir_writer import DatadirWriter
|
||||
from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
|
||||
from funasr.modules.beam_search.beam_search import Hypothesis
|
||||
from funasr.modules.scorers.ctc import CTCPrefixScorer
|
||||
from funasr.modules.scorers.length_bonus import LengthBonus
|
||||
from funasr.modules.subsampling import TooShortUttError
|
||||
from funasr.tasks.asr import ASRTaskParaformer as ASRTask
|
||||
from funasr.tasks.lm import LMTask
|
||||
from funasr.text.build_tokenizer import build_tokenizer
|
||||
from funasr.text.token_id_converter import TokenIDConverter
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
|
||||
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
|
||||
np.set_printoptions(threshold=np.inf)
|
||||
|
||||
class Speech2Text:
|
||||
"""Speech2Text class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# 1. Build ASR model
|
||||
scorers = {}
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
if asr_model.ctc != None:
|
||||
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
|
||||
scorers.update(
|
||||
ctc=ctc
|
||||
)
|
||||
token_list = asr_model.token_list
|
||||
scorers.update(
|
||||
length_bonus=LengthBonus(len(token_list)),
|
||||
)
|
||||
|
||||
# 2. Build Language model
|
||||
if lm_train_config is not None:
|
||||
lm, lm_train_args = LMTask.build_model_from_file(
|
||||
lm_train_config, lm_file, device
|
||||
)
|
||||
scorers["lm"] = lm.lm
|
||||
|
||||
# 3. Build ngram model
|
||||
# ngram is not supported now
|
||||
ngram = None
|
||||
scorers["ngram"] = ngram
|
||||
|
||||
# 4. Build BeamSearch object
|
||||
# transducer is not supported now
|
||||
beam_search_transducer = None
|
||||
|
||||
weights = dict(
|
||||
decoder=1.0 - ctc_weight,
|
||||
ctc=ctc_weight,
|
||||
lm=lm_weight,
|
||||
ngram=ngram_weight,
|
||||
length_bonus=penalty,
|
||||
)
|
||||
beam_search = BeamSearch(
|
||||
beam_size=beam_size,
|
||||
weights=weights,
|
||||
scorers=scorers,
|
||||
sos=asr_model.sos,
|
||||
eos=asr_model.eos,
|
||||
vocab_size=len(token_list),
|
||||
token_list=token_list,
|
||||
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
|
||||
)
|
||||
|
||||
beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
for scorer in scorers.values():
|
||||
if isinstance(scorer, torch.nn.Module):
|
||||
scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
# 6. [Optional] Build hotword list from str, local file or url
|
||||
|
||||
is_use_lm = lm_weight != 0.0 and lm_file is not None
|
||||
if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
|
||||
beam_search = None
|
||||
self.beam_search = beam_search
|
||||
logging.info(f"Beam_search: {self.beam_search}")
|
||||
self.beam_search_transducer = beam_search_transducer
|
||||
self.maxlenratio = maxlenratio
|
||||
self.minlenratio = minlenratio
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
self.encoder_downsampling_factor = 1
|
||||
if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
|
||||
self.encoder_downsampling_factor = 4
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
|
||||
begin_time: int = 0, end_time: int = None,
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
feats_len = cache["encoder"]["stride"] + cache["encoder"]["pad_left"] + cache["encoder"]["pad_right"]
|
||||
feats = feats[:,cache["encoder"]["start_idx"]:cache["encoder"]["start_idx"]+feats_len,:]
|
||||
feats_len = torch.tensor([feats_len])
|
||||
batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
# b. Forward Encoder
|
||||
enc, enc_len = self.asr_model.encode_chunk(feats, feats_len, cache)
|
||||
if isinstance(enc, tuple):
|
||||
enc = enc[0]
|
||||
# assert len(enc) == 1, len(enc)
|
||||
enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
|
||||
|
||||
predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
|
||||
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
|
||||
predictor_outs[2], predictor_outs[3]
|
||||
pre_token_length = pre_token_length.floor().long()
|
||||
if torch.max(pre_token_length) < 1:
|
||||
return []
|
||||
decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
|
||||
decoder_out = decoder_outs
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
x = enc[i, :enc_len[i], :]
|
||||
am_scores = decoder_out[i, :pre_token_length[i], :]
|
||||
if self.beam_search is not None:
|
||||
nbest_hyps = self.beam_search(
|
||||
x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
|
||||
)
|
||||
|
||||
nbest_hyps = nbest_hyps[: self.nbest]
|
||||
else:
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
[self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
# assert check_return_type(results)
|
||||
return results
|
||||
|
||||
|
||||
class Speech2TextExport:
|
||||
"""Speech2TextExport class
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
asr_train_config: Union[Path, str] = None,
|
||||
asr_model_file: Union[Path, str] = None,
|
||||
cmvn_file: Union[Path, str] = None,
|
||||
lm_train_config: Union[Path, str] = None,
|
||||
lm_file: Union[Path, str] = None,
|
||||
token_type: str = None,
|
||||
bpemodel: str = None,
|
||||
device: str = "cpu",
|
||||
maxlenratio: float = 0.0,
|
||||
minlenratio: float = 0.0,
|
||||
dtype: str = "float32",
|
||||
beam_size: int = 20,
|
||||
ctc_weight: float = 0.5,
|
||||
lm_weight: float = 1.0,
|
||||
ngram_weight: float = 0.9,
|
||||
penalty: float = 0.0,
|
||||
nbest: int = 1,
|
||||
frontend_conf: dict = None,
|
||||
hotword_list_or_file: str = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
# 1. Build ASR model
|
||||
asr_model, asr_train_args = ASRTask.build_model_from_file(
|
||||
asr_train_config, asr_model_file, cmvn_file, device
|
||||
)
|
||||
frontend = None
|
||||
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
|
||||
|
||||
logging.info("asr_model: {}".format(asr_model))
|
||||
logging.info("asr_train_args: {}".format(asr_train_args))
|
||||
asr_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
token_list = asr_model.token_list
|
||||
|
||||
logging.info(f"Decoding device={device}, dtype={dtype}")
|
||||
|
||||
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
|
||||
if token_type is None:
|
||||
token_type = asr_train_args.token_type
|
||||
if bpemodel is None:
|
||||
bpemodel = asr_train_args.bpemodel
|
||||
|
||||
if token_type is None:
|
||||
tokenizer = None
|
||||
elif token_type == "bpe":
|
||||
if bpemodel is not None:
|
||||
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
|
||||
else:
|
||||
tokenizer = None
|
||||
else:
|
||||
tokenizer = build_tokenizer(token_type=token_type)
|
||||
converter = TokenIDConverter(token_list=token_list)
|
||||
logging.info(f"Text tokenizer: {tokenizer}")
|
||||
|
||||
# self.asr_model = asr_model
|
||||
self.asr_train_args = asr_train_args
|
||||
self.converter = converter
|
||||
self.tokenizer = tokenizer
|
||||
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.nbest = nbest
|
||||
self.frontend = frontend
|
||||
|
||||
model = Paraformer_export(asr_model, onnx=False)
|
||||
self.asr_model = model
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
text, token, token_int, hyp
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.asr_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
|
||||
enc_len_batch_total = feats_len.sum()
|
||||
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
|
||||
# a. To device
|
||||
batch = to_device(batch, device=self.device)
|
||||
|
||||
decoder_outs = self.asr_model(**batch)
|
||||
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
|
||||
|
||||
results = []
|
||||
b, n, d = decoder_out.size()
|
||||
for i in range(b):
|
||||
am_scores = decoder_out[i, :ys_pad_lens[i], :]
|
||||
|
||||
yseq = am_scores.argmax(dim=-1)
|
||||
score = am_scores.max(dim=-1)[0]
|
||||
score = torch.sum(score, dim=-1)
|
||||
# pad with mask tokens to ensure compatibility with sos/eos tokens
|
||||
yseq = torch.tensor(
|
||||
yseq.tolist(), device=yseq.device
|
||||
)
|
||||
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
|
||||
|
||||
for hyp in nbest_hyps:
|
||||
assert isinstance(hyp, (Hypothesis)), type(hyp)
|
||||
|
||||
# remove sos/eos and get results
|
||||
last_pos = -1
|
||||
if isinstance(hyp.yseq, list):
|
||||
token_int = hyp.yseq[1:last_pos]
|
||||
else:
|
||||
token_int = hyp.yseq[1:last_pos].tolist()
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
else:
|
||||
text = None
|
||||
|
||||
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def inference(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
streaming: bool = False,
|
||||
output_dir: Optional[str] = None,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
batch_size=batch_size,
|
||||
beam_size=beam_size,
|
||||
ngpu=ngpu,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
penalty=penalty,
|
||||
log_level=log_level,
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
raw_inputs=raw_inputs,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
key_file=key_file,
|
||||
word_lm_train_config=word_lm_train_config,
|
||||
bpemodel=bpemodel,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
streaming=streaming,
|
||||
output_dir=output_dir,
|
||||
dtype=dtype,
|
||||
seed=seed,
|
||||
ngram_weight=ngram_weight,
|
||||
nbest=nbest,
|
||||
num_workers=num_workers,
|
||||
|
||||
**kwargs,
|
||||
)
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
maxlenratio: float,
|
||||
minlenratio: float,
|
||||
batch_size: int,
|
||||
beam_size: int,
|
||||
ngpu: int,
|
||||
ctc_weight: float,
|
||||
lm_weight: float,
|
||||
penalty: float,
|
||||
log_level: Union[int, str],
|
||||
# data_path_and_name_and_type,
|
||||
asr_train_config: Optional[str],
|
||||
asr_model_file: Optional[str],
|
||||
cmvn_file: Optional[str] = None,
|
||||
lm_train_config: Optional[str] = None,
|
||||
lm_file: Optional[str] = None,
|
||||
token_type: Optional[str] = None,
|
||||
key_file: Optional[str] = None,
|
||||
word_lm_train_config: Optional[str] = None,
|
||||
bpemodel: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = False,
|
||||
dtype: str = "float32",
|
||||
seed: int = 0,
|
||||
ngram_weight: float = 0.9,
|
||||
nbest: int = 1,
|
||||
num_workers: int = 1,
|
||||
output_dir: Optional[str] = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
if word_lm_train_config is not None:
|
||||
raise NotImplementedError("Word LM is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
|
||||
export_mode = False
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
batch_size = 1
|
||||
|
||||
# 1. Set random-seed
|
||||
set_all_random_seed(seed)
|
||||
|
||||
# 2. Build speech2text
|
||||
speech2text_kwargs = dict(
|
||||
asr_train_config=asr_train_config,
|
||||
asr_model_file=asr_model_file,
|
||||
cmvn_file=cmvn_file,
|
||||
lm_train_config=lm_train_config,
|
||||
lm_file=lm_file,
|
||||
token_type=token_type,
|
||||
bpemodel=bpemodel,
|
||||
device=device,
|
||||
maxlenratio=maxlenratio,
|
||||
minlenratio=minlenratio,
|
||||
dtype=dtype,
|
||||
beam_size=beam_size,
|
||||
ctc_weight=ctc_weight,
|
||||
lm_weight=lm_weight,
|
||||
ngram_weight=ngram_weight,
|
||||
penalty=penalty,
|
||||
nbest=nbest,
|
||||
)
|
||||
if export_mode:
|
||||
speech2text = Speech2TextExport(**speech2text_kwargs)
|
||||
else:
|
||||
speech2text = Speech2Text(**speech2text_kwargs)
|
||||
|
||||
def _load_bytes(input):
|
||||
middle_data = np.frombuffer(input, dtype=np.int16)
|
||||
middle_data = np.asarray(middle_data)
|
||||
if middle_data.dtype.kind not in 'iu':
|
||||
raise TypeError("'middle_data' must be an array of integers")
|
||||
dtype = np.dtype('float32')
|
||||
if dtype.kind != 'f':
|
||||
raise TypeError("'dtype' must be a floating point type")
|
||||
|
||||
i = np.iinfo(middle_data.dtype)
|
||||
abs_max = 2 ** (i.bits - 1)
|
||||
offset = i.min + abs_max
|
||||
array = np.frombuffer((middle_data.astype(dtype) - offset) / abs_max, dtype=np.float32)
|
||||
return array
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type,
|
||||
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
fs: dict = None,
|
||||
param_dict: dict = None,
|
||||
**kwargs,
|
||||
):
|
||||
|
||||
# 3. Build data-iterator
|
||||
if data_path_and_name_and_type is not None and data_path_and_name_and_type[2] == "bytes":
|
||||
raw_inputs = _load_bytes(data_path_and_name_and_type[0])
|
||||
raw_inputs = torch.tensor(raw_inputs)
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, np.ndarray):
|
||||
raw_inputs = torch.tensor(raw_inputs)
|
||||
is_final = False
|
||||
if param_dict is not None and "cache" in param_dict:
|
||||
cache = param_dict["cache"]
|
||||
if param_dict is not None and "is_final" in param_dict:
|
||||
is_final = param_dict["is_final"]
|
||||
# 7 .Start for-loop
|
||||
# FIXME(kamo): The output format should be discussed about
|
||||
asr_result_list = []
|
||||
results = []
|
||||
asr_result = ""
|
||||
wait = True
|
||||
if len(cache) == 0:
|
||||
cache["encoder"] = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None, "is_final": is_final, "left": 0, "right": 0}
|
||||
cache_de = {"decode_fsmn": None}
|
||||
cache["decoder"] = cache_de
|
||||
cache["first_chunk"] = True
|
||||
cache["speech"] = []
|
||||
cache["accum_speech"] = 0
|
||||
|
||||
if raw_inputs is not None:
|
||||
if len(cache["speech"]) == 0:
|
||||
cache["speech"] = raw_inputs
|
||||
else:
|
||||
cache["speech"] = torch.cat([cache["speech"], raw_inputs], dim=0)
|
||||
cache["accum_speech"] += len(raw_inputs)
|
||||
while cache["accum_speech"] >= 960:
|
||||
if cache["first_chunk"]:
|
||||
if cache["accum_speech"] >= 14400:
|
||||
speech = torch.unsqueeze(cache["speech"], axis=0)
|
||||
speech_length = torch.tensor([len(cache["speech"])])
|
||||
cache["encoder"]["pad_left"] = 5
|
||||
cache["encoder"]["pad_right"] = 5
|
||||
cache["encoder"]["stride"] = 10
|
||||
cache["encoder"]["left"] = 5
|
||||
cache["encoder"]["right"] = 0
|
||||
results = speech2text(cache, speech, speech_length)
|
||||
cache["accum_speech"] -= 4800
|
||||
cache["first_chunk"] = False
|
||||
cache["encoder"]["start_idx"] = -5
|
||||
cache["encoder"]["is_final"] = False
|
||||
wait = False
|
||||
else:
|
||||
if is_final:
|
||||
cache["encoder"]["stride"] = len(cache["speech"]) // 960
|
||||
cache["encoder"]["pad_left"] = 0
|
||||
cache["encoder"]["pad_right"] = 0
|
||||
speech = torch.unsqueeze(cache["speech"], axis=0)
|
||||
speech_length = torch.tensor([len(cache["speech"])])
|
||||
results = speech2text(cache, speech, speech_length)
|
||||
cache["accum_speech"] = 0
|
||||
wait = False
|
||||
else:
|
||||
break
|
||||
else:
|
||||
if cache["accum_speech"] >= 19200:
|
||||
cache["encoder"]["start_idx"] += 10
|
||||
cache["encoder"]["stride"] = 10
|
||||
cache["encoder"]["pad_left"] = 5
|
||||
cache["encoder"]["pad_right"] = 5
|
||||
cache["encoder"]["left"] = 0
|
||||
cache["encoder"]["right"] = 0
|
||||
speech = torch.unsqueeze(cache["speech"], axis=0)
|
||||
speech_length = torch.tensor([len(cache["speech"])])
|
||||
results = speech2text(cache, speech, speech_length)
|
||||
cache["accum_speech"] -= 9600
|
||||
wait = False
|
||||
else:
|
||||
if is_final:
|
||||
cache["encoder"]["is_final"] = True
|
||||
if cache["accum_speech"] >= 14400:
|
||||
cache["encoder"]["start_idx"] += 10
|
||||
cache["encoder"]["stride"] = 10
|
||||
cache["encoder"]["pad_left"] = 5
|
||||
cache["encoder"]["pad_right"] = 5
|
||||
cache["encoder"]["left"] = 0
|
||||
cache["encoder"]["right"] = cache["accum_speech"] // 960 - 15
|
||||
speech = torch.unsqueeze(cache["speech"], axis=0)
|
||||
speech_length = torch.tensor([len(cache["speech"])])
|
||||
results = speech2text(cache, speech, speech_length)
|
||||
cache["accum_speech"] -= 9600
|
||||
wait = False
|
||||
else:
|
||||
cache["encoder"]["start_idx"] += 10
|
||||
cache["encoder"]["stride"] = cache["accum_speech"] // 960 - 5
|
||||
cache["encoder"]["pad_left"] = 5
|
||||
cache["encoder"]["pad_right"] = 0
|
||||
cache["encoder"]["left"] = 0
|
||||
cache["encoder"]["right"] = 0
|
||||
speech = torch.unsqueeze(cache["speech"], axis=0)
|
||||
speech_length = torch.tensor([len(cache["speech"])])
|
||||
results = speech2text(cache, speech, speech_length)
|
||||
cache["accum_speech"] = 0
|
||||
wait = False
|
||||
else:
|
||||
break
|
||||
|
||||
if len(results) >= 1:
|
||||
asr_result += results[0][0]
|
||||
if asr_result == "":
|
||||
asr_result = "sil"
|
||||
if wait:
|
||||
asr_result = "waiting_for_more_voice"
|
||||
item = {'key': "utt", 'value': asr_result}
|
||||
asr_result_list.append(item)
|
||||
else:
|
||||
return []
|
||||
return asr_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="ASR Decoding",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=True)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--hotword",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="hotword file path or hotwords seperated by space"
|
||||
)
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--asr_train_config",
|
||||
type=str,
|
||||
help="ASR training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--asr_model_file",
|
||||
type=str,
|
||||
help="ASR model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--cmvn_file",
|
||||
type=str,
|
||||
help="Global cmvn file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_train_config",
|
||||
type=str,
|
||||
help="LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--lm_file",
|
||||
type=str,
|
||||
help="LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_train_config",
|
||||
type=str,
|
||||
help="Word LM training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--word_lm_file",
|
||||
type=str,
|
||||
help="Word LM parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ngram_file",
|
||||
type=str,
|
||||
help="N-gram parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--model_tag",
|
||||
type=str,
|
||||
help="Pretrained model tag. If specify this option, *_train_config and "
|
||||
"*_file will be overwritten",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Beam-search related")
|
||||
group.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
|
||||
group.add_argument("--beam_size", type=int, default=20, help="Beam size")
|
||||
group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
|
||||
group.add_argument(
|
||||
"--maxlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain max output length. "
|
||||
"If maxlenratio=0.0 (default), it uses a end-detect "
|
||||
"function "
|
||||
"to automatically find maximum hypothesis lengths."
|
||||
"If maxlenratio<0.0, its absolute value is interpreted"
|
||||
"as a constant max output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--minlenratio",
|
||||
type=float,
|
||||
default=0.0,
|
||||
help="Input length ratio to obtain min output length",
|
||||
)
|
||||
group.add_argument(
|
||||
"--ctc_weight",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="CTC weight in joint decoding",
|
||||
)
|
||||
group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
|
||||
group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
|
||||
group.add_argument("--streaming", type=str2bool, default=False)
|
||||
|
||||
group.add_argument(
|
||||
"--frontend_conf",
|
||||
default=None,
|
||||
help="",
|
||||
)
|
||||
group.add_argument("--raw_inputs", type=list, default=None)
|
||||
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
|
||||
|
||||
group = parser.add_argument_group("Text converter related")
|
||||
group.add_argument(
|
||||
"--token_type",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
choices=["char", "bpe", None],
|
||||
help="The token type for ASR model. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
group.add_argument(
|
||||
"--bpemodel",
|
||||
type=str_or_none,
|
||||
default=None,
|
||||
help="The model path of sentencepiece. "
|
||||
"If not given, refers from the training args",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
param_dict = {'hotword': args.hotword}
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
kwargs['param_dict'] = param_dict
|
||||
inference(**kwargs)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
# from modelscope.pipelines import pipeline
|
||||
# from modelscope.utils.constant import Tasks
|
||||
#
|
||||
# inference_16k_pipline = pipeline(
|
||||
# task=Tasks.auto_speech_recognition,
|
||||
# model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
|
||||
#
|
||||
# rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
|
||||
# print(rec_result)
|
||||
|
||||
|
||||
@ -338,7 +338,7 @@ def inference_modelscope(
|
||||
ibest_writer["token"][key] = " ".join(token)
|
||||
ibest_writer["token_int"][key] = " ".join(map(str, token_int))
|
||||
ibest_writer["vad"][key] = "{}".format(vadsegments)
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
ibest_writer["text_with_punc"][key] = text_postprocessed_punc
|
||||
if time_stamp_postprocessed is not None:
|
||||
ibest_writer["time_stamp"][key] = "{}".format(time_stamp_postprocessed)
|
||||
|
||||
@ -58,7 +58,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -292,6 +292,8 @@ class Speech2Text:
|
||||
|
||||
# remove blank symbol id, which is assumed to be 0
|
||||
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
|
||||
if len(token_int) == 0:
|
||||
continue
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
@ -668,7 +670,7 @@ def inference_modelscope(
|
||||
ibest_writer["token"][key] = " ".join(token)
|
||||
ibest_writer["token_int"][key] = " ".join(map(str, token_int))
|
||||
ibest_writer["vad"][key] = "{}".format(vadsegments)
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
ibest_writer["text_with_punc"][key] = text_postprocessed_punc
|
||||
if time_stamp_postprocessed is not None:
|
||||
ibest_writer["time_stamp"][key] = "{}".format(time_stamp_postprocessed)
|
||||
|
||||
@ -49,7 +49,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -738,13 +738,13 @@ def inference_modelscope(
|
||||
ibest_writer["rtf"][key] = rtf_cur
|
||||
|
||||
if text is not None:
|
||||
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
|
||||
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
# asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text_postprocessed
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
|
||||
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
|
||||
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total, forward_time_total, 100 * forward_time_total / (length_total * lfr_factor))
|
||||
|
||||
@ -37,16 +37,13 @@ from funasr.utils import asr_utils, wav_utils, postprocess_utils
|
||||
from funasr.models.frontend.wav_frontend import WavFrontend
|
||||
|
||||
|
||||
header_colors = '\033[95m'
|
||||
end_colors = '\033[0m'
|
||||
|
||||
|
||||
class Speech2Text:
|
||||
"""Speech2Text class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -261,6 +258,7 @@ class Speech2Text:
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
token = list(filter(lambda x: x != "<gbg>", token))
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
@ -506,13 +504,13 @@ def inference_modelscope(
|
||||
ibest_writer["score"][key] = str(hyp.score)
|
||||
|
||||
if text is not None:
|
||||
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
|
||||
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
return asr_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
@ -46,7 +46,7 @@ class Speech2Text:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
|
||||
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2text(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -261,6 +261,7 @@ class Speech2Text:
|
||||
|
||||
# Change integer-ids to tokens
|
||||
token = self.converter.ids2tokens(token_int)
|
||||
token = list(filter(lambda x: x != "<gbg>", token))
|
||||
|
||||
if self.tokenizer is not None:
|
||||
text = self.tokenizer.tokens2text(token)
|
||||
@ -506,13 +507,13 @@ def inference_modelscope(
|
||||
ibest_writer["score"][key] = str(hyp.score)
|
||||
|
||||
if text is not None:
|
||||
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
|
||||
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
|
||||
item = {'key': key, 'value': text_postprocessed}
|
||||
asr_result_list.append(item)
|
||||
finish_count += 1
|
||||
asr_utils.print_progress(finish_count / file_count)
|
||||
if writer is not None:
|
||||
ibest_writer["text"][key] = text
|
||||
ibest_writer["text"][key] = " ".join(word_lists)
|
||||
return asr_result_list
|
||||
|
||||
return _forward
|
||||
|
||||
@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
|
||||
param_dict = {
|
||||
"extract_profile": True,
|
||||
"sv_train_config": "sv.yaml",
|
||||
"sv_model_file": "sv.pth",
|
||||
"sv_model_file": "sv.pb",
|
||||
}
|
||||
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
|
||||
for key in param_dict:
|
||||
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
|
||||
else:
|
||||
kwargs["param_dict"] = param_dict
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
elif mode == "eend-ola":
|
||||
from funasr.bin.eend_ola_inference import inference_modelscope
|
||||
return inference_modelscope(mode=mode, **kwargs)
|
||||
else:
|
||||
logging.info("Unknown decoding mode: {}".format(mode))
|
||||
return None
|
||||
|
||||
427
funasr/bin/eend_ola_inference.py
Executable file
427
funasr/bin/eend_ola_inference.py
Executable file
@ -0,0 +1,427 @@
|
||||
#!/usr/bin/env python3
|
||||
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
|
||||
# MIT License (https://opensource.org/licenses/MIT)
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from typing import List
|
||||
from typing import Optional
|
||||
from typing import Sequence
|
||||
from typing import Tuple
|
||||
from typing import Union
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from scipy.signal import medfilt
|
||||
from typeguard import check_argument_types
|
||||
|
||||
from funasr.models.frontend.wav_frontend import WavFrontendMel23
|
||||
from funasr.tasks.diar import EENDOLADiarTask
|
||||
from funasr.torch_utils.device_funcs import to_device
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.cli_utils import get_commandline_args
|
||||
from funasr.utils.types import str2bool
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
|
||||
|
||||
class Speech2Diarization:
|
||||
"""Speech2Diarlization class
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
{"spk1": [(int, int), ...], ...}
|
||||
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
diar_train_config: Union[Path, str] = None,
|
||||
diar_model_file: Union[Path, str] = None,
|
||||
device: str = "cpu",
|
||||
dtype: str = "float32",
|
||||
):
|
||||
assert check_argument_types()
|
||||
|
||||
# 1. Build Diarization model
|
||||
diar_model, diar_train_args = EENDOLADiarTask.build_model_from_file(
|
||||
config_file=diar_train_config,
|
||||
model_file=diar_model_file,
|
||||
device=device
|
||||
)
|
||||
frontend = None
|
||||
if diar_train_args.frontend is not None and diar_train_args.frontend_conf is not None:
|
||||
frontend = WavFrontendMel23(**diar_train_args.frontend_conf)
|
||||
|
||||
# set up seed for eda
|
||||
np.random.seed(diar_train_args.seed)
|
||||
torch.manual_seed(diar_train_args.seed)
|
||||
torch.cuda.manual_seed(diar_train_args.seed)
|
||||
os.environ['PYTORCH_SEED'] = str(diar_train_args.seed)
|
||||
logging.info("diar_model: {}".format(diar_model))
|
||||
logging.info("diar_train_args: {}".format(diar_train_args))
|
||||
diar_model.to(dtype=getattr(torch, dtype)).eval()
|
||||
|
||||
self.diar_model = diar_model
|
||||
self.diar_train_args = diar_train_args
|
||||
self.device = device
|
||||
self.dtype = dtype
|
||||
self.frontend = frontend
|
||||
|
||||
@torch.no_grad()
|
||||
def __call__(
|
||||
self,
|
||||
speech: Union[torch.Tensor, np.ndarray],
|
||||
speech_lengths: Union[torch.Tensor, np.ndarray] = None
|
||||
):
|
||||
"""Inference
|
||||
|
||||
Args:
|
||||
speech: Input speech data
|
||||
Returns:
|
||||
diarization results
|
||||
|
||||
"""
|
||||
assert check_argument_types()
|
||||
# Input as audio signal
|
||||
if isinstance(speech, np.ndarray):
|
||||
speech = torch.tensor(speech)
|
||||
|
||||
if self.frontend is not None:
|
||||
feats, feats_len = self.frontend.forward(speech, speech_lengths)
|
||||
feats = to_device(feats, device=self.device)
|
||||
feats_len = feats_len.int()
|
||||
self.diar_model.frontend = None
|
||||
else:
|
||||
feats = speech
|
||||
feats_len = speech_lengths
|
||||
batch = {"speech": feats, "speech_lengths": feats_len}
|
||||
batch = to_device(batch, device=self.device)
|
||||
results = self.diar_model.estimate_sequential(**batch)
|
||||
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
def from_pretrained(
|
||||
model_tag: Optional[str] = None,
|
||||
**kwargs: Optional[Any],
|
||||
):
|
||||
"""Build Speech2Diarization instance from the pretrained model.
|
||||
|
||||
Args:
|
||||
model_tag (Optional[str]): Model tag of the pretrained models.
|
||||
Currently, the tags of espnet_model_zoo are supported.
|
||||
|
||||
Returns:
|
||||
Speech2Diarization: Speech2Diarization instance.
|
||||
|
||||
"""
|
||||
if model_tag is not None:
|
||||
try:
|
||||
from espnet_model_zoo.downloader import ModelDownloader
|
||||
|
||||
except ImportError:
|
||||
logging.error(
|
||||
"`espnet_model_zoo` is not installed. "
|
||||
"Please install via `pip install -U espnet_model_zoo`."
|
||||
)
|
||||
raise
|
||||
d = ModelDownloader()
|
||||
kwargs.update(**d.download_and_unpack(model_tag))
|
||||
|
||||
return Speech2Diarization(**kwargs)
|
||||
|
||||
|
||||
def inference_modelscope(
|
||||
diar_train_config: str,
|
||||
diar_model_file: str,
|
||||
output_dir: Optional[str] = None,
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
ngpu: int = 1,
|
||||
num_workers: int = 0,
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
param_dict: Optional[dict] = None,
|
||||
**kwargs,
|
||||
):
|
||||
assert check_argument_types()
|
||||
if batch_size > 1:
|
||||
raise NotImplementedError("batch decoding is not implemented")
|
||||
if ngpu > 1:
|
||||
raise NotImplementedError("only single GPU decoding is supported")
|
||||
|
||||
logging.basicConfig(
|
||||
level=log_level,
|
||||
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
|
||||
)
|
||||
logging.info("param_dict: {}".format(param_dict))
|
||||
|
||||
if ngpu >= 1 and torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
|
||||
# 1. Build speech2diar
|
||||
speech2diar_kwargs = dict(
|
||||
diar_train_config=diar_train_config,
|
||||
diar_model_file=diar_model_file,
|
||||
device=device,
|
||||
dtype=dtype,
|
||||
)
|
||||
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
|
||||
speech2diar = Speech2Diarization.from_pretrained(
|
||||
model_tag=model_tag,
|
||||
**speech2diar_kwargs,
|
||||
)
|
||||
speech2diar.diar_model.eval()
|
||||
|
||||
def output_results_str(results: dict, uttid: str):
|
||||
rst = []
|
||||
mid = uttid.rsplit("-", 1)[0]
|
||||
for key in results:
|
||||
results[key] = [(x[0] / 100, x[1] / 100) for x in results[key]]
|
||||
template = "SPEAKER {} 0 {:.2f} {:.2f} <NA> <NA> {} <NA> <NA>"
|
||||
for spk, segs in results.items():
|
||||
rst.extend([template.format(mid, st, ed, spk) for st, ed in segs])
|
||||
|
||||
return "\n".join(rst)
|
||||
|
||||
def _forward(
|
||||
data_path_and_name_and_type: Sequence[Tuple[str, str, str]] = None,
|
||||
raw_inputs: List[List[Union[np.ndarray, torch.Tensor, str, bytes]]] = None,
|
||||
output_dir_v2: Optional[str] = None,
|
||||
param_dict: Optional[dict] = None,
|
||||
):
|
||||
# 2. Build data-iterator
|
||||
if data_path_and_name_and_type is None and raw_inputs is not None:
|
||||
if isinstance(raw_inputs, torch.Tensor):
|
||||
raw_inputs = raw_inputs.numpy()
|
||||
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
|
||||
loader = EENDOLADiarTask.build_streaming_iterator(
|
||||
data_path_and_name_and_type,
|
||||
dtype=dtype,
|
||||
batch_size=batch_size,
|
||||
key_file=key_file,
|
||||
num_workers=num_workers,
|
||||
preprocess_fn=EENDOLADiarTask.build_preprocess_fn(speech2diar.diar_train_args, False),
|
||||
collate_fn=EENDOLADiarTask.build_collate_fn(speech2diar.diar_train_args, False),
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
inference=True,
|
||||
)
|
||||
|
||||
# 3. Start for-loop
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path is not None:
|
||||
os.makedirs(output_path, exist_ok=True)
|
||||
output_writer = open("{}/result.txt".format(output_path), "w")
|
||||
result_list = []
|
||||
for keys, batch in loader:
|
||||
assert isinstance(batch, dict), type(batch)
|
||||
assert all(isinstance(s, str) for s in keys), keys
|
||||
_bs = len(next(iter(batch.values())))
|
||||
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
|
||||
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
|
||||
|
||||
results = speech2diar(**batch)
|
||||
|
||||
# post process
|
||||
a = results[0][0].cpu().numpy()
|
||||
a = medfilt(a, (11, 1))
|
||||
rst = []
|
||||
for spkid, frames in enumerate(a.T):
|
||||
frames = np.pad(frames, (1, 1), 'constant')
|
||||
changes, = np.where(np.diff(frames, axis=0) != 0)
|
||||
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
|
||||
for s, e in zip(changes[::2], changes[1::2]):
|
||||
st = s / 10.
|
||||
dur = (e - s) / 10.
|
||||
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
|
||||
|
||||
# Only supporting batch_size==1
|
||||
value = "\n".join(rst)
|
||||
item = {"key": keys[0], "value": value}
|
||||
result_list.append(item)
|
||||
if output_path is not None:
|
||||
output_writer.write(value)
|
||||
output_writer.flush()
|
||||
|
||||
if output_path is not None:
|
||||
output_writer.close()
|
||||
|
||||
return result_list
|
||||
|
||||
return _forward
|
||||
|
||||
|
||||
def inference(
|
||||
data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
|
||||
diar_train_config: Optional[str],
|
||||
diar_model_file: Optional[str],
|
||||
output_dir: Optional[str] = None,
|
||||
batch_size: int = 1,
|
||||
dtype: str = "float32",
|
||||
ngpu: int = 0,
|
||||
seed: int = 0,
|
||||
num_workers: int = 1,
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
smooth_size: int = 83,
|
||||
dur_threshold: int = 10,
|
||||
out_format: str = "vad",
|
||||
**kwargs,
|
||||
):
|
||||
inference_pipeline = inference_modelscope(
|
||||
diar_train_config=diar_train_config,
|
||||
diar_model_file=diar_model_file,
|
||||
output_dir=output_dir,
|
||||
batch_size=batch_size,
|
||||
dtype=dtype,
|
||||
ngpu=ngpu,
|
||||
seed=seed,
|
||||
num_workers=num_workers,
|
||||
log_level=log_level,
|
||||
key_file=key_file,
|
||||
model_tag=model_tag,
|
||||
allow_variable_data_keys=allow_variable_data_keys,
|
||||
streaming=streaming,
|
||||
smooth_size=smooth_size,
|
||||
dur_threshold=dur_threshold,
|
||||
out_format=out_format,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return inference_pipeline(data_path_and_name_and_type, raw_inputs=None)
|
||||
|
||||
|
||||
def get_parser():
|
||||
parser = config_argparse.ArgumentParser(
|
||||
description="Speaker verification/x-vector extraction",
|
||||
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
|
||||
)
|
||||
|
||||
# Note(kamo): Use '_' instead of '-' as separator.
|
||||
# '-' is confusing if written in yaml.
|
||||
parser.add_argument(
|
||||
"--log_level",
|
||||
type=lambda x: x.upper(),
|
||||
default="INFO",
|
||||
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
|
||||
help="The verbose level of logging",
|
||||
)
|
||||
|
||||
parser.add_argument("--output_dir", type=str, required=False)
|
||||
parser.add_argument(
|
||||
"--ngpu",
|
||||
type=int,
|
||||
default=0,
|
||||
help="The number of gpus. 0 indicates CPU mode",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--gpuid_list",
|
||||
type=str,
|
||||
default="",
|
||||
help="The visible gpus",
|
||||
)
|
||||
parser.add_argument("--seed", type=int, default=0, help="Random seed")
|
||||
parser.add_argument(
|
||||
"--dtype",
|
||||
default="float32",
|
||||
choices=["float16", "float32", "float64"],
|
||||
help="Data type",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--num_workers",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The number of workers used for DataLoader",
|
||||
)
|
||||
|
||||
group = parser.add_argument_group("Input data related")
|
||||
group.add_argument(
|
||||
"--data_path_and_name_and_type",
|
||||
type=str2triple_str,
|
||||
required=False,
|
||||
action="append",
|
||||
)
|
||||
group.add_argument("--key_file", type=str_or_none)
|
||||
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
|
||||
|
||||
group = parser.add_argument_group("The model configuration related")
|
||||
group.add_argument(
|
||||
"--diar_train_config",
|
||||
type=str,
|
||||
help="diarization training configuration",
|
||||
)
|
||||
group.add_argument(
|
||||
"--diar_model_file",
|
||||
type=str,
|
||||
help="diarization model parameter file",
|
||||
)
|
||||
group.add_argument(
|
||||
"--dur_threshold",
|
||||
type=int,
|
||||
default=10,
|
||||
help="The threshold for short segments in number frames"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--smooth_size",
|
||||
type=int,
|
||||
default=83,
|
||||
help="The smoothing window length in number frames"
|
||||
)
|
||||
group.add_argument(
|
||||
"--model_tag",
|
||||
type=str,
|
||||
help="Pretrained model tag. If specify this option, *_train_config and "
|
||||
"*_file will be overwritten",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--batch_size",
|
||||
type=int,
|
||||
default=1,
|
||||
help="The batch size for inference",
|
||||
)
|
||||
parser.add_argument("--streaming", type=str2bool, default=False)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main(cmd=None):
|
||||
print(get_commandline_args(), file=sys.stderr)
|
||||
parser = get_parser()
|
||||
args = parser.parse_args(cmd)
|
||||
kwargs = vars(args)
|
||||
kwargs.pop("config", None)
|
||||
logging.info("args: {}".format(kwargs))
|
||||
if args.output_dir is None:
|
||||
jobid, n_gpu = 1, 1
|
||||
gpuid = args.gpuid_list.split(",")[jobid - 1]
|
||||
else:
|
||||
jobid = int(args.output_dir.split(".")[-1])
|
||||
n_gpu = len(args.gpuid_list.split(","))
|
||||
gpuid = args.gpuid_list.split(",")[(jobid - 1) % n_gpu]
|
||||
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
|
||||
results_list = inference(**kwargs)
|
||||
for results in results_list:
|
||||
print("{} {}".format(results["key"], results["value"]))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -23,7 +23,7 @@ from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.punctuation.text_preprocessor import split_to_mini_sentence
|
||||
from funasr.datasets.preprocessor import split_to_mini_sentence
|
||||
|
||||
|
||||
class Text2Punc:
|
||||
|
||||
@ -23,7 +23,7 @@ from funasr.torch_utils.set_all_random_seed import set_all_random_seed
|
||||
from funasr.utils import config_argparse
|
||||
from funasr.utils.types import str2triple_str
|
||||
from funasr.utils.types import str_or_none
|
||||
from funasr.punctuation.text_preprocessor import split_to_mini_sentence
|
||||
from funasr.datasets.preprocessor import split_to_mini_sentence
|
||||
|
||||
|
||||
class Text2Punc:
|
||||
@ -69,6 +69,7 @@ class Text2Punc:
|
||||
precache = "".join(cache)
|
||||
else:
|
||||
precache = ""
|
||||
cache = []
|
||||
data = {"text": precache + text}
|
||||
result = self.preprocessor(data=data, uid="12938712838719")
|
||||
split_text = self.preprocessor.pop_split_text_data(result)
|
||||
@ -225,7 +226,7 @@ def inference_modelscope(
|
||||
):
|
||||
results = []
|
||||
split_size = 10
|
||||
|
||||
cache_in = param_dict["cache"]
|
||||
if raw_inputs != None:
|
||||
line = raw_inputs.strip()
|
||||
key = "demo"
|
||||
@ -233,35 +234,12 @@ def inference_modelscope(
|
||||
item = {'key': key, 'value': ""}
|
||||
results.append(item)
|
||||
return results
|
||||
#import pdb;pdb.set_trace()
|
||||
result, _, cache = text2punc(line, cache)
|
||||
item = {'key': key, 'value': result, 'cache': cache}
|
||||
result, _, cache = text2punc(line, cache_in)
|
||||
param_dict["cache"] = cache
|
||||
item = {'key': key, 'value': result}
|
||||
results.append(item)
|
||||
return results
|
||||
|
||||
for inference_text, _, _ in data_path_and_name_and_type:
|
||||
with open(inference_text, "r", encoding="utf-8") as fin:
|
||||
for line in fin:
|
||||
line = line.strip()
|
||||
segs = line.split("\t")
|
||||
if len(segs) != 2:
|
||||
continue
|
||||
key = segs[0]
|
||||
if len(segs[1]) == 0:
|
||||
continue
|
||||
result, _ = text2punc(segs[1])
|
||||
item = {'key': key, 'value': result}
|
||||
results.append(item)
|
||||
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
|
||||
if output_path != None:
|
||||
output_file_name = "infer.out"
|
||||
Path(output_path).mkdir(parents=True, exist_ok=True)
|
||||
output_file_path = (Path(output_path) / output_file_name).absolute()
|
||||
with open(output_file_path, "w", encoding="utf-8") as fout:
|
||||
for item_i in results:
|
||||
key_out = item_i["key"]
|
||||
value_out = item_i["value"]
|
||||
fout.write(f"{key_out}\t{value_out}\n")
|
||||
return results
|
||||
|
||||
return _forward
|
||||
|
||||
@ -42,7 +42,7 @@ class Speech2Diarization:
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> import numpy as np
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
|
||||
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
|
||||
>>> profile = np.load("profiles.npy")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2diar(audio, profile)
|
||||
|
||||
@ -36,7 +36,7 @@ class Speech2Xvector:
|
||||
|
||||
Examples:
|
||||
>>> import soundfile
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
|
||||
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
|
||||
>>> audio, rate = soundfile.read("speech.wav")
|
||||
>>> speech2xvector(audio)
|
||||
[(text, token, token_int, hypothesis object), ...]
|
||||
@ -169,7 +169,7 @@ def inference_modelscope(
|
||||
log_level: Union[int, str] = "INFO",
|
||||
key_file: Optional[str] = None,
|
||||
sv_train_config: Optional[str] = "sv.yaml",
|
||||
sv_model_file: Optional[str] = "sv.pth",
|
||||
sv_model_file: Optional[str] = "sv.pb",
|
||||
model_tag: Optional[str] = None,
|
||||
allow_variable_data_keys: bool = True,
|
||||
streaming: bool = False,
|
||||
|
||||
@ -116,8 +116,8 @@ class SpeechText2Timestamp:
|
||||
enc = enc[0]
|
||||
|
||||
# c. Forward Predictor
|
||||
_, _, us_alphas, us_cif_peak = self.tp_model.calc_predictor_timestamp(enc, enc_len, text_lengths.to(self.device)+1)
|
||||
return us_alphas, us_cif_peak
|
||||
_, _, us_alphas, us_peaks = self.tp_model.calc_predictor_timestamp(enc, enc_len, text_lengths.to(self.device)+1)
|
||||
return us_alphas, us_peaks
|
||||
|
||||
|
||||
def inference(
|
||||
|
||||
@ -1,5 +1,6 @@
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
@ -266,7 +267,8 @@ def inference_modelscope(
|
||||
# do vad segment
|
||||
_, results = speech2vadsegment(**batch)
|
||||
for i, _ in enumerate(keys):
|
||||
results[i] = json.dumps(results[i])
|
||||
if "MODELSCOPE_ENVIRONMENT" in os.environ and os.environ["MODELSCOPE_ENVIRONMENT"] == "eas":
|
||||
results[i] = json.dumps(results[i])
|
||||
item = {'key': keys[i], 'value': results[i]}
|
||||
vad_results.append(item)
|
||||
if writer is not None:
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user