Merge branch 'main' of github.com:alibaba-damo-academy/FunASR into dev_dzh

This commit is contained in:
志浩 2023-04-07 21:03:34 +08:00
commit 4137f5cf26
767 changed files with 201784 additions and 18635 deletions

View File

@ -15,36 +15,10 @@
| [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
| [**Contact**](#contact)
## What's new:
### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
- We release several new UniASR model:
[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary),
[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary),
[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary),
[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
### 2023.1.16, funasr-0.1.6 modelscope-1.2.0
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
[Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
## Highlights
- Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
@ -217,7 +217,7 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
if [ -n "${inference_config}" ]; then
_opts+="--config ${inference_config} "
fi
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1: "${_nj}" "${_logdir}"/asr_inference.JOB.log \
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
python -m funasr.bin.asr_inference_launch \
--batch_size 1 \
--ngpu "${_ngpu}" \

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.cer_ctc.ave_10best.pth
inference_asr_model=valid.cer_ctc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -0,0 +1,53 @@
import argparse
import json
import numpy as np
def get_parser():
parser = argparse.ArgumentParser(
description="cmvn converter",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--cmvn-json",
"-c",
default=False,
required=True,
type=str,
help="cmvn json file",
)
parser.add_argument(
"--am-mvn",
"-a",
default=False,
required=True,
type=str,
help="am mvn file",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
with open(args.cmvn_json, "r") as fin:
cmvn_dict = json.load(fin)
mean_stats = np.array(cmvn_dict["mean_stats"])
var_stats = np.array(cmvn_dict["var_stats"])
total_frame = np.array(cmvn_dict["total_frames"])
mean = -1.0 * mean_stats / total_frame
var = 1.0 / np.sqrt(var_stats / total_frame - mean * mean)
dims = mean.shape[0]
with open(args.am_mvn, 'w') as fout:
fout.write("<Nnet>" + "\n" + "<Splice> " + str(dims) + " " + str(dims) + '\n' + "[ 0 ]" + "\n" + "<AddShift> " + str(dims) + " " + str(dims) + "\n")
mean_str = str(list(mean)).replace(',', '').replace('[', '[ ').replace(']', ' ]')
fout.write("<LearnRateCoef> 0 " + mean_str + '\n')
fout.write("<Rescale> " + str(dims) + " " + str(dims) + '\n')
var_str = str(list(var)).replace(',', '').replace('[', '[ ').replace(']', ' ]')
fout.write("<LearnRateCoef> 0 " + var_str + '\n')
fout.write("</Nnet>" + '\n')
if __name__ == '__main__':
main()

View File

@ -45,8 +45,8 @@ def compute_wer(ref_file,
if out_item['wrong'] > 0:
rst['wrong_sentences'] += 1
cer_detail_writer.write(hyp_key + print_cer_detail(out_item) + '\n')
cer_detail_writer.write("ref:" + '\t' + "".join(ref_dict[hyp_key]) + '\n')
cer_detail_writer.write("hyp:" + '\t' + "".join(hyp_dict[hyp_key]) + '\n')
cer_detail_writer.write("ref:" + '\t' + " ".join(list(map(lambda x: x.lower(), ref_dict[hyp_key]))) + '\n')
cer_detail_writer.write("hyp:" + '\t' + " ".join(list(map(lambda x: x.lower(), hyp_dict[hyp_key]))) + '\n')
if rst['Wrd'] > 0:
rst['Err'] = round(rst['wrong_words'] * 100 / rst['Wrd'], 2)

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default

View File

@ -34,7 +34,7 @@ exp_dir=./data
tag=exp1
model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
lm_exp=${exp_dir}/exp/${model_dir}
inference_lm=valid.loss.ave.pth # Language model path for decoding.
inference_lm=valid.loss.ave.pb # Language model path for decoding.
stage=0
stop_stage=3

View File

@ -4,7 +4,7 @@ import sys
def main():
diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
data_path_and_name_and_type = [
("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),

View File

@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "Downloading Pre-trained model..."
git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
echo "Done."
@ -30,7 +30,7 @@ fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Calculating diarization results..."
python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
python local/convert_label_to_rttm.py \
outputs/labels.txt \
data/test_rmsil/raw_rmsil_map.scp \

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "config_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -4,7 +4,7 @@ import os
def test_fbank_cpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():
def test_fbank_gpu_infer():
diar_config_path = "sond_fbank.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():
def test_wav_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
data_path_and_name_and_type = [
("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():
def test_without_profile_gpu_infer():
diar_config_path = "config.yaml"
diar_model_path = "sond.pth"
diar_model_path = "sond.pb"
output_dir = "./outputs"
raw_inputs = [[
"data/unit_test/raw_inputs/record.wav",

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.7

View File

@ -0,0 +1,80 @@
encoder: conformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: conv2d
normalize_before: true
macaron_style: true
rel_pos_type: latest
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
use_cnn_module: true
cnn_module_kernel: 31
decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
accum_grad: 2
max_epoch: 50
patience: none
init: none
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0025
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 40000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 10
dataset_conf:
shuffle: True
shuffle_conf:
shuffle_size: 1024
sort_size: 500
batch_conf:
batch_type: token
batch_size: 10000
num_workers: 8
log_interval: 50
normalize: None

View File

@ -0,0 +1,80 @@
encoder: conformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: conv2d
normalize_before: true
macaron_style: true
rel_pos_type: latest
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
use_cnn_module: true
cnn_module_kernel: 31
decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
accum_grad: 2
max_epoch: 50
patience: none
init: none
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0025
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 40000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 10
dataset_conf:
shuffle: True
shuffle_conf:
shuffle_size: 1024
sort_size: 500
batch_conf:
batch_type: token
batch_size: 10000
num_workers: 8
log_interval: 50
normalize: utterance_mvn

View File

@ -0,0 +1,58 @@
#!/usr/bin/env bash
# Copyright 2014 Vassil Panayotov
# 2014 Johns Hopkins University (author: Daniel Povey)
# Apache 2.0
if [ "$#" -ne 2 ]; then
echo "Usage: $0 <src-dir> <dst-dir>"
echo "e.g.: $0 /export/a15/vpanayotov/data/LibriSpeech/dev-clean data/dev-clean"
exit 1
fi
src=$1
dst=$2
# all utterances are FLAC compressed
if ! which flac >&/dev/null; then
echo "Please install 'flac' on ALL worker nodes!"
exit 1
fi
spk_file=$src/../SPEAKERS.TXT
mkdir -p $dst || exit 1
[ ! -d $src ] && echo "$0: no such directory $src" && exit 1
[ ! -f $spk_file ] && echo "$0: expected file $spk_file to exist" && exit 1
wav_scp=$dst/wav.scp; [[ -f "$wav_scp" ]] && rm $wav_scp
trans=$dst/text; [[ -f "$trans" ]] && rm $trans
for reader_dir in $(find -L $src -mindepth 1 -maxdepth 1 -type d | sort); do
reader=$(basename $reader_dir)
if ! [ $reader -eq $reader ]; then # not integer.
echo "$0: unexpected subdirectory name $reader"
exit 1
fi
for chapter_dir in $(find -L $reader_dir/ -mindepth 1 -maxdepth 1 -type d | sort); do
chapter=$(basename $chapter_dir)
if ! [ "$chapter" -eq "$chapter" ]; then
echo "$0: unexpected chapter-subdirectory name $chapter"
exit 1
fi
find -L $chapter_dir/ -iname "*.flac" | sort | xargs -I% basename % .flac | \
awk -v "dir=$chapter_dir" '{printf "%s %s/%s.flac \n", $0, dir, $0}' >>$wav_scp|| exit 1
chapter_trans=$chapter_dir/${reader}-${chapter}.trans.txt
[ ! -f $chapter_trans ] && echo "$0: expected file $chapter_trans to exist" && exit 1
cat $chapter_trans >>$trans
done
done
echo "$0: successfully prepared data in $dst"
exit 0

View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

262
egs/librispeech/conformer/run.sh Executable file
View File

@ -0,0 +1,262 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
gpu_num=8
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=5
train_cmd=utils/run.pl
infer_cmd=utils/run.pl
# general configuration
feats_dir="../DATA" #feature output dictionary
exp_dir="."
lang=en
dumpdir=dump/fbank
feats_type=fbank
token_type=bpe
dataset_type=large
scp=feats.scp
type=kaldi_ark
stage=3
stop_stage=4
# feature configuration
feats_dim=80
sample_frequency=16000
nj=100
speed_perturb="0.9,1.0,1.1"
# data
data_librispeech=
# bpe model
nbpe=5000
bpemode=unigram
# exp tag
tag=""
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train_960
valid_set=dev
test_sets="test_clean test_other dev_clean dev_other"
asr_config=conf/train_asr_conformer.yaml
#asr_config=conf/train_asr_conformer_uttnorm.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
#inference_config=conf/decode_asr_transformer_beam60_ctc0.3.yaml
inference_asr_model=valid.acc.ave_10best.pth
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
_ngpu=1
else
inference_nj=$njob
_ngpu=0
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
for x in train-clean-100 train-clean-360 train-other-500 dev-clean dev-other test-clean test-other; do
local/data_prep_librispeech.sh ${data_librispeech}/LibriSpeech/${x} ${feats_dir}/data/${x//-/_}
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/$train_set; mkdir -p ${feat_train_dir}
feat_dev_clean_dir=${feats_dir}/${dumpdir}/dev_clean; mkdir -p ${feat_dev_clean_dir}
feat_dev_other_dir=${feats_dir}/${dumpdir}/dev_other; mkdir -p ${feat_dev_other_dir}
feat_test_clean_dir=${feats_dir}/${dumpdir}/test_clean; mkdir -p ${feat_test_clean_dir}
feat_test_other_dir=${feats_dir}/${dumpdir}/test_other; mkdir -p ${feat_test_other_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/$valid_set; mkdir -p ${feat_dev_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "stage 1: Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
for x in dev_clean dev_other test_clean test_other; do
utils/compute_fbank.sh --cmd "$train_cmd" --nj 1 --max_lengths 3000 --feats_dim ${feats_dim} --sample_frequency ${sample_frequency} \
${feats_dir}/data/${x} ${exp_dir}/exp/make_fbank/${x} ${fbankdir}/${x}
utils/fix_data_feat.sh ${fbankdir}/${x}
done
mkdir ${feats_dir}/data/$train_set
train_sets="train_clean_100 train_clean_360 train_other_500"
for file in wav.scp text; do
( for f in $train_sets; do cat $feats_dir/data/$f/$file; done ) | sort -k1 > $feats_dir/data/$train_set/$file || exit 1;
done
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --max_lengths 3000 --feats_dim ${feats_dim} --sample_frequency ${sample_frequency} --speed_perturb ${speed_perturb} \
${feats_dir}/data/$train_set ${exp_dir}/exp/make_fbank/$train_set ${fbankdir}/$train_set
utils/fix_data_feat.sh ${fbankdir}/$train_set
# compute global cmvn
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj --feats_dim ${feats_dim} \
${fbankdir}/$train_set ${exp_dir}/exp/make_fbank/$train_set
# apply cmvn
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/$train_set ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/$train_set ${feat_train_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
${fbankdir}/dev_clean ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/dev_clean ${feat_dev_clean_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1\
${fbankdir}/dev_other ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/dev_other ${feat_dev_other_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
${fbankdir}/test_clean ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/test_clean ${feat_test_clean_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj 1 \
${fbankdir}/test_other ${fbankdir}/$train_set/cmvn.json ${exp_dir}/exp/make_fbank/test_other ${feat_test_other_dir}
cp ${fbankdir}/$train_set/text ${fbankdir}/$train_set/speech_shape ${fbankdir}/$train_set/text_shape ${feat_train_dir}
cp ${fbankdir}/dev_clean/text ${fbankdir}/dev_clean/speech_shape ${fbankdir}/dev_clean/text_shape ${feat_dev_clean_dir}
cp ${fbankdir}/dev_other/text ${fbankdir}/dev_other/speech_shape ${fbankdir}/dev_other/text_shape ${feat_dev_other_dir}
cp ${fbankdir}/test_clean/text ${fbankdir}/test_clean/speech_shape ${fbankdir}/test_clean/text_shape ${feat_test_clean_dir}
cp ${fbankdir}/test_other/text ${fbankdir}/test_other/speech_shape ${fbankdir}/test_other/text_shape ${feat_test_other_dir}
dev_sets="dev_clean dev_other"
for file in feats.scp text speech_shape text_shape; do
( for f in $dev_sets; do cat $feats_dir/${dumpdir}/$f/$file; done ) | sort -k1 > $feat_dev_dir/$file || exit 1;
done
#generate ark list
utils/gen_ark_list.sh --cmd "$train_cmd" --nj $nj ${feat_train_dir} ${fbankdir}/${train_set} ${feat_train_dir}
utils/gen_ark_list.sh --cmd "$train_cmd" --nj $nj ${feat_dev_dir} ${fbankdir}/${valid_set} ${feat_dev_dir}
fi
dict=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
bpemodel=${feats_dir}/data/lang_char/${train_set}_${bpemode}${nbpe}
echo "dictionary: ${dict}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
### Task dependent. You have to check non-linguistic symbols used in the corpus.
echo "stage 2: Dictionary and Json Data Preparation"
mkdir -p ${feats_dir}/data/lang_char/
echo "<blank>" > ${dict}
echo "<s>" >> ${dict}
echo "</s>" >> ${dict}
cut -f 2- -d" " ${feats_dir}/data/${train_set}/text > ${feats_dir}/data/lang_char/input.txt
spm_train --input=${feats_dir}/data/lang_char/input.txt --vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000
spm_encode --model=${bpemodel}.model --output_format=piece < ${feats_dir}/data/lang_char/input.txt | tr ' ' '\n' | sort | uniq | awk '{print $0}' >> ${dict}
echo "<unk>" >> ${dict}
wc -l ${dict}
vocab_size=$(cat ${dict} | wc -l)
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/$train_set
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/$valid_set
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/$train_set
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/$valid_set
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
echo "stage 3: Training"
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=${exp_dir}/exp/${model_dir}/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--split_with_space false \
--bpemodel ${bpemodel}.model \
--token_type $token_type \
--dataset_type $dataset_type \
--token_list $dict \
--train_data_file $feats_dir/$dumpdir/${train_set}/ark_txt.scp \
--valid_data_file $feats_dir/$dumpdir/${valid_set}/ark_txt.scp \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--config $asr_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "stage 4: Inference"
for dset in ${test_sets}; do
asr_exp=${exp_dir}/exp/${model_dir}
inference_tag="$(basename "${inference_config}" .yaml)"
_dir="${asr_exp}/${inference_tag}/${inference_asr_model}/${dset}"
_logdir="${_dir}/logdir"
if [ -d ${_dir} ]; then
echo "${_dir} is already exists. if you want to decode again, please delete this dir first."
exit 0
fi
mkdir -p "${_logdir}"
_data="${feats_dir}/${dumpdir}/${dset}"
key_file=${_data}/${scp}
num_scp_file="$(<${key_file} wc -l)"
_nj=$([ $inference_nj -le $num_scp_file ] && echo "$inference_nj" || echo "$num_scp_file")
split_scps=
for n in $(seq "${_nj}"); do
split_scps+=" ${_logdir}/keys.${n}.scp"
done
# shellcheck disable=SC2086
utils/split_scp.pl "${key_file}" ${split_scps}
_opts=
if [ -n "${inference_config}" ]; then
_opts+="--config ${inference_config} "
fi
${infer_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
python -m funasr.bin.asr_inference_launch \
--batch_size 1 \
--ngpu "${_ngpu}" \
--njob ${njob} \
--gpuid_list ${gpuid_list} \
--data_path_and_name_and_type "${_data}/${scp},speech,${type}" \
--key_file "${_logdir}"/keys.JOB.scp \
--asr_train_config "${asr_exp}"/config.yaml \
--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
--output_dir "${_logdir}"/output.JOB \
--mode asr \
${_opts}
for f in token token_int score text; do
if [ -f "${_logdir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${_nj}"); do
cat "${_logdir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${_dir}/${f}"
fi
done
python utils/compute_wer.py ${_data}/text ${_dir}/text ${_dir}/text.cer
tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
cat ${_dir}/text.cer.txt
done
fi

View File

@ -0,0 +1 @@
../../aishell/transformer/utils

View File

@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
inference_asr_model=valid.acc.ave_10best.pb
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -74,7 +74,7 @@ def modelscope_infer(params):
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
text_proc_file = os.path.join(best_recog_path, "text")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))

View File

@ -38,7 +38,7 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -74,7 +74,7 @@ def modelscope_infer(params):
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
text_proc_file = os.path.join(best_recog_path, "text")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))

View File

@ -38,7 +38,7 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -48,5 +48,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -63,5 +63,5 @@ if __name__ == '__main__':
params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./example_data/validation"
params["decoding_model_name"] = "valid.acc.ave.pth"
params["decoding_model_name"] = "valid.acc.ave.pb"
modelscope_infer_after_finetune(params)

View File

@ -1,30 +0,0 @@
# ModelScope Model
## How to finetune and infer using a pretrained Paraformer-large Model
### Finetune
- Modify finetune training related parameters in `finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
- <strong>batch_bins:</strong> # batch size
- <strong>max_epoch:</strong> # number of training epoch
- <strong>lr:</strong> # learning rate
- Then you can run the pipeline to finetune with:
```python
python finetune.py
```
### Inference
Or you can use the finetuned model for inference directly.
- Setting parameters in `infer.py`
- <strong>data_dir:</strong> # the dataset dir
- <strong>output_dir:</strong> # result dir
- Then you can run the pipeline to infer with:
```python
python infer.py
```

View File

@ -1,23 +0,0 @@
# Paraformer-Large
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/summary>
- Model size: 220M
# Environments
- date: `Fri Feb 10 13:34:24 CST 2023`
- python version: `3.7.12`
- FunASR version: `0.1.6`
- pytorch version: `pytorch 1.7.0`
- Git hash: ``
- Commit date: ``
# Beachmark Results
## AISHELL-1
- Decode config:
- Decode without CTC
- Decode without LM
| testset CER(%) | base model|finetune model |
|:--------------:|:---------:|:-------------:|
| dev | 1.75 |1.62 |
| test | 1.95 |1.78 |

View File

@ -1,36 +0,0 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
from funasr.utils.modelscope_param import modelscope_args
def modelscope_finetune(params):
if not os.path.exists(params.output_dir):
os.makedirs(params.output_dir, exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params.data_path)
kwargs = dict(
model=params.model,
data_dir=ds_dict,
dataset_type=params.dataset_type,
work_dir=params.output_dir,
batch_bins=params.batch_bins,
max_epoch=params.max_epoch,
lr=params.lr)
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch", data_path="./data")
params.output_dir = "./checkpoint" # m模型保存路径
params.data_path = "./example_data/" # 数据路径
params.dataset_type = "small" # 小数据量设置small若数据量大于1000小时请使用large
params.batch_bins = 2000 # batch size如果dataset_type="small"batch_bins单位为fbank特征帧数如果dataset_type="large"batch_bins单位为毫秒
params.max_epoch = 50 # 最大训练轮数
params.lr = 0.00005 # 设置学习率
modelscope_finetune(params)

View File

@ -1,88 +0,0 @@
import os
import shutil
from multiprocessing import Pool
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
else:
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
output_dir=output_dir_job,
batch_size=64
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
output_dir = params["output_dir"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
num_lines = len(lines)
num_job_lines = num_lines // nj
start = 0
for i in range(nj):
end = start + num_job_lines
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
with open(file, "w") as f:
if i == nj - 1:
f.writelines(lines[start:])
else:
f.writelines(lines[start:end])
start = end
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
p.close()
p.join()
# combine decoding results
best_recog_path = os.path.join(output_dir, "1best_recog")
os.mkdir(best_recog_path)
files = ["text", "token", "score"]
for file in files:
with open(os.path.join(best_recog_path, file), "w") as f:
for i in range(nj):
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
with open(job_file) as f_job:
lines = f_job.readlines()
f.writelines(lines)
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
if __name__ == "__main__":
params = {}
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)

View File

@ -1,53 +0,0 @@
import json
import os
import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
os.mkdir(decoding_path)
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
output_dir=decoding_path,
batch_size=64
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)

View File

@ -1,30 +0,0 @@
# ModelScope Model
## How to finetune and infer using a pretrained Paraformer-large Model
### Finetune
- Modify finetune training related parameters in `finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
- <strong>batch_bins:</strong> # batch size
- <strong>max_epoch:</strong> # number of training epoch
- <strong>lr:</strong> # learning rate
- Then you can run the pipeline to finetune with:
```python
python finetune.py
```
### Inference
Or you can use the finetuned model for inference directly.
- Setting parameters in `infer.py`
- <strong>data_dir:</strong> # the dataset dir
- <strong>output_dir:</strong> # result dir
- Then you can run the pipeline to infer with:
```python
python infer.py
```

View File

@ -1,25 +0,0 @@
# Paraformer-Large
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/summary>
- Model size: 220M
# Environments
- date: `Fri Feb 10 13:34:24 CST 2023`
- python version: `3.7.12`
- FunASR version: `0.1.6`
- pytorch version: `pytorch 1.7.0`
- Git hash: ``
- Commit date: ``
# Beachmark Results
## AISHELL-2
- Decode config:
- Decode without CTC
- Decode without LM
| testset | base model|finetune model|
|:------------:|:---------:|:------------:|
| dev_ios | 2.80 |2.60 |
| test_android | 3.13 |2.84 |
| test_ios | 2.85 |2.82 |
| test_mic | 3.06 |2.88 |

View File

@ -1,36 +0,0 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
from funasr.utils.modelscope_param import modelscope_args
def modelscope_finetune(params):
if not os.path.exists(params.output_dir):
os.makedirs(params.output_dir, exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params.data_path)
kwargs = dict(
model=params.model,
data_dir=ds_dict,
dataset_type=params.dataset_type,
work_dir=params.output_dir,
batch_bins=params.batch_bins,
max_epoch=params.max_epoch,
lr=params.lr)
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = modelscope_args(model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch", data_path="./data")
params.output_dir = "./checkpoint" # m模型保存路径
params.data_path = "./example_data/" # 数据路径
params.dataset_type = "small" # 小数据量设置small若数据量大于1000小时请使用large
params.batch_bins = 2000 # batch size如果dataset_type="small"batch_bins单位为fbank特征帧数如果dataset_type="large"batch_bins单位为毫秒
params.max_epoch = 50 # 最大训练轮数
params.lr = 0.00005 # 设置学习率
modelscope_finetune(params)

View File

@ -1,88 +0,0 @@
import os
import shutil
from multiprocessing import Pool
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
else:
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
output_dir=output_dir_job,
batch_size=64
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
output_dir = params["output_dir"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
num_lines = len(lines)
num_job_lines = num_lines // nj
start = 0
for i in range(nj):
end = start + num_job_lines
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
with open(file, "w") as f:
if i == nj - 1:
f.writelines(lines[start:])
else:
f.writelines(lines[start:end])
start = end
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
p.close()
p.join()
# combine decoding results
best_recog_path = os.path.join(output_dir, "1best_recog")
os.mkdir(best_recog_path)
files = ["text", "token", "score"]
for file in files:
with open(os.path.join(best_recog_path, file), "w") as f:
for i in range(nj):
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
with open(job_file) as f_job:
lines = f_job.readlines()
f.writelines(lines)
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
if __name__ == "__main__":
params = {}
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)

View File

@ -1,53 +0,0 @@
import json
import os
import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
os.mkdir(decoding_path)
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
output_dir=decoding_path,
batch_size=64
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)

View File

@ -21,27 +21,34 @@
Or you can use the finetuned model for inference directly.
- Setting parameters in `infer.py`
- Setting parameters in `infer.sh`
- <strong>model:</strong> # model name on ModelScope
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>output_dir:</strong> # result dir
- <strong>ngpu:</strong> # the number of GPUs for decoding
- <strong>njob:</strong> # the number of jobs for each GPU
- <strong>batch_size:</strong> # batchsize of inference
- <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding
- <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1"
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob`
- Then you can run the pipeline to infer with:
```python
python infer.py
sh infer.sh
```
- Results
The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.
If you decode the SpeechIO test sets, you can use textnorm with `stage`=3, and `DETAILS.txt`, `RESULTS.txt` record the results and CER after text normalization.
### Inference using local finetuned model
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>modelscope_model_name: </strong> # model name on ModelScope
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- <strong>batch_size:</strong> # batchsize of inference
- Then you can run the pipeline to finetune with:
```python

View File

@ -17,22 +17,22 @@
- Decode without CTC
- Decode without LM
| testset | CER(%)|
|:---------:|:-----:|
| dev | 1.75 |
| test | 1.95 |
| CER(%) | Pretrain model|[Finetune model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/summary) |
|:---------:|:-------------:|:-------------:|
| dev | 1.75 |1.62 |
| test | 1.95 |1.78 |
## AISHELL-2
- Decode config:
- Decode without CTC
- Decode without LM
| testset | CER(%)|
|:------------:|:-----:|
| dev_ios | 2.80 |
| test_android | 3.13 |
| test_ios | 2.85 |
| test_mic | 3.06 |
| CER(%) | Pretrain model|[Finetune model](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/summary)|
|:------------:|:-------------:|:------------:|
| dev_ios | 2.80 |2.60 |
| test_android | 3.13 |2.84 |
| test_ios | 2.85 |2.82 |
| test_mic | 3.06 |2.88 |
## Wenetspeech
- Decode config:

View File

@ -1,88 +1,25 @@
import os
import shutil
from multiprocessing import Pool
import argparse
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
else:
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
def modelscope_infer(args):
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid)
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
output_dir=output_dir_job,
batch_size=64
model=args.model,
output_dir=args.output_dir,
batch_size=args.batch_size,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
output_dir = params["output_dir"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
num_lines = len(lines)
num_job_lines = num_lines // nj
start = 0
for i in range(nj):
end = start + num_job_lines
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
with open(file, "w") as f:
if i == nj - 1:
f.writelines(lines[start:])
else:
f.writelines(lines[start:end])
start = end
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
p.close()
p.join()
# combine decoding results
best_recog_path = os.path.join(output_dir, "1best_recog")
os.mkdir(best_recog_path)
files = ["text", "token", "score"]
for file in files:
with open(os.path.join(best_recog_path, file), "w") as f:
for i in range(nj):
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
with open(job_file) as f_job:
lines = f_job.readlines()
f.writelines(lines)
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
inference_pipeline(audio_in=args.audio_in)
if __name__ == "__main__":
params = {}
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch")
parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp")
parser.add_argument('--output_dir', type=str, default="./results/")
parser.add_argument('--batch_size', type=int, default=64)
parser.add_argument('--gpuid', type=str, default="0")
args = parser.parse_args()
modelscope_infer(args)

View File

@ -0,0 +1,95 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
stage=1
stop_stage=2
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
data_dir="./data/test"
output_dir="./results"
batch_size=64
gpu_inference=true # whether to perform gpu decoding
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
if ${gpu_inference}; then
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
else
nj=$njob
batch_size=1
gpuid_list=""
for JOB in $(seq ${nj}); do
gpuid_list=$gpuid_list"-1,"
done
fi
mkdir -p $output_dir/split
split_scps=""
for JOB in $(seq ${nj}); do
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
done
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
echo "Decoding ..."
gpuid_list_array=(${gpuid_list//,/ })
for JOB in $(seq ${nj}); do
{
id=$((JOB-1))
gpuid=${gpuid_list_array[$id]}
mkdir -p ${output_dir}/output.$JOB
python infer.py \
--model ${model} \
--audio_in ${output_dir}/split/wav.$JOB.scp \
--output_dir ${output_dir}/output.$JOB \
--batch_size ${batch_size} \
--gpuid ${gpuid}
}&
done
wait
mkdir -p ${output_dir}/1best_recog
for f in token score text; do
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${nj}"); do
cat "${output_dir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${output_dir}/1best_recog/${f}"
fi
done
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
echo "Computing WER ..."
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
tail -n 3 ${output_dir}/1best_recog/text.cer
fi
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
echo "SpeechIO TIOBE textnorm"
echo "$0 --> Normalizing REF text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${data_dir}/text \
${output_dir}/1best_recog/ref.txt
echo "$0 --> Normalizing HYP text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${output_dir}/1best_recog/text.proc \
${output_dir}/1best_recog/rec.txt
grep -v $'\t$' ${output_dir}/1best_recog/rec.txt > ${output_dir}/1best_recog/rec_non_empty.txt
echo "$0 --> computing WER/CER and alignment ..."
./utils/error_rate_zh \
--tokenizer char \
--ref ${output_dir}/1best_recog/ref.txt \
--hyp ${output_dir}/1best_recog/rec_non_empty.txt \
${output_dir}/1best_recog/DETAILS.txt | tee ${output_dir}/1best_recog/RESULTS.txt
rm -rf ${output_dir}/1best_recog/rec.txt ${output_dir}/1best_recog/rec_non_empty.txt
fi

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -39,15 +34,15 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1 @@
../../../../egs/aishell/transformer/utils

View File

@ -0,0 +1,37 @@
import os
import logging
import torch
import soundfile
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
os.environ["MODELSCOPE_CACHE"] = "./"
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
model_revision='v1.0.2')
model_dir = os.path.join(os.environ["MODELSCOPE_CACHE"], "damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online")
speech, sample_rate = soundfile.read(os.path.join(model_dir, "example/asr_example.wav"))
speech_length = speech.shape[0]
sample_offset = 0
step = 4800 #300ms
param_dict = {"cache": dict(), "is_final": False}
final_result = ""
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
if sample_offset + step >= speech_length - 1:
step = speech_length - sample_offset
param_dict["is_final"] = True
rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + step],
param_dict=param_dict)
if len(rec_result) != 0 and rec_result['text'] != "sil" and rec_result['text'] != "waiting_for_more_voice":
final_result += rec_result['text']
print(rec_result)
print(final_result)

View File

@ -6,8 +6,9 @@
- Modify finetune training related parameters in `finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include files: train/wav.scp, train/text; validation/wav.scp, validation/text.
- <strong>batch_bins:</strong> # batch size
- <strong>data_dir:</strong> # the dataset dir needs to include files: `train/wav.scp`, `train/text`; `validation/wav.scp`, `validation/text`
- <strong>dataset_type:</strong> # for dataset larger than 1000 hours, set as `large`, otherwise set as `small`
- <strong>batch_bins:</strong> # batch size. For dataset_type is `small`, `batch_bins` indicates the feature frames. For dataset_type is `large`, `batch_bins` indicates the duration in ms
- <strong>max_epoch:</strong> # number of training epoch
- <strong>lr:</strong> # learning rate
@ -20,11 +21,38 @@
Or you can use the finetuned model for inference directly.
- Setting parameters in `infer.py`
- <strong>data_dir:</strong> # the dataset dir
- Setting parameters in `infer.sh`
- <strong>model:</strong> # model name on ModelScope
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>output_dir:</strong> # result dir
- <strong>batch_size:</strong> # batchsize of inference
- <strong>gpu_inference:</strong> # whether to perform gpu decoding, set false for cpu decoding
- <strong>gpuid_list:</strong> # set gpus, e.g., gpuid_list="0,1"
- <strong>njob:</strong> # the number of jobs for CPU decoding, if `gpu_inference`=false, use CPU decoding, please set `njob`
- Then you can run the pipeline to infer with:
```python
python infer.py
sh infer.sh
```
- Results
The decoding results can be found in `$output_dir/1best_recog/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.
### Inference using local finetuned model
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>modelscope_model_name: </strong> # model name on ModelScope
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- <strong>batch_size:</strong> # batchsize of inference
- Then you can run the pipeline to finetune with:
```python
python infer_after_finetune.py
```
- Results
The decoding results can be found in `$output_dir/decoding_results/text.cer`, which includes recognition results of each sample and the CER metric of the whole test set.

View File

@ -1,88 +1,25 @@
import os
import shutil
from multiprocessing import Pool
import argparse
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_core(output_dir, split_dir, njob, idx):
output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
gpu_id = (int(idx) - 1) // njob
if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
else:
os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
inference_pipline = pipeline(
def modelscope_infer(args):
os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpuid)
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
output_dir=output_dir_job,
batch_size=64
model=args.model,
output_dir=args.output_dir,
batch_size=args.batch_size,
)
audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
inference_pipline(audio_in=audio_in)
def modelscope_infer(params):
# prepare for multi-GPU decoding
ngpu = params["ngpu"]
njob = params["njob"]
output_dir = params["output_dir"]
if os.path.exists(output_dir):
shutil.rmtree(output_dir)
os.mkdir(output_dir)
split_dir = os.path.join(output_dir, "split")
os.mkdir(split_dir)
nj = ngpu * njob
wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
with open(wav_scp_file) as f:
lines = f.readlines()
num_lines = len(lines)
num_job_lines = num_lines // nj
start = 0
for i in range(nj):
end = start + num_job_lines
file = os.path.join(split_dir, "wav.{}.scp".format(str(i + 1)))
with open(file, "w") as f:
if i == nj - 1:
f.writelines(lines[start:])
else:
f.writelines(lines[start:end])
start = end
p = Pool(nj)
for i in range(nj):
p.apply_async(modelscope_infer_core,
args=(output_dir, split_dir, njob, str(i + 1)))
p.close()
p.join()
# combine decoding results
best_recog_path = os.path.join(output_dir, "1best_recog")
os.mkdir(best_recog_path)
files = ["text", "token", "score"]
for file in files:
with open(os.path.join(best_recog_path, file), "w") as f:
for i in range(nj):
job_file = os.path.join(output_dir, "output.{}/1best_recog".format(str(i + 1)), file)
with open(job_file) as f_job:
lines = f_job.readlines()
f.writelines(lines)
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))
inference_pipeline(audio_in=args.audio_in)
if __name__ == "__main__":
params = {}
params["data_dir"] = "./data/test"
params["output_dir"] = "./results"
params["ngpu"] = 1
params["njob"] = 1
modelscope_infer(params)
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, default="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1")
parser.add_argument('--audio_in', type=str, default="./data/test/wav.scp")
parser.add_argument('--output_dir', type=str, default="./results/")
parser.add_argument('--batch_size', type=int, default=64)
parser.add_argument('--gpuid', type=str, default="0")
args = parser.parse_args()
modelscope_infer(args)

View File

@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
stage=1
stop_stage=2
model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
data_dir="./data/test"
output_dir="./results"
batch_size=64
gpu_inference=true # whether to perform gpu decoding
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
njob=4 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
if ${gpu_inference}; then
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
else
nj=$njob
batch_size=1
gpuid_list=""
for JOB in $(seq ${nj}); do
gpuid_list=$gpuid_list"-1,"
done
fi
mkdir -p $output_dir/split
split_scps=""
for JOB in $(seq ${nj}); do
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
done
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
echo "Decoding ..."
gpuid_list_array=(${gpuid_list//,/ })
for JOB in $(seq ${nj}); do
{
id=$((JOB-1))
gpuid=${gpuid_list_array[$id]}
mkdir -p ${output_dir}/output.$JOB
python infer.py \
--model ${model} \
--audio_in ${output_dir}/split/wav.$JOB.scp \
--output_dir ${output_dir}/output.$JOB \
--batch_size ${batch_size} \
--gpuid ${gpuid}
}&
done
wait
mkdir -p ${output_dir}/1best_recog
for f in token score text; do
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${nj}"); do
cat "${output_dir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${output_dir}/1best_recog/${f}"
fi
done
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
echo "Computing WER ..."
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
tail -n 3 ${output_dir}/1best_recog/text.cer
fi

View File

@ -4,23 +4,18 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")
shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
@ -39,15 +34,15 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1 @@
../../../../egs/aishell/transformer/utils

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -50,5 +50,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -0,0 +1,35 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
def modelscope_finetune(params):
if not os.path.exists(params["output_dir"]):
os.makedirs(params["output_dir"], exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params["data_dir"])
kwargs = dict(
model=params["model"],
model_revision=params["model_revision"],
data_dir=ds_dict,
dataset_type=params["dataset_type"],
work_dir=params["output_dir"],
batch_bins=params["batch_bins"],
max_epoch=params["max_epoch"],
lr=params["lr"])
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = {}
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data"
params["batch_bins"] = 2000
params["dataset_type"] = "small"
params["max_epoch"] = 50
params["lr"] = 0.00005
params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
params["model_revision"] = None
modelscope_finetune(params)

View File

@ -0,0 +1,13 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
if __name__ == "__main__":
audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
output_dir = "./results"
inference_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
output_dir=output_dir,
)
rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
print(rec_result)

View File

@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -75,7 +75,7 @@ def modelscope_infer(params):
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
text_proc_file = os.path.join(best_recog_path, "text")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))

View File

@ -39,7 +39,7 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -75,7 +75,7 @@ def modelscope_infer(params):
# If text exists, compute CER
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(best_recog_path, "token")
text_proc_file = os.path.join(best_recog_path, "text")
compute_wer(text_in, text_proc_file, os.path.join(best_recog_path, "text.cer"))

View File

@ -39,7 +39,7 @@ def modelscope_infer_after_finetune(params):
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
text_proc_file = os.path.join(decoding_path, "1best_recog/text")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -49,5 +49,5 @@ if __name__ == '__main__':
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "20epoch.pth"
params["decoding_model_name"] = "20epoch.pb"
modelscope_infer_after_finetune(params)

View File

@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
- Modify inference related parameters in `infer_after_finetune.py`
- <strong>output_dir:</strong> # result dir
- <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
- <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
- Then you can run the pipeline to finetune with:
```python

View File

@ -4,27 +4,17 @@ import shutil
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.hub.snapshot_download import snapshot_download
from funasr.utils.compute_wer import compute_wer
def modelscope_infer_after_finetune(params):
# prepare for decoding
if not os.path.exists(os.path.join(params["output_dir"], "punc")):
os.makedirs(os.path.join(params["output_dir"], "punc"))
if not os.path.exists(os.path.join(params["output_dir"], "vad")):
os.makedirs(os.path.join(params["output_dir"], "vad"))
pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
for file_name in params["required_files"]:
if file_name == "configuration.json":
with open(os.path.join(pretrained_model_path, file_name)) as f:
config_dict = json.load(f)
config_dict["model"]["am_model_name"] = params["decoding_model_name"]
with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
json.dump(config_dict, f, indent=4, separators=(',', ': '))
else:
shutil.copy(os.path.join(pretrained_model_path, file_name),
os.path.join(params["output_dir"], file_name))
try:
pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
except BaseException:
raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
decoding_path = os.path.join(params["output_dir"], "decode_results")
if os.path.exists(decoding_path):
shutil.rmtree(decoding_path)
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
# decoding
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model=params["output_dir"],
model=pretrained_model_path,
output_dir=decoding_path,
batch_size=64
batch_size=params["batch_size"]
)
audio_in = os.path.join(params["data_dir"], "wav.scp")
inference_pipeline(audio_in=audio_in)
# computer CER if GT text is set
text_in = os.path.join(params["data_dir"], "text")
if text_in is not None:
if os.path.exists(text_in):
text_proc_file = os.path.join(decoding_path, "1best_recog/token")
compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))
@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
if __name__ == '__main__':
params = {}
params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
params["output_dir"] = "./checkpoint"
params["data_dir"] = "./data/test"
params["decoding_model_name"] = "valid.acc.ave_10best.pth"
modelscope_infer_after_finetune(params)
params["decoding_model_name"] = "valid.acc.ave_10best.pb"
params["batch_size"] = 64
modelscope_infer_after_finetune(params)

View File

@ -4,22 +4,23 @@ inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
inference_pipeline = pipeline(
task=Tasks.punctuation,
model='damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727',
model_revision="v1.0.0",
output_dir="./tmp/"
)
vads = inputs.split("|")
cache_out = []
rec_result_all="outputs:"
param_dict = {"cache": []}
for vad in vads:
rec_result = inference_pipeline(text_in=vad, cache=cache_out)
#print(rec_result)
cache_out = rec_result['cache']
rec_result = inference_pipeline(text_in=vad, param_dict=param_dict)
rec_result_all += rec_result['text']
print(rec_result_all)

View File

@ -0,0 +1,10 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
inference_diar_pipline = pipeline(
task=Tasks.speaker_diarization,
model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
model_revision="v1.0.0",
)
results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
print(results)

View File

@ -7,7 +7,7 @@ if __name__ == '__main__':
inference_pipline = pipeline(
task=Tasks.voice_activity_detection,
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
model_revision=None,
model_revision='v1.2.0',
output_dir=output_dir,
batch_size=1,
)

View File

@ -1,16 +1,20 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
import soundfile
if __name__ == '__main__':
output_dir = None
inference_pipline = pipeline(
task=Tasks.voice_activity_detection,
model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
model_revision='v1.1.9',
output_dir=None,
model_revision='v1.2.0',
output_dir=output_dir,
batch_size=1,
mode='online',
)
speech, sample_rate = soundfile.read("./vad_example_16k.wav")
speech_length = speech.shape[0]
@ -18,7 +22,7 @@ if __name__ == '__main__':
sample_offset = 0
step = 160 * 10
param_dict = {'in_cache': dict()}
param_dict = {'in_cache': dict(), 'max_end_sil': 800}
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
if sample_offset + step >= speech_length - 1:
step = speech_length - sample_offset

View File

@ -7,8 +7,8 @@ if __name__ == '__main__':
inference_pipline = pipeline(
task=Tasks.voice_activity_detection,
model="damo/speech_fsmn_vad_zh-cn-8k-common",
model_revision=None,
output_dir='./output_dir',
model_revision='v1.2.0',
output_dir=output_dir,
batch_size=1,
)
segments_result = inference_pipline(audio_in=audio_in)

View File

@ -1,16 +1,20 @@
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
import logging
logger = get_logger(log_level=logging.CRITICAL)
logger.setLevel(logging.CRITICAL)
import soundfile
if __name__ == '__main__':
output_dir = None
inference_pipline = pipeline(
task=Tasks.voice_activity_detection,
model="damo/speech_fsmn_vad_zh-cn-8k-common",
model_revision='v1.1.9',
output_dir='./output_dir',
model_revision='v1.2.0',
output_dir=output_dir,
batch_size=1,
mode='online',
)
speech, sample_rate = soundfile.read("./vad_example_8k.wav")
speech_length = speech.shape[0]
@ -18,7 +22,7 @@ if __name__ == '__main__':
sample_offset = 0
step = 80 * 10
param_dict = {'in_cache': dict()}
param_dict = {'in_cache': dict(), 'max_end_sil': 800}
for sample_offset in range(0, speech_length, min(step, speech_length - sample_offset)):
if sample_offset + step >= speech_length - 1:
step = speech_length - sample_offset

View File

@ -52,7 +52,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -216,6 +216,9 @@ def inference_launch(**kwargs):
elif mode == "paraformer":
from funasr.bin.asr_inference_paraformer import inference_modelscope
return inference_modelscope(**kwargs)
elif mode == "paraformer_streaming":
from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
return inference_modelscope(**kwargs)
elif mode == "paraformer_vad":
from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
return inference_modelscope(**kwargs)

View File

@ -41,8 +41,6 @@ from funasr.utils.types import str_or_none
from funasr.utils import asr_utils, wav_utils, postprocess_utils
import pdb
header_colors = '\033[95m'
end_colors = '\033[0m'
global_asr_language: str = 'zh-cn'
global_sample_rate: Union[int, Dict[Any, int]] = {
@ -55,7 +53,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]

View File

@ -43,6 +43,7 @@ from funasr.models.frontend.wav_frontend import WavFrontend
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
from funasr.utils.timestamp_tools import ts_prediction_lfr6_standard
from funasr.bin.tp_inference import SpeechText2Timestamp
class Speech2Text:
@ -50,7 +51,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
@ -540,7 +541,8 @@ def inference(
ngram_weight: float = 0.9,
nbest: int = 1,
num_workers: int = 1,
timestamp_infer_config: Union[Path, str] = None,
timestamp_model_file: Union[Path, str] = None,
**kwargs,
):
inference_pipeline = inference_modelscope(
@ -604,6 +606,8 @@ def inference_modelscope(
nbest: int = 1,
num_workers: int = 1,
output_dir: Optional[str] = None,
timestamp_infer_config: Union[Path, str] = None,
timestamp_model_file: Union[Path, str] = None,
param_dict: dict = None,
**kwargs,
):
@ -661,6 +665,15 @@ def inference_modelscope(
else:
speech2text = Speech2Text(**speech2text_kwargs)
if timestamp_model_file is not None:
speechtext2timestamp = SpeechText2Timestamp(
timestamp_cmvn_file=cmvn_file,
timestamp_model_file=timestamp_model_file,
timestamp_infer_config=timestamp_infer_config,
)
else:
speechtext2timestamp = None
def _forward(
data_path_and_name_and_type,
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
@ -744,7 +757,17 @@ def inference_modelscope(
key = keys[batch_id]
for n, result in zip(range(1, nbest + 1), result):
text, token, token_int, hyp = result[0], result[1], result[2], result[3]
time_stamp = None if len(result) < 5 else result[4]
timestamp = None if len(result) < 5 else result[4]
# conduct timestamp prediction here
# timestamp inference requires token length
# thus following inference cannot be conducted in batch
if timestamp is None and speechtext2timestamp:
ts_batch = {}
ts_batch['speech'] = batch['speech'][batch_id].unsqueeze(0)
ts_batch['speech_lengths'] = torch.tensor([batch['speech_lengths'][batch_id]])
ts_batch['text_lengths'] = torch.tensor([len(token)])
us_alphas, us_peaks = speechtext2timestamp(**ts_batch)
ts_str, timestamp = ts_prediction_lfr6_standard(us_alphas[0], us_peaks[0], token, force_time_shift=-3.0)
# Create a directory: outdir/{n}best_recog
if writer is not None:
ibest_writer = writer[f"{n}best_recog"]
@ -756,25 +779,25 @@ def inference_modelscope(
ibest_writer["rtf"][key] = rtf_cur
if text is not None:
if use_timestamp and time_stamp is not None:
postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
if use_timestamp and timestamp is not None:
postprocessed_result = postprocess_utils.sentence_postprocess(token, timestamp)
else:
postprocessed_result = postprocess_utils.sentence_postprocess(token)
time_stamp_postprocessed = ""
timestamp_postprocessed = ""
if len(postprocessed_result) == 3:
text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
text_postprocessed, timestamp_postprocessed, word_lists = postprocessed_result[0], \
postprocessed_result[1], \
postprocessed_result[2]
else:
text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
item = {'key': key, 'value': text_postprocessed}
if time_stamp_postprocessed != "":
item['time_stamp'] = time_stamp_postprocessed
if timestamp_postprocessed != "":
item['timestamp'] = timestamp_postprocessed
asr_result_list.append(item)
finish_count += 1
# asr_utils.print_progress(finish_count / file_count)
if writer is not None:
ibest_writer["text"][key] = text_postprocessed
ibest_writer["text"][key] = " ".join(word_lists)
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total, forward_time_total, 100 * forward_time_total / (length_total * lfr_factor))

View File

@ -0,0 +1,916 @@
#!/usr/bin/env python3
import argparse
import logging
import sys
import time
import copy
import os
import codecs
import tempfile
import requests
from pathlib import Path
from typing import Optional
from typing import Sequence
from typing import Tuple
from typing import Union
from typing import Dict
from typing import Any
from typing import List
import numpy as np
import torch
from typeguard import check_argument_types
from funasr.fileio.datadir_writer import DatadirWriter
from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
from funasr.modules.beam_search.beam_search import Hypothesis
from funasr.modules.scorers.ctc import CTCPrefixScorer
from funasr.modules.scorers.length_bonus import LengthBonus
from funasr.modules.subsampling import TooShortUttError
from funasr.tasks.asr import ASRTaskParaformer as ASRTask
from funasr.tasks.lm import LMTask
from funasr.text.build_tokenizer import build_tokenizer
from funasr.text.token_id_converter import TokenIDConverter
from funasr.torch_utils.device_funcs import to_device
from funasr.torch_utils.set_all_random_seed import set_all_random_seed
from funasr.utils import config_argparse
from funasr.utils.cli_utils import get_commandline_args
from funasr.utils.types import str2bool
from funasr.utils.types import str2triple_str
from funasr.utils.types import str_or_none
from funasr.utils import asr_utils, wav_utils, postprocess_utils
from funasr.models.frontend.wav_frontend import WavFrontend
from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
np.set_printoptions(threshold=np.inf)
class Speech2Text:
"""Speech2Text class
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
"""
def __init__(
self,
asr_train_config: Union[Path, str] = None,
asr_model_file: Union[Path, str] = None,
cmvn_file: Union[Path, str] = None,
lm_train_config: Union[Path, str] = None,
lm_file: Union[Path, str] = None,
token_type: str = None,
bpemodel: str = None,
device: str = "cpu",
maxlenratio: float = 0.0,
minlenratio: float = 0.0,
dtype: str = "float32",
beam_size: int = 20,
ctc_weight: float = 0.5,
lm_weight: float = 1.0,
ngram_weight: float = 0.9,
penalty: float = 0.0,
nbest: int = 1,
frontend_conf: dict = None,
hotword_list_or_file: str = None,
**kwargs,
):
assert check_argument_types()
# 1. Build ASR model
scorers = {}
asr_model, asr_train_args = ASRTask.build_model_from_file(
asr_train_config, asr_model_file, cmvn_file, device
)
frontend = None
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
logging.info("asr_model: {}".format(asr_model))
logging.info("asr_train_args: {}".format(asr_train_args))
asr_model.to(dtype=getattr(torch, dtype)).eval()
if asr_model.ctc != None:
ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
scorers.update(
ctc=ctc
)
token_list = asr_model.token_list
scorers.update(
length_bonus=LengthBonus(len(token_list)),
)
# 2. Build Language model
if lm_train_config is not None:
lm, lm_train_args = LMTask.build_model_from_file(
lm_train_config, lm_file, device
)
scorers["lm"] = lm.lm
# 3. Build ngram model
# ngram is not supported now
ngram = None
scorers["ngram"] = ngram
# 4. Build BeamSearch object
# transducer is not supported now
beam_search_transducer = None
weights = dict(
decoder=1.0 - ctc_weight,
ctc=ctc_weight,
lm=lm_weight,
ngram=ngram_weight,
length_bonus=penalty,
)
beam_search = BeamSearch(
beam_size=beam_size,
weights=weights,
scorers=scorers,
sos=asr_model.sos,
eos=asr_model.eos,
vocab_size=len(token_list),
token_list=token_list,
pre_beam_score_key=None if ctc_weight == 1.0 else "full",
)
beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
for scorer in scorers.values():
if isinstance(scorer, torch.nn.Module):
scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
logging.info(f"Decoding device={device}, dtype={dtype}")
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
if token_type is None:
token_type = asr_train_args.token_type
if bpemodel is None:
bpemodel = asr_train_args.bpemodel
if token_type is None:
tokenizer = None
elif token_type == "bpe":
if bpemodel is not None:
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
else:
tokenizer = None
else:
tokenizer = build_tokenizer(token_type=token_type)
converter = TokenIDConverter(token_list=token_list)
logging.info(f"Text tokenizer: {tokenizer}")
self.asr_model = asr_model
self.asr_train_args = asr_train_args
self.converter = converter
self.tokenizer = tokenizer
# 6. [Optional] Build hotword list from str, local file or url
is_use_lm = lm_weight != 0.0 and lm_file is not None
if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
beam_search = None
self.beam_search = beam_search
logging.info(f"Beam_search: {self.beam_search}")
self.beam_search_transducer = beam_search_transducer
self.maxlenratio = maxlenratio
self.minlenratio = minlenratio
self.device = device
self.dtype = dtype
self.nbest = nbest
self.frontend = frontend
self.encoder_downsampling_factor = 1
if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
self.encoder_downsampling_factor = 4
@torch.no_grad()
def __call__(
self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
begin_time: int = 0, end_time: int = None,
):
"""Inference
Args:
speech: Input speech data
Returns:
text, token, token_int, hyp
"""
assert check_argument_types()
# Input as audio signal
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if self.frontend is not None:
feats, feats_len = self.frontend.forward(speech, speech_lengths)
feats = to_device(feats, device=self.device)
feats_len = feats_len.int()
self.asr_model.frontend = None
else:
feats = speech
feats_len = speech_lengths
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
feats_len = cache["encoder"]["stride"] + cache["encoder"]["pad_left"] + cache["encoder"]["pad_right"]
feats = feats[:,cache["encoder"]["start_idx"]:cache["encoder"]["start_idx"]+feats_len,:]
feats_len = torch.tensor([feats_len])
batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
# a. To device
batch = to_device(batch, device=self.device)
# b. Forward Encoder
enc, enc_len = self.asr_model.encode_chunk(feats, feats_len, cache)
if isinstance(enc, tuple):
enc = enc[0]
# assert len(enc) == 1, len(enc)
enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
predictor_outs[2], predictor_outs[3]
pre_token_length = pre_token_length.floor().long()
if torch.max(pre_token_length) < 1:
return []
decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
decoder_out = decoder_outs
results = []
b, n, d = decoder_out.size()
for i in range(b):
x = enc[i, :enc_len[i], :]
am_scores = decoder_out[i, :pre_token_length[i], :]
if self.beam_search is not None:
nbest_hyps = self.beam_search(
x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
)
nbest_hyps = nbest_hyps[: self.nbest]
else:
yseq = am_scores.argmax(dim=-1)
score = am_scores.max(dim=-1)[0]
score = torch.sum(score, dim=-1)
# pad with mask tokens to ensure compatibility with sos/eos tokens
yseq = torch.tensor(
[self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
)
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
for hyp in nbest_hyps:
assert isinstance(hyp, (Hypothesis)), type(hyp)
# remove sos/eos and get results
last_pos = -1
if isinstance(hyp.yseq, list):
token_int = hyp.yseq[1:last_pos]
else:
token_int = hyp.yseq[1:last_pos].tolist()
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
else:
text = None
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
# assert check_return_type(results)
return results
class Speech2TextExport:
"""Speech2TextExport class
"""
def __init__(
self,
asr_train_config: Union[Path, str] = None,
asr_model_file: Union[Path, str] = None,
cmvn_file: Union[Path, str] = None,
lm_train_config: Union[Path, str] = None,
lm_file: Union[Path, str] = None,
token_type: str = None,
bpemodel: str = None,
device: str = "cpu",
maxlenratio: float = 0.0,
minlenratio: float = 0.0,
dtype: str = "float32",
beam_size: int = 20,
ctc_weight: float = 0.5,
lm_weight: float = 1.0,
ngram_weight: float = 0.9,
penalty: float = 0.0,
nbest: int = 1,
frontend_conf: dict = None,
hotword_list_or_file: str = None,
**kwargs,
):
# 1. Build ASR model
asr_model, asr_train_args = ASRTask.build_model_from_file(
asr_train_config, asr_model_file, cmvn_file, device
)
frontend = None
if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
logging.info("asr_model: {}".format(asr_model))
logging.info("asr_train_args: {}".format(asr_train_args))
asr_model.to(dtype=getattr(torch, dtype)).eval()
token_list = asr_model.token_list
logging.info(f"Decoding device={device}, dtype={dtype}")
# 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
if token_type is None:
token_type = asr_train_args.token_type
if bpemodel is None:
bpemodel = asr_train_args.bpemodel
if token_type is None:
tokenizer = None
elif token_type == "bpe":
if bpemodel is not None:
tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
else:
tokenizer = None
else:
tokenizer = build_tokenizer(token_type=token_type)
converter = TokenIDConverter(token_list=token_list)
logging.info(f"Text tokenizer: {tokenizer}")
# self.asr_model = asr_model
self.asr_train_args = asr_train_args
self.converter = converter
self.tokenizer = tokenizer
self.device = device
self.dtype = dtype
self.nbest = nbest
self.frontend = frontend
model = Paraformer_export(asr_model, onnx=False)
self.asr_model = model
@torch.no_grad()
def __call__(
self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
):
"""Inference
Args:
speech: Input speech data
Returns:
text, token, token_int, hyp
"""
assert check_argument_types()
# Input as audio signal
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if self.frontend is not None:
feats, feats_len = self.frontend.forward(speech, speech_lengths)
feats = to_device(feats, device=self.device)
feats_len = feats_len.int()
self.asr_model.frontend = None
else:
feats = speech
feats_len = speech_lengths
enc_len_batch_total = feats_len.sum()
lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
batch = {"speech": feats, "speech_lengths": feats_len}
# a. To device
batch = to_device(batch, device=self.device)
decoder_outs = self.asr_model(**batch)
decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
results = []
b, n, d = decoder_out.size()
for i in range(b):
am_scores = decoder_out[i, :ys_pad_lens[i], :]
yseq = am_scores.argmax(dim=-1)
score = am_scores.max(dim=-1)[0]
score = torch.sum(score, dim=-1)
# pad with mask tokens to ensure compatibility with sos/eos tokens
yseq = torch.tensor(
yseq.tolist(), device=yseq.device
)
nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
for hyp in nbest_hyps:
assert isinstance(hyp, (Hypothesis)), type(hyp)
# remove sos/eos and get results
last_pos = -1
if isinstance(hyp.yseq, list):
token_int = hyp.yseq[1:last_pos]
else:
token_int = hyp.yseq[1:last_pos].tolist()
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
else:
text = None
results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
return results
def inference(
maxlenratio: float,
minlenratio: float,
batch_size: int,
beam_size: int,
ngpu: int,
ctc_weight: float,
lm_weight: float,
penalty: float,
log_level: Union[int, str],
data_path_and_name_and_type,
asr_train_config: Optional[str],
asr_model_file: Optional[str],
cmvn_file: Optional[str] = None,
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
lm_train_config: Optional[str] = None,
lm_file: Optional[str] = None,
token_type: Optional[str] = None,
key_file: Optional[str] = None,
word_lm_train_config: Optional[str] = None,
bpemodel: Optional[str] = None,
allow_variable_data_keys: bool = False,
streaming: bool = False,
output_dir: Optional[str] = None,
dtype: str = "float32",
seed: int = 0,
ngram_weight: float = 0.9,
nbest: int = 1,
num_workers: int = 1,
**kwargs,
):
inference_pipeline = inference_modelscope(
maxlenratio=maxlenratio,
minlenratio=minlenratio,
batch_size=batch_size,
beam_size=beam_size,
ngpu=ngpu,
ctc_weight=ctc_weight,
lm_weight=lm_weight,
penalty=penalty,
log_level=log_level,
asr_train_config=asr_train_config,
asr_model_file=asr_model_file,
cmvn_file=cmvn_file,
raw_inputs=raw_inputs,
lm_train_config=lm_train_config,
lm_file=lm_file,
token_type=token_type,
key_file=key_file,
word_lm_train_config=word_lm_train_config,
bpemodel=bpemodel,
allow_variable_data_keys=allow_variable_data_keys,
streaming=streaming,
output_dir=output_dir,
dtype=dtype,
seed=seed,
ngram_weight=ngram_weight,
nbest=nbest,
num_workers=num_workers,
**kwargs,
)
return inference_pipeline(data_path_and_name_and_type, raw_inputs)
def inference_modelscope(
maxlenratio: float,
minlenratio: float,
batch_size: int,
beam_size: int,
ngpu: int,
ctc_weight: float,
lm_weight: float,
penalty: float,
log_level: Union[int, str],
# data_path_and_name_and_type,
asr_train_config: Optional[str],
asr_model_file: Optional[str],
cmvn_file: Optional[str] = None,
lm_train_config: Optional[str] = None,
lm_file: Optional[str] = None,
token_type: Optional[str] = None,
key_file: Optional[str] = None,
word_lm_train_config: Optional[str] = None,
bpemodel: Optional[str] = None,
allow_variable_data_keys: bool = False,
dtype: str = "float32",
seed: int = 0,
ngram_weight: float = 0.9,
nbest: int = 1,
num_workers: int = 1,
output_dir: Optional[str] = None,
param_dict: dict = None,
**kwargs,
):
assert check_argument_types()
if word_lm_train_config is not None:
raise NotImplementedError("Word LM is not implemented")
if ngpu > 1:
raise NotImplementedError("only single GPU decoding is supported")
logging.basicConfig(
level=log_level,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
)
export_mode = False
if ngpu >= 1 and torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
batch_size = 1
# 1. Set random-seed
set_all_random_seed(seed)
# 2. Build speech2text
speech2text_kwargs = dict(
asr_train_config=asr_train_config,
asr_model_file=asr_model_file,
cmvn_file=cmvn_file,
lm_train_config=lm_train_config,
lm_file=lm_file,
token_type=token_type,
bpemodel=bpemodel,
device=device,
maxlenratio=maxlenratio,
minlenratio=minlenratio,
dtype=dtype,
beam_size=beam_size,
ctc_weight=ctc_weight,
lm_weight=lm_weight,
ngram_weight=ngram_weight,
penalty=penalty,
nbest=nbest,
)
if export_mode:
speech2text = Speech2TextExport(**speech2text_kwargs)
else:
speech2text = Speech2Text(**speech2text_kwargs)
def _load_bytes(input):
middle_data = np.frombuffer(input, dtype=np.int16)
middle_data = np.asarray(middle_data)
if middle_data.dtype.kind not in 'iu':
raise TypeError("'middle_data' must be an array of integers")
dtype = np.dtype('float32')
if dtype.kind != 'f':
raise TypeError("'dtype' must be a floating point type")
i = np.iinfo(middle_data.dtype)
abs_max = 2 ** (i.bits - 1)
offset = i.min + abs_max
array = np.frombuffer((middle_data.astype(dtype) - offset) / abs_max, dtype=np.float32)
return array
def _forward(
data_path_and_name_and_type,
raw_inputs: Union[np.ndarray, torch.Tensor] = None,
output_dir_v2: Optional[str] = None,
fs: dict = None,
param_dict: dict = None,
**kwargs,
):
# 3. Build data-iterator
if data_path_and_name_and_type is not None and data_path_and_name_and_type[2] == "bytes":
raw_inputs = _load_bytes(data_path_and_name_and_type[0])
raw_inputs = torch.tensor(raw_inputs)
if data_path_and_name_and_type is None and raw_inputs is not None:
if isinstance(raw_inputs, np.ndarray):
raw_inputs = torch.tensor(raw_inputs)
is_final = False
if param_dict is not None and "cache" in param_dict:
cache = param_dict["cache"]
if param_dict is not None and "is_final" in param_dict:
is_final = param_dict["is_final"]
# 7 .Start for-loop
# FIXME(kamo): The output format should be discussed about
asr_result_list = []
results = []
asr_result = ""
wait = True
if len(cache) == 0:
cache["encoder"] = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None, "is_final": is_final, "left": 0, "right": 0}
cache_de = {"decode_fsmn": None}
cache["decoder"] = cache_de
cache["first_chunk"] = True
cache["speech"] = []
cache["accum_speech"] = 0
if raw_inputs is not None:
if len(cache["speech"]) == 0:
cache["speech"] = raw_inputs
else:
cache["speech"] = torch.cat([cache["speech"], raw_inputs], dim=0)
cache["accum_speech"] += len(raw_inputs)
while cache["accum_speech"] >= 960:
if cache["first_chunk"]:
if cache["accum_speech"] >= 14400:
speech = torch.unsqueeze(cache["speech"], axis=0)
speech_length = torch.tensor([len(cache["speech"])])
cache["encoder"]["pad_left"] = 5
cache["encoder"]["pad_right"] = 5
cache["encoder"]["stride"] = 10
cache["encoder"]["left"] = 5
cache["encoder"]["right"] = 0
results = speech2text(cache, speech, speech_length)
cache["accum_speech"] -= 4800
cache["first_chunk"] = False
cache["encoder"]["start_idx"] = -5
cache["encoder"]["is_final"] = False
wait = False
else:
if is_final:
cache["encoder"]["stride"] = len(cache["speech"]) // 960
cache["encoder"]["pad_left"] = 0
cache["encoder"]["pad_right"] = 0
speech = torch.unsqueeze(cache["speech"], axis=0)
speech_length = torch.tensor([len(cache["speech"])])
results = speech2text(cache, speech, speech_length)
cache["accum_speech"] = 0
wait = False
else:
break
else:
if cache["accum_speech"] >= 19200:
cache["encoder"]["start_idx"] += 10
cache["encoder"]["stride"] = 10
cache["encoder"]["pad_left"] = 5
cache["encoder"]["pad_right"] = 5
cache["encoder"]["left"] = 0
cache["encoder"]["right"] = 0
speech = torch.unsqueeze(cache["speech"], axis=0)
speech_length = torch.tensor([len(cache["speech"])])
results = speech2text(cache, speech, speech_length)
cache["accum_speech"] -= 9600
wait = False
else:
if is_final:
cache["encoder"]["is_final"] = True
if cache["accum_speech"] >= 14400:
cache["encoder"]["start_idx"] += 10
cache["encoder"]["stride"] = 10
cache["encoder"]["pad_left"] = 5
cache["encoder"]["pad_right"] = 5
cache["encoder"]["left"] = 0
cache["encoder"]["right"] = cache["accum_speech"] // 960 - 15
speech = torch.unsqueeze(cache["speech"], axis=0)
speech_length = torch.tensor([len(cache["speech"])])
results = speech2text(cache, speech, speech_length)
cache["accum_speech"] -= 9600
wait = False
else:
cache["encoder"]["start_idx"] += 10
cache["encoder"]["stride"] = cache["accum_speech"] // 960 - 5
cache["encoder"]["pad_left"] = 5
cache["encoder"]["pad_right"] = 0
cache["encoder"]["left"] = 0
cache["encoder"]["right"] = 0
speech = torch.unsqueeze(cache["speech"], axis=0)
speech_length = torch.tensor([len(cache["speech"])])
results = speech2text(cache, speech, speech_length)
cache["accum_speech"] = 0
wait = False
else:
break
if len(results) >= 1:
asr_result += results[0][0]
if asr_result == "":
asr_result = "sil"
if wait:
asr_result = "waiting_for_more_voice"
item = {'key': "utt", 'value': asr_result}
asr_result_list.append(item)
else:
return []
return asr_result_list
return _forward
def get_parser():
parser = config_argparse.ArgumentParser(
description="ASR Decoding",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
# Note(kamo): Use '_' instead of '-' as separator.
# '-' is confusing if written in yaml.
parser.add_argument(
"--log_level",
type=lambda x: x.upper(),
default="INFO",
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
help="The verbose level of logging",
)
parser.add_argument("--output_dir", type=str, required=True)
parser.add_argument(
"--ngpu",
type=int,
default=0,
help="The number of gpus. 0 indicates CPU mode",
)
parser.add_argument("--seed", type=int, default=0, help="Random seed")
parser.add_argument(
"--dtype",
default="float32",
choices=["float16", "float32", "float64"],
help="Data type",
)
parser.add_argument(
"--num_workers",
type=int,
default=1,
help="The number of workers used for DataLoader",
)
parser.add_argument(
"--hotword",
type=str_or_none,
default=None,
help="hotword file path or hotwords seperated by space"
)
group = parser.add_argument_group("Input data related")
group.add_argument(
"--data_path_and_name_and_type",
type=str2triple_str,
required=False,
action="append",
)
group.add_argument("--key_file", type=str_or_none)
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
group = parser.add_argument_group("The model configuration related")
group.add_argument(
"--asr_train_config",
type=str,
help="ASR training configuration",
)
group.add_argument(
"--asr_model_file",
type=str,
help="ASR model parameter file",
)
group.add_argument(
"--cmvn_file",
type=str,
help="Global cmvn file",
)
group.add_argument(
"--lm_train_config",
type=str,
help="LM training configuration",
)
group.add_argument(
"--lm_file",
type=str,
help="LM parameter file",
)
group.add_argument(
"--word_lm_train_config",
type=str,
help="Word LM training configuration",
)
group.add_argument(
"--word_lm_file",
type=str,
help="Word LM parameter file",
)
group.add_argument(
"--ngram_file",
type=str,
help="N-gram parameter file",
)
group.add_argument(
"--model_tag",
type=str,
help="Pretrained model tag. If specify this option, *_train_config and "
"*_file will be overwritten",
)
group = parser.add_argument_group("Beam-search related")
group.add_argument(
"--batch_size",
type=int,
default=1,
help="The batch size for inference",
)
group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
group.add_argument("--beam_size", type=int, default=20, help="Beam size")
group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
group.add_argument(
"--maxlenratio",
type=float,
default=0.0,
help="Input length ratio to obtain max output length. "
"If maxlenratio=0.0 (default), it uses a end-detect "
"function "
"to automatically find maximum hypothesis lengths."
"If maxlenratio<0.0, its absolute value is interpreted"
"as a constant max output length",
)
group.add_argument(
"--minlenratio",
type=float,
default=0.0,
help="Input length ratio to obtain min output length",
)
group.add_argument(
"--ctc_weight",
type=float,
default=0.5,
help="CTC weight in joint decoding",
)
group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
group.add_argument("--streaming", type=str2bool, default=False)
group.add_argument(
"--frontend_conf",
default=None,
help="",
)
group.add_argument("--raw_inputs", type=list, default=None)
# example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
group = parser.add_argument_group("Text converter related")
group.add_argument(
"--token_type",
type=str_or_none,
default=None,
choices=["char", "bpe", None],
help="The token type for ASR model. "
"If not given, refers from the training args",
)
group.add_argument(
"--bpemodel",
type=str_or_none,
default=None,
help="The model path of sentencepiece. "
"If not given, refers from the training args",
)
return parser
def main(cmd=None):
print(get_commandline_args(), file=sys.stderr)
parser = get_parser()
args = parser.parse_args(cmd)
param_dict = {'hotword': args.hotword}
kwargs = vars(args)
kwargs.pop("config", None)
kwargs['param_dict'] = param_dict
inference(**kwargs)
if __name__ == "__main__":
main()
# from modelscope.pipelines import pipeline
# from modelscope.utils.constant import Tasks
#
# inference_16k_pipline = pipeline(
# task=Tasks.auto_speech_recognition,
# model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
#
# rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
# print(rec_result)

View File

@ -338,7 +338,7 @@ def inference_modelscope(
ibest_writer["token"][key] = " ".join(token)
ibest_writer["token_int"][key] = " ".join(map(str, token_int))
ibest_writer["vad"][key] = "{}".format(vadsegments)
ibest_writer["text"][key] = text_postprocessed
ibest_writer["text"][key] = " ".join(word_lists)
ibest_writer["text_with_punc"][key] = text_postprocessed_punc
if time_stamp_postprocessed is not None:
ibest_writer["time_stamp"][key] = "{}".format(time_stamp_postprocessed)

View File

@ -58,7 +58,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
@ -292,6 +292,8 @@ class Speech2Text:
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
if len(token_int) == 0:
continue
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
@ -668,7 +670,7 @@ def inference_modelscope(
ibest_writer["token"][key] = " ".join(token)
ibest_writer["token_int"][key] = " ".join(map(str, token_int))
ibest_writer["vad"][key] = "{}".format(vadsegments)
ibest_writer["text"][key] = text_postprocessed
ibest_writer["text"][key] = " ".join(word_lists)
ibest_writer["text_with_punc"][key] = text_postprocessed_punc
if time_stamp_postprocessed is not None:
ibest_writer["time_stamp"][key] = "{}".format(time_stamp_postprocessed)

View File

@ -49,7 +49,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
@ -738,13 +738,13 @@ def inference_modelscope(
ibest_writer["rtf"][key] = rtf_cur
if text is not None:
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
item = {'key': key, 'value': text_postprocessed}
asr_result_list.append(item)
finish_count += 1
# asr_utils.print_progress(finish_count / file_count)
if writer is not None:
ibest_writer["text"][key] = text_postprocessed
ibest_writer["text"][key] = " ".join(word_lists)
logging.info("decoding, utt: {}, predictions: {}".format(key, text))
rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total, forward_time_total, 100 * forward_time_total / (length_total * lfr_factor))

View File

@ -37,16 +37,13 @@ from funasr.utils import asr_utils, wav_utils, postprocess_utils
from funasr.models.frontend.wav_frontend import WavFrontend
header_colors = '\033[95m'
end_colors = '\033[0m'
class Speech2Text:
"""Speech2Text class
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
@ -261,6 +258,7 @@ class Speech2Text:
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
token = list(filter(lambda x: x != "<gbg>", token))
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
@ -506,13 +504,13 @@ def inference_modelscope(
ibest_writer["score"][key] = str(hyp.score)
if text is not None:
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
item = {'key': key, 'value': text_postprocessed}
asr_result_list.append(item)
finish_count += 1
asr_utils.print_progress(finish_count / file_count)
if writer is not None:
ibest_writer["text"][key] = text
ibest_writer["text"][key] = " ".join(word_lists)
return asr_result_list
return _forward

View File

@ -46,7 +46,7 @@ class Speech2Text:
Examples:
>>> import soundfile
>>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
>>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2text(audio)
[(text, token, token_int, hypothesis object), ...]
@ -261,6 +261,7 @@ class Speech2Text:
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
token = list(filter(lambda x: x != "<gbg>", token))
if self.tokenizer is not None:
text = self.tokenizer.tokens2text(token)
@ -506,13 +507,13 @@ def inference_modelscope(
ibest_writer["score"][key] = str(hyp.score)
if text is not None:
text_postprocessed, _ = postprocess_utils.sentence_postprocess(token)
text_postprocessed, word_lists = postprocess_utils.sentence_postprocess(token)
item = {'key': key, 'value': text_postprocessed}
asr_result_list.append(item)
finish_count += 1
asr_utils.print_progress(finish_count / file_count)
if writer is not None:
ibest_writer["text"][key] = text
ibest_writer["text"][key] = " ".join(word_lists)
return asr_result_list
return _forward

View File

@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
param_dict = {
"extract_profile": True,
"sv_train_config": "sv.yaml",
"sv_model_file": "sv.pth",
"sv_model_file": "sv.pb",
}
if "param_dict" in kwargs and kwargs["param_dict"] is not None:
for key in param_dict:
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
else:
kwargs["param_dict"] = param_dict
return inference_modelscope(mode=mode, **kwargs)
elif mode == "eend-ola":
from funasr.bin.eend_ola_inference import inference_modelscope
return inference_modelscope(mode=mode, **kwargs)
else:
logging.info("Unknown decoding mode: {}".format(mode))
return None

427
funasr/bin/eend_ola_inference.py Executable file
View File

@ -0,0 +1,427 @@
#!/usr/bin/env python3
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
# MIT License (https://opensource.org/licenses/MIT)
import argparse
import logging
import os
import sys
from pathlib import Path
from typing import Any
from typing import List
from typing import Optional
from typing import Sequence
from typing import Tuple
from typing import Union
import numpy as np
import torch
from scipy.signal import medfilt
from typeguard import check_argument_types
from funasr.models.frontend.wav_frontend import WavFrontendMel23
from funasr.tasks.diar import EENDOLADiarTask
from funasr.torch_utils.device_funcs import to_device
from funasr.utils import config_argparse
from funasr.utils.cli_utils import get_commandline_args
from funasr.utils.types import str2bool
from funasr.utils.types import str2triple_str
from funasr.utils.types import str_or_none
class Speech2Diarization:
"""Speech2Diarlization class
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)
{"spk1": [(int, int), ...], ...}
"""
def __init__(
self,
diar_train_config: Union[Path, str] = None,
diar_model_file: Union[Path, str] = None,
device: str = "cpu",
dtype: str = "float32",
):
assert check_argument_types()
# 1. Build Diarization model
diar_model, diar_train_args = EENDOLADiarTask.build_model_from_file(
config_file=diar_train_config,
model_file=diar_model_file,
device=device
)
frontend = None
if diar_train_args.frontend is not None and diar_train_args.frontend_conf is not None:
frontend = WavFrontendMel23(**diar_train_args.frontend_conf)
# set up seed for eda
np.random.seed(diar_train_args.seed)
torch.manual_seed(diar_train_args.seed)
torch.cuda.manual_seed(diar_train_args.seed)
os.environ['PYTORCH_SEED'] = str(diar_train_args.seed)
logging.info("diar_model: {}".format(diar_model))
logging.info("diar_train_args: {}".format(diar_train_args))
diar_model.to(dtype=getattr(torch, dtype)).eval()
self.diar_model = diar_model
self.diar_train_args = diar_train_args
self.device = device
self.dtype = dtype
self.frontend = frontend
@torch.no_grad()
def __call__(
self,
speech: Union[torch.Tensor, np.ndarray],
speech_lengths: Union[torch.Tensor, np.ndarray] = None
):
"""Inference
Args:
speech: Input speech data
Returns:
diarization results
"""
assert check_argument_types()
# Input as audio signal
if isinstance(speech, np.ndarray):
speech = torch.tensor(speech)
if self.frontend is not None:
feats, feats_len = self.frontend.forward(speech, speech_lengths)
feats = to_device(feats, device=self.device)
feats_len = feats_len.int()
self.diar_model.frontend = None
else:
feats = speech
feats_len = speech_lengths
batch = {"speech": feats, "speech_lengths": feats_len}
batch = to_device(batch, device=self.device)
results = self.diar_model.estimate_sequential(**batch)
return results
@staticmethod
def from_pretrained(
model_tag: Optional[str] = None,
**kwargs: Optional[Any],
):
"""Build Speech2Diarization instance from the pretrained model.
Args:
model_tag (Optional[str]): Model tag of the pretrained models.
Currently, the tags of espnet_model_zoo are supported.
Returns:
Speech2Diarization: Speech2Diarization instance.
"""
if model_tag is not None:
try:
from espnet_model_zoo.downloader import ModelDownloader
except ImportError:
logging.error(
"`espnet_model_zoo` is not installed. "
"Please install via `pip install -U espnet_model_zoo`."
)
raise
d = ModelDownloader()
kwargs.update(**d.download_and_unpack(model_tag))
return Speech2Diarization(**kwargs)
def inference_modelscope(
diar_train_config: str,
diar_model_file: str,
output_dir: Optional[str] = None,
batch_size: int = 1,
dtype: str = "float32",
ngpu: int = 1,
num_workers: int = 0,
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
model_tag: Optional[str] = None,
allow_variable_data_keys: bool = True,
streaming: bool = False,
param_dict: Optional[dict] = None,
**kwargs,
):
assert check_argument_types()
if batch_size > 1:
raise NotImplementedError("batch decoding is not implemented")
if ngpu > 1:
raise NotImplementedError("only single GPU decoding is supported")
logging.basicConfig(
level=log_level,
format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
)
logging.info("param_dict: {}".format(param_dict))
if ngpu >= 1 and torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
# 1. Build speech2diar
speech2diar_kwargs = dict(
diar_train_config=diar_train_config,
diar_model_file=diar_model_file,
device=device,
dtype=dtype,
)
logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
speech2diar = Speech2Diarization.from_pretrained(
model_tag=model_tag,
**speech2diar_kwargs,
)
speech2diar.diar_model.eval()
def output_results_str(results: dict, uttid: str):
rst = []
mid = uttid.rsplit("-", 1)[0]
for key in results:
results[key] = [(x[0] / 100, x[1] / 100) for x in results[key]]
template = "SPEAKER {} 0 {:.2f} {:.2f} <NA> <NA> {} <NA> <NA>"
for spk, segs in results.items():
rst.extend([template.format(mid, st, ed, spk) for st, ed in segs])
return "\n".join(rst)
def _forward(
data_path_and_name_and_type: Sequence[Tuple[str, str, str]] = None,
raw_inputs: List[List[Union[np.ndarray, torch.Tensor, str, bytes]]] = None,
output_dir_v2: Optional[str] = None,
param_dict: Optional[dict] = None,
):
# 2. Build data-iterator
if data_path_and_name_and_type is None and raw_inputs is not None:
if isinstance(raw_inputs, torch.Tensor):
raw_inputs = raw_inputs.numpy()
data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
loader = EENDOLADiarTask.build_streaming_iterator(
data_path_and_name_and_type,
dtype=dtype,
batch_size=batch_size,
key_file=key_file,
num_workers=num_workers,
preprocess_fn=EENDOLADiarTask.build_preprocess_fn(speech2diar.diar_train_args, False),
collate_fn=EENDOLADiarTask.build_collate_fn(speech2diar.diar_train_args, False),
allow_variable_data_keys=allow_variable_data_keys,
inference=True,
)
# 3. Start for-loop
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
if output_path is not None:
os.makedirs(output_path, exist_ok=True)
output_writer = open("{}/result.txt".format(output_path), "w")
result_list = []
for keys, batch in loader:
assert isinstance(batch, dict), type(batch)
assert all(isinstance(s, str) for s in keys), keys
_bs = len(next(iter(batch.values())))
assert len(keys) == _bs, f"{len(keys)} != {_bs}"
# batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
results = speech2diar(**batch)
# post process
a = results[0][0].cpu().numpy()
a = medfilt(a, (11, 1))
rst = []
for spkid, frames in enumerate(a.T):
frames = np.pad(frames, (1, 1), 'constant')
changes, = np.where(np.diff(frames, axis=0) != 0)
fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
for s, e in zip(changes[::2], changes[1::2]):
st = s / 10.
dur = (e - s) / 10.
rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
# Only supporting batch_size==1
value = "\n".join(rst)
item = {"key": keys[0], "value": value}
result_list.append(item)
if output_path is not None:
output_writer.write(value)
output_writer.flush()
if output_path is not None:
output_writer.close()
return result_list
return _forward
def inference(
data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
diar_train_config: Optional[str],
diar_model_file: Optional[str],
output_dir: Optional[str] = None,
batch_size: int = 1,
dtype: str = "float32",
ngpu: int = 0,
seed: int = 0,
num_workers: int = 1,
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
model_tag: Optional[str] = None,
allow_variable_data_keys: bool = True,
streaming: bool = False,
smooth_size: int = 83,
dur_threshold: int = 10,
out_format: str = "vad",
**kwargs,
):
inference_pipeline = inference_modelscope(
diar_train_config=diar_train_config,
diar_model_file=diar_model_file,
output_dir=output_dir,
batch_size=batch_size,
dtype=dtype,
ngpu=ngpu,
seed=seed,
num_workers=num_workers,
log_level=log_level,
key_file=key_file,
model_tag=model_tag,
allow_variable_data_keys=allow_variable_data_keys,
streaming=streaming,
smooth_size=smooth_size,
dur_threshold=dur_threshold,
out_format=out_format,
**kwargs,
)
return inference_pipeline(data_path_and_name_and_type, raw_inputs=None)
def get_parser():
parser = config_argparse.ArgumentParser(
description="Speaker verification/x-vector extraction",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
# Note(kamo): Use '_' instead of '-' as separator.
# '-' is confusing if written in yaml.
parser.add_argument(
"--log_level",
type=lambda x: x.upper(),
default="INFO",
choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
help="The verbose level of logging",
)
parser.add_argument("--output_dir", type=str, required=False)
parser.add_argument(
"--ngpu",
type=int,
default=0,
help="The number of gpus. 0 indicates CPU mode",
)
parser.add_argument(
"--gpuid_list",
type=str,
default="",
help="The visible gpus",
)
parser.add_argument("--seed", type=int, default=0, help="Random seed")
parser.add_argument(
"--dtype",
default="float32",
choices=["float16", "float32", "float64"],
help="Data type",
)
parser.add_argument(
"--num_workers",
type=int,
default=1,
help="The number of workers used for DataLoader",
)
group = parser.add_argument_group("Input data related")
group.add_argument(
"--data_path_and_name_and_type",
type=str2triple_str,
required=False,
action="append",
)
group.add_argument("--key_file", type=str_or_none)
group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
group = parser.add_argument_group("The model configuration related")
group.add_argument(
"--diar_train_config",
type=str,
help="diarization training configuration",
)
group.add_argument(
"--diar_model_file",
type=str,
help="diarization model parameter file",
)
group.add_argument(
"--dur_threshold",
type=int,
default=10,
help="The threshold for short segments in number frames"
)
parser.add_argument(
"--smooth_size",
type=int,
default=83,
help="The smoothing window length in number frames"
)
group.add_argument(
"--model_tag",
type=str,
help="Pretrained model tag. If specify this option, *_train_config and "
"*_file will be overwritten",
)
parser.add_argument(
"--batch_size",
type=int,
default=1,
help="The batch size for inference",
)
parser.add_argument("--streaming", type=str2bool, default=False)
return parser
def main(cmd=None):
print(get_commandline_args(), file=sys.stderr)
parser = get_parser()
args = parser.parse_args(cmd)
kwargs = vars(args)
kwargs.pop("config", None)
logging.info("args: {}".format(kwargs))
if args.output_dir is None:
jobid, n_gpu = 1, 1
gpuid = args.gpuid_list.split(",")[jobid - 1]
else:
jobid = int(args.output_dir.split(".")[-1])
n_gpu = len(args.gpuid_list.split(","))
gpuid = args.gpuid_list.split(",")[(jobid - 1) % n_gpu]
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
results_list = inference(**kwargs)
for results in results_list:
print("{} {}".format(results["key"], results["value"]))
if __name__ == "__main__":
main()

View File

@ -23,7 +23,7 @@ from funasr.torch_utils.set_all_random_seed import set_all_random_seed
from funasr.utils import config_argparse
from funasr.utils.types import str2triple_str
from funasr.utils.types import str_or_none
from funasr.punctuation.text_preprocessor import split_to_mini_sentence
from funasr.datasets.preprocessor import split_to_mini_sentence
class Text2Punc:

View File

@ -23,7 +23,7 @@ from funasr.torch_utils.set_all_random_seed import set_all_random_seed
from funasr.utils import config_argparse
from funasr.utils.types import str2triple_str
from funasr.utils.types import str_or_none
from funasr.punctuation.text_preprocessor import split_to_mini_sentence
from funasr.datasets.preprocessor import split_to_mini_sentence
class Text2Punc:
@ -69,6 +69,7 @@ class Text2Punc:
precache = "".join(cache)
else:
precache = ""
cache = []
data = {"text": precache + text}
result = self.preprocessor(data=data, uid="12938712838719")
split_text = self.preprocessor.pop_split_text_data(result)
@ -225,7 +226,7 @@ def inference_modelscope(
):
results = []
split_size = 10
cache_in = param_dict["cache"]
if raw_inputs != None:
line = raw_inputs.strip()
key = "demo"
@ -233,35 +234,12 @@ def inference_modelscope(
item = {'key': key, 'value': ""}
results.append(item)
return results
#import pdb;pdb.set_trace()
result, _, cache = text2punc(line, cache)
item = {'key': key, 'value': result, 'cache': cache}
result, _, cache = text2punc(line, cache_in)
param_dict["cache"] = cache
item = {'key': key, 'value': result}
results.append(item)
return results
for inference_text, _, _ in data_path_and_name_and_type:
with open(inference_text, "r", encoding="utf-8") as fin:
for line in fin:
line = line.strip()
segs = line.split("\t")
if len(segs) != 2:
continue
key = segs[0]
if len(segs[1]) == 0:
continue
result, _ = text2punc(segs[1])
item = {'key': key, 'value': result}
results.append(item)
output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
if output_path != None:
output_file_name = "infer.out"
Path(output_path).mkdir(parents=True, exist_ok=True)
output_file_path = (Path(output_path) / output_file_name).absolute()
with open(output_file_path, "w", encoding="utf-8") as fout:
for item_i in results:
key_out = item_i["key"]
value_out = item_i["value"]
fout.write(f"{key_out}\t{value_out}\n")
return results
return _forward

View File

@ -42,7 +42,7 @@ class Speech2Diarization:
Examples:
>>> import soundfile
>>> import numpy as np
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
>>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
>>> profile = np.load("profiles.npy")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2diar(audio, profile)

View File

@ -36,7 +36,7 @@ class Speech2Xvector:
Examples:
>>> import soundfile
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
>>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
>>> audio, rate = soundfile.read("speech.wav")
>>> speech2xvector(audio)
[(text, token, token_int, hypothesis object), ...]
@ -169,7 +169,7 @@ def inference_modelscope(
log_level: Union[int, str] = "INFO",
key_file: Optional[str] = None,
sv_train_config: Optional[str] = "sv.yaml",
sv_model_file: Optional[str] = "sv.pth",
sv_model_file: Optional[str] = "sv.pb",
model_tag: Optional[str] = None,
allow_variable_data_keys: bool = True,
streaming: bool = False,

View File

@ -116,8 +116,8 @@ class SpeechText2Timestamp:
enc = enc[0]
# c. Forward Predictor
_, _, us_alphas, us_cif_peak = self.tp_model.calc_predictor_timestamp(enc, enc_len, text_lengths.to(self.device)+1)
return us_alphas, us_cif_peak
_, _, us_alphas, us_peaks = self.tp_model.calc_predictor_timestamp(enc, enc_len, text_lengths.to(self.device)+1)
return us_alphas, us_peaks
def inference(

View File

@ -1,5 +1,6 @@
import argparse
import logging
import os
import sys
import json
from pathlib import Path
@ -266,7 +267,8 @@ def inference_modelscope(
# do vad segment
_, results = speech2vadsegment(**batch)
for i, _ in enumerate(keys):
results[i] = json.dumps(results[i])
if "MODELSCOPE_ENVIRONMENT" in os.environ and os.environ["MODELSCOPE_ENVIRONMENT"] == "eas":
results[i] = json.dumps(results[i])
item = {'key': keys[i], 'value': results[i]}
vad_results.append(item)
if writer is not None:

Some files were not shown because too many files have changed in this diff Show More