resolve conflict

2025-09-15 14:48:36 +08:00 · 2023-03-21 15:12:21 +08:00 · 2023-03-21 15:12:21 +08:00 · 4dc3a1b011
commit 4dc3a1b011
parent 8a100b731e c096432ebc
128 changed files with 3628 additions and 508 deletions
--- a/README.md
+++ b/README.md
@ -15,36 +15,10 @@
 | [**Model Zoo**](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)
 | [**Contact**](#contact)

+
 ## What's new: 

-### 2023.2.17, funasr-0.2.0, modelscope-1.3.0
- We support a new feature, export paraformer models into [onnx and torchscripts](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export) from modelscope. The local finetuned models are also supported.
- We support a new feature, [onnxruntime](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python), you could deploy the runtime without modelscope or funasr, for the [paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) model, the rtf of onnxruntime is 3x speedup(0.110->0.038) on cpu, [details](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/onnxruntime/paraformer/rapid_paraformer#speed).
- We support a new feature, [grpc](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime/python/grpc), you could build the ASR service with grpc, by deploying the modelscope pipeline or onnxruntime.
- We release a new model [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary), which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords.
- We optimize the timestamp alignment of [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), the prediction accuracy of timestamp is much improved, and achieving accumulated average shift (aas) of 74.7ms, [details](https://arxiv.org/abs/2301.12343).
- We release a new model, [8k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We release a new model, [MFCCA](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary), a multi-channel multi-speaker model which is independent of the number and geometry of microphones and supports Mandarin meeting transcription.
- We release several new UniASR model: 
-[Southern Fujian Dialect model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/summary),
-[French model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary), 
-[German model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary), 
-[Vietnamese model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary), 
-[Persian model](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary).
- We release a new model, [paraformer-data2vec model](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/summary), an unsupervised pretraining model on AISHELL-2, which is inited for paraformer model and then finetune on AISHEL-1.
- We release a new feature, the `VAD`, `ASR` and `PUNC` models could be integrated freely, which could be models from [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), or the local finetine models. The [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/134).
- We optimized the [punctuation common model](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), enhance the recall and precision, fix the badcases of missing punctuation marks.
- Various new types of audio input types are now supported by modelscope inference pipeline, including: mp3、flac、ogg、opus...
-### 2023.1.16, funasr-0.1.6， modelscope-1.2.0
- We release a new version model [Paraformer-large-long](https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary), which integrate the [VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) model, [ASR](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary),
- [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) model and timestamp together. The model could take in several hours long inputs.
- We release a new model, [16k VAD model](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary), which could predict the duration of none-silence speech. It could be freely integrated with any ASR models in [modelscope](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary).
- We release a new model, [Punctuation](https://www.modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary), which could predict the punctuation of ASR models's results. It could be freely integrated with any ASR models in [Model Zoo](docs/modelscope_models.md).
- We release a new model, [Data2vec](https://www.modelscope.cn/models/damo/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/summary), an unsupervised pretraining model which could be finetuned on ASR and other downstream tasks.
- We release a new model, [Paraformer-Tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary), a lightweight Paraformer model which supports Mandarin command words recognition.
- We release a new model, [SV](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary), which could extract speaker embeddings and further perform speaker verification on paired utterances. It will be supported for speaker diarization in the future version.
- We improve the pipeline of modelscope to speedup the inference, by integrating the process of build model into build pipeline.
- Various new types of audio input types are now supported by modelscope inference pipeline, including wav.scp, wav format, audio bytes, wave samples...
+For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)

 ## Highlights
 - Many types of typical models are supported, e.g., [Tranformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100), [Paraformer](https://arxiv.org/abs/2206.08317).
--- a/egs/aishell/conformer/run.sh
+++ b/egs/aishell/conformer/run.sh
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell/data2vec_paraformer_finetune/run.sh
+++ b/egs/aishell/data2vec_paraformer_finetune/run.sh
@ -55,7 +55,7 @@ asr_config=conf/train_asr_paraformer_transformer_12e_6d_3072_768.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer_noctc_1best.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell/data2vec_transformer_finetune/run.sh
+++ b/egs/aishell/data2vec_transformer_finetune/run.sh
@ -55,7 +55,7 @@ asr_config=conf/train_asr_transformer_12e_6d_3072_768.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.cer_ctc.ave_10best.pth
+inference_asr_model=valid.cer_ctc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell/paraformer/run.sh
+++ b/egs/aishell/paraformer/run.sh
@ -52,7 +52,7 @@ asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer_noctc_1best.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell/paraformerbert/run.sh
+++ b/egs/aishell/paraformerbert/run.sh
@ -56,7 +56,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer_noctc_1best.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell/transformer/run.sh
+++ b/egs/aishell/transformer/run.sh
@ -52,7 +52,7 @@ asr_config=conf/train_asr_conformer.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs/aishell2/conformer/run.sh
+++ b/egs/aishell2/conformer/run.sh
@ -54,7 +54,7 @@ asr_config=conf/train_asr_conformer.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
--- a/egs/aishell2/paraformer/run.sh
+++ b/egs/aishell2/paraformer/run.sh
@ -54,7 +54,7 @@ asr_config=conf/train_asr_paraformer_conformer_20e_1280_320_6d_1280_320.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer_noctc_1best.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
--- a/egs/aishell2/paraformerbert/run.sh
+++ b/egs/aishell2/paraformerbert/run.sh
@ -58,7 +58,7 @@ asr_config=conf/train_asr_paraformerbert_conformer_20e_6d_1280_320.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer_noctc_1best.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
--- a/egs/aishell2/transformer/run.sh
+++ b/egs/aishell2/transformer/run.sh
@ -54,7 +54,7 @@ asr_config=conf/train_asr_transformer.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, e.g., gpuid_list=2,3, the same as training stage by default
--- a/egs/aishell2/transformerLM/run.sh
+++ b/egs/aishell2/transformerLM/run.sh
@ -34,7 +34,7 @@ exp_dir=./data
 tag=exp1
 model_dir="baseline_$(basename "${lm_config}" .yaml)_${lang}_${token_type}_${tag}"
 lm_exp=${exp_dir}/exp/${model_dir}
-inference_lm=valid.loss.ave.pth       # Language model path for decoding.
+inference_lm=valid.loss.ave.pb       # Language model path for decoding.

 stage=0
 stop_stage=3
--- a/egs/alimeeting/diarization/sond/infer_alimeeting_test.py
+++ b/egs/alimeeting/diarization/sond/infer_alimeeting_test.py
@ -4,7 +4,7 @@ import sys

 def main():
    diar_config_path = sys.argv[1] if len(sys.argv) > 1 else "sond_fbank.yaml"
-    diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pth"
+    diar_model_path = sys.argv[2] if len(sys.argv) > 2 else "sond.pb"
    output_dir = sys.argv[3] if len(sys.argv) > 3 else "./outputs"
    data_path_and_name_and_type = [
        ("data/test_rmsil/feats.scp", "speech", "kaldi_ark"),
--- a/egs/alimeeting/diarization/sond/run.sh
+++ b/egs/alimeeting/diarization/sond/run.sh
@ -17,9 +17,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
  echo "Downloading Pre-trained model..."
  git clone https://www.modelscope.cn/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch.git
  git clone https://www.modelscope.cn/damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch.git
-  ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pth ./sv.pth
+  ln -s speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.pb ./sv.pb
  cp speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/sv.yaml ./sv.yaml
-  ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pth ./sond.pth
+  ln -s speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.pb ./sond.pb
  cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond_fbank.yaml ./sond_fbank.yaml
  cp speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/sond.yaml ./sond.yaml
  echo "Done."
@ -30,7 +30,7 @@ fi

 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
  echo "Calculating diarization results..."
-  python infer_alimeeting_test.py sond_fbank.yaml sond.pth outputs
+  python infer_alimeeting_test.py sond_fbank.yaml sond.pb outputs
  python local/convert_label_to_rttm.py \
    outputs/labels.txt \
    data/test_rmsil/raw_rmsil_map.scp \
--- a/egs/alimeeting/diarization/sond/unit_test.py
+++ b/egs/alimeeting/diarization/sond/unit_test.py
@ -4,7 +4,7 @@ import os

 def test_fbank_cpu_infer():
    diar_config_path = "config_fbank.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():

 def test_fbank_gpu_infer():
    diar_config_path = "config_fbank.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():

 def test_wav_gpu_infer():
    diar_config_path = "config.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():

 def test_without_profile_gpu_infer():
    diar_config_path = "config.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    raw_inputs = [[
        "data/unit_test/raw_inputs/record.wav",
--- a/egs/callhome/diarization/sond/unit_test.py
+++ b/egs/callhome/diarization/sond/unit_test.py
@ -4,7 +4,7 @@ import os

 def test_fbank_cpu_infer():
    diar_config_path = "sond_fbank.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -24,7 +24,7 @@ def test_fbank_cpu_infer():

 def test_fbank_gpu_infer():
    diar_config_path = "sond_fbank.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_feats.scp", "speech", "kaldi_ark"),
@ -45,7 +45,7 @@ def test_fbank_gpu_infer():

 def test_wav_gpu_infer():
    diar_config_path = "config.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    data_path_and_name_and_type = [
        ("data/unit_test/test_wav.scp", "speech", "sound"),
@ -66,7 +66,7 @@ def test_wav_gpu_infer():

 def test_without_profile_gpu_infer():
    diar_config_path = "config.yaml"
-    diar_model_path = "sond.pth"
+    diar_model_path = "sond.pb"
    output_dir = "./outputs"
    raw_inputs = [[
        "data/unit_test/raw_inputs/record.wav",
--- a/egs/mars/sd/local_run.sh
+++ b/egs/mars/sd/local_run.sh
@ -49,7 +49,7 @@ asr_config=conf/train_asr_conformer.yaml
 model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"

 inference_config=conf/decode_asr_transformer.yaml
-inference_asr_model=valid.acc.ave_10best.pth
+inference_asr_model=valid.acc.ave_10best.pb

 # you can set gpu num for decoding here
 gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
--- a/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/README.md
+++ b/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/infer_after_finetune.py
+++ b/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-paraformer-zh-cn-aishell2-16k/infer_after_finetune.py
@ -48,5 +48,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
+    params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/README.md
+++ b/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed~~~~
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/infer_after_finetune.py
+++ b/egs_modelscope/asr/data2vec/speech_data2vec_pretrain-zh-cn-aishell2-16k-pytorch/infer_after_finetune.py
@ -48,5 +48,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.cer_ctc.ave.pth"
+    params["decoding_model_name"] = "valid.cer_ctc.ave.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/mfcca/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/README.md
+++ b/egs_modelscope/asr/mfcca/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.sp.cer` and `
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/mfcca/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/infer_after_finetune.py
+++ b/egs_modelscope/asr/mfcca/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/infer_after_finetune.py
@ -63,5 +63,5 @@ if __name__ == '__main__':
    params["required_files"] = ["feats_stats.npz", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./example_data/validation"
-    params["decoding_model_name"] = "valid.acc.ave.pth"
+    params["decoding_model_name"] = "valid.acc.ave.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/infer.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/infer.py
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
 from funasr.utils.compute_wer import compute_wer


-def modelscope_infer_core(output_dir, split_dir, njob, idx):
+def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
    output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
-    gpu_id = (int(idx) - 1) // njob
+    if ngpu > 0:
+        use_gpu = 1
+        gpu_id = int(idx) - 1
+    else:
+        use_gpu = 0
+        gpu_id = -1
    if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
        gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    inference_pipline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch",
+        model=model,
        output_dir=output_dir_job,
-        batch_size=64
+        batch_size=batch_size,
+        ngpu=use_gpu,
    )
    audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
    inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
    # prepare for multi-GPU decoding
    ngpu = params["ngpu"]
    njob = params["njob"]
+    batch_size = params["batch_size"]
    output_dir = params["output_dir"]
+    model = params["model"]
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)
    split_dir = os.path.join(output_dir, "split")
    os.mkdir(split_dir)
-    nj = ngpu * njob
+    if ngpu > 0:
+        nj = ngpu
+    elif ngpu == 0:
+        nj = njob
    wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
    with open(wav_scp_file) as f:
        lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
    p = Pool(nj)
    for i in range(nj):
        p.apply_async(modelscope_infer_core,
-                      args=(output_dir, split_dir, njob, str(i + 1)))
+                      args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
    p.close()
    p.join()

@ -81,8 +92,10 @@ def modelscope_infer(params):

 if __name__ == "__main__":
    params = {}
+    params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
    params["data_dir"] = "./data/test"
    params["output_dir"] = "./results"
-    params["ngpu"] = 1
-    params["njob"] = 1
-    modelscope_infer(params)
+    params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
+    params["njob"] = 1 # if ngpu = 0, will use cpu decoding
+    params["batch_size"] = 64
+    modelscope_infer(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/infer_after_finetune.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch/infer_after_finetune.py
@ -4,23 +4,18 @@ import shutil

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.hub.snapshot_download import snapshot_download

 from funasr.utils.compute_wer import compute_wer

-
 def modelscope_infer_after_finetune(params):
    # prepare for decoding
-    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
-    for file_name in params["required_files"]:
-        if file_name == "configuration.json":
-            with open(os.path.join(pretrained_model_path, file_name)) as f:
-                config_dict = json.load(f)
-                config_dict["model"]["am_model_name"] = params["decoding_model_name"]
-            with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
-                json.dump(config_dict, f, indent=4, separators=(',', ': '))
-        else:
-            shutil.copy(os.path.join(pretrained_model_path, file_name),
-                        os.path.join(params["output_dir"], file_name))
+
+    try:
+        pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
+    except BaseException:
+        raise BaseException(f"Please download pretrain model from ModelScope firstly.")
+    shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
    decoding_path = os.path.join(params["output_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model=params["output_dir"],
+        model=pretrained_model_path,
        output_dir=decoding_path,
-        batch_size=64
+        batch_size=params["batch_size"]
    )
    audio_in = os.path.join(params["data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
 if __name__ == '__main__':
    params = {}
    params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell1-vocab8404-pytorch"
-    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.acc.ave_10best.pth"
-    modelscope_infer_after_finetune(params)
+    params["decoding_model_name"] = "valid.acc.ave_10best.pb"
+    params["batch_size"] = 64
+    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/infer.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/infer.py
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
 from funasr.utils.compute_wer import compute_wer


-def modelscope_infer_core(output_dir, split_dir, njob, idx):
+def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
    output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
-    gpu_id = (int(idx) - 1) // njob
+    if ngpu > 0:
+        use_gpu = 1
+        gpu_id = int(idx) - 1
+    else:
+        use_gpu = 0
+        gpu_id = -1
    if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
        gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    inference_pipline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch",
+        model=model,
        output_dir=output_dir_job,
-        batch_size=64
+        batch_size=batch_size,
+        ngpu=use_gpu,
    )
    audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
    inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
    # prepare for multi-GPU decoding
    ngpu = params["ngpu"]
    njob = params["njob"]
+    batch_size = params["batch_size"]
    output_dir = params["output_dir"]
+    model = params["model"]
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)
    split_dir = os.path.join(output_dir, "split")
    os.mkdir(split_dir)
-    nj = ngpu * njob
+    if ngpu > 0:
+        nj = ngpu
+    elif ngpu == 0:
+        nj = njob
    wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
    with open(wav_scp_file) as f:
        lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
    p = Pool(nj)
    for i in range(nj):
        p.apply_async(modelscope_infer_core,
-                      args=(output_dir, split_dir, njob, str(i + 1)))
+                      args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
    p.close()
    p.join()

@ -81,8 +92,10 @@ def modelscope_infer(params):

 if __name__ == "__main__":
    params = {}
+    params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
    params["data_dir"] = "./data/test"
    params["output_dir"] = "./results"
-    params["ngpu"] = 1
-    params["njob"] = 1
-    modelscope_infer(params)
+    params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
+    params["njob"] = 1 # if ngpu = 0, will use cpu decoding
+    params["batch_size"] = 64
+    modelscope_infer(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/infer_after_finetune.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch/infer_after_finetune.py
@ -4,23 +4,18 @@ import shutil

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.hub.snapshot_download import snapshot_download

 from funasr.utils.compute_wer import compute_wer

-
 def modelscope_infer_after_finetune(params):
    # prepare for decoding
-    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
-    for file_name in params["required_files"]:
-        if file_name == "configuration.json":
-            with open(os.path.join(pretrained_model_path, file_name)) as f:
-                config_dict = json.load(f)
-                config_dict["model"]["am_model_name"] = params["decoding_model_name"]
-            with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
-                json.dump(config_dict, f, indent=4, separators=(',', ': '))
-        else:
-            shutil.copy(os.path.join(pretrained_model_path, file_name),
-                        os.path.join(params["output_dir"], file_name))
+
+    try:
+        pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
+    except BaseException:
+        raise BaseException(f"Please download pretrain model from ModelScope firstly.")
+    shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
    decoding_path = os.path.join(params["output_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model=params["output_dir"],
+        model=pretrained_model_path,
        output_dir=decoding_path,
-        batch_size=64
+        batch_size=params["batch_size"]
    )
    audio_in = os.path.join(params["data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
 if __name__ == '__main__':
    params = {}
    params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-aishell2-vocab8404-pytorch"
-    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.acc.ave_10best.pth"
-    modelscope_infer_after_finetune(params)
+    params["decoding_model_name"] = "valid.acc.ave_10best.pb"
+    params["batch_size"] = 64
+    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/README.md
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/README.md
@ -22,10 +22,12 @@
 Or you can use the finetuned model for inference directly.

 - Setting parameters in `infer.py`
+    - <strong>model:</strong> # model name on ModelScope
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
    - <strong>output_dir:</strong> # result dir
-    - <strong>ngpu:</strong> # the number of GPUs for decoding
-    - <strong>njob:</strong> # the number of jobs for each GPU
+    - <strong>ngpu:</strong> # the number of GPUs for decoding, if `ngpu` > 0, use GPU decoding
+    - <strong>njob:</strong> # the number of jobs for CPU decoding, if `ngpu` = 0, use CPU decoding, please set `njob`
+    - <strong>batch_size:</strong> # batchsize of inference

 - Then you can run the pipeline to infer with:
 ```python
@ -39,9 +41,11 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 ### Inference using local finetuned model

 - Modify inference related parameters in `infer_after_finetune.py`
+    - <strong>modelscope_model_name: </strong> # model name on ModelScope
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`
+    - <strong>batch_size:</strong> # batchsize of inference  

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer.py
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
 from funasr.utils.compute_wer import compute_wer


-def modelscope_infer_core(output_dir, split_dir, njob, idx):
+def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
    output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
-    gpu_id = (int(idx) - 1) // njob
+    if ngpu > 0:
+        use_gpu = 1
+        gpu_id = int(idx) - 1
+    else:
+        use_gpu = 0
+        gpu_id = -1
    if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
        gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    inference_pipline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+        model=model,
        output_dir=output_dir_job,
-        batch_size=64
+        batch_size=batch_size,
+        ngpu=use_gpu,
    )
    audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
    inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
    # prepare for multi-GPU decoding
    ngpu = params["ngpu"]
    njob = params["njob"]
+    batch_size = params["batch_size"]
    output_dir = params["output_dir"]
+    model = params["model"]
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)
    split_dir = os.path.join(output_dir, "split")
    os.mkdir(split_dir)
-    nj = ngpu * njob
+    if ngpu > 0:
+        nj = ngpu
+    elif ngpu == 0:
+        nj = njob
    wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
    with open(wav_scp_file) as f:
        lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
    p = Pool(nj)
    for i in range(nj):
        p.apply_async(modelscope_infer_core,
-                      args=(output_dir, split_dir, njob, str(i + 1)))
+                      args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
    p.close()
    p.join()

@ -81,8 +92,10 @@ def modelscope_infer(params):

 if __name__ == "__main__":
    params = {}
+    params["model"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
    params["data_dir"] = "./data/test"
    params["output_dir"] = "./results"
-    params["ngpu"] = 1
-    params["njob"] = 1
-    modelscope_infer(params)
+    params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
+    params["njob"] = 1 # if ngpu = 0, will use cpu decoding
+    params["batch_size"] = 64
+    modelscope_infer(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer_after_finetune.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer_after_finetune.py
@ -4,23 +4,18 @@ import shutil

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.hub.snapshot_download import snapshot_download

 from funasr.utils.compute_wer import compute_wer

-
 def modelscope_infer_after_finetune(params):
    # prepare for decoding
-    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
-    for file_name in params["required_files"]:
-        if file_name == "configuration.json":
-            with open(os.path.join(pretrained_model_path, file_name)) as f:
-                config_dict = json.load(f)
-                config_dict["model"]["am_model_name"] = params["decoding_model_name"]
-            with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
-                json.dump(config_dict, f, indent=4, separators=(',', ': '))
-        else:
-            shutil.copy(os.path.join(pretrained_model_path, file_name),
-                        os.path.join(params["output_dir"], file_name))
+
+    try:
+        pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
+    except BaseException:
+        raise BaseException(f"Please download pretrain model from ModelScope firstly.")
+    shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
    decoding_path = os.path.join(params["output_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model=params["output_dir"],
+        model=pretrained_model_path,
        output_dir=decoding_path,
-        batch_size=64
+        batch_size=params["batch_size"]
    )
    audio_in = os.path.join(params["data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
 if __name__ == '__main__':
    params = {}
    params["modelscope_model_name"] = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
-    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.acc.ave_10best.pth"
-    modelscope_infer_after_finetune(params)
+    params["decoding_model_name"] = "valid.acc.ave_10best.pb"
+    params["batch_size"] = 64
+    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/infer.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/infer.py
@ -0,0 +1,57 @@
+import torch
+import torchaudio
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+from modelscope.utils.logger import get_logger
+import logging
+logger = get_logger(log_level=logging.CRITICAL)
+logger.setLevel(logging.CRITICAL)
+
+inference_pipeline = pipeline(
+    task=Tasks.auto_speech_recognition,
+    model='damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online',
+    model_revision='v1.0.2')
+
+waveform, sample_rate = torchaudio.load("waihu.wav")
+speech_length = waveform.shape[1]
+speech = waveform[0]
+
+cache_en = {"start_idx": 0, "pad_left": 0, "stride": 10, "pad_right": 5, "cif_hidden": None, "cif_alphas": None}
+cache_de = {"decode_fsmn": None}
+cache = {"encoder": cache_en, "decoder": cache_de}
+param_dict = {}
+param_dict["cache"] = cache
+
+first_chunk = True
+speech_buffer = speech
+speech_cache = []
+final_result = ""
+
+while len(speech_buffer) >= 960:
+    if first_chunk:
+        if len(speech_buffer) >= 14400:
+            rec_result = inference_pipeline(audio_in=speech_buffer[0:14400], param_dict=param_dict)
+            speech_buffer = speech_buffer[4800:]
+        else:
+            cache_en["stride"] = len(speech_buffer) // 960
+            cache_en["pad_right"] = 0
+            rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
+            speech_buffer = []
+        cache_en["start_idx"] = -5
+        first_chunk = False
+    else:
+        cache_en["start_idx"] += 10
+        if len(speech_buffer) >= 4800:
+            cache_en["pad_left"] = 5
+            rec_result = inference_pipeline(audio_in=speech_buffer[:19200], param_dict=param_dict)
+            speech_buffer = speech_buffer[9600:]
+        else:
+            cache_en["stride"] = len(speech_buffer) // 960 
+            cache_en["pad_right"] = 0
+            rec_result = inference_pipeline(audio_in=speech_buffer, param_dict=param_dict)
+            speech_buffer = []
+    if len(rec_result) !=0 and rec_result['text'] != "sil":
+        final_result += rec_result['text']
+    print(rec_result)
+print(final_result)
--- a/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/infer.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/infer.py
@ -8,9 +8,14 @@ from modelscope.utils.constant import Tasks
 from funasr.utils.compute_wer import compute_wer


-def modelscope_infer_core(output_dir, split_dir, njob, idx):
+def modelscope_infer_core(output_dir, split_dir, njob, idx, batch_size, ngpu, model):
    output_dir_job = os.path.join(output_dir, "output.{}".format(idx))
-    gpu_id = (int(idx) - 1) // njob
+    if ngpu > 0:
+        use_gpu = 1
+        gpu_id = int(idx) - 1
+    else:
+        use_gpu = 0
+        gpu_id = -1
    if "CUDA_VISIBLE_DEVICES" in os.environ.keys():
        gpu_list = os.environ['CUDA_VISIBLE_DEVICES'].split(",")
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_list[gpu_id])
@ -18,9 +23,10 @@ def modelscope_infer_core(output_dir, split_dir, njob, idx):
        os.environ['CUDA_VISIBLE_DEVICES'] = str(gpu_id)
    inference_pipline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model="damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1",
+        model=model,
        output_dir=output_dir_job,
-        batch_size=64
+        batch_size=batch_size,
+        ngpu=use_gpu,
    )
    audio_in = os.path.join(split_dir, "wav.{}.scp".format(idx))
    inference_pipline(audio_in=audio_in)
@ -30,13 +36,18 @@ def modelscope_infer(params):
    # prepare for multi-GPU decoding
    ngpu = params["ngpu"]
    njob = params["njob"]
+    batch_size = params["batch_size"]
    output_dir = params["output_dir"]
+    model = params["model"]
    if os.path.exists(output_dir):
        shutil.rmtree(output_dir)
    os.mkdir(output_dir)
    split_dir = os.path.join(output_dir, "split")
    os.mkdir(split_dir)
-    nj = ngpu * njob
+    if ngpu > 0:
+        nj = ngpu
+    elif ngpu == 0:
+        nj = njob
    wav_scp_file = os.path.join(params["data_dir"], "wav.scp")
    with open(wav_scp_file) as f:
        lines = f.readlines()
@ -56,7 +67,7 @@ def modelscope_infer(params):
    p = Pool(nj)
    for i in range(nj):
        p.apply_async(modelscope_infer_core,
-                      args=(output_dir, split_dir, njob, str(i + 1)))
+                      args=(output_dir, split_dir, njob, str(i + 1), batch_size, ngpu, model))
    p.close()
    p.join()

@ -81,8 +92,10 @@ def modelscope_infer(params):

 if __name__ == "__main__":
    params = {}
+    params["model"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
    params["data_dir"] = "./data/test"
    params["output_dir"] = "./results"
-    params["ngpu"] = 1
-    params["njob"] = 1
-    modelscope_infer(params)
+    params["ngpu"] = 1 # if ngpu > 0, will use gpu decoding
+    params["njob"] = 1 # if ngpu = 0, will use cpu decoding
+    params["batch_size"] = 64
+    modelscope_infer(params)
--- a/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/infer_after_finetune.py
+++ b/egs_modelscope/asr/paraformer/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1/infer_after_finetune.py
@ -4,23 +4,18 @@ import shutil

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.hub.snapshot_download import snapshot_download

 from funasr.utils.compute_wer import compute_wer

-
 def modelscope_infer_after_finetune(params):
    # prepare for decoding
-    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
-    for file_name in params["required_files"]:
-        if file_name == "configuration.json":
-            with open(os.path.join(pretrained_model_path, file_name)) as f:
-                config_dict = json.load(f)
-                config_dict["model"]["am_model_name"] = params["decoding_model_name"]
-            with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
-                json.dump(config_dict, f, indent=4, separators=(',', ': '))
-        else:
-            shutil.copy(os.path.join(pretrained_model_path, file_name),
-                        os.path.join(params["output_dir"], file_name))
+
+    try:
+        pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
+    except BaseException:
+        raise BaseException(f"Please download pretrain model from ModelScope firstly.")
+    shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
    decoding_path = os.path.join(params["output_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
@ -29,9 +24,9 @@ def modelscope_infer_after_finetune(params):
    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model=params["output_dir"],
+        model=pretrained_model_path,
        output_dir=decoding_path,
-        batch_size=64
+        batch_size=params["batch_size"]
    )
    audio_in = os.path.join(params["data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)
@ -46,8 +41,8 @@ def modelscope_infer_after_finetune(params):
 if __name__ == '__main__':
    params = {}
    params["modelscope_model_name"] = "damo/speech_paraformer_asr_nat-zh-cn-8k-common-vocab8358-tensorflow1"
-    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.acc.ave_10best.pth"
-    modelscope_infer_after_finetune(params)
+    params["decoding_model_name"] = "valid.acc.ave_10best.pb"
+    params["batch_size"] = 64
+    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-offline/README.md
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-offline/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-offline/infer_after_finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-offline/infer_after_finetune.py
@ -50,5 +50,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "20epoch.pth"
+    params["decoding_model_name"] = "20epoch.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/README.md
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/infer_after_finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/infer_after_finetune.py
@ -50,5 +50,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "20epoch.pth"
+    params["decoding_model_name"] = "20epoch.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/finetune.py
@ -0,0 +1,35 @@
+import os
+from modelscope.metainfo import Trainers
+from modelscope.trainers import build_trainer
+from funasr.datasets.ms_dataset import MsDataset
+
+
+def modelscope_finetune(params):
+    if not os.path.exists(params["output_dir"]):
+        os.makedirs(params["output_dir"], exist_ok=True)
+    # dataset split ["train", "validation"]
+    ds_dict = MsDataset.load(params["data_dir"])
+    kwargs = dict(
+        model=params["model"],
+        model_revision=params["model_revision"],
+        data_dir=ds_dict,
+        dataset_type=params["dataset_type"],
+        work_dir=params["output_dir"],
+        batch_bins=params["batch_bins"],
+        max_epoch=params["max_epoch"],
+        lr=params["lr"])
+    trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
+    trainer.train()
+
+
+if __name__ == '__main__':
+    params = {}
+    params["output_dir"] = "./checkpoint"
+    params["data_dir"] = "./data"
+    params["batch_bins"] = 2000
+    params["dataset_type"] = "small"
+    params["max_epoch"] = 50
+    params["lr"] = 0.00005
+    params["model"] = "damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch"
+    params["model_revision"] = None
+    modelscope_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/infer.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/infer.py
@ -0,0 +1,13 @@
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+if __name__ == "__main__":
+    audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_he.wav"
+    output_dir = "./results"
+    inference_pipline = pipeline(
+        task=Tasks.auto_speech_recognition,
+        model="damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch",
+        output_dir=output_dir,
+    )
+    rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
+    print(rec_result)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/README.md
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/infer_after_finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-minnan-16k-common-vocab3825/infer_after_finetune.py
@ -50,5 +50,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "20epoch.pth"
+    params["decoding_model_name"] = "20epoch.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/finetune.py
@ -0,0 +1,35 @@
+import os
+from modelscope.metainfo import Trainers
+from modelscope.trainers import build_trainer
+from funasr.datasets.ms_dataset import MsDataset
+
+
+def modelscope_finetune(params):
+    if not os.path.exists(params["output_dir"]):
+        os.makedirs(params["output_dir"], exist_ok=True)
+    # dataset split ["train", "validation"]
+    ds_dict = MsDataset.load(params["data_dir"])
+    kwargs = dict(
+        model=params["model"],
+        model_revision=params["model_revision"],
+        data_dir=ds_dict,
+        dataset_type=params["dataset_type"],
+        work_dir=params["output_dir"],
+        batch_bins=params["batch_bins"],
+        max_epoch=params["max_epoch"],
+        lr=params["lr"])
+    trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
+    trainer.train()
+
+
+if __name__ == '__main__':
+    params = {}
+    params["output_dir"] = "./checkpoint"
+    params["data_dir"] = "./data"
+    params["batch_bins"] = 2000
+    params["dataset_type"] = "small"
+    params["max_epoch"] = 50
+    params["lr"] = 0.00005
+    params["model"] = "damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch"
+    params["model_revision"] = None
+    modelscope_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/infer.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/infer.py
@ -0,0 +1,13 @@
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+if __name__ == "__main__":
+    audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_my.wav"
+    output_dir = "./results"
+    inference_pipline = pipeline(
+        task=Tasks.auto_speech_recognition,
+        model="damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch",
+        output_dir=output_dir,
+    )
+    rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
+    print(rec_result)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/finetune.py
@ -0,0 +1,35 @@
+import os
+from modelscope.metainfo import Trainers
+from modelscope.trainers import build_trainer
+from funasr.datasets.ms_dataset import MsDataset
+
+
+def modelscope_finetune(params):
+    if not os.path.exists(params["output_dir"]):
+        os.makedirs(params["output_dir"], exist_ok=True)
+    # dataset split ["train", "validation"]
+    ds_dict = MsDataset.load(params["data_dir"])
+    kwargs = dict(
+        model=params["model"],
+        model_revision=params["model_revision"],
+        data_dir=ds_dict,
+        dataset_type=params["dataset_type"],
+        work_dir=params["output_dir"],
+        batch_bins=params["batch_bins"],
+        max_epoch=params["max_epoch"],
+        lr=params["lr"])
+    trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
+    trainer.train()
+
+
+if __name__ == '__main__':
+    params = {}
+    params["output_dir"] = "./checkpoint"
+    params["data_dir"] = "./data"
+    params["batch_bins"] = 2000
+    params["dataset_type"] = "small"
+    params["max_epoch"] = 50
+    params["lr"] = 0.00005
+    params["model"] = "damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch"
+    params["model_revision"] = None
+    modelscope_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/infer.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/infer.py
@ -0,0 +1,13 @@
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+if __name__ == "__main__":
+    audio_in = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_ur.wav"
+    output_dir = "./results"
+    inference_pipline = pipeline(
+        task=Tasks.auto_speech_recognition,
+        model="damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch",
+        output_dir=output_dir,
+    )
+    rec_result = inference_pipline(audio_in=audio_in, param_dict={"decoding_model":"offline"})
+    print(rec_result)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-offline/README.md
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-offline/README.md
@ -41,7 +41,7 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-offline/infer_after_finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-offline/infer_after_finetune.py
@ -49,5 +49,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "20epoch.pth"
+    params["decoding_model_name"] = "20epoch.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/README.md
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/README.md
@ -41,7 +41,8 @@ The decoding results can be found in `$output_dir/1best_recog/text.cer`, which i
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave
+      .pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/infer_after_finetune.py
+++ b/egs_modelscope/asr/uniasr/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/infer_after_finetune.py
@ -49,5 +49,5 @@ if __name__ == '__main__':
    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "20epoch.pth"
+    params["decoding_model_name"] = "20epoch.pb"
    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/README.md
+++ b/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/README.md
@ -34,7 +34,7 @@ Or you can use the finetuned model for inference directly.
 - Modify inference related parameters in `infer_after_finetune.py`
    - <strong>output_dir:</strong> # result dir
    - <strong>data_dir:</strong> # the dataset dir needs to include `test/wav.scp`. If `test/text` is also exists, CER will be computed
-    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pth`
+    - <strong>decoding_model_name:</strong> # set the checkpoint name for decoding, e.g., `valid.cer_ctc.ave.pb`

 - Then you can run the pipeline to finetune with:
 ```python
--- a/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer_after_finetune.py
+++ b/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer_after_finetune.py
@ -4,27 +4,17 @@ import shutil

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.hub.snapshot_download import snapshot_download

 from funasr.utils.compute_wer import compute_wer

-
 def modelscope_infer_after_finetune(params):
    # prepare for decoding
-    if not os.path.exists(os.path.join(params["output_dir"], "punc")):
-        os.makedirs(os.path.join(params["output_dir"], "punc"))
-    if not os.path.exists(os.path.join(params["output_dir"], "vad")):
-        os.makedirs(os.path.join(params["output_dir"], "vad"))
-    pretrained_model_path = os.path.join(os.environ["HOME"], ".cache/modelscope/hub", params["modelscope_model_name"])
-    for file_name in params["required_files"]:
-        if file_name == "configuration.json":
-            with open(os.path.join(pretrained_model_path, file_name)) as f:
-                config_dict = json.load(f)
-                config_dict["model"]["am_model_name"] = params["decoding_model_name"]
-            with open(os.path.join(params["output_dir"], "configuration.json"), "w") as f:
-                json.dump(config_dict, f, indent=4, separators=(',', ': '))
-        else:
-            shutil.copy(os.path.join(pretrained_model_path, file_name),
-                        os.path.join(params["output_dir"], file_name))
+
+    try:
+        pretrained_model_path = snapshot_download(params["modelscope_model_name"], cache_dir=params["output_dir"])
+    except BaseException:
+        raise BaseException(f"Please download pretrain model from ModelScope firstly.")shutil.copy(os.path.join(params["output_dir"], params["decoding_model_name"]), os.path.join(pretrained_model_path, "model.pb"))
    decoding_path = os.path.join(params["output_dir"], "decode_results")
    if os.path.exists(decoding_path):
        shutil.rmtree(decoding_path)
@ -33,16 +23,16 @@ def modelscope_infer_after_finetune(params):
    # decoding
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
-        model=params["output_dir"],
+        model=pretrained_model_path,
        output_dir=decoding_path,
-        batch_size=64
+        batch_size=params["batch_size"]
    )
    audio_in = os.path.join(params["data_dir"], "wav.scp")
    inference_pipeline(audio_in=audio_in)

    # computer CER if GT text is set
    text_in = os.path.join(params["data_dir"], "text")
-    if text_in is not None:
+    if os.path.exists(text_in):
        text_proc_file = os.path.join(decoding_path, "1best_recog/token")
        compute_wer(text_in, text_proc_file, os.path.join(decoding_path, "text.cer"))

@ -50,8 +40,8 @@ def modelscope_infer_after_finetune(params):
 if __name__ == '__main__':
    params = {}
    params["modelscope_model_name"] = "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
-    params["required_files"] = ["am.mvn", "decoding.yaml", "configuration.json", "punc/punc.pb", "punc/punc.yaml", "vad/vad.mvn", "vad/vad.pb", "vad/vad.yaml"]
    params["output_dir"] = "./checkpoint"
    params["data_dir"] = "./data/test"
-    params["decoding_model_name"] = "valid.acc.ave_10best.pth"
-    modelscope_infer_after_finetune(params)
+    params["decoding_model_name"] = "valid.acc.ave_10best.pb"
+    params["batch_size"] = 64
+    modelscope_infer_after_finetune(params)
--- a/egs_modelscope/punctuation/punc_ct-transformer_zh-cn-common-vadrealtime-vocab272727/infer.py
+++ b/egs_modelscope/punctuation/punc_ct-transformer_zh-cn-common-vadrealtime-vocab272727/infer.py
@ -4,6 +4,11 @@ inputs = "跨境河流是养育沿岸|人民的生命之源长期以来为帮助

 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.utils.logger import get_logger
+import logging
+logger = get_logger(log_level=logging.CRITICAL)
+logger.setLevel(logging.CRITICAL)
+

 inference_pipeline = pipeline(
    task=Tasks.punctuation,
--- a/egs_modelscope/speaker_diarization/speech_diarization_eend-ola-en-us-callhome-8k/infer.py
+++ b/egs_modelscope/speaker_diarization/speech_diarization_eend-ola-en-us-callhome-8k/infer.py
@ -0,0 +1,10 @@
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+inference_diar_pipline = pipeline(
+    task=Tasks.speaker_diarization,
+    model='damo/speech_diarization_eend-ola-en-us-callhome-8k',
+    model_revision="v1.0.0",
+)
+results = inference_diar_pipline(audio_in=["https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record2.wav"])
+print(results)
--- a/egs_modelscope/speaker_diarization/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/unit_test.py
+++ b/egs_modelscope/speaker_diarization/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch/unit_test.py
@ -14,13 +14,12 @@ inference_diar_pipline = pipeline(
 )

 # 以 audio_list 作为输入，其中第一个音频为待检测语音，后面的音频为不同说话人的声纹注册语音
-audio_list = [[
+audio_list = [
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/record.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_A.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B.wav",
    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/spk_B1.wav"
-]]
+]

 results = inference_diar_pipline(audio_in=audio_list)
-for rst in results:
-    print(rst["value"])
+print(results)
--- a/egs_modelscope/vad/speech_fsmn_vad_zh-cn-16k-common/infer_online.py
+++ b/egs_modelscope/vad/speech_fsmn_vad_zh-cn-16k-common/infer_online.py
@ -1,8 +1,11 @@
 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.utils.logger import get_logger
+import logging
+logger = get_logger(log_level=logging.CRITICAL)
+logger.setLevel(logging.CRITICAL)
 import soundfile

-
 if __name__ == '__main__':
    output_dir = None
    inference_pipline = pipeline(
--- a/egs_modelscope/vad/speech_fsmn_vad_zh-cn-8k-common/infer_online.py
+++ b/egs_modelscope/vad/speech_fsmn_vad_zh-cn-8k-common/infer_online.py
@ -1,8 +1,11 @@
 from modelscope.pipelines import pipeline
 from modelscope.utils.constant import Tasks
+from modelscope.utils.logger import get_logger
+import logging
+logger = get_logger(log_level=logging.CRITICAL)
+logger.setLevel(logging.CRITICAL)
 import soundfile

-
 if __name__ == '__main__':
    output_dir = None
    inference_pipline = pipeline(
--- a/funasr/bin/asr_inference.py
+++ b/funasr/bin/asr_inference.py
@ -52,7 +52,7 @@ class Speech2Text:

    Examples:
        >>> import soundfile
-        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2text(audio)
        [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/asr_inference_launch.py
+++ b/funasr/bin/asr_inference_launch.py
@ -256,6 +256,9 @@ def inference_launch(**kwargs):
    elif mode == "paraformer":
        from funasr.bin.asr_inference_paraformer import inference_modelscope
        return inference_modelscope(**kwargs)
+    elif mode == "paraformer_streaming":
+        from funasr.bin.asr_inference_paraformer_streaming import inference_modelscope
+        return inference_modelscope(**kwargs)
    elif mode == "paraformer_vad":
        from funasr.bin.asr_inference_paraformer_vad import inference_modelscope
        return inference_modelscope(**kwargs)
--- a/funasr/bin/asr_inference_mfcca.py
+++ b/funasr/bin/asr_inference_mfcca.py
@ -41,8 +41,6 @@ from funasr.utils.types import str_or_none
 from funasr.utils import asr_utils, wav_utils, postprocess_utils
 import pdb

-header_colors = '\033[95m'
-end_colors = '\033[0m'

 global_asr_language: str = 'zh-cn'
 global_sample_rate: Union[int, Dict[Any, int]] = {
@ -55,7 +53,7 @@ class Speech2Text:

    Examples:
        >>> import soundfile
-        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2text(audio)
        [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/asr_inference_paraformer.py
+++ b/funasr/bin/asr_inference_paraformer.py
@ -50,7 +50,7 @@ class Speech2Text:

    Examples:
            >>> import soundfile
-            >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+            >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
            >>> audio, rate = soundfile.read("speech.wav")
            >>> speech2text(audio)
            [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/asr_inference_paraformer_streaming.py
+++ b/funasr/bin/asr_inference_paraformer_streaming.py
@ -0,0 +1,907 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import sys
+import time
+import copy
+import os
+import codecs
+import tempfile
+import requests
+from pathlib import Path
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+from typing import Dict
+from typing import Any
+from typing import List
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+from funasr.modules.scorers.ctc import CTCPrefixScorer
+from funasr.modules.scorers.length_bonus import LengthBonus
+from funasr.modules.subsampling import TooShortUttError
+from funasr.tasks.asr import ASRTaskParaformer as ASRTask
+from funasr.tasks.lm import LMTask
+from funasr.text.build_tokenizer import build_tokenizer
+from funasr.text.token_id_converter import TokenIDConverter
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.utils import config_argparse
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+from funasr.utils import asr_utils, wav_utils, postprocess_utils
+from funasr.models.frontend.wav_frontend import WavFrontend
+from funasr.models.e2e_asr_paraformer import BiCifParaformer, ContextualParaformer
+from funasr.export.models.e2e_asr_paraformer import Paraformer as Paraformer_export
+
+class Speech2Text:
+    """Speech2Text class
+
+    Examples:
+            >>> import soundfile
+            >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+            >>> audio, rate = soundfile.read("speech.wav")
+            >>> speech2text(audio)
+            [(text, token, token_int, hypothesis object), ...]
+
+    """
+
+    def __init__(
+            self,
+            asr_train_config: Union[Path, str] = None,
+            asr_model_file: Union[Path, str] = None,
+            cmvn_file: Union[Path, str] = None,
+            lm_train_config: Union[Path, str] = None,
+            lm_file: Union[Path, str] = None,
+            token_type: str = None,
+            bpemodel: str = None,
+            device: str = "cpu",
+            maxlenratio: float = 0.0,
+            minlenratio: float = 0.0,
+            dtype: str = "float32",
+            beam_size: int = 20,
+            ctc_weight: float = 0.5,
+            lm_weight: float = 1.0,
+            ngram_weight: float = 0.9,
+            penalty: float = 0.0,
+            nbest: int = 1,
+            frontend_conf: dict = None,
+            hotword_list_or_file: str = None,
+            **kwargs,
+    ):
+        assert check_argument_types()
+
+        # 1. Build ASR model
+        scorers = {}
+        asr_model, asr_train_args = ASRTask.build_model_from_file(
+            asr_train_config, asr_model_file, cmvn_file, device
+        )
+        frontend = None
+        if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
+            frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
+
+        logging.info("asr_model: {}".format(asr_model))
+        logging.info("asr_train_args: {}".format(asr_train_args))
+        asr_model.to(dtype=getattr(torch, dtype)).eval()
+
+        if asr_model.ctc != None:
+            ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
+            scorers.update(
+                ctc=ctc
+            )
+        token_list = asr_model.token_list
+        scorers.update(
+            length_bonus=LengthBonus(len(token_list)),
+        )
+
+        # 2. Build Language model
+        if lm_train_config is not None:
+            lm, lm_train_args = LMTask.build_model_from_file(
+                lm_train_config, lm_file, device
+            )
+            scorers["lm"] = lm.lm
+
+        # 3. Build ngram model
+        # ngram is not supported now
+        ngram = None
+        scorers["ngram"] = ngram
+
+        # 4. Build BeamSearch object
+        # transducer is not supported now
+        beam_search_transducer = None
+
+        weights = dict(
+            decoder=1.0 - ctc_weight,
+            ctc=ctc_weight,
+            lm=lm_weight,
+            ngram=ngram_weight,
+            length_bonus=penalty,
+        )
+        beam_search = BeamSearch(
+            beam_size=beam_size,
+            weights=weights,
+            scorers=scorers,
+            sos=asr_model.sos,
+            eos=asr_model.eos,
+            vocab_size=len(token_list),
+            token_list=token_list,
+            pre_beam_score_key=None if ctc_weight == 1.0 else "full",
+        )
+
+        beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
+        for scorer in scorers.values():
+            if isinstance(scorer, torch.nn.Module):
+                scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
+
+        logging.info(f"Decoding device={device}, dtype={dtype}")
+
+        # 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
+        if token_type is None:
+            token_type = asr_train_args.token_type
+        if bpemodel is None:
+            bpemodel = asr_train_args.bpemodel
+
+        if token_type is None:
+            tokenizer = None
+        elif token_type == "bpe":
+            if bpemodel is not None:
+                tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
+            else:
+                tokenizer = None
+        else:
+            tokenizer = build_tokenizer(token_type=token_type)
+        converter = TokenIDConverter(token_list=token_list)
+        logging.info(f"Text tokenizer: {tokenizer}")
+
+        self.asr_model = asr_model
+        self.asr_train_args = asr_train_args
+        self.converter = converter
+        self.tokenizer = tokenizer
+
+        # 6. [Optional] Build hotword list from str, local file or url
+
+        is_use_lm = lm_weight != 0.0 and lm_file is not None
+        if (ctc_weight == 0.0 or asr_model.ctc == None) and not is_use_lm:
+            beam_search = None
+        self.beam_search = beam_search
+        logging.info(f"Beam_search: {self.beam_search}")
+        self.beam_search_transducer = beam_search_transducer
+        self.maxlenratio = maxlenratio
+        self.minlenratio = minlenratio
+        self.device = device
+        self.dtype = dtype
+        self.nbest = nbest
+        self.frontend = frontend
+        self.encoder_downsampling_factor = 1
+        if asr_train_args.encoder == "data2vec_encoder" or asr_train_args.encoder_conf["input_layer"] == "conv2d":
+            self.encoder_downsampling_factor = 4
+
+    @torch.no_grad()
+    def __call__(
+            self, cache: dict, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None,
+            begin_time: int = 0, end_time: int = None,
+    ):
+        """Inference
+
+        Args:
+                speech: Input speech data
+        Returns:
+                text, token, token_int, hyp
+
+        """
+        assert check_argument_types()
+
+        # Input as audio signal
+        if isinstance(speech, np.ndarray):
+            speech = torch.tensor(speech)
+
+        if self.frontend is not None:
+            feats, feats_len = self.frontend.forward(speech, speech_lengths)
+            feats = to_device(feats, device=self.device)
+            feats_len = feats_len.int()
+            self.asr_model.frontend = None
+        else:
+            feats = speech
+            feats_len = speech_lengths
+        lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
+        batch = {"speech": feats, "speech_lengths": feats_len, "cache": cache}
+
+        # a. To device
+        batch = to_device(batch, device=self.device)
+
+        # b. Forward Encoder
+        enc, enc_len = self.asr_model.encode_chunk(**batch)
+        if isinstance(enc, tuple):
+            enc = enc[0]
+        # assert len(enc) == 1, len(enc)
+        enc_len_batch_total = torch.sum(enc_len).item() * self.encoder_downsampling_factor
+
+        predictor_outs = self.asr_model.calc_predictor_chunk(enc, cache)
+        pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = predictor_outs[0], predictor_outs[1], \
+                                                                        predictor_outs[2], predictor_outs[3]
+        pre_token_length = pre_token_length.floor().long()
+        if torch.max(pre_token_length) < 1:
+            return []
+        decoder_outs = self.asr_model.cal_decoder_with_predictor_chunk(enc, pre_acoustic_embeds, cache)
+        decoder_out = decoder_outs
+
+        results = []
+        b, n, d = decoder_out.size()
+        for i in range(b):
+            x = enc[i, :enc_len[i], :]
+            am_scores = decoder_out[i, :pre_token_length[i], :]
+            if self.beam_search is not None:
+                nbest_hyps = self.beam_search(
+                    x=x, am_scores=am_scores, maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
+                )
+
+                nbest_hyps = nbest_hyps[: self.nbest]
+            else:
+                yseq = am_scores.argmax(dim=-1)
+                score = am_scores.max(dim=-1)[0]
+                score = torch.sum(score, dim=-1)
+                # pad with mask tokens to ensure compatibility with sos/eos tokens
+                yseq = torch.tensor(
+                    [self.asr_model.sos] + yseq.tolist() + [self.asr_model.eos], device=yseq.device
+                )
+                nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
+
+            for hyp in nbest_hyps:
+                assert isinstance(hyp, (Hypothesis)), type(hyp)
+
+                # remove sos/eos and get results
+                last_pos = -1
+                if isinstance(hyp.yseq, list):
+                    token_int = hyp.yseq[1:last_pos]
+                else:
+                    token_int = hyp.yseq[1:last_pos].tolist()
+
+                # remove blank symbol id, which is assumed to be 0
+                token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
+
+                # Change integer-ids to tokens
+                token = self.converter.ids2tokens(token_int)
+
+                if self.tokenizer is not None:
+                    text = self.tokenizer.tokens2text(token)
+                else:
+                    text = None
+
+                results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
+
+        # assert check_return_type(results)
+        return results
+
+
+class Speech2TextExport:
+    """Speech2TextExport class
+
+    """
+
+    def __init__(
+            self,
+            asr_train_config: Union[Path, str] = None,
+            asr_model_file: Union[Path, str] = None,
+            cmvn_file: Union[Path, str] = None,
+            lm_train_config: Union[Path, str] = None,
+            lm_file: Union[Path, str] = None,
+            token_type: str = None,
+            bpemodel: str = None,
+            device: str = "cpu",
+            maxlenratio: float = 0.0,
+            minlenratio: float = 0.0,
+            dtype: str = "float32",
+            beam_size: int = 20,
+            ctc_weight: float = 0.5,
+            lm_weight: float = 1.0,
+            ngram_weight: float = 0.9,
+            penalty: float = 0.0,
+            nbest: int = 1,
+            frontend_conf: dict = None,
+            hotword_list_or_file: str = None,
+            **kwargs,
+    ):
+
+        # 1. Build ASR model
+        asr_model, asr_train_args = ASRTask.build_model_from_file(
+            asr_train_config, asr_model_file, cmvn_file, device
+        )
+        frontend = None
+        if asr_train_args.frontend is not None and asr_train_args.frontend_conf is not None:
+            frontend = WavFrontend(cmvn_file=cmvn_file, **asr_train_args.frontend_conf)
+
+        logging.info("asr_model: {}".format(asr_model))
+        logging.info("asr_train_args: {}".format(asr_train_args))
+        asr_model.to(dtype=getattr(torch, dtype)).eval()
+
+        token_list = asr_model.token_list
+
+        logging.info(f"Decoding device={device}, dtype={dtype}")
+
+        # 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
+        if token_type is None:
+            token_type = asr_train_args.token_type
+        if bpemodel is None:
+            bpemodel = asr_train_args.bpemodel
+
+        if token_type is None:
+            tokenizer = None
+        elif token_type == "bpe":
+            if bpemodel is not None:
+                tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
+            else:
+                tokenizer = None
+        else:
+            tokenizer = build_tokenizer(token_type=token_type)
+        converter = TokenIDConverter(token_list=token_list)
+        logging.info(f"Text tokenizer: {tokenizer}")
+
+        # self.asr_model = asr_model
+        self.asr_train_args = asr_train_args
+        self.converter = converter
+        self.tokenizer = tokenizer
+
+        self.device = device
+        self.dtype = dtype
+        self.nbest = nbest
+        self.frontend = frontend
+
+        model = Paraformer_export(asr_model, onnx=False)
+        self.asr_model = model
+
+    @torch.no_grad()
+    def __call__(
+            self, speech: Union[torch.Tensor, np.ndarray], speech_lengths: Union[torch.Tensor, np.ndarray] = None
+    ):
+        """Inference
+
+        Args:
+                speech: Input speech data
+        Returns:
+                text, token, token_int, hyp
+
+        """
+        assert check_argument_types()
+
+        # Input as audio signal
+        if isinstance(speech, np.ndarray):
+            speech = torch.tensor(speech)
+
+        if self.frontend is not None:
+            feats, feats_len = self.frontend.forward(speech, speech_lengths)
+            feats = to_device(feats, device=self.device)
+            feats_len = feats_len.int()
+            self.asr_model.frontend = None
+        else:
+            feats = speech
+            feats_len = speech_lengths
+
+        enc_len_batch_total = feats_len.sum()
+        lfr_factor = max(1, (feats.size()[-1] // 80) - 1)
+        batch = {"speech": feats, "speech_lengths": feats_len}
+
+        # a. To device
+        batch = to_device(batch, device=self.device)
+
+        decoder_outs = self.asr_model(**batch)
+        decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
+
+        results = []
+        b, n, d = decoder_out.size()
+        for i in range(b):
+            am_scores = decoder_out[i, :ys_pad_lens[i], :]
+
+            yseq = am_scores.argmax(dim=-1)
+            score = am_scores.max(dim=-1)[0]
+            score = torch.sum(score, dim=-1)
+            # pad with mask tokens to ensure compatibility with sos/eos tokens
+            yseq = torch.tensor(
+                yseq.tolist(), device=yseq.device
+            )
+            nbest_hyps = [Hypothesis(yseq=yseq, score=score)]
+
+            for hyp in nbest_hyps:
+                assert isinstance(hyp, (Hypothesis)), type(hyp)
+
+                # remove sos/eos and get results
+                last_pos = -1
+                if isinstance(hyp.yseq, list):
+                    token_int = hyp.yseq[1:last_pos]
+                else:
+                    token_int = hyp.yseq[1:last_pos].tolist()
+
+                # remove blank symbol id, which is assumed to be 0
+                token_int = list(filter(lambda x: x != 0 and x != 2, token_int))
+
+                # Change integer-ids to tokens
+                token = self.converter.ids2tokens(token_int)
+
+                if self.tokenizer is not None:
+                    text = self.tokenizer.tokens2text(token)
+                else:
+                    text = None
+
+                results.append((text, token, token_int, hyp, enc_len_batch_total, lfr_factor))
+
+        return results
+
+
+def inference(
+        maxlenratio: float,
+        minlenratio: float,
+        batch_size: int,
+        beam_size: int,
+        ngpu: int,
+        ctc_weight: float,
+        lm_weight: float,
+        penalty: float,
+        log_level: Union[int, str],
+        data_path_and_name_and_type,
+        asr_train_config: Optional[str],
+        asr_model_file: Optional[str],
+        cmvn_file: Optional[str] = None,
+        raw_inputs: Union[np.ndarray, torch.Tensor] = None,
+        lm_train_config: Optional[str] = None,
+        lm_file: Optional[str] = None,
+        token_type: Optional[str] = None,
+        key_file: Optional[str] = None,
+        word_lm_train_config: Optional[str] = None,
+        bpemodel: Optional[str] = None,
+        allow_variable_data_keys: bool = False,
+        streaming: bool = False,
+        output_dir: Optional[str] = None,
+        dtype: str = "float32",
+        seed: int = 0,
+        ngram_weight: float = 0.9,
+        nbest: int = 1,
+        num_workers: int = 1,
+
+        **kwargs,
+):
+    inference_pipeline = inference_modelscope(
+        maxlenratio=maxlenratio,
+        minlenratio=minlenratio,
+        batch_size=batch_size,
+        beam_size=beam_size,
+        ngpu=ngpu,
+        ctc_weight=ctc_weight,
+        lm_weight=lm_weight,
+        penalty=penalty,
+        log_level=log_level,
+        asr_train_config=asr_train_config,
+        asr_model_file=asr_model_file,
+        cmvn_file=cmvn_file,
+        raw_inputs=raw_inputs,
+        lm_train_config=lm_train_config,
+        lm_file=lm_file,
+        token_type=token_type,
+        key_file=key_file,
+        word_lm_train_config=word_lm_train_config,
+        bpemodel=bpemodel,
+        allow_variable_data_keys=allow_variable_data_keys,
+        streaming=streaming,
+        output_dir=output_dir,
+        dtype=dtype,
+        seed=seed,
+        ngram_weight=ngram_weight,
+        nbest=nbest,
+        num_workers=num_workers,
+
+        **kwargs,
+    )
+    return inference_pipeline(data_path_and_name_and_type, raw_inputs)
+
+
+def inference_modelscope(
+        maxlenratio: float,
+        minlenratio: float,
+        batch_size: int,
+        beam_size: int,
+        ngpu: int,
+        ctc_weight: float,
+        lm_weight: float,
+        penalty: float,
+        log_level: Union[int, str],
+        # data_path_and_name_and_type,
+        asr_train_config: Optional[str],
+        asr_model_file: Optional[str],
+        cmvn_file: Optional[str] = None,
+        lm_train_config: Optional[str] = None,
+        lm_file: Optional[str] = None,
+        token_type: Optional[str] = None,
+        key_file: Optional[str] = None,
+        word_lm_train_config: Optional[str] = None,
+        bpemodel: Optional[str] = None,
+        allow_variable_data_keys: bool = False,
+        dtype: str = "float32",
+        seed: int = 0,
+        ngram_weight: float = 0.9,
+        nbest: int = 1,
+        num_workers: int = 1,
+        output_dir: Optional[str] = None,
+        param_dict: dict = None,
+        **kwargs,
+):
+    assert check_argument_types()
+
+    if word_lm_train_config is not None:
+        raise NotImplementedError("Word LM is not implemented")
+    if ngpu > 1:
+        raise NotImplementedError("only single GPU decoding is supported")
+
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+
+    export_mode = False
+    if param_dict is not None:
+        hotword_list_or_file = param_dict.get('hotword')
+        export_mode = param_dict.get("export_mode", False)
+    else:
+        hotword_list_or_file = None
+
+    if ngpu >= 1 and torch.cuda.is_available():
+        device = "cuda"
+    else:
+        device = "cpu"
+        batch_size = 1
+
+    # 1. Set random-seed
+    set_all_random_seed(seed)
+
+    # 2. Build speech2text
+    speech2text_kwargs = dict(
+        asr_train_config=asr_train_config,
+        asr_model_file=asr_model_file,
+        cmvn_file=cmvn_file,
+        lm_train_config=lm_train_config,
+        lm_file=lm_file,
+        token_type=token_type,
+        bpemodel=bpemodel,
+        device=device,
+        maxlenratio=maxlenratio,
+        minlenratio=minlenratio,
+        dtype=dtype,
+        beam_size=beam_size,
+        ctc_weight=ctc_weight,
+        lm_weight=lm_weight,
+        ngram_weight=ngram_weight,
+        penalty=penalty,
+        nbest=nbest,
+        hotword_list_or_file=hotword_list_or_file,
+    )
+    if export_mode:
+        speech2text = Speech2TextExport(**speech2text_kwargs)
+    else:
+        speech2text = Speech2Text(**speech2text_kwargs)
+
+    def _forward(
+            data_path_and_name_and_type,
+            raw_inputs: Union[np.ndarray, torch.Tensor] = None,
+            output_dir_v2: Optional[str] = None,
+            fs: dict = None,
+            param_dict: dict = None,
+            **kwargs,
+    ):
+
+        hotword_list_or_file = None
+        if param_dict is not None:
+            hotword_list_or_file = param_dict.get('hotword')
+        if 'hotword' in kwargs:
+            hotword_list_or_file = kwargs['hotword']
+        if hotword_list_or_file is not None or 'hotword' in kwargs:
+            speech2text.hotword_list = speech2text.generate_hotwords_list(hotword_list_or_file)
+
+        # 3. Build data-iterator
+        if data_path_and_name_and_type is None and raw_inputs is not None:
+            if isinstance(raw_inputs, torch.Tensor):
+                raw_inputs = raw_inputs.numpy()
+            data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
+        loader = ASRTask.build_streaming_iterator(
+            data_path_and_name_and_type,
+            dtype=dtype,
+            fs=fs,
+            batch_size=batch_size,
+            key_file=key_file,
+            num_workers=num_workers,
+            preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
+            collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
+            allow_variable_data_keys=allow_variable_data_keys,
+            inference=True,
+        )
+
+        if param_dict is not None:
+            use_timestamp = param_dict.get('use_timestamp', True)
+        else:
+            use_timestamp = True
+
+        forward_time_total = 0.0
+        length_total = 0.0
+        finish_count = 0
+        file_count = 1
+        cache = None
+        # 7 .Start for-loop
+        # FIXME(kamo): The output format should be discussed about
+        asr_result_list = []
+        output_path = output_dir_v2 if output_dir_v2 is not None else output_dir
+        if output_path is not None:
+            writer = DatadirWriter(output_path)
+        else:
+            writer = None
+        if param_dict is not None and "cache" in param_dict:
+            cache = param_dict["cache"]
+        for keys, batch in loader:
+            assert isinstance(batch, dict), type(batch)
+            assert all(isinstance(s, str) for s in keys), keys
+            _bs = len(next(iter(batch.values())))
+            assert len(keys) == _bs, f"{len(keys)} != {_bs}"
+            # batch = {k: v for k, v in batch.items() if not k.endswith("_lengths")}
+            logging.info("decoding, utt_id: {}".format(keys))
+            # N-best list of (text, token, token_int, hyp_object)
+
+            time_beg = time.time()
+            results = speech2text(cache=cache, **batch)
+            if len(results) < 1:
+                hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
+                results = [[" ", ["sil"], [2], hyp, 10, 6]] * nbest
+            time_end = time.time()
+            forward_time = time_end - time_beg
+            lfr_factor = results[0][-1]
+            length = results[0][-2]
+            forward_time_total += forward_time
+            length_total += length
+            rtf_cur = "decoding, feature length: {}, forward_time: {:.4f}, rtf: {:.4f}".format(length, forward_time,
+                                                                                               100 * forward_time / (
+                                                                                                           length * lfr_factor))
+            logging.info(rtf_cur)
+
+            for batch_id in range(_bs):
+                result = [results[batch_id][:-2]]
+
+                key = keys[batch_id]
+                for n, result in zip(range(1, nbest + 1), result):
+                    text, token, token_int, hyp = result[0], result[1], result[2], result[3]
+                    time_stamp = None if len(result) < 5 else result[4]
+                    # Create a directory: outdir/{n}best_recog
+                    if writer is not None:
+                        ibest_writer = writer[f"{n}best_recog"]
+
+                        # Write the result to each file
+                        ibest_writer["token"][key] = " ".join(token)
+                        # ibest_writer["token_int"][key] = " ".join(map(str, token_int))
+                        ibest_writer["score"][key] = str(hyp.score)
+                        ibest_writer["rtf"][key] = rtf_cur
+
+                    if text is not None:
+                        if use_timestamp and time_stamp is not None:
+                            postprocessed_result = postprocess_utils.sentence_postprocess(token, time_stamp)
+                        else:
+                            postprocessed_result = postprocess_utils.sentence_postprocess(token)
+                        time_stamp_postprocessed = ""
+                        if len(postprocessed_result) == 3:
+                            text_postprocessed, time_stamp_postprocessed, word_lists = postprocessed_result[0], \
+                                                                                       postprocessed_result[1], \
+                                                                                       postprocessed_result[2]
+                        else:
+                            text_postprocessed, word_lists = postprocessed_result[0], postprocessed_result[1]
+                        item = {'key': key, 'value': text_postprocessed}
+                        if time_stamp_postprocessed != "":
+                            item['time_stamp'] = time_stamp_postprocessed
+                        asr_result_list.append(item)
+                        finish_count += 1
+                        # asr_utils.print_progress(finish_count / file_count)
+                        if writer is not None:
+                            ibest_writer["text"][key] = text_postprocessed
+
+                    logging.info("decoding, utt: {}, predictions: {}".format(key, text))
+        rtf_avg = "decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".format(length_total,
+                                                                                                           forward_time_total,
+                                                                                                           100 * forward_time_total / (
+                                                                                                                       length_total * lfr_factor))
+        logging.info(rtf_avg)
+        if writer is not None:
+            ibest_writer["rtf"]["rtf_avf"] = rtf_avg
+        return asr_result_list
+
+    return _forward
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="ASR Decoding",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+    parser.add_argument(
+        "--hotword",
+        type=str_or_none,
+        default=None,
+        help="hotword file path or hotwords seperated by space"
+    )
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=False,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument(
+        "--asr_train_config",
+        type=str,
+        help="ASR training configuration",
+    )
+    group.add_argument(
+        "--asr_model_file",
+        type=str,
+        help="ASR model parameter file",
+    )
+    group.add_argument(
+        "--cmvn_file",
+        type=str,
+        help="Global cmvn file",
+    )
+    group.add_argument(
+        "--lm_train_config",
+        type=str,
+        help="LM training configuration",
+    )
+    group.add_argument(
+        "--lm_file",
+        type=str,
+        help="LM parameter file",
+    )
+    group.add_argument(
+        "--word_lm_train_config",
+        type=str,
+        help="Word LM training configuration",
+    )
+    group.add_argument(
+        "--word_lm_file",
+        type=str,
+        help="Word LM parameter file",
+    )
+    group.add_argument(
+        "--ngram_file",
+        type=str,
+        help="N-gram parameter file",
+    )
+    group.add_argument(
+        "--model_tag",
+        type=str,
+        help="Pretrained model tag. If specify this option, *_train_config and "
+             "*_file will be overwritten",
+    )
+
+    group = parser.add_argument_group("Beam-search related")
+    group.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
+    group.add_argument("--beam_size", type=int, default=20, help="Beam size")
+    group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
+    group.add_argument(
+        "--maxlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain max output length. "
+             "If maxlenratio=0.0 (default), it uses a end-detect "
+             "function "
+             "to automatically find maximum hypothesis lengths."
+             "If maxlenratio<0.0, its absolute value is interpreted"
+             "as a constant max output length",
+    )
+    group.add_argument(
+        "--minlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain min output length",
+    )
+    group.add_argument(
+        "--ctc_weight",
+        type=float,
+        default=0.5,
+        help="CTC weight in joint decoding",
+    )
+    group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
+    group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
+    group.add_argument("--streaming", type=str2bool, default=False)
+
+    group.add_argument(
+        "--frontend_conf",
+        default=None,
+        help="",
+    )
+    group.add_argument("--raw_inputs", type=list, default=None)
+    # example=[{'key':'EdevDEWdIYQ_0021','file':'/mnt/data/jiangyu.xzy/test_data/speech_io/SPEECHIO_ASR_ZH00007_zhibodaihuo/wav/EdevDEWdIYQ_0021.wav'}])
+
+    group = parser.add_argument_group("Text converter related")
+    group.add_argument(
+        "--token_type",
+        type=str_or_none,
+        default=None,
+        choices=["char", "bpe", None],
+        help="The token type for ASR model. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument(
+        "--bpemodel",
+        type=str_or_none,
+        default=None,
+        help="The model path of sentencepiece. "
+             "If not given, refers from the training args",
+    )
+
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    param_dict = {'hotword': args.hotword}
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+    kwargs['param_dict'] = param_dict
+    inference(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
+
+    # from modelscope.pipelines import pipeline
+    # from modelscope.utils.constant import Tasks
+    #
+    # inference_16k_pipline = pipeline(
+    #     task=Tasks.auto_speech_recognition,
+    #     model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
+    #
+    # rec_result = inference_16k_pipline(audio_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav')
+    # print(rec_result)
+
--- a/funasr/bin/asr_inference_paraformer_vad_punc.py
+++ b/funasr/bin/asr_inference_paraformer_vad_punc.py
@ -58,7 +58,7 @@ class Speech2Text:

    Examples:
            >>> import soundfile
-            >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+            >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
            >>> audio, rate = soundfile.read("speech.wav")
            >>> speech2text(audio)
            [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/asr_inference_uniasr.py
+++ b/funasr/bin/asr_inference_uniasr.py
@ -46,7 +46,7 @@ class Speech2Text:

    Examples:
        >>> import soundfile
-        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2text(audio)
        [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/asr_inference_uniasr_vad.py
+++ b/funasr/bin/asr_inference_uniasr_vad.py
@ -46,7 +46,7 @@ class Speech2Text:

    Examples:
        >>> import soundfile
-        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pb")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2text(audio)
        [(text, token, token_int, hypothesis object), ...]
--- a/funasr/bin/diar_inference_launch.py
+++ b/funasr/bin/diar_inference_launch.py
@ -133,7 +133,7 @@ def inference_launch(mode, **kwargs):
        param_dict = {
            "extract_profile": True,
            "sv_train_config": "sv.yaml",
-            "sv_model_file": "sv.pth",
+            "sv_model_file": "sv.pb",
        }
        if "param_dict" in kwargs and kwargs["param_dict"] is not None:
            for key in param_dict:
@ -142,6 +142,9 @@ def inference_launch(mode, **kwargs):
        else:
            kwargs["param_dict"] = param_dict
        return inference_modelscope(mode=mode, **kwargs)
+    elif mode == "eend-ola":
+        from funasr.bin.eend_ola_inference import inference_modelscope
+        return inference_modelscope(mode=mode, **kwargs)
    else:
        logging.info("Unknown decoding mode: {}".format(mode))
        return None
--- a/funasr/bin/eend_ola_inference.py
+++ b/funasr/bin/eend_ola_inference.py
@ -16,6 +16,7 @@ from typing import Union

 import numpy as np
 import torch
+from scipy.signal import medfilt
 from typeguard import check_argument_types

 from funasr.models.frontend.wav_frontend import WavFrontendMel23
@ -34,7 +35,7 @@ class Speech2Diarization:
    Examples:
        >>> import soundfile
        >>> import numpy as np
-        >>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
+        >>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
        >>> profile = np.load("profiles.npy")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2diar(audio, profile)
@ -146,7 +147,7 @@ def inference_modelscope(
        output_dir: Optional[str] = None,
        batch_size: int = 1,
        dtype: str = "float32",
-        ngpu: int = 0,
+        ngpu: int = 1,
        num_workers: int = 0,
        log_level: Union[int, str] = "INFO",
        key_file: Optional[str] = None,
@ -179,7 +180,6 @@ def inference_modelscope(
        diar_model_file=diar_model_file,
        device=device,
        dtype=dtype,
-        streaming=streaming,
    )
    logging.info("speech2diarization_kwargs: {}".format(speech2diar_kwargs))
    speech2diar = Speech2Diarization.from_pretrained(
@ -209,7 +209,7 @@ def inference_modelscope(
        if data_path_and_name_and_type is None and raw_inputs is not None:
            if isinstance(raw_inputs, torch.Tensor):
                raw_inputs = raw_inputs.numpy()
-            data_path_and_name_and_type = [raw_inputs, "speech", "waveform"]
+            data_path_and_name_and_type = [raw_inputs[0], "speech", "sound"]
        loader = EENDOLADiarTask.build_streaming_iterator(
            data_path_and_name_and_type,
            dtype=dtype,
@ -236,9 +236,23 @@ def inference_modelscope(
            # batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}

            results = speech2diar(**batch)
+
+            # post process
+            a = results[0][0].cpu().numpy()
+            a = medfilt(a, (11, 1))
+            rst = []
+            for spkid, frames in enumerate(a.T):
+                frames = np.pad(frames, (1, 1), 'constant')
+                changes, = np.where(np.diff(frames, axis=0) != 0)
+                fmt = "SPEAKER {:s} 1 {:7.2f} {:7.2f} <NA> <NA> {:s} <NA>"
+                for s, e in zip(changes[::2], changes[1::2]):
+                    st = s / 10.
+                    dur = (e - s) / 10.
+                    rst.append(fmt.format(keys[0], st, dur, "{}_{}".format(keys[0], str(spkid))))
+
            # Only supporting batch_size==1
-            key, value = keys[0], output_results_str(results, keys[0])
-            item = {"key": key, "value": value}
+            value = "\n".join(rst)
+            item = {"key": keys[0], "value": value}
            result_list.append(item)
            if output_path is not None:
                output_writer.write(value)
--- a/funasr/bin/sond_inference.py
+++ b/funasr/bin/sond_inference.py
@ -42,7 +42,7 @@ class Speech2Diarization:
    Examples:
        >>> import soundfile
        >>> import numpy as np
-        >>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pth")
+        >>> speech2diar = Speech2Diarization("diar_sond_config.yml", "diar_sond.pb")
        >>> profile = np.load("profiles.npy")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2diar(audio, profile)
@ -54,7 +54,7 @@ class Speech2Diarization:
            self,
            diar_train_config: Union[Path, str] = None,
            diar_model_file: Union[Path, str] = None,
-            device: str = "cpu",
+            device: Union[str, torch.device] = "cpu",
            batch_size: int = 1,
            dtype: str = "float32",
            streaming: bool = False,
@ -114,9 +114,19 @@ class Speech2Diarization:
            # little-endian order: lower bit first
            return (np.array(list(b)[::-1]) == '1').astype(dtype)

-        return np.row_stack([int2vec(int(x), vec_dim) for x in seq])
+        # process oov
+        seq = np.array([int(x) for x in seq])
+        new_seq = []
+        for i, x in enumerate(seq):
+            if x < 2 ** vec_dim:
+                new_seq.append(x)
+            else:
+                idx_list = np.where(seq < 2 ** vec_dim)[0]
+                idx = np.abs(idx_list - i).argmin()
+                new_seq.append(seq[idx_list[idx]])
+        return np.row_stack([int2vec(x, vec_dim) for x in new_seq])

-    def post_processing(self, raw_logits: torch.Tensor, spk_num: int):
+    def post_processing(self, raw_logits: torch.Tensor, spk_num: int, output_format: str = "speaker_turn"):
        logits_idx = raw_logits.argmax(-1)  # B, T, vocab_size -> B, T
        # upsampling outputs to match inputs
        ut = logits_idx.shape[1] * self.diar_model.encoder.time_ds_ratio
@ -127,8 +137,14 @@ class Speech2Diarization:
        ).squeeze(1).long()
        logits_idx = logits_idx[0].tolist()
        pse_labels = [self.token_list[x] for x in logits_idx]
+        if output_format == "pse_labels":
+            return pse_labels, None
+
        multi_labels = self.seq2arr(pse_labels, spk_num)[:, :spk_num]  # remove padding speakers
        multi_labels = self.smooth_multi_labels(multi_labels)
+        if output_format == "binary_labels":
+            return multi_labels, None
+
        spk_list = ["spk{}".format(i + 1) for i in range(spk_num)]
        spk_turns = self.calc_spk_turns(multi_labels, spk_list)
        results = OrderedDict()
@ -149,6 +165,7 @@ class Speech2Diarization:
            self,
            speech: Union[torch.Tensor, np.ndarray],
            profile: Union[torch.Tensor, np.ndarray],
+            output_format: str = "speaker_turn"
    ):
        """Inference

@ -178,7 +195,7 @@ class Speech2Diarization:
        batch = to_device(batch, device=self.device)

        logits = self.diar_model.prediction_forward(**batch)
-        results, pse_labels = self.post_processing(logits, profile.shape[1])
+        results, pse_labels = self.post_processing(logits, profile.shape[1], output_format)

        return results, pse_labels

@ -367,7 +384,7 @@ def inference_modelscope(
            pse_label_writer = open("{}/labels.txt".format(output_path), "w")
        logging.info("Start to diarize...")
        result_list = []
-        for keys, batch in loader:
+        for idx, (keys, batch) in enumerate(loader):
            assert isinstance(batch, dict), type(batch)
            assert all(isinstance(s, str) for s in keys), keys
            _bs = len(next(iter(batch.values())))
@ -385,6 +402,9 @@ def inference_modelscope(
                pse_label_writer.write("{} {}\n".format(key, " ".join(pse_labels)))
                pse_label_writer.flush()

+            if idx % 100 == 0:
+                logging.info("Processing {:5d}: {}".format(idx, key))
+
        if output_path is not None:
            output_writer.close()
            pse_label_writer.close()
--- a/funasr/bin/sv_inference.py
+++ b/funasr/bin/sv_inference.py
@ -36,7 +36,7 @@ class Speech2Xvector:

    Examples:
        >>> import soundfile
-        >>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pth")
+        >>> speech2xvector = Speech2Xvector("sv_config.yml", "sv.pb")
        >>> audio, rate = soundfile.read("speech.wav")
        >>> speech2xvector(audio)
        [(text, token, token_int, hypothesis object), ...]
@ -169,7 +169,7 @@ def inference_modelscope(
        log_level: Union[int, str] = "INFO",
        key_file: Optional[str] = None,
        sv_train_config: Optional[str] = "sv.yaml",
-        sv_model_file: Optional[str] =  "sv.pth",
+        sv_model_file: Optional[str] =  "sv.pb",
        model_tag: Optional[str] = None,
        allow_variable_data_keys: bool = True,
        streaming: bool = False,
--- a/funasr/datasets/iterable_dataset.py
+++ b/funasr/datasets/iterable_dataset.py
@ -8,6 +8,7 @@ from typing import Dict
 from typing import Iterator
 from typing import Tuple
 from typing import Union
+from typing import List

 import kaldiio
 import numpy as np
@ -129,7 +130,7 @@ class IterableESPnetDataset(IterableDataset):
        non_iterable_list = []
        self.path_name_type_list = []

-        if not isinstance(path_name_type_list[0], Tuple):
+        if not isinstance(path_name_type_list[0], (Tuple, List)):
            path = path_name_type_list[0]
            name = path_name_type_list[1]
            _type = path_name_type_list[2]
@ -227,13 +228,9 @@ class IterableESPnetDataset(IterableDataset):
                name = self.path_name_type_list[i][1]
                _type = self.path_name_type_list[i][2]
                if _type == "sound":
-                    audio_type = os.path.basename(value).split(".")[-1].lower()
-                    if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
-                        raise NotImplementedError(
-                            f'Not supported audio type: {audio_type}')
-                    if audio_type == "pcm":
-                        _type = "pcm"
-
+                   audio_type = os.path.basename(value).lower()
+                   if audio_type.rfind(".pcm") >= 0:
+                       _type = "pcm"
                func = DATA_TYPES[_type]
                array = func(value)
                if self.fs is not None and (name == "speech" or name == "ref_speech"):
@ -335,11 +332,8 @@ class IterableESPnetDataset(IterableDataset):
                # 2.a. Load data streamingly
                for value, (path, name, _type) in zip(values, self.path_name_type_list):
                    if _type == "sound":
-                        audio_type = os.path.basename(value).split(".")[-1].lower()
-                        if audio_type not in SUPPORT_AUDIO_TYPE_SETS:
-                            raise NotImplementedError(
-                                f'Not supported audio type: {audio_type}')
-                        if audio_type == "pcm":
+                        audio_type = os.path.basename(value).lower()
+                        if audio_type.rfind(".pcm") >= 0:
                            _type = "pcm"
                    func = DATA_TYPES[_type]
                    # Load entry
@ -391,3 +385,4 @@ class IterableESPnetDataset(IterableDataset):

        if count == 0:
            raise RuntimeError("No iteration")
+
--- a/funasr/datasets/large_datasets/utils/tokenize.py
+++ b/funasr/datasets/large_datasets/utils/tokenize.py
@ -18,15 +18,11 @@ def forward_segment(text, seg_dict):

 def seg_tokenize(txt, seg_dict):
    out_txt = ""
-    pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
    for word in txt:
-        if pattern.match(word):
-            if word in seg_dict:
-                out_txt += seg_dict[word] + " "
-            else:
-                out_txt += "<unk>" + " "
+        if word in seg_dict:
+            out_txt += seg_dict[word] + " "
        else:
-            continue
+            out_txt += "<unk>" + " "
    return out_txt.strip().split()

 def tokenize(data,
--- a/funasr/datasets/preprocessor.py
+++ b/funasr/datasets/preprocessor.py
@ -47,15 +47,11 @@ def forward_segment(text, dic):

 def seg_tokenize(txt, seg_dict):
    out_txt = ""
-    pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
    for word in txt:
-        if pattern.match(word):
-            if word in seg_dict:
-                out_txt += seg_dict[word] + " "
-            else:
-                out_txt += "<unk>" + " "
+        if word in seg_dict:
+            out_txt += seg_dict[word] + " "
        else:
-            continue
+            out_txt += "<unk>" + " "
    return out_txt.strip().split()

 def seg_tokenize_wo_pattern(txt, seg_dict):
--- a/funasr/export/README.md
+++ b/funasr/export/README.md
@ -2,6 +2,8 @@
 ## Environments
    torch >= 1.11.0
    modelscope >= 1.2.0
+    torch-quant >= 0.4.0 (required for exporting quantized torchscript format model)
+    # pip install torch-quant -i https://pypi.org/simple

 ## Install modelscope and funasr

@ -11,31 +13,46 @@ The installation is the same as [funasr](../../README.md)
   `Tips`: torch>=1.11.0

   ```shell
-   python -m funasr.export.export_model [model_name] [export_dir] [onnx]
+   python -m funasr.export.export_model \
+       --model-name [model_name] \
+       --export-dir [export_dir] \
+       --type [onnx, torch] \
+       --quantize [true, false] \
+       --fallback-num [fallback_num]
   ```
-   `model_name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb). 
-   `export_dir`: the dir where the onnx is export.
-    `onnx`: `true`, export onnx format model; `false`, export torchscripts format model.
+   `model-name`: the model is to export. It could be the models from modelscope, or local finetuned model(named: model.pb).
+
+   `export-dir`: the dir where the onnx is export.
+
+   `type`: `onnx` or `torch`, export onnx format model or torchscript format model.
+
+   `quantize`: `true`, export quantized model at the same time; `false`, export fp32 model only.
+
+   `fallback-num`: specify the number of fallback layers to perform automatic mixed precision quantization.
+

 ## For example
 ### Export onnx format model
 Export model from modelscope
 ```shell
-python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
+python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
 ```
 Export model from local path, the model'name must be `model.pb`.
 ```shell
-python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
+python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx
 ```

 ### Export torchscripts format model
 Export model from modelscope
 ```shell
-python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
+python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
 ```

 Export model from local path, the model'name must be `model.pb`.
 ```shell
-python -m funasr.export.export_model '/mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" false
+python -m funasr.export.export_model --model-name /mnt/workspace/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type torch
 ```

+## Acknowledge
+Torch model quantization is supported by [BladeDISC](https://github.com/alibaba/BladeDISC), an end-to-end DynamIc Shape Compiler project for machine learning workloads. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. If you are interested, please contact us.
+
--- a/funasr/export/export_model.py
+++ b/funasr/export/export_model.py
@ -10,12 +10,20 @@ import torch
 from funasr.export.models import get_model
 import numpy as np
 import random
-
+from funasr.utils.types import str2bool
 # torch_version = float(".".join(torch.__version__.split(".")[:2]))
 # assert torch_version > 1.9

 class ASRModelExportParaformer:
-    def __init__(self, cache_dir: Union[Path, str] = None, onnx: bool = True):
+    def __init__(
+        self,
+        cache_dir: Union[Path, str] = None,
+        onnx: bool = True,
+        quant: bool = True,
+        fallback_num: int = 0,
+        audio_in: str = None,
+        calib_num: int = 200,
+    ):
        assert check_argument_types()
        self.set_all_random_seed(0)
        if cache_dir is None:
@ -28,6 +36,11 @@ class ASRModelExportParaformer:
        )
        print("output dir: {}".format(self.cache_dir))
        self.onnx = onnx
+        self.quant = quant
+        self.fallback_num = fallback_num
+        self.frontend = None
+        self.audio_in = audio_in
+        self.calib_num = calib_num
        

    def _export(
@ -56,6 +69,43 @@ class ASRModelExportParaformer:
        print("output dir: {}".format(export_dir))


+    def _torch_quantize(self, model):
+        def _run_calibration_data(m):
+            # using dummy inputs for a example
+            if self.audio_in is not None:
+                feats, feats_len = self.load_feats(self.audio_in)
+                for i, (feat, len) in enumerate(zip(feats, feats_len)):
+                    with torch.no_grad():
+                        m(feat, len)
+            else:
+                dummy_input = model.get_dummy_inputs()
+                m(*dummy_input)
+            
+
+        from torch_quant.module import ModuleFilter
+        from torch_quant.quantizer import Backend, Quantizer
+        from funasr.export.models.modules.decoder_layer import DecoderLayerSANM
+        from funasr.export.models.modules.encoder_layer import EncoderLayerSANM
+        module_filter = ModuleFilter(include_classes=[EncoderLayerSANM, DecoderLayerSANM])
+        module_filter.exclude_op_types = [torch.nn.Conv1d]
+        quantizer = Quantizer(
+            module_filter=module_filter,
+            backend=Backend.FBGEMM,
+        )
+        model.eval()
+        calib_model = quantizer.calib(model)
+        _run_calibration_data(calib_model)
+        if self.fallback_num > 0:
+            # perform automatic mixed precision quantization
+            amp_model = quantizer.amp(model)
+            _run_calibration_data(amp_model)
+            quantizer.fallback(amp_model, num=self.fallback_num)
+            print('Fallback layers:')
+            print('\n'.join(quantizer.module_filter.exclude_names))
+        quant_model = quantizer.quantize(model)
+        return quant_model
+
+
    def _export_torchscripts(self, model, verbose, path, enc_size=None):
        if enc_size:
            dummy_input = model.get_dummy_inputs(enc_size)
@ -66,10 +116,49 @@ class ASRModelExportParaformer:
        model_script = torch.jit.trace(model, dummy_input)
        model_script.save(os.path.join(path, f'{model.model_name}.torchscripts'))

+        if self.quant:
+            quant_model = self._torch_quantize(model)
+            model_script = torch.jit.trace(quant_model, dummy_input)
+            model_script.save(os.path.join(path, f'{model.model_name}_quant.torchscripts'))
+
+
    def set_all_random_seed(self, seed: int):
        random.seed(seed)
        np.random.seed(seed)
        torch.random.manual_seed(seed)
+
+    def parse_audio_in(self, audio_in):
+        
+        wav_list, name_list = [], []
+        if audio_in.endswith(".scp"):
+            f = open(audio_in, 'r')
+            lines = f.readlines()[:self.calib_num]
+            for line in lines:
+                name, path = line.strip().split()
+                name_list.append(name)
+                wav_list.append(path)
+        else:
+            wav_list = [audio_in,]
+            name_list = ["test",]
+        return wav_list, name_list
+    
+    def load_feats(self, audio_in: str = None):
+        import torchaudio
+
+        wav_list, name_list = self.parse_audio_in(audio_in)
+        feats = []
+        feats_len = []
+        for line in wav_list:
+            path = line.strip()
+            waveform, sampling_rate = torchaudio.load(path)
+            if sampling_rate != self.frontend.fs:
+                waveform = torchaudio.transforms.Resample(orig_freq=sampling_rate,
+                                                          new_freq=self.frontend.fs)(waveform)
+            fbank, fbank_len = self.frontend(waveform, [waveform.size(1)])
+            feats.append(fbank)
+            feats_len.append(fbank_len)
+        return feats, feats_len
+    
    def export(self,
               tag_name: str = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
               mode: str = 'paraformer',
@ -96,6 +185,7 @@ class ASRModelExportParaformer:
        model, asr_train_args = ASRTask.build_model_from_file(
            asr_train_config, asr_model_file, cmvn_file, 'cpu'
        )
+        self.frontend = model.frontend
        self._export(model, tag_name)
            

@ -107,11 +197,12 @@ class ASRModelExportParaformer:

        # model_script = torch.jit.script(model)
        model_script = model #torch.jit.trace(model)
+        model_path = os.path.join(path, f'{model.model_name}.onnx')

        torch.onnx.export(
            model_script,
            dummy_input,
-            os.path.join(path, f'{model.model_name}.onnx'),
+            model_path,
            verbose=verbose,
            opset_version=14,
            input_names=model.get_input_names(),
@ -119,17 +210,42 @@ class ASRModelExportParaformer:
            dynamic_axes=model.get_dynamic_axes()
        )

+        if self.quant:
+            from onnxruntime.quantization import QuantType, quantize_dynamic
+            import onnx
+            quant_model_path = os.path.join(path, f'{model.model_name}_quant.onnx')
+            onnx_model = onnx.load(model_path)
+            nodes = [n.name for n in onnx_model.graph.node]
+            nodes_to_exclude = [m for m in nodes if 'output' in m]
+            quantize_dynamic(
+                model_input=model_path,
+                model_output=quant_model_path,
+                op_types_to_quantize=['MatMul'],
+                per_channel=True,
+                reduce_range=False,
+                weight_type=QuantType.QUInt8,
+                nodes_to_exclude=nodes_to_exclude,
+            )
+

 if __name__ == '__main__':
-    import sys
-    
-    model_path = sys.argv[1]
-    output_dir = sys.argv[2]
-    onnx = sys.argv[3]
-    onnx = onnx.lower()
-    onnx = onnx == 'true'
-    # model_path = 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch'
-    # output_dir = "../export"
-    export_model = ASRModelExportParaformer(cache_dir=output_dir, onnx=onnx)
-    export_model.export(model_path)
-    # export_model.export('/root/cache/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch')
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--model-name', type=str, required=True)
+    parser.add_argument('--export-dir', type=str, required=True)
+    parser.add_argument('--type', type=str, default='onnx', help='["onnx", "torch"]')
+    parser.add_argument('--quantize', type=str2bool, default=False, help='export quantized model')
+    parser.add_argument('--fallback-num', type=int, default=0, help='amp fallback number')
+    parser.add_argument('--audio_in', type=str, default=None, help='["wav", "wav.scp"]')
+    parser.add_argument('--calib_num', type=int, default=200, help='calib max num')
+    args = parser.parse_args()
+
+    export_model = ASRModelExportParaformer(
+        cache_dir=args.export_dir,
+        onnx=args.type == 'onnx',
+        quant=args.quantize,
+        fallback_num=args.fallback_num,
+        audio_in=args.audio_in,
+        calib_num=args.calib_num,
+    )
+    export_model.export(args.model_name)
--- a/funasr/export/models/modules/encoder_layer.py
+++ b/funasr/export/models/modules/encoder_layer.py
@ -16,6 +16,7 @@ class EncoderLayerSANM(nn.Module):
        self.feed_forward = model.feed_forward
        self.norm1 = model.norm1
        self.norm2 = model.norm2
+        self.in_size = model.in_size
        self.size = model.size

    def forward(self, x, mask):
@ -23,13 +24,12 @@ class EncoderLayerSANM(nn.Module):
        residual = x
        x = self.norm1(x)
        x = self.self_attn(x, mask)
-        if x.size(2) == residual.size(2):
+        if self.in_size == self.size:
            x = x + residual
        residual = x
        x = self.norm2(x)
        x = self.feed_forward(x)
-        if x.size(2) == residual.size(2):
-            x = x + residual
+        x = x + residual

        return x, mask

--- a/funasr/export/models/modules/multihead_att.py
+++ b/funasr/export/models/modules/multihead_att.py
@ -64,6 +64,23 @@ class MultiHeadedAttentionSANM(nn.Module):
        return self.linear_out(context_layer)  # (batch, time1, d_model)


+def preprocess_for_attn(x, mask, cache, pad_fn):
+    x = x * mask
+    x = x.transpose(1, 2)
+    if cache is None:
+        x = pad_fn(x)
+    else:
+        x = torch.cat((cache[:, :, 1:], x), dim=2)
+        cache = x
+    return x, cache
+
+
+torch_version = float(".".join(torch.__version__.split(".")[:2]))
+if torch_version >= 1.8:
+    import torch.fx
+    torch.fx.wrap('preprocess_for_attn')
+
+
 class MultiHeadedAttentionSANMDecoder(nn.Module):
    def __init__(self, model):
        super().__init__()
@ -73,16 +90,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
        self.attn = None

    def forward(self, inputs, mask, cache=None):
-        # b, t, d = inputs.size()
-        # mask = torch.reshape(mask, (b, -1, 1))
-        inputs = inputs * mask
-
-        x = inputs.transpose(1, 2)
-        if cache is None:
-            x = self.pad_fn(x)
-        else:
-            x = torch.cat((cache[:, :, 1:], x), dim=2)
-            cache = x
+        x, cache = preprocess_for_attn(inputs, mask, cache, self.pad_fn)
        x = self.fsmn_block(x)
        x = x.transpose(1, 2)

@ -232,4 +240,4 @@ class OnnxRelPosMultiHeadedAttention(OnnxMultiHeadedAttention):
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(new_context_layer_shape)
        return self.linear_out(context_layer)  # (batch, time1, d_model)
-        
+        
--- a/funasr/main_funcs/average_nbest_models.py
+++ b/funasr/main_funcs/average_nbest_models.py
@ -66,13 +66,13 @@ def average_nbest_models(
            elif n == 1:
                # The averaged model is same as the best model
                e, _ = epoch_and_values[0]
-                op = output_dir / f"{e}epoch.pth"
-                sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
+                op = output_dir / f"{e}epoch.pb"
+                sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pb"
                if sym_op.is_symlink() or sym_op.exists():
                    sym_op.unlink()
                sym_op.symlink_to(op.name)
            else:
-                op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
+                op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pb"
                logging.info(
                    f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
                )
@ -83,12 +83,12 @@ def average_nbest_models(
                    if e not in _loaded:
                        if oss_bucket is None:
                            _loaded[e] = torch.load(
-                                output_dir / f"{e}epoch.pth",
+                                output_dir / f"{e}epoch.pb",
                                map_location="cpu",
                            )
                        else:
                            buffer = BytesIO(
-                                oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
+                                oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pb")).read())
                            _loaded[e] = torch.load(buffer)
                    states = _loaded[e]

@ -115,13 +115,13 @@ def average_nbest_models(
                else:
                    buffer = BytesIO()
                    torch.save(avg, buffer)
-                    oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
+                    oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pb"),
                                          buffer.getvalue())

-        # 3. *.*.ave.pth is a symlink to the max ave model
+        # 3. *.*.ave.pb is a symlink to the max ave model
        if oss_bucket is None:
-            op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
-            sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
+            op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pb"
+            sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pb"
            if sym_op.is_symlink() or sym_op.exists():
                sym_op.unlink()
            sym_op.symlink_to(op.name)
--- a/funasr/main_funcs/pack_funcs.py
+++ b/funasr/main_funcs/pack_funcs.py
@ -191,12 +191,12 @@ def unpack(

    Examples:
        tarfile:
-           model.pth
+           model.pb
           some1.file
           some2.file

        >>> unpack("tarfile", "out")
-        {'asr_model_file': 'out/model.pth'}
+        {'asr_model_file': 'out/model.pb'}
    """
    input_archive = Path(input_archive)
    outpath = Path(outpath)
--- a/funasr/models/decoder/sanm_decoder.py
+++ b/funasr/models/decoder/sanm_decoder.py
@ -90,6 +90,47 @@ class DecoderLayerSANM(nn.Module):
            tgt = self.norm1(tgt)
        tgt = self.feed_forward(tgt)

+        x = tgt
+        if self.self_attn:
+            if self.normalize_before:
+                tgt = self.norm2(tgt)
+            x, _ = self.self_attn(tgt, tgt_mask)
+            x = residual + self.dropout(x)
+
+        if self.src_attn is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm3(x)
+
+            x = residual + self.dropout(self.src_attn(x, memory, memory_mask))
+
+
+        return x, tgt_mask, memory, memory_mask, cache
+
+    def forward_chunk(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
+        """Compute decoded features.
+
+        Args:
+            tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
+            tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
+            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
+            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
+            cache (List[torch.Tensor]): List of cached tensors.
+                Each tensor shape should be (#batch, maxlen_out - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor(#batch, maxlen_out, size).
+            torch.Tensor: Mask for output tensor (#batch, maxlen_out).
+            torch.Tensor: Encoded memory (#batch, maxlen_in, size).
+            torch.Tensor: Encoded memory mask (#batch, maxlen_in).
+
+        """
+        # tgt = self.dropout(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        tgt = self.feed_forward(tgt)
+
        x = tgt
        if self.self_attn:
            if self.normalize_before:
@ -109,7 +150,6 @@ class DecoderLayerSANM(nn.Module):

        return x, tgt_mask, memory, memory_mask, cache

-
 class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
    """
    author: Speech Lab, Alibaba Group, China
@ -947,6 +987,65 @@ class ParaformerSANMDecoder(BaseTransformerDecoder):
        )
        return logp.squeeze(0), state

+    def forward_chunk(
+        self,
+        memory: torch.Tensor,
+        tgt: torch.Tensor,
+        cache: dict = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward decoder.
+
+        Args:
+            hs_pad: encoded memory, float32  (batch, maxlen_in, feat)
+            hlens: (batch)
+            ys_in_pad:
+                input token ids, int64 (batch, maxlen_out)
+                if input_layer == "embed"
+                input tensor (batch, maxlen_out, #mels) in the other cases
+            ys_in_lens: (batch)
+        Returns:
+            (tuple): tuple containing:
+
+            x: decoded token score before softmax (batch, maxlen_out, token)
+                if use_output_layer is True,
+            olens: (batch, )
+        """
+        x = tgt
+        if cache["decode_fsmn"] is None:
+            cache_layer_num = len(self.decoders)
+            if self.decoders2 is not None:
+                cache_layer_num += len(self.decoders2)
+            new_cache = [None] * cache_layer_num
+        else:
+            new_cache = cache["decode_fsmn"]
+        for i in range(self.att_layer_num):
+            decoder = self.decoders[i]
+            x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
+                x, None, memory, None, cache=new_cache[i]
+            )
+            new_cache[i] = c_ret
+
+        if self.num_blocks - self.att_layer_num > 1:
+            for i in range(self.num_blocks - self.att_layer_num):
+                j = i + self.att_layer_num
+                decoder = self.decoders2[i]
+                x, tgt_mask, memory, memory_mask, c_ret = decoder.forward_chunk(
+                    x, None, memory, None, cache=new_cache[j]
+                )
+                new_cache[j] = c_ret
+
+        for decoder in self.decoders3:
+
+            x, tgt_mask, memory, memory_mask, _ = decoder.forward_chunk(
+                x, None, memory, None, cache=None
+            )
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.output_layer is not None:
+            x = self.output_layer(x)
+        cache["decode_fsmn"] = new_cache
+        return x
+
    def forward_one_step(
        self,
        tgt: torch.Tensor,
--- a/funasr/models/e2e_asr_paraformer.py
+++ b/funasr/models/e2e_asr_paraformer.py
@ -325,6 +325,65 @@ class Paraformer(AbsESPnetModel):

        return encoder_out, encoder_out_lens

+    def encode_chunk(
+            self, speech: torch.Tensor, speech_lengths: torch.Tensor, cache: dict = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Frontend + Encoder. Note that this method is used by asr_inference.py
+
+        Args:
+                speech: (Batch, Length, ...)
+                speech_lengths: (Batch, )
+        """
+        with autocast(False):
+            # 1. Extract feats
+            feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+
+            # 2. Data augmentation
+            if self.specaug is not None and self.training:
+                feats, feats_lengths = self.specaug(feats, feats_lengths)
+
+            # 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+            if self.normalize is not None:
+                feats, feats_lengths = self.normalize(feats, feats_lengths)
+
+        # Pre-encoder, e.g. used for raw input data
+        if self.preencoder is not None:
+            feats, feats_lengths = self.preencoder(feats, feats_lengths)
+
+        # 4. Forward encoder
+        # feats: (Batch, Length, Dim)
+        # -> encoder_out: (Batch, Length2, Dim2)
+        if self.encoder.interctc_use_conditioning:
+            encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(
+                feats, feats_lengths, cache=cache["encoder"], ctc=self.ctc
+            )
+        else:
+            encoder_out, encoder_out_lens, _ = self.encoder.forward_chunk(feats, feats_lengths, cache=cache["encoder"])
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        # Post-encoder, e.g. NLU
+        if self.postencoder is not None:
+            encoder_out, encoder_out_lens = self.postencoder(
+                encoder_out, encoder_out_lens
+            )
+
+        assert encoder_out.size(0) == speech.size(0), (
+            encoder_out.size(),
+            speech.size(0),
+        )
+        assert encoder_out.size(1) <= encoder_out_lens.max(), (
+            encoder_out.size(),
+            encoder_out_lens.max(),
+        )
+
+        if intermediate_outs is not None:
+            return (encoder_out, intermediate_outs), encoder_out_lens
+
+        return encoder_out, encoder_out_lens
+
    def calc_predictor(self, encoder_out, encoder_out_lens):

        encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(
@ -333,6 +392,11 @@ class Paraformer(AbsESPnetModel):
                                                                                  ignore_id=self.ignore_id)
        return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index

+    def calc_predictor_chunk(self, encoder_out, cache=None):
+
+        pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index = self.predictor.forward_chunk(encoder_out, cache["encoder"])
+        return pre_acoustic_embeds, pre_token_length, alphas, pre_peak_index
+
    def cal_decoder_with_predictor(self, encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens):

        decoder_outs = self.decoder(
@ -342,6 +406,14 @@ class Paraformer(AbsESPnetModel):
        decoder_out = torch.log_softmax(decoder_out, dim=-1)
        return decoder_out, ys_pad_lens

+    def cal_decoder_with_predictor_chunk(self, encoder_out, sematic_embeds, cache=None):
+        decoder_outs = self.decoder.forward_chunk(
+            encoder_out, sematic_embeds, cache["decoder"]
+        )
+        decoder_out = decoder_outs
+        decoder_out = torch.log_softmax(decoder_out, dim=-1)
+        return decoder_out
+
    def _extract_feats(
            self, speech: torch.Tensor, speech_lengths: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
@ -1459,4 +1531,4 @@ class ContextualParaformer(Paraformer):
                    "torch tensor: {}, {}, loading from tf tensor: {}, {}".format(name, data_tf.size(), name_tf,
                                                                                  var_dict_tf[name_tf].shape))

-        return var_dict_torch_update
+        return var_dict_torch_update
--- a/funasr/models/e2e_diar_eend_ola.py
+++ b/funasr/models/e2e_diar_eend_ola.py
@ -52,15 +52,15 @@ class DiarEENDOLAModel(AbsESPnetModel):

        super().__init__()
        self.frontend = frontend
-        self.encoder = encoder
-        self.encoder_decoder_attractor = encoder_decoder_attractor
+        self.enc = encoder
+        self.eda = encoder_decoder_attractor
        self.attractor_loss_weight = attractor_loss_weight
        self.max_n_speaker = max_n_speaker
        if mapping_dict is None:
            mapping_dict = generate_mapping_dict(max_speaker_num=self.max_n_speaker)
            self.mapping_dict = mapping_dict
        # PostNet
-        self.PostNet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
+        self.postnet = nn.LSTM(self.max_n_speaker, n_units, 1, batch_first=True)
        self.output_layer = nn.Linear(n_units, mapping_dict['oov'] + 1)

    def forward_encoder(self, xs, ilens):
@ -68,7 +68,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
        pad_shape = xs.shape
        xs_mask = [torch.ones(ilen).to(xs.device) for ilen in ilens]
        xs_mask = torch.nn.utils.rnn.pad_sequence(xs_mask, batch_first=True, padding_value=0).unsqueeze(-2)
-        emb = self.encoder(xs, xs_mask)
+        emb = self.enc(xs, xs_mask)
        emb = torch.split(emb.view(pad_shape[0], pad_shape[1], -1), 1, dim=0)
        emb = [e[0][:ilen] for e, ilen in zip(emb, ilens)]
        return emb
@ -76,8 +76,8 @@ class DiarEENDOLAModel(AbsESPnetModel):
    def forward_post_net(self, logits, ilens):
        maxlen = torch.max(ilens).to(torch.int).item()
        logits = nn.utils.rnn.pad_sequence(logits, batch_first=True, padding_value=-1)
-        logits = nn.utils.rnn.pack_padded_sequence(logits, ilens, batch_first=True, enforce_sorted=False)
-        outputs, (_, _) = self.PostNet(logits)
+        logits = nn.utils.rnn.pack_padded_sequence(logits, ilens.cpu().to(torch.int64), batch_first=True, enforce_sorted=False)
+        outputs, (_, _) = self.postnet(logits)
        outputs = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True, padding_value=-1, total_length=maxlen)[0]
        outputs = [output[:ilens[i].to(torch.int).item()] for i, output in enumerate(outputs)]
        outputs = [self.output_layer(output) for output in outputs]
@ -112,7 +112,7 @@ class DiarEENDOLAModel(AbsESPnetModel):
        text = text[:, : text_lengths.max()]

        # 1. Encoder
-        encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
+        encoder_out, encoder_out_lens = self.enc(speech, speech_lengths)
        intermediate_outs = None
        if isinstance(encoder_out, tuple):
            intermediate_outs = encoder_out[1]
@ -190,18 +190,16 @@ class DiarEENDOLAModel(AbsESPnetModel):
                            shuffle: bool = True,
                            threshold: float = 0.5,
                            **kwargs):
-        if self.frontend is not None:
-            speech = self.frontend(speech)
        speech = [s[:s_len] for s, s_len in zip(speech, speech_lengths)]
        emb = self.forward_encoder(speech, speech_lengths)
        if shuffle:
            orders = [np.arange(e.shape[0]) for e in emb]
            for order in orders:
                np.random.shuffle(order)
-            attractors, probs = self.encoder_decoder_attractor.estimate(
+            attractors, probs = self.eda.estimate(
                [e[torch.from_numpy(order).to(torch.long).to(speech[0].device)] for e, order in zip(emb, orders)])
        else:
-            attractors, probs = self.encoder_decoder_attractor.estimate(emb)
+            attractors, probs = self.eda.estimate(emb)
        attractors_active = []
        for p, att, e in zip(probs, attractors, emb):
            if n_speakers and n_speakers >= 0:
@ -233,10 +231,23 @@ class DiarEENDOLAModel(AbsESPnetModel):
                pred[i] = pred[i - 1]
            else:
                pred[i] = 0
-        pred = [self.reporter.inv_mapping_func(i, self.mapping_dict) for i in pred]
+        pred = [self.inv_mapping_func(i) for i in pred]
        decisions = [bin(num)[2:].zfill(self.max_n_speaker)[::-1] for num in pred]
        decisions = torch.from_numpy(
            np.stack([np.array([int(i) for i in dec]) for dec in decisions], axis=0)).to(logit.device).to(
            torch.float32)
        decisions = decisions[:, :n_speaker]
        return decisions
+
+    def inv_mapping_func(self, label):
+
+        if not isinstance(label, int):
+            label = int(label)
+        if label in self.mapping_dict['label2dec'].keys():
+            num = self.mapping_dict['label2dec'][label]
+        else:
+            num = -1
+        return num
+
+    def collect_feats(self, **batch: torch.Tensor) -> Dict[str, torch.Tensor]:
+        pass
--- a/funasr/models/e2e_diar_sond.py
+++ b/funasr/models/e2e_diar_sond.py
@ -59,7 +59,8 @@ class DiarSondModel(AbsESPnetModel):
        normalize_speech_speaker: bool = False,
        ignore_id: int = -1,
        speaker_discrimination_loss_weight: float = 1.0,
-        inter_score_loss_weight: float = 0.0
+        inter_score_loss_weight: float = 0.0,
+        inputs_type: str = "raw",
    ):
        assert check_argument_types()

@ -86,14 +87,12 @@ class DiarSondModel(AbsESPnetModel):
        )
        self.criterion_bce = SequenceBinaryCrossEntropy(normalize_length=length_normalized_loss)
        self.pse_embedding = self.generate_pse_embedding()
-        # self.register_buffer("pse_embedding", pse_embedding)
        self.power_weight = torch.from_numpy(2 ** np.arange(max_spk_num)[np.newaxis, np.newaxis, :]).float()
-        # self.register_buffer("power_weight", power_weight)
        self.int_token_arr = torch.from_numpy(np.array(self.token_list).astype(int)[np.newaxis, np.newaxis, :]).int()
-        # self.register_buffer("int_token_arr", int_token_arr)
        self.speaker_discrimination_loss_weight = speaker_discrimination_loss_weight
        self.inter_score_loss_weight = inter_score_loss_weight
        self.forward_steps = 0
+        self.inputs_type = inputs_type

    def generate_pse_embedding(self):
        embedding = np.zeros((len(self.token_list), self.max_spk_num), dtype=np.float)
@ -125,9 +124,14 @@ class DiarSondModel(AbsESPnetModel):
            binary_labels: (Batch, frames, max_spk_num)
            binary_labels_lengths: (Batch,)
        """
-        assert speech.shape[0] == binary_labels.shape[0], (speech.shape, binary_labels.shape)
+        assert speech.shape[0] <= binary_labels.shape[0], (speech.shape, binary_labels.shape)
        batch_size = speech.shape[0]
        self.forward_steps = self.forward_steps + 1
+        if self.pse_embedding.device != speech.device:
+            self.pse_embedding = self.pse_embedding.to(speech.device)
+            self.power_weight = self.power_weight.to(speech.device)
+            self.int_token_arr = self.int_token_arr.to(speech.device)
+
        # 1. Network forward
        pred, inter_outputs = self.prediction_forward(
            speech, speech_lengths,
@ -149,9 +153,13 @@ class DiarSondModel(AbsESPnetModel):
        # the sequence length of 'pred' might be slightly less than the
        # length of 'spk_labels'. Here we force them to be equal.
        length_diff_tolerance = 2
-        length_diff = pse_labels.shape[1] - pred.shape[1]
-        if 0 < length_diff <= length_diff_tolerance:
-            pse_labels = pse_labels[:, 0: pred.shape[1]]
+        length_diff = abs(pse_labels.shape[1] - pred.shape[1])
+        if length_diff <= length_diff_tolerance:
+            min_len = min(pred.shape[1], pse_labels.shape[1])
+            pse_labels = pse_labels[:, :min_len]
+            pred = pred[:, :min_len]
+            cd_score = cd_score[:, :min_len]
+            ci_score = ci_score[:, :min_len]

        loss_diar = self.classification_loss(pred, pse_labels, binary_labels_lengths)
        loss_spk_dis = self.speaker_discrimination_loss(profile, profile_lengths)
@ -299,7 +307,7 @@ class DiarSondModel(AbsESPnetModel):
            speech: torch.Tensor,
            speech_lengths: torch.Tensor,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
-        if self.encoder is not None:
+        if self.encoder is not None and self.inputs_type == "raw":
            speech, speech_lengths = self.encode(speech, speech_lengths)
            speech_mask = ~make_pad_mask(speech_lengths, maxlen=speech.shape[1])
            speech_mask = speech_mask.to(speech.device).unsqueeze(-1).float()
--- a/funasr/models/encoder/sanm_encoder.py
+++ b/funasr/models/encoder/sanm_encoder.py
@ -347,6 +347,48 @@ class SANMEncoder(AbsEncoder):
            return (xs_pad, intermediate_outs), olens, None
        return xs_pad, olens, None

+    def forward_chunk(self,
+                      xs_pad: torch.Tensor,
+                      ilens: torch.Tensor,
+                      cache: dict = None,
+                      ctc: CTC = None,
+                      ):
+        xs_pad *= self.output_size() ** 0.5
+        if self.embed is None:
+            xs_pad = xs_pad
+        else:
+            xs_pad = self.embed.forward_chunk(xs_pad, cache)
+
+        encoder_outs = self.encoders0(xs_pad, None, None, None, None)
+        xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        intermediate_outs = []
+        if len(self.interctc_layer_idx) == 0:
+            encoder_outs = self.encoders(xs_pad, None, None, None, None)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        else:
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                encoder_outs = encoder_layer(xs_pad, None, None, None, None)
+                xs_pad, masks = encoder_outs[0], encoder_outs[1]
+                if layer_idx + 1 in self.interctc_layer_idx:
+                    encoder_out = xs_pad
+
+                    # intermediate outputs are also normalized
+                    if self.normalize_before:
+                        encoder_out = self.after_norm(encoder_out)
+
+                    intermediate_outs.append((layer_idx + 1, encoder_out))
+
+                    if self.interctc_use_conditioning:
+                        ctc_out = ctc.softmax(encoder_out)
+                        xs_pad = xs_pad + self.conditioning_layer(ctc_out)
+
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+
+        if len(intermediate_outs) > 0:
+            return (xs_pad, intermediate_outs), None, None
+        return xs_pad, ilens, None
+
    def gen_tf2torch_map_dict(self):
        tensor_name_prefix_torch = self.tf2torch_tensor_name_prefix_torch
        tensor_name_prefix_tf = self.tf2torch_tensor_name_prefix_tf
--- a/funasr/models/frontend/wav_frontend.py
+++ b/funasr/models/frontend/wav_frontend.py
@ -1,14 +1,15 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 # Part of the implementation is borrowed from espnet/espnet.
-from abc import ABC
 from typing import Tuple

 import numpy as np
 import torch
 import torchaudio.compliance.kaldi as kaldi
-from funasr.models.frontend.abs_frontend import AbsFrontend
-from typeguard import check_argument_types
 from torch.nn.utils.rnn import pad_sequence
+from typeguard import check_argument_types
+
+import funasr.models.frontend.eend_ola_feature as eend_ola_feature
+from funasr.models.frontend.abs_frontend import AbsFrontend


 def load_cmvn(cmvn_file):
@ -275,7 +276,8 @@ class WavFrontendOnline(AbsFrontend):
    # inputs tensor has catted the cache tensor
    # def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, inputs_lfr_cache: torch.Tensor = None,
    #               is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
-    def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[torch.Tensor, torch.Tensor, int]:
+    def apply_lfr(inputs: torch.Tensor, lfr_m: int, lfr_n: int, is_final: bool = False) -> Tuple[
+        torch.Tensor, torch.Tensor, int]:
        """
        Apply lfr with data
        """
@ -376,7 +378,8 @@ class WavFrontendOnline(AbsFrontend):
            if self.lfr_m != 1 or self.lfr_n != 1:
                # update self.lfr_splice_cache in self.apply_lfr
                # mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, self.lfr_splice_cache[i],
-                mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n, is_final)
+                mat, self.lfr_splice_cache[i], lfr_splice_frame_idx = self.apply_lfr(mat, self.lfr_m, self.lfr_n,
+                                                                                     is_final)
            if self.cmvn_file is not None:
                mat = self.apply_cmvn(mat, self.cmvn)
            feat_length = mat.size(0)
@ -398,9 +401,10 @@ class WavFrontendOnline(AbsFrontend):
        assert batch_size == 1, 'we support to extract feature online only when the batch size is equal to 1 now'
        waveforms, feats, feats_lengths = self.forward_fbank(input, input_lengths)  # input shape: B T D
        if feats.shape[0]:
-            #if self.reserve_waveforms is None and self.lfr_m > 1:
+            # if self.reserve_waveforms is None and self.lfr_m > 1:
            #    self.reserve_waveforms = waveforms[:, :(self.lfr_m - 1) // 2 * self.frame_shift_sample_length]
-            self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat((self.reserve_waveforms, waveforms), dim=1)
+            self.waveforms = waveforms if self.reserve_waveforms is None else torch.cat(
+                (self.reserve_waveforms, waveforms), dim=1)
            if not self.lfr_splice_cache:  # 初始化splice_cache
                for i in range(batch_size):
                    self.lfr_splice_cache.append(feats[i][0, :].unsqueeze(dim=0).repeat((self.lfr_m - 1) // 2, 1))
@ -409,7 +413,8 @@ class WavFrontendOnline(AbsFrontend):
                lfr_splice_cache_tensor = torch.stack(self.lfr_splice_cache)  # B T D
                feats = torch.cat((lfr_splice_cache_tensor, feats), dim=1)
                feats_lengths += lfr_splice_cache_tensor[0].shape[0]
-                frame_from_waveforms = int((self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
+                frame_from_waveforms = int(
+                    (self.waveforms.shape[1] - self.frame_sample_length) / self.frame_shift_sample_length + 1)
                minus_frame = (self.lfr_m - 1) // 2 if self.reserve_waveforms is None else 0
                feats, feats_lengths, lfr_splice_frame_idxs = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
                if self.lfr_m == 1:
@ -423,14 +428,15 @@ class WavFrontendOnline(AbsFrontend):
                    self.waveforms = self.waveforms[:, :sample_length]
            else:
                # update self.reserve_waveforms and self.lfr_splice_cache
-                self.reserve_waveforms = self.waveforms[:, :-(self.frame_sample_length - self.frame_shift_sample_length)]
+                self.reserve_waveforms = self.waveforms[:,
+                                         :-(self.frame_sample_length - self.frame_shift_sample_length)]
                for i in range(batch_size):
                    self.lfr_splice_cache[i] = torch.cat((self.lfr_splice_cache[i], feats[i]), dim=0)
                return torch.empty(0), feats_lengths
        else:
            if is_final:
                self.waveforms = waveforms if self.reserve_waveforms is None else self.reserve_waveforms
-                feats = torch.stack(self.lfr_splice_cache) 
+                feats = torch.stack(self.lfr_splice_cache)
                feats_lengths = torch.zeros(batch_size, dtype=torch.int) + feats.shape[1]
                feats, feats_lengths, _ = self.forward_lfr_cmvn(feats, feats_lengths, is_final)
        if is_final:
@ -444,3 +450,54 @@ class WavFrontendOnline(AbsFrontend):
        self.reserve_waveforms = None
        self.input_cache = None
        self.lfr_splice_cache = []
+
+
+class WavFrontendMel23(AbsFrontend):
+    """Conventional frontend structure for ASR.
+    """
+
+    def __init__(
+            self,
+            fs: int = 16000,
+            frame_length: int = 25,
+            frame_shift: int = 10,
+            lfr_m: int = 1,
+            lfr_n: int = 1,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self.fs = fs
+        self.frame_length = frame_length
+        self.frame_shift = frame_shift
+        self.lfr_m = lfr_m
+        self.lfr_n = lfr_n
+        self.n_mels = 23
+
+    def output_size(self) -> int:
+        return self.n_mels * (2 * self.lfr_m + 1)
+
+    def forward(
+            self,
+            input: torch.Tensor,
+            input_lengths: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        batch_size = input.size(0)
+        feats = []
+        feats_lens = []
+        for i in range(batch_size):
+            waveform_length = input_lengths[i]
+            waveform = input[i][:waveform_length]
+            waveform = waveform.numpy()
+            mat = eend_ola_feature.stft(waveform, self.frame_length, self.frame_shift)
+            mat = eend_ola_feature.transform(mat)
+            mat = eend_ola_feature.splice(mat, context_size=self.lfr_m)
+            mat = mat[::self.lfr_n]
+            mat = torch.from_numpy(mat)
+            feat_length = mat.size(0)
+            feats.append(mat)
+            feats_lens.append(feat_length)
+
+        feats_lens = torch.as_tensor(feats_lens)
+        feats_pad = pad_sequence(feats,
+                                 batch_first=True,
+                                 padding_value=0.0)
+        return feats_pad, feats_lens
--- a/funasr/models/predictor/cif.py
+++ b/funasr/models/predictor/cif.py
@ -199,6 +199,63 @@ class CifPredictorV2(nn.Module):

        return acoustic_embeds, token_num, alphas, cif_peak

+    def forward_chunk(self, hidden, cache=None):
+        h = hidden
+        context = h.transpose(1, 2)
+        queries = self.pad(context)
+        output = torch.relu(self.cif_conv1d(queries))
+        output = output.transpose(1, 2)
+        output = self.cif_output(output)
+        alphas = torch.sigmoid(output)
+        alphas = torch.nn.functional.relu(alphas * self.smooth_factor - self.noise_threshold)
+
+        alphas = alphas.squeeze(-1)
+        mask_chunk_predictor = None
+        if cache is not None:
+            mask_chunk_predictor = None
+            mask_chunk_predictor = torch.zeros_like(alphas)
+            mask_chunk_predictor[:, cache["pad_left"]:cache["stride"] + cache["pad_left"]] = 1.0
+       
+        if mask_chunk_predictor is not None:
+            alphas = alphas * mask_chunk_predictor
+      
+        if cache is not None:
+            if cache["cif_hidden"] is not None:
+                hidden = torch.cat((cache["cif_hidden"], hidden), 1)
+            if cache["cif_alphas"] is not None:
+                alphas = torch.cat((cache["cif_alphas"], alphas), -1)
+
+        token_num = alphas.sum(-1)
+        acoustic_embeds, cif_peak = cif(hidden, alphas, self.threshold)
+        len_time = alphas.size(-1)
+        last_fire_place = len_time - 1
+        last_fire_remainds = 0.0
+        pre_alphas_length = 0
+ 
+        mask_chunk_peak_predictor = None
+        if cache is not None:
+            mask_chunk_peak_predictor = None
+            mask_chunk_peak_predictor = torch.zeros_like(cif_peak)
+            if cache["cif_alphas"] is not None:
+                pre_alphas_length = cache["cif_alphas"].size(-1)
+                mask_chunk_peak_predictor[:, :pre_alphas_length] = 1.0
+            mask_chunk_peak_predictor[:, pre_alphas_length + cache["pad_left"]:pre_alphas_length + cache["stride"] + cache["pad_left"]] = 1.0
+            
+
+        if mask_chunk_peak_predictor is not None:
+            cif_peak = cif_peak * mask_chunk_peak_predictor.squeeze(-1)
+        
+        for i in range(len_time):
+            if cif_peak[0][len_time - 1 - i] > self.threshold or cif_peak[0][len_time - 1 - i] == self.threshold:
+                last_fire_place = len_time - 1 - i
+                last_fire_remainds = cif_peak[0][len_time - 1 - i] - self.threshold
+                break
+        last_fire_remainds = torch.tensor([last_fire_remainds], dtype=alphas.dtype).to(alphas.device)
+        cache["cif_hidden"] = hidden[:, last_fire_place:, :]
+        cache["cif_alphas"] = torch.cat((last_fire_remainds.unsqueeze(0), alphas[:, last_fire_place+1:]), -1)
+        token_num_int = token_num.floor().type(torch.int32).item()
+        return acoustic_embeds[:, 0:token_num_int, :], token_num, alphas, cif_peak
+
    def tail_process_fn(self, hidden, alphas, token_num=None, mask=None):
        b, t, d = hidden.size()
        tail_threshold = self.tail_threshold
--- a/funasr/modules/attention.py
+++ b/funasr/modules/attention.py
@ -347,15 +347,17 @@ class MultiHeadedAttentionSANM(nn.Module):
            mask = torch.reshape(mask, (b, -1, 1))
            if mask_shfit_chunk is not None:
                mask = mask * mask_shfit_chunk
+            inputs = inputs * mask

-        inputs = inputs * mask
        x = inputs.transpose(1, 2)
        x = self.pad_fn(x)
        x = self.fsmn_block(x)
        x = x.transpose(1, 2)
        x += inputs
        x = self.dropout(x)
-        return x * mask
+        if mask is not None:
+            x = x * mask
+        return x

    def forward_qkv(self, x):
        """Transform query, key and value.
@ -505,7 +507,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
            # print("in fsmn, cache is None, x", x.size())

            x = self.pad_fn(x)
-            if not self.training and t <= 1:
+            if not self.training:
                cache = x
        else:
            # print("in fsmn, cache is not None, x", x.size())
@ -513,7 +515,7 @@ class MultiHeadedAttentionSANMDecoder(nn.Module):
            # if t < self.kernel_size:
            #     x = self.pad_fn(x)
            x = torch.cat((cache[:, :, 1:], x), dim=2)
-            x = x[:, :, -self.kernel_size:]
+            x = x[:, :, -(self.kernel_size+t-1):]
            # print("in fsmn, cache is not None, x_cat", x.size())
            cache = x
        x = self.fsmn_block(x)
--- a/funasr/modules/eend_ola/encoder.py
+++ b/funasr/modules/eend_ola/encoder.py
@ -87,7 +87,7 @@ class EENDOLATransformerEncoder(nn.Module):
                 n_layers: int,
                 n_units: int,
                 e_units: int = 2048,
-                 h: int = 8,
+                 h: int = 4,
                 dropout_rate: float = 0.1,
                 use_pos_emb: bool = False):
        super(EENDOLATransformerEncoder, self).__init__()
--- a/funasr/modules/eend_ola/encoder_decoder_attractor.py
+++ b/funasr/modules/eend_ola/encoder_decoder_attractor.py
@ -16,12 +16,12 @@ class EncoderDecoderAttractor(nn.Module):
        self.n_units = n_units

    def forward_core(self, xs, zeros):
-        ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.float32).to(xs[0].device)
+        ilens = torch.from_numpy(np.array([x.shape[0] for x in xs])).to(torch.int64)
        xs = [self.enc0_dropout(x) for x in xs]
        xs = nn.utils.rnn.pad_sequence(xs, batch_first=True, padding_value=-1)
        xs = nn.utils.rnn.pack_padded_sequence(xs, ilens, batch_first=True, enforce_sorted=False)
        _, (hx, cx) = self.encoder(xs)
-        zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.float32).to(zeros[0].device)
+        zlens = torch.from_numpy(np.array([z.shape[0] for z in zeros])).to(torch.int64)
        max_zlen = torch.max(zlens).to(torch.int).item()
        zeros = [self.enc0_dropout(z) for z in zeros]
        zeros = nn.utils.rnn.pad_sequence(zeros, batch_first=True, padding_value=-1)
@ -47,4 +47,4 @@ class EncoderDecoderAttractor(nn.Module):
        zeros = [torch.zeros(max_n_speakers, self.n_units).to(torch.float32).to(xs[0].device) for _ in xs]
        attractors = self.forward_core(xs, zeros)
        probs = [torch.sigmoid(torch.flatten(self.counter(att))) for att in attractors]
-        return attractors, probs
+        return attractors, probs
--- a/funasr/modules/embedding.py
+++ b/funasr/modules/embedding.py
@ -405,4 +405,13 @@ class SinusoidalPositionEncoder(torch.nn.Module):
        positions = torch.arange(1, timesteps+1)[None, :]
        position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)

-        return x + position_encoding
+        return x + position_encoding
+
+    def forward_chunk(self, x, cache=None):
+        start_idx = 0
+        batch_size, timesteps, input_dim = x.size()
+        if cache is not None:
+            start_idx = cache["start_idx"]
+        positions = torch.arange(1, timesteps+start_idx+1)[None, :]
+        position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
+        return x + position_encoding[:, start_idx: start_idx + timesteps]
--- a/funasr/runtime/grpc/CMakeLists.txt
+++ b/funasr/runtime/grpc/CMakeLists.txt
@ -0,0 +1,83 @@
+# Copyright 2018 gRPC authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# cmake build file for C++ paraformer example.
+# Assumes protobuf and gRPC have been installed using cmake.
+# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
+# that automatically builds all the dependencies before building paraformer.
+
+cmake_minimum_required(VERSION 3.10)
+
+project(ASR C CXX)
+
+include(common.cmake)
+
+# Proto file
+get_filename_component(rg_proto "../python/grpc/proto/paraformer.proto" ABSOLUTE)
+get_filename_component(rg_proto_path "${rg_proto}" PATH)
+
+# Generated sources
+set(rg_proto_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.cc")
+set(rg_proto_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.pb.h")
+set(rg_grpc_srcs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.cc")
+set(rg_grpc_hdrs "${CMAKE_CURRENT_BINARY_DIR}/paraformer.grpc.pb.h")
+add_custom_command(
+      OUTPUT "${rg_proto_srcs}" "${rg_proto_hdrs}" "${rg_grpc_srcs}" "${rg_grpc_hdrs}"
+      COMMAND ${_PROTOBUF_PROTOC}
+      ARGS --grpc_out "${CMAKE_CURRENT_BINARY_DIR}"
+        --cpp_out "${CMAKE_CURRENT_BINARY_DIR}"
+        -I "${rg_proto_path}"
+        --plugin=protoc-gen-grpc="${_GRPC_CPP_PLUGIN_EXECUTABLE}"
+        "${rg_proto}"
+      DEPENDS "${rg_proto}")
+
+
+# Include generated *.pb.h files
+include_directories("${CMAKE_CURRENT_BINARY_DIR}")
+
+include_directories(../onnxruntime/include/)
+link_directories(../onnxruntime/build/src/)
+link_directories(../onnxruntime/build/third_party/webrtc/)
+
+link_directories(${ONNXRUNTIME_DIR}/lib)
+add_subdirectory("../onnxruntime/src" onnx_src)
+
+# rg_grpc_proto
+add_library(rg_grpc_proto
+  ${rg_grpc_srcs}
+  ${rg_grpc_hdrs}
+  ${rg_proto_srcs}
+  ${rg_proto_hdrs})
+
+
+
+target_link_libraries(rg_grpc_proto
+  ${_REFLECTION}
+  ${_GRPC_GRPCPP}
+  ${_PROTOBUF_LIBPROTOBUF})
+
+# Targets paraformer_(server)
+foreach(_target
+  paraformer_server)
+  add_executable(${_target}
+    "${_target}.cc")
+  target_link_libraries(${_target}
+    rg_grpc_proto
+    rapidasr
+    webrtcvad
+    ${EXTRA_LIBS}
+    ${_REFLECTION}
+    ${_GRPC_GRPCPP}
+    ${_PROTOBUF_LIBPROTOBUF})
+endforeach()
--- a/funasr/runtime/grpc/Readme.md
+++ b/funasr/runtime/grpc/Readme.md
@ -0,0 +1,57 @@
+## paraformer grpc onnx server in c++
+
+
+#### Step 1. Build ../onnxruntime as it's document
+```
+#put onnx-lib & onnx-asr-model & vocab.txt into /path/to/asrmodel(eg: /data/asrmodel)
+ls /data/asrmodel/
+onnxruntime-linux-x64-1.14.0  speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+
+file /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/vocab.txt
+UTF-8 Unicode text
+```
+
+#### Step 2. Compile and install grpc v1.52.0 in case of grpc bugs
+```
+export GRPC_INSTALL_DIR=/data/soft/grpc
+export PKG_CONFIG_PATH=$GRPC_INSTALL_DIR/lib/pkgconfig
+
+git clone -b v1.52.0 --depth=1  https://github.com/grpc/grpc.git
+cd grpc
+git submodule update --init --recursive
+
+mkdir -p cmake/build
+pushd cmake/build
+cmake -DgRPC_INSTALL=ON \
+      -DgRPC_BUILD_TESTS=OFF \
+      -DCMAKE_INSTALL_PREFIX=$GRPC_INSTALL_DIR \
+      ../..
+make
+make install
+popd
+
+echo "export GRPC_INSTALL_DIR=/data/soft/grpc" >> ~/.bashrc
+echo "export PKG_CONFIG_PATH=\$GRPC_INSTALL_DIR/lib/pkgconfig" >> ~/.bashrc
+echo "export PATH=\$GRPC_INSTALL_DIR/bin/:\$PKG_CONFIG_PATH:\$PATH" >> ~/.bashrc
+source ~/.bashrc
+```
+
+#### Step 3. Compile and start grpc onnx paraformer server
+```
+# set -DONNXRUNTIME_DIR=/path/to/asrmodel/onnxruntime-linux-x64-1.14.0
+./rebuild.sh
+```
+
+#### Step 4. Start grpc paraformer server
+```
+Usage: ./cmake/build/paraformer_server port thread_num /path/to/model_file
+./cmake/build/paraformer_server 10108 4 /data/asrmodel/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+```
+
+
+
+#### Step 5. Start grpc python paraformer client  on PC with MIC
+```
+cd ../python/grpc
+python grpc_main_client_mic.py  --host $server_ip --port 10108
+```
--- a/funasr/runtime/grpc/common.cmake
+++ b/funasr/runtime/grpc/common.cmake
@ -0,0 +1,125 @@
+# Copyright 2018 gRPC authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# cmake build file for C++ route_guide example.
+# Assumes protobuf and gRPC have been installed using cmake.
+# See cmake_externalproject/CMakeLists.txt for all-in-one cmake build
+# that automatically builds all the dependencies before building route_guide.
+
+cmake_minimum_required(VERSION 3.5.1)
+
+if (NOT DEFINED CMAKE_CXX_STANDARD)
+  set (CMAKE_CXX_STANDARD 14)
+endif()
+
+if(MSVC)
+  add_definitions(-D_WIN32_WINNT=0x600)
+endif()
+
+find_package(Threads REQUIRED)
+
+if(GRPC_AS_SUBMODULE)
+  # One way to build a projects that uses gRPC is to just include the
+  # entire gRPC project tree via "add_subdirectory".
+  # This approach is very simple to use, but the are some potential
+  # disadvantages:
+  # * it includes gRPC's CMakeLists.txt directly into your build script
+  #   without and that can make gRPC's internal setting interfere with your
+  #   own build.
+  # * depending on what's installed on your system, the contents of submodules
+  #   in gRPC's third_party/* might need to be available (and there might be
+  #   additional prerequisites required to build them). Consider using
+  #   the gRPC_*_PROVIDER options to fine-tune the expected behavior.
+  #
+  # A more robust approach to add dependency on gRPC is using
+  # cmake's ExternalProject_Add (see cmake_externalproject/CMakeLists.txt).
+
+  # Include the gRPC's cmake build (normally grpc source code would live
+  # in a git submodule called "third_party/grpc", but this example lives in
+  # the same repository as gRPC sources, so we just look a few directories up)
+  add_subdirectory(../../.. ${CMAKE_CURRENT_BINARY_DIR}/grpc EXCLUDE_FROM_ALL)
+  message(STATUS "Using gRPC via add_subdirectory.")
+
+  # After using add_subdirectory, we can now use the grpc targets directly from
+  # this build.
+  set(_PROTOBUF_LIBPROTOBUF libprotobuf)
+  set(_REFLECTION grpc++_reflection)
+  if(CMAKE_CROSSCOMPILING)
+    find_program(_PROTOBUF_PROTOC protoc)
+  else()
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
+  set(_GRPC_GRPCPP grpc++)
+  if(CMAKE_CROSSCOMPILING)
+    find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
+  else()
+    set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
+  endif()
+elseif(GRPC_FETCHCONTENT)
+  # Another way is to use CMake's FetchContent module to clone gRPC at
+  # configure time. This makes gRPC's source code available to your project,
+  # similar to a git submodule.
+  message(STATUS "Using gRPC via add_subdirectory (FetchContent).")
+  include(FetchContent)
+  FetchContent_Declare(
+    grpc
+    GIT_REPOSITORY https://github.com/grpc/grpc.git
+    # when using gRPC, you will actually set this to an existing tag, such as
+    # v1.25.0, v1.26.0 etc..
+    # For the purpose of testing, we override the tag used to the commit
+    # that's currently under test.
+    GIT_TAG        vGRPC_TAG_VERSION_OF_YOUR_CHOICE)
+  FetchContent_MakeAvailable(grpc)
+
+  # Since FetchContent uses add_subdirectory under the hood, we can use
+  # the grpc targets directly from this build.
+  set(_PROTOBUF_LIBPROTOBUF libprotobuf)
+  set(_REFLECTION grpc++_reflection)
+  set(_PROTOBUF_PROTOC $<TARGET_FILE:protoc>)
+  set(_GRPC_GRPCPP grpc++)
+  if(CMAKE_CROSSCOMPILING)
+    find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
+  else()
+    set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:grpc_cpp_plugin>)
+  endif()
+else()
+  # This branch assumes that gRPC and all its dependencies are already installed
+  # on this system, so they can be located by find_package().
+
+  # Find Protobuf installation
+  # Looks for protobuf-config.cmake file installed by Protobuf's cmake installation.
+  set(protobuf_MODULE_COMPATIBLE TRUE)
+  find_package(Protobuf CONFIG REQUIRED)
+  message(STATUS "Using protobuf ${Protobuf_VERSION}")
+
+  set(_PROTOBUF_LIBPROTOBUF protobuf::libprotobuf)
+  set(_REFLECTION gRPC::grpc++_reflection)
+  if(CMAKE_CROSSCOMPILING)
+    find_program(_PROTOBUF_PROTOC protoc)
+  else()
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
+
+  # Find gRPC installation
+  # Looks for gRPCConfig.cmake file installed by gRPC's cmake installation.
+  find_package(gRPC CONFIG REQUIRED)
+  message(STATUS "Using gRPC ${gRPC_VERSION}")
+
+  set(_GRPC_GRPCPP gRPC::grpc++)
+  if(CMAKE_CROSSCOMPILING)
+    find_program(_GRPC_CPP_PLUGIN_EXECUTABLE grpc_cpp_plugin)
+  else()
+    set(_GRPC_CPP_PLUGIN_EXECUTABLE $<TARGET_FILE:gRPC::grpc_cpp_plugin>)
+  endif()
+endif()
--- a/funasr/runtime/grpc/paraformer_server.cc
+++ b/funasr/runtime/grpc/paraformer_server.cc
@ -0,0 +1,195 @@
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <iostream>
+#include <sstream>
+#include <memory>
+#include <string>
+
+#include <grpc/grpc.h>
+#include <grpcpp/server.h>
+#include <grpcpp/server_builder.h>
+#include <grpcpp/server_context.h>
+#include <grpcpp/security/server_credentials.h>
+
+#include "paraformer.grpc.pb.h"
+#include "paraformer_server.h"
+
+
+using grpc::Server;
+using grpc::ServerBuilder;
+using grpc::ServerContext;
+using grpc::ServerReader;
+using grpc::ServerReaderWriter;
+using grpc::ServerWriter;
+using grpc::Status;
+
+
+using paraformer::Request;
+using paraformer::Response;
+using paraformer::ASR;
+
+ASRServicer::ASRServicer(const char* model_path, int thread_num) {
+    AsrHanlde=RapidAsrInit(model_path, thread_num);
+    std::cout << "ASRServicer init" << std::endl;
+    init_flag = 0;
+}
+
+void ASRServicer::clear_states(const std::string& user) {
+    clear_buffers(user);
+    clear_transcriptions(user);
+}
+
+void ASRServicer::clear_buffers(const std::string& user) {
+    if (client_buffers.count(user)) {
+        client_buffers.erase(user);
+    }
+}
+
+void ASRServicer::clear_transcriptions(const std::string& user) {
+    if (client_transcription.count(user)) {
+        client_transcription.erase(user);
+    }
+}
+
+void ASRServicer::disconnect(const std::string& user) {
+    clear_states(user);
+    std::cout << "Disconnecting user: " << user << std::endl;
+}
+
+grpc::Status ASRServicer::Recognize(
+    grpc::ServerContext* context,
+    grpc::ServerReaderWriter<Response, Request>* stream) {
+
+    Request req;
+    while (stream->Read(&req)) {
+        if (req.isend()) {
+            std::cout << "asr end" << std::endl;
+            disconnect(req.user());
+            Response res;
+            res.set_sentence(
+                R"({"success": true, "detail": "asr end"})"
+            );
+            res.set_user(req.user());
+            res.set_action("terminate");
+            res.set_language(req.language());
+            stream->Write(res);
+        } else if (req.speaking()) {
+            if (req.audio_data().size() > 0) {
+                auto& buf = client_buffers[req.user()];
+                buf.insert(buf.end(), req.audio_data().begin(), req.audio_data().end());
+            }
+            Response res;
+            res.set_sentence(
+                R"({"success": true, "detail": "speaking"})"
+            );
+            res.set_user(req.user());
+            res.set_action("speaking");
+            res.set_language(req.language());
+            stream->Write(res);
+        } else if (!req.speaking()) {
+            if (client_buffers.count(req.user()) == 0) {
+                Response res;
+                res.set_sentence(
+                    R"({"success": true, "detail": "waiting_for_voice"})"
+                );
+                res.set_user(req.user());
+                res.set_action("waiting");
+                res.set_language(req.language());
+                stream->Write(res);
+            }else {
+                auto begin_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+                std::string tmp_data = this->client_buffers[req.user()];
+                this->clear_states(req.user());
+                
+                Response res;
+                res.set_sentence(
+                    R"({"success": true, "detail": "decoding data: " + std::to_string(tmp_data.length()) + " bytes"})"
+                );
+		int data_len_int = tmp_data.length();
+                std::string data_len = std::to_string(data_len_int);
+                std::stringstream ss;
+                ss << R"({"success": true, "detail": "decoding data: )" << data_len << R"( bytes")"  << R"("})";
+                std::string result = ss.str();
+                res.set_sentence(result);
+                res.set_user(req.user());
+                res.set_action("decoding");
+                res.set_language(req.language());
+                stream->Write(res);
+                if (tmp_data.length() < 800) { //min input_len for asr model
+                    auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+                    std::string delay_str = std::to_string(end_time - begin_time);
+                    std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", error: data_is_not_long_enough" << std::endl;
+                    Response res;
+                    std::stringstream ss;
+                    std::string asr_result = "";
+                    ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
+                    std::string result = ss.str();
+                    res.set_sentence(result);
+                    res.set_user(req.user());
+                    res.set_action("finish");
+                    res.set_language(req.language());
+                    
+                    
+                    
+                    stream->Write(res);
+                }
+                else {
+                    RPASR_RESULT Result= RapidAsrRecogPCMBuffer(AsrHanlde, tmp_data.c_str(), data_len_int, RASR_NONE, NULL);   
+                    std::string asr_result = ((RPASR_RECOG_RESULT*)Result)->msg;
+
+                    auto end_time = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch()).count();
+                    std::string delay_str = std::to_string(end_time - begin_time);
+                    
+                    std::cout << "user: " << req.user() << " , delay(ms): " << delay_str << ", text: " << asr_result << std::endl;
+                    Response res;
+                    std::stringstream ss;
+                    ss << R"({"success": true, "detail": "finish_sentence","server_delay_ms":)" << delay_str << R"(,"text":")" << asr_result << R"("})";
+                    std::string result = ss.str();
+                    res.set_sentence(result);
+                    res.set_user(req.user());
+                    res.set_action("finish");
+                    res.set_language(req.language());
+                    
+                    
+                    stream->Write(res);
+                }
+            }
+        }else {
+            Response res;
+            res.set_sentence(
+                R"({"success": false, "detail": "error, no condition matched! Unknown reason."})"
+            );
+            res.set_user(req.user());
+            res.set_action("terminate");
+            res.set_language(req.language());
+            stream->Write(res);
+        }
+    }    
+    return Status::OK;
+}
+
+
+void RunServer(const std::string& port, int thread_num, const char* model_path) {
+    std::string server_address;
+    server_address = "0.0.0.0:" + port;
+    ASRServicer service(model_path, thread_num);
+
+    ServerBuilder builder;
+    builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
+    builder.RegisterService(&service);
+    std::unique_ptr<Server> server(builder.BuildAndStart());
+    std::cout << "Server listening on " << server_address << std::endl;
+    server->Wait();
+}
+
+int main(int argc, char* argv[]) {
+    if (argc < 3)
+    {
+        printf("Usage: %s port thread_num /path/to/model_file\n", argv[0]);
+        exit(-1);
+    }
+
+    RunServer(argv[1], atoi(argv[2]), argv[3]);
+    return 0;
+}
--- a/funasr/runtime/grpc/paraformer_server.h
+++ b/funasr/runtime/grpc/paraformer_server.h
@ -0,0 +1,56 @@
+#include <algorithm>
+#include <chrono>
+#include <cmath>
+#include <iostream>
+#include <memory>
+#include <string>
+
+#include <grpc/grpc.h>
+#include <grpcpp/server.h>
+#include <grpcpp/server_builder.h>
+#include <grpcpp/server_context.h>
+#include <grpcpp/security/server_credentials.h>
+
+#include <unordered_map>
+#include <chrono>
+
+#include "paraformer.grpc.pb.h"
+#include "librapidasrapi.h"
+
+
+using grpc::Server;
+using grpc::ServerBuilder;
+using grpc::ServerContext;
+using grpc::ServerReader;
+using grpc::ServerReaderWriter;
+using grpc::ServerWriter;
+using grpc::Status;
+
+
+using paraformer::Request;
+using paraformer::Response;
+using paraformer::ASR;
+
+typedef struct
+{
+    std::string msg;
+    float  snippet_time;
+}RPASR_RECOG_RESULT;
+
+
+class ASRServicer final : public ASR::Service {
+  private:
+    int init_flag;
+    std::unordered_map<std::string, std::string> client_buffers;
+    std::unordered_map<std::string, std::string> client_transcription;
+
+  public:
+    ASRServicer(const char* model_path, int thread_num);
+    void clear_states(const std::string& user);
+    void clear_buffers(const std::string& user);
+    void clear_transcriptions(const std::string& user);
+    void disconnect(const std::string& user);
+    grpc::Status Recognize(grpc::ServerContext* context, grpc::ServerReaderWriter<Response, Request>* stream);
+    RPASR_HANDLE AsrHanlde;
+	
+};
--- a/funasr/runtime/grpc/rebuild.sh
+++ b/funasr/runtime/grpc/rebuild.sh
@ -0,0 +1,12 @@
+#!/bin/bash
+
+rm cmake -rf
+mkdir -p cmake/build
+
+cd cmake/build
+
+cmake  -DCMAKE_BUILD_TYPE=release ../.. -DONNXRUNTIME_DIR=/data/asrmodel/onnxruntime-linux-x64-1.14.0
+make
+
+
+echo "Build cmake/build/paraformer_server successfully!"
--- a/funasr/runtime/onnxruntime/readme.md
+++ b/funasr/runtime/onnxruntime/readme.md
@ -41,8 +41,8 @@ pip install --editable ./
 ```
 导出onnx模型，[详见](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/export)，参考示例，从modelscope中模型导出：

-```
-python -m funasr.export.export_model 'damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch' "./export" true
+```shell
+python -m funasr.export.export_model --model-name damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch --export-dir ./export --type onnx --quantize False
 ```

 ## Building Guidance for Linux/Unix
--- a/funasr/runtime/onnxruntime/src/Audio.cpp
+++ b/funasr/runtime/onnxruntime/src/Audio.cpp
@ -237,7 +237,7 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)

    size_t nOffset = 0;

-#define WAV_HEADER_SIZE 44
+

    speech_len = nBufLen / 2;
    speech_align_len = (int)(ceil((float)speech_len / align_size) * align_size);
@ -263,7 +263,8 @@ bool Audio::loadpcmwav(const char* buf, int nBufLen)
            speech_data[i] = (float)speech_buff[i] / scale;
        }

-
+        AudioFrame* frame = new AudioFrame(speech_len);
+        frame_queue.push(frame);
        return true;

    }
--- a/funasr/runtime/onnxruntime/src/librapidasrapi.cpp
+++ b/funasr/runtime/onnxruntime/src/librapidasrapi.cpp
@ -26,8 +26,9 @@ extern "C" {
 			return nullptr;

 		Audio audio(1);
-		audio.loadwav(szBuf,nLen);
-		audio.split();
+		if (!audio.loadwav(szBuf, nLen))
+			return nullptr;
+		//audio.split();

 		float* buff;
 		int len;
@ -58,8 +59,9 @@ extern "C" {
 			return nullptr;

 		Audio audio(1);
-		audio.loadpcmwav(szBuf, nLen);
-		audio.split();
+		if (!audio.loadpcmwav(szBuf, nLen))
+			return nullptr;
+		//audio.split();

 		float* buff;
 		int len;
@ -91,8 +93,9 @@ extern "C" {
 			return nullptr;

 		Audio audio(1);
-		audio.loadpcmwav(szFileName);
-		audio.split();
+		if (!audio.loadpcmwav(szFileName))
+			return nullptr;
+		//audio.split();

 		float* buff;
 		int len;
@ -125,7 +128,7 @@ extern "C" {
 		Audio audio(1);
 		if(!audio.loadwav(szWavfile))
 			return nullptr;
-		audio.split();
+		//audio.split();

 		float* buff;
 		int len;
--- a/funasr/runtime/onnxruntime/tester/tester.cpp
+++ b/funasr/runtime/onnxruntime/tester/tester.cpp
@ -8,7 +8,7 @@
 #include "librapidasrapi.h"

 #include <iostream>
-
+#include <fstream>
 using namespace std;

 int main(int argc, char *argv[])
@ -40,10 +40,13 @@ int main(int argc, char *argv[])


    gettimeofday(&start, NULL);
-
-    RPASR_RESULT Result=RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
-    gettimeofday(&end, NULL);
    float snippet_time = 0.0f;
+
+
+     RPASR_RESULT Result=RapidAsrRecogFile(AsrHanlde, argv[2], RASR_NONE, NULL);
+
+    gettimeofday(&end, NULL);
+   
    if (Result)
    {
        string msg = RapidAsrGetResult(Result, 0);
@ -56,11 +59,51 @@ int main(int argc, char *argv[])
    }
    else
    {
-        cout <<("no return data!");
+        cout <<"no return data!";
    }
-  
-    printf("Audio length %lfs.\n", (double)snippet_time);
+ 
+ 
+    //char* buff = nullptr;
+    //int len = 0;
+    //ifstream ifs(argv[2], std::ios::binary | std::ios::in);
+    //if (ifs.is_open())
+    //{
+    //    ifs.seekg(0, std::ios::end);
+    //    len = ifs.tellg();
+    //    ifs.seekg(0, std::ios::beg);

+    //    buff = new char[len];
+
+    //    ifs.read(buff, len);
+
+
+    //    //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
+
+    //    RPASR_RESULT Result=RapidAsrRecogPCMBuffer(AsrHanlde, buff,len, RASR_NONE, NULL);
+    //    //RPASR_RESULT Result = RapidAsrRecogPCMFile(AsrHanlde, argv[2], RASR_NONE, NULL);
+    //    gettimeofday(&end, NULL);
+    //   
+    //    if (Result)
+    //    {
+    //        string msg = RapidAsrGetResult(Result, 0);
+    //        setbuf(stdout, NULL);
+    //        cout << "Result: \"";
+    //        cout << msg << endl;
+    //        cout << "\"." << endl;
+    //        snippet_time = RapidAsrGetRetSnippetTime(Result);
+    //        RapidAsrFreeResult(Result);
+    //    }
+    //    else
+    //    {
+    //        cout <<"no return data!";
+    //    }
+  
+    //   
+    //delete[]buff;
+    //}
+
+ 
+    printf("Audio length %lfs.\n", (double)snippet_time);
    seconds = (end.tv_sec - start.tv_sec);
    long taking_micros = ((seconds * 1000000) + end.tv_usec) - (start.tv_usec);
    printf("Model inference takes %lfs.\n", (double)taking_micros / 1000000);
--- a/funasr/runtime/python/benchmark_libtorch.md
+++ b/funasr/runtime/python/benchmark_libtorch.md
@ -0,0 +1,45 @@
+# Benchmark 
+
+### Data set:
+Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
+
+### Tools
+- Install ModelScope and FunASR
+
+    ```shell
+    pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
+    git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
+    pip install --editable ./
+    cd funasr/runtime/python/utils
+    pip install -r requirements.txt
+    ```
+
+- recipe
+
+    set the model, data path and output_dir
+
+    ```shell
+    nohup bash test_rtf.sh &> log.txt &
+    ```
+
+
+
+## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) 
+
+
+### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz   16core-32processor    with avx512_vnni
+
+| concurrent-tasks | processing time(s) |  RTF   | Speedup Rate |
+|:----------------:|:------------------:|:------:|:------------:|
+| 1 (torch fp32)   |        3522        | 0.0976 |     10.3     |
+|  1 (torch int8)  |        1746        | 0.0484 |     20.7     |
+| 32 (torch fp32)  |        236         | 0.0066 |    152.7     |
+| 32 (torch int8)  |        114         | 0.0032 |    317.4     |
+| 64 (torch fp32)  |        235         | 0.0065 |    153.7     |
+| 64 (torch int8)  |        113         | 0.0031 |    319.2     |
+
+
+[//]: # (### Intel&#40;R&#41; Xeon&#40;R&#41; Platinum 8163 CPU @ 2.50GHz    32core-64processor   without avx512_vnni)
+
+
+## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)
--- a/funasr/runtime/python/benchmark_onnx.md
+++ b/funasr/runtime/python/benchmark_onnx.md
@ -0,0 +1,89 @@
+# Benchmark 
+
+### Data set:
+Aishell1 [test set](https://www.openslr.org/33/) , the total audio duration is 36108.919 seconds.
+
+### Tools
+- Install ModelScope and FunASR
+
+    ```shell
+    pip install "modelscope[audio_asr]" --upgrade -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
+    git clone https://github.com/alibaba-damo-academy/FunASR.git && cd FunASR
+    pip install --editable ./
+    cd funasr/runtime/python/utils
+    pip install -r requirements.txt
+    ```
+
+- recipe
+
+    set the model, data path and output_dir
+
+    ```shell
+    nohup bash test_rtf.sh &> log.txt &
+    ```
+
+
+## [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) 
+
+ ### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz   16core-32processor    with avx512_vnni
+
+| concurrent-tasks | processing time(s) |   RTF   | Speedup Rate |
+|:----------------:|:------------------:|:-------:|:------------:|
+|  1 (onnx fp32)   |        2806        | 0.0777  |     12.9     |
+|  1 (onnx int8)   |        1611        | 0.0446  |     22.4     |
+|  8 (onnx fp32)   |        538         | 0.0149  |     67.1     |
+|  8 (onnx int8)   |        210         | 0.0058  |    172.4     |
+|  16 (onnx fp32)  |        288         | 0.0080  |    125.2     |
+|  16 (onnx int8)  |        117         | 0.0032  |    309.9     |
+|  32 (onnx fp32)  |        167         | 0.0046  |    216.5     |
+|  32 (onnx int8)  |         86         | 0.0024  |    420.0     |
+|  64 (onnx fp32)  |        158         | 0.0044  |    228.1     |
+|  64 (onnx int8)  |         82         | 0.0023  |    442.8     |
+|  96 (onnx fp32)  |        151         | 0.0042  |    238.0     |
+|  96 (onnx int8)  |         80         | 0.0022  |    452.0     |
+
+
+### Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz   16core-32processor    with avx512_vnni
+
+| concurrent-tasks | processing time(s) |  RTF   | Speedup Rate |
+|:----------------:|:------------------:|:------:|:------------:|
+|  1 (onnx fp32)   |        2613        | 0.0724 |     13.8     |
+|  1 (onnx int8)   |        1321        | 0.0366 |     22.4     |
+|  32 (onnx fp32)  |        170         | 0.0047 |    212.7     |
+|  32 (onnx int8)  |        89          | 0.0025 |    407.0     |
+|  64 (onnx fp32)  |        166         | 0.0046 |    217.1     |
+|  64 (onnx int8)  |         87         | 0.0024 |    414.7     |
+
+
+### Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz    32core-64processor   without avx512_vnni
+
+
+| concurrent-tasks | processing time(s) |  RTF   | Speedup Rate |
+|:----------------:|:------------------:|:------:|:------------:|
+|  1 (onnx fp32)   |        2959        | 0.0820 |     12.2     |
+|  1 (onnx int8)   |        2814        | 0.0778 |     12.8     |
+|  16 (onnx fp32)  |        373         | 0.0103 |     96.9     |
+|  16 (onnx int8)  |        331         | 0.0091 |    109.0     |
+|  32 (onnx fp32)  |        211         | 0.0058 |    171.4     |
+|  32 (onnx int8)  |        181         | 0.0050 |    200.0     |
+|  64 (onnx fp32)  |        153         | 0.0042 |    235.9     |
+|  64 (onnx int8)  |        103         | 0.0029 |    349.9     |
+|  96 (onnx fp32)  |        146         | 0.0041 |    247.0     |
+|  96 (onnx int8)  |        108         | 0.0030 |    334.1     |
+
+## [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary)
+
+ ### Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz   16core-32processor    with avx512_vnni
+
+| concurrent-tasks | processing time(s) |  RTF   | Speedup Rate |
+|:----------------:|:------------------:|:------:|:------------:|
+|  1 (onnx fp32)   |        1173        | 0.0325 |     30.8     |
+|  1 (onnx int8)   |        976         | 0.0270 |     37.0     |
+|  16 (onnx fp32)  |         91         | 0.0025 |    395.2     |
+|  16 (onnx int8)  |         78         | 0.0022 |    463.0     |
+|  32 (onnx fp32)  |         60         | 0.0017 |    598.8     |
+|  32 (onnx int8)  |         40         | 0.0011 |    892.9     |
+|  64 (onnx fp32)  |         55         | 0.0015 |    653.6     |
+|  64 (onnx int8)  |         31         | 0.0009 |    1162.8    |
+|  96 (onnx fp32)  |         57         | 0.0016 |    632.9     |
+|  96 (onnx int8)  |         33         | 0.0009 |    1098.9    |
--- a/Show More
+++ b/Show More