create

2025-09-15 14:48:36 +08:00 · 2022-11-26 21:56:51 +08:00 · 2022-11-26 21:56:51 +08:00 · c087854f71
commit c087854f71
parent 460ab97011
314 changed files with 42966 additions and 1 deletions
--- a/2
+++ b/2
@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,78 @@
+<div align="left"><img src="image/funasr_logo.jpg" width="400"/></div>
+
+# FunASR: A Fundamental End-to-End Speech Recognition Toolkit
+
+<strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun！
+
+## Installation(Training and Developing)
+
+- Clone the repo:
+``` sh
+git clone https://github.com/alibaba/FunASR.git
+```
+
+- Install Conda:
+``` sh
+wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
+sh Miniconda3-latest-Linux-x86_64.sh
+conda create -n funasr python=3.7
+conda activate funasr
+```
+
+- Install Pytorch (version >= 1.7.0): 
+
+| cuda  | |
+|:-----:| --- |
+|  9.2  | conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch |
+| 10.2  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch |
+| 11.1  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch |
+
+For more versions, please see https://pytorch.org/get-started/locally/
+
+- Install ModelScope:
+``` sh
+pip install "modelscope[audio]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
+```
+
+- Install other packages: 
+
+``` sh
+pip install --editable ./
+```
+
+## Contact
+
+If you have any questions about FunASR, please contact us by
+
+- email: [funasr@list.alibaba-inc.com](funasr@list.alibaba-inc.com)
+
+- Dingding group:
+<div align="left"><img src="image/dingding.jpg" width="400"/></div>
+
+
+## Acknowledge
+
+1. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for data preparation.
+2. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet). FunASR follows up the training and finetuning pipelines of ESPnet.
+3. We referred [Wenet](https://github.com/wenet-e2e/wenet) for building dataloader for large scale data training.
+
+## License
+This project is licensed under the [The MIT License](https://opensource.org/licenses/MIT). FunASR also contains various third-party components and some code modified from other repos under other open source licenses.
+
+## Citations
+
+``` bibtex
+@inproceedings{gao2020universal,
+  title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model},
+  author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian},
+  booktitle={arXiv preprint arXiv:2010.14099},
+  year={2010}
+}
+
+@inproceedings{gao2022paraformer,
+  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
+  author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
+  booktitle={INTERSPEECH},
+  year={2022}
+}
+```
--- a/egs/aishell/conformer/README.md
+++ b/egs/aishell/conformer/README.md
@ -0,0 +1,17 @@
+
+# Conformer Result
+
+## Training Config
+- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
+- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
+- Train config: conf/train_asr_transformer.yaml
+- LM config: LM was not used
+- Model size: 46M
+
+## Results (CER)
+- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.42   |
+|    test     |  4.87   |
--- a/egs/aishell/conformer/conf/decode_asr_transformer.yaml
+++ b/egs/aishell/conformer/conf/decode_asr_transformer.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
--- a/egs/aishell/conformer/conf/train_asr_conformer.yaml
+++ b/egs/aishell/conformer/conf/train_asr_conformer.yaml
@ -0,0 +1,80 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/conformer/local/aishell_data_prep.sh
+++ b/egs/aishell/conformer/local/aishell_data_prep.sh
@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
--- a/egs/aishell/conformer/local/prepare_data.sh
+++ b/egs/aishell/conformer/local/prepare_data.sh
@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
--- a/egs/aishell/conformer/path.sh
+++ b/egs/aishell/conformer/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs/aishell/conformer/run.sh
+++ b/egs/aishell/conformer/run.sh
@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_conformer.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode asr
+fi
+
--- a/egs/aishell/conformer/utils
+++ b/egs/aishell/conformer/utils
@ -0,0 +1 @@
+../tranformer/utils
--- a/egs/aishell/paraformer/README.md
+++ b/egs/aishell/paraformer/README.md
@ -0,0 +1,24 @@
+# Paraformer
+pretrained model in [ModelScope](https://www.modelscope.cn/home)：[speech_paraformer_asr_nat-aishell1-pytorch](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-aishell1-pytorch/summary)
+
+## Training Config
+- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
+- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
+- Train config: conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+- LM config: LM was not used
+
+## Results (CER)
+
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml (ctc weight:0.0)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.66   |
+|    test     |  5.11   |
+
+- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.52   |
+|    test     |  4.94   |
--- a/egs/aishell/paraformer/conf/decode_asr_transformer.yaml
+++ b/egs/aishell/paraformer/conf/decode_asr_transformer.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
--- a/egs/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
+++ b/egs/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
--- a/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
@ -0,0 +1,92 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 20      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_40e_12d_1280_320_lfr6.yaml
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_40e_12d_1280_320_lfr6.yaml
@ -0,0 +1,114 @@
+# network architecture
+# encoder related
+encoder: sanm
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 40      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    input_layer: pe # encoder architecture type
+    pos_enc_class: SinusoidalPositionEncoder
+    normalize_before: true
+    kernel_size: 11
+    sanm_shfit: 0
+    selfattention_layer_type: sanm
+
+# decoder related
+decoder: paraformer_decoder_sanm
+decoder_conf:
+    attention_heads: 4
+    linear_units: 1280
+    num_blocks: 12
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+    att_layer_num: 6
+    kernel_size: 11
+    sanm_shfit: 0
+
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 320
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 0
+    sampling_ratio: 0.75
+
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 6000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: True
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: token
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 5
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 15000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
--- a/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_50e_16d_2048_512_lfr6.yaml
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_50e_16d_2048_512_lfr6.yaml
@ -0,0 +1,114 @@
+# network architecture
+# encoder related
+encoder: sanm
+encoder_conf:
+    output_size: 512    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 50      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    input_layer: pe # encoder architecture type
+    pos_enc_class: SinusoidalPositionEncoder
+    normalize_before: true
+    kernel_size: 11
+    sanm_shfit: 0
+    selfattention_layer_type: sanm
+
+# decoder related
+decoder: paraformer_decoder_sanm
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 16
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+    att_layer_num: 16
+    kernel_size: 11
+    sanm_shfit: 0
+
+
+predictor: cif_predictor_v2
+predictor_conf:
+  idim: 512
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 10000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 5
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
--- a/egs/aishell/paraformer/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+++ b/egs/aishell/paraformer/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
@ -0,0 +1,99 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer_bert
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+    embeds_id: 3
+    embed_dims: 768
+    embeds_loss_weight: 2.0
+
+
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/paraformer/local/aishell_data_prep.sh
+++ b/egs/aishell/paraformer/local/aishell_data_prep.sh
@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
--- a/egs/aishell/paraformer/local/prepare_data.sh
+++ b/egs/aishell/paraformer/local/prepare_data.sh
@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
--- a/egs/aishell/paraformer/path.sh
+++ b/egs/aishell/paraformer/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs/aishell/paraformer/run.sh
+++ b/egs/aishell/paraformer/run.sh
@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
--- a/egs/aishell/paraformer/utils
+++ b/egs/aishell/paraformer/utils
@ -0,0 +1 @@
+../tranformer/utils
--- a/egs/aishell/paraformer2/README.md
+++ b/egs/aishell/paraformer2/README.md
@ -0,0 +1,18 @@
+# ParaformerBert + specaug + speed perturbation + specaugmentation
+## Environments
+- date: `Mon Nov 21 13:25:30 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+
+## Config files
+- train config: conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+- model size: 46M
+- lm config: LM was not used
+- decode config: conf/decode_asr_transformer_noctc_1best.yaml (CTC was not used)
+
+## Results (CER)
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.30   |
+|    test     |  4.80   |
--- a/egs/aishell/paraformer2/conf/decode_asr_transformer.yaml
+++ b/egs/aishell/paraformer2/conf/decode_asr_transformer.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
--- a/egs/aishell/paraformer2/conf/decode_asr_transformer_noctc_1best.yaml
+++ b/egs/aishell/paraformer2/conf/decode_asr_transformer_noctc_1best.yaml
@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
--- a/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
@ -0,0 +1,92 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
@ -0,0 +1,94 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 20      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/paraformer2/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
@ -0,0 +1,100 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer_bert
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+    embeds_id: 3
+    embed_dims: 768
+    embeds_loss_weight: 2.0
+
+
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
+allow_variable_data_keys: true
--- a/egs/aishell/paraformer2/local/aishell_data_prep.sh
+++ b/egs/aishell/paraformer2/local/aishell_data_prep.sh
@ -0,0 +1,65 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 2 ]; then
+  echo "Usage: $0 <audio-path> <text-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+
+train_dir=data/local/train
+dev_dir=data/local/dev
+test_dir=data/local/test
+tmp_dir=data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p data/train data/dev data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f data/train/$f || exit 1;
+  cp $dev_dir/$f data/dev/$f || exit 1;
+  cp $test_dir/$f data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
--- a/egs/aishell/paraformer2/local/extract_embeds.sh
+++ b/egs/aishell/paraformer2/local/extract_embeds.sh
@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+
+stage=1
+stop_stage=3
+
+bert_model_root="../../huggingface_models"
+bert_model_name="bert-base-chinese"
+#bert_model_name="chinese-roberta-wwm-ext"
+#bert_model_name="mengzi-bert-base"
+raw_dataset_path=~/Funasr_data/aishell-1
+model_path=${bert_model_root}/${bert_model_name}
+
+. utils/parse_options.sh || exit 1;
+
+nj=32
+
+for data_set in train dev test;do
+    scp=$raw_dataset_path/dump/fbank/${data_set}/text
+    local_scp_dir_raw=$raw_dataset_path/embeds/$bert_model_name/${data_set}
+    local_scp_dir=$local_scp_dir_raw/split$nj
+    local_records_dir=$local_scp_dir_raw/ark
+
+    mkdir -p $local_records_dir
+    mkdir -p $local_scp_dir
+
+    split_scps=""
+    for JOB in $(seq ${nj}); do
+        split_scps="$split_scps $local_scp_dir/data.$JOB.text"
+    done
+
+    utils/split_scp.pl $scp ${split_scps}
+
+
+    for num in {0..7};do
+        tmp=`expr $num \* 4`
+
+        if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+            for idx in {1..4}; do
+                JOB=`expr $tmp + $idx`
+                echo "proces jobid=$JOB"
+                {
+
+                beg=0
+                gpu=`expr $beg + $idx`
+                echo ${local_scp_dir}/log.${JOB}
+                python utils/extract_embeds.py $local_scp_dir/data.$JOB.text ${local_records_dir}/embeds.${JOB}.ark ${local_records_dir}/embeds.${JOB}.scp ${local_records_dir}/embeds.${JOB}.shape ${gpu} ${model_path} &> ${local_scp_dir}/log.${JOB}
+            } &
+            done
+            wait
+        fi
+    done
+
+    if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+        for JOB in $(seq ${nj}); do
+            cat ${local_records_dir}/embeds.${JOB}.scp || exit 1;
+        done > ${local_scp_dir_raw}/embeds.scp
+
+        sed 's#nfs#data\/volume1#g' ${local_scp_dir_raw}/embeds.scp > ${local_scp_dir_raw}/embeds.scp.pai
+
+        for JOB in $(seq ${nj}); do
+            cat ${local_records_dir}/embeds.${JOB}.shape || exit 1;
+        done > ${local_scp_dir_raw}/embeds.shape
+    fi
+done
+
+echo "embeds is in: ${local_scp_dir_raw}"
+echo "success"
--- a/egs/aishell/paraformer2/path.sh
+++ b/egs/aishell/paraformer2/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs/aishell/paraformer2/run.sh
+++ b/egs/aishell/paraformer2/run.sh
@ -0,0 +1,226 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+skip_extract_embed=false
+bert_model_root="../../huggingface_models"
+bert_model_name="bert-base-chinese"
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+run_dir="exp"
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+exp_dir=$run_dir/$model_dir
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript
+    for x in train dev test; do
+        cp data/${x}/text data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " data/${x}/text.org) <(cut -f 2- -d" " data/${x}/text.org | tr -d " ") \
+            > data/${x}/text
+        utils/text2token.py -n 1 -s 1 data/${x}/text > data/${x}/text.org
+        mv data/${x}/text.org data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        data/train exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        data/dev exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        data/test exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p asr_stats_fbank_zh_char/train 
+    mkdir -p asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char asr_stats_fbank_zh_char/dev
+fi
+
+if ! "${skip_extract_embed}"; then
+    local/extract_embeds.sh \
+        --bert_model_root ${bert_model_root} \
+        --bert_model_name ${bert_model_name} \
+        --raw_dataset_path ${feats_dir}
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p $exp_dir
+    mkdir -p $exp_dir/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.scp,embed,${type} \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --train_shape_file ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.shape \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.scp,embed,${type} \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --valid_shape_file ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.shape \
+                --resume true \
+                --output_dir $exp_dir \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --allow_variable_data_keys true \
+                --local_rank $local_rank 1> $exp_dir/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp $exp_dir \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --gpu_inference ${gpu_inference} \
+        --mode paraformer
+fi
+
--- a/egs/aishell/paraformer2/utils
+++ b/egs/aishell/paraformer2/utils
@ -0,0 +1 @@
+../tranformer/utils
--- a/egs/aishell/tranformer/conf/decode_asr_transformer.yaml
+++ b/egs/aishell/tranformer/conf/decode_asr_transformer.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
--- a/egs/aishell/tranformer/conf/train_asr_conformer.yaml
+++ b/egs/aishell/tranformer/conf/train_asr_conformer.yaml
@ -0,0 +1,80 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/tranformer/conf/train_asr_transformer.yaml
+++ b/egs/aishell/tranformer/conf/train_asr_transformer.yaml
@ -0,0 +1,70 @@
+# network architecture
+# encoder related
+encoder: transformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 32000
+num_workers: 8
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+patience: 3
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+# NoamLR is deprecated. Use WarmupLR.
+# The following is equivalent setting for NoamLR:
+#
+#    optim: adam
+#    optim_conf:
+#        lr: 10.
+#    scheduler: noamlr
+#    scheduler_conf:
+#        model_size: 256
+#        warmup_steps: 25000
+#
+optim: adam
+optim_conf:
+    lr: 0.002
+scheduler: warmuplr     # pytorch v1.1.0+ required
+scheduler_conf:
+    warmup_steps: 25000
+
+log_interval: 50
+normalize: None
--- a/egs/aishell/tranformer/local/aishell_data_prep.sh
+++ b/egs/aishell/tranformer/local/aishell_data_prep.sh
@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
--- a/egs/aishell/tranformer/local/prepare_data.sh
+++ b/egs/aishell/tranformer/local/prepare_data.sh
@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
--- a/egs/aishell/tranformer/path.sh
+++ b/egs/aishell/tranformer/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs/aishell/tranformer/run.sh
+++ b/egs/aishell/tranformer/run.sh
@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_conformer.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode asr
+fi
+
--- a/egs/aishell/tranformer/utils/init.py
+++ b/egs/aishell/tranformer/utils/init.py
--- a/egs/aishell/tranformer/utils/apply_cmvn.py
+++ b/egs/aishell/tranformer/utils/apply_cmvn.py
@ -0,0 +1,79 @@
+from kaldiio import ReadHelper
+from kaldiio import WriteHelper
+
+import argparse
+import json
+import math
+import numpy as np
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="apply cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--cmvn-file",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="cmvn file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
+    scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
+
+    with open(args.cmvn_file) as f:
+        cmvn_stats = json.load(f)
+
+    means = cmvn_stats['mean_stats']
+    vars = cmvn_stats['var_stats']
+    total_frames = cmvn_stats['total_frames']
+
+    for i in range(len(means)):
+        means[i] /= total_frames
+        vars[i] = vars[i] / total_frames - means[i] * means[i]
+        if vars[i] < 1.0e-20:
+            vars[i] = 1.0e-20
+        vars[i] = 1.0 / math.sqrt(vars[i])
+
+    with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            mat = (mat - means) * vars
+            ark_writer(key, mat)
+
+
+if __name__ == '__main__':
+    main()
--- a/egs/aishell/tranformer/utils/apply_cmvn.sh
+++ b/egs/aishell/tranformer/utils/apply_cmvn.sh
@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+cmvn_file=$2
+logdir=$3
+output_dir=$4
+
+dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/apply_cmvn.JOB.log \
+    python utils/apply_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
+        -c $cmvn_file -i JOB -o ${dump_dir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${dump_dir}/feats.$n.scp || exit 1
+done > ${output_dir}/feats.scp || exit 1
+
+echo "$0: Succeeded apply cmvn"
--- a/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.py
+++ b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.py
@ -0,0 +1,143 @@
+from kaldiio import ReadHelper, WriteHelper
+
+import argparse
+import numpy as np
+
+
+def build_LFR_features(inputs, m=7, n=6):
+    LFR_inputs = []
+    T = inputs.shape[0]
+    T_lfr = int(np.ceil(T / n))
+    left_padding = np.tile(inputs[0], ((m - 1) // 2, 1))
+    inputs = np.vstack((left_padding, inputs))
+    T = T + (m - 1) // 2
+    for i in range(T_lfr):
+        if m <= T - i * n:
+            LFR_inputs.append(np.hstack(inputs[i * n:i * n + m]))
+        else:
+            num_padding = m - (T - i * n)
+            frame = np.hstack(inputs[i * n:])
+            for _ in range(num_padding):
+                frame = np.hstack((frame, inputs[-1]))
+            LFR_inputs.append(frame)
+    return np.vstack(LFR_inputs)
+
+
+def build_CMVN_features(inputs, mvn_file):  # noqa
+    with open(mvn_file, 'r', encoding='utf-8') as f:
+        lines = f.readlines()
+
+    add_shift_list = []
+    rescale_list = []
+    for i in range(len(lines)):
+        line_item = lines[i].split()
+        if line_item[0] == '<AddShift>':
+            line_item = lines[i + 1].split()
+            if line_item[0] == '<LearnRateCoef>':
+                add_shift_line = line_item[3:(len(line_item) - 1)]
+                add_shift_list = list(add_shift_line)
+                continue
+        elif line_item[0] == '<Rescale>':
+            line_item = lines[i + 1].split()
+            if line_item[0] == '<LearnRateCoef>':
+                rescale_line = line_item[3:(len(line_item) - 1)]
+                rescale_list = list(rescale_line)
+                continue
+
+    for j in range(inputs.shape[0]):
+        for k in range(inputs.shape[1]):
+            add_shift_value = add_shift_list[k]
+            rescale_value = rescale_list[k]
+            inputs[j, k] = float(inputs[j, k]) + float(add_shift_value)
+            inputs[j, k] = float(inputs[j, k]) * float(rescale_value)
+
+    return inputs
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="apply low_frame_rate and cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--lfr",
+        "-f",
+        default=True,
+        type=str,
+        help="low frame rate",
+    )
+    parser.add_argument(
+        "--lfr-m",
+        "-m",
+        default=7,
+        type=int,
+        help="number of frames to stack",
+    )
+    parser.add_argument(
+        "--lfr-n",
+        "-n",
+        default=6,
+        type=int,
+        help="number of frames to skip",
+    )
+    parser.add_argument(
+        "--cmvn-file",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="global cmvn file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    dump_ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
+    dump_scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
+    shape_file = args.output_dir + "/len." + str(args.ark_index)
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(dump_ark_file, dump_scp_file))
+
+    shape_writer = open(shape_file, 'w')
+    with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            if args.lfr:
+                lfr = build_LFR_features(mat, args.lfr_m, args.lfr_n)
+            else:
+                lfr = mat
+            cmvn = build_CMVN_features(lfr, args.cmvn_file)
+            dims = cmvn.shape[1]
+            lens = cmvn.shape[0]
+            shape_writer.write(key + " " + str(lens) + "," + str(dims) + '\n')
+            ark_writer(key, cmvn)
+
+
+if __name__ == '__main__':
+    main()
+
--- a/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.sh
+++ b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.sh
@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=32
+cmd=utils/run.pl
+
+# feature configuration
+lfr=True
+lfr_m=7
+lfr_n=6
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+cmvn_file=$2
+logdir=$3
+output_dir=$4
+
+dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/apply_lfr_and_cmvn.JOB.log \
+    python utils/apply_lfr_and_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
+        -f $lfr -m $lfr_m -n $lfr_n -c $cmvn_file -i JOB -o ${dump_dir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${dump_dir}/feats.$n.scp || exit 1
+done > ${output_dir}/feats.scp || exit 1
+
+for n in $(seq $nj); do
+  cat ${dump_dir}/len.$n || exit 1
+done > ${output_dir}/speech_shape || exit 1
+
+echo "$0: Succeeded apply low frame rate and cmvn"
--- a/egs/aishell/tranformer/utils/combine_cmvn_file.py
+++ b/egs/aishell/tranformer/utils/combine_cmvn_file.py
@ -0,0 +1,66 @@
+import argparse
+import json
+import numpy as np
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="combine cmvn file",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--cmvn-dir",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="cmvn dir",
+    )
+
+    parser.add_argument(
+        "--nj",
+        "-n",
+        default=1,
+        required=True,
+        type=int,
+        help="num of cmvn file",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    total_means = 0.0
+    total_vars = 0.0
+    total_frames = 0
+
+    cmvn_file = args.output_dir + "/cmvn.json"
+
+    for i in range(1, args.nj+1):
+        with open(args.cmvn_dir + "/cmvn." + str(i) + ".json", "r") as fin:
+            cmvn_stats = json.load(fin)
+
+        total_means += np.array(cmvn_stats["mean_stats"])
+        total_vars += np.array(cmvn_stats["var_stats"])
+        total_frames += cmvn_stats["total_frames"]
+
+    cmvn_info = {
+        'mean_stats': list(total_means.tolist()),
+        'var_stats': list(total_vars.tolist()),
+        'total_frames': total_frames
+    }
+    with open(cmvn_file, 'w') as fout:
+        fout.write(json.dumps(cmvn_info))
+
+
+if __name__ == '__main__':
+    main()
--- a/egs/aishell/tranformer/utils/compute_cmvn.py
+++ b/egs/aishell/tranformer/utils/compute_cmvn.py
@ -0,0 +1,67 @@
+from kaldiio import ReadHelper
+
+import argparse
+import numpy as np
+import json
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="computer global cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.ark_file + "/feats." + str(args.ark_index) + ".ark"
+    cmvn_file = args.output_dir + "/cmvn." + str(args.ark_index) + ".json"
+
+    mean_stats = 0.0
+    var_stats = 0.0
+    total_frames = 0
+
+    with ReadHelper('ark:{}'.format(ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            mean_stats += np.sum(mat, axis=0)
+            var_stats += np.sum(np.square(mat), axis=0)
+            total_frames += mat.shape[0]
+
+    cmvn_info = {
+        'mean_stats': list(mean_stats.tolist()),
+        'var_stats': list(var_stats.tolist()),
+        'total_frames': total_frames
+    }
+    with open(cmvn_file, 'w') as fout:
+        fout.write(json.dumps(cmvn_info))
+
+
+if __name__ == '__main__':
+    main()
--- a/egs/aishell/tranformer/utils/compute_cmvn.sh
+++ b/egs/aishell/tranformer/utils/compute_cmvn.sh
@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+logdir=$2
+
+output_dir=${fbankdir}/cmvn; mkdir -p ${output_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/cmvn.JOB.log \
+    python utils/compute_cmvn.py -a $fbankdir/ark -i JOB -o ${output_dir} \
+        || exit 1;
+
+python utils/combine_cmvn_file.py -c ${output_dir} -n $nj -o $fbankdir
+
+echo "$0: Succeeded compute global cmvn"
--- a/egs/aishell/tranformer/utils/compute_fbank.py
+++ b/egs/aishell/tranformer/utils/compute_fbank.py
@ -0,0 +1,153 @@
+from kaldiio import WriteHelper
+
+import argparse
+import numpy as np
+import json
+import torch
+import torchaudio
+import torchaudio.compliance.kaldi as kaldi
+
+
+def compute_fbank(wav_file,
+                  num_mel_bins=80,
+                  frame_length=25,
+                  frame_shift=10,
+                  dither=0.0,
+                  resample_rate=16000,
+                  speed=1.0):
+
+    waveform, sample_rate = torchaudio.load(wav_file)
+    if resample_rate != sample_rate:
+        waveform = torchaudio.transforms.Resample(orig_freq=sample_rate,
+                                                  new_freq=resample_rate)(waveform)
+    if speed != 1.0:
+        waveform, _ = torchaudio.sox_effects.apply_effects_tensor(
+            waveform, resample_rate,
+            [['speed', str(speed)], ['rate', str(resample_rate)]]
+        )
+
+    waveform = waveform * (1 << 15)
+    mat = kaldi.fbank(waveform,
+                      num_mel_bins=num_mel_bins,
+                      frame_length=frame_length,
+                      frame_shift=frame_shift,
+                      dither=dither,
+                      energy_floor=0.0,
+                      window_type='hamming',
+                      sample_frequency=resample_rate)
+
+    return mat.numpy()
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="computer features",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--wav-lists",
+        "-w",
+        default=False,
+        required=True,
+        type=str,
+        help="input wav lists",
+    )
+    parser.add_argument(
+        "--text-files",
+        "-t",
+        default=False,
+        required=True,
+        type=str,
+        help="input text files",
+    )
+    parser.add_argument(
+        "--dims",
+        "-d",
+        default=80,
+        type=int,
+        help="feature dims",
+    )
+    parser.add_argument(
+        "--sample-frequency",
+        "-s",
+        default=16000,
+        type=int,
+        help="sample frequency",
+    )
+    parser.add_argument(
+        "--speed-perturb",
+        "-p",
+        default="1.0",
+        type=str,
+        help="speed perturb",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-a",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".ark"
+    scp_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".scp"
+    text_file = args.output_dir + "/txt/text." + str(args.ark_index) + ".txt"  
+    feats_shape_file = args.output_dir + "/ark/len." + str(args.ark_index)
+    text_shape_file = args.output_dir + "/txt/len." + str(args.ark_index)
+
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
+    text_writer = open(text_file, 'w')
+    feats_shape_writer = open(feats_shape_file, 'w')
+    text_shape_writer = open(text_shape_file, 'w')
+
+    speed_perturb_list = args.speed_perturb.split(',')
+    
+    for speed in speed_perturb_list:
+        with open(args.wav_lists, 'r', encoding='utf-8') as wavfile:
+            with open(args.text_files, 'r', encoding='utf-8') as textfile:
+                for wav, text in zip(wavfile, textfile): 
+                    s_w = wav.strip().split()
+                    wav_id = s_w[0]
+                    wav_file = s_w[1]
+
+                    s_t = text.strip().split()
+                    text_id = s_t[0]
+                    txt = s_t[1:]
+                    fbank = compute_fbank(wav_file,
+                                          num_mel_bins=args.dims,
+                                          resample_rate=args.sample_frequency,
+                                          speed=float(speed)
+                                          )
+                    feats_dims = fbank.shape[1]
+                    feats_lens = fbank.shape[0]
+                    txt_lens = len(txt)
+                    if speed == "1.0":
+                        wav_id_sp = wav_id
+                    else: 
+                        wav_id_sp = wav_id + "_sp" + speed
+
+                    feats_shape_writer.write(wav_id_sp + " " + str(feats_lens) + "," + str(feats_dims) + '\n')
+                    text_shape_writer.write(wav_id_sp + " " + str(txt_lens) + '\n')
+
+                    text_writer.write(wav_id_sp + " " + " ".join(txt) + '\n')
+                    ark_writer(wav_id_sp, fbank)
+                    
+
+if __name__ == '__main__':
+    main()
+
--- a/egs/aishell/tranformer/utils/compute_fbank.sh
+++ b/egs/aishell/tranformer/utils/compute_fbank.sh
@ -0,0 +1,51 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+# feature configuration
+feat_dims=80
+sample_frequency=16000
+speed_perturb="1.0"
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+data=$1
+logdir=$2
+fbankdir=$3
+
+[ ! -f $data/wav.scp ] && echo "$0: no such file $data/wav.scp" && exit 1;
+[ ! -f $data/text ] && echo "$0: no such file $data/text" && exit 1;
+
+python utils/split_data.py $data $data $nj
+
+ark_dir=${fbankdir}/ark; mkdir -p ${ark_dir}
+text_dir=${fbankdir}/txt; mkdir -p ${text_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/make_fbank.JOB.log \
+    python utils/compute_fbank.py -w $data/split${nj}/JOB/wav.scp -t $data/split${nj}/JOB/text \
+        -d $feat_dims -s $sample_frequency -p ${speed_perturb} -a JOB -o ${fbankdir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${ark_dir}/feats.$n.scp || exit 1
+done > $fbankdir/feats.scp || exit 1
+
+for n in $(seq $nj); do
+    cat ${text_dir}/text.$n.txt || exit 1
+done > $fbankdir/text || exit 1
+
+for n in $(seq $nj); do
+    cat ${ark_dir}/len.$n || exit 1
+done > $fbankdir/speech_shape || exit 1
+
+for n in $(seq $nj); do
+    cat ${text_dir}/len.$n || exit 1
+done > $fbankdir/text_shape || exit 1
+
+echo "$0: Succeeded compute FBANK features"
--- a/egs/aishell/tranformer/utils/compute_wer.py
+++ b/egs/aishell/tranformer/utils/compute_wer.py
@ -0,0 +1,157 @@
+import os
+import numpy as np
+import sys
+
+def compute_wer(ref_file,
+                hyp_file,
+                cer_detail_file):
+    rst = {
+        'Wrd': 0,
+        'Corr': 0,
+        'Ins': 0,
+        'Del': 0,
+        'Sub': 0,
+        'Snt': 0,
+        'Err': 0.0,
+        'S.Err': 0.0,
+        'wrong_words': 0,
+        'wrong_sentences': 0
+    }
+
+    hyp_dict = {}
+    ref_dict = {}
+    with open(hyp_file, 'r') as hyp_reader:
+        for line in hyp_reader:
+            key = line.strip().split()[0]
+            value = line.strip().split()[1:]
+            hyp_dict[key] = value
+    with open(ref_file, 'r') as ref_reader:
+        for line in ref_reader:
+            key = line.strip().split()[0]
+            value = line.strip().split()[1:]
+            ref_dict[key] = value
+
+    cer_detail_writer = open(cer_detail_file, 'w')
+    for hyp_key in hyp_dict:
+        if hyp_key in ref_dict:
+           out_item = compute_wer_by_line(hyp_dict[hyp_key], ref_dict[hyp_key])
+           rst['Wrd'] += out_item['nwords']
+           rst['Corr'] += out_item['cor']
+           rst['wrong_words'] += out_item['wrong']
+           rst['Ins'] += out_item['ins']
+           rst['Del'] += out_item['del']
+           rst['Sub'] += out_item['sub']
+           rst['Snt'] += 1
+           if out_item['wrong'] > 0:
+               rst['wrong_sentences'] += 1
+           cer_detail_writer.write(hyp_key + print_cer_detail(out_item) + '\n')
+           cer_detail_writer.write("ref:" + '\t' + "".join(ref_dict[hyp_key]) + '\n')
+           cer_detail_writer.write("hyp:" + '\t' + "".join(hyp_dict[hyp_key]) + '\n')
+
+    if rst['Wrd'] > 0:
+        rst['Err'] = round(rst['wrong_words'] * 100 / rst['Wrd'], 2)
+    if rst['Snt'] > 0:
+        rst['S.Err'] = round(rst['wrong_sentences'] * 100 / rst['Snt'], 2)
+
+    cer_detail_writer.write('\n')
+    cer_detail_writer.write("%WER " + str(rst['Err']) + " [ " + str(rst['wrong_words'])+ " / " + str(rst['Wrd']) +
+                            ", " + str(rst['Ins']) + " ins, " + str(rst['Del']) + " del, " + str(rst['Sub']) + " sub ]" + '\n')
+    cer_detail_writer.write("%SER " + str(rst['S.Err']) + " [ " + str(rst['wrong_sentences']) + " / " + str(rst['Snt']) + " ]" + '\n')
+    cer_detail_writer.write("Scored " + str(len(hyp_dict)) + " sentences, " + str(len(hyp_dict) - rst['Snt']) + " not present in hyp." + '\n')
+
+     
+def compute_wer_by_line(hyp,
+                        ref):
+    hyp = list(map(lambda x: x.lower(), hyp))
+    ref = list(map(lambda x: x.lower(), ref))
+
+    len_hyp = len(hyp)
+    len_ref = len(ref)
+
+    cost_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int16)
+
+    ops_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int8)
+
+    for i in range(len_hyp + 1):
+        cost_matrix[i][0] = i
+    for j in range(len_ref + 1):
+        cost_matrix[0][j] = j
+
+    for i in range(1, len_hyp + 1):
+        for j in range(1, len_ref + 1):
+            if hyp[i - 1] == ref[j - 1]:
+                cost_matrix[i][j] = cost_matrix[i - 1][j - 1]
+            else:
+                substitution = cost_matrix[i - 1][j - 1] + 1
+                insertion = cost_matrix[i - 1][j] + 1
+                deletion = cost_matrix[i][j - 1] + 1
+
+                compare_val = [substitution, insertion, deletion]
+
+                min_val = min(compare_val)
+                operation_idx = compare_val.index(min_val) + 1
+                cost_matrix[i][j] = min_val
+                ops_matrix[i][j] = operation_idx
+
+    match_idx = []
+    i = len_hyp
+    j = len_ref
+    rst = {
+        'nwords': len_ref,
+        'cor': 0,
+        'wrong': 0,
+        'ins': 0,
+        'del': 0,
+        'sub': 0
+    }
+    while i >= 0 or j >= 0:
+        i_idx = max(0, i)
+        j_idx = max(0, j)
+
+        if ops_matrix[i_idx][j_idx] == 0:  # correct
+            if i - 1 >= 0 and j - 1 >= 0:
+                match_idx.append((j - 1, i - 1))
+                rst['cor'] += 1
+
+            i -= 1
+            j -= 1
+
+        elif ops_matrix[i_idx][j_idx] == 2:  # insert
+            i -= 1
+            rst['ins'] += 1
+
+        elif ops_matrix[i_idx][j_idx] == 3:  # delete
+            j -= 1
+            rst['del'] += 1
+
+        elif ops_matrix[i_idx][j_idx] == 1:  # substitute
+            i -= 1
+            j -= 1
+            rst['sub'] += 1
+
+        if i < 0 and j >= 0:
+            rst['del'] += 1
+        elif j < 0 and i >= 0:
+            rst['ins'] += 1
+
+    match_idx.reverse()
+    wrong_cnt = cost_matrix[len_hyp][len_ref]
+    rst['wrong'] = wrong_cnt
+
+    return rst
+
+def print_cer_detail(rst):
+    return ("(" + "nwords=" + str(rst['nwords']) + ",cor=" + str(rst['cor'])
+            + ",ins=" + str(rst['ins']) + ",del=" + str(rst['del']) + ",sub="
+            + str(rst['sub']) + ") corr:" + '{:.2%}'.format(rst['cor']/rst['nwords'])
+            + ",cer:" + '{:.2%}'.format(rst['wrong']/rst['nwords']))
+
+if __name__ == '__main__':
+    if len(sys.argv) != 4:
+        print("usage : python compute-wer.py test.ref test.hyp test.wer")
+        sys.exit(0)
+
+    ref_file = sys.argv[1]
+    hyp_file = sys.argv[2]
+    cer_detail_file = sys.argv[3]
+    compute_wer(ref_file, hyp_file, cer_detail_file)
--- a/egs/aishell/tranformer/utils/easy_asr_infer.sh
+++ b/egs/aishell/tranformer/utils/easy_asr_infer.sh
@ -0,0 +1,407 @@
+#!/usr/bin/env bash
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+log() {
+    local fname=${BASH_SOURCE[1]##*/}
+    echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+min() {
+  local a b
+  a=$1
+  for b in "$@"; do
+      if [ "${b}" -le "${a}" ]; then
+          a="${b}"
+      fi
+  done
+  echo "${a}"
+}
+SECONDS=0
+
+# General configuration
+stage=1              # Processes starts from the specified stage.
+stop_stage=10000     # Processes is stopped at the specified stage.
+skip_data_prep=true  # Skip data preparation stages.
+skip_train=false     # Skip training stages.
+skip_eval=false      # Skip decoding and evaluation stages.
+skip_upload=true     # Skip packing and uploading stages.
+skip_upload_hf=true  # Skip uploading to hugging face stages.
+cuda_cmd=utils/run.pl
+decode_cmd=utils/run.pl
+ngpu=1               # The number of gpus ("0" uses cpu, otherwise use gpu).
+njob=1               # the number of jobs for each gpu
+gpuid_list=
+num_nodes=1          # The number of nodes.
+nj=32                # The number of parallel jobs.
+inference_nj=32      # The number of parallel jobs in decoding.
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+datadir="./"
+dumpdir=dump         # Directory to dump features.
+expdir=exp           # Directory to save experiments.
+python=python        # Specify python to execute funasr commands.
+
+# Data preparation related
+local_data_opts= # The options given to local/data.sh.
+
+# Speed perturbation related
+speed_perturb_factors=  # perturbation factors, e.g. "0.9 1.0 1.1" (separated by space).
+
+# Feature extraction related
+feats_type=fbank       # Feature type (raw or fbank_pitch).
+feats_dim=
+audio_format=flac    # Audio format: wav, flac, wav.ark, flac.ark  (only in feats_type=raw).
+fs=16k               # Sampling rate.
+min_wav_duration=0.1 # Minimum duration in second.
+max_wav_duration=20  # Maximum duration in second.
+
+# Tokenization related
+token_type=bpe      # Tokenization type (char or bpe).
+nbpe=30             # The number of BPE vocabulary.
+bpemode=unigram     # Mode of BPE (unigram or bpe).
+oov="<unk>"         # Out of vocabulary symbol.
+blank="<blank>"     # CTC blank symbol
+sos_eos="<sos/eos>" # sos and eos symbole
+bpe_input_sentence_size=100000000 # Size of input sentence for BPE.
+bpe_nlsyms=         # non-linguistic symbols list, separated by a comma, for BPE
+bpe_char_cover=1.0  # character coverage when modeling BPE
+
+# Ngram model related
+use_ngram=false
+ngram_exp=
+ngram_num=3
+
+# Language model related
+use_lm=false       # Use language model for ASR decoding.
+lm_tag=           # Suffix to the result dir for language model training.
+lm_exp=           # Specify the directory path for LM experiment.
+                  # If this option is specified, lm_tag is ignored.
+lm_stats_dir=     # Specify the directory path for LM statistics.
+lm_config=        # Config for language model training.
+lm_args=          # Arguments for language model training, e.g., "--max_epoch 10".
+                  # Note that it will overwrite args in lm config.
+use_word_lm=false # Whether to use word language model.
+num_splits_lm=1   # Number of splitting for lm corpus.
+# shellcheck disable=SC2034
+word_vocab_size=10000 # Size of word vocabulary.
+
+# ASR model related
+asr_tag=       # Suffix to the result dir for asr model training.
+asr_exp=       # Specify the directory path for ASR experiment.
+               # If this option is specified, asr_tag is ignored.
+asr_stats_dir= # Specify the directory path for ASR statistics.
+asr_config=    # Config for asr model training.
+asr_args=      # Arguments for asr model training, e.g., "--max_epoch 10".
+               # Note that it will overwrite args in asr config.
+pretrained_model=              # Pretrained model to load
+ignore_init_mismatch=false      # Ignore initial mismatch
+feats_normalize=global_mvn # Normalizaton layer type.
+num_splits_asr=1           # Number of splitting for lm corpus.
+
+# Upload model related
+hf_repo=
+
+# Decoding related
+use_k2=false      # Whether to use k2 based decoder
+k2_ctc_decoding=true
+use_nbest_rescoring=true # use transformer-decoder
+                         # and transformer language model for nbest rescoring
+num_paths=1000 # The 3rd argument of k2.random_paths.
+nll_batch_size=100 # Affect GPU memory usage when computing nll
+                   # during nbest rescoring
+k2_config=./conf/decode_asr_transformer_with_k2.yaml
+
+use_streaming=false # Whether to use streaming decoding
+
+use_maskctc=false # Whether to use maskctc decoding
+
+batch_size=1
+inference_tag=    # Suffix to the result dir for decoding.
+inference_config= # Config for decoding.
+inference_args=   # Arguments for decoding, e.g., "--lm_weight 0.1".
+                  # Note that it will overwrite args in inference config.
+inference_lm=valid.loss.ave.pth       # Language model path for decoding.
+inference_ngram=${ngram_num}gram.bin
+inference_asr_model=valid.acc.ave.pth # ASR model path for decoding.
+                                      # e.g.
+                                      # inference_asr_model=train.loss.best.pth
+                                      # inference_asr_model=3epoch.pth
+                                      # inference_asr_model=valid.acc.best.pth
+                                      # inference_asr_model=valid.loss.ave.pth
+download_model= # Download a model from Model Zoo and use it for decoding.
+
+# [Task dependent] Set the datadir name created by local/data.sh
+train_set=       # Name of training set.
+valid_set=       # Name of validation set used for monitoring/tuning network training.
+test_sets=       # Names of test sets. Multiple items (e.g., both dev and eval sets) can be specified.
+bpe_train_text=  # Text file path of bpe training set.
+lm_train_text=   # Text file path of language model training set.
+lm_dev_text=     # Text file path of language model development set.
+lm_test_text=    # Text file path of language model evaluation set.
+nlsyms_txt=none  # Non-linguistic symbol list if existing.
+cleaner=none     # Text cleaner.
+g2p=none         # g2p method (needed if token_type=phn).
+lang=noinfo      # The language type of corpus.
+score_opts=                # The options given to sclite scoring
+local_score_opts=          # The options given to local/score.sh.
+asr_speech_fold_length=800 # fold_length for speech data during ASR training.
+asr_text_fold_length=150   # fold_length for text data during ASR training.
+lm_fold_length=150         # fold_length for LM training.
+
+oss_path=
+token_list=
+scp=
+text=
+
+mode=
+
+help_message=$(cat << EOF
+Usage: $0 --train-set "<train_set_name>" --valid-set "<valid_set_name>" --test_sets "<test_set_names>"
+
+Options:
+    # General configuration
+    --stage          # Processes starts from the specified stage (default="${stage}").
+    --stop_stage     # Processes is stopped at the specified stage (default="${stop_stage}").
+    --skip_data_prep # Skip data preparation stages (default="${skip_data_prep}").
+    --skip_train     # Skip training stages (default="${skip_train}").
+    --skip_eval      # Skip decoding and evaluation stages (default="${skip_eval}").
+    --skip_upload    # Skip packing and uploading stages (default="${skip_upload}").
+    --ngpu           # The number of gpus ("0" uses cpu, otherwise use gpu, default="${ngpu}").
+    --num_nodes      # The number of nodes (default="${num_nodes}").
+    --nj             # The number of parallel jobs (default="${nj}").
+    --inference_nj   # The number of parallel jobs in decoding (default="${inference_nj}").
+    --gpu_inference  # Whether to perform gpu decoding (default="${gpu_inference}").
+    --dumpdir        # Directory to dump features (default="${dumpdir}").
+    --expdir         # Directory to save experiments (default="${expdir}").
+    --python         # Specify python to execute espnet commands (default="${python}").
+
+    # Data preparation related
+    --local_data_opts # The options given to local/data.sh (default="${local_data_opts}").
+
+    # Speed perturbation related
+    --speed_perturb_factors # speed perturbation factors, e.g. "0.9 1.0 1.1" (separated by space, default="${speed_perturb_factors}").
+
+    # Feature extraction related
+    --feats_type       # Feature type (raw, fbank_pitch or extracted, default="${feats_type}").
+    --audio_format     # Audio format: wav, flac, wav.ark, flac.ark  (only in feats_type=raw, default="${audio_format}").
+    --fs               # Sampling rate (default="${fs}").
+    --min_wav_duration # Minimum duration in second (default="${min_wav_duration}").
+    --max_wav_duration # Maximum duration in second (default="${max_wav_duration}").
+
+    # Tokenization related
+    --token_type              # Tokenization type (char or bpe, default="${token_type}").
+    --nbpe                    # The number of BPE vocabulary (default="${nbpe}").
+    --bpemode                 # Mode of BPE (unigram or bpe, default="${bpemode}").
+    --oov                     # Out of vocabulary symbol (default="${oov}").
+    --blank                   # CTC blank symbol (default="${blank}").
+    --sos_eos                 # sos and eos symbole (default="${sos_eos}").
+    --bpe_input_sentence_size # Size of input sentence for BPE (default="${bpe_input_sentence_size}").
+    --bpe_nlsyms              # Non-linguistic symbol list for sentencepiece, separated by a comma. (default="${bpe_nlsyms}").
+    --bpe_char_cover          # Character coverage when modeling BPE (default="${bpe_char_cover}").
+
+    # Language model related
+    --lm_tag          # Suffix to the result dir for language model training (default="${lm_tag}").
+    --lm_exp          # Specify the directory path for LM experiment.
+                      # If this option is specified, lm_tag is ignored (default="${lm_exp}").
+    --lm_stats_dir    # Specify the directory path for LM statistics (default="${lm_stats_dir}").
+    --lm_config       # Config for language model training (default="${lm_config}").
+    --lm_args         # Arguments for language model training (default="${lm_args}").
+                      # e.g., --lm_args "--max_epoch 10"
+                      # Note that it will overwrite args in lm config.
+    --use_word_lm     # Whether to use word language model (default="${use_word_lm}").
+    --word_vocab_size # Size of word vocabulary (default="${word_vocab_size}").
+    --num_splits_lm   # Number of splitting for lm corpus (default="${num_splits_lm}").
+
+    # ASR model related
+    --asr_tag          # Suffix to the result dir for asr model training (default="${asr_tag}").
+    --asr_exp          # Specify the directory path for ASR experiment.
+                       # If this option is specified, asr_tag is ignored (default="${asr_exp}").
+    --asr_stats_dir    # Specify the directory path for ASR statistics (default="${asr_stats_dir}").
+    --asr_config       # Config for asr model training (default="${asr_config}").
+    --asr_args         # Arguments for asr model training (default="${asr_args}").
+                       # e.g., --asr_args "--max_epoch 10"
+                       # Note that it will overwrite args in asr config.
+    --pretrained_model=          # Pretrained model to load (default="${pretrained_model}").
+    --ignore_init_mismatch=      # Ignore mismatch parameter init with pretrained model (default="${ignore_init_mismatch}").
+    --feats_normalize  # Normalizaton layer type (default="${feats_normalize}").
+    --num_splits_asr   # Number of splitting for lm corpus  (default="${num_splits_asr}").
+
+    # Decoding related
+    --inference_tag       # Suffix to the result dir for decoding (default="${inference_tag}").
+    --inference_config    # Config for decoding (default="${inference_config}").
+    --inference_args      # Arguments for decoding (default="${inference_args}").
+                          # e.g., --inference_args "--lm_weight 0.1"
+                          # Note that it will overwrite args in inference config.
+    --inference_lm        # Language model path for decoding (default="${inference_lm}").
+    --inference_asr_model # ASR model path for decoding (default="${inference_asr_model}").
+    --download_model      # Download a model from Model Zoo and use it for decoding (default="${download_model}").
+    --use_streaming       # Whether to use streaming decoding (default="${use_streaming}").
+    --use_maskctc         # Whether to use maskctc decoding (default="${use_streaming}").
+
+    # [Task dependent] Set the datadir name created by local/data.sh
+    --train_set     # Name of training set (required).
+    --valid_set     # Name of validation set used for monitoring/tuning network training (required).
+    --test_sets     # Names of test sets.
+                    # Multiple items (e.g., both dev and eval sets) can be specified (required).
+    --bpe_train_text # Text file path of bpe training set.
+    --lm_train_text  # Text file path of language model training set.
+    --lm_dev_text   # Text file path of language model development set (default="${lm_dev_text}").
+    --lm_test_text  # Text file path of language model evaluation set (default="${lm_test_text}").
+    --nlsyms_txt    # Non-linguistic symbol list if existing (default="${nlsyms_txt}").
+    --cleaner       # Text cleaner (default="${cleaner}").
+    --g2p           # g2p method (default="${g2p}").
+    --lang          # The language type of corpus (default=${lang}).
+    --score_opts             # The options given to sclite scoring (default="{score_opts}").
+    --local_score_opts       # The options given to local/score.sh (default="{local_score_opts}").
+    --asr_speech_fold_length # fold_length for speech data during ASR training (default="${asr_speech_fold_length}").
+    --asr_text_fold_length   # fold_length for text data during ASR training (default="${asr_text_fold_length}").
+    --lm_fold_length         # fold_length for LM training (default="${lm_fold_length}").
+EOF
+)
+
+log "$0 $*"
+# Save command line args for logging (they will be lost after utils/parse_options.sh)
+run_args=$(utils/print_args.py $0 "$@")
+. utils/parse_options.sh
+
+if [ $# -ne 0 ]; then
+    log "${help_message}"
+    log "Error: No positional arguments are required."
+    exit 2
+fi
+
+# set absolute dump dir path
+dumpdir=${datadir}/${dumpdir}
+
+if [ -z "${inference_tag}" ]; then
+    if [ -n "${inference_config}" ]; then
+        inference_tag="$(basename "${inference_config}" .yaml)"
+    else
+        inference_tag=inference
+    fi
+
+    if "${use_k2}"; then
+        inference_tag+="_use_k2"
+        inference_tag+="_k2_ctc_decoding_${k2_ctc_decoding}"
+        inference_tag+="_use_nbest_rescoring_${use_nbest_rescoring}"
+    fi
+fi
+
+# ========================== Main stages start from here. ==========================
+
+if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
+    log "Stage 12: Decoding: training_dir=${asr_exp}"
+
+    if ${gpu_inference}; then
+        _cmd="${cuda_cmd}"
+        _ngpu=1
+    else
+        _cmd="${decode_cmd}"
+        _ngpu=0
+    fi
+
+    _opts=
+    if [ -n "${inference_config}" ]; then
+        _opts+="--config ${inference_config} "
+    fi
+
+    if "${use_lm}"; then
+        if "${use_word_lm}"; then
+            _opts+="--word_lm_train_config ${lm_exp}/config.yaml "
+            _opts+="--word_lm_file ${lm_exp}/${inference_lm} "
+        else
+            _opts+="--lm_train_config ${lm_exp}/config.yaml "
+            _opts+="--lm_file ${lm_exp}/${inference_lm} "
+        fi
+    fi
+
+    if "${use_ngram}"; then
+         _opts+="--ngram_file ${ngram_exp}/${inference_ngram}"
+         inference_tag=${inference_tag}.${inference_ngram}
+    fi
+
+    # 2. Generate run.sh
+    log "Generate '${asr_exp}/${inference_tag}/run.sh'. You can resume the process from stage 12 using this script"
+    mkdir -p "${asr_exp}/${inference_tag}"; echo "${run_args} --stage 12 \"\$@\"; exit \$?" > "${asr_exp}/${inference_tag}/run.sh"; chmod +x "${asr_exp}/${inference_tag}/run.sh"
+
+    if "${use_streaming}"; then
+        asr_inference_tool="funasr.bin.asr_inference_streaming"
+    elif "${use_maskctc}"; then
+        asr_inference_tool="funasr.bin.asr_inference_maskctc"
+    else
+        asr_inference_tool="funasr.bin.asr_inference_launch"
+    fi
+
+    for dset in ${test_sets}; do
+        if [ $feats_type == "ark_wav" ]; then
+            _data="${dumpdir}/wav/${dset}"
+        else
+            _data="${dumpdir}/$feats_type/${dset}"
+        fi
+        _dir="${asr_exp}/${inference_tag}/${inference_asr_model}/${dset}"
+        _logdir="${_dir}/logdir"
+
+        if [ -d ${_dir} ]; then
+            #echo "${_dir} is already exists. if you want to decode again, please delete this dir first."
+            rm -r ${_dir}
+        fi
+        mkdir -p "${_logdir}"
+
+        _scp=$scp
+        _type=kaldi_ark
+
+
+        # 1. Split the key file
+        key_file=${_data}/${_scp}
+        split_scps=""
+        if "${use_k2}"; then
+            # Now only _nj=1 is verified if using k2
+            _nj=1
+        else
+            _nj=$(min "${inference_nj}" "$(<${key_file} wc -l)")
+        fi
+
+        for n in $(seq "${_nj}"); do
+            split_scps+=" ${_logdir}/keys.${n}.scp"
+        done
+        # shellcheck disable=SC2086
+        utils/split_scp.pl "${key_file}" ${split_scps}
+
+        # 2. Submit decoding jobs
+        log "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+        # shellcheck disable=SC2086
+        ${_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
+            ${python} -m ${asr_inference_tool} \
+                --batch_size ${batch_size} \
+                --ngpu "${_ngpu}" \
+                --njob ${njob} \
+                --gpuid_list ${gpuid_list} \
+                --data_path_and_name_and_type "${_data}/${_scp},speech,${_type}" \
+                --key_file "${_logdir}"/keys.JOB.scp \
+                --asr_train_config "${asr_exp}"/config.yaml \
+                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
+                --output_dir "${_logdir}"/output.JOB \
+                --mode $mode \
+                ${_opts} ${inference_args}
+
+        # 3. Concatenates the output files from each jobs
+        for f in token token_int score text; do
+            if [ -f "${_logdir}/output.1/1best_recog/${f}" ]; then
+                for i in $(seq "${_nj}"); do
+                    cat "${_logdir}/output.${i}/1best_recog/${f}"
+                done | sort -k1 >"${_dir}/${f}"
+            fi
+        done
+        python utils/proce_text.py ${_dir}/text ${_dir}/${text}.proc
+        python utils/proce_text.py ${_data}/text ${_data}/${text}.proc
+        python utils/compute_wer.py ${_data}/text.proc ${_dir}/text.proc ${_dir}/text.cer
+        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
+        cat ${_dir}/text.cer.txt
+    done
+fi
+
+log "Successfully finished. [elapsed=${SECONDS}s]"
+
--- a/egs/aishell/tranformer/utils/error_rate_zh
+++ b/egs/aishell/tranformer/utils/error_rate_zh
@ -0,0 +1,370 @@
+#!/usr/bin/env python3
+# coding=utf8
+
+# Copyright  2021  Jiayu DU
+
+import sys
+import argparse
+import json
+import logging
+logging.basicConfig(stream=sys.stderr, level=logging.INFO, format='[%(levelname)s] %(message)s')
+
+DEBUG = None
+
+def GetEditType(ref_token, hyp_token):
+    if ref_token == None and hyp_token != None:
+        return 'I'
+    elif ref_token != None and hyp_token == None:
+        return 'D'
+    elif ref_token == hyp_token:
+        return 'C'
+    elif ref_token != hyp_token:
+        return 'S'
+    else:
+        raise RuntimeError
+
+class AlignmentArc:
+    def __init__(self, src, dst, ref, hyp):
+        self.src = src
+        self.dst = dst
+        self.ref = ref
+        self.hyp = hyp
+        self.edit_type = GetEditType(ref, hyp)
+
+def similarity_score_function(ref_token, hyp_token):
+    return 0 if (ref_token == hyp_token) else -1.0
+
+def insertion_score_function(token):
+    return -1.0
+
+def deletion_score_function(token):
+    return -1.0
+
+def EditDistance(
+        ref,
+        hyp, 
+        similarity_score_function = similarity_score_function,
+        insertion_score_function = insertion_score_function,
+        deletion_score_function = deletion_score_function):
+    assert(len(ref) != 0)
+    class DPState:
+        def __init__(self):
+            self.score = -float('inf')
+            # backpointer
+            self.prev_r = None
+            self.prev_h = None
+    
+    def print_search_grid(S, R, H, fstream):
+        print(file=fstream)
+        for r in range(R):
+            for h in range(H):
+                print(F'[{r},{h}]:{S[r][h].score:4.3f}:({S[r][h].prev_r},{S[r][h].prev_h}) ', end='', file=fstream)
+            print(file=fstream)
+
+    R = len(ref) + 1
+    H = len(hyp) + 1
+
+    # Construct DP search space, a (R x H) grid
+    S = [ [] for r in range(R) ]
+    for r in range(R):
+        S[r] = [ DPState() for x in range(H) ]
+
+    # initialize DP search grid origin, S(r = 0, h = 0)
+    S[0][0].score = 0.0
+    S[0][0].prev_r = None
+    S[0][0].prev_h = None
+
+    # initialize REF axis
+    for r in range(1, R):
+        S[r][0].score = S[r-1][0].score + deletion_score_function(ref[r-1])
+        S[r][0].prev_r = r-1
+        S[r][0].prev_h = 0
+
+    # initialize HYP axis
+    for h in range(1, H):
+        S[0][h].score = S[0][h-1].score + insertion_score_function(hyp[h-1])
+        S[0][h].prev_r = 0
+        S[0][h].prev_h = h-1
+
+    best_score = S[0][0].score
+    best_state = (0, 0)
+
+    for r in range(1, R):
+        for h in range(1, H):
+            sub_or_cor_score = similarity_score_function(ref[r-1], hyp[h-1])
+            new_score = S[r-1][h-1].score + sub_or_cor_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r-1
+                S[r][h].prev_h = h-1
+
+            del_score = deletion_score_function(ref[r-1])
+            new_score = S[r-1][h].score + del_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r - 1
+                S[r][h].prev_h = h
+
+            ins_score = insertion_score_function(hyp[h-1])
+            new_score = S[r][h-1].score + ins_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r
+                S[r][h].prev_h = h-1
+
+    best_score = S[R-1][H-1].score
+    best_state = (R-1, H-1)
+
+    if DEBUG:
+        print_search_grid(S, R, H, sys.stderr)
+
+    # Backtracing best alignment path, i.e. a list of arcs
+    # arc = (src, dst, ref, hyp, edit_type)
+    # src/dst = (r, h), where r/h refers to search grid state-id along Ref/Hyp axis
+    best_path = []
+    r, h = best_state[0], best_state[1]
+    prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
+    score = S[r][h].score
+    # loop invariant:
+    #   1. (prev_r, prev_h) -> (r, h) is a "forward arc" on best alignment path
+    #   2. score is the value of point(r, h) on DP search grid
+    while prev_r != None or prev_h != None:
+        src = (prev_r, prev_h)
+        dst = (r, h)
+        if (r == prev_r + 1 and h == prev_h + 1): # Substitution or correct
+            arc = AlignmentArc(src, dst, ref[prev_r], hyp[prev_h])
+        elif (r == prev_r + 1 and h == prev_h): # Deletion
+            arc = AlignmentArc(src, dst, ref[prev_r], None)
+        elif (r == prev_r and h == prev_h + 1): # Insertion
+            arc = AlignmentArc(src, dst, None, hyp[prev_h])
+        else:
+            raise RuntimeError
+        best_path.append(arc)
+        r, h = prev_r, prev_h
+        prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
+        score = S[r][h].score
+    
+    best_path.reverse()
+    return (best_path, best_score)
+
+def PrettyPrintAlignment(alignment, stream = sys.stderr):
+    def get_token_str(token):
+        if token == None:
+            return "*"
+        return token
+    
+    def is_double_width_char(ch):
+        if (ch >= '\u4e00') and (ch <= '\u9fa5'): # codepoint ranges for Chinese chars
+            return True
+        # TODO: support other double-width-char language such as Japanese, Korean 
+        else:
+            return False
+    
+    def display_width(token_str):
+        m = 0
+        for c in token_str:
+            if is_double_width_char(c):
+                m += 2
+            else:
+                m += 1
+        return m
+
+    R = '  REF  : '
+    H = '  HYP  : '
+    E = '  EDIT : '
+    for arc in alignment:
+        r = get_token_str(arc.ref)
+        h = get_token_str(arc.hyp)
+        e = arc.edit_type if arc.edit_type != 'C' else ''
+
+        nr, nh, ne = display_width(r), display_width(h), display_width(e)
+        n = max(nr, nh, ne) + 1
+
+        R += r + ' ' * (n-nr)
+        H += h + ' ' * (n-nh)
+        E += e + ' ' * (n-ne)
+
+    print(R, file=stream)
+    print(H, file=stream)
+    print(E, file=stream)
+
+def CountEdits(alignment):
+    c, s, i, d = 0, 0, 0, 0
+    for arc in alignment:
+        if arc.edit_type == 'C':
+            c += 1
+        elif arc.edit_type == 'S':
+            s += 1
+        elif arc.edit_type == 'I':
+            i += 1
+        elif arc.edit_type == 'D':
+            d += 1
+        else:
+            raise RuntimeError
+    return (c, s, i, d)
+
+def ComputeTokenErrorRate(c, s, i, d):
+    return 100.0 * (s + d + i) / (s + d + c)
+
+def ComputeSentenceErrorRate(num_err_utts, num_utts):
+    assert(num_utts != 0)
+    return 100.0 * num_err_utts / num_utts
+
+
+class EvaluationResult:
+    def __init__(self):
+        self.num_ref_utts = 0
+        self.num_hyp_utts = 0
+        self.num_eval_utts = 0 # seen in both ref & hyp
+        self.num_hyp_without_ref = 0
+
+        self.C = 0
+        self.S = 0
+        self.I = 0
+        self.D = 0
+        self.token_error_rate = 0.0
+
+        self.num_utts_with_error = 0
+        self.sentence_error_rate = 0.0
+    
+    def to_json(self):
+        return json.dumps(self.__dict__)
+    
+    def to_kaldi(self):
+        info = (
+            F'%WER {self.token_error_rate:.2f} [ {self.S + self.D + self.I} / {self.C + self.S + self.D}, {self.I} ins, {self.D} del, {self.S} sub ]\n'
+            F'%SER {self.sentence_error_rate:.2f} [ {self.num_utts_with_error} / {self.num_eval_utts} ]\n'
+        )
+        return info
+    
+    def to_sclite(self):
+        return "TODO"
+    
+    def to_espnet(self):
+        return "TODO"
+    
+    def to_summary(self):
+        #return json.dumps(self.__dict__, indent=4)
+        summary = (
+            '==================== Overall Statistics ====================\n'
+            F'num_ref_utts: {self.num_ref_utts}\n'
+            F'num_hyp_utts: {self.num_hyp_utts}\n'
+            F'num_hyp_without_ref: {self.num_hyp_without_ref}\n'
+            F'num_eval_utts: {self.num_eval_utts}\n'
+            F'sentence_error_rate: {self.sentence_error_rate:.2f}%\n'
+            F'token_error_rate: {self.token_error_rate:.2f}%\n'
+            F'token_stats:\n'
+            F'  - tokens:{self.C + self.S + self.D:>7}\n'
+            F'  - edits: {self.S + self.I + self.D:>7}\n'
+            F'  - cor:   {self.C:>7}\n'
+            F'  - sub:   {self.S:>7}\n'
+            F'  - ins:   {self.I:>7}\n'
+            F'  - del:   {self.D:>7}\n'
+            '============================================================\n'
+        )
+        return summary
+
+
+class Utterance:
+    def __init__(self, uid, text):
+        self.uid = uid
+        self.text = text
+
+
+def LoadUtterances(filepath, format):
+    utts = {}
+    if format == 'text': # utt_id word1 word2 ...
+        with open(filepath, 'r', encoding='utf8') as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    cols = line.split(maxsplit=1)
+                    assert(len(cols) == 2 or len(cols) == 1)
+                    uid = cols[0]
+                    text = cols[1] if len(cols) == 2 else ''
+                    if utts.get(uid) != None:
+                        raise RuntimeError(F'Found duplicated utterence id {uid}')
+                    utts[uid] = Utterance(uid, text)
+    else:
+        raise RuntimeError(F'Unsupported text format {format}')
+    return utts
+
+
+def tokenize_text(text, tokenizer):
+    if tokenizer == 'whitespace':
+        return text.split()
+    elif tokenizer == 'char':
+        return [ ch for ch in ''.join(text.split()) ]
+    else:
+        raise RuntimeError(F'ERROR: Unsupported tokenizer {tokenizer}')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    # optional
+    parser.add_argument('--tokenizer', choices=['whitespace', 'char'], default='whitespace', help='whitespace for WER, char for CER')
+    parser.add_argument('--ref-format', choices=['text'], default='text', help='reference format, first col is utt_id, the rest is text')
+    parser.add_argument('--hyp-format', choices=['text'], default='text', help='hypothesis format, first col is utt_id, the rest is text')
+    # required
+    parser.add_argument('--ref', type=str, required=True, help='input reference file')
+    parser.add_argument('--hyp', type=str, required=True, help='input hypothesis file')
+
+    parser.add_argument('result_file', type=str)
+    args = parser.parse_args()
+    logging.info(args)
+
+    ref_utts = LoadUtterances(args.ref, args.ref_format)
+    hyp_utts = LoadUtterances(args.hyp, args.hyp_format)
+
+    r = EvaluationResult()
+
+    # check valid utterances in hyp that have matched non-empty reference
+    eval_utts = []
+    r.num_hyp_without_ref = 0
+    for uid in sorted(hyp_utts.keys()):
+        if uid in ref_utts.keys(): # TODO: efficiency
+            if ref_utts[uid].text.strip(): # non-empty reference
+                eval_utts.append(uid)
+            else:
+                logging.warn(F'Found {uid} with empty reference, skipping...')
+        else:
+            logging.warn(F'Found {uid} without reference, skipping...')
+            r.num_hyp_without_ref += 1
+
+    r.num_hyp_utts = len(hyp_utts)
+    r.num_ref_utts = len(ref_utts)
+    r.num_eval_utts = len(eval_utts)
+
+    with open(args.result_file, 'w+', encoding='utf8') as fo:
+        for uid in eval_utts:
+            ref = ref_utts[uid]
+            hyp = hyp_utts[uid]
+
+            alignment, score = EditDistance(
+                tokenize_text(ref.text, args.tokenizer),
+                tokenize_text(hyp.text, args.tokenizer)
+            )
+            
+            c, s, i, d = CountEdits(alignment)
+            utt_ter = ComputeTokenErrorRate(c, s, i, d)
+
+            # utt-level evaluation result
+            print(F'{{"uid":{uid}, "score":{score}, "ter":{utt_ter:.2f}, "cor":{c}, "sub":{s}, "ins":{i}, "del":{d}}}', file=fo)
+            PrettyPrintAlignment(alignment, fo)
+
+            r.C += c
+            r.S += s
+            r.I += i
+            r.D += d
+
+            if utt_ter > 0:
+                r.num_utts_with_error += 1
+
+        # corpus level evaluation result
+        r.sentence_error_rate = ComputeSentenceErrorRate(r.num_utts_with_error, r.num_eval_utts)
+        r.token_error_rate = ComputeTokenErrorRate(r.C, r.S, r.I, r.D)
+
+        print(r.to_summary(), file=fo)
+
+    print(r.to_json())
+    print(r.to_kaldi())
--- a/egs/aishell/tranformer/utils/extract_embeds.py
+++ b/egs/aishell/tranformer/utils/extract_embeds.py
@ -0,0 +1,47 @@
+from transformers import AutoTokenizer, AutoModel, pipeline
+import numpy as np
+import sys
+import os
+import torch
+from kaldiio import WriteHelper
+import re
+text_file_json = sys.argv[1]
+out_ark = sys.argv[2]
+out_scp = sys.argv[3]
+out_shape = sys.argv[4]
+device = int(sys.argv[5])
+model_path = sys.argv[6]
+
+model = AutoModel.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+extractor = pipeline(task="feature-extraction", model=model, tokenizer=tokenizer, device=device)
+
+with open(text_file_json, 'r') as f:
+    js = f.readlines()
+
+
+f_shape = open(out_shape, "w")
+with WriteHelper('ark,scp:{},{}'.format(out_ark, out_scp)) as writer:
+    with torch.no_grad():
+        for idx, line in enumerate(js):
+            id, tokens = line.strip().split(" ", 1)
+            tokens = re.sub(" ", "", tokens.strip())
+            tokens = ' '.join([j for j in tokens])
+            token_num = len(tokens.split(" "))
+            outputs = extractor(tokens)
+            outputs = np.array(outputs)
+            embeds = outputs[0, 1:-1, :]
+
+            token_num_embeds, dim = embeds.shape
+            if token_num == token_num_embeds:
+                writer(id, embeds)
+                shape_line = "{} {},{}\n".format(id, token_num_embeds, dim)
+                f_shape.write(shape_line)
+            else:
+                print("{}, size has changed, {}, {}, {}".format(id, token_num, token_num_embeds, tokens))
+
+
+
+f_shape.close()
+
+
--- a/egs/aishell/tranformer/utils/filter_scp.pl
+++ b/egs/aishell/tranformer/utils/filter_scp.pl
@ -0,0 +1,87 @@
+#!/usr/bin/env perl
+# Copyright 2010-2012 Microsoft Corporation
+#                     Johns Hopkins University (author: Daniel Povey)
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# This script takes a list of utterance-ids or any file whose first field
+# of each line is an utterance-id, and filters an scp
+# file (or any file whose "n-th" field is an utterance id), printing
+# out only those lines whose "n-th" field is in id_list. The index of
+# the "n-th" field is 1, by default, but can be changed by using
+# the -f <n> switch
+
+$exclude = 0;
+$field = 1;
+$shifted = 0;
+
+do {
+  $shifted=0;
+  if ($ARGV[0] eq "--exclude") {
+    $exclude = 1;
+    shift @ARGV;
+    $shifted=1;
+  }
+  if ($ARGV[0] eq "-f") {
+    $field = $ARGV[1];
+    shift @ARGV; shift @ARGV;
+    $shifted=1
+  }
+} while ($shifted);
+
+if(@ARGV < 1 || @ARGV > 2) {
+  die "Usage: filter_scp.pl [--exclude] [-f <field-to-filter-on>] id_list [in.scp] > out.scp \n" .
+      "Prints only the input lines whose f'th field (default: first) is in 'id_list'.\n" .
+      "Note: only the first field of each line in id_list matters.  With --exclude, prints\n" .
+      "only the lines that were *not* in id_list.\n" .
+      "Caution: previously, the -f option was interpreted as a zero-based field index.\n" .
+      "If your older scripts (written before Oct 2014) stopped working and you used the\n" .
+      "-f option, add 1 to the argument.\n" .
+      "See also: scripts/filter_scp.pl .\n";
+}
+
+
+$idlist = shift @ARGV;
+open(F, "<$idlist") || die "Could not open id-list file $idlist";
+while(<F>) {
+  @A = split;
+  @A>=1 || die "Invalid id-list file line $_";
+  $seen{$A[0]} = 1;
+}
+
+if ($field == 1) { # Treat this as special case, since it is common.
+  while(<>) {
+    $_ =~ m/\s*(\S+)\s*/ || die "Bad line $_, could not get first field.";
+    # $1 is what we filter on.
+    if ((!$exclude && $seen{$1}) || ($exclude && !defined $seen{$1})) {
+      print $_;
+    }
+  }
+} else {
+  while(<>) {
+    @A = split;
+    @A > 0 || die "Invalid scp file line $_";
+    @A >= $field || die "Invalid scp file line $_";
+    if ((!$exclude && $seen{$A[$field-1]}) || ($exclude && !defined $seen{$A[$field-1]})) {
+      print $_;
+    }
+  }
+}
+
+# tests:
+# the following should print "foo 1"
+# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl <(echo foo)
+# the following should print "bar 2".
+# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl -f 2 <(echo 2)
--- a/egs/aishell/tranformer/utils/fix_data.sh
+++ b/egs/aishell/tranformer/utils/fix_data.sh
@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+echo "$0 $@"
+data_dir=$1
+
+if [ ! -f ${data_dir}/wav.scp ]; then
+  echo "$0: wav.scp is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text ]; then
+  echo "$0: text is not found"
+  exit 1;
+fi
+
+
+
+mkdir -p ${data_dir}/.backup
+
+awk '{print $1}' ${data_dir}/wav.scp > ${data_dir}/.backup/wav_id
+awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
+
+sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
+
+cp ${data_dir}/wav.scp ${data_dir}/.backup/wav.scp
+cp ${data_dir}/text ${data_dir}/.backup/text
+
+mv ${data_dir}/wav.scp ${data_dir}/wav.scp.bak
+mv ${data_dir}/text ${data_dir}/text.bak
+
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/wav.scp.bak > ${data_dir}/wav.scp
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
+
+rm ${data_dir}/wav.scp.bak
+rm ${data_dir}/text.bak
--- a/egs/aishell/tranformer/utils/fix_data_feat.sh
+++ b/egs/aishell/tranformer/utils/fix_data_feat.sh
@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+
+echo "$0 $@"
+data_dir=$1
+
+if [ ! -f ${data_dir}/feats.scp ]; then
+  echo "$0: feats.scp is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text ]; then
+  echo "$0: text is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/speech_shape ]; then
+  echo "$0: feature lengths is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text_shape ]; then
+  echo "$0: text lengths is not found"
+  exit 1;
+fi
+
+mkdir -p ${data_dir}/.backup
+
+awk '{print $1}' ${data_dir}/feats.scp > ${data_dir}/.backup/wav_id
+awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
+
+sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
+
+cp ${data_dir}/feats.scp ${data_dir}/.backup/feats.scp
+cp ${data_dir}/text ${data_dir}/.backup/text
+cp ${data_dir}/speech_shape ${data_dir}/.backup/speech_shape
+cp ${data_dir}/text_shape ${data_dir}/.backup/text_shape
+
+mv ${data_dir}/feats.scp ${data_dir}/feats.scp.bak
+mv ${data_dir}/text ${data_dir}/text.bak
+mv ${data_dir}/speech_shape ${data_dir}/speech_shape.bak
+mv ${data_dir}/text_shape ${data_dir}/text_shape.bak
+
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/feats.scp.bak > ${data_dir}/feats.scp
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/speech_shape.bak > ${data_dir}/speech_shape
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text_shape.bak > ${data_dir}/text_shape
+
+rm ${data_dir}/feats.scp.bak
+rm ${data_dir}/text.bak
+rm ${data_dir}/speech_shape.bak
+rm ${data_dir}/text_shape.bak
+
--- a/egs/aishell/tranformer/utils/gen_ark_list.sh
+++ b/egs/aishell/tranformer/utils/gen_ark_list.sh
@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=4
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+data=$1
+
+[ ! -d ${data}/ark ] && echo "$0: ark data is required" && exit 1;
+[ ! -d ${data}/txt ] && echo "$0: txt data is required" && exit 1;
+
+for n in $(seq $nj); do
+  echo "$data/ark/feats.$n.ark $data/txt/text.$n" || exit 1
+done > $data/ark_txt.scp || exit 1
+
--- a/egs/aishell/tranformer/utils/parse_options.sh
+++ b/egs/aishell/tranformer/utils/parse_options.sh
@ -0,0 +1,97 @@
+#!/usr/bin/env bash
+
+# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
+#                 Arnab Ghoshal, Karel Vesely
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Parse command-line options.
+# To be sourced by another script (as in ". parse_options.sh").
+# Option format is: --option-name arg
+# and shell variable "option_name" gets set to value "arg."
+# The exception is --help, which takes no arguments, but prints the
+# $help_message variable (if defined).
+
+
+###
+### The --config file options have lower priority to command line
+### options, so we need to import them first...
+###
+
+# Now import all the configs specified by command-line, in left-to-right order
+for ((argpos=1; argpos<$#; argpos++)); do
+  if [ "${!argpos}" == "--config" ]; then
+    argpos_plus1=$((argpos+1))
+    config=${!argpos_plus1}
+    [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
+    . $config  # source the config file.
+  fi
+done
+
+
+###
+### Now we process the command line options
+###
+while true; do
+  [ -z "${1:-}" ] && break;  # break if there are no arguments
+  case "$1" in
+    # If the enclosing script is called with --help option, print the help
+    # message and exit.  Scripts should put help messages in $help_message
+    --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
+      else printf "$help_message\n" 1>&2 ; fi;
+      exit 0 ;;
+    --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
+      exit 1 ;;
+    # If the first command-line argument begins with "--" (e.g. --foo-bar),
+    # then work out the variable name as $name, which will equal "foo_bar".
+    --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
+      # Next we test whether the variable in question is undefned-- if so it's
+      # an invalid option and we die.  Note: $0 evaluates to the name of the
+      # enclosing script.
+      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
+      # is undefined.  We then have to wrap this test inside "eval" because
+      # foo_bar is itself inside a variable ($name).
+      eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
+
+      oldval="`eval echo \\$$name`";
+      # Work out whether we seem to be expecting a Boolean argument.
+      if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
+        was_bool=true;
+      else
+        was_bool=false;
+      fi
+
+      # Set the variable to the right value-- the escaped quotes make it work if
+      # the option had spaces, like --cmd "queue.pl -sync y"
+      eval $name=\"$2\";
+
+      # Check that Boolean-valued arguments are really Boolean.
+      if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
+        echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
+        exit 1;
+      fi
+      shift 2;
+      ;;
+  *) break;
+  esac
+done
+
+
+# Check for an empty argument to the --cmd option, which can easily occur as a
+# result of scripting errors.
+[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
+
+
+true; # so this script returns exit code 0.
--- a/egs/aishell/tranformer/utils/print_args.py
+++ b/egs/aishell/tranformer/utils/print_args.py
@ -0,0 +1,45 @@
+#!/usr/bin/env python
+import sys
+
+
+def get_commandline_args(no_executable=True):
+    extra_chars = [
+        " ",
+        ";",
+        "&",
+        "|",
+        "<",
+        ">",
+        "?",
+        "*",
+        "~",
+        "`",
+        '"',
+        "'",
+        "\\",
+        "{",
+        "}",
+        "(",
+        ")",
+    ]
+
+    # Escape the extra characters for shell
+    argv = [
+        arg.replace("'", "'\\''")
+        if all(char not in arg for char in extra_chars)
+        else "'" + arg.replace("'", "'\\''") + "'"
+        for arg in sys.argv
+    ]
+
+    if no_executable:
+        return " ".join(argv[1:])
+    else:
+        return sys.executable + " " + " ".join(argv)
+
+
+def main():
+    print(get_commandline_args())
+
+
+if __name__ == "__main__":
+    main()
--- a/egs/aishell/tranformer/utils/proc_conf_oss.py
+++ b/egs/aishell/tranformer/utils/proc_conf_oss.py
@ -0,0 +1,35 @@
+from pathlib import Path
+
+import torch
+import yaml
+
+
+class NoAliasSafeDumper(yaml.SafeDumper):
+    # Disable anchor/alias in yaml because looks ugly
+    def ignore_aliases(self, data):
+        return True
+
+
+def yaml_no_alias_safe_dump(data, stream=None, **kwargs):
+    """Safe-dump in yaml with no anchor/alias"""
+    return yaml.dump(
+        data, stream, allow_unicode=True, Dumper=NoAliasSafeDumper, **kwargs
+    )
+
+
+def gen_conf(file, out_dir):
+    conf = torch.load(file)["config"]
+    conf["oss_bucket"] = "null"
+    print(conf)
+    output_dir = Path(out_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with (output_dir / "config.yaml").open("w", encoding="utf-8") as f:
+        yaml_no_alias_safe_dump(conf, f, indent=4, sort_keys=False)
+
+
+if __name__ == "__main__":
+    import sys
+
+    in_f = sys.argv[1]
+    out_f = sys.argv[2]
+    gen_conf(in_f, out_f)
--- a/egs/aishell/tranformer/utils/proce_text.py
+++ b/egs/aishell/tranformer/utils/proce_text.py
@ -0,0 +1,31 @@
+
+import sys
+import re
+
+in_f = sys.argv[1]
+out_f = sys.argv[2]
+
+
+with open(in_f, "r", encoding="utf-8") as f:
+  lines = f.readlines()
+
+with open(out_f, "w", encoding="utf-8") as f:
+  for line in lines:
+    outs = line.strip().split(" ", 1)
+    if len(outs) == 2:
+      idx, text = outs
+      text = re.sub("</s>", "", text)
+      text = re.sub("<s>", "", text)
+      text = re.sub("@@", "", text)
+      text = re.sub("@", "", text)
+      text = re.sub("<unk>", "", text)
+      text = re.sub(" ", "", text)
+      text = text.lower()
+    else:
+      idx = outs[0]
+      text = " "
+
+    text = [x for x in text]
+    text = " ".join(text)
+    out = "{} {}\n".format(idx, text)
+    f.write(out)
--- a/egs/aishell/tranformer/utils/run.pl
+++ b/egs/aishell/tranformer/utils/run.pl
@ -0,0 +1,356 @@
+#!/usr/bin/env perl
+use warnings; #sed replacement for -w perl parameter
+# In general, doing
+#  run.pl some.log a b c is like running the command a b c in
+# the bash shell, and putting the standard error and output into some.log.
+# To run parallel jobs (backgrounded on the host machine), you can do (e.g.)
+#  run.pl JOB=1:4 some.JOB.log a b c JOB is like running the command a b c JOB
+# and putting it in some.JOB.log, for each one. [Note: JOB can be any identifier].
+# If any of the jobs fails, this script will fail.
+
+# A typical example is:
+#  run.pl some.log my-prog "--opt=foo bar" foo \|  other-prog baz
+# and run.pl will run something like:
+# ( my-prog '--opt=foo bar' foo |  other-prog baz ) >& some.log
+#
+# Basically it takes the command-line arguments, quotes them
+# as necessary to preserve spaces, and evaluates them with bash.
+# In addition it puts the command line at the top of the log, and
+# the start and end times of the command at the beginning and end.
+# The reason why this is useful is so that we can create a different
+# version of this program that uses a queueing system instead.
+
+#use Data::Dumper;
+
+@ARGV < 2 && die "usage: run.pl log-file command-line arguments...";
+
+#print STDERR "COMMAND-LINE: " .  Dumper(\@ARGV) . "\n";
+$job_pick = 'all';
+$max_jobs_run = -1;
+$jobstart = 1;
+$jobend = 1;
+$ignored_opts = ""; # These will be ignored.
+
+# First parse an option like JOB=1:4, and any
+# options that would normally be given to
+# queue.pl, which we will just discard.
+
+for (my $x = 1; $x <= 2; $x++) { # This for-loop is to
+  # allow the JOB=1:n option to be interleaved with the
+  # options to qsub.
+  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {
+    # parse any options that would normally go to qsub, but which will be ignored here.
+    my $switch = shift @ARGV;
+    if ($switch eq "-V") {
+      $ignored_opts .= "-V ";
+    } elsif ($switch eq "--max-jobs-run" || $switch eq "-tc") {
+      # we do support the option --max-jobs-run n, and its GridEngine form -tc n.
+      # if the command appears multiple times uses the smallest option.
+      if ( $max_jobs_run <= 0 ) {
+          $max_jobs_run =  shift @ARGV;
+      } else {
+        my $new_constraint = shift @ARGV;
+        if ( ($new_constraint < $max_jobs_run) ) {
+          $max_jobs_run = $new_constraint;
+        }
+      }
+      
+      if (! ($max_jobs_run > 0)) {
+        die "run.pl: invalid option --max-jobs-run $max_jobs_run";
+      }
+    } else {
+      my $argument = shift @ARGV;
+      if ($argument =~ m/^--/) {
+        print STDERR "run.pl: WARNING: suspicious argument '$argument' to $switch; starts with '-'\n";
+      }
+      if ($switch eq "-sync" && $argument =~ m/^[yY]/) {
+        $ignored_opts .= "-sync "; # Note: in the
+        # corresponding code in queue.pl it says instead, just "$sync = 1;".
+      } elsif ($switch eq "-pe") { # e.g. -pe smp 5
+        my $argument2 = shift @ARGV;
+        $ignored_opts .= "$switch $argument $argument2 ";
+      } elsif ($switch eq "--gpu") {
+        $using_gpu = $argument;
+      } elsif ($switch eq "--pick") {
+        if($argument =~ m/^(all|failed|incomplete)$/) {
+          $job_pick = $argument;
+        } else {
+          print STDERR "run.pl: ERROR: --pick argument must be one of 'all', 'failed' or 'incomplete'"
+        }
+      } else {
+        # Ignore option.
+        $ignored_opts .= "$switch $argument ";
+      }
+    }
+  }
+  if ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+):(\d+)$/) { # e.g. JOB=1:20
+    $jobname = $1;
+    $jobstart = $2;
+    $jobend = $3;
+    if ($jobstart > $jobend) {
+      die "run.pl: invalid job range $ARGV[0]";
+    }
+    if ($jobstart <= 0) {
+      die "run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility).";
+    }
+    shift;
+  } elsif ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+)$/) { # e.g. JOB=1.
+    $jobname = $1;
+    $jobstart = $2;
+    $jobend = $2;
+    shift;
+  } elsif ($ARGV[0] =~ m/.+\=.*\:.*$/) {
+    print STDERR "run.pl: Warning: suspicious first argument to run.pl: $ARGV[0]\n";
+  }
+}
+
+# Users found this message confusing so we are removing it.
+# if ($ignored_opts ne "") {
+#   print STDERR "run.pl: Warning: ignoring options \"$ignored_opts\"\n";
+# }
+
+if ($max_jobs_run == -1) { # If --max-jobs-run option not set,
+                           # then work out the number of processors if possible,
+                           # and set it based on that.
+  $max_jobs_run = 0;
+  if ($using_gpu) {
+    if (open(P, "nvidia-smi -L |")) {
+      $max_jobs_run++ while (<P>);
+      close(P);
+    }
+    if ($max_jobs_run == 0) {
+      $max_jobs_run = 1;
+      print STDERR "run.pl: Warning: failed to detect number of GPUs from nvidia-smi, using ${max_jobs_run}\n";
+    }
+  } elsif (open(P, "</proc/cpuinfo")) {  # Linux
+    while (<P>) { if (m/^processor/) { $max_jobs_run++; } }
+    if ($max_jobs_run == 0) {
+      print STDERR "run.pl: Warning: failed to detect any processors from /proc/cpuinfo\n";
+      $max_jobs_run = 10;  # reasonable default.
+    }
+    close(P);
+  } elsif (open(P, "sysctl -a |")) {  # BSD/Darwin
+    while (<P>) {
+      if (m/hw\.ncpu\s*[:=]\s*(\d+)/) { # hw.ncpu = 4, or hw.ncpu: 4
+        $max_jobs_run = $1;
+        last;
+      }
+    }
+    close(P);
+    if ($max_jobs_run == 0) {
+      print STDERR "run.pl: Warning: failed to detect any processors from sysctl -a\n";
+      $max_jobs_run = 10;  # reasonable default.
+    }
+  } else {
+    # allow at most 32 jobs at once, on non-UNIX systems; change this code
+    # if you need to change this default.
+    $max_jobs_run = 32;
+  }
+  # The just-computed value of $max_jobs_run is just the number of processors
+  # (or our best guess); and if it happens that the number of jobs we need to
+  # run is just slightly above $max_jobs_run, it will make sense to increase
+  # $max_jobs_run to equal the number of jobs, so we don't have a small number
+  # of leftover jobs.
+  $num_jobs = $jobend - $jobstart + 1;
+  if (!$using_gpu &&
+      $num_jobs > $max_jobs_run && $num_jobs < 1.4 * $max_jobs_run) {
+    $max_jobs_run = $num_jobs;
+  }
+}
+
+sub pick_or_exit {
+  # pick_or_exit ( $logfile ) 
+  # Invoked before each job is started helps to run jobs selectively.
+  #
+  # Given the name of the output logfile decides whether the job must be 
+  # executed (by returning from the subroutine) or not (by terminating the
+  # process calling exit)
+  # 
+  # PRE: $job_pick is a global variable set by command line switch --pick
+  #      and indicates which class of jobs must be executed.
+  #
+  # 1) If a failed job is not executed the process exit code will indicate 
+  #    failure, just as if the task was just executed  and failed.
+  #
+  # 2) If a task is incomplete it will be executed. Incomplete may be either
+  #    a job whose log file does not contain the accounting notes in the end,
+  #    or a job whose log file does not exist.
+  #
+  # 3) If the $job_pick is set to 'all' (default behavior) a task will be
+  #    executed regardless of the result of previous attempts.
+  #
+  # This logic could have been implemented in the main execution loop
+  # but a subroutine to preserve the current level of readability of
+  # that part of the code.
+  #
+  # Alexandre Felipe, (o.alexandre.felipe@gmail.com) 14th of August of 2020
+  #
+  if($job_pick eq 'all'){
+    return; # no need to bother with the previous log
+  }
+  open my $fh, "<", $_[0] or return; # job not executed yet
+  my $log_line;
+  my $cur_line;
+  while ($cur_line = <$fh>) {
+    if( $cur_line =~ m/# Ended \(code .*/ ) {
+      $log_line = $cur_line;
+    }
+  }
+  close $fh;
+  if (! defined($log_line)){
+    return; # incomplete
+  }
+  if ( $log_line =~ m/# Ended \(code 0\).*/ ) {
+    exit(0); # complete
+  } elsif ( $log_line =~ m/# Ended \(code \d+(; signal \d+)?\).*/ ){
+    if ($job_pick !~ m/^(failed|all)$/) {
+      exit(1); # failed but not going to run
+    } else {
+      return; # failed
+    }
+  } elsif ( $log_line =~ m/.*\S.*/ ) {
+    return; # incomplete jobs are always run
+  }
+}
+
+
+$logfile = shift @ARGV;
+
+if (defined $jobname && $logfile !~ m/$jobname/ &&
+    $jobend > $jobstart) {
+  print STDERR "run.pl: you are trying to run a parallel job but "
+    . "you are putting the output into just one log file ($logfile)\n";
+  exit(1);
+}
+
+$cmd = "";
+
+foreach $x (@ARGV) {
+    if ($x =~ m/^\S+$/) { $cmd .=  $x . " "; }
+    elsif ($x =~ m:\":) { $cmd .= "'$x' "; }
+    else { $cmd .= "\"$x\" "; }
+}
+
+#$Data::Dumper::Indent=0;
+$ret = 0;
+$numfail = 0;
+%active_pids=();
+
+use POSIX ":sys_wait_h";
+for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
+  if (scalar(keys %active_pids) >= $max_jobs_run) {
+
+    # Lets wait for a change in any child's status
+    # Then we have to work out which child finished
+    $r = waitpid(-1, 0);
+    $code = $?;
+    if ($r < 0 ) { die "run.pl: Error waiting for child process"; } # should never happen.
+    if ( defined $active_pids{$r} ) {
+        $jid=$active_pids{$r};
+        $fail[$jid]=$code;
+        if ($code !=0) { $numfail++;}
+        delete $active_pids{$r};
+        # print STDERR "Finished: $r/$jid " .  Dumper(\%active_pids) . "\n";
+    } else {
+        die "run.pl: Cannot find the PID of the child process that just finished.";
+    }
+
+    # In theory we could do a non-blocking waitpid over all jobs running just
+    # to find out if only one or more jobs finished during the previous waitpid()
+    # However, we just omit this and will reap the next one in the next pass
+    # through the for(;;) cycle
+  }
+  $childpid = fork();
+  if (!defined $childpid) { die "run.pl: Error forking in run.pl (writing to $logfile)"; }
+  if ($childpid == 0) { # We're in the child... this branch
+    # executes the job and returns (possibly with an error status).
+    if (defined $jobname) {
+      $cmd =~ s/$jobname/$jobid/g;
+      $logfile =~ s/$jobname/$jobid/g;
+    }
+    # exit if the job does not need to be executed
+    pick_or_exit( $logfile );
+
+    system("mkdir -p `dirname $logfile` 2>/dev/null");
+    open(F, ">$logfile") || die "run.pl: Error opening log file $logfile";
+    print F "# " . $cmd . "\n";
+    print F "# Started at " . `date`;
+    $starttime = `date +'%s'`;
+    print F "#\n";
+    close(F);
+
+    # Pipe into bash.. make sure we're not using any other shell.
+    open(B, "|bash") || die "run.pl: Error opening shell command";
+    print B "( " . $cmd . ") 2>>$logfile >> $logfile";
+    close(B);                   # If there was an error, exit status is in $?
+    $ret = $?;
+
+    $lowbits = $ret & 127;
+    $highbits = $ret >> 8;
+    if ($lowbits != 0) { $return_str = "code $highbits; signal $lowbits" }
+    else { $return_str = "code $highbits"; }
+
+    $endtime = `date +'%s'`;
+    open(F, ">>$logfile") || die "run.pl: Error opening log file $logfile (again)";
+    $enddate = `date`;
+    chop $enddate;
+    print F "# Accounting: time=" . ($endtime - $starttime) . " threads=1\n";
+    print F "# Ended ($return_str) at " . $enddate . ", elapsed time " . ($endtime-$starttime) . " seconds\n";
+    close(F);
+    exit($ret == 0 ? 0 : 1);
+  } else {
+    $pid[$jobid] = $childpid;
+    $active_pids{$childpid} = $jobid;
+    # print STDERR "Queued: " .  Dumper(\%active_pids) . "\n";
+  }
+}
+
+# Now we have submitted all the jobs, lets wait until all the jobs finish
+foreach $child (keys %active_pids) {
+    $jobid=$active_pids{$child};
+    $r = waitpid($pid[$jobid], 0);
+    $code = $?;
+    if ($r == -1) { die "run.pl: Error waiting for child process"; } # should never happen.
+    if ($r != 0) { $fail[$jobid]=$code; $numfail++ if $code!=0; } # Completed successfully
+}
+
+# Some sanity checks:
+# The $fail array should not contain undefined codes
+# The number of non-zeros in that array  should be equal to $numfail
+# We cannot do foreach() here, as the JOB ids do not start at zero
+$failed_jids=0;
+for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
+  $job_return = $fail[$jobid];
+  if (not defined $job_return ) {
+    # print Dumper(\@fail);
+
+    die "run.pl: Sanity check failed: we have indication that some jobs are running " .
+      "even after we waited for all jobs to finish" ;
+  }
+  if ($job_return != 0 ){ $failed_jids++;}
+}
+if ($failed_jids != $numfail) {
+  die "run.pl: Sanity check failed: cannot find out how many jobs failed ($failed_jids x $numfail)."
+}
+if ($numfail > 0) { $ret = 1; }
+
+if ($ret != 0) {
+  $njobs = $jobend - $jobstart + 1;
+  if ($njobs == 1) {
+    if (defined $jobname) {
+      $logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with
+                                         # that job.
+    }
+    print STDERR "run.pl: job failed, log is in $logfile\n";
+    if ($logfile =~ m/JOB/) {
+      print STDERR "run.pl: probably you forgot to put JOB=1:\$nj in your script.";
+    }
+  }
+  else {
+    $logfile =~ s/$jobname/*/g;
+    print STDERR "run.pl: $numfail / $njobs failed, log is in $logfile\n";
+  }
+}
+
+
+exit ($ret);
--- a/egs/aishell/tranformer/utils/shuffle_list.pl
+++ b/egs/aishell/tranformer/utils/shuffle_list.pl
@ -0,0 +1,44 @@
+#!/usr/bin/env perl
+
+# Copyright 2013  Johns Hopkins University (author: Daniel Povey)
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+if ($ARGV[0] eq "--srand") {
+  $n = $ARGV[1];
+  $n =~ m/\d+/ || die "Bad argument to --srand option: \"$n\"";
+  srand($ARGV[1]);
+  shift;
+  shift;
+} else {
+  srand(0); # Gives inconsistent behavior if we don't seed.
+}
+
+if (@ARGV > 1 || $ARGV[0] =~ m/^-.+/) { # >1 args, or an option we
+  # don't understand.
+  print "Usage: shuffle_list.pl [--srand N] [input file]  > output\n";
+  print "randomizes the order of lines of input.\n";
+  exit(1);
+}
+
+@lines;
+while (<>) {
+  push @lines, [ (rand(), $_)] ;
+}
+
+@lines = sort { $a->[0] cmp $b->[0] } @lines;
+foreach $l (@lines) {
+    print $l->[1];
+}
--- a/egs/aishell/tranformer/utils/split_data.py
+++ b/egs/aishell/tranformer/utils/split_data.py
@ -0,0 +1,60 @@
+import os
+import sys
+import random
+
+
+in_dir = sys.argv[1]
+out_dir = sys.argv[2]
+num_split = sys.argv[3]
+
+
+def split_scp(scp, num):
+    assert len(scp) >= num
+    avg = len(scp) // num
+    out = []
+    begin = 0
+
+    for i in range(num):
+        if i == num - 1:
+            out.append(scp[begin:])
+        else:
+            out.append(scp[begin:begin+avg])
+        begin += avg
+
+    return out
+
+
+os.path.exists("{}/wav.scp".format(in_dir))
+os.path.exists("{}/text".format(in_dir))
+
+with open("{}/wav.scp".format(in_dir), 'r') as infile:
+    wav_list = infile.readlines()
+
+with open("{}/text".format(in_dir), 'r') as infile:
+    text_list = infile.readlines()
+
+assert len(wav_list) == len(text_list)
+
+x = list(zip(wav_list, text_list))
+random.shuffle(x)
+wav_shuffle_list, text_shuffle_list = zip(*x)
+
+num_split = int(num_split)
+wav_split_list = split_scp(wav_shuffle_list, num_split)
+text_split_list = split_scp(text_shuffle_list, num_split)
+
+for idx, wav_list in enumerate(wav_split_list, 1):
+    path = out_dir + "/split" + str(num_split) + "/" + str(idx)
+    if not os.path.exists(path):
+        os.makedirs(path)
+    with open("{}/wav.scp".format(path), 'w') as wav_writer:
+        for line in wav_list:
+            wav_writer.write(line)
+
+for idx, text_list in enumerate(text_split_list, 1):
+    path = out_dir + "/split" + str(num_split) + "/" + str(idx)
+    if not os.path.exists(path):
+        os.makedirs(path)
+    with open("{}/text".format(path), 'w') as text_writer:
+        for line in text_list:
+            text_writer.write(line)
--- a/egs/aishell/tranformer/utils/split_scp.pl
+++ b/egs/aishell/tranformer/utils/split_scp.pl
@ -0,0 +1,246 @@
+#!/usr/bin/env perl
+
+# Copyright 2010-2011 Microsoft Corporation
+
+# See ../../COPYING for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# This program splits up any kind of .scp or archive-type file.
+# If there is no utt2spk option it will work on any text  file and
+# will split it up with an approximately equal number of lines in
+# each but.
+# With the --utt2spk option it will work on anything that has the
+# utterance-id as the first entry on each line; the utt2spk file is
+# of the form "utterance speaker" (on each line).
+# It splits it into equal size chunks as far as it can.  If you use the utt2spk
+# option it will make sure these chunks coincide with speaker boundaries.  In
+# this case, if there are more chunks than speakers (and in some other
+# circumstances), some of the resulting chunks will be empty and it will print
+# an error message and exit with nonzero status.
+# You will normally call this like:
+# split_scp.pl scp scp.1 scp.2 scp.3 ...
+# or
+# split_scp.pl --utt2spk=utt2spk scp scp.1 scp.2 scp.3 ...
+# Note that you can use this script to split the utt2spk file itself,
+# e.g. split_scp.pl --utt2spk=utt2spk utt2spk utt2spk.1 utt2spk.2 ...
+
+# You can also call the scripts like:
+# split_scp.pl -j 3 0 scp scp.0
+# [note: with this option, it assumes zero-based indexing of the split parts,
+# i.e. the second number must be 0 <= n < num-jobs.]
+
+use warnings;
+
+$num_jobs = 0;
+$job_id = 0;
+$utt2spk_file = "";
+$one_based = 0;
+
+for ($x = 1; $x <= 3 && @ARGV > 0; $x++) {
+    if ($ARGV[0] eq "-j") {
+        shift @ARGV;
+        $num_jobs = shift @ARGV;
+        $job_id = shift @ARGV;
+    }
+    if ($ARGV[0] =~ /--utt2spk=(.+)/) {
+        $utt2spk_file=$1;
+        shift;
+    }
+    if ($ARGV[0] eq '--one-based') {
+        $one_based = 1;
+        shift @ARGV;
+    }
+}
+
+if ($num_jobs != 0 && ($num_jobs < 0 || $job_id - $one_based < 0 ||
+                       $job_id - $one_based >= $num_jobs)) {
+  die "$0: Invalid job number/index values for '-j $num_jobs $job_id" .
+      ($one_based ? " --one-based" : "") . "'\n"
+}
+
+$one_based
+    and $job_id--;
+
+if(($num_jobs == 0 && @ARGV < 2) || ($num_jobs > 0 && (@ARGV < 1 || @ARGV > 2))) {
+    die
+"Usage: split_scp.pl [--utt2spk=<utt2spk_file>] in.scp out1.scp out2.scp ...
+   or: split_scp.pl -j num-jobs job-id [--one-based] [--utt2spk=<utt2spk_file>] in.scp [out.scp]
+ ... where 0 <= job-id < num-jobs, or 1 <= job-id <- num-jobs if --one-based.\n";
+}
+
+$error = 0;
+$inscp = shift @ARGV;
+if ($num_jobs == 0) { # without -j option
+    @OUTPUTS = @ARGV;
+} else {
+    for ($j = 0; $j < $num_jobs; $j++) {
+        if ($j == $job_id) {
+            if (@ARGV > 0) { push @OUTPUTS, $ARGV[0]; }
+            else { push @OUTPUTS, "-"; }
+        } else {
+            push @OUTPUTS, "/dev/null";
+        }
+    }
+}
+
+if ($utt2spk_file ne "") {  # We have the --utt2spk option...
+    open($u_fh, '<', $utt2spk_file) || die "$0: Error opening utt2spk file $utt2spk_file: $!\n";
+    while(<$u_fh>) {
+        @A = split;
+        @A == 2 || die "$0: Bad line $_ in utt2spk file $utt2spk_file\n";
+        ($u,$s) = @A;
+        $utt2spk{$u} = $s;
+    }
+    close $u_fh;
+    open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
+    @spkrs = ();
+    while(<$i_fh>) {
+        @A = split;
+        if(@A == 0) { die "$0: Empty or space-only line in scp file $inscp\n"; }
+        $u = $A[0];
+        $s = $utt2spk{$u};
+        defined $s || die "$0: No utterance $u in utt2spk file $utt2spk_file\n";
+        if(!defined $spk_count{$s}) {
+            push @spkrs, $s;
+            $spk_count{$s} = 0;
+            $spk_data{$s} = [];  # ref to new empty array.
+        }
+        $spk_count{$s}++;
+        push @{$spk_data{$s}}, $_;
+    }
+    # Now split as equally as possible ..
+    # First allocate spks to files by allocating an approximately
+    # equal number of speakers.
+    $numspks = @spkrs;  # number of speakers.
+    $numscps = @OUTPUTS; # number of output files.
+    if ($numspks < $numscps) {
+      die "$0: Refusing to split data because number of speakers $numspks " .
+          "is less than the number of output .scp files $numscps\n";
+    }
+    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+        $scparray[$scpidx] = []; # [] is array reference.
+    }
+    for ($spkidx = 0; $spkidx < $numspks; $spkidx++) {
+        $scpidx = int(($spkidx*$numscps) / $numspks);
+        $spk = $spkrs[$spkidx];
+        push @{$scparray[$scpidx]}, $spk;
+        $scpcount[$scpidx] += $spk_count{$spk};
+    }
+
+    # Now will try to reassign beginning + ending speakers
+    # to different scp's and see if it gets more balanced.
+    # Suppose objf we're minimizing is sum_i (num utts in scp[i] - average)^2.
+    # We can show that if considering changing just 2 scp's, we minimize
+    # this by minimizing the squared difference in sizes.  This is
+    # equivalent to minimizing the absolute difference in sizes.  This
+    # shows this method is bound to converge.
+
+    $changed = 1;
+    while($changed) {
+        $changed = 0;
+        for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+            # First try to reassign ending spk of this scp.
+            if($scpidx < $numscps-1) {
+                $sz = @{$scparray[$scpidx]};
+                if($sz > 0) {
+                    $spk = $scparray[$scpidx]->[$sz-1];
+                    $count = $spk_count{$spk};
+                    $nutt1 = $scpcount[$scpidx];
+                    $nutt2 = $scpcount[$scpidx+1];
+                    if( abs( ($nutt2+$count) - ($nutt1-$count))
+                        < abs($nutt2 - $nutt1))  { # Would decrease
+                        # size-diff by reassigning spk...
+                        $scpcount[$scpidx+1] += $count;
+                        $scpcount[$scpidx] -= $count;
+                        pop @{$scparray[$scpidx]};
+                        unshift @{$scparray[$scpidx+1]}, $spk;
+                        $changed = 1;
+                    }
+                }
+            }
+            if($scpidx > 0 && @{$scparray[$scpidx]} > 0) {
+                $spk = $scparray[$scpidx]->[0];
+                $count = $spk_count{$spk};
+                $nutt1 = $scpcount[$scpidx-1];
+                $nutt2 = $scpcount[$scpidx];
+                if( abs( ($nutt2-$count) - ($nutt1+$count))
+                    < abs($nutt2 - $nutt1))  { # Would decrease
+                    # size-diff by reassigning spk...
+                    $scpcount[$scpidx-1] += $count;
+                    $scpcount[$scpidx] -= $count;
+                    shift @{$scparray[$scpidx]};
+                    push @{$scparray[$scpidx-1]}, $spk;
+                    $changed = 1;
+                }
+            }
+        }
+    }
+    # Now print out the files...
+    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+        $scpfile = $OUTPUTS[$scpidx];
+        ($scpfile ne '-' ? open($f_fh, '>', $scpfile)
+                         : open($f_fh, '>&', \*STDOUT)) ||
+            die "$0: Could not open scp file $scpfile for writing: $!\n";
+        $count = 0;
+        if(@{$scparray[$scpidx]} == 0) {
+            print STDERR "$0: eError: split_scp.pl producing empty .scp file " .
+                         "$scpfile (too many splits and too few speakers?)\n";
+            $error = 1;
+        } else {
+            foreach $spk ( @{$scparray[$scpidx]} ) {
+                print $f_fh @{$spk_data{$spk}};
+                $count += $spk_count{$spk};
+            }
+            $count == $scpcount[$scpidx] || die "Count mismatch [code error]";
+        }
+        close($f_fh);
+    }
+} else {
+   # This block is the "normal" case where there is no --utt2spk
+   # option and we just break into equal size chunks.
+
+    open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
+
+    $numscps = @OUTPUTS;  # size of array.
+    @F = ();
+    while(<$i_fh>) {
+        push @F, $_;
+    }
+    $numlines = @F;
+    if($numlines == 0) {
+        print STDERR "$0: error: empty input scp file $inscp\n";
+        $error = 1;
+    }
+    $linesperscp = int( $numlines / $numscps); # the "whole part"..
+    $linesperscp >= 1 || die "$0: You are splitting into too many pieces! [reduce \$nj ($numscps) to be smaller than the number of lines ($numlines) in $inscp]\n";
+    $remainder = $numlines - ($linesperscp * $numscps);
+    ($remainder >= 0 && $remainder < $numlines) || die "bad remainder $remainder";
+    # [just doing int() rounds down].
+    $n = 0;
+    for($scpidx = 0; $scpidx < @OUTPUTS; $scpidx++) {
+        $scpfile = $OUTPUTS[$scpidx];
+        ($scpfile ne '-' ? open($o_fh, '>', $scpfile)
+                         : open($o_fh, '>&', \*STDOUT)) ||
+            die "$0: Could not open scp file $scpfile for writing: $!\n";
+        for($k = 0; $k < $linesperscp + ($scpidx < $remainder ? 1 : 0); $k++) {
+            print $o_fh $F[$n++];
+        }
+        close($o_fh) || die "$0: Eror closing scp file $scpfile: $!\n";
+    }
+    $n == $numlines || die "$n != $numlines [code error]";
+}
+
+exit ($error);
--- a/egs/aishell/tranformer/utils/subset_data_dir_tr_cv.sh
+++ b/egs/aishell/tranformer/utils/subset_data_dir_tr_cv.sh
@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+dev_num_utt=1000
+
+echo "$0 $@"
+. utils/parse_options.sh || exit 1;
+
+train_data=$1
+out_dir=$2
+
+[ ! -f ${train_data}/wav.scp ] && echo "$0: no such file ${train_data}/wav.scp" && exit 1;
+[ ! -f ${train_data}/text ] && echo "$0: no such file ${train_data}/text" && exit 1;
+
+mkdir -p ${out_dir}/train && mkdir -p ${out_dir}/dev
+
+cp ${train_data}/wav.scp ${out_dir}/train/wav.scp.bak
+cp ${train_data}/text ${out_dir}/train/text.bak
+
+num_utt=$(wc -l <${out_dir}/train/wav.scp.bak)
+
+utils/shuffle_list.pl --srand 1 ${out_dir}/train/wav.scp.bak > ${out_dir}/train/wav.scp.shuf
+head -n ${dev_num_utt} ${out_dir}/train/wav.scp.shuf > ${out_dir}/dev/wav.scp
+tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/wav.scp.shuf > ${out_dir}/train/wav.scp
+
+utils/shuffle_list.pl --srand 1 ${out_dir}/train/text.bak > ${out_dir}/train/text.shuf
+head -n ${dev_num_utt} ${out_dir}/train/text.shuf > ${out_dir}/dev/text
+tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/text.shuf > ${out_dir}/train/text
+
+rm ${out_dir}/train/wav.scp.bak ${out_dir}/train/text.bak
+rm ${out_dir}/train/wav.scp.shuf ${out_dir}/train/text.shuf
--- a/egs/aishell/tranformer/utils/text2token.py
+++ b/egs/aishell/tranformer/utils/text2token.py
@ -0,0 +1,135 @@
+#!/usr/bin/env python3
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+
+import argparse
+import codecs
+import re
+import sys
+
+is_python2 = sys.version_info[0] == 2
+
+
+def exist_or_not(i, match_pos):
+    start_pos = None
+    end_pos = None
+    for pos in match_pos:
+        if pos[0] <= i < pos[1]:
+            start_pos = pos[0]
+            end_pos = pos[1]
+            break
+
+    return start_pos, end_pos
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="convert raw text to tokenized text",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--nchar",
+        "-n",
+        default=1,
+        type=int,
+        help="number of characters to split, i.e., \
+                        aabb -> a a b b with -n 1 and aa bb with -n 2",
+    )
+    parser.add_argument(
+        "--skip-ncols", "-s", default=0, type=int, help="skip first n columns"
+    )
+    parser.add_argument("--space", default="<space>", type=str, help="space symbol")
+    parser.add_argument(
+        "--non-lang-syms",
+        "-l",
+        default=None,
+        type=str,
+        help="list of non-linguistic symobles, e.g., <NOISE> etc.",
+    )
+    parser.add_argument("text", type=str, default=False, nargs="?", help="input text")
+    parser.add_argument(
+        "--trans_type",
+        "-t",
+        type=str,
+        default="char",
+        choices=["char", "phn"],
+        help="""Transcript type. char/phn. e.g., for TIMIT FADG0_SI1279 -
+                        If trans_type is char,
+                        read from SI1279.WRD file -> "bricks are an alternative"
+                        Else if trans_type is phn,
+                        read from SI1279.PHN file -> "sil b r ih sil k s aa r er n aa l
+                        sil t er n ih sil t ih v sil" """,
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    rs = []
+    if args.non_lang_syms is not None:
+        with codecs.open(args.non_lang_syms, "r", encoding="utf-8") as f:
+            nls = [x.rstrip() for x in f.readlines()]
+            rs = [re.compile(re.escape(x)) for x in nls]
+
+    if args.text:
+        f = codecs.open(args.text, encoding="utf-8")
+    else:
+        f = codecs.getreader("utf-8")(sys.stdin if is_python2 else sys.stdin.buffer)
+
+    sys.stdout = codecs.getwriter("utf-8")(
+        sys.stdout if is_python2 else sys.stdout.buffer
+    )
+    line = f.readline()
+    n = args.nchar
+    while line:
+        x = line.split()
+        print(" ".join(x[: args.skip_ncols]), end=" ")
+        a = " ".join(x[args.skip_ncols :])
+
+        # get all matched positions
+        match_pos = []
+        for r in rs:
+            i = 0
+            while i >= 0:
+                m = r.search(a, i)
+                if m:
+                    match_pos.append([m.start(), m.end()])
+                    i = m.end()
+                else:
+                    break
+
+        if args.trans_type == "phn":
+            a = a.split(" ")
+        else:
+            if len(match_pos) > 0:
+                chars = []
+                i = 0
+                while i < len(a):
+                    start_pos, end_pos = exist_or_not(i, match_pos)
+                    if start_pos is not None:
+                        chars.append(a[start_pos:end_pos])
+                        i = end_pos
+                    else:
+                        chars.append(a[i])
+                        i += 1
+                a = chars
+
+            a = [a[j : j + n] for j in range(0, len(a), n)]
+
+        a_flat = []
+        for z in a:
+            a_flat.append("".join(z))
+
+        a_chars = [z.replace(" ", args.space) for z in a_flat]
+        if args.trans_type == "phn":
+            a_chars = [z.replace("sil", args.space) for z in a_chars]
+        print(" ".join(a_chars))
+        line = f.readline()
+
+
+if __name__ == "__main__":
+    main()
--- a/egs/aishell/tranformer/utils/text_tokenize.py
+++ b/egs/aishell/tranformer/utils/text_tokenize.py
@ -0,0 +1,106 @@
+import re
+import argparse
+
+
+def load_dict(seg_file):
+    seg_dict = {}
+    with open(seg_file, 'r') as infile:
+        for line in infile:
+            s = line.strip().split()
+            key = s[0]
+            value = s[1:]
+            seg_dict[key] = " ".join(value)
+    return seg_dict
+
+
+def forward_segment(text, dic):
+    word_list = []
+    i = 0
+    while i < len(text):
+        longest_word = text[i]
+        for j in range(i + 1, len(text) + 1):
+            word = text[i:j]
+            if word in dic:
+                if len(word) > len(longest_word):
+                    longest_word = word
+        word_list.append(longest_word)
+        i += len(longest_word)
+    return word_list
+
+
+def tokenize(txt,
+             seg_dict):
+    out_txt = ""
+    pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
+    for word in txt:
+        if pattern.match(word):
+            if word in seg_dict:
+                out_txt += seg_dict[word] + " "
+            else:
+                out_txt += "<unk>" + " "
+        else:
+            continue
+    return out_txt.strip()
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="text tokenize",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--text-file",
+        "-t",
+        default=False,
+        required=True,
+        type=str,
+        help="input text",
+    )
+    parser.add_argument(
+        "--seg-file",
+        "-s",
+        default=False,
+        required=True,
+        type=str,
+        help="seg file",
+    )
+    parser.add_argument(
+        "--txt-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="txt index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    txt_writer = open("{}/text.{}.txt".format(args.output_dir, args.txt_index), 'w')
+    shape_writer = open("{}/len.{}".format(args.output_dir, args.txt_index), 'w')
+    seg_dict = load_dict(args.seg_file)
+    with open(args.text_file, 'r') as infile:
+        for line in infile:
+            s = line.strip().split()
+            text_id = s[0]
+            text_list = forward_segment("".join(s[1:]).lower(), seg_dict)
+            text = tokenize(text_list, seg_dict)
+            lens = len(text.strip().split())
+            txt_writer.write(text_id + " " + text + '\n')
+            shape_writer.write(text_id + " " + str(lens) + '\n')
+
+
+if __name__ == '__main__':
+    main()
+
--- a/egs/aishell/tranformer/utils/text_tokenize.sh
+++ b/egs/aishell/tranformer/utils/text_tokenize.sh
@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=32
+cmd=utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+# tokenize configuration
+text_dir=$1
+seg_file=$2
+logdir=$3
+output_dir=$4
+
+txt_dir=${output_dir}/txt; mkdir -p ${output_dir}/txt
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/text_tokenize.JOB.log \
+  python utils/text_tokenize.py -t ${text_dir}/txt/text.JOB.txt \
+      -s ${seg_file} -i JOB -o ${txt_dir} \
+      || exit 1;
+
+# concatenate the text files together.
+for n in $(seq $nj); do
+  cat ${txt_dir}/text.$n.txt || exit 1
+done > ${output_dir}/text || exit 1
+
+for n in $(seq $nj); do
+  cat ${txt_dir}/len.$n || exit 1
+done > ${output_dir}/text_shape || exit 1
+
+echo "$0: Succeeded text tokenize"
--- a/egs/aishell/tranformer/utils/textnorm_zh.py
+++ b/egs/aishell/tranformer/utils/textnorm_zh.py
@ -0,0 +1,834 @@
+#!/usr/bin/env python3
+# coding=utf-8
+
+# Authors:
+#   2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
+#   2019.9 Jiayu DU
+#
+# requirements:
+#   - python 3.X
+# notes: python 2.X WILL fail or produce misleading results
+
+import sys, os, argparse, codecs, string, re
+
+# ================================================================================ #
+#                                    basic constant
+# ================================================================================ #
+CHINESE_DIGIS = u'零一二三四五六七八九'
+BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
+BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
+SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
+LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
+LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
+SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
+
+ZERO_ALT = u'〇'
+ONE_ALT = u'幺'
+TWO_ALTS = [u'两', u'兩']
+
+POSITIVE = [u'正', u'正']
+NEGATIVE = [u'负', u'負']
+POINT = [u'点', u'點']
+# PLUS = [u'加', u'加']
+# SIL = [u'杠', u'槓']
+
+FILLER_CHARS = ['呃', '啊']
+ER_WHITELIST = '(儿女|儿子|儿孙|女儿|儿媳|妻儿|' \
+             '胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|' \
+             '儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|' \
+             '佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)'
+
+# 中文数字系统类型
+NUMBERING_TYPES = ['low', 'mid', 'high']
+
+CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
+                 '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
+CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
+COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
+                  '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
+                  '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
+                  '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
+                  '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
+                  '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
+
+# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
+CHINESE_PUNC_STOP = '！？｡。'
+CHINESE_PUNC_NON_STOP = '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
+CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP
+
+# ================================================================================ #
+#                                    basic class
+# ================================================================================ #
+class ChineseChar(object):
+    """
+    中文字符
+    每个字符对应简体和繁体,
+    e.g. 简体 = '负', 繁体 = '負'
+    转换时可转换为简体或繁体
+    """
+
+    def __init__(self, simplified, traditional):
+        self.simplified = simplified
+        self.traditional = traditional
+        #self.__repr__ = self.__str__
+
+    def __str__(self):
+        return self.simplified or self.traditional or None
+
+    def __repr__(self):
+        return self.__str__()
+
+
+class ChineseNumberUnit(ChineseChar):
+    """
+    中文数字/数位字符
+    每个字符除繁简体外还有一个额外的大写字符
+    e.g. '陆' 和 '陸'
+    """
+
+    def __init__(self, power, simplified, traditional, big_s, big_t):
+        super(ChineseNumberUnit, self).__init__(simplified, traditional)
+        self.power = power
+        self.big_s = big_s
+        self.big_t = big_t
+
+    def __str__(self):
+        return '10^{}'.format(self.power)
+
+    @classmethod
+    def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
+
+        if small_unit:
+            return ChineseNumberUnit(power=index + 1,
+                                     simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[0]:
+            return ChineseNumberUnit(power=index + 8,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[1]:
+            return ChineseNumberUnit(power=(index + 2) * 4,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[2]:
+            return ChineseNumberUnit(power=pow(2, index + 3),
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        else:
+            raise ValueError(
+                'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
+
+
+class ChineseNumberDigit(ChineseChar):
+    """
+    中文数字字符
+    """
+
+    def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
+        super(ChineseNumberDigit, self).__init__(simplified, traditional)
+        self.value = value
+        self.big_s = big_s
+        self.big_t = big_t
+        self.alt_s = alt_s
+        self.alt_t = alt_t
+
+    def __str__(self):
+        return str(self.value)
+
+    @classmethod
+    def create(cls, i, v):
+        return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
+
+
+class ChineseMath(ChineseChar):
+    """
+    中文数位字符
+    """
+
+    def __init__(self, simplified, traditional, symbol, expression=None):
+        super(ChineseMath, self).__init__(simplified, traditional)
+        self.symbol = symbol
+        self.expression = expression
+        self.big_s = simplified
+        self.big_t = traditional
+
+
+CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
+
+
+class NumberSystem(object):
+    """
+    中文数字系统
+    """
+    pass
+
+
+class MathSymbol(object):
+    """
+    用于中文数字系统的数学符号 (繁/简体), e.g.
+    positive = ['正', '正']
+    negative = ['负', '負']
+    point = ['点', '點']
+    """
+
+    def __init__(self, positive, negative, point):
+        self.positive = positive
+        self.negative = negative
+        self.point = point
+
+    def __iter__(self):
+        for v in self.__dict__.values():
+            yield v
+
+
+# class OtherSymbol(object):
+#     """
+#     其他符号
+#     """
+#
+#     def __init__(self, sil):
+#         self.sil = sil
+#
+#     def __iter__(self):
+#         for v in self.__dict__.values():
+#             yield v
+
+
+# ================================================================================ #
+#                                    basic utils
+# ================================================================================ #
+def create_system(numbering_type=NUMBERING_TYPES[1]):
+    """
+    根据数字系统类型返回创建相应的数字系统，默认为 mid
+    NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
+        low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
+        mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
+        high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
+    返回对应的数字系统
+    """
+
+    # chinese number units of '亿' and larger
+    all_larger_units = zip(
+        LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    larger_units = [CNU.create(i, v, numbering_type, False)
+                    for i, v in enumerate(all_larger_units)]
+    # chinese number units of '十, 百, 千, 万'
+    all_smaller_units = zip(
+        SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    smaller_units = [CNU.create(i, v, small_unit=True)
+                     for i, v in enumerate(all_smaller_units)]
+    # digis
+    chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
+                        BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
+    digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
+    digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
+    digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
+    digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
+
+    # symbols
+    positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
+    negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
+    point_cn = CM(POINT[0], POINT[1], '.', lambda x,
+                  y: float(str(x) + '.' + str(y)))
+    # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
+    system = NumberSystem()
+    system.units = smaller_units + larger_units
+    system.digits = digits
+    system.math = MathSymbol(positive_cn, negative_cn, point_cn)
+    # system.symbols = OtherSymbol(sil_cn)
+    return system
+
+
+def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
+
+    def get_symbol(char, system):
+        for u in system.units:
+            if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
+                return u
+        for d in system.digits:
+            if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
+                return d
+        for m in system.math:
+            if char in [m.traditional, m.simplified]:
+                return m
+
+    def string2symbols(chinese_string, system):
+        int_string, dec_string = chinese_string, ''
+        for p in [system.math.point.simplified, system.math.point.traditional]:
+            if p in chinese_string:
+                int_string, dec_string = chinese_string.split(p)
+                break
+        return [get_symbol(c, system) for c in int_string], \
+               [get_symbol(c, system) for c in dec_string]
+
+    def correct_symbols(integer_symbols, system):
+        """
+        一百八 to 一百八十
+        一亿一千三百万 to 一亿 一千万 三百万
+        """
+
+        if integer_symbols and isinstance(integer_symbols[0], CNU):
+            if integer_symbols[0].power == 1:
+                integer_symbols = [system.digits[1]] + integer_symbols
+
+        if len(integer_symbols) > 1:
+            if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
+                integer_symbols.append(
+                    CNU(integer_symbols[-2].power - 1, None, None, None, None))
+
+        result = []
+        unit_count = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                result.append(s)
+                unit_count = 0
+            elif isinstance(s, CNU):
+                current_unit = CNU(s.power, None, None, None, None)
+                unit_count += 1
+
+            if unit_count == 1:
+                result.append(current_unit)
+            elif unit_count > 1:
+                for i in range(len(result)):
+                    if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
+                        result[-i - 1] = CNU(result[-i - 1].power +
+                                             current_unit.power, None, None, None, None)
+        return result
+
+    def compute_value(integer_symbols):
+        """
+        Compute the value.
+        When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
+        e.g. '两千万' = 2000 * 10000 not 2000 + 10000
+        """
+        value = [0]
+        last_power = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                value[-1] = s.value
+            elif isinstance(s, CNU):
+                value[-1] *= pow(10, s.power)
+                if s.power > last_power:
+                    value[:-1] = list(map(lambda v: v *
+                                                    pow(10, s.power), value[:-1]))
+                    last_power = s.power
+                value.append(0)
+        return sum(value)
+
+    system = create_system(numbering_type)
+    int_part, dec_part = string2symbols(chinese_string, system)
+    int_part = correct_symbols(int_part, system)
+    int_str = str(compute_value(int_part))
+    dec_str = ''.join([str(d.value) for d in dec_part])
+    if dec_part:
+        return '{0}.{1}'.format(int_str, dec_str)
+    else:
+        return int_str
+
+
+def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
+            traditional=False, alt_zero=False, alt_one=False, alt_two=True,
+            use_zeros=True, use_units=True):
+
+    def get_value(value_string, use_zeros=True):
+
+        striped_string = value_string.lstrip('0')
+
+        # record nothing if all zeros
+        if not striped_string:
+            return []
+
+        # record one digits
+        elif len(striped_string) == 1:
+            if use_zeros and len(value_string) != len(striped_string):
+                return [system.digits[0], system.digits[int(striped_string)]]
+            else:
+                return [system.digits[int(striped_string)]]
+
+        # recursively record multiple digits
+        else:
+            result_unit = next(u for u in reversed(
+                system.units) if u.power < len(striped_string))
+            result_string = value_string[:-result_unit.power]
+            return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
+
+    system = create_system(numbering_type)
+
+    int_dec = number_string.split('.')
+    if len(int_dec) == 1:
+        int_string = int_dec[0]
+        dec_string = ""
+    elif len(int_dec) == 2:
+        int_string = int_dec[0]
+        dec_string = int_dec[1]
+    else:
+        raise ValueError(
+            "invalid input num string with more than one dot: {}".format(number_string))
+
+    if use_units and len(int_string) > 1:
+        result_symbols = get_value(int_string)
+    else:
+        result_symbols = [system.digits[int(c)] for c in int_string]
+    dec_symbols = [system.digits[int(c)] for c in dec_string]
+    if dec_string:
+        result_symbols += [system.math.point] + dec_symbols
+
+    if alt_two:
+        liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
+                    system.digits[2].big_s, system.digits[2].big_t)
+        for i, v in enumerate(result_symbols):
+            if isinstance(v, CND) and v.value == 2:
+                next_symbol = result_symbols[i +
+                                             1] if i < len(result_symbols) - 1 else None
+                previous_symbol = result_symbols[i - 1] if i > 0 else None
+                if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
+                    if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
+                        result_symbols[i] = liang
+
+    # if big is True, '两' will not be used and `alt_two` has no impact on output
+    if big:
+        attr_name = 'big_'
+        if traditional:
+            attr_name += 't'
+        else:
+            attr_name += 's'
+    else:
+        if traditional:
+            attr_name = 'traditional'
+        else:
+            attr_name = 'simplified'
+
+    result = ''.join([getattr(s, attr_name) for s in result_symbols])
+
+    # if not use_zeros:
+    #     result = result.strip(getattr(system.digits[0], attr_name))
+
+    if alt_zero:
+        result = result.replace(
+            getattr(system.digits[0], attr_name), system.digits[0].alt_s)
+
+    if alt_one:
+        result = result.replace(
+            getattr(system.digits[1], attr_name), system.digits[1].alt_s)
+
+    for i, p in enumerate(POINT):
+        if result.startswith(p):
+            return CHINESE_DIGIS[0] + result
+
+    # ^10, 11, .., 19
+    if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
+                                          SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
+            result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
+        result = result[1:]
+
+    return result
+
+
+# ================================================================================ #
+#                          different types of rewriters
+# ================================================================================ #
+class Cardinal:
+    """
+    CARDINAL类
+    """
+
+    def __init__(self, cardinal=None, chntext=None):
+        self.cardinal = cardinal
+        self.chntext = chntext
+
+    def chntext2cardinal(self):
+        return chn2num(self.chntext)
+
+    def cardinal2chntext(self):
+        return num2chn(self.cardinal)
+
+class Digit:
+    """
+    DIGIT类
+    """
+
+    def __init__(self, digit=None, chntext=None):
+        self.digit = digit
+        self.chntext = chntext
+
+    # def chntext2digit(self):
+    #     return chn2num(self.chntext)
+
+    def digit2chntext(self):
+        return num2chn(self.digit, alt_two=False, use_units=False)
+
+
+class TelePhone:
+    """
+    TELEPHONE类
+    """
+
+    def __init__(self, telephone=None, raw_chntext=None, chntext=None):
+        self.telephone = telephone
+        self.raw_chntext = raw_chntext
+        self.chntext = chntext
+
+    # def chntext2telephone(self):
+    #     sil_parts = self.raw_chntext.split('<SIL>')
+    #     self.telephone = '-'.join([
+    #         str(chn2num(p)) for p in sil_parts
+    #     ])
+    #     return self.telephone
+
+    def telephone2chntext(self, fixed=False):
+
+        if fixed:
+            sil_parts = self.telephone.split('-')
+            self.raw_chntext = '<SIL>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sil_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SIL>', '')
+        else:
+            sp_parts = self.telephone.strip('+').split()
+            self.raw_chntext = '<SP>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sp_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SP>', '')
+        return self.chntext
+
+
+class Fraction:
+    """
+    FRACTION类
+    """
+
+    def __init__(self, fraction=None, chntext=None):
+        self.fraction = fraction
+        self.chntext = chntext
+
+    def chntext2fraction(self):
+        denominator, numerator = self.chntext.split('分之')
+        return chn2num(numerator) + '/' + chn2num(denominator)
+
+    def fraction2chntext(self):
+        numerator, denominator = self.fraction.split('/')
+        return num2chn(denominator) + '分之' + num2chn(numerator)
+
+
+class Date:
+    """
+    DATE类
+    """
+
+    def __init__(self, date=None, chntext=None):
+        self.date = date
+        self.chntext = chntext
+
+    # def chntext2date(self):
+    #     chntext = self.chntext
+    #     try:
+    #         year, other = chntext.strip().split('年', maxsplit=1)
+    #         year = Digit(chntext=year).digit2chntext() + '年'
+    #     except ValueError:
+    #         other = chntext
+    #         year = ''
+    #     if other:
+    #         try:
+    #             month, day = other.strip().split('月', maxsplit=1)
+    #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
+    #         except ValueError:
+    #             day = chntext
+    #             month = ''
+    #         if day:
+    #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
+    #     else:
+    #         month = ''
+    #         day = ''
+    #     date = year + month + day
+    #     self.date = date
+    #     return self.date
+
+    def date2chntext(self):
+        date = self.date
+        try:
+            year, other = date.strip().split('年', 1)
+            year = Digit(digit=year).digit2chntext() + '年'
+        except ValueError:
+            other = date
+            year = ''
+        if other:
+            try:
+                month, day = other.strip().split('月', 1)
+                month = Cardinal(cardinal=month).cardinal2chntext() + '月'
+            except ValueError:
+                day = date
+                month = ''
+            if day:
+                day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
+        else:
+            month = ''
+            day = ''
+        chntext = year + month + day
+        self.chntext = chntext
+        return self.chntext
+
+
+class Money:
+    """
+    MONEY类
+    """
+
+    def __init__(self, money=None, chntext=None):
+        self.money = money
+        self.chntext = chntext
+
+    # def chntext2money(self):
+    #     return self.money
+
+    def money2chntext(self):
+        money = self.money
+        pattern = re.compile(r'(\d+(\.\d+)?)')
+        matchers = pattern.findall(money)
+        if matchers:
+            for matcher in matchers:
+                money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext())
+        self.chntext = money
+        return self.chntext
+
+
+class Percentage:
+    """
+    PERCENTAGE类
+    """
+
+    def __init__(self, percentage=None, chntext=None):
+        self.percentage = percentage
+        self.chntext = chntext
+
+    def chntext2percentage(self):
+        return chn2num(self.chntext.strip().strip('百分之')) + '%'
+
+    def percentage2chntext(self):
+        return '百分之' + num2chn(self.percentage.strip().strip('%'))
+
+
+def remove_erhua(text, er_whitelist):
+    """
+    去除儿化音词中的儿:
+    他女儿在那边儿 -> 他女儿在那边
+    """
+
+    er_pattern = re.compile(er_whitelist)
+    new_str=''
+    while re.search('儿',text):
+        a = re.search('儿',text).span()
+        remove_er_flag = 0
+
+        if er_pattern.search(text):
+            b = er_pattern.search(text).span()
+            if b[0] <= a[0]:
+                remove_er_flag = 1
+
+        if remove_er_flag == 0 :
+            new_str = new_str + text[0:a[0]]
+            text = text[a[1]:]
+        else:
+            new_str = new_str + text[0:b[1]]
+            text = text[b[1]:]
+
+    text = new_str + text
+    return text
+
+# ================================================================================ #
+#                            NSW Normalizer
+# ================================================================================ #
+class NSWNormalizer:
+    def __init__(self, raw_text):
+        self.raw_text = '^' + raw_text + '$'
+        self.norm_text = ''
+
+    def _particular(self):
+        text = self.norm_text
+        pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('particular')
+            for matcher in matchers:
+                text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
+        self.norm_text = text
+        return self.norm_text
+
+    def normalize(self):
+        text = self.raw_text
+
+        # 规范化日期
+        pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('date')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
+
+        # 规范化金钱
+        pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('money')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1)
+
+        # 规范化固话/手机号码
+        # 手机
+        # http://www.jihaoba.com/news/show/13680
+        # 移动：139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
+        # 联通：130、131、132、156、155、186、185、176
+        # 电信：133、153、189、180、181、177
+        pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1)
+        # 固话
+        pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('fixed telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1)
+
+        # 规范化分数
+        pattern = re.compile(r"(\d+/\d+)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('fraction')
+            for matcher in matchers:
+                text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1)
+
+        # 规范化百分数
+        text = text.replace('％', '%')
+        pattern = re.compile(r"(\d+(\.\d+)?%)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('percentage')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1)
+
+        # 规范化纯数+量词
+        pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal+quantifier')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+
+        # 规范化数字编号
+        pattern = re.compile(r"(\d{4,32})")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('digit')
+            for matcher in matchers:
+                text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
+
+        # 规范化纯数
+        pattern = re.compile(r"(\d+(\.\d+)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+
+        self.norm_text = text
+        self._particular()
+
+        return self.norm_text.lstrip('^').rstrip('$')
+
+
+def nsw_test_case(raw_text):
+    print('I:' + raw_text)
+    print('O:' + NSWNormalizer(raw_text).normalize())
+    print('')
+
+
+def nsw_test():
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('手机：+86 19859213959或15659451527。')
+    nsw_test_case('分数：32477/76391。')
+    nsw_test_case('百分数：80.03%。')
+    nsw_test_case('编号：31520181154418。')
+    nsw_test_case('纯数：2983.07克或12345.60米。')
+    nsw_test_case('日期：1999年2月20日或09年3月15号。')
+    nsw_test_case('金钱：12块5，34.5元，20.1万')
+    nsw_test_case('特殊：O2O或B2C。')
+    nsw_test_case('3456万吨')
+    nsw_test_case('2938个')
+    nsw_test_case('938')
+    nsw_test_case('今天吃了115个小笼包231个馒头')
+    nsw_test_case('有62％的概率')
+
+
+if __name__ == '__main__':
+    #nsw_test()
+
+    p = argparse.ArgumentParser()
+    p.add_argument('ifile', help='input filename, assume utf-8 encoding')
+    p.add_argument('ofile', help='output filename')
+    p.add_argument('--to_upper', action='store_true', help='convert to upper case')
+    p.add_argument('--to_lower', action='store_true', help='convert to lower case')
+    p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.")
+    p.add_argument('--remove_fillers', type=bool, default=True, help='remove filler chars such as "呃, 啊"')
+    p.add_argument('--remove_erhua', type=bool, default=True, help='remove erhua chars such as "这儿"')
+    p.add_argument('--log_interval', type=int, default=10000, help='log interval in number of processed lines')
+    args = p.parse_args()
+
+    ifile = codecs.open(args.ifile, 'r', 'utf8')
+    ofile = codecs.open(args.ofile, 'w+', 'utf8')
+
+    n = 0
+    for l in ifile:
+        key = ''
+        text = ''
+        if args.has_key:
+            cols = l.split(maxsplit=1)
+            key = cols[0]
+            if len(cols) == 2:
+                text = cols[1].strip()
+            else:
+                text = ''
+        else:
+            text = l.strip()
+
+        # cases
+        if args.to_upper and args.to_lower:
+            sys.stderr.write('text norm: to_upper OR to_lower?')
+            exit(1)
+        if args.to_upper:
+            text = text.upper()
+        if args.to_lower:
+            text = text.lower()
+
+        # Filler chars removal
+        if args.remove_fillers:
+            for ch in FILLER_CHARS:
+                text = text.replace(ch, '')
+
+        if args.remove_erhua:
+            text = remove_erhua(text, ER_WHITELIST)
+
+        # NSW(Non-Standard-Word) normalization
+        text = NSWNormalizer(text).normalize()
+
+        # Punctuations removal
+        old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations
+        new_chars = ' ' * len(old_chars)
+        del_chars = ''
+        text = text.translate(str.maketrans(old_chars, new_chars, del_chars))
+
+        #
+        if args.has_key:
+            ofile.write(key + '\t' + text + '\n')
+        else:
+            ofile.write(text + '\n')
+
+        n += 1
+        if n % args.log_interval == 0:
+            sys.stderr.write("text norm: {} lines done.\n".format(n))
+
+    sys.stderr.write("text norm: {} lines done in total.\n".format(n))
+
+    ifile.close()
+    ofile.close()
--- a/egs_modelscope/aishell/paraformer/README.md
+++ b/egs_modelscope/aishell/paraformer/README.md
@ -0,0 +1,38 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `paraformer_large_finetune.sh`
+    - <strong>data_aishell:</strong> please set the aishell data path
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune: 
+```sh
+    sh ./paraformer_large_finetune.sh
+``` 
+
+### Inference
+
+Or you can download the model from ModelScope for inference directly.
+
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the aishell raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
--- a/egs_modelscope/aishell/paraformer/RESULTS.md
+++ b/egs_modelscope/aishell/paraformer/RESULTS.md
@ -0,0 +1,24 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+## AISHELL-1
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+
+| testset   | CER(%)|
+|:---------:|:-----:|
+| dev       | 1.75  |
+| test      | 1.95  |
--- a/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
+++ b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
--- a/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
+++ b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
--- a/egs_modelscope/aishell/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+++ b/egs_modelscope/aishell/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
--- a/egs_modelscope/aishell/paraformer/local/aishell_data_prep.sh
+++ b/egs_modelscope/aishell/paraformer/local/aishell_data_prep.sh
@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
--- a/egs_modelscope/aishell/paraformer/modelscope_utils
+++ b/egs_modelscope/aishell/paraformer/modelscope_utils
@ -0,0 +1 @@
+../../common/modelscope_utils
--- a/egs_modelscope/aishell/paraformer/paraformer_large_finetune.sh
+++ b/egs_modelscope/aishell/paraformer/paraformer_large_finetune.sh
@ -0,0 +1,224 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=32
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        rm ${feats_dir}/data/${x}/text.org
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
--- a/egs_modelscope/aishell/paraformer/paraformer_large_infer.sh
+++ b/egs_modelscope/aishell/paraformer/paraformer_large_infer.sh
@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="dev test"
+
+. utils/parse_options.sh
+
+aishell_audio_dir=$ori_data/data_aishell/wav
+aishell_text=$ori_data/data_aishell/transcript/aishell_transcript_v0.8.txt
+dev_dir=${data_dir}/aishell/dev
+test_dir=${data_dir}/aishell/test
+tmp_dir=${data_dir}/aishell/tmp
+
+mkdir -p ${dev_dir}
+mkdir -p ${test_dir}
+mkdir -p ${tmp_dir}
+
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+for dir in $dev_dir $test_dir; do
+    sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+    paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+    utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+    awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+    utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+    sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p ${exp_dir}/aishell
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/aishell \
+        --exp_dir ${exp_dir}/aishell \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
--- a/egs_modelscope/aishell/paraformer/path.sh
+++ b/egs_modelscope/aishell/paraformer/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs_modelscope/aishell/paraformer/utils
+++ b/egs_modelscope/aishell/paraformer/utils
@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
--- a/egs_modelscope/aishell2/paraformer/README.md
+++ b/egs_modelscope/aishell2/paraformer/README.md
@ -0,0 +1,39 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `paraformer_large_finetune.sh`
+    - <strong>tr_dir:</strong> please set the aishell2 train data path
+    - <strong>dev_tst_dir:</strong> please set the aishell2 dev/test data path
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune: 
+```sh
+    sh ./paraformer_large_finetune.sh
+``` 
+
+### Inference
+
+Or you can download the model from ModelScope for inference directly.
+
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the aishell2 dev/test raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
--- a/egs_modelscope/aishell2/paraformer/RESULTS.md
+++ b/egs_modelscope/aishell2/paraformer/RESULTS.md
@ -0,0 +1,26 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+## AISHELL-2
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+
+| testset      | CER(%)|
+|:------------:|:-----:|
+| dev_ios      | 2.80  |
+| test_android | 3.13  |
+| test_ios     | 2.85  |
+| test_mic     | 3.06  |
--- a/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
+++ b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
--- a/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
+++ b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
--- a/egs_modelscope/aishell2/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+++ b/egs_modelscope/aishell2/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
--- a/egs_modelscope/aishell2/paraformer/local/aishell2_data_prep.sh
+++ b/egs_modelscope/aishell2/paraformer/local/aishell2_data_prep.sh
@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
--- a/egs_modelscope/aishell2/paraformer/modelscope_utils
+++ b/egs_modelscope/aishell2/paraformer/modelscope_utils
@ -0,0 +1 @@
+../../common/modelscope_utils
--- a/egs_modelscope/aishell2/paraformer/paraformer_large_finetune.sh
+++ b/egs_modelscope/aishell2/paraformer/paraformer_large_finetune.sh
@ -0,0 +1,239 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=100
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+tr_dir=
+dev_tst_dir=
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev_ios
+test_sets="dev_ios test_android test_ios test_mic"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # For training set
+    local/aishell2_data_prep.sh ${tr_dir} ${feats_dir}/data/local/train ${feats_dir}/data/train || exit 1;
+    # # For dev and test set
+    for x in Android iOS Mic; do
+        local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/dev ${feats_dir}/data/local/dev_${x,,} ${feats_dir}/data/dev_${x,,} || exit 1;
+        local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/test ${feats_dir}/data/local/test_${x,,} ${feats_dir}/data/test_${x,,} || exit 1;
+    done
+    # Normalize text to capital letters
+    for x in train dev_android dev_ios dev_mic test_android test_ios test_mic; do
+        mv ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 ${feats_dir}/data/${x}/text.org) <(cut -f 2- ${feats_dir}/data/${x}/text.org \
+             | tr 'A-Z' 'a-z' | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        rm ${feats_dir}/data/${x}/text.org
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/${train_set}; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/${valid_set}; mkdir -p ${feat_dev_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    for x in android ios mic; do
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${feats_dir}/data/dev_${x} ${exp_dir}/exp/make_fbank/dev_${x} ${fbankdir}/dev_${x}
+        utils/fix_data_feat.sh ${fbankdir}/dev_${x}
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${feats_dir}/data/test_${x} ${exp_dir}/exp/make_fbank/test_${x} ${fbankdir}/test_${x}
+        utils/fix_data_feat.sh ${fbankdir}/test_${x}
+    done
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/${train_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/${valid_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    for x in android ios mic; do
+        feat_test_dir=${feats_dir}/${dumpdir}/test_${x}; mkdir ${feat_test_dir}
+        utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+            --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+            ${fbankdir}/test_${x} ${cmvn_file} ${exp_dir}/exp/make_fbank/test_${x} ${feat_test_dir}
+    done
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${train_set} ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${valid_set} ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    for x in android ios mic; do
+      feat_test_dir=${feats_dir}/${dumpdir}/test_${x} 
+      cp ${fbankdir}/test_${x}/text  ${feat_test_dir}
+    done
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev_ios
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${train_set}
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
--- a/egs_modelscope/aishell2/paraformer/paraformer_large_infer.sh
+++ b/egs_modelscope/aishell2/paraformer/paraformer_large_infer.sh
@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="dev_ios test_android test_ios test_mic"
+
+. utils/parse_options.sh
+
+for x in Android iOS Mic; do
+    local/aishell2_data_prep.sh ${ori_data}/${x}/dev ${data_dir}/aishell2/local/dev_${x,,} ${data_dir}/aishell2/dev_${x,,} || exit 1;
+    local/aishell2_data_prep.sh ${ori_data}/${x}/test ${data_dir}/aishell2/local/test_${x,,} ${data_dir}/aishell2/test_${x,,} || exit 1;
+done
+for x in dev_android dev_ios dev_mic test_android test_ios test_mic; do
+    mv ${data_dir}/aishell2/${x}/text ${data_dir}/aishell2/${x}/text.org
+    paste -d " " <(cut -f 1 ${data_dir}/aishell2/${x}/text.org) <(cut -f 2- ${data_dir}/aishell2/${x}/text.org \
+        | tr 'A-Z' 'a-z' | tr -d " ") \
+       > ${data_dir}/aishell2/${x}/text
+    rm ${data_dir}/aishell2/${x}/text.org
+done
+
+mkdir -p ${exp_dir}/aishell2
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/aishell2 \
+        --exp_dir ${exp_dir}/aishell2 \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
--- a/egs_modelscope/aishell2/paraformer/path.sh
+++ b/egs_modelscope/aishell2/paraformer/path.sh
@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
--- a/egs_modelscope/aishell2/paraformer/utils
+++ b/egs_modelscope/aishell2/paraformer/utils
@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
--- a/egs_modelscope/common/README.md
+++ b/egs_modelscope/common/README.md
@ -0,0 +1,27 @@
+# ModelScope Model
+
+## How to finetune and infer using a pretrained ModelScope Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `modelscope_common_finetune.sh`
+    - <strong>dataset:</strong> the dataset dir needs to include files: train/wav.scp, train/text; optional dev/wav.scp, dev/text, test/wav.scp test/text
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope:
+```sh
+    sh ./modelscope_common_finetune.sh
+``` 
+
+### Inference
+
+Or you can use the finetuned model for inference directly.
+
+- Setting parameters in `modelscope_common_infer.sh`
+    - <strong>data_dir:</strong> # wav list, ${data_dir}/wav.scp
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./modelscope_common_infer.sh
+```
--- a/egs_modelscope/common/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
+++ b/egs_modelscope/common/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
--- a/egs_modelscope/common/conf/decode_asr_transformer_noctc_1best.yaml
+++ b/egs_modelscope/common/conf/decode_asr_transformer_noctc_1best.yaml
@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
--- a/egs_modelscope/common/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+++ b/egs_modelscope/common/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
--- a/egs_modelscope/common/modelscope_common_finetune.sh
+++ b/egs_modelscope/common/modelscope_common_finetune.sh
@ -0,0 +1,230 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=1
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=32
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+dataset=  # dataset (include train/wav.scp, train/text, dev/wav.scp, dev/text, optional test/wav.scp test/text)
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+[ ! -d ${dataset} ] && echo "$0: Training data is required" && exit 1;
+[ ! -f ${dataset}/train/wav.scp ] && [ ! -f ${dataset}/train/text ] && echo "$0: Training data wav.scp or text is not found" && exit 1;
+
+if [ ! -d "${dataset}/dev" ]; then
+    utils/fix_data.sh ${dataset}/train
+    utils/subset_data_dir_tr_cv.sh --dev-num-utt 1000 ${dataset}/train ${dataset}
+fi
+if [ ! -d "${dataset}/test" ]; then
+   test_sets="dev" 
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${dataset}/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${dataset}/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    if [ -d "${dataset}/test" ]; then
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${dataset}/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+        utils/fix_data_feat.sh ${fbankdir}/test
+    fi
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    if [ -d "${dataset}/test" ]; then
+        utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+            --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+            ${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    fi
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    if [ -d "${dataset}/test" ]; then
+        cp ${fbankdir}/test/text ${feat_test_dir}
+    fi
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
--- a/egs_modelscope/common/modelscope_common_infer.sh
+++ b/egs_modelscope/common/modelscope_common_infer.sh
@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope
+data_dir=  # wav list, ${data_dir}/wav.scp
+exp_dir="exp"
+gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4
+gpu_inference=true
+decode_cmd=utils/run.pl
+
+. utils/parse_options.sh
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+    _ngpu=1
+else
+    inference_nj=${njob}
+    _ngpu=0
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+python modelscope_utils/download_model.py \
+          --model_name ${model_name}
+
+if [ -d ${exp_dir} ]; then
+    echo "${exp_dir} is already exists. if you want to decode again, please delete ${exp_dir} first."
+    exit 1
+else
+    mkdir -p ${exp_dir}/${model_name}
+    cp ${HOME}/.cache/modelscope/hub/damo/${model_name}/* ${exp_dir}/${model_name}/. -r
+    _dir=${exp_dir}/decode_asr
+    _logdir=${_dir}/logdir
+    mkdir -p "${_dir}"
+    mkdir -p "${_logdir}"
+fi
+
+for n in $(seq "${inference_nj}"); do
+    split_scps+=" ${_logdir}/keys.${n}.scp"
+done
+# shellcheck disable=SC2086
+utils/split_scp.pl "${data_dir}/wav.scp" ${split_scps}
+
+if "${use_lm}"; then
+    cp ${exp_dir}/${model_name}/decode_asr_transformer.yaml ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back
+    cp ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back
+    sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+    sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
+    sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+    sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml  
+fi
+
+echo "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+# shellcheck disable=SC2086
+${decode_cmd} --max-jobs-run "${inference_nj}" JOB=1:"${inference_nj}" "${_logdir}"/asr_inference.JOB.log \
+    python -m funasr.bin.modelscope_infer \
+          --local_model_path ${exp_dir}/${model_name} \
+          --wav_list ${_logdir}/keys.JOB.scp \
+          --output_file ${_logdir}/text.JOB \
+          --gpuid_list ${gpuid_list} \
+          --njob ${njob} \
+          --ngpu ${_ngpu} \
+
+    for i in $(seq ${inference_nj}); do
+        cat ${_logdir}/text.${i}
+    done | sort -k1 >${_dir}/text
+
+mv ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+mv ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
+
--- a/Show More
+++ b/Show More