diff --git a/LICENSE b/LICENSE
index 3f0f5bf7a..303aaf82d 100644
--- a/LICENSE
+++ b/LICENSE
@@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+SOFTWARE.
\ No newline at end of file
diff --git a/README.md b/README.md
new file mode 100644
index 000000000..795a1308e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,78 @@
+<div align="left"><img src="image/funasr_logo.jpg" width="400"/></div>
+
+# FunASR: A Fundamental End-to-End Speech Recognition Toolkit
+
+<strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun！
+
+## Installation(Training and Developing)
+
+- Clone the repo:
+``` sh
+git clone https://github.com/alibaba/FunASR.git
+```
+
+- Install Conda:
+``` sh
+wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
+sh Miniconda3-latest-Linux-x86_64.sh
+conda create -n funasr python=3.7
+conda activate funasr
+```
+
+- Install Pytorch (version >= 1.7.0): 
+
+| cuda  | |
+|:-----:| --- |
+|  9.2  | conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch |
+| 10.2  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch |
+| 11.1  | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch |
+
+For more versions, please see https://pytorch.org/get-started/locally/
+
+- Install ModelScope:
+``` sh
+pip install "modelscope[audio]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
+```
+
+- Install other packages: 
+
+``` sh
+pip install --editable ./
+```
+
+## Contact
+
+If you have any questions about FunASR, please contact us by
+
+- email: [funasr@list.alibaba-inc.com](funasr@list.alibaba-inc.com)
+
+- Dingding group:
+<div align="left"><img src="image/dingding.jpg" width="400"/></div>
+
+
+## Acknowledge
+
+1. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for data preparation.
+2. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet). FunASR follows up the training and finetuning pipelines of ESPnet.
+3. We referred [Wenet](https://github.com/wenet-e2e/wenet) for building dataloader for large scale data training.
+
+## License
+This project is licensed under the [The MIT License](https://opensource.org/licenses/MIT). FunASR also contains various third-party components and some code modified from other repos under other open source licenses.
+
+## Citations
+
+``` bibtex
+@inproceedings{gao2020universal,
+  title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model},
+  author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian},
+  booktitle={arXiv preprint arXiv:2010.14099},
+  year={2010}
+}
+
+@inproceedings{gao2022paraformer,
+  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
+  author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
+  booktitle={INTERSPEECH},
+  year={2022}
+}
+```
diff --git a/egs/aishell/conformer/README.md b/egs/aishell/conformer/README.md
new file mode 100644
index 000000000..a67b183ed
--- /dev/null
+++ b/egs/aishell/conformer/README.md
@@ -0,0 +1,17 @@
+
+# Conformer Result
+
+## Training Config
+- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
+- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
+- Train config: conf/train_asr_transformer.yaml
+- LM config: LM was not used
+- Model size: 46M
+
+## Results (CER)
+- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.42   |
+|    test     |  4.87   |
\ No newline at end of file
diff --git a/egs/aishell/conformer/conf/decode_asr_transformer.yaml b/egs/aishell/conformer/conf/decode_asr_transformer.yaml
new file mode 100644
index 000000000..a147fa79d
--- /dev/null
+++ b/egs/aishell/conformer/conf/decode_asr_transformer.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
diff --git a/egs/aishell/conformer/conf/train_asr_conformer.yaml b/egs/aishell/conformer/conf/train_asr_conformer.yaml
new file mode 100644
index 000000000..ddf217ec0
--- /dev/null
+++ b/egs/aishell/conformer/conf/train_asr_conformer.yaml
@@ -0,0 +1,80 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+log_interval: 50
+normalize: None
diff --git a/egs/aishell/conformer/local/aishell_data_prep.sh b/egs/aishell/conformer/local/aishell_data_prep.sh
new file mode 100755
index 000000000..83f489b3c
--- /dev/null
+++ b/egs/aishell/conformer/local/aishell_data_prep.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
diff --git a/egs/aishell/conformer/local/prepare_data.sh b/egs/aishell/conformer/local/prepare_data.sh
new file mode 100755
index 000000000..77791f9c1
--- /dev/null
+++ b/egs/aishell/conformer/local/prepare_data.sh
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
diff --git a/egs/aishell/conformer/path.sh b/egs/aishell/conformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs/aishell/conformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs/aishell/conformer/run.sh b/egs/aishell/conformer/run.sh
new file mode 100755
index 000000000..16ebc6759
--- /dev/null
+++ b/egs/aishell/conformer/run.sh
@@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_conformer.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode asr
+fi
+
diff --git a/egs/aishell/conformer/utils b/egs/aishell/conformer/utils
new file mode 120000
index 000000000..40e14f57f
--- /dev/null
+++ b/egs/aishell/conformer/utils
@@ -0,0 +1 @@
+../tranformer/utils
\ No newline at end of file
diff --git a/egs/aishell/paraformer/README.md b/egs/aishell/paraformer/README.md
new file mode 100644
index 000000000..c0385db86
--- /dev/null
+++ b/egs/aishell/paraformer/README.md
@@ -0,0 +1,24 @@
+# Paraformer
+pretrained model in [ModelScope](https://www.modelscope.cn/home)：[speech_paraformer_asr_nat-aishell1-pytorch](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-aishell1-pytorch/summary)
+
+## Training Config
+- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
+- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
+- Train config: conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+- LM config: LM was not used
+
+## Results (CER)
+
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml (ctc weight:0.0)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.66   |
+|    test     |  5.11   |
+
+- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
+
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.52   |
+|    test     |  4.94   |
\ No newline at end of file
diff --git a/egs/aishell/paraformer/conf/decode_asr_transformer.yaml b/egs/aishell/paraformer/conf/decode_asr_transformer.yaml
new file mode 100644
index 000000000..a147fa79d
--- /dev/null
+++ b/egs/aishell/paraformer/conf/decode_asr_transformer.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
diff --git a/egs/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml b/egs/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
new file mode 100644
index 000000000..5436b12e4
--- /dev/null
+++ b/egs/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
diff --git a/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
new file mode 100644
index 000000000..b5ab916fc
--- /dev/null
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
@@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
\ No newline at end of file
diff --git a/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
new file mode 100644
index 000000000..2b5e2d1a3
--- /dev/null
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
@@ -0,0 +1,92 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 20      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
\ No newline at end of file
diff --git a/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_40e_12d_1280_320_lfr6.yaml b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_40e_12d_1280_320_lfr6.yaml
new file mode 100644
index 000000000..864350755
--- /dev/null
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_40e_12d_1280_320_lfr6.yaml
@@ -0,0 +1,114 @@
+# network architecture
+# encoder related
+encoder: sanm
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 40      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    input_layer: pe # encoder architecture type
+    pos_enc_class: SinusoidalPositionEncoder
+    normalize_before: true
+    kernel_size: 11
+    sanm_shfit: 0
+    selfattention_layer_type: sanm
+
+# decoder related
+decoder: paraformer_decoder_sanm
+decoder_conf:
+    attention_heads: 4
+    linear_units: 1280
+    num_blocks: 12
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+    att_layer_num: 6
+    kernel_size: 11
+    sanm_shfit: 0
+
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 320
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 0
+    sampling_ratio: 0.75
+
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 6000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: True
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: token
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 5
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 15000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
\ No newline at end of file
diff --git a/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_50e_16d_2048_512_lfr6.yaml b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_50e_16d_2048_512_lfr6.yaml
new file mode 100644
index 000000000..67983b379
--- /dev/null
+++ b/egs/aishell/paraformer/conf/train_asr_paraformer_sanm_tf_50e_16d_2048_512_lfr6.yaml
@@ -0,0 +1,114 @@
+# network architecture
+# encoder related
+encoder: sanm
+encoder_conf:
+    output_size: 512    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 50      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+    input_layer: pe # encoder architecture type
+    pos_enc_class: SinusoidalPositionEncoder
+    normalize_before: true
+    kernel_size: 11
+    sanm_shfit: 0
+    selfattention_layer_type: sanm
+
+# decoder related
+decoder: paraformer_decoder_sanm
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 16
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+    att_layer_num: 16
+    kernel_size: 11
+    sanm_shfit: 0
+
+
+predictor: cif_predictor_v2
+predictor_conf:
+  idim: 512
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 10000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 5
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
diff --git a/egs/aishell/paraformer/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml b/egs/aishell/paraformer/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
new file mode 100644
index 000000000..f369b3d19
--- /dev/null
+++ b/egs/aishell/paraformer/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
@@ -0,0 +1,99 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer_bert
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+    embeds_id: 3
+    embed_dims: 768
+    embeds_loss_weight: 2.0
+
+
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
\ No newline at end of file
diff --git a/egs/aishell/paraformer/local/aishell_data_prep.sh b/egs/aishell/paraformer/local/aishell_data_prep.sh
new file mode 100755
index 000000000..83f489b3c
--- /dev/null
+++ b/egs/aishell/paraformer/local/aishell_data_prep.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
diff --git a/egs/aishell/paraformer/local/prepare_data.sh b/egs/aishell/paraformer/local/prepare_data.sh
new file mode 100755
index 000000000..77791f9c1
--- /dev/null
+++ b/egs/aishell/paraformer/local/prepare_data.sh
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
diff --git a/egs/aishell/paraformer/path.sh b/egs/aishell/paraformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs/aishell/paraformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs/aishell/paraformer/run.sh b/egs/aishell/paraformer/run.sh
new file mode 100755
index 000000000..bebb646e0
--- /dev/null
+++ b/egs/aishell/paraformer/run.sh
@@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
diff --git a/egs/aishell/paraformer/utils b/egs/aishell/paraformer/utils
new file mode 120000
index 000000000..40e14f57f
--- /dev/null
+++ b/egs/aishell/paraformer/utils
@@ -0,0 +1 @@
+../tranformer/utils
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/README.md b/egs/aishell/paraformer2/README.md
new file mode 100644
index 000000000..ffce9493e
--- /dev/null
+++ b/egs/aishell/paraformer2/README.md
@@ -0,0 +1,18 @@
+# ParaformerBert + specaug + speed perturbation + specaugmentation
+## Environments
+- date: `Mon Nov 21 13:25:30 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+
+## Config files
+- train config: conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+- model size: 46M
+- lm config: LM was not used
+- decode config: conf/decode_asr_transformer_noctc_1best.yaml (CTC was not used)
+
+## Results (CER)
+|   testset   | CER(%)  |
+|:-----------:|:-------:|
+|     dev     |  4.30   |
+|    test     |  4.80   |
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/conf/decode_asr_transformer.yaml b/egs/aishell/paraformer2/conf/decode_asr_transformer.yaml
new file mode 100644
index 000000000..a147fa79d
--- /dev/null
+++ b/egs/aishell/paraformer2/conf/decode_asr_transformer.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
diff --git a/egs/aishell/paraformer2/conf/decode_asr_transformer_noctc_1best.yaml b/egs/aishell/paraformer2/conf/decode_asr_transformer_noctc_1best.yaml
new file mode 100644
index 000000000..5436b12e4
--- /dev/null
+++ b/egs/aishell/paraformer2/conf/decode_asr_transformer_noctc_1best.yaml
@@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
diff --git a/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
new file mode 100644
index 000000000..779c7a913
--- /dev/null
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
@@ -0,0 +1,92 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
new file mode 100644
index 000000000..29b9ca6d3
--- /dev/null
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformer_conformer_20e_6d_1280_320.yaml
@@ -0,0 +1,94 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 320    # dimension of attention
+    attention_heads: 4
+    linear_units: 1280  # the number of units of position-wise feed forward
+    num_blocks: 20      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml b/egs/aishell/paraformer2/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
new file mode 100644
index 000000000..7562a49fb
--- /dev/null
+++ b/egs/aishell/paraformer2/conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
@@ -0,0 +1,100 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: paraformer_decoder_san
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model: paraformer_bert
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+    predictor_weight: 1.0
+    sampling_ratio: 0.4
+    embeds_id: 3
+    embed_dims: 768
+    embeds_loss_weight: 2.0
+
+
+
+# minibatch related
+#batch_type: length
+#batch_bins: 40000
+batch_type: numel
+batch_bins: 2000000
+num_workers: 16
+
+# optimization related
+accum_grad: 4
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+predictor: cif_predictor
+predictor_conf:
+  idim: 256
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+
+log_interval: 50
+normalize: None
+allow_variable_data_keys: true
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/local/aishell_data_prep.sh b/egs/aishell/paraformer2/local/aishell_data_prep.sh
new file mode 100755
index 000000000..b6ea36b72
--- /dev/null
+++ b/egs/aishell/paraformer2/local/aishell_data_prep.sh
@@ -0,0 +1,65 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 2 ]; then
+  echo "Usage: $0 <audio-path> <text-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+
+train_dir=data/local/train
+dev_dir=data/local/dev
+test_dir=data/local/test
+tmp_dir=data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p data/train data/dev data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f data/train/$f || exit 1;
+  cp $dev_dir/$f data/dev/$f || exit 1;
+  cp $test_dir/$f data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
diff --git a/egs/aishell/paraformer2/local/extract_embeds.sh b/egs/aishell/paraformer2/local/extract_embeds.sh
new file mode 100755
index 000000000..6d9939077
--- /dev/null
+++ b/egs/aishell/paraformer2/local/extract_embeds.sh
@@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+
+stage=1
+stop_stage=3
+
+bert_model_root="../../huggingface_models"
+bert_model_name="bert-base-chinese"
+#bert_model_name="chinese-roberta-wwm-ext"
+#bert_model_name="mengzi-bert-base"
+raw_dataset_path=~/Funasr_data/aishell-1
+model_path=${bert_model_root}/${bert_model_name}
+
+. utils/parse_options.sh || exit 1;
+
+nj=32
+
+for data_set in train dev test;do
+    scp=$raw_dataset_path/dump/fbank/${data_set}/text
+    local_scp_dir_raw=$raw_dataset_path/embeds/$bert_model_name/${data_set}
+    local_scp_dir=$local_scp_dir_raw/split$nj
+    local_records_dir=$local_scp_dir_raw/ark
+
+    mkdir -p $local_records_dir
+    mkdir -p $local_scp_dir
+
+    split_scps=""
+    for JOB in $(seq ${nj}); do
+        split_scps="$split_scps $local_scp_dir/data.$JOB.text"
+    done
+
+    utils/split_scp.pl $scp ${split_scps}
+
+
+    for num in {0..7};do
+        tmp=`expr $num \* 4`
+
+        if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+            for idx in {1..4}; do
+                JOB=`expr $tmp + $idx`
+                echo "proces jobid=$JOB"
+                {
+
+                beg=0
+                gpu=`expr $beg + $idx`
+                echo ${local_scp_dir}/log.${JOB}
+                python utils/extract_embeds.py $local_scp_dir/data.$JOB.text ${local_records_dir}/embeds.${JOB}.ark ${local_records_dir}/embeds.${JOB}.scp ${local_records_dir}/embeds.${JOB}.shape ${gpu} ${model_path} &> ${local_scp_dir}/log.${JOB}
+            } &
+            done
+            wait
+        fi
+    done
+
+    if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+        for JOB in $(seq ${nj}); do
+            cat ${local_records_dir}/embeds.${JOB}.scp || exit 1;
+        done > ${local_scp_dir_raw}/embeds.scp
+
+        sed 's#nfs#data\/volume1#g' ${local_scp_dir_raw}/embeds.scp > ${local_scp_dir_raw}/embeds.scp.pai
+
+        for JOB in $(seq ${nj}); do
+            cat ${local_records_dir}/embeds.${JOB}.shape || exit 1;
+        done > ${local_scp_dir_raw}/embeds.shape
+    fi
+done
+
+echo "embeds is in: ${local_scp_dir_raw}"
+echo "success"
\ No newline at end of file
diff --git a/egs/aishell/paraformer2/path.sh b/egs/aishell/paraformer2/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs/aishell/paraformer2/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs/aishell/paraformer2/run.sh b/egs/aishell/paraformer2/run.sh
new file mode 100755
index 000000000..15d659c4e
--- /dev/null
+++ b/egs/aishell/paraformer2/run.sh
@@ -0,0 +1,226 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+skip_extract_embed=false
+bert_model_root="../../huggingface_models"
+bert_model_name="bert-base-chinese"
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
+run_dir="exp"
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+exp_dir=$run_dir/$model_dir
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript
+    for x in train dev test; do
+        cp data/${x}/text data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " data/${x}/text.org) <(cut -f 2- -d" " data/${x}/text.org | tr -d " ") \
+            > data/${x}/text
+        utils/text2token.py -n 1 -s 1 data/${x}/text > data/${x}/text.org
+        mv data/${x}/text.org data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        data/train exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        data/dev exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        data/test exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p asr_stats_fbank_zh_char/train 
+    mkdir -p asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char asr_stats_fbank_zh_char/dev
+fi
+
+if ! "${skip_extract_embed}"; then
+    local/extract_embeds.sh \
+        --bert_model_root ${bert_model_root} \
+        --bert_model_name ${bert_model_name} \
+        --raw_dataset_path ${feats_dir}
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p $exp_dir
+    mkdir -p $exp_dir/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.scp,embed,${type} \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --train_shape_file ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.shape \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.scp,embed,${type} \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --valid_shape_file ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.shape \
+                --resume true \
+                --output_dir $exp_dir \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --allow_variable_data_keys true \
+                --local_rank $local_rank 1> $exp_dir/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp $exp_dir \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --gpu_inference ${gpu_inference} \
+        --mode paraformer
+fi
+
diff --git a/egs/aishell/paraformer2/utils b/egs/aishell/paraformer2/utils
new file mode 120000
index 000000000..40e14f57f
--- /dev/null
+++ b/egs/aishell/paraformer2/utils
@@ -0,0 +1 @@
+../tranformer/utils
\ No newline at end of file
diff --git a/egs/aishell/tranformer/conf/decode_asr_transformer.yaml b/egs/aishell/tranformer/conf/decode_asr_transformer.yaml
new file mode 100644
index 000000000..a147fa79d
--- /dev/null
+++ b/egs/aishell/tranformer/conf/decode_asr_transformer.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.7
diff --git a/egs/aishell/tranformer/conf/train_asr_conformer.yaml b/egs/aishell/tranformer/conf/train_asr_conformer.yaml
new file mode 100644
index 000000000..ddf217ec0
--- /dev/null
+++ b/egs/aishell/tranformer/conf/train_asr_conformer.yaml
@@ -0,0 +1,80 @@
+# network architecture
+# encoder related
+encoder: conformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+    pos_enc_layer_type: rel_pos
+    selfattention_layer_type: rel_selfattn
+    activation_type: swish
+    macaron_style: true
+    use_cnn_module: true
+    cnn_module_kernel: 15
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 25000
+num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 50
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug
+specaug_conf:
+    apply_time_warp: true
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    num_freq_mask: 2
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 40
+    num_time_mask: 2
+
+log_interval: 50
+normalize: None
diff --git a/egs/aishell/tranformer/conf/train_asr_transformer.yaml b/egs/aishell/tranformer/conf/train_asr_transformer.yaml
new file mode 100644
index 000000000..ce987e7c9
--- /dev/null
+++ b/egs/aishell/tranformer/conf/train_asr_transformer.yaml
@@ -0,0 +1,70 @@
+# network architecture
+# encoder related
+encoder: transformer
+encoder_conf:
+    output_size: 256    # dimension of attention
+    attention_heads: 4
+    linear_units: 2048  # the number of units of position-wise feed forward
+    num_blocks: 12      # the number of encoder blocks
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.0
+    input_layer: conv2d # encoder architecture type
+    normalize_before: true
+
+# decoder related
+decoder: transformer
+decoder_conf:
+    attention_heads: 4
+    linear_units: 2048
+    num_blocks: 6
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.0
+    src_attention_dropout_rate: 0.0
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.3
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: false
+
+# minibatch related
+batch_type: length
+batch_bins: 32000
+num_workers: 8
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+patience: 3
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+# NoamLR is deprecated. Use WarmupLR.
+# The following is equivalent setting for NoamLR:
+#
+#    optim: adam
+#    optim_conf:
+#        lr: 10.
+#    scheduler: noamlr
+#    scheduler_conf:
+#        model_size: 256
+#        warmup_steps: 25000
+#
+optim: adam
+optim_conf:
+    lr: 0.002
+scheduler: warmuplr     # pytorch v1.1.0+ required
+scheduler_conf:
+    warmup_steps: 25000
+
+log_interval: 50
+normalize: None
diff --git a/egs/aishell/tranformer/local/aishell_data_prep.sh b/egs/aishell/tranformer/local/aishell_data_prep.sh
new file mode 100755
index 000000000..83f489b3c
--- /dev/null
+++ b/egs/aishell/tranformer/local/aishell_data_prep.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
diff --git a/egs/aishell/tranformer/local/prepare_data.sh b/egs/aishell/tranformer/local/prepare_data.sh
new file mode 100755
index 000000000..77791f9c1
--- /dev/null
+++ b/egs/aishell/tranformer/local/prepare_data.sh
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
diff --git a/egs/aishell/tranformer/path.sh b/egs/aishell/tranformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs/aishell/tranformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs/aishell/tranformer/run.sh b/egs/aishell/tranformer/run.sh
new file mode 100755
index 000000000..16ebc6759
--- /dev/null
+++ b/egs/aishell/tranformer/run.sh
@@ -0,0 +1,208 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
+njob=8
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir=".." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=80
+sample_frequency=16000
+nj=32
+speed_perturb="0.9,1.0,1.1"
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+. utils/parse_options.sh || exit 1;
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_conformer.yaml
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+inference_config=conf/decode_asr_transformer.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
+        mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "stage 1: Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+     
+    # compute global cmvn
+    utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${exp_dir}/exp/make_fbank/train
+
+    # apply cmvn 
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        ${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    
+    cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
+    cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
+
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+   
+    echo "make a dictionary"
+    echo "<blank>" > ${token_list}
+    echo "<s>" >> ${token_list}
+    echo "</s>" >> ${token_list}
+    utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
+        | sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
+    num_token=$(cat ${token_list} | wc -l)
+    echo "<unk>" >> ${token_list}
+    vocab_size=$(cat ${token_list} | wc -l)
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train 
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi 
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type char \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --config $asr_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode asr
+fi
+
diff --git a/egs/aishell/tranformer/utils/__init__.py b/egs/aishell/tranformer/utils/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/egs/aishell/tranformer/utils/apply_cmvn.py b/egs/aishell/tranformer/utils/apply_cmvn.py
new file mode 100755
index 000000000..b5c5086b3
--- /dev/null
+++ b/egs/aishell/tranformer/utils/apply_cmvn.py
@@ -0,0 +1,79 @@
+from kaldiio import ReadHelper
+from kaldiio import WriteHelper
+
+import argparse
+import json
+import math
+import numpy as np
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="apply cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--cmvn-file",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="cmvn file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
+    scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
+
+    with open(args.cmvn_file) as f:
+        cmvn_stats = json.load(f)
+
+    means = cmvn_stats['mean_stats']
+    vars = cmvn_stats['var_stats']
+    total_frames = cmvn_stats['total_frames']
+
+    for i in range(len(means)):
+        means[i] /= total_frames
+        vars[i] = vars[i] / total_frames - means[i] * means[i]
+        if vars[i] < 1.0e-20:
+            vars[i] = 1.0e-20
+        vars[i] = 1.0 / math.sqrt(vars[i])
+
+    with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            mat = (mat - means) * vars
+            ark_writer(key, mat)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/egs/aishell/tranformer/utils/apply_cmvn.sh b/egs/aishell/tranformer/utils/apply_cmvn.sh
new file mode 100755
index 000000000..f8fd1d140
--- /dev/null
+++ b/egs/aishell/tranformer/utils/apply_cmvn.sh
@@ -0,0 +1,29 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+cmvn_file=$2
+logdir=$3
+output_dir=$4
+
+dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/apply_cmvn.JOB.log \
+    python utils/apply_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
+        -c $cmvn_file -i JOB -o ${dump_dir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${dump_dir}/feats.$n.scp || exit 1
+done > ${output_dir}/feats.scp || exit 1
+
+echo "$0: Succeeded apply cmvn"
diff --git a/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.py b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.py
new file mode 100755
index 000000000..50d18d1a4
--- /dev/null
+++ b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.py
@@ -0,0 +1,143 @@
+from kaldiio import ReadHelper, WriteHelper
+
+import argparse
+import numpy as np
+
+
+def build_LFR_features(inputs, m=7, n=6):
+    LFR_inputs = []
+    T = inputs.shape[0]
+    T_lfr = int(np.ceil(T / n))
+    left_padding = np.tile(inputs[0], ((m - 1) // 2, 1))
+    inputs = np.vstack((left_padding, inputs))
+    T = T + (m - 1) // 2
+    for i in range(T_lfr):
+        if m <= T - i * n:
+            LFR_inputs.append(np.hstack(inputs[i * n:i * n + m]))
+        else:
+            num_padding = m - (T - i * n)
+            frame = np.hstack(inputs[i * n:])
+            for _ in range(num_padding):
+                frame = np.hstack((frame, inputs[-1]))
+            LFR_inputs.append(frame)
+    return np.vstack(LFR_inputs)
+
+
+def build_CMVN_features(inputs, mvn_file):  # noqa
+    with open(mvn_file, 'r', encoding='utf-8') as f:
+        lines = f.readlines()
+
+    add_shift_list = []
+    rescale_list = []
+    for i in range(len(lines)):
+        line_item = lines[i].split()
+        if line_item[0] == '<AddShift>':
+            line_item = lines[i + 1].split()
+            if line_item[0] == '<LearnRateCoef>':
+                add_shift_line = line_item[3:(len(line_item) - 1)]
+                add_shift_list = list(add_shift_line)
+                continue
+        elif line_item[0] == '<Rescale>':
+            line_item = lines[i + 1].split()
+            if line_item[0] == '<LearnRateCoef>':
+                rescale_line = line_item[3:(len(line_item) - 1)]
+                rescale_list = list(rescale_line)
+                continue
+
+    for j in range(inputs.shape[0]):
+        for k in range(inputs.shape[1]):
+            add_shift_value = add_shift_list[k]
+            rescale_value = rescale_list[k]
+            inputs[j, k] = float(inputs[j, k]) + float(add_shift_value)
+            inputs[j, k] = float(inputs[j, k]) * float(rescale_value)
+
+    return inputs
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="apply low_frame_rate and cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--lfr",
+        "-f",
+        default=True,
+        type=str,
+        help="low frame rate",
+    )
+    parser.add_argument(
+        "--lfr-m",
+        "-m",
+        default=7,
+        type=int,
+        help="number of frames to stack",
+    )
+    parser.add_argument(
+        "--lfr-n",
+        "-n",
+        default=6,
+        type=int,
+        help="number of frames to skip",
+    )
+    parser.add_argument(
+        "--cmvn-file",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="global cmvn file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    dump_ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
+    dump_scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
+    shape_file = args.output_dir + "/len." + str(args.ark_index)
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(dump_ark_file, dump_scp_file))
+
+    shape_writer = open(shape_file, 'w')
+    with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            if args.lfr:
+                lfr = build_LFR_features(mat, args.lfr_m, args.lfr_n)
+            else:
+                lfr = mat
+            cmvn = build_CMVN_features(lfr, args.cmvn_file)
+            dims = cmvn.shape[1]
+            lens = cmvn.shape[0]
+            shape_writer.write(key + " " + str(lens) + "," + str(dims) + '\n')
+            ark_writer(key, cmvn)
+
+
+if __name__ == '__main__':
+    main()
+
diff --git a/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.sh b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.sh
new file mode 100755
index 000000000..3119fdb8f
--- /dev/null
+++ b/egs/aishell/tranformer/utils/apply_lfr_and_cmvn.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=32
+cmd=utils/run.pl
+
+# feature configuration
+lfr=True
+lfr_m=7
+lfr_n=6
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+cmvn_file=$2
+logdir=$3
+output_dir=$4
+
+dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/apply_lfr_and_cmvn.JOB.log \
+    python utils/apply_lfr_and_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
+        -f $lfr -m $lfr_m -n $lfr_n -c $cmvn_file -i JOB -o ${dump_dir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${dump_dir}/feats.$n.scp || exit 1
+done > ${output_dir}/feats.scp || exit 1
+
+for n in $(seq $nj); do
+  cat ${dump_dir}/len.$n || exit 1
+done > ${output_dir}/speech_shape || exit 1
+
+echo "$0: Succeeded apply low frame rate and cmvn"
diff --git a/egs/aishell/tranformer/utils/combine_cmvn_file.py b/egs/aishell/tranformer/utils/combine_cmvn_file.py
new file mode 100755
index 000000000..e16174c60
--- /dev/null
+++ b/egs/aishell/tranformer/utils/combine_cmvn_file.py
@@ -0,0 +1,66 @@
+import argparse
+import json
+import numpy as np
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="combine cmvn file",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--cmvn-dir",
+        "-c",
+        default=False,
+        required=True,
+        type=str,
+        help="cmvn dir",
+    )
+
+    parser.add_argument(
+        "--nj",
+        "-n",
+        default=1,
+        required=True,
+        type=int,
+        help="num of cmvn file",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    total_means = 0.0
+    total_vars = 0.0
+    total_frames = 0
+
+    cmvn_file = args.output_dir + "/cmvn.json"
+
+    for i in range(1, args.nj+1):
+        with open(args.cmvn_dir + "/cmvn." + str(i) + ".json", "r") as fin:
+            cmvn_stats = json.load(fin)
+
+        total_means += np.array(cmvn_stats["mean_stats"])
+        total_vars += np.array(cmvn_stats["var_stats"])
+        total_frames += cmvn_stats["total_frames"]
+
+    cmvn_info = {
+        'mean_stats': list(total_means.tolist()),
+        'var_stats': list(total_vars.tolist()),
+        'total_frames': total_frames
+    }
+    with open(cmvn_file, 'w') as fout:
+        fout.write(json.dumps(cmvn_info))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/egs/aishell/tranformer/utils/compute_cmvn.py b/egs/aishell/tranformer/utils/compute_cmvn.py
new file mode 100755
index 000000000..988d6dc9e
--- /dev/null
+++ b/egs/aishell/tranformer/utils/compute_cmvn.py
@@ -0,0 +1,67 @@
+from kaldiio import ReadHelper
+
+import argparse
+import numpy as np
+import json
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="computer global cmvn",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--ark-file",
+        "-a",
+        default=False,
+        required=True,
+        type=str,
+        help="fbank ark file",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.ark_file + "/feats." + str(args.ark_index) + ".ark"
+    cmvn_file = args.output_dir + "/cmvn." + str(args.ark_index) + ".json"
+
+    mean_stats = 0.0
+    var_stats = 0.0
+    total_frames = 0
+
+    with ReadHelper('ark:{}'.format(ark_file)) as ark_reader:
+        for key, mat in ark_reader:
+            mean_stats += np.sum(mat, axis=0)
+            var_stats += np.sum(np.square(mat), axis=0)
+            total_frames += mat.shape[0]
+
+    cmvn_info = {
+        'mean_stats': list(mean_stats.tolist()),
+        'var_stats': list(var_stats.tolist()),
+        'total_frames': total_frames
+    }
+    with open(cmvn_file, 'w') as fout:
+        fout.write(json.dumps(cmvn_info))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/egs/aishell/tranformer/utils/compute_cmvn.sh b/egs/aishell/tranformer/utils/compute_cmvn.sh
new file mode 100755
index 000000000..3a3019016
--- /dev/null
+++ b/egs/aishell/tranformer/utils/compute_cmvn.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+fbankdir=$1
+logdir=$2
+
+output_dir=${fbankdir}/cmvn; mkdir -p ${output_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/cmvn.JOB.log \
+    python utils/compute_cmvn.py -a $fbankdir/ark -i JOB -o ${output_dir} \
+        || exit 1;
+
+python utils/combine_cmvn_file.py -c ${output_dir} -n $nj -o $fbankdir
+
+echo "$0: Succeeded compute global cmvn"
diff --git a/egs/aishell/tranformer/utils/compute_fbank.py b/egs/aishell/tranformer/utils/compute_fbank.py
new file mode 100755
index 000000000..d03b5a826
--- /dev/null
+++ b/egs/aishell/tranformer/utils/compute_fbank.py
@@ -0,0 +1,153 @@
+from kaldiio import WriteHelper
+
+import argparse
+import numpy as np
+import json
+import torch
+import torchaudio
+import torchaudio.compliance.kaldi as kaldi
+
+
+def compute_fbank(wav_file,
+                  num_mel_bins=80,
+                  frame_length=25,
+                  frame_shift=10,
+                  dither=0.0,
+                  resample_rate=16000,
+                  speed=1.0):
+
+    waveform, sample_rate = torchaudio.load(wav_file)
+    if resample_rate != sample_rate:
+        waveform = torchaudio.transforms.Resample(orig_freq=sample_rate,
+                                                  new_freq=resample_rate)(waveform)
+    if speed != 1.0:
+        waveform, _ = torchaudio.sox_effects.apply_effects_tensor(
+            waveform, resample_rate,
+            [['speed', str(speed)], ['rate', str(resample_rate)]]
+        )
+
+    waveform = waveform * (1 << 15)
+    mat = kaldi.fbank(waveform,
+                      num_mel_bins=num_mel_bins,
+                      frame_length=frame_length,
+                      frame_shift=frame_shift,
+                      dither=dither,
+                      energy_floor=0.0,
+                      window_type='hamming',
+                      sample_frequency=resample_rate)
+
+    return mat.numpy()
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="computer features",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--wav-lists",
+        "-w",
+        default=False,
+        required=True,
+        type=str,
+        help="input wav lists",
+    )
+    parser.add_argument(
+        "--text-files",
+        "-t",
+        default=False,
+        required=True,
+        type=str,
+        help="input text files",
+    )
+    parser.add_argument(
+        "--dims",
+        "-d",
+        default=80,
+        type=int,
+        help="feature dims",
+    )
+    parser.add_argument(
+        "--sample-frequency",
+        "-s",
+        default=16000,
+        type=int,
+        help="sample frequency",
+    )
+    parser.add_argument(
+        "--speed-perturb",
+        "-p",
+        default="1.0",
+        type=str,
+        help="speed perturb",
+    )
+    parser.add_argument(
+        "--ark-index",
+        "-a",
+        default=1,
+        required=True,
+        type=int,
+        help="ark index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    ark_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".ark"
+    scp_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".scp"
+    text_file = args.output_dir + "/txt/text." + str(args.ark_index) + ".txt"  
+    feats_shape_file = args.output_dir + "/ark/len." + str(args.ark_index)
+    text_shape_file = args.output_dir + "/txt/len." + str(args.ark_index)
+
+    ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
+    text_writer = open(text_file, 'w')
+    feats_shape_writer = open(feats_shape_file, 'w')
+    text_shape_writer = open(text_shape_file, 'w')
+
+    speed_perturb_list = args.speed_perturb.split(',')
+    
+    for speed in speed_perturb_list:
+        with open(args.wav_lists, 'r', encoding='utf-8') as wavfile:
+            with open(args.text_files, 'r', encoding='utf-8') as textfile:
+                for wav, text in zip(wavfile, textfile): 
+                    s_w = wav.strip().split()
+                    wav_id = s_w[0]
+                    wav_file = s_w[1]
+
+                    s_t = text.strip().split()
+                    text_id = s_t[0]
+                    txt = s_t[1:]
+                    fbank = compute_fbank(wav_file,
+                                          num_mel_bins=args.dims,
+                                          resample_rate=args.sample_frequency,
+                                          speed=float(speed)
+                                          )
+                    feats_dims = fbank.shape[1]
+                    feats_lens = fbank.shape[0]
+                    txt_lens = len(txt)
+                    if speed == "1.0":
+                        wav_id_sp = wav_id
+                    else: 
+                        wav_id_sp = wav_id + "_sp" + speed
+
+                    feats_shape_writer.write(wav_id_sp + " " + str(feats_lens) + "," + str(feats_dims) + '\n')
+                    text_shape_writer.write(wav_id_sp + " " + str(txt_lens) + '\n')
+
+                    text_writer.write(wav_id_sp + " " + " ".join(txt) + '\n')
+                    ark_writer(wav_id_sp, fbank)
+                    
+
+if __name__ == '__main__':
+    main()
+
diff --git a/egs/aishell/tranformer/utils/compute_fbank.sh b/egs/aishell/tranformer/utils/compute_fbank.sh
new file mode 100755
index 000000000..b456b4d83
--- /dev/null
+++ b/egs/aishell/tranformer/utils/compute_fbank.sh
@@ -0,0 +1,51 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+# Begin configuration section.
+nj=32
+cmd=./utils/run.pl
+
+# feature configuration
+feat_dims=80
+sample_frequency=16000
+speed_perturb="1.0"
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+data=$1
+logdir=$2
+fbankdir=$3
+
+[ ! -f $data/wav.scp ] && echo "$0: no such file $data/wav.scp" && exit 1;
+[ ! -f $data/text ] && echo "$0: no such file $data/text" && exit 1;
+
+python utils/split_data.py $data $data $nj
+
+ark_dir=${fbankdir}/ark; mkdir -p ${ark_dir}
+text_dir=${fbankdir}/txt; mkdir -p ${text_dir}
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/make_fbank.JOB.log \
+    python utils/compute_fbank.py -w $data/split${nj}/JOB/wav.scp -t $data/split${nj}/JOB/text \
+        -d $feat_dims -s $sample_frequency -p ${speed_perturb} -a JOB -o ${fbankdir} \
+        || exit 1;
+
+for n in $(seq $nj); do
+    cat ${ark_dir}/feats.$n.scp || exit 1
+done > $fbankdir/feats.scp || exit 1
+
+for n in $(seq $nj); do
+    cat ${text_dir}/text.$n.txt || exit 1
+done > $fbankdir/text || exit 1
+
+for n in $(seq $nj); do
+    cat ${ark_dir}/len.$n || exit 1
+done > $fbankdir/speech_shape || exit 1
+
+for n in $(seq $nj); do
+    cat ${text_dir}/len.$n || exit 1
+done > $fbankdir/text_shape || exit 1
+
+echo "$0: Succeeded compute FBANK features"
diff --git a/egs/aishell/tranformer/utils/compute_wer.py b/egs/aishell/tranformer/utils/compute_wer.py
new file mode 100755
index 000000000..349a3f609
--- /dev/null
+++ b/egs/aishell/tranformer/utils/compute_wer.py
@@ -0,0 +1,157 @@
+import os
+import numpy as np
+import sys
+
+def compute_wer(ref_file,
+                hyp_file,
+                cer_detail_file):
+    rst = {
+        'Wrd': 0,
+        'Corr': 0,
+        'Ins': 0,
+        'Del': 0,
+        'Sub': 0,
+        'Snt': 0,
+        'Err': 0.0,
+        'S.Err': 0.0,
+        'wrong_words': 0,
+        'wrong_sentences': 0
+    }
+
+    hyp_dict = {}
+    ref_dict = {}
+    with open(hyp_file, 'r') as hyp_reader:
+        for line in hyp_reader:
+            key = line.strip().split()[0]
+            value = line.strip().split()[1:]
+            hyp_dict[key] = value
+    with open(ref_file, 'r') as ref_reader:
+        for line in ref_reader:
+            key = line.strip().split()[0]
+            value = line.strip().split()[1:]
+            ref_dict[key] = value
+
+    cer_detail_writer = open(cer_detail_file, 'w')
+    for hyp_key in hyp_dict:
+        if hyp_key in ref_dict:
+           out_item = compute_wer_by_line(hyp_dict[hyp_key], ref_dict[hyp_key])
+           rst['Wrd'] += out_item['nwords']
+           rst['Corr'] += out_item['cor']
+           rst['wrong_words'] += out_item['wrong']
+           rst['Ins'] += out_item['ins']
+           rst['Del'] += out_item['del']
+           rst['Sub'] += out_item['sub']
+           rst['Snt'] += 1
+           if out_item['wrong'] > 0:
+               rst['wrong_sentences'] += 1
+           cer_detail_writer.write(hyp_key + print_cer_detail(out_item) + '\n')
+           cer_detail_writer.write("ref:" + '\t' + "".join(ref_dict[hyp_key]) + '\n')
+           cer_detail_writer.write("hyp:" + '\t' + "".join(hyp_dict[hyp_key]) + '\n')
+
+    if rst['Wrd'] > 0:
+        rst['Err'] = round(rst['wrong_words'] * 100 / rst['Wrd'], 2)
+    if rst['Snt'] > 0:
+        rst['S.Err'] = round(rst['wrong_sentences'] * 100 / rst['Snt'], 2)
+
+    cer_detail_writer.write('\n')
+    cer_detail_writer.write("%WER " + str(rst['Err']) + " [ " + str(rst['wrong_words'])+ " / " + str(rst['Wrd']) +
+                            ", " + str(rst['Ins']) + " ins, " + str(rst['Del']) + " del, " + str(rst['Sub']) + " sub ]" + '\n')
+    cer_detail_writer.write("%SER " + str(rst['S.Err']) + " [ " + str(rst['wrong_sentences']) + " / " + str(rst['Snt']) + " ]" + '\n')
+    cer_detail_writer.write("Scored " + str(len(hyp_dict)) + " sentences, " + str(len(hyp_dict) - rst['Snt']) + " not present in hyp." + '\n')
+
+     
+def compute_wer_by_line(hyp,
+                        ref):
+    hyp = list(map(lambda x: x.lower(), hyp))
+    ref = list(map(lambda x: x.lower(), ref))
+
+    len_hyp = len(hyp)
+    len_ref = len(ref)
+
+    cost_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int16)
+
+    ops_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int8)
+
+    for i in range(len_hyp + 1):
+        cost_matrix[i][0] = i
+    for j in range(len_ref + 1):
+        cost_matrix[0][j] = j
+
+    for i in range(1, len_hyp + 1):
+        for j in range(1, len_ref + 1):
+            if hyp[i - 1] == ref[j - 1]:
+                cost_matrix[i][j] = cost_matrix[i - 1][j - 1]
+            else:
+                substitution = cost_matrix[i - 1][j - 1] + 1
+                insertion = cost_matrix[i - 1][j] + 1
+                deletion = cost_matrix[i][j - 1] + 1
+
+                compare_val = [substitution, insertion, deletion]
+
+                min_val = min(compare_val)
+                operation_idx = compare_val.index(min_val) + 1
+                cost_matrix[i][j] = min_val
+                ops_matrix[i][j] = operation_idx
+
+    match_idx = []
+    i = len_hyp
+    j = len_ref
+    rst = {
+        'nwords': len_ref,
+        'cor': 0,
+        'wrong': 0,
+        'ins': 0,
+        'del': 0,
+        'sub': 0
+    }
+    while i >= 0 or j >= 0:
+        i_idx = max(0, i)
+        j_idx = max(0, j)
+
+        if ops_matrix[i_idx][j_idx] == 0:  # correct
+            if i - 1 >= 0 and j - 1 >= 0:
+                match_idx.append((j - 1, i - 1))
+                rst['cor'] += 1
+
+            i -= 1
+            j -= 1
+
+        elif ops_matrix[i_idx][j_idx] == 2:  # insert
+            i -= 1
+            rst['ins'] += 1
+
+        elif ops_matrix[i_idx][j_idx] == 3:  # delete
+            j -= 1
+            rst['del'] += 1
+
+        elif ops_matrix[i_idx][j_idx] == 1:  # substitute
+            i -= 1
+            j -= 1
+            rst['sub'] += 1
+
+        if i < 0 and j >= 0:
+            rst['del'] += 1
+        elif j < 0 and i >= 0:
+            rst['ins'] += 1
+
+    match_idx.reverse()
+    wrong_cnt = cost_matrix[len_hyp][len_ref]
+    rst['wrong'] = wrong_cnt
+
+    return rst
+
+def print_cer_detail(rst):
+    return ("(" + "nwords=" + str(rst['nwords']) + ",cor=" + str(rst['cor'])
+            + ",ins=" + str(rst['ins']) + ",del=" + str(rst['del']) + ",sub="
+            + str(rst['sub']) + ") corr:" + '{:.2%}'.format(rst['cor']/rst['nwords'])
+            + ",cer:" + '{:.2%}'.format(rst['wrong']/rst['nwords']))
+
+if __name__ == '__main__':
+    if len(sys.argv) != 4:
+        print("usage : python compute-wer.py test.ref test.hyp test.wer")
+        sys.exit(0)
+
+    ref_file = sys.argv[1]
+    hyp_file = sys.argv[2]
+    cer_detail_file = sys.argv[3]
+    compute_wer(ref_file, hyp_file, cer_detail_file)
diff --git a/egs/aishell/tranformer/utils/easy_asr_infer.sh b/egs/aishell/tranformer/utils/easy_asr_infer.sh
new file mode 100755
index 000000000..1b8db3469
--- /dev/null
+++ b/egs/aishell/tranformer/utils/easy_asr_infer.sh
@@ -0,0 +1,407 @@
+#!/usr/bin/env bash
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+log() {
+    local fname=${BASH_SOURCE[1]##*/}
+    echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+min() {
+  local a b
+  a=$1
+  for b in "$@"; do
+      if [ "${b}" -le "${a}" ]; then
+          a="${b}"
+      fi
+  done
+  echo "${a}"
+}
+SECONDS=0
+
+# General configuration
+stage=1              # Processes starts from the specified stage.
+stop_stage=10000     # Processes is stopped at the specified stage.
+skip_data_prep=true  # Skip data preparation stages.
+skip_train=false     # Skip training stages.
+skip_eval=false      # Skip decoding and evaluation stages.
+skip_upload=true     # Skip packing and uploading stages.
+skip_upload_hf=true  # Skip uploading to hugging face stages.
+cuda_cmd=utils/run.pl
+decode_cmd=utils/run.pl
+ngpu=1               # The number of gpus ("0" uses cpu, otherwise use gpu).
+njob=1               # the number of jobs for each gpu
+gpuid_list=
+num_nodes=1          # The number of nodes.
+nj=32                # The number of parallel jobs.
+inference_nj=32      # The number of parallel jobs in decoding.
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+datadir="./"
+dumpdir=dump         # Directory to dump features.
+expdir=exp           # Directory to save experiments.
+python=python        # Specify python to execute funasr commands.
+
+# Data preparation related
+local_data_opts= # The options given to local/data.sh.
+
+# Speed perturbation related
+speed_perturb_factors=  # perturbation factors, e.g. "0.9 1.0 1.1" (separated by space).
+
+# Feature extraction related
+feats_type=fbank       # Feature type (raw or fbank_pitch).
+feats_dim=
+audio_format=flac    # Audio format: wav, flac, wav.ark, flac.ark  (only in feats_type=raw).
+fs=16k               # Sampling rate.
+min_wav_duration=0.1 # Minimum duration in second.
+max_wav_duration=20  # Maximum duration in second.
+
+# Tokenization related
+token_type=bpe      # Tokenization type (char or bpe).
+nbpe=30             # The number of BPE vocabulary.
+bpemode=unigram     # Mode of BPE (unigram or bpe).
+oov="<unk>"         # Out of vocabulary symbol.
+blank="<blank>"     # CTC blank symbol
+sos_eos="<sos/eos>" # sos and eos symbole
+bpe_input_sentence_size=100000000 # Size of input sentence for BPE.
+bpe_nlsyms=         # non-linguistic symbols list, separated by a comma, for BPE
+bpe_char_cover=1.0  # character coverage when modeling BPE
+
+# Ngram model related
+use_ngram=false
+ngram_exp=
+ngram_num=3
+
+# Language model related
+use_lm=false       # Use language model for ASR decoding.
+lm_tag=           # Suffix to the result dir for language model training.
+lm_exp=           # Specify the directory path for LM experiment.
+                  # If this option is specified, lm_tag is ignored.
+lm_stats_dir=     # Specify the directory path for LM statistics.
+lm_config=        # Config for language model training.
+lm_args=          # Arguments for language model training, e.g., "--max_epoch 10".
+                  # Note that it will overwrite args in lm config.
+use_word_lm=false # Whether to use word language model.
+num_splits_lm=1   # Number of splitting for lm corpus.
+# shellcheck disable=SC2034
+word_vocab_size=10000 # Size of word vocabulary.
+
+# ASR model related
+asr_tag=       # Suffix to the result dir for asr model training.
+asr_exp=       # Specify the directory path for ASR experiment.
+               # If this option is specified, asr_tag is ignored.
+asr_stats_dir= # Specify the directory path for ASR statistics.
+asr_config=    # Config for asr model training.
+asr_args=      # Arguments for asr model training, e.g., "--max_epoch 10".
+               # Note that it will overwrite args in asr config.
+pretrained_model=              # Pretrained model to load
+ignore_init_mismatch=false      # Ignore initial mismatch
+feats_normalize=global_mvn # Normalizaton layer type.
+num_splits_asr=1           # Number of splitting for lm corpus.
+
+# Upload model related
+hf_repo=
+
+# Decoding related
+use_k2=false      # Whether to use k2 based decoder
+k2_ctc_decoding=true
+use_nbest_rescoring=true # use transformer-decoder
+                         # and transformer language model for nbest rescoring
+num_paths=1000 # The 3rd argument of k2.random_paths.
+nll_batch_size=100 # Affect GPU memory usage when computing nll
+                   # during nbest rescoring
+k2_config=./conf/decode_asr_transformer_with_k2.yaml
+
+use_streaming=false # Whether to use streaming decoding
+
+use_maskctc=false # Whether to use maskctc decoding
+
+batch_size=1
+inference_tag=    # Suffix to the result dir for decoding.
+inference_config= # Config for decoding.
+inference_args=   # Arguments for decoding, e.g., "--lm_weight 0.1".
+                  # Note that it will overwrite args in inference config.
+inference_lm=valid.loss.ave.pth       # Language model path for decoding.
+inference_ngram=${ngram_num}gram.bin
+inference_asr_model=valid.acc.ave.pth # ASR model path for decoding.
+                                      # e.g.
+                                      # inference_asr_model=train.loss.best.pth
+                                      # inference_asr_model=3epoch.pth
+                                      # inference_asr_model=valid.acc.best.pth
+                                      # inference_asr_model=valid.loss.ave.pth
+download_model= # Download a model from Model Zoo and use it for decoding.
+
+# [Task dependent] Set the datadir name created by local/data.sh
+train_set=       # Name of training set.
+valid_set=       # Name of validation set used for monitoring/tuning network training.
+test_sets=       # Names of test sets. Multiple items (e.g., both dev and eval sets) can be specified.
+bpe_train_text=  # Text file path of bpe training set.
+lm_train_text=   # Text file path of language model training set.
+lm_dev_text=     # Text file path of language model development set.
+lm_test_text=    # Text file path of language model evaluation set.
+nlsyms_txt=none  # Non-linguistic symbol list if existing.
+cleaner=none     # Text cleaner.
+g2p=none         # g2p method (needed if token_type=phn).
+lang=noinfo      # The language type of corpus.
+score_opts=                # The options given to sclite scoring
+local_score_opts=          # The options given to local/score.sh.
+asr_speech_fold_length=800 # fold_length for speech data during ASR training.
+asr_text_fold_length=150   # fold_length for text data during ASR training.
+lm_fold_length=150         # fold_length for LM training.
+
+oss_path=
+token_list=
+scp=
+text=
+
+mode=
+
+help_message=$(cat << EOF
+Usage: $0 --train-set "<train_set_name>" --valid-set "<valid_set_name>" --test_sets "<test_set_names>"
+
+Options:
+    # General configuration
+    --stage          # Processes starts from the specified stage (default="${stage}").
+    --stop_stage     # Processes is stopped at the specified stage (default="${stop_stage}").
+    --skip_data_prep # Skip data preparation stages (default="${skip_data_prep}").
+    --skip_train     # Skip training stages (default="${skip_train}").
+    --skip_eval      # Skip decoding and evaluation stages (default="${skip_eval}").
+    --skip_upload    # Skip packing and uploading stages (default="${skip_upload}").
+    --ngpu           # The number of gpus ("0" uses cpu, otherwise use gpu, default="${ngpu}").
+    --num_nodes      # The number of nodes (default="${num_nodes}").
+    --nj             # The number of parallel jobs (default="${nj}").
+    --inference_nj   # The number of parallel jobs in decoding (default="${inference_nj}").
+    --gpu_inference  # Whether to perform gpu decoding (default="${gpu_inference}").
+    --dumpdir        # Directory to dump features (default="${dumpdir}").
+    --expdir         # Directory to save experiments (default="${expdir}").
+    --python         # Specify python to execute espnet commands (default="${python}").
+
+    # Data preparation related
+    --local_data_opts # The options given to local/data.sh (default="${local_data_opts}").
+
+    # Speed perturbation related
+    --speed_perturb_factors # speed perturbation factors, e.g. "0.9 1.0 1.1" (separated by space, default="${speed_perturb_factors}").
+
+    # Feature extraction related
+    --feats_type       # Feature type (raw, fbank_pitch or extracted, default="${feats_type}").
+    --audio_format     # Audio format: wav, flac, wav.ark, flac.ark  (only in feats_type=raw, default="${audio_format}").
+    --fs               # Sampling rate (default="${fs}").
+    --min_wav_duration # Minimum duration in second (default="${min_wav_duration}").
+    --max_wav_duration # Maximum duration in second (default="${max_wav_duration}").
+
+    # Tokenization related
+    --token_type              # Tokenization type (char or bpe, default="${token_type}").
+    --nbpe                    # The number of BPE vocabulary (default="${nbpe}").
+    --bpemode                 # Mode of BPE (unigram or bpe, default="${bpemode}").
+    --oov                     # Out of vocabulary symbol (default="${oov}").
+    --blank                   # CTC blank symbol (default="${blank}").
+    --sos_eos                 # sos and eos symbole (default="${sos_eos}").
+    --bpe_input_sentence_size # Size of input sentence for BPE (default="${bpe_input_sentence_size}").
+    --bpe_nlsyms              # Non-linguistic symbol list for sentencepiece, separated by a comma. (default="${bpe_nlsyms}").
+    --bpe_char_cover          # Character coverage when modeling BPE (default="${bpe_char_cover}").
+
+    # Language model related
+    --lm_tag          # Suffix to the result dir for language model training (default="${lm_tag}").
+    --lm_exp          # Specify the directory path for LM experiment.
+                      # If this option is specified, lm_tag is ignored (default="${lm_exp}").
+    --lm_stats_dir    # Specify the directory path for LM statistics (default="${lm_stats_dir}").
+    --lm_config       # Config for language model training (default="${lm_config}").
+    --lm_args         # Arguments for language model training (default="${lm_args}").
+                      # e.g., --lm_args "--max_epoch 10"
+                      # Note that it will overwrite args in lm config.
+    --use_word_lm     # Whether to use word language model (default="${use_word_lm}").
+    --word_vocab_size # Size of word vocabulary (default="${word_vocab_size}").
+    --num_splits_lm   # Number of splitting for lm corpus (default="${num_splits_lm}").
+
+    # ASR model related
+    --asr_tag          # Suffix to the result dir for asr model training (default="${asr_tag}").
+    --asr_exp          # Specify the directory path for ASR experiment.
+                       # If this option is specified, asr_tag is ignored (default="${asr_exp}").
+    --asr_stats_dir    # Specify the directory path for ASR statistics (default="${asr_stats_dir}").
+    --asr_config       # Config for asr model training (default="${asr_config}").
+    --asr_args         # Arguments for asr model training (default="${asr_args}").
+                       # e.g., --asr_args "--max_epoch 10"
+                       # Note that it will overwrite args in asr config.
+    --pretrained_model=          # Pretrained model to load (default="${pretrained_model}").
+    --ignore_init_mismatch=      # Ignore mismatch parameter init with pretrained model (default="${ignore_init_mismatch}").
+    --feats_normalize  # Normalizaton layer type (default="${feats_normalize}").
+    --num_splits_asr   # Number of splitting for lm corpus  (default="${num_splits_asr}").
+
+    # Decoding related
+    --inference_tag       # Suffix to the result dir for decoding (default="${inference_tag}").
+    --inference_config    # Config for decoding (default="${inference_config}").
+    --inference_args      # Arguments for decoding (default="${inference_args}").
+                          # e.g., --inference_args "--lm_weight 0.1"
+                          # Note that it will overwrite args in inference config.
+    --inference_lm        # Language model path for decoding (default="${inference_lm}").
+    --inference_asr_model # ASR model path for decoding (default="${inference_asr_model}").
+    --download_model      # Download a model from Model Zoo and use it for decoding (default="${download_model}").
+    --use_streaming       # Whether to use streaming decoding (default="${use_streaming}").
+    --use_maskctc         # Whether to use maskctc decoding (default="${use_streaming}").
+
+    # [Task dependent] Set the datadir name created by local/data.sh
+    --train_set     # Name of training set (required).
+    --valid_set     # Name of validation set used for monitoring/tuning network training (required).
+    --test_sets     # Names of test sets.
+                    # Multiple items (e.g., both dev and eval sets) can be specified (required).
+    --bpe_train_text # Text file path of bpe training set.
+    --lm_train_text  # Text file path of language model training set.
+    --lm_dev_text   # Text file path of language model development set (default="${lm_dev_text}").
+    --lm_test_text  # Text file path of language model evaluation set (default="${lm_test_text}").
+    --nlsyms_txt    # Non-linguistic symbol list if existing (default="${nlsyms_txt}").
+    --cleaner       # Text cleaner (default="${cleaner}").
+    --g2p           # g2p method (default="${g2p}").
+    --lang          # The language type of corpus (default=${lang}).
+    --score_opts             # The options given to sclite scoring (default="{score_opts}").
+    --local_score_opts       # The options given to local/score.sh (default="{local_score_opts}").
+    --asr_speech_fold_length # fold_length for speech data during ASR training (default="${asr_speech_fold_length}").
+    --asr_text_fold_length   # fold_length for text data during ASR training (default="${asr_text_fold_length}").
+    --lm_fold_length         # fold_length for LM training (default="${lm_fold_length}").
+EOF
+)
+
+log "$0 $*"
+# Save command line args for logging (they will be lost after utils/parse_options.sh)
+run_args=$(utils/print_args.py $0 "$@")
+. utils/parse_options.sh
+
+if [ $# -ne 0 ]; then
+    log "${help_message}"
+    log "Error: No positional arguments are required."
+    exit 2
+fi
+
+# set absolute dump dir path
+dumpdir=${datadir}/${dumpdir}
+
+if [ -z "${inference_tag}" ]; then
+    if [ -n "${inference_config}" ]; then
+        inference_tag="$(basename "${inference_config}" .yaml)"
+    else
+        inference_tag=inference
+    fi
+
+    if "${use_k2}"; then
+        inference_tag+="_use_k2"
+        inference_tag+="_k2_ctc_decoding_${k2_ctc_decoding}"
+        inference_tag+="_use_nbest_rescoring_${use_nbest_rescoring}"
+    fi
+fi
+
+# ========================== Main stages start from here. ==========================
+
+if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
+    log "Stage 12: Decoding: training_dir=${asr_exp}"
+
+    if ${gpu_inference}; then
+        _cmd="${cuda_cmd}"
+        _ngpu=1
+    else
+        _cmd="${decode_cmd}"
+        _ngpu=0
+    fi
+
+    _opts=
+    if [ -n "${inference_config}" ]; then
+        _opts+="--config ${inference_config} "
+    fi
+
+    if "${use_lm}"; then
+        if "${use_word_lm}"; then
+            _opts+="--word_lm_train_config ${lm_exp}/config.yaml "
+            _opts+="--word_lm_file ${lm_exp}/${inference_lm} "
+        else
+            _opts+="--lm_train_config ${lm_exp}/config.yaml "
+            _opts+="--lm_file ${lm_exp}/${inference_lm} "
+        fi
+    fi
+
+    if "${use_ngram}"; then
+         _opts+="--ngram_file ${ngram_exp}/${inference_ngram}"
+         inference_tag=${inference_tag}.${inference_ngram}
+    fi
+
+    # 2. Generate run.sh
+    log "Generate '${asr_exp}/${inference_tag}/run.sh'. You can resume the process from stage 12 using this script"
+    mkdir -p "${asr_exp}/${inference_tag}"; echo "${run_args} --stage 12 \"\$@\"; exit \$?" > "${asr_exp}/${inference_tag}/run.sh"; chmod +x "${asr_exp}/${inference_tag}/run.sh"
+
+    if "${use_streaming}"; then
+        asr_inference_tool="funasr.bin.asr_inference_streaming"
+    elif "${use_maskctc}"; then
+        asr_inference_tool="funasr.bin.asr_inference_maskctc"
+    else
+        asr_inference_tool="funasr.bin.asr_inference_launch"
+    fi
+
+    for dset in ${test_sets}; do
+        if [ $feats_type == "ark_wav" ]; then
+            _data="${dumpdir}/wav/${dset}"
+        else
+            _data="${dumpdir}/$feats_type/${dset}"
+        fi
+        _dir="${asr_exp}/${inference_tag}/${inference_asr_model}/${dset}"
+        _logdir="${_dir}/logdir"
+
+        if [ -d ${_dir} ]; then
+            #echo "${_dir} is already exists. if you want to decode again, please delete this dir first."
+            rm -r ${_dir}
+        fi
+        mkdir -p "${_logdir}"
+
+        _scp=$scp
+        _type=kaldi_ark
+
+
+        # 1. Split the key file
+        key_file=${_data}/${_scp}
+        split_scps=""
+        if "${use_k2}"; then
+            # Now only _nj=1 is verified if using k2
+            _nj=1
+        else
+            _nj=$(min "${inference_nj}" "$(<${key_file} wc -l)")
+        fi
+
+        for n in $(seq "${_nj}"); do
+            split_scps+=" ${_logdir}/keys.${n}.scp"
+        done
+        # shellcheck disable=SC2086
+        utils/split_scp.pl "${key_file}" ${split_scps}
+
+        # 2. Submit decoding jobs
+        log "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+        # shellcheck disable=SC2086
+        ${_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
+            ${python} -m ${asr_inference_tool} \
+                --batch_size ${batch_size} \
+                --ngpu "${_ngpu}" \
+                --njob ${njob} \
+                --gpuid_list ${gpuid_list} \
+                --data_path_and_name_and_type "${_data}/${_scp},speech,${_type}" \
+                --key_file "${_logdir}"/keys.JOB.scp \
+                --asr_train_config "${asr_exp}"/config.yaml \
+                --asr_model_file "${asr_exp}"/"${inference_asr_model}" \
+                --output_dir "${_logdir}"/output.JOB \
+                --mode $mode \
+                ${_opts} ${inference_args}
+
+        # 3. Concatenates the output files from each jobs
+        for f in token token_int score text; do
+            if [ -f "${_logdir}/output.1/1best_recog/${f}" ]; then
+                for i in $(seq "${_nj}"); do
+                    cat "${_logdir}/output.${i}/1best_recog/${f}"
+                done | sort -k1 >"${_dir}/${f}"
+            fi
+        done
+        python utils/proce_text.py ${_dir}/text ${_dir}/${text}.proc
+        python utils/proce_text.py ${_data}/text ${_data}/${text}.proc
+        python utils/compute_wer.py ${_data}/text.proc ${_dir}/text.proc ${_dir}/text.cer
+        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
+        cat ${_dir}/text.cer.txt
+    done
+fi
+
+log "Successfully finished. [elapsed=${SECONDS}s]"
+
diff --git a/egs/aishell/tranformer/utils/error_rate_zh b/egs/aishell/tranformer/utils/error_rate_zh
new file mode 100755
index 000000000..6871a07fa
--- /dev/null
+++ b/egs/aishell/tranformer/utils/error_rate_zh
@@ -0,0 +1,370 @@
+#!/usr/bin/env python3
+# coding=utf8
+
+# Copyright  2021  Jiayu DU
+
+import sys
+import argparse
+import json
+import logging
+logging.basicConfig(stream=sys.stderr, level=logging.INFO, format='[%(levelname)s] %(message)s')
+
+DEBUG = None
+
+def GetEditType(ref_token, hyp_token):
+    if ref_token == None and hyp_token != None:
+        return 'I'
+    elif ref_token != None and hyp_token == None:
+        return 'D'
+    elif ref_token == hyp_token:
+        return 'C'
+    elif ref_token != hyp_token:
+        return 'S'
+    else:
+        raise RuntimeError
+
+class AlignmentArc:
+    def __init__(self, src, dst, ref, hyp):
+        self.src = src
+        self.dst = dst
+        self.ref = ref
+        self.hyp = hyp
+        self.edit_type = GetEditType(ref, hyp)
+
+def similarity_score_function(ref_token, hyp_token):
+    return 0 if (ref_token == hyp_token) else -1.0
+
+def insertion_score_function(token):
+    return -1.0
+
+def deletion_score_function(token):
+    return -1.0
+
+def EditDistance(
+        ref,
+        hyp, 
+        similarity_score_function = similarity_score_function,
+        insertion_score_function = insertion_score_function,
+        deletion_score_function = deletion_score_function):
+    assert(len(ref) != 0)
+    class DPState:
+        def __init__(self):
+            self.score = -float('inf')
+            # backpointer
+            self.prev_r = None
+            self.prev_h = None
+    
+    def print_search_grid(S, R, H, fstream):
+        print(file=fstream)
+        for r in range(R):
+            for h in range(H):
+                print(F'[{r},{h}]:{S[r][h].score:4.3f}:({S[r][h].prev_r},{S[r][h].prev_h}) ', end='', file=fstream)
+            print(file=fstream)
+
+    R = len(ref) + 1
+    H = len(hyp) + 1
+
+    # Construct DP search space, a (R x H) grid
+    S = [ [] for r in range(R) ]
+    for r in range(R):
+        S[r] = [ DPState() for x in range(H) ]
+
+    # initialize DP search grid origin, S(r = 0, h = 0)
+    S[0][0].score = 0.0
+    S[0][0].prev_r = None
+    S[0][0].prev_h = None
+
+    # initialize REF axis
+    for r in range(1, R):
+        S[r][0].score = S[r-1][0].score + deletion_score_function(ref[r-1])
+        S[r][0].prev_r = r-1
+        S[r][0].prev_h = 0
+
+    # initialize HYP axis
+    for h in range(1, H):
+        S[0][h].score = S[0][h-1].score + insertion_score_function(hyp[h-1])
+        S[0][h].prev_r = 0
+        S[0][h].prev_h = h-1
+
+    best_score = S[0][0].score
+    best_state = (0, 0)
+
+    for r in range(1, R):
+        for h in range(1, H):
+            sub_or_cor_score = similarity_score_function(ref[r-1], hyp[h-1])
+            new_score = S[r-1][h-1].score + sub_or_cor_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r-1
+                S[r][h].prev_h = h-1
+
+            del_score = deletion_score_function(ref[r-1])
+            new_score = S[r-1][h].score + del_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r - 1
+                S[r][h].prev_h = h
+
+            ins_score = insertion_score_function(hyp[h-1])
+            new_score = S[r][h-1].score + ins_score
+            if new_score >= S[r][h].score:
+                S[r][h].score = new_score
+                S[r][h].prev_r = r
+                S[r][h].prev_h = h-1
+
+    best_score = S[R-1][H-1].score
+    best_state = (R-1, H-1)
+
+    if DEBUG:
+        print_search_grid(S, R, H, sys.stderr)
+
+    # Backtracing best alignment path, i.e. a list of arcs
+    # arc = (src, dst, ref, hyp, edit_type)
+    # src/dst = (r, h), where r/h refers to search grid state-id along Ref/Hyp axis
+    best_path = []
+    r, h = best_state[0], best_state[1]
+    prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
+    score = S[r][h].score
+    # loop invariant:
+    #   1. (prev_r, prev_h) -> (r, h) is a "forward arc" on best alignment path
+    #   2. score is the value of point(r, h) on DP search grid
+    while prev_r != None or prev_h != None:
+        src = (prev_r, prev_h)
+        dst = (r, h)
+        if (r == prev_r + 1 and h == prev_h + 1): # Substitution or correct
+            arc = AlignmentArc(src, dst, ref[prev_r], hyp[prev_h])
+        elif (r == prev_r + 1 and h == prev_h): # Deletion
+            arc = AlignmentArc(src, dst, ref[prev_r], None)
+        elif (r == prev_r and h == prev_h + 1): # Insertion
+            arc = AlignmentArc(src, dst, None, hyp[prev_h])
+        else:
+            raise RuntimeError
+        best_path.append(arc)
+        r, h = prev_r, prev_h
+        prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
+        score = S[r][h].score
+    
+    best_path.reverse()
+    return (best_path, best_score)
+
+def PrettyPrintAlignment(alignment, stream = sys.stderr):
+    def get_token_str(token):
+        if token == None:
+            return "*"
+        return token
+    
+    def is_double_width_char(ch):
+        if (ch >= '\u4e00') and (ch <= '\u9fa5'): # codepoint ranges for Chinese chars
+            return True
+        # TODO: support other double-width-char language such as Japanese, Korean 
+        else:
+            return False
+    
+    def display_width(token_str):
+        m = 0
+        for c in token_str:
+            if is_double_width_char(c):
+                m += 2
+            else:
+                m += 1
+        return m
+
+    R = '  REF  : '
+    H = '  HYP  : '
+    E = '  EDIT : '
+    for arc in alignment:
+        r = get_token_str(arc.ref)
+        h = get_token_str(arc.hyp)
+        e = arc.edit_type if arc.edit_type != 'C' else ''
+
+        nr, nh, ne = display_width(r), display_width(h), display_width(e)
+        n = max(nr, nh, ne) + 1
+
+        R += r + ' ' * (n-nr)
+        H += h + ' ' * (n-nh)
+        E += e + ' ' * (n-ne)
+
+    print(R, file=stream)
+    print(H, file=stream)
+    print(E, file=stream)
+
+def CountEdits(alignment):
+    c, s, i, d = 0, 0, 0, 0
+    for arc in alignment:
+        if arc.edit_type == 'C':
+            c += 1
+        elif arc.edit_type == 'S':
+            s += 1
+        elif arc.edit_type == 'I':
+            i += 1
+        elif arc.edit_type == 'D':
+            d += 1
+        else:
+            raise RuntimeError
+    return (c, s, i, d)
+
+def ComputeTokenErrorRate(c, s, i, d):
+    return 100.0 * (s + d + i) / (s + d + c)
+
+def ComputeSentenceErrorRate(num_err_utts, num_utts):
+    assert(num_utts != 0)
+    return 100.0 * num_err_utts / num_utts
+
+
+class EvaluationResult:
+    def __init__(self):
+        self.num_ref_utts = 0
+        self.num_hyp_utts = 0
+        self.num_eval_utts = 0 # seen in both ref & hyp
+        self.num_hyp_without_ref = 0
+
+        self.C = 0
+        self.S = 0
+        self.I = 0
+        self.D = 0
+        self.token_error_rate = 0.0
+
+        self.num_utts_with_error = 0
+        self.sentence_error_rate = 0.0
+    
+    def to_json(self):
+        return json.dumps(self.__dict__)
+    
+    def to_kaldi(self):
+        info = (
+            F'%WER {self.token_error_rate:.2f} [ {self.S + self.D + self.I} / {self.C + self.S + self.D}, {self.I} ins, {self.D} del, {self.S} sub ]\n'
+            F'%SER {self.sentence_error_rate:.2f} [ {self.num_utts_with_error} / {self.num_eval_utts} ]\n'
+        )
+        return info
+    
+    def to_sclite(self):
+        return "TODO"
+    
+    def to_espnet(self):
+        return "TODO"
+    
+    def to_summary(self):
+        #return json.dumps(self.__dict__, indent=4)
+        summary = (
+            '==================== Overall Statistics ====================\n'
+            F'num_ref_utts: {self.num_ref_utts}\n'
+            F'num_hyp_utts: {self.num_hyp_utts}\n'
+            F'num_hyp_without_ref: {self.num_hyp_without_ref}\n'
+            F'num_eval_utts: {self.num_eval_utts}\n'
+            F'sentence_error_rate: {self.sentence_error_rate:.2f}%\n'
+            F'token_error_rate: {self.token_error_rate:.2f}%\n'
+            F'token_stats:\n'
+            F'  - tokens:{self.C + self.S + self.D:>7}\n'
+            F'  - edits: {self.S + self.I + self.D:>7}\n'
+            F'  - cor:   {self.C:>7}\n'
+            F'  - sub:   {self.S:>7}\n'
+            F'  - ins:   {self.I:>7}\n'
+            F'  - del:   {self.D:>7}\n'
+            '============================================================\n'
+        )
+        return summary
+
+
+class Utterance:
+    def __init__(self, uid, text):
+        self.uid = uid
+        self.text = text
+
+
+def LoadUtterances(filepath, format):
+    utts = {}
+    if format == 'text': # utt_id word1 word2 ...
+        with open(filepath, 'r', encoding='utf8') as f:
+            for line in f:
+                line = line.strip()
+                if line:
+                    cols = line.split(maxsplit=1)
+                    assert(len(cols) == 2 or len(cols) == 1)
+                    uid = cols[0]
+                    text = cols[1] if len(cols) == 2 else ''
+                    if utts.get(uid) != None:
+                        raise RuntimeError(F'Found duplicated utterence id {uid}')
+                    utts[uid] = Utterance(uid, text)
+    else:
+        raise RuntimeError(F'Unsupported text format {format}')
+    return utts
+
+
+def tokenize_text(text, tokenizer):
+    if tokenizer == 'whitespace':
+        return text.split()
+    elif tokenizer == 'char':
+        return [ ch for ch in ''.join(text.split()) ]
+    else:
+        raise RuntimeError(F'ERROR: Unsupported tokenizer {tokenizer}')
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    # optional
+    parser.add_argument('--tokenizer', choices=['whitespace', 'char'], default='whitespace', help='whitespace for WER, char for CER')
+    parser.add_argument('--ref-format', choices=['text'], default='text', help='reference format, first col is utt_id, the rest is text')
+    parser.add_argument('--hyp-format', choices=['text'], default='text', help='hypothesis format, first col is utt_id, the rest is text')
+    # required
+    parser.add_argument('--ref', type=str, required=True, help='input reference file')
+    parser.add_argument('--hyp', type=str, required=True, help='input hypothesis file')
+
+    parser.add_argument('result_file', type=str)
+    args = parser.parse_args()
+    logging.info(args)
+
+    ref_utts = LoadUtterances(args.ref, args.ref_format)
+    hyp_utts = LoadUtterances(args.hyp, args.hyp_format)
+
+    r = EvaluationResult()
+
+    # check valid utterances in hyp that have matched non-empty reference
+    eval_utts = []
+    r.num_hyp_without_ref = 0
+    for uid in sorted(hyp_utts.keys()):
+        if uid in ref_utts.keys(): # TODO: efficiency
+            if ref_utts[uid].text.strip(): # non-empty reference
+                eval_utts.append(uid)
+            else:
+                logging.warn(F'Found {uid} with empty reference, skipping...')
+        else:
+            logging.warn(F'Found {uid} without reference, skipping...')
+            r.num_hyp_without_ref += 1
+
+    r.num_hyp_utts = len(hyp_utts)
+    r.num_ref_utts = len(ref_utts)
+    r.num_eval_utts = len(eval_utts)
+
+    with open(args.result_file, 'w+', encoding='utf8') as fo:
+        for uid in eval_utts:
+            ref = ref_utts[uid]
+            hyp = hyp_utts[uid]
+
+            alignment, score = EditDistance(
+                tokenize_text(ref.text, args.tokenizer),
+                tokenize_text(hyp.text, args.tokenizer)
+            )
+            
+            c, s, i, d = CountEdits(alignment)
+            utt_ter = ComputeTokenErrorRate(c, s, i, d)
+
+            # utt-level evaluation result
+            print(F'{{"uid":{uid}, "score":{score}, "ter":{utt_ter:.2f}, "cor":{c}, "sub":{s}, "ins":{i}, "del":{d}}}', file=fo)
+            PrettyPrintAlignment(alignment, fo)
+
+            r.C += c
+            r.S += s
+            r.I += i
+            r.D += d
+
+            if utt_ter > 0:
+                r.num_utts_with_error += 1
+
+        # corpus level evaluation result
+        r.sentence_error_rate = ComputeSentenceErrorRate(r.num_utts_with_error, r.num_eval_utts)
+        r.token_error_rate = ComputeTokenErrorRate(r.C, r.S, r.I, r.D)
+
+        print(r.to_summary(), file=fo)
+
+    print(r.to_json())
+    print(r.to_kaldi())
diff --git a/egs/aishell/tranformer/utils/extract_embeds.py b/egs/aishell/tranformer/utils/extract_embeds.py
new file mode 100755
index 000000000..7b817d8ca
--- /dev/null
+++ b/egs/aishell/tranformer/utils/extract_embeds.py
@@ -0,0 +1,47 @@
+from transformers import AutoTokenizer, AutoModel, pipeline
+import numpy as np
+import sys
+import os
+import torch
+from kaldiio import WriteHelper
+import re
+text_file_json = sys.argv[1]
+out_ark = sys.argv[2]
+out_scp = sys.argv[3]
+out_shape = sys.argv[4]
+device = int(sys.argv[5])
+model_path = sys.argv[6]
+
+model = AutoModel.from_pretrained(model_path)
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+extractor = pipeline(task="feature-extraction", model=model, tokenizer=tokenizer, device=device)
+
+with open(text_file_json, 'r') as f:
+    js = f.readlines()
+
+
+f_shape = open(out_shape, "w")
+with WriteHelper('ark,scp:{},{}'.format(out_ark, out_scp)) as writer:
+    with torch.no_grad():
+        for idx, line in enumerate(js):
+            id, tokens = line.strip().split(" ", 1)
+            tokens = re.sub(" ", "", tokens.strip())
+            tokens = ' '.join([j for j in tokens])
+            token_num = len(tokens.split(" "))
+            outputs = extractor(tokens)
+            outputs = np.array(outputs)
+            embeds = outputs[0, 1:-1, :]
+
+            token_num_embeds, dim = embeds.shape
+            if token_num == token_num_embeds:
+                writer(id, embeds)
+                shape_line = "{} {},{}\n".format(id, token_num_embeds, dim)
+                f_shape.write(shape_line)
+            else:
+                print("{}, size has changed, {}, {}, {}".format(id, token_num, token_num_embeds, tokens))
+
+
+
+f_shape.close()
+
+
diff --git a/egs/aishell/tranformer/utils/filter_scp.pl b/egs/aishell/tranformer/utils/filter_scp.pl
new file mode 100755
index 000000000..003530d53
--- /dev/null
+++ b/egs/aishell/tranformer/utils/filter_scp.pl
@@ -0,0 +1,87 @@
+#!/usr/bin/env perl
+# Copyright 2010-2012 Microsoft Corporation
+#                     Johns Hopkins University (author: Daniel Povey)
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# This script takes a list of utterance-ids or any file whose first field
+# of each line is an utterance-id, and filters an scp
+# file (or any file whose "n-th" field is an utterance id), printing
+# out only those lines whose "n-th" field is in id_list. The index of
+# the "n-th" field is 1, by default, but can be changed by using
+# the -f <n> switch
+
+$exclude = 0;
+$field = 1;
+$shifted = 0;
+
+do {
+  $shifted=0;
+  if ($ARGV[0] eq "--exclude") {
+    $exclude = 1;
+    shift @ARGV;
+    $shifted=1;
+  }
+  if ($ARGV[0] eq "-f") {
+    $field = $ARGV[1];
+    shift @ARGV; shift @ARGV;
+    $shifted=1
+  }
+} while ($shifted);
+
+if(@ARGV < 1 || @ARGV > 2) {
+  die "Usage: filter_scp.pl [--exclude] [-f <field-to-filter-on>] id_list [in.scp] > out.scp \n" .
+      "Prints only the input lines whose f'th field (default: first) is in 'id_list'.\n" .
+      "Note: only the first field of each line in id_list matters.  With --exclude, prints\n" .
+      "only the lines that were *not* in id_list.\n" .
+      "Caution: previously, the -f option was interpreted as a zero-based field index.\n" .
+      "If your older scripts (written before Oct 2014) stopped working and you used the\n" .
+      "-f option, add 1 to the argument.\n" .
+      "See also: scripts/filter_scp.pl .\n";
+}
+
+
+$idlist = shift @ARGV;
+open(F, "<$idlist") || die "Could not open id-list file $idlist";
+while(<F>) {
+  @A = split;
+  @A>=1 || die "Invalid id-list file line $_";
+  $seen{$A[0]} = 1;
+}
+
+if ($field == 1) { # Treat this as special case, since it is common.
+  while(<>) {
+    $_ =~ m/\s*(\S+)\s*/ || die "Bad line $_, could not get first field.";
+    # $1 is what we filter on.
+    if ((!$exclude && $seen{$1}) || ($exclude && !defined $seen{$1})) {
+      print $_;
+    }
+  }
+} else {
+  while(<>) {
+    @A = split;
+    @A > 0 || die "Invalid scp file line $_";
+    @A >= $field || die "Invalid scp file line $_";
+    if ((!$exclude && $seen{$A[$field-1]}) || ($exclude && !defined $seen{$A[$field-1]})) {
+      print $_;
+    }
+  }
+}
+
+# tests:
+# the following should print "foo 1"
+# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl <(echo foo)
+# the following should print "bar 2".
+# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl -f 2 <(echo 2)
diff --git a/egs/aishell/tranformer/utils/fix_data.sh b/egs/aishell/tranformer/utils/fix_data.sh
new file mode 100755
index 000000000..32cdde593
--- /dev/null
+++ b/egs/aishell/tranformer/utils/fix_data.sh
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+echo "$0 $@"
+data_dir=$1
+
+if [ ! -f ${data_dir}/wav.scp ]; then
+  echo "$0: wav.scp is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text ]; then
+  echo "$0: text is not found"
+  exit 1;
+fi
+
+
+
+mkdir -p ${data_dir}/.backup
+
+awk '{print $1}' ${data_dir}/wav.scp > ${data_dir}/.backup/wav_id
+awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
+
+sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
+
+cp ${data_dir}/wav.scp ${data_dir}/.backup/wav.scp
+cp ${data_dir}/text ${data_dir}/.backup/text
+
+mv ${data_dir}/wav.scp ${data_dir}/wav.scp.bak
+mv ${data_dir}/text ${data_dir}/text.bak
+
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/wav.scp.bak > ${data_dir}/wav.scp
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
+
+rm ${data_dir}/wav.scp.bak
+rm ${data_dir}/text.bak
diff --git a/egs/aishell/tranformer/utils/fix_data_feat.sh b/egs/aishell/tranformer/utils/fix_data_feat.sh
new file mode 100755
index 000000000..2c92d7f71
--- /dev/null
+++ b/egs/aishell/tranformer/utils/fix_data_feat.sh
@@ -0,0 +1,52 @@
+#!/usr/bin/env bash
+
+echo "$0 $@"
+data_dir=$1
+
+if [ ! -f ${data_dir}/feats.scp ]; then
+  echo "$0: feats.scp is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text ]; then
+  echo "$0: text is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/speech_shape ]; then
+  echo "$0: feature lengths is not found"
+  exit 1;
+fi
+
+if [ ! -f ${data_dir}/text_shape ]; then
+  echo "$0: text lengths is not found"
+  exit 1;
+fi
+
+mkdir -p ${data_dir}/.backup
+
+awk '{print $1}' ${data_dir}/feats.scp > ${data_dir}/.backup/wav_id
+awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
+
+sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
+
+cp ${data_dir}/feats.scp ${data_dir}/.backup/feats.scp
+cp ${data_dir}/text ${data_dir}/.backup/text
+cp ${data_dir}/speech_shape ${data_dir}/.backup/speech_shape
+cp ${data_dir}/text_shape ${data_dir}/.backup/text_shape
+
+mv ${data_dir}/feats.scp ${data_dir}/feats.scp.bak
+mv ${data_dir}/text ${data_dir}/text.bak
+mv ${data_dir}/speech_shape ${data_dir}/speech_shape.bak
+mv ${data_dir}/text_shape ${data_dir}/text_shape.bak
+
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/feats.scp.bak > ${data_dir}/feats.scp
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/speech_shape.bak > ${data_dir}/speech_shape
+utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text_shape.bak > ${data_dir}/text_shape
+
+rm ${data_dir}/feats.scp.bak
+rm ${data_dir}/text.bak
+rm ${data_dir}/speech_shape.bak
+rm ${data_dir}/text_shape.bak
+
diff --git a/egs/aishell/tranformer/utils/gen_ark_list.sh b/egs/aishell/tranformer/utils/gen_ark_list.sh
new file mode 100755
index 000000000..be60f7be8
--- /dev/null
+++ b/egs/aishell/tranformer/utils/gen_ark_list.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=4
+cmd=./utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+data=$1
+
+[ ! -d ${data}/ark ] && echo "$0: ark data is required" && exit 1;
+[ ! -d ${data}/txt ] && echo "$0: txt data is required" && exit 1;
+
+for n in $(seq $nj); do
+  echo "$data/ark/feats.$n.ark $data/txt/text.$n" || exit 1
+done > $data/ark_txt.scp || exit 1
+
diff --git a/egs/aishell/tranformer/utils/parse_options.sh b/egs/aishell/tranformer/utils/parse_options.sh
new file mode 100755
index 000000000..71fb9e5ea
--- /dev/null
+++ b/egs/aishell/tranformer/utils/parse_options.sh
@@ -0,0 +1,97 @@
+#!/usr/bin/env bash
+
+# Copyright 2012  Johns Hopkins University (Author: Daniel Povey);
+#                 Arnab Ghoshal, Karel Vesely
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Parse command-line options.
+# To be sourced by another script (as in ". parse_options.sh").
+# Option format is: --option-name arg
+# and shell variable "option_name" gets set to value "arg."
+# The exception is --help, which takes no arguments, but prints the
+# $help_message variable (if defined).
+
+
+###
+### The --config file options have lower priority to command line
+### options, so we need to import them first...
+###
+
+# Now import all the configs specified by command-line, in left-to-right order
+for ((argpos=1; argpos<$#; argpos++)); do
+  if [ "${!argpos}" == "--config" ]; then
+    argpos_plus1=$((argpos+1))
+    config=${!argpos_plus1}
+    [ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
+    . $config  # source the config file.
+  fi
+done
+
+
+###
+### Now we process the command line options
+###
+while true; do
+  [ -z "${1:-}" ] && break;  # break if there are no arguments
+  case "$1" in
+    # If the enclosing script is called with --help option, print the help
+    # message and exit.  Scripts should put help messages in $help_message
+    --help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
+      else printf "$help_message\n" 1>&2 ; fi;
+      exit 0 ;;
+    --*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
+      exit 1 ;;
+    # If the first command-line argument begins with "--" (e.g. --foo-bar),
+    # then work out the variable name as $name, which will equal "foo_bar".
+    --*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
+      # Next we test whether the variable in question is undefned-- if so it's
+      # an invalid option and we die.  Note: $0 evaluates to the name of the
+      # enclosing script.
+      # The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
+      # is undefined.  We then have to wrap this test inside "eval" because
+      # foo_bar is itself inside a variable ($name).
+      eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
+
+      oldval="`eval echo \\$$name`";
+      # Work out whether we seem to be expecting a Boolean argument.
+      if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
+        was_bool=true;
+      else
+        was_bool=false;
+      fi
+
+      # Set the variable to the right value-- the escaped quotes make it work if
+      # the option had spaces, like --cmd "queue.pl -sync y"
+      eval $name=\"$2\";
+
+      # Check that Boolean-valued arguments are really Boolean.
+      if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
+        echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
+        exit 1;
+      fi
+      shift 2;
+      ;;
+  *) break;
+  esac
+done
+
+
+# Check for an empty argument to the --cmd option, which can easily occur as a
+# result of scripting errors.
+[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
+
+
+true; # so this script returns exit code 0.
diff --git a/egs/aishell/tranformer/utils/print_args.py b/egs/aishell/tranformer/utils/print_args.py
new file mode 100755
index 000000000..b0c61e5b4
--- /dev/null
+++ b/egs/aishell/tranformer/utils/print_args.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python
+import sys
+
+
+def get_commandline_args(no_executable=True):
+    extra_chars = [
+        " ",
+        ";",
+        "&",
+        "|",
+        "<",
+        ">",
+        "?",
+        "*",
+        "~",
+        "`",
+        '"',
+        "'",
+        "\\",
+        "{",
+        "}",
+        "(",
+        ")",
+    ]
+
+    # Escape the extra characters for shell
+    argv = [
+        arg.replace("'", "'\\''")
+        if all(char not in arg for char in extra_chars)
+        else "'" + arg.replace("'", "'\\''") + "'"
+        for arg in sys.argv
+    ]
+
+    if no_executable:
+        return " ".join(argv[1:])
+    else:
+        return sys.executable + " " + " ".join(argv)
+
+
+def main():
+    print(get_commandline_args())
+
+
+if __name__ == "__main__":
+    main()
diff --git a/egs/aishell/tranformer/utils/proc_conf_oss.py b/egs/aishell/tranformer/utils/proc_conf_oss.py
new file mode 100755
index 000000000..c4a90c5c1
--- /dev/null
+++ b/egs/aishell/tranformer/utils/proc_conf_oss.py
@@ -0,0 +1,35 @@
+from pathlib import Path
+
+import torch
+import yaml
+
+
+class NoAliasSafeDumper(yaml.SafeDumper):
+    # Disable anchor/alias in yaml because looks ugly
+    def ignore_aliases(self, data):
+        return True
+
+
+def yaml_no_alias_safe_dump(data, stream=None, **kwargs):
+    """Safe-dump in yaml with no anchor/alias"""
+    return yaml.dump(
+        data, stream, allow_unicode=True, Dumper=NoAliasSafeDumper, **kwargs
+    )
+
+
+def gen_conf(file, out_dir):
+    conf = torch.load(file)["config"]
+    conf["oss_bucket"] = "null"
+    print(conf)
+    output_dir = Path(out_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    with (output_dir / "config.yaml").open("w", encoding="utf-8") as f:
+        yaml_no_alias_safe_dump(conf, f, indent=4, sort_keys=False)
+
+
+if __name__ == "__main__":
+    import sys
+
+    in_f = sys.argv[1]
+    out_f = sys.argv[2]
+    gen_conf(in_f, out_f)
diff --git a/egs/aishell/tranformer/utils/proce_text.py b/egs/aishell/tranformer/utils/proce_text.py
new file mode 100755
index 000000000..9e517a4e1
--- /dev/null
+++ b/egs/aishell/tranformer/utils/proce_text.py
@@ -0,0 +1,31 @@
+
+import sys
+import re
+
+in_f = sys.argv[1]
+out_f = sys.argv[2]
+
+
+with open(in_f, "r", encoding="utf-8") as f:
+  lines = f.readlines()
+
+with open(out_f, "w", encoding="utf-8") as f:
+  for line in lines:
+    outs = line.strip().split(" ", 1)
+    if len(outs) == 2:
+      idx, text = outs
+      text = re.sub("</s>", "", text)
+      text = re.sub("<s>", "", text)
+      text = re.sub("@@", "", text)
+      text = re.sub("@", "", text)
+      text = re.sub("<unk>", "", text)
+      text = re.sub(" ", "", text)
+      text = text.lower()
+    else:
+      idx = outs[0]
+      text = " "
+
+    text = [x for x in text]
+    text = " ".join(text)
+    out = "{} {}\n".format(idx, text)
+    f.write(out)
diff --git a/egs/aishell/tranformer/utils/run.pl b/egs/aishell/tranformer/utils/run.pl
new file mode 100755
index 000000000..483f95bc6
--- /dev/null
+++ b/egs/aishell/tranformer/utils/run.pl
@@ -0,0 +1,356 @@
+#!/usr/bin/env perl
+use warnings; #sed replacement for -w perl parameter
+# In general, doing
+#  run.pl some.log a b c is like running the command a b c in
+# the bash shell, and putting the standard error and output into some.log.
+# To run parallel jobs (backgrounded on the host machine), you can do (e.g.)
+#  run.pl JOB=1:4 some.JOB.log a b c JOB is like running the command a b c JOB
+# and putting it in some.JOB.log, for each one. [Note: JOB can be any identifier].
+# If any of the jobs fails, this script will fail.
+
+# A typical example is:
+#  run.pl some.log my-prog "--opt=foo bar" foo \|  other-prog baz
+# and run.pl will run something like:
+# ( my-prog '--opt=foo bar' foo |  other-prog baz ) >& some.log
+#
+# Basically it takes the command-line arguments, quotes them
+# as necessary to preserve spaces, and evaluates them with bash.
+# In addition it puts the command line at the top of the log, and
+# the start and end times of the command at the beginning and end.
+# The reason why this is useful is so that we can create a different
+# version of this program that uses a queueing system instead.
+
+#use Data::Dumper;
+
+@ARGV < 2 && die "usage: run.pl log-file command-line arguments...";
+
+#print STDERR "COMMAND-LINE: " .  Dumper(\@ARGV) . "\n";
+$job_pick = 'all';
+$max_jobs_run = -1;
+$jobstart = 1;
+$jobend = 1;
+$ignored_opts = ""; # These will be ignored.
+
+# First parse an option like JOB=1:4, and any
+# options that would normally be given to
+# queue.pl, which we will just discard.
+
+for (my $x = 1; $x <= 2; $x++) { # This for-loop is to
+  # allow the JOB=1:n option to be interleaved with the
+  # options to qsub.
+  while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {
+    # parse any options that would normally go to qsub, but which will be ignored here.
+    my $switch = shift @ARGV;
+    if ($switch eq "-V") {
+      $ignored_opts .= "-V ";
+    } elsif ($switch eq "--max-jobs-run" || $switch eq "-tc") {
+      # we do support the option --max-jobs-run n, and its GridEngine form -tc n.
+      # if the command appears multiple times uses the smallest option.
+      if ( $max_jobs_run <= 0 ) {
+          $max_jobs_run =  shift @ARGV;
+      } else {
+        my $new_constraint = shift @ARGV;
+        if ( ($new_constraint < $max_jobs_run) ) {
+          $max_jobs_run = $new_constraint;
+        }
+      }
+      
+      if (! ($max_jobs_run > 0)) {
+        die "run.pl: invalid option --max-jobs-run $max_jobs_run";
+      }
+    } else {
+      my $argument = shift @ARGV;
+      if ($argument =~ m/^--/) {
+        print STDERR "run.pl: WARNING: suspicious argument '$argument' to $switch; starts with '-'\n";
+      }
+      if ($switch eq "-sync" && $argument =~ m/^[yY]/) {
+        $ignored_opts .= "-sync "; # Note: in the
+        # corresponding code in queue.pl it says instead, just "$sync = 1;".
+      } elsif ($switch eq "-pe") { # e.g. -pe smp 5
+        my $argument2 = shift @ARGV;
+        $ignored_opts .= "$switch $argument $argument2 ";
+      } elsif ($switch eq "--gpu") {
+        $using_gpu = $argument;
+      } elsif ($switch eq "--pick") {
+        if($argument =~ m/^(all|failed|incomplete)$/) {
+          $job_pick = $argument;
+        } else {
+          print STDERR "run.pl: ERROR: --pick argument must be one of 'all', 'failed' or 'incomplete'"
+        }
+      } else {
+        # Ignore option.
+        $ignored_opts .= "$switch $argument ";
+      }
+    }
+  }
+  if ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+):(\d+)$/) { # e.g. JOB=1:20
+    $jobname = $1;
+    $jobstart = $2;
+    $jobend = $3;
+    if ($jobstart > $jobend) {
+      die "run.pl: invalid job range $ARGV[0]";
+    }
+    if ($jobstart <= 0) {
+      die "run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility).";
+    }
+    shift;
+  } elsif ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+)$/) { # e.g. JOB=1.
+    $jobname = $1;
+    $jobstart = $2;
+    $jobend = $2;
+    shift;
+  } elsif ($ARGV[0] =~ m/.+\=.*\:.*$/) {
+    print STDERR "run.pl: Warning: suspicious first argument to run.pl: $ARGV[0]\n";
+  }
+}
+
+# Users found this message confusing so we are removing it.
+# if ($ignored_opts ne "") {
+#   print STDERR "run.pl: Warning: ignoring options \"$ignored_opts\"\n";
+# }
+
+if ($max_jobs_run == -1) { # If --max-jobs-run option not set,
+                           # then work out the number of processors if possible,
+                           # and set it based on that.
+  $max_jobs_run = 0;
+  if ($using_gpu) {
+    if (open(P, "nvidia-smi -L |")) {
+      $max_jobs_run++ while (<P>);
+      close(P);
+    }
+    if ($max_jobs_run == 0) {
+      $max_jobs_run = 1;
+      print STDERR "run.pl: Warning: failed to detect number of GPUs from nvidia-smi, using ${max_jobs_run}\n";
+    }
+  } elsif (open(P, "</proc/cpuinfo")) {  # Linux
+    while (<P>) { if (m/^processor/) { $max_jobs_run++; } }
+    if ($max_jobs_run == 0) {
+      print STDERR "run.pl: Warning: failed to detect any processors from /proc/cpuinfo\n";
+      $max_jobs_run = 10;  # reasonable default.
+    }
+    close(P);
+  } elsif (open(P, "sysctl -a |")) {  # BSD/Darwin
+    while (<P>) {
+      if (m/hw\.ncpu\s*[:=]\s*(\d+)/) { # hw.ncpu = 4, or hw.ncpu: 4
+        $max_jobs_run = $1;
+        last;
+      }
+    }
+    close(P);
+    if ($max_jobs_run == 0) {
+      print STDERR "run.pl: Warning: failed to detect any processors from sysctl -a\n";
+      $max_jobs_run = 10;  # reasonable default.
+    }
+  } else {
+    # allow at most 32 jobs at once, on non-UNIX systems; change this code
+    # if you need to change this default.
+    $max_jobs_run = 32;
+  }
+  # The just-computed value of $max_jobs_run is just the number of processors
+  # (or our best guess); and if it happens that the number of jobs we need to
+  # run is just slightly above $max_jobs_run, it will make sense to increase
+  # $max_jobs_run to equal the number of jobs, so we don't have a small number
+  # of leftover jobs.
+  $num_jobs = $jobend - $jobstart + 1;
+  if (!$using_gpu &&
+      $num_jobs > $max_jobs_run && $num_jobs < 1.4 * $max_jobs_run) {
+    $max_jobs_run = $num_jobs;
+  }
+}
+
+sub pick_or_exit {
+  # pick_or_exit ( $logfile ) 
+  # Invoked before each job is started helps to run jobs selectively.
+  #
+  # Given the name of the output logfile decides whether the job must be 
+  # executed (by returning from the subroutine) or not (by terminating the
+  # process calling exit)
+  # 
+  # PRE: $job_pick is a global variable set by command line switch --pick
+  #      and indicates which class of jobs must be executed.
+  #
+  # 1) If a failed job is not executed the process exit code will indicate 
+  #    failure, just as if the task was just executed  and failed.
+  #
+  # 2) If a task is incomplete it will be executed. Incomplete may be either
+  #    a job whose log file does not contain the accounting notes in the end,
+  #    or a job whose log file does not exist.
+  #
+  # 3) If the $job_pick is set to 'all' (default behavior) a task will be
+  #    executed regardless of the result of previous attempts.
+  #
+  # This logic could have been implemented in the main execution loop
+  # but a subroutine to preserve the current level of readability of
+  # that part of the code.
+  #
+  # Alexandre Felipe, (o.alexandre.felipe@gmail.com) 14th of August of 2020
+  #
+  if($job_pick eq 'all'){
+    return; # no need to bother with the previous log
+  }
+  open my $fh, "<", $_[0] or return; # job not executed yet
+  my $log_line;
+  my $cur_line;
+  while ($cur_line = <$fh>) {
+    if( $cur_line =~ m/# Ended \(code .*/ ) {
+      $log_line = $cur_line;
+    }
+  }
+  close $fh;
+  if (! defined($log_line)){
+    return; # incomplete
+  }
+  if ( $log_line =~ m/# Ended \(code 0\).*/ ) {
+    exit(0); # complete
+  } elsif ( $log_line =~ m/# Ended \(code \d+(; signal \d+)?\).*/ ){
+    if ($job_pick !~ m/^(failed|all)$/) {
+      exit(1); # failed but not going to run
+    } else {
+      return; # failed
+    }
+  } elsif ( $log_line =~ m/.*\S.*/ ) {
+    return; # incomplete jobs are always run
+  }
+}
+
+
+$logfile = shift @ARGV;
+
+if (defined $jobname && $logfile !~ m/$jobname/ &&
+    $jobend > $jobstart) {
+  print STDERR "run.pl: you are trying to run a parallel job but "
+    . "you are putting the output into just one log file ($logfile)\n";
+  exit(1);
+}
+
+$cmd = "";
+
+foreach $x (@ARGV) {
+    if ($x =~ m/^\S+$/) { $cmd .=  $x . " "; }
+    elsif ($x =~ m:\":) { $cmd .= "'$x' "; }
+    else { $cmd .= "\"$x\" "; }
+}
+
+#$Data::Dumper::Indent=0;
+$ret = 0;
+$numfail = 0;
+%active_pids=();
+
+use POSIX ":sys_wait_h";
+for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
+  if (scalar(keys %active_pids) >= $max_jobs_run) {
+
+    # Lets wait for a change in any child's status
+    # Then we have to work out which child finished
+    $r = waitpid(-1, 0);
+    $code = $?;
+    if ($r < 0 ) { die "run.pl: Error waiting for child process"; } # should never happen.
+    if ( defined $active_pids{$r} ) {
+        $jid=$active_pids{$r};
+        $fail[$jid]=$code;
+        if ($code !=0) { $numfail++;}
+        delete $active_pids{$r};
+        # print STDERR "Finished: $r/$jid " .  Dumper(\%active_pids) . "\n";
+    } else {
+        die "run.pl: Cannot find the PID of the child process that just finished.";
+    }
+
+    # In theory we could do a non-blocking waitpid over all jobs running just
+    # to find out if only one or more jobs finished during the previous waitpid()
+    # However, we just omit this and will reap the next one in the next pass
+    # through the for(;;) cycle
+  }
+  $childpid = fork();
+  if (!defined $childpid) { die "run.pl: Error forking in run.pl (writing to $logfile)"; }
+  if ($childpid == 0) { # We're in the child... this branch
+    # executes the job and returns (possibly with an error status).
+    if (defined $jobname) {
+      $cmd =~ s/$jobname/$jobid/g;
+      $logfile =~ s/$jobname/$jobid/g;
+    }
+    # exit if the job does not need to be executed
+    pick_or_exit( $logfile );
+
+    system("mkdir -p `dirname $logfile` 2>/dev/null");
+    open(F, ">$logfile") || die "run.pl: Error opening log file $logfile";
+    print F "# " . $cmd . "\n";
+    print F "# Started at " . `date`;
+    $starttime = `date +'%s'`;
+    print F "#\n";
+    close(F);
+
+    # Pipe into bash.. make sure we're not using any other shell.
+    open(B, "|bash") || die "run.pl: Error opening shell command";
+    print B "( " . $cmd . ") 2>>$logfile >> $logfile";
+    close(B);                   # If there was an error, exit status is in $?
+    $ret = $?;
+
+    $lowbits = $ret & 127;
+    $highbits = $ret >> 8;
+    if ($lowbits != 0) { $return_str = "code $highbits; signal $lowbits" }
+    else { $return_str = "code $highbits"; }
+
+    $endtime = `date +'%s'`;
+    open(F, ">>$logfile") || die "run.pl: Error opening log file $logfile (again)";
+    $enddate = `date`;
+    chop $enddate;
+    print F "# Accounting: time=" . ($endtime - $starttime) . " threads=1\n";
+    print F "# Ended ($return_str) at " . $enddate . ", elapsed time " . ($endtime-$starttime) . " seconds\n";
+    close(F);
+    exit($ret == 0 ? 0 : 1);
+  } else {
+    $pid[$jobid] = $childpid;
+    $active_pids{$childpid} = $jobid;
+    # print STDERR "Queued: " .  Dumper(\%active_pids) . "\n";
+  }
+}
+
+# Now we have submitted all the jobs, lets wait until all the jobs finish
+foreach $child (keys %active_pids) {
+    $jobid=$active_pids{$child};
+    $r = waitpid($pid[$jobid], 0);
+    $code = $?;
+    if ($r == -1) { die "run.pl: Error waiting for child process"; } # should never happen.
+    if ($r != 0) { $fail[$jobid]=$code; $numfail++ if $code!=0; } # Completed successfully
+}
+
+# Some sanity checks:
+# The $fail array should not contain undefined codes
+# The number of non-zeros in that array  should be equal to $numfail
+# We cannot do foreach() here, as the JOB ids do not start at zero
+$failed_jids=0;
+for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
+  $job_return = $fail[$jobid];
+  if (not defined $job_return ) {
+    # print Dumper(\@fail);
+
+    die "run.pl: Sanity check failed: we have indication that some jobs are running " .
+      "even after we waited for all jobs to finish" ;
+  }
+  if ($job_return != 0 ){ $failed_jids++;}
+}
+if ($failed_jids != $numfail) {
+  die "run.pl: Sanity check failed: cannot find out how many jobs failed ($failed_jids x $numfail)."
+}
+if ($numfail > 0) { $ret = 1; }
+
+if ($ret != 0) {
+  $njobs = $jobend - $jobstart + 1;
+  if ($njobs == 1) {
+    if (defined $jobname) {
+      $logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with
+                                         # that job.
+    }
+    print STDERR "run.pl: job failed, log is in $logfile\n";
+    if ($logfile =~ m/JOB/) {
+      print STDERR "run.pl: probably you forgot to put JOB=1:\$nj in your script.";
+    }
+  }
+  else {
+    $logfile =~ s/$jobname/*/g;
+    print STDERR "run.pl: $numfail / $njobs failed, log is in $logfile\n";
+  }
+}
+
+
+exit ($ret);
diff --git a/egs/aishell/tranformer/utils/shuffle_list.pl b/egs/aishell/tranformer/utils/shuffle_list.pl
new file mode 100755
index 000000000..a116200f4
--- /dev/null
+++ b/egs/aishell/tranformer/utils/shuffle_list.pl
@@ -0,0 +1,44 @@
+#!/usr/bin/env perl
+
+# Copyright 2013  Johns Hopkins University (author: Daniel Povey)
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+if ($ARGV[0] eq "--srand") {
+  $n = $ARGV[1];
+  $n =~ m/\d+/ || die "Bad argument to --srand option: \"$n\"";
+  srand($ARGV[1]);
+  shift;
+  shift;
+} else {
+  srand(0); # Gives inconsistent behavior if we don't seed.
+}
+
+if (@ARGV > 1 || $ARGV[0] =~ m/^-.+/) { # >1 args, or an option we
+  # don't understand.
+  print "Usage: shuffle_list.pl [--srand N] [input file]  > output\n";
+  print "randomizes the order of lines of input.\n";
+  exit(1);
+}
+
+@lines;
+while (<>) {
+  push @lines, [ (rand(), $_)] ;
+}
+
+@lines = sort { $a->[0] cmp $b->[0] } @lines;
+foreach $l (@lines) {
+    print $l->[1];
+}
\ No newline at end of file
diff --git a/egs/aishell/tranformer/utils/split_data.py b/egs/aishell/tranformer/utils/split_data.py
new file mode 100755
index 000000000..060eae6d3
--- /dev/null
+++ b/egs/aishell/tranformer/utils/split_data.py
@@ -0,0 +1,60 @@
+import os
+import sys
+import random
+
+
+in_dir = sys.argv[1]
+out_dir = sys.argv[2]
+num_split = sys.argv[3]
+
+
+def split_scp(scp, num):
+    assert len(scp) >= num
+    avg = len(scp) // num
+    out = []
+    begin = 0
+
+    for i in range(num):
+        if i == num - 1:
+            out.append(scp[begin:])
+        else:
+            out.append(scp[begin:begin+avg])
+        begin += avg
+
+    return out
+
+
+os.path.exists("{}/wav.scp".format(in_dir))
+os.path.exists("{}/text".format(in_dir))
+
+with open("{}/wav.scp".format(in_dir), 'r') as infile:
+    wav_list = infile.readlines()
+
+with open("{}/text".format(in_dir), 'r') as infile:
+    text_list = infile.readlines()
+
+assert len(wav_list) == len(text_list)
+
+x = list(zip(wav_list, text_list))
+random.shuffle(x)
+wav_shuffle_list, text_shuffle_list = zip(*x)
+
+num_split = int(num_split)
+wav_split_list = split_scp(wav_shuffle_list, num_split)
+text_split_list = split_scp(text_shuffle_list, num_split)
+
+for idx, wav_list in enumerate(wav_split_list, 1):
+    path = out_dir + "/split" + str(num_split) + "/" + str(idx)
+    if not os.path.exists(path):
+        os.makedirs(path)
+    with open("{}/wav.scp".format(path), 'w') as wav_writer:
+        for line in wav_list:
+            wav_writer.write(line)
+
+for idx, text_list in enumerate(text_split_list, 1):
+    path = out_dir + "/split" + str(num_split) + "/" + str(idx)
+    if not os.path.exists(path):
+        os.makedirs(path)
+    with open("{}/text".format(path), 'w') as text_writer:
+        for line in text_list:
+            text_writer.write(line)
diff --git a/egs/aishell/tranformer/utils/split_scp.pl b/egs/aishell/tranformer/utils/split_scp.pl
new file mode 100755
index 000000000..0876dcb6d
--- /dev/null
+++ b/egs/aishell/tranformer/utils/split_scp.pl
@@ -0,0 +1,246 @@
+#!/usr/bin/env perl
+
+# Copyright 2010-2011 Microsoft Corporation
+
+# See ../../COPYING for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
+# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
+# MERCHANTABLITY OR NON-INFRINGEMENT.
+# See the Apache 2 License for the specific language governing permissions and
+# limitations under the License.
+
+
+# This program splits up any kind of .scp or archive-type file.
+# If there is no utt2spk option it will work on any text  file and
+# will split it up with an approximately equal number of lines in
+# each but.
+# With the --utt2spk option it will work on anything that has the
+# utterance-id as the first entry on each line; the utt2spk file is
+# of the form "utterance speaker" (on each line).
+# It splits it into equal size chunks as far as it can.  If you use the utt2spk
+# option it will make sure these chunks coincide with speaker boundaries.  In
+# this case, if there are more chunks than speakers (and in some other
+# circumstances), some of the resulting chunks will be empty and it will print
+# an error message and exit with nonzero status.
+# You will normally call this like:
+# split_scp.pl scp scp.1 scp.2 scp.3 ...
+# or
+# split_scp.pl --utt2spk=utt2spk scp scp.1 scp.2 scp.3 ...
+# Note that you can use this script to split the utt2spk file itself,
+# e.g. split_scp.pl --utt2spk=utt2spk utt2spk utt2spk.1 utt2spk.2 ...
+
+# You can also call the scripts like:
+# split_scp.pl -j 3 0 scp scp.0
+# [note: with this option, it assumes zero-based indexing of the split parts,
+# i.e. the second number must be 0 <= n < num-jobs.]
+
+use warnings;
+
+$num_jobs = 0;
+$job_id = 0;
+$utt2spk_file = "";
+$one_based = 0;
+
+for ($x = 1; $x <= 3 && @ARGV > 0; $x++) {
+    if ($ARGV[0] eq "-j") {
+        shift @ARGV;
+        $num_jobs = shift @ARGV;
+        $job_id = shift @ARGV;
+    }
+    if ($ARGV[0] =~ /--utt2spk=(.+)/) {
+        $utt2spk_file=$1;
+        shift;
+    }
+    if ($ARGV[0] eq '--one-based') {
+        $one_based = 1;
+        shift @ARGV;
+    }
+}
+
+if ($num_jobs != 0 && ($num_jobs < 0 || $job_id - $one_based < 0 ||
+                       $job_id - $one_based >= $num_jobs)) {
+  die "$0: Invalid job number/index values for '-j $num_jobs $job_id" .
+      ($one_based ? " --one-based" : "") . "'\n"
+}
+
+$one_based
+    and $job_id--;
+
+if(($num_jobs == 0 && @ARGV < 2) || ($num_jobs > 0 && (@ARGV < 1 || @ARGV > 2))) {
+    die
+"Usage: split_scp.pl [--utt2spk=<utt2spk_file>] in.scp out1.scp out2.scp ...
+   or: split_scp.pl -j num-jobs job-id [--one-based] [--utt2spk=<utt2spk_file>] in.scp [out.scp]
+ ... where 0 <= job-id < num-jobs, or 1 <= job-id <- num-jobs if --one-based.\n";
+}
+
+$error = 0;
+$inscp = shift @ARGV;
+if ($num_jobs == 0) { # without -j option
+    @OUTPUTS = @ARGV;
+} else {
+    for ($j = 0; $j < $num_jobs; $j++) {
+        if ($j == $job_id) {
+            if (@ARGV > 0) { push @OUTPUTS, $ARGV[0]; }
+            else { push @OUTPUTS, "-"; }
+        } else {
+            push @OUTPUTS, "/dev/null";
+        }
+    }
+}
+
+if ($utt2spk_file ne "") {  # We have the --utt2spk option...
+    open($u_fh, '<', $utt2spk_file) || die "$0: Error opening utt2spk file $utt2spk_file: $!\n";
+    while(<$u_fh>) {
+        @A = split;
+        @A == 2 || die "$0: Bad line $_ in utt2spk file $utt2spk_file\n";
+        ($u,$s) = @A;
+        $utt2spk{$u} = $s;
+    }
+    close $u_fh;
+    open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
+    @spkrs = ();
+    while(<$i_fh>) {
+        @A = split;
+        if(@A == 0) { die "$0: Empty or space-only line in scp file $inscp\n"; }
+        $u = $A[0];
+        $s = $utt2spk{$u};
+        defined $s || die "$0: No utterance $u in utt2spk file $utt2spk_file\n";
+        if(!defined $spk_count{$s}) {
+            push @spkrs, $s;
+            $spk_count{$s} = 0;
+            $spk_data{$s} = [];  # ref to new empty array.
+        }
+        $spk_count{$s}++;
+        push @{$spk_data{$s}}, $_;
+    }
+    # Now split as equally as possible ..
+    # First allocate spks to files by allocating an approximately
+    # equal number of speakers.
+    $numspks = @spkrs;  # number of speakers.
+    $numscps = @OUTPUTS; # number of output files.
+    if ($numspks < $numscps) {
+      die "$0: Refusing to split data because number of speakers $numspks " .
+          "is less than the number of output .scp files $numscps\n";
+    }
+    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+        $scparray[$scpidx] = []; # [] is array reference.
+    }
+    for ($spkidx = 0; $spkidx < $numspks; $spkidx++) {
+        $scpidx = int(($spkidx*$numscps) / $numspks);
+        $spk = $spkrs[$spkidx];
+        push @{$scparray[$scpidx]}, $spk;
+        $scpcount[$scpidx] += $spk_count{$spk};
+    }
+
+    # Now will try to reassign beginning + ending speakers
+    # to different scp's and see if it gets more balanced.
+    # Suppose objf we're minimizing is sum_i (num utts in scp[i] - average)^2.
+    # We can show that if considering changing just 2 scp's, we minimize
+    # this by minimizing the squared difference in sizes.  This is
+    # equivalent to minimizing the absolute difference in sizes.  This
+    # shows this method is bound to converge.
+
+    $changed = 1;
+    while($changed) {
+        $changed = 0;
+        for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+            # First try to reassign ending spk of this scp.
+            if($scpidx < $numscps-1) {
+                $sz = @{$scparray[$scpidx]};
+                if($sz > 0) {
+                    $spk = $scparray[$scpidx]->[$sz-1];
+                    $count = $spk_count{$spk};
+                    $nutt1 = $scpcount[$scpidx];
+                    $nutt2 = $scpcount[$scpidx+1];
+                    if( abs( ($nutt2+$count) - ($nutt1-$count))
+                        < abs($nutt2 - $nutt1))  { # Would decrease
+                        # size-diff by reassigning spk...
+                        $scpcount[$scpidx+1] += $count;
+                        $scpcount[$scpidx] -= $count;
+                        pop @{$scparray[$scpidx]};
+                        unshift @{$scparray[$scpidx+1]}, $spk;
+                        $changed = 1;
+                    }
+                }
+            }
+            if($scpidx > 0 && @{$scparray[$scpidx]} > 0) {
+                $spk = $scparray[$scpidx]->[0];
+                $count = $spk_count{$spk};
+                $nutt1 = $scpcount[$scpidx-1];
+                $nutt2 = $scpcount[$scpidx];
+                if( abs( ($nutt2-$count) - ($nutt1+$count))
+                    < abs($nutt2 - $nutt1))  { # Would decrease
+                    # size-diff by reassigning spk...
+                    $scpcount[$scpidx-1] += $count;
+                    $scpcount[$scpidx] -= $count;
+                    shift @{$scparray[$scpidx]};
+                    push @{$scparray[$scpidx-1]}, $spk;
+                    $changed = 1;
+                }
+            }
+        }
+    }
+    # Now print out the files...
+    for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
+        $scpfile = $OUTPUTS[$scpidx];
+        ($scpfile ne '-' ? open($f_fh, '>', $scpfile)
+                         : open($f_fh, '>&', \*STDOUT)) ||
+            die "$0: Could not open scp file $scpfile for writing: $!\n";
+        $count = 0;
+        if(@{$scparray[$scpidx]} == 0) {
+            print STDERR "$0: eError: split_scp.pl producing empty .scp file " .
+                         "$scpfile (too many splits and too few speakers?)\n";
+            $error = 1;
+        } else {
+            foreach $spk ( @{$scparray[$scpidx]} ) {
+                print $f_fh @{$spk_data{$spk}};
+                $count += $spk_count{$spk};
+            }
+            $count == $scpcount[$scpidx] || die "Count mismatch [code error]";
+        }
+        close($f_fh);
+    }
+} else {
+   # This block is the "normal" case where there is no --utt2spk
+   # option and we just break into equal size chunks.
+
+    open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
+
+    $numscps = @OUTPUTS;  # size of array.
+    @F = ();
+    while(<$i_fh>) {
+        push @F, $_;
+    }
+    $numlines = @F;
+    if($numlines == 0) {
+        print STDERR "$0: error: empty input scp file $inscp\n";
+        $error = 1;
+    }
+    $linesperscp = int( $numlines / $numscps); # the "whole part"..
+    $linesperscp >= 1 || die "$0: You are splitting into too many pieces! [reduce \$nj ($numscps) to be smaller than the number of lines ($numlines) in $inscp]\n";
+    $remainder = $numlines - ($linesperscp * $numscps);
+    ($remainder >= 0 && $remainder < $numlines) || die "bad remainder $remainder";
+    # [just doing int() rounds down].
+    $n = 0;
+    for($scpidx = 0; $scpidx < @OUTPUTS; $scpidx++) {
+        $scpfile = $OUTPUTS[$scpidx];
+        ($scpfile ne '-' ? open($o_fh, '>', $scpfile)
+                         : open($o_fh, '>&', \*STDOUT)) ||
+            die "$0: Could not open scp file $scpfile for writing: $!\n";
+        for($k = 0; $k < $linesperscp + ($scpidx < $remainder ? 1 : 0); $k++) {
+            print $o_fh $F[$n++];
+        }
+        close($o_fh) || die "$0: Eror closing scp file $scpfile: $!\n";
+    }
+    $n == $numlines || die "$n != $numlines [code error]";
+}
+
+exit ($error);
diff --git a/egs/aishell/tranformer/utils/subset_data_dir_tr_cv.sh b/egs/aishell/tranformer/utils/subset_data_dir_tr_cv.sh
new file mode 100755
index 000000000..e16cebdf1
--- /dev/null
+++ b/egs/aishell/tranformer/utils/subset_data_dir_tr_cv.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+
+dev_num_utt=1000
+
+echo "$0 $@"
+. utils/parse_options.sh || exit 1;
+
+train_data=$1
+out_dir=$2
+
+[ ! -f ${train_data}/wav.scp ] && echo "$0: no such file ${train_data}/wav.scp" && exit 1;
+[ ! -f ${train_data}/text ] && echo "$0: no such file ${train_data}/text" && exit 1;
+
+mkdir -p ${out_dir}/train && mkdir -p ${out_dir}/dev
+
+cp ${train_data}/wav.scp ${out_dir}/train/wav.scp.bak
+cp ${train_data}/text ${out_dir}/train/text.bak
+
+num_utt=$(wc -l <${out_dir}/train/wav.scp.bak)
+
+utils/shuffle_list.pl --srand 1 ${out_dir}/train/wav.scp.bak > ${out_dir}/train/wav.scp.shuf
+head -n ${dev_num_utt} ${out_dir}/train/wav.scp.shuf > ${out_dir}/dev/wav.scp
+tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/wav.scp.shuf > ${out_dir}/train/wav.scp
+
+utils/shuffle_list.pl --srand 1 ${out_dir}/train/text.bak > ${out_dir}/train/text.shuf
+head -n ${dev_num_utt} ${out_dir}/train/text.shuf > ${out_dir}/dev/text
+tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/text.shuf > ${out_dir}/train/text
+
+rm ${out_dir}/train/wav.scp.bak ${out_dir}/train/text.bak
+rm ${out_dir}/train/wav.scp.shuf ${out_dir}/train/text.shuf
diff --git a/egs/aishell/tranformer/utils/text2token.py b/egs/aishell/tranformer/utils/text2token.py
new file mode 100755
index 000000000..56c39138f
--- /dev/null
+++ b/egs/aishell/tranformer/utils/text2token.py
@@ -0,0 +1,135 @@
+#!/usr/bin/env python3
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+
+import argparse
+import codecs
+import re
+import sys
+
+is_python2 = sys.version_info[0] == 2
+
+
+def exist_or_not(i, match_pos):
+    start_pos = None
+    end_pos = None
+    for pos in match_pos:
+        if pos[0] <= i < pos[1]:
+            start_pos = pos[0]
+            end_pos = pos[1]
+            break
+
+    return start_pos, end_pos
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="convert raw text to tokenized text",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--nchar",
+        "-n",
+        default=1,
+        type=int,
+        help="number of characters to split, i.e., \
+                        aabb -> a a b b with -n 1 and aa bb with -n 2",
+    )
+    parser.add_argument(
+        "--skip-ncols", "-s", default=0, type=int, help="skip first n columns"
+    )
+    parser.add_argument("--space", default="<space>", type=str, help="space symbol")
+    parser.add_argument(
+        "--non-lang-syms",
+        "-l",
+        default=None,
+        type=str,
+        help="list of non-linguistic symobles, e.g., <NOISE> etc.",
+    )
+    parser.add_argument("text", type=str, default=False, nargs="?", help="input text")
+    parser.add_argument(
+        "--trans_type",
+        "-t",
+        type=str,
+        default="char",
+        choices=["char", "phn"],
+        help="""Transcript type. char/phn. e.g., for TIMIT FADG0_SI1279 -
+                        If trans_type is char,
+                        read from SI1279.WRD file -> "bricks are an alternative"
+                        Else if trans_type is phn,
+                        read from SI1279.PHN file -> "sil b r ih sil k s aa r er n aa l
+                        sil t er n ih sil t ih v sil" """,
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    rs = []
+    if args.non_lang_syms is not None:
+        with codecs.open(args.non_lang_syms, "r", encoding="utf-8") as f:
+            nls = [x.rstrip() for x in f.readlines()]
+            rs = [re.compile(re.escape(x)) for x in nls]
+
+    if args.text:
+        f = codecs.open(args.text, encoding="utf-8")
+    else:
+        f = codecs.getreader("utf-8")(sys.stdin if is_python2 else sys.stdin.buffer)
+
+    sys.stdout = codecs.getwriter("utf-8")(
+        sys.stdout if is_python2 else sys.stdout.buffer
+    )
+    line = f.readline()
+    n = args.nchar
+    while line:
+        x = line.split()
+        print(" ".join(x[: args.skip_ncols]), end=" ")
+        a = " ".join(x[args.skip_ncols :])
+
+        # get all matched positions
+        match_pos = []
+        for r in rs:
+            i = 0
+            while i >= 0:
+                m = r.search(a, i)
+                if m:
+                    match_pos.append([m.start(), m.end()])
+                    i = m.end()
+                else:
+                    break
+
+        if args.trans_type == "phn":
+            a = a.split(" ")
+        else:
+            if len(match_pos) > 0:
+                chars = []
+                i = 0
+                while i < len(a):
+                    start_pos, end_pos = exist_or_not(i, match_pos)
+                    if start_pos is not None:
+                        chars.append(a[start_pos:end_pos])
+                        i = end_pos
+                    else:
+                        chars.append(a[i])
+                        i += 1
+                a = chars
+
+            a = [a[j : j + n] for j in range(0, len(a), n)]
+
+        a_flat = []
+        for z in a:
+            a_flat.append("".join(z))
+
+        a_chars = [z.replace(" ", args.space) for z in a_flat]
+        if args.trans_type == "phn":
+            a_chars = [z.replace("sil", args.space) for z in a_chars]
+        print(" ".join(a_chars))
+        line = f.readline()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/egs/aishell/tranformer/utils/text_tokenize.py b/egs/aishell/tranformer/utils/text_tokenize.py
new file mode 100755
index 000000000..962ea11bc
--- /dev/null
+++ b/egs/aishell/tranformer/utils/text_tokenize.py
@@ -0,0 +1,106 @@
+import re
+import argparse
+
+
+def load_dict(seg_file):
+    seg_dict = {}
+    with open(seg_file, 'r') as infile:
+        for line in infile:
+            s = line.strip().split()
+            key = s[0]
+            value = s[1:]
+            seg_dict[key] = " ".join(value)
+    return seg_dict
+
+
+def forward_segment(text, dic):
+    word_list = []
+    i = 0
+    while i < len(text):
+        longest_word = text[i]
+        for j in range(i + 1, len(text) + 1):
+            word = text[i:j]
+            if word in dic:
+                if len(word) > len(longest_word):
+                    longest_word = word
+        word_list.append(longest_word)
+        i += len(longest_word)
+    return word_list
+
+
+def tokenize(txt,
+             seg_dict):
+    out_txt = ""
+    pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
+    for word in txt:
+        if pattern.match(word):
+            if word in seg_dict:
+                out_txt += seg_dict[word] + " "
+            else:
+                out_txt += "<unk>" + " "
+        else:
+            continue
+    return out_txt.strip()
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="text tokenize",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--text-file",
+        "-t",
+        default=False,
+        required=True,
+        type=str,
+        help="input text",
+    )
+    parser.add_argument(
+        "--seg-file",
+        "-s",
+        default=False,
+        required=True,
+        type=str,
+        help="seg file",
+    )
+    parser.add_argument(
+        "--txt-index",
+        "-i",
+        default=1,
+        required=True,
+        type=int,
+        help="txt index",
+    )
+    parser.add_argument(
+        "--output-dir",
+        "-o",
+        default=False,
+        required=True,
+        type=str,
+        help="output dir",
+    )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    txt_writer = open("{}/text.{}.txt".format(args.output_dir, args.txt_index), 'w')
+    shape_writer = open("{}/len.{}".format(args.output_dir, args.txt_index), 'w')
+    seg_dict = load_dict(args.seg_file)
+    with open(args.text_file, 'r') as infile:
+        for line in infile:
+            s = line.strip().split()
+            text_id = s[0]
+            text_list = forward_segment("".join(s[1:]).lower(), seg_dict)
+            text = tokenize(text_list, seg_dict)
+            lens = len(text.strip().split())
+            txt_writer.write(text_id + " " + text + '\n')
+            shape_writer.write(text_id + " " + str(lens) + '\n')
+
+
+if __name__ == '__main__':
+    main()
+
diff --git a/egs/aishell/tranformer/utils/text_tokenize.sh b/egs/aishell/tranformer/utils/text_tokenize.sh
new file mode 100755
index 000000000..6b74fef80
--- /dev/null
+++ b/egs/aishell/tranformer/utils/text_tokenize.sh
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+
+# Begin configuration section.
+nj=32
+cmd=utils/run.pl
+
+echo "$0 $@"
+
+. utils/parse_options.sh || exit 1;
+
+# tokenize configuration
+text_dir=$1
+seg_file=$2
+logdir=$3
+output_dir=$4
+
+txt_dir=${output_dir}/txt; mkdir -p ${output_dir}/txt
+mkdir -p ${logdir}
+
+$cmd JOB=1:$nj $logdir/text_tokenize.JOB.log \
+  python utils/text_tokenize.py -t ${text_dir}/txt/text.JOB.txt \
+      -s ${seg_file} -i JOB -o ${txt_dir} \
+      || exit 1;
+
+# concatenate the text files together.
+for n in $(seq $nj); do
+  cat ${txt_dir}/text.$n.txt || exit 1
+done > ${output_dir}/text || exit 1
+
+for n in $(seq $nj); do
+  cat ${txt_dir}/len.$n || exit 1
+done > ${output_dir}/text_shape || exit 1
+
+echo "$0: Succeeded text tokenize"
diff --git a/egs/aishell/tranformer/utils/textnorm_zh.py b/egs/aishell/tranformer/utils/textnorm_zh.py
new file mode 100755
index 000000000..79feb83fd
--- /dev/null
+++ b/egs/aishell/tranformer/utils/textnorm_zh.py
@@ -0,0 +1,834 @@
+#!/usr/bin/env python3
+# coding=utf-8
+
+# Authors:
+#   2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
+#   2019.9 Jiayu DU
+#
+# requirements:
+#   - python 3.X
+# notes: python 2.X WILL fail or produce misleading results
+
+import sys, os, argparse, codecs, string, re
+
+# ================================================================================ #
+#                                    basic constant
+# ================================================================================ #
+CHINESE_DIGIS = u'零一二三四五六七八九'
+BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
+BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
+SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
+LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
+LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
+SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
+SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
+
+ZERO_ALT = u'〇'
+ONE_ALT = u'幺'
+TWO_ALTS = [u'两', u'兩']
+
+POSITIVE = [u'正', u'正']
+NEGATIVE = [u'负', u'負']
+POINT = [u'点', u'點']
+# PLUS = [u'加', u'加']
+# SIL = [u'杠', u'槓']
+
+FILLER_CHARS = ['呃', '啊']
+ER_WHITELIST = '(儿女|儿子|儿孙|女儿|儿媳|妻儿|' \
+             '胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|' \
+             '儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|' \
+             '佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)'
+
+# 中文数字系统类型
+NUMBERING_TYPES = ['low', 'mid', 'high']
+
+CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
+                 '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
+CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
+COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
+                  '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
+                  '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
+                  '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
+                  '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
+                  '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
+
+# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
+CHINESE_PUNC_STOP = '！？｡。'
+CHINESE_PUNC_NON_STOP = '＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
+CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP
+
+# ================================================================================ #
+#                                    basic class
+# ================================================================================ #
+class ChineseChar(object):
+    """
+    中文字符
+    每个字符对应简体和繁体,
+    e.g. 简体 = '负', 繁体 = '負'
+    转换时可转换为简体或繁体
+    """
+
+    def __init__(self, simplified, traditional):
+        self.simplified = simplified
+        self.traditional = traditional
+        #self.__repr__ = self.__str__
+
+    def __str__(self):
+        return self.simplified or self.traditional or None
+
+    def __repr__(self):
+        return self.__str__()
+
+
+class ChineseNumberUnit(ChineseChar):
+    """
+    中文数字/数位字符
+    每个字符除繁简体外还有一个额外的大写字符
+    e.g. '陆' 和 '陸'
+    """
+
+    def __init__(self, power, simplified, traditional, big_s, big_t):
+        super(ChineseNumberUnit, self).__init__(simplified, traditional)
+        self.power = power
+        self.big_s = big_s
+        self.big_t = big_t
+
+    def __str__(self):
+        return '10^{}'.format(self.power)
+
+    @classmethod
+    def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
+
+        if small_unit:
+            return ChineseNumberUnit(power=index + 1,
+                                     simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[0]:
+            return ChineseNumberUnit(power=index + 8,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[1]:
+            return ChineseNumberUnit(power=(index + 2) * 4,
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        elif numbering_type == NUMBERING_TYPES[2]:
+            return ChineseNumberUnit(power=pow(2, index + 3),
+                                     simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
+        else:
+            raise ValueError(
+                'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
+
+
+class ChineseNumberDigit(ChineseChar):
+    """
+    中文数字字符
+    """
+
+    def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
+        super(ChineseNumberDigit, self).__init__(simplified, traditional)
+        self.value = value
+        self.big_s = big_s
+        self.big_t = big_t
+        self.alt_s = alt_s
+        self.alt_t = alt_t
+
+    def __str__(self):
+        return str(self.value)
+
+    @classmethod
+    def create(cls, i, v):
+        return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
+
+
+class ChineseMath(ChineseChar):
+    """
+    中文数位字符
+    """
+
+    def __init__(self, simplified, traditional, symbol, expression=None):
+        super(ChineseMath, self).__init__(simplified, traditional)
+        self.symbol = symbol
+        self.expression = expression
+        self.big_s = simplified
+        self.big_t = traditional
+
+
+CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
+
+
+class NumberSystem(object):
+    """
+    中文数字系统
+    """
+    pass
+
+
+class MathSymbol(object):
+    """
+    用于中文数字系统的数学符号 (繁/简体), e.g.
+    positive = ['正', '正']
+    negative = ['负', '負']
+    point = ['点', '點']
+    """
+
+    def __init__(self, positive, negative, point):
+        self.positive = positive
+        self.negative = negative
+        self.point = point
+
+    def __iter__(self):
+        for v in self.__dict__.values():
+            yield v
+
+
+# class OtherSymbol(object):
+#     """
+#     其他符号
+#     """
+#
+#     def __init__(self, sil):
+#         self.sil = sil
+#
+#     def __iter__(self):
+#         for v in self.__dict__.values():
+#             yield v
+
+
+# ================================================================================ #
+#                                    basic utils
+# ================================================================================ #
+def create_system(numbering_type=NUMBERING_TYPES[1]):
+    """
+    根据数字系统类型返回创建相应的数字系统，默认为 mid
+    NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
+        low:  '兆' = '亿' * '十' = $10^{9}$,  '京' = '兆' * '十', etc.
+        mid:  '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc.
+        high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc.
+    返回对应的数字系统
+    """
+
+    # chinese number units of '亿' and larger
+    all_larger_units = zip(
+        LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    larger_units = [CNU.create(i, v, numbering_type, False)
+                    for i, v in enumerate(all_larger_units)]
+    # chinese number units of '十, 百, 千, 万'
+    all_smaller_units = zip(
+        SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
+    smaller_units = [CNU.create(i, v, small_unit=True)
+                     for i, v in enumerate(all_smaller_units)]
+    # digis
+    chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
+                        BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
+    digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
+    digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
+    digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
+    digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
+
+    # symbols
+    positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
+    negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
+    point_cn = CM(POINT[0], POINT[1], '.', lambda x,
+                  y: float(str(x) + '.' + str(y)))
+    # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
+    system = NumberSystem()
+    system.units = smaller_units + larger_units
+    system.digits = digits
+    system.math = MathSymbol(positive_cn, negative_cn, point_cn)
+    # system.symbols = OtherSymbol(sil_cn)
+    return system
+
+
+def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
+
+    def get_symbol(char, system):
+        for u in system.units:
+            if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
+                return u
+        for d in system.digits:
+            if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
+                return d
+        for m in system.math:
+            if char in [m.traditional, m.simplified]:
+                return m
+
+    def string2symbols(chinese_string, system):
+        int_string, dec_string = chinese_string, ''
+        for p in [system.math.point.simplified, system.math.point.traditional]:
+            if p in chinese_string:
+                int_string, dec_string = chinese_string.split(p)
+                break
+        return [get_symbol(c, system) for c in int_string], \
+               [get_symbol(c, system) for c in dec_string]
+
+    def correct_symbols(integer_symbols, system):
+        """
+        一百八 to 一百八十
+        一亿一千三百万 to 一亿 一千万 三百万
+        """
+
+        if integer_symbols and isinstance(integer_symbols[0], CNU):
+            if integer_symbols[0].power == 1:
+                integer_symbols = [system.digits[1]] + integer_symbols
+
+        if len(integer_symbols) > 1:
+            if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
+                integer_symbols.append(
+                    CNU(integer_symbols[-2].power - 1, None, None, None, None))
+
+        result = []
+        unit_count = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                result.append(s)
+                unit_count = 0
+            elif isinstance(s, CNU):
+                current_unit = CNU(s.power, None, None, None, None)
+                unit_count += 1
+
+            if unit_count == 1:
+                result.append(current_unit)
+            elif unit_count > 1:
+                for i in range(len(result)):
+                    if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
+                        result[-i - 1] = CNU(result[-i - 1].power +
+                                             current_unit.power, None, None, None, None)
+        return result
+
+    def compute_value(integer_symbols):
+        """
+        Compute the value.
+        When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
+        e.g. '两千万' = 2000 * 10000 not 2000 + 10000
+        """
+        value = [0]
+        last_power = 0
+        for s in integer_symbols:
+            if isinstance(s, CND):
+                value[-1] = s.value
+            elif isinstance(s, CNU):
+                value[-1] *= pow(10, s.power)
+                if s.power > last_power:
+                    value[:-1] = list(map(lambda v: v *
+                                                    pow(10, s.power), value[:-1]))
+                    last_power = s.power
+                value.append(0)
+        return sum(value)
+
+    system = create_system(numbering_type)
+    int_part, dec_part = string2symbols(chinese_string, system)
+    int_part = correct_symbols(int_part, system)
+    int_str = str(compute_value(int_part))
+    dec_str = ''.join([str(d.value) for d in dec_part])
+    if dec_part:
+        return '{0}.{1}'.format(int_str, dec_str)
+    else:
+        return int_str
+
+
+def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
+            traditional=False, alt_zero=False, alt_one=False, alt_two=True,
+            use_zeros=True, use_units=True):
+
+    def get_value(value_string, use_zeros=True):
+
+        striped_string = value_string.lstrip('0')
+
+        # record nothing if all zeros
+        if not striped_string:
+            return []
+
+        # record one digits
+        elif len(striped_string) == 1:
+            if use_zeros and len(value_string) != len(striped_string):
+                return [system.digits[0], system.digits[int(striped_string)]]
+            else:
+                return [system.digits[int(striped_string)]]
+
+        # recursively record multiple digits
+        else:
+            result_unit = next(u for u in reversed(
+                system.units) if u.power < len(striped_string))
+            result_string = value_string[:-result_unit.power]
+            return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
+
+    system = create_system(numbering_type)
+
+    int_dec = number_string.split('.')
+    if len(int_dec) == 1:
+        int_string = int_dec[0]
+        dec_string = ""
+    elif len(int_dec) == 2:
+        int_string = int_dec[0]
+        dec_string = int_dec[1]
+    else:
+        raise ValueError(
+            "invalid input num string with more than one dot: {}".format(number_string))
+
+    if use_units and len(int_string) > 1:
+        result_symbols = get_value(int_string)
+    else:
+        result_symbols = [system.digits[int(c)] for c in int_string]
+    dec_symbols = [system.digits[int(c)] for c in dec_string]
+    if dec_string:
+        result_symbols += [system.math.point] + dec_symbols
+
+    if alt_two:
+        liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
+                    system.digits[2].big_s, system.digits[2].big_t)
+        for i, v in enumerate(result_symbols):
+            if isinstance(v, CND) and v.value == 2:
+                next_symbol = result_symbols[i +
+                                             1] if i < len(result_symbols) - 1 else None
+                previous_symbol = result_symbols[i - 1] if i > 0 else None
+                if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
+                    if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
+                        result_symbols[i] = liang
+
+    # if big is True, '两' will not be used and `alt_two` has no impact on output
+    if big:
+        attr_name = 'big_'
+        if traditional:
+            attr_name += 't'
+        else:
+            attr_name += 's'
+    else:
+        if traditional:
+            attr_name = 'traditional'
+        else:
+            attr_name = 'simplified'
+
+    result = ''.join([getattr(s, attr_name) for s in result_symbols])
+
+    # if not use_zeros:
+    #     result = result.strip(getattr(system.digits[0], attr_name))
+
+    if alt_zero:
+        result = result.replace(
+            getattr(system.digits[0], attr_name), system.digits[0].alt_s)
+
+    if alt_one:
+        result = result.replace(
+            getattr(system.digits[1], attr_name), system.digits[1].alt_s)
+
+    for i, p in enumerate(POINT):
+        if result.startswith(p):
+            return CHINESE_DIGIS[0] + result
+
+    # ^10, 11, .., 19
+    if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
+                                          SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
+            result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
+        result = result[1:]
+
+    return result
+
+
+# ================================================================================ #
+#                          different types of rewriters
+# ================================================================================ #
+class Cardinal:
+    """
+    CARDINAL类
+    """
+
+    def __init__(self, cardinal=None, chntext=None):
+        self.cardinal = cardinal
+        self.chntext = chntext
+
+    def chntext2cardinal(self):
+        return chn2num(self.chntext)
+
+    def cardinal2chntext(self):
+        return num2chn(self.cardinal)
+
+class Digit:
+    """
+    DIGIT类
+    """
+
+    def __init__(self, digit=None, chntext=None):
+        self.digit = digit
+        self.chntext = chntext
+
+    # def chntext2digit(self):
+    #     return chn2num(self.chntext)
+
+    def digit2chntext(self):
+        return num2chn(self.digit, alt_two=False, use_units=False)
+
+
+class TelePhone:
+    """
+    TELEPHONE类
+    """
+
+    def __init__(self, telephone=None, raw_chntext=None, chntext=None):
+        self.telephone = telephone
+        self.raw_chntext = raw_chntext
+        self.chntext = chntext
+
+    # def chntext2telephone(self):
+    #     sil_parts = self.raw_chntext.split('<SIL>')
+    #     self.telephone = '-'.join([
+    #         str(chn2num(p)) for p in sil_parts
+    #     ])
+    #     return self.telephone
+
+    def telephone2chntext(self, fixed=False):
+
+        if fixed:
+            sil_parts = self.telephone.split('-')
+            self.raw_chntext = '<SIL>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sil_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SIL>', '')
+        else:
+            sp_parts = self.telephone.strip('+').split()
+            self.raw_chntext = '<SP>'.join([
+                num2chn(part, alt_two=False, use_units=False) for part in sp_parts
+            ])
+            self.chntext = self.raw_chntext.replace('<SP>', '')
+        return self.chntext
+
+
+class Fraction:
+    """
+    FRACTION类
+    """
+
+    def __init__(self, fraction=None, chntext=None):
+        self.fraction = fraction
+        self.chntext = chntext
+
+    def chntext2fraction(self):
+        denominator, numerator = self.chntext.split('分之')
+        return chn2num(numerator) + '/' + chn2num(denominator)
+
+    def fraction2chntext(self):
+        numerator, denominator = self.fraction.split('/')
+        return num2chn(denominator) + '分之' + num2chn(numerator)
+
+
+class Date:
+    """
+    DATE类
+    """
+
+    def __init__(self, date=None, chntext=None):
+        self.date = date
+        self.chntext = chntext
+
+    # def chntext2date(self):
+    #     chntext = self.chntext
+    #     try:
+    #         year, other = chntext.strip().split('年', maxsplit=1)
+    #         year = Digit(chntext=year).digit2chntext() + '年'
+    #     except ValueError:
+    #         other = chntext
+    #         year = ''
+    #     if other:
+    #         try:
+    #             month, day = other.strip().split('月', maxsplit=1)
+    #             month = Cardinal(chntext=month).chntext2cardinal() + '月'
+    #         except ValueError:
+    #             day = chntext
+    #             month = ''
+    #         if day:
+    #             day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
+    #     else:
+    #         month = ''
+    #         day = ''
+    #     date = year + month + day
+    #     self.date = date
+    #     return self.date
+
+    def date2chntext(self):
+        date = self.date
+        try:
+            year, other = date.strip().split('年', 1)
+            year = Digit(digit=year).digit2chntext() + '年'
+        except ValueError:
+            other = date
+            year = ''
+        if other:
+            try:
+                month, day = other.strip().split('月', 1)
+                month = Cardinal(cardinal=month).cardinal2chntext() + '月'
+            except ValueError:
+                day = date
+                month = ''
+            if day:
+                day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
+        else:
+            month = ''
+            day = ''
+        chntext = year + month + day
+        self.chntext = chntext
+        return self.chntext
+
+
+class Money:
+    """
+    MONEY类
+    """
+
+    def __init__(self, money=None, chntext=None):
+        self.money = money
+        self.chntext = chntext
+
+    # def chntext2money(self):
+    #     return self.money
+
+    def money2chntext(self):
+        money = self.money
+        pattern = re.compile(r'(\d+(\.\d+)?)')
+        matchers = pattern.findall(money)
+        if matchers:
+            for matcher in matchers:
+                money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext())
+        self.chntext = money
+        return self.chntext
+
+
+class Percentage:
+    """
+    PERCENTAGE类
+    """
+
+    def __init__(self, percentage=None, chntext=None):
+        self.percentage = percentage
+        self.chntext = chntext
+
+    def chntext2percentage(self):
+        return chn2num(self.chntext.strip().strip('百分之')) + '%'
+
+    def percentage2chntext(self):
+        return '百分之' + num2chn(self.percentage.strip().strip('%'))
+
+
+def remove_erhua(text, er_whitelist):
+    """
+    去除儿化音词中的儿:
+    他女儿在那边儿 -> 他女儿在那边
+    """
+
+    er_pattern = re.compile(er_whitelist)
+    new_str=''
+    while re.search('儿',text):
+        a = re.search('儿',text).span()
+        remove_er_flag = 0
+
+        if er_pattern.search(text):
+            b = er_pattern.search(text).span()
+            if b[0] <= a[0]:
+                remove_er_flag = 1
+
+        if remove_er_flag == 0 :
+            new_str = new_str + text[0:a[0]]
+            text = text[a[1]:]
+        else:
+            new_str = new_str + text[0:b[1]]
+            text = text[b[1]:]
+
+    text = new_str + text
+    return text
+
+# ================================================================================ #
+#                            NSW Normalizer
+# ================================================================================ #
+class NSWNormalizer:
+    def __init__(self, raw_text):
+        self.raw_text = '^' + raw_text + '$'
+        self.norm_text = ''
+
+    def _particular(self):
+        text = self.norm_text
+        pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('particular')
+            for matcher in matchers:
+                text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
+        self.norm_text = text
+        return self.norm_text
+
+    def normalize(self):
+        text = self.raw_text
+
+        # 规范化日期
+        pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('date')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
+
+        # 规范化金钱
+        pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('money')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1)
+
+        # 规范化固话/手机号码
+        # 手机
+        # http://www.jihaoba.com/news/show/13680
+        # 移动：139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
+        # 联通：130、131、132、156、155、186、185、176
+        # 电信：133、153、189、180、181、177
+        pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1)
+        # 固话
+        pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
+        matchers = pattern.findall(text)
+        if matchers:
+            # print('fixed telephone')
+            for matcher in matchers:
+                text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1)
+
+        # 规范化分数
+        pattern = re.compile(r"(\d+/\d+)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('fraction')
+            for matcher in matchers:
+                text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1)
+
+        # 规范化百分数
+        text = text.replace('％', '%')
+        pattern = re.compile(r"(\d+(\.\d+)?%)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('percentage')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1)
+
+        # 规范化纯数+量词
+        pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal+quantifier')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+
+        # 规范化数字编号
+        pattern = re.compile(r"(\d{4,32})")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('digit')
+            for matcher in matchers:
+                text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
+
+        # 规范化纯数
+        pattern = re.compile(r"(\d+(\.\d+)?)")
+        matchers = pattern.findall(text)
+        if matchers:
+            #print('cardinal')
+            for matcher in matchers:
+                text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
+
+        self.norm_text = text
+        self._particular()
+
+        return self.norm_text.lstrip('^').rstrip('$')
+
+
+def nsw_test_case(raw_text):
+    print('I:' + raw_text)
+    print('O:' + NSWNormalizer(raw_text).normalize())
+    print('')
+
+
+def nsw_test():
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('固话：0595-23865596或23880880。')
+    nsw_test_case('手机：+86 19859213959或15659451527。')
+    nsw_test_case('分数：32477/76391。')
+    nsw_test_case('百分数：80.03%。')
+    nsw_test_case('编号：31520181154418。')
+    nsw_test_case('纯数：2983.07克或12345.60米。')
+    nsw_test_case('日期：1999年2月20日或09年3月15号。')
+    nsw_test_case('金钱：12块5，34.5元，20.1万')
+    nsw_test_case('特殊：O2O或B2C。')
+    nsw_test_case('3456万吨')
+    nsw_test_case('2938个')
+    nsw_test_case('938')
+    nsw_test_case('今天吃了115个小笼包231个馒头')
+    nsw_test_case('有62％的概率')
+
+
+if __name__ == '__main__':
+    #nsw_test()
+
+    p = argparse.ArgumentParser()
+    p.add_argument('ifile', help='input filename, assume utf-8 encoding')
+    p.add_argument('ofile', help='output filename')
+    p.add_argument('--to_upper', action='store_true', help='convert to upper case')
+    p.add_argument('--to_lower', action='store_true', help='convert to lower case')
+    p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.")
+    p.add_argument('--remove_fillers', type=bool, default=True, help='remove filler chars such as "呃, 啊"')
+    p.add_argument('--remove_erhua', type=bool, default=True, help='remove erhua chars such as "这儿"')
+    p.add_argument('--log_interval', type=int, default=10000, help='log interval in number of processed lines')
+    args = p.parse_args()
+
+    ifile = codecs.open(args.ifile, 'r', 'utf8')
+    ofile = codecs.open(args.ofile, 'w+', 'utf8')
+
+    n = 0
+    for l in ifile:
+        key = ''
+        text = ''
+        if args.has_key:
+            cols = l.split(maxsplit=1)
+            key = cols[0]
+            if len(cols) == 2:
+                text = cols[1].strip()
+            else:
+                text = ''
+        else:
+            text = l.strip()
+
+        # cases
+        if args.to_upper and args.to_lower:
+            sys.stderr.write('text norm: to_upper OR to_lower?')
+            exit(1)
+        if args.to_upper:
+            text = text.upper()
+        if args.to_lower:
+            text = text.lower()
+
+        # Filler chars removal
+        if args.remove_fillers:
+            for ch in FILLER_CHARS:
+                text = text.replace(ch, '')
+
+        if args.remove_erhua:
+            text = remove_erhua(text, ER_WHITELIST)
+
+        # NSW(Non-Standard-Word) normalization
+        text = NSWNormalizer(text).normalize()
+
+        # Punctuations removal
+        old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations
+        new_chars = ' ' * len(old_chars)
+        del_chars = ''
+        text = text.translate(str.maketrans(old_chars, new_chars, del_chars))
+
+        #
+        if args.has_key:
+            ofile.write(key + '\t' + text + '\n')
+        else:
+            ofile.write(text + '\n')
+
+        n += 1
+        if n % args.log_interval == 0:
+            sys.stderr.write("text norm: {} lines done.\n".format(n))
+
+    sys.stderr.write("text norm: {} lines done in total.\n".format(n))
+
+    ifile.close()
+    ofile.close()
diff --git a/egs_modelscope/aishell/paraformer/README.md b/egs_modelscope/aishell/paraformer/README.md
new file mode 100644
index 000000000..48a5621b1
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/README.md
@@ -0,0 +1,38 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `paraformer_large_finetune.sh`
+    - <strong>data_aishell:</strong> please set the aishell data path
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune: 
+```sh
+    sh ./paraformer_large_finetune.sh
+``` 
+
+### Inference
+
+Or you can download the model from ModelScope for inference directly.
+
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the aishell raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
diff --git a/egs_modelscope/aishell/paraformer/RESULTS.md b/egs_modelscope/aishell/paraformer/RESULTS.md
new file mode 100644
index 000000000..516750453
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/RESULTS.md
@@ -0,0 +1,24 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+## AISHELL-1
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+
+| testset   | CER(%)|
+|:---------:|:-----:|
+| dev       | 1.75  |
+| test      | 1.95  |
diff --git a/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
new file mode 100644
index 000000000..22f02d913
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
diff --git a/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
new file mode 100644
index 000000000..e6231927c
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
diff --git a/egs_modelscope/aishell/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml b/egs_modelscope/aishell/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
new file mode 100644
index 000000000..e9210f373
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
diff --git a/egs_modelscope/aishell/paraformer/local/aishell_data_prep.sh b/egs_modelscope/aishell/paraformer/local/aishell_data_prep.sh
new file mode 100755
index 000000000..83f489b3c
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/local/aishell_data_prep.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Copyright 2017 Xingyu Na
+# Apache 2.0
+
+#. ./path.sh || exit 1;
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <audio-path> <text-path> <output-path>"
+  echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
+  exit 1;
+fi
+
+aishell_audio_dir=$1
+aishell_text=$2/aishell_transcript_v0.8.txt
+output_dir=$3
+
+train_dir=$output_dir/data/local/train
+dev_dir=$output_dir/data/local/dev
+test_dir=$output_dir/data/local/test
+tmp_dir=$output_dir/data/local/tmp
+
+mkdir -p $train_dir
+mkdir -p $dev_dir
+mkdir -p $test_dir
+mkdir -p $tmp_dir
+
+# data directory check
+if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
+  echo "Error: $0 requires two directory arguments"
+  exit 1;
+fi
+
+# find wav audio file for train, dev and test resp.
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+n=`cat $tmp_dir/wav.flist | wc -l`
+[ $n -ne 141925 ] && \
+  echo Warning: expected 141925 data data files, found $n
+
+grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+# Transcriptions preparation
+for dir in $train_dir $dev_dir $test_dir; do
+  echo Preparing $dir transcriptions
+  sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+  paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+  utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+  awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+  utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+  sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
+
+for f in wav.scp text; do
+  cp $train_dir/$f $output_dir/data/train/$f || exit 1;
+  cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
+  cp $test_dir/$f $output_dir/data/test/$f || exit 1;
+done
+
+echo "$0: AISHELL data preparation succeeded"
+exit 0;
diff --git a/egs_modelscope/aishell/paraformer/modelscope_utils b/egs_modelscope/aishell/paraformer/modelscope_utils
new file mode 120000
index 000000000..fc97768c8
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/modelscope_utils
@@ -0,0 +1 @@
+../../common/modelscope_utils
\ No newline at end of file
diff --git a/egs_modelscope/aishell/paraformer/paraformer_large_finetune.sh b/egs_modelscope/aishell/paraformer/paraformer_large_finetune.sh
new file mode 100755
index 000000000..a68338fb9
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/paraformer_large_finetune.sh
@@ -0,0 +1,224 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=32
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+data_aishell=
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # Data preparation
+    local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
+    for x in train dev test; do
+        cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        rm ${feats_dir}/data/${x}/text.org
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+    utils/fix_data_feat.sh ${fbankdir}/test
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    cp ${fbankdir}/test/text ${feat_test_dir}
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
diff --git a/egs_modelscope/aishell/paraformer/paraformer_large_infer.sh b/egs_modelscope/aishell/paraformer/paraformer_large_infer.sh
new file mode 100755
index 000000000..8e2c8f33d
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/paraformer_large_infer.sh
@@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="dev test"
+
+. utils/parse_options.sh
+
+aishell_audio_dir=$ori_data/data_aishell/wav
+aishell_text=$ori_data/data_aishell/transcript/aishell_transcript_v0.8.txt
+dev_dir=${data_dir}/aishell/dev
+test_dir=${data_dir}/aishell/test
+tmp_dir=${data_dir}/aishell/tmp
+
+mkdir -p ${dev_dir}
+mkdir -p ${test_dir}
+mkdir -p ${tmp_dir}
+
+find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
+grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
+grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
+
+rm -r $tmp_dir
+
+for dir in $dev_dir $test_dir; do
+    sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
+    paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
+    utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
+    awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
+    utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
+    sort -u $dir/transcripts.txt > $dir/text
+done
+
+mkdir -p ${exp_dir}/aishell
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/aishell \
+        --exp_dir ${exp_dir}/aishell \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
diff --git a/egs_modelscope/aishell/paraformer/path.sh b/egs_modelscope/aishell/paraformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs_modelscope/aishell/paraformer/utils b/egs_modelscope/aishell/paraformer/utils
new file mode 120000
index 000000000..37d976175
--- /dev/null
+++ b/egs_modelscope/aishell/paraformer/utils
@@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
\ No newline at end of file
diff --git a/egs_modelscope/aishell2/paraformer/README.md b/egs_modelscope/aishell2/paraformer/README.md
new file mode 100644
index 000000000..46bd3ad71
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/README.md
@@ -0,0 +1,39 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `paraformer_large_finetune.sh`
+    - <strong>tr_dir:</strong> please set the aishell2 train data path
+    - <strong>dev_tst_dir:</strong> please set the aishell2 dev/test data path
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune: 
+```sh
+    sh ./paraformer_large_finetune.sh
+``` 
+
+### Inference
+
+Or you can download the model from ModelScope for inference directly.
+
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the aishell2 dev/test raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
diff --git a/egs_modelscope/aishell2/paraformer/RESULTS.md b/egs_modelscope/aishell2/paraformer/RESULTS.md
new file mode 100644
index 000000000..a265a749c
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/RESULTS.md
@@ -0,0 +1,26 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+## AISHELL-2
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+
+| testset      | CER(%)|
+|:------------:|:-----:|
+| dev_ios      | 2.80  |
+| test_android | 3.13  |
+| test_ios     | 2.85  |
+| test_mic     | 3.06  |
diff --git a/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
new file mode 100644
index 000000000..22f02d913
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
diff --git a/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_1best.yaml b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
new file mode 100644
index 000000000..e6231927c
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/conf/decode_asr_transformer_noctc_1best.yaml
@@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
diff --git a/egs_modelscope/aishell2/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml b/egs_modelscope/aishell2/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
new file mode 100644
index 000000000..e9210f373
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
diff --git a/egs_modelscope/aishell2/paraformer/local/aishell2_data_prep.sh b/egs_modelscope/aishell2/paraformer/local/aishell2_data_prep.sh
new file mode 100755
index 000000000..77791f9c1
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/local/aishell2_data_prep.sh
@@ -0,0 +1,53 @@
+#!/usr/bin/env bash
+# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
+#           2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
+# Apache 2.0
+
+# transform raw AISHELL-2 data to kaldi format
+
+. ./path.sh || exit 1;
+
+tmp=
+dir=
+
+if [ $# != 3 ]; then
+  echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
+  echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
+  exit 1;
+fi
+
+corpus=$1
+tmp=$2
+dir=$3
+
+echo "prepare_data.sh: Preparing data in $corpus"
+
+mkdir -p $tmp
+mkdir -p $dir
+
+# corpus check
+if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
+  echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
+  exit 1;
+fi
+
+# validate utt-key list, IC0803W0380 is a bad utterance
+awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
+awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
+utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
+
+# wav.scp
+awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
+utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
+
+# text
+utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
+
+# copy prepared resources from tmp_dir to target dir
+mkdir -p $dir
+for f in wav.scp text; do
+  cp $tmp/$f $dir/$f || exit 1;
+done
+
+echo "local/prepare_data.sh succeeded"
+exit 0;
diff --git a/egs_modelscope/aishell2/paraformer/modelscope_utils b/egs_modelscope/aishell2/paraformer/modelscope_utils
new file mode 120000
index 000000000..fc97768c8
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/modelscope_utils
@@ -0,0 +1 @@
+../../common/modelscope_utils
\ No newline at end of file
diff --git a/egs_modelscope/aishell2/paraformer/paraformer_large_finetune.sh b/egs_modelscope/aishell2/paraformer/paraformer_large_finetune.sh
new file mode 100755
index 000000000..d4b5dde73
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/paraformer_large_finetune.sh
@@ -0,0 +1,239 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=0
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=100
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+tr_dir=
+dev_tst_dir=
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev_ios
+test_sets="dev_ios test_android test_ios test_mic"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "stage 0: Data preparation"
+    # For training set
+    local/aishell2_data_prep.sh ${tr_dir} ${feats_dir}/data/local/train ${feats_dir}/data/train || exit 1;
+    # # For dev and test set
+    for x in Android iOS Mic; do
+        local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/dev ${feats_dir}/data/local/dev_${x,,} ${feats_dir}/data/dev_${x,,} || exit 1;
+        local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/test ${feats_dir}/data/local/test_${x,,} ${feats_dir}/data/test_${x,,} || exit 1;
+    done
+    # Normalize text to capital letters
+    for x in train dev_android dev_ios dev_mic test_android test_ios test_mic; do
+        mv ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
+        paste -d " " <(cut -f 1 ${feats_dir}/data/${x}/text.org) <(cut -f 2- ${feats_dir}/data/${x}/text.org \
+             | tr 'A-Z' 'a-z' | tr -d " ") \
+            > ${feats_dir}/data/${x}/text
+        rm ${feats_dir}/data/${x}/text.org
+    done
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/${train_set}; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/${valid_set}; mkdir -p ${feat_dev_dir}
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    for x in android ios mic; do
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${feats_dir}/data/dev_${x} ${exp_dir}/exp/make_fbank/dev_${x} ${fbankdir}/dev_${x}
+        utils/fix_data_feat.sh ${fbankdir}/dev_${x}
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${feats_dir}/data/test_${x} ${exp_dir}/exp/make_fbank/test_${x} ${fbankdir}/test_${x}
+        utils/fix_data_feat.sh ${fbankdir}/test_${x}
+    done
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/${train_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/${valid_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    for x in android ios mic; do
+        feat_test_dir=${feats_dir}/${dumpdir}/test_${x}; mkdir ${feat_test_dir}
+        utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+            --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+            ${fbankdir}/test_${x} ${cmvn_file} ${exp_dir}/exp/make_fbank/test_${x} ${feat_test_dir}
+    done
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${train_set} ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${valid_set} ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    for x in android ios mic; do
+      feat_test_dir=${feats_dir}/${dumpdir}/test_${x} 
+      cp ${fbankdir}/test_${x}/text  ${feat_test_dir}
+    done
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev_ios
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${train_set}
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
diff --git a/egs_modelscope/aishell2/paraformer/paraformer_large_infer.sh b/egs_modelscope/aishell2/paraformer/paraformer_large_infer.sh
new file mode 100755
index 000000000..95b32fc75
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/paraformer_large_infer.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="dev_ios test_android test_ios test_mic"
+
+. utils/parse_options.sh
+
+for x in Android iOS Mic; do
+    local/aishell2_data_prep.sh ${ori_data}/${x}/dev ${data_dir}/aishell2/local/dev_${x,,} ${data_dir}/aishell2/dev_${x,,} || exit 1;
+    local/aishell2_data_prep.sh ${ori_data}/${x}/test ${data_dir}/aishell2/local/test_${x,,} ${data_dir}/aishell2/test_${x,,} || exit 1;
+done
+for x in dev_android dev_ios dev_mic test_android test_ios test_mic; do
+    mv ${data_dir}/aishell2/${x}/text ${data_dir}/aishell2/${x}/text.org
+    paste -d " " <(cut -f 1 ${data_dir}/aishell2/${x}/text.org) <(cut -f 2- ${data_dir}/aishell2/${x}/text.org \
+        | tr 'A-Z' 'a-z' | tr -d " ") \
+       > ${data_dir}/aishell2/${x}/text
+    rm ${data_dir}/aishell2/${x}/text.org
+done
+
+mkdir -p ${exp_dir}/aishell2
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/aishell2 \
+        --exp_dir ${exp_dir}/aishell2 \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
diff --git a/egs_modelscope/aishell2/paraformer/path.sh b/egs_modelscope/aishell2/paraformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs_modelscope/aishell2/paraformer/utils b/egs_modelscope/aishell2/paraformer/utils
new file mode 120000
index 000000000..37d976175
--- /dev/null
+++ b/egs_modelscope/aishell2/paraformer/utils
@@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
\ No newline at end of file
diff --git a/egs_modelscope/common/README.md b/egs_modelscope/common/README.md
new file mode 100644
index 000000000..f2049e2f0
--- /dev/null
+++ b/egs_modelscope/common/README.md
@@ -0,0 +1,27 @@
+# ModelScope Model
+
+## How to finetune and infer using a pretrained ModelScope Model
+
+### Finetune
+- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
+- Setting parameters in `modelscope_common_finetune.sh`
+    - <strong>dataset:</strong> the dataset dir needs to include files: train/wav.scp, train/text; optional dev/wav.scp, dev/text, test/wav.scp test/text
+    - <strong>tag:</strong> exp tag
+    - <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
+- Then you can run the pipeline to finetune with our model download from modelscope:
+```sh
+    sh ./modelscope_common_finetune.sh
+``` 
+
+### Inference
+
+Or you can use the finetuned model for inference directly.
+
+- Setting parameters in `modelscope_common_infer.sh`
+    - <strong>data_dir:</strong> # wav list, ${data_dir}/wav.scp
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./modelscope_common_infer.sh
+```
diff --git a/egs_modelscope/common/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml b/egs_modelscope/common/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
new file mode 100644
index 000000000..22f02d913
--- /dev/null
+++ b/egs_modelscope/common/conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
@@ -0,0 +1,6 @@
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.15
diff --git a/egs_modelscope/common/conf/decode_asr_transformer_noctc_1best.yaml b/egs_modelscope/common/conf/decode_asr_transformer_noctc_1best.yaml
new file mode 100644
index 000000000..e6231927c
--- /dev/null
+++ b/egs_modelscope/common/conf/decode_asr_transformer_noctc_1best.yaml
@@ -0,0 +1,6 @@
+beam_size: 1
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.0
+lm_weight: 0.0
diff --git a/egs_modelscope/common/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml b/egs_modelscope/common/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
new file mode 100644
index 000000000..e9210f373
--- /dev/null
+++ b/egs_modelscope/common/conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
@@ -0,0 +1,91 @@
+# network architecture
+# encoder related
+encoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    attention_dropout_rate: 0.1
+
+# decoder related
+decoder_conf:
+    dropout_rate: 0.1
+    positional_dropout_rate: 0.1
+    self_attention_dropout_rate: 0.1
+    src_attention_dropout_rate: 0.1
+
+predictor_conf:
+  threshold: 1.0
+  l_order: 1
+  r_order: 1
+  tail_threshold: 0.45
+
+# hybrid CTC/attention
+model_conf:
+    ctc_weight: 0.0
+    lsm_weight: 0.1     # label smoothing option
+    length_normalized_loss: true
+    predictor_weight: 1.0
+    predictor_bias: 1
+    sampling_ratio: 0.75
+
+# minibatch related
+# dataset_type: small
+batch_type: length
+batch_bins: 2000
+num_workers: 16
+# dataset_type: large
+dataset_conf:
+    filter_conf:
+        min_length: 10
+        max_length: 250
+        min_token_length: 1
+        max_token_length: 200
+    shuffle: true
+    shuffle_conf:
+        shuffle_size: 10240
+        sort_size: 500
+    batch_conf:
+        batch_type: 'token'
+        batch_size: 6000
+    num_workers: 16
+
+# optimization related
+accum_grad: 1
+grad_clip: 5
+max_epoch: 20
+val_scheduler_criterion:
+    - valid
+    - acc
+best_model_criterion:
+-   - valid
+    - acc
+    - max
+keep_nbest_models: 10
+
+optim: adam
+optim_conf:
+   lr: 0.0005
+scheduler: warmuplr
+scheduler_conf:
+   warmup_steps: 30000
+
+specaug: specaug_lfr
+specaug_conf:
+    apply_time_warp: false
+    time_warp_window: 5
+    time_warp_mode: bicubic
+    apply_freq_mask: true
+    freq_mask_width_range:
+    - 0
+    - 30
+    lfr_rate: 6
+    num_freq_mask: 1
+    apply_time_mask: true
+    time_mask_width_range:
+    - 0
+    - 12
+    num_time_mask: 1
+
+unused_parameters: true
+log_interval: 50
+normalize: None
+split_with_space: true
diff --git a/egs_modelscope/common/modelscope_common_finetune.sh b/egs_modelscope/common/modelscope_common_finetune.sh
new file mode 100755
index 000000000..a43083f0c
--- /dev/null
+++ b/egs_modelscope/common/modelscope_common_finetune.sh
@@ -0,0 +1,230 @@
+#!/usr/bin/env bash
+
+. ./path.sh || exit 1;
+
+# machines configuration
+CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=2
+count=1
+gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
+njob=4 # the number of jobs for each gpu
+train_cmd=utils/run.pl
+
+# general configuration
+feats_dir="." #feature output dictionary, for large data
+exp_dir="."
+lang=zh
+dumpdir=dump/fbank
+feats_type=fbank
+token_type=char
+scp=feats.scp
+type=kaldi_ark
+stage=1
+stop_stage=4
+
+# feature configuration
+feats_dim=560
+sample_frequency=16000
+nj=32
+speed_perturb="1.0"
+lfr=True
+lfr_m=7
+lfr_n=6
+
+init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope during fine-tuning
+cmvn_file=init_model/${init_model_name}/am.mvn
+seg_file=init_model/${init_model_name}/seg_dict
+vocab=init_model/${init_model_name}/tokens.txt
+
+# data
+dataset=  # dataset (include train/wav.scp, train/text, dev/wav.scp, dev/text, optional test/wav.scp test/text)
+
+# exp tag
+tag=""
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train
+valid_set=dev
+test_sets="dev test"
+
+asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+init_param="init_model/${init_model_name}/${init_model_name}"
+
+inference_config=conf/decode_asr_transformer_noctc_1best.yaml
+inference_asr_model=valid.acc.ave_10best.pth
+
+. utils/parse_options.sh || exit 1;
+
+# download model from modelscope
+python modelscope_utils/download_model.py --model_name ${init_model_name}
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
+    exit 1
+else
+    if [ -d init_model/${init_model_name} ]; then
+        echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
+    else
+        mkdir -p init_model/${init_model_name}
+        cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
+    fi
+fi
+
+model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
+
+# you can set gpu num for decoding here
+gpuid_list=$CUDA_VISIBLE_DEVICES  # set gpus for decoding, the same as training stage by default
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+inference_nj=$[${ngpu}*${njob}]
+
+[ ! -d ${dataset} ] && echo "$0: Training data is required" && exit 1;
+[ ! -f ${dataset}/train/wav.scp ] && [ ! -f ${dataset}/train/text ] && echo "$0: Training data wav.scp or text is not found" && exit 1;
+
+if [ ! -d "${dataset}/dev" ]; then
+    utils/fix_data.sh ${dataset}/train
+    utils/subset_data_dir_tr_cv.sh --dev-num-utt 1000 ${dataset}/train ${dataset}
+fi
+if [ ! -d "${dataset}/test" ]; then
+   test_sets="dev" 
+fi
+
+feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
+feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
+feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Feature Generation"
+    # compute fbank features
+    fbankdir=${feats_dir}/fbank
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
+        ${dataset}/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
+    utils/fix_data_feat.sh ${fbankdir}/train
+    utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+        ${dataset}/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
+    utils/fix_data_feat.sh ${fbankdir}/dev
+    if [ -d "${dataset}/test" ]; then
+        utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
+            ${dataset}/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
+        utils/fix_data_feat.sh ${fbankdir}/test
+    fi
+
+    echo "apply low_frame_rate and cmvn"
+    [ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
+    utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+        --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+        ${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
+    if [ -d "${dataset}/test" ]; then
+        utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
+            --lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
+            ${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
+    fi
+
+    echo "Text Tokenize"
+    # 我爱reading->我 爱 read@@ ing
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
+    utils/fix_data_feat.sh ${feat_train_dir}
+    utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
+    utils/fix_data_feat.sh ${feat_dev_dir}
+    if [ -d "${dataset}/test" ]; then
+        cp ${fbankdir}/test/text ${feat_test_dir}
+    fi
+fi
+
+token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
+echo "dictionary: ${token_list}"
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "stage 2: Dictionary Preparation"
+    mkdir -p ${feats_dir}/data/${lang}_token_list/char/
+    cp $vocab ${token_list}
+
+    vocab_size=$(wc -l <${token_list})
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
+    awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
+    mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
+    cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
+    cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
+fi
+
+# Training Stage
+world_size=$gpu_num  # run on one machine
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # update asr train config.yaml
+    python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
+    finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
+
+    mkdir -p ${exp_dir}/exp/${model_dir}
+    mkdir -p ${exp_dir}/exp/${model_dir}/log
+    INIT_FILE=$exp_dir/ddp_init
+    if [ -f $INIT_FILE ];then
+        rm -f $INIT_FILE
+    fi
+    init_method=file://$(readlink -f $INIT_FILE)
+    echo "$0: init method is $init_method"
+    for ((i = 0; i < $gpu_num; ++i)); do
+        {
+            rank=$i
+            local_rank=$i
+            gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
+            asr_train_paraformer.py \
+                --gpu_id $gpu_id \
+                --use_preprocessor true \
+                --token_type $token_type \
+                --token_list $token_list \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
+                --train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
+                --train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
+                --valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
+                --valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char  \
+                --resume true \
+                --output_dir ${exp_dir}/exp/${model_dir} \
+                --init_param $init_param \
+                --config $finetune_config \
+                --input_size $feats_dim \
+                --ngpu $gpu_num \
+                --num_worker_count $count \
+                --multiprocessing_distributed true \
+                --dist_init_method $init_method \
+                --dist_world_size $world_size \
+                --dist_rank $rank \
+                --local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
+        } &
+        done
+        wait
+fi
+
+# Testing Stage
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    ./utils/easy_asr_infer.sh \
+        --lang zh \
+        --datadir ${feats_dir} \
+        --feats_type ${feats_type} \
+        --feats_dim ${feats_dim} \
+        --token_type ${token_type} \
+        --gpu_inference ${gpu_inference} \
+        --inference_config "${inference_config}" \
+        --test_sets "${test_sets}" \
+        --token_list $token_list \
+        --asr_exp ${exp_dir}/exp/${model_dir} \
+        --stage 12 \
+        --stop_stage 12 \
+        --scp $scp \
+        --text text \
+        --inference_nj $inference_nj \
+        --njob $njob \
+        --inference_asr_model $inference_asr_model \
+        --gpuid_list $gpuid_list \
+        --mode paraformer
+fi
+
diff --git a/egs_modelscope/common/modelscope_common_infer.sh b/egs_modelscope/common/modelscope_common_infer.sh
new file mode 100755
index 000000000..12b2cbcb2
--- /dev/null
+++ b/egs_modelscope/common/modelscope_common_infer.sh
@@ -0,0 +1,78 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope
+data_dir=  # wav list, ${data_dir}/wav.scp
+exp_dir="exp"
+gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4
+gpu_inference=true
+decode_cmd=utils/run.pl
+
+. utils/parse_options.sh
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+    _ngpu=1
+else
+    inference_nj=${njob}
+    _ngpu=0
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+python modelscope_utils/download_model.py \
+          --model_name ${model_name}
+
+if [ -d ${exp_dir} ]; then
+    echo "${exp_dir} is already exists. if you want to decode again, please delete ${exp_dir} first."
+    exit 1
+else
+    mkdir -p ${exp_dir}/${model_name}
+    cp ${HOME}/.cache/modelscope/hub/damo/${model_name}/* ${exp_dir}/${model_name}/. -r
+    _dir=${exp_dir}/decode_asr
+    _logdir=${_dir}/logdir
+    mkdir -p "${_dir}"
+    mkdir -p "${_logdir}"
+fi
+
+for n in $(seq "${inference_nj}"); do
+    split_scps+=" ${_logdir}/keys.${n}.scp"
+done
+# shellcheck disable=SC2086
+utils/split_scp.pl "${data_dir}/wav.scp" ${split_scps}
+
+if "${use_lm}"; then
+    cp ${exp_dir}/${model_name}/decode_asr_transformer.yaml ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back
+    cp ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back
+    sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+    sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
+    sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+    sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml  
+fi
+
+echo "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+# shellcheck disable=SC2086
+${decode_cmd} --max-jobs-run "${inference_nj}" JOB=1:"${inference_nj}" "${_logdir}"/asr_inference.JOB.log \
+    python -m funasr.bin.modelscope_infer \
+          --local_model_path ${exp_dir}/${model_name} \
+          --wav_list ${_logdir}/keys.JOB.scp \
+          --output_file ${_logdir}/text.JOB \
+          --gpuid_list ${gpuid_list} \
+          --njob ${njob} \
+          --ngpu ${_ngpu} \
+
+    for i in $(seq ${inference_nj}); do
+        cat ${_logdir}/text.${i}
+    done | sort -k1 >${_dir}/text
+
+mv ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer.yaml
+mv ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
+
diff --git a/egs_modelscope/common/modelscope_common_infer_after_finetune.sh b/egs_modelscope/common/modelscope_common_infer_after_finetune.sh
new file mode 100755
index 000000000..00dd28336
--- /dev/null
+++ b/egs_modelscope/common/modelscope_common_infer_after_finetune.sh
@@ -0,0 +1,66 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+pretrained_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch  # pre-trained model, download from modelscope
+data_dir=  # wav list, ${data_dir}/wav.scp
+finetune_model_name=  # fine-tuning model name
+finetune_exp_dir=  # fine-tuning model experiment result path
+gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4
+gpu_inference=true
+decode_cmd=utils/run.pl
+
+. utils/parse_options.sh
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+    _ngpu=1
+else
+    inference_nj=${njob}
+    inference_nj=${njob}
+    _ngpu=0
+fi
+
+if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${pretrained_model_name} ]; then
+    echo "${HOME}/.cache/modelscope/hub/damo/${pretrained_model_name} must exist."
+    exit 1
+else
+    exp_dir=${finetune_exp_dir}/${finetune_model_name}.modelscope
+    mkdir -p $exp_dir
+    cp ${finetune_exp_dir}/${finetune_model_name} ${exp_dir}/${finetune_model_name}.modelscope
+    cp ${HOME}/.cache/modelscope/hub/damo/${pretrained_model_name}/* ${exp_dir}/. -r
+fi
+
+_dir=${exp_dir}/decode_asr
+_logdir=${_dir}/logdir
+if [ -d ${_dir} ]; then
+    echo "${_dir} is already exists. if you want to decode again, please delete ${_dir} first."
+else
+    mkdir -p "${_dir}"
+    mkdir -p "${_logdir}"
+fi
+
+for n in $(seq "${inference_nj}"); do
+    split_scps+=" ${_logdir}/keys.${n}.scp"
+done
+# shellcheck disable=SC2086
+utils/split_scp.pl "${data_dir}/wav.scp" ${split_scps}
+
+echo "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+# shellcheck disable=SC2086
+${decode_cmd} --max-jobs-run "${inference_nj}" JOB=1:"${inference_nj}" "${_logdir}"/asr_inference.JOB.log \
+    python -m funasr.bin.modelscope_infer \
+          --local_model_path ${exp_dir} \
+          --wav_list ${_logdir}/keys.JOB.scp \
+          --output_file ${_logdir}/text.JOB \
+          --gpuid_list ${gpuid_list} \
+          --njob ${njob} \
+          --ngpu ${_ngpu} \
+
+    for i in $(seq ${inference_nj}); do
+        cat ${_logdir}/text.${i}
+    done | sort -k1 >${_dir}/text
\ No newline at end of file
diff --git a/egs_modelscope/common/modelscope_utils/download_model.py b/egs_modelscope/common/modelscope_utils/download_model.py
new file mode 100755
index 000000000..5d5f70dd1
--- /dev/null
+++ b/egs_modelscope/common/modelscope_utils/download_model.py
@@ -0,0 +1,21 @@
+#!/usr/bin/env python3
+import argparse
+
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="download model configs",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument("--model_name",
+                        type=str,
+                        default="speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+                        help="model name in modelscope")
+    args = parser.parse_args()
+
+    inference_pipeline = pipeline(
+        task=Tasks.auto_speech_recognition,
+        model='damo/{}'.format(args.model_name),
+        model_revision='v1.0.0')
diff --git a/egs_modelscope/common/modelscope_utils/modelscope_infer.sh b/egs_modelscope/common/modelscope_utils/modelscope_infer.sh
new file mode 100755
index 000000000..1a56dce98
--- /dev/null
+++ b/egs_modelscope/common/modelscope_utils/modelscope_infer.sh
@@ -0,0 +1,90 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+data_dir=
+exp_dir=
+model_name=
+inference_nj=32
+gpuid_list="0,1,2,3"
+njob=32
+gpu_inference=true
+
+test_sets="dev test"
+decode_cmd=utils/run.pl
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+. utils/parse_options.sh
+
+if ${gpu_inference}; then
+    _ngpu=1
+else
+    _ngpu=0
+fi
+
+# download model from modelscope
+python modelscope_utils/download_model.py \
+          --model_name ${model_name}
+
+modelscope_dir=${HOME}/.cache/modelscope/hub/damo/${model_name}
+
+
+for dset in ${test_sets}; do
+    _dir=${exp_dir}/${model_name}/decode_asr/${dset}
+    _logdir=${_dir}/logdir
+    _data=${data_dir}/${dset}
+    if [ -d ${_dir} ]; then
+        echo "${_dir} is already exists. if you want to decode again, please delete ${_dir} first."
+        exit 1
+    else
+        mkdir -p "${_dir}"
+        mkdir -p "${_logdir}"
+    fi
+
+    if "${use_lm}"; then
+        cp ${modelscope_dir}/decode_asr_transformer.yaml ${modelscope_dir}/decode_asr_transformer.yaml.back
+        cp ${modelscope_dir}/decode_asr_transformer_wav.yaml ${modelscope_dir}/decode_asr_transformer_wav.yaml.back
+        sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${modelscope_dir}/decode_asr_transformer.yaml
+        sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${modelscope_dir}/decode_asr_transformer_wav.yaml
+        sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${modelscope_dir}/decode_asr_transformer.yaml
+        sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${modelscope_dir}/decode_asr_transformer_wav.yaml
+    fi
+
+    for n in $(seq "${inference_nj}"); do
+        split_scps+=" ${_logdir}/keys.${n}.scp"
+    done
+    # shellcheck disable=SC2086
+    utils/split_scp.pl "${data_dir}/${dset}/wav.scp" ${split_scps}
+
+    echo "Decoding started... log: '${_logdir}/asr_inference.*.log'"
+    # shellcheck disable=SC2086
+    ${decode_cmd} --max-jobs-run "${inference_nj}" JOB=1:"${inference_nj}" "${_logdir}"/asr_inference.JOB.log \
+        python -m funasr.bin.modelscope_infer \
+              --model_name ${model_name} \
+              --wav_list ${_logdir}/keys.JOB.scp \
+              --output_file ${_logdir}/text.JOB \
+              --gpuid_list ${gpuid_list} \
+              --njob ${njob} \
+              --ngpu ${_ngpu} \
+
+        for i in $(seq ${inference_nj}); do
+            cat ${_logdir}/text.${i}
+        done | sort -k1 >${_dir}/text
+
+        python utils/proce_text.py ${_dir}/text ${_dir}/text.proc
+        python utils/proce_text.py ${_data}/text ${_data}/text.proc
+        python utils/compute_wer.py ${_data}/text.proc ${_dir}/text.proc ${_dir}/text.cer
+        tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
+        cat ${_dir}/text.cer.txt
+done
+
+if "${use_lm}"; then
+    mv ${modelscope_dir}/decode_asr_transformer.yaml.back  ${modelscope_dir}/decode_asr_transformer.yaml
+    mv ${modelscope_dir}/decode_asr_transformer_wav.yaml.back ${modelscope_dir}/decode_asr_transformer_wav.yaml
+fi
diff --git a/egs_modelscope/common/modelscope_utils/update_config.py b/egs_modelscope/common/modelscope_utils/update_config.py
new file mode 100644
index 000000000..88466edcd
--- /dev/null
+++ b/egs_modelscope/common/modelscope_utils/update_config.py
@@ -0,0 +1,41 @@
+import yaml
+import argparse
+
+def update_dct(fin_configs, root):
+    if root == {}:
+        return {}
+    for root_key, root_value  in root.items():
+        if not isinstance(root[root_key],dict):
+            fin_configs[root_key] = root[root_key]
+        else:
+            result = update_dct(fin_configs[root_key], root[root_key])
+            fin_configs[root_key] = result
+    return fin_configs
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="update configs",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument("--modelscope_config",
+                        type=str,
+                        help="modelscope config file")
+    parser.add_argument("--finetune_config",
+                        type=str,
+                        help="finetune config file")
+    parser.add_argument("--output_config",
+                        type=str,
+                        help="output config file")
+    args = parser.parse_args()
+
+    with open(args.modelscope_config) as f:
+        modelscope_configs = yaml.safe_load(f)
+
+    with open(args.finetune_config) as f:
+        finetune_configs = yaml.safe_load(f)
+
+    # update configs, e.g., lr, batch_size, ...
+    modelscope_configs = update_dct(modelscope_configs, finetune_configs)
+
+    with open(args.output_config, "w") as f:
+        yaml.dump(modelscope_configs, f, indent=4)
diff --git a/egs_modelscope/common/path.sh b/egs_modelscope/common/path.sh
new file mode 100755
index 000000000..c340218c2
--- /dev/null
+++ b/egs_modelscope/common/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs_modelscope/common/utils b/egs_modelscope/common/utils
new file mode 120000
index 000000000..cbef564a5
--- /dev/null
+++ b/egs_modelscope/common/utils
@@ -0,0 +1 @@
+../../egs/aishell/tranformer/utils/
\ No newline at end of file
diff --git a/egs_modelscope/speechio/paraformer/README.md b/egs_modelscope/speechio/paraformer/README.md
new file mode 100644
index 000000000..669185f6e
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/README.md
@@ -0,0 +1,24 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to infer using a pretrained ModelScope Paraformer-large Model
+
+### Inference
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the speechio raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # base model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
diff --git a/egs_modelscope/speechio/paraformer/RESULTS.md b/egs_modelscope/speechio/paraformer/RESULTS.md
new file mode 100644
index 000000000..9938e74fe
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/RESULTS.md
@@ -0,0 +1,42 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+
+## SpeechIO TIOBE
+- Decode config 1: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+- Decode config 2: conf/decode_asr_transformer_noctc_10best_lm_weight_0.15.yaml
+  - Decode without CTC
+  - Decode with Transformer-LM
+  - LM weight: 0.15
+
+| testset | w/o LM | w/ LM |
+|:------------------:|:----:|:----:|
+|SPEECHIO_ASR_ZH00001| 0.49 | 0.35 |
+|SPEECHIO_ASR_ZH00002| 3.23 | 2.86 |
+|SPEECHIO_ASR_ZH00003| 1.13 | 0.80 |
+|SPEECHIO_ASR_ZH00004| 1.33 | 1.10 |
+|SPEECHIO_ASR_ZH00005| 1.41 | 1.18 |
+|SPEECHIO_ASR_ZH00006| 5.25 | 4.85 |
+|SPEECHIO_ASR_ZH00007| 5.51 | 4.97 |
+|SPEECHIO_ASR_ZH00008| 3.69 | 3.18 |
+|SPEECHIO_ASR_ZH00009| 3.02 | 2.78 |
+|SPEECHIO_ASR_ZH000010| 3.35 | 2.99 |
+|SPEECHIO_ASR_ZH000011| 1.54 | 1.25 |
+|SPEECHIO_ASR_ZH000012| 2.06 | 1.68 |
+|SPEECHIO_ASR_ZH000013| 2.57 | 2.25 |
+|SPEECHIO_ASR_ZH000014| 3.86 | 3.08 |
+|SPEECHIO_ASR_ZH000015| 3.34 | 2.67 |
diff --git a/egs_modelscope/speechio/paraformer/modelscope_utils b/egs_modelscope/speechio/paraformer/modelscope_utils
new file mode 120000
index 000000000..fc97768c8
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/modelscope_utils
@@ -0,0 +1 @@
+../../common/modelscope_utils
\ No newline at end of file
diff --git a/egs_modelscope/speechio/paraformer/paraformer_large_infer.sh b/egs_modelscope/speechio/paraformer/paraformer_large_infer.sh
new file mode 100755
index 000000000..bcf8c331c
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/paraformer_large_infer.sh
@@ -0,0 +1,81 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="SPEECHIO_ASR_ZH00001 SPEECHIO_ASR_ZH00002 SPEECHIO_ASR_ZH00003 SPEECHIO_ASR_ZH00004 SPEECHIO_ASR_ZH00005 SPEECHIO_ASR_ZH00006 SPEECHIO_ASR_ZH00007 SPEECHIO_ASR_ZH00008 SPEECHIO_ASR_ZH00009 SPEECHIO_ASR_ZH00010 SPEECHIO_ASR_ZH00011 SPEECHIO_ASR_ZH00012 SPEECHIO_ASR_ZH00013 SPEECHIO_ASR_ZH00014 SPEECHIO_ASR_ZH00015"
+
+. utils/parse_options.sh
+
+for tset_name in ${test_sets}; do
+    test_dir=${data_dir}/speechio/${tset_name}
+    mkdir -p ${test_dir}
+    find ${ori_data}/${tset_name} -iname "*.wav" > ${test_dir}/wav.flist
+    sed -e 's/\.wav//' ${test_dir}/wav.flist | awk -F '/' '{print $NF}' > ${test_dir}/utt.list
+    paste -d' ' ${test_dir}/utt.list ${test_dir}/wav.flist > ${test_dir}/wav.scp
+    cp ${ori_data}/${tset_name}/trans.txt ${test_dir}/text
+    sed -i "s/\t/ /g" ${test_dir}/text
+done
+
+mkdir -p ${exp_dir}/speechio
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/speechio \
+        --exp_dir ${exp_dir}/speechio \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
+
+#  SpeechIO TIOBE textnorm
+for tset_name in ${test_sets}; do
+    echo "$0 --> Normalizing REF text ..."
+    ./utils/textnorm_zh.py \
+        --has_key --to_upper \
+        ${ori_data}/${tset_name}/trans.txt \
+        ${data_dir}/speechio/${tset_name}/ref.txt
+    
+    cp ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/text.proc ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/raw_rec.txt
+    sed -i "s#</s>##g" ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/raw_rec.txt 
+    echo "$0 --> Normalizing HYP text ..."
+    ./utils/textnorm_zh.py \
+        --has_key --to_upper \
+        ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/raw_rec.txt \
+        ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/rec.txt
+    grep -v $'\t$' ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/rec.txt > ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/rec_non_empty.txt
+
+    echo "$0 --> computing WER/CER and alignment ..."
+    ./utils/error_rate_zh \
+        --tokenizer char \
+        --ref ${data_dir}/speechio/${tset_name}/ref.txt \
+        --hyp ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/rec_non_empty.txt \
+        ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/DETAILS.txt | tee ${exp_dir}/speechio/${model_name}/decode_asr/${tset_name}/RESULTS.txt
+done
+
diff --git a/egs_modelscope/speechio/paraformer/path.sh b/egs_modelscope/speechio/paraformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs_modelscope/speechio/paraformer/utils b/egs_modelscope/speechio/paraformer/utils
new file mode 120000
index 000000000..37d976175
--- /dev/null
+++ b/egs_modelscope/speechio/paraformer/utils
@@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
\ No newline at end of file
diff --git a/egs_modelscope/wenetspeech/paraformer/README.md b/egs_modelscope/wenetspeech/paraformer/README.md
new file mode 100644
index 000000000..9dc5f3f4b
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/README.md
@@ -0,0 +1,24 @@
+# ModelScope: Paraformer-large Model
+
+## Highlight
+
+### ModelScope: Paraformer-Large Model
+- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
+- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
+- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
+    - Support finetuning and inference on AISHELL-1 and AISHELL-2.
+    - Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
+
+## How to infer using a pretrained ModelScope Paraformer-large Model
+
+### Inference
+- Setting parameters in `paraformer_large_infer.sh`
+    - <strong>ori_data:</strong> please set the wenetspeech raw data path
+    - <strong>data_dir:</strong> data output dictionary
+    - <strong>exp_dir:</strong> the result path
+    - <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # base model, download from modelscope
+    - <strong>test_sets:</strong> please set the testsets name
+- Then you can run the pipeline to infer with: 
+```sh
+    sh ./paraformer_large_infer.sh
+```
diff --git a/egs_modelscope/wenetspeech/paraformer/RESULTS.md b/egs_modelscope/wenetspeech/paraformer/RESULTS.md
new file mode 100644
index 000000000..a912c92a6
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/RESULTS.md
@@ -0,0 +1,25 @@
+# Paraformer-Large
+- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
+- Model size: 220M
+- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
+
+# Environments
+- date: `Tue Nov 22 18:48:39 CST 2022`
+- python version: `3.7.12`
+- FunASR version: `0.1.0`
+- pytorch version: `pytorch 1.7.0`
+- Git hash: ``
+- Commit date: ``
+
+# Beachmark Results
+
+## Wenetspeech
+- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
+  - Decode without CTC
+  - Decode without LM
+
+| testset   | CER(%)|
+|:---------:|:-----:|
+| dev       | 3.57  |
+| test      | 6.97  |
+| test_net  | 6.74  |
diff --git a/egs_modelscope/wenetspeech/paraformer/modelscope_utils b/egs_modelscope/wenetspeech/paraformer/modelscope_utils
new file mode 120000
index 000000000..fc97768c8
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/modelscope_utils
@@ -0,0 +1 @@
+../../common/modelscope_utils
\ No newline at end of file
diff --git a/egs_modelscope/wenetspeech/paraformer/paraformer_large_infer.sh b/egs_modelscope/wenetspeech/paraformer/paraformer_large_infer.sh
new file mode 100755
index 000000000..182a32488
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/paraformer_large_infer.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+
+set -e
+set -u
+set -o pipefail
+
+ori_data=
+data_dir=
+exp_dir=
+model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
+inference_nj=32
+gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
+ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
+njob=4  # the number of jobs for each gpu
+gpu_inference=true  # Whether to perform gpu decoding, set false for cpu decoding
+
+if ${gpu_inference}; then
+    inference_nj=$[${ngpu}*${njob}]
+else
+    inference_nj=$njob
+fi
+
+# LM configs
+use_lm=false
+beam_size=1
+lm_weight=0.0
+
+test_sets="dev test_meeting test_net"
+
+. utils/parse_options.sh
+
+for tset_name in ${test_sets}; do
+    test_dir=${data_dir}/wenetspeech/${tset_name}
+    mkdir -p ${test_dir} 
+    find ${ori_data}/${tset_name} -iname "*.wav" > ${test_dir}/wav.flist
+    sed -e 's/\.wav//' ${test_dir}/wav.flist | awk -F '/' '{print $NF}' > ${test_dir}/utt.list
+    paste -d' ' ${test_dir}/utt.list ${test_dir}/wav.flist > ${test_dir}/wav.scp
+    cp ${ori_data}/${tset_name}/trans.txt ${test_dir}/text
+    sed -i "s/\t/ /g" ${test_dir}/text
+done
+
+mkdir -p ${exp_dir}/wenetspeech
+
+modelscope_utils/modelscope_infer.sh \
+        --data_dir ${data_dir}/wenetspeech \
+        --exp_dir ${exp_dir}/wenetspeech \
+        --test_sets "${test_sets}" \
+        --model_name ${model_name} \
+        --inference_nj ${inference_nj} \
+        --gpuid_list ${gpuid_list} \
+        --njob ${njob} \
+        --gpu_inference ${gpu_inference} \
+        --use_lm ${use_lm} \
+        --beam_size ${beam_size} \
+        --lm_weight ${lm_weight}
+
diff --git a/egs_modelscope/wenetspeech/paraformer/path.sh b/egs_modelscope/wenetspeech/paraformer/path.sh
new file mode 100755
index 000000000..7972642d0
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/path.sh
@@ -0,0 +1,5 @@
+export FUNASR_DIR=$PWD/../../..
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PATH=$FUNASR_DIR/funasr/bin:$PATH
diff --git a/egs_modelscope/wenetspeech/paraformer/utils b/egs_modelscope/wenetspeech/paraformer/utils
new file mode 120000
index 000000000..37d976175
--- /dev/null
+++ b/egs_modelscope/wenetspeech/paraformer/utils
@@ -0,0 +1 @@
+../../../egs/aishell/tranformer/utils/
\ No newline at end of file
diff --git a/funasr/__init__.py b/funasr/__init__.py
new file mode 100644
index 000000000..f297bc3e6
--- /dev/null
+++ b/funasr/__init__.py
@@ -0,0 +1,8 @@
+"""Initialize funasr package."""
+
+import os
+
+dirname = os.path.dirname(__file__)
+version_file = os.path.join(dirname, "version.txt")
+with open(version_file, "r") as f:
+    __version__ = f.read().strip()
diff --git a/funasr/bin/__init__.py b/funasr/bin/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/bin/aggregate_stats_dirs.py b/funasr/bin/aggregate_stats_dirs.py
new file mode 100755
index 000000000..94cbdf888
--- /dev/null
+++ b/funasr/bin/aggregate_stats_dirs.py
@@ -0,0 +1,108 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import sys
+from pathlib import Path
+from typing import Iterable
+from typing import Union
+
+import numpy as np
+
+from funasr.utils.cli_utils import get_commandline_args
+
+
+def aggregate_stats_dirs(
+        input_dir: Iterable[Union[str, Path]],
+        output_dir: Union[str, Path],
+        log_level: str,
+        skip_sum_stats: bool,
+):
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) (levelname)s: %(message)s",
+    )
+
+    input_dirs = [Path(p) for p in input_dir]
+    output_dir = Path(output_dir)
+
+    for mode in ["train", "valid"]:
+        with (input_dirs[0] / mode / "batch_keys").open("r", encoding="utf-8") as f:
+            batch_keys = [line.strip() for line in f if line.strip() != ""]
+        with (input_dirs[0] / mode / "stats_keys").open("r", encoding="utf-8") as f:
+            stats_keys = [line.strip() for line in f if line.strip() != ""]
+        (output_dir / mode).mkdir(parents=True, exist_ok=True)
+
+        for key in batch_keys:
+            with (output_dir / mode / f"{key}_shape").open(
+                    "w", encoding="utf-8"
+            ) as fout:
+                for idir in input_dirs:
+                    with (idir / mode / f"{key}_shape").open(
+                            "r", encoding="utf-8"
+                    ) as fin:
+                        # Read to the last in order to sort keys
+                        # because the order can be changed if num_workers>=1
+                        lines = fin.readlines()
+                        lines = sorted(lines, key=lambda x: x.split()[0])
+                        for line in lines:
+                            fout.write(line)
+
+        for key in stats_keys:
+            if not skip_sum_stats:
+                sum_stats = None
+                for idir in input_dirs:
+                    stats = np.load(idir / mode / f"{key}_stats.npz")
+                    if sum_stats is None:
+                        sum_stats = dict(**stats)
+                    else:
+                        for k in stats:
+                            sum_stats[k] += stats[k]
+
+                np.savez(output_dir / mode / f"{key}_stats.npz", **sum_stats)
+
+            # if --write_collected_feats=true
+            p = Path(mode) / "collect_feats" / f"{key}.scp"
+            scp = input_dirs[0] / p
+            if scp.exists():
+                (output_dir / p).parent.mkdir(parents=True, exist_ok=True)
+                with (output_dir / p).open("w", encoding="utf-8") as fout:
+                    for idir in input_dirs:
+                        with (idir / p).open("r", encoding="utf-8") as fin:
+                            for line in fin:
+                                fout.write(line)
+
+
+def get_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        description="Aggregate statistics directories into one directory",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+    parser.add_argument(
+        "--skip_sum_stats",
+        default=False,
+        action="store_true",
+        help="Skip computing the sum of statistics.",
+    )
+
+    parser.add_argument("--input_dir", action="append", help="Input directories")
+    parser.add_argument("--output_dir", required=True, help="Output directory")
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    aggregate_stats_dirs(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/asr_inference.py b/funasr/bin/asr_inference.py
new file mode 100755
index 000000000..6ee0ffef8
--- /dev/null
+++ b/funasr/bin/asr_inference.py
@@ -0,0 +1,548 @@
+#!/usr/bin/env python3
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import argparse
+import logging
+import sys
+from pathlib import Path
+from typing import Any
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.modules.beam_search.batch_beam_search import BatchBeamSearch
+from funasr.modules.beam_search.batch_beam_search_online_sim import BatchBeamSearchOnlineSim
+from funasr.modules.beam_search.beam_search import BeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+from funasr.modules.scorers.ctc import CTCPrefixScorer
+from funasr.modules.scorers.length_bonus import LengthBonus
+from funasr.modules.scorers.scorer_interface import BatchScorerInterface
+from funasr.modules.subsampling import TooShortUttError
+from funasr.tasks.asr import ASRTask
+from funasr.tasks.lm import LMTask
+from funasr.text.build_tokenizer import build_tokenizer
+from funasr.text.token_id_converter import TokenIDConverter
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.utils import config_argparse
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+
+
+class Speech2Text:
+    """Speech2Text class
+
+    Examples:
+        >>> import soundfile
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> audio, rate = soundfile.read("speech.wav")
+        >>> speech2text(audio)
+        [(text, token, token_int, hypothesis object), ...]
+
+    """
+
+    def __init__(
+            self,
+            asr_train_config: Union[Path, str] = None,
+            asr_model_file: Union[Path, str] = None,
+            lm_train_config: Union[Path, str] = None,
+            lm_file: Union[Path, str] = None,
+            token_type: str = None,
+            bpemodel: str = None,
+            device: str = "cpu",
+            maxlenratio: float = 0.0,
+            minlenratio: float = 0.0,
+            batch_size: int = 1,
+            dtype: str = "float32",
+            beam_size: int = 20,
+            ctc_weight: float = 0.5,
+            lm_weight: float = 1.0,
+            ngram_weight: float = 0.9,
+            penalty: float = 0.0,
+            nbest: int = 1,
+            streaming: bool = False,
+            **kwargs,
+    ):
+        assert check_argument_types()
+
+        # 1. Build ASR model
+        scorers = {}
+        asr_model, asr_train_args = ASRTask.build_model_from_file(
+            asr_train_config, asr_model_file, device
+        )
+        logging.info("asr_model: {}".format(asr_model))
+        logging.info("asr_train_args: {}".format(asr_train_args))
+        asr_model.to(dtype=getattr(torch, dtype)).eval()
+
+        decoder = asr_model.decoder
+
+        ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
+        token_list = asr_model.token_list
+        scorers.update(
+            decoder=decoder,
+            ctc=ctc,
+            length_bonus=LengthBonus(len(token_list)),
+        )
+
+        # 2. Build Language model
+        if lm_train_config is not None:
+            lm, lm_train_args = LMTask.build_model_from_file(
+                lm_train_config, lm_file, device
+            )
+            scorers["lm"] = lm.lm
+
+        # 3. Build ngram model
+        # ngram is not supported now
+        ngram = None
+        scorers["ngram"] = ngram
+
+        # 4. Build BeamSearch object
+        # transducer is not supported now
+        beam_search_transducer = None
+
+        weights = dict(
+            decoder=1.0 - ctc_weight,
+            ctc=ctc_weight,
+            lm=lm_weight,
+            ngram=ngram_weight,
+            length_bonus=penalty,
+        )
+        beam_search = BeamSearch(
+            beam_size=beam_size,
+            weights=weights,
+            scorers=scorers,
+            sos=asr_model.sos,
+            eos=asr_model.eos,
+            vocab_size=len(token_list),
+            token_list=token_list,
+            pre_beam_score_key=None if ctc_weight == 1.0 else "full",
+        )
+
+        # TODO(karita): make all scorers batchfied
+        if batch_size == 1:
+            non_batch = [
+                k
+                for k, v in beam_search.full_scorers.items()
+                if not isinstance(v, BatchScorerInterface)
+            ]
+            if len(non_batch) == 0:
+                if streaming:
+                    beam_search.__class__ = BatchBeamSearchOnlineSim
+                    beam_search.set_streaming_config(asr_train_config)
+                    logging.info(
+                        "BatchBeamSearchOnlineSim implementation is selected."
+                    )
+                else:
+                    beam_search.__class__ = BatchBeamSearch
+                    logging.info("BatchBeamSearch implementation is selected.")
+            else:
+                logging.warning(
+                    f"As non-batch scorers {non_batch} are found, "
+                    f"fall back to non-batch implementation."
+                )
+
+            beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
+            for scorer in scorers.values():
+                if isinstance(scorer, torch.nn.Module):
+                    scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
+            logging.info(f"Beam_search: {beam_search}")
+            logging.info(f"Decoding device={device}, dtype={dtype}")
+
+        # 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
+        if token_type is None:
+            token_type = asr_train_args.token_type
+        if bpemodel is None:
+            bpemodel = asr_train_args.bpemodel
+
+        if token_type is None:
+            tokenizer = None
+        elif token_type == "bpe":
+            if bpemodel is not None:
+                tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
+            else:
+                tokenizer = None
+        else:
+            tokenizer = build_tokenizer(token_type=token_type)
+        converter = TokenIDConverter(token_list=token_list)
+        logging.info(f"Text tokenizer: {tokenizer}")
+
+        self.asr_model = asr_model
+        self.asr_train_args = asr_train_args
+        self.converter = converter
+        self.tokenizer = tokenizer
+        self.beam_search = beam_search
+        self.beam_search_transducer = beam_search_transducer
+        self.maxlenratio = maxlenratio
+        self.minlenratio = minlenratio
+        self.device = device
+        self.dtype = dtype
+        self.nbest = nbest
+
+    @torch.no_grad()
+    def __call__(
+            self, speech: Union[torch.Tensor, np.ndarray]
+    ) -> List[
+        Tuple[
+            Optional[str],
+            List[str],
+            List[int],
+            Union[Hypothesis],
+        ]
+    ]:
+        """Inference
+
+        Args:
+            data: Input speech data
+        Returns:
+            text, token, token_int, hyp
+
+        """
+        assert check_argument_types()
+
+        # Input as audio signal
+        if isinstance(speech, np.ndarray):
+            speech = torch.tensor(speech)
+
+        # data: (Nsamples,) -> (1, Nsamples)
+        speech = speech.unsqueeze(0).to(getattr(torch, self.dtype))
+        # lengths: (1,)
+        lengths = speech.new_full([1], dtype=torch.long, fill_value=speech.size(1))
+        batch = {"speech": speech, "speech_lengths": lengths}
+
+        # a. To device
+        batch = to_device(batch, device=self.device)
+
+        # b. Forward Encoder
+        enc, _ = self.asr_model.encode(**batch)
+        if isinstance(enc, tuple):
+            enc = enc[0]
+        assert len(enc) == 1, len(enc)
+
+        # c. Passed the encoder result and the beam search
+        nbest_hyps = self.beam_search(
+            x=enc[0], maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
+        )
+
+        nbest_hyps = nbest_hyps[: self.nbest]
+
+        results = []
+        for hyp in nbest_hyps:
+            assert isinstance(hyp, (Hypothesis)), type(hyp)
+
+            # remove sos/eos and get results
+            last_pos = -1
+            if isinstance(hyp.yseq, list):
+                token_int = hyp.yseq[1:last_pos]
+            else:
+                token_int = hyp.yseq[1:last_pos].tolist()
+
+            # remove blank symbol id, which is assumed to be 0
+            token_int = list(filter(lambda x: x != 0, token_int))
+
+            # Change integer-ids to tokens
+            token = self.converter.ids2tokens(token_int)
+
+            if self.tokenizer is not None:
+                text = self.tokenizer.tokens2text(token)
+            else:
+                text = None
+            results.append((text, token, token_int, hyp))
+
+        assert check_return_type(results)
+        return results
+
+
+def inference(
+        output_dir: str,
+        maxlenratio: float,
+        minlenratio: float,
+        batch_size: int,
+        dtype: str,
+        beam_size: int,
+        ngpu: int,
+        seed: int,
+        ctc_weight: float,
+        lm_weight: float,
+        ngram_weight: float,
+        penalty: float,
+        nbest: int,
+        num_workers: int,
+        log_level: Union[int, str],
+        data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
+        key_file: Optional[str],
+        asr_train_config: Optional[str],
+        asr_model_file: Optional[str],
+        lm_train_config: Optional[str],
+        lm_file: Optional[str],
+        word_lm_train_config: Optional[str],
+        token_type: Optional[str],
+        bpemodel: Optional[str],
+        allow_variable_data_keys: bool,
+        streaming: bool,
+        **kwargs,
+):
+    assert check_argument_types()
+    if batch_size > 1:
+        raise NotImplementedError("batch decoding is not implemented")
+    if word_lm_train_config is not None:
+        raise NotImplementedError("Word LM is not implemented")
+    if ngpu > 1:
+        raise NotImplementedError("only single GPU decoding is supported")
+
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+
+    if ngpu >= 1:
+        device = "cuda"
+    else:
+        device = "cpu"
+
+    # 1. Set random-seed
+    set_all_random_seed(seed)
+
+    # 2. Build speech2text
+    speech2text_kwargs = dict(
+        asr_train_config=asr_train_config,
+        asr_model_file=asr_model_file,
+        lm_train_config=lm_train_config,
+        lm_file=lm_file,
+        token_type=token_type,
+        bpemodel=bpemodel,
+        device=device,
+        maxlenratio=maxlenratio,
+        minlenratio=minlenratio,
+        dtype=dtype,
+        beam_size=beam_size,
+        ctc_weight=ctc_weight,
+        lm_weight=lm_weight,
+        ngram_weight=ngram_weight,
+        penalty=penalty,
+        nbest=nbest,
+        streaming=streaming,
+    )
+    logging.info("speech2text_kwargs: {}".format(speech2text_kwargs))
+    speech2text = Speech2Text(**speech2text_kwargs)
+
+    # 3. Build data-iterator
+    loader = ASRTask.build_streaming_iterator(
+        data_path_and_name_and_type,
+        dtype=dtype,
+        batch_size=batch_size,
+        key_file=key_file,
+        num_workers=num_workers,
+        preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
+        collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
+        allow_variable_data_keys=allow_variable_data_keys,
+        inference=True,
+    )
+
+    # 7 .Start for-loop
+    # FIXME(kamo): The output format should be discussed about
+    with DatadirWriter(output_dir) as writer:
+        for keys, batch in loader:
+            assert isinstance(batch, dict), type(batch)
+            assert all(isinstance(s, str) for s in keys), keys
+            _bs = len(next(iter(batch.values())))
+            assert len(keys) == _bs, f"{len(keys)} != {_bs}"
+            batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
+
+            # N-best list of (text, token, token_int, hyp_object)
+            try:
+                results = speech2text(**batch)
+            except TooShortUttError as e:
+                logging.warning(f"Utterance {keys} {e}")
+                hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
+                results = [[" ", ["<space>"], [2], hyp]] * nbest
+
+            # Only supporting batch_size==1
+            key = keys[0]
+            for n, (text, token, token_int, hyp) in zip(range(1, nbest + 1), results):
+                # Create a directory: outdir/{n}best_recog
+                ibest_writer = writer[f"{n}best_recog"]
+
+                # Write the result to each file
+                ibest_writer["token"][key] = " ".join(token)
+                ibest_writer["token_int"][key] = " ".join(map(str, token_int))
+                ibest_writer["score"][key] = str(hyp.score)
+
+                if text is not None:
+                    ibest_writer["text"][key] = text
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="ASR Decoding",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument(
+        "--gpuid_list",
+        type=str,
+        default="",
+        help="The visible gpus",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=True,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument(
+        "--asr_train_config",
+        type=str,
+        help="ASR training configuration",
+    )
+    group.add_argument(
+        "--asr_model_file",
+        type=str,
+        help="ASR model parameter file",
+    )
+    group.add_argument(
+        "--lm_train_config",
+        type=str,
+        help="LM training configuration",
+    )
+    group.add_argument(
+        "--lm_file",
+        type=str,
+        help="LM parameter file",
+    )
+    group.add_argument(
+        "--word_lm_train_config",
+        type=str,
+        help="Word LM training configuration",
+    )
+    group.add_argument(
+        "--word_lm_file",
+        type=str,
+        help="Word LM parameter file",
+    )
+    group.add_argument(
+        "--ngram_file",
+        type=str,
+        help="N-gram parameter file",
+    )
+    group.add_argument(
+        "--model_tag",
+        type=str,
+        help="Pretrained model tag. If specify this option, *_train_config and "
+             "*_file will be overwritten",
+    )
+
+    group = parser.add_argument_group("Beam-search related")
+    group.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
+    group.add_argument("--beam_size", type=int, default=20, help="Beam size")
+    group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
+    group.add_argument(
+        "--maxlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain max output length. "
+             "If maxlenratio=0.0 (default), it uses a end-detect "
+             "function "
+             "to automatically find maximum hypothesis lengths."
+             "If maxlenratio<0.0, its absolute value is interpreted"
+             "as a constant max output length",
+    )
+    group.add_argument(
+        "--minlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain min output length",
+    )
+    group.add_argument(
+        "--ctc_weight",
+        type=float,
+        default=0.5,
+        help="CTC weight in joint decoding",
+    )
+    group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
+    group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
+    group.add_argument("--streaming", type=str2bool, default=False)
+
+    group = parser.add_argument_group("Text converter related")
+    group.add_argument(
+        "--token_type",
+        type=str_or_none,
+        default=None,
+        choices=["char", "bpe", None],
+        help="The token type for ASR model. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument(
+        "--bpemodel",
+        type=str_or_none,
+        default=None,
+        help="The model path of sentencepiece. "
+             "If not given, refers from the training args",
+    )
+
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+    inference(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/asr_inference_launch.py b/funasr/bin/asr_inference_launch.py
new file mode 100755
index 000000000..9d328ad21
--- /dev/null
+++ b/funasr/bin/asr_inference_launch.py
@@ -0,0 +1,225 @@
+#!/usr/bin/env python3
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import argparse
+import logging
+import os
+import sys
+
+from funasr.utils import config_argparse
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="ASR Decoding",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument(
+        "--njob",
+        type=int,
+        default=1,
+        help="The number of jobs for each gpu",
+    )
+    parser.add_argument(
+        "--gpuid_list",
+        type=str,
+        default="",
+        help="The visible gpus",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=True,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument(
+        "--asr_train_config",
+        type=str,
+        help="ASR training configuration",
+    )
+    group.add_argument(
+        "--asr_model_file",
+        type=str,
+        help="ASR model parameter file",
+    )
+    group.add_argument(
+        "--lm_train_config",
+        type=str,
+        help="LM training configuration",
+    )
+    group.add_argument(
+        "--lm_file",
+        type=str,
+        help="LM parameter file",
+    )
+    group.add_argument(
+        "--word_lm_train_config",
+        type=str,
+        help="Word LM training configuration",
+    )
+    group.add_argument(
+        "--word_lm_file",
+        type=str,
+        help="Word LM parameter file",
+    )
+    group.add_argument(
+        "--ngram_file",
+        type=str,
+        help="N-gram parameter file",
+    )
+    group.add_argument(
+        "--model_tag",
+        type=str,
+        help="Pretrained model tag. If specify this option, *_train_config and "
+             "*_file will be overwritten",
+    )
+
+    group = parser.add_argument_group("Beam-search related")
+    group.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    group.add_argument("--nbest", type=int, default=5, help="Output N-best hypotheses")
+    group.add_argument("--beam_size", type=int, default=20, help="Beam size")
+    group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
+    group.add_argument(
+        "--maxlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain max output length. "
+             "If maxlenratio=0.0 (default), it uses a end-detect "
+             "function "
+             "to automatically find maximum hypothesis lengths."
+             "If maxlenratio<0.0, its absolute value is interpreted"
+             "as a constant max output length",
+    )
+    group.add_argument(
+        "--minlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain min output length",
+    )
+    group.add_argument(
+        "--ctc_weight",
+        type=float,
+        default=0.5,
+        help="CTC weight in joint decoding",
+    )
+    group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
+    group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
+    group.add_argument("--streaming", type=str2bool, default=False)
+
+    group = parser.add_argument_group("Text converter related")
+    group.add_argument(
+        "--token_type",
+        type=str_or_none,
+        default=None,
+        choices=["char", "bpe", None],
+        help="The token type for ASR model. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument(
+        "--bpemodel",
+        type=str_or_none,
+        default=None,
+        help="The model path of sentencepiece. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument("--token_num_relax", type=int, default=1, help="")
+    group.add_argument("--decoding_ind", type=int, default=0, help="")
+    group.add_argument("--decoding_mode", type=str, default="model1", help="")
+    group.add_argument(
+        "--ctc_weight2",
+        type=float,
+        default=0.0,
+        help="CTC weight in joint decoding",
+    )
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    parser.add_argument(
+        "--mode",
+        type=str,
+        default="asr",
+        help="The decoding mode",
+    )
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+
+    # set logging messages
+    logging.basicConfig(
+        level=args.log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+    logging.info("Decoding args: {}".format(kwargs))
+
+    # gpu setting
+    if args.ngpu > 0:
+        jobid = int(args.output_dir.split(".")[-1])
+        gpuid = args.gpuid_list.split(",")[(jobid - 1) // args.njob]
+        os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
+        os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
+
+    if args.mode == "asr":
+        from funasr.bin.asr_inference import inference
+        inference(**kwargs)
+    elif args.mode == "uniasr":
+        from funasr.bin.asr_inference_uniasr import inference
+        inference(**kwargs)
+    elif args.mode == "paraformer":
+        from funasr.bin.asr_inference_paraformer import inference
+        inference(**kwargs)
+    else:
+        logging.info("Unknown decoding mode: {}".format(args.mode))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/asr_inference_paraformer.py b/funasr/bin/asr_inference_paraformer.py
new file mode 100755
index 000000000..ed75010d1
--- /dev/null
+++ b/funasr/bin/asr_inference_paraformer.py
@@ -0,0 +1,528 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.modules.beam_search.beam_search import BeamSearchPara as BeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+from funasr.modules.scorers.ctc import CTCPrefixScorer
+from funasr.modules.scorers.length_bonus import LengthBonus
+from funasr.modules.subsampling import TooShortUttError
+from funasr.tasks.asr import ASRTaskParaformer as ASRTask
+from funasr.tasks.lm import LMTask
+from funasr.text.build_tokenizer import build_tokenizer
+from funasr.text.token_id_converter import TokenIDConverter
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.utils import config_argparse
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+
+
+class Speech2Text:
+    """Speech2Text class
+
+    Examples:
+            >>> import soundfile
+            >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+            >>> audio, rate = soundfile.read("speech.wav")
+            >>> speech2text(audio)
+            [(text, token, token_int, hypothesis object), ...]
+
+    """
+
+    def __init__(
+            self,
+            asr_train_config: Union[Path, str] = None,
+            asr_model_file: Union[Path, str] = None,
+            lm_train_config: Union[Path, str] = None,
+            lm_file: Union[Path, str] = None,
+            token_type: str = None,
+            bpemodel: str = None,
+            device: str = "cpu",
+            maxlenratio: float = 0.0,
+            minlenratio: float = 0.0,
+            dtype: str = "float32",
+            beam_size: int = 20,
+            ctc_weight: float = 0.5,
+            lm_weight: float = 1.0,
+            ngram_weight: float = 0.9,
+            penalty: float = 0.0,
+            nbest: int = 1,
+            **kwargs,
+    ):
+        assert check_argument_types()
+
+        # 1. Build ASR model
+        scorers = {}
+        asr_model, asr_train_args = ASRTask.build_model_from_file(
+            asr_train_config, asr_model_file, device
+        )
+        logging.info("asr_model: {}".format(asr_model))
+        logging.info("asr_train_args: {}".format(asr_train_args))
+        asr_model.to(dtype=getattr(torch, dtype)).eval()
+
+        ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
+        token_list = asr_model.token_list
+        scorers.update(
+            ctc=ctc,
+            length_bonus=LengthBonus(len(token_list)),
+        )
+
+        # 2. Build Language model
+        if lm_train_config is not None:
+            lm, lm_train_args = LMTask.build_model_from_file(
+                lm_train_config, lm_file, device
+            )
+            scorers["lm"] = lm.lm
+
+        # 3. Build ngram model
+        # ngram is not supported now
+        ngram = None
+        scorers["ngram"] = ngram
+
+        # 4. Build BeamSearch object
+        # transducer is not supported now
+        beam_search_transducer = None
+
+        weights = dict(
+            decoder=1.0 - ctc_weight,
+            ctc=ctc_weight,
+            lm=lm_weight,
+            ngram=ngram_weight,
+            length_bonus=penalty,
+        )
+        beam_search = BeamSearch(
+            beam_size=beam_size,
+            weights=weights,
+            scorers=scorers,
+            sos=asr_model.sos,
+            eos=asr_model.eos,
+            vocab_size=len(token_list),
+            token_list=token_list,
+            pre_beam_score_key=None if ctc_weight == 1.0 else "full",
+        )
+
+        beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
+        for scorer in scorers.values():
+            if isinstance(scorer, torch.nn.Module):
+                scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
+        logging.info(f"Beam_search: {beam_search}")
+        logging.info(f"Decoding device={device}, dtype={dtype}")
+
+        # 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
+        if token_type is None:
+            token_type = asr_train_args.token_type
+        if bpemodel is None:
+            bpemodel = asr_train_args.bpemodel
+
+        if token_type is None:
+            tokenizer = None
+        elif token_type == "bpe":
+            if bpemodel is not None:
+                tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
+            else:
+                tokenizer = None
+        else:
+            tokenizer = build_tokenizer(token_type=token_type)
+        converter = TokenIDConverter(token_list=token_list)
+        logging.info(f"Text tokenizer: {tokenizer}")
+
+        self.asr_model = asr_model
+        self.asr_train_args = asr_train_args
+        self.converter = converter
+        self.tokenizer = tokenizer
+        self.beam_search = beam_search
+        self.beam_search_transducer = beam_search_transducer
+        self.maxlenratio = maxlenratio
+        self.minlenratio = minlenratio
+        self.device = device
+        self.dtype = dtype
+        self.nbest = nbest
+
+    @torch.no_grad()
+    def __call__(
+            self, speech: Union[torch.Tensor, np.ndarray]
+    ):
+        """Inference
+
+        Args:
+                data: Input speech data
+        Returns:
+                text, token, token_int, hyp
+
+        """
+        assert check_argument_types()
+
+        # Input as audio signal
+        if isinstance(speech, np.ndarray):
+            speech = torch.tensor(speech)
+
+        # data: (Nsamples,) -> (1, Nsamples)
+        speech = speech.unsqueeze(0).to(getattr(torch, self.dtype))
+        lfr_factor = max(1, (speech.size()[-1]//80)-1)
+        # lengths: (1,)
+        lengths = speech.new_full([1], dtype=torch.long, fill_value=speech.size(1))
+        batch = {"speech": speech, "speech_lengths": lengths}
+
+        # a. To device
+        batch = to_device(batch, device=self.device)
+
+        # b. Forward Encoder
+        enc, enc_len = self.asr_model.encode(**batch)
+        if isinstance(enc, tuple):
+            enc = enc[0]
+        assert len(enc) == 1, len(enc)
+
+        predictor_outs = self.asr_model.calc_predictor(enc, enc_len)
+        pre_acoustic_embeds, pre_token_length = predictor_outs[0], predictor_outs[1]
+        pre_token_length = torch.tensor([pre_acoustic_embeds.size(1)], device=pre_acoustic_embeds.device)
+        decoder_outs = self.asr_model.cal_decoder_with_predictor(enc, enc_len, pre_acoustic_embeds, pre_token_length)
+        decoder_out, ys_pad_lens = decoder_outs[0], decoder_outs[1]
+
+        nbest_hyps = self.beam_search(
+            x=enc[0], am_scores=decoder_out[0], maxlenratio=self.maxlenratio, minlenratio=self.minlenratio
+        )
+
+        nbest_hyps = nbest_hyps[: self.nbest]
+        results = []
+        for hyp in nbest_hyps:
+            assert isinstance(hyp, (Hypothesis)), type(hyp)
+
+            # remove sos/eos and get results
+            last_pos = -1
+            if isinstance(hyp.yseq, list):
+                token_int = hyp.yseq[1:last_pos]
+            else:
+                token_int = hyp.yseq[1:last_pos].tolist()
+
+            # remove blank symbol id, which is assumed to be 0
+            token_int = list(filter(lambda x: x != 0, token_int))
+
+            # Change integer-ids to tokens
+            token = self.converter.ids2tokens(token_int)
+
+            if self.tokenizer is not None:
+                text = self.tokenizer.tokens2text(token)
+            else:
+                text = None
+
+            results.append((text, token, token_int, hyp, speech.size(1), lfr_factor))
+
+        # assert check_return_type(results)
+        return results
+
+
+def inference(
+        output_dir: str,
+        maxlenratio: float,
+        minlenratio: float,
+        batch_size: int,
+        dtype: str,
+        beam_size: int,
+        ngpu: int,
+        seed: int,
+        ctc_weight: float,
+        lm_weight: float,
+        ngram_weight: float,
+        penalty: float,
+        nbest: int,
+        num_workers: int,
+        log_level: Union[int, str],
+        data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
+        key_file: Optional[str],
+        asr_train_config: Optional[str],
+        asr_model_file: Optional[str],
+        lm_train_config: Optional[str],
+        lm_file: Optional[str],
+        word_lm_train_config: Optional[str],
+        token_type: Optional[str],
+        bpemodel: Optional[str],
+        allow_variable_data_keys: bool,
+        **kwargs,
+):
+    assert check_argument_types()
+    if batch_size > 1:
+        raise NotImplementedError("batch decoding is not implemented")
+    if word_lm_train_config is not None:
+        raise NotImplementedError("Word LM is not implemented")
+    if ngpu > 1:
+        raise NotImplementedError("only single GPU decoding is supported")
+
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+
+    if ngpu >= 1:
+        device = "cuda"
+    else:
+        device = "cpu"
+
+    # 1. Set random-seed
+    set_all_random_seed(seed)
+
+    # 2. Build speech2text
+    speech2text_kwargs = dict(
+        asr_train_config=asr_train_config,
+        asr_model_file=asr_model_file,
+        lm_train_config=lm_train_config,
+        lm_file=lm_file,
+        token_type=token_type,
+        bpemodel=bpemodel,
+        device=device,
+        maxlenratio=maxlenratio,
+        minlenratio=minlenratio,
+        dtype=dtype,
+        beam_size=beam_size,
+        ctc_weight=ctc_weight,
+        lm_weight=lm_weight,
+        ngram_weight=ngram_weight,
+        penalty=penalty,
+        nbest=nbest,
+    )
+    speech2text = Speech2Text(**speech2text_kwargs)
+
+    # 3. Build data-iterator
+    loader = ASRTask.build_streaming_iterator(
+        data_path_and_name_and_type,
+        dtype=dtype,
+        batch_size=batch_size,
+        key_file=key_file,
+        num_workers=num_workers,
+        preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
+        collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
+        allow_variable_data_keys=allow_variable_data_keys,
+        inference=True,
+    )
+
+    forward_time_total = 0.0
+    length_total = 0.0
+    # 7 .Start for-loop
+    # FIXME(kamo): The output format should be discussed about
+    with DatadirWriter(output_dir) as writer:
+        for keys, batch in loader:
+            assert isinstance(batch, dict), type(batch)
+            assert all(isinstance(s, str) for s in keys), keys
+            _bs = len(next(iter(batch.values())))
+            assert len(keys) == _bs, f"{len(keys)} != {_bs}"
+            batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
+
+            logging.info("decoding, utt_id: {}".format(keys))
+            # N-best list of (text, token, token_int, hyp_object)
+
+            try:
+                time_beg = time.time()
+                results = speech2text(**batch)
+                time_end = time.time()
+                forward_time = time_end - time_beg
+                lfr_factor = results[0][-1]
+                length = results[0][-2]
+                results = [results[0][:-2]]
+                forward_time_total += forward_time
+                length_total += length
+                logging.info(
+                    "decoding, feature length: {}, forward_time: {:.4f}, rtf: {:.4f}".
+                        format(length, forward_time, 100 * forward_time / (length*lfr_factor)))
+            except TooShortUttError as e:
+                logging.warning(f"Utterance {keys} {e}")
+                hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
+                results = [[" ", ["<space>"], [2], hyp]] * nbest
+
+            # Only supporting batch_size==1
+            key = keys[0]
+            for n, (text, token, token_int, hyp) in zip(range(1, nbest + 1), results):
+                # Create a directory: outdir/{n}best_recog
+                ibest_writer = writer[f"{n}best_recog"]
+
+                # Write the result to each file
+                ibest_writer["token"][key] = " ".join(token)
+                ibest_writer["token_int"][key] = " ".join(map(str, token_int))
+                ibest_writer["score"][key] = str(hyp.score)
+
+                if text is not None:
+                    ibest_writer["text"][key] = text
+
+                logging.info("decoding, predictions: {}".format(text))
+
+    logging.info("decoding, feature length total: {}, forward_time total: {:.4f}, rtf avg: {:.4f}".
+                 format(length_total, forward_time_total, 100 * forward_time_total / (length_total*lfr_factor)))
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="ASR Decoding",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=True,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument(
+        "--asr_train_config",
+        type=str,
+        help="ASR training configuration",
+    )
+    group.add_argument(
+        "--asr_model_file",
+        type=str,
+        help="ASR model parameter file",
+    )
+    group.add_argument(
+        "--lm_train_config",
+        type=str,
+        help="LM training configuration",
+    )
+    group.add_argument(
+        "--lm_file",
+        type=str,
+        help="LM parameter file",
+    )
+    group.add_argument(
+        "--word_lm_train_config",
+        type=str,
+        help="Word LM training configuration",
+    )
+    group.add_argument(
+        "--word_lm_file",
+        type=str,
+        help="Word LM parameter file",
+    )
+    group.add_argument(
+        "--ngram_file",
+        type=str,
+        help="N-gram parameter file",
+    )
+    group.add_argument(
+        "--model_tag",
+        type=str,
+        help="Pretrained model tag. If specify this option, *_train_config and "
+             "*_file will be overwritten",
+    )
+
+    group = parser.add_argument_group("Beam-search related")
+    group.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
+    group.add_argument("--beam_size", type=int, default=20, help="Beam size")
+    group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
+    group.add_argument(
+        "--maxlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain max output length. "
+             "If maxlenratio=0.0 (default), it uses a end-detect "
+             "function "
+             "to automatically find maximum hypothesis lengths."
+             "If maxlenratio<0.0, its absolute value is interpreted"
+             "as a constant max output length",
+    )
+    group.add_argument(
+        "--minlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain min output length",
+    )
+    group.add_argument(
+        "--ctc_weight",
+        type=float,
+        default=0.5,
+        help="CTC weight in joint decoding",
+    )
+    group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
+    group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
+    group.add_argument("--streaming", type=str2bool, default=False)
+
+    group.add_argument(
+        "--frontend_conf",
+        default=None,
+        help="",
+    )
+
+    group = parser.add_argument_group("Text converter related")
+    group.add_argument(
+        "--token_type",
+        type=str_or_none,
+        default=None,
+        choices=["char", "bpe", None],
+        help="The token type for ASR model. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument(
+        "--bpemodel",
+        type=str_or_none,
+        default=None,
+        help="The model path of sentencepiece. "
+             "If not given, refers from the training args",
+    )
+
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+    inference(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/asr_inference_uniasr.py b/funasr/bin/asr_inference_uniasr.py
new file mode 100755
index 000000000..796c5b303
--- /dev/null
+++ b/funasr/bin/asr_inference_uniasr.py
@@ -0,0 +1,543 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import sys
+from pathlib import Path
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.modules.beam_search.beam_search import BeamSearchScama as BeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+from funasr.modules.scorers.ctc import CTCPrefixScorer
+from funasr.modules.scorers.length_bonus import LengthBonus
+from funasr.modules.subsampling import TooShortUttError
+from funasr.tasks.asr import ASRTaskUniASR as ASRTask
+from funasr.tasks.lm import LMTask
+from funasr.text.build_tokenizer import build_tokenizer
+from funasr.text.token_id_converter import TokenIDConverter
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.utils import config_argparse
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+
+
+class Speech2Text:
+    """Speech2Text class
+
+    Examples:
+        >>> import soundfile
+        >>> speech2text = Speech2Text("asr_config.yml", "asr.pth")
+        >>> audio, rate = soundfile.read("speech.wav")
+        >>> speech2text(audio)
+        [(text, token, token_int, hypothesis object), ...]
+
+    """
+
+    def __init__(
+            self,
+            asr_train_config: Union[Path, str] = None,
+            asr_model_file: Union[Path, str] = None,
+            lm_train_config: Union[Path, str] = None,
+            lm_file: Union[Path, str] = None,
+            token_type: str = None,
+            bpemodel: str = None,
+            device: str = "cpu",
+            maxlenratio: float = 0.0,
+            minlenratio: float = 0.0,
+            dtype: str = "float32",
+            beam_size: int = 20,
+            ctc_weight: float = 0.5,
+            lm_weight: float = 1.0,
+            ngram_weight: float = 0.9,
+            penalty: float = 0.0,
+            nbest: int = 1,
+            token_num_relax: int = 1,
+            decoding_ind: int = 0,
+            decoding_mode: str = "model1",
+            **kwargs,
+    ):
+        assert check_argument_types()
+
+        # 1. Build ASR model
+        scorers = {}
+        asr_model, asr_train_args = ASRTask.build_model_from_file(
+            asr_train_config, asr_model_file, device
+        )
+        asr_model.to(dtype=getattr(torch, dtype)).eval()
+        if decoding_mode == "model1":
+            decoder = asr_model.decoder
+        else:
+            decoder = asr_model.decoder2
+
+        ctc = CTCPrefixScorer(ctc=asr_model.ctc, eos=asr_model.eos)
+        token_list = asr_model.token_list
+        scorers.update(
+            decoder=decoder,
+            ctc=ctc,
+            length_bonus=LengthBonus(len(token_list)),
+        )
+
+        # 2. Build Language model
+        if lm_train_config is not None:
+            lm, lm_train_args = LMTask.build_model_from_file(
+                lm_train_config, lm_file, device
+            )
+            scorers["lm"] = lm.lm
+
+        # 3. Build ngram model
+        # ngram is not supported now
+        ngram = None
+        scorers["ngram"] = ngram
+
+        # 4. Build BeamSearch object
+        # transducer is not supported now
+        beam_search_transducer = None
+
+        weights = dict(
+            decoder=1.0 - ctc_weight,
+            ctc=ctc_weight,
+            lm=lm_weight,
+            ngram=ngram_weight,
+            length_bonus=penalty,
+        )
+        beam_search = BeamSearch(
+            beam_size=beam_size,
+            weights=weights,
+            scorers=scorers,
+            sos=asr_model.sos,
+            eos=asr_model.eos,
+            vocab_size=len(token_list),
+            token_list=token_list,
+            pre_beam_score_key=None if ctc_weight == 1.0 else "full",
+        )
+
+        beam_search.to(device=device, dtype=getattr(torch, dtype)).eval()
+        for scorer in scorers.values():
+            if isinstance(scorer, torch.nn.Module):
+                scorer.to(device=device, dtype=getattr(torch, dtype)).eval()
+        logging.info(f"Beam_search: {beam_search}")
+        logging.info(f"Decoding device={device}, dtype={dtype}")
+
+        # 5. [Optional] Build Text converter: e.g. bpe-sym -> Text
+        if token_type is None:
+            token_type = asr_train_args.token_type
+        if bpemodel is None:
+            bpemodel = asr_train_args.bpemodel
+
+        if token_type is None:
+            tokenizer = None
+        elif token_type == "bpe":
+            if bpemodel is not None:
+                tokenizer = build_tokenizer(token_type=token_type, bpemodel=bpemodel)
+            else:
+                tokenizer = None
+        else:
+            tokenizer = build_tokenizer(token_type=token_type)
+        converter = TokenIDConverter(token_list=token_list)
+        logging.info(f"Text tokenizer: {tokenizer}")
+
+        self.asr_model = asr_model
+        self.asr_train_args = asr_train_args
+        self.converter = converter
+        self.tokenizer = tokenizer
+        self.beam_search = beam_search
+        self.beam_search_transducer = beam_search_transducer
+        self.maxlenratio = maxlenratio
+        self.minlenratio = minlenratio
+        self.device = device
+        self.dtype = dtype
+        self.nbest = nbest
+        self.token_num_relax = token_num_relax
+        self.decoding_ind = decoding_ind
+        self.decoding_mode = decoding_mode
+
+    @torch.no_grad()
+    def __call__(
+            self, speech: Union[torch.Tensor, np.ndarray]
+    ) -> List[
+        Tuple[
+            Optional[str],
+            List[str],
+            List[int],
+            Union[Hypothesis],
+        ]
+    ]:
+        """Inference
+
+        Args:
+            data: Input speech data
+        Returns:
+            text, token, token_int, hyp
+
+        """
+        assert check_argument_types()
+
+        # Input as audio signal
+        if isinstance(speech, np.ndarray):
+            speech = torch.tensor(speech)
+
+        # data: (Nsamples,) -> (1, Nsamples)
+        speech = speech.unsqueeze(0).to(getattr(torch, self.dtype))
+        # lengths: (1,)
+        lengths = speech.new_full([1], dtype=torch.long, fill_value=speech.size(1))
+        batch = {"speech": speech, "speech_lengths": lengths}
+
+        # a. To device
+        batch = to_device(batch, device=self.device)
+        # b. Forward Encoder
+        speech_raw = speech.clone().to(self.device)
+        enc, enc_len = self.asr_model.encode(**batch, ind=self.decoding_ind)
+        if isinstance(enc, tuple):
+            enc = enc[0]
+        assert len(enc) == 1, len(enc)
+        if self.decoding_mode == "model1":
+            predictor_outs = self.asr_model.calc_predictor_mask(enc, enc_len)
+        else:
+            enc, enc_len = self.asr_model.encode2(enc, enc_len, speech_raw, lengths, ind=self.decoding_ind)
+            predictor_outs = self.asr_model.calc_predictor_mask2(enc, enc_len)
+
+        scama_mask = predictor_outs[4]
+        pre_token_length = predictor_outs[1]
+        pre_acoustic_embeds = predictor_outs[0]
+        maxlen = pre_token_length.sum().item() + self.token_num_relax
+        minlen = max(0, pre_token_length.sum().item() - self.token_num_relax)
+        # c. Passed the encoder result and the beam search
+        nbest_hyps = self.beam_search(
+            x=enc[0], scama_mask=scama_mask, pre_acoustic_embeds=pre_acoustic_embeds, maxlenratio=self.maxlenratio,
+            minlenratio=self.minlenratio, maxlen=int(maxlen), minlen=int(minlen),
+        )
+
+        nbest_hyps = nbest_hyps[: self.nbest]
+
+        results = []
+        for hyp in nbest_hyps:
+            assert isinstance(hyp, (Hypothesis)), type(hyp)
+
+            # remove sos/eos and get results
+            last_pos = -1
+            if isinstance(hyp.yseq, list):
+                token_int = hyp.yseq[1:last_pos]
+            else:
+                token_int = hyp.yseq[1:last_pos].tolist()
+
+            # remove blank symbol id, which is assumed to be 0
+            token_int = list(filter(lambda x: x != 0, token_int))
+
+            # Change integer-ids to tokens
+            token = self.converter.ids2tokens(token_int)
+
+            if self.tokenizer is not None:
+                text = self.tokenizer.tokens2text(token)
+            else:
+                text = None
+            results.append((text, token, token_int, hyp))
+
+        assert check_return_type(results)
+        return results
+
+
+def inference(
+        output_dir: str,
+        maxlenratio: float,
+        minlenratio: float,
+        batch_size: int,
+        dtype: str,
+        beam_size: int,
+        ngpu: int,
+        seed: int,
+        ctc_weight: float,
+        lm_weight: float,
+        ngram_weight: float,
+        penalty: float,
+        nbest: int,
+        num_workers: int,
+        log_level: Union[int, str],
+        data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
+        key_file: Optional[str],
+        asr_train_config: Optional[str],
+        asr_model_file: Optional[str],
+        lm_train_config: Optional[str],
+        lm_file: Optional[str],
+        word_lm_train_config: Optional[str],
+        ngram_file: Optional[str],
+        token_type: Optional[str],
+        bpemodel: Optional[str],
+        allow_variable_data_keys: bool,
+        streaming: bool,
+        token_num_relax: int = 1,
+        decoding_ind: int = 0,
+        decoding_mode: str = "model1",
+        **kwargs,
+):
+    assert check_argument_types()
+    if batch_size > 1:
+        raise NotImplementedError("batch decoding is not implemented")
+    if word_lm_train_config is not None:
+        raise NotImplementedError("Word LM is not implemented")
+    if ngpu > 1:
+        raise NotImplementedError("only single GPU decoding is supported")
+
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+
+    if ngpu >= 1:
+        device = "cuda"
+    else:
+        device = "cpu"
+
+    # 1. Set random-seed
+    set_all_random_seed(seed)
+
+    # 2. Build speech2text
+    speech2text_kwargs = dict(
+        asr_train_config=asr_train_config,
+        asr_model_file=asr_model_file,
+        lm_train_config=lm_train_config,
+        lm_file=lm_file,
+        ngram_file=ngram_file,
+        token_type=token_type,
+        bpemodel=bpemodel,
+        device=device,
+        maxlenratio=maxlenratio,
+        minlenratio=minlenratio,
+        dtype=dtype,
+        beam_size=beam_size,
+        ctc_weight=ctc_weight,
+        lm_weight=lm_weight,
+        ngram_weight=ngram_weight,
+        penalty=penalty,
+        nbest=nbest,
+        streaming=streaming,
+        token_num_relax=token_num_relax,
+        decoding_ind=decoding_ind,
+        decoding_mode=decoding_mode,
+    )
+    speech2text = Speech2Text(**speech2text_kwargs)
+
+    # 3. Build data-iterator
+    loader = ASRTask.build_streaming_iterator(
+        data_path_and_name_and_type,
+        dtype=dtype,
+        batch_size=batch_size,
+        key_file=key_file,
+        num_workers=num_workers,
+        preprocess_fn=ASRTask.build_preprocess_fn(speech2text.asr_train_args, False),
+        collate_fn=ASRTask.build_collate_fn(speech2text.asr_train_args, False),
+        allow_variable_data_keys=allow_variable_data_keys,
+        inference=True,
+    )
+
+    # 7 .Start for-loop
+    # FIXME(kamo): The output format should be discussed about
+    with DatadirWriter(output_dir) as writer:
+        for keys, batch in loader:
+            assert isinstance(batch, dict), type(batch)
+            assert all(isinstance(s, str) for s in keys), keys
+            _bs = len(next(iter(batch.values())))
+            assert len(keys) == _bs, f"{len(keys)} != {_bs}"
+            batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
+
+            # N-best list of (text, token, token_int, hyp_object)
+            try:
+                results = speech2text(**batch)
+            except TooShortUttError as e:
+                logging.warning(f"Utterance {keys} {e}")
+                hyp = Hypothesis(score=0.0, scores={}, states={}, yseq=[])
+                results = [[" ", ["<space>"], [2], hyp]] * nbest
+
+            # Only supporting batch_size==1
+            key = keys[0]
+            logging.info(f"Utterance: {key}")
+            for n, (text, token, token_int, hyp) in zip(range(1, nbest + 1), results):
+                # Create a directory: outdir/{n}best_recog
+                ibest_writer = writer[f"{n}best_recog"]
+
+                # Write the result to each file
+                ibest_writer["token"][key] = " ".join(token)
+                ibest_writer["token_int"][key] = " ".join(map(str, token_int))
+                ibest_writer["score"][key] = str(hyp.score)
+
+                if text is not None:
+                    ibest_writer["text"][key] = text
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="ASR Decoding",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=True,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument(
+        "--asr_train_config",
+        type=str,
+        help="ASR training configuration",
+    )
+    group.add_argument(
+        "--asr_model_file",
+        type=str,
+        help="ASR model parameter file",
+    )
+    group.add_argument(
+        "--lm_train_config",
+        type=str,
+        help="LM training configuration",
+    )
+    group.add_argument(
+        "--lm_file",
+        type=str,
+        help="LM parameter file",
+    )
+    group.add_argument(
+        "--word_lm_train_config",
+        type=str,
+        help="Word LM training configuration",
+    )
+    group.add_argument(
+        "--word_lm_file",
+        type=str,
+        help="Word LM parameter file",
+    )
+    group.add_argument(
+        "--ngram_file",
+        type=str,
+        help="N-gram parameter file",
+    )
+    group.add_argument(
+        "--model_tag",
+        type=str,
+        help="Pretrained model tag. If specify this option, *_train_config and "
+             "*_file will be overwritten",
+    )
+
+    group = parser.add_argument_group("Beam-search related")
+    group.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    group.add_argument("--nbest", type=int, default=1, help="Output N-best hypotheses")
+    group.add_argument("--beam_size", type=int, default=20, help="Beam size")
+    group.add_argument("--penalty", type=float, default=0.0, help="Insertion penalty")
+    group.add_argument(
+        "--maxlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain max output length. "
+             "If maxlenratio=0.0 (default), it uses a end-detect "
+             "function "
+             "to automatically find maximum hypothesis lengths."
+             "If maxlenratio<0.0, its absolute value is interpreted"
+             "as a constant max output length",
+    )
+    group.add_argument(
+        "--minlenratio",
+        type=float,
+        default=0.0,
+        help="Input length ratio to obtain min output length",
+    )
+    group.add_argument(
+        "--ctc_weight",
+        type=float,
+        default=0.5,
+        help="CTC weight in joint decoding",
+    )
+    group.add_argument("--lm_weight", type=float, default=1.0, help="RNNLM weight")
+    group.add_argument("--ngram_weight", type=float, default=0.9, help="ngram weight")
+    group.add_argument("--streaming", type=str2bool, default=False)
+
+    group = parser.add_argument_group("Text converter related")
+    group.add_argument(
+        "--token_type",
+        type=str_or_none,
+        default=None,
+        choices=["char", "bpe", None],
+        help="The token type for ASR model. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument(
+        "--bpemodel",
+        type=str_or_none,
+        default=None,
+        help="The model path of sentencepiece. "
+             "If not given, refers from the training args",
+    )
+    group.add_argument("--token_num_relax", type=int, default=1, help="")
+    group.add_argument("--decoding_ind", type=int, default=0, help="")
+    group.add_argument("--decoding_mode", type=str, default="model1", help="")
+    group.add_argument(
+        "--ctc_weight2",
+        type=float,
+        default=0.0,
+        help="CTC weight in joint decoding",
+    )
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+    inference(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/asr_train.py b/funasr/bin/asr_train.py
new file mode 100755
index 000000000..bba50daf0
--- /dev/null
+++ b/funasr/bin/asr_train.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+import os
+
+from funasr.tasks.asr import ASRTask
+
+
+# for ASR Training
+def parse_args():
+    parser = ASRTask.get_parser()
+    parser.add_argument(
+        "--gpu_id",
+        type=int,
+        default=0,
+        help="local gpu id.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main(args=None, cmd=None):
+    # for ASR Training
+    ASRTask.main(args=args, cmd=cmd)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    # setup local gpu_id
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_id)
+
+    # DDP settings
+    if args.ngpu > 1:
+        args.distributed = True
+    else:
+        args.distributed = False
+    assert args.num_worker_count == 1
+
+    # re-compute batch size: when dataset type is small
+    if args.dataset_type == "small":
+        if args.batch_size is not None:
+            args.batch_size = args.batch_size * args.ngpu
+        if args.batch_bins is not None:
+            args.batch_bins = args.batch_bins * args.ngpu
+
+    main(args=args)
diff --git a/funasr/bin/asr_train_paraformer.py b/funasr/bin/asr_train_paraformer.py
new file mode 100755
index 000000000..76943d5b7
--- /dev/null
+++ b/funasr/bin/asr_train_paraformer.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+import os
+
+from funasr.tasks.asr import ASRTaskParaformer as ASRTask
+
+
+# for ASR Training
+def parse_args():
+    parser = ASRTask.get_parser()
+    parser.add_argument(
+        "--gpu_id",
+        type=int,
+        default=0,
+        help="local gpu id.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main(args=None, cmd=None):
+    # for ASR Training
+    ASRTask.main(args=args, cmd=cmd)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    # setup local gpu_id
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_id)
+
+    # DDP settings
+    if args.ngpu > 1:
+        args.distributed = True
+    else:
+        args.distributed = False
+    assert args.num_worker_count == 1
+
+    # re-compute batch size: when dataset type is small
+    if args.dataset_type == "small":
+        if args.batch_size is not None:
+            args.batch_size = args.batch_size * args.ngpu
+        if args.batch_bins is not None:
+            args.batch_bins = args.batch_bins * args.ngpu
+
+    main(args=args)
diff --git a/funasr/bin/asr_train_uniasr.py b/funasr/bin/asr_train_uniasr.py
new file mode 100755
index 000000000..a40b5032c
--- /dev/null
+++ b/funasr/bin/asr_train_uniasr.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+
+import os
+
+from funasr.tasks.asr import ASRTaskUniASR
+
+
+# for ASR Training
+def parse_args():
+    parser = ASRTaskUniASR.get_parser()
+    parser.add_argument(
+        "--gpu_id",
+        type=int,
+        default=0,
+        help="local gpu id.",
+    )
+    args = parser.parse_args()
+    return args
+
+
+def main(args=None, cmd=None):
+    # for ASR Training
+    ASRTaskUniASR.main(args=args, cmd=cmd)
+
+
+if __name__ == '__main__':
+    args = parse_args()
+
+    # setup local gpu_id
+    os.environ['CUDA_VISIBLE_DEVICES'] = str(args.gpu_id)
+
+    # DDP settings
+    if args.ngpu > 1:
+        args.distributed = True
+    else:
+        args.distributed = False
+    assert args.num_worker_count == 1
+
+    # re-compute batch size: when dataset type is small
+    if args.dataset_type == "small":
+        if args.batch_size is not None:
+            args.batch_size = args.batch_size * args.ngpu
+        if args.batch_bins is not None:
+            args.batch_bins = args.batch_bins * args.ngpu
+
+    main(args=args)
diff --git a/funasr/bin/lm_calc_perplexity.py b/funasr/bin/lm_calc_perplexity.py
new file mode 100755
index 000000000..27a8a71fc
--- /dev/null
+++ b/funasr/bin/lm_calc_perplexity.py
@@ -0,0 +1,210 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+from pathlib import Path
+import sys
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from torch.nn.parallel import data_parallel
+from typeguard import check_argument_types
+
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.tasks.lm import LMTask
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.forward_adaptor import ForwardAdaptor
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.utils import config_argparse
+from funasr.utils.types import float_or_none
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_none
+
+
+def calc_perplexity(
+    output_dir: str,
+    batch_size: int,
+    dtype: str,
+    ngpu: int,
+    seed: int,
+    num_workers: int,
+    log_level: Union[int, str],
+    data_path_and_name_and_type: Sequence[Tuple[str, str, str]],
+    key_file: Optional[str],
+    train_config: Optional[str],
+    model_file: Optional[str],
+    log_base: Optional[float],
+    allow_variable_data_keys: bool,
+):
+    assert check_argument_types()
+    logging.basicConfig(
+        level=log_level,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+
+    if ngpu >= 1:
+        device = "cuda"
+    else:
+        device = "cpu"
+
+    # 1. Set random-seed
+    set_all_random_seed(seed)
+
+    # 2. Build LM
+    model, train_args = LMTask.build_model_from_file(train_config, model_file, device)
+    # Wrape model to make model.nll() data-parallel
+    wrapped_model = ForwardAdaptor(model, "nll")
+    wrapped_model.to(dtype=getattr(torch, dtype)).eval()
+    logging.info(f"Model:\n{model}")
+
+    # 3. Build data-iterator
+    loader = LMTask.build_streaming_iterator(
+        data_path_and_name_and_type,
+        dtype=dtype,
+        batch_size=batch_size,
+        key_file=key_file,
+        num_workers=num_workers,
+        preprocess_fn=LMTask.build_preprocess_fn(train_args, False),
+        collate_fn=LMTask.build_collate_fn(train_args, False),
+        allow_variable_data_keys=allow_variable_data_keys,
+        inference=True,
+    )
+
+    # 4. Start for-loop
+    with DatadirWriter(output_dir) as writer:
+        total_nll = 0.0
+        total_ntokens = 0
+        for keys, batch in loader:
+            assert isinstance(batch, dict), type(batch)
+            assert all(isinstance(s, str) for s in keys), keys
+            _bs = len(next(iter(batch.values())))
+            assert len(keys) == _bs, f"{len(keys)} != {_bs}"
+
+            with torch.no_grad():
+                batch = to_device(batch, device)
+                if ngpu <= 1:
+                    # NOTE(kamo): data_parallel also should work with ngpu=1,
+                    # but for debuggability it's better to keep this block.
+                    nll, lengths = wrapped_model(**batch)
+                else:
+                    nll, lengths = data_parallel(
+                        wrapped_model, (), range(ngpu), module_kwargs=batch
+                    )
+
+            assert _bs == len(nll) == len(lengths), (_bs, len(nll), len(lengths))
+            # nll: (B, L) -> (B,)
+            nll = nll.detach().cpu().numpy().sum(1)
+            # lengths: (B,)
+            lengths = lengths.detach().cpu().numpy()
+            total_nll += nll.sum()
+            total_ntokens += lengths.sum()
+
+            for key, _nll, ntoken in zip(keys, nll, lengths):
+                if log_base is None:
+                    utt_ppl = np.exp(_nll / ntoken)
+                else:
+                    utt_ppl = log_base ** (_nll / ntoken / np.log(log_base))
+
+                # Write PPL of each utts for debugging or analysis
+                writer["utt2ppl"][key] = str(utt_ppl)
+                writer["utt2ntokens"][key] = str(ntoken)
+
+        if log_base is None:
+            ppl = np.exp(total_nll / total_ntokens)
+        else:
+            ppl = log_base ** (total_nll / total_ntokens / np.log(log_base))
+
+        with (Path(output_dir) / "ppl").open("w", encoding="utf-8") as f:
+            f.write(f"{ppl}\n")
+        with (Path(output_dir) / "base").open("w", encoding="utf-8") as f:
+            if log_base is None:
+                _log_base = np.e
+            else:
+                _log_base = log_base
+            f.write(f"{_log_base}\n")
+        logging.info(f"PPL={ppl}")
+
+
+def get_parser():
+    parser = config_argparse.ArgumentParser(
+        description="Calc perplexity",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+
+    # Note(kamo): Use '_' instead of '-' as separator.
+    # '-' is confusing if written in yaml.
+    parser.add_argument(
+        "--log_level",
+        type=lambda x: x.upper(),
+        default="INFO",
+        choices=("CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+        help="The verbose level of logging",
+    )
+
+    parser.add_argument("--output_dir", type=str, required=True)
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    parser.add_argument("--seed", type=int, default=0, help="Random seed")
+    parser.add_argument(
+        "--dtype",
+        default="float32",
+        choices=["float16", "float32", "float64"],
+        help="Data type",
+    )
+    parser.add_argument(
+        "--num_workers",
+        type=int,
+        default=1,
+        help="The number of workers used for DataLoader",
+    )
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=1,
+        help="The batch size for inference",
+    )
+    parser.add_argument(
+        "--log_base",
+        type=float_or_none,
+        default=None,
+        help="The base of logarithm for Perplexity. "
+        "If None, napier's constant is used.",
+    )
+
+    group = parser.add_argument_group("Input data related")
+    group.add_argument(
+        "--data_path_and_name_and_type",
+        type=str2triple_str,
+        required=True,
+        action="append",
+    )
+    group.add_argument("--key_file", type=str_or_none)
+    group.add_argument("--allow_variable_data_keys", type=str2bool, default=False)
+
+    group = parser.add_argument_group("The model configuration related")
+    group.add_argument("--train_config", type=str)
+    group.add_argument("--model_file", type=str)
+
+    return parser
+
+
+def main(cmd=None):
+    print(get_commandline_args(), file=sys.stderr)
+    parser = get_parser()
+    args = parser.parse_args(cmd)
+    kwargs = vars(args)
+    kwargs.pop("config", None)
+    calc_perplexity(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/lm_train.py b/funasr/bin/lm_train.py
new file mode 100755
index 000000000..faa7a4596
--- /dev/null
+++ b/funasr/bin/lm_train.py
@@ -0,0 +1,22 @@
+#!/usr/bin/env python3
+from funasr.tasks.lm import LMTask
+
+
+def get_parser():
+    parser = LMTask.get_parser()
+    return parser
+
+
+def main(cmd=None):
+    """LM training.
+
+    Example:
+
+        % python lm_train.py asr --print_config --optim adadelta
+        % python lm_train.py --config conf/train_asr.yaml
+    """
+    LMTask.main(cmd=cmd)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/funasr/bin/modelscope_infer.py b/funasr/bin/modelscope_infer.py
new file mode 100755
index 000000000..440c88163
--- /dev/null
+++ b/funasr/bin/modelscope_infer.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import os
+
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description="decoding configs",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument("--model_name",
+                        type=str,
+                        default="speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
+                        help="model name in modelscope")
+    parser.add_argument("--local_model_path",
+                        type=str,
+                        default=None,
+                        help="local model path, usually for fine-tuning")
+    parser.add_argument("--wav_list",
+                        type=str,
+                        help="input wav list")
+    parser.add_argument("--output_file",
+                        type=str,
+                        help="saving decoding results")
+    parser.add_argument(
+        "--njob",
+        type=int,
+        default=1,
+        help="The number of jobs for each gpu",
+    )
+    parser.add_argument(
+        "--gpuid_list",
+        type=str,
+        default="",
+        help="The visible gpus",
+    )
+    parser.add_argument(
+        "--ngpu",
+        type=int,
+        default=0,
+        help="The number of gpus. 0 indicates CPU mode",
+    )
+    args = parser.parse_args()
+
+    # set logging messages
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+    )
+    logging.info("Decoding args: {}".format(args))
+
+    # gpu setting
+    if args.ngpu > 0:
+        jobid = int(args.output_file.split(".")[-1])
+        gpuid = args.gpuid_list.split(",")[(jobid - 1) // args.njob]
+        os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
+        os.environ["CUDA_VISIBLE_DEVICES"] = gpuid
+
+    if args.local_model_path is None:
+        inference_pipeline = pipeline(
+            task=Tasks.auto_speech_recognition,
+            model="damo/{}".format(args.model_name))
+    else:
+        inference_pipeline = pipeline(
+            task=Tasks.auto_speech_recognition,
+            model=args.local_model_path)
+
+
+    with open(args.wav_list, 'r') as f_wav:
+        wav_lines = f_wav.readlines()
+
+    with open(args.output_file, "w") as f_out:
+        for line in wav_lines:
+            wav_id, wav_path = line.strip().split()
+            logging.info("decoding, utt_id: ['{}']".format(wav_id))
+            rec_result = inference_pipeline(audio_in=wav_path)
+            text = rec_result["text"]
+            f_out.write(wav_id + " " + text + "\n")
+            logging.info("best hypo: {} \n".format(text))
diff --git a/funasr/datasets/__init__.py b/funasr/datasets/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/datasets/collate_fn.py b/funasr/datasets/collate_fn.py
new file mode 100644
index 000000000..d52032f9e
--- /dev/null
+++ b/funasr/datasets/collate_fn.py
@@ -0,0 +1,83 @@
+from typing import Collection
+from typing import Dict
+from typing import List
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.modules.nets_utils import pad_list
+
+
+class CommonCollateFn:
+    """Functor class of common_collate_fn()"""
+
+    def __init__(
+            self,
+            float_pad_value: Union[float, int] = 0.0,
+            int_pad_value: int = -32768,
+            not_sequence: Collection[str] = (),
+            max_sample_size=None
+    ):
+        assert check_argument_types()
+        self.float_pad_value = float_pad_value
+        self.int_pad_value = int_pad_value
+        self.not_sequence = set(not_sequence)
+        self.max_sample_size = max_sample_size
+
+    def __repr__(self):
+        return (
+            f"{self.__class__}(float_pad_value={self.float_pad_value}, "
+            f"int_pad_value={self.float_pad_value})"
+        )
+
+    def __call__(
+            self, data: Collection[Tuple[str, Dict[str, np.ndarray]]]
+    ) -> Tuple[List[str], Dict[str, torch.Tensor]]:
+        return common_collate_fn(
+            data,
+            float_pad_value=self.float_pad_value,
+            int_pad_value=self.int_pad_value,
+            not_sequence=self.not_sequence,
+        )
+
+
+def common_collate_fn(
+        data: Collection[Tuple[str, Dict[str, np.ndarray]]],
+        float_pad_value: Union[float, int] = 0.0,
+        int_pad_value: int = -32768,
+        not_sequence: Collection[str] = (),
+) -> Tuple[List[str], Dict[str, torch.Tensor]]:
+    """Concatenate ndarray-list to an array and convert to torch.Tensor.
+    """
+    assert check_argument_types()
+    uttids = [u for u, _ in data]
+    data = [d for _, d in data]
+
+    assert all(set(data[0]) == set(d) for d in data), "dict-keys mismatching"
+    assert all(
+        not k.endswith("_lengths") for k in data[0]
+    ), f"*_lengths is reserved: {list(data[0])}"
+
+    output = {}
+    for key in data[0]:
+        if data[0][key].dtype.kind == "i":
+            pad_value = int_pad_value
+        else:
+            pad_value = float_pad_value
+
+        array_list = [d[key] for d in data]
+        tensor_list = [torch.from_numpy(a) for a in array_list]
+        tensor = pad_list(tensor_list, pad_value)
+        output[key] = tensor
+
+        if key not in not_sequence:
+            lens = torch.tensor([d[key].shape[0] for d in data], dtype=torch.long)
+            output[key + "_lengths"] = lens
+
+    output = (uttids, output)
+    assert check_return_type(output)
+    return output
\ No newline at end of file
diff --git a/funasr/datasets/dataset.py b/funasr/datasets/dataset.py
new file mode 100644
index 000000000..2af93d0bc
--- /dev/null
+++ b/funasr/datasets/dataset.py
@@ -0,0 +1,444 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+from abc import ABC
+from abc import abstractmethod
+import collections
+import copy
+import functools
+import logging
+import numbers
+import re
+from typing import Any
+from typing import Callable
+from typing import Collection
+from typing import Dict
+from typing import Mapping
+from typing import Tuple
+from typing import Union
+
+import h5py
+import humanfriendly
+import kaldiio
+import numpy as np
+import torch
+from torch.utils.data.dataset import Dataset
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.fileio.npy_scp import NpyScpReader
+from funasr.fileio.rand_gen_dataset import FloatRandomGenerateDataset
+from funasr.fileio.rand_gen_dataset import IntRandomGenerateDataset
+from funasr.fileio.read_text import load_num_sequence_text
+from funasr.fileio.read_text import read_2column_text
+from funasr.fileio.sound_scp import SoundScpReader
+from funasr.utils.sized_dict import SizedDict
+
+
+class AdapterForSoundScpReader(collections.abc.Mapping):
+    def __init__(self, loader, dtype=None):
+        assert check_argument_types()
+        self.loader = loader
+        self.dtype = dtype
+        self.rate = None
+
+    def keys(self):
+        return self.loader.keys()
+
+    def __len__(self):
+        return len(self.loader)
+
+    def __iter__(self):
+        return iter(self.loader)
+
+    def __getitem__(self, key: str) -> np.ndarray:
+        retval = self.loader[key]
+
+        if isinstance(retval, tuple):
+            assert len(retval) == 2, len(retval)
+            if isinstance(retval[0], int) and isinstance(retval[1], np.ndarray):
+                # sound scp case
+                rate, array = retval
+            elif isinstance(retval[1], int) and isinstance(retval[0], np.ndarray):
+                # Extended ark format case
+                array, rate = retval
+            else:
+                raise RuntimeError(
+                    f"Unexpected type: {type(retval[0])}, {type(retval[1])}"
+                )
+
+            if self.rate is not None and self.rate != rate:
+                raise RuntimeError(
+                    f"Sampling rates are mismatched: {self.rate} != {rate}"
+                )
+            self.rate = rate
+            # Multichannel wave fie
+            # array: (NSample, Channel) or (Nsample)
+            if self.dtype is not None:
+                array = array.astype(self.dtype)
+
+        else:
+            # Normal ark case
+            assert isinstance(retval, np.ndarray), type(retval)
+            array = retval
+            if self.dtype is not None:
+                array = array.astype(self.dtype)
+
+        assert isinstance(array, np.ndarray), type(array)
+        return array
+
+
+class H5FileWrapper:
+    def __init__(self, path: str):
+        self.path = path
+        self.h5_file = h5py.File(path, "r")
+
+    def __repr__(self) -> str:
+        return str(self.h5_file)
+
+    def __len__(self) -> int:
+        return len(self.h5_file)
+
+    def __iter__(self):
+        return iter(self.h5_file)
+
+    def __getitem__(self, key) -> np.ndarray:
+        value = self.h5_file[key]
+        return value[()]
+
+
+def sound_loader(path, float_dtype=None):
+    # The file is as follows:
+    #   utterance_id_A /some/where/a.wav
+    #   utterance_id_B /some/where/a.flac
+
+    # NOTE(kamo): SoundScpReader doesn't support pipe-fashion
+    # like Kaldi e.g. "cat a.wav |".
+    # NOTE(kamo): The audio signal is normalized to [-1,1] range.
+    loader = SoundScpReader(path, normalize=True, always_2d=False)
+
+    # SoundScpReader.__getitem__() returns Tuple[int, ndarray],
+    # but ndarray is desired, so Adapter class is inserted here
+    return AdapterForSoundScpReader(loader, float_dtype)
+
+
+def kaldi_loader(path, float_dtype=None, max_cache_fd: int = 0):
+    loader = kaldiio.load_scp(path, max_cache_fd=max_cache_fd)
+    return AdapterForSoundScpReader(loader, float_dtype)
+
+
+def rand_int_loader(filepath, loader_type):
+    # e.g. rand_int_3_10
+    try:
+        low, high = map(int, loader_type[len("rand_int_") :].split("_"))
+    except ValueError:
+        raise RuntimeError(f"e.g rand_int_3_10: but got {loader_type}")
+    return IntRandomGenerateDataset(filepath, low, high)
+
+
+DATA_TYPES = {
+    "sound": dict(
+        func=sound_loader,
+        kwargs=["float_dtype"],
+        help="Audio format types which supported by sndfile wav, flac, etc."
+        "\n\n"
+        "   utterance_id_a a.wav\n"
+        "   utterance_id_b b.wav\n"
+        "   ...",
+    ),
+    "kaldi_ark": dict(
+        func=kaldi_loader,
+        kwargs=["max_cache_fd"],
+        help="Kaldi-ark file type."
+        "\n\n"
+        "   utterance_id_A /some/where/a.ark:123\n"
+        "   utterance_id_B /some/where/a.ark:456\n"
+        "   ...",
+    ),
+    "npy": dict(
+        func=NpyScpReader,
+        kwargs=[],
+        help="Npy file format."
+        "\n\n"
+        "   utterance_id_A /some/where/a.npy\n"
+        "   utterance_id_B /some/where/b.npy\n"
+        "   ...",
+    ),
+    "text_int": dict(
+        func=functools.partial(load_num_sequence_text, loader_type="text_int"),
+        kwargs=[],
+        help="A text file in which is written a sequence of interger numbers "
+        "separated by space."
+        "\n\n"
+        "   utterance_id_A 12 0 1 3\n"
+        "   utterance_id_B 3 3 1\n"
+        "   ...",
+    ),
+    "csv_int": dict(
+        func=functools.partial(load_num_sequence_text, loader_type="csv_int"),
+        kwargs=[],
+        help="A text file in which is written a sequence of interger numbers "
+        "separated by comma."
+        "\n\n"
+        "   utterance_id_A 100,80\n"
+        "   utterance_id_B 143,80\n"
+        "   ...",
+    ),
+    "text_float": dict(
+        func=functools.partial(load_num_sequence_text, loader_type="text_float"),
+        kwargs=[],
+        help="A text file in which is written a sequence of float numbers "
+        "separated by space."
+        "\n\n"
+        "   utterance_id_A 12. 3.1 3.4 4.4\n"
+        "   utterance_id_B 3. 3.12 1.1\n"
+        "   ...",
+    ),
+    "csv_float": dict(
+        func=functools.partial(load_num_sequence_text, loader_type="csv_float"),
+        kwargs=[],
+        help="A text file in which is written a sequence of float numbers "
+        "separated by comma."
+        "\n\n"
+        "   utterance_id_A 12.,3.1,3.4,4.4\n"
+        "   utterance_id_B 3.,3.12,1.1\n"
+        "   ...",
+    ),
+    "text": dict(
+        func=read_2column_text,
+        kwargs=[],
+        help="Return text as is. The text must be converted to ndarray "
+        "by 'preprocess'."
+        "\n\n"
+        "   utterance_id_A hello world\n"
+        "   utterance_id_B foo bar\n"
+        "   ...",
+    ),
+    "hdf5": dict(
+        func=H5FileWrapper,
+        kwargs=[],
+        help="A HDF5 file which contains arrays at the first level or the second level."
+        "   >>> f = h5py.File('file.h5')\n"
+        "   >>> array1 = f['utterance_id_A']\n"
+        "   >>> array2 = f['utterance_id_B']\n",
+    ),
+    "rand_float": dict(
+        func=FloatRandomGenerateDataset,
+        kwargs=[],
+        help="Generate random float-ndarray which has the given shapes "
+        "in the file."
+        "\n\n"
+        "   utterance_id_A 3,4\n"
+        "   utterance_id_B 10,4\n"
+        "   ...",
+    ),
+    "rand_int_\\d+_\\d+": dict(
+        func=rand_int_loader,
+        kwargs=["loader_type"],
+        help="e.g. 'rand_int_0_10'. Generate random int-ndarray which has the given "
+        "shapes in the path. "
+        "Give the lower and upper value by the file type. e.g. "
+        "rand_int_0_10 -> Generate integers from 0 to 10."
+        "\n\n"
+        "   utterance_id_A 3,4\n"
+        "   utterance_id_B 10,4\n"
+        "   ...",
+    ),
+}
+
+
+class AbsDataset(Dataset, ABC):
+    @abstractmethod
+    def has_name(self, name) -> bool:
+        raise NotImplementedError
+
+    @abstractmethod
+    def names(self) -> Tuple[str, ...]:
+        raise NotImplementedError
+
+    @abstractmethod
+    def __getitem__(self, uid) -> Tuple[Any, Dict[str, np.ndarray]]:
+        raise NotImplementedError
+
+
+class ESPnetDataset(AbsDataset):
+    """Pytorch Dataset class for ESPNet.
+
+    Examples:
+        >>> dataset = ESPnetDataset([('wav.scp', 'input', 'sound'),
+        ...                          ('token_int', 'output', 'text_int')],
+        ...                         )
+        ... uttid, data = dataset['uttid']
+        {'input': per_utt_array, 'output': per_utt_array}
+    """
+
+    def __init__(
+        self,
+        path_name_type_list: Collection[Tuple[str, str, str]],
+        preprocess: Callable[
+            [str, Dict[str, np.ndarray]], Dict[str, np.ndarray]
+        ] = None,
+        float_dtype: str = "float32",
+        int_dtype: str = "long",
+        max_cache_size: Union[float, int, str] = 0.0,
+        max_cache_fd: int = 0,
+    ):
+        assert check_argument_types()
+        if len(path_name_type_list) == 0:
+            raise ValueError(
+                '1 or more elements are required for "path_name_type_list"'
+            )
+
+        path_name_type_list = copy.deepcopy(path_name_type_list)
+        self.preprocess = preprocess
+
+        self.float_dtype = float_dtype
+        self.int_dtype = int_dtype
+        self.max_cache_fd = max_cache_fd
+
+        self.loader_dict = {}
+        self.debug_info = {}
+        for path, name, _type in path_name_type_list:
+            if name in self.loader_dict:
+                raise RuntimeError(f'"{name}" is duplicated for data-key')
+
+            loader = self._build_loader(path, _type)
+            self.loader_dict[name] = loader
+            self.debug_info[name] = path, _type
+            if len(self.loader_dict[name]) == 0:
+                raise RuntimeError(f"{path} has no samples")
+
+            # TODO(kamo): Should check consistency of each utt-keys?
+
+        if isinstance(max_cache_size, str):
+            max_cache_size = humanfriendly.parse_size(max_cache_size)
+        self.max_cache_size = max_cache_size
+        if max_cache_size > 0:
+            self.cache = SizedDict(shared=True)
+        else:
+            self.cache = None
+
+    def _build_loader(
+        self, path: str, loader_type: str
+    ) -> Mapping[str, Union[np.ndarray, torch.Tensor, str, numbers.Number]]:
+        """Helper function to instantiate Loader.
+
+        Args:
+            path:  The file path
+            loader_type:  loader_type. sound, npy, text_int, text_float, etc
+        """
+        for key, dic in DATA_TYPES.items():
+            # e.g. loader_type="sound"
+            # -> return DATA_TYPES["sound"]["func"](path)
+            if re.match(key, loader_type):
+                kwargs = {}
+                for key2 in dic["kwargs"]:
+                    if key2 == "loader_type":
+                        kwargs["loader_type"] = loader_type
+                    elif key2 == "float_dtype":
+                        kwargs["float_dtype"] = self.float_dtype
+                    elif key2 == "int_dtype":
+                        kwargs["int_dtype"] = self.int_dtype
+                    elif key2 == "max_cache_fd":
+                        kwargs["max_cache_fd"] = self.max_cache_fd
+                    else:
+                        raise RuntimeError(f"Not implemented keyword argument: {key2}")
+
+                func = dic["func"]
+                try:
+                    return func(path, **kwargs)
+                except Exception:
+                    if hasattr(func, "__name__"):
+                        name = func.__name__
+                    else:
+                        name = str(func)
+                    logging.error(f"An error happened with {name}({path})")
+                    raise
+        else:
+            raise RuntimeError(f"Not supported: loader_type={loader_type}")
+
+    def has_name(self, name) -> bool:
+        return name in self.loader_dict
+
+    def names(self) -> Tuple[str, ...]:
+        return tuple(self.loader_dict)
+
+    def __iter__(self):
+        return iter(next(iter(self.loader_dict.values())))
+
+    def __repr__(self):
+        _mes = self.__class__.__name__
+        _mes += "("
+        for name, (path, _type) in self.debug_info.items():
+            _mes += f'\n  {name}: {{"path": "{path}", "type": "{_type}"}}'
+        _mes += f"\n  preprocess: {self.preprocess})"
+        return _mes
+
+    def __getitem__(self, uid: Union[str, int]) -> Tuple[str, Dict[str, np.ndarray]]:
+        assert check_argument_types()
+
+        # Change integer-id to string-id
+        if isinstance(uid, int):
+            d = next(iter(self.loader_dict.values()))
+            uid = list(d)[uid]
+
+        if self.cache is not None and uid in self.cache:
+            data = self.cache[uid]
+            return uid, data
+
+        data = {}
+        # 1. Load data from each loaders
+        for name, loader in self.loader_dict.items():
+            try:
+                value = loader[uid]
+                if isinstance(value, (list, tuple)):
+                    value = np.array(value)
+                if not isinstance(
+                    value, (np.ndarray, torch.Tensor, str, numbers.Number)
+                ):
+                    raise TypeError(
+                        f"Must be ndarray, torch.Tensor, str or Number: {type(value)}"
+                    )
+            except Exception:
+                path, _type = self.debug_info[name]
+                logging.error(
+                    f"Error happened with path={path}, type={_type}, id={uid}"
+                )
+                raise
+
+            # torch.Tensor is converted to ndarray
+            if isinstance(value, torch.Tensor):
+                value = value.numpy()
+            elif isinstance(value, numbers.Number):
+                value = np.array([value])
+            data[name] = value
+
+        # 2. [Option] Apply preprocessing
+        #   e.g. funasr.train.preprocessor:CommonPreprocessor
+        if self.preprocess is not None:
+            data = self.preprocess(uid, data)
+
+        # 3. Force data-precision
+        for name in data:
+            value = data[name]
+            if not isinstance(value, np.ndarray):
+                raise RuntimeError(
+                    f"All values must be converted to np.ndarray object "
+                    f'by preprocessing, but "{name}" is still {type(value)}.'
+                )
+
+            # Cast to desired type
+            if value.dtype.kind == "f":
+                value = value.astype(self.float_dtype)
+            elif value.dtype.kind == "i":
+                value = value.astype(self.int_dtype)
+            else:
+                raise NotImplementedError(f"Not supported dtype: {value.dtype}")
+            data[name] = value
+
+        if self.cache is not None and self.cache.size < self.max_cache_size:
+            self.cache[uid] = data
+
+        retval = uid, data
+        assert check_return_type(retval)
+        return retval
diff --git a/funasr/datasets/iterable_dataset.py b/funasr/datasets/iterable_dataset.py
new file mode 100644
index 000000000..319dd7ffe
--- /dev/null
+++ b/funasr/datasets/iterable_dataset.py
@@ -0,0 +1,237 @@
+"""Iterable dataset module."""
+import copy
+from io import StringIO
+from pathlib import Path
+from typing import Callable
+from typing import Collection
+from typing import Dict
+from typing import Iterator
+from typing import Tuple
+from typing import Union
+
+import kaldiio
+import numpy as np
+import soundfile
+import torch
+from torch.utils.data.dataset import IterableDataset
+from typeguard import check_argument_types
+
+from funasr.datasets.dataset import ESPnetDataset
+
+
+def load_kaldi(input):
+    retval = kaldiio.load_mat(input)
+    if isinstance(retval, tuple):
+        assert len(retval) == 2, len(retval)
+        if isinstance(retval[0], int) and isinstance(retval[1], np.ndarray):
+            # sound scp case
+            rate, array = retval
+        elif isinstance(retval[1], int) and isinstance(retval[0], np.ndarray):
+            # Extended ark format case
+            array, rate = retval
+        else:
+            raise RuntimeError(f"Unexpected type: {type(retval[0])}, {type(retval[1])}")
+
+        # Multichannel wave fie
+        # array: (NSample, Channel) or (Nsample)
+
+    else:
+        # Normal ark case
+        assert isinstance(retval, np.ndarray), type(retval)
+        array = retval
+    return array
+
+
+DATA_TYPES = {
+    "sound": lambda x: soundfile.read(x)[0],
+    "kaldi_ark": load_kaldi,
+    "npy": np.load,
+    "text_int": lambda x: np.loadtxt(
+        StringIO(x), ndmin=1, dtype=np.long, delimiter=" "
+    ),
+    "csv_int": lambda x: np.loadtxt(StringIO(x), ndmin=1, dtype=np.long, delimiter=","),
+    "text_float": lambda x: np.loadtxt(
+        StringIO(x), ndmin=1, dtype=np.float32, delimiter=" "
+    ),
+    "csv_float": lambda x: np.loadtxt(
+        StringIO(x), ndmin=1, dtype=np.float32, delimiter=","
+    ),
+    "text": lambda x: x,
+}
+
+
+class IterableESPnetDataset(IterableDataset):
+    """Pytorch Dataset class for ESPNet.
+
+    Examples:
+        >>> dataset = IterableESPnetDataset([('wav.scp', 'input', 'sound'),
+        ...                                  ('token_int', 'output', 'text_int')],
+        ...                                )
+        >>> for uid, data in dataset:
+        ...     data
+        {'input': per_utt_array, 'output': per_utt_array}
+    """
+
+    def __init__(
+        self,
+        path_name_type_list: Collection[Tuple[str, str, str]],
+        preprocess: Callable[
+            [str, Dict[str, np.ndarray]], Dict[str, np.ndarray]
+        ] = None,
+        float_dtype: str = "float32",
+        int_dtype: str = "long",
+        key_file: str = None,
+    ):
+        assert check_argument_types()
+        if len(path_name_type_list) == 0:
+            raise ValueError(
+                '1 or more elements are required for "path_name_type_list"'
+            )
+
+        path_name_type_list = copy.deepcopy(path_name_type_list)
+        self.preprocess = preprocess
+
+        self.float_dtype = float_dtype
+        self.int_dtype = int_dtype
+        self.key_file = key_file
+
+        self.debug_info = {}
+        non_iterable_list = []
+        self.path_name_type_list = []
+
+        for path, name, _type in path_name_type_list:
+            if name in self.debug_info:
+                raise RuntimeError(f'"{name}" is duplicated for data-key')
+            self.debug_info[name] = path, _type
+            if _type not in DATA_TYPES:
+                non_iterable_list.append((path, name, _type))
+            else:
+                self.path_name_type_list.append((path, name, _type))
+
+        if len(non_iterable_list) != 0:
+            # Some types doesn't support iterable mode
+            self.non_iterable_dataset = ESPnetDataset(
+                path_name_type_list=non_iterable_list,
+                preprocess=preprocess,
+                float_dtype=float_dtype,
+                int_dtype=int_dtype,
+            )
+        else:
+            self.non_iterable_dataset = None
+
+        if Path(Path(path_name_type_list[0][0]).parent, "utt2category").exists():
+            self.apply_utt2category = True
+        else:
+            self.apply_utt2category = False
+
+    def has_name(self, name) -> bool:
+        return name in self.debug_info
+
+    def names(self) -> Tuple[str, ...]:
+        return tuple(self.debug_info)
+
+    def __repr__(self):
+        _mes = self.__class__.__name__
+        _mes += "("
+        for name, (path, _type) in self.debug_info.items():
+            _mes += f'\n  {name}: {{"path": "{path}", "type": "{_type}"}}'
+        _mes += f"\n  preprocess: {self.preprocess})"
+        return _mes
+
+    def __iter__(self) -> Iterator[Tuple[Union[str, int], Dict[str, np.ndarray]]]:
+        if self.key_file is not None:
+            uid_iter = (
+                line.rstrip().split(maxsplit=1)[0]
+                for line in open(self.key_file, encoding="utf-8")
+            )
+        elif len(self.path_name_type_list) != 0:
+            uid_iter = (
+                line.rstrip().split(maxsplit=1)[0]
+                for line in open(self.path_name_type_list[0][0], encoding="utf-8")
+            )
+        else:
+            uid_iter = iter(self.non_iterable_dataset)
+
+        files = [open(lis[0], encoding="utf-8") for lis in self.path_name_type_list]
+
+        worker_info = torch.utils.data.get_worker_info()
+
+        linenum = 0
+        count = 0
+        for count, uid in enumerate(uid_iter, 1):
+            # If num_workers>=1, split keys
+            if worker_info is not None:
+                if (count - 1) % worker_info.num_workers != worker_info.id:
+                    continue
+
+            # 1. Read a line from each file
+            while True:
+                keys = []
+                values = []
+                for f in files:
+                    linenum += 1
+                    try:
+                        line = next(f)
+                    except StopIteration:
+                        raise RuntimeError(f"{uid} is not found in the files")
+                    sps = line.rstrip().split(maxsplit=1)
+                    if len(sps) != 2:
+                        raise RuntimeError(
+                            f"This line doesn't include a space:"
+                            f" {f}:L{linenum}: {line})"
+                        )
+                    key, value = sps
+                    keys.append(key)
+                    values.append(value)
+
+                for k_idx, k in enumerate(keys):
+                    if k != keys[0]:
+                        raise RuntimeError(
+                            f"Keys are mismatched. Text files (idx={k_idx}) is "
+                            f"not sorted or not having same keys at L{linenum}"
+                        )
+
+                # If the key is matched, break the loop
+                if len(keys) == 0 or keys[0] == uid:
+                    break
+
+            # 2. Load the entry from each line and create a dict
+            data = {}
+            # 2.a. Load data streamingly
+            for value, (path, name, _type) in zip(values, self.path_name_type_list):
+                func = DATA_TYPES[_type]
+                # Load entry
+                array = func(value)
+                data[name] = array
+            if self.non_iterable_dataset is not None:
+                # 2.b. Load data from non-iterable dataset
+                _, from_non_iterable = self.non_iterable_dataset[uid]
+                data.update(from_non_iterable)
+
+            # 3. [Option] Apply preprocessing
+            #   e.g. funasr.train.preprocessor:CommonPreprocessor
+            if self.preprocess is not None:
+                data = self.preprocess(uid, data)
+
+            # 4. Force data-precision
+            for name in data:
+                value = data[name]
+                if not isinstance(value, np.ndarray):
+                    raise RuntimeError(
+                        f"All values must be converted to np.ndarray object "
+                        f'by preprocessing, but "{name}" is still {type(value)}.'
+                    )
+
+                # Cast to desired type
+                if value.dtype.kind == "f":
+                    value = value.astype(self.float_dtype)
+                elif value.dtype.kind == "i":
+                    value = value.astype(self.int_dtype)
+                else:
+                    raise NotImplementedError(f"Not supported dtype: {value.dtype}")
+                data[name] = value
+
+            yield uid, data
+
+        if count == 0:
+            raise RuntimeError("No iteration")
diff --git a/funasr/datasets/large_datasets/__init__.py b/funasr/datasets/large_datasets/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/datasets/large_datasets/build_dataloader.py b/funasr/datasets/large_datasets/build_dataloader.py
new file mode 100644
index 000000000..37fbb7cb4
--- /dev/null
+++ b/funasr/datasets/large_datasets/build_dataloader.py
@@ -0,0 +1,41 @@
+import logging
+
+import yaml
+
+from torch.utils.data import DataLoader
+from funasr.datasets.large_datasets.dataset import Dataset
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+
+
+def read_symbol_table(symbol_table_file):
+    if isinstance(symbol_table_file, str):
+        symbol_table = {}
+        with open(symbol_table_file, "r", encoding="utf8") as fin:
+            for i, line in enumerate(fin):
+                char = line.strip()
+                symbol_table[char] = i
+    else:
+        assert isinstance(symbol_table_file, list)
+        symbol_table = {}
+        for i, char in enumerate(symbol_table_file):
+            symbol_table[char] = i
+    return symbol_table
+
+
+class ArkDataLoader(AbsIterFactory):
+    def __init__(self, data_list, dict_file, config_file, mode="train"):
+        symbol_table = read_symbol_table(dict_file)
+        with open(config_file, "r") as fin:
+            configs = yaml.load(fin, Loader=yaml.FullLoader)
+        self.dataset_conf = configs["dataset_conf"]
+        logging.info("dataloader config: {}".format(self.dataset_conf))
+        self.dataset = Dataset(data_list, symbol_table,
+                               self.dataset_conf, mode=mode)
+
+    def build_iter(self, epoch, shuffle=True):
+        self.dataset.set_epoch(epoch)
+        data_loader = DataLoader(self.dataset,
+                                 batch_size=None,
+                                 pin_memory=True,
+                                 num_workers=self.dataset_conf.get("num_workers", 8))
+        return data_loader
diff --git a/funasr/datasets/large_datasets/datapipes/__init__.py b/funasr/datasets/large_datasets/datapipes/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/datasets/large_datasets/datapipes/batch.py b/funasr/datasets/large_datasets/datapipes/batch.py
new file mode 100644
index 000000000..9c85d5edc
--- /dev/null
+++ b/funasr/datasets/large_datasets/datapipes/batch.py
@@ -0,0 +1,148 @@
+import random
+
+from itertools import count
+from functools import partial
+from torch.utils.data import IterableDataset
+from funasr.datasets.large_datasets.datapipes.map import MapperIterDataPipe
+
+tiebreaker = count()
+
+
+def _default_len_fn(token):
+    return len(token), next(tiebreaker)
+
+
+def _token_len_fn(token, len_fn):
+    return len_fn(token), next(tiebreaker), token
+
+
+class MaxTokenBucketizerIterDataPipe(IterableDataset):
+
+    def __init__(
+            self,
+            datapipe,
+            batch_size=8000,
+            len_fn=_default_len_fn,
+            buffer_size=10240,
+            sort_size=500
+    ):
+        assert batch_size > 0, "Batch size is required to be larger than 0!"
+        assert buffer_size >= -1, "Buffer size is required to be larger than -1!"
+        assert sort_size > 0, "Sort size is required to be larger than 0!"
+
+        datapipe = MapperIterDataPipe(datapipe, fn=partial(_token_len_fn, len_fn=len_fn))
+        self.datapipe = datapipe
+        self.batch_size = batch_size
+        self.buffer_size = buffer_size
+        self.sort_size = sort_size
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+    def __iter__(self):
+        buffer = []
+        batch = []
+        bucket = []
+        max_lengths = 0
+        batch_lengths = 0
+
+        if self.buffer_size == -1:
+            for d in self.datapipe:
+                if d[0] > self.batch_size:
+                    continue
+                buffer.append(d)
+            buffer.sort()
+            for sample in buffer:
+                length, _, token = sample
+                if length > max_lengths:
+                    max_lengths = length
+                batch_lengths = max_lengths * (len(batch) + 1)
+                if batch_lengths > self.batch_size:
+                    bucket.append(batch)
+                    batch = []
+                    max_lengths = length
+                batch.append(token)
+            random.shuffle(bucket)
+            if bucket:
+                for batch_sample in bucket:
+                    yield batch_sample
+            if batch:
+                yield batch
+
+        elif self.buffer_size == 0:
+            for d in self.datapipe:
+                if d[0] > self.batch_size:
+                    continue
+                length, _, token = d
+                if length > self.batch_size:
+                    continue
+                if length > max_lengths:
+                    max_lengths = length
+                batch_lengths = max_lengths * (len(batch) + 1)
+                if batch_lengths > self.batch_size:
+                    yield batch
+                    batch = []
+                    max_lengths = length
+                batch.append(token)
+            if batch:
+                yield batch
+
+        else:
+            for d in self.datapipe:
+                if d[0] > self.batch_size:
+                    continue
+                buffer.append(d)
+                if len(buffer) == self.buffer_size:
+                    random.shuffle(buffer)
+                    for sample in buffer:
+                        bucket.append(sample)
+                        if len(bucket) == self.sort_size:
+                            bucket.sort()
+                            for x in bucket:
+                                length, _, token = x
+                                if length > max_lengths:
+                                    max_lengths = length
+                                batch_lengths = max_lengths * (len(batch) + 1)
+                                if batch_lengths > self.batch_size:
+                                    yield batch
+                                    batch = []
+                                    max_lengths = length
+                                batch.append(token)
+                            bucket = []
+                    buffer = []
+
+            if buffer:
+                random.shuffle(buffer)
+                for sample in buffer:
+                    bucket.append(sample)
+                    if len(bucket) == self.sort_size:
+                        bucket.sort()
+                        for x in bucket:
+                            length, _, token = x
+                            if length > max_lengths:
+                                max_lengths = length
+                            batch_lengths = max_lengths * (len(batch) + 1)
+                            if batch_lengths > self.batch_size:
+                                yield batch
+                                batch = []
+                                max_lengths = length
+                            batch.append(token)
+                        bucket = []
+                buffer = []
+
+            if bucket:
+                bucket.sort()
+                for x in bucket:
+                    length, _, token = x
+                    if length > max_lengths:
+                        max_lengths = length
+                    batch_lengths = max_lengths * (len(batch) + 1)
+                    if batch_lengths > self.batch_size:
+                        yield batch
+                        batch = []
+                        max_lengths = length
+                    batch.append(token)
+                bucket = []
+
+            if batch:
+                yield batch
diff --git a/funasr/datasets/large_datasets/datapipes/filter.py b/funasr/datasets/large_datasets/datapipes/filter.py
new file mode 100644
index 000000000..e79934d18
--- /dev/null
+++ b/funasr/datasets/large_datasets/datapipes/filter.py
@@ -0,0 +1,24 @@
+from torch.utils.data import IterableDataset
+
+def default_fn(data):
+    return data
+
+
+class FilterIterDataPipe(IterableDataset):
+
+    def __init__(self,
+                 datapipe,
+                 fn=default_fn):
+        self.datapipe = datapipe
+        self.fn = fn
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+    def __iter__(self):
+        assert callable(self.fn)
+        for data in self.datapipe:
+            if self.fn(data):
+                yield data
+            else:
+                continue
\ No newline at end of file
diff --git a/funasr/datasets/large_datasets/datapipes/map.py b/funasr/datasets/large_datasets/datapipes/map.py
new file mode 100644
index 000000000..6e0168de0
--- /dev/null
+++ b/funasr/datasets/large_datasets/datapipes/map.py
@@ -0,0 +1,22 @@
+from torch.utils.data import IterableDataset
+
+
+def default_fn(data):
+    return data
+
+
+class MapperIterDataPipe(IterableDataset):
+
+    def __init__(self,
+                 datapipe,
+                 fn=default_fn):
+        self.datapipe = datapipe
+        self.fn = fn
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+    def __iter__(self):
+        assert callable(self.fn)
+        for data in self.datapipe:
+            yield self.fn(data)
diff --git a/funasr/datasets/large_datasets/dataset.py b/funasr/datasets/large_datasets/dataset.py
new file mode 100644
index 000000000..60c5abd1b
--- /dev/null
+++ b/funasr/datasets/large_datasets/dataset.py
@@ -0,0 +1,175 @@
+import os
+import random
+from functools import partial
+
+import torch
+import torch.distributed as dist
+from kaldiio import ReadHelper
+from torch.utils.data import IterableDataset
+
+from funasr.datasets.large_datasets.datapipes.batch import MaxTokenBucketizerIterDataPipe
+from funasr.datasets.large_datasets.datapipes.filter import FilterIterDataPipe
+from funasr.datasets.large_datasets.datapipes.map import MapperIterDataPipe
+from funasr.datasets.large_datasets.utils.filter import filter
+from funasr.datasets.large_datasets.utils.padding import padding
+from funasr.datasets.large_datasets.utils.tokenize import tokenize
+
+
+def read_lists(list_file):
+    lists = []
+    with open(list_file, 'r', encoding='utf8') as fin:
+        for line in fin:
+            parts = line.strip()
+            lists.append(parts)
+    return lists
+
+
+class AudioDataset(IterableDataset):
+    def __init__(self, scp_lists, data_names, data_types, shuffle=True, mode="train"):
+        self.scp_lists = scp_lists
+        self.data_names = data_names
+        self.data_types = data_types
+        self.shuffle = shuffle
+        self.mode = mode
+        self.epoch = -1
+        self.rank = 0
+        self.world_size = 1
+        self.worker_id = 0
+        self.num_workers = 1
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch
+
+    def get_rank_data_list(self, data_index):
+        assert dist.is_available()
+        if dist.is_initialized():
+            self.rank = dist.get_rank()
+            self.world_size = dist.get_world_size()
+        else:
+            self.rank = 0
+            self.world_size = 1
+
+        if self.mode == "train":
+            if self.shuffle:
+                random.seed(self.epoch)
+                random.shuffle(data_index)
+            return data_index[self.rank::self.world_size]
+
+        return data_index
+
+    def get_worker_data_list(self, rank_data_index):
+        worker_info = torch.utils.data.get_worker_info()
+        if worker_info is None:
+            self.worker_id = 0
+            self.num_workers = 1
+        else:
+            self.worker_id = worker_info.id
+            self.num_workers = worker_info.num_workers
+
+        return rank_data_index[self.worker_id::self.num_workers]
+
+    def close_reader(self, reader_list):
+        for reader in reader_list:
+            reader.close()
+
+    def __iter__(self):
+        data_index = list(range(len(self.scp_lists)))
+        rank_data_index = self.get_rank_data_list(data_index)
+        worker_data_index = self.get_worker_data_list(rank_data_index)
+
+        for index in worker_data_index:
+            data = dict(scp=self.scp_lists[index])
+
+            assert 'scp' in data
+            scp = data['scp']
+            data_file_list = scp.strip().split()
+            data_name_list = self.data_names.split(",")
+            data_type_list = self.data_types.split(",")
+
+            for file in data_file_list:
+                assert os.path.exists(file), "{} not exists".format(file)
+
+            assert len(data_file_list) == len(data_name_list) == len(data_type_list), \
+                "The item number of data, data_names, data_types must be the same "
+
+            reader_list = []
+            for data_file, data_type in zip(data_file_list, data_type_list):
+                if data_type == "kaldi_ark":
+                    ark_reader = ReadHelper('ark:{}'.format(data_file))
+                    reader_list.append(ark_reader)
+                elif data_type == "text":
+                    text_reader = open(data_file, "r")
+                    reader_list.append(text_reader)
+                else:
+                    raise TypeError("Data type {} is not supported".format(data_type))
+
+            for items in zip(*reader_list):
+                sample_dict = {}
+                for item, (data_name, data_type) in zip(items, zip(data_name_list, data_type_list)):
+                    if data_type == "kaldi_ark":
+                        key, mat = item
+                        sample_dict[data_name] = mat
+                        if data_name == "speech":
+                            sample_dict["key"] = key
+                    else:
+                        text = item
+                        sample_dict[data_name] = text.strip().split()[1:]
+                yield sample_dict
+
+            self.close_reader(reader_list)
+
+
+def len_fn_example(data):
+    return len(data)
+
+
+def len_fn_token(data):
+    assert "speech" in data
+    return data["speech"].shape[0]
+
+
+def Dataset(data_list_file,
+            dict,
+            conf,
+            mode="train"):
+    scp_lists = read_lists(data_list_file)
+    shuffle = conf.get('shuffle', True)
+    data_names = conf.get("data_names", "speech,text")
+    data_types = conf.get("data_types", "kaldi_ark,text")
+    dataset = AudioDataset(scp_lists, data_names, data_types, shuffle=shuffle, mode=mode)
+
+    filter_conf = conf.get('filter_conf', {})
+    filter_fn = partial(filter, **filter_conf)
+    dataset = FilterIterDataPipe(dataset, fn=filter_fn)
+
+    vocab = {'vocab': dict}
+    tokenize_fn = partial(tokenize, **vocab)
+    dataset = MapperIterDataPipe(dataset, fn=tokenize_fn)
+
+    if shuffle:
+        buffer_conf = conf.get('shuffle_conf', {})
+        buffer_size = buffer_conf['shuffle_size']
+        sort_size = buffer_conf['sort_size']
+    else:
+        buffer_size = 0
+        sort_size = 1
+
+    batch_conf = conf.get('batch_conf', {})
+    batch_size = batch_conf['batch_size']
+    batch_type = batch_conf['batch_type']
+
+    assert batch_type in ["example", "token"]
+    if batch_type == 'example':
+        len_fn = len_fn_example
+    else:
+        len_fn = len_fn_token
+
+    dataset = MaxTokenBucketizerIterDataPipe(dataset,
+                                             batch_size=batch_size,
+                                             len_fn=len_fn,
+                                             buffer_size=buffer_size,
+                                             sort_size=sort_size)
+
+    dataset = MapperIterDataPipe(dataset, fn=padding)
+
+    return dataset
diff --git a/funasr/datasets/large_datasets/utils/__init__.py b/funasr/datasets/large_datasets/utils/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/datasets/large_datasets/utils/filter.py b/funasr/datasets/large_datasets/utils/filter.py
new file mode 100644
index 000000000..5dc911f56
--- /dev/null
+++ b/funasr/datasets/large_datasets/utils/filter.py
@@ -0,0 +1,15 @@
+#!/usr/bin/env python
+
+
+def filter(data,
+           min_length=10,
+           max_length=10000,
+           min_token_length=0,
+           max_token_length=200):
+    assert "speech" in data
+    assert "text" in data
+
+    num_frames = data["speech"].shape[0]
+    num_tokens = len(data['text'])
+
+    return min_length < num_frames < max_length and min_token_length < num_tokens < max_token_length
\ No newline at end of file
diff --git a/funasr/datasets/large_datasets/utils/low_frame_rate.py b/funasr/datasets/large_datasets/utils/low_frame_rate.py
new file mode 100644
index 000000000..76eb2da93
--- /dev/null
+++ b/funasr/datasets/large_datasets/utils/low_frame_rate.py
@@ -0,0 +1,30 @@
+import numpy as np
+
+
+def build_LFR_features(data, m, n):
+    """
+    Actually, this implements stacking frames and skipping frames.
+    if m = 1 and n = 1, just return the origin features.
+    if m = 1 and n > 1, it works like skipping.
+    if m > 1 and n = 1, it works like stacking but only support right frames.
+    if m > 1 and n > 1, it works like LFR.
+
+    Args:
+        inputs_batch: inputs is T x D np.ndarray
+        m: number of frames to stack
+        n: number of frames to skip
+    """
+
+    LFR_inputs = []
+    T = data.shape[0]
+    T_lfr = int(np.ceil(T / n))
+    for i in range(T_lfr):
+        if m <= T - i * n:
+            LFR_inputs.append(np.hstack(data[i*n:i*n+m]))
+        else:
+            num_padding = m - (T - i * n)
+            frame = np.hstack(data[i*n:])
+            for _ in range(num_padding):
+                frame = np.hstack((frame, data[-1]))
+            LFR_inputs.append(frame)
+    return np.vstack(LFR_inputs)
diff --git a/funasr/datasets/large_datasets/utils/padding.py b/funasr/datasets/large_datasets/utils/padding.py
new file mode 100644
index 000000000..2e91e78c1
--- /dev/null
+++ b/funasr/datasets/large_datasets/utils/padding.py
@@ -0,0 +1,35 @@
+import numpy as np
+import torch
+from torch.nn.utils.rnn import pad_sequence
+
+
+def padding(data, float_pad_value=0.0, int_pad_value=-1):
+    assert isinstance(data, list)
+    assert "key" in data[0]
+    assert "speech" in data[0]
+    assert "text" in data[0]
+
+    keys = [x["key"] for x in data]
+
+    batch = {}
+    data_names = data[0].keys()
+    for data_name in data_names:
+        if data_name == "key":
+            continue
+        else:
+            if data[0][data_name].dtype.kind == "i":
+                pad_value = int_pad_value
+                tensor_type = torch.int64
+            else:
+                pad_value = float_pad_value
+                tensor_type = torch.float32
+
+            tensor_list = [torch.tensor(np.copy(d[data_name]), dtype=tensor_type) for d in data]
+            tensor_lengths = torch.tensor([len(d[data_name]) for d in data], dtype=torch.int32)
+            tensor_pad = pad_sequence(tensor_list,
+                                      batch_first=True,
+                                      padding_value=pad_value)
+            batch[data_name] = tensor_pad
+            batch[data_name + "_lengths"] = tensor_lengths
+
+    return keys, batch
diff --git a/funasr/datasets/large_datasets/utils/tokenize.py b/funasr/datasets/large_datasets/utils/tokenize.py
new file mode 100644
index 000000000..937e14482
--- /dev/null
+++ b/funasr/datasets/large_datasets/utils/tokenize.py
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+import numpy as np
+
+def tokenize(data,
+             vocab=None):
+    assert "text" in data
+    assert isinstance(vocab, dict)
+    text = data["text"]
+    token = []
+    for x in text:
+        if x in vocab:
+            token.append(vocab[x])
+        else:
+            token.append(vocab['<unk>'])
+
+    data["text"] = np.array(token)
+    return data
diff --git a/funasr/datasets/preprocessor.py b/funasr/datasets/preprocessor.py
new file mode 100644
index 000000000..80d1adcfe
--- /dev/null
+++ b/funasr/datasets/preprocessor.py
@@ -0,0 +1,496 @@
+from abc import ABC
+from abc import abstractmethod
+from pathlib import Path
+from typing import Collection
+from typing import Dict
+from typing import Iterable
+from typing import List
+from typing import Union
+
+import numpy as np
+import scipy.signal
+import soundfile
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.text.build_tokenizer import build_tokenizer
+from funasr.text.cleaner import TextCleaner
+from funasr.text.token_id_converter import TokenIDConverter
+
+
+class AbsPreprocessor(ABC):
+    def __init__(self, train: bool):
+        self.train = train
+
+    @abstractmethod
+    def __call__(
+        self, uid: str, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        raise NotImplementedError
+
+
+def framing(
+    x,
+    frame_length: int = 512,
+    frame_shift: int = 256,
+    centered: bool = True,
+    padded: bool = True,
+):
+    if x.size == 0:
+        raise ValueError("Input array size is zero")
+    if frame_length < 1:
+        raise ValueError("frame_length must be a positive integer")
+    if frame_length > x.shape[-1]:
+        raise ValueError("frame_length is greater than input length")
+    if 0 >= frame_shift:
+        raise ValueError("frame_shift must be greater than 0")
+
+    if centered:
+        pad_shape = [(0, 0) for _ in range(x.ndim - 1)] + [
+            (frame_length // 2, frame_length // 2)
+        ]
+        x = np.pad(x, pad_shape, mode="constant", constant_values=0)
+
+    if padded:
+        # Pad to integer number of windowed segments
+        # I.e make x.shape[-1] = frame_length + (nseg-1)*nstep,
+        #  with integer nseg
+        nadd = (-(x.shape[-1] - frame_length) % frame_shift) % frame_length
+        pad_shape = [(0, 0) for _ in range(x.ndim - 1)] + [(0, nadd)]
+        x = np.pad(x, pad_shape, mode="constant", constant_values=0)
+
+    # Created strided array of data segments
+    if frame_length == 1 and frame_length == frame_shift:
+        result = x[..., None]
+    else:
+        shape = x.shape[:-1] + (
+            (x.shape[-1] - frame_length) // frame_shift + 1,
+            frame_length,
+        )
+        strides = x.strides[:-1] + (frame_shift * x.strides[-1], x.strides[-1])
+        result = np.lib.stride_tricks.as_strided(x, shape=shape, strides=strides)
+    return result
+
+
+def detect_non_silence(
+    x: np.ndarray,
+    threshold: float = 0.01,
+    frame_length: int = 1024,
+    frame_shift: int = 512,
+    window: str = "boxcar",
+) -> np.ndarray:
+    """Power based voice activity detection.
+
+    Args:
+        x: (Channel, Time)
+    >>> x = np.random.randn(1000)
+    >>> detect = detect_non_silence(x)
+    >>> assert x.shape == detect.shape
+    >>> assert detect.dtype == np.bool
+    """
+    if x.shape[-1] < frame_length:
+        return np.full(x.shape, fill_value=True, dtype=np.bool)
+
+    if x.dtype.kind == "i":
+        x = x.astype(np.float64)
+    # framed_w: (C, T, F)
+    framed_w = framing(
+        x,
+        frame_length=frame_length,
+        frame_shift=frame_shift,
+        centered=False,
+        padded=True,
+    )
+    framed_w *= scipy.signal.get_window(window, frame_length).astype(framed_w.dtype)
+    # power: (C, T)
+    power = (framed_w**2).mean(axis=-1)
+    # mean_power: (C, 1)
+    mean_power = np.mean(power, axis=-1, keepdims=True)
+    if np.all(mean_power == 0):
+        return np.full(x.shape, fill_value=True, dtype=np.bool)
+    # detect_frames: (C, T)
+    detect_frames = power / mean_power > threshold
+    # detects: (C, T, F)
+    detects = np.broadcast_to(
+        detect_frames[..., None], detect_frames.shape + (frame_shift,)
+    )
+    # detects: (C, TF)
+    detects = detects.reshape(*detect_frames.shape[:-1], -1)
+    # detects: (C, TF)
+    return np.pad(
+        detects,
+        [(0, 0)] * (x.ndim - 1) + [(0, x.shape[-1] - detects.shape[-1])],
+        mode="edge",
+    )
+
+
+class CommonPreprocessor(AbsPreprocessor):
+    def __init__(
+        self,
+        train: bool,
+        token_type: str = None,
+        token_list: Union[Path, str, Iterable[str]] = None,
+        bpemodel: Union[Path, str, Iterable[str]] = None,
+        text_cleaner: Collection[str] = None,
+        g2p_type: str = None,
+        unk_symbol: str = "<unk>",
+        space_symbol: str = "<space>",
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        delimiter: str = None,
+        rir_scp: str = None,
+        rir_apply_prob: float = 1.0,
+        noise_scp: str = None,
+        noise_apply_prob: float = 1.0,
+        noise_db_range: str = "3_10",
+        speech_volume_normalize: float = None,
+        speech_name: str = "speech",
+        text_name: str = "text",
+        split_with_space: bool = False,
+    ):
+        super().__init__(train)
+        self.train = train
+        self.speech_name = speech_name
+        self.text_name = text_name
+        self.speech_volume_normalize = speech_volume_normalize
+        self.rir_apply_prob = rir_apply_prob
+        self.noise_apply_prob = noise_apply_prob
+        self.split_with_space = split_with_space
+
+        if token_type is not None:
+            if token_list is None:
+                raise ValueError("token_list is required if token_type is not None")
+            self.text_cleaner = TextCleaner(text_cleaner)
+
+            self.tokenizer = build_tokenizer(
+                token_type=token_type,
+                bpemodel=bpemodel,
+                delimiter=delimiter,
+                space_symbol=space_symbol,
+                non_linguistic_symbols=non_linguistic_symbols,
+                g2p_type=g2p_type,
+            )
+            self.token_id_converter = TokenIDConverter(
+                token_list=token_list,
+                unk_symbol=unk_symbol,
+            )
+        else:
+            self.text_cleaner = None
+            self.tokenizer = None
+            self.token_id_converter = None
+
+        if train and rir_scp is not None:
+            self.rirs = []
+            with open(rir_scp, "r", encoding="utf-8") as f:
+                for line in f:
+                    sps = line.strip().split(None, 1)
+                    if len(sps) == 1:
+                        self.rirs.append(sps[0])
+                    else:
+                        self.rirs.append(sps[1])
+        else:
+            self.rirs = None
+
+        if train and noise_scp is not None:
+            self.noises = []
+            with open(noise_scp, "r", encoding="utf-8") as f:
+                for line in f:
+                    sps = line.strip().split(None, 1)
+                    if len(sps) == 1:
+                        self.noises.append(sps[0])
+                    else:
+                        self.noises.append(sps[1])
+            sps = noise_db_range.split("_")
+            if len(sps) == 1:
+                self.noise_db_low, self.noise_db_high = float(sps[0])
+            elif len(sps) == 2:
+                self.noise_db_low, self.noise_db_high = float(sps[0]), float(sps[1])
+            else:
+                raise ValueError(
+                    "Format error: '{noise_db_range}' e.g. -3_4 -> [-3db,4db]"
+                )
+        else:
+            self.noises = None
+
+    def _speech_process(
+        self, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, Union[str, np.ndarray]]:
+        assert check_argument_types()
+        if self.speech_name in data:
+            if self.train and (self.rirs is not None or self.noises is not None):
+                speech = data[self.speech_name]
+                nsamples = len(speech)
+
+                # speech: (Nmic, Time)
+                if speech.ndim == 1:
+                    speech = speech[None, :]
+                else:
+                    speech = speech.T
+                # Calc power on non shlence region
+                power = (speech[detect_non_silence(speech)] ** 2).mean()
+
+                # 1. Convolve RIR
+                if self.rirs is not None and self.rir_apply_prob >= np.random.random():
+                    rir_path = np.random.choice(self.rirs)
+                    if rir_path is not None:
+                        rir, _ = soundfile.read(
+                            rir_path, dtype=np.float64, always_2d=True
+                        )
+
+                        # rir: (Nmic, Time)
+                        rir = rir.T
+
+                        # speech: (Nmic, Time)
+                        # Note that this operation doesn't change the signal length
+                        speech = scipy.signal.convolve(speech, rir, mode="full")[
+                            :, : speech.shape[1]
+                        ]
+                        # Reverse mean power to the original power
+                        power2 = (speech[detect_non_silence(speech)] ** 2).mean()
+                        speech = np.sqrt(power / max(power2, 1e-10)) * speech
+
+                # 2. Add Noise
+                if (
+                    self.noises is not None
+                    and self.noise_apply_prob >= np.random.random()
+                ):
+                    noise_path = np.random.choice(self.noises)
+                    if noise_path is not None:
+                        noise_db = np.random.uniform(
+                            self.noise_db_low, self.noise_db_high
+                        )
+                        with soundfile.SoundFile(noise_path) as f:
+                            if f.frames == nsamples:
+                                noise = f.read(dtype=np.float64, always_2d=True)
+                            elif f.frames < nsamples:
+                                offset = np.random.randint(0, nsamples - f.frames)
+                                # noise: (Time, Nmic)
+                                noise = f.read(dtype=np.float64, always_2d=True)
+                                # Repeat noise
+                                noise = np.pad(
+                                    noise,
+                                    [(offset, nsamples - f.frames - offset), (0, 0)],
+                                    mode="wrap",
+                                )
+                            else:
+                                offset = np.random.randint(0, f.frames - nsamples)
+                                f.seek(offset)
+                                # noise: (Time, Nmic)
+                                noise = f.read(
+                                    nsamples, dtype=np.float64, always_2d=True
+                                )
+                                if len(noise) != nsamples:
+                                    raise RuntimeError(f"Something wrong: {noise_path}")
+                        # noise: (Nmic, Time)
+                        noise = noise.T
+
+                        noise_power = (noise**2).mean()
+                        scale = (
+                            10 ** (-noise_db / 20)
+                            * np.sqrt(power)
+                            / np.sqrt(max(noise_power, 1e-10))
+                        )
+                        speech = speech + scale * noise
+
+                speech = speech.T
+                ma = np.max(np.abs(speech))
+                if ma > 1.0:
+                    speech /= ma
+                data[self.speech_name] = speech
+
+            if self.speech_volume_normalize is not None:
+                speech = data[self.speech_name]
+                ma = np.max(np.abs(speech))
+                data[self.speech_name] = speech * self.speech_volume_normalize / ma
+        assert check_return_type(data)
+        return data
+
+    def _text_process(
+        self, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        if self.text_name in data and self.tokenizer is not None:
+            text = data[self.text_name]
+            text = self.text_cleaner(text)
+            if self.split_with_space:
+                tokens = text.strip().split(" ")
+            else:
+                tokens = self.tokenizer.text2tokens(text)
+            text_ints = self.token_id_converter.tokens2ids(tokens)
+            data[self.text_name] = np.array(text_ints, dtype=np.int64)
+        assert check_return_type(data)
+        return data
+
+    def __call__(
+        self, uid: str, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        assert check_argument_types()
+
+        data = self._speech_process(data)
+        data = self._text_process(data)
+        return data
+
+
+class CommonPreprocessor_multi(AbsPreprocessor):
+    def __init__(
+        self,
+        train: bool,
+        token_type: str = None,
+        token_list: Union[Path, str, Iterable[str]] = None,
+        bpemodel: Union[Path, str, Iterable[str]] = None,
+        text_cleaner: Collection[str] = None,
+        g2p_type: str = None,
+        unk_symbol: str = "<unk>",
+        space_symbol: str = "<space>",
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        delimiter: str = None,
+        speech_name: str = "speech",
+        text_name: List[str] = ["text"],
+    ):
+        super().__init__(train)
+        self.train = train
+        self.speech_name = speech_name
+        self.text_name = text_name
+
+        if token_type is not None:
+            if token_list is None:
+                raise ValueError("token_list is required if token_type is not None")
+            self.text_cleaner = TextCleaner(text_cleaner)
+
+            self.tokenizer = build_tokenizer(
+                token_type=token_type,
+                bpemodel=bpemodel,
+                delimiter=delimiter,
+                space_symbol=space_symbol,
+                non_linguistic_symbols=non_linguistic_symbols,
+                g2p_type=g2p_type,
+            )
+            self.token_id_converter = TokenIDConverter(
+                token_list=token_list,
+                unk_symbol=unk_symbol,
+            )
+        else:
+            self.text_cleaner = None
+            self.tokenizer = None
+            self.token_id_converter = None
+
+    def _text_process(
+        self, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        for text_n in self.text_name:
+            if text_n in data and self.tokenizer is not None:
+                text = data[text_n]
+                text = self.text_cleaner(text)
+                tokens = self.tokenizer.text2tokens(text)
+                text_ints = self.token_id_converter.tokens2ids(tokens)
+                data[text_n] = np.array(text_ints, dtype=np.int64)
+        assert check_return_type(data)
+        return data
+
+    def __call__(
+        self, uid: str, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        assert check_argument_types()
+
+        if self.speech_name in data:
+            # Nothing now: candidates:
+            # - STFT
+            # - Fbank
+            # - CMVN
+            # - Data augmentation
+            pass
+
+        data = self._text_process(data)
+        return data
+
+
+class MutliTokenizerCommonPreprocessor(CommonPreprocessor):
+    def __init__(
+        self,
+        train: bool,
+        token_type: List[str] = [None],
+        token_list: List[Union[Path, str, Iterable[str]]] = [None],
+        bpemodel: List[Union[Path, str, Iterable[str]]] = [None],
+        text_cleaner: Collection[str] = None,
+        g2p_type: str = None,
+        unk_symbol: str = "<unk>",
+        space_symbol: str = "<space>",
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        delimiter: str = None,
+        rir_scp: str = None,
+        rir_apply_prob: float = 1.0,
+        noise_scp: str = None,
+        noise_apply_prob: float = 1.0,
+        noise_db_range: str = "3_10",
+        speech_volume_normalize: float = None,
+        speech_name: str = "speech",
+        text_name: List[str] = ["text"],
+    ):
+        # TODO(jiatong): sync with Kamo and Jing on interface for preprocessor
+        super().__init__(
+            train=train,
+            token_type=token_type[0],
+            token_list=token_list[0],
+            bpemodel=bpemodel[0],
+            text_cleaner=text_cleaner,
+            g2p_type=g2p_type,
+            unk_symbol=unk_symbol,
+            space_symbol=space_symbol,
+            non_linguistic_symbols=non_linguistic_symbols,
+            delimiter=delimiter,
+            speech_name=speech_name,
+            text_name=text_name[0],
+            rir_scp=rir_scp,
+            rir_apply_prob=rir_apply_prob,
+            noise_scp=noise_scp,
+            noise_apply_prob=noise_apply_prob,
+            noise_db_range=noise_db_range,
+            speech_volume_normalize=speech_volume_normalize,
+        )
+
+        assert (
+            len(token_type) == len(token_list) == len(bpemodel) == len(text_name)
+        ), "token_type, token_list, bpemodel, or processing text_name mismatched"
+        self.num_tokenizer = len(token_type)
+        self.tokenizer = []
+        self.token_id_converter = []
+
+        for i in range(self.num_tokenizer):
+            if token_type[i] is not None:
+                if token_list[i] is None:
+                    raise ValueError("token_list is required if token_type is not None")
+
+                self.tokenizer.append(
+                    build_tokenizer(
+                        token_type=token_type[i],
+                        bpemodel=bpemodel[i],
+                        delimiter=delimiter,
+                        space_symbol=space_symbol,
+                        non_linguistic_symbols=non_linguistic_symbols,
+                        g2p_type=g2p_type,
+                    )
+                )
+                self.token_id_converter.append(
+                    TokenIDConverter(
+                        token_list=token_list[i],
+                        unk_symbol=unk_symbol,
+                    )
+                )
+            else:
+                self.tokenizer.append(None)
+                self.token_id_converter.append(None)
+
+        self.text_cleaner = TextCleaner(text_cleaner)
+        self.text_name = text_name  # override the text_name from CommonPreprocessor
+
+    def _text_process(
+        self, data: Dict[str, Union[str, np.ndarray]]
+    ) -> Dict[str, np.ndarray]:
+        for i in range(self.num_tokenizer):
+            text_name = self.text_name[i]
+            if text_name in data and self.tokenizer[i] is not None:
+                text = data[text_name]
+                text = self.text_cleaner(text)
+                tokens = self.tokenizer[i].text2tokens(text)
+                text_ints = self.token_id_converter[i].tokens2ids(tokens)
+                data[text_name] = np.array(text_ints, dtype=np.int64)
+        assert check_return_type(data)
+        return data
diff --git a/funasr/fileio/__init__.py b/funasr/fileio/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/fileio/datadir_writer.py b/funasr/fileio/datadir_writer.py
new file mode 100644
index 000000000..bafdf984f
--- /dev/null
+++ b/funasr/fileio/datadir_writer.py
@@ -0,0 +1,77 @@
+from pathlib import Path
+from typing import Union
+import warnings
+
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+
+class DatadirWriter:
+    """Writer class to create kaldi like data directory.
+
+    Examples:
+        >>> with DatadirWriter("output") as writer:
+        ...     # output/sub.txt is created here
+        ...     subwriter = writer["sub.txt"]
+        ...     # Write "uttidA some/where/a.wav"
+        ...     subwriter["uttidA"] = "some/where/a.wav"
+        ...     subwriter["uttidB"] = "some/where/b.wav"
+
+    """
+
+    def __init__(self, p: Union[Path, str]):
+        assert check_argument_types()
+        self.path = Path(p)
+        self.chilidren = {}
+        self.fd = None
+        self.has_children = False
+        self.keys = set()
+
+    def __enter__(self):
+        return self
+
+    def __getitem__(self, key: str) -> "DatadirWriter":
+        assert check_argument_types()
+        if self.fd is not None:
+            raise RuntimeError("This writer points out a file")
+
+        if key not in self.chilidren:
+            w = DatadirWriter((self.path / key))
+            self.chilidren[key] = w
+            self.has_children = True
+
+        retval = self.chilidren[key]
+        assert check_return_type(retval)
+        return retval
+
+    def __setitem__(self, key: str, value: str):
+        assert check_argument_types()
+        if self.has_children:
+            raise RuntimeError("This writer points out a directory")
+        if key in self.keys:
+            warnings.warn(f"Duplicated: {key}")
+
+        if self.fd is None:
+            self.path.parent.mkdir(parents=True, exist_ok=True)
+            self.fd = self.path.open("w", encoding="utf-8")
+
+        self.keys.add(key)
+        self.fd.write(f"{key} {value}\n")
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+
+    def close(self):
+        if self.has_children:
+            prev_child = None
+            for child in self.chilidren.values():
+                child.close()
+                if prev_child is not None and prev_child.keys != child.keys:
+                    warnings.warn(
+                        f"Ids are mismatching between "
+                        f"{prev_child.path} and {child.path}"
+                    )
+                prev_child = child
+
+        elif self.fd is not None:
+            self.fd.close()
diff --git a/funasr/fileio/npy_scp.py b/funasr/fileio/npy_scp.py
new file mode 100644
index 000000000..26666b678
--- /dev/null
+++ b/funasr/fileio/npy_scp.py
@@ -0,0 +1,97 @@
+import collections.abc
+from pathlib import Path
+from typing import Union
+
+import numpy as np
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import read_2column_text
+
+
+class NpyScpWriter:
+    """Writer class for a scp file of numpy file.
+
+    Examples:
+        key1 /some/path/a.npy
+        key2 /some/path/b.npy
+        key3 /some/path/c.npy
+        key4 /some/path/d.npy
+        ...
+
+        >>> writer = NpyScpWriter('./data/', './data/feat.scp')
+        >>> writer['aa'] = numpy_array
+        >>> writer['bb'] = numpy_array
+
+    """
+
+    def __init__(self, outdir: Union[Path, str], scpfile: Union[Path, str]):
+        assert check_argument_types()
+        self.dir = Path(outdir)
+        self.dir.mkdir(parents=True, exist_ok=True)
+        scpfile = Path(scpfile)
+        scpfile.parent.mkdir(parents=True, exist_ok=True)
+        self.fscp = scpfile.open("w", encoding="utf-8")
+
+        self.data = {}
+
+    def get_path(self, key):
+        return self.data[key]
+
+    def __setitem__(self, key, value):
+        assert isinstance(value, np.ndarray), type(value)
+        p = self.dir / f"{key}.npy"
+        p.parent.mkdir(parents=True, exist_ok=True)
+        np.save(str(p), value)
+        self.fscp.write(f"{key} {p}\n")
+
+        # Store the file path
+        self.data[key] = str(p)
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+
+    def close(self):
+        self.fscp.close()
+
+
+class NpyScpReader(collections.abc.Mapping):
+    """Reader class for a scp file of numpy file.
+
+    Examples:
+        key1 /some/path/a.npy
+        key2 /some/path/b.npy
+        key3 /some/path/c.npy
+        key4 /some/path/d.npy
+        ...
+
+        >>> reader = NpyScpReader('npy.scp')
+        >>> array = reader['key1']
+
+    """
+
+    def __init__(self, fname: Union[Path, str]):
+        assert check_argument_types()
+        self.fname = Path(fname)
+        self.data = read_2column_text(fname)
+
+    def get_path(self, key):
+        return self.data[key]
+
+    def __getitem__(self, key) -> np.ndarray:
+        p = self.data[key]
+        return np.load(p)
+
+    def __contains__(self, item):
+        return item
+
+    def __len__(self):
+        return len(self.data)
+
+    def __iter__(self):
+        return iter(self.data)
+
+    def keys(self):
+        return self.data.keys()
diff --git a/funasr/fileio/rand_gen_dataset.py b/funasr/fileio/rand_gen_dataset.py
new file mode 100644
index 000000000..2faef3a04
--- /dev/null
+++ b/funasr/fileio/rand_gen_dataset.py
@@ -0,0 +1,86 @@
+import collections
+from pathlib import Path
+from typing import Union
+
+import numpy as np
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import load_num_sequence_text
+
+
+class FloatRandomGenerateDataset(collections.abc.Mapping):
+    """Generate float array from shape.txt.
+
+    Examples:
+        shape.txt
+        uttA 123,83
+        uttB 34,83
+        >>> dataset = FloatRandomGenerateDataset("shape.txt")
+        >>> array = dataset["uttA"]
+        >>> assert array.shape == (123, 83)
+        >>> array = dataset["uttB"]
+        >>> assert array.shape == (34, 83)
+
+    """
+
+    def __init__(
+        self,
+        shape_file: Union[Path, str],
+        dtype: Union[str, np.dtype] = "float32",
+        loader_type: str = "csv_int",
+    ):
+        assert check_argument_types()
+        shape_file = Path(shape_file)
+        self.utt2shape = load_num_sequence_text(shape_file, loader_type)
+        self.dtype = np.dtype(dtype)
+
+    def __iter__(self):
+        return iter(self.utt2shape)
+
+    def __len__(self):
+        return len(self.utt2shape)
+
+    def __getitem__(self, item) -> np.ndarray:
+        shape = self.utt2shape[item]
+        return np.random.randn(*shape).astype(self.dtype)
+
+
+class IntRandomGenerateDataset(collections.abc.Mapping):
+    """Generate float array from shape.txt
+
+    Examples:
+        shape.txt
+        uttA 123,83
+        uttB 34,83
+        >>> dataset = IntRandomGenerateDataset("shape.txt", low=0, high=10)
+        >>> array = dataset["uttA"]
+        >>> assert array.shape == (123, 83)
+        >>> array = dataset["uttB"]
+        >>> assert array.shape == (34, 83)
+
+    """
+
+    def __init__(
+        self,
+        shape_file: Union[Path, str],
+        low: int,
+        high: int = None,
+        dtype: Union[str, np.dtype] = "int64",
+        loader_type: str = "csv_int",
+    ):
+        assert check_argument_types()
+        shape_file = Path(shape_file)
+        self.utt2shape = load_num_sequence_text(shape_file, loader_type)
+        self.dtype = np.dtype(dtype)
+        self.low = low
+        self.high = high
+
+    def __iter__(self):
+        return iter(self.utt2shape)
+
+    def __len__(self):
+        return len(self.utt2shape)
+
+    def __getitem__(self, item) -> np.ndarray:
+        shape = self.utt2shape[item]
+        return np.random.randint(self.low, self.high, size=shape, dtype=self.dtype)
diff --git a/funasr/fileio/read_text.py b/funasr/fileio/read_text.py
new file mode 100644
index 000000000..e26e7a1c5
--- /dev/null
+++ b/funasr/fileio/read_text.py
@@ -0,0 +1,81 @@
+import logging
+from pathlib import Path
+from typing import Dict
+from typing import List
+from typing import Union
+
+from typeguard import check_argument_types
+
+
+def read_2column_text(path: Union[Path, str]) -> Dict[str, str]:
+    """Read a text file having 2 column as dict object.
+
+    Examples:
+        wav.scp:
+            key1 /some/path/a.wav
+            key2 /some/path/b.wav
+
+        >>> read_2column_text('wav.scp')
+        {'key1': '/some/path/a.wav', 'key2': '/some/path/b.wav'}
+
+    """
+    assert check_argument_types()
+
+    data = {}
+    with Path(path).open("r", encoding="utf-8") as f:
+        for linenum, line in enumerate(f, 1):
+            sps = line.rstrip().split(maxsplit=1)
+            if len(sps) == 1:
+                k, v = sps[0], ""
+            else:
+                k, v = sps
+            if k in data:
+                raise RuntimeError(f"{k} is duplicated ({path}:{linenum})")
+            data[k] = v
+    return data
+
+
+def load_num_sequence_text(
+    path: Union[Path, str], loader_type: str = "csv_int"
+) -> Dict[str, List[Union[float, int]]]:
+    """Read a text file indicating sequences of number
+
+    Examples:
+        key1 1 2 3
+        key2 34 5 6
+
+        >>> d = load_num_sequence_text('text')
+        >>> np.testing.assert_array_equal(d["key1"], np.array([1, 2, 3]))
+    """
+    assert check_argument_types()
+    if loader_type == "text_int":
+        delimiter = " "
+        dtype = int
+    elif loader_type == "text_float":
+        delimiter = " "
+        dtype = float
+    elif loader_type == "csv_int":
+        delimiter = ","
+        dtype = int
+    elif loader_type == "csv_float":
+        delimiter = ","
+        dtype = float
+    else:
+        raise ValueError(f"Not supported loader_type={loader_type}")
+
+    # path looks like:
+    #   utta 1,0
+    #   uttb 3,4,5
+    # -> return {'utta': np.ndarray([1, 0]),
+    #            'uttb': np.ndarray([3, 4, 5])}
+    d = read_2column_text(path)
+
+    # Using for-loop instead of dict-comprehension for debuggability
+    retval = {}
+    for k, v in d.items():
+        try:
+            retval[k] = [dtype(i) for i in v.split(delimiter)]
+        except TypeError:
+            logging.error(f'Error happened with path="{path}", id="{k}", value="{v}"')
+            raise
+    return retval
diff --git a/funasr/fileio/sound_scp.py b/funasr/fileio/sound_scp.py
new file mode 100644
index 000000000..459369efb
--- /dev/null
+++ b/funasr/fileio/sound_scp.py
@@ -0,0 +1,131 @@
+import collections.abc
+from pathlib import Path
+from typing import Union
+
+import numpy as np
+import soundfile
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import read_2column_text
+
+
+class SoundScpReader(collections.abc.Mapping):
+    """Reader class for 'wav.scp'.
+
+    Examples:
+        key1 /some/path/a.wav
+        key2 /some/path/b.wav
+        key3 /some/path/c.wav
+        key4 /some/path/d.wav
+        ...
+
+        >>> reader = SoundScpReader('wav.scp')
+        >>> rate, array = reader['key1']
+
+    """
+
+    def __init__(
+        self,
+        fname,
+        dtype=np.int16,
+        always_2d: bool = False,
+        normalize: bool = False,
+    ):
+        assert check_argument_types()
+        self.fname = fname
+        self.dtype = dtype
+        self.always_2d = always_2d
+        self.normalize = normalize
+        self.data = read_2column_text(fname)
+
+    def __getitem__(self, key):
+        wav = self.data[key]
+        if self.normalize:
+            # soundfile.read normalizes data to [-1,1] if dtype is not given
+            array, rate = soundfile.read(wav, always_2d=self.always_2d)
+        else:
+            array, rate = soundfile.read(
+                wav, dtype=self.dtype, always_2d=self.always_2d
+            )
+
+        return rate, array
+
+    def get_path(self, key):
+        return self.data[key]
+
+    def __contains__(self, item):
+        return item
+
+    def __len__(self):
+        return len(self.data)
+
+    def __iter__(self):
+        return iter(self.data)
+
+    def keys(self):
+        return self.data.keys()
+
+
+class SoundScpWriter:
+    """Writer class for 'wav.scp'
+
+    Examples:
+        key1 /some/path/a.wav
+        key2 /some/path/b.wav
+        key3 /some/path/c.wav
+        key4 /some/path/d.wav
+        ...
+
+        >>> writer = SoundScpWriter('./data/', './data/feat.scp')
+        >>> writer['aa'] = 16000, numpy_array
+        >>> writer['bb'] = 16000, numpy_array
+
+    """
+
+    def __init__(
+        self,
+        outdir: Union[Path, str],
+        scpfile: Union[Path, str],
+        format="wav",
+        dtype=None,
+    ):
+        assert check_argument_types()
+        self.dir = Path(outdir)
+        self.dir.mkdir(parents=True, exist_ok=True)
+        scpfile = Path(scpfile)
+        scpfile.parent.mkdir(parents=True, exist_ok=True)
+        self.fscp = scpfile.open("w", encoding="utf-8")
+        self.format = format
+        self.dtype = dtype
+
+        self.data = {}
+
+    def __setitem__(self, key: str, value):
+        rate, signal = value
+        assert isinstance(rate, int), type(rate)
+        assert isinstance(signal, np.ndarray), type(signal)
+        if signal.ndim not in (1, 2):
+            raise RuntimeError(f"Input signal must be 1 or 2 dimension: {signal.ndim}")
+        if signal.ndim == 1:
+            signal = signal[:, None]
+
+        wav = self.dir / f"{key}.{self.format}"
+        wav.parent.mkdir(parents=True, exist_ok=True)
+        soundfile.write(str(wav), signal, rate)
+
+        self.fscp.write(f"{key} {wav}\n")
+
+        # Store the file path
+        self.data[key] = str(wav)
+
+    def get_path(self, key):
+        return self.data[key]
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+
+    def close(self):
+        self.fscp.close()
diff --git a/funasr/iterators/__init__.py b/funasr/iterators/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/iterators/abs_iter_factory.py b/funasr/iterators/abs_iter_factory.py
new file mode 100644
index 000000000..36e4dd2c5
--- /dev/null
+++ b/funasr/iterators/abs_iter_factory.py
@@ -0,0 +1,9 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Iterator
+
+
+class AbsIterFactory(ABC):
+    @abstractmethod
+    def build_iter(self, epoch: int, shuffle: bool = None) -> Iterator:
+        raise NotImplementedError
diff --git a/funasr/iterators/chunk_iter_factory.py b/funasr/iterators/chunk_iter_factory.py
new file mode 100644
index 000000000..cec637040
--- /dev/null
+++ b/funasr/iterators/chunk_iter_factory.py
@@ -0,0 +1,215 @@
+import logging
+from typing import Any
+from typing import Dict
+from typing import Iterator
+from typing import List
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+from funasr.iterators.sequence_iter_factory import SequenceIterFactory
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class ChunkIterFactory(AbsIterFactory):
+    """Creates chunks from a sequence
+
+    Examples:
+        >>> batches = [["id1"], ["id2"], ...]
+        >>> batch_size = 128
+        >>> chunk_length = 1000
+        >>> iter_factory = ChunkIterFactory(dataset, batches, batch_size, chunk_length)
+        >>> it = iter_factory.build_iter(epoch)
+        >>> for ids, batch in it:
+        ...     ...
+
+    - The number of mini-batches are varied in each epochs and
+      we can't get the number in advance
+      because IterFactory doesn't be given to the length information.
+    - Since the first reason, "num_iters_per_epoch" can't be implemented
+      for this iterator. Instead of it, "num_samples_per_epoch" is implemented.
+
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batch_size: int,
+        batches: Union[AbsSampler, Sequence[Sequence[Any]]],
+        chunk_length: Union[int, str],
+        chunk_shift_ratio: float = 0.5,
+        num_cache_chunks: int = 1024,
+        num_samples_per_epoch: int = None,
+        seed: int = 0,
+        shuffle: bool = False,
+        num_workers: int = 0,
+        collate_fn=None,
+        pin_memory: bool = False,
+    ):
+        assert check_argument_types()
+        assert all(len(x) == 1 for x in batches), "batch-size must be 1"
+
+        self.per_sample_iter_factory = SequenceIterFactory(
+            dataset=dataset,
+            batches=batches,
+            num_iters_per_epoch=num_samples_per_epoch,
+            seed=seed,
+            shuffle=shuffle,
+            num_workers=num_workers,
+            collate_fn=collate_fn,
+            pin_memory=pin_memory,
+        )
+
+        self.num_cache_chunks = max(num_cache_chunks, batch_size)
+        if isinstance(chunk_length, str):
+            if len(chunk_length) == 0:
+                raise ValueError("e.g. 5,8 or 3-5: but got empty string")
+
+            self.chunk_lengths = []
+            for x in chunk_length.split(","):
+                try:
+                    sps = list(map(int, x.split("-")))
+                except ValueError:
+                    raise ValueError(f"e.g. 5,8 or 3-5: but got {chunk_length}")
+
+                if len(sps) > 2:
+                    raise ValueError(f"e.g. 5,8 or 3-5: but got {chunk_length}")
+                elif len(sps) == 2:
+                    # Append all numbers between the range into the candidates
+                    self.chunk_lengths += list(range(sps[0], sps[1] + 1))
+                else:
+                    self.chunk_lengths += [sps[0]]
+        else:
+            # Single candidates: Fixed chunk length
+            self.chunk_lengths = [chunk_length]
+
+        self.chunk_shift_ratio = chunk_shift_ratio
+        self.batch_size = batch_size
+        self.seed = seed
+        self.shuffle = shuffle
+
+    def build_iter(
+        self,
+        epoch: int,
+        shuffle: bool = None,
+    ) -> Iterator[Tuple[List[str], Dict[str, torch.Tensor]]]:
+        per_sample_loader = self.per_sample_iter_factory.build_iter(epoch, shuffle)
+
+        if shuffle is None:
+            shuffle = self.shuffle
+        state = np.random.RandomState(epoch + self.seed)
+
+        # NOTE(kamo):
+        #   This iterator supports multiple chunk lengths and
+        #   keep chunks for each lengths here until collecting specified numbers
+        cache_chunks_dict = {}
+        cache_id_list_dict = {}
+        for ids, batch in per_sample_loader:
+            # Must be per-sample-loader
+            assert len(ids) == 1, f"Must be per-sample-loader: {len(ids)}"
+            assert all(len(x) == 1 for x in batch.values())
+
+            # Get keys of sequence data
+            sequence_keys = []
+            for key in batch:
+                if key + "_lengths" in batch:
+                    sequence_keys.append(key)
+            # Remove lengths data and get the first sample
+            batch = {k: v[0] for k, v in batch.items() if not k.endswith("_lengths")}
+            id_ = ids[0]
+
+            for key in sequence_keys:
+                if len(batch[key]) != len(batch[sequence_keys[0]]):
+                    raise RuntimeError(
+                        f"All sequences must has same length: "
+                        f"{len(batch[key])} != {len(batch[sequence_keys[0]])}"
+                    )
+
+            L = len(batch[sequence_keys[0]])
+            # Select chunk length
+            chunk_lengths = [lg for lg in self.chunk_lengths if lg < L]
+            if len(chunk_lengths) == 0:
+                logging.warning(
+                    f"The length of '{id_}' is {L}, but it is shorter than "
+                    f"any candidates of chunk-length: {self.chunk_lengths}"
+                )
+                continue
+
+            W = int(state.choice(chunk_lengths, 1))
+            cache_id_list = cache_id_list_dict.setdefault(W, [])
+            cache_chunks = cache_chunks_dict.setdefault(W, {})
+
+            # Shift width to the next chunk
+            S = int(W * self.chunk_shift_ratio)
+            # Number of chunks
+            N = (L - W) // S + 1
+            if shuffle:
+                Z = state.randint(0, (L - W) % S + 1)
+            else:
+                Z = 0
+
+            # Split a sequence into chunks.
+            # Note that the marginal frames divided by chunk length are discarded
+            for k, v in batch.items():
+                if k not in cache_chunks:
+                    cache_chunks[k] = []
+                if k in sequence_keys:
+                    # Shift chunks with overlapped length for data augmentation
+                    cache_chunks[k] += [v[Z + i * S : Z + i * S + W] for i in range(N)]
+                else:
+                    # If not sequence, use whole data instead of chunk
+                    cache_chunks[k] += [v for _ in range(N)]
+            cache_id_list += [id_ for _ in range(N)]
+
+            if len(cache_id_list) > self.num_cache_chunks:
+                cache_id_list, cache_chunks = yield from self._generate_mini_batches(
+                    cache_id_list,
+                    cache_chunks,
+                    shuffle,
+                    state,
+                )
+
+            cache_id_list_dict[W] = cache_id_list
+            cache_chunks_dict[W] = cache_chunks
+
+        else:
+            for W in cache_id_list_dict:
+                cache_id_list = cache_id_list_dict.setdefault(W, [])
+                cache_chunks = cache_chunks_dict.setdefault(W, {})
+
+                yield from self._generate_mini_batches(
+                    cache_id_list,
+                    cache_chunks,
+                    shuffle,
+                    state,
+                )
+
+    def _generate_mini_batches(
+        self,
+        id_list: List[str],
+        batches: Dict[str, List[torch.Tensor]],
+        shuffle: bool,
+        state: np.random.RandomState,
+    ):
+        if shuffle:
+            indices = np.arange(0, len(id_list))
+            state.shuffle(indices)
+            batches = {k: [v[i] for i in indices] for k, v in batches.items()}
+            id_list = [id_list[i] for i in indices]
+
+        bs = self.batch_size
+        while len(id_list) >= bs:
+            # Make mini-batch and yield
+            yield (
+                id_list[:bs],
+                {k: torch.stack(v[:bs], 0) for k, v in batches.items()},
+            )
+            id_list = id_list[bs:]
+            batches = {k: v[bs:] for k, v in batches.items()}
+
+        return id_list, batches
diff --git a/funasr/iterators/multiple_iter_factory.py b/funasr/iterators/multiple_iter_factory.py
new file mode 100644
index 000000000..088016cf3
--- /dev/null
+++ b/funasr/iterators/multiple_iter_factory.py
@@ -0,0 +1,37 @@
+import logging
+from typing import Callable
+from typing import Collection
+from typing import Iterator
+
+import numpy as np
+from typeguard import check_argument_types
+
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+
+
+class MultipleIterFactory(AbsIterFactory):
+    def __init__(
+        self,
+        build_funcs: Collection[Callable[[], AbsIterFactory]],
+        seed: int = 0,
+        shuffle: bool = False,
+    ):
+        assert check_argument_types()
+        self.build_funcs = list(build_funcs)
+        self.seed = seed
+        self.shuffle = shuffle
+
+    def build_iter(self, epoch: int, shuffle: bool = None) -> Iterator:
+        if shuffle is None:
+            shuffle = self.shuffle
+
+        build_funcs = list(self.build_funcs)
+
+        if shuffle:
+            np.random.RandomState(epoch + self.seed).shuffle(build_funcs)
+
+        for i, build_func in enumerate(build_funcs):
+            logging.info(f"Building {i}th iter-factory...")
+            iter_factory = build_func()
+            assert isinstance(iter_factory, AbsIterFactory), type(iter_factory)
+            yield from iter_factory.build_iter(epoch, shuffle)
diff --git a/funasr/iterators/sequence_iter_factory.py b/funasr/iterators/sequence_iter_factory.py
new file mode 100644
index 000000000..39d083446
--- /dev/null
+++ b/funasr/iterators/sequence_iter_factory.py
@@ -0,0 +1,143 @@
+from typing import Any
+from typing import Sequence
+from typing import Union
+
+import numpy as np
+from torch.utils.data import DataLoader
+from typeguard import check_argument_types
+
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class RawSampler(AbsSampler):
+    def __init__(self, batches):
+        self.batches = batches
+
+    def __len__(self):
+        return len(self.batches)
+
+    def __iter__(self):
+        return iter(self.batches)
+
+    def generate(self, seed):
+        return list(self.batches)
+
+
+class SequenceIterFactory(AbsIterFactory):
+    """Build iterator for each epoch.
+
+    This class simply creates pytorch DataLoader except for the following points:
+    - The random seed is decided according to the number of epochs. This feature
+      guarantees reproducibility when resuming from middle of training process.
+    - Enable to restrict the number of samples for one epoch. This features
+      controls the interval number between training and evaluation.
+
+    """
+
+    def __init__(
+        self,
+        dataset,
+        batches: Union[AbsSampler, Sequence[Sequence[Any]]],
+        num_iters_per_epoch: int = None,
+        seed: int = 0,
+        shuffle: bool = False,
+        num_workers: int = 0,
+        collate_fn=None,
+        pin_memory: bool = False,
+    ):
+        assert check_argument_types()
+
+        if not isinstance(batches, AbsSampler):
+            self.sampler = RawSampler(batches)
+        else:
+            self.sampler = batches
+
+        self.dataset = dataset
+        self.num_iters_per_epoch = num_iters_per_epoch
+        self.shuffle = shuffle
+        self.seed = seed
+        self.num_workers = num_workers
+        self.collate_fn = collate_fn
+        # https://discuss.pytorch.org/t/what-is-the-disadvantage-of-using-pin-memory/1702
+        self.pin_memory = pin_memory
+
+    def build_iter(self, epoch: int, shuffle: bool = None) -> DataLoader:
+        if shuffle is None:
+            shuffle = self.shuffle
+
+        if self.num_iters_per_epoch is not None:
+            N = len(self.sampler)
+            # If corpus size is larger than the num_per_epoch
+            if self.num_iters_per_epoch < N:
+                N = len(self.sampler)
+                real_epoch, offset = divmod(self.num_iters_per_epoch * epoch, N)
+
+                if offset >= self.num_iters_per_epoch:
+                    current_batches = self.sampler.generate(real_epoch + self.seed)
+                    if shuffle:
+                        np.random.RandomState(real_epoch + self.seed).shuffle(
+                            current_batches
+                        )
+                    batches = current_batches[
+                        offset - self.num_iters_per_epoch : offset
+                    ]
+                else:
+                    prev_batches = self.sampler.generate(real_epoch - 1 + self.seed)
+                    current_batches = self.sampler.generate(real_epoch + self.seed)
+                    if shuffle:
+                        np.random.RandomState(real_epoch - 1 + self.seed).shuffle(
+                            prev_batches
+                        )
+                        np.random.RandomState(real_epoch + self.seed).shuffle(
+                            current_batches
+                        )
+                    batches = (
+                        prev_batches[offset - self.num_iters_per_epoch :]
+                        + current_batches[:offset]
+                    )
+
+            # If corpus size is less than the num_per_epoch
+            else:
+                _epoch, _cursor = divmod(self.num_iters_per_epoch * (epoch - 1), N)
+                _remain = self.num_iters_per_epoch
+                batches = []
+                current_batches = self.sampler.generate(_epoch + self.seed)
+                if shuffle:
+                    np.random.RandomState(_epoch + self.seed).shuffle(current_batches)
+                while _remain > 0:
+
+                    _batches = current_batches[_cursor : _cursor + _remain]
+                    batches += _batches
+                    if _cursor + _remain >= N:
+                        _epoch += 1
+                        _cursor = 0
+                        current_batches = self.sampler.generate(_epoch + self.seed)
+                        if shuffle:
+                            np.random.RandomState(_epoch + self.seed).shuffle(
+                                current_batches
+                            )
+                    else:
+                        _cursor = _cursor + _remain
+                    _remain -= len(_batches)
+
+                assert len(batches) == self.num_iters_per_epoch
+
+        else:
+            batches = self.sampler.generate(epoch + self.seed)
+            if shuffle:
+                np.random.RandomState(epoch + self.seed).shuffle(batches)
+
+        # For backward compatibility for pytorch DataLoader
+        if self.collate_fn is not None:
+            kwargs = dict(collate_fn=self.collate_fn)
+        else:
+            kwargs = {}
+
+        return DataLoader(
+            dataset=self.dataset,
+            batch_sampler=batches,
+            num_workers=self.num_workers,
+            pin_memory=self.pin_memory,
+            **kwargs,
+        )
diff --git a/funasr/layers/__init__.py b/funasr/layers/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/layers/abs_normalize.py b/funasr/layers/abs_normalize.py
new file mode 100644
index 000000000..f2be748dd
--- /dev/null
+++ b/funasr/layers/abs_normalize.py
@@ -0,0 +1,14 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+
+class AbsNormalize(torch.nn.Module, ABC):
+    @abstractmethod
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # return output, output_lengths
+        raise NotImplementedError
diff --git a/funasr/layers/complex_utils.py b/funasr/layers/complex_utils.py
new file mode 100644
index 000000000..bf4799f58
--- /dev/null
+++ b/funasr/layers/complex_utils.py
@@ -0,0 +1,191 @@
+"""Beamformer module."""
+from distutils.version import LooseVersion
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import torch
+from torch_complex import functional as FC
+from torch_complex.tensor import ComplexTensor
+
+
+EPS = torch.finfo(torch.double).eps
+is_torch_1_8_plus = LooseVersion(torch.__version__) >= LooseVersion("1.8.0")
+is_torch_1_9_plus = LooseVersion(torch.__version__) >= LooseVersion("1.9.0")
+
+
+def new_complex_like(
+    ref: Union[torch.Tensor, ComplexTensor],
+    real_imag: Tuple[torch.Tensor, torch.Tensor],
+):
+    if isinstance(ref, ComplexTensor):
+        return ComplexTensor(*real_imag)
+    elif is_torch_complex_tensor(ref):
+        return torch.complex(*real_imag)
+    else:
+        raise ValueError(
+            "Please update your PyTorch version to 1.9+ for complex support."
+        )
+
+
+def is_torch_complex_tensor(c):
+    return (
+        not isinstance(c, ComplexTensor) and is_torch_1_9_plus and torch.is_complex(c)
+    )
+
+
+def is_complex(c):
+    return isinstance(c, ComplexTensor) or is_torch_complex_tensor(c)
+
+
+def to_double(c):
+    if not isinstance(c, ComplexTensor) and is_torch_1_9_plus and torch.is_complex(c):
+        return c.to(dtype=torch.complex128)
+    else:
+        return c.double()
+
+
+def to_float(c):
+    if not isinstance(c, ComplexTensor) and is_torch_1_9_plus and torch.is_complex(c):
+        return c.to(dtype=torch.complex64)
+    else:
+        return c.float()
+
+
+def cat(seq: Sequence[Union[ComplexTensor, torch.Tensor]], *args, **kwargs):
+    if not isinstance(seq, (list, tuple)):
+        raise TypeError(
+            "cat(): argument 'tensors' (position 1) must be tuple of Tensors, "
+            "not Tensor"
+        )
+    if isinstance(seq[0], ComplexTensor):
+        return FC.cat(seq, *args, **kwargs)
+    else:
+        return torch.cat(seq, *args, **kwargs)
+
+
+def complex_norm(
+    c: Union[torch.Tensor, ComplexTensor], dim=-1, keepdim=False
+) -> torch.Tensor:
+    if not is_complex(c):
+        raise TypeError("Input is not a complex tensor.")
+    if is_torch_complex_tensor(c):
+        return torch.norm(c, dim=dim, keepdim=keepdim)
+    else:
+        return torch.sqrt(
+            (c.real**2 + c.imag**2).sum(dim=dim, keepdim=keepdim) + EPS
+        )
+
+
+def einsum(equation, *operands):
+    # NOTE: Do not mix ComplexTensor and torch.complex in the input!
+    # NOTE (wangyou): Until PyTorch 1.9.0, torch.einsum does not support
+    # mixed input with complex and real tensors.
+    if len(operands) == 1:
+        if isinstance(operands[0], (tuple, list)):
+            operands = operands[0]
+        complex_module = FC if isinstance(operands[0], ComplexTensor) else torch
+        return complex_module.einsum(equation, *operands)
+    elif len(operands) != 2:
+        op0 = operands[0]
+        same_type = all(op.dtype == op0.dtype for op in operands[1:])
+        if same_type:
+            _einsum = FC.einsum if isinstance(op0, ComplexTensor) else torch.einsum
+            return _einsum(equation, *operands)
+        else:
+            raise ValueError("0 or More than 2 operands are not supported.")
+    a, b = operands
+    if isinstance(a, ComplexTensor) or isinstance(b, ComplexTensor):
+        return FC.einsum(equation, a, b)
+    elif is_torch_1_9_plus and (torch.is_complex(a) or torch.is_complex(b)):
+        if not torch.is_complex(a):
+            o_real = torch.einsum(equation, a, b.real)
+            o_imag = torch.einsum(equation, a, b.imag)
+            return torch.complex(o_real, o_imag)
+        elif not torch.is_complex(b):
+            o_real = torch.einsum(equation, a.real, b)
+            o_imag = torch.einsum(equation, a.imag, b)
+            return torch.complex(o_real, o_imag)
+        else:
+            return torch.einsum(equation, a, b)
+    else:
+        return torch.einsum(equation, a, b)
+
+
+def inverse(
+    c: Union[torch.Tensor, ComplexTensor]
+) -> Union[torch.Tensor, ComplexTensor]:
+    if isinstance(c, ComplexTensor):
+        return c.inverse2()
+    else:
+        return c.inverse()
+
+
+def matmul(
+    a: Union[torch.Tensor, ComplexTensor], b: Union[torch.Tensor, ComplexTensor]
+) -> Union[torch.Tensor, ComplexTensor]:
+    # NOTE: Do not mix ComplexTensor and torch.complex in the input!
+    # NOTE (wangyou): Until PyTorch 1.9.0, torch.matmul does not support
+    # multiplication between complex and real tensors.
+    if isinstance(a, ComplexTensor) or isinstance(b, ComplexTensor):
+        return FC.matmul(a, b)
+    elif is_torch_1_9_plus and (torch.is_complex(a) or torch.is_complex(b)):
+        if not torch.is_complex(a):
+            o_real = torch.matmul(a, b.real)
+            o_imag = torch.matmul(a, b.imag)
+            return torch.complex(o_real, o_imag)
+        elif not torch.is_complex(b):
+            o_real = torch.matmul(a.real, b)
+            o_imag = torch.matmul(a.imag, b)
+            return torch.complex(o_real, o_imag)
+        else:
+            return torch.matmul(a, b)
+    else:
+        return torch.matmul(a, b)
+
+
+def trace(a: Union[torch.Tensor, ComplexTensor]):
+    # NOTE (wangyou): until PyTorch 1.9.0, torch.trace does not
+    # support bacth processing. Use FC.trace() as fallback.
+    return FC.trace(a)
+
+
+def reverse(a: Union[torch.Tensor, ComplexTensor], dim=0):
+    if isinstance(a, ComplexTensor):
+        return FC.reverse(a, dim=dim)
+    else:
+        return torch.flip(a, dims=(dim,))
+
+
+def solve(b: Union[torch.Tensor, ComplexTensor], a: Union[torch.Tensor, ComplexTensor]):
+    """Solve the linear equation ax = b."""
+    # NOTE: Do not mix ComplexTensor and torch.complex in the input!
+    # NOTE (wangyou): Until PyTorch 1.9.0, torch.solve does not support
+    # mixed input with complex and real tensors.
+    if isinstance(a, ComplexTensor) or isinstance(b, ComplexTensor):
+        if isinstance(a, ComplexTensor) and isinstance(b, ComplexTensor):
+            return FC.solve(b, a, return_LU=False)
+        else:
+            return matmul(inverse(a), b)
+    elif is_torch_1_9_plus and (torch.is_complex(a) or torch.is_complex(b)):
+        if torch.is_complex(a) and torch.is_complex(b):
+            return torch.linalg.solve(a, b)
+        else:
+            return matmul(inverse(a), b)
+    else:
+        if is_torch_1_8_plus:
+            return torch.linalg.solve(a, b)
+        else:
+            return torch.solve(b, a)[0]
+
+
+def stack(seq: Sequence[Union[ComplexTensor, torch.Tensor]], *args, **kwargs):
+    if not isinstance(seq, (list, tuple)):
+        raise TypeError(
+            "stack(): argument 'tensors' (position 1) must be tuple of Tensors, "
+            "not Tensor"
+        )
+    if isinstance(seq[0], ComplexTensor):
+        return FC.stack(seq, *args, **kwargs)
+    else:
+        return torch.stack(seq, *args, **kwargs)
diff --git a/funasr/layers/global_mvn.py b/funasr/layers/global_mvn.py
new file mode 100644
index 000000000..5515cdde6
--- /dev/null
+++ b/funasr/layers/global_mvn.py
@@ -0,0 +1,121 @@
+from pathlib import Path
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.layers.abs_normalize import AbsNormalize
+from funasr.layers.inversible_interface import InversibleInterface
+
+
+class GlobalMVN(AbsNormalize, InversibleInterface):
+    """Apply global mean and variance normalization
+
+    TODO(kamo): Make this class portable somehow
+
+    Args:
+        stats_file: npy file
+        norm_means: Apply mean normalization
+        norm_vars: Apply var normalization
+        eps:
+    """
+
+    def __init__(
+        self,
+        stats_file: Union[Path, str],
+        norm_means: bool = True,
+        norm_vars: bool = True,
+        eps: float = 1.0e-20,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.eps = eps
+        stats_file = Path(stats_file)
+
+        self.stats_file = stats_file
+        stats = np.load(stats_file)
+        if isinstance(stats, np.ndarray):
+            # Kaldi like stats
+            count = stats[0].flatten()[-1]
+            mean = stats[0, :-1] / count
+            var = stats[1, :-1] / count - mean * mean
+        else:
+            # New style: Npz file
+            count = stats["count"]
+            sum_v = stats["sum"]
+            sum_square_v = stats["sum_square"]
+            mean = sum_v / count
+            var = sum_square_v / count - mean * mean
+        std = np.sqrt(np.maximum(var, eps))
+
+        self.register_buffer("mean", torch.from_numpy(mean))
+        self.register_buffer("std", torch.from_numpy(std))
+
+    def extra_repr(self):
+        return (
+            f"stats_file={self.stats_file}, "
+            f"norm_means={self.norm_means}, norm_vars={self.norm_vars}"
+        )
+
+    def forward(
+        self, x: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward function
+
+        Args:
+            x: (B, L, ...)
+            ilens: (B,)
+        """
+        if ilens is None:
+            ilens = x.new_full([x.size(0)], x.size(1))
+        norm_means = self.norm_means
+        norm_vars = self.norm_vars
+        self.mean = self.mean.to(x.device, x.dtype)
+        self.std = self.std.to(x.device, x.dtype)
+        mask = make_pad_mask(ilens, x, 1)
+
+        # feat: (B, T, D)
+        if norm_means:
+            if x.requires_grad:
+                x = x - self.mean
+            else:
+                x -= self.mean
+        if x.requires_grad:
+            x = x.masked_fill(mask, 0.0)
+        else:
+            x.masked_fill_(mask, 0.0)
+
+        if norm_vars:
+            x /= self.std
+
+        return x, ilens
+
+    def inverse(
+        self, x: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        if ilens is None:
+            ilens = x.new_full([x.size(0)], x.size(1))
+        norm_means = self.norm_means
+        norm_vars = self.norm_vars
+        self.mean = self.mean.to(x.device, x.dtype)
+        self.std = self.std.to(x.device, x.dtype)
+        mask = make_pad_mask(ilens, x, 1)
+
+        if x.requires_grad:
+            x = x.masked_fill(mask, 0.0)
+        else:
+            x.masked_fill_(mask, 0.0)
+
+        if norm_vars:
+            x *= self.std
+
+        # feat: (B, T, D)
+        if norm_means:
+            x += self.mean
+            x.masked_fill_(make_pad_mask(ilens, x, 1), 0.0)
+        return x, ilens
diff --git a/funasr/layers/inversible_interface.py b/funasr/layers/inversible_interface.py
new file mode 100644
index 000000000..a1a59399a
--- /dev/null
+++ b/funasr/layers/inversible_interface.py
@@ -0,0 +1,14 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+
+class InversibleInterface(ABC):
+    @abstractmethod
+    def inverse(
+        self, input: torch.Tensor, input_lengths: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # return output, output_lengths
+        raise NotImplementedError
diff --git a/funasr/layers/label_aggregation.py b/funasr/layers/label_aggregation.py
new file mode 100644
index 000000000..075e19d90
--- /dev/null
+++ b/funasr/layers/label_aggregation.py
@@ -0,0 +1,82 @@
+import torch
+from typeguard import check_argument_types
+from typing import Optional
+from typing import Tuple
+
+from funasr.modules.nets_utils import make_pad_mask
+
+
+class LabelAggregate(torch.nn.Module):
+    def __init__(
+        self,
+        win_length: int = 512,
+        hop_length: int = 128,
+        center: bool = True,
+    ):
+        assert check_argument_types()
+        super().__init__()
+
+        self.win_length = win_length
+        self.hop_length = hop_length
+        self.center = center
+
+    def extra_repr(self):
+        return (
+            f"win_length={self.win_length}, "
+            f"hop_length={self.hop_length}, "
+            f"center={self.center}, "
+        )
+
+    def forward(
+        self, input: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """LabelAggregate forward function.
+
+        Args:
+            input: (Batch, Nsamples, Label_dim)
+            ilens: (Batch)
+        Returns:
+            output: (Batch, Frames, Label_dim)
+
+        """
+        bs = input.size(0)
+        max_length = input.size(1)
+        label_dim = input.size(2)
+
+        # NOTE(jiatong):
+        #   The default behaviour of label aggregation is compatible with
+        #   torch.stft about framing and padding.
+
+        # Step1: center padding
+        if self.center:
+            pad = self.win_length // 2
+            max_length = max_length + 2 * pad
+            input = torch.nn.functional.pad(input, (0, 0, pad, pad), "constant", 0)
+            input[:, :pad, :] = input[:, pad : (2 * pad), :]
+            input[:, (max_length - pad) : max_length, :] = input[
+                :, (max_length - 2 * pad) : (max_length - pad), :
+            ]
+            nframe = (max_length - self.win_length) // self.hop_length + 1
+
+        # Step2: framing
+        output = input.as_strided(
+            (bs, nframe, self.win_length, label_dim),
+            (max_length * label_dim, self.hop_length * label_dim, label_dim, 1),
+        )
+
+        # Step3: aggregate label
+        output = torch.gt(output.sum(dim=2, keepdim=False), self.win_length // 2)
+        output = output.float()
+
+        # Step4: process lengths
+        if ilens is not None:
+            if self.center:
+                pad = self.win_length // 2
+                ilens = ilens + 2 * pad
+
+            olens = (ilens - self.win_length) // self.hop_length + 1
+            output.masked_fill_(make_pad_mask(olens, output, 1), 0.0)
+        else:
+            olens = None
+
+        return output, olens
diff --git a/funasr/layers/log_mel.py b/funasr/layers/log_mel.py
new file mode 100644
index 000000000..2285f6d4f
--- /dev/null
+++ b/funasr/layers/log_mel.py
@@ -0,0 +1,83 @@
+import librosa
+import torch
+from typing import Tuple
+
+from funasr.modules.nets_utils import make_pad_mask
+
+
+class LogMel(torch.nn.Module):
+    """Convert STFT to fbank feats
+
+    The arguments is same as librosa.filters.mel
+
+    Args:
+        fs: number > 0 [scalar] sampling rate of the incoming signal
+        n_fft: int > 0 [scalar] number of FFT components
+        n_mels: int > 0 [scalar] number of Mel bands to generate
+        fmin: float >= 0 [scalar] lowest frequency (in Hz)
+        fmax: float >= 0 [scalar] highest frequency (in Hz).
+            If `None`, use `fmax = fs / 2.0`
+        htk: use HTK formula instead of Slaney
+    """
+
+    def __init__(
+        self,
+        fs: int = 16000,
+        n_fft: int = 512,
+        n_mels: int = 80,
+        fmin: float = None,
+        fmax: float = None,
+        htk: bool = False,
+        log_base: float = None,
+    ):
+        super().__init__()
+
+        fmin = 0 if fmin is None else fmin
+        fmax = fs / 2 if fmax is None else fmax
+        _mel_options = dict(
+            sr=fs,
+            n_fft=n_fft,
+            n_mels=n_mels,
+            fmin=fmin,
+            fmax=fmax,
+            htk=htk,
+        )
+        self.mel_options = _mel_options
+        self.log_base = log_base
+
+        # Note(kamo): The mel matrix of librosa is different from kaldi.
+        melmat = librosa.filters.mel(**_mel_options)
+        # melmat: (D2, D1) -> (D1, D2)
+        self.register_buffer("melmat", torch.from_numpy(melmat.T).float())
+
+    def extra_repr(self):
+        return ", ".join(f"{k}={v}" for k, v in self.mel_options.items())
+
+    def forward(
+        self,
+        feat: torch.Tensor,
+        ilens: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # feat: (B, T, D1) x melmat: (D1, D2) -> mel_feat: (B, T, D2)
+        mel_feat = torch.matmul(feat, self.melmat)
+        mel_feat = torch.clamp(mel_feat, min=1e-10)
+
+        if self.log_base is None:
+            logmel_feat = mel_feat.log()
+        elif self.log_base == 2.0:
+            logmel_feat = mel_feat.log2()
+        elif self.log_base == 10.0:
+            logmel_feat = mel_feat.log10()
+        else:
+            logmel_feat = mel_feat.log() / torch.log(self.log_base)
+
+        # Zero padding
+        if ilens is not None:
+            logmel_feat = logmel_feat.masked_fill(
+                make_pad_mask(ilens, logmel_feat, 1), 0.0
+            )
+        else:
+            ilens = feat.new_full(
+                [feat.size(0)], fill_value=feat.size(1), dtype=torch.long
+            )
+        return logmel_feat, ilens
diff --git a/funasr/layers/mask_along_axis.py b/funasr/layers/mask_along_axis.py
new file mode 100644
index 000000000..e49e621cc
--- /dev/null
+++ b/funasr/layers/mask_along_axis.py
@@ -0,0 +1,340 @@
+import math
+import torch
+from typeguard import check_argument_types
+from typing import Sequence
+from typing import Union
+
+
+def mask_along_axis(
+    spec: torch.Tensor,
+    spec_lengths: torch.Tensor,
+    mask_width_range: Sequence[int] = (0, 30),
+    dim: int = 1,
+    num_mask: int = 2,
+    replace_with_zero: bool = True,
+):
+    """Apply mask along the specified direction.
+
+    Args:
+        spec: (Batch, Length, Freq)
+        spec_lengths: (Length): Not using lengths in this implementation
+        mask_width_range: Select the width randomly between this range
+    """
+
+    org_size = spec.size()
+    if spec.dim() == 4:
+        # spec: (Batch, Channel, Length, Freq) -> (Batch * Channel, Length, Freq)
+        spec = spec.view(-1, spec.size(2), spec.size(3))
+
+    B = spec.shape[0]
+    # D = Length or Freq
+    D = spec.shape[dim]
+    # mask_length: (B, num_mask, 1)
+    mask_length = torch.randint(
+        mask_width_range[0],
+        mask_width_range[1],
+        (B, num_mask),
+        device=spec.device,
+    ).unsqueeze(2)
+
+    # mask_pos: (B, num_mask, 1)
+    mask_pos = torch.randint(
+        0, max(1, D - mask_length.max()), (B, num_mask), device=spec.device
+    ).unsqueeze(2)
+
+    # aran: (1, 1, D)
+    aran = torch.arange(D, device=spec.device)[None, None, :]
+    # mask: (Batch, num_mask, D)
+    mask = (mask_pos <= aran) * (aran < (mask_pos + mask_length))
+    # Multiply masks: (Batch, num_mask, D) -> (Batch, D)
+    mask = mask.any(dim=1)
+    if dim == 1:
+        # mask: (Batch, Length, 1)
+        mask = mask.unsqueeze(2)
+    elif dim == 2:
+        # mask: (Batch, 1, Freq)
+        mask = mask.unsqueeze(1)
+
+    if replace_with_zero:
+        value = 0.0
+    else:
+        value = spec.mean()
+
+    if spec.requires_grad:
+        spec = spec.masked_fill(mask, value)
+    else:
+        spec = spec.masked_fill_(mask, value)
+    spec = spec.view(*org_size)
+    return spec, spec_lengths
+
+def mask_along_axis_lfr(
+    spec: torch.Tensor,
+    spec_lengths: torch.Tensor,
+    mask_width_range: Sequence[int] = (0, 30),
+    dim: int = 1,
+    num_mask: int = 2,
+    replace_with_zero: bool = True,
+    lfr_rate: int = 1,
+):
+    """Apply mask along the specified direction.
+
+    Args:
+        spec: (Batch, Length, Freq)
+        spec_lengths: (Length): Not using lengths in this implementation
+        mask_width_range: Select the width randomly between this range
+        lfr_rate：low frame rate
+    """
+
+    org_size = spec.size()
+    if spec.dim() == 4:
+        # spec: (Batch, Channel, Length, Freq) -> (Batch * Channel, Length, Freq)
+        spec = spec.view(-1, spec.size(2), spec.size(3))
+
+    B = spec.shape[0]
+    # D = Length or Freq
+    D = spec.shape[dim] // lfr_rate
+    # mask_length: (B, num_mask, 1)
+    mask_length = torch.randint(
+        mask_width_range[0],
+        mask_width_range[1],
+        (B, num_mask),
+        device=spec.device,
+    ).unsqueeze(2)
+    if lfr_rate > 1:
+        mask_length = mask_length.repeat(1, lfr_rate, 1)
+    # mask_pos: (B, num_mask, 1)
+    mask_pos = torch.randint(
+        0, max(1, D - mask_length.max()), (B, num_mask), device=spec.device
+    ).unsqueeze(2)
+    if lfr_rate > 1:
+        mask_pos_raw = mask_pos.clone()
+        mask_pos = torch.zeros((B, 0, 1), device=spec.device, dtype=torch.int32)
+        for i in range(lfr_rate):
+            mask_pos_i = mask_pos_raw + D * i
+            mask_pos = torch.cat((mask_pos, mask_pos_i), dim=1)
+    # aran: (1, 1, D)
+    D = spec.shape[dim]
+    aran = torch.arange(D, device=spec.device)[None, None, :]
+    # mask: (Batch, num_mask, D)
+    mask = (mask_pos <= aran) * (aran < (mask_pos + mask_length))
+    # Multiply masks: (Batch, num_mask, D) -> (Batch, D)
+    mask = mask.any(dim=1)
+    if dim == 1:
+        # mask: (Batch, Length, 1)
+        mask = mask.unsqueeze(2)
+    elif dim == 2:
+        # mask: (Batch, 1, Freq)
+        mask = mask.unsqueeze(1)
+
+    if replace_with_zero:
+        value = 0.0
+    else:
+        value = spec.mean()
+
+    if spec.requires_grad:
+        spec = spec.masked_fill(mask, value)
+    else:
+        spec = spec.masked_fill_(mask, value)
+    spec = spec.view(*org_size)
+    return spec, spec_lengths
+
+
+class MaskAlongAxis(torch.nn.Module):
+    def __init__(
+        self,
+        mask_width_range: Union[int, Sequence[int]] = (0, 30),
+        num_mask: int = 2,
+        dim: Union[int, str] = "time",
+        replace_with_zero: bool = True,
+    ):
+        assert check_argument_types()
+        if isinstance(mask_width_range, int):
+            mask_width_range = (0, mask_width_range)
+        if len(mask_width_range) != 2:
+            raise TypeError(
+                f"mask_width_range must be a tuple of int and int values: "
+                f"{mask_width_range}",
+            )
+
+        assert mask_width_range[1] > mask_width_range[0]
+        if isinstance(dim, str):
+            if dim == "time":
+                dim = 1
+            elif dim == "freq":
+                dim = 2
+            else:
+                raise ValueError("dim must be int, 'time' or 'freq'")
+        if dim == 1:
+            self.mask_axis = "time"
+        elif dim == 2:
+            self.mask_axis = "freq"
+        else:
+            self.mask_axis = "unknown"
+
+        super().__init__()
+        self.mask_width_range = mask_width_range
+        self.num_mask = num_mask
+        self.dim = dim
+        self.replace_with_zero = replace_with_zero
+
+    def extra_repr(self):
+        return (
+            f"mask_width_range={self.mask_width_range}, "
+            f"num_mask={self.num_mask}, axis={self.mask_axis}"
+        )
+
+    def forward(self, spec: torch.Tensor, spec_lengths: torch.Tensor = None):
+        """Forward function.
+
+        Args:
+            spec: (Batch, Length, Freq)
+        """
+
+        return mask_along_axis(
+            spec,
+            spec_lengths,
+            mask_width_range=self.mask_width_range,
+            dim=self.dim,
+            num_mask=self.num_mask,
+            replace_with_zero=self.replace_with_zero,
+        )
+
+
+class MaskAlongAxisVariableMaxWidth(torch.nn.Module):
+    """Mask input spec along a specified axis with variable maximum width.
+
+    Formula:
+        max_width = max_width_ratio * seq_len
+    """
+
+    def __init__(
+        self,
+        mask_width_ratio_range: Union[float, Sequence[float]] = (0.0, 0.05),
+        num_mask: int = 2,
+        dim: Union[int, str] = "time",
+        replace_with_zero: bool = True,
+    ):
+        assert check_argument_types()
+        if isinstance(mask_width_ratio_range, float):
+            mask_width_ratio_range = (0.0, mask_width_ratio_range)
+        if len(mask_width_ratio_range) != 2:
+            raise TypeError(
+                f"mask_width_ratio_range must be a tuple of float and float values: "
+                f"{mask_width_ratio_range}",
+            )
+
+        assert mask_width_ratio_range[1] > mask_width_ratio_range[0]
+        if isinstance(dim, str):
+            if dim == "time":
+                dim = 1
+            elif dim == "freq":
+                dim = 2
+            else:
+                raise ValueError("dim must be int, 'time' or 'freq'")
+        if dim == 1:
+            self.mask_axis = "time"
+        elif dim == 2:
+            self.mask_axis = "freq"
+        else:
+            self.mask_axis = "unknown"
+
+        super().__init__()
+        self.mask_width_ratio_range = mask_width_ratio_range
+        self.num_mask = num_mask
+        self.dim = dim
+        self.replace_with_zero = replace_with_zero
+
+    def extra_repr(self):
+        return (
+            f"mask_width_ratio_range={self.mask_width_ratio_range}, "
+            f"num_mask={self.num_mask}, axis={self.mask_axis}"
+        )
+
+    def forward(self, spec: torch.Tensor, spec_lengths: torch.Tensor = None):
+        """Forward function.
+
+        Args:
+            spec: (Batch, Length, Freq)
+        """
+
+        max_seq_len = spec.shape[self.dim]
+        min_mask_width = math.floor(max_seq_len * self.mask_width_ratio_range[0])
+        min_mask_width = max([0, min_mask_width])
+        max_mask_width = math.floor(max_seq_len * self.mask_width_ratio_range[1])
+        max_mask_width = min([max_seq_len, max_mask_width])
+
+        if max_mask_width > min_mask_width:
+            return mask_along_axis(
+                spec,
+                spec_lengths,
+                mask_width_range=(min_mask_width, max_mask_width),
+                dim=self.dim,
+                num_mask=self.num_mask,
+                replace_with_zero=self.replace_with_zero,
+            )
+        return spec, spec_lengths
+
+class MaskAlongAxisLFR(torch.nn.Module):
+    def __init__(
+        self,
+        mask_width_range: Union[int, Sequence[int]] = (0, 30),
+        num_mask: int = 2,
+        dim: Union[int, str] = "time",
+        replace_with_zero: bool = True,
+        lfr_rate: int = 1,
+    ):
+        assert check_argument_types()
+        if isinstance(mask_width_range, int):
+            mask_width_range = (0, mask_width_range)
+        if len(mask_width_range) != 2:
+            raise TypeError(
+                f"mask_width_range must be a tuple of int and int values: "
+                f"{mask_width_range}",
+            )
+
+        assert mask_width_range[1] > mask_width_range[0]
+        if isinstance(dim, str):
+            if dim == "time":
+                dim = 1
+                lfr_rate = 1
+            elif dim == "freq":
+                dim = 2
+            else:
+                raise ValueError("dim must be int, 'time' or 'freq'")
+        if dim == 1:
+            self.mask_axis = "time"
+            lfr_rate = 1
+        elif dim == 2:
+            self.mask_axis = "freq"
+        else:
+            self.mask_axis = "unknown"
+
+        super().__init__()
+        self.mask_width_range = mask_width_range
+        self.num_mask = num_mask
+        self.dim = dim
+        self.replace_with_zero = replace_with_zero
+        self.lfr_rate = lfr_rate
+
+    def extra_repr(self):
+        return (
+            f"mask_width_range={self.mask_width_range}, "
+            f"num_mask={self.num_mask}, axis={self.mask_axis}"
+        )
+
+    def forward(self, spec: torch.Tensor, spec_lengths: torch.Tensor = None):
+        """Forward function.
+
+        Args:
+            spec: (Batch, Length, Freq)
+        """
+
+        return mask_along_axis_lfr(
+            spec,
+            spec_lengths,
+            mask_width_range=self.mask_width_range,
+            dim=self.dim,
+            num_mask=self.num_mask,
+            replace_with_zero=self.replace_with_zero,
+            lfr_rate=self.lfr_rate,
+        )
\ No newline at end of file
diff --git a/funasr/layers/sinc_conv.py b/funasr/layers/sinc_conv.py
new file mode 100644
index 000000000..33df97fbc
--- /dev/null
+++ b/funasr/layers/sinc_conv.py
@@ -0,0 +1,273 @@
+#!/usr/bin/env python3
+#  2020, Technische Universität München;  Ludwig Kürzinger
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Sinc convolutions."""
+import math
+import torch
+from typeguard import check_argument_types
+from typing import Union
+
+
+class LogCompression(torch.nn.Module):
+    """Log Compression Activation.
+
+    Activation function `log(abs(x) + 1)`.
+    """
+
+    def __init__(self):
+        """Initialize."""
+        super().__init__()
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward.
+
+        Applies the Log Compression function elementwise on tensor x.
+        """
+        return torch.log(torch.abs(x) + 1)
+
+
+class SincConv(torch.nn.Module):
+    """Sinc Convolution.
+
+    This module performs a convolution using Sinc filters in time domain as kernel.
+    Sinc filters function as band passes in spectral domain.
+    The filtering is done as a convolution in time domain, and no transformation
+    to spectral domain is necessary.
+
+    This implementation of the Sinc convolution is heavily inspired
+    by Ravanelli et al. https://github.com/mravanelli/SincNet,
+    and adapted for the ESpnet toolkit.
+    Combine Sinc convolutions with a log compression activation function, as in:
+    https://arxiv.org/abs/2010.07597
+
+    Notes:
+    Currently, the same filters are applied to all input channels.
+    The windowing function is applied on the kernel to obtained a smoother filter,
+    and not on the input values, which is different to traditional ASR.
+    """
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: int,
+        stride: int = 1,
+        padding: int = 0,
+        dilation: int = 1,
+        window_func: str = "hamming",
+        scale_type: str = "mel",
+        fs: Union[int, float] = 16000,
+    ):
+        """Initialize Sinc convolutions.
+
+        Args:
+            in_channels: Number of input channels.
+            out_channels: Number of output channels.
+            kernel_size: Sinc filter kernel size (needs to be an odd number).
+            stride: See torch.nn.functional.conv1d.
+            padding: See torch.nn.functional.conv1d.
+            dilation: See torch.nn.functional.conv1d.
+            window_func: Window function on the filter, one of ["hamming", "none"].
+            fs (str, int, float): Sample rate of the input data
+        """
+        assert check_argument_types()
+        super().__init__()
+        window_funcs = {
+            "none": self.none_window,
+            "hamming": self.hamming_window,
+        }
+        if window_func not in window_funcs:
+            raise NotImplementedError(
+                f"Window function has to be one of {list(window_funcs.keys())}",
+            )
+        self.window_func = window_funcs[window_func]
+        scale_choices = {
+            "mel": MelScale,
+            "bark": BarkScale,
+        }
+        if scale_type not in scale_choices:
+            raise NotImplementedError(
+                f"Scale has to be one of {list(scale_choices.keys())}",
+            )
+        self.scale = scale_choices[scale_type]
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.padding = padding
+        self.dilation = dilation
+        self.stride = stride
+        self.fs = float(fs)
+        if self.kernel_size % 2 == 0:
+            raise ValueError("SincConv: Kernel size must be odd.")
+        self.f = None
+        N = self.kernel_size // 2
+        self._x = 2 * math.pi * torch.linspace(1, N, N)
+        self._window = self.window_func(torch.linspace(1, N, N))
+        # init may get overwritten by E2E network,
+        # but is still required to calculate output dim
+        self.init_filters()
+
+    @staticmethod
+    def sinc(x: torch.Tensor) -> torch.Tensor:
+        """Sinc function."""
+        x2 = x + 1e-6
+        return torch.sin(x2) / x2
+
+    @staticmethod
+    def none_window(x: torch.Tensor) -> torch.Tensor:
+        """Identity-like windowing function."""
+        return torch.ones_like(x)
+
+    @staticmethod
+    def hamming_window(x: torch.Tensor) -> torch.Tensor:
+        """Hamming Windowing function."""
+        L = 2 * x.size(0) + 1
+        x = x.flip(0)
+        return 0.54 - 0.46 * torch.cos(2.0 * math.pi * x / L)
+
+    def init_filters(self):
+        """Initialize filters with filterbank values."""
+        f = self.scale.bank(self.out_channels, self.fs)
+        f = torch.div(f, self.fs)
+        self.f = torch.nn.Parameter(f, requires_grad=True)
+
+    def _create_filters(self, device: str):
+        """Calculate coefficients.
+
+        This function (re-)calculates the filter convolutions coefficients.
+        """
+        f_mins = torch.abs(self.f[:, 0])
+        f_maxs = torch.abs(self.f[:, 0]) + torch.abs(self.f[:, 1] - self.f[:, 0])
+
+        self._x = self._x.to(device)
+        self._window = self._window.to(device)
+
+        f_mins_x = torch.matmul(f_mins.view(-1, 1), self._x.view(1, -1))
+        f_maxs_x = torch.matmul(f_maxs.view(-1, 1), self._x.view(1, -1))
+
+        kernel = (torch.sin(f_maxs_x) - torch.sin(f_mins_x)) / (0.5 * self._x)
+        kernel = kernel * self._window
+
+        kernel_left = kernel.flip(1)
+        kernel_center = (2 * f_maxs - 2 * f_mins).unsqueeze(1)
+        filters = torch.cat([kernel_left, kernel_center, kernel], dim=1)
+
+        filters = filters.view(filters.size(0), 1, filters.size(1))
+        self.sinc_filters = filters
+
+    def forward(self, xs: torch.Tensor) -> torch.Tensor:
+        """Sinc convolution forward function.
+
+        Args:
+            xs: Batch in form of torch.Tensor (B, C_in, D_in).
+
+        Returns:
+            xs: Batch in form of torch.Tensor (B, C_out, D_out).
+        """
+        self._create_filters(xs.device)
+        xs = torch.nn.functional.conv1d(
+            xs,
+            self.sinc_filters,
+            padding=self.padding,
+            stride=self.stride,
+            dilation=self.dilation,
+            groups=self.in_channels,
+        )
+        return xs
+
+    def get_odim(self, idim: int) -> int:
+        """Obtain the output dimension of the filter."""
+        D_out = idim + 2 * self.padding - self.dilation * (self.kernel_size - 1) - 1
+        D_out = (D_out // self.stride) + 1
+        return D_out
+
+
+class MelScale:
+    """Mel frequency scale."""
+
+    @staticmethod
+    def convert(f):
+        """Convert Hz to mel."""
+        return 1125.0 * torch.log(torch.div(f, 700.0) + 1.0)
+
+    @staticmethod
+    def invert(x):
+        """Convert mel to Hz."""
+        return 700.0 * (torch.exp(torch.div(x, 1125.0)) - 1.0)
+
+    @classmethod
+    def bank(cls, channels: int, fs: float) -> torch.Tensor:
+        """Obtain initialization values for the mel scale.
+
+        Args:
+            channels: Number of channels.
+            fs: Sample rate.
+
+        Returns:
+            torch.Tensor: Filter start frequencíes.
+            torch.Tensor: Filter stop frequencies.
+        """
+        assert check_argument_types()
+        # min and max bandpass edge frequencies
+        min_frequency = torch.tensor(30.0)
+        max_frequency = torch.tensor(fs * 0.5)
+        frequencies = torch.linspace(
+            cls.convert(min_frequency), cls.convert(max_frequency), channels + 2
+        )
+        frequencies = cls.invert(frequencies)
+        f1, f2 = frequencies[:-2], frequencies[2:]
+        return torch.stack([f1, f2], dim=1)
+
+
+class BarkScale:
+    """Bark frequency scale.
+
+    Has wider bandwidths at lower frequencies, see:
+    Critical bandwidth: BARK
+    Zwicker and Terhardt, 1980
+    """
+
+    @staticmethod
+    def convert(f):
+        """Convert Hz to Bark."""
+        b = torch.div(f, 1000.0)
+        b = torch.pow(b, 2.0) * 1.4
+        b = torch.pow(b + 1.0, 0.69)
+        return b * 75.0 + 25.0
+
+    @staticmethod
+    def invert(x):
+        """Convert Bark to Hz."""
+        f = torch.div(x - 25.0, 75.0)
+        f = torch.pow(f, (1.0 / 0.69))
+        f = torch.div(f - 1.0, 1.4)
+        f = torch.pow(f, 0.5)
+        return f * 1000.0
+
+    @classmethod
+    def bank(cls, channels: int, fs: float) -> torch.Tensor:
+        """Obtain initialization values for the Bark scale.
+
+        Args:
+            channels: Number of channels.
+            fs: Sample rate.
+
+        Returns:
+            torch.Tensor: Filter start frequencíes.
+            torch.Tensor: Filter stop frequencíes.
+        """
+        assert check_argument_types()
+        # min and max BARK center frequencies by approximation
+        min_center_frequency = torch.tensor(70.0)
+        max_center_frequency = torch.tensor(fs * 0.45)
+        center_frequencies = torch.linspace(
+            cls.convert(min_center_frequency),
+            cls.convert(max_center_frequency),
+            channels,
+        )
+        center_frequencies = cls.invert(center_frequencies)
+
+        f1 = center_frequencies - torch.div(cls.convert(center_frequencies), 2)
+        f2 = center_frequencies + torch.div(cls.convert(center_frequencies), 2)
+        return torch.stack([f1, f2], dim=1)
diff --git a/funasr/layers/stft.py b/funasr/layers/stft.py
new file mode 100644
index 000000000..21beaae6f
--- /dev/null
+++ b/funasr/layers/stft.py
@@ -0,0 +1,229 @@
+from distutils.version import LooseVersion
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+from torch_complex.tensor import ComplexTensor
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.layers.complex_utils import is_complex
+from funasr.layers.inversible_interface import InversibleInterface
+import librosa
+import numpy as np
+
+is_torch_1_9_plus = LooseVersion(torch.__version__) >= LooseVersion("1.9.0")
+
+
+is_torch_1_7_plus = LooseVersion(torch.__version__) >= LooseVersion("1.7")
+
+
+class Stft(torch.nn.Module, InversibleInterface):
+    def __init__(
+        self,
+        n_fft: int = 512,
+        win_length: int = None,
+        hop_length: int = 128,
+        window: Optional[str] = "hann",
+        center: bool = True,
+        normalized: bool = False,
+        onesided: bool = True,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self.n_fft = n_fft
+        if win_length is None:
+            self.win_length = n_fft
+        else:
+            self.win_length = win_length
+        self.hop_length = hop_length
+        self.center = center
+        self.normalized = normalized
+        self.onesided = onesided
+        if window is not None and not hasattr(torch, f"{window}_window"):
+            raise ValueError(f"{window} window is not implemented")
+        self.window = window
+
+    def extra_repr(self):
+        return (
+            f"n_fft={self.n_fft}, "
+            f"win_length={self.win_length}, "
+            f"hop_length={self.hop_length}, "
+            f"center={self.center}, "
+            f"normalized={self.normalized}, "
+            f"onesided={self.onesided}"
+        )
+
+    def forward(
+        self, input: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """STFT forward function.
+
+        Args:
+            input: (Batch, Nsamples) or (Batch, Nsample, Channels)
+            ilens: (Batch)
+        Returns:
+            output: (Batch, Frames, Freq, 2) or (Batch, Frames, Channels, Freq, 2)
+
+        """
+        bs = input.size(0)
+        if input.dim() == 3:
+            multi_channel = True
+            # input: (Batch, Nsample, Channels) -> (Batch * Channels, Nsample)
+            input = input.transpose(1, 2).reshape(-1, input.size(1))
+        else:
+            multi_channel = False
+
+        # NOTE(kamo):
+        #   The default behaviour of torch.stft is compatible with librosa.stft
+        #   about padding and scaling.
+        #   Note that it's different from scipy.signal.stft
+
+        # output: (Batch, Freq, Frames, 2=real_imag)
+        # or (Batch, Channel, Freq, Frames, 2=real_imag)
+        if self.window is not None:
+            window_func = getattr(torch, f"{self.window}_window")
+            window = window_func(
+                self.win_length, dtype=input.dtype, device=input.device
+            )
+        else:
+            window = None
+
+        # For the compatibility of ARM devices, which do not support
+        # torch.stft() due to the lake of MKL.
+        if input.is_cuda or torch.backends.mkl.is_available():
+            stft_kwargs = dict(
+                n_fft=self.n_fft,
+                win_length=self.win_length,
+                hop_length=self.hop_length,
+                center=self.center,
+                window=window,
+                normalized=self.normalized,
+                onesided=self.onesided,
+            )
+            if is_torch_1_7_plus:
+                stft_kwargs["return_complex"] = False
+            output = torch.stft(input, **stft_kwargs)
+        else:
+            if self.training:
+                raise NotImplementedError(
+                    "stft is implemented with librosa on this device, which does not "
+                    "support the training mode."
+                )
+
+            # use stft_kwargs to flexibly control different PyTorch versions' kwargs
+            stft_kwargs = dict(
+                n_fft=self.n_fft,
+                win_length=self.win_length,
+                hop_length=self.hop_length,
+                center=self.center,
+                window=window,
+            )
+
+            if window is not None:
+                # pad the given window to n_fft
+                n_pad_left = (self.n_fft - window.shape[0]) // 2
+                n_pad_right = self.n_fft - window.shape[0] - n_pad_left
+                stft_kwargs["window"] = torch.cat(
+                    [torch.zeros(n_pad_left), window, torch.zeros(n_pad_right)], 0
+                ).numpy()
+            else:
+                win_length = (
+                    self.win_length if self.win_length is not None else self.n_fft
+                )
+                stft_kwargs["window"] = torch.ones(win_length)
+
+            output = []
+            # iterate over istances in a batch
+            for i, instance in enumerate(input):
+                stft = librosa.stft(input[i].numpy(), **stft_kwargs)
+                output.append(torch.tensor(np.stack([stft.real, stft.imag], -1)))
+            output = torch.stack(output, 0)
+            if not self.onesided:
+                len_conj = self.n_fft - output.shape[1]
+                conj = output[:, 1 : 1 + len_conj].flip(1)
+                conj[:, :, :, -1].data *= -1
+                output = torch.cat([output, conj], 1)
+            if self.normalized:
+                output = output * (stft_kwargs["window"].shape[0] ** (-0.5))
+
+        # output: (Batch, Freq, Frames, 2=real_imag)
+        # -> (Batch, Frames, Freq, 2=real_imag)
+        output = output.transpose(1, 2)
+        if multi_channel:
+            # output: (Batch * Channel, Frames, Freq, 2=real_imag)
+            # -> (Batch, Frame, Channel, Freq, 2=real_imag)
+            output = output.view(bs, -1, output.size(1), output.size(2), 2).transpose(
+                1, 2
+            )
+
+        if ilens is not None:
+            if self.center:
+                pad = self.n_fft // 2
+                ilens = ilens + 2 * pad
+
+            olens = (ilens - self.n_fft) // self.hop_length + 1
+            output.masked_fill_(make_pad_mask(olens, output, 1), 0.0)
+        else:
+            olens = None
+
+        return output, olens
+
+    def inverse(
+        self, input: Union[torch.Tensor, ComplexTensor], ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """Inverse STFT.
+
+        Args:
+            input: Tensor(batch, T, F, 2) or ComplexTensor(batch, T, F)
+            ilens: (batch,)
+        Returns:
+            wavs: (batch, samples)
+            ilens: (batch,)
+        """
+        if LooseVersion(torch.__version__) >= LooseVersion("1.6.0"):
+            istft = torch.functional.istft
+        else:
+            try:
+                import torchaudio
+            except ImportError:
+                raise ImportError(
+                    "Please install torchaudio>=0.3.0 or use torch>=1.6.0"
+                )
+
+            if not hasattr(torchaudio.functional, "istft"):
+                raise ImportError(
+                    "Please install torchaudio>=0.3.0 or use torch>=1.6.0"
+                )
+            istft = torchaudio.functional.istft
+
+        if self.window is not None:
+            window_func = getattr(torch, f"{self.window}_window")
+            if is_complex(input):
+                datatype = input.real.dtype
+            else:
+                datatype = input.dtype
+            window = window_func(self.win_length, dtype=datatype, device=input.device)
+        else:
+            window = None
+
+        if is_complex(input):
+            input = torch.stack([input.real, input.imag], dim=-1)
+        elif input.shape[-1] != 2:
+            raise TypeError("Invalid input type")
+        input = input.transpose(1, 2)
+
+        wavs = istft(
+            input,
+            n_fft=self.n_fft,
+            hop_length=self.hop_length,
+            win_length=self.win_length,
+            window=window,
+            center=self.center,
+            normalized=self.normalized,
+            onesided=self.onesided,
+            length=ilens.max() if ilens is not None else ilens,
+        )
+
+        return wavs, ilens
diff --git a/funasr/layers/time_warp.py b/funasr/layers/time_warp.py
new file mode 100644
index 000000000..b55461872
--- /dev/null
+++ b/funasr/layers/time_warp.py
@@ -0,0 +1,88 @@
+"""Time warp module."""
+import torch
+
+from funasr.modules.nets_utils import pad_list
+
+DEFAULT_TIME_WARP_MODE = "bicubic"
+
+
+def time_warp(x: torch.Tensor, window: int = 80, mode: str = DEFAULT_TIME_WARP_MODE):
+    """Time warping using torch.interpolate.
+
+    Args:
+        x: (Batch, Time, Freq)
+        window: time warp parameter
+        mode: Interpolate mode
+    """
+
+    # bicubic supports 4D or more dimension tensor
+    org_size = x.size()
+    if x.dim() == 3:
+        # x: (Batch, Time, Freq) -> (Batch, 1, Time, Freq)
+        x = x[:, None]
+
+    t = x.shape[2]
+    if t - window <= window:
+        return x.view(*org_size)
+
+    center = torch.randint(window, t - window, (1,))[0]
+    warped = torch.randint(center - window, center + window, (1,))[0] + 1
+
+    # left: (Batch, Channel, warped, Freq)
+    # right: (Batch, Channel, time - warped, Freq)
+    left = torch.nn.functional.interpolate(
+        x[:, :, :center], (warped, x.shape[3]), mode=mode, align_corners=False
+    )
+    right = torch.nn.functional.interpolate(
+        x[:, :, center:], (t - warped, x.shape[3]), mode=mode, align_corners=False
+    )
+
+    if x.requires_grad:
+        x = torch.cat([left, right], dim=-2)
+    else:
+        x[:, :, :warped] = left
+        x[:, :, warped:] = right
+
+    return x.view(*org_size)
+
+
+class TimeWarp(torch.nn.Module):
+    """Time warping using torch.interpolate.
+
+    Args:
+        window: time warp parameter
+        mode: Interpolate mode
+    """
+
+    def __init__(self, window: int = 80, mode: str = DEFAULT_TIME_WARP_MODE):
+        super().__init__()
+        self.window = window
+        self.mode = mode
+
+    def extra_repr(self):
+        return f"window={self.window}, mode={self.mode}"
+
+    def forward(self, x: torch.Tensor, x_lengths: torch.Tensor = None):
+        """Forward function.
+
+        Args:
+            x: (Batch, Time, Freq)
+            x_lengths: (Batch,)
+        """
+
+        if x_lengths is None or all(le == x_lengths[0] for le in x_lengths):
+            # Note that applying same warping for each sample
+            y = time_warp(x, window=self.window, mode=self.mode)
+        else:
+            # FIXME(kamo): I have no idea to batchify Timewarp
+            ys = []
+            for i in range(x.size(0)):
+                _y = time_warp(
+                    x[i][None, : x_lengths[i]],
+                    window=self.window,
+                    mode=self.mode,
+                )[0]
+                ys.append(_y)
+            y = pad_list(ys, 0.0)
+
+        return y, x_lengths
diff --git a/funasr/layers/utterance_mvn.py b/funasr/layers/utterance_mvn.py
new file mode 100644
index 000000000..50f27cd55
--- /dev/null
+++ b/funasr/layers/utterance_mvn.py
@@ -0,0 +1,88 @@
+from typing import Tuple
+
+import torch
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.layers.abs_normalize import AbsNormalize
+
+
+class UtteranceMVN(AbsNormalize):
+    def __init__(
+        self,
+        norm_means: bool = True,
+        norm_vars: bool = False,
+        eps: float = 1.0e-20,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.eps = eps
+
+    def extra_repr(self):
+        return f"norm_means={self.norm_means}, norm_vars={self.norm_vars}"
+
+    def forward(
+        self, x: torch.Tensor, ilens: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward function
+
+        Args:
+            x: (B, L, ...)
+            ilens: (B,)
+
+        """
+        return utterance_mvn(
+            x,
+            ilens,
+            norm_means=self.norm_means,
+            norm_vars=self.norm_vars,
+            eps=self.eps,
+        )
+
+
+def utterance_mvn(
+    x: torch.Tensor,
+    ilens: torch.Tensor = None,
+    norm_means: bool = True,
+    norm_vars: bool = False,
+    eps: float = 1.0e-20,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Apply utterance mean and variance normalization
+
+    Args:
+        x: (B, T, D), assumed zero padded
+        ilens: (B,)
+        norm_means:
+        norm_vars:
+        eps:
+
+    """
+    if ilens is None:
+        ilens = x.new_full([x.size(0)], x.size(1))
+    ilens_ = ilens.to(x.device, x.dtype).view(-1, *[1 for _ in range(x.dim() - 1)])
+    # Zero padding
+    if x.requires_grad:
+        x = x.masked_fill(make_pad_mask(ilens, x, 1), 0.0)
+    else:
+        x.masked_fill_(make_pad_mask(ilens, x, 1), 0.0)
+    # mean: (B, 1, D)
+    mean = x.sum(dim=1, keepdim=True) / ilens_
+
+    if norm_means:
+        x -= mean
+
+        if norm_vars:
+            var = x.pow(2).sum(dim=1, keepdim=True) / ilens_
+            std = torch.clamp(var.sqrt(), min=eps)
+            x = x / std.sqrt()
+        return x, ilens
+    else:
+        if norm_vars:
+            y = x - mean
+            y.masked_fill_(make_pad_mask(ilens, y, 1), 0.0)
+            var = y.pow(2).sum(dim=1, keepdim=True) / ilens_
+            std = torch.clamp(var.sqrt(), min=eps)
+            x /= std
+        return x, ilens
diff --git a/funasr/lm/__init__.py b/funasr/lm/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/lm/abs_model.py b/funasr/lm/abs_model.py
new file mode 100644
index 000000000..0ad1e71bc
--- /dev/null
+++ b/funasr/lm/abs_model.py
@@ -0,0 +1,29 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+from funasr.modules.scorers.scorer_interface import BatchScorerInterface
+
+
+class AbsLM(torch.nn.Module, BatchScorerInterface, ABC):
+    """The abstract LM class
+
+    To share the loss calculation way among different models,
+    We uses delegate pattern here:
+    The instance of this class should be passed to "LanguageModel"
+
+    >>> from funasr.lm.abs_model import AbsLM
+    >>> lm = AbsLM()
+    >>> model = LanguageESPnetModel(lm=lm)
+
+    This "model" is one of mediator objects for "Task" class.
+
+    """
+
+    @abstractmethod
+    def forward(
+        self, input: torch.Tensor, hidden: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        raise NotImplementedError
diff --git a/funasr/lm/espnet_model.py b/funasr/lm/espnet_model.py
new file mode 100644
index 000000000..4fc3b49c8
--- /dev/null
+++ b/funasr/lm/espnet_model.py
@@ -0,0 +1,131 @@
+from typing import Dict
+from typing import Optional
+from typing import Tuple
+
+import torch
+import torch.nn.functional as F
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.lm.abs_model import AbsLM
+from funasr.torch_utils.device_funcs import force_gatherable
+from funasr.train.abs_espnet_model import AbsESPnetModel
+
+
+class ESPnetLanguageModel(AbsESPnetModel):
+    def __init__(self, lm: AbsLM, vocab_size: int, ignore_id: int = 0):
+        assert check_argument_types()
+        super().__init__()
+        self.lm = lm
+        self.sos = 1
+        self.eos = 2
+
+        # ignore_id may be assumed as 0, shared with CTC-blank symbol for ASR.
+        self.ignore_id = ignore_id
+
+    def nll(
+        self,
+        text: torch.Tensor,
+        text_lengths: torch.Tensor,
+        max_length: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute negative log likelihood(nll)
+
+        Normally, this function is called in batchify_nll.
+        Args:
+            text: (Batch, Length)
+            text_lengths: (Batch,)
+            max_lengths: int
+        """
+        batch_size = text.size(0)
+        # For data parallel
+        if max_length is None:
+            text = text[:, : text_lengths.max()]
+        else:
+            text = text[:, :max_length]
+
+        # 1. Create a sentence pair like '<sos> w1 w2 w3' and 'w1 w2 w3 <eos>'
+        # text: (Batch, Length) -> x, y: (Batch, Length + 1)
+        x = F.pad(text, [1, 0], "constant", self.eos)
+        t = F.pad(text, [0, 1], "constant", self.ignore_id)
+        for i, l in enumerate(text_lengths):
+            t[i, l] = self.sos
+        x_lengths = text_lengths + 1
+
+        # 2. Forward Language model
+        # x: (Batch, Length) -> y: (Batch, Length, NVocab)
+        y, _ = self.lm(x, None)
+
+        # 3. Calc negative log likelihood
+        # nll: (BxL,)
+        nll = F.cross_entropy(y.view(-1, y.shape[-1]), t.view(-1), reduction="none")
+        # nll: (BxL,) -> (BxL,)
+        if max_length is None:
+            nll.masked_fill_(make_pad_mask(x_lengths).to(nll.device).view(-1), 0.0)
+        else:
+            nll.masked_fill_(
+                make_pad_mask(x_lengths, maxlen=max_length + 1).to(nll.device).view(-1),
+                0.0,
+            )
+        # nll: (BxL,) -> (B, L)
+        nll = nll.view(batch_size, -1)
+        return nll, x_lengths
+
+    def batchify_nll(
+        self, text: torch.Tensor, text_lengths: torch.Tensor, batch_size: int = 100
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute negative log likelihood(nll) from transformer language model
+
+        To avoid OOM, this fuction seperate the input into batches.
+        Then call nll for each batch and combine and return results.
+        Args:
+            text: (Batch, Length)
+            text_lengths: (Batch,)
+            batch_size: int, samples each batch contain when computing nll,
+                        you may change this to avoid OOM or increase
+
+        """
+        total_num = text.size(0)
+        if total_num <= batch_size:
+            nll, x_lengths = self.nll(text, text_lengths)
+        else:
+            nlls = []
+            x_lengths = []
+            max_length = text_lengths.max()
+
+            start_idx = 0
+            while True:
+                end_idx = min(start_idx + batch_size, total_num)
+                batch_text = text[start_idx:end_idx, :]
+                batch_text_lengths = text_lengths[start_idx:end_idx]
+                # batch_nll: [B * T]
+                batch_nll, batch_x_lengths = self.nll(
+                    batch_text, batch_text_lengths, max_length=max_length
+                )
+                nlls.append(batch_nll)
+                x_lengths.append(batch_x_lengths)
+                start_idx = end_idx
+                if start_idx == total_num:
+                    break
+            nll = torch.cat(nlls)
+            x_lengths = torch.cat(x_lengths)
+        assert nll.size(0) == total_num
+        assert x_lengths.size(0) == total_num
+        return nll, x_lengths
+
+    def forward(
+        self, text: torch.Tensor, text_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+        nll, y_lengths = self.nll(text, text_lengths)
+        ntokens = y_lengths.sum()
+        loss = nll.sum() / ntokens
+        stats = dict(loss=loss.detach())
+
+        # force_gatherable: to-device and to-tensor if scalar for DataParallel
+        loss, stats, weight = force_gatherable((loss, stats, ntokens), loss.device)
+        return loss, stats, weight
+
+    def collect_feats(
+        self, text: torch.Tensor, text_lengths: torch.Tensor
+    ) -> Dict[str, torch.Tensor]:
+        return {}
diff --git a/funasr/lm/seq_rnn_lm.py b/funasr/lm/seq_rnn_lm.py
new file mode 100644
index 000000000..09d1e4ae5
--- /dev/null
+++ b/funasr/lm/seq_rnn_lm.py
@@ -0,0 +1,174 @@
+"""Sequential implementation of Recurrent Neural Network Language Model."""
+from typing import Tuple
+from typing import Union
+
+import torch
+import torch.nn as nn
+from typeguard import check_argument_types
+
+from funasr.lm.abs_model import AbsLM
+
+
+class SequentialRNNLM(AbsLM):
+    """Sequential RNNLM.
+
+    See also:
+        https://github.com/pytorch/examples/blob/4581968193699de14b56527296262dd76ab43557/word_language_model/model.py
+
+    """
+
+    def __init__(
+        self,
+        vocab_size: int,
+        unit: int = 650,
+        nhid: int = None,
+        nlayers: int = 2,
+        dropout_rate: float = 0.0,
+        tie_weights: bool = False,
+        rnn_type: str = "lstm",
+        ignore_id: int = 0,
+    ):
+        assert check_argument_types()
+        super().__init__()
+
+        ninp = unit
+        if nhid is None:
+            nhid = unit
+        rnn_type = rnn_type.upper()
+
+        self.drop = nn.Dropout(dropout_rate)
+        self.encoder = nn.Embedding(vocab_size, ninp, padding_idx=ignore_id)
+        if rnn_type in ["LSTM", "GRU"]:
+            rnn_class = getattr(nn, rnn_type)
+            self.rnn = rnn_class(
+                ninp, nhid, nlayers, dropout=dropout_rate, batch_first=True
+            )
+        else:
+            try:
+                nonlinearity = {"RNN_TANH": "tanh", "RNN_RELU": "relu"}[rnn_type]
+            except KeyError:
+                raise ValueError(
+                    """An invalid option for `--model` was supplied,
+                    options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']"""
+                )
+            self.rnn = nn.RNN(
+                ninp,
+                nhid,
+                nlayers,
+                nonlinearity=nonlinearity,
+                dropout=dropout_rate,
+                batch_first=True,
+            )
+        self.decoder = nn.Linear(nhid, vocab_size)
+
+        # Optionally tie weights as in:
+        # "Using the Output Embedding to Improve Language Models"
+        # (Press & Wolf 2016) https://arxiv.org/abs/1608.05859
+        # and
+        # "Tying Word Vectors and Word Classifiers:
+        # A Loss Framework for Language Modeling" (Inan et al. 2016)
+        # https://arxiv.org/abs/1611.01462
+        if tie_weights:
+            if nhid != ninp:
+                raise ValueError(
+                    "When using the tied flag, nhid must be equal to emsize"
+                )
+            self.decoder.weight = self.encoder.weight
+
+        self.rnn_type = rnn_type
+        self.nhid = nhid
+        self.nlayers = nlayers
+
+    def zero_state(self):
+        """Initialize LM state filled with zero values."""
+        if isinstance(self.rnn, torch.nn.LSTM):
+            h = torch.zeros((self.nlayers, self.nhid), dtype=torch.float)
+            c = torch.zeros((self.nlayers, self.nhid), dtype=torch.float)
+            state = h, c
+        else:
+            state = torch.zeros((self.nlayers, self.nhid), dtype=torch.float)
+
+        return state
+
+    def forward(
+        self, input: torch.Tensor, hidden: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        emb = self.drop(self.encoder(input))
+        output, hidden = self.rnn(emb, hidden)
+        output = self.drop(output)
+        decoded = self.decoder(
+            output.contiguous().view(output.size(0) * output.size(1), output.size(2))
+        )
+        return (
+            decoded.view(output.size(0), output.size(1), decoded.size(1)),
+            hidden,
+        )
+
+    def score(
+        self,
+        y: torch.Tensor,
+        state: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
+        x: torch.Tensor,
+    ) -> Tuple[torch.Tensor, Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]]:
+        """Score new token.
+
+        Args:
+            y: 1D torch.int64 prefix tokens.
+            state: Scorer state for prefix tokens
+            x: 2D encoder feature that generates ys.
+
+        Returns:
+            Tuple of
+                torch.float32 scores for next token (n_vocab)
+                and next state for ys
+
+        """
+        y, new_state = self(y[-1].view(1, 1), state)
+        logp = y.log_softmax(dim=-1).view(-1)
+        return logp, new_state
+
+    def batch_score(
+        self, ys: torch.Tensor, states: torch.Tensor, xs: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Score new token batch.
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, n_vocab)`
+                and next state list for ys.
+
+        """
+        if states[0] is None:
+            states = None
+        elif isinstance(self.rnn, torch.nn.LSTM):
+            # states: Batch x 2 x (Nlayers, Dim) -> 2 x (Nlayers, Batch, Dim)
+            h = torch.stack([h for h, c in states], dim=1)
+            c = torch.stack([c for h, c in states], dim=1)
+            states = h, c
+        else:
+            # states: Batch x (Nlayers, Dim) -> (Nlayers, Batch, Dim)
+            states = torch.stack(states, dim=1)
+
+        ys, states = self(ys[:, -1:], states)
+        # ys: (Batch, 1, Nvocab) -> (Batch, NVocab)
+        assert ys.size(1) == 1, ys.shape
+        ys = ys.squeeze(1)
+        logp = ys.log_softmax(dim=-1)
+
+        # state: Change to batch first
+        if isinstance(self.rnn, torch.nn.LSTM):
+            # h, c: (Nlayers, Batch, Dim)
+            h, c = states
+            # states: Batch x 2 x (Nlayers, Dim)
+            states = [(h[:, i], c[:, i]) for i in range(h.size(1))]
+        else:
+            # states: (Nlayers, Batch, Dim) -> Batch x (Nlayers, Dim)
+            states = [states[:, i] for i in range(states.size(1))]
+
+        return logp, states
diff --git a/funasr/lm/transformer_lm.py b/funasr/lm/transformer_lm.py
new file mode 100644
index 000000000..52af45bdc
--- /dev/null
+++ b/funasr/lm/transformer_lm.py
@@ -0,0 +1,131 @@
+from typing import Any
+from typing import List
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+
+from funasr.modules.embedding import PositionalEncoding
+from funasr.models.encoder.transformer_encoder import TransformerEncoder_s0 as Encoder
+from funasr.modules.mask import subsequent_mask
+from funasr.lm.abs_model import AbsLM
+
+
+class TransformerLM(AbsLM):
+    def __init__(
+        self,
+        vocab_size: int,
+        pos_enc: str = None,
+        embed_unit: int = 128,
+        att_unit: int = 256,
+        head: int = 2,
+        unit: int = 1024,
+        layer: int = 4,
+        dropout_rate: float = 0.5,
+    ):
+        super().__init__()
+        if pos_enc == "sinusoidal":
+            pos_enc_class = PositionalEncoding
+        elif pos_enc is None:
+
+            def pos_enc_class(*args, **kwargs):
+                return nn.Sequential()  # indentity
+
+        else:
+            raise ValueError(f"unknown pos-enc option: {pos_enc}")
+
+        self.embed = nn.Embedding(vocab_size, embed_unit)
+        self.encoder = Encoder(
+            idim=embed_unit,
+            attention_dim=att_unit,
+            attention_heads=head,
+            linear_units=unit,
+            num_blocks=layer,
+            dropout_rate=dropout_rate,
+            input_layer="linear",
+            pos_enc_class=pos_enc_class,
+        )
+        self.decoder = nn.Linear(att_unit, vocab_size)
+
+    def _target_mask(self, ys_in_pad):
+        ys_mask = ys_in_pad != 0
+        m = subsequent_mask(ys_mask.size(-1), device=ys_mask.device).unsqueeze(0)
+        return ys_mask.unsqueeze(-2) & m
+
+    def forward(self, input: torch.Tensor, hidden: None) -> Tuple[torch.Tensor, None]:
+        """Compute LM loss value from buffer sequences.
+
+        Args:
+            input (torch.Tensor): Input ids. (batch, len)
+            hidden (torch.Tensor): Target ids. (batch, len)
+
+        """
+        x = self.embed(input)
+        mask = self._target_mask(input)
+        h, _ = self.encoder(x, mask)
+        y = self.decoder(h)
+        return y, None
+
+    def score(
+        self, y: torch.Tensor, state: Any, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, Any]:
+        """Score new token.
+
+        Args:
+            y (torch.Tensor): 1D torch.int64 prefix tokens.
+            state: Scorer state for prefix tokens
+            x (torch.Tensor): encoder feature that generates ys.
+
+        Returns:
+            tuple[torch.Tensor, Any]: Tuple of
+                torch.float32 scores for next token (vocab_size)
+                and next state for ys
+
+        """
+        y = y.unsqueeze(0)
+        h, _, cache = self.encoder.forward_one_step(
+            self.embed(y), self._target_mask(y), cache=state
+        )
+        h = self.decoder(h[:, -1])
+        logp = h.log_softmax(dim=-1).squeeze(0)
+        return logp, cache
+
+    def batch_score(
+        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor
+    ) -> Tuple[torch.Tensor, List[Any]]:
+        """Score new token batch.
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, vocab_size)`
+                and next state list for ys.
+
+        """
+        # merge states
+        n_batch = len(ys)
+        n_layers = len(self.encoder.encoders)
+        if states[0] is None:
+            batch_state = None
+        else:
+            # transpose state of [batch, layer] into [layer, batch]
+            batch_state = [
+                torch.stack([states[b][i] for b in range(n_batch)])
+                for i in range(n_layers)
+            ]
+
+        # batch decoding
+        h, _, states = self.encoder.forward_one_step(
+            self.embed(ys), self._target_mask(ys), cache=batch_state
+        )
+        h = self.decoder(h[:, -1])
+        logp = h.log_softmax(dim=-1)
+
+        # transpose state of [layer, batch] into [batch, layer]
+        state_list = [[states[i][b] for i in range(n_layers)] for b in range(n_batch)]
+        return logp, state_list
diff --git a/funasr/losses/label_smoothing_loss.py b/funasr/losses/label_smoothing_loss.py
new file mode 100644
index 000000000..0d8b30338
--- /dev/null
+++ b/funasr/losses/label_smoothing_loss.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Label smoothing module."""
+
+import torch
+from torch import nn
+
+
+class LabelSmoothingLoss(nn.Module):
+    """Label-smoothing loss.
+
+    :param int size: the number of class
+    :param int padding_idx: ignored class id
+    :param float smoothing: smoothing rate (0.0 means the conventional CE)
+    :param bool normalize_length: normalize loss by sequence length if True
+    :param torch.nn.Module criterion: loss function to be smoothed
+    """
+
+    def __init__(
+        self,
+        size,
+        padding_idx,
+        smoothing,
+        normalize_length=False,
+        criterion=nn.KLDivLoss(reduction="none"),
+    ):
+        """Construct an LabelSmoothingLoss object."""
+        super(LabelSmoothingLoss, self).__init__()
+        self.criterion = criterion
+        self.padding_idx = padding_idx
+        self.confidence = 1.0 - smoothing
+        self.smoothing = smoothing
+        self.size = size
+        self.true_dist = None
+        self.normalize_length = normalize_length
+
+    def forward(self, x, target):
+        """Compute loss between x and target.
+
+        :param torch.Tensor x: prediction (batch, seqlen, class)
+        :param torch.Tensor target:
+            target signal masked with self.padding_id (batch, seqlen)
+        :return: scalar float value
+        :rtype torch.Tensor
+        """
+        assert x.size(2) == self.size
+        batch_size = x.size(0)
+        x = x.view(-1, self.size)
+        target = target.view(-1)
+        with torch.no_grad():
+            true_dist = x.clone()
+            true_dist.fill_(self.smoothing / (self.size - 1))
+            ignore = target == self.padding_idx  # (B,)
+            total = len(target) - ignore.sum().item()
+            target = target.masked_fill(ignore, 0)  # avoid -1 index
+            true_dist.scatter_(1, target.unsqueeze(1), self.confidence)
+        kl = self.criterion(torch.log_softmax(x, dim=1), true_dist)
+        denom = total if self.normalize_length else batch_size
+        return kl.masked_fill(ignore.unsqueeze(1), 0).sum() / denom
diff --git a/funasr/main_funcs/__init__.py b/funasr/main_funcs/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/main_funcs/average_nbest_models.py b/funasr/main_funcs/average_nbest_models.py
new file mode 100644
index 000000000..53f956800
--- /dev/null
+++ b/funasr/main_funcs/average_nbest_models.py
@@ -0,0 +1,127 @@
+import logging
+from pathlib import Path
+from typing import Optional
+from typing import Sequence
+from typing import Union
+import warnings
+import os
+from io import BytesIO
+
+import torch
+from typeguard import check_argument_types
+from typing import Collection
+
+from funasr.train.reporter import Reporter
+
+
+@torch.no_grad()
+def average_nbest_models(
+    output_dir: Path,
+    reporter: Reporter,
+    best_model_criterion: Sequence[Sequence[str]],
+    nbest: Union[Collection[int], int],
+    suffix: Optional[str] = None,
+    oss_bucket=None,
+    pai_output_dir=None,
+) -> None:
+    """Generate averaged model from n-best models
+
+    Args:
+        output_dir: The directory contains the model file for each epoch
+        reporter: Reporter instance
+        best_model_criterion: Give criterions to decide the best model.
+            e.g. [("valid", "loss", "min"), ("train", "acc", "max")]
+        nbest: Number of best model files to be averaged
+        suffix: A suffix added to the averaged model file name
+    """
+    assert check_argument_types()
+    if isinstance(nbest, int):
+        nbests = [nbest]
+    else:
+        nbests = list(nbest)
+    if len(nbests) == 0:
+        warnings.warn("At least 1 nbest values are required")
+        nbests = [1]
+    if suffix is not None:
+        suffix = suffix + "."
+    else:
+        suffix = ""
+
+    # 1. Get nbests: List[Tuple[str, str, List[Tuple[epoch, value]]]]
+    nbest_epochs = [
+        (ph, k, reporter.sort_epochs_and_values(ph, k, m)[: max(nbests)])
+        for ph, k, m in best_model_criterion
+        if reporter.has(ph, k)
+    ]
+
+    _loaded = {}
+    for ph, cr, epoch_and_values in nbest_epochs:
+        _nbests = [i for i in nbests if i <= len(epoch_and_values)]
+        if len(_nbests) == 0:
+            _nbests = [1]
+
+        for n in _nbests:
+            if n == 0:
+                continue
+            elif n == 1:
+                # The averaged model is same as the best model
+                e, _ = epoch_and_values[0]
+                op = output_dir / f"{e}epoch.pth"
+                sym_op = output_dir / f"{ph}.{cr}.ave_1best.{suffix}pth"
+                if sym_op.is_symlink() or sym_op.exists():
+                    sym_op.unlink()
+                sym_op.symlink_to(op.name)
+            else:
+                op = output_dir / f"{ph}.{cr}.ave_{n}best.{suffix}pth"
+                logging.info(
+                    f"Averaging {n}best models: " f'criterion="{ph}.{cr}": {op}'
+                )
+
+                avg = None
+                # 2.a. Averaging model
+                for e, _ in epoch_and_values[:n]:
+                    if e not in _loaded:
+                        if oss_bucket is None:
+                            _loaded[e] = torch.load(
+                                output_dir / f"{e}epoch.pth",
+                                map_location="cpu",
+                            )
+                        else:
+                            buffer = BytesIO(
+                                oss_bucket.get_object(os.path.join(pai_output_dir, f"{e}epoch.pth")).read())
+                            _loaded[e] = torch.load(buffer)
+                    states = _loaded[e]
+
+                    if avg is None:
+                        avg = states
+                    else:
+                        # Accumulated
+                        for k in avg:
+                            avg[k] = avg[k] + states[k]
+                for k in avg:
+                    if str(avg[k].dtype).startswith("torch.int"):
+                        # For int type, not averaged, but only accumulated.
+                        # e.g. BatchNorm.num_batches_tracked
+                        # (If there are any cases that requires averaging
+                        #  or the other reducing method, e.g. max/min, for integer type,
+                        #  please report.)
+                        pass
+                    else:
+                        avg[k] = avg[k] / n
+
+                # 2.b. Save the ave model and create a symlink
+                if oss_bucket is None:
+                    torch.save(avg, op)
+                else:
+                    buffer = BytesIO()
+                    torch.save(avg, buffer)
+                    oss_bucket.put_object(os.path.join(pai_output_dir, f"{ph}.{cr}.ave_{n}best.{suffix}pth"),
+                                          buffer.getvalue())
+
+        # 3. *.*.ave.pth is a symlink to the max ave model
+        if oss_bucket is None:
+            op = output_dir / f"{ph}.{cr}.ave_{max(_nbests)}best.{suffix}pth"
+            sym_op = output_dir / f"{ph}.{cr}.ave.{suffix}pth"
+            if sym_op.is_symlink() or sym_op.exists():
+                sym_op.unlink()
+            sym_op.symlink_to(op.name)
diff --git a/funasr/main_funcs/calculate_all_attentions.py b/funasr/main_funcs/calculate_all_attentions.py
new file mode 100644
index 000000000..8f238c6bf
--- /dev/null
+++ b/funasr/main_funcs/calculate_all_attentions.py
@@ -0,0 +1,160 @@
+from collections import defaultdict
+from typing import Dict
+from typing import List
+
+import torch
+
+from funasr.modules.rnn.attentions import AttAdd
+from funasr.modules.rnn.attentions import AttCov
+from funasr.modules.rnn.attentions import AttCovLoc
+from funasr.modules.rnn.attentions import AttDot
+from funasr.modules.rnn.attentions import AttForward
+from funasr.modules.rnn.attentions import AttForwardTA
+from funasr.modules.rnn.attentions import AttLoc
+from funasr.modules.rnn.attentions import AttLoc2D
+from funasr.modules.rnn.attentions import AttLocRec
+from funasr.modules.rnn.attentions import AttMultiHeadAdd
+from funasr.modules.rnn.attentions import AttMultiHeadDot
+from funasr.modules.rnn.attentions import AttMultiHeadLoc
+from funasr.modules.rnn.attentions import AttMultiHeadMultiResLoc
+from funasr.modules.rnn.attentions import NoAtt
+from funasr.modules.attention import MultiHeadedAttention
+
+
+from funasr.train.abs_espnet_model import AbsESPnetModel
+
+
+@torch.no_grad()
+def calculate_all_attentions(
+    model: AbsESPnetModel, batch: Dict[str, torch.Tensor]
+) -> Dict[str, List[torch.Tensor]]:
+    """Derive the outputs from the all attention layers
+
+    Args:
+        model:
+        batch: same as forward
+    Returns:
+        return_dict: A dict of a list of tensor.
+        key_names x batch x (D1, D2, ...)
+
+    """
+    bs = len(next(iter(batch.values())))
+    assert all(len(v) == bs for v in batch.values()), {
+        k: v.shape for k, v in batch.items()
+    }
+
+    # 1. Register forward_hook fn to save the output from specific layers
+    outputs = {}
+    handles = {}
+    for name, modu in model.named_modules():
+
+        def hook(module, input, output, name=name):
+            if isinstance(module, MultiHeadedAttention):
+                # NOTE(kamo): MultiHeadedAttention doesn't return attention weight
+                # attn: (B, Head, Tout, Tin)
+                outputs[name] = module.attn.detach().cpu()
+            elif isinstance(module, AttLoc2D):
+                c, w = output
+                # w: previous concate attentions
+                # w: (B, nprev, Tin)
+                att_w = w[:, -1].detach().cpu()
+                outputs.setdefault(name, []).append(att_w)
+            elif isinstance(module, (AttCov, AttCovLoc)):
+                c, w = output
+                assert isinstance(w, list), type(w)
+                # w: list of previous attentions
+                # w: nprev x (B, Tin)
+                att_w = w[-1].detach().cpu()
+                outputs.setdefault(name, []).append(att_w)
+            elif isinstance(module, AttLocRec):
+                # w: (B, Tin)
+                c, (w, (att_h, att_c)) = output
+                att_w = w.detach().cpu()
+                outputs.setdefault(name, []).append(att_w)
+            elif isinstance(
+                module,
+                (
+                    AttMultiHeadDot,
+                    AttMultiHeadAdd,
+                    AttMultiHeadLoc,
+                    AttMultiHeadMultiResLoc,
+                ),
+            ):
+                c, w = output
+                # w: nhead x (B, Tin)
+                assert isinstance(w, list), type(w)
+                att_w = [_w.detach().cpu() for _w in w]
+                outputs.setdefault(name, []).append(att_w)
+            elif isinstance(
+                module,
+                (
+                    AttAdd,
+                    AttDot,
+                    AttForward,
+                    AttForwardTA,
+                    AttLoc,
+                    NoAtt,
+                ),
+            ):
+                c, w = output
+                att_w = w.detach().cpu()
+                outputs.setdefault(name, []).append(att_w)
+
+        handle = modu.register_forward_hook(hook)
+        handles[name] = handle
+
+    # 2. Just forward one by one sample.
+    # Batch-mode can't be used to keep requirements small for each models.
+    keys = []
+    for k in batch:
+        if not k.endswith("_lengths"):
+            keys.append(k)
+
+    return_dict = defaultdict(list)
+    for ibatch in range(bs):
+        # *: (B, L, ...) -> (1, L2, ...)
+        _sample = {
+            k: batch[k][ibatch, None, : batch[k + "_lengths"][ibatch]]
+            if k + "_lengths" in batch
+            else batch[k][ibatch, None]
+            for k in keys
+        }
+
+        # *_lengths: (B,) -> (1,)
+        _sample.update(
+            {
+                k + "_lengths": batch[k + "_lengths"][ibatch, None]
+                for k in keys
+                if k + "_lengths" in batch
+            }
+        )
+        model(**_sample)
+
+        # Derive the attention results
+        for name, output in outputs.items():
+            if isinstance(output, list):
+                if isinstance(output[0], list):
+                    # output: nhead x (Tout, Tin)
+                    output = torch.stack(
+                        [
+                            # Tout x (1, Tin) -> (Tout, Tin)
+                            torch.cat([o[idx] for o in output], dim=0)
+                            for idx in range(len(output[0]))
+                        ],
+                        dim=0,
+                    )
+                else:
+                    # Tout x (1, Tin) -> (Tout, Tin)
+                    output = torch.cat(output, dim=0)
+            else:
+                # output: (1, NHead, Tout, Tin) -> (NHead, Tout, Tin)
+                output = output.squeeze(0)
+            # output: (Tout, Tin) or (NHead, Tout, Tin)
+            return_dict[name].append(output)
+        outputs.clear()
+
+    # 3. Remove all hooks
+    for _, handle in handles.items():
+        handle.remove()
+
+    return dict(return_dict)
diff --git a/funasr/main_funcs/collect_stats.py b/funasr/main_funcs/collect_stats.py
new file mode 100644
index 000000000..bacda8f2f
--- /dev/null
+++ b/funasr/main_funcs/collect_stats.py
@@ -0,0 +1,126 @@
+from collections import defaultdict
+import logging
+from pathlib import Path
+from typing import Dict
+from typing import Iterable
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import numpy as np
+import torch
+from torch.nn.parallel import data_parallel
+from torch.utils.data import DataLoader
+from typeguard import check_argument_types
+
+from funasr.fileio.datadir_writer import DatadirWriter
+from funasr.fileio.npy_scp import NpyScpWriter
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.forward_adaptor import ForwardAdaptor
+from funasr.train.abs_espnet_model import AbsESPnetModel
+
+
+@torch.no_grad()
+def collect_stats(
+    model: AbsESPnetModel,
+    train_iter: DataLoader and Iterable[Tuple[List[str], Dict[str, torch.Tensor]]],
+    valid_iter: DataLoader and Iterable[Tuple[List[str], Dict[str, torch.Tensor]]],
+    output_dir: Path,
+    ngpu: Optional[int],
+    log_interval: Optional[int],
+    write_collected_feats: bool,
+) -> None:
+    """Perform on collect_stats mode.
+
+    Running for deriving the shape information from data
+    and gathering statistics.
+    This method is used before executing train().
+
+    """
+    assert check_argument_types()
+
+    npy_scp_writers = {}
+    for itr, mode in zip([train_iter, valid_iter], ["train", "valid"]):
+        if log_interval is None:
+            try:
+                log_interval = max(len(itr) // 20, 10)
+            except TypeError:
+                log_interval = 100
+
+        sum_dict = defaultdict(lambda: 0)
+        sq_dict = defaultdict(lambda: 0)
+        count_dict = defaultdict(lambda: 0)
+
+        with DatadirWriter(output_dir / mode) as datadir_writer:
+            for iiter, (keys, batch) in enumerate(itr, 1):
+                batch = to_device(batch, "cuda" if ngpu > 0 else "cpu")
+
+                # 1. Write shape file
+                for name in batch:
+                    if name.endswith("_lengths"):
+                        continue
+                    for i, (key, data) in enumerate(zip(keys, batch[name])):
+                        if f"{name}_lengths" in batch:
+                            lg = int(batch[f"{name}_lengths"][i])
+                            data = data[:lg]
+                        datadir_writer[f"{name}_shape"][key] = ",".join(
+                            map(str, data.shape)
+                        )
+
+                # 2. Extract feats
+                if ngpu <= 1:
+                    data = model.collect_feats(**batch)
+                else:
+                    # Note that data_parallel can parallelize only "forward()"
+                    data = data_parallel(
+                        ForwardAdaptor(model, "collect_feats"),
+                        (),
+                        range(ngpu),
+                        module_kwargs=batch,
+                    )
+
+                # 3. Calculate sum and square sum
+                for key, v in data.items():
+                    for i, (uttid, seq) in enumerate(zip(keys, v.cpu().numpy())):
+                        # Truncate zero-padding region
+                        if f"{key}_lengths" in data:
+                            length = data[f"{key}_lengths"][i]
+                            # seq: (Length, Dim, ...)
+                            seq = seq[:length]
+                        else:
+                            # seq: (Dim, ...) -> (1, Dim, ...)
+                            seq = seq[None]
+                        # Accumulate value, its square, and count
+                        sum_dict[key] += seq.sum(0)
+                        sq_dict[key] += (seq**2).sum(0)
+                        count_dict[key] += len(seq)
+
+                        # 4. [Option] Write derived features as npy format file.
+                        if write_collected_feats:
+                            # Instantiate NpyScpWriter for the first iteration
+                            if (key, mode) not in npy_scp_writers:
+                                p = output_dir / mode / "collect_feats"
+                                npy_scp_writers[(key, mode)] = NpyScpWriter(
+                                    p / f"data_{key}", p / f"{key}.scp"
+                                )
+                            # Save array as npy file
+                            npy_scp_writers[(key, mode)][uttid] = seq
+
+                if iiter % log_interval == 0:
+                    logging.info(f"Niter: {iiter}")
+
+        for key in sum_dict:
+            np.savez(
+                output_dir / mode / f"{key}_stats.npz",
+                count=count_dict[key],
+                sum=sum_dict[key],
+                sum_square=sq_dict[key],
+            )
+
+        # batch_keys and stats_keys are used by aggregate_stats_dirs.py
+        with (output_dir / mode / "batch_keys").open("w", encoding="utf-8") as f:
+            f.write(
+                "\n".join(filter(lambda x: not x.endswith("_lengths"), batch)) + "\n"
+            )
+        with (output_dir / mode / "stats_keys").open("w", encoding="utf-8") as f:
+            f.write("\n".join(sum_dict) + "\n")
diff --git a/funasr/main_funcs/pack_funcs.py b/funasr/main_funcs/pack_funcs.py
new file mode 100644
index 000000000..ffa807e23
--- /dev/null
+++ b/funasr/main_funcs/pack_funcs.py
@@ -0,0 +1,302 @@
+from datetime import datetime
+from io import BytesIO
+from io import TextIOWrapper
+import os
+from pathlib import Path
+import sys
+import tarfile
+from typing import Dict
+from typing import Iterable
+from typing import Optional
+from typing import Union
+import zipfile
+
+import yaml
+
+
+class Archiver:
+    def __init__(self, file, mode="r"):
+        if Path(file).suffix == ".tar":
+            self.type = "tar"
+        elif Path(file).suffix == ".tgz" or Path(file).suffixes == [".tar", ".gz"]:
+            self.type = "tar"
+            if mode == "w":
+                mode = "w:gz"
+        elif Path(file).suffix == ".tbz2" or Path(file).suffixes == [".tar", ".bz2"]:
+            self.type = "tar"
+            if mode == "w":
+                mode = "w:bz2"
+        elif Path(file).suffix == ".txz" or Path(file).suffixes == [".tar", ".xz"]:
+            self.type = "tar"
+            if mode == "w":
+                mode = "w:xz"
+        elif Path(file).suffix == ".zip":
+            self.type = "zip"
+        else:
+            raise ValueError(f"Cannot detect archive format: type={file}")
+
+        if self.type == "tar":
+            self.fopen = tarfile.open(file, mode=mode)
+        elif self.type == "zip":
+
+            self.fopen = zipfile.ZipFile(file, mode=mode)
+        else:
+            raise ValueError(f"Not supported: type={type}")
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.fopen.close()
+
+    def close(self):
+        self.fopen.close()
+
+    def __iter__(self):
+        if self.type == "tar":
+            return iter(self.fopen)
+        elif self.type == "zip":
+            return iter(self.fopen.infolist())
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def add(self, filename, arcname=None, recursive: bool = True):
+        if arcname is not None:
+            print(f"adding: {arcname}")
+        else:
+            print(f"adding: {filename}")
+
+        if recursive and Path(filename).is_dir():
+            for f in Path(filename).glob("**/*"):
+                if f.is_dir():
+                    continue
+
+                if arcname is not None:
+                    _arcname = Path(arcname) / f
+                else:
+                    _arcname = None
+
+                self.add(f, _arcname)
+            return
+
+        if self.type == "tar":
+            return self.fopen.add(filename, arcname)
+        elif self.type == "zip":
+            return self.fopen.write(filename, arcname)
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def addfile(self, info, fileobj):
+        print(f"adding: {self.get_name_from_info(info)}")
+
+        if self.type == "tar":
+            return self.fopen.addfile(info, fileobj)
+        elif self.type == "zip":
+            return self.fopen.writestr(info, fileobj.read())
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def generate_info(self, name, size) -> Union[tarfile.TarInfo, zipfile.ZipInfo]:
+        """Generate TarInfo using system information"""
+        if self.type == "tar":
+            tarinfo = tarfile.TarInfo(str(name))
+            if os.name == "posix":
+                tarinfo.gid = os.getgid()
+                tarinfo.uid = os.getuid()
+            tarinfo.mtime = datetime.now().timestamp()
+            tarinfo.size = size
+            # Keep mode as default
+            return tarinfo
+        elif self.type == "zip":
+            zipinfo = zipfile.ZipInfo(str(name), datetime.now().timetuple()[:6])
+            zipinfo.file_size = size
+            return zipinfo
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def get_name_from_info(self, info):
+        if self.type == "tar":
+            assert isinstance(info, tarfile.TarInfo), type(info)
+            return info.name
+        elif self.type == "zip":
+            assert isinstance(info, zipfile.ZipInfo), type(info)
+            return info.filename
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def extract(self, info, path=None):
+        if self.type == "tar":
+            return self.fopen.extract(info, path)
+        elif self.type == "zip":
+            return self.fopen.extract(info, path)
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+    def extractfile(self, info, mode="r"):
+        if self.type == "tar":
+            f = self.fopen.extractfile(info)
+            if mode == "r":
+                return TextIOWrapper(f)
+            else:
+                return f
+        elif self.type == "zip":
+            if mode == "rb":
+                mode = "r"
+            return self.fopen.open(info, mode)
+        else:
+            raise ValueError(f"Not supported: type={self.type}")
+
+
+def find_path_and_change_it_recursive(value, src: str, tgt: str):
+    if isinstance(value, dict):
+        return {
+            k: find_path_and_change_it_recursive(v, src, tgt) for k, v in value.items()
+        }
+    elif isinstance(value, (list, tuple)):
+        return [find_path_and_change_it_recursive(v, src, tgt) for v in value]
+    elif isinstance(value, str) and Path(value) == Path(src):
+        return tgt
+    else:
+        return value
+
+
+def get_dict_from_cache(meta: Union[Path, str]) -> Optional[Dict[str, str]]:
+    meta = Path(meta)
+    outpath = meta.parent.parent
+    if not meta.exists():
+        return None
+
+    with meta.open("r", encoding="utf-8") as f:
+        d = yaml.safe_load(f)
+        assert isinstance(d, dict), type(d)
+        yaml_files = d["yaml_files"]
+        files = d["files"]
+        assert isinstance(yaml_files, dict), type(yaml_files)
+        assert isinstance(files, dict), type(files)
+
+        retval = {}
+        for key, value in list(yaml_files.items()) + list(files.items()):
+            if not (outpath / value).exists():
+                return None
+            retval[key] = str(outpath / value)
+        return retval
+
+
+def unpack(
+    input_archive: Union[Path, str],
+    outpath: Union[Path, str],
+    use_cache: bool = True,
+) -> Dict[str, str]:
+    """Scan all files in the archive file and return as a dict of files.
+
+    Examples:
+        tarfile:
+           model.pth
+           some1.file
+           some2.file
+
+        >>> unpack("tarfile", "out")
+        {'asr_model_file': 'out/model.pth'}
+    """
+    input_archive = Path(input_archive)
+    outpath = Path(outpath)
+
+    with Archiver(input_archive) as archive:
+        for info in archive:
+            if Path(archive.get_name_from_info(info)).name == "meta.yaml":
+                if (
+                    use_cache
+                    and (outpath / Path(archive.get_name_from_info(info))).exists()
+                ):
+                    retval = get_dict_from_cache(
+                        outpath / Path(archive.get_name_from_info(info))
+                    )
+                    if retval is not None:
+                        return retval
+                d = yaml.safe_load(archive.extractfile(info))
+                assert isinstance(d, dict), type(d)
+                yaml_files = d["yaml_files"]
+                files = d["files"]
+                assert isinstance(yaml_files, dict), type(yaml_files)
+                assert isinstance(files, dict), type(files)
+                break
+        else:
+            raise RuntimeError("Format error: not found meta.yaml")
+
+        for info in archive:
+            fname = archive.get_name_from_info(info)
+            outname = outpath / fname
+            outname.parent.mkdir(parents=True, exist_ok=True)
+            if fname in set(yaml_files.values()):
+                d = yaml.safe_load(archive.extractfile(info))
+                # Rewrite yaml
+                for info2 in archive:
+                    name = archive.get_name_from_info(info2)
+                    d = find_path_and_change_it_recursive(d, name, str(outpath / name))
+                with outname.open("w", encoding="utf-8") as f:
+                    yaml.safe_dump(d, f)
+            else:
+                archive.extract(info, path=outpath)
+
+        retval = {}
+        for key, value in list(yaml_files.items()) + list(files.items()):
+            retval[key] = str(outpath / value)
+        return retval
+
+
+def _to_relative_or_resolve(f):
+    # Resolve to avoid symbolic link
+    p = Path(f).resolve()
+    try:
+        # Change to relative if it can
+        p = p.relative_to(Path(".").resolve())
+    except ValueError:
+        pass
+    return str(p)
+
+
+def pack(
+    files: Dict[str, Union[str, Path]],
+    yaml_files: Dict[str, Union[str, Path]],
+    outpath: Union[str, Path],
+    option: Iterable[Union[str, Path]] = (),
+):
+    for v in list(files.values()) + list(yaml_files.values()) + list(option):
+        if not Path(v).exists():
+            raise FileNotFoundError(f"No such file or directory: {v}")
+
+    files = {k: _to_relative_or_resolve(v) for k, v in files.items()}
+    yaml_files = {k: _to_relative_or_resolve(v) for k, v in yaml_files.items()}
+    option = [_to_relative_or_resolve(v) for v in option]
+
+    meta_objs = dict(
+        files=files,
+        yaml_files=yaml_files,
+        timestamp=datetime.now().timestamp(),
+        python=sys.version,
+    )
+
+    try:
+        import torch
+
+        meta_objs.update(torch=str(torch.__version__))
+    except ImportError:
+        pass
+    try:
+        import espnet
+
+        meta_objs.update(espnet=espnet.__version__)
+    except ImportError:
+        pass
+
+    Path(outpath).parent.mkdir(parents=True, exist_ok=True)
+    with Archiver(outpath, mode="w") as archive:
+        # Write packed/meta.yaml
+        fileobj = BytesIO(yaml.safe_dump(meta_objs).encode())
+        info = archive.generate_info("meta.yaml", fileobj.getbuffer().nbytes)
+        archive.addfile(info, fileobj=fileobj)
+
+        for f in list(yaml_files.values()) + list(files.values()) + list(option):
+            archive.add(f)
+
+    print(f"Generate: {outpath}")
diff --git a/funasr/models/ctc.py b/funasr/models/ctc.py
new file mode 100644
index 000000000..64b87106a
--- /dev/null
+++ b/funasr/models/ctc.py
@@ -0,0 +1,187 @@
+import logging
+
+import torch
+import torch.nn.functional as F
+from typeguard import check_argument_types
+
+
+class CTC(torch.nn.Module):
+    """CTC module.
+
+    Args:
+        odim: dimension of outputs
+        encoder_output_size: number of encoder projection units
+        dropout_rate: dropout rate (0.0 ~ 1.0)
+        ctc_type: builtin or warpctc
+        reduce: reduce the CTC loss into a scalar
+    """
+
+    def __init__(
+        self,
+        odim: int,
+        encoder_output_size: int,
+        dropout_rate: float = 0.0,
+        ctc_type: str = "builtin",
+        reduce: bool = True,
+        ignore_nan_grad: bool = True,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        eprojs = encoder_output_size
+        self.dropout_rate = dropout_rate
+        self.ctc_lo = torch.nn.Linear(eprojs, odim)
+        self.ctc_type = ctc_type
+        self.ignore_nan_grad = ignore_nan_grad
+
+        if self.ctc_type == "builtin":
+            self.ctc_loss = torch.nn.CTCLoss(reduction="none")
+        elif self.ctc_type == "warpctc":
+            import warpctc_pytorch as warp_ctc
+
+            if ignore_nan_grad:
+                logging.warning("ignore_nan_grad option is not supported for warp_ctc")
+            self.ctc_loss = warp_ctc.CTCLoss(size_average=True, reduce=reduce)
+
+        elif self.ctc_type == "gtnctc":
+            from espnet.nets.pytorch_backend.gtn_ctc import GTNCTCLossFunction
+
+            self.ctc_loss = GTNCTCLossFunction.apply
+        else:
+            raise ValueError(
+                f'ctc_type must be "builtin" or "warpctc": {self.ctc_type}'
+            )
+
+        self.reduce = reduce
+
+    def loss_fn(self, th_pred, th_target, th_ilen, th_olen) -> torch.Tensor:
+        if self.ctc_type == "builtin":
+            th_pred = th_pred.log_softmax(2)
+            loss = self.ctc_loss(th_pred, th_target, th_ilen, th_olen)
+
+            if loss.requires_grad and self.ignore_nan_grad:
+                # ctc_grad: (L, B, O)
+                ctc_grad = loss.grad_fn(torch.ones_like(loss))
+                ctc_grad = ctc_grad.sum([0, 2])
+                indices = torch.isfinite(ctc_grad)
+                size = indices.long().sum()
+                if size == 0:
+                    # Return as is
+                    logging.warning(
+                        "All samples in this mini-batch got nan grad."
+                        " Returning nan value instead of CTC loss"
+                    )
+                elif size != th_pred.size(1):
+                    logging.warning(
+                        f"{th_pred.size(1) - size}/{th_pred.size(1)}"
+                        " samples got nan grad."
+                        " These were ignored for CTC loss."
+                    )
+
+                    # Create mask for target
+                    target_mask = torch.full(
+                        [th_target.size(0)],
+                        1,
+                        dtype=torch.bool,
+                        device=th_target.device,
+                    )
+                    s = 0
+                    for ind, le in enumerate(th_olen):
+                        if not indices[ind]:
+                            target_mask[s : s + le] = 0
+                        s += le
+
+                    # Calc loss again using maksed data
+                    loss = self.ctc_loss(
+                        th_pred[:, indices, :],
+                        th_target[target_mask],
+                        th_ilen[indices],
+                        th_olen[indices],
+                    )
+            else:
+                size = th_pred.size(1)
+
+            if self.reduce:
+                # Batch-size average
+                loss = loss.sum() / size
+            else:
+                loss = loss / size
+            return loss
+
+        elif self.ctc_type == "warpctc":
+            # warpctc only supports float32
+            th_pred = th_pred.to(dtype=torch.float32)
+
+            th_target = th_target.cpu().int()
+            th_ilen = th_ilen.cpu().int()
+            th_olen = th_olen.cpu().int()
+            loss = self.ctc_loss(th_pred, th_target, th_ilen, th_olen)
+            if self.reduce:
+                # NOTE: sum() is needed to keep consistency since warpctc
+                # return as tensor w/ shape (1,)
+                # but builtin return as tensor w/o shape (scalar).
+                loss = loss.sum()
+            return loss
+
+        elif self.ctc_type == "gtnctc":
+            log_probs = torch.nn.functional.log_softmax(th_pred, dim=2)
+            return self.ctc_loss(log_probs, th_target, th_ilen, 0, "none")
+
+        else:
+            raise NotImplementedError
+
+    def forward(self, hs_pad, hlens, ys_pad, ys_lens):
+        """Calculate CTC loss.
+
+        Args:
+            hs_pad: batch of padded hidden state sequences (B, Tmax, D)
+            hlens: batch of lengths of hidden state sequences (B)
+            ys_pad: batch of padded character id sequence tensor (B, Lmax)
+            ys_lens: batch of lengths of character sequence (B)
+        """
+        # hs_pad: (B, L, NProj) -> ys_hat: (B, L, Nvocab)
+        ys_hat = self.ctc_lo(F.dropout(hs_pad, p=self.dropout_rate))
+
+        if self.ctc_type == "gtnctc":
+            # gtn expects list form for ys
+            ys_true = [y[y != -1] for y in ys_pad]  # parse padded ys
+        else:
+            # ys_hat: (B, L, D) -> (L, B, D)
+            ys_hat = ys_hat.transpose(0, 1)
+            # (B, L) -> (BxL,)
+            ys_true = torch.cat([ys_pad[i, :l] for i, l in enumerate(ys_lens)])
+
+        loss = self.loss_fn(ys_hat, ys_true, hlens, ys_lens).to(
+            device=hs_pad.device, dtype=hs_pad.dtype
+        )
+
+        return loss
+
+    def softmax(self, hs_pad):
+        """softmax of frame activations
+
+        Args:
+            Tensor hs_pad: 3d tensor (B, Tmax, eprojs)
+        Returns:
+            torch.Tensor: softmax applied 3d tensor (B, Tmax, odim)
+        """
+        return F.softmax(self.ctc_lo(hs_pad), dim=2)
+
+    def log_softmax(self, hs_pad):
+        """log_softmax of frame activations
+
+        Args:
+            Tensor hs_pad: 3d tensor (B, Tmax, eprojs)
+        Returns:
+            torch.Tensor: log softmax applied 3d tensor (B, Tmax, odim)
+        """
+        return F.log_softmax(self.ctc_lo(hs_pad), dim=2)
+
+    def argmax(self, hs_pad):
+        """argmax of frame activations
+
+        Args:
+            torch.Tensor hs_pad: 3d tensor (B, Tmax, eprojs)
+        Returns:
+            torch.Tensor: argmax applied 2d tensor (B, Tmax)
+        """
+        return torch.argmax(self.ctc_lo(hs_pad), dim=2)
diff --git a/funasr/models/decoder/__init__.py b/funasr/models/decoder/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/decoder/abs_decoder.py b/funasr/models/decoder/abs_decoder.py
new file mode 100644
index 000000000..bc8acf44c
--- /dev/null
+++ b/funasr/models/decoder/abs_decoder.py
@@ -0,0 +1,19 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+from funasr.modules.scorers.scorer_interface import ScorerInterface
+
+
+class AbsDecoder(torch.nn.Module, ScorerInterface, ABC):
+    @abstractmethod
+    def forward(
+        self,
+        hs_pad: torch.Tensor,
+        hlens: torch.Tensor,
+        ys_in_pad: torch.Tensor,
+        ys_in_lens: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        raise NotImplementedError
diff --git a/funasr/models/decoder/rnn_decoder.py b/funasr/models/decoder/rnn_decoder.py
new file mode 100644
index 000000000..80709c9be
--- /dev/null
+++ b/funasr/models/decoder/rnn_decoder.py
@@ -0,0 +1,334 @@
+import random
+
+import numpy as np
+import torch
+import torch.nn.functional as F
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.nets_utils import to_device
+from funasr.modules.rnn.attentions import initial_att
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.utils.get_default_kwargs import get_default_kwargs
+
+
+def build_attention_list(
+    eprojs: int,
+    dunits: int,
+    atype: str = "location",
+    num_att: int = 1,
+    num_encs: int = 1,
+    aheads: int = 4,
+    adim: int = 320,
+    awin: int = 5,
+    aconv_chans: int = 10,
+    aconv_filts: int = 100,
+    han_mode: bool = False,
+    han_type=None,
+    han_heads: int = 4,
+    han_dim: int = 320,
+    han_conv_chans: int = -1,
+    han_conv_filts: int = 100,
+    han_win: int = 5,
+):
+
+    att_list = torch.nn.ModuleList()
+    if num_encs == 1:
+        for i in range(num_att):
+            att = initial_att(
+                atype,
+                eprojs,
+                dunits,
+                aheads,
+                adim,
+                awin,
+                aconv_chans,
+                aconv_filts,
+            )
+            att_list.append(att)
+    elif num_encs > 1:  # no multi-speaker mode
+        if han_mode:
+            att = initial_att(
+                han_type,
+                eprojs,
+                dunits,
+                han_heads,
+                han_dim,
+                han_win,
+                han_conv_chans,
+                han_conv_filts,
+                han_mode=True,
+            )
+            return att
+        else:
+            att_list = torch.nn.ModuleList()
+            for idx in range(num_encs):
+                att = initial_att(
+                    atype[idx],
+                    eprojs,
+                    dunits,
+                    aheads[idx],
+                    adim[idx],
+                    awin[idx],
+                    aconv_chans[idx],
+                    aconv_filts[idx],
+                )
+                att_list.append(att)
+    else:
+        raise ValueError(
+            "Number of encoders needs to be more than one. {}".format(num_encs)
+        )
+    return att_list
+
+
+class RNNDecoder(AbsDecoder):
+    def __init__(
+        self,
+        vocab_size: int,
+        encoder_output_size: int,
+        rnn_type: str = "lstm",
+        num_layers: int = 1,
+        hidden_size: int = 320,
+        sampling_probability: float = 0.0,
+        dropout: float = 0.0,
+        context_residual: bool = False,
+        replace_sos: bool = False,
+        num_encs: int = 1,
+        att_conf: dict = get_default_kwargs(build_attention_list),
+    ):
+        # FIXME(kamo): The parts of num_spk should be refactored more more more
+        assert check_argument_types()
+        if rnn_type not in {"lstm", "gru"}:
+            raise ValueError(f"Not supported: rnn_type={rnn_type}")
+
+        super().__init__()
+        eprojs = encoder_output_size
+        self.dtype = rnn_type
+        self.dunits = hidden_size
+        self.dlayers = num_layers
+        self.context_residual = context_residual
+        self.sos = vocab_size - 1
+        self.eos = vocab_size - 1
+        self.odim = vocab_size
+        self.sampling_probability = sampling_probability
+        self.dropout = dropout
+        self.num_encs = num_encs
+
+        # for multilingual translation
+        self.replace_sos = replace_sos
+
+        self.embed = torch.nn.Embedding(vocab_size, hidden_size)
+        self.dropout_emb = torch.nn.Dropout(p=dropout)
+
+        self.decoder = torch.nn.ModuleList()
+        self.dropout_dec = torch.nn.ModuleList()
+        self.decoder += [
+            torch.nn.LSTMCell(hidden_size + eprojs, hidden_size)
+            if self.dtype == "lstm"
+            else torch.nn.GRUCell(hidden_size + eprojs, hidden_size)
+        ]
+        self.dropout_dec += [torch.nn.Dropout(p=dropout)]
+        for _ in range(1, self.dlayers):
+            self.decoder += [
+                torch.nn.LSTMCell(hidden_size, hidden_size)
+                if self.dtype == "lstm"
+                else torch.nn.GRUCell(hidden_size, hidden_size)
+            ]
+            self.dropout_dec += [torch.nn.Dropout(p=dropout)]
+            # NOTE: dropout is applied only for the vertical connections
+            # see https://arxiv.org/pdf/1409.2329.pdf
+
+        if context_residual:
+            self.output = torch.nn.Linear(hidden_size + eprojs, vocab_size)
+        else:
+            self.output = torch.nn.Linear(hidden_size, vocab_size)
+
+        self.att_list = build_attention_list(
+            eprojs=eprojs, dunits=hidden_size, **att_conf
+        )
+
+    def zero_state(self, hs_pad):
+        return hs_pad.new_zeros(hs_pad.size(0), self.dunits)
+
+    def rnn_forward(self, ey, z_list, c_list, z_prev, c_prev):
+        if self.dtype == "lstm":
+            z_list[0], c_list[0] = self.decoder[0](ey, (z_prev[0], c_prev[0]))
+            for i in range(1, self.dlayers):
+                z_list[i], c_list[i] = self.decoder[i](
+                    self.dropout_dec[i - 1](z_list[i - 1]),
+                    (z_prev[i], c_prev[i]),
+                )
+        else:
+            z_list[0] = self.decoder[0](ey, z_prev[0])
+            for i in range(1, self.dlayers):
+                z_list[i] = self.decoder[i](
+                    self.dropout_dec[i - 1](z_list[i - 1]), z_prev[i]
+                )
+        return z_list, c_list
+
+    def forward(self, hs_pad, hlens, ys_in_pad, ys_in_lens, strm_idx=0):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            hs_pad = [hs_pad]
+            hlens = [hlens]
+
+        # attention index for the attention module
+        # in SPA (speaker parallel attention),
+        # att_idx is used to select attention module. In other cases, it is 0.
+        att_idx = min(strm_idx, len(self.att_list) - 1)
+
+        # hlens should be list of list of integer
+        hlens = [list(map(int, hlens[idx])) for idx in range(self.num_encs)]
+
+        # get dim, length info
+        olength = ys_in_pad.size(1)
+
+        # initialization
+        c_list = [self.zero_state(hs_pad[0])]
+        z_list = [self.zero_state(hs_pad[0])]
+        for _ in range(1, self.dlayers):
+            c_list.append(self.zero_state(hs_pad[0]))
+            z_list.append(self.zero_state(hs_pad[0]))
+        z_all = []
+        if self.num_encs == 1:
+            att_w = None
+            self.att_list[att_idx].reset()  # reset pre-computation of h
+        else:
+            att_w_list = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * self.num_encs  # atts
+            for idx in range(self.num_encs + 1):
+                # reset pre-computation of h in atts and han
+                self.att_list[idx].reset()
+
+        # pre-computation of embedding
+        eys = self.dropout_emb(self.embed(ys_in_pad))  # utt x olen x zdim
+
+        # loop for an output sequence
+        for i in range(olength):
+            if self.num_encs == 1:
+                att_c, att_w = self.att_list[att_idx](
+                    hs_pad[0], hlens[0], self.dropout_dec[0](z_list[0]), att_w
+                )
+            else:
+                for idx in range(self.num_encs):
+                    att_c_list[idx], att_w_list[idx] = self.att_list[idx](
+                        hs_pad[idx],
+                        hlens[idx],
+                        self.dropout_dec[0](z_list[0]),
+                        att_w_list[idx],
+                    )
+                hs_pad_han = torch.stack(att_c_list, dim=1)
+                hlens_han = [self.num_encs] * len(ys_in_pad)
+                att_c, att_w_list[self.num_encs] = self.att_list[self.num_encs](
+                    hs_pad_han,
+                    hlens_han,
+                    self.dropout_dec[0](z_list[0]),
+                    att_w_list[self.num_encs],
+                )
+            if i > 0 and random.random() < self.sampling_probability:
+                z_out = self.output(z_all[-1])
+                z_out = np.argmax(z_out.detach().cpu(), axis=1)
+                z_out = self.dropout_emb(self.embed(to_device(self, z_out)))
+                ey = torch.cat((z_out, att_c), dim=1)  # utt x (zdim + hdim)
+            else:
+                # utt x (zdim + hdim)
+                ey = torch.cat((eys[:, i, :], att_c), dim=1)
+            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)
+            if self.context_residual:
+                z_all.append(
+                    torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+                )  # utt x (zdim + hdim)
+            else:
+                z_all.append(self.dropout_dec[-1](z_list[-1]))  # utt x (zdim)
+
+        z_all = torch.stack(z_all, dim=1)
+        z_all = self.output(z_all)
+        z_all.masked_fill_(
+            make_pad_mask(ys_in_lens, z_all, 1),
+            0,
+        )
+        return z_all, ys_in_lens
+
+    def init_state(self, x):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            x = [x]
+
+        c_list = [self.zero_state(x[0].unsqueeze(0))]
+        z_list = [self.zero_state(x[0].unsqueeze(0))]
+        for _ in range(1, self.dlayers):
+            c_list.append(self.zero_state(x[0].unsqueeze(0)))
+            z_list.append(self.zero_state(x[0].unsqueeze(0)))
+        # TODO(karita): support strm_index for `asr_mix`
+        strm_index = 0
+        att_idx = min(strm_index, len(self.att_list) - 1)
+        if self.num_encs == 1:
+            a = None
+            self.att_list[att_idx].reset()  # reset pre-computation of h
+        else:
+            a = [None] * (self.num_encs + 1)  # atts + han
+            for idx in range(self.num_encs + 1):
+                # reset pre-computation of h in atts and han
+                self.att_list[idx].reset()
+        return dict(
+            c_prev=c_list[:],
+            z_prev=z_list[:],
+            a_prev=a,
+            workspace=(att_idx, z_list, c_list),
+        )
+
+    def score(self, yseq, state, x):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            x = [x]
+
+        att_idx, z_list, c_list = state["workspace"]
+        vy = yseq[-1].unsqueeze(0)
+        ey = self.dropout_emb(self.embed(vy))  # utt list (1) x zdim
+        if self.num_encs == 1:
+            att_c, att_w = self.att_list[att_idx](
+                x[0].unsqueeze(0),
+                [x[0].size(0)],
+                self.dropout_dec[0](state["z_prev"][0]),
+                state["a_prev"],
+            )
+        else:
+            att_w = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * self.num_encs  # atts
+            for idx in range(self.num_encs):
+                att_c_list[idx], att_w[idx] = self.att_list[idx](
+                    x[idx].unsqueeze(0),
+                    [x[idx].size(0)],
+                    self.dropout_dec[0](state["z_prev"][0]),
+                    state["a_prev"][idx],
+                )
+            h_han = torch.stack(att_c_list, dim=1)
+            att_c, att_w[self.num_encs] = self.att_list[self.num_encs](
+                h_han,
+                [self.num_encs],
+                self.dropout_dec[0](state["z_prev"][0]),
+                state["a_prev"][self.num_encs],
+            )
+        ey = torch.cat((ey, att_c), dim=1)  # utt(1) x (zdim + hdim)
+        z_list, c_list = self.rnn_forward(
+            ey, z_list, c_list, state["z_prev"], state["c_prev"]
+        )
+        if self.context_residual:
+            logits = self.output(
+                torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+            )
+        else:
+            logits = self.output(self.dropout_dec[-1](z_list[-1]))
+        logp = F.log_softmax(logits, dim=1).squeeze(0)
+        return (
+            logp,
+            dict(
+                c_prev=c_list[:],
+                z_prev=z_list[:],
+                a_prev=att_w,
+                workspace=(att_idx, z_list, c_list),
+            ),
+        )
diff --git a/funasr/models/decoder/sanm_decoder.py b/funasr/models/decoder/sanm_decoder.py
new file mode 100644
index 000000000..a5db353ba
--- /dev/null
+++ b/funasr/models/decoder/sanm_decoder.py
@@ -0,0 +1,616 @@
+from typing import List
+from typing import Tuple
+
+import torch
+import torch.nn as nn
+from funasr.modules.streaming_utils import utils as myutils
+from funasr.models.decoder.transformer_decoder import BaseTransformerDecoder
+from typeguard import check_argument_types
+
+from funasr.modules.attention import MultiHeadedAttentionSANMDecoder, MultiHeadedAttentionCrossAtt
+from funasr.modules.embedding import PositionalEncoding
+from funasr.modules.layer_norm import LayerNorm
+from funasr.modules.positionwise_feed_forward import PositionwiseFeedForwardDecoderSANM
+from funasr.modules.repeat import repeat
+
+
+class DecoderLayerSANM(nn.Module):
+    """Single decoder layer module.
+
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        src_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+
+
+    """
+
+    def __init__(
+        self,
+        size,
+        self_attn,
+        src_attn,
+        feed_forward,
+        dropout_rate,
+        normalize_before=True,
+        concat_after=False,
+    ):
+        """Construct an DecoderLayer object."""
+        super(DecoderLayerSANM, self).__init__()
+        self.size = size
+        self.self_attn = self_attn
+        self.src_attn = src_attn
+        self.feed_forward = feed_forward
+        self.norm1 = LayerNorm(size)
+        if self_attn is not None:
+            self.norm2 = LayerNorm(size)
+        if src_attn is not None:
+            self.norm3 = LayerNorm(size)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear1 = nn.Linear(size + size, size)
+            self.concat_linear2 = nn.Linear(size + size, size)
+
+    def forward(self, tgt, tgt_mask, memory, memory_mask=None, cache=None):
+        """Compute decoded features.
+
+        Args:
+            tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
+            tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
+            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
+            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
+            cache (List[torch.Tensor]): List of cached tensors.
+                Each tensor shape should be (#batch, maxlen_out - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor(#batch, maxlen_out, size).
+            torch.Tensor: Mask for output tensor (#batch, maxlen_out).
+            torch.Tensor: Encoded memory (#batch, maxlen_in, size).
+            torch.Tensor: Encoded memory mask (#batch, maxlen_in).
+
+        """
+        # tgt = self.dropout(tgt)
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+        tgt = self.feed_forward(tgt)
+
+        x = tgt
+        if self.self_attn:
+            if self.normalize_before:
+                tgt = self.norm2(tgt)
+            if self.training:
+                cache = None
+            x, cache = self.self_attn(tgt, tgt_mask, cache=cache)
+            x = residual + self.dropout(x)
+
+        if self.src_attn is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm3(x)
+
+            x = residual + self.dropout(self.src_attn(x, memory, memory_mask))
+
+
+        return x, tgt_mask, memory, memory_mask, cache
+
+
+class FsmnDecoderSCAMAOpt(BaseTransformerDecoder):
+    """
+    author: Speech Lab, Alibaba Group, China
+    SCAMA: Streaming chunk-aware multihead attention for online end-to-end speech recognition
+    https://arxiv.org/abs/2006.01713
+
+    """
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            att_layer_num: int = 6,
+            kernel_size: int = 21,
+            sanm_shfit: int = None,
+            concat_embeds: bool = False,
+            attention_dim: int = None,
+    ):
+        assert check_argument_types()
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+        if attention_dim is None:
+            attention_dim = encoder_output_size
+
+        if input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(vocab_size, attention_dim),
+            )
+        elif input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(vocab_size, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        else:
+            raise ValueError(f"only 'embed' or 'linear' is supported: {input_layer}")
+
+        self.normalize_before = normalize_before
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+        if use_output_layer:
+            self.output_layer = torch.nn.Linear(attention_dim, vocab_size)
+        else:
+            self.output_layer = None
+
+        self.att_layer_num = att_layer_num
+        self.num_blocks = num_blocks
+        if sanm_shfit is None:
+            sanm_shfit = (kernel_size - 1) // 2
+        self.decoders = repeat(
+            att_layer_num,
+            lambda lnum: DecoderLayerSANM(
+                attention_dim,
+                MultiHeadedAttentionSANMDecoder(
+                    attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit
+                ),
+                MultiHeadedAttentionCrossAtt(
+                    attention_heads, attention_dim, src_attention_dropout_rate, encoder_output_size=encoder_output_size
+                ),
+                PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if num_blocks - att_layer_num <= 0:
+            self.decoders2 = None
+        else:
+            self.decoders2 = repeat(
+                num_blocks - att_layer_num,
+                lambda lnum: DecoderLayerSANM(
+                    attention_dim,
+                    MultiHeadedAttentionSANMDecoder(
+                        attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit
+                    ),
+                    None,
+                    PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                    dropout_rate,
+                    normalize_before,
+                    concat_after,
+                ),
+            )
+
+        self.decoders3 = repeat(
+            1,
+            lambda lnum: DecoderLayerSANM(
+                attention_dim,
+                None,
+                None,
+                PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if concat_embeds:
+            self.embed_concat_ffn = repeat(
+                1,
+                lambda lnum: DecoderLayerSANM(
+                    attention_dim + encoder_output_size,
+                    None,
+                    None,
+                    PositionwiseFeedForwardDecoderSANM(attention_dim + encoder_output_size, linear_units, dropout_rate,
+                                                      adim=attention_dim),
+                    dropout_rate,
+                    normalize_before,
+                    concat_after,
+                ),
+            )
+        else:
+            self.embed_concat_ffn = None
+        self.concat_embeds = concat_embeds
+
+    def forward(
+            self,
+            hs_pad: torch.Tensor,
+            hlens: torch.Tensor,
+            ys_in_pad: torch.Tensor,
+            ys_in_lens: torch.Tensor,
+            chunk_mask: torch.Tensor = None,
+            pre_acoustic_embeds: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward decoder.
+
+        Args:
+            hs_pad: encoded memory, float32  (batch, maxlen_in, feat)
+            hlens: (batch)
+            ys_in_pad:
+                input token ids, int64 (batch, maxlen_out)
+                if input_layer == "embed"
+                input tensor (batch, maxlen_out, #mels) in the other cases
+            ys_in_lens: (batch)
+        Returns:
+            (tuple): tuple containing:
+
+            x: decoded token score before softmax (batch, maxlen_out, token)
+                if use_output_layer is True,
+            olens: (batch, )
+        """
+        tgt = ys_in_pad
+        tgt_mask = myutils.sequence_mask(ys_in_lens, device=tgt.device)[:, :, None]
+
+        memory = hs_pad
+        memory_mask = myutils.sequence_mask(hlens, device=memory.device)[:, None, :]
+        if chunk_mask is not None:
+            memory_mask = memory_mask * chunk_mask
+            if tgt_mask.size(1) != memory_mask.size(1):
+                memory_mask = torch.cat((memory_mask, memory_mask[:, -2:-1, :]), dim=1)
+
+        x = self.embed(tgt)
+
+        if pre_acoustic_embeds is not None and self.concat_embeds:
+            x = torch.cat((x, pre_acoustic_embeds), dim=-1)
+            x, _, _, _, _ = self.embed_concat_ffn(x, None, None, None, None)
+
+        x, tgt_mask, memory, memory_mask, _ = self.decoders(
+            x, tgt_mask, memory, memory_mask
+        )
+        if self.decoders2 is not None:
+            x, tgt_mask, memory, memory_mask, _ = self.decoders2(
+                x, tgt_mask, memory, memory_mask
+            )
+        x, tgt_mask, memory, memory_mask, _ = self.decoders3(
+            x, tgt_mask, memory, memory_mask
+        )
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.output_layer is not None:
+            x = self.output_layer(x)
+
+        olens = tgt_mask.sum(1)
+        return x, olens
+
+    def score(self, ys, state, x, x_mask=None, pre_acoustic_embeds: torch.Tensor = None, ):
+        """Score."""
+        ys_mask = myutils.sequence_mask(torch.tensor([len(ys)], dtype=torch.int32), device=x.device)[:, :, None]
+        logp, state = self.forward_one_step(
+            ys.unsqueeze(0), ys_mask, x.unsqueeze(0), memory_mask=x_mask, pre_acoustic_embeds=pre_acoustic_embeds,
+            cache=state
+        )
+        return logp.squeeze(0), state
+
+    def forward_one_step(
+            self,
+            tgt: torch.Tensor,
+            tgt_mask: torch.Tensor,
+            memory: torch.Tensor,
+            memory_mask: torch.Tensor = None,
+            pre_acoustic_embeds: torch.Tensor = None,
+            cache: List[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, List[torch.Tensor]]:
+        """Forward one step.
+
+        Args:
+            tgt: input token ids, int64 (batch, maxlen_out)
+            tgt_mask: input token mask,  (batch, maxlen_out)
+                      dtype=torch.uint8 in PyTorch 1.2-
+                      dtype=torch.bool in PyTorch 1.2+ (include 1.2)
+            memory: encoded memory, float32  (batch, maxlen_in, feat)
+            cache: cached output list of (batch, max_time_out-1, size)
+        Returns:
+            y, cache: NN output value and cache per `self.decoders`.
+            y.shape` is (batch, maxlen_out, token)
+        """
+
+        x = tgt[:, -1:]
+        tgt_mask = None
+        x = self.embed(x)
+
+        if pre_acoustic_embeds is not None and self.concat_embeds:
+            x = torch.cat((x, pre_acoustic_embeds), dim=-1)
+            x, _, _, _, _ = self.embed_concat_ffn(x, None, None, None, None)
+
+        if cache is None:
+            cache_layer_num = len(self.decoders)
+            if self.decoders2 is not None:
+                cache_layer_num += len(self.decoders2)
+            cache = [None] * cache_layer_num
+        new_cache = []
+        # for c, decoder in zip(cache, self.decoders):
+        for i in range(self.att_layer_num):
+            decoder = self.decoders[i]
+            c = cache[i]
+            x, tgt_mask, memory, memory_mask, c_ret = decoder(
+                x, tgt_mask, memory, memory_mask, cache=c
+            )
+            new_cache.append(c_ret)
+
+        if self.num_blocks - self.att_layer_num >= 1:
+            for i in range(self.num_blocks - self.att_layer_num):
+                j = i + self.att_layer_num
+                decoder = self.decoders2[i]
+                c = cache[j]
+                x, tgt_mask, memory, memory_mask, c_ret = decoder(
+                    x, tgt_mask, memory, memory_mask, cache=c
+                )
+                new_cache.append(c_ret)
+
+        for decoder in self.decoders3:
+            x, tgt_mask, memory, memory_mask, _ = decoder(
+                x, tgt_mask, memory, None, cache=None
+            )
+
+        if self.normalize_before:
+            y = self.after_norm(x[:, -1])
+        else:
+            y = x[:, -1]
+        if self.output_layer is not None:
+            y = self.output_layer(y)
+            y = torch.log_softmax(y, dim=-1)
+
+        return y, new_cache
+
+class ParaformerSANMDecoder(BaseTransformerDecoder):
+    """
+    author: Speech Lab, Alibaba Group, China
+    Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
+    https://arxiv.org/abs/2006.01713
+    """
+    def __init__(
+        self,
+        vocab_size: int,
+        encoder_output_size: int,
+        attention_heads: int = 4,
+        linear_units: int = 2048,
+        num_blocks: int = 6,
+        dropout_rate: float = 0.1,
+        positional_dropout_rate: float = 0.1,
+        self_attention_dropout_rate: float = 0.0,
+        src_attention_dropout_rate: float = 0.0,
+        input_layer: str = "embed",
+        use_output_layer: bool = True,
+        pos_enc_class=PositionalEncoding,
+        normalize_before: bool = True,
+        concat_after: bool = False,
+        att_layer_num: int = 6,
+        kernel_size: int = 21,
+        sanm_shfit: int = 0,
+    ):
+        assert check_argument_types()
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+
+        attention_dim = encoder_output_size
+
+        if input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(vocab_size, attention_dim),
+                # pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(vocab_size, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        else:
+            raise ValueError(f"only 'embed' or 'linear' is supported: {input_layer}")
+
+        self.normalize_before = normalize_before
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+        if use_output_layer:
+            self.output_layer = torch.nn.Linear(attention_dim, vocab_size)
+        else:
+            self.output_layer = None
+
+        self.att_layer_num = att_layer_num
+        self.num_blocks = num_blocks
+        if sanm_shfit is None:
+            sanm_shfit = (kernel_size - 1) // 2
+        self.decoders = repeat(
+            att_layer_num,
+            lambda lnum: DecoderLayerSANM(
+                attention_dim,
+                MultiHeadedAttentionSANMDecoder(
+                    attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=sanm_shfit
+                ),
+                MultiHeadedAttentionCrossAtt(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if num_blocks - att_layer_num <= 0:
+            self.decoders2 = None
+        else:
+            self.decoders2 = repeat(
+                num_blocks - att_layer_num,
+                lambda lnum: DecoderLayerSANM(
+                    attention_dim,
+                    MultiHeadedAttentionSANMDecoder(
+                        attention_dim, self_attention_dropout_rate, kernel_size, sanm_shfit=0
+                    ),
+                    None,
+                    PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                    dropout_rate,
+                    normalize_before,
+                    concat_after,
+                ),
+            )
+
+        self.decoders3 = repeat(
+            1,
+            lambda lnum: DecoderLayerSANM(
+                attention_dim,
+                None,
+                None,
+                PositionwiseFeedForwardDecoderSANM(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+    def forward(
+        self,
+        hs_pad: torch.Tensor,
+        hlens: torch.Tensor,
+        ys_in_pad: torch.Tensor,
+        ys_in_lens: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward decoder.
+
+        Args:
+            hs_pad: encoded memory, float32  (batch, maxlen_in, feat)
+            hlens: (batch)
+            ys_in_pad:
+                input token ids, int64 (batch, maxlen_out)
+                if input_layer == "embed"
+                input tensor (batch, maxlen_out, #mels) in the other cases
+            ys_in_lens: (batch)
+        Returns:
+            (tuple): tuple containing:
+
+            x: decoded token score before softmax (batch, maxlen_out, token)
+                if use_output_layer is True,
+            olens: (batch, )
+        """
+        tgt = ys_in_pad
+        tgt_mask = myutils.sequence_mask(ys_in_lens, device=tgt.device)[:, :, None]
+
+        memory = hs_pad
+        memory_mask = myutils.sequence_mask(hlens, device=memory.device)[:, None, :]
+
+        x = tgt
+        x, tgt_mask, memory, memory_mask, _ = self.decoders(
+            x, tgt_mask, memory, memory_mask
+        )
+        if self.decoders2 is not None:
+            x, tgt_mask, memory, memory_mask, _ = self.decoders2(
+                x, tgt_mask, memory, memory_mask
+            )
+        x, tgt_mask, memory, memory_mask, _ = self.decoders3(
+            x, tgt_mask, memory, memory_mask
+        )
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.output_layer is not None:
+            x = self.output_layer(x)
+
+        olens = tgt_mask.sum(1)
+        return x, olens
+
+    def score(self, ys, state, x):
+        """Score."""
+        ys_mask = myutils.sequence_mask(torch.tensor([len(ys)], dtype=torch.int32), device=x.device)[:, :, None]
+        logp, state = self.forward_one_step(
+            ys.unsqueeze(0), ys_mask, x.unsqueeze(0), cache=state
+        )
+        return logp.squeeze(0), state
+
+    def forward_one_step(
+        self,
+        tgt: torch.Tensor,
+        tgt_mask: torch.Tensor,
+        memory: torch.Tensor,
+        cache: List[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, List[torch.Tensor]]:
+        """Forward one step.
+
+        Args:
+            tgt: input token ids, int64 (batch, maxlen_out)
+            tgt_mask: input token mask,  (batch, maxlen_out)
+                      dtype=torch.uint8 in PyTorch 1.2-
+                      dtype=torch.bool in PyTorch 1.2+ (include 1.2)
+            memory: encoded memory, float32  (batch, maxlen_in, feat)
+            cache: cached output list of (batch, max_time_out-1, size)
+        Returns:
+            y, cache: NN output value and cache per `self.decoders`.
+            y.shape` is (batch, maxlen_out, token)
+        """
+        x = self.embed(tgt)
+        if cache is None:
+            cache_layer_num = len(self.decoders)
+            if self.decoders2 is not None:
+                cache_layer_num += len(self.decoders2)
+            cache = [None] * cache_layer_num
+        new_cache = []
+        # for c, decoder in zip(cache, self.decoders):
+        for i in range(self.att_layer_num):
+            decoder = self.decoders[i]
+            c = cache[i]
+            x, tgt_mask, memory, memory_mask, c_ret = decoder(
+                x, tgt_mask, memory, None, cache=c
+            )
+            new_cache.append(c_ret)
+
+        if self.num_blocks - self.att_layer_num > 1:
+            for i in range(self.num_blocks - self.att_layer_num):
+                j = i + self.att_layer_num
+                decoder = self.decoders2[i]
+                c = cache[j]
+                x, tgt_mask, memory, memory_mask, c_ret = decoder(
+                    x, tgt_mask, memory, None, cache=c
+                )
+                new_cache.append(c_ret)
+
+        for decoder in self.decoders3:
+
+            x, tgt_mask, memory, memory_mask, _ = decoder(
+                x, tgt_mask, memory, None, cache=None
+            )
+
+        if self.normalize_before:
+            y = self.after_norm(x[:, -1])
+        else:
+            y = x[:, -1]
+        if self.output_layer is not None:
+            y = torch.log_softmax(self.output_layer(y), dim=-1)
+
+        return y, new_cache
\ No newline at end of file
diff --git a/funasr/models/decoder/transformer_decoder.py b/funasr/models/decoder/transformer_decoder.py
new file mode 100644
index 000000000..5f1bb2436
--- /dev/null
+++ b/funasr/models/decoder/transformer_decoder.py
@@ -0,0 +1,766 @@
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Decoder definition."""
+from typing import Any
+from typing import List
+from typing import Sequence
+from typing import Tuple
+
+import torch
+from torch import nn
+from typeguard import check_argument_types
+
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.modules.attention import MultiHeadedAttention
+from funasr.modules.dynamic_conv import DynamicConvolution
+from funasr.modules.dynamic_conv2d import DynamicConvolution2D
+from funasr.modules.embedding import PositionalEncoding
+from funasr.modules.layer_norm import LayerNorm
+from funasr.modules.lightconv import LightweightConvolution
+from funasr.modules.lightconv2d import LightweightConvolution2D
+from funasr.modules.mask import subsequent_mask
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.positionwise_feed_forward import (
+    PositionwiseFeedForward,  # noqa: H301
+)
+from funasr.modules.repeat import repeat
+from funasr.modules.scorers.scorer_interface import BatchScorerInterface
+
+
+class DecoderLayer(nn.Module):
+    """Single decoder layer module.
+
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        src_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` instance can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+
+
+    """
+
+    def __init__(
+            self,
+            size,
+            self_attn,
+            src_attn,
+            feed_forward,
+            dropout_rate,
+            normalize_before=True,
+            concat_after=False,
+    ):
+        """Construct an DecoderLayer object."""
+        super(DecoderLayer, self).__init__()
+        self.size = size
+        self.self_attn = self_attn
+        self.src_attn = src_attn
+        self.feed_forward = feed_forward
+        self.norm1 = LayerNorm(size)
+        self.norm2 = LayerNorm(size)
+        self.norm3 = LayerNorm(size)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear1 = nn.Linear(size + size, size)
+            self.concat_linear2 = nn.Linear(size + size, size)
+
+    def forward(self, tgt, tgt_mask, memory, memory_mask, cache=None):
+        """Compute decoded features.
+
+        Args:
+            tgt (torch.Tensor): Input tensor (#batch, maxlen_out, size).
+            tgt_mask (torch.Tensor): Mask for input tensor (#batch, maxlen_out).
+            memory (torch.Tensor): Encoded memory, float32 (#batch, maxlen_in, size).
+            memory_mask (torch.Tensor): Encoded memory mask (#batch, maxlen_in).
+            cache (List[torch.Tensor]): List of cached tensors.
+                Each tensor shape should be (#batch, maxlen_out - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor(#batch, maxlen_out, size).
+            torch.Tensor: Mask for output tensor (#batch, maxlen_out).
+            torch.Tensor: Encoded memory (#batch, maxlen_in, size).
+            torch.Tensor: Encoded memory mask (#batch, maxlen_in).
+
+        """
+        residual = tgt
+        if self.normalize_before:
+            tgt = self.norm1(tgt)
+
+        if cache is None:
+            tgt_q = tgt
+            tgt_q_mask = tgt_mask
+        else:
+            # compute only the last frame query keeping dim: max_time_out -> 1
+            assert cache.shape == (
+                tgt.shape[0],
+                tgt.shape[1] - 1,
+                self.size,
+            ), f"{cache.shape} == {(tgt.shape[0], tgt.shape[1] - 1, self.size)}"
+            tgt_q = tgt[:, -1:, :]
+            residual = residual[:, -1:, :]
+            tgt_q_mask = None
+            if tgt_mask is not None:
+                tgt_q_mask = tgt_mask[:, -1:, :]
+
+        if self.concat_after:
+            tgt_concat = torch.cat(
+                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1
+            )
+            x = residual + self.concat_linear1(tgt_concat)
+        else:
+            x = residual + self.dropout(self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        if self.concat_after:
+            x_concat = torch.cat(
+                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1
+            )
+            x = residual + self.concat_linear2(x_concat)
+        else:
+            x = residual + self.dropout(self.src_attn(x, memory, memory, memory_mask))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm3(x)
+        x = residual + self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm3(x)
+
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+
+        return x, tgt_mask, memory, memory_mask
+
+
+class BaseTransformerDecoder(AbsDecoder, BatchScorerInterface):
+    """Base class of Transfomer decoder module.
+
+    Args:
+        vocab_size: output dim
+        encoder_output_size: dimension of attention
+        attention_heads: the number of heads of multi head attention
+        linear_units: the number of units of position-wise feed forward
+        num_blocks: the number of decoder blocks
+        dropout_rate: dropout rate
+        self_attention_dropout_rate: dropout rate for attention
+        input_layer: input layer type
+        use_output_layer: whether to use output layer
+        pos_enc_class: PositionalEncoding or ScaledPositionalEncoding
+        normalize_before: whether to use layer_norm before the first block
+        concat_after: whether to concat attention layer's input and output
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied.
+            i.e. x -> x + att(x)
+    """
+
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        attention_dim = encoder_output_size
+
+        if input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(vocab_size, attention_dim),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(vocab_size, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        else:
+            raise ValueError(f"only 'embed' or 'linear' is supported: {input_layer}")
+
+        self.normalize_before = normalize_before
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+        if use_output_layer:
+            self.output_layer = torch.nn.Linear(attention_dim, vocab_size)
+        else:
+            self.output_layer = None
+
+        # Must set by the inheritance
+        self.decoders = None
+
+    def forward(
+            self,
+            hs_pad: torch.Tensor,
+            hlens: torch.Tensor,
+            ys_in_pad: torch.Tensor,
+            ys_in_lens: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward decoder.
+
+        Args:
+            hs_pad: encoded memory, float32  (batch, maxlen_in, feat)
+            hlens: (batch)
+            ys_in_pad:
+                input token ids, int64 (batch, maxlen_out)
+                if input_layer == "embed"
+                input tensor (batch, maxlen_out, #mels) in the other cases
+            ys_in_lens: (batch)
+        Returns:
+            (tuple): tuple containing:
+
+            x: decoded token score before softmax (batch, maxlen_out, token)
+                if use_output_layer is True,
+            olens: (batch, )
+        """
+        tgt = ys_in_pad
+        # tgt_mask: (B, 1, L)
+        tgt_mask = (~make_pad_mask(ys_in_lens)[:, None, :]).to(tgt.device)
+        # m: (1, L, L)
+        m = subsequent_mask(tgt_mask.size(-1), device=tgt_mask.device).unsqueeze(0)
+        # tgt_mask: (B, L, L)
+        tgt_mask = tgt_mask & m
+
+        memory = hs_pad
+        memory_mask = (~make_pad_mask(hlens, maxlen=memory.size(1)))[:, None, :].to(
+            memory.device
+        )
+        # Padding for Longformer
+        if memory_mask.shape[-1] != memory.shape[1]:
+            padlen = memory.shape[1] - memory_mask.shape[-1]
+            memory_mask = torch.nn.functional.pad(
+                memory_mask, (0, padlen), "constant", False
+            )
+
+        x = self.embed(tgt)
+        x, tgt_mask, memory, memory_mask = self.decoders(
+            x, tgt_mask, memory, memory_mask
+        )
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.output_layer is not None:
+            x = self.output_layer(x)
+
+        olens = tgt_mask.sum(1)
+        return x, olens
+
+    def forward_one_step(
+            self,
+            tgt: torch.Tensor,
+            tgt_mask: torch.Tensor,
+            memory: torch.Tensor,
+            cache: List[torch.Tensor] = None,
+    ) -> Tuple[torch.Tensor, List[torch.Tensor]]:
+        """Forward one step.
+
+        Args:
+            tgt: input token ids, int64 (batch, maxlen_out)
+            tgt_mask: input token mask,  (batch, maxlen_out)
+                      dtype=torch.uint8 in PyTorch 1.2-
+                      dtype=torch.bool in PyTorch 1.2+ (include 1.2)
+            memory: encoded memory, float32  (batch, maxlen_in, feat)
+            cache: cached output list of (batch, max_time_out-1, size)
+        Returns:
+            y, cache: NN output value and cache per `self.decoders`.
+            y.shape` is (batch, maxlen_out, token)
+        """
+        x = self.embed(tgt)
+        if cache is None:
+            cache = [None] * len(self.decoders)
+        new_cache = []
+        for c, decoder in zip(cache, self.decoders):
+            x, tgt_mask, memory, memory_mask = decoder(
+                x, tgt_mask, memory, None, cache=c
+            )
+            new_cache.append(x)
+
+        if self.normalize_before:
+            y = self.after_norm(x[:, -1])
+        else:
+            y = x[:, -1]
+        if self.output_layer is not None:
+            y = torch.log_softmax(self.output_layer(y), dim=-1)
+
+        return y, new_cache
+
+    def score(self, ys, state, x):
+        """Score."""
+        ys_mask = subsequent_mask(len(ys), device=x.device).unsqueeze(0)
+        logp, state = self.forward_one_step(
+            ys.unsqueeze(0), ys_mask, x.unsqueeze(0), cache=state
+        )
+        return logp.squeeze(0), state
+
+    def batch_score(
+            self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor
+    ) -> Tuple[torch.Tensor, List[Any]]:
+        """Score new token batch.
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, n_vocab)`
+                and next state list for ys.
+
+        """
+        # merge states
+        n_batch = len(ys)
+        n_layers = len(self.decoders)
+        if states[0] is None:
+            batch_state = None
+        else:
+            # transpose state of [batch, layer] into [layer, batch]
+            batch_state = [
+                torch.stack([states[b][i] for b in range(n_batch)])
+                for i in range(n_layers)
+            ]
+
+        # batch decoding
+        ys_mask = subsequent_mask(ys.size(-1), device=xs.device).unsqueeze(0)
+        logp, states = self.forward_one_step(ys, ys_mask, xs, cache=batch_state)
+
+        # transpose state of [layer, batch] into [batch, layer]
+        state_list = [[states[i][b] for i in range(n_layers)] for b in range(n_batch)]
+        return logp, state_list
+
+
+class TransformerDecoder(BaseTransformerDecoder):
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+    ):
+        assert check_argument_types()
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+
+        attention_dim = encoder_output_size
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, self_attention_dropout_rate
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+
+class ParaformerDecoderSAN(BaseTransformerDecoder):
+    """
+    author: Speech Lab, Alibaba Group, China
+    Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
+    https://arxiv.org/abs/2006.01713
+    """
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            embeds_id: int = -1,
+    ):
+        assert check_argument_types()
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+
+        attention_dim = encoder_output_size
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, self_attention_dropout_rate
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        self.embeds_id = embeds_id
+        self.attention_dim = attention_dim
+
+    def forward(
+            self,
+            hs_pad: torch.Tensor,
+            hlens: torch.Tensor,
+            ys_in_pad: torch.Tensor,
+            ys_in_lens: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward decoder.
+
+        Args:
+            hs_pad: encoded memory, float32  (batch, maxlen_in, feat)
+            hlens: (batch)
+            ys_in_pad:
+                input token ids, int64 (batch, maxlen_out)
+                if input_layer == "embed"
+                input tensor (batch, maxlen_out, #mels) in the other cases
+            ys_in_lens: (batch)
+        Returns:
+            (tuple): tuple containing:
+
+            x: decoded token score before softmax (batch, maxlen_out, token)
+                if use_output_layer is True,
+            olens: (batch, )
+        """
+        tgt = ys_in_pad
+        tgt_mask = (~make_pad_mask(ys_in_lens)[:, None, :]).to(tgt.device)
+
+        memory = hs_pad
+        memory_mask = (~make_pad_mask(hlens, maxlen=memory.size(1)))[:, None, :].to(
+            memory.device
+        )
+        # Padding for Longformer
+        if memory_mask.shape[-1] != memory.shape[1]:
+            padlen = memory.shape[1] - memory_mask.shape[-1]
+            memory_mask = torch.nn.functional.pad(
+                memory_mask, (0, padlen), "constant", False
+            )
+
+        # x = self.embed(tgt)
+        x = tgt
+        embeds_outputs = None
+        for layer_id, decoder in enumerate(self.decoders):
+            x, tgt_mask, memory, memory_mask = decoder(
+                x, tgt_mask, memory, memory_mask
+            )
+            if layer_id == self.embeds_id:
+                embeds_outputs = x
+        if self.normalize_before:
+            x = self.after_norm(x)
+        if self.output_layer is not None:
+            x = self.output_layer(x)
+
+        olens = tgt_mask.sum(1)
+        if embeds_outputs is not None:
+            return x, olens, embeds_outputs
+        else:
+            return x, olens
+
+
+class LightweightConvolutionTransformerDecoder(BaseTransformerDecoder):
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            conv_wshare: int = 4,
+            conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11),
+            conv_usebias: int = False,
+    ):
+        assert check_argument_types()
+        if len(conv_kernel_length) != num_blocks:
+            raise ValueError(
+                "conv_kernel_length must have equal number of values to num_blocks: "
+                f"{len(conv_kernel_length)} != {num_blocks}"
+            )
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+
+        attention_dim = encoder_output_size
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                LightweightConvolution(
+                    wshare=conv_wshare,
+                    n_feat=attention_dim,
+                    dropout_rate=self_attention_dropout_rate,
+                    kernel_size=conv_kernel_length[lnum],
+                    use_kernel_mask=True,
+                    use_bias=conv_usebias,
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+
+class LightweightConvolution2DTransformerDecoder(BaseTransformerDecoder):
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            conv_wshare: int = 4,
+            conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11),
+            conv_usebias: int = False,
+    ):
+        assert check_argument_types()
+        if len(conv_kernel_length) != num_blocks:
+            raise ValueError(
+                "conv_kernel_length must have equal number of values to num_blocks: "
+                f"{len(conv_kernel_length)} != {num_blocks}"
+            )
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+
+        attention_dim = encoder_output_size
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                LightweightConvolution2D(
+                    wshare=conv_wshare,
+                    n_feat=attention_dim,
+                    dropout_rate=self_attention_dropout_rate,
+                    kernel_size=conv_kernel_length[lnum],
+                    use_kernel_mask=True,
+                    use_bias=conv_usebias,
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+
+class DynamicConvolutionTransformerDecoder(BaseTransformerDecoder):
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            conv_wshare: int = 4,
+            conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11),
+            conv_usebias: int = False,
+    ):
+        assert check_argument_types()
+        if len(conv_kernel_length) != num_blocks:
+            raise ValueError(
+                "conv_kernel_length must have equal number of values to num_blocks: "
+                f"{len(conv_kernel_length)} != {num_blocks}"
+            )
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+        attention_dim = encoder_output_size
+
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                DynamicConvolution(
+                    wshare=conv_wshare,
+                    n_feat=attention_dim,
+                    dropout_rate=self_attention_dropout_rate,
+                    kernel_size=conv_kernel_length[lnum],
+                    use_kernel_mask=True,
+                    use_bias=conv_usebias,
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+
+class DynamicConvolution2DTransformerDecoder(BaseTransformerDecoder):
+    def __init__(
+            self,
+            vocab_size: int,
+            encoder_output_size: int,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            self_attention_dropout_rate: float = 0.0,
+            src_attention_dropout_rate: float = 0.0,
+            input_layer: str = "embed",
+            use_output_layer: bool = True,
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            conv_wshare: int = 4,
+            conv_kernel_length: Sequence[int] = (11, 11, 11, 11, 11, 11),
+            conv_usebias: int = False,
+    ):
+        assert check_argument_types()
+        if len(conv_kernel_length) != num_blocks:
+            raise ValueError(
+                "conv_kernel_length must have equal number of values to num_blocks: "
+                f"{len(conv_kernel_length)} != {num_blocks}"
+            )
+        super().__init__(
+            vocab_size=vocab_size,
+            encoder_output_size=encoder_output_size,
+            dropout_rate=dropout_rate,
+            positional_dropout_rate=positional_dropout_rate,
+            input_layer=input_layer,
+            use_output_layer=use_output_layer,
+            pos_enc_class=pos_enc_class,
+            normalize_before=normalize_before,
+        )
+        attention_dim = encoder_output_size
+
+        self.decoders = repeat(
+            num_blocks,
+            lambda lnum: DecoderLayer(
+                attention_dim,
+                DynamicConvolution2D(
+                    wshare=conv_wshare,
+                    n_feat=attention_dim,
+                    dropout_rate=self_attention_dropout_rate,
+                    kernel_size=conv_kernel_length[lnum],
+                    use_kernel_mask=True,
+                    use_bias=conv_usebias,
+                ),
+                MultiHeadedAttention(
+                    attention_heads, attention_dim, src_attention_dropout_rate
+                ),
+                PositionwiseFeedForward(attention_dim, linear_units, dropout_rate),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
\ No newline at end of file
diff --git a/funasr/models/e2e_asr.py b/funasr/models/e2e_asr.py
new file mode 100644
index 000000000..f64ea3dbe
--- /dev/null
+++ b/funasr/models/e2e_asr.py
@@ -0,0 +1,458 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import logging
+from contextlib import contextmanager
+from distutils.version import LooseVersion
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+from typeguard import check_argument_types
+
+from funasr.layers.abs_normalize import AbsNormalize
+from funasr.losses.label_smoothing_loss import (
+    LabelSmoothingLoss,  # noqa: H301
+)
+from funasr.models.ctc import CTC
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.models.postencoder.abs_postencoder import AbsPostEncoder
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from funasr.models.specaug.abs_specaug import AbsSpecAug
+from funasr.modules.add_sos_eos import add_sos_eos
+from funasr.modules.e2e_asr_common import ErrorCalculator
+from funasr.modules.nets_utils import th_accuracy
+from funasr.torch_utils.device_funcs import force_gatherable
+from funasr.train.abs_espnet_model import AbsESPnetModel
+
+if LooseVersion(torch.__version__) >= LooseVersion("1.6.0"):
+    from torch.cuda.amp import autocast
+else:
+    # Nothing to do if torch<1.6.0
+    @contextmanager
+    def autocast(enabled=True):
+        yield
+
+
+class ESPnetASRModel(AbsESPnetModel):
+    """CTC-attention hybrid Encoder-Decoder model"""
+
+    def __init__(
+            self,
+            vocab_size: int,
+            token_list: Union[Tuple[str, ...], List[str]],
+            frontend: Optional[AbsFrontend],
+            specaug: Optional[AbsSpecAug],
+            normalize: Optional[AbsNormalize],
+            preencoder: Optional[AbsPreEncoder],
+            encoder: AbsEncoder,
+            postencoder: Optional[AbsPostEncoder],
+            decoder: AbsDecoder,
+            ctc: CTC,
+            ctc_weight: float = 0.5,
+            interctc_weight: float = 0.0,
+            ignore_id: int = -1,
+            lsm_weight: float = 0.0,
+            length_normalized_loss: bool = False,
+            report_cer: bool = True,
+            report_wer: bool = True,
+            sym_space: str = "<space>",
+            sym_blank: str = "<blank>",
+            extract_feats_in_collect_stats: bool = True,
+    ):
+        assert check_argument_types()
+        assert 0.0 <= ctc_weight <= 1.0, ctc_weight
+        assert 0.0 <= interctc_weight < 1.0, interctc_weight
+
+        super().__init__()
+        # note that eos is the same as sos (equivalent ID)
+        self.blank_id = 0
+        self.sos = 1
+        self.eos = 2
+        self.vocab_size = vocab_size
+        self.ignore_id = ignore_id
+        self.ctc_weight = ctc_weight
+        self.interctc_weight = interctc_weight
+        self.token_list = token_list.copy()
+
+        self.frontend = frontend
+        self.specaug = specaug
+        self.normalize = normalize
+        self.preencoder = preencoder
+        self.postencoder = postencoder
+        self.encoder = encoder
+
+        if not hasattr(self.encoder, "interctc_use_conditioning"):
+            self.encoder.interctc_use_conditioning = False
+        if self.encoder.interctc_use_conditioning:
+            self.encoder.conditioning_layer = torch.nn.Linear(
+                vocab_size, self.encoder.output_size()
+            )
+
+        self.error_calculator = None
+
+
+        # we set self.decoder = None in the CTC mode since
+        # self.decoder parameters were never used and PyTorch complained
+        # and threw an Exception in the multi-GPU experiment.
+        # thanks Jeff Farris for pointing out the issue.
+        if ctc_weight == 1.0:
+            self.decoder = None
+        else:
+            self.decoder = decoder
+
+        self.criterion_att = LabelSmoothingLoss(
+            size=vocab_size,
+            padding_idx=ignore_id,
+            smoothing=lsm_weight,
+            normalize_length=length_normalized_loss,
+        )
+
+        if report_cer or report_wer:
+            self.error_calculator = ErrorCalculator(
+                token_list, sym_space, sym_blank, report_cer, report_wer
+            )
+
+        if ctc_weight == 0.0:
+            self.ctc = None
+        else:
+            self.ctc = ctc
+
+        self.extract_feats_in_collect_stats = extract_feats_in_collect_stats
+
+    def forward(
+            self,
+            speech: torch.Tensor,
+            speech_lengths: torch.Tensor,
+            text: torch.Tensor,
+            text_lengths: torch.Tensor,
+    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+        """Frontend + Encoder + Decoder + Calc loss
+
+        Args:
+            speech: (Batch, Length, ...)
+            speech_lengths: (Batch, )
+            text: (Batch, Length)
+            text_lengths: (Batch,)
+        """
+        assert text_lengths.dim() == 1, text_lengths.shape
+        # Check that batch_size is unified
+        assert (
+                speech.shape[0]
+                == speech_lengths.shape[0]
+                == text.shape[0]
+                == text_lengths.shape[0]
+        ), (speech.shape, speech_lengths.shape, text.shape, text_lengths.shape)
+        batch_size = speech.shape[0]
+
+        # for data-parallel
+        text = text[:, : text_lengths.max()]
+
+        # 1. Encoder
+        encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        loss_att, acc_att, cer_att, wer_att = None, None, None, None
+        loss_ctc, cer_ctc = None, None
+        stats = dict()
+
+        # 1. CTC branch
+        if self.ctc_weight != 0.0:
+            loss_ctc, cer_ctc = self._calc_ctc_loss(
+                encoder_out, encoder_out_lens, text, text_lengths
+            )
+
+            # Collect CTC branch stats
+            stats["loss_ctc"] = loss_ctc.detach() if loss_ctc is not None else None
+            stats["cer_ctc"] = cer_ctc
+
+        # Intermediate CTC (optional)
+        loss_interctc = 0.0
+        if self.interctc_weight != 0.0 and intermediate_outs is not None:
+            for layer_idx, intermediate_out in intermediate_outs:
+                # we assume intermediate_out has the same length & padding
+                # as those of encoder_out
+                loss_ic, cer_ic = self._calc_ctc_loss(
+                    intermediate_out, encoder_out_lens, text, text_lengths
+                )
+                loss_interctc = loss_interctc + loss_ic
+
+                # Collect Intermedaite CTC stats
+                stats["loss_interctc_layer{}".format(layer_idx)] = (
+                    loss_ic.detach() if loss_ic is not None else None
+                )
+                stats["cer_interctc_layer{}".format(layer_idx)] = cer_ic
+
+            loss_interctc = loss_interctc / len(intermediate_outs)
+
+            # calculate whole encoder loss
+            loss_ctc = (
+                               1 - self.interctc_weight
+                       ) * loss_ctc + self.interctc_weight * loss_interctc
+
+
+        # 2b. Attention decoder branch
+        if self.ctc_weight != 1.0:
+            loss_att, acc_att, cer_att, wer_att = self._calc_att_loss(
+                encoder_out, encoder_out_lens, text, text_lengths
+            )
+
+        # 3. CTC-Att loss definition
+        if self.ctc_weight == 0.0:
+            loss = loss_att
+        elif self.ctc_weight == 1.0:
+            loss = loss_ctc
+        else:
+            loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att
+
+        # Collect Attn branch stats
+        stats["loss_att"] = loss_att.detach() if loss_att is not None else None
+        stats["acc"] = acc_att
+        stats["cer"] = cer_att
+        stats["wer"] = wer_att
+
+        # Collect total loss stats
+        stats["loss"] = torch.clone(loss.detach())
+
+        # force_gatherable: to-device and to-tensor if scalar for DataParallel
+        loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
+        return loss, stats, weight
+
+    def collect_feats(
+            self,
+            speech: torch.Tensor,
+            speech_lengths: torch.Tensor,
+            text: torch.Tensor,
+            text_lengths: torch.Tensor,
+    ) -> Dict[str, torch.Tensor]:
+        if self.extract_feats_in_collect_stats:
+            feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+        else:
+            # Generate dummy stats if extract_feats_in_collect_stats is False
+            logging.warning(
+                "Generating dummy stats for feats and feats_lengths, "
+                "because encoder_conf.extract_feats_in_collect_stats is "
+                f"{self.extract_feats_in_collect_stats}"
+            )
+            feats, feats_lengths = speech, speech_lengths
+        return {"feats": feats, "feats_lengths": feats_lengths}
+
+    def encode(
+            self, speech: torch.Tensor, speech_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Frontend + Encoder. Note that this method is used by asr_inference.py
+
+        Args:
+            speech: (Batch, Length, ...)
+            speech_lengths: (Batch, )
+        """
+        with autocast(False):
+            # 1. Extract feats
+            feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+
+            # 2. Data augmentation
+            if self.specaug is not None and self.training:
+                feats, feats_lengths = self.specaug(feats, feats_lengths)
+
+            # 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+            if self.normalize is not None:
+                feats, feats_lengths = self.normalize(feats, feats_lengths)
+
+        # Pre-encoder, e.g. used for raw input data
+        if self.preencoder is not None:
+            feats, feats_lengths = self.preencoder(feats, feats_lengths)
+
+        # 4. Forward encoder
+        # feats: (Batch, Length, Dim)
+        # -> encoder_out: (Batch, Length2, Dim2)
+        if self.encoder.interctc_use_conditioning:
+            encoder_out, encoder_out_lens, _ = self.encoder(
+                feats, feats_lengths, ctc=self.ctc
+            )
+        else:
+            encoder_out, encoder_out_lens, _ = self.encoder(feats, feats_lengths)
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        # Post-encoder, e.g. NLU
+        if self.postencoder is not None:
+            encoder_out, encoder_out_lens = self.postencoder(
+                encoder_out, encoder_out_lens
+            )
+
+        assert encoder_out.size(0) == speech.size(0), (
+            encoder_out.size(),
+            speech.size(0),
+        )
+        assert encoder_out.size(1) <= encoder_out_lens.max(), (
+            encoder_out.size(),
+            encoder_out_lens.max(),
+        )
+
+        if intermediate_outs is not None:
+            return (encoder_out, intermediate_outs), encoder_out_lens
+
+        return encoder_out, encoder_out_lens
+
+    def _extract_feats(
+            self, speech: torch.Tensor, speech_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        assert speech_lengths.dim() == 1, speech_lengths.shape
+
+        # for data-parallel
+        speech = speech[:, : speech_lengths.max()]
+
+        if self.frontend is not None:
+            # Frontend
+            #  e.g. STFT and Feature extract
+            #       data_loader may send time-domain signal in this case
+            # speech (Batch, NSamples) -> feats: (Batch, NFrames, Dim)
+            feats, feats_lengths = self.frontend(speech, speech_lengths)
+        else:
+            # No frontend and no feature extract
+            feats, feats_lengths = speech, speech_lengths
+        return feats, feats_lengths
+
+    def nll(
+            self,
+            encoder_out: torch.Tensor,
+            encoder_out_lens: torch.Tensor,
+            ys_pad: torch.Tensor,
+            ys_pad_lens: torch.Tensor,
+    ) -> torch.Tensor:
+        """Compute negative log likelihood(nll) from transformer-decoder
+
+        Normally, this function is called in batchify_nll.
+
+        Args:
+            encoder_out: (Batch, Length, Dim)
+            encoder_out_lens: (Batch,)
+            ys_pad: (Batch, Length)
+            ys_pad_lens: (Batch,)
+        """
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(
+            encoder_out, encoder_out_lens, ys_in_pad, ys_in_lens
+        )  # [batch, seqlen, dim]
+        batch_size = decoder_out.size(0)
+        decoder_num_class = decoder_out.size(2)
+        # nll: negative log-likelihood
+        nll = torch.nn.functional.cross_entropy(
+            decoder_out.view(-1, decoder_num_class),
+            ys_out_pad.view(-1),
+            ignore_index=self.ignore_id,
+            reduction="none",
+        )
+        nll = nll.view(batch_size, -1)
+        nll = nll.sum(dim=1)
+        assert nll.size(0) == batch_size
+        return nll
+
+    def batchify_nll(
+            self,
+            encoder_out: torch.Tensor,
+            encoder_out_lens: torch.Tensor,
+            ys_pad: torch.Tensor,
+            ys_pad_lens: torch.Tensor,
+            batch_size: int = 100,
+    ):
+        """Compute negative log likelihood(nll) from transformer-decoder
+
+        To avoid OOM, this fuction seperate the input into batches.
+        Then call nll for each batch and combine and return results.
+        Args:
+            encoder_out: (Batch, Length, Dim)
+            encoder_out_lens: (Batch,)
+            ys_pad: (Batch, Length)
+            ys_pad_lens: (Batch,)
+            batch_size: int, samples each batch contain when computing nll,
+                        you may change this to avoid OOM or increase
+                        GPU memory usage
+        """
+        total_num = encoder_out.size(0)
+        if total_num <= batch_size:
+            nll = self.nll(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+        else:
+            nll = []
+            start_idx = 0
+            while True:
+                end_idx = min(start_idx + batch_size, total_num)
+                batch_encoder_out = encoder_out[start_idx:end_idx, :, :]
+                batch_encoder_out_lens = encoder_out_lens[start_idx:end_idx]
+                batch_ys_pad = ys_pad[start_idx:end_idx, :]
+                batch_ys_pad_lens = ys_pad_lens[start_idx:end_idx]
+                batch_nll = self.nll(
+                    batch_encoder_out,
+                    batch_encoder_out_lens,
+                    batch_ys_pad,
+                    batch_ys_pad_lens,
+                )
+                nll.append(batch_nll)
+                start_idx = end_idx
+                if start_idx == total_num:
+                    break
+            nll = torch.cat(nll)
+        assert nll.size(0) == total_num
+        return nll
+
+    def _calc_att_loss(
+            self,
+            encoder_out: torch.Tensor,
+            encoder_out_lens: torch.Tensor,
+            ys_pad: torch.Tensor,
+            ys_pad_lens: torch.Tensor,
+    ):
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(
+            encoder_out, encoder_out_lens, ys_in_pad, ys_in_lens
+        )
+
+        # 2. Compute attention loss
+        loss_att = self.criterion_att(decoder_out, ys_out_pad)
+        acc_att = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_out_pad,
+            ignore_label=self.ignore_id,
+        )
+
+        # Compute cer/wer using attention-decoder
+        if self.training or self.error_calculator is None:
+            cer_att, wer_att = None, None
+        else:
+            ys_hat = decoder_out.argmax(dim=-1)
+            cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+        return loss_att, acc_att, cer_att, wer_att
+
+    def _calc_ctc_loss(
+            self,
+            encoder_out: torch.Tensor,
+            encoder_out_lens: torch.Tensor,
+            ys_pad: torch.Tensor,
+            ys_pad_lens: torch.Tensor,
+    ):
+        # Calc CTC loss
+        loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+
+        # Calc CER using CTC
+        cer_ctc = None
+        if not self.training and self.error_calculator is not None:
+            ys_hat = self.ctc.argmax(encoder_out).data
+            cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
+        return loss_ctc, cer_ctc
diff --git a/funasr/models/e2e_asr_common.py b/funasr/models/e2e_asr_common.py
new file mode 100644
index 000000000..92f90796a
--- /dev/null
+++ b/funasr/models/e2e_asr_common.py
@@ -0,0 +1,249 @@
+#!/usr/bin/env python3
+# encoding: utf-8
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Common functions for ASR."""
+
+import json
+import logging
+import sys
+
+from itertools import groupby
+import numpy as np
+import six
+
+
+def end_detect(ended_hyps, i, M=3, D_end=np.log(1 * np.exp(-10))):
+    """End detection.
+
+    described in Eq. (50) of S. Watanabe et al
+    "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition"
+
+    :param ended_hyps:
+    :param i:
+    :param M:
+    :param D_end:
+    :return:
+    """
+    if len(ended_hyps) == 0:
+        return False
+    count = 0
+    best_hyp = sorted(ended_hyps, key=lambda x: x["score"], reverse=True)[0]
+    for m in six.moves.range(M):
+        # get ended_hyps with their length is i - m
+        hyp_length = i - m
+        hyps_same_length = [x for x in ended_hyps if len(x["yseq"]) == hyp_length]
+        if len(hyps_same_length) > 0:
+            best_hyp_same_length = sorted(
+                hyps_same_length, key=lambda x: x["score"], reverse=True
+            )[0]
+            if best_hyp_same_length["score"] - best_hyp["score"] < D_end:
+                count += 1
+
+    if count == M:
+        return True
+    else:
+        return False
+
+
+# TODO(takaaki-hori): add different smoothing methods
+def label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):
+    """Obtain label distribution for loss smoothing.
+
+    :param odim:
+    :param lsm_type:
+    :param blank:
+    :param transcript:
+    :return:
+    """
+    if transcript is not None:
+        with open(transcript, "rb") as f:
+            trans_json = json.load(f)["utts"]
+
+    if lsm_type == "unigram":
+        assert transcript is not None, (
+            "transcript is required for %s label smoothing" % lsm_type
+        )
+        labelcount = np.zeros(odim)
+        for k, v in trans_json.items():
+            ids = np.array([int(n) for n in v["output"][0]["tokenid"].split()])
+            # to avoid an error when there is no text in an uttrance
+            if len(ids) > 0:
+                labelcount[ids] += 1
+        labelcount[odim - 1] = len(transcript)  # count <eos>
+        labelcount[labelcount == 0] = 1  # flooring
+        labelcount[blank] = 0  # remove counts for blank
+        labeldist = labelcount.astype(np.float32) / np.sum(labelcount)
+    else:
+        logging.error("Error: unexpected label smoothing type: %s" % lsm_type)
+        sys.exit()
+
+    return labeldist
+
+
+def get_vgg2l_odim(idim, in_channel=3, out_channel=128):
+    """Return the output size of the VGG frontend.
+
+    :param in_channel: input channel size
+    :param out_channel: output channel size
+    :return: output size
+    :rtype int
+    """
+    idim = idim / in_channel
+    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 1st max pooling
+    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 2nd max pooling
+    return int(idim) * out_channel  # numer of channels
+
+
+class ErrorCalculator(object):
+    """Calculate CER and WER for E2E_ASR and CTC models during training.
+
+    :param y_hats: numpy array with predicted text
+    :param y_pads: numpy array with true (target) text
+    :param char_list:
+    :param sym_space:
+    :param sym_blank:
+    :return:
+    """
+
+    def __init__(
+        self, char_list, sym_space, sym_blank, report_cer=False, report_wer=False
+    ):
+        """Construct an ErrorCalculator object."""
+        super(ErrorCalculator, self).__init__()
+
+        self.report_cer = report_cer
+        self.report_wer = report_wer
+
+        self.char_list = char_list
+        self.space = sym_space
+        self.blank = sym_blank
+        self.idx_blank = self.char_list.index(self.blank)
+        if self.space in self.char_list:
+            self.idx_space = self.char_list.index(self.space)
+        else:
+            self.idx_space = None
+
+    def __call__(self, ys_hat, ys_pad, is_ctc=False):
+        """Calculate sentence-level WER/CER score.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :param bool is_ctc: calculate CER score for CTC
+        :return: sentence-level WER score
+        :rtype float
+        :return: sentence-level CER score
+        :rtype float
+        """
+        cer, wer = None, None
+        if is_ctc:
+            return self.calculate_cer_ctc(ys_hat, ys_pad)
+        elif not self.report_cer and not self.report_wer:
+            return cer, wer
+
+        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)
+        if self.report_cer:
+            cer = self.calculate_cer(seqs_hat, seqs_true)
+
+        if self.report_wer:
+            wer = self.calculate_wer(seqs_hat, seqs_true)
+        return cer, wer
+
+    def calculate_cer_ctc(self, ys_hat, ys_pad):
+        """Calculate sentence-level CER score for CTC.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        import editdistance
+
+        cers, char_ref_lens = [], []
+        for i, y in enumerate(ys_hat):
+            y_hat = [x[0] for x in groupby(y)]
+            y_true = ys_pad[i]
+            seq_hat, seq_true = [], []
+            for idx in y_hat:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_hat.append(self.char_list[int(idx)])
+
+            for idx in y_true:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_true.append(self.char_list[int(idx)])
+
+            hyp_chars = "".join(seq_hat)
+            ref_chars = "".join(seq_true)
+            if len(ref_chars) > 0:
+                cers.append(editdistance.eval(hyp_chars, ref_chars))
+                char_ref_lens.append(len(ref_chars))
+
+        cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None
+        return cer_ctc
+
+    def convert_to_char(self, ys_hat, ys_pad):
+        """Convert index to character.
+
+        :param torch.Tensor seqs_hat: prediction (batch, seqlen)
+        :param torch.Tensor seqs_true: reference (batch, seqlen)
+        :return: token list of prediction
+        :rtype list
+        :return: token list of reference
+        :rtype list
+        """
+        seqs_hat, seqs_true = [], []
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+            eos_true = np.where(y_true == -1)[0]
+            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)
+            # NOTE: padding index (-1) in y_true is used to pad y_hat
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]
+            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
+            seq_hat_text = "".join(seq_hat).replace(self.space, " ")
+            seq_hat_text = seq_hat_text.replace(self.blank, "")
+            seq_true_text = "".join(seq_true).replace(self.space, " ")
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+        return seqs_hat, seqs_true
+
+    def calculate_cer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level CER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        import editdistance
+
+        char_eds, char_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_chars = seq_hat_text.replace(" ", "")
+            ref_chars = seq_true_text.replace(" ", "")
+            char_eds.append(editdistance.eval(hyp_chars, ref_chars))
+            char_ref_lens.append(len(ref_chars))
+        return float(sum(char_eds)) / sum(char_ref_lens)
+
+    def calculate_wer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level WER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level WER score
+        :rtype float
+        """
+        import editdistance
+
+        word_eds, word_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_words = seq_hat_text.split()
+            ref_words = seq_true_text.split()
+            word_eds.append(editdistance.eval(hyp_words, ref_words))
+            word_ref_lens.append(len(ref_words))
+        return float(sum(word_eds)) / sum(word_ref_lens)
diff --git a/funasr/models/e2e_asr_paraformer.py b/funasr/models/e2e_asr_paraformer.py
new file mode 100644
index 000000000..5ea28f31a
--- /dev/null
+++ b/funasr/models/e2e_asr_paraformer.py
@@ -0,0 +1,820 @@
+import logging
+from contextlib import contextmanager
+from distutils.version import LooseVersion
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+from typeguard import check_argument_types
+
+from funasr.layers.abs_normalize import AbsNormalize
+from funasr.losses.label_smoothing_loss import (
+	LabelSmoothingLoss,  # noqa: H301
+)
+from funasr.models.ctc import CTC
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.models.e2e_asr_common import ErrorCalculator
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.models.postencoder.abs_postencoder import AbsPostEncoder
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from funasr.models.specaug.abs_specaug import AbsSpecAug
+from funasr.modules.add_sos_eos import add_sos_eos
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.nets_utils import th_accuracy
+from funasr.models.predictor.cif import mae_loss
+from funasr.torch_utils.device_funcs import force_gatherable
+from funasr.train.abs_espnet_model import AbsESPnetModel
+
+if LooseVersion(torch.__version__) >= LooseVersion("1.6.0"):
+	from torch.cuda.amp import autocast
+else:
+	# Nothing to do if torch<1.6.0
+	@contextmanager
+	def autocast(enabled=True):
+		yield
+
+class Paraformer(AbsESPnetModel):
+	"""
+	Author: Speech Lab, Alibaba Group, China
+	Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
+	https://arxiv.org/abs/2206.08317
+	"""
+
+	def __init__(
+		self,
+		vocab_size: int,
+		token_list: Union[Tuple[str, ...], List[str]],
+		frontend: Optional[AbsFrontend],
+		specaug: Optional[AbsSpecAug],
+		normalize: Optional[AbsNormalize],
+		preencoder: Optional[AbsPreEncoder],
+		encoder: AbsEncoder,
+		postencoder: Optional[AbsPostEncoder],
+		decoder: AbsDecoder,
+		ctc: CTC,
+		ctc_weight: float = 0.5,
+		interctc_weight: float = 0.0,
+		ignore_id: int = -1,
+		blank_id: int = 0,
+		sos: int = 1,
+		eos: int = 2,
+		lsm_weight: float = 0.0,
+		length_normalized_loss: bool = False,
+		report_cer: bool = True,
+		report_wer: bool = True,
+		sym_space: str = "<space>",
+		sym_blank: str = "<blank>",
+		extract_feats_in_collect_stats: bool = True,
+		predictor = None,
+		predictor_weight: float = 0.0,
+		predictor_bias: int = 0,
+		sampling_ratio: float = 0.2,
+
+	):
+		assert check_argument_types()
+		assert 0.0 <= ctc_weight <= 1.0, ctc_weight
+		assert 0.0 <= interctc_weight < 1.0, interctc_weight
+
+		super().__init__()
+		# note that eos is the same as sos (equivalent ID)
+		self.blank_id = blank_id
+		self.sos = vocab_size - 1 if sos is None else sos
+		self.eos = vocab_size - 1 if eos is None else eos
+		self.vocab_size = vocab_size
+		self.ignore_id = ignore_id
+		self.ctc_weight = ctc_weight
+		self.interctc_weight = interctc_weight
+		self.token_list = token_list.copy()
+
+		self.frontend = frontend
+		self.specaug = specaug
+		self.normalize = normalize
+		self.preencoder = preencoder
+		self.postencoder = postencoder
+		self.encoder = encoder
+
+		if not hasattr(self.encoder, "interctc_use_conditioning"):
+			self.encoder.interctc_use_conditioning = False
+		if self.encoder.interctc_use_conditioning:
+			self.encoder.conditioning_layer = torch.nn.Linear(
+				vocab_size, self.encoder.output_size()
+			)
+
+		self.error_calculator = None
+
+
+		if ctc_weight == 1.0:
+			self.decoder = None
+		else:
+			self.decoder = decoder
+
+		self.criterion_att = LabelSmoothingLoss(
+			size=vocab_size,
+			padding_idx=ignore_id,
+			smoothing=lsm_weight,
+			normalize_length=length_normalized_loss,
+		)
+
+		if report_cer or report_wer:
+			self.error_calculator = ErrorCalculator(
+				token_list, sym_space, sym_blank, report_cer, report_wer
+			)
+
+		if ctc_weight == 0.0:
+			self.ctc = None
+		else:
+			self.ctc = ctc
+
+		self.extract_feats_in_collect_stats = extract_feats_in_collect_stats
+		self.predictor = predictor
+		self.predictor_weight = predictor_weight
+		self.predictor_bias = predictor_bias
+		self.sampling_ratio = sampling_ratio
+		self.criterion_pre = mae_loss(normalize_length=length_normalized_loss)
+		self.step_cur = 0
+
+
+	def forward(
+		self,
+		speech: torch.Tensor,
+		speech_lengths: torch.Tensor,
+		text: torch.Tensor,
+		text_lengths: torch.Tensor,
+	) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+		"""Frontend + Encoder + Decoder + Calc loss
+
+		Args:
+				speech: (Batch, Length, ...)
+				speech_lengths: (Batch, )
+				text: (Batch, Length)
+				text_lengths: (Batch,)
+		"""
+		assert text_lengths.dim() == 1, text_lengths.shape
+		# Check that batch_size is unified
+		assert (
+			speech.shape[0]
+			== speech_lengths.shape[0]
+			== text.shape[0]
+			== text_lengths.shape[0]
+		), (speech.shape, speech_lengths.shape, text.shape, text_lengths.shape)
+		batch_size = speech.shape[0]
+		self.step_cur += 1
+		# for data-parallel
+		text = text[:, : text_lengths.max()]
+		speech = speech[:, :speech_lengths.max(), :]
+
+		# 1. Encoder
+		encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
+		intermediate_outs = None
+		if isinstance(encoder_out, tuple):
+			intermediate_outs = encoder_out[1]
+			encoder_out = encoder_out[0]
+
+
+		loss_att, acc_att, cer_att, wer_att = None, None, None, None
+		loss_ctc, cer_ctc = None, None
+		loss_pre = None
+		stats = dict()
+
+		# 1. CTC branch
+		if self.ctc_weight != 0.0:
+			loss_ctc, cer_ctc = self._calc_ctc_loss(
+				encoder_out, encoder_out_lens, text, text_lengths
+			)
+
+			# Collect CTC branch stats
+			stats["loss_ctc"] = loss_ctc.detach() if loss_ctc is not None else None
+			stats["cer_ctc"] = cer_ctc
+
+		# Intermediate CTC (optional)
+		loss_interctc = 0.0
+		if self.interctc_weight != 0.0 and intermediate_outs is not None:
+			for layer_idx, intermediate_out in intermediate_outs:
+				# we assume intermediate_out has the same length & padding
+				# as those of encoder_out
+				loss_ic, cer_ic = self._calc_ctc_loss(
+					intermediate_out, encoder_out_lens, text, text_lengths
+				)
+				loss_interctc = loss_interctc + loss_ic
+
+				# Collect Intermedaite CTC stats
+				stats["loss_interctc_layer{}".format(layer_idx)] = (
+					loss_ic.detach() if loss_ic is not None else None
+				)
+				stats["cer_interctc_layer{}".format(layer_idx)] = cer_ic
+
+			loss_interctc = loss_interctc / len(intermediate_outs)
+
+			# calculate whole encoder loss
+			loss_ctc = (
+				           1 - self.interctc_weight
+			           ) * loss_ctc + self.interctc_weight * loss_interctc
+
+
+		# 2b. Attention decoder branch
+		if self.ctc_weight != 1.0:
+
+			loss_att, acc_att, cer_att, wer_att, loss_pre = self._calc_att_loss(
+				encoder_out, encoder_out_lens, text, text_lengths
+			)
+
+		# 3. CTC-Att loss definition
+		if self.ctc_weight == 0.0:
+			loss = loss_att + loss_pre * self.predictor_weight
+		elif self.ctc_weight == 1.0:
+			loss = loss_ctc
+		else:
+			loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight
+
+		# Collect Attn branch stats
+		stats["loss_att"] = loss_att.detach() if loss_att is not None else None
+		stats["acc"] = acc_att
+		stats["cer"] = cer_att
+		stats["wer"] = wer_att
+		stats["loss_pre"] = loss_pre.detach().cpu() if loss_pre is not None else None
+
+		stats["loss"] =torch.clone(loss.detach())
+
+		# force_gatherable: to-device and to-tensor if scalar for DataParallel
+		loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
+		return loss, stats, weight
+
+	def collect_feats(
+		self,
+		speech: torch.Tensor,
+		speech_lengths: torch.Tensor,
+		text: torch.Tensor,
+		text_lengths: torch.Tensor,
+	) -> Dict[str, torch.Tensor]:
+		if self.extract_feats_in_collect_stats:
+			feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+		else:
+			# Generate dummy stats if extract_feats_in_collect_stats is False
+			logging.warning(
+				"Generating dummy stats for feats and feats_lengths, "
+				"because encoder_conf.extract_feats_in_collect_stats is "
+				f"{self.extract_feats_in_collect_stats}"
+			)
+			feats, feats_lengths = speech, speech_lengths
+		return {"feats": feats, "feats_lengths": feats_lengths}
+
+	def encode(
+		self, speech: torch.Tensor, speech_lengths: torch.Tensor
+	) -> Tuple[torch.Tensor, torch.Tensor]:
+		"""Frontend + Encoder. Note that this method is used by asr_inference.py
+
+		Args:
+				speech: (Batch, Length, ...)
+				speech_lengths: (Batch, )
+		"""
+		with autocast(False):
+			# 1. Extract feats
+			feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+
+			# 2. Data augmentation
+			if self.specaug is not None and self.training:
+				feats, feats_lengths = self.specaug(feats, feats_lengths)
+
+			# 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+			if self.normalize is not None:
+				feats, feats_lengths = self.normalize(feats, feats_lengths)
+
+		# Pre-encoder, e.g. used for raw input data
+		if self.preencoder is not None:
+			feats, feats_lengths = self.preencoder(feats, feats_lengths)
+
+		# 4. Forward encoder
+		# feats: (Batch, Length, Dim)
+		# -> encoder_out: (Batch, Length2, Dim2)
+		if self.encoder.interctc_use_conditioning:
+			encoder_out, encoder_out_lens, _ = self.encoder(
+				feats, feats_lengths, ctc=self.ctc
+			)
+		else:
+			encoder_out, encoder_out_lens, _ = self.encoder(feats, feats_lengths)
+		intermediate_outs = None
+		if isinstance(encoder_out, tuple):
+			intermediate_outs = encoder_out[1]
+			encoder_out = encoder_out[0]
+
+		# Post-encoder, e.g. NLU
+		if self.postencoder is not None:
+			encoder_out, encoder_out_lens = self.postencoder(
+				encoder_out, encoder_out_lens
+			)
+
+		assert encoder_out.size(0) == speech.size(0), (
+			encoder_out.size(),
+			speech.size(0),
+		)
+		assert encoder_out.size(1) <= encoder_out_lens.max(), (
+			encoder_out.size(),
+			encoder_out_lens.max(),
+		)
+
+		if intermediate_outs is not None:
+			return (encoder_out, intermediate_outs), encoder_out_lens
+
+		return encoder_out, encoder_out_lens
+
+	def calc_predictor(self, encoder_out, encoder_out_lens):
+
+		encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(encoder_out.device)
+		pre_acoustic_embeds, pre_token_length, _, pre_peak_index = self.predictor(encoder_out, None, encoder_out_mask, ignore_id=self.ignore_id)
+		return pre_acoustic_embeds, pre_token_length
+
+
+	def cal_decoder_with_predictor(self, encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens):
+
+		decoder_out, _ = self.decoder(
+			encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens
+		)
+		decoder_out = torch.log_softmax(decoder_out, dim=-1)
+		return decoder_out, ys_pad_lens
+
+	def _extract_feats(
+		self, speech: torch.Tensor, speech_lengths: torch.Tensor
+	) -> Tuple[torch.Tensor, torch.Tensor]:
+		assert speech_lengths.dim() == 1, speech_lengths.shape
+
+		# for data-parallel
+		speech = speech[:, : speech_lengths.max()]
+		if self.frontend is not None:
+			# Frontend
+			#  e.g. STFT and Feature extract
+			#       data_loader may send time-domain signal in this case
+			# speech (Batch, NSamples) -> feats: (Batch, NFrames, Dim)
+			feats, feats_lengths = self.frontend(speech, speech_lengths)
+		else:
+			# No frontend and no feature extract
+			feats, feats_lengths = speech, speech_lengths
+		return feats, feats_lengths
+
+	def nll(
+		self,
+		encoder_out: torch.Tensor,
+		encoder_out_lens: torch.Tensor,
+		ys_pad: torch.Tensor,
+		ys_pad_lens: torch.Tensor,
+	) -> torch.Tensor:
+		"""Compute negative log likelihood(nll) from transformer-decoder
+
+		Normally, this function is called in batchify_nll.
+
+		Args:
+				encoder_out: (Batch, Length, Dim)
+				encoder_out_lens: (Batch,)
+				ys_pad: (Batch, Length)
+				ys_pad_lens: (Batch,)
+		"""
+		ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+		ys_in_lens = ys_pad_lens + 1
+
+		# 1. Forward decoder
+		decoder_out, _ = self.decoder(
+			encoder_out, encoder_out_lens, ys_in_pad, ys_in_lens
+		)  # [batch, seqlen, dim]
+		batch_size = decoder_out.size(0)
+		decoder_num_class = decoder_out.size(2)
+		# nll: negative log-likelihood
+		nll = torch.nn.functional.cross_entropy(
+			decoder_out.view(-1, decoder_num_class),
+			ys_out_pad.view(-1),
+			ignore_index=self.ignore_id,
+			reduction="none",
+		)
+		nll = nll.view(batch_size, -1)
+		nll = nll.sum(dim=1)
+		assert nll.size(0) == batch_size
+		return nll
+
+	def batchify_nll(
+		self,
+		encoder_out: torch.Tensor,
+		encoder_out_lens: torch.Tensor,
+		ys_pad: torch.Tensor,
+		ys_pad_lens: torch.Tensor,
+		batch_size: int = 100,
+	):
+		"""Compute negative log likelihood(nll) from transformer-decoder
+
+		To avoid OOM, this fuction seperate the input into batches.
+		Then call nll for each batch and combine and return results.
+		Args:
+				encoder_out: (Batch, Length, Dim)
+				encoder_out_lens: (Batch,)
+				ys_pad: (Batch, Length)
+				ys_pad_lens: (Batch,)
+				batch_size: int, samples each batch contain when computing nll,
+										you may change this to avoid OOM or increase
+										GPU memory usage
+		"""
+		total_num = encoder_out.size(0)
+		if total_num <= batch_size:
+			nll = self.nll(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+		else:
+			nll = []
+			start_idx = 0
+			while True:
+				end_idx = min(start_idx + batch_size, total_num)
+				batch_encoder_out = encoder_out[start_idx:end_idx, :, :]
+				batch_encoder_out_lens = encoder_out_lens[start_idx:end_idx]
+				batch_ys_pad = ys_pad[start_idx:end_idx, :]
+				batch_ys_pad_lens = ys_pad_lens[start_idx:end_idx]
+				batch_nll = self.nll(
+					batch_encoder_out,
+					batch_encoder_out_lens,
+					batch_ys_pad,
+					batch_ys_pad_lens,
+				)
+				nll.append(batch_nll)
+				start_idx = end_idx
+				if start_idx == total_num:
+					break
+			nll = torch.cat(nll)
+		assert nll.size(0) == total_num
+		return nll
+
+	def _calc_att_loss(
+		self,
+		encoder_out: torch.Tensor,
+		encoder_out_lens: torch.Tensor,
+		ys_pad: torch.Tensor,
+		ys_pad_lens: torch.Tensor,
+	):
+		encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(encoder_out.device)
+		if self.predictor_bias == 1:
+			_, ys_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+			ys_pad_lens = ys_pad_lens + self.predictor_bias
+		pre_acoustic_embeds, pre_token_length, _, pre_peak_index = self.predictor(encoder_out, ys_pad, encoder_out_mask, ignore_id=self.ignore_id)
+
+		# 0. sampler
+		decoder_out_1st = None
+		if self.sampling_ratio > 0.0:
+			if self.step_cur < 2:
+				logging.info("enable sampler in paraformer, sampling_ratio: {}".format(self.sampling_ratio))
+			sematic_embeds, decoder_out_1st = self.sampler(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens, pre_acoustic_embeds)
+		else:
+			if self.step_cur < 2:
+				logging.info("disable sampler in paraformer, sampling_ratio: {}".format(self.sampling_ratio))
+			sematic_embeds = pre_acoustic_embeds
+
+		# 1. Forward decoder
+		decoder_outs = self.decoder(
+			encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens
+		)
+		decoder_out, _ = decoder_outs[0], decoder_outs[1]
+
+		if decoder_out_1st is None:
+			decoder_out_1st = decoder_out
+		# 2. Compute attention loss
+		loss_att = self.criterion_att(decoder_out, ys_pad)
+		acc_att = th_accuracy(
+			decoder_out_1st.view(-1, self.vocab_size),
+			ys_pad,
+			ignore_label=self.ignore_id,
+		)
+		loss_pre = self.criterion_pre(ys_pad_lens.type_as(pre_token_length), pre_token_length)
+
+		# Compute cer/wer using attention-decoder
+		if self.training or self.error_calculator is None:
+			cer_att, wer_att = None, None
+		else:
+			ys_hat = decoder_out_1st.argmax(dim=-1)
+			cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+		return loss_att, acc_att, cer_att, wer_att, loss_pre
+
+	def sampler(self, encoder_out, encoder_out_lens, ys_pad, ys_pad_lens, pre_acoustic_embeds):
+
+		tgt_mask = (~make_pad_mask(ys_pad_lens, maxlen=ys_pad_lens.max())[:, :, None]).to(ys_pad.device)
+		ys_pad *= tgt_mask[:, :, 0]
+		ys_pad_embed = self.decoder.embed(ys_pad)
+		with torch.no_grad():
+			decoder_outs = self.decoder(
+				encoder_out, encoder_out_lens, pre_acoustic_embeds, ys_pad_lens
+			)
+			decoder_out, _ = decoder_outs[0], decoder_outs[1]
+			pred_tokens = decoder_out.argmax(-1)
+			nonpad_positions = ys_pad.ne(self.ignore_id)
+			seq_lens = (nonpad_positions).sum(1)
+			same_num = ((pred_tokens == ys_pad) & nonpad_positions).sum(1)
+			input_mask = torch.ones_like(nonpad_positions)
+			bsz, seq_len = ys_pad.size()
+			for li in range(bsz):
+				target_num = (((seq_lens[li] - same_num[li].sum()).float()) * self.sampling_ratio).long()
+				if target_num > 0:
+					input_mask[li].scatter_(dim=0, index=torch.randperm(seq_lens[li])[:target_num].cuda(), value=0)
+			input_mask = input_mask.eq(1)
+			input_mask = input_mask.masked_fill(~nonpad_positions, False)
+			input_mask_expand_dim = input_mask.unsqueeze(2).to(pre_acoustic_embeds.device)
+
+		sematic_embeds = pre_acoustic_embeds.masked_fill(~input_mask_expand_dim, 0) + ys_pad_embed.masked_fill(
+			input_mask_expand_dim, 0)
+		return sematic_embeds * tgt_mask, decoder_out * tgt_mask
+
+
+	def _calc_ctc_loss(
+		self,
+		encoder_out: torch.Tensor,
+		encoder_out_lens: torch.Tensor,
+		ys_pad: torch.Tensor,
+		ys_pad_lens: torch.Tensor,
+	):
+		# Calc CTC loss
+		loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+
+		# Calc CER using CTC
+		cer_ctc = None
+		if not self.training and self.error_calculator is not None:
+			ys_hat = self.ctc.argmax(encoder_out).data
+			cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
+		return loss_ctc, cer_ctc
+
+class ParaformerBert(Paraformer):
+	"""
+	Author: Speech Lab, Alibaba Group, China
+	Paraformer2: advanced paraformer with LFMMI and bert for non-autoregressive end-to-end speech recognition
+	"""
+
+	def __init__(
+		self,
+		vocab_size: int,
+		token_list: Union[Tuple[str, ...], List[str]],
+		frontend: Optional[AbsFrontend],
+		specaug: Optional[AbsSpecAug],
+		normalize: Optional[AbsNormalize],
+		preencoder: Optional[AbsPreEncoder],
+		encoder: AbsEncoder,
+		postencoder: Optional[AbsPostEncoder],
+		decoder: AbsDecoder,
+		ctc: CTC,
+		joint_network: Optional[torch.nn.Module],
+		ctc_weight: float = 0.5,
+		interctc_weight: float = 0.0,
+		ignore_id: int = -1,
+		blank_id: int = 0,
+		sos: int = 1,
+		eos: int = 2,
+		lsm_weight: float = 0.0,
+		length_normalized_loss: bool = False,
+		report_cer: bool = True,
+		report_wer: bool = True,
+		sym_space: str = "<space>",
+		sym_blank: str = "<blank>",
+		extract_feats_in_collect_stats: bool = True,
+		predictor = None,
+		predictor_weight: float = 0.0,
+		predictor_bias: int = 0,
+		sampling_ratio: float = 0.2,
+		embeds_id: int = 2,
+		embeds_loss_weight: float = 0.0,
+		embed_dims: int = 768,
+	):
+		assert check_argument_types()
+		assert 0.0 <= ctc_weight <= 1.0, ctc_weight
+		assert 0.0 <= interctc_weight < 1.0, interctc_weight
+
+		super().__init__(
+		vocab_size=vocab_size,
+		token_list=token_list,
+		frontend=frontend,
+		specaug=specaug,
+		normalize=normalize,
+		preencoder=preencoder,
+		encoder=encoder,
+		postencoder=postencoder,
+		decoder=decoder,
+		ctc=ctc,
+		joint_network=joint_network,
+		ctc_weight=ctc_weight,
+		interctc_weight=interctc_weight,
+		ignore_id=ignore_id,
+		blank_id=blank_id,
+		sos=sos,
+		eos=eos,
+		lsm_weight=lsm_weight,
+		length_normalized_loss=length_normalized_loss,
+		report_cer=report_cer,
+		report_wer=report_wer,
+		sym_space=sym_space,
+		sym_blank=sym_blank,
+		extract_feats_in_collect_stats=extract_feats_in_collect_stats,
+		predictor=predictor,
+		predictor_weight=predictor_weight,
+		predictor_bias=predictor_bias,
+		sampling_ratio=sampling_ratio,
+		)
+		self.decoder.embeds_id = embeds_id
+		decoder_attention_dim = self.decoder.attention_dim
+		self.pro_nn = torch.nn.Linear(decoder_attention_dim, embed_dims)
+		self.cos = torch.nn.CosineSimilarity(dim=-1, eps=1e-6)
+		self.embeds_loss_weight = embeds_loss_weight
+		self.length_normalized_loss = length_normalized_loss
+
+	def _calc_embed_loss(self,
+	                     ys_pad: torch.Tensor,
+	                     ys_pad_lens: torch.Tensor,
+	                     embed: torch.Tensor = None,
+	                     embed_lengths: torch.Tensor = None,
+	                     embeds_outputs: torch.Tensor = None,
+	                     ):
+		embeds_outputs = self.pro_nn(embeds_outputs)
+		tgt_mask = (~make_pad_mask(ys_pad_lens, maxlen=ys_pad_lens.max())[:, :, None]).to(ys_pad.device)
+		embeds_outputs *= tgt_mask  # b x l x d
+		embed *= tgt_mask  # b x l x d
+		cos_loss = 1.0 - self.cos(embeds_outputs, embed)
+		cos_loss *= tgt_mask.squeeze(2)
+		if self.length_normalized_loss:
+			token_num_total = torch.sum(tgt_mask)
+		else:
+			token_num_total = tgt_mask.size()[0]
+		cos_loss_total = torch.sum(cos_loss)
+		cos_loss = cos_loss_total / token_num_total
+		# print("cos_loss: {}".format(cos_loss))
+		return cos_loss
+
+
+	def _calc_att_loss(
+		self,
+		encoder_out: torch.Tensor,
+		encoder_out_lens: torch.Tensor,
+		ys_pad: torch.Tensor,
+		ys_pad_lens: torch.Tensor,
+	):
+		encoder_out_mask = (~make_pad_mask(encoder_out_lens, maxlen=encoder_out.size(1))[:, None, :]).to(encoder_out.device)
+		if self.predictor_bias == 1:
+			_, ys_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+			ys_pad_lens = ys_pad_lens + self.predictor_bias
+		pre_acoustic_embeds, pre_token_length, _, pre_peak_index = self.predictor(encoder_out, ys_pad, encoder_out_mask, ignore_id=self.ignore_id)
+
+		# 0. sampler
+		decoder_out_1st = None
+		if self.sampling_ratio > 0.0:
+			if self.step_cur < 2:
+				logging.info(
+					"enable sampler in paraformer, sampling_ratio: {}".format(self.sampling_ratio))
+			sematic_embeds, decoder_out_1st = self.sampler(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens, pre_acoustic_embeds)
+		else:
+			if self.step_cur < 2:
+				logging.info(
+					"disable sampler in paraformer, sampling_ratio: {}".format(self.sampling_ratio))
+			sematic_embeds = pre_acoustic_embeds
+
+		# 1. Forward decoder
+		decoder_outs = self.decoder(
+			encoder_out, encoder_out_lens, sematic_embeds, ys_pad_lens
+		)
+		decoder_out, _ = decoder_outs[0], decoder_outs[1]
+		embeds_outputs = None
+		if len(decoder_outs) > 2:
+			embeds_outputs = decoder_outs[2]
+
+		if decoder_out_1st is None:
+			decoder_out_1st = decoder_out
+		# 2. Compute attention loss
+		loss_att = self.criterion_att(decoder_out, ys_pad)
+		acc_att = th_accuracy(
+			decoder_out_1st.view(-1, self.vocab_size),
+			ys_pad,
+			ignore_label=self.ignore_id,
+		)
+		loss_pre = self.criterion_pre(ys_pad_lens.type_as(pre_token_length), pre_token_length)
+
+		# Compute cer/wer using attention-decoder
+		if self.training or self.error_calculator is None:
+			cer_att, wer_att = None, None
+		else:
+			ys_hat = decoder_out_1st.argmax(dim=-1)
+			cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+		return loss_att, acc_att, cer_att, wer_att, loss_pre, embeds_outputs
+
+
+	def forward(
+		self,
+		speech: torch.Tensor,
+		speech_lengths: torch.Tensor,
+		text: torch.Tensor,
+		text_lengths: torch.Tensor,
+		embed: torch.Tensor = None,
+		embed_lengths: torch.Tensor = None,
+	) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+		"""Frontend + Encoder + Decoder + Calc loss
+
+		Args:
+				speech: (Batch, Length, ...)
+				speech_lengths: (Batch, )
+				text: (Batch, Length)
+				text_lengths: (Batch,)
+		"""
+		assert text_lengths.dim() == 1, text_lengths.shape
+		# Check that batch_size is unified
+		assert (
+			speech.shape[0]
+			== speech_lengths.shape[0]
+			== text.shape[0]
+			== text_lengths.shape[0]
+		), (speech.shape, speech_lengths.shape, text.shape, text_lengths.shape)
+		batch_size = speech.shape[0]
+		self.step_cur += 1
+		# for data-parallel
+		text = text[:, : text_lengths.max()]
+		speech = speech[:, :speech_lengths.max(), :]
+		if embed is not None:
+			embed = embed[:, :embed_lengths.max(), :]
+
+		# 1. Encoder
+		encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
+		intermediate_outs = None
+		if isinstance(encoder_out, tuple):
+			intermediate_outs = encoder_out[1]
+			encoder_out = encoder_out[0]
+
+
+		loss_att, acc_att, cer_att, wer_att = None, None, None, None
+		loss_ctc, cer_ctc = None, None
+		loss_pre = 0.0
+		cos_loss = 0.0
+		stats = dict()
+
+		# 1. CTC branch
+		if self.ctc_weight != 0.0:
+			loss_ctc, cer_ctc = self._calc_ctc_loss(
+				encoder_out, encoder_out_lens, text, text_lengths
+			)
+
+			# Collect CTC branch stats
+			stats["loss_ctc"] = loss_ctc.detach() if loss_ctc is not None else None
+			stats["cer_ctc"] = cer_ctc
+
+		# Intermediate CTC (optional)
+		loss_interctc = 0.0
+		if self.interctc_weight != 0.0 and intermediate_outs is not None:
+			for layer_idx, intermediate_out in intermediate_outs:
+				# we assume intermediate_out has the same length & padding
+				# as those of encoder_out
+				loss_ic, cer_ic = self._calc_ctc_loss(
+					intermediate_out, encoder_out_lens, text, text_lengths
+				)
+				loss_interctc = loss_interctc + loss_ic
+
+				# Collect Intermedaite CTC stats
+				stats["loss_interctc_layer{}".format(layer_idx)] = (
+					loss_ic.detach() if loss_ic is not None else None
+				)
+				stats["cer_interctc_layer{}".format(layer_idx)] = cer_ic
+
+			loss_interctc = loss_interctc / len(intermediate_outs)
+
+			# calculate whole encoder loss
+			loss_ctc = (
+				           1 - self.interctc_weight
+			           ) * loss_ctc + self.interctc_weight * loss_interctc
+
+
+		# 2b. Attention decoder branch
+		if self.ctc_weight != 1.0:
+
+			loss_ret = self._calc_att_loss(
+				encoder_out, encoder_out_lens, text, text_lengths
+			)
+			loss_att, acc_att, cer_att, wer_att, loss_pre = loss_ret[0], loss_ret[1], loss_ret[2], loss_ret[3], loss_ret[4]
+			embeds_outputs = None
+			if len(loss_ret) > 5:
+				embeds_outputs = loss_ret[5]
+			if embeds_outputs is not None:
+				cos_loss = self._calc_embed_loss(text, text_lengths, embed, embed_lengths, embeds_outputs)
+
+		# 3. CTC-Att loss definition
+		if self.ctc_weight == 0.0:
+			loss = loss_att + loss_pre * self.predictor_weight + cos_loss * self.embeds_loss_weight
+		elif self.ctc_weight == 1.0:
+			loss = loss_ctc
+		else:
+			loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight + cos_loss * self.embeds_loss_weight
+
+		# Collect Attn branch stats
+		stats["loss_att"] = loss_att.detach() if loss_att is not None else None
+		stats["acc"] = acc_att
+		stats["cer"] = cer_att
+		stats["wer"] = wer_att
+		stats["loss_pre"] = loss_pre.detach().cpu() if loss_pre > 0.0 else None
+		stats["cos_loss"] = cos_loss.detach().cpu() if cos_loss > 0.0 else None
+
+		stats["loss"] =torch.clone(loss.detach())
+
+		# force_gatherable: to-device and to-tensor if scalar for DataParallel
+		loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
+		return loss, stats, weight
+
+
+
+
+
+
+
diff --git a/funasr/models/e2e_uni_asr.py b/funasr/models/e2e_uni_asr.py
new file mode 100644
index 000000000..03fbca9af
--- /dev/null
+++ b/funasr/models/e2e_uni_asr.py
@@ -0,0 +1,1076 @@
+import logging
+from contextlib import contextmanager
+from distutils.version import LooseVersion
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+from typeguard import check_argument_types
+
+from funasr.models.e2e_asr_common import ErrorCalculator
+from funasr.modules.nets_utils import th_accuracy
+from funasr.modules.add_sos_eos import add_sos_eos
+from funasr.losses.label_smoothing_loss import (
+    LabelSmoothingLoss,  # noqa: H301
+)
+from funasr.models.ctc import CTC
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.models.postencoder.abs_postencoder import AbsPostEncoder
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from funasr.models.specaug.abs_specaug import AbsSpecAug
+from funasr.layers.abs_normalize import AbsNormalize
+from funasr.torch_utils.device_funcs import force_gatherable
+from funasr.train.abs_espnet_model import AbsESPnetModel
+from funasr.modules.streaming_utils.chunk_utilis import sequence_mask
+from funasr.models.predictor.cif import mae_loss
+
+if LooseVersion(torch.__version__) >= LooseVersion("1.6.0"):
+    from torch.cuda.amp import autocast
+else:
+    # Nothing to do if torch<1.6.0
+    @contextmanager
+    def autocast(enabled=True):
+        yield
+
+
+class UniASR(AbsESPnetModel):
+    """
+    Author: Speech Lab, Alibaba Group, China
+    """
+
+    def __init__(
+        self,
+        vocab_size: int,
+        token_list: Union[Tuple[str, ...], List[str]],
+        frontend: Optional[AbsFrontend],
+        specaug: Optional[AbsSpecAug],
+        normalize: Optional[AbsNormalize],
+        preencoder: Optional[AbsPreEncoder],
+        encoder: AbsEncoder,
+        postencoder: Optional[AbsPostEncoder],
+        decoder: AbsDecoder,
+        ctc: CTC,
+        ctc_weight: float = 0.5,
+        interctc_weight: float = 0.0,
+        ignore_id: int = -1,
+        lsm_weight: float = 0.0,
+        length_normalized_loss: bool = False,
+        report_cer: bool = True,
+        report_wer: bool = True,
+        sym_space: str = "<space>",
+        sym_blank: str = "<blank>",
+        extract_feats_in_collect_stats: bool = True,
+        predictor=None,
+        predictor_weight: float = 0.0,
+        decoder_attention_chunk_type: str = 'chunk',
+        encoder2: AbsEncoder = None,
+        decoder2: AbsDecoder = None,
+        ctc2: CTC = None,
+        ctc_weight2: float = 0.5,
+        interctc_weight2: float = 0.0,
+        predictor2=None,
+        predictor_weight2: float = 0.0,
+        decoder_attention_chunk_type2: str = 'chunk',
+        stride_conv=None,
+        loss_weight_model1: float = 0.5,
+        enable_maas_finetune: bool = False,
+        freeze_encoder2: bool = False,
+        encoder1_encoder2_joint_training: bool = True,
+    ):
+        assert check_argument_types()
+        assert 0.0 <= ctc_weight <= 1.0, ctc_weight
+        assert 0.0 <= interctc_weight < 1.0, interctc_weight
+
+        super().__init__()
+        self.blank_id = 0
+        self.sos = 1
+        self.eos = 2
+        self.vocab_size = vocab_size
+        self.ignore_id = ignore_id
+        self.ctc_weight = ctc_weight
+        self.interctc_weight = interctc_weight
+        self.token_list = token_list.copy()
+
+        self.frontend = frontend
+        self.specaug = specaug
+        self.normalize = normalize
+        self.preencoder = preencoder
+        self.postencoder = postencoder
+        self.encoder = encoder
+
+        if not hasattr(self.encoder, "interctc_use_conditioning"):
+            self.encoder.interctc_use_conditioning = False
+        if self.encoder.interctc_use_conditioning:
+            self.encoder.conditioning_layer = torch.nn.Linear(
+                vocab_size, self.encoder.output_size()
+            )
+
+        self.error_calculator = None
+
+        # we set self.decoder = None in the CTC mode since
+        # self.decoder parameters were never used and PyTorch complained
+        # and threw an Exception in the multi-GPU experiment.
+        # thanks Jeff Farris for pointing out the issue.
+        if ctc_weight == 1.0:
+            self.decoder = None
+        else:
+            self.decoder = decoder
+
+        self.criterion_att = LabelSmoothingLoss(
+            size=vocab_size,
+            padding_idx=ignore_id,
+            smoothing=lsm_weight,
+            normalize_length=length_normalized_loss,
+        )
+
+        if report_cer or report_wer:
+            self.error_calculator = ErrorCalculator(
+                token_list, sym_space, sym_blank, report_cer, report_wer
+            )
+
+        if ctc_weight == 0.0:
+            self.ctc = None
+        else:
+            self.ctc = ctc
+
+        self.extract_feats_in_collect_stats = extract_feats_in_collect_stats
+        self.predictor = predictor
+        self.predictor_weight = predictor_weight
+        self.criterion_pre = mae_loss(normalize_length=length_normalized_loss)
+        self.step_cur = 0
+        if self.encoder.overlap_chunk_cls is not None:
+            from funasr.modules.streaming_utils.chunk_utilis import build_scama_mask_for_cross_attention_decoder
+            self.build_scama_mask_for_cross_attention_decoder_fn = build_scama_mask_for_cross_attention_decoder
+            self.decoder_attention_chunk_type = decoder_attention_chunk_type
+
+        self.encoder2 = encoder2
+        self.decoder2 = decoder2
+        self.ctc_weight2 = ctc_weight2
+        if ctc_weight2 == 0.0:
+            self.ctc2 = None
+        else:
+            self.ctc2 = ctc2
+        self.interctc_weight2 = interctc_weight2
+        self.predictor2 = predictor2
+        self.predictor_weight2 = predictor_weight2
+        self.decoder_attention_chunk_type2 = decoder_attention_chunk_type2
+        self.stride_conv = stride_conv
+        self.loss_weight_model1 = loss_weight_model1
+        if self.encoder2.overlap_chunk_cls is not None:
+            from funasr.modules.streaming_utils.chunk_utilis import build_scama_mask_for_cross_attention_decoder
+            self.build_scama_mask_for_cross_attention_decoder_fn2 = build_scama_mask_for_cross_attention_decoder
+            self.decoder_attention_chunk_type2 = decoder_attention_chunk_type2
+
+        self.enable_maas_finetune = enable_maas_finetune
+        self.freeze_encoder2 = freeze_encoder2
+        self.encoder1_encoder2_joint_training = encoder1_encoder2_joint_training
+
+    def forward(
+        self,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        text: torch.Tensor,
+        text_lengths: torch.Tensor,
+        decoding_ind: int = None,
+    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+        """Frontend + Encoder + Decoder + Calc loss
+
+        Args:
+                        speech: (Batch, Length, ...)
+                        speech_lengths: (Batch, )
+                        text: (Batch, Length)
+                        text_lengths: (Batch,)
+        """
+        assert text_lengths.dim() == 1, text_lengths.shape
+        # Check that batch_size is unified
+        assert (
+            speech.shape[0]
+            == speech_lengths.shape[0]
+            == text.shape[0]
+            == text_lengths.shape[0]
+        ), (speech.shape, speech_lengths.shape, text.shape, text_lengths.shape)
+        batch_size = speech.shape[0]
+
+        # for data-parallel
+        text = text[:, : text_lengths.max()]
+        speech = speech[:, :speech_lengths.max(), :]
+
+        ind = self.encoder.overlap_chunk_cls.random_choice(self.training, decoding_ind)
+        speech_raw = speech.clone().to(speech.device)
+        # 1. Encoder
+        if self.enable_maas_finetune:
+            with torch.no_grad():
+                encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, ind=ind)
+        else:
+            encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, ind=ind)
+
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        loss_att, acc_att, cer_att, wer_att = None, None, None, None
+        loss_ctc, cer_ctc = None, None
+        stats = dict()
+        loss_pre = None
+        loss, loss1, loss2 = 0.0, 0.0, 0.0
+
+        if self.loss_weight_model1 > 0.0:
+            ## model1
+            # 1. CTC branch
+            if self.enable_maas_finetune:
+                with torch.no_grad():
+                    if self.ctc_weight != 0.0:
+                        if self.encoder.overlap_chunk_cls is not None:
+                            encoder_out_ctc, encoder_out_lens_ctc = self.encoder.overlap_chunk_cls.remove_chunk(encoder_out,
+                                                                                                                encoder_out_lens,
+                                                                                                                chunk_outs=None)
+                        loss_ctc, cer_ctc = self._calc_ctc_loss(
+                            encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                        )
+
+                        # Collect CTC branch stats
+                        stats["loss_ctc"] = loss_ctc.detach() if loss_ctc is not None else None
+                        stats["cer_ctc"] = cer_ctc
+
+                    # Intermediate CTC (optional)
+                    loss_interctc = 0.0
+                    if self.interctc_weight != 0.0 and intermediate_outs is not None:
+                        for layer_idx, intermediate_out in intermediate_outs:
+                            # we assume intermediate_out has the same length & padding
+                            # as those of encoder_out
+                            if self.encoder.overlap_chunk_cls is not None:
+                                encoder_out_ctc, encoder_out_lens_ctc = \
+                                    self.encoder.overlap_chunk_cls.remove_chunk(
+                                        intermediate_out,
+                                        encoder_out_lens,
+                                        chunk_outs=None)
+                            loss_ic, cer_ic = self._calc_ctc_loss(
+                                encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                            )
+                            loss_interctc = loss_interctc + loss_ic
+
+                            # Collect Intermedaite CTC stats
+                            stats["loss_interctc_layer{}".format(layer_idx)] = (
+                                loss_ic.detach() if loss_ic is not None else None
+                            )
+                            stats["cer_interctc_layer{}".format(layer_idx)] = cer_ic
+
+                        loss_interctc = loss_interctc / len(intermediate_outs)
+
+                        # calculate whole encoder loss
+                        loss_ctc = (
+                                    1 - self.interctc_weight
+                                ) * loss_ctc + self.interctc_weight * loss_interctc
+
+                    # 2b. Attention decoder branch
+                    if self.ctc_weight != 1.0:
+                        loss_att, acc_att, cer_att, wer_att, loss_pre = self._calc_att_predictor_loss(
+                            encoder_out, encoder_out_lens, text, text_lengths
+                        )
+
+                    # 3. CTC-Att loss definition
+                    if self.ctc_weight == 0.0:
+                        loss = loss_att + loss_pre * self.predictor_weight
+                    elif self.ctc_weight == 1.0:
+                        loss = loss_ctc
+                    else:
+                        loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight
+
+                    # Collect Attn branch stats
+                    stats["loss_att"] = loss_att.detach() if loss_att is not None else None
+                    stats["acc"] = acc_att
+                    stats["cer"] = cer_att
+                    stats["wer"] = wer_att
+                    stats["loss_pre"] = loss_pre.detach().cpu() if loss_pre is not None else None
+            else:
+                if self.ctc_weight != 0.0:
+                    if self.encoder.overlap_chunk_cls is not None:
+                        encoder_out_ctc, encoder_out_lens_ctc = self.encoder.overlap_chunk_cls.remove_chunk(encoder_out,
+                                                                                                            encoder_out_lens,
+                                                                                                            chunk_outs=None)
+                    loss_ctc, cer_ctc = self._calc_ctc_loss(
+                        encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                    )
+
+                    # Collect CTC branch stats
+                    stats["loss_ctc"] = loss_ctc.detach() if loss_ctc is not None else None
+                    stats["cer_ctc"] = cer_ctc
+
+                    # Intermediate CTC (optional)
+                loss_interctc = 0.0
+                if self.interctc_weight != 0.0 and intermediate_outs is not None:
+                    for layer_idx, intermediate_out in intermediate_outs:
+                        # we assume intermediate_out has the same length & padding
+                        # as those of encoder_out
+                        if self.encoder.overlap_chunk_cls is not None:
+                            encoder_out_ctc, encoder_out_lens_ctc = \
+                                self.encoder.overlap_chunk_cls.remove_chunk(
+                                    intermediate_out,
+                                    encoder_out_lens,
+                                    chunk_outs=None)
+                        loss_ic, cer_ic = self._calc_ctc_loss(
+                            encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                        )
+                        loss_interctc = loss_interctc + loss_ic
+
+                        # Collect Intermedaite CTC stats
+                        stats["loss_interctc_layer{}".format(layer_idx)] = (
+                            loss_ic.detach() if loss_ic is not None else None
+                        )
+                        stats["cer_interctc_layer{}".format(layer_idx)] = cer_ic
+
+                    loss_interctc = loss_interctc / len(intermediate_outs)
+
+                    # calculate whole encoder loss
+                    loss_ctc = (
+                                1 - self.interctc_weight
+                            ) * loss_ctc + self.interctc_weight * loss_interctc
+
+                # 2b. Attention decoder branch
+                if self.ctc_weight != 1.0:
+                    loss_att, acc_att, cer_att, wer_att, loss_pre = self._calc_att_predictor_loss(
+                        encoder_out, encoder_out_lens, text, text_lengths
+                    )
+
+                # 3. CTC-Att loss definition
+                if self.ctc_weight == 0.0:
+                    loss = loss_att + loss_pre * self.predictor_weight
+                elif self.ctc_weight == 1.0:
+                    loss = loss_ctc
+                else:
+                    loss = self.ctc_weight * loss_ctc + (1 - self.ctc_weight) * loss_att + loss_pre * self.predictor_weight
+
+                # Collect Attn branch stats
+                stats["loss_att"] = loss_att.detach() if loss_att is not None else None
+                stats["acc"] = acc_att
+                stats["cer"] = cer_att
+                stats["wer"] = wer_att
+                stats["loss_pre"] = loss_pre.detach().cpu() if loss_pre is not None else None
+
+        loss1 = loss
+
+        if self.loss_weight_model1 < 1.0:
+            ## model2
+
+            # encoder2
+            if self.freeze_encoder2:
+                with torch.no_grad():
+                    encoder_out, encoder_out_lens = self.encode2(encoder_out, encoder_out_lens, speech_raw, speech_lengths, ind=ind)
+            else:
+                encoder_out, encoder_out_lens = self.encode2(encoder_out, encoder_out_lens, speech_raw, speech_lengths, ind=ind)
+
+            intermediate_outs = None
+            if isinstance(encoder_out, tuple):
+                intermediate_outs = encoder_out[1]
+                encoder_out = encoder_out[0]
+            # CTC2
+            if self.ctc_weight2 != 0.0:
+                if self.encoder2.overlap_chunk_cls is not None:
+                    encoder_out_ctc, encoder_out_lens_ctc = \
+                        self.encoder2.overlap_chunk_cls.remove_chunk(
+                            encoder_out,
+                            encoder_out_lens,
+                            chunk_outs=None,
+                        )
+                loss_ctc, cer_ctc = self._calc_ctc_loss2(
+                    encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                )
+
+                # Collect CTC branch stats
+                stats["loss_ctc2"] = loss_ctc.detach() if loss_ctc is not None else None
+                stats["cer_ctc2"] = cer_ctc
+
+            # Intermediate CTC (optional)
+            loss_interctc = 0.0
+            if self.interctc_weight2 != 0.0 and intermediate_outs is not None:
+                for layer_idx, intermediate_out in intermediate_outs:
+                    # we assume intermediate_out has the same length & padding
+                    # as those of encoder_out
+                    if self.encoder2.overlap_chunk_cls is not None:
+                        encoder_out_ctc, encoder_out_lens_ctc = \
+                            self.encoder2.overlap_chunk_cls.remove_chunk(
+                                intermediate_out,
+                                encoder_out_lens,
+                                chunk_outs=None)
+                    loss_ic, cer_ic = self._calc_ctc_loss2(
+                        encoder_out_ctc, encoder_out_lens_ctc, text, text_lengths
+                    )
+                    loss_interctc = loss_interctc + loss_ic
+
+                    # Collect Intermedaite CTC stats
+                    stats["loss_interctc_layer{}2".format(layer_idx)] = (
+                        loss_ic.detach() if loss_ic is not None else None
+                    )
+                    stats["cer_interctc_layer{}2".format(layer_idx)] = cer_ic
+
+                loss_interctc = loss_interctc / len(intermediate_outs)
+
+                # calculate whole encoder loss
+                loss_ctc = (
+                               1 - self.interctc_weight2
+                           ) * loss_ctc + self.interctc_weight2 * loss_interctc
+
+            # 2b. Attention decoder branch
+            if self.ctc_weight2 != 1.0:
+                loss_att, acc_att, cer_att, wer_att, loss_pre = self._calc_att_predictor_loss2(
+                    encoder_out, encoder_out_lens, text, text_lengths
+                )
+
+            # 3. CTC-Att loss definition
+            if self.ctc_weight2 == 0.0:
+                loss = loss_att + loss_pre * self.predictor_weight2
+            elif self.ctc_weight2 == 1.0:
+                loss = loss_ctc
+            else:
+                loss = self.ctc_weight2 * loss_ctc + (
+                    1 - self.ctc_weight2) * loss_att + loss_pre * self.predictor_weight2
+
+            # Collect Attn branch stats
+            stats["loss_att2"] = loss_att.detach() if loss_att is not None else None
+            stats["acc2"] = acc_att
+            stats["cer2"] = cer_att
+            stats["wer2"] = wer_att
+            stats["loss_pre2"] = loss_pre.detach().cpu() if loss_pre is not None else None
+        loss2 = loss
+
+        loss = loss1 * self.loss_weight_model1 + loss2 * (1 - self.loss_weight_model1)
+        stats["loss1"] = torch.clone(loss1.detach())
+        stats["loss2"] = torch.clone(loss2.detach())
+        stats["loss"] = torch.clone(loss.detach())
+        # force_gatherable: to-device and to-tensor if scalar for DataParallel
+        loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
+        return loss, stats, weight
+
+    def collect_feats(
+        self,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        text: torch.Tensor,
+        text_lengths: torch.Tensor,
+    ) -> Dict[str, torch.Tensor]:
+        if self.extract_feats_in_collect_stats:
+            feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+        else:
+            # Generate dummy stats if extract_feats_in_collect_stats is False
+            logging.warning(
+                "Generating dummy stats for feats and feats_lengths, "
+                "because encoder_conf.extract_feats_in_collect_stats is "
+                f"{self.extract_feats_in_collect_stats}"
+            )
+            feats, feats_lengths = speech, speech_lengths
+        return {"feats": feats, "feats_lengths": feats_lengths}
+
+    def encode(
+        self, speech: torch.Tensor, speech_lengths: torch.Tensor, ind: int = 0,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Frontend + Encoder. Note that this method is used by asr_inference.py
+
+        Args:
+                        speech: (Batch, Length, ...)
+                        speech_lengths: (Batch, )
+        """
+        with autocast(False):
+            # 1. Extract feats
+            feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+
+            # 2. Data augmentation
+            if self.specaug is not None and self.training:
+                feats, feats_lengths = self.specaug(feats, feats_lengths)
+
+            # 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+            if self.normalize is not None:
+                feats, feats_lengths = self.normalize(feats, feats_lengths)
+
+        # Pre-encoder, e.g. used for raw input data
+        if self.preencoder is not None:
+            feats, feats_lengths = self.preencoder(feats, feats_lengths)
+
+        # 4. Forward encoder
+        # feats: (Batch, Length, Dim)
+        # -> encoder_out: (Batch, Length2, Dim2)
+        if self.encoder.interctc_use_conditioning:
+            encoder_out, encoder_out_lens, _ = self.encoder(
+                feats, feats_lengths, ctc=self.ctc, ind=ind
+            )
+        else:
+            encoder_out, encoder_out_lens, _ = self.encoder(feats, feats_lengths, ind=ind)
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        # Post-encoder, e.g. NLU
+        if self.postencoder is not None:
+            encoder_out, encoder_out_lens = self.postencoder(
+                encoder_out, encoder_out_lens
+            )
+
+        assert encoder_out.size(0) == speech.size(0), (
+            encoder_out.size(),
+            speech.size(0),
+        )
+        assert encoder_out.size(1) <= encoder_out_lens.max(), (
+            encoder_out.size(),
+            encoder_out_lens.max(),
+        )
+
+        if intermediate_outs is not None:
+            return (encoder_out, intermediate_outs), encoder_out_lens
+
+        return encoder_out, encoder_out_lens
+
+    def encode2(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        ind: int = 0,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Frontend + Encoder. Note that this method is used by asr_inference.py
+
+        Args:
+                        speech: (Batch, Length, ...)
+                        speech_lengths: (Batch, )
+        """
+        # with autocast(False):
+        # 	# 1. Extract feats
+        # 	feats, feats_lengths = self._extract_feats(speech, speech_lengths)
+        #
+        # 	# 2. Data augmentation
+        # 	if self.specaug is not None and self.training:
+        # 		feats, feats_lengths = self.specaug(feats, feats_lengths)
+        #
+        # 	# 3. Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+        # 	if self.normalize is not None:
+        # 		feats, feats_lengths = self.normalize(feats, feats_lengths)
+
+        # Pre-encoder, e.g. used for raw input data
+        # if self.preencoder is not None:
+        # 	feats, feats_lengths = self.preencoder(feats, feats_lengths)
+        encoder_out_rm, encoder_out_lens_rm = self.encoder.overlap_chunk_cls.remove_chunk(
+            encoder_out,
+            encoder_out_lens,
+            chunk_outs=None,
+        )
+        # residual_input
+        encoder_out = torch.cat((speech, encoder_out_rm), dim=-1)
+        encoder_out_lens = encoder_out_lens_rm
+        if self.stride_conv is not None:
+            speech, speech_lengths = self.stride_conv(encoder_out, encoder_out_lens)
+        if not self.encoder1_encoder2_joint_training:
+            speech = speech.detach()
+            speech_lengths = speech_lengths.detach()
+        # 4. Forward encoder
+        # feats: (Batch, Length, Dim)
+        # -> encoder_out: (Batch, Length2, Dim2)
+        if self.encoder2.interctc_use_conditioning:
+            encoder_out, encoder_out_lens, _ = self.encoder2(
+                speech, speech_lengths, ctc=self.ctc2, ind=ind
+            )
+        else:
+            encoder_out, encoder_out_lens, _ = self.encoder2(speech, speech_lengths, ind=ind)
+        intermediate_outs = None
+        if isinstance(encoder_out, tuple):
+            intermediate_outs = encoder_out[1]
+            encoder_out = encoder_out[0]
+
+        # # Post-encoder, e.g. NLU
+        # if self.postencoder is not None:
+        # 	encoder_out, encoder_out_lens = self.postencoder(
+        # 		encoder_out, encoder_out_lens
+        # 	)
+
+        assert encoder_out.size(0) == speech.size(0), (
+            encoder_out.size(),
+            speech.size(0),
+        )
+        assert encoder_out.size(1) <= encoder_out_lens.max(), (
+            encoder_out.size(),
+            encoder_out_lens.max(),
+        )
+
+        if intermediate_outs is not None:
+            return (encoder_out, intermediate_outs), encoder_out_lens
+
+        return encoder_out, encoder_out_lens
+
+    def _extract_feats(
+        self, speech: torch.Tensor, speech_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        assert speech_lengths.dim() == 1, speech_lengths.shape
+
+        # for data-parallel
+        speech = speech[:, : speech_lengths.max()]
+
+        if self.frontend is not None:
+            # Frontend
+            #  e.g. STFT and Feature extract
+            #       data_loader may send time-domain signal in this case
+            # speech (Batch, NSamples) -> feats: (Batch, NFrames, Dim)
+            feats, feats_lengths = self.frontend(speech, speech_lengths)
+        else:
+            # No frontend and no feature extract
+            feats, feats_lengths = speech, speech_lengths
+        return feats, feats_lengths
+
+    def nll(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ) -> torch.Tensor:
+        """Compute negative log likelihood(nll) from transformer-decoder
+
+        Normally, this function is called in batchify_nll.
+
+        Args:
+                        encoder_out: (Batch, Length, Dim)
+                        encoder_out_lens: (Batch,)
+                        ys_pad: (Batch, Length)
+                        ys_pad_lens: (Batch,)
+        """
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(
+            encoder_out, encoder_out_lens, ys_in_pad, ys_in_lens
+        )  # [batch, seqlen, dim]
+        batch_size = decoder_out.size(0)
+        decoder_num_class = decoder_out.size(2)
+        # nll: negative log-likelihood
+        nll = torch.nn.functional.cross_entropy(
+            decoder_out.view(-1, decoder_num_class),
+            ys_out_pad.view(-1),
+            ignore_index=self.ignore_id,
+            reduction="none",
+        )
+        nll = nll.view(batch_size, -1)
+        nll = nll.sum(dim=1)
+        assert nll.size(0) == batch_size
+        return nll
+
+    def batchify_nll(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+        batch_size: int = 100,
+    ):
+        """Compute negative log likelihood(nll) from transformer-decoder
+
+        To avoid OOM, this fuction seperate the input into batches.
+        Then call nll for each batch and combine and return results.
+        Args:
+                        encoder_out: (Batch, Length, Dim)
+                        encoder_out_lens: (Batch,)
+                        ys_pad: (Batch, Length)
+                        ys_pad_lens: (Batch,)
+                        batch_size: int, samples each batch contain when computing nll,
+                                                                        you may change this to avoid OOM or increase
+                                                                        GPU memory usage
+        """
+        total_num = encoder_out.size(0)
+        if total_num <= batch_size:
+            nll = self.nll(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+        else:
+            nll = []
+            start_idx = 0
+            while True:
+                end_idx = min(start_idx + batch_size, total_num)
+                batch_encoder_out = encoder_out[start_idx:end_idx, :, :]
+                batch_encoder_out_lens = encoder_out_lens[start_idx:end_idx]
+                batch_ys_pad = ys_pad[start_idx:end_idx, :]
+                batch_ys_pad_lens = ys_pad_lens[start_idx:end_idx]
+                batch_nll = self.nll(
+                    batch_encoder_out,
+                    batch_encoder_out_lens,
+                    batch_ys_pad,
+                    batch_ys_pad_lens,
+                )
+                nll.append(batch_nll)
+                start_idx = end_idx
+                if start_idx == total_num:
+                    break
+            nll = torch.cat(nll)
+        assert nll.size(0) == total_num
+        return nll
+
+    def _calc_att_loss(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(
+            encoder_out, encoder_out_lens, ys_in_pad, ys_in_lens
+        )
+
+        # 2. Compute attention loss
+        loss_att = self.criterion_att(decoder_out, ys_out_pad)
+        acc_att = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_out_pad,
+            ignore_label=self.ignore_id,
+        )
+
+        # Compute cer/wer using attention-decoder
+        if self.training or self.error_calculator is None:
+            cer_att, wer_att = None, None
+        else:
+            ys_hat = decoder_out.argmax(dim=-1)
+            cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+        return loss_att, acc_att, cer_att, wer_att
+
+    def _calc_att_predictor_loss(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        encoder_out_mask = sequence_mask(encoder_out_lens, maxlen=encoder_out.size(1), dtype=encoder_out.dtype,
+                                         device=encoder_out.device)[:, None, :]
+        mask_chunk_predictor = None
+        if self.encoder.overlap_chunk_cls is not None:
+            mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor(None,
+                                                                                           device=encoder_out.device,
+                                                                                           batch_size=encoder_out.size(
+                                                                                               0))
+            mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk(None, device=encoder_out.device,
+                                                                                   batch_size=encoder_out.size(0))
+            encoder_out = encoder_out * mask_shfit_chunk
+        pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor(encoder_out,
+                                                                              ys_out_pad,
+                                                                              encoder_out_mask,
+                                                                              ignore_id=self.ignore_id,
+                                                                              mask_chunk_predictor=mask_chunk_predictor,
+                                                                              target_label_length=ys_in_lens,
+                                                                              )
+        predictor_alignments, predictor_alignments_len = self.predictor.gen_frame_alignments(pre_alphas,
+                                                                                             encoder_out_lens)
+
+        scama_mask = None
+        if self.encoder.overlap_chunk_cls is not None and self.decoder_attention_chunk_type == 'chunk':
+            encoder_chunk_size = self.encoder.overlap_chunk_cls.chunk_size_pad_shift_cur
+            attention_chunk_center_bias = 0
+            attention_chunk_size = encoder_chunk_size
+            decoder_att_look_back_factor = self.encoder.overlap_chunk_cls.decoder_att_look_back_factor_cur
+            mask_shift_att_chunk_decoder = self.encoder.overlap_chunk_cls.get_mask_shift_att_chunk_decoder(None,
+                                                                                                           device=encoder_out.device,
+                                                                                                           batch_size=encoder_out.size(
+                                                                                                               0))
+            scama_mask = self.build_scama_mask_for_cross_attention_decoder_fn(
+                predictor_alignments=predictor_alignments,
+                encoder_sequence_length=encoder_out_lens,
+                chunk_size=1,
+                encoder_chunk_size=encoder_chunk_size,
+                attention_chunk_center_bias=attention_chunk_center_bias,
+                attention_chunk_size=attention_chunk_size,
+                attention_chunk_type=self.decoder_attention_chunk_type,
+                step=None,
+                predictor_mask_chunk_hopping=mask_chunk_predictor,
+                decoder_att_look_back_factor=decoder_att_look_back_factor,
+                mask_shift_att_chunk_decoder=mask_shift_att_chunk_decoder,
+                target_length=ys_in_lens,
+                is_training=self.training,
+            )
+        elif self.encoder.overlap_chunk_cls is not None:
+            encoder_out, encoder_out_lens = self.encoder.overlap_chunk_cls.remove_chunk(encoder_out, encoder_out_lens,
+                                                                                        chunk_outs=None)
+        # try:
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder(
+            encoder_out,
+            encoder_out_lens,
+            ys_in_pad,
+            ys_in_lens,
+            chunk_mask=scama_mask,
+            pre_acoustic_embeds=pre_acoustic_embeds,
+
+        )
+
+        # 2. Compute attention loss
+        loss_att = self.criterion_att(decoder_out, ys_out_pad)
+        acc_att = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_out_pad,
+            ignore_label=self.ignore_id,
+        )
+        # predictor loss
+        loss_pre = self.criterion_pre(ys_in_lens.type_as(pre_token_length), pre_token_length)
+        # Compute cer/wer using attention-decoder
+        if self.training or self.error_calculator is None:
+            cer_att, wer_att = None, None
+        else:
+            ys_hat = decoder_out.argmax(dim=-1)
+            cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+        return loss_att, acc_att, cer_att, wer_att, loss_pre
+
+    def _calc_att_predictor_loss2(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        ys_in_lens = ys_pad_lens + 1
+
+        encoder_out_mask = sequence_mask(encoder_out_lens, maxlen=encoder_out.size(1), dtype=encoder_out.dtype,
+                                         device=encoder_out.device)[:, None, :]
+        mask_chunk_predictor = None
+        if self.encoder2.overlap_chunk_cls is not None:
+            mask_chunk_predictor = self.encoder2.overlap_chunk_cls.get_mask_chunk_predictor(None,
+                                                                                            device=encoder_out.device,
+                                                                                            batch_size=encoder_out.size(
+                                                                                                0))
+            mask_shfit_chunk = self.encoder2.overlap_chunk_cls.get_mask_shfit_chunk(None, device=encoder_out.device,
+                                                                                    batch_size=encoder_out.size(0))
+            encoder_out = encoder_out * mask_shfit_chunk
+        pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor2(encoder_out,
+                                                                               ys_out_pad,
+                                                                               encoder_out_mask,
+                                                                               ignore_id=self.ignore_id,
+                                                                               mask_chunk_predictor=mask_chunk_predictor,
+                                                                               target_label_length=ys_in_lens,
+                                                                               )
+        predictor_alignments, predictor_alignments_len = self.predictor2.gen_frame_alignments(pre_alphas,
+                                                                                              encoder_out_lens)
+
+        scama_mask = None
+        if self.encoder2.overlap_chunk_cls is not None and self.decoder_attention_chunk_type2 == 'chunk':
+            encoder_chunk_size = self.encoder2.overlap_chunk_cls.chunk_size_pad_shift_cur
+            attention_chunk_center_bias = 0
+            attention_chunk_size = encoder_chunk_size
+            decoder_att_look_back_factor = self.encoder2.overlap_chunk_cls.decoder_att_look_back_factor_cur
+            mask_shift_att_chunk_decoder = self.encoder2.overlap_chunk_cls.get_mask_shift_att_chunk_decoder(None,
+                                                                                                            device=encoder_out.device,
+                                                                                                            batch_size=encoder_out.size(
+                                                                                                                0))
+            scama_mask = self.build_scama_mask_for_cross_attention_decoder_fn2(
+                predictor_alignments=predictor_alignments,
+                encoder_sequence_length=encoder_out_lens,
+                chunk_size=1,
+                encoder_chunk_size=encoder_chunk_size,
+                attention_chunk_center_bias=attention_chunk_center_bias,
+                attention_chunk_size=attention_chunk_size,
+                attention_chunk_type=self.decoder_attention_chunk_type2,
+                step=None,
+                predictor_mask_chunk_hopping=mask_chunk_predictor,
+                decoder_att_look_back_factor=decoder_att_look_back_factor,
+                mask_shift_att_chunk_decoder=mask_shift_att_chunk_decoder,
+                target_length=ys_in_lens,
+                is_training=self.training,
+            )
+        elif self.encoder2.overlap_chunk_cls is not None:
+            encoder_out, encoder_out_lens = self.encoder2.overlap_chunk_cls.remove_chunk(encoder_out, encoder_out_lens,
+                                                                                         chunk_outs=None)
+        # try:
+        # 1. Forward decoder
+        decoder_out, _ = self.decoder2(
+            encoder_out,
+            encoder_out_lens,
+            ys_in_pad,
+            ys_in_lens,
+            chunk_mask=scama_mask,
+            pre_acoustic_embeds=pre_acoustic_embeds,
+        )
+
+        # 2. Compute attention loss
+        loss_att = self.criterion_att(decoder_out, ys_out_pad)
+        acc_att = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_out_pad,
+            ignore_label=self.ignore_id,
+        )
+        # predictor loss
+        loss_pre = self.criterion_pre(ys_in_lens.type_as(pre_token_length), pre_token_length)
+        # Compute cer/wer using attention-decoder
+        if self.training or self.error_calculator is None:
+            cer_att, wer_att = None, None
+        else:
+            ys_hat = decoder_out.argmax(dim=-1)
+            cer_att, wer_att = self.error_calculator(ys_hat.cpu(), ys_pad.cpu())
+
+        return loss_att, acc_att, cer_att, wer_att, loss_pre
+
+    def calc_predictor_mask(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor = None,
+        ys_pad_lens: torch.Tensor = None,
+    ):
+        # ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        # ys_in_lens = ys_pad_lens + 1
+        ys_out_pad, ys_in_lens = None, None
+
+        encoder_out_mask = sequence_mask(encoder_out_lens, maxlen=encoder_out.size(1), dtype=encoder_out.dtype,
+                                         device=encoder_out.device)[:, None, :]
+        mask_chunk_predictor = None
+        if self.encoder.overlap_chunk_cls is not None:
+            mask_chunk_predictor = self.encoder.overlap_chunk_cls.get_mask_chunk_predictor(None,
+                                                                                           device=encoder_out.device,
+                                                                                           batch_size=encoder_out.size(
+                                                                                               0))
+            mask_shfit_chunk = self.encoder.overlap_chunk_cls.get_mask_shfit_chunk(None, device=encoder_out.device,
+                                                                                   batch_size=encoder_out.size(0))
+            encoder_out = encoder_out * mask_shfit_chunk
+        pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor(encoder_out,
+                                                                              ys_out_pad,
+                                                                              encoder_out_mask,
+                                                                              ignore_id=self.ignore_id,
+                                                                              mask_chunk_predictor=mask_chunk_predictor,
+                                                                              target_label_length=ys_in_lens,
+                                                                              )
+        predictor_alignments, predictor_alignments_len = self.predictor.gen_frame_alignments(pre_alphas,
+                                                                                             encoder_out_lens)
+
+        scama_mask = None
+        if self.encoder.overlap_chunk_cls is not None and self.decoder_attention_chunk_type == 'chunk':
+            encoder_chunk_size = self.encoder.overlap_chunk_cls.chunk_size_pad_shift_cur
+            attention_chunk_center_bias = 0
+            attention_chunk_size = encoder_chunk_size
+            decoder_att_look_back_factor = self.encoder.overlap_chunk_cls.decoder_att_look_back_factor_cur
+            mask_shift_att_chunk_decoder = self.encoder.overlap_chunk_cls.get_mask_shift_att_chunk_decoder(None,
+                                                                                                           device=encoder_out.device,
+                                                                                                           batch_size=encoder_out.size(
+                                                                                                               0))
+            scama_mask = self.build_scama_mask_for_cross_attention_decoder_fn(
+                predictor_alignments=predictor_alignments,
+                encoder_sequence_length=encoder_out_lens,
+                chunk_size=1,
+                encoder_chunk_size=encoder_chunk_size,
+                attention_chunk_center_bias=attention_chunk_center_bias,
+                attention_chunk_size=attention_chunk_size,
+                attention_chunk_type=self.decoder_attention_chunk_type,
+                step=None,
+                predictor_mask_chunk_hopping=mask_chunk_predictor,
+                decoder_att_look_back_factor=decoder_att_look_back_factor,
+                mask_shift_att_chunk_decoder=mask_shift_att_chunk_decoder,
+                target_length=ys_in_lens,
+                is_training=self.training,
+            )
+        elif self.encoder.overlap_chunk_cls is not None:
+            encoder_out, encoder_out_lens = self.encoder.overlap_chunk_cls.remove_chunk(encoder_out, encoder_out_lens,
+                                                                                        chunk_outs=None)
+
+        return pre_acoustic_embeds, pre_token_length, predictor_alignments, predictor_alignments_len, scama_mask
+
+    def calc_predictor_mask2(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor = None,
+        ys_pad_lens: torch.Tensor = None,
+    ):
+        # ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+        # ys_in_lens = ys_pad_lens + 1
+        ys_out_pad, ys_in_lens = None, None
+
+        encoder_out_mask = sequence_mask(encoder_out_lens, maxlen=encoder_out.size(1), dtype=encoder_out.dtype,
+                                         device=encoder_out.device)[:, None, :]
+        mask_chunk_predictor = None
+        if self.encoder2.overlap_chunk_cls is not None:
+            mask_chunk_predictor = self.encoder2.overlap_chunk_cls.get_mask_chunk_predictor(None,
+                                                                                            device=encoder_out.device,
+                                                                                            batch_size=encoder_out.size(
+                                                                                                0))
+            mask_shfit_chunk = self.encoder2.overlap_chunk_cls.get_mask_shfit_chunk(None, device=encoder_out.device,
+                                                                                    batch_size=encoder_out.size(0))
+            encoder_out = encoder_out * mask_shfit_chunk
+        pre_acoustic_embeds, pre_token_length, pre_alphas, _ = self.predictor2(encoder_out,
+                                                                               ys_out_pad,
+                                                                               encoder_out_mask,
+                                                                               ignore_id=self.ignore_id,
+                                                                               mask_chunk_predictor=mask_chunk_predictor,
+                                                                               target_label_length=ys_in_lens,
+                                                                               )
+        predictor_alignments, predictor_alignments_len = self.predictor2.gen_frame_alignments(pre_alphas,
+                                                                                              encoder_out_lens)
+
+        scama_mask = None
+        if self.encoder2.overlap_chunk_cls is not None and self.decoder_attention_chunk_type2 == 'chunk':
+            encoder_chunk_size = self.encoder2.overlap_chunk_cls.chunk_size_pad_shift_cur
+            attention_chunk_center_bias = 0
+            attention_chunk_size = encoder_chunk_size
+            decoder_att_look_back_factor = self.encoder2.overlap_chunk_cls.decoder_att_look_back_factor_cur
+            mask_shift_att_chunk_decoder = self.encoder2.overlap_chunk_cls.get_mask_shift_att_chunk_decoder(None,
+                                                                                                            device=encoder_out.device,
+                                                                                                            batch_size=encoder_out.size(
+                                                                                                                0))
+            scama_mask = self.build_scama_mask_for_cross_attention_decoder_fn2(
+                predictor_alignments=predictor_alignments,
+                encoder_sequence_length=encoder_out_lens,
+                chunk_size=1,
+                encoder_chunk_size=encoder_chunk_size,
+                attention_chunk_center_bias=attention_chunk_center_bias,
+                attention_chunk_size=attention_chunk_size,
+                attention_chunk_type=self.decoder_attention_chunk_type2,
+                step=None,
+                predictor_mask_chunk_hopping=mask_chunk_predictor,
+                decoder_att_look_back_factor=decoder_att_look_back_factor,
+                mask_shift_att_chunk_decoder=mask_shift_att_chunk_decoder,
+                target_length=ys_in_lens,
+                is_training=self.training,
+            )
+        elif self.encoder2.overlap_chunk_cls is not None:
+            encoder_out, encoder_out_lens = self.encoder2.overlap_chunk_cls.remove_chunk(encoder_out, encoder_out_lens,
+                                                                                         chunk_outs=None)
+
+        return pre_acoustic_embeds, pre_token_length, predictor_alignments, predictor_alignments_len, scama_mask
+
+    def _calc_ctc_loss(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        # Calc CTC loss
+        loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+
+        # Calc CER using CTC
+        cer_ctc = None
+        if not self.training and self.error_calculator is not None:
+            ys_hat = self.ctc.argmax(encoder_out).data
+            cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
+        return loss_ctc, cer_ctc
+
+    def _calc_ctc_loss2(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        # Calc CTC loss
+        loss_ctc = self.ctc2(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+
+        # Calc CER using CTC
+        cer_ctc = None
+        if not self.training and self.error_calculator is not None:
+            ys_hat = self.ctc2.argmax(encoder_out).data
+            cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
+        return loss_ctc, cer_ctc
+
diff --git a/funasr/models/encoder/__init__.py b/funasr/models/encoder/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/encoder/abs_encoder.py b/funasr/models/encoder/abs_encoder.py
new file mode 100644
index 000000000..1fb7c97c3
--- /dev/null
+++ b/funasr/models/encoder/abs_encoder.py
@@ -0,0 +1,21 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Optional
+from typing import Tuple
+
+import torch
+
+
+class AbsEncoder(torch.nn.Module, ABC):
+    @abstractmethod
+    def output_size(self) -> int:
+        raise NotImplementedError
+
+    @abstractmethod
+    def forward(
+        self,
+        xs_pad: torch.Tensor,
+        ilens: torch.Tensor,
+        prev_states: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        raise NotImplementedError
diff --git a/funasr/models/encoder/conformer_encoder.py b/funasr/models/encoder/conformer_encoder.py
new file mode 100644
index 000000000..2df2ba608
--- /dev/null
+++ b/funasr/models/encoder/conformer_encoder.py
@@ -0,0 +1,598 @@
+# Copyright 2020 Tomoki Hayashi
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Conformer encoder definition."""
+
+import logging
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import torch
+from torch import nn
+from typeguard import check_argument_types
+
+from funasr.models.ctc import CTC
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.modules.attention import (
+    MultiHeadedAttention,  # noqa: H301
+    RelPositionMultiHeadedAttention,  # noqa: H301
+    LegacyRelPositionMultiHeadedAttention,  # noqa: H301
+)
+from funasr.modules.embedding import (
+    PositionalEncoding,  # noqa: H301
+    ScaledPositionalEncoding,  # noqa: H301
+    RelPositionalEncoding,  # noqa: H301
+    LegacyRelPositionalEncoding,  # noqa: H301
+)
+from funasr.modules.layer_norm import LayerNorm
+from funasr.modules.multi_layer_conv import Conv1dLinear
+from funasr.modules.multi_layer_conv import MultiLayeredConv1d
+from funasr.modules.nets_utils import get_activation
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.positionwise_feed_forward import (
+    PositionwiseFeedForward,  # noqa: H301
+)
+from funasr.modules.repeat import repeat
+from funasr.modules.subsampling import Conv2dSubsampling
+from funasr.modules.subsampling import Conv2dSubsampling2
+from funasr.modules.subsampling import Conv2dSubsampling6
+from funasr.modules.subsampling import Conv2dSubsampling8
+from funasr.modules.subsampling import TooShortUttError
+from funasr.modules.subsampling import check_short_utt
+
+class ConvolutionModule(nn.Module):
+    """ConvolutionModule in Conformer model.
+
+    Args:
+        channels (int): The number of channels of conv layers.
+        kernel_size (int): Kernerl size of conv layers.
+
+    """
+
+    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
+        """Construct an ConvolutionModule object."""
+        super(ConvolutionModule, self).__init__()
+        # kernerl_size should be a odd number for 'SAME' padding
+        assert (kernel_size - 1) % 2 == 0
+
+        self.pointwise_conv1 = nn.Conv1d(
+            channels,
+            2 * channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.depthwise_conv = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+            groups=channels,
+            bias=bias,
+        )
+        self.norm = nn.BatchNorm1d(channels)
+        self.pointwise_conv2 = nn.Conv1d(
+            channels,
+            channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            bias=bias,
+        )
+        self.activation = activation
+
+    def forward(self, x):
+        """Compute convolution module.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, channels).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, channels).
+
+        """
+        # exchange the temporal dimension and the feature dimension
+        x = x.transpose(1, 2)
+
+        # GLU mechanism
+        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        x = nn.functional.glu(x, dim=1)  # (batch, channel, dim)
+
+        # 1D Depthwise Conv
+        x = self.depthwise_conv(x)
+        x = self.activation(self.norm(x))
+
+        x = self.pointwise_conv2(x)
+
+        return x.transpose(1, 2)
+
+
+class EncoderLayer(nn.Module):
+    """Encoder layer module.
+
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+            can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        feed_forward_macaron (torch.nn.Module): Additional feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        conv_module (torch.nn.Module): Convolution module instance.
+            `ConvlutionModule` instance can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+        stochastic_depth_rate (float): Proability to skip this layer.
+            During training, the layer may skip residual computation and return input
+            as-is with given probability.
+    """
+
+    def __init__(
+            self,
+            size,
+            self_attn,
+            feed_forward,
+            feed_forward_macaron,
+            conv_module,
+            dropout_rate,
+            normalize_before=True,
+            concat_after=False,
+            stochastic_depth_rate=0.0,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayer, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.feed_forward_macaron = feed_forward_macaron
+        self.conv_module = conv_module
+        self.norm_ff = LayerNorm(size)  # for the FNN module
+        self.norm_mha = LayerNorm(size)  # for the MHA module
+        if feed_forward_macaron is not None:
+            self.norm_ff_macaron = LayerNorm(size)
+            self.ff_scale = 0.5
+        else:
+            self.ff_scale = 1.0
+        if self.conv_module is not None:
+            self.norm_conv = LayerNorm(size)  # for the CNN module
+            self.norm_final = LayerNorm(size)  # for the final output of the block
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+        self.stochastic_depth_rate = stochastic_depth_rate
+
+    def forward(self, x_input, mask, cache=None):
+        """Compute encoded features.
+
+        Args:
+            x_input (Union[Tuple, torch.Tensor]): Input tensor w/ or w/o pos emb.
+                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
+                - w/o pos emb: Tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+        if isinstance(x_input, tuple):
+            x, pos_emb = x_input[0], x_input[1]
+        else:
+            x, pos_emb = x_input, None
+
+        skip_layer = False
+        # with stochastic depth, residual connection `x + f(x)` becomes
+        # `x <- x + 1 / (1 - p) * f(x)` at training time.
+        stoch_layer_coeff = 1.0
+        if self.training and self.stochastic_depth_rate > 0:
+            skip_layer = torch.rand(1).item() < self.stochastic_depth_rate
+            stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate)
+
+        if skip_layer:
+            if cache is not None:
+                x = torch.cat([cache, x], dim=1)
+            if pos_emb is not None:
+                return (x, pos_emb), mask
+            return x, mask
+
+        # whether to use macaron style
+        if self.feed_forward_macaron is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_ff_macaron(x)
+            x = residual + stoch_layer_coeff * self.ff_scale * self.dropout(
+                self.feed_forward_macaron(x)
+            )
+            if not self.normalize_before:
+                x = self.norm_ff_macaron(x)
+
+        # multi-headed self-attention module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_mha(x)
+
+        if cache is None:
+            x_q = x
+        else:
+            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)
+            x_q = x[:, -1:, :]
+            residual = residual[:, -1:, :]
+            mask = None if mask is None else mask[:, -1:, :]
+
+        if pos_emb is not None:
+            x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+        else:
+            x_att = self.self_attn(x_q, x, x, mask)
+
+        if self.concat_after:
+            x_concat = torch.cat((x, x_att), dim=-1)
+            x = residual + stoch_layer_coeff * self.concat_linear(x_concat)
+        else:
+            x = residual + stoch_layer_coeff * self.dropout(x_att)
+        if not self.normalize_before:
+            x = self.norm_mha(x)
+
+        # convolution module
+        if self.conv_module is not None:
+            residual = x
+            if self.normalize_before:
+                x = self.norm_conv(x)
+            x = residual + stoch_layer_coeff * self.dropout(self.conv_module(x))
+            if not self.normalize_before:
+                x = self.norm_conv(x)
+
+        # feed forward module
+        residual = x
+        if self.normalize_before:
+            x = self.norm_ff(x)
+        x = residual + stoch_layer_coeff * self.ff_scale * self.dropout(
+            self.feed_forward(x)
+        )
+        if not self.normalize_before:
+            x = self.norm_ff(x)
+
+        if self.conv_module is not None:
+            x = self.norm_final(x)
+
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+
+        if pos_emb is not None:
+            return (x, pos_emb), mask
+
+        return x, mask
+
+
+class ConformerEncoder(AbsEncoder):
+    """Conformer encoder module.
+
+    Args:
+        input_size (int): Input dimension.
+        output_size (int): Dimension of attention.
+        attention_heads (int): The number of heads of multi head attention.
+        linear_units (int): The number of units of position-wise feed forward.
+        num_blocks (int): The number of decoder blocks.
+        dropout_rate (float): Dropout rate.
+        attention_dropout_rate (float): Dropout rate in attention.
+        positional_dropout_rate (float): Dropout rate after adding positional encoding.
+        input_layer (Union[str, torch.nn.Module]): Input layer type.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            If True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            If False, no additional linear will be applied. i.e. x -> x + att(x)
+        positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear".
+        positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer.
+        rel_pos_type (str): Whether to use the latest relative positional encoding or
+            the legacy one. The legacy relative positional encoding will be deprecated
+            in the future. More Details can be found in
+            https://github.com/espnet/espnet/pull/2816.
+        encoder_pos_enc_layer_type (str): Encoder positional encoding layer type.
+        encoder_attn_layer_type (str): Encoder attention layer type.
+        activation_type (str): Encoder activation function type.
+        macaron_style (bool): Whether to use macaron style for positionwise layer.
+        use_cnn_module (bool): Whether to use convolution module.
+        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.
+        cnn_module_kernel (int): Kernerl size of convolution module.
+        padding_idx (int): Padding idx for input_layer=embed.
+
+    """
+
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int = 256,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            attention_dropout_rate: float = 0.0,
+            input_layer: str = "conv2d",
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            positionwise_layer_type: str = "linear",
+            positionwise_conv_kernel_size: int = 3,
+            macaron_style: bool = False,
+            rel_pos_type: str = "legacy",
+            pos_enc_layer_type: str = "rel_pos",
+            selfattention_layer_type: str = "rel_selfattn",
+            activation_type: str = "swish",
+            use_cnn_module: bool = True,
+            zero_triu: bool = False,
+            cnn_module_kernel: int = 31,
+            padding_idx: int = -1,
+            interctc_layer_idx: List[int] = [],
+            interctc_use_conditioning: bool = False,
+            stochastic_depth_rate: Union[float, List[float]] = 0.0,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+
+        if rel_pos_type == "legacy":
+            if pos_enc_layer_type == "rel_pos":
+                pos_enc_layer_type = "legacy_rel_pos"
+            if selfattention_layer_type == "rel_selfattn":
+                selfattention_layer_type = "legacy_rel_selfattn"
+        elif rel_pos_type == "latest":
+            assert selfattention_layer_type != "legacy_rel_selfattn"
+            assert pos_enc_layer_type != "legacy_rel_pos"
+        else:
+            raise ValueError("unknown rel_pos_type: " + rel_pos_type)
+
+        activation = get_activation(activation_type)
+        if pos_enc_layer_type == "abs_pos":
+            pos_enc_class = PositionalEncoding
+        elif pos_enc_layer_type == "scaled_abs_pos":
+            pos_enc_class = ScaledPositionalEncoding
+        elif pos_enc_layer_type == "rel_pos":
+            assert selfattention_layer_type == "rel_selfattn"
+            pos_enc_class = RelPositionalEncoding
+        elif pos_enc_layer_type == "legacy_rel_pos":
+            assert selfattention_layer_type == "legacy_rel_selfattn"
+            pos_enc_class = LegacyRelPositionalEncoding
+            logging.warning(
+                "Using legacy_rel_pos and it will be deprecated in the future."
+            )
+        else:
+            raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(input_size, output_size),
+                torch.nn.LayerNorm(output_size),
+                torch.nn.Dropout(dropout_rate),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(
+                input_size,
+                output_size,
+                dropout_rate,
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d2":
+            self.embed = Conv2dSubsampling2(
+                input_size,
+                output_size,
+                dropout_rate,
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d6":
+            self.embed = Conv2dSubsampling6(
+                input_size,
+                output_size,
+                dropout_rate,
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d8":
+            self.embed = Conv2dSubsampling8(
+                input_size,
+                output_size,
+                dropout_rate,
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(input_size, output_size, padding_idx=padding_idx),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif isinstance(input_layer, torch.nn.Module):
+            self.embed = torch.nn.Sequential(
+                input_layer,
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            self.embed = torch.nn.Sequential(
+                pos_enc_class(output_size, positional_dropout_rate)
+            )
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                dropout_rate,
+                activation,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+
+        if selfattention_layer_type == "selfattn":
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                attention_dropout_rate,
+            )
+        elif selfattention_layer_type == "legacy_rel_selfattn":
+            assert pos_enc_layer_type == "legacy_rel_pos"
+            encoder_selfattn_layer = LegacyRelPositionMultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                attention_dropout_rate,
+            )
+            logging.warning(
+                "Using legacy_rel_selfattn and it will be deprecated in the future."
+            )
+        elif selfattention_layer_type == "rel_selfattn":
+            assert pos_enc_layer_type == "rel_pos"
+            encoder_selfattn_layer = RelPositionMultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                attention_dropout_rate,
+                zero_triu,
+            )
+        else:
+            raise ValueError("unknown encoder_attn_layer: " + selfattention_layer_type)
+
+        convolution_layer = ConvolutionModule
+        convolution_layer_args = (output_size, cnn_module_kernel, activation)
+
+        if isinstance(stochastic_depth_rate, float):
+            stochastic_depth_rate = [stochastic_depth_rate] * num_blocks
+
+        if len(stochastic_depth_rate) != num_blocks:
+            raise ValueError(
+                f"Length of stochastic_depth_rate ({len(stochastic_depth_rate)}) "
+                f"should be equal to num_blocks ({num_blocks})"
+            )
+
+        self.encoders = repeat(
+            num_blocks,
+            lambda lnum: EncoderLayer(
+                output_size,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                positionwise_layer(*positionwise_layer_args),
+                positionwise_layer(*positionwise_layer_args) if macaron_style else None,
+                convolution_layer(*convolution_layer_args) if use_cnn_module else None,
+                dropout_rate,
+                normalize_before,
+                concat_after,
+                stochastic_depth_rate[lnum],
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(output_size)
+
+        self.interctc_layer_idx = interctc_layer_idx
+        if len(interctc_layer_idx) > 0:
+            assert 0 < min(interctc_layer_idx) and max(interctc_layer_idx) < num_blocks
+        self.interctc_use_conditioning = interctc_use_conditioning
+        self.conditioning_layer = None
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+            self,
+            xs_pad: torch.Tensor,
+            ilens: torch.Tensor,
+            prev_states: torch.Tensor = None,
+            ctc: CTC = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        """Calculate forward propagation.
+
+        Args:
+            xs_pad (torch.Tensor): Input tensor (#batch, L, input_size).
+            ilens (torch.Tensor): Input length (#batch).
+            prev_states (torch.Tensor): Not to be used now.
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, L, output_size).
+            torch.Tensor: Output length (#batch).
+            torch.Tensor: Not to be used now.
+
+        """
+        masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+
+        if (
+                isinstance(self.embed, Conv2dSubsampling)
+                or isinstance(self.embed, Conv2dSubsampling2)
+                or isinstance(self.embed, Conv2dSubsampling6)
+                or isinstance(self.embed, Conv2dSubsampling8)
+        ):
+            short_status, limit_size = check_short_utt(self.embed, xs_pad.size(1))
+            if short_status:
+                raise TooShortUttError(
+                    f"has {xs_pad.size(1)} frames and is too short for subsampling "
+                    + f"(it needs more than {limit_size} frames), return empty results",
+                    xs_pad.size(1),
+                    limit_size,
+                )
+            xs_pad, masks = self.embed(xs_pad, masks)
+        else:
+            xs_pad = self.embed(xs_pad)
+
+        intermediate_outs = []
+        if len(self.interctc_layer_idx) == 0:
+            xs_pad, masks = self.encoders(xs_pad, masks)
+        else:
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                xs_pad, masks = encoder_layer(xs_pad, masks)
+
+                if layer_idx + 1 in self.interctc_layer_idx:
+                    encoder_out = xs_pad
+                    if isinstance(encoder_out, tuple):
+                        encoder_out = encoder_out[0]
+
+                    # intermediate outputs are also normalized
+                    if self.normalize_before:
+                        encoder_out = self.after_norm(encoder_out)
+
+                    intermediate_outs.append((layer_idx + 1, encoder_out))
+
+                    if self.interctc_use_conditioning:
+                        ctc_out = ctc.softmax(encoder_out)
+
+                        if isinstance(xs_pad, tuple):
+                            x, pos_emb = xs_pad
+                            x = x + self.conditioning_layer(ctc_out)
+                            xs_pad = (x, pos_emb)
+                        else:
+                            xs_pad = xs_pad + self.conditioning_layer(ctc_out)
+
+        if isinstance(xs_pad, tuple):
+            xs_pad = xs_pad[0]
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+
+        olens = masks.squeeze(1).sum(1)
+        if len(intermediate_outs) > 0:
+            return (xs_pad, intermediate_outs), olens, None
+        return xs_pad, olens, None
diff --git a/funasr/models/encoder/rnn_encoder.py b/funasr/models/encoder/rnn_encoder.py
new file mode 100644
index 000000000..7a3b05399
--- /dev/null
+++ b/funasr/models/encoder/rnn_encoder.py
@@ -0,0 +1,115 @@
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.rnn.encoders import RNN
+from funasr.modules.rnn.encoders import RNNP
+from funasr.models.encoder.abs_encoder import AbsEncoder
+
+
+class RNNEncoder(AbsEncoder):
+    """RNNEncoder class.
+
+    Args:
+        input_size: The number of expected features in the input
+        output_size: The number of output features
+        hidden_size: The number of hidden features
+        bidirectional: If ``True`` becomes a bidirectional LSTM
+        use_projection: Use projection layer or not
+        num_layers: Number of recurrent layers
+        dropout: dropout probability
+
+    """
+
+    def __init__(
+        self,
+        input_size: int,
+        rnn_type: str = "lstm",
+        bidirectional: bool = True,
+        use_projection: bool = True,
+        num_layers: int = 4,
+        hidden_size: int = 320,
+        output_size: int = 320,
+        dropout: float = 0.0,
+        subsample: Optional[Sequence[int]] = (2, 2, 1, 1),
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+        self.rnn_type = rnn_type
+        self.bidirectional = bidirectional
+        self.use_projection = use_projection
+
+        if rnn_type not in {"lstm", "gru"}:
+            raise ValueError(f"Not supported rnn_type={rnn_type}")
+
+        if subsample is None:
+            subsample = np.ones(num_layers + 1, dtype=np.int)
+        else:
+            subsample = subsample[:num_layers]
+            # Append 1 at the beginning because the second or later is used
+            subsample = np.pad(
+                np.array(subsample, dtype=np.int),
+                [1, num_layers - len(subsample)],
+                mode="constant",
+                constant_values=1,
+            )
+
+        rnn_type = ("b" if bidirectional else "") + rnn_type
+        if use_projection:
+            self.enc = torch.nn.ModuleList(
+                [
+                    RNNP(
+                        input_size,
+                        num_layers,
+                        hidden_size,
+                        output_size,
+                        subsample,
+                        dropout,
+                        typ=rnn_type,
+                    )
+                ]
+            )
+
+        else:
+            self.enc = torch.nn.ModuleList(
+                [
+                    RNN(
+                        input_size,
+                        num_layers,
+                        hidden_size,
+                        output_size,
+                        dropout,
+                        typ=rnn_type,
+                    )
+                ]
+            )
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+        self,
+        xs_pad: torch.Tensor,
+        ilens: torch.Tensor,
+        prev_states: torch.Tensor = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        if prev_states is None:
+            prev_states = [None] * len(self.enc)
+        assert len(prev_states) == len(self.enc)
+
+        current_states = []
+        for module, prev_state in zip(self.enc, prev_states):
+            xs_pad, ilens, states = module(xs_pad, ilens, prev_state=prev_state)
+            current_states.append(states)
+
+        if self.use_projection:
+            xs_pad.masked_fill_(make_pad_mask(ilens, xs_pad, 1), 0.0)
+        else:
+            xs_pad = xs_pad.masked_fill(make_pad_mask(ilens, xs_pad, 1), 0.0)
+        return xs_pad, ilens, current_states
diff --git a/funasr/models/encoder/sanm_encoder.py b/funasr/models/encoder/sanm_encoder.py
new file mode 100644
index 000000000..3d8079dd3
--- /dev/null
+++ b/funasr/models/encoder/sanm_encoder.py
@@ -0,0 +1,595 @@
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import torch
+import torch.nn as nn
+from funasr.modules.streaming_utils.chunk_utilis import overlap_chunk
+from typeguard import check_argument_types
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.attention import MultiHeadedAttention, MultiHeadedAttentionSANM
+from funasr.modules.embedding import SinusoidalPositionEncoder
+from funasr.modules.layer_norm import LayerNorm
+from funasr.modules.multi_layer_conv import Conv1dLinear
+from funasr.modules.multi_layer_conv import MultiLayeredConv1d
+from funasr.modules.positionwise_feed_forward import (
+    PositionwiseFeedForward,  # noqa: H301
+)
+from funasr.modules.repeat import repeat
+from funasr.modules.subsampling import Conv2dSubsampling
+from funasr.modules.subsampling import Conv2dSubsampling2
+from funasr.modules.subsampling import Conv2dSubsampling6
+from funasr.modules.subsampling import Conv2dSubsampling8
+from funasr.modules.subsampling import TooShortUttError
+from funasr.modules.subsampling import check_short_utt
+from funasr.models.ctc import CTC
+from funasr.models.encoder.abs_encoder import AbsEncoder
+
+class EncoderLayerSANM(nn.Module):
+    def __init__(
+        self,
+        in_size,
+        size,
+        self_attn,
+        feed_forward,
+        dropout_rate,
+        normalize_before=True,
+        concat_after=False,
+        stochastic_depth_rate=0.0,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayerSANM, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.norm1 = LayerNorm(in_size)
+        self.norm2 = LayerNorm(size)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.in_size = in_size
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+        self.stochastic_depth_rate = stochastic_depth_rate
+        self.dropout_rate = dropout_rate
+
+    def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
+        """Compute encoded features.
+
+        Args:
+            x_input (torch.Tensor): Input tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+        skip_layer = False
+        # with stochastic depth, residual connection `x + f(x)` becomes
+        # `x <- x + 1 / (1 - p) * f(x)` at training time.
+        stoch_layer_coeff = 1.0
+        if self.training and self.stochastic_depth_rate > 0:
+            skip_layer = torch.rand(1).item() < self.stochastic_depth_rate
+            stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate)
+
+        if skip_layer:
+            if cache is not None:
+                x = torch.cat([cache, x], dim=1)
+            return x, mask
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm1(x)
+
+        if self.concat_after:
+            x_concat = torch.cat((x, self.self_attn(x, mask, mask_shfit_chunk=mask_shfit_chunk, mask_att_chunk_encoder=mask_att_chunk_encoder)), dim=-1)
+            if self.in_size == self.size:
+                x = residual + stoch_layer_coeff * self.concat_linear(x_concat)
+            else:
+                x = stoch_layer_coeff * self.concat_linear(x_concat)
+        else:
+            if self.in_size == self.size:
+                x = residual + stoch_layer_coeff * self.dropout(
+                    self.self_attn(x, mask, mask_shfit_chunk=mask_shfit_chunk, mask_att_chunk_encoder=mask_att_chunk_encoder)
+                )
+            else:
+                x = stoch_layer_coeff * self.dropout(
+                    self.self_attn(x, mask, mask_shfit_chunk=mask_shfit_chunk, mask_att_chunk_encoder=mask_att_chunk_encoder)
+                )
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        x = residual + stoch_layer_coeff * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+
+        return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder
+
+class SANMEncoder(AbsEncoder):
+    """
+    author: Speech Lab, Alibaba Group, China
+    San-m: Memory equipped self-attention for end-to-end speech recognition
+    https://arxiv.org/abs/2006.01713
+
+    """
+
+    def __init__(
+        self,
+        input_size: int,
+        output_size: int = 256,
+        attention_heads: int = 4,
+        linear_units: int = 2048,
+        num_blocks: int = 6,
+        dropout_rate: float = 0.1,
+        positional_dropout_rate: float = 0.1,
+        attention_dropout_rate: float = 0.0,
+        input_layer: Optional[str] = "conv2d",
+        pos_enc_class=SinusoidalPositionEncoder,
+        normalize_before: bool = True,
+        concat_after: bool = False,
+        positionwise_layer_type: str = "linear",
+        positionwise_conv_kernel_size: int = 1,
+        padding_idx: int = -1,
+        interctc_layer_idx: List[int] = [],
+        interctc_use_conditioning: bool = False,
+        kernel_size : int = 11,
+        sanm_shfit : int = 0,
+        selfattention_layer_type: str = "sanm",
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(input_size, output_size),
+                torch.nn.LayerNorm(output_size),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d2":
+            self.embed = Conv2dSubsampling2(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d6":
+            self.embed = Conv2dSubsampling6(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d8":
+            self.embed = Conv2dSubsampling8(input_size, output_size, dropout_rate)
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(input_size, output_size, padding_idx=padding_idx),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            if input_size == output_size:
+                self.embed = None
+            else:
+                self.embed = torch.nn.Linear(input_size, output_size)
+        elif input_layer == "pe":
+            self.embed = SinusoidalPositionEncoder()
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+
+        if selfattention_layer_type == "selfattn":
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                attention_dropout_rate,
+            )
+
+        elif selfattention_layer_type == "sanm":
+            encoder_selfattn_layer = MultiHeadedAttentionSANM
+            encoder_selfattn_layer_args0 = (
+                attention_heads,
+                input_size,
+                output_size,
+                attention_dropout_rate,
+                kernel_size,
+                sanm_shfit,
+            )
+
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                output_size,
+                attention_dropout_rate,
+                kernel_size,
+                sanm_shfit,
+            )
+        self.encoders0 = repeat(
+            1,
+            lambda lnum: EncoderLayerSANM(
+                input_size,
+                output_size,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args0),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+        self.encoders = repeat(
+            num_blocks-1,
+            lambda lnum: EncoderLayerSANM(
+                output_size,
+                output_size,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(output_size)
+
+        self.interctc_layer_idx = interctc_layer_idx
+        if len(interctc_layer_idx) > 0:
+            assert 0 < min(interctc_layer_idx) and max(interctc_layer_idx) < num_blocks
+        self.interctc_use_conditioning = interctc_use_conditioning
+        self.conditioning_layer = None
+        self.dropout = nn.Dropout(dropout_rate)
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+        self,
+        xs_pad: torch.Tensor,
+        ilens: torch.Tensor,
+        prev_states: torch.Tensor = None,
+        ctc: CTC = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        """Embed positions in tensor.
+
+        Args:
+            xs_pad: input tensor (B, L, D)
+            ilens: input length (B)
+            prev_states: Not to be used now.
+        Returns:
+            position embedded tensor and mask
+        """
+        masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+        xs_pad *= self.output_size()**0.5
+        if self.embed is None:
+            xs_pad = xs_pad
+        elif (
+            isinstance(self.embed, Conv2dSubsampling)
+            or isinstance(self.embed, Conv2dSubsampling2)
+            or isinstance(self.embed, Conv2dSubsampling6)
+            or isinstance(self.embed, Conv2dSubsampling8)
+        ):
+            short_status, limit_size = check_short_utt(self.embed, xs_pad.size(1))
+            if short_status:
+                raise TooShortUttError(
+                    f"has {xs_pad.size(1)} frames and is too short for subsampling "
+                    + f"(it needs more than {limit_size} frames), return empty results",
+                    xs_pad.size(1),
+                    limit_size,
+                )
+            xs_pad, masks = self.embed(xs_pad, masks)
+        else:
+            xs_pad = self.embed(xs_pad)
+
+        # xs_pad = self.dropout(xs_pad)
+        encoder_outs = self.encoders0(xs_pad, masks)
+        xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        intermediate_outs = []
+        if len(self.interctc_layer_idx) == 0:
+            encoder_outs = self.encoders(xs_pad, masks)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        else:
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                encoder_outs = encoder_layer(xs_pad, masks)
+                xs_pad, masks = encoder_outs[0], encoder_outs[1]
+
+                if layer_idx + 1 in self.interctc_layer_idx:
+                    encoder_out = xs_pad
+
+                    # intermediate outputs are also normalized
+                    if self.normalize_before:
+                        encoder_out = self.after_norm(encoder_out)
+
+                    intermediate_outs.append((layer_idx + 1, encoder_out))
+
+                    if self.interctc_use_conditioning:
+                        ctc_out = ctc.softmax(encoder_out)
+                        xs_pad = xs_pad + self.conditioning_layer(ctc_out)
+
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+
+        olens = masks.squeeze(1).sum(1)
+        if len(intermediate_outs) > 0:
+            return (xs_pad, intermediate_outs), olens, None
+        return xs_pad, olens, None
+
+
+class SANMEncoderChunkOpt(AbsEncoder):
+    """
+    author: Speech Lab, Alibaba Group, China
+    SCAMA: Streaming chunk-aware multihead attention for online end-to-end speech recognition
+    https://arxiv.org/abs/2006.01713
+
+    """
+
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int = 256,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            attention_dropout_rate: float = 0.0,
+            input_layer: Optional[str] = "conv2d",
+            pos_enc_class=SinusoidalPositionEncoder,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            positionwise_layer_type: str = "linear",
+            positionwise_conv_kernel_size: int = 1,
+            padding_idx: int = -1,
+            interctc_layer_idx: List[int] = [],
+            interctc_use_conditioning: bool = False,
+            kernel_size: int = 11,
+            sanm_shfit: int = 0,
+            selfattention_layer_type: str = "sanm",
+            chunk_size: Union[int, Sequence[int]] = (16,),
+            stride: Union[int, Sequence[int]] = (10,),
+            pad_left: Union[int, Sequence[int]] = (0,),
+            encoder_att_look_back_factor: Union[int, Sequence[int]] = (1,),
+            decoder_att_look_back_factor: Union[int, Sequence[int]] = (1,),
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(input_size, output_size),
+                torch.nn.LayerNorm(output_size),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d2":
+            self.embed = Conv2dSubsampling2(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d6":
+            self.embed = Conv2dSubsampling6(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d8":
+            self.embed = Conv2dSubsampling8(input_size, output_size, dropout_rate)
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(input_size, output_size, padding_idx=padding_idx),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            if input_size == output_size:
+                self.embed = None
+            else:
+                self.embed = torch.nn.Linear(input_size, output_size)
+        elif input_layer == "pe":
+            self.embed = SinusoidalPositionEncoder()
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+
+        if selfattention_layer_type == "selfattn":
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                attention_dropout_rate,
+            )
+        elif selfattention_layer_type == "sanm":
+            encoder_selfattn_layer = MultiHeadedAttentionSANM
+            encoder_selfattn_layer_args0 = (
+                attention_heads,
+                input_size,
+                output_size,
+                attention_dropout_rate,
+                kernel_size,
+                sanm_shfit,
+            )
+
+            encoder_selfattn_layer_args = (
+                attention_heads,
+                output_size,
+                output_size,
+                attention_dropout_rate,
+                kernel_size,
+                sanm_shfit,
+            )
+        self.encoders0 = repeat(
+            1,
+            lambda lnum: EncoderLayerSANM(
+                input_size,
+                output_size,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args0),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+
+        self.encoders = repeat(
+            num_blocks - 1,
+            lambda lnum: EncoderLayerSANM(
+                output_size,
+                output_size,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(output_size)
+
+        self.interctc_layer_idx = interctc_layer_idx
+        if len(interctc_layer_idx) > 0:
+            assert 0 < min(interctc_layer_idx) and max(interctc_layer_idx) < num_blocks
+        self.interctc_use_conditioning = interctc_use_conditioning
+        self.conditioning_layer = None
+        shfit_fsmn = (kernel_size - 1) // 2
+        self.overlap_chunk_cls = overlap_chunk(
+            chunk_size=chunk_size,
+            stride=stride,
+            pad_left=pad_left,
+            shfit_fsmn=shfit_fsmn,
+            encoder_att_look_back_factor=encoder_att_look_back_factor,
+            decoder_att_look_back_factor=decoder_att_look_back_factor,
+        )
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+            self,
+            xs_pad: torch.Tensor,
+            ilens: torch.Tensor,
+            prev_states: torch.Tensor = None,
+            ctc: CTC = None,
+            ind: int = 0,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        """Embed positions in tensor.
+
+        Args:
+            xs_pad: input tensor (B, L, D)
+            ilens: input length (B)
+            prev_states: Not to be used now.
+        Returns:
+            position embedded tensor and mask
+        """
+        masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+        xs_pad *= self.output_size() ** 0.5
+        if self.embed is None:
+            xs_pad = xs_pad
+        elif (
+                isinstance(self.embed, Conv2dSubsampling)
+                or isinstance(self.embed, Conv2dSubsampling2)
+                or isinstance(self.embed, Conv2dSubsampling6)
+                or isinstance(self.embed, Conv2dSubsampling8)
+        ):
+            short_status, limit_size = check_short_utt(self.embed, xs_pad.size(1))
+            if short_status:
+                raise TooShortUttError(
+                    f"has {xs_pad.size(1)} frames and is too short for subsampling "
+                    + f"(it needs more than {limit_size} frames), return empty results",
+                    xs_pad.size(1),
+                    limit_size,
+                )
+            xs_pad, masks = self.embed(xs_pad, masks)
+        else:
+            xs_pad = self.embed(xs_pad)
+
+        mask_shfit_chunk, mask_att_chunk_encoder = None, None
+        if self.overlap_chunk_cls is not None:
+            ilens = masks.squeeze(1).sum(1)
+            chunk_outs = self.overlap_chunk_cls.gen_chunk_mask(ilens, ind)
+            xs_pad, ilens = self.overlap_chunk_cls.split_chunk(xs_pad, ilens, chunk_outs=chunk_outs)
+            masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+            mask_shfit_chunk = self.overlap_chunk_cls.get_mask_shfit_chunk(chunk_outs, xs_pad.device, xs_pad.size(0),
+                                                                           dtype=xs_pad.dtype)
+            mask_att_chunk_encoder = self.overlap_chunk_cls.get_mask_att_chunk_encoder(chunk_outs, xs_pad.device,
+                                                                                       xs_pad.size(0),
+                                                                                       dtype=xs_pad.dtype)
+
+        encoder_outs = self.encoders0(xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder)
+        xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        intermediate_outs = []
+        if len(self.interctc_layer_idx) == 0:
+            encoder_outs = self.encoders(xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+        else:
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                encoder_outs = encoder_layer(xs_pad, masks, None, mask_shfit_chunk, mask_att_chunk_encoder)
+                xs_pad, masks = encoder_outs[0], encoder_outs[1]
+                if layer_idx + 1 in self.interctc_layer_idx:
+                    encoder_out = xs_pad
+
+                    # intermediate outputs are also normalized
+                    if self.normalize_before:
+                        encoder_out = self.after_norm(encoder_out)
+
+                    intermediate_outs.append((layer_idx + 1, encoder_out))
+
+                    if self.interctc_use_conditioning:
+                        ctc_out = ctc.softmax(encoder_out)
+                        xs_pad = xs_pad + self.conditioning_layer(ctc_out)
+
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+
+        olens = masks.squeeze(1).sum(1)
+        if len(intermediate_outs) > 0:
+            return (xs_pad, intermediate_outs), olens, None
+        return xs_pad, olens, None
diff --git a/funasr/models/encoder/transformer_encoder.py b/funasr/models/encoder/transformer_encoder.py
new file mode 100644
index 000000000..ff9c3db51
--- /dev/null
+++ b/funasr/models/encoder/transformer_encoder.py
@@ -0,0 +1,684 @@
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Transformer encoder definition."""
+
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import torch
+from torch import nn
+from typeguard import check_argument_types
+import logging
+
+from funasr.models.ctc import CTC
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.modules.attention import MultiHeadedAttention
+from funasr.modules.embedding import PositionalEncoding
+from funasr.modules.layer_norm import LayerNorm
+from funasr.modules.multi_layer_conv import Conv1dLinear
+from funasr.modules.multi_layer_conv import MultiLayeredConv1d
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.positionwise_feed_forward import (
+    PositionwiseFeedForward,  # noqa: H301
+)
+from funasr.modules.repeat import repeat
+from funasr.modules.nets_utils import rename_state_dict
+from funasr.modules.dynamic_conv import DynamicConvolution
+from funasr.modules.dynamic_conv2d import DynamicConvolution2D
+from funasr.modules.lightconv import LightweightConvolution
+from funasr.modules.lightconv2d import LightweightConvolution2D
+from funasr.modules.subsampling import Conv2dSubsampling
+from funasr.modules.subsampling import Conv2dSubsampling2
+from funasr.modules.subsampling import Conv2dSubsampling6
+from funasr.modules.subsampling import Conv2dSubsampling8
+from funasr.modules.subsampling import TooShortUttError
+from funasr.modules.subsampling import check_short_utt
+
+
+class EncoderLayer(nn.Module):
+    """Encoder layer module.
+
+    Args:
+        size (int): Input dimension.
+        self_attn (torch.nn.Module): Self-attention module instance.
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+            can be used as the argument.
+        feed_forward (torch.nn.Module): Feed-forward module instance.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            can be used as the argument.
+        dropout_rate (float): Dropout rate.
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+        stochastic_depth_rate (float): Proability to skip this layer.
+            During training, the layer may skip residual computation and return input
+            as-is with given probability.
+    """
+
+    def __init__(
+            self,
+            size,
+            self_attn,
+            feed_forward,
+            dropout_rate,
+            normalize_before=True,
+            concat_after=False,
+            stochastic_depth_rate=0.0,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayer, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.norm1 = LayerNorm(size)
+        self.norm2 = LayerNorm(size)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+        self.stochastic_depth_rate = stochastic_depth_rate
+
+    def forward(self, x, mask, cache=None):
+        """Compute encoded features.
+
+        Args:
+            x_input (torch.Tensor): Input tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+        skip_layer = False
+        # with stochastic depth, residual connection `x + f(x)` becomes
+        # `x <- x + 1 / (1 - p) * f(x)` at training time.
+        stoch_layer_coeff = 1.0
+        if self.training and self.stochastic_depth_rate > 0:
+            skip_layer = torch.rand(1).item() < self.stochastic_depth_rate
+            stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate)
+
+        if skip_layer:
+            if cache is not None:
+                x = torch.cat([cache, x], dim=1)
+            return x, mask
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm1(x)
+
+        if cache is None:
+            x_q = x
+        else:
+            assert cache.shape == (x.shape[0], x.shape[1] - 1, self.size)
+            x_q = x[:, -1:, :]
+            residual = residual[:, -1:, :]
+            mask = None if mask is None else mask[:, -1:, :]
+
+        if self.concat_after:
+            x_concat = torch.cat((x, self.self_attn(x_q, x, x, mask)), dim=-1)
+            x = residual + stoch_layer_coeff * self.concat_linear(x_concat)
+        else:
+            x = residual + stoch_layer_coeff * self.dropout(
+                self.self_attn(x_q, x, x, mask)
+            )
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        x = residual + stoch_layer_coeff * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        if cache is not None:
+            x = torch.cat([cache, x], dim=1)
+
+        return x, mask
+
+
+class TransformerEncoder(AbsEncoder):
+    """Transformer encoder module.
+
+    Args:
+        input_size: input dim
+        output_size: dimension of attention
+        attention_heads: the number of heads of multi head attention
+        linear_units: the number of units of position-wise feed forward
+        num_blocks: the number of decoder blocks
+        dropout_rate: dropout rate
+        attention_dropout_rate: dropout rate in attention
+        positional_dropout_rate: dropout rate after adding positional encoding
+        input_layer: input layer type
+        pos_enc_class: PositionalEncoding or ScaledPositionalEncoding
+        normalize_before: whether to use layer_norm before the first block
+        concat_after: whether to concat attention layer's input and output
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied.
+            i.e. x -> x + att(x)
+        positionwise_layer_type: linear of conv1d
+        positionwise_conv_kernel_size: kernel size of positionwise conv1d layer
+        padding_idx: padding_idx for input_layer=embed
+    """
+
+    def __init__(
+            self,
+            input_size: int,
+            output_size: int = 256,
+            attention_heads: int = 4,
+            linear_units: int = 2048,
+            num_blocks: int = 6,
+            dropout_rate: float = 0.1,
+            positional_dropout_rate: float = 0.1,
+            attention_dropout_rate: float = 0.0,
+            input_layer: Optional[str] = "conv2d",
+            pos_enc_class=PositionalEncoding,
+            normalize_before: bool = True,
+            concat_after: bool = False,
+            positionwise_layer_type: str = "linear",
+            positionwise_conv_kernel_size: int = 1,
+            padding_idx: int = -1,
+            interctc_layer_idx: List[int] = [],
+            interctc_use_conditioning: bool = False,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        self._output_size = output_size
+
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(input_size, output_size),
+                torch.nn.LayerNorm(output_size),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d2":
+            self.embed = Conv2dSubsampling2(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d6":
+            self.embed = Conv2dSubsampling6(input_size, output_size, dropout_rate)
+        elif input_layer == "conv2d8":
+            self.embed = Conv2dSubsampling8(input_size, output_size, dropout_rate)
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(input_size, output_size, padding_idx=padding_idx),
+                pos_enc_class(output_size, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            if input_size == output_size:
+                self.embed = None
+            else:
+                self.embed = torch.nn.Linear(input_size, output_size)
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                output_size,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+        self.encoders = repeat(
+            num_blocks,
+            lambda lnum: EncoderLayer(
+                output_size,
+                MultiHeadedAttention(
+                    attention_heads, output_size, attention_dropout_rate
+                ),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(output_size)
+
+        self.interctc_layer_idx = interctc_layer_idx
+        if len(interctc_layer_idx) > 0:
+            assert 0 < min(interctc_layer_idx) and max(interctc_layer_idx) < num_blocks
+        self.interctc_use_conditioning = interctc_use_conditioning
+        self.conditioning_layer = None
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+            self,
+            xs_pad: torch.Tensor,
+            ilens: torch.Tensor,
+            prev_states: torch.Tensor = None,
+            ctc: CTC = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]:
+        """Embed positions in tensor.
+
+        Args:
+            xs_pad: input tensor (B, L, D)
+            ilens: input length (B)
+            prev_states: Not to be used now.
+        Returns:
+            position embedded tensor and mask
+        """
+        masks = (~make_pad_mask(ilens)[:, None, :]).to(xs_pad.device)
+
+        if self.embed is None:
+            xs_pad = xs_pad
+        elif (
+                isinstance(self.embed, Conv2dSubsampling)
+                or isinstance(self.embed, Conv2dSubsampling2)
+                or isinstance(self.embed, Conv2dSubsampling6)
+                or isinstance(self.embed, Conv2dSubsampling8)
+        ):
+            short_status, limit_size = check_short_utt(self.embed, xs_pad.size(1))
+            if short_status:
+                raise TooShortUttError(
+                    f"has {xs_pad.size(1)} frames and is too short for subsampling "
+                    + f"(it needs more than {limit_size} frames), return empty results",
+                    xs_pad.size(1),
+                    limit_size,
+                )
+            xs_pad, masks = self.embed(xs_pad, masks)
+        else:
+            xs_pad = self.embed(xs_pad)
+
+        intermediate_outs = []
+        if len(self.interctc_layer_idx) == 0:
+            xs_pad, masks = self.encoders(xs_pad, masks)
+        else:
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                xs_pad, masks = encoder_layer(xs_pad, masks)
+
+                if layer_idx + 1 in self.interctc_layer_idx:
+                    encoder_out = xs_pad
+
+                    # intermediate outputs are also normalized
+                    if self.normalize_before:
+                        encoder_out = self.after_norm(encoder_out)
+
+                    intermediate_outs.append((layer_idx + 1, encoder_out))
+
+                    if self.interctc_use_conditioning:
+                        ctc_out = ctc.softmax(encoder_out)
+                        xs_pad = xs_pad + self.conditioning_layer(ctc_out)
+
+        if self.normalize_before:
+            xs_pad = self.after_norm(xs_pad)
+
+        olens = masks.squeeze(1).sum(1)
+        if len(intermediate_outs) > 0:
+            return (xs_pad, intermediate_outs), olens, None
+        return xs_pad, olens, None
+
+
+def _pre_hook(
+    state_dict,
+    prefix,
+    local_metadata,
+    strict,
+    missing_keys,
+    unexpected_keys,
+    error_msgs,
+):
+    # https://github.com/espnet/espnet/commit/21d70286c354c66c0350e65dc098d2ee236faccc#diff-bffb1396f038b317b2b64dd96e6d3563
+    rename_state_dict(prefix + "input_layer.", prefix + "embed.", state_dict)
+    # https://github.com/espnet/espnet/commit/3d422f6de8d4f03673b89e1caef698745ec749ea#diff-bffb1396f038b317b2b64dd96e6d3563
+    rename_state_dict(prefix + "norm.", prefix + "after_norm.", state_dict)
+
+
+class TransformerEncoder_s0(torch.nn.Module):
+    """Transformer encoder module.
+
+    Args:
+        idim (int): Input dimension.
+        attention_dim (int): Dimension of attention.
+        attention_heads (int): The number of heads of multi head attention.
+        conv_wshare (int): The number of kernel of convolution. Only used in
+            selfattention_layer_type == "lightconv*" or "dynamiconv*".
+        conv_kernel_length (Union[int, str]): Kernel size str of convolution
+            (e.g. 71_71_71_71_71_71). Only used in selfattention_layer_type
+            == "lightconv*" or "dynamiconv*".
+        conv_usebias (bool): Whether to use bias in convolution. Only used in
+            selfattention_layer_type == "lightconv*" or "dynamiconv*".
+        linear_units (int): The number of units of position-wise feed forward.
+        num_blocks (int): The number of decoder blocks.
+        dropout_rate (float): Dropout rate.
+        positional_dropout_rate (float): Dropout rate after adding positional encoding.
+        attention_dropout_rate (float): Dropout rate in attention.
+        input_layer (Union[str, torch.nn.Module]): Input layer type.
+        pos_enc_class (torch.nn.Module): Positional encoding module class.
+            `PositionalEncoding `or `ScaledPositionalEncoding`
+        normalize_before (bool): Whether to use layer_norm before the first block.
+        concat_after (bool): Whether to concat attention layer's input and output.
+            if True, additional linear will be applied.
+            i.e. x -> x + linear(concat(x, att(x)))
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
+        positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear".
+        positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer.
+        selfattention_layer_type (str): Encoder attention layer type.
+        padding_idx (int): Padding idx for input_layer=embed.
+        stochastic_depth_rate (float): Maximum probability to skip the encoder layer.
+        intermediate_layers (Union[List[int], None]): indices of intermediate CTC layer.
+            indices start from 1.
+            if not None, intermediate outputs are returned (which changes return type
+            signature.)
+
+    """
+
+    def __init__(
+        self,
+        idim,
+        attention_dim=256,
+        attention_heads=4,
+        conv_wshare=4,
+        conv_kernel_length="11",
+        conv_usebias=False,
+        linear_units=2048,
+        num_blocks=6,
+        dropout_rate=0.1,
+        positional_dropout_rate=0.1,
+        attention_dropout_rate=0.0,
+        input_layer="conv2d",
+        pos_enc_class=PositionalEncoding,
+        normalize_before=True,
+        concat_after=False,
+        positionwise_layer_type="linear",
+        positionwise_conv_kernel_size=1,
+        selfattention_layer_type="selfattn",
+        padding_idx=-1,
+        stochastic_depth_rate=0.0,
+        intermediate_layers=None,
+        ctc_softmax=None,
+        conditioning_layer_dim=None,
+    ):
+        """Construct an Encoder object."""
+        super(TransformerEncoder_s0, self).__init__()
+        self._register_load_state_dict_pre_hook(_pre_hook)
+
+        self.conv_subsampling_factor = 1
+        if input_layer == "linear":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Linear(idim, attention_dim),
+                torch.nn.LayerNorm(attention_dim),
+                torch.nn.Dropout(dropout_rate),
+                torch.nn.ReLU(),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer == "conv2d":
+            self.embed = Conv2dSubsampling(idim, attention_dim, dropout_rate)
+            self.conv_subsampling_factor = 4
+        elif input_layer == "conv2d-scaled-pos-enc":
+            self.embed = Conv2dSubsampling(
+                idim,
+                attention_dim,
+                dropout_rate,
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+            self.conv_subsampling_factor = 4
+        elif input_layer == "conv2d6":
+            self.embed = Conv2dSubsampling6(idim, attention_dim, dropout_rate)
+            self.conv_subsampling_factor = 6
+        elif input_layer == "conv2d8":
+            self.embed = Conv2dSubsampling8(idim, attention_dim, dropout_rate)
+            self.conv_subsampling_factor = 8
+        elif input_layer == "embed":
+            self.embed = torch.nn.Sequential(
+                torch.nn.Embedding(idim, attention_dim, padding_idx=padding_idx),
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif isinstance(input_layer, torch.nn.Module):
+            self.embed = torch.nn.Sequential(
+                input_layer,
+                pos_enc_class(attention_dim, positional_dropout_rate),
+            )
+        elif input_layer is None:
+            self.embed = torch.nn.Sequential(
+                pos_enc_class(attention_dim, positional_dropout_rate)
+            )
+        else:
+            raise ValueError("unknown input_layer: " + input_layer)
+        self.normalize_before = normalize_before
+        positionwise_layer, positionwise_layer_args = self.get_positionwise_layer(
+            positionwise_layer_type,
+            attention_dim,
+            linear_units,
+            dropout_rate,
+            positionwise_conv_kernel_size,
+        )
+        if selfattention_layer_type in [
+            "selfattn",
+            "rel_selfattn",
+            "legacy_rel_selfattn",
+        ]:
+            logging.info("encoder self-attention layer type = self-attention")
+            encoder_selfattn_layer = MultiHeadedAttention
+            encoder_selfattn_layer_args = [
+                (
+                    attention_heads,
+                    attention_dim,
+                    attention_dropout_rate,
+                )
+            ] * num_blocks
+        elif selfattention_layer_type == "lightconv":
+            logging.info("encoder self-attention layer type = lightweight convolution")
+            encoder_selfattn_layer = LightweightConvolution
+            encoder_selfattn_layer_args = [
+                (
+                    conv_wshare,
+                    attention_dim,
+                    attention_dropout_rate,
+                    int(conv_kernel_length.split("_")[lnum]),
+                    False,
+                    conv_usebias,
+                )
+                for lnum in range(num_blocks)
+            ]
+        elif selfattention_layer_type == "lightconv2d":
+            logging.info(
+                "encoder self-attention layer "
+                "type = lightweight convolution 2-dimensional"
+            )
+            encoder_selfattn_layer = LightweightConvolution2D
+            encoder_selfattn_layer_args = [
+                (
+                    conv_wshare,
+                    attention_dim,
+                    attention_dropout_rate,
+                    int(conv_kernel_length.split("_")[lnum]),
+                    False,
+                    conv_usebias,
+                )
+                for lnum in range(num_blocks)
+            ]
+        elif selfattention_layer_type == "dynamicconv":
+            logging.info("encoder self-attention layer type = dynamic convolution")
+            encoder_selfattn_layer = DynamicConvolution
+            encoder_selfattn_layer_args = [
+                (
+                    conv_wshare,
+                    attention_dim,
+                    attention_dropout_rate,
+                    int(conv_kernel_length.split("_")[lnum]),
+                    False,
+                    conv_usebias,
+                )
+                for lnum in range(num_blocks)
+            ]
+        elif selfattention_layer_type == "dynamicconv2d":
+            logging.info(
+                "encoder self-attention layer type = dynamic convolution 2-dimensional"
+            )
+            encoder_selfattn_layer = DynamicConvolution2D
+            encoder_selfattn_layer_args = [
+                (
+                    conv_wshare,
+                    attention_dim,
+                    attention_dropout_rate,
+                    int(conv_kernel_length.split("_")[lnum]),
+                    False,
+                    conv_usebias,
+                )
+                for lnum in range(num_blocks)
+            ]
+        else:
+            raise NotImplementedError(selfattention_layer_type)
+
+        self.encoders = repeat(
+            num_blocks,
+            lambda lnum: EncoderLayer(
+                attention_dim,
+                encoder_selfattn_layer(*encoder_selfattn_layer_args[lnum]),
+                positionwise_layer(*positionwise_layer_args),
+                dropout_rate,
+                normalize_before,
+                concat_after,
+                stochastic_depth_rate * float(1 + lnum) / num_blocks,
+            ),
+        )
+        if self.normalize_before:
+            self.after_norm = LayerNorm(attention_dim)
+
+        self.intermediate_layers = intermediate_layers
+        self.use_conditioning = True if ctc_softmax is not None else False
+        if self.use_conditioning:
+            self.ctc_softmax = ctc_softmax
+            self.conditioning_layer = torch.nn.Linear(
+                conditioning_layer_dim, attention_dim
+            )
+
+    def get_positionwise_layer(
+        self,
+        positionwise_layer_type="linear",
+        attention_dim=256,
+        linear_units=2048,
+        dropout_rate=0.1,
+        positionwise_conv_kernel_size=1,
+    ):
+        """Define positionwise layer."""
+        if positionwise_layer_type == "linear":
+            positionwise_layer = PositionwiseFeedForward
+            positionwise_layer_args = (attention_dim, linear_units, dropout_rate)
+        elif positionwise_layer_type == "conv1d":
+            positionwise_layer = MultiLayeredConv1d
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        elif positionwise_layer_type == "conv1d-linear":
+            positionwise_layer = Conv1dLinear
+            positionwise_layer_args = (
+                attention_dim,
+                linear_units,
+                positionwise_conv_kernel_size,
+                dropout_rate,
+            )
+        else:
+            raise NotImplementedError("Support only linear or conv1d.")
+        return positionwise_layer, positionwise_layer_args
+
+    def forward(self, xs, masks):
+        """Encode input sequence.
+
+        Args:
+            xs (torch.Tensor): Input tensor (#batch, time, idim).
+            masks (torch.Tensor): Mask tensor (#batch, time).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, attention_dim).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+        if isinstance(
+            self.embed,
+            (Conv2dSubsampling, Conv2dSubsampling6, Conv2dSubsampling8),
+        ):
+            xs, masks = self.embed(xs, masks)
+        else:
+            xs = self.embed(xs)
+
+        if self.intermediate_layers is None:
+            xs, masks = self.encoders(xs, masks)
+        else:
+            intermediate_outputs = []
+            for layer_idx, encoder_layer in enumerate(self.encoders):
+                xs, masks = encoder_layer(xs, masks)
+
+                if (
+                    self.intermediate_layers is not None
+                    and layer_idx + 1 in self.intermediate_layers
+                ):
+                    encoder_output = xs
+                    # intermediate branches also require normalization.
+                    if self.normalize_before:
+                        encoder_output = self.after_norm(encoder_output)
+                    intermediate_outputs.append(encoder_output)
+
+                    if self.use_conditioning:
+                        intermediate_result = self.ctc_softmax(encoder_output)
+                        xs = xs + self.conditioning_layer(intermediate_result)
+
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+
+        if self.intermediate_layers is not None:
+            return xs, masks, intermediate_outputs
+        return xs, masks
+
+    def forward_one_step(self, xs, masks, cache=None):
+        """Encode input frame.
+
+        Args:
+            xs (torch.Tensor): Input tensor.
+            masks (torch.Tensor): Mask tensor.
+            cache (List[torch.Tensor]): List of cache tensors.
+
+        Returns:
+            torch.Tensor: Output tensor.
+            torch.Tensor: Mask tensor.
+            List[torch.Tensor]: List of new cache tensors.
+
+        """
+        if isinstance(self.embed, Conv2dSubsampling):
+            xs, masks = self.embed(xs, masks)
+        else:
+            xs = self.embed(xs)
+        if cache is None:
+            cache = [None for _ in range(len(self.encoders))]
+        new_cache = []
+        for c, e in zip(cache, self.encoders):
+            xs, masks = e(xs, masks, cache=c)
+            new_cache.append(xs)
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+        return xs, masks, new_cache
+
diff --git a/funasr/models/frontend/__init__.py b/funasr/models/frontend/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/frontend/abs_frontend.py b/funasr/models/frontend/abs_frontend.py
new file mode 100644
index 000000000..538236fe9
--- /dev/null
+++ b/funasr/models/frontend/abs_frontend.py
@@ -0,0 +1,17 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+
+class AbsFrontend(torch.nn.Module, ABC):
+    @abstractmethod
+    def output_size(self) -> int:
+        raise NotImplementedError
+
+    @abstractmethod
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        raise NotImplementedError
diff --git a/funasr/models/frontend/default.py b/funasr/models/frontend/default.py
new file mode 100644
index 000000000..fad6b70f2
--- /dev/null
+++ b/funasr/models/frontend/default.py
@@ -0,0 +1,133 @@
+import copy
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+import numpy as np
+import torch
+from torch_complex.tensor import ComplexTensor
+from typeguard import check_argument_types
+
+from funasr.layers.log_mel import LogMel
+from funasr.layers.stft import Stft
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.modules.frontends.frontend import Frontend
+from funasr.utils.get_default_kwargs import get_default_kwargs
+
+
+class DefaultFrontend(AbsFrontend):
+    """Conventional frontend structure for ASR.
+
+    Stft -> WPE -> MVDR-Beamformer -> Power-spec -> Mel-Fbank -> CMVN
+    """
+
+    def __init__(
+            self,
+            fs: Union[int, str] = 16000,
+            n_fft: int = 512,
+            win_length: int = None,
+            hop_length: int = 128,
+            window: Optional[str] = "hann",
+            center: bool = True,
+            normalized: bool = False,
+            onesided: bool = True,
+            n_mels: int = 80,
+            fmin: int = None,
+            fmax: int = None,
+            htk: bool = False,
+            frontend_conf: Optional[dict] = get_default_kwargs(Frontend),
+            apply_stft: bool = True,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        if isinstance(fs, str):
+            fs = humanfriendly.parse_size(fs)
+
+        # Deepcopy (In general, dict shouldn't be used as default arg)
+        frontend_conf = copy.deepcopy(frontend_conf)
+        self.hop_length = hop_length
+
+        if apply_stft:
+            self.stft = Stft(
+                n_fft=n_fft,
+                win_length=win_length,
+                hop_length=hop_length,
+                center=center,
+                window=window,
+                normalized=normalized,
+                onesided=onesided,
+            )
+        else:
+            self.stft = None
+        self.apply_stft = apply_stft
+
+        if frontend_conf is not None:
+            self.frontend = Frontend(idim=n_fft // 2 + 1, **frontend_conf)
+        else:
+            self.frontend = None
+
+        self.logmel = LogMel(
+            fs=fs,
+            n_fft=n_fft,
+            n_mels=n_mels,
+            fmin=fmin,
+            fmax=fmax,
+            htk=htk,
+        )
+        self.n_mels = n_mels
+        self.frontend_type = "default"
+
+    def output_size(self) -> int:
+        return self.n_mels
+
+    def forward(
+            self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        # 1. Domain-conversion: e.g. Stft: time -> time-freq
+        if self.stft is not None:
+            input_stft, feats_lens = self._compute_stft(input, input_lengths)
+        else:
+            input_stft = ComplexTensor(input[..., 0], input[..., 1])
+            feats_lens = input_lengths
+        # 2. [Option] Speech enhancement
+        if self.frontend is not None:
+            assert isinstance(input_stft, ComplexTensor), type(input_stft)
+            # input_stft: (Batch, Length, [Channel], Freq)
+            input_stft, _, mask = self.frontend(input_stft, feats_lens)
+
+        # 3. [Multi channel case]: Select a channel
+        if input_stft.dim() == 4:
+            # h: (B, T, C, F) -> h: (B, T, F)
+            if self.training:
+                # Select 1ch randomly
+                ch = np.random.randint(input_stft.size(2))
+                input_stft = input_stft[:, :, ch, :]
+            else:
+                # Use the first channel
+                input_stft = input_stft[:, :, 0, :]
+
+        # 4. STFT -> Power spectrum
+        # h: ComplexTensor(B, T, F) -> torch.Tensor(B, T, F)
+        input_power = input_stft.real ** 2 + input_stft.imag ** 2
+
+        # 5. Feature transform e.g. Stft -> Log-Mel-Fbank
+        # input_power: (Batch, [Channel,] Length, Freq)
+        #       -> input_feats: (Batch, Length, Dim)
+        input_feats, _ = self.logmel(input_power, feats_lens)
+
+        return input_feats, feats_lens
+
+    def _compute_stft(
+            self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> torch.Tensor:
+        input_stft, feats_lens = self.stft(input, input_lengths)
+
+        assert input_stft.dim() >= 4, input_stft.shape
+        # "2" refers to the real/imag parts of Complex
+        assert input_stft.shape[-1] == 2, input_stft.shape
+
+        # Change torch.Tensor to ComplexTensor
+        # input_stft: (..., F, 2) -> (..., F)
+        input_stft = ComplexTensor(input_stft[..., 0], input_stft[..., 1])
+        return input_stft, feats_lens
diff --git a/funasr/models/frontend/fused.py b/funasr/models/frontend/fused.py
new file mode 100644
index 000000000..8b5e56ebf
--- /dev/null
+++ b/funasr/models/frontend/fused.py
@@ -0,0 +1,146 @@
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.models.frontend.default import DefaultFrontend
+from funasr.models.frontend.s3prl import S3prlFrontend
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typing import Tuple
+
+
+class FusedFrontends(AbsFrontend):
+    def __init__(
+        self, frontends=None, align_method="linear_projection", proj_dim=100, fs=16000
+    ):
+
+        assert check_argument_types()
+        super().__init__()
+        self.align_method = (
+            align_method  # fusing method : linear_projection only for now
+        )
+        self.proj_dim = proj_dim  # dim of the projection done on each frontend
+        self.frontends = []  # list of the frontends to combine
+
+        for i, frontend in enumerate(frontends):
+            frontend_type = frontend["frontend_type"]
+            if frontend_type == "default":
+                n_mels, fs, n_fft, win_length, hop_length = (
+                    frontend.get("n_mels", 80),
+                    fs,
+                    frontend.get("n_fft", 512),
+                    frontend.get("win_length"),
+                    frontend.get("hop_length", 128),
+                )
+                window, center, normalized, onesided = (
+                    frontend.get("window", "hann"),
+                    frontend.get("center", True),
+                    frontend.get("normalized", False),
+                    frontend.get("onesided", True),
+                )
+                fmin, fmax, htk, apply_stft = (
+                    frontend.get("fmin", None),
+                    frontend.get("fmax", None),
+                    frontend.get("htk", False),
+                    frontend.get("apply_stft", True),
+                )
+
+                self.frontends.append(
+                    DefaultFrontend(
+                        n_mels=n_mels,
+                        n_fft=n_fft,
+                        fs=fs,
+                        win_length=win_length,
+                        hop_length=hop_length,
+                        window=window,
+                        center=center,
+                        normalized=normalized,
+                        onesided=onesided,
+                        fmin=fmin,
+                        fmax=fmax,
+                        htk=htk,
+                        apply_stft=apply_stft,
+                    )
+                )
+            elif frontend_type == "s3prl":
+                frontend_conf, download_dir, multilayer_feature = (
+                    frontend.get("frontend_conf"),
+                    frontend.get("download_dir"),
+                    frontend.get("multilayer_feature"),
+                )
+                self.frontends.append(
+                    S3prlFrontend(
+                        fs=fs,
+                        frontend_conf=frontend_conf,
+                        download_dir=download_dir,
+                        multilayer_feature=multilayer_feature,
+                    )
+                )
+
+            else:
+                raise NotImplementedError  # frontends are only default or s3prl
+
+        self.frontends = torch.nn.ModuleList(self.frontends)
+
+        self.gcd = np.gcd.reduce([frontend.hop_length for frontend in self.frontends])
+        self.factors = [frontend.hop_length // self.gcd for frontend in self.frontends]
+        if torch.cuda.is_available():
+            dev = "cuda"
+        else:
+            dev = "cpu"
+        if self.align_method == "linear_projection":
+            self.projection_layers = [
+                torch.nn.Linear(
+                    in_features=frontend.output_size(),
+                    out_features=self.factors[i] * self.proj_dim,
+                )
+                for i, frontend in enumerate(self.frontends)
+            ]
+            self.projection_layers = torch.nn.ModuleList(self.projection_layers)
+            self.projection_layers = self.projection_layers.to(torch.device(dev))
+
+    def output_size(self) -> int:
+        return len(self.frontends) * self.proj_dim
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+
+        # step 0 : get all frontends features
+        self.feats = []
+        for frontend in self.frontends:
+            with torch.no_grad():
+                input_feats, feats_lens = frontend.forward(input, input_lengths)
+            self.feats.append([input_feats, feats_lens])
+
+        if (
+            self.align_method == "linear_projection"
+        ):  # TODO(Dan): to add other align methods
+
+            # first step : projections
+            self.feats_proj = []
+            for i, frontend in enumerate(self.frontends):
+                input_feats = self.feats[i][0]
+                self.feats_proj.append(self.projection_layers[i](input_feats))
+
+            # 2nd step : reshape
+            self.feats_reshaped = []
+            for i, frontend in enumerate(self.frontends):
+                input_feats_proj = self.feats_proj[i]
+                bs, nf, dim = input_feats_proj.shape
+                input_feats_reshaped = torch.reshape(
+                    input_feats_proj, (bs, nf * self.factors[i], dim // self.factors[i])
+                )
+                self.feats_reshaped.append(input_feats_reshaped)
+
+            # 3rd step : drop the few last frames
+            m = min([x.shape[1] for x in self.feats_reshaped])
+            self.feats_final = [x[:, :m, :] for x in self.feats_reshaped]
+
+            input_feats = torch.cat(
+                self.feats_final, dim=-1
+            )  # change the input size of the preencoder : proj_dim * n_frontends
+            feats_lens = torch.ones_like(self.feats[0][1]) * (m)
+
+        else:
+            raise NotImplementedError
+
+        return input_feats, feats_lens
diff --git a/funasr/models/frontend/s3prl.py b/funasr/models/frontend/s3prl.py
new file mode 100644
index 000000000..f2b6107d9
--- /dev/null
+++ b/funasr/models/frontend/s3prl.py
@@ -0,0 +1,143 @@
+import copy
+import logging
+import os
+from argparse import Namespace
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+import torch
+from typeguard import check_argument_types
+
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.modules.frontends.frontend import Frontend
+from funasr.modules.nets_utils import pad_list
+from funasr.utils.get_default_kwargs import get_default_kwargs
+
+
+def base_s3prl_setup(args):
+    args.upstream_feature_selection = getattr(args, "upstream_feature_selection", None)
+    args.upstream_model_config = getattr(args, "upstream_model_config", None)
+    args.upstream_refresh = getattr(args, "upstream_refresh", False)
+    args.upstream_ckpt = getattr(args, "upstream_ckpt", None)
+    args.init_ckpt = getattr(args, "init_ckpt", None)
+    args.verbose = getattr(args, "verbose", False)
+    args.tile_factor = getattr(args, "tile_factor", 1)
+    return args
+
+
+class S3prlFrontend(AbsFrontend):
+    """Speech Pretrained Representation frontend structure for ASR."""
+
+    def __init__(
+            self,
+            fs: Union[int, str] = 16000,
+            frontend_conf: Optional[dict] = get_default_kwargs(Frontend),
+            download_dir: str = None,
+            multilayer_feature: bool = False,
+    ):
+        assert check_argument_types()
+        super().__init__()
+        if isinstance(fs, str):
+            fs = humanfriendly.parse_size(fs)
+
+        if download_dir is not None:
+            torch.hub.set_dir(download_dir)
+
+        self.multilayer_feature = multilayer_feature
+        self.upstream, self.featurizer = self._get_upstream(frontend_conf)
+        self.pretrained_params = copy.deepcopy(self.upstream.state_dict())
+        self.output_dim = self.featurizer.output_dim
+        self.frontend_type = "s3prl"
+        self.hop_length = self.upstream.get_downsample_rates("key")
+
+    def _get_upstream(self, frontend_conf):
+        """Get S3PRL upstream model."""
+        s3prl_args = base_s3prl_setup(
+            Namespace(**frontend_conf, device="cpu"),
+        )
+        self.args = s3prl_args
+
+        s3prl_path = None
+        python_path_list = os.environ.get("PYTHONPATH", "(None)").split(":")
+        for p in python_path_list:
+            if p.endswith("s3prl"):
+                s3prl_path = p
+                break
+        assert s3prl_path is not None
+
+        s3prl_upstream = torch.hub.load(
+            s3prl_path,
+            s3prl_args.upstream,
+            ckpt=s3prl_args.upstream_ckpt,
+            model_config=s3prl_args.upstream_model_config,
+            refresh=s3prl_args.upstream_refresh,
+            source="local",
+        ).to("cpu")
+
+        if getattr(
+                s3prl_upstream, "model", None
+        ) is not None and s3prl_upstream.model.__class__.__name__ in [
+            "Wav2Vec2Model",
+            "HubertModel",
+        ]:
+            s3prl_upstream.model.encoder.layerdrop = 0.0
+
+        from s3prl.upstream.interfaces import Featurizer
+
+        if self.multilayer_feature is None:
+            feature_selection = "last_hidden_state"
+        else:
+            feature_selection = "hidden_states"
+        s3prl_featurizer = Featurizer(
+            upstream=s3prl_upstream,
+            feature_selection=feature_selection,
+            upstream_device="cpu",
+        )
+
+        return s3prl_upstream, s3prl_featurizer
+
+    def _tile_representations(self, feature):
+        """Tile up the representations by `tile_factor`.
+
+        Input - sequence of representations
+                shape: (batch_size, seq_len, feature_dim)
+        Output - sequence of tiled representations
+                 shape: (batch_size, seq_len * factor, feature_dim)
+        """
+        assert (
+                len(feature.shape) == 3
+        ), "Input argument `feature` has invalid shape: {}".format(feature.shape)
+        tiled_feature = feature.repeat(1, 1, self.args.tile_factor)
+        tiled_feature = tiled_feature.reshape(
+            feature.size(0), feature.size(1) * self.args.tile_factor, feature.size(2)
+        )
+        return tiled_feature
+
+    def output_size(self) -> int:
+        return self.output_dim
+
+    def forward(
+            self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        wavs = [wav[: input_lengths[i]] for i, wav in enumerate(input)]
+        self.upstream.eval()
+        with torch.no_grad():
+            feats = self.upstream(wavs)
+        feats = self.featurizer(wavs, feats)
+
+        if self.args.tile_factor != 1:
+            feats = self._tile_representations(feats)
+
+        input_feats = pad_list(feats, 0.0)
+        feats_lens = torch.tensor([f.shape[0] for f in feats], dtype=torch.long)
+
+        # Saving CUDA Memory
+        del feats
+
+        return input_feats, feats_lens
+
+    def reload_pretrained_parameters(self):
+        self.upstream.load_state_dict(self.pretrained_params)
+        logging.info("Pretrained S3PRL frontend model parameters reloaded!")
diff --git a/funasr/models/frontend/windowing.py b/funasr/models/frontend/windowing.py
new file mode 100644
index 000000000..7c4c56853
--- /dev/null
+++ b/funasr/models/frontend/windowing.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python3
+#  2020, Technische Universität München;  Ludwig Kürzinger
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Sliding Window for raw audio input data."""
+
+from funasr.models.frontend.abs_frontend import AbsFrontend
+import torch
+from typeguard import check_argument_types
+from typing import Tuple
+
+
+class SlidingWindow(AbsFrontend):
+    """Sliding Window.
+
+    Provides a sliding window over a batched continuous raw audio tensor.
+    Optionally, provides padding (Currently not implemented).
+    Combine this module with a pre-encoder compatible with raw audio data,
+    for example Sinc convolutions.
+
+    Known issues:
+    Output length is calculated incorrectly if audio shorter than win_length.
+    WARNING: trailing values are discarded - padding not implemented yet.
+    There is currently no additional window function applied to input values.
+    """
+
+    def __init__(
+        self,
+        win_length: int = 400,
+        hop_length: int = 160,
+        channels: int = 1,
+        padding: int = None,
+        fs=None,
+    ):
+        """Initialize.
+
+        Args:
+            win_length: Length of frame.
+            hop_length: Relative starting point of next frame.
+            channels: Number of input channels.
+            padding: Padding (placeholder, currently not implemented).
+            fs:  Sampling rate (placeholder for compatibility, not used).
+        """
+        assert check_argument_types()
+        super().__init__()
+        self.fs = fs
+        self.win_length = win_length
+        self.hop_length = hop_length
+        self.channels = channels
+        self.padding = padding
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Apply a sliding window on the input.
+
+        Args:
+            input: Input (B, T, C*D) or (B, T*C*D), with D=C=1.
+            input_lengths: Input lengths within batch.
+
+        Returns:
+            Tensor: Output with dimensions (B, T, C, D), with D=win_length.
+            Tensor: Output lengths within batch.
+        """
+        input_size = input.size()
+        B = input_size[0]
+        T = input_size[1]
+        C = self.channels
+        D = self.win_length
+        # (B, T, C) --> (T, B, C)
+        continuous = input.view(B, T, C).permute(1, 0, 2)
+        windowed = continuous.unfold(0, D, self.hop_length)
+        # (T, B, C, D) --> (B, T, C, D)
+        output = windowed.permute(1, 0, 2, 3).contiguous()
+        # After unfold(), windowed lengths change:
+        output_lengths = (input_lengths - self.win_length) // self.hop_length + 1
+        return output, output_lengths
+
+    def output_size(self) -> int:
+        """Return output length of feature dimension D, i.e. the window length."""
+        return self.win_length
diff --git a/funasr/models/postencoder/__init__.py b/funasr/models/postencoder/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/postencoder/abs_postencoder.py b/funasr/models/postencoder/abs_postencoder.py
new file mode 100644
index 000000000..f5ac03be2
--- /dev/null
+++ b/funasr/models/postencoder/abs_postencoder.py
@@ -0,0 +1,17 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+
+class AbsPostEncoder(torch.nn.Module, ABC):
+    @abstractmethod
+    def output_size(self) -> int:
+        raise NotImplementedError
+
+    @abstractmethod
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        raise NotImplementedError
diff --git a/funasr/models/postencoder/hugging_face_transformers_postencoder.py b/funasr/models/postencoder/hugging_face_transformers_postencoder.py
new file mode 100644
index 000000000..1aad15d79
--- /dev/null
+++ b/funasr/models/postencoder/hugging_face_transformers_postencoder.py
@@ -0,0 +1,115 @@
+#!/usr/bin/env python3
+#  2021, University of Stuttgart;  Pavel Denisov
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Hugging Face Transformers PostEncoder."""
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.models.postencoder.abs_postencoder import AbsPostEncoder
+from typeguard import check_argument_types
+from typing import Tuple
+
+import copy
+import logging
+import torch
+
+try:
+    from transformers import AutoModel
+
+    is_transformers_available = True
+except ImportError:
+    is_transformers_available = False
+
+
+class HuggingFaceTransformersPostEncoder(AbsPostEncoder):
+    """Hugging Face Transformers PostEncoder."""
+
+    def __init__(
+        self,
+        input_size: int,
+        model_name_or_path: str,
+    ):
+        """Initialize the module."""
+        assert check_argument_types()
+        super().__init__()
+
+        if not is_transformers_available:
+            raise ImportError(
+                "`transformers` is not available. Please install it via `pip install"
+                " transformers` or `cd /path/to/espnet/tools && . ./activate_python.sh"
+                " && ./installers/install_transformers.sh`."
+            )
+
+        model = AutoModel.from_pretrained(model_name_or_path)
+
+        if hasattr(model, "encoder"):
+            self.transformer = model.encoder
+        else:
+            self.transformer = model
+
+        if hasattr(self.transformer, "embed_tokens"):
+            del self.transformer.embed_tokens
+        if hasattr(self.transformer, "wte"):
+            del self.transformer.wte
+        if hasattr(self.transformer, "word_embedding"):
+            del self.transformer.word_embedding
+
+        self.pretrained_params = copy.deepcopy(self.transformer.state_dict())
+
+        if (
+            self.transformer.config.is_encoder_decoder
+            or self.transformer.config.model_type in ["xlnet", "t5"]
+        ):
+            self.use_inputs_embeds = True
+            self.extend_attention_mask = False
+        elif self.transformer.config.model_type == "gpt2":
+            self.use_inputs_embeds = True
+            self.extend_attention_mask = True
+        else:
+            self.use_inputs_embeds = False
+            self.extend_attention_mask = True
+
+        self.linear_in = torch.nn.Linear(
+            input_size, self.transformer.config.hidden_size
+        )
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward."""
+        input = self.linear_in(input)
+
+        args = {"return_dict": True}
+
+        mask = (~make_pad_mask(input_lengths)).to(input.device).float()
+
+        if self.extend_attention_mask:
+            args["attention_mask"] = _extend_attention_mask(mask)
+        else:
+            args["attention_mask"] = mask
+
+        if self.use_inputs_embeds:
+            args["inputs_embeds"] = input
+        else:
+            args["hidden_states"] = input
+
+        if self.transformer.config.model_type == "mpnet":
+            args["head_mask"] = [None for _ in self.transformer.layer]
+
+        output = self.transformer(**args).last_hidden_state
+
+        return output, input_lengths
+
+    def reload_pretrained_parameters(self):
+        self.transformer.load_state_dict(self.pretrained_params)
+        logging.info("Pretrained Transformers model parameters reloaded!")
+
+    def output_size(self) -> int:
+        """Get the output size."""
+        return self.transformer.config.hidden_size
+
+
+def _extend_attention_mask(mask: torch.Tensor) -> torch.Tensor:
+    mask = mask[:, None, None, :]
+    mask = (1.0 - mask) * -10000.0
+    return mask
diff --git a/funasr/models/predictor/__init__.py b/funasr/models/predictor/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/predictor/cif.py b/funasr/models/predictor/cif.py
new file mode 100644
index 000000000..819970870
--- /dev/null
+++ b/funasr/models/predictor/cif.py
@@ -0,0 +1,266 @@
+import torch
+from torch import nn
+
+from funasr.modules.nets_utils import make_pad_mask
+
+class CifPredictor(nn.Module):
+    def __init__(self, idim, l_order, r_order, threshold=1.0, dropout=0.1, smooth_factor=1.0, noise_threshold=0):
+        super(CifPredictor, self).__init__()
+
+        self.pad = nn.ConstantPad1d((l_order, r_order), 0)
+        self.cif_conv1d = nn.Conv1d(idim, idim, l_order + r_order + 1, groups=idim)
+        self.cif_output = nn.Linear(idim, 1)
+        self.dropout = torch.nn.Dropout(p=dropout)
+        self.threshold = threshold
+        self.smooth_factor = smooth_factor
+        self.noise_threshold = noise_threshold
+
+    def forward(self, hidden, target_label=None, mask=None, ignore_id=-1, mask_chunk_predictor=None,
+                target_label_length=None):
+        h = hidden
+        context = h.transpose(1, 2)
+        queries = self.pad(context)
+        memory = self.cif_conv1d(queries)
+        output = memory + context
+        output = self.dropout(output)
+        output = output.transpose(1, 2)
+        output = torch.relu(output)
+        output = self.cif_output(output)
+        alphas = torch.sigmoid(output)
+        alphas = torch.nn.functional.relu(alphas * self.smooth_factor - self.noise_threshold)
+        if mask is not None:
+            alphas = alphas * mask.transpose(-1, -2).float()
+        if mask_chunk_predictor is not None:
+            alphas = alphas * mask_chunk_predictor
+        alphas = alphas.squeeze(-1)
+        if target_label_length is not None:
+            target_length = target_label_length
+        elif target_label is not None:
+            target_length = (target_label != ignore_id).float().sum(-1)
+        else:
+            target_length = None
+        token_num = alphas.sum(-1)
+        if target_length is not None:
+            alphas *= (target_length / token_num)[:, None].repeat(1, alphas.size(1))
+        acoustic_embeds, cif_peak = cif(hidden, alphas, self.threshold)
+        return acoustic_embeds, token_num, alphas, cif_peak
+
+    def gen_frame_alignments(self,
+                             alphas: torch.Tensor = None,
+                             encoder_sequence_length: torch.Tensor = None):
+        batch_size, maximum_length = alphas.size()
+        int_type = torch.int32
+
+        is_training = self.training
+        if is_training:
+            token_num = torch.round(torch.sum(alphas, dim=1)).type(int_type)
+        else:
+            token_num = torch.floor(torch.sum(alphas, dim=1)).type(int_type)
+
+        max_token_num = torch.max(token_num).item()
+
+        alphas_cumsum = torch.cumsum(alphas, dim=1)
+        alphas_cumsum = torch.floor(alphas_cumsum).type(int_type)
+        alphas_cumsum = alphas_cumsum[:, None, :].repeat(1, max_token_num, 1)
+
+        index = torch.ones([batch_size, max_token_num], dtype=int_type)
+        index = torch.cumsum(index, dim=1)
+        index = index[:, :, None].repeat(1, 1, maximum_length).to(alphas_cumsum.device)
+
+        index_div = torch.floor(torch.true_divide(alphas_cumsum, index)).type(int_type)
+        index_div_bool_zeros = index_div.eq(0)
+        index_div_bool_zeros_count = torch.sum(index_div_bool_zeros, dim=-1) + 1
+        index_div_bool_zeros_count = torch.clamp(index_div_bool_zeros_count, 0, encoder_sequence_length.max())
+        token_num_mask = (~make_pad_mask(token_num, maxlen=max_token_num)).to(token_num.device)
+        index_div_bool_zeros_count *= token_num_mask
+
+        index_div_bool_zeros_count_tile = index_div_bool_zeros_count[:, :, None].repeat(1, 1, maximum_length)
+        ones = torch.ones_like(index_div_bool_zeros_count_tile)
+        zeros = torch.zeros_like(index_div_bool_zeros_count_tile)
+        ones = torch.cumsum(ones, dim=2)
+        cond = index_div_bool_zeros_count_tile == ones
+        index_div_bool_zeros_count_tile = torch.where(cond, zeros, ones)
+
+        index_div_bool_zeros_count_tile_bool = index_div_bool_zeros_count_tile.type(torch.bool)
+        index_div_bool_zeros_count_tile = 1 - index_div_bool_zeros_count_tile_bool.type(int_type)
+        index_div_bool_zeros_count_tile_out = torch.sum(index_div_bool_zeros_count_tile, dim=1)
+        index_div_bool_zeros_count_tile_out = index_div_bool_zeros_count_tile_out.type(int_type)
+        predictor_mask = (~make_pad_mask(encoder_sequence_length, maxlen=encoder_sequence_length.max())).type(
+            int_type).to(encoder_sequence_length.device)
+        index_div_bool_zeros_count_tile_out = index_div_bool_zeros_count_tile_out * predictor_mask
+
+        predictor_alignments = index_div_bool_zeros_count_tile_out
+        predictor_alignments_length = predictor_alignments.sum(-1).type(encoder_sequence_length.dtype)
+        return predictor_alignments.detach(), predictor_alignments_length.detach()
+
+
+class CifPredictorV2(nn.Module):
+    def __init__(self, idim, l_order, r_order, threshold=1.0, dropout=0.1, smooth_factor=1.0, noise_threshold=0,
+                 tail_threshold=0.0):
+        super(CifPredictorV2, self).__init__()
+
+        self.pad = nn.ConstantPad1d((l_order, r_order), 0)
+        self.cif_conv1d = nn.Conv1d(idim, idim, l_order + r_order + 1)
+        self.cif_output = nn.Linear(idim, 1)
+        self.dropout = torch.nn.Dropout(p=dropout)
+        self.threshold = threshold
+        self.smooth_factor = smooth_factor
+        self.noise_threshold = noise_threshold
+        self.tail_threshold = tail_threshold
+
+    def forward(self, hidden, target_label=None, mask=None, ignore_id=-1, mask_chunk_predictor=None,
+                target_label_length=None):
+        h = hidden
+        context = h.transpose(1, 2)
+        queries = self.pad(context)
+        output = torch.relu(self.cif_conv1d(queries))
+        output = output.transpose(1, 2)
+
+        output = self.cif_output(output)
+        alphas = torch.sigmoid(output)
+        alphas = torch.nn.functional.relu(alphas * self.smooth_factor - self.noise_threshold)
+        if mask is not None:
+            alphas = alphas * mask.transpose(-1, -2).float()
+        if mask_chunk_predictor is not None:
+            alphas = alphas * mask_chunk_predictor
+        alphas = alphas.squeeze(-1)
+        if target_label_length is not None:
+            target_length = target_label_length
+        elif target_label is not None:
+            target_length = (target_label != ignore_id).float().sum(-1)
+        else:
+            target_length = None
+        token_num = alphas.sum(-1)
+        if target_length is not None:
+            alphas *= (target_length / token_num)[:, None].repeat(1, alphas.size(1))
+        elif self.tail_threshold > 0.0:
+            hidden, alphas, token_num = self.tail_process_fn(hidden, alphas, token_num)
+
+        acoustic_embeds, cif_peak = cif(hidden, alphas, self.threshold)
+        if target_length is None and self.tail_threshold > 0.0:
+            token_num_int = torch.max(token_num).type(torch.int32).item()
+            acoustic_embeds = acoustic_embeds[:, :token_num_int, :]
+
+        return acoustic_embeds, token_num, alphas, cif_peak
+
+    def tail_process_fn(self, hidden, alphas, token_num=None):
+        b, t, d = hidden.size()
+        tail_threshold = self.tail_threshold
+        tail_threshold = torch.tensor([tail_threshold], dtype=alphas.dtype).to(alphas.device)
+        tail_threshold = torch.reshape(tail_threshold, (1, 1))
+        alphas = torch.cat([alphas, tail_threshold], dim=1)
+        zeros = torch.zeros((b, 1, d), dtype=hidden.dtype).to(hidden.device)
+        hidden = torch.cat([hidden, zeros], dim=1)
+        token_num = alphas.sum(dim=-1)
+        token_num_floor = torch.floor(token_num)
+
+        return hidden, alphas, token_num_floor
+
+    def gen_frame_alignments(self,
+                             alphas: torch.Tensor = None,
+                             encoder_sequence_length: torch.Tensor = None):
+        batch_size, maximum_length = alphas.size()
+        int_type = torch.int32
+
+        is_training = self.training
+        if is_training:
+            token_num = torch.round(torch.sum(alphas, dim=1)).type(int_type)
+        else:
+            token_num = torch.floor(torch.sum(alphas, dim=1)).type(int_type)
+
+        max_token_num = torch.max(token_num).item()
+
+        alphas_cumsum = torch.cumsum(alphas, dim=1)
+        alphas_cumsum = torch.floor(alphas_cumsum).type(int_type)
+        alphas_cumsum = alphas_cumsum[:, None, :].repeat(1, max_token_num, 1)
+
+        index = torch.ones([batch_size, max_token_num], dtype=int_type)
+        index = torch.cumsum(index, dim=1)
+        index = index[:, :, None].repeat(1, 1, maximum_length).to(alphas_cumsum.device)
+
+        index_div = torch.floor(torch.true_divide(alphas_cumsum, index)).type(int_type)
+        index_div_bool_zeros = index_div.eq(0)
+        index_div_bool_zeros_count = torch.sum(index_div_bool_zeros, dim=-1) + 1
+        index_div_bool_zeros_count = torch.clamp(index_div_bool_zeros_count, 0, encoder_sequence_length.max())
+        token_num_mask = (~make_pad_mask(token_num, maxlen=max_token_num)).to(token_num.device)
+        index_div_bool_zeros_count *= token_num_mask
+
+        index_div_bool_zeros_count_tile = index_div_bool_zeros_count[:, :, None].repeat(1, 1, maximum_length)
+        ones = torch.ones_like(index_div_bool_zeros_count_tile)
+        zeros = torch.zeros_like(index_div_bool_zeros_count_tile)
+        ones = torch.cumsum(ones, dim=2)
+        cond = index_div_bool_zeros_count_tile == ones
+        index_div_bool_zeros_count_tile = torch.where(cond, zeros, ones)
+
+        index_div_bool_zeros_count_tile_bool = index_div_bool_zeros_count_tile.type(torch.bool)
+        index_div_bool_zeros_count_tile = 1 - index_div_bool_zeros_count_tile_bool.type(int_type)
+        index_div_bool_zeros_count_tile_out = torch.sum(index_div_bool_zeros_count_tile, dim=1)
+        index_div_bool_zeros_count_tile_out = index_div_bool_zeros_count_tile_out.type(int_type)
+        predictor_mask = (~make_pad_mask(encoder_sequence_length, maxlen=encoder_sequence_length.max())).type(
+            int_type).to(encoder_sequence_length.device)
+        index_div_bool_zeros_count_tile_out = index_div_bool_zeros_count_tile_out * predictor_mask
+
+        predictor_alignments = index_div_bool_zeros_count_tile_out
+        predictor_alignments_length = predictor_alignments.sum(-1).type(encoder_sequence_length.dtype)
+        return predictor_alignments.detach(), predictor_alignments_length.detach()
+
+
+class mae_loss(nn.Module):
+
+    def __init__(self, normalize_length=False):
+        super(mae_loss, self).__init__()
+        self.normalize_length = normalize_length
+        self.criterion = torch.nn.L1Loss(reduction='sum')
+
+    def forward(self, token_length, pre_token_length):
+        loss_token_normalizer = token_length.size(0)
+        if self.normalize_length:
+            loss_token_normalizer = token_length.sum().type(torch.float32)
+        loss = self.criterion(token_length, pre_token_length)
+        loss = loss / loss_token_normalizer
+        return loss
+
+
+def cif(hidden, alphas, threshold):
+    batch_size, len_time, hidden_size = hidden.size()
+
+    # loop varss
+    integrate = torch.zeros([batch_size], device=hidden.device)
+    frame = torch.zeros([batch_size, hidden_size], device=hidden.device)
+    # intermediate vars along time
+    list_fires = []
+    list_frames = []
+
+    for t in range(len_time):
+        alpha = alphas[:, t]
+        distribution_completion = torch.ones([batch_size], device=hidden.device) - integrate
+
+        integrate += alpha
+        list_fires.append(integrate)
+
+        fire_place = integrate >= threshold
+        integrate = torch.where(fire_place,
+                                integrate - torch.ones([batch_size], device=hidden.device),
+                                integrate)
+        cur = torch.where(fire_place,
+                          distribution_completion,
+                          alpha)
+        remainds = alpha - cur
+
+        frame += cur[:, None] * hidden[:, t, :]
+        list_frames.append(frame)
+        frame = torch.where(fire_place[:, None].repeat(1, hidden_size),
+                            remainds[:, None] * hidden[:, t, :],
+                            frame)
+
+    fires = torch.stack(list_fires, 1)
+    frames = torch.stack(list_frames, 1)
+    list_ls = []
+    len_labels = torch.round(alphas.sum(-1)).int()
+    max_label_len = len_labels.max()
+    for b in range(batch_size):
+        fire = fires[b, :]
+        l = torch.index_select(frames[b, :, :], 0, torch.nonzero(fire >= threshold).squeeze())
+        pad_l = torch.zeros([max_label_len - l.size(0), hidden_size], device=hidden.device)
+        list_ls.append(torch.cat([l, pad_l], 0))
+    return torch.stack(list_ls, 0), fires
diff --git a/funasr/models/preencoder/__init__.py b/funasr/models/preencoder/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/preencoder/abs_preencoder.py b/funasr/models/preencoder/abs_preencoder.py
new file mode 100644
index 000000000..3ecdc6b91
--- /dev/null
+++ b/funasr/models/preencoder/abs_preencoder.py
@@ -0,0 +1,17 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Tuple
+
+import torch
+
+
+class AbsPreEncoder(torch.nn.Module, ABC):
+    @abstractmethod
+    def output_size(self) -> int:
+        raise NotImplementedError
+
+    @abstractmethod
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        raise NotImplementedError
diff --git a/funasr/models/preencoder/linear.py b/funasr/models/preencoder/linear.py
new file mode 100644
index 000000000..c69b6ce92
--- /dev/null
+++ b/funasr/models/preencoder/linear.py
@@ -0,0 +1,38 @@
+#!/usr/bin/env python3
+#  2021, Carnegie Mellon University;  Xuankai Chang
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Linear Projection."""
+
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from typeguard import check_argument_types
+from typing import Tuple
+
+import torch
+
+
+class LinearProjection(AbsPreEncoder):
+    """Linear Projection Preencoder."""
+
+    def __init__(
+        self,
+        input_size: int,
+        output_size: int,
+    ):
+        """Initialize the module."""
+        assert check_argument_types()
+        super().__init__()
+
+        self.output_dim = output_size
+        self.linear_out = torch.nn.Linear(input_size, output_size)
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Forward."""
+        output = self.linear_out(input)
+        return output, input_lengths  # no state in this layer
+
+    def output_size(self) -> int:
+        """Get the output size."""
+        return self.output_dim
diff --git a/funasr/models/preencoder/sinc.py b/funasr/models/preencoder/sinc.py
new file mode 100644
index 000000000..fe6d2af1b
--- /dev/null
+++ b/funasr/models/preencoder/sinc.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python3
+#  2020, Technische Universität München;  Ludwig Kürzinger
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Sinc convolutions for raw audio input."""
+
+from collections import OrderedDict
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from funasr.layers.sinc_conv import LogCompression
+from funasr.layers.sinc_conv import SincConv
+import humanfriendly
+import torch
+from typeguard import check_argument_types
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+
+class LightweightSincConvs(AbsPreEncoder):
+    """Lightweight Sinc Convolutions.
+
+    Instead of using precomputed features, end-to-end speech recognition
+    can also be done directly from raw audio using sinc convolutions, as
+    described in "Lightweight End-to-End Speech Recognition from Raw Audio
+    Data Using Sinc-Convolutions" by Kürzinger et al.
+    https://arxiv.org/abs/2010.07597
+
+    To use Sinc convolutions in your model instead of the default f-bank
+    frontend, set this module as your pre-encoder with `preencoder: sinc`
+    and use the input of the sliding window frontend with
+    `frontend: sliding_window` in your yaml configuration file.
+    So that the process flow is:
+
+    Frontend (SlidingWindow) -> SpecAug -> Normalization ->
+    Pre-encoder (LightweightSincConvs) -> Encoder -> Decoder
+
+    Note that this method also performs data augmentation in time domain
+    (vs. in spectral domain in the default frontend).
+    Use `plot_sinc_filters.py` to visualize the learned Sinc filters.
+    """
+
+    def __init__(
+        self,
+        fs: Union[int, str, float] = 16000,
+        in_channels: int = 1,
+        out_channels: int = 256,
+        activation_type: str = "leakyrelu",
+        dropout_type: str = "dropout",
+        windowing_type: str = "hamming",
+        scale_type: str = "mel",
+    ):
+        """Initialize the module.
+
+        Args:
+            fs: Sample rate.
+            in_channels: Number of input channels.
+            out_channels: Number of output channels (for each input channel).
+            activation_type: Choice of activation function.
+            dropout_type: Choice of dropout function.
+            windowing_type: Choice of windowing function.
+            scale_type:  Choice of filter-bank initialization scale.
+        """
+        assert check_argument_types()
+        super().__init__()
+        if isinstance(fs, str):
+            fs = humanfriendly.parse_size(fs)
+        self.fs = fs
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.activation_type = activation_type
+        self.dropout_type = dropout_type
+        self.windowing_type = windowing_type
+        self.scale_type = scale_type
+
+        self.choices_dropout = {
+            "dropout": torch.nn.Dropout,
+            "spatial": SpatialDropout,
+            "dropout2d": torch.nn.Dropout2d,
+        }
+        if dropout_type not in self.choices_dropout:
+            raise NotImplementedError(
+                f"Dropout type has to be one of "
+                f"{list(self.choices_dropout.keys())}",
+            )
+
+        self.choices_activation = {
+            "leakyrelu": torch.nn.LeakyReLU,
+            "relu": torch.nn.ReLU,
+        }
+        if activation_type not in self.choices_activation:
+            raise NotImplementedError(
+                f"Activation type has to be one of "
+                f"{list(self.choices_activation.keys())}",
+            )
+
+        # initialization
+        self._create_sinc_convs()
+        # Sinc filters require custom initialization
+        self.espnet_initialization_fn()
+
+    def _create_sinc_convs(self):
+        blocks = OrderedDict()
+
+        # SincConvBlock
+        out_channels = 128
+        self.filters = SincConv(
+            self.in_channels,
+            out_channels,
+            kernel_size=101,
+            stride=1,
+            fs=self.fs,
+            window_func=self.windowing_type,
+            scale_type=self.scale_type,
+        )
+        block = OrderedDict(
+            [
+                ("Filters", self.filters),
+                ("LogCompression", LogCompression()),
+                ("BatchNorm", torch.nn.BatchNorm1d(out_channels, affine=True)),
+                ("AvgPool", torch.nn.AvgPool1d(2)),
+            ]
+        )
+        blocks["SincConvBlock"] = torch.nn.Sequential(block)
+        in_channels = out_channels
+
+        # First convolutional block, connects the sinc output to the front-end "body"
+        out_channels = 128
+        blocks["DConvBlock1"] = self.gen_lsc_block(
+            in_channels,
+            out_channels,
+            depthwise_kernel_size=25,
+            depthwise_stride=2,
+            pointwise_groups=0,
+            avgpool=True,
+            dropout_probability=0.1,
+        )
+        in_channels = out_channels
+
+        # Second convolutional block, multiple convolutional layers
+        out_channels = self.out_channels
+        for layer in [2, 3, 4]:
+            blocks[f"DConvBlock{layer}"] = self.gen_lsc_block(
+                in_channels, out_channels, depthwise_kernel_size=9, depthwise_stride=1
+            )
+            in_channels = out_channels
+
+        # Third Convolutional block, acts as coupling to encoder
+        out_channels = self.out_channels
+        blocks["DConvBlock5"] = self.gen_lsc_block(
+            in_channels,
+            out_channels,
+            depthwise_kernel_size=7,
+            depthwise_stride=1,
+            pointwise_groups=0,
+        )
+
+        self.blocks = torch.nn.Sequential(blocks)
+
+    def gen_lsc_block(
+        self,
+        in_channels: int,
+        out_channels: int,
+        depthwise_kernel_size: int = 9,
+        depthwise_stride: int = 1,
+        depthwise_groups=None,
+        pointwise_groups=0,
+        dropout_probability: float = 0.15,
+        avgpool=False,
+    ):
+        """Generate a convolutional block for Lightweight Sinc convolutions.
+
+        Each block consists of either a depthwise or a depthwise-separable
+        convolutions together with dropout, (batch-)normalization layer, and
+        an optional average-pooling layer.
+
+        Args:
+            in_channels: Number of input channels.
+            out_channels: Number of output channels.
+            depthwise_kernel_size: Kernel size of the depthwise convolution.
+            depthwise_stride: Stride of the depthwise convolution.
+            depthwise_groups: Number of groups of the depthwise convolution.
+            pointwise_groups: Number of groups of the pointwise convolution.
+            dropout_probability: Dropout probability in the block.
+            avgpool: If True, an AvgPool layer is inserted.
+
+        Returns:
+            torch.nn.Sequential: Neural network building block.
+        """
+        block = OrderedDict()
+        if not depthwise_groups:
+            # GCD(in_channels, out_channels) to prevent size mismatches
+            depthwise_groups, r = in_channels, out_channels
+            while r != 0:
+                depthwise_groups, r = depthwise_groups, depthwise_groups % r
+        block["depthwise"] = torch.nn.Conv1d(
+            in_channels,
+            out_channels,
+            depthwise_kernel_size,
+            depthwise_stride,
+            groups=depthwise_groups,
+        )
+        if pointwise_groups:
+            block["pointwise"] = torch.nn.Conv1d(
+                out_channels, out_channels, 1, 1, groups=pointwise_groups
+            )
+        block["activation"] = self.choices_activation[self.activation_type]()
+        block["batchnorm"] = torch.nn.BatchNorm1d(out_channels, affine=True)
+        if avgpool:
+            block["avgpool"] = torch.nn.AvgPool1d(2)
+        block["dropout"] = self.choices_dropout[self.dropout_type](dropout_probability)
+        return torch.nn.Sequential(block)
+
+    def espnet_initialization_fn(self):
+        """Initialize sinc filters with filterbank values."""
+        self.filters.init_filters()
+        for block in self.blocks:
+            for layer in block:
+                if type(layer) == torch.nn.BatchNorm1d and layer.affine:
+                    layer.weight.data[:] = 1.0
+                    layer.bias.data[:] = 0.0
+
+    def forward(
+        self, input: torch.Tensor, input_lengths: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Apply Lightweight Sinc Convolutions.
+
+        The input shall be formatted as (B, T, C_in, D_in)
+        with B as batch size, T as time dimension, C_in as channels,
+        and D_in as feature dimension.
+
+        The output will then be (B, T, C_out*D_out)
+        with C_out and D_out as output dimensions.
+
+        The current module structure only handles D_in=400, so that D_out=1.
+        Remark for the multichannel case: C_out is the number of out_channels
+        given at initialization multiplied with C_in.
+        """
+        # Transform input data:
+        #   (B, T, C_in, D_in) -> (B*T, C_in, D_in)
+        B, T, C_in, D_in = input.size()
+        input_frames = input.view(B * T, C_in, D_in)
+        output_frames = self.blocks.forward(input_frames)
+
+        # ---TRANSFORM: (B*T, C_out, D_out) -> (B, T, C_out*D_out)
+        _, C_out, D_out = output_frames.size()
+        output_frames = output_frames.view(B, T, C_out * D_out)
+        return output_frames, input_lengths  # no state in this layer
+
+    def output_size(self) -> int:
+        """Get the output size."""
+        return self.out_channels * self.in_channels
+
+
+class SpatialDropout(torch.nn.Module):
+    """Spatial dropout module.
+
+    Apply dropout to full channels on tensors of input (B, C, D)
+    """
+
+    def __init__(
+        self,
+        dropout_probability: float = 0.15,
+        shape: Optional[Union[tuple, list]] = None,
+    ):
+        """Initialize.
+
+        Args:
+            dropout_probability: Dropout probability.
+            shape (tuple, list): Shape of input tensors.
+        """
+        assert check_argument_types()
+        super().__init__()
+        if shape is None:
+            shape = (0, 2, 1)
+        self.dropout = torch.nn.Dropout2d(dropout_probability)
+        self.shape = (shape,)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """Forward of spatial dropout module."""
+        y = x.permute(*self.shape)
+        y = self.dropout(y)
+        return y.permute(*self.shape)
diff --git a/funasr/models/specaug/__init__.py b/funasr/models/specaug/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/models/specaug/abs_specaug.py b/funasr/models/specaug/abs_specaug.py
new file mode 100644
index 000000000..3cbac418f
--- /dev/null
+++ b/funasr/models/specaug/abs_specaug.py
@@ -0,0 +1,18 @@
+from typing import Optional
+from typing import Tuple
+
+import torch
+
+
+class AbsSpecAug(torch.nn.Module):
+    """Abstract class for the augmentation of spectrogram
+
+    The process-flow:
+
+    Frontend  -> SpecAug -> Normalization -> Encoder -> Decoder
+    """
+
+    def forward(
+        self, x: torch.Tensor, x_lengths: torch.Tensor = None
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        raise NotImplementedError
diff --git a/funasr/models/specaug/specaug.py b/funasr/models/specaug/specaug.py
new file mode 100644
index 000000000..6074f86fd
--- /dev/null
+++ b/funasr/models/specaug/specaug.py
@@ -0,0 +1,184 @@
+"""SpecAugment module."""
+from typing import Optional
+from typing import Sequence
+from typing import Union
+
+from funasr.models.specaug.abs_specaug import AbsSpecAug
+from funasr.layers.mask_along_axis import MaskAlongAxis
+from funasr.layers.mask_along_axis import MaskAlongAxisVariableMaxWidth
+from funasr.layers.mask_along_axis import MaskAlongAxisLFR
+from funasr.layers.time_warp import TimeWarp
+
+
+class SpecAug(AbsSpecAug):
+    """Implementation of SpecAug.
+
+    Reference:
+        Daniel S. Park et al.
+        "SpecAugment: A Simple Data
+         Augmentation Method for Automatic Speech Recognition"
+
+    .. warning::
+        When using cuda mode, time_warp doesn't have reproducibility
+        due to `torch.nn.functional.interpolate`.
+
+    """
+
+    def __init__(
+        self,
+        apply_time_warp: bool = True,
+        time_warp_window: int = 5,
+        time_warp_mode: str = "bicubic",
+        apply_freq_mask: bool = True,
+        freq_mask_width_range: Union[int, Sequence[int]] = (0, 20),
+        num_freq_mask: int = 2,
+        apply_time_mask: bool = True,
+        time_mask_width_range: Optional[Union[int, Sequence[int]]] = None,
+        time_mask_width_ratio_range: Optional[Union[float, Sequence[float]]] = None,
+        num_time_mask: int = 2,
+    ):
+        if not apply_time_warp and not apply_time_mask and not apply_freq_mask:
+            raise ValueError(
+                "Either one of time_warp, time_mask, or freq_mask should be applied"
+            )
+        if (
+            apply_time_mask
+            and (time_mask_width_range is not None)
+            and (time_mask_width_ratio_range is not None)
+        ):
+            raise ValueError(
+                'Either one of "time_mask_width_range" or '
+                '"time_mask_width_ratio_range" can be used'
+            )
+        super().__init__()
+        self.apply_time_warp = apply_time_warp
+        self.apply_freq_mask = apply_freq_mask
+        self.apply_time_mask = apply_time_mask
+
+        if apply_time_warp:
+            self.time_warp = TimeWarp(window=time_warp_window, mode=time_warp_mode)
+        else:
+            self.time_warp = None
+
+        if apply_freq_mask:
+            self.freq_mask = MaskAlongAxis(
+                dim="freq",
+                mask_width_range=freq_mask_width_range,
+                num_mask=num_freq_mask,
+            )
+        else:
+            self.freq_mask = None
+
+        if apply_time_mask:
+            if time_mask_width_range is not None:
+                self.time_mask = MaskAlongAxis(
+                    dim="time",
+                    mask_width_range=time_mask_width_range,
+                    num_mask=num_time_mask,
+                )
+            elif time_mask_width_ratio_range is not None:
+                self.time_mask = MaskAlongAxisVariableMaxWidth(
+                    dim="time",
+                    mask_width_ratio_range=time_mask_width_ratio_range,
+                    num_mask=num_time_mask,
+                )
+            else:
+                raise ValueError(
+                    'Either one of "time_mask_width_range" or '
+                    '"time_mask_width_ratio_range" should be used.'
+                )
+        else:
+            self.time_mask = None
+
+    def forward(self, x, x_lengths=None):
+        if self.time_warp is not None:
+            x, x_lengths = self.time_warp(x, x_lengths)
+        if self.freq_mask is not None:
+            x, x_lengths = self.freq_mask(x, x_lengths)
+        if self.time_mask is not None:
+            x, x_lengths = self.time_mask(x, x_lengths)
+        return x, x_lengths
+
+class SpecAugLFR(AbsSpecAug):
+    """Implementation of SpecAug.
+    lfr_rate：low frame rate
+    """
+
+    def __init__(
+        self,
+        apply_time_warp: bool = True,
+        time_warp_window: int = 5,
+        time_warp_mode: str = "bicubic",
+        apply_freq_mask: bool = True,
+        freq_mask_width_range: Union[int, Sequence[int]] = (0, 20),
+        num_freq_mask: int = 2,
+        lfr_rate: int = 0,
+        apply_time_mask: bool = True,
+        time_mask_width_range: Optional[Union[int, Sequence[int]]] = None,
+        time_mask_width_ratio_range: Optional[Union[float, Sequence[float]]] = None,
+        num_time_mask: int = 2,
+    ):
+        if not apply_time_warp and not apply_time_mask and not apply_freq_mask:
+            raise ValueError(
+                "Either one of time_warp, time_mask, or freq_mask should be applied"
+            )
+        if (
+            apply_time_mask
+            and (time_mask_width_range is not None)
+            and (time_mask_width_ratio_range is not None)
+        ):
+            raise ValueError(
+                'Either one of "time_mask_width_range" or '
+                '"time_mask_width_ratio_range" can be used'
+            )
+        super().__init__()
+        self.apply_time_warp = apply_time_warp
+        self.apply_freq_mask = apply_freq_mask
+        self.apply_time_mask = apply_time_mask
+
+        if apply_time_warp:
+            self.time_warp = TimeWarp(window=time_warp_window, mode=time_warp_mode)
+        else:
+            self.time_warp = None
+
+        if apply_freq_mask:
+            self.freq_mask = MaskAlongAxisLFR(
+                dim="freq",
+                mask_width_range=freq_mask_width_range,
+                num_mask=num_freq_mask,
+                lfr_rate=lfr_rate+1,
+            )
+
+        else:
+            self.freq_mask = None
+
+        if apply_time_mask:
+            if time_mask_width_range is not None:
+                self.time_mask = MaskAlongAxisLFR(
+                    dim="time",
+                    mask_width_range=time_mask_width_range,
+                    num_mask=num_time_mask,
+                    lfr_rate=lfr_rate + 1,
+                )
+            elif time_mask_width_ratio_range is not None:
+                self.time_mask = MaskAlongAxisVariableMaxWidth(
+                    dim="time",
+                    mask_width_ratio_range=time_mask_width_ratio_range,
+                    num_mask=num_time_mask,
+                )
+            else:
+                raise ValueError(
+                    'Either one of "time_mask_width_range" or '
+                    '"time_mask_width_ratio_range" should be used.'
+                )
+        else:
+            self.time_mask = None
+
+    def forward(self, x, x_lengths=None):
+        if self.time_warp is not None:
+            x, x_lengths = self.time_warp(x, x_lengths)
+        if self.freq_mask is not None:
+            x, x_lengths = self.freq_mask(x, x_lengths)
+        if self.time_mask is not None:
+            x, x_lengths = self.time_mask(x, x_lengths)
+        return x, x_lengths
\ No newline at end of file
diff --git a/funasr/modules/add_sos_eos.py b/funasr/modules/add_sos_eos.py
new file mode 100644
index 000000000..ada1c2e01
--- /dev/null
+++ b/funasr/modules/add_sos_eos.py
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Unility functions for Transformer."""
+
+import torch
+from funasr.modules.nets_utils import pad_list
+
+
+def add_sos_eos(ys_pad, sos, eos, ignore_id):
+    """Add <sos> and <eos> labels.
+
+    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)
+    :param int sos: index of <sos>
+    :param int eos: index of <eos>
+    :param int ignore_id: index of padding
+    :return: padded tensor (B, Lmax)
+    :rtype: torch.Tensor
+    :return: padded tensor (B, Lmax)
+    :rtype: torch.Tensor
+    """
+
+    _sos = ys_pad.new([sos])
+    _eos = ys_pad.new([eos])
+    ys = [y[y != ignore_id] for y in ys_pad]  # parse padded ys
+    ys_in = [torch.cat([_sos, y], dim=0) for y in ys]
+    ys_out = [torch.cat([y, _eos], dim=0) for y in ys]
+    return pad_list(ys_in, eos), pad_list(ys_out, ignore_id)
diff --git a/funasr/modules/attention.py b/funasr/modules/attention.py
new file mode 100644
index 000000000..e3ad56a5a
--- /dev/null
+++ b/funasr/modules/attention.py
@@ -0,0 +1,625 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Multi-Head Attention layer definition."""
+
+import math
+
+import numpy
+import torch
+from torch import nn
+
+
+class MultiHeadedAttention(nn.Module):
+    """Multi-Head Attention layer.
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttention, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+    def forward_qkv(self, query, key, value):
+        """Transform query, key and value.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+
+        """
+        n_batch = query.size(0)
+        q = self.linear_q(query).view(n_batch, -1, self.h, self.d_k)
+        k = self.linear_k(key).view(n_batch, -1, self.h, self.d_k)
+        v = self.linear_v(value).view(n_batch, -1, self.h, self.d_k)
+        q = q.transpose(1, 2)  # (batch, head, time1, d_k)
+        k = k.transpose(1, 2)  # (batch, head, time2, d_k)
+        v = v.transpose(1, 2)  # (batch, head, time2, d_k)
+
+        return q, k, v
+
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self, query, key, value, mask):
+        """Compute scaled dot product attention.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
+        return self.forward_attention(v, scores, mask)
+
+
+class LegacyRelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding (old version).
+
+    Details can be found in https://github.com/espnet/espnet/pull/2816.
+
+    Paper: https://arxiv.org/abs/1901.02860
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False):
+        """Construct an RelPositionMultiHeadedAttention object."""
+        super().__init__(n_head, n_feat, dropout_rate)
+        self.zero_triu = zero_triu
+        # linear transformation for positional encoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        torch.nn.init.xavier_uniform_(self.pos_bias_v)
+
+    def rel_shift(self, x):
+        """Compute relative positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, head, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor.
+
+        """
+        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+
+        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)
+
+        if self.zero_triu:
+            ones = torch.ones((x.size(2), x.size(3)))
+            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+
+        return x
+
+    def forward(self, query, key, value, pos_emb, mask):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            pos_emb (torch.Tensor): Positional embedding tensor (#batch, time1, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose(1, 2)  # (batch, time1, head, d_k)
+
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose(1, 2)  # (batch, head, time1, d_k)
+
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
+
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
+
+        # compute matrix b and matrix d
+        # (batch, head, time1, time1)
+        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
+        matrix_bd = self.rel_shift(matrix_bd)
+
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k
+        )  # (batch, head, time1, time2)
+
+        return self.forward_attention(v, scores, mask)
+
+
+class RelPositionMultiHeadedAttention(MultiHeadedAttention):
+    """Multi-Head Attention layer with relative position encoding (new implementation).
+
+    Details can be found in https://github.com/espnet/espnet/pull/2816.
+
+    Paper: https://arxiv.org/abs/1901.02860
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+        zero_triu (bool): Whether to zero the upper triangular part of attention matrix.
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False):
+        """Construct an RelPositionMultiHeadedAttention object."""
+        super().__init__(n_head, n_feat, dropout_rate)
+        self.zero_triu = zero_triu
+        # linear transformation for positional encoding
+        self.linear_pos = nn.Linear(n_feat, n_feat, bias=False)
+        # these two learnable bias are used in matrix c and matrix d
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        self.pos_bias_v = nn.Parameter(torch.Tensor(self.h, self.d_k))
+        torch.nn.init.xavier_uniform_(self.pos_bias_u)
+        torch.nn.init.xavier_uniform_(self.pos_bias_v)
+
+    def rel_shift(self, x):
+        """Compute relative positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, head, time1, 2*time1-1).
+            time1 means the length of query vector.
+
+        Returns:
+            torch.Tensor: Output tensor.
+
+        """
+        zero_pad = torch.zeros((*x.size()[:3], 1), device=x.device, dtype=x.dtype)
+        x_padded = torch.cat([zero_pad, x], dim=-1)
+
+        x_padded = x_padded.view(*x.size()[:2], x.size(3) + 1, x.size(2))
+        x = x_padded[:, :, 1:].view_as(x)[
+            :, :, :, : x.size(-1) // 2 + 1
+            ]  # only keep the positions from 0 to time2
+
+        if self.zero_triu:
+            ones = torch.ones((x.size(2), x.size(3)), device=x.device)
+            x = x * torch.tril(ones, x.size(3) - x.size(2))[None, None, :, :]
+
+        return x
+
+    def forward(self, query, key, value, pos_emb, mask):
+        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            pos_emb (torch.Tensor): Positional embedding tensor
+                (#batch, 2*time1-1, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q, k, v = self.forward_qkv(query, key, value)
+        q = q.transpose(1, 2)  # (batch, time1, head, d_k)
+
+        n_batch_pos = pos_emb.size(0)
+        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
+        p = p.transpose(1, 2)  # (batch, head, 2*time1-1, d_k)
+
+        # (batch, head, time1, d_k)
+        q_with_bias_u = (q + self.pos_bias_u).transpose(1, 2)
+        # (batch, head, time1, d_k)
+        q_with_bias_v = (q + self.pos_bias_v).transpose(1, 2)
+
+        # compute attention score
+        # first compute matrix a and matrix c
+        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
+        # (batch, head, time1, time2)
+        matrix_ac = torch.matmul(q_with_bias_u, k.transpose(-2, -1))
+
+        # compute matrix b and matrix d
+        # (batch, head, time1, 2*time1-1)
+        matrix_bd = torch.matmul(q_with_bias_v, p.transpose(-2, -1))
+        matrix_bd = self.rel_shift(matrix_bd)
+
+        scores = (matrix_ac + matrix_bd) / math.sqrt(
+            self.d_k
+        )  # (batch, head, time1, time2)
+
+        return self.forward_attention(v, scores, mask)
+
+
+class MultiHeadedAttentionSANM(nn.Module):
+    """Multi-Head Attention layer.
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, n_head, in_feat, n_feat, dropout_rate, kernel_size, sanm_shfit=0):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttentionSANM, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        # self.linear_q = nn.Linear(n_feat, n_feat)
+        # self.linear_k = nn.Linear(n_feat, n_feat)
+        # self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.linear_q_k_v = nn.Linear(in_feat, n_feat * 3)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+        self.fsmn_block = nn.Conv1d(n_feat, n_feat, kernel_size, stride=1, padding=0, groups=n_feat, bias=False)
+        # padding
+        left_padding = (kernel_size - 1) // 2
+        if sanm_shfit > 0:
+            left_padding = left_padding + sanm_shfit
+        right_padding = kernel_size - 1 - left_padding
+        self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0)
+
+    def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None):
+        b, t, d = inputs.size()
+        if mask is not None:
+            mask = torch.reshape(mask, (b, -1, 1))
+            if mask_shfit_chunk is not None:
+                mask = mask * mask_shfit_chunk
+
+        inputs = inputs * mask
+        x = inputs.transpose(1, 2)
+        x = self.pad_fn(x)
+        x = self.fsmn_block(x)
+        x = x.transpose(1, 2)
+        x += inputs
+        x = self.dropout(x)
+        return x * mask
+
+    def forward_qkv(self, x):
+        """Transform query, key and value.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+
+        """
+        b, t, d = x.size()
+        q_k_v = self.linear_q_k_v(x)
+        q, k, v = torch.split(q_k_v, int(self.h * self.d_k), dim=-1)
+        q_h = torch.reshape(q, (b, t, self.h, self.d_k)).transpose(1, 2)  # (batch, head, time1, d_k)
+        k_h = torch.reshape(k, (b, t, self.h, self.d_k)).transpose(1, 2)  # (batch, head, time2, d_k)
+        v_h = torch.reshape(v, (b, t, self.h, self.d_k)).transpose(1, 2)  # (batch, head, time2, d_k)
+
+        return q_h, k_h, v_h, v
+
+    def forward_attention(self, value, scores, mask, mask_att_chunk_encoder=None):
+        """Compute attention context vector.
+
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            if mask_att_chunk_encoder is not None:
+                mask = mask * mask_att_chunk_encoder
+
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
+        """Compute scaled dot product attention.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q_h, k_h, v_h, v = self.forward_qkv(x)
+        fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk)
+        q_h = q_h * self.d_k ** (-0.5)
+        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
+        att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder)
+        return att_outs + fsmn_memory
+
+class MultiHeadedAttentionSANMDecoder(nn.Module):
+    """Multi-Head Attention layer.
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, n_feat, dropout_rate, kernel_size, sanm_shfit=0):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttentionSANMDecoder, self).__init__()
+
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+        self.fsmn_block = nn.Conv1d(n_feat, n_feat,
+                                    kernel_size, stride=1, padding=0, groups=n_feat, bias=False)
+        # padding
+        # padding
+        left_padding = (kernel_size - 1) // 2
+        if sanm_shfit > 0:
+            left_padding = left_padding + sanm_shfit
+        right_padding = kernel_size - 1 - left_padding
+        self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0)
+        self.kernel_size = kernel_size
+
+    def forward(self, inputs, mask, cache=None, mask_shfit_chunk=None):
+        '''
+        :param x: (#batch, time1, size).
+        :param mask: Mask tensor (#batch, 1, time)
+        :return:
+        '''
+        # print("in fsmn, inputs", inputs.size())
+        b, t, d = inputs.size()
+        # logging.info(
+        #     "mask: {}".format(mask.size()))
+        if mask is not None:
+            mask = torch.reshape(mask, (b ,-1, 1))
+            # logging.info("in fsmn, mask: {}, {}".format(mask.size(), mask[0:100:50, :, :]))
+            if mask_shfit_chunk is not None:
+                # logging.info("in fsmn, mask_fsmn: {}, {}".format(mask_shfit_chunk.size(), mask_shfit_chunk[0:100:50, :, :]))
+                mask = mask * mask_shfit_chunk
+            # logging.info("in fsmn, mask_after_fsmn: {}, {}".format(mask.size(), mask[0:100:50, :, :]))
+            # print("in fsmn, mask", mask.size())
+            # print("in fsmn, inputs", inputs.size())
+            inputs = inputs * mask
+
+        x = inputs.transpose(1, 2)
+        b, d, t = x.size()
+        if cache is None:
+            # print("in fsmn, cache is None, x", x.size())
+
+            x = self.pad_fn(x)
+            if not self.training and t <= 1:
+                cache = x
+        else:
+            # print("in fsmn, cache is not None, x", x.size())
+            # x = torch.cat((x, cache), dim=2)[:, :, :-1]
+            # if t < self.kernel_size:
+            #     x = self.pad_fn(x)
+            x = torch.cat((cache[:, :, 1:], x), dim=2)
+            x = x[:, :, -self.kernel_size:]
+            # print("in fsmn, cache is not None, x_cat", x.size())
+            cache = x
+        x = self.fsmn_block(x)
+        x = x.transpose(1, 2)
+        # print("in fsmn, fsmn_out", x.size())
+        if x.size(1) != inputs.size(1):
+            inputs = inputs[:, -1, :]
+
+        x = x + inputs
+        x = self.dropout(x)
+        if mask is not None:
+            x = x * mask
+        return x, cache
+
+class MultiHeadedAttentionCrossAtt(nn.Module):
+    """Multi-Head Attention layer.
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, n_head, n_feat, dropout_rate, encoder_output_size=None):
+        """Construct an MultiHeadedAttention object."""
+        super(MultiHeadedAttentionCrossAtt, self).__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        self.linear_q = nn.Linear(n_feat, n_feat)
+        # self.linear_k = nn.Linear(n_feat, n_feat)
+        # self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_k_v = nn.Linear(n_feat if encoder_output_size is None else encoder_output_size, n_feat*2)
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+    def forward_qkv(self, x, memory):
+        """Transform query, key and value.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+
+        """
+
+        # print("in forward_qkv, x", x.size())
+        b = x.size(0)
+        q = self.linear_q(x)
+        q_h = torch.reshape(q, (b, -1, self.h, self.d_k)).transpose(1, 2)    # (batch, head, time1, d_k)
+
+        k_v = self.linear_k_v(memory)
+        k, v = torch.split(k_v, int(self.h*self.d_k), dim=-1)
+        k_h = torch.reshape(k, (b, -1, self.h, self.d_k)).transpose(1, 2)    # (batch, head, time2, d_k)
+        v_h = torch.reshape(v, (b, -1, self.h, self.d_k)).transpose(1, 2)    # (batch, head, time2, d_k)
+
+
+        return q_h, k_h, v_h
+
+    def forward_attention(self, value, scores, mask):
+        """Compute attention context vector.
+
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            min_value = float(
+                numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min
+            )
+            # logging.info(
+            #     "scores: {}, mask_size: {}".format(scores.size(), mask.size()))
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self, x, memory, memory_mask):
+        """Compute scaled dot product attention.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q_h, k_h, v_h = self.forward_qkv(x, memory)
+        q_h = q_h * self.d_k ** (-0.5)
+        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
+        return self.forward_attention(v_h, scores, memory_mask)
\ No newline at end of file
diff --git a/funasr/modules/beam_search/__init__.py b/funasr/modules/beam_search/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/modules/beam_search/batch_beam_search.py b/funasr/modules/beam_search/batch_beam_search.py
new file mode 100644
index 000000000..6d2da8d9a
--- /dev/null
+++ b/funasr/modules/beam_search/batch_beam_search.py
@@ -0,0 +1,348 @@
+"""Parallel beam search module."""
+
+import logging
+from typing import Any
+from typing import Dict
+from typing import List
+from typing import NamedTuple
+from typing import Tuple
+
+import torch
+from torch.nn.utils.rnn import pad_sequence
+
+from funasr.modules.beam_search.beam_search import BeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+
+
+class BatchHypothesis(NamedTuple):
+    """Batchfied/Vectorized hypothesis data type."""
+
+    yseq: torch.Tensor = torch.tensor([])  # (batch, maxlen)
+    score: torch.Tensor = torch.tensor([])  # (batch,)
+    length: torch.Tensor = torch.tensor([])  # (batch,)
+    scores: Dict[str, torch.Tensor] = dict()  # values: (batch,)
+    states: Dict[str, Dict] = dict()
+
+    def __len__(self) -> int:
+        """Return a batch size."""
+        return len(self.length)
+
+
+class BatchBeamSearch(BeamSearch):
+    """Batch beam search implementation."""
+
+    def batchfy(self, hyps: List[Hypothesis]) -> BatchHypothesis:
+        """Convert list to batch."""
+        if len(hyps) == 0:
+            return BatchHypothesis()
+        return BatchHypothesis(
+            yseq=pad_sequence(
+                [h.yseq for h in hyps], batch_first=True, padding_value=self.eos
+            ),
+            length=torch.tensor([len(h.yseq) for h in hyps], dtype=torch.int64),
+            score=torch.tensor([h.score for h in hyps]),
+            scores={k: torch.tensor([h.scores[k] for h in hyps]) for k in self.scorers},
+            states={k: [h.states[k] for h in hyps] for k in self.scorers},
+        )
+
+    def _batch_select(self, hyps: BatchHypothesis, ids: List[int]) -> BatchHypothesis:
+        return BatchHypothesis(
+            yseq=hyps.yseq[ids],
+            score=hyps.score[ids],
+            length=hyps.length[ids],
+            scores={k: v[ids] for k, v in hyps.scores.items()},
+            states={
+                k: [self.scorers[k].select_state(v, i) for i in ids]
+                for k, v in hyps.states.items()
+            },
+        )
+
+    def _select(self, hyps: BatchHypothesis, i: int) -> Hypothesis:
+        return Hypothesis(
+            yseq=hyps.yseq[i, : hyps.length[i]],
+            score=hyps.score[i],
+            scores={k: v[i] for k, v in hyps.scores.items()},
+            states={
+                k: self.scorers[k].select_state(v, i) for k, v in hyps.states.items()
+            },
+        )
+
+    def unbatchfy(self, batch_hyps: BatchHypothesis) -> List[Hypothesis]:
+        """Revert batch to list."""
+        return [
+            Hypothesis(
+                yseq=batch_hyps.yseq[i][: batch_hyps.length[i]],
+                score=batch_hyps.score[i],
+                scores={k: batch_hyps.scores[k][i] for k in self.scorers},
+                states={
+                    k: v.select_state(batch_hyps.states[k], i)
+                    for k, v in self.scorers.items()
+                },
+            )
+            for i in range(len(batch_hyps.length))
+        ]
+
+    def batch_beam(
+        self, weighted_scores: torch.Tensor, ids: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+        """Batch-compute topk full token ids and partial token ids.
+
+        Args:
+            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.
+                Its shape is `(n_beam, self.vocab_size)`.
+            ids (torch.Tensor): The partial token ids to compute topk.
+                Its shape is `(n_beam, self.pre_beam_size)`.
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+                The topk full (prev_hyp, new_token) ids
+                and partial (prev_hyp, new_token) ids.
+                Their shapes are all `(self.beam_size,)`
+
+        """
+        top_ids = weighted_scores.view(-1).topk(self.beam_size)[1]
+        # Because of the flatten above, `top_ids` is organized as:
+        # [hyp1 * V + token1, hyp2 * V + token2, ..., hypK * V + tokenK],
+        # where V is `self.n_vocab` and K is `self.beam_size`
+        prev_hyp_ids = top_ids // self.n_vocab
+        new_token_ids = top_ids % self.n_vocab
+        return prev_hyp_ids, new_token_ids, prev_hyp_ids, new_token_ids
+
+    def init_hyp(self, x: torch.Tensor) -> BatchHypothesis:
+        """Get an initial hypothesis data.
+
+        Args:
+            x (torch.Tensor): The encoder output feature
+
+        Returns:
+            Hypothesis: The initial hypothesis.
+
+        """
+        init_states = dict()
+        init_scores = dict()
+        for k, d in self.scorers.items():
+            init_states[k] = d.batch_init_state(x)
+            init_scores[k] = 0.0
+        return self.batchfy(
+            [
+                Hypothesis(
+                    score=0.0,
+                    scores=init_scores,
+                    states=init_states,
+                    yseq=torch.tensor([self.sos], device=x.device),
+                )
+            ]
+        )
+
+    def score_full(
+        self, hyp: BatchHypothesis, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.full_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.full_scorers`
+                and tensor score values of shape: `(self.n_vocab,)`,
+                and state dict that has string keys
+                and state values of `self.full_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.full_scorers.items():
+            scores[k], states[k] = d.batch_score(hyp.yseq, hyp.states[k], x)
+        return scores, states
+
+    def score_partial(
+        self, hyp: BatchHypothesis, ids: torch.Tensor, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.full_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            ids (torch.Tensor): 2D tensor of new partial tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.full_scorers`
+                and tensor score values of shape: `(self.n_vocab,)`,
+                and state dict that has string keys
+                and state values of `self.full_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.part_scorers.items():
+            scores[k], states[k] = d.batch_score_partial(
+                hyp.yseq, ids, hyp.states[k], x
+            )
+        return scores, states
+
+    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:
+        """Merge states for new hypothesis.
+
+        Args:
+            states: states of `self.full_scorers`
+            part_states: states of `self.part_scorers`
+            part_idx (int): The new token id for `part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are states of the scorers.
+
+        """
+        new_states = dict()
+        for k, v in states.items():
+            new_states[k] = v
+        for k, v in part_states.items():
+            new_states[k] = v
+        return new_states
+
+    def search(self, running_hyps: BatchHypothesis, x: torch.Tensor) -> BatchHypothesis:
+        """Search new tokens for running hypotheses and encoded speech x.
+
+        Args:
+            running_hyps (BatchHypothesis): Running hypotheses on beam
+            x (torch.Tensor): Encoded speech feature (T, D)
+
+        Returns:
+            BatchHypothesis: Best sorted hypotheses
+
+        """
+        n_batch = len(running_hyps)
+        part_ids = None  # no pre-beam
+        # batch scoring
+        weighted_scores = torch.zeros(
+            n_batch, self.n_vocab, dtype=x.dtype, device=x.device
+        )
+        scores, states = self.score_full(running_hyps, x.expand(n_batch, *x.shape))
+        for k in self.full_scorers:
+            weighted_scores += self.weights[k] * scores[k]
+        # partial scoring
+        if self.do_pre_beam:
+            pre_beam_scores = (
+                weighted_scores
+                if self.pre_beam_score_key == "full"
+                else scores[self.pre_beam_score_key]
+            )
+            part_ids = torch.topk(pre_beam_scores, self.pre_beam_size, dim=-1)[1]
+        # NOTE(takaaki-hori): Unlike BeamSearch, we assume that score_partial returns
+        # full-size score matrices, which has non-zero scores for part_ids and zeros
+        # for others.
+        part_scores, part_states = self.score_partial(running_hyps, part_ids, x)
+        for k in self.part_scorers:
+            weighted_scores += self.weights[k] * part_scores[k]
+        # add previous hyp scores
+        weighted_scores += running_hyps.score.to(
+            dtype=x.dtype, device=x.device
+        ).unsqueeze(1)
+
+        # TODO(karita): do not use list. use batch instead
+        # see also https://github.com/espnet/espnet/pull/1402#discussion_r354561029
+        # update hyps
+        best_hyps = []
+        prev_hyps = self.unbatchfy(running_hyps)
+        for (
+            full_prev_hyp_id,
+            full_new_token_id,
+            part_prev_hyp_id,
+            part_new_token_id,
+        ) in zip(*self.batch_beam(weighted_scores, part_ids)):
+            prev_hyp = prev_hyps[full_prev_hyp_id]
+            best_hyps.append(
+                Hypothesis(
+                    score=weighted_scores[full_prev_hyp_id, full_new_token_id],
+                    yseq=self.append_token(prev_hyp.yseq, full_new_token_id),
+                    scores=self.merge_scores(
+                        prev_hyp.scores,
+                        {k: v[full_prev_hyp_id] for k, v in scores.items()},
+                        full_new_token_id,
+                        {k: v[part_prev_hyp_id] for k, v in part_scores.items()},
+                        part_new_token_id,
+                    ),
+                    states=self.merge_states(
+                        {
+                            k: self.full_scorers[k].select_state(v, full_prev_hyp_id)
+                            for k, v in states.items()
+                        },
+                        {
+                            k: self.part_scorers[k].select_state(
+                                v, part_prev_hyp_id, part_new_token_id
+                            )
+                            for k, v in part_states.items()
+                        },
+                        part_new_token_id,
+                    ),
+                )
+            )
+        return self.batchfy(best_hyps)
+
+    def post_process(
+        self,
+        i: int,
+        maxlen: int,
+        maxlenratio: float,
+        running_hyps: BatchHypothesis,
+        ended_hyps: List[Hypothesis],
+    ) -> BatchHypothesis:
+        """Perform post-processing of beam search iterations.
+
+        Args:
+            i (int): The length of hypothesis tokens.
+            maxlen (int): The maximum length of tokens in beam search.
+            maxlenratio (int): The maximum length ratio in beam search.
+            running_hyps (BatchHypothesis): The running hypotheses in beam search.
+            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.
+
+        Returns:
+            BatchHypothesis: The new running hypotheses.
+
+        """
+        n_batch = running_hyps.yseq.shape[0]
+        logging.debug(f"the number of running hypothes: {n_batch}")
+        if self.token_list is not None:
+            logging.debug(
+                "best hypo: "
+                + "".join(
+                    [
+                        self.token_list[x]
+                        for x in running_hyps.yseq[0, 1 : running_hyps.length[0]]
+                    ]
+                )
+            )
+        # add eos in the final loop to avoid that there are no ended hyps
+        if i == maxlen - 1:
+            logging.info("adding <eos> in the last position in the loop")
+            yseq_eos = torch.cat(
+                (
+                    running_hyps.yseq,
+                    torch.full(
+                        (n_batch, 1),
+                        self.eos,
+                        device=running_hyps.yseq.device,
+                        dtype=torch.int64,
+                    ),
+                ),
+                1,
+            )
+            running_hyps.yseq.resize_as_(yseq_eos)
+            running_hyps.yseq[:] = yseq_eos
+            running_hyps.length[:] = yseq_eos.shape[1]
+
+        # add ended hypotheses to a final list, and removed them from current hypotheses
+        # (this will be a probmlem, number of hyps < beam)
+        is_eos = (
+            running_hyps.yseq[torch.arange(n_batch), running_hyps.length - 1]
+            == self.eos
+        )
+        for b in torch.nonzero(is_eos, as_tuple=False).view(-1):
+            hyp = self._select(running_hyps, b)
+            ended_hyps.append(hyp)
+        remained_ids = torch.nonzero(is_eos == 0, as_tuple=False).view(-1)
+        return self._batch_select(running_hyps, remained_ids)
diff --git a/funasr/modules/beam_search/batch_beam_search_online_sim.py b/funasr/modules/beam_search/batch_beam_search_online_sim.py
new file mode 100644
index 000000000..4d3debd23
--- /dev/null
+++ b/funasr/modules/beam_search/batch_beam_search_online_sim.py
@@ -0,0 +1,270 @@
+"""Parallel beam search module for online simulation."""
+
+import logging
+from pathlib import Path
+from typing import List
+
+import yaml
+
+import torch
+
+from funasr.modules.beam_search.batch_beam_search import BatchBeamSearch
+from funasr.modules.beam_search.beam_search import Hypothesis
+from funasr.models.e2e_asr_common import end_detect
+
+
+class BatchBeamSearchOnlineSim(BatchBeamSearch):
+    """Online beam search implementation.
+
+    This simulates streaming decoding.
+    It requires encoded features of entire utterance and
+    extracts block by block from it as it shoud be done
+    in streaming processing.
+    This is based on Tsunoo et al, "STREAMING TRANSFORMER ASR
+    WITH BLOCKWISE SYNCHRONOUS BEAM SEARCH"
+    (https://arxiv.org/abs/2006.14941).
+    """
+
+    def set_streaming_config(self, asr_config: str):
+        """Set config file for streaming decoding.
+
+        Args:
+            asr_config (str): The config file for asr training
+
+        """
+        train_config_file = Path(asr_config)
+        self.block_size = None
+        self.hop_size = None
+        self.look_ahead = None
+        config = None
+        with train_config_file.open("r", encoding="utf-8") as f:
+            args = yaml.safe_load(f)
+            if "encoder_conf" in args.keys():
+                if "block_size" in args["encoder_conf"].keys():
+                    self.block_size = args["encoder_conf"]["block_size"]
+                if "hop_size" in args["encoder_conf"].keys():
+                    self.hop_size = args["encoder_conf"]["hop_size"]
+                if "look_ahead" in args["encoder_conf"].keys():
+                    self.look_ahead = args["encoder_conf"]["look_ahead"]
+            elif "config" in args.keys():
+                config = args["config"]
+                if config is None:
+                    logging.info(
+                        "Cannot find config file for streaming decoding: "
+                        + "apply batch beam search instead."
+                    )
+                    return
+        if (
+            self.block_size is None or self.hop_size is None or self.look_ahead is None
+        ) and config is not None:
+            config_file = Path(config)
+            with config_file.open("r", encoding="utf-8") as f:
+                args = yaml.safe_load(f)
+            if "encoder_conf" in args.keys():
+                enc_args = args["encoder_conf"]
+            if enc_args and "block_size" in enc_args:
+                self.block_size = enc_args["block_size"]
+            if enc_args and "hop_size" in enc_args:
+                self.hop_size = enc_args["hop_size"]
+            if enc_args and "look_ahead" in enc_args:
+                self.look_ahead = enc_args["look_ahead"]
+
+    def set_block_size(self, block_size: int):
+        """Set block size for streaming decoding.
+
+        Args:
+            block_size (int): The block size of encoder
+        """
+        self.block_size = block_size
+
+    def set_hop_size(self, hop_size: int):
+        """Set hop size for streaming decoding.
+
+        Args:
+            hop_size (int): The hop size of encoder
+        """
+        self.hop_size = hop_size
+
+    def set_look_ahead(self, look_ahead: int):
+        """Set look ahead size for streaming decoding.
+
+        Args:
+            look_ahead (int): The look ahead size of encoder
+        """
+        self.look_ahead = look_ahead
+
+    def forward(
+        self, x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0
+    ) -> List[Hypothesis]:
+        """Perform beam search.
+
+        Args:
+            x (torch.Tensor): Encoded speech feature (T, D)
+            maxlenratio (float): Input length ratio to obtain max output length.
+                If maxlenratio=0.0 (default), it uses a end-detect function
+                to automatically find maximum hypothesis lengths
+            minlenratio (float): Input length ratio to obtain min output length.
+
+        Returns:
+            list[Hypothesis]: N-best decoding results
+
+        """
+        self.conservative = True  # always true
+
+        if self.block_size and self.hop_size and self.look_ahead:
+            cur_end_frame = int(self.block_size - self.look_ahead)
+        else:
+            cur_end_frame = x.shape[0]
+        process_idx = 0
+        if cur_end_frame < x.shape[0]:
+            h = x.narrow(0, 0, cur_end_frame)
+        else:
+            h = x
+
+        # set length bounds
+        if maxlenratio == 0:
+            maxlen = x.shape[0]
+        else:
+            maxlen = max(1, int(maxlenratio * x.size(0)))
+        minlen = int(minlenratio * x.size(0))
+        logging.info("decoder input length: " + str(x.shape[0]))
+        logging.info("max output length: " + str(maxlen))
+        logging.info("min output length: " + str(minlen))
+
+        # main loop of prefix search
+        running_hyps = self.init_hyp(h)
+        prev_hyps = []
+        ended_hyps = []
+        prev_repeat = False
+
+        continue_decode = True
+
+        while continue_decode:
+            move_to_next_block = False
+            if cur_end_frame < x.shape[0]:
+                h = x.narrow(0, 0, cur_end_frame)
+            else:
+                h = x
+
+            # extend states for ctc
+            self.extend(h, running_hyps)
+
+            while process_idx < maxlen:
+                logging.debug("position " + str(process_idx))
+                best = self.search(running_hyps, h)
+
+                if process_idx == maxlen - 1:
+                    # end decoding
+                    running_hyps = self.post_process(
+                        process_idx, maxlen, maxlenratio, best, ended_hyps
+                    )
+                n_batch = best.yseq.shape[0]
+                local_ended_hyps = []
+                is_local_eos = (
+                    best.yseq[torch.arange(n_batch), best.length - 1] == self.eos
+                )
+                for i in range(is_local_eos.shape[0]):
+                    if is_local_eos[i]:
+                        hyp = self._select(best, i)
+                        local_ended_hyps.append(hyp)
+                    # NOTE(tsunoo): check repetitions here
+                    # This is a implicit implementation of
+                    # Eq (11) in https://arxiv.org/abs/2006.14941
+                    # A flag prev_repeat is used instead of using set
+                    elif (
+                        not prev_repeat
+                        and best.yseq[i, -1] in best.yseq[i, :-1]
+                        and cur_end_frame < x.shape[0]
+                    ):
+                        move_to_next_block = True
+                        prev_repeat = True
+                if maxlenratio == 0.0 and end_detect(
+                    [lh.asdict() for lh in local_ended_hyps], process_idx
+                ):
+                    logging.info(f"end detected at {process_idx}")
+                    continue_decode = False
+                    break
+                if len(local_ended_hyps) > 0 and cur_end_frame < x.shape[0]:
+                    move_to_next_block = True
+
+                if move_to_next_block:
+                    if (
+                        self.hop_size
+                        and cur_end_frame + int(self.hop_size) + int(self.look_ahead)
+                        < x.shape[0]
+                    ):
+                        cur_end_frame += int(self.hop_size)
+                    else:
+                        cur_end_frame = x.shape[0]
+                    logging.debug("Going to next block: %d", cur_end_frame)
+                    if process_idx > 1 and len(prev_hyps) > 0 and self.conservative:
+                        running_hyps = prev_hyps
+                        process_idx -= 1
+                        prev_hyps = []
+                    break
+
+                prev_repeat = False
+                prev_hyps = running_hyps
+                running_hyps = self.post_process(
+                    process_idx, maxlen, maxlenratio, best, ended_hyps
+                )
+
+                if cur_end_frame >= x.shape[0]:
+                    for hyp in local_ended_hyps:
+                        ended_hyps.append(hyp)
+
+                if len(running_hyps) == 0:
+                    logging.info("no hypothesis. Finish decoding.")
+                    continue_decode = False
+                    break
+                else:
+                    logging.debug(f"remained hypotheses: {len(running_hyps)}")
+                # increment number
+                process_idx += 1
+
+        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)
+        # check the number of hypotheses reaching to eos
+        if len(nbest_hyps) == 0:
+            logging.warning(
+                "there is no N-best results, perform recognition "
+                "again with smaller minlenratio."
+            )
+            return (
+                []
+                if minlenratio < 0.1
+                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))
+            )
+
+        # report the best result
+        best = nbest_hyps[0]
+        for k, v in best.scores.items():
+            logging.info(
+                f"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}"
+            )
+        logging.info(f"total log probability: {best.score:.2f}")
+        logging.info(f"normalized log probability: {best.score / len(best.yseq):.2f}")
+        logging.info(f"total number of ended hypotheses: {len(nbest_hyps)}")
+        if self.token_list is not None:
+            logging.info(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in best.yseq[1:-1]])
+                + "\n"
+            )
+        return nbest_hyps
+
+    def extend(self, x: torch.Tensor, hyps: Hypothesis) -> List[Hypothesis]:
+        """Extend probabilities and states with more encoded chunks.
+
+        Args:
+            x (torch.Tensor): The extended encoder output feature
+            hyps (Hypothesis): Current list of hypothesis
+
+        Returns:
+            Hypothesis: The extended hypothesis
+
+        """
+        for k, d in self.scorers.items():
+            if hasattr(d, "extend_prob"):
+                d.extend_prob(x)
+            if hasattr(d, "extend_state"):
+                hyps.states[k] = d.extend_state(hyps.states[k])
diff --git a/funasr/modules/beam_search/beam_search.py b/funasr/modules/beam_search/beam_search.py
new file mode 100644
index 000000000..51fa60100
--- /dev/null
+++ b/funasr/modules/beam_search/beam_search.py
@@ -0,0 +1,1400 @@
+"""Beam search module."""
+
+from itertools import chain
+import logging
+from typing import Any
+from typing import Dict
+from typing import List
+from typing import NamedTuple
+from typing import Tuple
+from typing import Union
+
+import torch
+
+from funasr.modules.e2e_asr_common import end_detect
+from funasr.modules.scorers.scorer_interface import PartialScorerInterface
+from funasr.modules.scorers.scorer_interface import ScorerInterface
+
+
+class Hypothesis(NamedTuple):
+    """Hypothesis data type."""
+
+    yseq: torch.Tensor
+    score: Union[float, torch.Tensor] = 0
+    scores: Dict[str, Union[float, torch.Tensor]] = dict()
+    states: Dict[str, Any] = dict()
+
+    def asdict(self) -> dict:
+        """Convert data to JSON-friendly dict."""
+        return self._replace(
+            yseq=self.yseq.tolist(),
+            score=float(self.score),
+            scores={k: float(v) for k, v in self.scores.items()},
+        )._asdict()
+
+
+class BeamSearch(torch.nn.Module):
+    """Beam search implementation."""
+
+    def __init__(
+        self,
+        scorers: Dict[str, ScorerInterface],
+        weights: Dict[str, float],
+        beam_size: int,
+        vocab_size: int,
+        sos: int,
+        eos: int,
+        token_list: List[str] = None,
+        pre_beam_ratio: float = 1.5,
+        pre_beam_score_key: str = None,
+    ):
+        """Initialize beam search.
+
+        Args:
+            scorers (dict[str, ScorerInterface]): Dict of decoder modules
+                e.g., Decoder, CTCPrefixScorer, LM
+                The scorer will be ignored if it is `None`
+            weights (dict[str, float]): Dict of weights for each scorers
+                The scorer will be ignored if its weight is 0
+            beam_size (int): The number of hypotheses kept during search
+            vocab_size (int): The number of vocabulary
+            sos (int): Start of sequence id
+            eos (int): End of sequence id
+            token_list (list[str]): List of tokens for debug log
+            pre_beam_score_key (str): key of scores to perform pre-beam search
+            pre_beam_ratio (float): beam size in the pre-beam search
+                will be `int(pre_beam_ratio * beam_size)`
+
+        """
+        super().__init__()
+        # set scorers
+        self.weights = weights
+        self.scorers = dict()
+        self.full_scorers = dict()
+        self.part_scorers = dict()
+        # this module dict is required for recursive cast
+        # `self.to(device, dtype)` in `recog.py`
+        self.nn_dict = torch.nn.ModuleDict()
+        for k, v in scorers.items():
+            w = weights.get(k, 0)
+            if w == 0 or v is None:
+                continue
+            assert isinstance(
+                v, ScorerInterface
+            ), f"{k} ({type(v)}) does not implement ScorerInterface"
+            self.scorers[k] = v
+            if isinstance(v, PartialScorerInterface):
+                self.part_scorers[k] = v
+            else:
+                self.full_scorers[k] = v
+            if isinstance(v, torch.nn.Module):
+                self.nn_dict[k] = v
+
+        # set configurations
+        self.sos = sos
+        self.eos = eos
+        self.token_list = token_list
+        self.pre_beam_size = int(pre_beam_ratio * beam_size)
+        self.beam_size = beam_size
+        self.n_vocab = vocab_size
+        if (
+            pre_beam_score_key is not None
+            and pre_beam_score_key != "full"
+            and pre_beam_score_key not in self.full_scorers
+        ):
+            raise KeyError(f"{pre_beam_score_key} is not found in {self.full_scorers}")
+        self.pre_beam_score_key = pre_beam_score_key
+        self.do_pre_beam = (
+            self.pre_beam_score_key is not None
+            and self.pre_beam_size < self.n_vocab
+            and len(self.part_scorers) > 0
+        )
+
+    def init_hyp(self, x: torch.Tensor) -> List[Hypothesis]:
+        """Get an initial hypothesis data.
+
+        Args:
+            x (torch.Tensor): The encoder output feature
+
+        Returns:
+            Hypothesis: The initial hypothesis.
+
+        """
+        init_states = dict()
+        init_scores = dict()
+        for k, d in self.scorers.items():
+            init_states[k] = d.init_state(x)
+            init_scores[k] = 0.0
+        return [
+            Hypothesis(
+                score=0.0,
+                scores=init_scores,
+                states=init_states,
+                yseq=torch.tensor([self.sos], device=x.device),
+            )
+        ]
+
+    @staticmethod
+    def append_token(xs: torch.Tensor, x: int) -> torch.Tensor:
+        """Append new token to prefix tokens.
+
+        Args:
+            xs (torch.Tensor): The prefix token
+            x (int): The new token to append
+
+        Returns:
+            torch.Tensor: New tensor contains: xs + [x] with xs.dtype and xs.device
+
+        """
+        x = torch.tensor([x], dtype=xs.dtype, device=xs.device)
+        return torch.cat((xs, x))
+
+    def score_full(
+        self, hyp: Hypothesis, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.full_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.full_scorers`
+                and tensor score values of shape: `(self.n_vocab,)`,
+                and state dict that has string keys
+                and state values of `self.full_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.full_scorers.items():
+            scores[k], states[k] = d.score(hyp.yseq, hyp.states[k], x)
+        return scores, states
+
+    def score_partial(
+        self, hyp: Hypothesis, ids: torch.Tensor, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.part_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            ids (torch.Tensor): 1D tensor of new partial tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.part_scorers`
+                and tensor score values of shape: `(len(ids),)`,
+                and state dict that has string keys
+                and state values of `self.part_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.part_scorers.items():
+            scores[k], states[k] = d.score_partial(hyp.yseq, ids, hyp.states[k], x)
+        return scores, states
+
+    def beam(
+        self, weighted_scores: torch.Tensor, ids: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute topk full token ids and partial token ids.
+
+        Args:
+            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.
+            Its shape is `(self.n_vocab,)`.
+            ids (torch.Tensor): The partial token ids to compute topk
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+                The topk full token ids and partial token ids.
+                Their shapes are `(self.beam_size,)`
+
+        """
+        # no pre beam performed
+        if weighted_scores.size(0) == ids.size(0):
+            top_ids = weighted_scores.topk(self.beam_size)[1]
+            return top_ids, top_ids
+
+        # mask pruned in pre-beam not to select in topk
+        tmp = weighted_scores[ids]
+        weighted_scores[:] = -float("inf")
+        weighted_scores[ids] = tmp
+        top_ids = weighted_scores.topk(self.beam_size)[1]
+        local_ids = weighted_scores[ids].topk(self.beam_size)[1]
+        return top_ids, local_ids
+
+    @staticmethod
+    def merge_scores(
+        prev_scores: Dict[str, float],
+        next_full_scores: Dict[str, torch.Tensor],
+        full_idx: int,
+        next_part_scores: Dict[str, torch.Tensor],
+        part_idx: int,
+    ) -> Dict[str, torch.Tensor]:
+        """Merge scores for new hypothesis.
+
+        Args:
+            prev_scores (Dict[str, float]):
+                The previous hypothesis scores by `self.scorers`
+            next_full_scores (Dict[str, torch.Tensor]): scores by `self.full_scorers`
+            full_idx (int): The next token id for `next_full_scores`
+            next_part_scores (Dict[str, torch.Tensor]):
+                scores of partial tokens by `self.part_scorers`
+            part_idx (int): The new token id for `next_part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are scalar tensors by the scorers.
+
+        """
+        new_scores = dict()
+        for k, v in next_full_scores.items():
+            new_scores[k] = prev_scores[k] + v[full_idx]
+        for k, v in next_part_scores.items():
+            new_scores[k] = prev_scores[k] + v[part_idx]
+        return new_scores
+
+    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:
+        """Merge states for new hypothesis.
+
+        Args:
+            states: states of `self.full_scorers`
+            part_states: states of `self.part_scorers`
+            part_idx (int): The new token id for `part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are states of the scorers.
+
+        """
+        new_states = dict()
+        for k, v in states.items():
+            new_states[k] = v
+        for k, d in self.part_scorers.items():
+            new_states[k] = d.select_state(part_states[k], part_idx)
+        return new_states
+
+    def search(
+        self, running_hyps: List[Hypothesis], x: torch.Tensor
+    ) -> List[Hypothesis]:
+        """Search new tokens for running hypotheses and encoded speech x.
+
+        Args:
+            running_hyps (List[Hypothesis]): Running hypotheses on beam
+            x (torch.Tensor): Encoded speech feature (T, D)
+
+        Returns:
+            List[Hypotheses]: Best sorted hypotheses
+
+        """
+        best_hyps = []
+        part_ids = torch.arange(self.n_vocab, device=x.device)  # no pre-beam
+        for hyp in running_hyps:
+            # scoring
+            weighted_scores = torch.zeros(self.n_vocab, dtype=x.dtype, device=x.device)
+            scores, states = self.score_full(hyp, x)
+            for k in self.full_scorers:
+                weighted_scores += self.weights[k] * scores[k]
+            # partial scoring
+            if self.do_pre_beam:
+                pre_beam_scores = (
+                    weighted_scores
+                    if self.pre_beam_score_key == "full"
+                    else scores[self.pre_beam_score_key]
+                )
+                part_ids = torch.topk(pre_beam_scores, self.pre_beam_size)[1]
+            part_scores, part_states = self.score_partial(hyp, part_ids, x)
+            for k in self.part_scorers:
+                weighted_scores[part_ids] += self.weights[k] * part_scores[k]
+            # add previous hyp score
+            weighted_scores += hyp.score
+
+            # update hyps
+            for j, part_j in zip(*self.beam(weighted_scores, part_ids)):
+                # will be (2 x beam at most)
+                best_hyps.append(
+                    Hypothesis(
+                        score=weighted_scores[j],
+                        yseq=self.append_token(hyp.yseq, j),
+                        scores=self.merge_scores(
+                            hyp.scores, scores, j, part_scores, part_j
+                        ),
+                        states=self.merge_states(states, part_states, part_j),
+                    )
+                )
+
+            # sort and prune 2 x beam -> beam
+            best_hyps = sorted(best_hyps, key=lambda x: x.score, reverse=True)[
+                : min(len(best_hyps), self.beam_size)
+            ]
+        return best_hyps
+
+    def forward(
+        self, x: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0
+    ) -> List[Hypothesis]:
+        """Perform beam search.
+
+        Args:
+            x (torch.Tensor): Encoded speech feature (T, D)
+            maxlenratio (float): Input length ratio to obtain max output length.
+                If maxlenratio=0.0 (default), it uses a end-detect function
+                to automatically find maximum hypothesis lengths
+                If maxlenratio<0.0, its absolute value is interpreted
+                as a constant max output length.
+            minlenratio (float): Input length ratio to obtain min output length.
+
+        Returns:
+            list[Hypothesis]: N-best decoding results
+
+        """
+        # set length bounds
+        if maxlenratio == 0:
+            maxlen = x.shape[0]
+        elif maxlenratio < 0:
+            maxlen = -1 * int(maxlenratio)
+        else:
+            maxlen = max(1, int(maxlenratio * x.size(0)))
+        minlen = int(minlenratio * x.size(0))
+        logging.info("decoder input length: " + str(x.shape[0]))
+        logging.info("max output length: " + str(maxlen))
+        logging.info("min output length: " + str(minlen))
+
+        # main loop of prefix search
+        running_hyps = self.init_hyp(x)
+        ended_hyps = []
+        for i in range(maxlen):
+            logging.debug("position " + str(i))
+            best = self.search(running_hyps, x)
+            # post process of one iteration
+            running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)
+            # end detection
+            if maxlenratio == 0.0 and end_detect([h.asdict() for h in ended_hyps], i):
+                logging.info(f"end detected at {i}")
+                break
+            if len(running_hyps) == 0:
+                logging.info("no hypothesis. Finish decoding.")
+                break
+            else:
+                logging.debug(f"remained hypotheses: {len(running_hyps)}")
+
+        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)
+        # check the number of hypotheses reaching to eos
+        if len(nbest_hyps) == 0:
+            logging.warning(
+                "there is no N-best results, perform recognition "
+                "again with smaller minlenratio."
+            )
+            return (
+                []
+                if minlenratio < 0.1
+                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))
+            )
+
+        # report the best result
+        best = nbest_hyps[0]
+        for k, v in best.scores.items():
+            logging.info(
+                f"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}"
+            )
+        logging.info(f"total log probability: {best.score:.2f}")
+        logging.info(f"normalized log probability: {best.score / len(best.yseq):.2f}")
+        logging.info(f"total number of ended hypotheses: {len(nbest_hyps)}")
+        if self.token_list is not None:
+            logging.info(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in best.yseq[1:-1]])
+                + "\n"
+            )
+        return nbest_hyps
+
+    def post_process(
+        self,
+        i: int,
+        maxlen: int,
+        maxlenratio: float,
+        running_hyps: List[Hypothesis],
+        ended_hyps: List[Hypothesis],
+    ) -> List[Hypothesis]:
+        """Perform post-processing of beam search iterations.
+
+        Args:
+            i (int): The length of hypothesis tokens.
+            maxlen (int): The maximum length of tokens in beam search.
+            maxlenratio (int): The maximum length ratio in beam search.
+            running_hyps (List[Hypothesis]): The running hypotheses in beam search.
+            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.
+
+        Returns:
+            List[Hypothesis]: The new running hypotheses.
+
+        """
+        logging.debug(f"the number of running hypotheses: {len(running_hyps)}")
+        if self.token_list is not None:
+            logging.debug(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in running_hyps[0].yseq[1:]])
+            )
+        # add eos in the final loop to avoid that there are no ended hyps
+        if i == maxlen - 1:
+            logging.info("adding <eos> in the last position in the loop")
+            running_hyps = [
+                h._replace(yseq=self.append_token(h.yseq, self.eos))
+                for h in running_hyps
+            ]
+
+        # add ended hypotheses to a final list, and removed them from current hypotheses
+        # (this will be a problem, number of hyps < beam)
+        remained_hyps = []
+        for hyp in running_hyps:
+            if hyp.yseq[-1] == self.eos:
+                # e.g., Word LM needs to add final <eos> score
+                for k, d in chain(self.full_scorers.items(), self.part_scorers.items()):
+                    s = d.final_score(hyp.states[k])
+                    hyp.scores[k] += s
+                    hyp = hyp._replace(score=hyp.score + self.weights[k] * s)
+                ended_hyps.append(hyp)
+            else:
+                remained_hyps.append(hyp)
+        return remained_hyps
+
+
+def beam_search(
+    x: torch.Tensor,
+    sos: int,
+    eos: int,
+    beam_size: int,
+    vocab_size: int,
+    scorers: Dict[str, ScorerInterface],
+    weights: Dict[str, float],
+    token_list: List[str] = None,
+    maxlenratio: float = 0.0,
+    minlenratio: float = 0.0,
+    pre_beam_ratio: float = 1.5,
+    pre_beam_score_key: str = "full",
+) -> list:
+    """Perform beam search with scorers.
+
+    Args:
+        x (torch.Tensor): Encoded speech feature (T, D)
+        sos (int): Start of sequence id
+        eos (int): End of sequence id
+        beam_size (int): The number of hypotheses kept during search
+        vocab_size (int): The number of vocabulary
+        scorers (dict[str, ScorerInterface]): Dict of decoder modules
+            e.g., Decoder, CTCPrefixScorer, LM
+            The scorer will be ignored if it is `None`
+        weights (dict[str, float]): Dict of weights for each scorers
+            The scorer will be ignored if its weight is 0
+        token_list (list[str]): List of tokens for debug log
+        maxlenratio (float): Input length ratio to obtain max output length.
+            If maxlenratio=0.0 (default), it uses a end-detect function
+            to automatically find maximum hypothesis lengths
+        minlenratio (float): Input length ratio to obtain min output length.
+        pre_beam_score_key (str): key of scores to perform pre-beam search
+        pre_beam_ratio (float): beam size in the pre-beam search
+            will be `int(pre_beam_ratio * beam_size)`
+
+    Returns:
+        list: N-best decoding results
+
+    """
+    ret = BeamSearch(
+        scorers,
+        weights,
+        beam_size=beam_size,
+        vocab_size=vocab_size,
+        pre_beam_ratio=pre_beam_ratio,
+        pre_beam_score_key=pre_beam_score_key,
+        sos=sos,
+        eos=eos,
+        token_list=token_list,
+    ).forward(x=x, maxlenratio=maxlenratio, minlenratio=minlenratio)
+    return [h.asdict() for h in ret]
+
+class BeamSearchScama(torch.nn.Module):
+    """Beam search implementation."""
+
+    def __init__(
+        self,
+        scorers: Dict[str, ScorerInterface],
+        weights: Dict[str, float],
+        beam_size: int,
+        vocab_size: int,
+        sos: int,
+        eos: int,
+        token_list: List[str] = None,
+        pre_beam_ratio: float = 1.5,
+        pre_beam_score_key: str = None,
+    ):
+        """Initialize beam search.
+
+        Args:
+            scorers (dict[str, ScorerInterface]): Dict of decoder modules
+                e.g., Decoder, CTCPrefixScorer, LM
+                The scorer will be ignored if it is `None`
+            weights (dict[str, float]): Dict of weights for each scorers
+                The scorer will be ignored if its weight is 0
+            beam_size (int): The number of hypotheses kept during search
+            vocab_size (int): The number of vocabulary
+            sos (int): Start of sequence id
+            eos (int): End of sequence id
+            token_list (list[str]): List of tokens for debug log
+            pre_beam_score_key (str): key of scores to perform pre-beam search
+            pre_beam_ratio (float): beam size in the pre-beam search
+                will be `int(pre_beam_ratio * beam_size)`
+
+        """
+        super().__init__()
+        # set scorers
+        self.weights = weights
+        self.scorers = dict()
+        self.full_scorers = dict()
+        self.part_scorers = dict()
+        # this module dict is required for recursive cast
+        # `self.to(device, dtype)` in `recog.py`
+        self.nn_dict = torch.nn.ModuleDict()
+        for k, v in scorers.items():
+            w = weights.get(k, 0)
+            if w == 0 or v is None:
+                continue
+            assert isinstance(
+                v, ScorerInterface
+            ), f"{k} ({type(v)}) does not implement ScorerInterface"
+            self.scorers[k] = v
+            if isinstance(v, PartialScorerInterface):
+                self.part_scorers[k] = v
+            else:
+                self.full_scorers[k] = v
+            if isinstance(v, torch.nn.Module):
+                self.nn_dict[k] = v
+
+        # set configurations
+        self.sos = sos
+        self.eos = eos
+        self.token_list = token_list
+        self.pre_beam_size = int(pre_beam_ratio * beam_size)
+        self.beam_size = beam_size
+        self.n_vocab = vocab_size
+        if (
+            pre_beam_score_key is not None
+            and pre_beam_score_key != "full"
+            and pre_beam_score_key not in self.full_scorers
+        ):
+            raise KeyError(f"{pre_beam_score_key} is not found in {self.full_scorers}")
+        self.pre_beam_score_key = pre_beam_score_key
+        self.do_pre_beam = (
+            self.pre_beam_score_key is not None
+            and self.pre_beam_size < self.n_vocab
+            and len(self.part_scorers) > 0
+        )
+
+    def init_hyp(self, x: torch.Tensor) -> List[Hypothesis]:
+        """Get an initial hypothesis data.
+
+        Args:
+            x (torch.Tensor): The encoder output feature
+
+        Returns:
+            Hypothesis: The initial hypothesis.
+
+        """
+        init_states = dict()
+        init_scores = dict()
+        for k, d in self.scorers.items():
+            init_states[k] = d.init_state(x)
+            init_scores[k] = 0.0
+        return [
+            Hypothesis(
+                score=0.0,
+                scores=init_scores,
+                states=init_states,
+                yseq=torch.tensor([self.sos], device=x.device),
+            )
+        ]
+
+    @staticmethod
+    def append_token(xs: torch.Tensor, x: int) -> torch.Tensor:
+        """Append new token to prefix tokens.
+
+        Args:
+            xs (torch.Tensor): The prefix token
+            x (int): The new token to append
+
+        Returns:
+            torch.Tensor: New tensor contains: xs + [x] with xs.dtype and xs.device
+
+        """
+        x = torch.tensor([x], dtype=xs.dtype, device=xs.device)
+        return torch.cat((xs, x))
+
+    def score_full(
+        self, hyp: Hypothesis,
+        x: torch.Tensor,
+        x_mask: torch.Tensor = None,
+        pre_acoustic_embeds: torch.Tensor = None,
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.full_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.full_scorers`
+                and tensor score values of shape: `(self.n_vocab,)`,
+                and state dict that has string keys
+                and state values of `self.full_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.full_scorers.items():
+            scores[k], states[k] = d.score(hyp.yseq, hyp.states[k], x, x_mask=x_mask, pre_acoustic_embeds=pre_acoustic_embeds)
+        return scores, states
+
+    def score_partial(
+        self, hyp: Hypothesis, ids: torch.Tensor, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.part_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            ids (torch.Tensor): 1D tensor of new partial tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.part_scorers`
+                and tensor score values of shape: `(len(ids),)`,
+                and state dict that has string keys
+                and state values of `self.part_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.part_scorers.items():
+            scores[k], states[k] = d.score_partial(hyp.yseq, ids, hyp.states[k], x)
+        return scores, states
+
+    def beam(
+        self, weighted_scores: torch.Tensor, ids: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute topk full token ids and partial token ids.
+
+        Args:
+            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.
+            Its shape is `(self.n_vocab,)`.
+            ids (torch.Tensor): The partial token ids to compute topk
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+                The topk full token ids and partial token ids.
+                Their shapes are `(self.beam_size,)`
+
+        """
+        # no pre beam performed
+        if weighted_scores.size(0) == ids.size(0):
+            top_ids = weighted_scores.topk(self.beam_size)[1]
+            return top_ids, top_ids
+
+        # mask pruned in pre-beam not to select in topk
+        tmp = weighted_scores[ids]
+        weighted_scores[:] = -float("inf")
+        weighted_scores[ids] = tmp
+        top_ids = weighted_scores.topk(self.beam_size)[1]
+        local_ids = weighted_scores[ids].topk(self.beam_size)[1]
+        return top_ids, local_ids
+
+    @staticmethod
+    def merge_scores(
+        prev_scores: Dict[str, float],
+        next_full_scores: Dict[str, torch.Tensor],
+        full_idx: int,
+        next_part_scores: Dict[str, torch.Tensor],
+        part_idx: int,
+    ) -> Dict[str, torch.Tensor]:
+        """Merge scores for new hypothesis.
+
+        Args:
+            prev_scores (Dict[str, float]):
+                The previous hypothesis scores by `self.scorers`
+            next_full_scores (Dict[str, torch.Tensor]): scores by `self.full_scorers`
+            full_idx (int): The next token id for `next_full_scores`
+            next_part_scores (Dict[str, torch.Tensor]):
+                scores of partial tokens by `self.part_scorers`
+            part_idx (int): The new token id for `next_part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are scalar tensors by the scorers.
+
+        """
+        new_scores = dict()
+        for k, v in next_full_scores.items():
+            new_scores[k] = prev_scores[k] + v[full_idx]
+        for k, v in next_part_scores.items():
+            new_scores[k] = prev_scores[k] + v[part_idx]
+        return new_scores
+
+    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:
+        """Merge states for new hypothesis.
+
+        Args:
+            states: states of `self.full_scorers`
+            part_states: states of `self.part_scorers`
+            part_idx (int): The new token id for `part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are states of the scorers.
+
+        """
+        new_states = dict()
+        for k, v in states.items():
+            new_states[k] = v
+        for k, d in self.part_scorers.items():
+            new_states[k] = d.select_state(part_states[k], part_idx)
+        return new_states
+
+    def search(
+        self, running_hyps: List[Hypothesis],
+        x: torch.Tensor,
+        x_mask: torch.Tensor = None,
+        pre_acoustic_embeds: torch.Tensor = None,
+    ) -> List[Hypothesis]:
+        """Search new tokens for running hypotheses and encoded speech x.
+
+        Args:
+            running_hyps (List[Hypothesis]): Running hypotheses on beam
+            x (torch.Tensor): Encoded speech feature (T, D)
+
+        Returns:
+            List[Hypotheses]: Best sorted hypotheses
+
+        """
+        best_hyps = []
+        part_ids = torch.arange(self.n_vocab, device=x.device)  # no pre-beam
+        for hyp in running_hyps:
+            # scoring
+            weighted_scores = torch.zeros(self.n_vocab, dtype=x.dtype, device=x.device)
+            scores, states = self.score_full(hyp, x, x_mask=x_mask, pre_acoustic_embeds=pre_acoustic_embeds)
+            for k in self.full_scorers:
+                weighted_scores += self.weights[k] * scores[k]
+            # partial scoring
+            if self.do_pre_beam:
+                pre_beam_scores = (
+                    weighted_scores
+                    if self.pre_beam_score_key == "full"
+                    else scores[self.pre_beam_score_key]
+                )
+                part_ids = torch.topk(pre_beam_scores, self.pre_beam_size)[1]
+            part_scores, part_states = self.score_partial(hyp, part_ids, x)
+            for k in self.part_scorers:
+                weighted_scores[part_ids] += self.weights[k] * part_scores[k]
+            # add previous hyp score
+            weighted_scores += hyp.score
+
+            # update hyps
+            for j, part_j in zip(*self.beam(weighted_scores, part_ids)):
+                # will be (2 x beam at most)
+                best_hyps.append(
+                    Hypothesis(
+                        score=weighted_scores[j],
+                        yseq=self.append_token(hyp.yseq, j),
+                        scores=self.merge_scores(
+                            hyp.scores, scores, j, part_scores, part_j
+                        ),
+                        states=self.merge_states(states, part_states, part_j),
+                    )
+                )
+
+            # sort and prune 2 x beam -> beam
+            best_hyps = sorted(best_hyps, key=lambda x: x.score, reverse=True)[
+                : min(len(best_hyps), self.beam_size)
+            ]
+        return best_hyps
+
+    def forward(
+        self, x: torch.Tensor,
+        scama_mask: torch.Tensor = None,
+        pre_acoustic_embeds: torch.Tensor = None,
+        maxlenratio: float = 0.0,
+        minlenratio: float = 0.0,
+        maxlen: int = None,
+        minlen: int = 0,
+    ) -> List[Hypothesis]:
+        """Perform beam search.
+
+        Args:
+            x (torch.Tensor): Encoded speech feature (T, D)
+            maxlenratio (float): Input length ratio to obtain max output length.
+                If maxlenratio=0.0 (default), it uses a end-detect function
+                to automatically find maximum hypothesis lengths
+                If maxlenratio<0.0, its absolute value is interpreted
+                as a constant max output length.
+            minlenratio (float): Input length ratio to obtain min output length.
+
+        Returns:
+            list[Hypothesis]: N-best decoding results
+
+        """
+        if maxlen is None:
+            # set length bounds
+            if maxlenratio == 0:
+                maxlen = x.shape[0]
+            elif maxlenratio < 0:
+                maxlen = -1 * int(maxlenratio)
+            else:
+                maxlen = max(1, int(maxlenratio * x.size(0)))
+            minlen = int(minlenratio * x.size(0))
+
+        logging.info("decoder input length: " + str(x.shape[0]))
+        logging.info("max output length: " + str(maxlen))
+        logging.info("min output length: " + str(minlen))
+
+        # main loop of prefix search
+        running_hyps = self.init_hyp(x)
+        ended_hyps = []
+        for i in range(maxlen):
+            logging.debug("position " + str(i))
+            mask_enc = None
+            if scama_mask is not None:
+                token_num_predictor = scama_mask.size(1)
+                token_id_slice = min(i, token_num_predictor-1)
+                mask_enc = scama_mask[:, token_id_slice:token_id_slice+1, :]
+                # if mask_enc.size(1) == 0:
+                #     mask_enc = scama_mask[:, -2:-1, :]
+                #     # mask_enc = torch.zeros_like(mask_enc)
+            pre_acoustic_embeds_cur = None
+            if pre_acoustic_embeds is not None:
+                b, t, d = pre_acoustic_embeds.size()
+                pad = torch.zeros((b, 1, d), dtype=pre_acoustic_embeds.dtype).to(device=pre_acoustic_embeds.device)
+                pre_acoustic_embeds = torch.cat((pre_acoustic_embeds, pad), dim=1)
+                token_id_slice = min(i, t)
+                pre_acoustic_embeds_cur = pre_acoustic_embeds[:, token_id_slice:token_id_slice+1, :]
+
+            best = self.search(running_hyps, x, x_mask=mask_enc, pre_acoustic_embeds=pre_acoustic_embeds_cur)
+            # post process of one iteration
+            running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)
+            # end detection
+            if maxlenratio == 0.0 and end_detect([h.asdict() for h in ended_hyps], i):
+                logging.info(f"end detected at {i}")
+                break
+            if len(running_hyps) == 0:
+                logging.info("no hypothesis. Finish decoding.")
+                break
+            else:
+                logging.debug(f"remained hypotheses: {len(running_hyps)}")
+
+        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)
+        # check the number of hypotheses reaching to eos
+        if len(nbest_hyps) == 0:
+            logging.warning(
+                "there is no N-best results, perform recognition "
+                "again with smaller minlenratio."
+            )
+            return (
+                []
+                if minlenratio < 0.1
+                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))
+            )
+
+        # report the best result
+        for x in nbest_hyps:
+            yseq = "".join([self.token_list[x] for x in x.yseq])
+            logging.debug("nbest: y: {}, yseq: {}, score: {}".format(x.yseq, yseq, x.score))
+        best = nbest_hyps[0]
+        for k, v in best.scores.items():
+            logging.info(
+                f"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}"
+            )
+        logging.info(f"total log probability: {best.score:.2f}")
+        logging.info(f"normalized log probability: {best.score / len(best.yseq):.2f}")
+        logging.info(f"total number of ended hypotheses: {len(nbest_hyps)}")
+        if self.token_list is not None:
+            logging.info(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in best.yseq[1:-1]])
+                + "\n"
+            )
+        return nbest_hyps
+
+    def post_process(
+        self,
+        i: int,
+        maxlen: int,
+        maxlenratio: float,
+        running_hyps: List[Hypothesis],
+        ended_hyps: List[Hypothesis],
+    ) -> List[Hypothesis]:
+        """Perform post-processing of beam search iterations.
+
+        Args:
+            i (int): The length of hypothesis tokens.
+            maxlen (int): The maximum length of tokens in beam search.
+            maxlenratio (int): The maximum length ratio in beam search.
+            running_hyps (List[Hypothesis]): The running hypotheses in beam search.
+            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.
+
+        Returns:
+            List[Hypothesis]: The new running hypotheses.
+
+        """
+        logging.debug(f"the number of running hypotheses: {len(running_hyps)}")
+        if self.token_list is not None:
+            logging.debug(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in running_hyps[0].yseq[1:]])
+            )
+        # add eos in the final loop to avoid that there are no ended hyps
+        if i == maxlen - 1:
+            logging.info("adding <eos> in the last position in the loop")
+            running_hyps = [
+                h._replace(yseq=self.append_token(h.yseq, self.eos))
+                for h in running_hyps
+            ]
+
+        # add ended hypotheses to a final list, and removed them from current hypotheses
+        # (this will be a problem, number of hyps < beam)
+        remained_hyps = []
+        for hyp in running_hyps:
+            if hyp.yseq[-1] == self.eos:
+                # e.g., Word LM needs to add final <eos> score
+                for k, d in chain(self.full_scorers.items(), self.part_scorers.items()):
+                    s = d.final_score(hyp.states[k])
+                    hyp.scores[k] += s
+                    hyp = hyp._replace(score=hyp.score + self.weights[k] * s)
+                ended_hyps.append(hyp)
+            else:
+                remained_hyps.append(hyp)
+        return remained_hyps
+
+class BeamSearchPara(torch.nn.Module):
+    """Beam search implementation."""
+
+    def __init__(
+        self,
+        scorers: Dict[str, ScorerInterface],
+        weights: Dict[str, float],
+        beam_size: int,
+        vocab_size: int,
+        sos: int,
+        eos: int,
+        token_list: List[str] = None,
+        pre_beam_ratio: float = 1.5,
+        pre_beam_score_key: str = None,
+    ):
+        """Initialize beam search.
+
+        Args:
+            scorers (dict[str, ScorerInterface]): Dict of decoder modules
+                e.g., Decoder, CTCPrefixScorer, LM
+                The scorer will be ignored if it is `None`
+            weights (dict[str, float]): Dict of weights for each scorers
+                The scorer will be ignored if its weight is 0
+            beam_size (int): The number of hypotheses kept during search
+            vocab_size (int): The number of vocabulary
+            sos (int): Start of sequence id
+            eos (int): End of sequence id
+            token_list (list[str]): List of tokens for debug log
+            pre_beam_score_key (str): key of scores to perform pre-beam search
+            pre_beam_ratio (float): beam size in the pre-beam search
+                will be `int(pre_beam_ratio * beam_size)`
+
+        """
+        super().__init__()
+        # set scorers
+        self.weights = weights
+        self.scorers = dict()
+        self.full_scorers = dict()
+        self.part_scorers = dict()
+        # this module dict is required for recursive cast
+        # `self.to(device, dtype)` in `recog.py`
+        self.nn_dict = torch.nn.ModuleDict()
+        for k, v in scorers.items():
+            w = weights.get(k, 0)
+            if w == 0 or v is None:
+                continue
+            assert isinstance(
+                v, ScorerInterface
+            ), f"{k} ({type(v)}) does not implement ScorerInterface"
+            self.scorers[k] = v
+            if isinstance(v, PartialScorerInterface):
+                self.part_scorers[k] = v
+            else:
+                self.full_scorers[k] = v
+            if isinstance(v, torch.nn.Module):
+                self.nn_dict[k] = v
+
+        # set configurations
+        self.sos = sos
+        self.eos = eos
+        self.token_list = token_list
+        self.pre_beam_size = int(pre_beam_ratio * beam_size)
+        self.beam_size = beam_size
+        self.n_vocab = vocab_size
+        if (
+            pre_beam_score_key is not None
+            and pre_beam_score_key != "full"
+            and pre_beam_score_key not in self.full_scorers
+        ):
+            raise KeyError(f"{pre_beam_score_key} is not found in {self.full_scorers}")
+        self.pre_beam_score_key = pre_beam_score_key
+        self.do_pre_beam = (
+            self.pre_beam_score_key is not None
+            and self.pre_beam_size < self.n_vocab
+            and len(self.part_scorers) > 0
+        )
+
+    def init_hyp(self, x: torch.Tensor) -> List[Hypothesis]:
+        """Get an initial hypothesis data.
+
+        Args:
+            x (torch.Tensor): The encoder output feature
+
+        Returns:
+            Hypothesis: The initial hypothesis.
+
+        """
+        init_states = dict()
+        init_scores = dict()
+        for k, d in self.scorers.items():
+            init_states[k] = d.init_state(x)
+            init_scores[k] = 0.0
+        return [
+            Hypothesis(
+                score=0.0,
+                scores=init_scores,
+                states=init_states,
+                yseq=torch.tensor([self.sos], device=x.device),
+            )
+        ]
+
+    @staticmethod
+    def append_token(xs: torch.Tensor, x: int) -> torch.Tensor:
+        """Append new token to prefix tokens.
+
+        Args:
+            xs (torch.Tensor): The prefix token
+            x (int): The new token to append
+
+        Returns:
+            torch.Tensor: New tensor contains: xs + [x] with xs.dtype and xs.device
+
+        """
+        x = torch.tensor([x], dtype=xs.dtype, device=xs.device)
+        return torch.cat((xs, x))
+
+    def score_full(
+        self, hyp: Hypothesis, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.full_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.full_scorers`
+                and tensor score values of shape: `(self.n_vocab,)`,
+                and state dict that has string keys
+                and state values of `self.full_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.full_scorers.items():
+            scores[k], states[k] = d.score(hyp.yseq, hyp.states[k], x)
+        return scores, states
+
+    def score_partial(
+        self, hyp: Hypothesis, ids: torch.Tensor, x: torch.Tensor
+    ) -> Tuple[Dict[str, torch.Tensor], Dict[str, Any]]:
+        """Score new hypothesis by `self.part_scorers`.
+
+        Args:
+            hyp (Hypothesis): Hypothesis with prefix tokens to score
+            ids (torch.Tensor): 1D tensor of new partial tokens to score
+            x (torch.Tensor): Corresponding input feature
+
+        Returns:
+            Tuple[Dict[str, torch.Tensor], Dict[str, Any]]: Tuple of
+                score dict of `hyp` that has string keys of `self.part_scorers`
+                and tensor score values of shape: `(len(ids),)`,
+                and state dict that has string keys
+                and state values of `self.part_scorers`
+
+        """
+        scores = dict()
+        states = dict()
+        for k, d in self.part_scorers.items():
+            scores[k], states[k] = d.score_partial(hyp.yseq, ids, hyp.states[k], x)
+        return scores, states
+
+    def beam(
+        self, weighted_scores: torch.Tensor, ids: torch.Tensor
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        """Compute topk full token ids and partial token ids.
+
+        Args:
+            weighted_scores (torch.Tensor): The weighted sum scores for each tokens.
+            Its shape is `(self.n_vocab,)`.
+            ids (torch.Tensor): The partial token ids to compute topk
+
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor]:
+                The topk full token ids and partial token ids.
+                Their shapes are `(self.beam_size,)`
+
+        """
+        # no pre beam performed
+        if weighted_scores.size(0) == ids.size(0):
+            top_ids = weighted_scores.topk(self.beam_size)[1]
+            return top_ids, top_ids
+
+        # mask pruned in pre-beam not to select in topk
+        tmp = weighted_scores[ids]
+        weighted_scores[:] = -float("inf")
+        weighted_scores[ids] = tmp
+        top_ids = weighted_scores.topk(self.beam_size)[1]
+        local_ids = weighted_scores[ids].topk(self.beam_size)[1]
+        return top_ids, local_ids
+
+    @staticmethod
+    def merge_scores(
+        prev_scores: Dict[str, float],
+        next_full_scores: Dict[str, torch.Tensor],
+        full_idx: int,
+        next_part_scores: Dict[str, torch.Tensor],
+        part_idx: int,
+    ) -> Dict[str, torch.Tensor]:
+        """Merge scores for new hypothesis.
+
+        Args:
+            prev_scores (Dict[str, float]):
+                The previous hypothesis scores by `self.scorers`
+            next_full_scores (Dict[str, torch.Tensor]): scores by `self.full_scorers`
+            full_idx (int): The next token id for `next_full_scores`
+            next_part_scores (Dict[str, torch.Tensor]):
+                scores of partial tokens by `self.part_scorers`
+            part_idx (int): The new token id for `next_part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are scalar tensors by the scorers.
+
+        """
+        new_scores = dict()
+        for k, v in next_full_scores.items():
+            new_scores[k] = prev_scores[k] + v[full_idx]
+        for k, v in next_part_scores.items():
+            new_scores[k] = prev_scores[k] + v[part_idx]
+        return new_scores
+
+    def merge_states(self, states: Any, part_states: Any, part_idx: int) -> Any:
+        """Merge states for new hypothesis.
+
+        Args:
+            states: states of `self.full_scorers`
+            part_states: states of `self.part_scorers`
+            part_idx (int): The new token id for `part_scores`
+
+        Returns:
+            Dict[str, torch.Tensor]: The new score dict.
+                Its keys are names of `self.full_scorers` and `self.part_scorers`.
+                Its values are states of the scorers.
+
+        """
+        new_states = dict()
+        for k, v in states.items():
+            new_states[k] = v
+        for k, d in self.part_scorers.items():
+            new_states[k] = d.select_state(part_states[k], part_idx)
+        return new_states
+
+    def search(
+        self, running_hyps: List[Hypothesis], x: torch.Tensor, am_score: torch.Tensor
+    ) -> List[Hypothesis]:
+        """Search new tokens for running hypotheses and encoded speech x.
+
+        Args:
+            running_hyps (List[Hypothesis]): Running hypotheses on beam
+            x (torch.Tensor): Encoded speech feature (T, D)
+
+        Returns:
+            List[Hypotheses]: Best sorted hypotheses
+
+        """
+        best_hyps = []
+        part_ids = torch.arange(self.n_vocab, device=x.device)  # no pre-beam
+        for hyp in running_hyps:
+            # scoring
+            weighted_scores = torch.zeros(self.n_vocab, dtype=x.dtype, device=x.device)
+            weighted_scores += am_score
+            scores, states = self.score_full(hyp, x)
+            for k in self.full_scorers:
+                weighted_scores += self.weights[k] * scores[k]
+            # partial scoring
+            if self.do_pre_beam:
+                pre_beam_scores = (
+                    weighted_scores
+                    if self.pre_beam_score_key == "full"
+                    else scores[self.pre_beam_score_key]
+                )
+                part_ids = torch.topk(pre_beam_scores, self.pre_beam_size)[1]
+            part_scores, part_states = self.score_partial(hyp, part_ids, x)
+            for k in self.part_scorers:
+                weighted_scores[part_ids] += self.weights[k] * part_scores[k]
+            # add previous hyp score
+            weighted_scores += hyp.score
+
+            # update hyps
+            for j, part_j in zip(*self.beam(weighted_scores, part_ids)):
+                # will be (2 x beam at most)
+                best_hyps.append(
+                    Hypothesis(
+                        score=weighted_scores[j],
+                        yseq=self.append_token(hyp.yseq, j),
+                        scores=self.merge_scores(
+                            hyp.scores, scores, j, part_scores, part_j
+                        ),
+                        states=self.merge_states(states, part_states, part_j),
+                    )
+                )
+
+            # sort and prune 2 x beam -> beam
+            best_hyps = sorted(best_hyps, key=lambda x: x.score, reverse=True)[
+                : min(len(best_hyps), self.beam_size)
+            ]
+        return best_hyps
+
+    def forward(
+        self, x: torch.Tensor, am_scores: torch.Tensor, maxlenratio: float = 0.0, minlenratio: float = 0.0
+    ) -> List[Hypothesis]:
+        """Perform beam search.
+
+        Args:
+            x (torch.Tensor): Encoded speech feature (T, D)
+            maxlenratio (float): Input length ratio to obtain max output length.
+                If maxlenratio=0.0 (default), it uses a end-detect function
+                to automatically find maximum hypothesis lengths
+                If maxlenratio<0.0, its absolute value is interpreted
+                as a constant max output length.
+            minlenratio (float): Input length ratio to obtain min output length.
+
+        Returns:
+            list[Hypothesis]: N-best decoding results
+
+        """
+        # set length bounds
+        maxlen = am_scores.shape[0]
+        logging.info("decoder input length: " + str(x.shape[0]))
+        logging.info("max output length: " + str(maxlen))
+
+        # main loop of prefix search
+        running_hyps = self.init_hyp(x)
+        ended_hyps = []
+        for i in range(maxlen):
+            logging.debug("position " + str(i))
+            best = self.search(running_hyps, x, am_scores[i])
+            # post process of one iteration
+            running_hyps = self.post_process(i, maxlen, maxlenratio, best, ended_hyps)
+            # end detection
+            if maxlenratio == 0.0 and end_detect([h.asdict() for h in ended_hyps], i):
+                logging.info(f"end detected at {i}")
+                break
+            if len(running_hyps) == 0:
+                logging.info("no hypothesis. Finish decoding.")
+                break
+            else:
+                logging.debug(f"remained hypotheses: {len(running_hyps)}")
+
+        nbest_hyps = sorted(ended_hyps, key=lambda x: x.score, reverse=True)
+        # check the number of hypotheses reaching to eos
+        if len(nbest_hyps) == 0:
+            logging.warning(
+                "there is no N-best results, perform recognition "
+                "again with smaller minlenratio."
+            )
+            return (
+                []
+                if minlenratio < 0.1
+                else self.forward(x, maxlenratio, max(0.0, minlenratio - 0.1))
+            )
+
+        # report the best result
+        best = nbest_hyps[0]
+        for k, v in best.scores.items():
+            logging.info(
+                f"{v:6.2f} * {self.weights[k]:3} = {v * self.weights[k]:6.2f} for {k}"
+            )
+        logging.info(f"total log probability: {best.score:.2f}")
+        logging.info(f"normalized log probability: {best.score / len(best.yseq):.2f}")
+        logging.info(f"total number of ended hypotheses: {len(nbest_hyps)}")
+        if self.token_list is not None:
+            logging.info(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in best.yseq[1:-1]])
+                + "\n"
+            )
+        return nbest_hyps
+
+    def post_process(
+        self,
+        i: int,
+        maxlen: int,
+        maxlenratio: float,
+        running_hyps: List[Hypothesis],
+        ended_hyps: List[Hypothesis],
+    ) -> List[Hypothesis]:
+        """Perform post-processing of beam search iterations.
+
+        Args:
+            i (int): The length of hypothesis tokens.
+            maxlen (int): The maximum length of tokens in beam search.
+            maxlenratio (int): The maximum length ratio in beam search.
+            running_hyps (List[Hypothesis]): The running hypotheses in beam search.
+            ended_hyps (List[Hypothesis]): The ended hypotheses in beam search.
+
+        Returns:
+            List[Hypothesis]: The new running hypotheses.
+
+        """
+        logging.debug(f"the number of running hypotheses: {len(running_hyps)}")
+        if self.token_list is not None:
+            logging.debug(
+                "best hypo: "
+                + "".join([self.token_list[x] for x in running_hyps[0].yseq[1:]])
+            )
+        # add eos in the final loop to avoid that there are no ended hyps
+        if i == maxlen - 1:
+            logging.info("adding <eos> in the last position in the loop")
+            running_hyps = [
+                h._replace(yseq=self.append_token(h.yseq, self.eos))
+                for h in running_hyps
+            ]
+
+        # add ended hypotheses to a final list, and removed them from current hypotheses
+        # (this will be a problem, number of hyps < beam)
+        remained_hyps = []
+        for hyp in running_hyps:
+            if hyp.yseq[-1] == self.eos:
+                # e.g., Word LM needs to add final <eos> score
+                for k, d in chain(self.full_scorers.items(), self.part_scorers.items()):
+                    s = d.final_score(hyp.states[k])
+                    hyp.scores[k] += s
+                    hyp = hyp._replace(score=hyp.score + self.weights[k] * s)
+                ended_hyps.append(hyp)
+            else:
+                remained_hyps.append(hyp)
+        return remained_hyps
+
diff --git a/funasr/modules/dynamic_conv.py b/funasr/modules/dynamic_conv.py
new file mode 100644
index 000000000..8a2a0c1ea
--- /dev/null
+++ b/funasr/modules/dynamic_conv.py
@@ -0,0 +1,125 @@
+"""Dynamic Convolution module."""
+
+import numpy
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+MIN_VALUE = float(numpy.finfo(numpy.float32).min)
+
+
+class DynamicConvolution(nn.Module):
+    """Dynamic Convolution layer.
+
+    This implementation is based on
+    https://github.com/pytorch/fairseq/tree/master/fairseq
+
+    Args:
+        wshare (int): the number of kernel of convolution
+        n_feat (int): the number of features
+        dropout_rate (float): dropout_rate
+        kernel_size (int): kernel size (length)
+        use_kernel_mask (bool): Use causal mask or not for convolution kernel
+        use_bias (bool): Use bias term or not.
+
+    """
+
+    def __init__(
+        self,
+        wshare,
+        n_feat,
+        dropout_rate,
+        kernel_size,
+        use_kernel_mask=False,
+        use_bias=False,
+    ):
+        """Construct Dynamic Convolution layer."""
+        super(DynamicConvolution, self).__init__()
+
+        assert n_feat % wshare == 0
+        self.wshare = wshare
+        self.use_kernel_mask = use_kernel_mask
+        self.dropout_rate = dropout_rate
+        self.kernel_size = kernel_size
+        self.attn = None
+
+        # linear -> GLU -- -> lightconv -> linear
+        #               \        /
+        #                 Linear
+        self.linear1 = nn.Linear(n_feat, n_feat * 2)
+        self.linear2 = nn.Linear(n_feat, n_feat)
+        self.linear_weight = nn.Linear(n_feat, self.wshare * 1 * kernel_size)
+        nn.init.xavier_uniform(self.linear_weight.weight)
+        self.act = nn.GLU()
+
+        # dynamic conv related
+        self.use_bias = use_bias
+        if self.use_bias:
+            self.bias = nn.Parameter(torch.Tensor(n_feat))
+
+    def forward(self, query, key, value, mask):
+        """Forward of 'Dynamic Convolution'.
+
+        This function takes query, key and value but uses only quert.
+        This is just for compatibility with self-attention layer (attention.py)
+
+        Args:
+            query (torch.Tensor): (batch, time1, d_model) input tensor
+            key (torch.Tensor): (batch, time2, d_model) NOT USED
+            value (torch.Tensor): (batch, time2, d_model) NOT USED
+            mask (torch.Tensor): (batch, time1, time2) mask
+
+        Return:
+            x (torch.Tensor): (batch, time1, d_model) output
+
+        """
+        # linear -> GLU -- -> lightconv -> linear
+        #               \        /
+        #                 Linear
+        x = query
+        B, T, C = x.size()
+        H = self.wshare
+        k = self.kernel_size
+
+        # first liner layer
+        x = self.linear1(x)
+
+        # GLU activation
+        x = self.act(x)
+
+        # get kernel of convolution
+        weight = self.linear_weight(x)  # B x T x kH
+        weight = F.dropout(weight, self.dropout_rate, training=self.training)
+        weight = weight.view(B, T, H, k).transpose(1, 2).contiguous()  # B x H x T x k
+        weight_new = torch.zeros(B * H * T * (T + k - 1), dtype=weight.dtype)
+        weight_new = weight_new.view(B, H, T, T + k - 1).fill_(float("-inf"))
+        weight_new = weight_new.to(x.device)  # B x H x T x T+k-1
+        weight_new.as_strided(
+            (B, H, T, k), ((T + k - 1) * T * H, (T + k - 1) * T, T + k, 1)
+        ).copy_(weight)
+        weight_new = weight_new.narrow(-1, int((k - 1) / 2), T)  # B x H x T x T(k)
+        if self.use_kernel_mask:
+            kernel_mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0)
+            weight_new = weight_new.masked_fill(kernel_mask == 0.0, float("-inf"))
+        weight_new = F.softmax(weight_new, dim=-1)
+        self.attn = weight_new
+        weight_new = weight_new.view(B * H, T, T)
+
+        # convolution
+        x = x.transpose(1, 2).contiguous()  # B x C x T
+        x = x.view(B * H, int(C / H), T).transpose(1, 2)
+        x = torch.bmm(weight_new, x)  # BH x T x C/H
+        x = x.transpose(1, 2).contiguous().view(B, C, T)
+
+        if self.use_bias:
+            x = x + self.bias.view(1, -1, 1)
+        x = x.transpose(1, 2)  # B x T x C
+
+        if mask is not None and not self.use_kernel_mask:
+            mask = mask.transpose(-1, -2)
+            x = x.masked_fill(mask == 0, 0.0)
+
+        # second linear layer
+        x = self.linear2(x)
+        return x
diff --git a/funasr/modules/dynamic_conv2d.py b/funasr/modules/dynamic_conv2d.py
new file mode 100644
index 000000000..f8a4dd6e9
--- /dev/null
+++ b/funasr/modules/dynamic_conv2d.py
@@ -0,0 +1,138 @@
+"""Dynamic 2-Dimensional Convolution module."""
+
+import numpy
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+MIN_VALUE = float(numpy.finfo(numpy.float32).min)
+
+
+class DynamicConvolution2D(nn.Module):
+    """Dynamic 2-Dimensional Convolution layer.
+
+    This implementation is based on
+    https://github.com/pytorch/fairseq/tree/master/fairseq
+
+    Args:
+        wshare (int): the number of kernel of convolution
+        n_feat (int): the number of features
+        dropout_rate (float): dropout_rate
+        kernel_size (int): kernel size (length)
+        use_kernel_mask (bool): Use causal mask or not for convolution kernel
+        use_bias (bool): Use bias term or not.
+
+    """
+
+    def __init__(
+        self,
+        wshare,
+        n_feat,
+        dropout_rate,
+        kernel_size,
+        use_kernel_mask=False,
+        use_bias=False,
+    ):
+        """Construct Dynamic 2-Dimensional Convolution layer."""
+        super(DynamicConvolution2D, self).__init__()
+
+        assert n_feat % wshare == 0
+        self.wshare = wshare
+        self.use_kernel_mask = use_kernel_mask
+        self.dropout_rate = dropout_rate
+        self.kernel_size = kernel_size
+        self.padding_size = int(kernel_size / 2)
+        self.attn_t = None
+        self.attn_f = None
+
+        # linear -> GLU -- -> lightconv -> linear
+        #               \        /
+        #                 Linear
+        self.linear1 = nn.Linear(n_feat, n_feat * 2)
+        self.linear2 = nn.Linear(n_feat * 2, n_feat)
+        self.linear_weight = nn.Linear(n_feat, self.wshare * 1 * kernel_size)
+        nn.init.xavier_uniform(self.linear_weight.weight)
+        self.linear_weight_f = nn.Linear(n_feat, kernel_size)
+        nn.init.xavier_uniform(self.linear_weight_f.weight)
+        self.act = nn.GLU()
+
+        # dynamic conv related
+        self.use_bias = use_bias
+        if self.use_bias:
+            self.bias = nn.Parameter(torch.Tensor(n_feat))
+
+    def forward(self, query, key, value, mask):
+        """Forward of 'Dynamic 2-Dimensional Convolution'.
+
+        This function takes query, key and value but uses only query.
+        This is just for compatibility with self-attention layer (attention.py)
+
+        Args:
+            query (torch.Tensor): (batch, time1, d_model) input tensor
+            key (torch.Tensor): (batch, time2, d_model) NOT USED
+            value (torch.Tensor): (batch, time2, d_model) NOT USED
+            mask (torch.Tensor): (batch, time1, time2) mask
+
+        Return:
+            x (torch.Tensor): (batch, time1, d_model) output
+
+        """
+        # linear -> GLU -- -> lightconv -> linear
+        #               \        /
+        #                 Linear
+        x = query
+        B, T, C = x.size()
+        H = self.wshare
+        k = self.kernel_size
+
+        # first liner layer
+        x = self.linear1(x)
+
+        # GLU activation
+        x = self.act(x)
+
+        # convolution of frequency axis
+        weight_f = self.linear_weight_f(x).view(B * T, 1, k)  # B x T x k
+        self.attn_f = weight_f.view(B, T, k).unsqueeze(1)
+        xf = F.conv1d(
+            x.view(1, B * T, C), weight_f, padding=self.padding_size, groups=B * T
+        )
+        xf = xf.view(B, T, C)
+
+        # get kernel of convolution
+        weight = self.linear_weight(x)  # B x T x kH
+        weight = F.dropout(weight, self.dropout_rate, training=self.training)
+        weight = weight.view(B, T, H, k).transpose(1, 2).contiguous()  # B x H x T x k
+        weight_new = torch.zeros(B * H * T * (T + k - 1), dtype=weight.dtype)
+        weight_new = weight_new.view(B, H, T, T + k - 1).fill_(float("-inf"))
+        weight_new = weight_new.to(x.device)  # B x H x T x T+k-1
+        weight_new.as_strided(
+            (B, H, T, k), ((T + k - 1) * T * H, (T + k - 1) * T, T + k, 1)
+        ).copy_(weight)
+        weight_new = weight_new.narrow(-1, int((k - 1) / 2), T)  # B x H x T x T(k)
+        if self.use_kernel_mask:
+            kernel_mask = torch.tril(torch.ones(T, T, device=x.device)).unsqueeze(0)
+            weight_new = weight_new.masked_fill(kernel_mask == 0.0, float("-inf"))
+        weight_new = F.softmax(weight_new, dim=-1)
+        self.attn_t = weight_new
+        weight_new = weight_new.view(B * H, T, T)
+
+        # convolution
+        x = x.transpose(1, 2).contiguous()  # B x C x T
+        x = x.view(B * H, int(C / H), T).transpose(1, 2)
+        x = torch.bmm(weight_new, x)
+        x = x.transpose(1, 2).contiguous().view(B, C, T)
+
+        if self.use_bias:
+            x = x + self.bias.view(1, -1, 1)
+        x = x.transpose(1, 2)  # B x T x C
+        x = torch.cat((x, xf), -1)  # B x T x Cx2
+
+        if mask is not None and not self.use_kernel_mask:
+            mask = mask.transpose(-1, -2)
+            x = x.masked_fill(mask == 0, 0.0)
+
+        # second linear layer
+        x = self.linear2(x)
+        return x
diff --git a/funasr/modules/e2e_asr_common.py b/funasr/modules/e2e_asr_common.py
new file mode 100644
index 000000000..92f90796a
--- /dev/null
+++ b/funasr/modules/e2e_asr_common.py
@@ -0,0 +1,249 @@
+#!/usr/bin/env python3
+# encoding: utf-8
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Common functions for ASR."""
+
+import json
+import logging
+import sys
+
+from itertools import groupby
+import numpy as np
+import six
+
+
+def end_detect(ended_hyps, i, M=3, D_end=np.log(1 * np.exp(-10))):
+    """End detection.
+
+    described in Eq. (50) of S. Watanabe et al
+    "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition"
+
+    :param ended_hyps:
+    :param i:
+    :param M:
+    :param D_end:
+    :return:
+    """
+    if len(ended_hyps) == 0:
+        return False
+    count = 0
+    best_hyp = sorted(ended_hyps, key=lambda x: x["score"], reverse=True)[0]
+    for m in six.moves.range(M):
+        # get ended_hyps with their length is i - m
+        hyp_length = i - m
+        hyps_same_length = [x for x in ended_hyps if len(x["yseq"]) == hyp_length]
+        if len(hyps_same_length) > 0:
+            best_hyp_same_length = sorted(
+                hyps_same_length, key=lambda x: x["score"], reverse=True
+            )[0]
+            if best_hyp_same_length["score"] - best_hyp["score"] < D_end:
+                count += 1
+
+    if count == M:
+        return True
+    else:
+        return False
+
+
+# TODO(takaaki-hori): add different smoothing methods
+def label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):
+    """Obtain label distribution for loss smoothing.
+
+    :param odim:
+    :param lsm_type:
+    :param blank:
+    :param transcript:
+    :return:
+    """
+    if transcript is not None:
+        with open(transcript, "rb") as f:
+            trans_json = json.load(f)["utts"]
+
+    if lsm_type == "unigram":
+        assert transcript is not None, (
+            "transcript is required for %s label smoothing" % lsm_type
+        )
+        labelcount = np.zeros(odim)
+        for k, v in trans_json.items():
+            ids = np.array([int(n) for n in v["output"][0]["tokenid"].split()])
+            # to avoid an error when there is no text in an uttrance
+            if len(ids) > 0:
+                labelcount[ids] += 1
+        labelcount[odim - 1] = len(transcript)  # count <eos>
+        labelcount[labelcount == 0] = 1  # flooring
+        labelcount[blank] = 0  # remove counts for blank
+        labeldist = labelcount.astype(np.float32) / np.sum(labelcount)
+    else:
+        logging.error("Error: unexpected label smoothing type: %s" % lsm_type)
+        sys.exit()
+
+    return labeldist
+
+
+def get_vgg2l_odim(idim, in_channel=3, out_channel=128):
+    """Return the output size of the VGG frontend.
+
+    :param in_channel: input channel size
+    :param out_channel: output channel size
+    :return: output size
+    :rtype int
+    """
+    idim = idim / in_channel
+    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 1st max pooling
+    idim = np.ceil(np.array(idim, dtype=np.float32) / 2)  # 2nd max pooling
+    return int(idim) * out_channel  # numer of channels
+
+
+class ErrorCalculator(object):
+    """Calculate CER and WER for E2E_ASR and CTC models during training.
+
+    :param y_hats: numpy array with predicted text
+    :param y_pads: numpy array with true (target) text
+    :param char_list:
+    :param sym_space:
+    :param sym_blank:
+    :return:
+    """
+
+    def __init__(
+        self, char_list, sym_space, sym_blank, report_cer=False, report_wer=False
+    ):
+        """Construct an ErrorCalculator object."""
+        super(ErrorCalculator, self).__init__()
+
+        self.report_cer = report_cer
+        self.report_wer = report_wer
+
+        self.char_list = char_list
+        self.space = sym_space
+        self.blank = sym_blank
+        self.idx_blank = self.char_list.index(self.blank)
+        if self.space in self.char_list:
+            self.idx_space = self.char_list.index(self.space)
+        else:
+            self.idx_space = None
+
+    def __call__(self, ys_hat, ys_pad, is_ctc=False):
+        """Calculate sentence-level WER/CER score.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :param bool is_ctc: calculate CER score for CTC
+        :return: sentence-level WER score
+        :rtype float
+        :return: sentence-level CER score
+        :rtype float
+        """
+        cer, wer = None, None
+        if is_ctc:
+            return self.calculate_cer_ctc(ys_hat, ys_pad)
+        elif not self.report_cer and not self.report_wer:
+            return cer, wer
+
+        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)
+        if self.report_cer:
+            cer = self.calculate_cer(seqs_hat, seqs_true)
+
+        if self.report_wer:
+            wer = self.calculate_wer(seqs_hat, seqs_true)
+        return cer, wer
+
+    def calculate_cer_ctc(self, ys_hat, ys_pad):
+        """Calculate sentence-level CER score for CTC.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        import editdistance
+
+        cers, char_ref_lens = [], []
+        for i, y in enumerate(ys_hat):
+            y_hat = [x[0] for x in groupby(y)]
+            y_true = ys_pad[i]
+            seq_hat, seq_true = [], []
+            for idx in y_hat:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_hat.append(self.char_list[int(idx)])
+
+            for idx in y_true:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_true.append(self.char_list[int(idx)])
+
+            hyp_chars = "".join(seq_hat)
+            ref_chars = "".join(seq_true)
+            if len(ref_chars) > 0:
+                cers.append(editdistance.eval(hyp_chars, ref_chars))
+                char_ref_lens.append(len(ref_chars))
+
+        cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None
+        return cer_ctc
+
+    def convert_to_char(self, ys_hat, ys_pad):
+        """Convert index to character.
+
+        :param torch.Tensor seqs_hat: prediction (batch, seqlen)
+        :param torch.Tensor seqs_true: reference (batch, seqlen)
+        :return: token list of prediction
+        :rtype list
+        :return: token list of reference
+        :rtype list
+        """
+        seqs_hat, seqs_true = [], []
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+            eos_true = np.where(y_true == -1)[0]
+            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)
+            # NOTE: padding index (-1) in y_true is used to pad y_hat
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]
+            seq_true = [self.char_list[int(idx)] for idx in y_true if int(idx) != -1]
+            seq_hat_text = "".join(seq_hat).replace(self.space, " ")
+            seq_hat_text = seq_hat_text.replace(self.blank, "")
+            seq_true_text = "".join(seq_true).replace(self.space, " ")
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+        return seqs_hat, seqs_true
+
+    def calculate_cer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level CER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        import editdistance
+
+        char_eds, char_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_chars = seq_hat_text.replace(" ", "")
+            ref_chars = seq_true_text.replace(" ", "")
+            char_eds.append(editdistance.eval(hyp_chars, ref_chars))
+            char_ref_lens.append(len(ref_chars))
+        return float(sum(char_eds)) / sum(char_ref_lens)
+
+    def calculate_wer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level WER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level WER score
+        :rtype float
+        """
+        import editdistance
+
+        word_eds, word_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_words = seq_hat_text.split()
+            ref_words = seq_true_text.split()
+            word_eds.append(editdistance.eval(hyp_words, ref_words))
+            word_ref_lens.append(len(ref_words))
+        return float(sum(word_eds)) / sum(word_ref_lens)
diff --git a/funasr/modules/embedding.py b/funasr/modules/embedding.py
new file mode 100644
index 000000000..b61a61a88
--- /dev/null
+++ b/funasr/modules/embedding.py
@@ -0,0 +1,408 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Positional Encoding Module."""
+
+import math
+import torch
+
+
+def _pre_hook(
+    state_dict,
+    prefix,
+    local_metadata,
+    strict,
+    missing_keys,
+    unexpected_keys,
+    error_msgs,
+):
+    """Perform pre-hook in load_state_dict for backward compatibility.
+
+    Note:
+        We saved self.pe until v.0.5.2 but we have omitted it later.
+        Therefore, we remove the item "pe" from `state_dict` for backward compatibility.
+
+    """
+    k = prefix + "pe"
+    if k in state_dict:
+        state_dict.pop(k)
+
+
+class PositionalEncoding(torch.nn.Module):
+    """Positional encoding.
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+        reverse (bool): Whether to reverse the input position. Only for
+        the class LegacyRelPositionalEncoding. We remove it in the current
+        class RelPositionalEncoding.
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000, reverse=False):
+        """Construct an PositionalEncoding object."""
+        super(PositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.reverse = reverse
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+        self._register_load_state_dict_pre_hook(_pre_hook)
+
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= x.size(1):
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        pe = torch.zeros(x.size(1), self.d_model)
+        if self.reverse:
+            position = torch.arange(
+                x.size(1) - 1, -1, -1.0, dtype=torch.float32
+            ).unsqueeze(1)
+        else:
+            position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        self.extend_pe(x)
+        x = x * self.xscale + self.pe[:, : x.size(1)]
+        return self.dropout(x)
+
+
+class ScaledPositionalEncoding(PositionalEncoding):
+    """Scaled positional encoding module.
+
+    See Sec. 3.2  https://arxiv.org/abs/1809.08895
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(d_model=d_model, dropout_rate=dropout_rate, max_len=max_len)
+        self.alpha = torch.nn.Parameter(torch.tensor(1.0))
+
+    def reset_parameters(self):
+        """Reset parameters."""
+        self.alpha.data = torch.tensor(1.0)
+
+    def forward(self, x):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+
+        """
+        self.extend_pe(x)
+        x = x + self.alpha * self.pe[:, : x.size(1)]
+        return self.dropout(x)
+
+
+class LearnableFourierPosEnc(torch.nn.Module):
+    """Learnable Fourier Features for Positional Encoding.
+
+    See https://arxiv.org/pdf/2106.02795.pdf
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+        gamma (float): init parameter for the positional kernel variance
+            see https://arxiv.org/pdf/2106.02795.pdf.
+        apply_scaling (bool): Whether to scale the input before adding the pos encoding.
+        hidden_dim (int): if not None, we modulate the pos encodings with
+            an MLP whose hidden layer has hidden_dim neurons.
+    """
+
+    def __init__(
+        self,
+        d_model,
+        dropout_rate=0.0,
+        max_len=5000,
+        gamma=1.0,
+        apply_scaling=False,
+        hidden_dim=None,
+    ):
+        """Initialize class."""
+        super(LearnableFourierPosEnc, self).__init__()
+
+        self.d_model = d_model
+
+        if apply_scaling:
+            self.xscale = math.sqrt(self.d_model)
+        else:
+            self.xscale = 1.0
+
+        self.dropout = torch.nn.Dropout(dropout_rate)
+        self.max_len = max_len
+
+        self.gamma = gamma
+        if self.gamma is None:
+            self.gamma = self.d_model // 2
+
+        assert (
+            d_model % 2 == 0
+        ), "d_model should be divisible by two in order to use this layer."
+        self.w_r = torch.nn.Parameter(torch.empty(1, d_model // 2))
+        self._reset()  # init the weights
+
+        self.hidden_dim = hidden_dim
+        if self.hidden_dim is not None:
+            self.mlp = torch.nn.Sequential(
+                torch.nn.Linear(d_model, hidden_dim),
+                torch.nn.GELU(),
+                torch.nn.Linear(hidden_dim, d_model),
+            )
+
+    def _reset(self):
+        self.w_r.data = torch.normal(
+            0, (1 / math.sqrt(self.gamma)), (1, self.d_model // 2)
+        )
+
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        position_v = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1).to(x)
+
+        cosine = torch.cos(torch.matmul(position_v, self.w_r))
+        sine = torch.sin(torch.matmul(position_v, self.w_r))
+        pos_enc = torch.cat((cosine, sine), -1)
+        pos_enc /= math.sqrt(self.d_model)
+
+        if self.hidden_dim is None:
+            return pos_enc.unsqueeze(0)
+        else:
+            return self.mlp(pos_enc.unsqueeze(0))
+
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+        """
+        pe = self.extend_pe(x)
+        x = x * self.xscale + pe
+        return self.dropout(x)
+
+
+class LegacyRelPositionalEncoding(PositionalEncoding):
+    """Relative positional encoding module (old version).
+
+    Details can be found in https://github.com/espnet/espnet/pull/2816.
+
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Initialize class."""
+        super().__init__(
+            d_model=d_model,
+            dropout_rate=dropout_rate,
+            max_len=max_len,
+            reverse=True,
+        )
+
+    def forward(self, x):
+        """Compute positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+            torch.Tensor: Positional embedding tensor (1, time, `*`).
+
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[:, : x.size(1)]
+        return self.dropout(x), self.dropout(pos_emb)
+
+
+class RelPositionalEncoding(torch.nn.Module):
+    """Relative positional encoding module (new implementation).
+
+    Details can be found in https://github.com/espnet/espnet/pull/2816.
+
+    See : Appendix B in https://arxiv.org/abs/1901.02860
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Construct an PositionalEncoding object."""
+        super(RelPositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.extend_pe(torch.tensor(0.0).expand(1, max_len))
+
+    def extend_pe(self, x):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            # self.pe contains both positive and negative parts
+            # the length of self.pe is 2 * input_len - 1
+            if self.pe.size(1) >= x.size(1) * 2 - 1:
+                if self.pe.dtype != x.dtype or self.pe.device != x.device:
+                    self.pe = self.pe.to(dtype=x.dtype, device=x.device)
+                return
+        # Suppose `i` means to the position of query vecotr and `j` means the
+        # position of key vector. We use position relative positions when keys
+        # are to the left (i>j) and negative relative positions otherwise (i<j).
+        pe_positive = torch.zeros(x.size(1), self.d_model)
+        pe_negative = torch.zeros(x.size(1), self.d_model)
+        position = torch.arange(0, x.size(1), dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe_positive[:, 0::2] = torch.sin(position * div_term)
+        pe_positive[:, 1::2] = torch.cos(position * div_term)
+        pe_negative[:, 0::2] = torch.sin(-1 * position * div_term)
+        pe_negative[:, 1::2] = torch.cos(-1 * position * div_term)
+
+        # Reserve the order of positive indices and concat both positive and
+        # negative indices. This is used to support the shifting trick
+        # as in https://arxiv.org/abs/1901.02860
+        pe_positive = torch.flip(pe_positive, [0]).unsqueeze(0)
+        pe_negative = pe_negative[1:].unsqueeze(0)
+        pe = torch.cat([pe_positive, pe_negative], dim=1)
+        self.pe = pe.to(device=x.device, dtype=x.dtype)
+
+    def forward(self, x: torch.Tensor):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+
+        """
+        self.extend_pe(x)
+        x = x * self.xscale
+        pos_emb = self.pe[
+            :,
+            self.pe.size(1) // 2 - x.size(1) + 1 : self.pe.size(1) // 2 + x.size(1),
+        ]
+        return self.dropout(x), self.dropout(pos_emb)
+
+
+class StreamPositionalEncoding(torch.nn.Module):
+    """Streaming Positional encoding.
+
+    Args:
+        d_model (int): Embedding dimension.
+        dropout_rate (float): Dropout rate.
+        max_len (int): Maximum input length.
+
+    """
+
+    def __init__(self, d_model, dropout_rate, max_len=5000):
+        """Construct an PositionalEncoding object."""
+        super(StreamPositionalEncoding, self).__init__()
+        self.d_model = d_model
+        self.xscale = math.sqrt(self.d_model)
+        self.dropout = torch.nn.Dropout(p=dropout_rate)
+        self.pe = None
+        self.tmp = torch.tensor(0.0).expand(1, max_len)
+        self.extend_pe(self.tmp.size(1), self.tmp.device, self.tmp.dtype)
+        self._register_load_state_dict_pre_hook(_pre_hook)
+
+    def extend_pe(self, length, device, dtype):
+        """Reset the positional encodings."""
+        if self.pe is not None:
+            if self.pe.size(1) >= length:
+                if self.pe.dtype != dtype or self.pe.device != device:
+                    self.pe = self.pe.to(dtype=dtype, device=device)
+                return
+        pe = torch.zeros(length, self.d_model)
+        position = torch.arange(0, length, dtype=torch.float32).unsqueeze(1)
+        div_term = torch.exp(
+            torch.arange(0, self.d_model, 2, dtype=torch.float32)
+            * -(math.log(10000.0) / self.d_model)
+        )
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        pe = pe.unsqueeze(0)
+        self.pe = pe.to(device=device, dtype=dtype)
+
+    def forward(self, x: torch.Tensor, start_idx: int = 0):
+        """Add positional encoding.
+
+        Args:
+            x (torch.Tensor): Input tensor (batch, time, `*`).
+
+        Returns:
+            torch.Tensor: Encoded tensor (batch, time, `*`).
+
+        """
+        self.extend_pe(x.size(1) + start_idx, x.device, x.dtype)
+        x = x * self.xscale + self.pe[:, start_idx : start_idx + x.size(1)]
+        return self.dropout(x)
+
+class SinusoidalPositionEncoder(torch.nn.Module):
+    '''
+
+    '''
+    def __int__(self, d_model=80, dropout_rate=0.1):
+        pass
+
+    def encode(self, positions: torch.Tensor = None, depth: int = None, dtype: torch.dtype = torch.float32):
+        batch_size = positions.size(0)
+        positions = positions.type(dtype)
+        log_timescale_increment = torch.log(torch.tensor([10000], dtype=dtype)) / (depth / 2 - 1)
+        inv_timescales = torch.exp(torch.arange(depth / 2).type(dtype) * (-log_timescale_increment))
+        inv_timescales = torch.reshape(inv_timescales, [batch_size, -1])
+        scaled_time = torch.reshape(positions, [1, -1, 1]) * torch.reshape(inv_timescales, [1, 1, -1])
+        encoding = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=2)
+        return encoding.type(dtype)
+
+    def forward(self, x):
+        batch_size, timesteps, input_dim = x.size()
+        positions = torch.arange(1, timesteps+1)[None, :]
+        position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
+
+        return x + position_encoding
\ No newline at end of file
diff --git a/funasr/modules/frontends/__init__.py b/funasr/modules/frontends/__init__.py
new file mode 100644
index 000000000..b7f177368
--- /dev/null
+++ b/funasr/modules/frontends/__init__.py
@@ -0,0 +1 @@
+"""Initialize sub package."""
diff --git a/funasr/modules/frontends/beamformer.py b/funasr/modules/frontends/beamformer.py
new file mode 100644
index 000000000..f3eccee4c
--- /dev/null
+++ b/funasr/modules/frontends/beamformer.py
@@ -0,0 +1,84 @@
+import torch
+from torch_complex import functional as FC
+from torch_complex.tensor import ComplexTensor
+
+
+def get_power_spectral_density_matrix(
+    xs: ComplexTensor, mask: torch.Tensor, normalization=True, eps: float = 1e-15
+) -> ComplexTensor:
+    """Return cross-channel power spectral density (PSD) matrix
+
+    Args:
+        xs (ComplexTensor): (..., F, C, T)
+        mask (torch.Tensor): (..., F, C, T)
+        normalization (bool):
+        eps (float):
+    Returns
+        psd (ComplexTensor): (..., F, C, C)
+
+    """
+    # outer product: (..., C_1, T) x (..., C_2, T) -> (..., T, C, C_2)
+    psd_Y = FC.einsum("...ct,...et->...tce", [xs, xs.conj()])
+
+    # Averaging mask along C: (..., C, T) -> (..., T)
+    mask = mask.mean(dim=-2)
+
+    # Normalized mask along T: (..., T)
+    if normalization:
+        # If assuming the tensor is padded with zero, the summation along
+        # the time axis is same regardless of the padding length.
+        mask = mask / (mask.sum(dim=-1, keepdim=True) + eps)
+
+    # psd: (..., T, C, C)
+    psd = psd_Y * mask[..., None, None]
+    # (..., T, C, C) -> (..., C, C)
+    psd = psd.sum(dim=-3)
+
+    return psd
+
+
+def get_mvdr_vector(
+    psd_s: ComplexTensor,
+    psd_n: ComplexTensor,
+    reference_vector: torch.Tensor,
+    eps: float = 1e-15,
+) -> ComplexTensor:
+    """Return the MVDR(Minimum Variance Distortionless Response) vector:
+
+        h = (Npsd^-1 @ Spsd) / (Tr(Npsd^-1 @ Spsd)) @ u
+
+    Reference:
+        On optimal frequency-domain multichannel linear filtering
+        for noise reduction; M. Souden et al., 2010;
+        https://ieeexplore.ieee.org/document/5089420
+
+    Args:
+        psd_s (ComplexTensor): (..., F, C, C)
+        psd_n (ComplexTensor): (..., F, C, C)
+        reference_vector (torch.Tensor): (..., C)
+        eps (float):
+    Returns:
+        beamform_vector (ComplexTensor)r: (..., F, C)
+    """
+    # Add eps
+    C = psd_n.size(-1)
+    eye = torch.eye(C, dtype=psd_n.dtype, device=psd_n.device)
+    shape = [1 for _ in range(psd_n.dim() - 2)] + [C, C]
+    eye = eye.view(*shape)
+    psd_n += eps * eye
+
+    # numerator: (..., C_1, C_2) x (..., C_2, C_3) -> (..., C_1, C_3)
+    numerator = FC.einsum("...ec,...cd->...ed", [psd_n.inverse(), psd_s])
+    # ws: (..., C, C) / (...,) -> (..., C, C)
+    ws = numerator / (FC.trace(numerator)[..., None, None] + eps)
+    # h: (..., F, C_1, C_2) x (..., C_2) -> (..., F, C_1)
+    beamform_vector = FC.einsum("...fec,...c->...fe", [ws, reference_vector])
+    return beamform_vector
+
+
+def apply_beamforming_vector(
+    beamform_vector: ComplexTensor, mix: ComplexTensor
+) -> ComplexTensor:
+    # (..., C) x (..., C, T) -> (..., T)
+    es = FC.einsum("...c,...ct->...t", [beamform_vector.conj(), mix])
+    return es
diff --git a/funasr/modules/frontends/dnn_beamformer.py b/funasr/modules/frontends/dnn_beamformer.py
new file mode 100644
index 000000000..e75d771d3
--- /dev/null
+++ b/funasr/modules/frontends/dnn_beamformer.py
@@ -0,0 +1,172 @@
+"""DNN beamformer module."""
+from typing import Tuple
+
+import torch
+from torch.nn import functional as F
+
+from funasr.modules.frontends.beamformer import apply_beamforming_vector
+from funasr.modules.frontends.beamformer import get_mvdr_vector
+from funasr.modules.frontends.beamformer import (
+    get_power_spectral_density_matrix,  # noqa: H301
+)
+from funasr.modules.frontends.mask_estimator import MaskEstimator
+from torch_complex.tensor import ComplexTensor
+
+
+class DNN_Beamformer(torch.nn.Module):
+    """DNN mask based Beamformer
+
+    Citation:
+        Multichannel End-to-end Speech Recognition; T. Ochiai et al., 2017;
+        https://arxiv.org/abs/1703.04783
+
+    """
+
+    def __init__(
+        self,
+        bidim,
+        btype="blstmp",
+        blayers=3,
+        bunits=300,
+        bprojs=320,
+        bnmask=2,
+        dropout_rate=0.0,
+        badim=320,
+        ref_channel: int = -1,
+        beamformer_type="mvdr",
+    ):
+        super().__init__()
+        self.mask = MaskEstimator(
+            btype, bidim, blayers, bunits, bprojs, dropout_rate, nmask=bnmask
+        )
+        self.ref = AttentionReference(bidim, badim)
+        self.ref_channel = ref_channel
+
+        self.nmask = bnmask
+
+        if beamformer_type != "mvdr":
+            raise ValueError(
+                "Not supporting beamformer_type={}".format(beamformer_type)
+            )
+        self.beamformer_type = beamformer_type
+
+    def forward(
+        self, data: ComplexTensor, ilens: torch.LongTensor
+    ) -> Tuple[ComplexTensor, torch.LongTensor, ComplexTensor]:
+        """The forward function
+
+        Notation:
+            B: Batch
+            C: Channel
+            T: Time or Sequence length
+            F: Freq
+
+        Args:
+            data (ComplexTensor): (B, T, C, F)
+            ilens (torch.Tensor): (B,)
+        Returns:
+            enhanced (ComplexTensor): (B, T, F)
+            ilens (torch.Tensor): (B,)
+
+        """
+
+        def apply_beamforming(data, ilens, psd_speech, psd_noise):
+            # u: (B, C)
+            if self.ref_channel < 0:
+                u, _ = self.ref(psd_speech, ilens)
+            else:
+                # (optional) Create onehot vector for fixed reference microphone
+                u = torch.zeros(
+                    *(data.size()[:-3] + (data.size(-2),)), device=data.device
+                )
+                u[..., self.ref_channel].fill_(1)
+
+            ws = get_mvdr_vector(psd_speech, psd_noise, u)
+            enhanced = apply_beamforming_vector(ws, data)
+
+            return enhanced, ws
+
+        # data (B, T, C, F) -> (B, F, C, T)
+        data = data.permute(0, 3, 2, 1)
+
+        # mask: (B, F, C, T)
+        masks, _ = self.mask(data, ilens)
+        assert self.nmask == len(masks)
+
+        if self.nmask == 2:  # (mask_speech, mask_noise)
+            mask_speech, mask_noise = masks
+
+            psd_speech = get_power_spectral_density_matrix(data, mask_speech)
+            psd_noise = get_power_spectral_density_matrix(data, mask_noise)
+
+            enhanced, ws = apply_beamforming(data, ilens, psd_speech, psd_noise)
+
+            # (..., F, T) -> (..., T, F)
+            enhanced = enhanced.transpose(-1, -2)
+            mask_speech = mask_speech.transpose(-1, -3)
+        else:  # multi-speaker case: (mask_speech1, ..., mask_noise)
+            mask_speech = list(masks[:-1])
+            mask_noise = masks[-1]
+
+            psd_speeches = [
+                get_power_spectral_density_matrix(data, mask) for mask in mask_speech
+            ]
+            psd_noise = get_power_spectral_density_matrix(data, mask_noise)
+
+            enhanced = []
+            ws = []
+            for i in range(self.nmask - 1):
+                psd_speech = psd_speeches.pop(i)
+                # treat all other speakers' psd_speech as noises
+                enh, w = apply_beamforming(
+                    data, ilens, psd_speech, sum(psd_speeches) + psd_noise
+                )
+                psd_speeches.insert(i, psd_speech)
+
+                # (..., F, T) -> (..., T, F)
+                enh = enh.transpose(-1, -2)
+                mask_speech[i] = mask_speech[i].transpose(-1, -3)
+
+                enhanced.append(enh)
+                ws.append(w)
+
+        return enhanced, ilens, mask_speech
+
+
+class AttentionReference(torch.nn.Module):
+    def __init__(self, bidim, att_dim):
+        super().__init__()
+        self.mlp_psd = torch.nn.Linear(bidim, att_dim)
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+    def forward(
+        self, psd_in: ComplexTensor, ilens: torch.LongTensor, scaling: float = 2.0
+    ) -> Tuple[torch.Tensor, torch.LongTensor]:
+        """The forward function
+
+        Args:
+            psd_in (ComplexTensor): (B, F, C, C)
+            ilens (torch.Tensor): (B,)
+            scaling (float):
+        Returns:
+            u (torch.Tensor): (B, C)
+            ilens (torch.Tensor): (B,)
+        """
+        B, _, C = psd_in.size()[:3]
+        assert psd_in.size(2) == psd_in.size(3), psd_in.size()
+        # psd_in: (B, F, C, C)
+        psd = psd_in.masked_fill(
+            torch.eye(C, dtype=torch.bool, device=psd_in.device), 0
+        )
+        # psd: (B, F, C, C) -> (B, C, F)
+        psd = (psd.sum(dim=-1) / (C - 1)).transpose(-1, -2)
+
+        # Calculate amplitude
+        psd_feat = (psd.real**2 + psd.imag**2) ** 0.5
+
+        # (B, C, F) -> (B, C, F2)
+        mlp_psd = self.mlp_psd(psd_feat)
+        # (B, C, F2) -> (B, C, 1) -> (B, C)
+        e = self.gvec(torch.tanh(mlp_psd)).squeeze(-1)
+        u = F.softmax(scaling * e, dim=-1)
+        return u, ilens
diff --git a/funasr/modules/frontends/dnn_wpe.py b/funasr/modules/frontends/dnn_wpe.py
new file mode 100644
index 000000000..9596765c8
--- /dev/null
+++ b/funasr/modules/frontends/dnn_wpe.py
@@ -0,0 +1,93 @@
+from typing import Tuple
+
+from pytorch_wpe import wpe_one_iteration
+import torch
+from torch_complex.tensor import ComplexTensor
+
+from funasr.modules.frontends.mask_estimator import MaskEstimator
+from funasr.modules.nets_utils import make_pad_mask
+
+
+class DNN_WPE(torch.nn.Module):
+    def __init__(
+        self,
+        wtype: str = "blstmp",
+        widim: int = 257,
+        wlayers: int = 3,
+        wunits: int = 300,
+        wprojs: int = 320,
+        dropout_rate: float = 0.0,
+        taps: int = 5,
+        delay: int = 3,
+        use_dnn_mask: bool = True,
+        iterations: int = 1,
+        normalization: bool = False,
+    ):
+        super().__init__()
+        self.iterations = iterations
+        self.taps = taps
+        self.delay = delay
+
+        self.normalization = normalization
+        self.use_dnn_mask = use_dnn_mask
+
+        self.inverse_power = True
+
+        if self.use_dnn_mask:
+            self.mask_est = MaskEstimator(
+                wtype, widim, wlayers, wunits, wprojs, dropout_rate, nmask=1
+            )
+
+    def forward(
+        self, data: ComplexTensor, ilens: torch.LongTensor
+    ) -> Tuple[ComplexTensor, torch.LongTensor, ComplexTensor]:
+        """The forward function
+
+        Notation:
+            B: Batch
+            C: Channel
+            T: Time or Sequence length
+            F: Freq or Some dimension of the feature vector
+
+        Args:
+            data: (B, C, T, F)
+            ilens: (B,)
+        Returns:
+            data: (B, C, T, F)
+            ilens: (B,)
+        """
+        # (B, T, C, F) -> (B, F, C, T)
+        enhanced = data = data.permute(0, 3, 2, 1)
+        mask = None
+
+        for i in range(self.iterations):
+            # Calculate power: (..., C, T)
+            power = enhanced.real**2 + enhanced.imag**2
+            if i == 0 and self.use_dnn_mask:
+                # mask: (B, F, C, T)
+                (mask,), _ = self.mask_est(enhanced, ilens)
+                if self.normalization:
+                    # Normalize along T
+                    mask = mask / mask.sum(dim=-1)[..., None]
+                # (..., C, T) * (..., C, T) -> (..., C, T)
+                power = power * mask
+
+            # Averaging along the channel axis: (..., C, T) -> (..., T)
+            power = power.mean(dim=-2)
+
+            # enhanced: (..., C, T) -> (..., C, T)
+            enhanced = wpe_one_iteration(
+                data.contiguous(),
+                power,
+                taps=self.taps,
+                delay=self.delay,
+                inverse_power=self.inverse_power,
+            )
+
+            enhanced.masked_fill_(make_pad_mask(ilens, enhanced.real), 0)
+
+        # (B, F, C, T) -> (B, T, C, F)
+        enhanced = enhanced.permute(0, 3, 2, 1)
+        if mask is not None:
+            mask = mask.transpose(-1, -3)
+        return enhanced, ilens, mask
diff --git a/funasr/modules/frontends/feature_transform.py b/funasr/modules/frontends/feature_transform.py
new file mode 100644
index 000000000..353dca1a6
--- /dev/null
+++ b/funasr/modules/frontends/feature_transform.py
@@ -0,0 +1,263 @@
+from typing import List
+from typing import Tuple
+from typing import Union
+
+import librosa
+import numpy as np
+import torch
+from torch_complex.tensor import ComplexTensor
+
+from funasr.modules.nets_utils import make_pad_mask
+
+
+class FeatureTransform(torch.nn.Module):
+    def __init__(
+        self,
+        # Mel options,
+        fs: int = 16000,
+        n_fft: int = 512,
+        n_mels: int = 80,
+        fmin: float = 0.0,
+        fmax: float = None,
+        # Normalization
+        stats_file: str = None,
+        apply_uttmvn: bool = True,
+        uttmvn_norm_means: bool = True,
+        uttmvn_norm_vars: bool = False,
+    ):
+        super().__init__()
+        self.apply_uttmvn = apply_uttmvn
+
+        self.logmel = LogMel(fs=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
+        self.stats_file = stats_file
+        if stats_file is not None:
+            self.global_mvn = GlobalMVN(stats_file)
+        else:
+            self.global_mvn = None
+
+        if self.apply_uttmvn is not None:
+            self.uttmvn = UtteranceMVN(
+                norm_means=uttmvn_norm_means, norm_vars=uttmvn_norm_vars
+            )
+        else:
+            self.uttmvn = None
+
+    def forward(
+        self, x: ComplexTensor, ilens: Union[torch.LongTensor, np.ndarray, List[int]]
+    ) -> Tuple[torch.Tensor, torch.LongTensor]:
+        # (B, T, F) or (B, T, C, F)
+        if x.dim() not in (3, 4):
+            raise ValueError(f"Input dim must be 3 or 4: {x.dim()}")
+        if not torch.is_tensor(ilens):
+            ilens = torch.from_numpy(np.asarray(ilens)).to(x.device)
+
+        if x.dim() == 4:
+            # h: (B, T, C, F) -> h: (B, T, F)
+            if self.training:
+                # Select 1ch randomly
+                ch = np.random.randint(x.size(2))
+                h = x[:, :, ch, :]
+            else:
+                # Use the first channel
+                h = x[:, :, 0, :]
+        else:
+            h = x
+
+        # h: ComplexTensor(B, T, F) -> torch.Tensor(B, T, F)
+        h = h.real**2 + h.imag**2
+
+        h, _ = self.logmel(h, ilens)
+        if self.stats_file is not None:
+            h, _ = self.global_mvn(h, ilens)
+        if self.apply_uttmvn:
+            h, _ = self.uttmvn(h, ilens)
+
+        return h, ilens
+
+
+class LogMel(torch.nn.Module):
+    """Convert STFT to fbank feats
+
+    The arguments is same as librosa.filters.mel
+
+    Args:
+        fs: number > 0 [scalar] sampling rate of the incoming signal
+        n_fft: int > 0 [scalar] number of FFT components
+        n_mels: int > 0 [scalar] number of Mel bands to generate
+        fmin: float >= 0 [scalar] lowest frequency (in Hz)
+        fmax: float >= 0 [scalar] highest frequency (in Hz).
+            If `None`, use `fmax = fs / 2.0`
+        htk: use HTK formula instead of Slaney
+        norm: {None, 1, np.inf} [scalar]
+            if 1, divide the triangular mel weights by the width of the mel band
+            (area normalization).  Otherwise, leave all the triangles aiming for
+            a peak value of 1.0
+
+    """
+
+    def __init__(
+        self,
+        fs: int = 16000,
+        n_fft: int = 512,
+        n_mels: int = 80,
+        fmin: float = 0.0,
+        fmax: float = None,
+        htk: bool = False,
+        norm=1,
+    ):
+        super().__init__()
+
+        _mel_options = dict(
+            sr=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax, htk=htk, norm=norm
+        )
+        self.mel_options = _mel_options
+
+        # Note(kamo): The mel matrix of librosa is different from kaldi.
+        melmat = librosa.filters.mel(**_mel_options)
+        # melmat: (D2, D1) -> (D1, D2)
+        self.register_buffer("melmat", torch.from_numpy(melmat.T).float())
+
+    def extra_repr(self):
+        return ", ".join(f"{k}={v}" for k, v in self.mel_options.items())
+
+    def forward(
+        self, feat: torch.Tensor, ilens: torch.LongTensor
+    ) -> Tuple[torch.Tensor, torch.LongTensor]:
+        # feat: (B, T, D1) x melmat: (D1, D2) -> mel_feat: (B, T, D2)
+        mel_feat = torch.matmul(feat, self.melmat)
+
+        logmel_feat = (mel_feat + 1e-20).log()
+        # Zero padding
+        logmel_feat = logmel_feat.masked_fill(make_pad_mask(ilens, logmel_feat, 1), 0.0)
+        return logmel_feat, ilens
+
+
+class GlobalMVN(torch.nn.Module):
+    """Apply global mean and variance normalization
+
+    Args:
+        stats_file(str): npy file of 1-dim array or text file.
+            From the _first element to
+            the {(len(array) - 1) / 2}th element are treated as
+            the sum of features,
+            and the rest excluding the last elements are
+            treated as the sum of the square value of features,
+            and the last elements eqauls to the number of samples.
+        std_floor(float):
+    """
+
+    def __init__(
+        self,
+        stats_file: str,
+        norm_means: bool = True,
+        norm_vars: bool = True,
+        eps: float = 1.0e-20,
+    ):
+        super().__init__()
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+
+        self.stats_file = stats_file
+        stats = np.load(stats_file)
+
+        stats = stats.astype(float)
+        assert (len(stats) - 1) % 2 == 0, stats.shape
+
+        count = stats.flatten()[-1]
+        mean = stats[: (len(stats) - 1) // 2] / count
+        var = stats[(len(stats) - 1) // 2 : -1] / count - mean * mean
+        std = np.maximum(np.sqrt(var), eps)
+
+        self.register_buffer("bias", torch.from_numpy(-mean.astype(np.float32)))
+        self.register_buffer("scale", torch.from_numpy(1 / std.astype(np.float32)))
+
+    def extra_repr(self):
+        return (
+            f"stats_file={self.stats_file}, "
+            f"norm_means={self.norm_means}, norm_vars={self.norm_vars}"
+        )
+
+    def forward(
+        self, x: torch.Tensor, ilens: torch.LongTensor
+    ) -> Tuple[torch.Tensor, torch.LongTensor]:
+        # feat: (B, T, D)
+        if self.norm_means:
+            x += self.bias.type_as(x)
+            x.masked_fill(make_pad_mask(ilens, x, 1), 0.0)
+
+        if self.norm_vars:
+            x *= self.scale.type_as(x)
+        return x, ilens
+
+
+class UtteranceMVN(torch.nn.Module):
+    def __init__(
+        self, norm_means: bool = True, norm_vars: bool = False, eps: float = 1.0e-20
+    ):
+        super().__init__()
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.eps = eps
+
+    def extra_repr(self):
+        return f"norm_means={self.norm_means}, norm_vars={self.norm_vars}"
+
+    def forward(
+        self, x: torch.Tensor, ilens: torch.LongTensor
+    ) -> Tuple[torch.Tensor, torch.LongTensor]:
+        return utterance_mvn(
+            x, ilens, norm_means=self.norm_means, norm_vars=self.norm_vars, eps=self.eps
+        )
+
+
+def utterance_mvn(
+    x: torch.Tensor,
+    ilens: torch.LongTensor,
+    norm_means: bool = True,
+    norm_vars: bool = False,
+    eps: float = 1.0e-20,
+) -> Tuple[torch.Tensor, torch.LongTensor]:
+    """Apply utterance mean and variance normalization
+
+    Args:
+        x: (B, T, D), assumed zero padded
+        ilens: (B, T, D)
+        norm_means:
+        norm_vars:
+        eps:
+
+    """
+    ilens_ = ilens.type_as(x)
+    # mean: (B, D)
+    mean = x.sum(dim=1) / ilens_[:, None]
+
+    if norm_means:
+        x -= mean[:, None, :]
+        x_ = x
+    else:
+        x_ = x - mean[:, None, :]
+
+    # Zero padding
+    x_.masked_fill(make_pad_mask(ilens, x_, 1), 0.0)
+    if norm_vars:
+        var = x_.pow(2).sum(dim=1) / ilens_[:, None]
+        var = torch.clamp(var, min=eps)
+        x /= var.sqrt()[:, None, :]
+        x_ = x
+    return x_, ilens
+
+
+def feature_transform_for(args, n_fft):
+    return FeatureTransform(
+        # Mel options,
+        fs=args.fbank_fs,
+        n_fft=n_fft,
+        n_mels=args.n_mels,
+        fmin=args.fbank_fmin,
+        fmax=args.fbank_fmax,
+        # Normalization
+        stats_file=args.stats_file,
+        apply_uttmvn=args.apply_uttmvn,
+        uttmvn_norm_means=args.uttmvn_norm_means,
+        uttmvn_norm_vars=args.uttmvn_norm_vars,
+    )
diff --git a/funasr/modules/frontends/frontend.py b/funasr/modules/frontends/frontend.py
new file mode 100644
index 000000000..ab5ea3b79
--- /dev/null
+++ b/funasr/modules/frontends/frontend.py
@@ -0,0 +1,151 @@
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import numpy
+import torch
+import torch.nn as nn
+from torch_complex.tensor import ComplexTensor
+
+from funasr.modules.frontends.dnn_beamformer import DNN_Beamformer
+from funasr.modules.frontends.dnn_wpe import DNN_WPE
+
+
+class Frontend(nn.Module):
+    def __init__(
+        self,
+        idim: int,
+        # WPE options
+        use_wpe: bool = False,
+        wtype: str = "blstmp",
+        wlayers: int = 3,
+        wunits: int = 300,
+        wprojs: int = 320,
+        wdropout_rate: float = 0.0,
+        taps: int = 5,
+        delay: int = 3,
+        use_dnn_mask_for_wpe: bool = True,
+        # Beamformer options
+        use_beamformer: bool = False,
+        btype: str = "blstmp",
+        blayers: int = 3,
+        bunits: int = 300,
+        bprojs: int = 320,
+        bnmask: int = 2,
+        badim: int = 320,
+        ref_channel: int = -1,
+        bdropout_rate=0.0,
+    ):
+        super().__init__()
+
+        self.use_beamformer = use_beamformer
+        self.use_wpe = use_wpe
+        self.use_dnn_mask_for_wpe = use_dnn_mask_for_wpe
+        # use frontend for all the data,
+        # e.g. in the case of multi-speaker speech separation
+        self.use_frontend_for_all = bnmask > 2
+
+        if self.use_wpe:
+            if self.use_dnn_mask_for_wpe:
+                # Use DNN for power estimation
+                # (Not observed significant gains)
+                iterations = 1
+            else:
+                # Performing as conventional WPE, without DNN Estimator
+                iterations = 2
+
+            self.wpe = DNN_WPE(
+                wtype=wtype,
+                widim=idim,
+                wunits=wunits,
+                wprojs=wprojs,
+                wlayers=wlayers,
+                taps=taps,
+                delay=delay,
+                dropout_rate=wdropout_rate,
+                iterations=iterations,
+                use_dnn_mask=use_dnn_mask_for_wpe,
+            )
+        else:
+            self.wpe = None
+
+        if self.use_beamformer:
+            self.beamformer = DNN_Beamformer(
+                btype=btype,
+                bidim=idim,
+                bunits=bunits,
+                bprojs=bprojs,
+                blayers=blayers,
+                bnmask=bnmask,
+                dropout_rate=bdropout_rate,
+                badim=badim,
+                ref_channel=ref_channel,
+            )
+        else:
+            self.beamformer = None
+
+    def forward(
+        self, x: ComplexTensor, ilens: Union[torch.LongTensor, numpy.ndarray, List[int]]
+    ) -> Tuple[ComplexTensor, torch.LongTensor, Optional[ComplexTensor]]:
+        assert len(x) == len(ilens), (len(x), len(ilens))
+        # (B, T, F) or (B, T, C, F)
+        if x.dim() not in (3, 4):
+            raise ValueError(f"Input dim must be 3 or 4: {x.dim()}")
+        if not torch.is_tensor(ilens):
+            ilens = torch.from_numpy(numpy.asarray(ilens)).to(x.device)
+
+        mask = None
+        h = x
+        if h.dim() == 4:
+            if self.training:
+                choices = [(False, False)] if not self.use_frontend_for_all else []
+                if self.use_wpe:
+                    choices.append((True, False))
+
+                if self.use_beamformer:
+                    choices.append((False, True))
+
+                use_wpe, use_beamformer = choices[numpy.random.randint(len(choices))]
+
+            else:
+                use_wpe = self.use_wpe
+                use_beamformer = self.use_beamformer
+
+            # 1. WPE
+            if use_wpe:
+                # h: (B, T, C, F) -> h: (B, T, C, F)
+                h, ilens, mask = self.wpe(h, ilens)
+
+            # 2. Beamformer
+            if use_beamformer:
+                # h: (B, T, C, F) -> h: (B, T, F)
+                h, ilens, mask = self.beamformer(h, ilens)
+
+        return h, ilens, mask
+
+
+def frontend_for(args, idim):
+    return Frontend(
+        idim=idim,
+        # WPE options
+        use_wpe=args.use_wpe,
+        wtype=args.wtype,
+        wlayers=args.wlayers,
+        wunits=args.wunits,
+        wprojs=args.wprojs,
+        wdropout_rate=args.wdropout_rate,
+        taps=args.wpe_taps,
+        delay=args.wpe_delay,
+        use_dnn_mask_for_wpe=args.use_dnn_mask_for_wpe,
+        # Beamformer options
+        use_beamformer=args.use_beamformer,
+        btype=args.btype,
+        blayers=args.blayers,
+        bunits=args.bunits,
+        bprojs=args.bprojs,
+        bnmask=args.bnmask,
+        badim=args.badim,
+        ref_channel=args.ref_channel,
+        bdropout_rate=args.bdropout_rate,
+    )
diff --git a/funasr/modules/frontends/mask_estimator.py b/funasr/modules/frontends/mask_estimator.py
new file mode 100644
index 000000000..cf4385769
--- /dev/null
+++ b/funasr/modules/frontends/mask_estimator.py
@@ -0,0 +1,77 @@
+from typing import Tuple
+
+import numpy as np
+import torch
+from torch.nn import functional as F
+from torch_complex.tensor import ComplexTensor
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.rnn.encoders import RNN
+from funasr.modules.rnn.encoders import RNNP
+
+
+class MaskEstimator(torch.nn.Module):
+    def __init__(self, type, idim, layers, units, projs, dropout, nmask=1):
+        super().__init__()
+        subsample = np.ones(layers + 1, dtype=np.int)
+
+        typ = type.lstrip("vgg").rstrip("p")
+        if type[-1] == "p":
+            self.brnn = RNNP(idim, layers, units, projs, subsample, dropout, typ=typ)
+        else:
+            self.brnn = RNN(idim, layers, units, projs, dropout, typ=typ)
+
+        self.type = type
+        self.nmask = nmask
+        self.linears = torch.nn.ModuleList(
+            [torch.nn.Linear(projs, idim) for _ in range(nmask)]
+        )
+
+    def forward(
+        self, xs: ComplexTensor, ilens: torch.LongTensor
+    ) -> Tuple[Tuple[torch.Tensor, ...], torch.LongTensor]:
+        """The forward function
+
+        Args:
+            xs: (B, F, C, T)
+            ilens: (B,)
+        Returns:
+            hs (torch.Tensor): The hidden vector (B, F, C, T)
+            masks: A tuple of the masks. (B, F, C, T)
+            ilens: (B,)
+        """
+        assert xs.size(0) == ilens.size(0), (xs.size(0), ilens.size(0))
+        _, _, C, input_length = xs.size()
+        # (B, F, C, T) -> (B, C, T, F)
+        xs = xs.permute(0, 2, 3, 1)
+
+        # Calculate amplitude: (B, C, T, F) -> (B, C, T, F)
+        xs = (xs.real**2 + xs.imag**2) ** 0.5
+        # xs: (B, C, T, F) -> xs: (B * C, T, F)
+        xs = xs.contiguous().view(-1, xs.size(-2), xs.size(-1))
+        # ilens: (B,) -> ilens_: (B * C)
+        ilens_ = ilens[:, None].expand(-1, C).contiguous().view(-1)
+
+        # xs: (B * C, T, F) -> xs: (B * C, T, D)
+        xs, _, _ = self.brnn(xs, ilens_)
+        # xs: (B * C, T, D) -> xs: (B, C, T, D)
+        xs = xs.view(-1, C, xs.size(-2), xs.size(-1))
+
+        masks = []
+        for linear in self.linears:
+            # xs: (B, C, T, D) -> mask:(B, C, T, F)
+            mask = linear(xs)
+
+            mask = torch.sigmoid(mask)
+            # Zero padding
+            mask.masked_fill(make_pad_mask(ilens, mask, length_dim=2), 0)
+
+            # (B, C, T, F) -> (B, F, C, T)
+            mask = mask.permute(0, 3, 1, 2)
+
+            # Take cares of multi gpu cases: If input_length > max(ilens)
+            if mask.size(-1) < input_length:
+                mask = F.pad(mask, [0, input_length - mask.size(-1)], value=0)
+            masks.append(mask)
+
+        return tuple(masks), ilens
diff --git a/funasr/modules/layer_norm.py b/funasr/modules/layer_norm.py
new file mode 100644
index 000000000..6e934e644
--- /dev/null
+++ b/funasr/modules/layer_norm.py
@@ -0,0 +1,42 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Layer normalization module."""
+
+import torch
+
+
+class LayerNorm(torch.nn.LayerNorm):
+    """Layer normalization module.
+
+    Args:
+        nout (int): Output dim size.
+        dim (int): Dimension to be normalized.
+
+    """
+
+    def __init__(self, nout, dim=-1):
+        """Construct an LayerNorm object."""
+        super(LayerNorm, self).__init__(nout, eps=1e-12)
+        self.dim = dim
+
+    def forward(self, x):
+        """Apply layer normalization.
+
+        Args:
+            x (torch.Tensor): Input tensor.
+
+        Returns:
+            torch.Tensor: Normalized tensor.
+
+        """
+        if self.dim == -1:
+            return super(LayerNorm, self).forward(x)
+        return (
+            super(LayerNorm, self)
+            .forward(x.transpose(self.dim, -1))
+            .transpose(self.dim, -1)
+        )
diff --git a/funasr/modules/lightconv.py b/funasr/modules/lightconv.py
new file mode 100644
index 000000000..b24940259
--- /dev/null
+++ b/funasr/modules/lightconv.py
@@ -0,0 +1,112 @@
+"""Lightweight Convolution Module."""
+
+import numpy
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+MIN_VALUE = float(numpy.finfo(numpy.float32).min)
+
+
+class LightweightConvolution(nn.Module):
+    """Lightweight Convolution layer.
+
+    This implementation is based on
+    https://github.com/pytorch/fairseq/tree/master/fairseq
+
+    Args:
+        wshare (int): the number of kernel of convolution
+        n_feat (int): the number of features
+        dropout_rate (float): dropout_rate
+        kernel_size (int): kernel size (length)
+        use_kernel_mask (bool): Use causal mask or not for convolution kernel
+        use_bias (bool): Use bias term or not.
+
+    """
+
+    def __init__(
+        self,
+        wshare,
+        n_feat,
+        dropout_rate,
+        kernel_size,
+        use_kernel_mask=False,
+        use_bias=False,
+    ):
+        """Construct Lightweight Convolution layer."""
+        super(LightweightConvolution, self).__init__()
+
+        assert n_feat % wshare == 0
+        self.wshare = wshare
+        self.use_kernel_mask = use_kernel_mask
+        self.dropout_rate = dropout_rate
+        self.kernel_size = kernel_size
+        self.padding_size = int(kernel_size / 2)
+
+        # linear -> GLU -> lightconv -> linear
+        self.linear1 = nn.Linear(n_feat, n_feat * 2)
+        self.linear2 = nn.Linear(n_feat, n_feat)
+        self.act = nn.GLU()
+
+        # lightconv related
+        self.weight = nn.Parameter(
+            torch.Tensor(self.wshare, 1, kernel_size).uniform_(0, 1)
+        )
+        self.use_bias = use_bias
+        if self.use_bias:
+            self.bias = nn.Parameter(torch.Tensor(n_feat))
+
+        # mask of kernel
+        kernel_mask0 = torch.zeros(self.wshare, int(kernel_size / 2))
+        kernel_mask1 = torch.ones(self.wshare, int(kernel_size / 2 + 1))
+        self.kernel_mask = torch.cat((kernel_mask1, kernel_mask0), dim=-1).unsqueeze(1)
+
+    def forward(self, query, key, value, mask):
+        """Forward of 'Lightweight Convolution'.
+
+        This function takes query, key and value but uses only query.
+        This is just for compatibility with self-attention layer (attention.py)
+
+        Args:
+            query (torch.Tensor): (batch, time1, d_model) input tensor
+            key (torch.Tensor): (batch, time2, d_model) NOT USED
+            value (torch.Tensor): (batch, time2, d_model) NOT USED
+            mask (torch.Tensor): (batch, time1, time2) mask
+
+        Return:
+            x (torch.Tensor): (batch, time1, d_model) output
+
+        """
+        # linear -> GLU -> lightconv -> linear
+        x = query
+        B, T, C = x.size()
+        H = self.wshare
+
+        # first liner layer
+        x = self.linear1(x)
+
+        # GLU activation
+        x = self.act(x)
+
+        # lightconv
+        x = x.transpose(1, 2).contiguous().view(-1, H, T)  # B x C x T
+        weight = F.dropout(self.weight, self.dropout_rate, training=self.training)
+        if self.use_kernel_mask:
+            self.kernel_mask = self.kernel_mask.to(x.device)
+            weight = weight.masked_fill(self.kernel_mask == 0.0, float("-inf"))
+        weight = F.softmax(weight, dim=-1)
+        x = F.conv1d(x, weight, padding=self.padding_size, groups=self.wshare).view(
+            B, C, T
+        )
+        if self.use_bias:
+            x = x + self.bias.view(1, -1, 1)
+        x = x.transpose(1, 2)  # B x T x C
+
+        if mask is not None and not self.use_kernel_mask:
+            mask = mask.transpose(-1, -2)
+            x = x.masked_fill(mask == 0, 0.0)
+
+        # second linear layer
+        x = self.linear2(x)
+        return x
diff --git a/funasr/modules/lightconv2d.py b/funasr/modules/lightconv2d.py
new file mode 100644
index 000000000..294d23244
--- /dev/null
+++ b/funasr/modules/lightconv2d.py
@@ -0,0 +1,124 @@
+"""Lightweight 2-Dimensional Convolution module."""
+
+import numpy
+import torch
+from torch import nn
+import torch.nn.functional as F
+
+
+MIN_VALUE = float(numpy.finfo(numpy.float32).min)
+
+
+class LightweightConvolution2D(nn.Module):
+    """Lightweight 2-Dimensional Convolution layer.
+
+    This implementation is based on
+    https://github.com/pytorch/fairseq/tree/master/fairseq
+
+    Args:
+        wshare (int): the number of kernel of convolution
+        n_feat (int): the number of features
+        dropout_rate (float): dropout_rate
+        kernel_size (int): kernel size (length)
+        use_kernel_mask (bool): Use causal mask or not for convolution kernel
+        use_bias (bool): Use bias term or not.
+
+    """
+
+    def __init__(
+        self,
+        wshare,
+        n_feat,
+        dropout_rate,
+        kernel_size,
+        use_kernel_mask=False,
+        use_bias=False,
+    ):
+        """Construct Lightweight 2-Dimensional Convolution layer."""
+        super(LightweightConvolution2D, self).__init__()
+
+        assert n_feat % wshare == 0
+        self.wshare = wshare
+        self.use_kernel_mask = use_kernel_mask
+        self.dropout_rate = dropout_rate
+        self.kernel_size = kernel_size
+        self.padding_size = int(kernel_size / 2)
+
+        # linear -> GLU -> lightconv -> linear
+        self.linear1 = nn.Linear(n_feat, n_feat * 2)
+        self.linear2 = nn.Linear(n_feat * 2, n_feat)
+        self.act = nn.GLU()
+
+        # lightconv related
+        self.weight = nn.Parameter(
+            torch.Tensor(self.wshare, 1, kernel_size).uniform_(0, 1)
+        )
+        self.weight_f = nn.Parameter(torch.Tensor(1, 1, kernel_size).uniform_(0, 1))
+        self.use_bias = use_bias
+        if self.use_bias:
+            self.bias = nn.Parameter(torch.Tensor(n_feat))
+
+        # mask of kernel
+        kernel_mask0 = torch.zeros(self.wshare, int(kernel_size / 2))
+        kernel_mask1 = torch.ones(self.wshare, int(kernel_size / 2 + 1))
+        self.kernel_mask = torch.cat((kernel_mask1, kernel_mask0), dim=-1).unsqueeze(1)
+
+    def forward(self, query, key, value, mask):
+        """Forward of 'Lightweight 2-Dimensional Convolution'.
+
+        This function takes query, key and value but uses only query.
+        This is just for compatibility with self-attention layer (attention.py)
+
+        Args:
+            query (torch.Tensor): (batch, time1, d_model) input tensor
+            key (torch.Tensor): (batch, time2, d_model) NOT USED
+            value (torch.Tensor): (batch, time2, d_model) NOT USED
+            mask (torch.Tensor): (batch, time1, time2) mask
+
+        Return:
+            x (torch.Tensor): (batch, time1, d_model) output
+
+        """
+        # linear -> GLU -> lightconv -> linear
+        x = query
+        B, T, C = x.size()
+        H = self.wshare
+
+        # first liner layer
+        x = self.linear1(x)
+
+        # GLU activation
+        x = self.act(x)
+
+        # convolution along frequency axis
+        weight_f = F.softmax(self.weight_f, dim=-1)
+        weight_f = F.dropout(weight_f, self.dropout_rate, training=self.training)
+        weight_new = torch.zeros(
+            B * T, 1, self.kernel_size, device=x.device, dtype=x.dtype
+        ).copy_(weight_f)
+        xf = F.conv1d(
+            x.view(1, B * T, C), weight_new, padding=self.padding_size, groups=B * T
+        ).view(B, T, C)
+
+        # lightconv
+        x = x.transpose(1, 2).contiguous().view(-1, H, T)  # B x C x T
+        weight = F.dropout(self.weight, self.dropout_rate, training=self.training)
+        if self.use_kernel_mask:
+            self.kernel_mask = self.kernel_mask.to(x.device)
+            weight = weight.masked_fill(self.kernel_mask == 0.0, float("-inf"))
+        weight = F.softmax(weight, dim=-1)
+        x = F.conv1d(x, weight, padding=self.padding_size, groups=self.wshare).view(
+            B, C, T
+        )
+        if self.use_bias:
+            x = x + self.bias.view(1, -1, 1)
+        x = x.transpose(1, 2)  # B x T x C
+        x = torch.cat((x, xf), -1)  # B x T x Cx2
+
+        if mask is not None and not self.use_kernel_mask:
+            mask = mask.transpose(-1, -2)
+            x = x.masked_fill(mask == 0, 0.0)
+
+        # second linear layer
+        x = self.linear2(x)
+        return x
diff --git a/funasr/modules/mask.py b/funasr/modules/mask.py
new file mode 100644
index 000000000..8f068e11c
--- /dev/null
+++ b/funasr/modules/mask.py
@@ -0,0 +1,35 @@
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Mask module."""
+
+import torch
+
+
+def subsequent_mask(size, device="cpu", dtype=torch.bool):
+    """Create mask for subsequent steps (size, size).
+
+    :param int size: size of mask
+    :param str device: "cpu" or "cuda" or torch.Tensor.device
+    :param torch.dtype dtype: result dtype
+    :rtype: torch.Tensor
+    >>> subsequent_mask(3)
+    [[1, 0, 0],
+     [1, 1, 0],
+     [1, 1, 1]]
+    """
+    ret = torch.ones(size, size, device=device, dtype=dtype)
+    return torch.tril(ret, out=ret)
+
+
+def target_mask(ys_in_pad, ignore_id):
+    """Create mask for decoder self-attention.
+
+    :param torch.Tensor ys_pad: batch of padded target sequences (B, Lmax)
+    :param int ignore_id: index of padding
+    :param torch.dtype dtype: result dtype
+    :rtype: torch.Tensor (B, Lmax, Lmax)
+    """
+    ys_mask = ys_in_pad != ignore_id
+    m = subsequent_mask(ys_mask.size(-1), device=ys_mask.device).unsqueeze(0)
+    return ys_mask.unsqueeze(-2) & m
diff --git a/funasr/modules/multi_layer_conv.py b/funasr/modules/multi_layer_conv.py
new file mode 100644
index 000000000..5fb0717b0
--- /dev/null
+++ b/funasr/modules/multi_layer_conv.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Tomoki Hayashi
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Layer modules for FFT block in FastSpeech (Feed-forward Transformer)."""
+
+import torch
+
+
+class MultiLayeredConv1d(torch.nn.Module):
+    """Multi-layered conv1d for Transformer block.
+
+    This is a module of multi-leyered conv1d designed
+    to replace positionwise feed-forward network
+    in Transforner block, which is introduced in
+    `FastSpeech: Fast, Robust and Controllable Text to Speech`_.
+
+    .. _`FastSpeech: Fast, Robust and Controllable Text to Speech`:
+        https://arxiv.org/pdf/1905.09263.pdf
+
+    """
+
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize MultiLayeredConv1d module.
+
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+
+        """
+        super(MultiLayeredConv1d, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Conv1d(
+            hidden_chans,
+            in_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.dropout = torch.nn.Dropout(dropout_rate)
+
+    def forward(self, x):
+        """Calculate forward propagation.
+
+        Args:
+            x (torch.Tensor): Batch of input tensors (B, T, in_chans).
+
+        Returns:
+            torch.Tensor: Batch of output tensors (B, T, hidden_chans).
+
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x).transpose(-1, 1)).transpose(-1, 1)
+
+
+class Conv1dLinear(torch.nn.Module):
+    """Conv1D + Linear for Transformer block.
+
+    A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.
+
+    """
+
+    def __init__(self, in_chans, hidden_chans, kernel_size, dropout_rate):
+        """Initialize Conv1dLinear module.
+
+        Args:
+            in_chans (int): Number of input channels.
+            hidden_chans (int): Number of hidden channels.
+            kernel_size (int): Kernel size of conv1d.
+            dropout_rate (float): Dropout rate.
+
+        """
+        super(Conv1dLinear, self).__init__()
+        self.w_1 = torch.nn.Conv1d(
+            in_chans,
+            hidden_chans,
+            kernel_size,
+            stride=1,
+            padding=(kernel_size - 1) // 2,
+        )
+        self.w_2 = torch.nn.Linear(hidden_chans, in_chans)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+
+    def forward(self, x):
+        """Calculate forward propagation.
+
+        Args:
+            x (torch.Tensor): Batch of input tensors (B, T, in_chans).
+
+        Returns:
+            torch.Tensor: Batch of output tensors (B, T, hidden_chans).
+
+        """
+        x = torch.relu(self.w_1(x.transpose(-1, 1))).transpose(-1, 1)
+        return self.w_2(self.dropout(x))
diff --git a/funasr/modules/nets_utils.py b/funasr/modules/nets_utils.py
new file mode 100644
index 000000000..6d77d69a6
--- /dev/null
+++ b/funasr/modules/nets_utils.py
@@ -0,0 +1,508 @@
+# -*- coding: utf-8 -*-
+
+"""Network related utility tools."""
+
+import logging
+from typing import Dict
+
+import numpy as np
+import torch
+
+
+def to_device(m, x):
+    """Send tensor into the device of the module.
+
+    Args:
+        m (torch.nn.Module): Torch module.
+        x (Tensor): Torch tensor.
+
+    Returns:
+        Tensor: Torch tensor located in the same place as torch module.
+
+    """
+    if isinstance(m, torch.nn.Module):
+        device = next(m.parameters()).device
+    elif isinstance(m, torch.Tensor):
+        device = m.device
+    else:
+        raise TypeError(
+            "Expected torch.nn.Module or torch.tensor, " f"bot got: {type(m)}"
+        )
+    return x.to(device)
+
+
+def pad_list(xs, pad_value):
+    """Perform padding for the list of tensors.
+
+    Args:
+        xs (List): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)].
+        pad_value (float): Value for padding.
+
+    Returns:
+        Tensor: Padded tensor (B, Tmax, `*`).
+
+    Examples:
+        >>> x = [torch.ones(4), torch.ones(2), torch.ones(1)]
+        >>> x
+        [tensor([1., 1., 1., 1.]), tensor([1., 1.]), tensor([1.])]
+        >>> pad_list(x, 0)
+        tensor([[1., 1., 1., 1.],
+                [1., 1., 0., 0.],
+                [1., 0., 0., 0.]])
+
+    """
+    n_batch = len(xs)
+    max_len = max(x.size(0) for x in xs)
+    pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)
+
+    for i in range(n_batch):
+        pad[i, : xs[i].size(0)] = xs[i]
+
+    return pad
+
+
+def make_pad_mask(lengths, xs=None, length_dim=-1, maxlen=None):
+    """Make mask tensor containing indices of padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor.
+            If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor.
+            See the example.
+
+    Returns:
+        Tensor: Mask tensor containing indices of padded part.
+                dtype=torch.uint8 in PyTorch 1.2-
+                dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_pad_mask(lengths)
+        masks = [[0, 0, 0, 0 ,0],
+                 [0, 0, 0, 1, 1],
+                 [0, 0, 1, 1, 1]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0],
+                 [0, 0, 0, 0]],
+                [[0, 0, 0, 1],
+                 [0, 0, 0, 1]],
+                [[0, 0, 1, 1],
+                 [0, 0, 1, 1]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_pad_mask(lengths, xs)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_pad_mask(lengths, xs, 1)
+        tensor([[[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]],
+                [[0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1]]], dtype=torch.uint8)
+        >>> make_pad_mask(lengths, xs, 2)
+        tensor([[[0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1],
+                 [0, 0, 0, 0, 0, 1]],
+                [[0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1],
+                 [0, 0, 0, 1, 1, 1]],
+                [[0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1],
+                 [0, 0, 1, 1, 1, 1]]], dtype=torch.uint8)
+
+    """
+    if length_dim == 0:
+        raise ValueError("length_dim cannot be 0: {}".format(length_dim))
+
+    if not isinstance(lengths, list):
+        lengths = lengths.tolist()
+    bs = int(len(lengths))
+    if maxlen is None:
+        if xs is None:
+            maxlen = int(max(lengths))
+        else:
+            maxlen = xs.size(length_dim)
+    else:
+        assert xs is None
+        assert maxlen >= int(max(lengths))
+
+    seq_range = torch.arange(0, maxlen, dtype=torch.int64)
+    seq_range_expand = seq_range.unsqueeze(0).expand(bs, maxlen)
+    seq_length_expand = seq_range_expand.new(lengths).unsqueeze(-1)
+    mask = seq_range_expand >= seq_length_expand
+
+    if xs is not None:
+        assert xs.size(0) == bs, (xs.size(0), bs)
+
+        if length_dim < 0:
+            length_dim = xs.dim() + length_dim
+        # ind = (:, None, ..., None, :, , None, ..., None)
+        ind = tuple(
+            slice(None) if i in (0, length_dim) else None for i in range(xs.dim())
+        )
+        mask = mask[ind].expand_as(xs).to(xs.device)
+    return mask
+
+
+def make_non_pad_mask(lengths, xs=None, length_dim=-1):
+    """Make mask tensor containing indices of non-padded part.
+
+    Args:
+        lengths (LongTensor or List): Batch of lengths (B,).
+        xs (Tensor, optional): The reference tensor.
+            If set, masks will be the same shape as this tensor.
+        length_dim (int, optional): Dimension indicator of the above tensor.
+            See the example.
+
+    Returns:
+        ByteTensor: mask tensor containing indices of padded part.
+                    dtype=torch.uint8 in PyTorch 1.2-
+                    dtype=torch.bool in PyTorch 1.2+ (including 1.2)
+
+    Examples:
+        With only lengths.
+
+        >>> lengths = [5, 3, 2]
+        >>> make_non_pad_mask(lengths)
+        masks = [[1, 1, 1, 1 ,1],
+                 [1, 1, 1, 0, 0],
+                 [1, 1, 0, 0, 0]]
+
+        With the reference tensor.
+
+        >>> xs = torch.zeros((3, 2, 4))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1],
+                 [1, 1, 1, 1]],
+                [[1, 1, 1, 0],
+                 [1, 1, 1, 0]],
+                [[1, 1, 0, 0],
+                 [1, 1, 0, 0]]], dtype=torch.uint8)
+        >>> xs = torch.zeros((3, 2, 6))
+        >>> make_non_pad_mask(lengths, xs)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+        With the reference tensor and dimension indicator.
+
+        >>> xs = torch.zeros((3, 6, 6))
+        >>> make_non_pad_mask(lengths, xs, 1)
+        tensor([[[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]],
+                [[1, 1, 1, 1, 1, 1],
+                 [1, 1, 1, 1, 1, 1],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0],
+                 [0, 0, 0, 0, 0, 0]]], dtype=torch.uint8)
+        >>> make_non_pad_mask(lengths, xs, 2)
+        tensor([[[1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0],
+                 [1, 1, 1, 1, 1, 0]],
+                [[1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0],
+                 [1, 1, 1, 0, 0, 0]],
+                [[1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0],
+                 [1, 1, 0, 0, 0, 0]]], dtype=torch.uint8)
+
+    """
+    return ~make_pad_mask(lengths, xs, length_dim)
+
+
+def mask_by_length(xs, lengths, fill=0):
+    """Mask tensor according to length.
+
+    Args:
+        xs (Tensor): Batch of input tensor (B, `*`).
+        lengths (LongTensor or List): Batch of lengths (B,).
+        fill (int or float): Value to fill masked part.
+
+    Returns:
+        Tensor: Batch of masked input tensor (B, `*`).
+
+    Examples:
+        >>> x = torch.arange(5).repeat(3, 1) + 1
+        >>> x
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5],
+                [1, 2, 3, 4, 5]])
+        >>> lengths = [5, 3, 2]
+        >>> mask_by_length(x, lengths)
+        tensor([[1, 2, 3, 4, 5],
+                [1, 2, 3, 0, 0],
+                [1, 2, 0, 0, 0]])
+
+    """
+    assert xs.size(0) == len(lengths)
+    ret = xs.data.new(*xs.size()).fill_(fill)
+    for i, l in enumerate(lengths):
+        ret[i, :l] = xs[i, :l]
+    return ret
+
+
+def th_accuracy(pad_outputs, pad_targets, ignore_label):
+    """Calculate accuracy.
+
+    Args:
+        pad_outputs (Tensor): Prediction tensors (B * Lmax, D).
+        pad_targets (LongTensor): Target label tensors (B, Lmax, D).
+        ignore_label (int): Ignore label id.
+
+    Returns:
+        float: Accuracy value (0.0 - 1.0).
+
+    """
+    pad_pred = pad_outputs.view(
+        pad_targets.size(0), pad_targets.size(1), pad_outputs.size(1)
+    ).argmax(2)
+    mask = pad_targets != ignore_label
+    numerator = torch.sum(
+        pad_pred.masked_select(mask) == pad_targets.masked_select(mask)
+    )
+    denominator = torch.sum(mask)
+    return float(numerator) / float(denominator)
+
+
+def to_torch_tensor(x):
+    """Change to torch.Tensor or ComplexTensor from numpy.ndarray.
+
+    Args:
+        x: Inputs. It should be one of numpy.ndarray, Tensor, ComplexTensor, and dict.
+
+    Returns:
+        Tensor or ComplexTensor: Type converted inputs.
+
+    Examples:
+        >>> xs = np.ones(3, dtype=np.float32)
+        >>> xs = to_torch_tensor(xs)
+        tensor([1., 1., 1.])
+        >>> xs = torch.ones(3, 4, 5)
+        >>> assert to_torch_tensor(xs) is xs
+        >>> xs = {'real': xs, 'imag': xs}
+        >>> to_torch_tensor(xs)
+        ComplexTensor(
+        Real:
+        tensor([1., 1., 1.])
+        Imag;
+        tensor([1., 1., 1.])
+        )
+
+    """
+    # If numpy, change to torch tensor
+    if isinstance(x, np.ndarray):
+        if x.dtype.kind == "c":
+            # Dynamically importing because torch_complex requires python3
+            from torch_complex.tensor import ComplexTensor
+
+            return ComplexTensor(x)
+        else:
+            return torch.from_numpy(x)
+
+    # If {'real': ..., 'imag': ...}, convert to ComplexTensor
+    elif isinstance(x, dict):
+        # Dynamically importing because torch_complex requires python3
+        from torch_complex.tensor import ComplexTensor
+
+        if "real" not in x or "imag" not in x:
+            raise ValueError("has 'real' and 'imag' keys: {}".format(list(x)))
+        # Relative importing because of using python3 syntax
+        return ComplexTensor(x["real"], x["imag"])
+
+    # If torch.Tensor, as it is
+    elif isinstance(x, torch.Tensor):
+        return x
+
+    else:
+        error = (
+            "x must be numpy.ndarray, torch.Tensor or a dict like "
+            "{{'real': torch.Tensor, 'imag': torch.Tensor}}, "
+            "but got {}".format(type(x))
+        )
+        try:
+            from torch_complex.tensor import ComplexTensor
+        except Exception:
+            # If PY2
+            raise ValueError(error)
+        else:
+            # If PY3
+            if isinstance(x, ComplexTensor):
+                return x
+            else:
+                raise ValueError(error)
+
+
+def get_subsample(train_args, mode, arch):
+    """Parse the subsampling factors from the args for the specified `mode` and `arch`.
+
+    Args:
+        train_args: argument Namespace containing options.
+        mode: one of ('asr', 'mt', 'st')
+        arch: one of ('rnn', 'rnn-t', 'rnn_mix', 'rnn_mulenc', 'transformer')
+
+    Returns:
+        np.ndarray / List[np.ndarray]: subsampling factors.
+    """
+    if arch == "transformer":
+        return np.array([1])
+
+    elif mode == "mt" and arch == "rnn":
+        # +1 means input (+1) and layers outputs (train_args.elayer)
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        logging.warning("Subsampling is not performed for machine translation.")
+        logging.info("subsample: " + " ".join([str(x) for x in subsample]))
+        return subsample
+
+    elif (
+            (mode == "asr" and arch in ("rnn", "rnn-t"))
+            or (mode == "mt" and arch == "rnn")
+            or (mode == "st" and arch == "rnn")
+    ):
+        subsample = np.ones(train_args.elayers + 1, dtype=np.int)
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(min(train_args.elayers + 1, len(ss))):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                "Subsampling is not performed for vgg*. "
+                "It is performed in max pooling layers at CNN."
+            )
+        logging.info("subsample: " + " ".join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == "asr" and arch == "rnn_mix":
+        subsample = np.ones(
+            train_args.elayers_sd + train_args.elayers + 1, dtype=np.int
+        )
+        if train_args.etype.endswith("p") and not train_args.etype.startswith("vgg"):
+            ss = train_args.subsample.split("_")
+            for j in range(
+                    min(train_args.elayers_sd + train_args.elayers + 1, len(ss))
+            ):
+                subsample[j] = int(ss[j])
+        else:
+            logging.warning(
+                "Subsampling is not performed for vgg*. "
+                "It is performed in max pooling layers at CNN."
+            )
+        logging.info("subsample: " + " ".join([str(x) for x in subsample]))
+        return subsample
+
+    elif mode == "asr" and arch == "rnn_mulenc":
+        subsample_list = []
+        for idx in range(train_args.num_encs):
+            subsample = np.ones(train_args.elayers[idx] + 1, dtype=np.int)
+            if train_args.etype[idx].endswith("p") and not train_args.etype[
+                idx
+            ].startswith("vgg"):
+                ss = train_args.subsample[idx].split("_")
+                for j in range(min(train_args.elayers[idx] + 1, len(ss))):
+                    subsample[j] = int(ss[j])
+            else:
+                logging.warning(
+                    "Encoder %d: Subsampling is not performed for vgg*. "
+                    "It is performed in max pooling layers at CNN.",
+                    idx + 1,
+                )
+            logging.info("subsample: " + " ".join([str(x) for x in subsample]))
+            subsample_list.append(subsample)
+        return subsample_list
+
+    else:
+        raise ValueError("Invalid options: mode={}, arch={}".format(mode, arch))
+
+
+def rename_state_dict(
+        old_prefix: str, new_prefix: str, state_dict: Dict[str, torch.Tensor]
+):
+    """Replace keys of old prefix with new prefix in state dict."""
+    # need this list not to break the dict iterator
+    old_keys = [k for k in state_dict if k.startswith(old_prefix)]
+    if len(old_keys) > 0:
+        logging.warning(f"Rename: {old_prefix} -> {new_prefix}")
+    for k in old_keys:
+        v = state_dict.pop(k)
+        new_k = k.replace(old_prefix, new_prefix)
+        state_dict[new_k] = v
+
+
+class Swish(torch.nn.Module):
+    """Construct an Swish object."""
+
+    def forward(self, x):
+        """Return Swich activation function."""
+        return x * torch.sigmoid(x)
+
+
+def get_activation(act):
+    """Return activation function."""
+
+    activation_funcs = {
+        "hardtanh": torch.nn.Hardtanh,
+        "tanh": torch.nn.Tanh,
+        "relu": torch.nn.ReLU,
+        "selu": torch.nn.SELU,
+        "swish": Swish,
+    }
+
+    return activation_funcs[act]()
diff --git a/funasr/modules/positionwise_feed_forward.py b/funasr/modules/positionwise_feed_forward.py
new file mode 100644
index 000000000..61b874fda
--- /dev/null
+++ b/funasr/modules/positionwise_feed_forward.py
@@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Positionwise feed forward layer definition."""
+
+import torch
+
+from funasr.modules.layer_norm import LayerNorm
+
+
+class PositionwiseFeedForward(torch.nn.Module):
+    """Positionwise feed forward layer.
+
+    Args:
+        idim (int): Input dimenstion.
+        hidden_units (int): The number of hidden units.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()):
+        """Construct an PositionwiseFeedForward object."""
+        super(PositionwiseFeedForward, self).__init__()
+        self.w_1 = torch.nn.Linear(idim, hidden_units)
+        self.w_2 = torch.nn.Linear(hidden_units, idim)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+        self.activation = activation
+
+    def forward(self, x):
+        """Forward function."""
+        return self.w_2(self.dropout(self.activation(self.w_1(x))))
+
+
+class PositionwiseFeedForwardDecoderSANM(torch.nn.Module):
+    """Positionwise feed forward layer.
+
+    Args:
+        idim (int): Input dimenstion.
+        hidden_units (int): The number of hidden units.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, idim, hidden_units, dropout_rate, adim=None, activation=torch.nn.ReLU()):
+        """Construct an PositionwiseFeedForward object."""
+        super(PositionwiseFeedForwardDecoderSANM, self).__init__()
+        self.w_1 = torch.nn.Linear(idim, hidden_units)
+        self.w_2 = torch.nn.Linear(hidden_units, idim if adim is None else adim, bias=False)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+        self.activation = activation
+        self.norm = LayerNorm(hidden_units)
+
+    def forward(self, x):
+        """Forward function."""
+        return self.w_2(self.norm(self.dropout(self.activation(self.w_1(x)))))
diff --git a/funasr/modules/repeat.py b/funasr/modules/repeat.py
new file mode 100644
index 000000000..a3d2676a8
--- /dev/null
+++ b/funasr/modules/repeat.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Repeat the same layer definition."""
+
+import torch
+
+
+class MultiSequential(torch.nn.Sequential):
+    """Multi-input multi-output torch.nn.Sequential."""
+
+    def forward(self, *args):
+        """Repeat."""
+        for m in self:
+            args = m(*args)
+        return args
+
+
+def repeat(N, fn):
+    """Repeat module N times.
+
+    Args:
+        N (int): Number of repeat time.
+        fn (Callable): Function to generate module.
+
+    Returns:
+        MultiSequential: Repeated model instance.
+
+    """
+    return MultiSequential(*[fn(n) for n in range(N)])
diff --git a/funasr/modules/rnn/__init__.py b/funasr/modules/rnn/__init__.py
new file mode 100644
index 000000000..b7f177368
--- /dev/null
+++ b/funasr/modules/rnn/__init__.py
@@ -0,0 +1 @@
+"""Initialize sub package."""
diff --git a/funasr/modules/rnn/argument.py b/funasr/modules/rnn/argument.py
new file mode 100644
index 000000000..b4c89d25f
--- /dev/null
+++ b/funasr/modules/rnn/argument.py
@@ -0,0 +1,156 @@
+# Copyright 2020 Hirofumi Inaguma
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Conformer common arguments."""
+
+
+def add_arguments_rnn_encoder_common(group):
+    """Define common arguments for RNN encoder."""
+    group.add_argument(
+        "--etype",
+        default="blstmp",
+        type=str,
+        choices=[
+            "lstm",
+            "blstm",
+            "lstmp",
+            "blstmp",
+            "vgglstmp",
+            "vggblstmp",
+            "vgglstm",
+            "vggblstm",
+            "gru",
+            "bgru",
+            "grup",
+            "bgrup",
+            "vgggrup",
+            "vggbgrup",
+            "vgggru",
+            "vggbgru",
+        ],
+        help="Type of encoder network architecture",
+    )
+    group.add_argument(
+        "--elayers",
+        default=4,
+        type=int,
+        help="Number of encoder layers",
+    )
+    group.add_argument(
+        "--eunits",
+        "-u",
+        default=300,
+        type=int,
+        help="Number of encoder hidden units",
+    )
+    group.add_argument(
+        "--eprojs", default=320, type=int, help="Number of encoder projection units"
+    )
+    group.add_argument(
+        "--subsample",
+        default="1",
+        type=str,
+        help="Subsample input frames x_y_z means "
+        "subsample every x frame at 1st layer, "
+        "every y frame at 2nd layer etc.",
+    )
+    return group
+
+
+def add_arguments_rnn_decoder_common(group):
+    """Define common arguments for RNN decoder."""
+    group.add_argument(
+        "--dtype",
+        default="lstm",
+        type=str,
+        choices=["lstm", "gru"],
+        help="Type of decoder network architecture",
+    )
+    group.add_argument(
+        "--dlayers", default=1, type=int, help="Number of decoder layers"
+    )
+    group.add_argument(
+        "--dunits", default=320, type=int, help="Number of decoder hidden units"
+    )
+    group.add_argument(
+        "--dropout-rate-decoder",
+        default=0.0,
+        type=float,
+        help="Dropout rate for the decoder",
+    )
+    group.add_argument(
+        "--sampling-probability",
+        default=0.0,
+        type=float,
+        help="Ratio of predicted labels fed back to decoder",
+    )
+    group.add_argument(
+        "--lsm-type",
+        const="",
+        default="",
+        type=str,
+        nargs="?",
+        choices=["", "unigram"],
+        help="Apply label smoothing with a specified distribution type",
+    )
+    return group
+
+
+def add_arguments_rnn_attention_common(group):
+    """Define common arguments for RNN attention."""
+    group.add_argument(
+        "--atype",
+        default="dot",
+        type=str,
+        choices=[
+            "noatt",
+            "dot",
+            "add",
+            "location",
+            "coverage",
+            "coverage_location",
+            "location2d",
+            "location_recurrent",
+            "multi_head_dot",
+            "multi_head_add",
+            "multi_head_loc",
+            "multi_head_multi_res_loc",
+        ],
+        help="Type of attention architecture",
+    )
+    group.add_argument(
+        "--adim",
+        default=320,
+        type=int,
+        help="Number of attention transformation dimensions",
+    )
+    group.add_argument(
+        "--awin", default=5, type=int, help="Window size for location2d attention"
+    )
+    group.add_argument(
+        "--aheads",
+        default=4,
+        type=int,
+        help="Number of heads for multi head attention",
+    )
+    group.add_argument(
+        "--aconv-chans",
+        default=-1,
+        type=int,
+        help="Number of attention convolution channels \
+                       (negative value indicates no location-aware attention)",
+    )
+    group.add_argument(
+        "--aconv-filts",
+        default=100,
+        type=int,
+        help="Number of attention convolution filters \
+                       (negative value indicates no location-aware attention)",
+    )
+    group.add_argument(
+        "--dropout-rate",
+        default=0.0,
+        type=float,
+        help="Dropout rate for the encoder",
+    )
+    return group
diff --git a/funasr/modules/rnn/attentions.py b/funasr/modules/rnn/attentions.py
new file mode 100644
index 000000000..71d1786d6
--- /dev/null
+++ b/funasr/modules/rnn/attentions.py
@@ -0,0 +1,1808 @@
+"""Attention modules for RNN."""
+
+import math
+import six
+
+import torch
+import torch.nn.functional as F
+
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.nets_utils import to_device
+
+
+def _apply_attention_constraint(
+    e, last_attended_idx, backward_window=1, forward_window=3
+):
+    """Apply monotonic attention constraint.
+
+    This function apply the monotonic attention constraint
+    introduced in `Deep Voice 3: Scaling
+    Text-to-Speech with Convolutional Sequence Learning`_.
+
+    Args:
+        e (Tensor): Attention energy before applying softmax (1, T).
+        last_attended_idx (int): The index of the inputs of the last attended [0, T].
+        backward_window (int, optional): Backward window size in attention constraint.
+        forward_window (int, optional): Forward window size in attetion constraint.
+
+    Returns:
+        Tensor: Monotonic constrained attention energy (1, T).
+
+    .. _`Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning`:
+        https://arxiv.org/abs/1710.07654
+
+    """
+    if e.size(0) != 1:
+        raise NotImplementedError("Batch attention constraining is not yet supported.")
+    backward_idx = last_attended_idx - backward_window
+    forward_idx = last_attended_idx + forward_window
+    if backward_idx > 0:
+        e[:, :backward_idx] = -float("inf")
+    if forward_idx < e.size(1):
+        e[:, forward_idx:] = -float("inf")
+    return e
+
+
+class NoAtt(torch.nn.Module):
+    """No attention"""
+
+    def __init__(self):
+        super(NoAtt, self).__init__()
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.c = None
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.c = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):
+        """NoAtt forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B, T_max, D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: dummy (does not use)
+        :param torch.Tensor att_prev: dummy (does not use)
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights
+        :rtype: torch.Tensor
+        """
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+
+        # initialize attention weight with uniform dist.
+        if att_prev is None:
+            # if no bias, 0 0-pad goes 0
+            mask = 1.0 - make_pad_mask(enc_hs_len).float()
+            att_prev = mask / mask.new(enc_hs_len).unsqueeze(-1)
+            att_prev = att_prev.to(self.enc_h)
+            self.c = torch.sum(
+                self.enc_h * att_prev.view(batch, self.h_length, 1), dim=1
+            )
+
+        return self.c, att_prev
+
+
+class AttDot(torch.nn.Module):
+    """Dot product attention
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_enc_h
+    """
+
+    def __init__(self, eprojs, dunits, att_dim, han_mode=False):
+        super(AttDot, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):
+        """AttDot forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: dummy (does not use)
+        :param torch.Tensor att_prev: dummy (does not use)
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weight (B x T_max)
+        :rtype: torch.Tensor
+        """
+
+        batch = enc_hs_pad.size(0)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = torch.tanh(self.mlp_enc(self.enc_h))
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        e = torch.sum(
+            self.pre_compute_enc_h
+            * torch.tanh(self.mlp_dec(dec_z)).view(batch, 1, self.att_dim),
+            dim=2,
+        )  # utt x frame
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+        return c, w
+
+
+class AttAdd(torch.nn.Module):
+    """Additive attention
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_enc_h
+    """
+
+    def __init__(self, eprojs, dunits, att_dim, han_mode=False):
+        super(AttAdd, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.gvec = torch.nn.Linear(att_dim, 1)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):
+        """AttAdd forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: dummy (does not use)
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights (B x T_max)
+        :rtype: torch.Tensor
+        """
+
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(torch.tanh(self.pre_compute_enc_h + dec_z_tiled)).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        return c, w
+
+
+class AttLoc(torch.nn.Module):
+    """location-aware attention module.
+
+    Reference: Attention-Based Models for Speech Recognition
+        (https://arxiv.org/pdf/1506.07503.pdf)
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_enc_h
+    """
+
+    def __init__(
+        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False
+    ):
+        super(AttLoc, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (1, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(
+        self,
+        enc_hs_pad,
+        enc_hs_len,
+        dec_z,
+        att_prev,
+        scaling=2.0,
+        last_attended_idx=None,
+        backward_window=1,
+        forward_window=3,
+    ):
+        """Calculate AttLoc forward propagation.
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: previous attention weight (B x T_max)
+        :param float scaling: scaling parameter before applying softmax
+        :param torch.Tensor forward_window:
+            forward window size when constraining attention
+        :param int last_attended_idx: index of the inputs of the last attended
+        :param int backward_window: backward window size in attention constraint
+        :param int forward_window: forward window size in attetion constraint
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights (B x T_max)
+        :rtype: torch.Tensor
+        """
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        # initialize attention weight with uniform dist.
+        if att_prev is None:
+            # if no bias, 0 0-pad goes 0
+            att_prev = 1.0 - make_pad_mask(enc_hs_len).to(
+                device=dec_z.device, dtype=dec_z.dtype
+            )
+            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)
+
+        # att_prev: utt x frame -> utt x 1 x 1 x frame
+        # -> utt x att_conv_chans x 1 x frame
+        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))
+        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans
+        att_conv = att_conv.squeeze(2).transpose(1, 2)
+        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim
+        att_conv = self.mlp_att(att_conv)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE: consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+
+        # apply monotonic attention constraint (mainly for TTS)
+        if last_attended_idx is not None:
+            e = _apply_attention_constraint(
+                e, last_attended_idx, backward_window, forward_window
+            )
+
+        w = F.softmax(scaling * e, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        return c, w
+
+
+class AttCov(torch.nn.Module):
+    """Coverage mechanism attention
+
+    Reference: Get To The Point: Summarization with Pointer-Generator Network
+       (https://arxiv.org/abs/1704.04368)
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_enc_h
+    """
+
+    def __init__(self, eprojs, dunits, att_dim, han_mode=False):
+        super(AttCov, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.wvec = torch.nn.Linear(1, att_dim)
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0):
+        """AttCov forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param list att_prev_list: list of previous attention weight
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weights
+        :rtype: list
+        """
+
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        # initialize attention weight with uniform dist.
+        if att_prev_list is None:
+            # if no bias, 0 0-pad goes 0
+            att_prev_list = to_device(
+                enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float())
+            )
+            att_prev_list = [
+                att_prev_list / att_prev_list.new(enc_hs_len).unsqueeze(-1)
+            ]
+
+        # att_prev_list: L' * [B x T] => cov_vec B x T
+        cov_vec = sum(att_prev_list)
+        # cov_vec: B x T => B x T x 1 => B x T x att_dim
+        cov_vec = self.wvec(cov_vec.unsqueeze(-1))
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(cov_vec + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+        att_prev_list += [w]
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        return c, att_prev_list
+
+
+class AttLoc2D(torch.nn.Module):
+    """2D location-aware attention
+
+    This attention is an extended version of location aware attention.
+    It take not only one frame before attention weights,
+    but also earlier frames into account.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param int att_win: attention window size (default=5)
+    :param bool han_mode:
+        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
+    """
+
+    def __init__(
+        self, eprojs, dunits, att_dim, att_win, aconv_chans, aconv_filts, han_mode=False
+    ):
+        super(AttLoc2D, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (att_win, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.aconv_chans = aconv_chans
+        self.att_win = att_win
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):
+        """AttLoc2D forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: previous attention weight (B x att_win x T_max)
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights (B x att_win x T_max)
+        :rtype: torch.Tensor
+        """
+
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        # initialize attention weight with uniform dist.
+        if att_prev is None:
+            # B * [Li x att_win]
+            # if no bias, 0 0-pad goes 0
+            att_prev = to_device(enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float()))
+            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)
+            att_prev = att_prev.unsqueeze(1).expand(-1, self.att_win, -1)
+
+        # att_prev: B x att_win x Tmax -> B x 1 x att_win x Tmax -> B x C x 1 x Tmax
+        att_conv = self.loc_conv(att_prev.unsqueeze(1))
+        # att_conv: B x C x 1 x Tmax -> B x Tmax x C
+        att_conv = att_conv.squeeze(2).transpose(1, 2)
+        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim
+        att_conv = self.mlp_att(att_conv)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        # update att_prev: B x att_win x Tmax -> B x att_win+1 x Tmax
+        # -> B x att_win x Tmax
+        att_prev = torch.cat([att_prev, w.unsqueeze(1)], dim=1)
+        att_prev = att_prev[:, 1:]
+
+        return c, att_prev
+
+
+class AttLocRec(torch.nn.Module):
+    """location-aware recurrent attention
+
+    This attention is an extended version of location aware attention.
+    With the use of RNN,
+    it take the effect of the history of attention weights into account.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode:
+        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
+    """
+
+    def __init__(
+        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False
+    ):
+        super(AttLocRec, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (1, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.att_lstm = torch.nn.LSTMCell(aconv_chans, att_dim, bias=False)
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_states, scaling=2.0):
+        """AttLocRec forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param tuple att_prev_states: previous attention weight and lstm states
+                                      ((B, T_max), ((B, att_dim), (B, att_dim)))
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights and lstm states (w, (hx, cx))
+                 ((B, T_max), ((B, att_dim), (B, att_dim)))
+        :rtype: tuple
+        """
+
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        if att_prev_states is None:
+            # initialize attention weight with uniform dist.
+            # if no bias, 0 0-pad goes 0
+            att_prev = to_device(enc_hs_pad, (1.0 - make_pad_mask(enc_hs_len).float()))
+            att_prev = att_prev / att_prev.new(enc_hs_len).unsqueeze(-1)
+
+            # initialize lstm states
+            att_h = enc_hs_pad.new_zeros(batch, self.att_dim)
+            att_c = enc_hs_pad.new_zeros(batch, self.att_dim)
+            att_states = (att_h, att_c)
+        else:
+            att_prev = att_prev_states[0]
+            att_states = att_prev_states[1]
+
+        # B x 1 x 1 x T -> B x C x 1 x T
+        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))
+        # apply non-linear
+        att_conv = F.relu(att_conv)
+        # B x C x 1 x T -> B x C x 1 x 1 -> B x C
+        att_conv = F.max_pool2d(att_conv, (1, att_conv.size(3))).view(batch, -1)
+
+        att_h, att_c = self.att_lstm(att_conv, att_states)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(att_h.unsqueeze(1) + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        return c, (w, (att_h, att_c))
+
+
+class AttCovLoc(torch.nn.Module):
+    """Coverage mechanism location aware attention
+
+    This attention is a combination of coverage and location-aware attentions.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode:
+        flag to swith on mode of hierarchical attention and not store pre_compute_enc_h
+    """
+
+    def __init__(
+        self, eprojs, dunits, att_dim, aconv_chans, aconv_filts, han_mode=False
+    ):
+        super(AttCovLoc, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (1, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.gvec = torch.nn.Linear(att_dim, 1)
+
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.aconv_chans = aconv_chans
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev_list, scaling=2.0):
+        """AttCovLoc forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param list att_prev_list: list of previous attention weight
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weights
+        :rtype: list
+        """
+
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        # initialize attention weight with uniform dist.
+        if att_prev_list is None:
+            # if no bias, 0 0-pad goes 0
+            mask = 1.0 - make_pad_mask(enc_hs_len).float()
+            att_prev_list = [
+                to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))
+            ]
+
+        # att_prev_list: L' * [B x T] => cov_vec B x T
+        cov_vec = sum(att_prev_list)
+
+        # cov_vec: B x T -> B x 1 x 1 x T -> B x C x 1 x T
+        att_conv = self.loc_conv(cov_vec.view(batch, 1, 1, self.h_length))
+        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans
+        att_conv = att_conv.squeeze(2).transpose(1, 2)
+        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim
+        att_conv = self.mlp_att(att_conv)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+        w = F.softmax(scaling * e, dim=1)
+        att_prev_list += [w]
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        return c, att_prev_list
+
+
+class AttMultiHeadDot(torch.nn.Module):
+    """Multi head dot product attention
+
+    Reference: Attention is all you need
+        (https://arxiv.org/abs/1706.03762)
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int aheads: # heads of multi head attention
+    :param int att_dim_k: dimension k in multi head attention
+    :param int att_dim_v: dimension v in multi head attention
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_k and pre_compute_v
+    """
+
+    def __init__(self, eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False):
+        super(AttMultiHeadDot, self).__init__()
+        self.mlp_q = torch.nn.ModuleList()
+        self.mlp_k = torch.nn.ModuleList()
+        self.mlp_v = torch.nn.ModuleList()
+        for _ in six.moves.range(aheads):
+            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]
+            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]
+            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]
+        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.aheads = aheads
+        self.att_dim_k = att_dim_k
+        self.att_dim_v = att_dim_v
+        self.scaling = 1.0 / math.sqrt(att_dim_k)
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):
+        """AttMultiHeadDot forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: dummy (does not use)
+        :return: attention weighted encoder state (B x D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weight (B x T_max) * aheads
+        :rtype: list
+        """
+
+        batch = enc_hs_pad.size(0)
+        # pre-compute all k and v outside the decoder loop
+        if self.pre_compute_k is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_k = [
+                torch.tanh(self.mlp_k[h](self.enc_h))
+                for h in six.moves.range(self.aheads)
+            ]
+
+        if self.pre_compute_v is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_v = [
+                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        c = []
+        w = []
+        for h in six.moves.range(self.aheads):
+            e = torch.sum(
+                self.pre_compute_k[h]
+                * torch.tanh(self.mlp_q[h](dec_z)).view(batch, 1, self.att_dim_k),
+                dim=2,
+            )  # utt x frame
+
+            # NOTE consider zero padding when compute w.
+            if self.mask is None:
+                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+            e.masked_fill_(self.mask, -float("inf"))
+            w += [F.softmax(self.scaling * e, dim=1)]
+
+            # weighted sum over flames
+            # utt x hdim
+            # NOTE use bmm instead of sum(*)
+            c += [
+                torch.sum(
+                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1
+                )
+            ]
+
+        # concat all of c
+        c = self.mlp_o(torch.cat(c, dim=1))
+
+        return c, w
+
+
+class AttMultiHeadAdd(torch.nn.Module):
+    """Multi head additive attention
+
+    Reference: Attention is all you need
+        (https://arxiv.org/abs/1706.03762)
+
+    This attention is multi head attention using additive attention for each head.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int aheads: # heads of multi head attention
+    :param int att_dim_k: dimension k in multi head attention
+    :param int att_dim_v: dimension v in multi head attention
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_k and pre_compute_v
+    """
+
+    def __init__(self, eprojs, dunits, aheads, att_dim_k, att_dim_v, han_mode=False):
+        super(AttMultiHeadAdd, self).__init__()
+        self.mlp_q = torch.nn.ModuleList()
+        self.mlp_k = torch.nn.ModuleList()
+        self.mlp_v = torch.nn.ModuleList()
+        self.gvec = torch.nn.ModuleList()
+        for _ in six.moves.range(aheads):
+            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]
+            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]
+            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]
+            self.gvec += [torch.nn.Linear(att_dim_k, 1)]
+        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.aheads = aheads
+        self.att_dim_k = att_dim_k
+        self.att_dim_v = att_dim_v
+        self.scaling = 1.0 / math.sqrt(att_dim_k)
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):
+        """AttMultiHeadAdd forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: dummy (does not use)
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weight (B x T_max) * aheads
+        :rtype: list
+        """
+
+        batch = enc_hs_pad.size(0)
+        # pre-compute all k and v outside the decoder loop
+        if self.pre_compute_k is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_k = [
+                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if self.pre_compute_v is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_v = [
+                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        c = []
+        w = []
+        for h in six.moves.range(self.aheads):
+            e = self.gvec[h](
+                torch.tanh(
+                    self.pre_compute_k[h]
+                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)
+                )
+            ).squeeze(2)
+
+            # NOTE consider zero padding when compute w.
+            if self.mask is None:
+                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+            e.masked_fill_(self.mask, -float("inf"))
+            w += [F.softmax(self.scaling * e, dim=1)]
+
+            # weighted sum over flames
+            # utt x hdim
+            # NOTE use bmm instead of sum(*)
+            c += [
+                torch.sum(
+                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1
+                )
+            ]
+
+        # concat all of c
+        c = self.mlp_o(torch.cat(c, dim=1))
+
+        return c, w
+
+
+class AttMultiHeadLoc(torch.nn.Module):
+    """Multi head location based attention
+
+    Reference: Attention is all you need
+        (https://arxiv.org/abs/1706.03762)
+
+    This attention is multi head attention using location-aware attention for each head.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int aheads: # heads of multi head attention
+    :param int att_dim_k: dimension k in multi head attention
+    :param int att_dim_v: dimension v in multi head attention
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_k and pre_compute_v
+    """
+
+    def __init__(
+        self,
+        eprojs,
+        dunits,
+        aheads,
+        att_dim_k,
+        att_dim_v,
+        aconv_chans,
+        aconv_filts,
+        han_mode=False,
+    ):
+        super(AttMultiHeadLoc, self).__init__()
+        self.mlp_q = torch.nn.ModuleList()
+        self.mlp_k = torch.nn.ModuleList()
+        self.mlp_v = torch.nn.ModuleList()
+        self.gvec = torch.nn.ModuleList()
+        self.loc_conv = torch.nn.ModuleList()
+        self.mlp_att = torch.nn.ModuleList()
+        for _ in six.moves.range(aheads):
+            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]
+            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]
+            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]
+            self.gvec += [torch.nn.Linear(att_dim_k, 1)]
+            self.loc_conv += [
+                torch.nn.Conv2d(
+                    1,
+                    aconv_chans,
+                    (1, 2 * aconv_filts + 1),
+                    padding=(0, aconv_filts),
+                    bias=False,
+                )
+            ]
+            self.mlp_att += [torch.nn.Linear(aconv_chans, att_dim_k, bias=False)]
+        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.aheads = aheads
+        self.att_dim_k = att_dim_k
+        self.att_dim_v = att_dim_v
+        self.scaling = 1.0 / math.sqrt(att_dim_k)
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev, scaling=2.0):
+        """AttMultiHeadLoc forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev:
+            list of previous attention weight (B x T_max) * aheads
+        :param float scaling: scaling parameter before applying softmax
+        :return: attention weighted encoder state (B x D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weight (B x T_max) * aheads
+        :rtype: list
+        """
+
+        batch = enc_hs_pad.size(0)
+        # pre-compute all k and v outside the decoder loop
+        if self.pre_compute_k is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_k = [
+                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if self.pre_compute_v is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_v = [
+                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        if att_prev is None:
+            att_prev = []
+            for _ in six.moves.range(self.aheads):
+                # if no bias, 0 0-pad goes 0
+                mask = 1.0 - make_pad_mask(enc_hs_len).float()
+                att_prev += [
+                    to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))
+                ]
+
+        c = []
+        w = []
+        for h in six.moves.range(self.aheads):
+            att_conv = self.loc_conv[h](att_prev[h].view(batch, 1, 1, self.h_length))
+            att_conv = att_conv.squeeze(2).transpose(1, 2)
+            att_conv = self.mlp_att[h](att_conv)
+
+            e = self.gvec[h](
+                torch.tanh(
+                    self.pre_compute_k[h]
+                    + att_conv
+                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)
+                )
+            ).squeeze(2)
+
+            # NOTE consider zero padding when compute w.
+            if self.mask is None:
+                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+            e.masked_fill_(self.mask, -float("inf"))
+            w += [F.softmax(scaling * e, dim=1)]
+
+            # weighted sum over flames
+            # utt x hdim
+            # NOTE use bmm instead of sum(*)
+            c += [
+                torch.sum(
+                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1
+                )
+            ]
+
+        # concat all of c
+        c = self.mlp_o(torch.cat(c, dim=1))
+
+        return c, w
+
+
+class AttMultiHeadMultiResLoc(torch.nn.Module):
+    """Multi head multi resolution location based attention
+
+    Reference: Attention is all you need
+        (https://arxiv.org/abs/1706.03762)
+
+    This attention is multi head attention using location-aware attention for each head.
+    Furthermore, it uses different filter size for each head.
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int aheads: # heads of multi head attention
+    :param int att_dim_k: dimension k in multi head attention
+    :param int att_dim_v: dimension v in multi head attention
+    :param int aconv_chans: maximum # channels of attention convolution
+        each head use #ch = aconv_chans * (head + 1) / aheads
+        e.g. aheads=4, aconv_chans=100 => filter size = 25, 50, 75, 100
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+        and not store pre_compute_k and pre_compute_v
+    """
+
+    def __init__(
+        self,
+        eprojs,
+        dunits,
+        aheads,
+        att_dim_k,
+        att_dim_v,
+        aconv_chans,
+        aconv_filts,
+        han_mode=False,
+    ):
+        super(AttMultiHeadMultiResLoc, self).__init__()
+        self.mlp_q = torch.nn.ModuleList()
+        self.mlp_k = torch.nn.ModuleList()
+        self.mlp_v = torch.nn.ModuleList()
+        self.gvec = torch.nn.ModuleList()
+        self.loc_conv = torch.nn.ModuleList()
+        self.mlp_att = torch.nn.ModuleList()
+        for h in six.moves.range(aheads):
+            self.mlp_q += [torch.nn.Linear(dunits, att_dim_k)]
+            self.mlp_k += [torch.nn.Linear(eprojs, att_dim_k, bias=False)]
+            self.mlp_v += [torch.nn.Linear(eprojs, att_dim_v, bias=False)]
+            self.gvec += [torch.nn.Linear(att_dim_k, 1)]
+            afilts = aconv_filts * (h + 1) // aheads
+            self.loc_conv += [
+                torch.nn.Conv2d(
+                    1, aconv_chans, (1, 2 * afilts + 1), padding=(0, afilts), bias=False
+                )
+            ]
+            self.mlp_att += [torch.nn.Linear(aconv_chans, att_dim_k, bias=False)]
+        self.mlp_o = torch.nn.Linear(aheads * att_dim_v, eprojs, bias=False)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.aheads = aheads
+        self.att_dim_k = att_dim_k
+        self.att_dim_v = att_dim_v
+        self.scaling = 1.0 / math.sqrt(att_dim_k)
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+        self.han_mode = han_mode
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_k = None
+        self.pre_compute_v = None
+        self.mask = None
+
+    def forward(self, enc_hs_pad, enc_hs_len, dec_z, att_prev):
+        """AttMultiHeadMultiResLoc forward
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: list of previous attention weight
+            (B x T_max) * aheads
+        :return: attention weighted encoder state (B x D_enc)
+        :rtype: torch.Tensor
+        :return: list of previous attention weight (B x T_max) * aheads
+        :rtype: list
+        """
+
+        batch = enc_hs_pad.size(0)
+        # pre-compute all k and v outside the decoder loop
+        if self.pre_compute_k is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_k = [
+                self.mlp_k[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if self.pre_compute_v is None or self.han_mode:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_v = [
+                self.mlp_v[h](self.enc_h) for h in six.moves.range(self.aheads)
+            ]
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        if att_prev is None:
+            att_prev = []
+            for _ in six.moves.range(self.aheads):
+                # if no bias, 0 0-pad goes 0
+                mask = 1.0 - make_pad_mask(enc_hs_len).float()
+                att_prev += [
+                    to_device(enc_hs_pad, mask / mask.new(enc_hs_len).unsqueeze(-1))
+                ]
+
+        c = []
+        w = []
+        for h in six.moves.range(self.aheads):
+            att_conv = self.loc_conv[h](att_prev[h].view(batch, 1, 1, self.h_length))
+            att_conv = att_conv.squeeze(2).transpose(1, 2)
+            att_conv = self.mlp_att[h](att_conv)
+
+            e = self.gvec[h](
+                torch.tanh(
+                    self.pre_compute_k[h]
+                    + att_conv
+                    + self.mlp_q[h](dec_z).view(batch, 1, self.att_dim_k)
+                )
+            ).squeeze(2)
+
+            # NOTE consider zero padding when compute w.
+            if self.mask is None:
+                self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+            e.masked_fill_(self.mask, -float("inf"))
+            w += [F.softmax(self.scaling * e, dim=1)]
+
+            # weighted sum over flames
+            # utt x hdim
+            # NOTE use bmm instead of sum(*)
+            c += [
+                torch.sum(
+                    self.pre_compute_v[h] * w[h].view(batch, self.h_length, 1), dim=1
+                )
+            ]
+
+        # concat all of c
+        c = self.mlp_o(torch.cat(c, dim=1))
+
+        return c, w
+
+
+class AttForward(torch.nn.Module):
+    """Forward attention module.
+
+    Reference:
+    Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
+        (https://arxiv.org/pdf/1807.06736.pdf)
+
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    """
+
+    def __init__(self, eprojs, dunits, att_dim, aconv_chans, aconv_filts):
+        super(AttForward, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eprojs, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (1, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.gvec = torch.nn.Linear(att_dim, 1)
+        self.dunits = dunits
+        self.eprojs = eprojs
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def reset(self):
+        """reset states"""
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+
+    def forward(
+        self,
+        enc_hs_pad,
+        enc_hs_len,
+        dec_z,
+        att_prev,
+        scaling=1.0,
+        last_attended_idx=None,
+        backward_window=1,
+        forward_window=3,
+    ):
+        """Calculate AttForward forward propagation.
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B x T_max x D_enc)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B x D_dec)
+        :param torch.Tensor att_prev: attention weights of previous step
+        :param float scaling: scaling parameter before applying softmax
+        :param int last_attended_idx: index of the inputs of the last attended
+        :param int backward_window: backward window size in attention constraint
+        :param int forward_window: forward window size in attetion constraint
+        :return: attention weighted encoder state (B, D_enc)
+        :rtype: torch.Tensor
+        :return: previous attention weights (B x T_max)
+        :rtype: torch.Tensor
+        """
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        if att_prev is None:
+            # initial attention will be [1, 0, 0, ...]
+            att_prev = enc_hs_pad.new_zeros(*enc_hs_pad.size()[:2])
+            att_prev[:, 0] = 1.0
+
+        # att_prev: utt x frame -> utt x 1 x 1 x frame
+        # -> utt x att_conv_chans x 1 x frame
+        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))
+        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans
+        att_conv = att_conv.squeeze(2).transpose(1, 2)
+        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim
+        att_conv = self.mlp_att(att_conv)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).unsqueeze(1)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(self.pre_compute_enc_h + dec_z_tiled + att_conv)
+        ).squeeze(2)
+
+        # NOTE: consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+
+        # apply monotonic attention constraint (mainly for TTS)
+        if last_attended_idx is not None:
+            e = _apply_attention_constraint(
+                e, last_attended_idx, backward_window, forward_window
+            )
+
+        w = F.softmax(scaling * e, dim=1)
+
+        # forward attention
+        att_prev_shift = F.pad(att_prev, (1, 0))[:, :-1]
+        w = (att_prev + att_prev_shift) * w
+        # NOTE: clamp is needed to avoid nan gradient
+        w = F.normalize(torch.clamp(w, 1e-6), p=1, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.unsqueeze(-1), dim=1)
+
+        return c, w
+
+
+class AttForwardTA(torch.nn.Module):
+    """Forward attention with transition agent module.
+
+    Reference:
+    Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
+        (https://arxiv.org/pdf/1807.06736.pdf)
+
+    :param int eunits: # units of encoder
+    :param int dunits: # units of decoder
+    :param int att_dim: attention dimension
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param int odim: output dimension
+    """
+
+    def __init__(self, eunits, dunits, att_dim, aconv_chans, aconv_filts, odim):
+        super(AttForwardTA, self).__init__()
+        self.mlp_enc = torch.nn.Linear(eunits, att_dim)
+        self.mlp_dec = torch.nn.Linear(dunits, att_dim, bias=False)
+        self.mlp_ta = torch.nn.Linear(eunits + dunits + odim, 1)
+        self.mlp_att = torch.nn.Linear(aconv_chans, att_dim, bias=False)
+        self.loc_conv = torch.nn.Conv2d(
+            1,
+            aconv_chans,
+            (1, 2 * aconv_filts + 1),
+            padding=(0, aconv_filts),
+            bias=False,
+        )
+        self.gvec = torch.nn.Linear(att_dim, 1)
+        self.dunits = dunits
+        self.eunits = eunits
+        self.att_dim = att_dim
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.trans_agent_prob = 0.5
+
+    def reset(self):
+        self.h_length = None
+        self.enc_h = None
+        self.pre_compute_enc_h = None
+        self.mask = None
+        self.trans_agent_prob = 0.5
+
+    def forward(
+        self,
+        enc_hs_pad,
+        enc_hs_len,
+        dec_z,
+        att_prev,
+        out_prev,
+        scaling=1.0,
+        last_attended_idx=None,
+        backward_window=1,
+        forward_window=3,
+    ):
+        """Calculate AttForwardTA forward propagation.
+
+        :param torch.Tensor enc_hs_pad: padded encoder hidden state (B, Tmax, eunits)
+        :param list enc_hs_len: padded encoder hidden state length (B)
+        :param torch.Tensor dec_z: decoder hidden state (B, dunits)
+        :param torch.Tensor att_prev: attention weights of previous step
+        :param torch.Tensor out_prev: decoder outputs of previous step (B, odim)
+        :param float scaling: scaling parameter before applying softmax
+        :param int last_attended_idx: index of the inputs of the last attended
+        :param int backward_window: backward window size in attention constraint
+        :param int forward_window: forward window size in attetion constraint
+        :return: attention weighted encoder state (B, dunits)
+        :rtype: torch.Tensor
+        :return: previous attention weights (B, Tmax)
+        :rtype: torch.Tensor
+        """
+        batch = len(enc_hs_pad)
+        # pre-compute all h outside the decoder loop
+        if self.pre_compute_enc_h is None:
+            self.enc_h = enc_hs_pad  # utt x frame x hdim
+            self.h_length = self.enc_h.size(1)
+            # utt x frame x att_dim
+            self.pre_compute_enc_h = self.mlp_enc(self.enc_h)
+
+        if dec_z is None:
+            dec_z = enc_hs_pad.new_zeros(batch, self.dunits)
+        else:
+            dec_z = dec_z.view(batch, self.dunits)
+
+        if att_prev is None:
+            # initial attention will be [1, 0, 0, ...]
+            att_prev = enc_hs_pad.new_zeros(*enc_hs_pad.size()[:2])
+            att_prev[:, 0] = 1.0
+
+        # att_prev: utt x frame -> utt x 1 x 1 x frame
+        # -> utt x att_conv_chans x 1 x frame
+        att_conv = self.loc_conv(att_prev.view(batch, 1, 1, self.h_length))
+        # att_conv: utt x att_conv_chans x 1 x frame -> utt x frame x att_conv_chans
+        att_conv = att_conv.squeeze(2).transpose(1, 2)
+        # att_conv: utt x frame x att_conv_chans -> utt x frame x att_dim
+        att_conv = self.mlp_att(att_conv)
+
+        # dec_z_tiled: utt x frame x att_dim
+        dec_z_tiled = self.mlp_dec(dec_z).view(batch, 1, self.att_dim)
+
+        # dot with gvec
+        # utt x frame x att_dim -> utt x frame
+        e = self.gvec(
+            torch.tanh(att_conv + self.pre_compute_enc_h + dec_z_tiled)
+        ).squeeze(2)
+
+        # NOTE consider zero padding when compute w.
+        if self.mask is None:
+            self.mask = to_device(enc_hs_pad, make_pad_mask(enc_hs_len))
+        e.masked_fill_(self.mask, -float("inf"))
+
+        # apply monotonic attention constraint (mainly for TTS)
+        if last_attended_idx is not None:
+            e = _apply_attention_constraint(
+                e, last_attended_idx, backward_window, forward_window
+            )
+
+        w = F.softmax(scaling * e, dim=1)
+
+        # forward attention
+        att_prev_shift = F.pad(att_prev, (1, 0))[:, :-1]
+        w = (
+            self.trans_agent_prob * att_prev
+            + (1 - self.trans_agent_prob) * att_prev_shift
+        ) * w
+        # NOTE: clamp is needed to avoid nan gradient
+        w = F.normalize(torch.clamp(w, 1e-6), p=1, dim=1)
+
+        # weighted sum over flames
+        # utt x hdim
+        # NOTE use bmm instead of sum(*)
+        c = torch.sum(self.enc_h * w.view(batch, self.h_length, 1), dim=1)
+
+        # update transition agent prob
+        self.trans_agent_prob = torch.sigmoid(
+            self.mlp_ta(torch.cat([c, out_prev, dec_z], dim=1))
+        )
+
+        return c, w
+
+
+def att_for(args, num_att=1, han_mode=False):
+    """Instantiates an attention module given the program arguments
+
+    :param Namespace args: The arguments
+    :param int num_att: number of attention modules
+        (in multi-speaker case, it can be 2 or more)
+    :param bool han_mode: switch on/off mode of hierarchical attention network (HAN)
+    :rtype torch.nn.Module
+    :return: The attention module
+    """
+    att_list = torch.nn.ModuleList()
+    num_encs = getattr(args, "num_encs", 1)  # use getattr to keep compatibility
+    aheads = getattr(args, "aheads", None)
+    awin = getattr(args, "awin", None)
+    aconv_chans = getattr(args, "aconv_chans", None)
+    aconv_filts = getattr(args, "aconv_filts", None)
+
+    if num_encs == 1:
+        for i in range(num_att):
+            att = initial_att(
+                args.atype,
+                args.eprojs,
+                args.dunits,
+                aheads,
+                args.adim,
+                awin,
+                aconv_chans,
+                aconv_filts,
+            )
+            att_list.append(att)
+    elif num_encs > 1:  # no multi-speaker mode
+        if han_mode:
+            att = initial_att(
+                args.han_type,
+                args.eprojs,
+                args.dunits,
+                args.han_heads,
+                args.han_dim,
+                args.han_win,
+                args.han_conv_chans,
+                args.han_conv_filts,
+                han_mode=True,
+            )
+            return att
+        else:
+            att_list = torch.nn.ModuleList()
+            for idx in range(num_encs):
+                att = initial_att(
+                    args.atype[idx],
+                    args.eprojs,
+                    args.dunits,
+                    aheads[idx],
+                    args.adim[idx],
+                    awin[idx],
+                    aconv_chans[idx],
+                    aconv_filts[idx],
+                )
+                att_list.append(att)
+    else:
+        raise ValueError(
+            "Number of encoders needs to be more than one. {}".format(num_encs)
+        )
+    return att_list
+
+
+def initial_att(
+    atype, eprojs, dunits, aheads, adim, awin, aconv_chans, aconv_filts, han_mode=False
+):
+    """Instantiates a single attention module
+
+    :param str atype: attention type
+    :param int eprojs: # projection-units of encoder
+    :param int dunits: # units of decoder
+    :param int aheads: # heads of multi head attention
+    :param int adim: attention dimension
+    :param int awin: attention window size
+    :param int aconv_chans: # channels of attention convolution
+    :param int aconv_filts: filter size of attention convolution
+    :param bool han_mode: flag to swith on mode of hierarchical attention
+    :return: The attention module
+    """
+
+    if atype == "noatt":
+        att = NoAtt()
+    elif atype == "dot":
+        att = AttDot(eprojs, dunits, adim, han_mode)
+    elif atype == "add":
+        att = AttAdd(eprojs, dunits, adim, han_mode)
+    elif atype == "location":
+        att = AttLoc(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)
+    elif atype == "location2d":
+        att = AttLoc2D(eprojs, dunits, adim, awin, aconv_chans, aconv_filts, han_mode)
+    elif atype == "location_recurrent":
+        att = AttLocRec(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)
+    elif atype == "coverage":
+        att = AttCov(eprojs, dunits, adim, han_mode)
+    elif atype == "coverage_location":
+        att = AttCovLoc(eprojs, dunits, adim, aconv_chans, aconv_filts, han_mode)
+    elif atype == "multi_head_dot":
+        att = AttMultiHeadDot(eprojs, dunits, aheads, adim, adim, han_mode)
+    elif atype == "multi_head_add":
+        att = AttMultiHeadAdd(eprojs, dunits, aheads, adim, adim, han_mode)
+    elif atype == "multi_head_loc":
+        att = AttMultiHeadLoc(
+            eprojs, dunits, aheads, adim, adim, aconv_chans, aconv_filts, han_mode
+        )
+    elif atype == "multi_head_multi_res_loc":
+        att = AttMultiHeadMultiResLoc(
+            eprojs, dunits, aheads, adim, adim, aconv_chans, aconv_filts, han_mode
+        )
+    return att
+
+
+def att_to_numpy(att_ws, att):
+    """Converts attention weights to a numpy array given the attention
+
+    :param list att_ws: The attention weights
+    :param torch.nn.Module att: The attention
+    :rtype: np.ndarray
+    :return: The numpy array of the attention weights
+    """
+    # convert to numpy array with the shape (B, Lmax, Tmax)
+    if isinstance(att, AttLoc2D):
+        # att_ws => list of previous concate attentions
+        att_ws = torch.stack([aw[:, -1] for aw in att_ws], dim=1).cpu().numpy()
+    elif isinstance(att, (AttCov, AttCovLoc)):
+        # att_ws => list of list of previous attentions
+        att_ws = (
+            torch.stack([aw[idx] for idx, aw in enumerate(att_ws)], dim=1).cpu().numpy()
+        )
+    elif isinstance(att, AttLocRec):
+        # att_ws => list of tuple of attention and hidden states
+        att_ws = torch.stack([aw[0] for aw in att_ws], dim=1).cpu().numpy()
+    elif isinstance(
+        att,
+        (AttMultiHeadDot, AttMultiHeadAdd, AttMultiHeadLoc, AttMultiHeadMultiResLoc),
+    ):
+        # att_ws => list of list of each head attention
+        n_heads = len(att_ws[0])
+        att_ws_sorted_by_head = []
+        for h in six.moves.range(n_heads):
+            att_ws_head = torch.stack([aw[h] for aw in att_ws], dim=1)
+            att_ws_sorted_by_head += [att_ws_head]
+        att_ws = torch.stack(att_ws_sorted_by_head, dim=1).cpu().numpy()
+    else:
+        # att_ws => list of attentions
+        att_ws = torch.stack(att_ws, dim=1).cpu().numpy()
+    return att_ws
diff --git a/funasr/modules/rnn/decoders.py b/funasr/modules/rnn/decoders.py
new file mode 100644
index 000000000..c5d886f30
--- /dev/null
+++ b/funasr/modules/rnn/decoders.py
@@ -0,0 +1,1211 @@
+"""RNN decoder module."""
+import logging
+import math
+import random
+from argparse import Namespace
+
+import numpy as np
+import six
+import torch
+import torch.nn.functional as F
+
+from funasr.modules.scorers.ctc_prefix_score import CTCPrefixScore
+from funasr.modules.scorers.ctc_prefix_score import CTCPrefixScoreTH
+from funasr.modules.scorers.scorer_interface import ScorerInterface
+from funasr.modules.e2e_asr_common import end_detect
+from funasr.modules.nets_utils import mask_by_length
+from funasr.modules.nets_utils import pad_list
+from funasr.modules.nets_utils import th_accuracy
+from funasr.modules.nets_utils import to_device
+from funasr.modules.rnn.attentions import att_to_numpy
+
+MAX_DECODER_OUTPUT = 5
+CTC_SCORING_RATIO = 1.5
+
+
+class Decoder(torch.nn.Module, ScorerInterface):
+    """Decoder module
+
+    :param int eprojs: encoder projection units
+    :param int odim: dimension of outputs
+    :param str dtype: gru or lstm
+    :param int dlayers: decoder layers
+    :param int dunits: decoder units
+    :param int sos: start of sequence symbol id
+    :param int eos: end of sequence symbol id
+    :param torch.nn.Module att: attention module
+    :param int verbose: verbose level
+    :param list char_list: list of character strings
+    :param ndarray labeldist: distribution of label smoothing
+    :param float lsm_weight: label smoothing weight
+    :param float sampling_probability: scheduled sampling probability
+    :param float dropout: dropout rate
+    :param float context_residual: if True, use context vector for token generation
+    :param float replace_sos: use for multilingual (speech/text) translation
+    """
+
+    def __init__(
+            self,
+            eprojs,
+            odim,
+            dtype,
+            dlayers,
+            dunits,
+            sos,
+            eos,
+            att,
+            verbose=0,
+            char_list=None,
+            labeldist=None,
+            lsm_weight=0.0,
+            sampling_probability=0.0,
+            dropout=0.0,
+            context_residual=False,
+            replace_sos=False,
+            num_encs=1,
+    ):
+
+        torch.nn.Module.__init__(self)
+        self.dtype = dtype
+        self.dunits = dunits
+        self.dlayers = dlayers
+        self.context_residual = context_residual
+        self.embed = torch.nn.Embedding(odim, dunits)
+        self.dropout_emb = torch.nn.Dropout(p=dropout)
+
+        self.decoder = torch.nn.ModuleList()
+        self.dropout_dec = torch.nn.ModuleList()
+        self.decoder += [
+            torch.nn.LSTMCell(dunits + eprojs, dunits)
+            if self.dtype == "lstm"
+            else torch.nn.GRUCell(dunits + eprojs, dunits)
+        ]
+        self.dropout_dec += [torch.nn.Dropout(p=dropout)]
+        for _ in six.moves.range(1, self.dlayers):
+            self.decoder += [
+                torch.nn.LSTMCell(dunits, dunits)
+                if self.dtype == "lstm"
+                else torch.nn.GRUCell(dunits, dunits)
+            ]
+            self.dropout_dec += [torch.nn.Dropout(p=dropout)]
+            # NOTE: dropout is applied only for the vertical connections
+            # see https://arxiv.org/pdf/1409.2329.pdf
+        self.ignore_id = -1
+
+        if context_residual:
+            self.output = torch.nn.Linear(dunits + eprojs, odim)
+        else:
+            self.output = torch.nn.Linear(dunits, odim)
+
+        self.loss = None
+        self.att = att
+        self.dunits = dunits
+        self.sos = sos
+        self.eos = eos
+        self.odim = odim
+        self.verbose = verbose
+        self.char_list = char_list
+        # for label smoothing
+        self.labeldist = labeldist
+        self.vlabeldist = None
+        self.lsm_weight = lsm_weight
+        self.sampling_probability = sampling_probability
+        self.dropout = dropout
+        self.num_encs = num_encs
+
+        # for multilingual E2E-ST
+        self.replace_sos = replace_sos
+
+        self.logzero = -10000000000.0
+
+    def zero_state(self, hs_pad):
+        return hs_pad.new_zeros(hs_pad.size(0), self.dunits)
+
+    def rnn_forward(self, ey, z_list, c_list, z_prev, c_prev):
+        if self.dtype == "lstm":
+            z_list[0], c_list[0] = self.decoder[0](ey, (z_prev[0], c_prev[0]))
+            for i in six.moves.range(1, self.dlayers):
+                z_list[i], c_list[i] = self.decoder[i](
+                    self.dropout_dec[i - 1](z_list[i - 1]), (z_prev[i], c_prev[i])
+                )
+        else:
+            z_list[0] = self.decoder[0](ey, z_prev[0])
+            for i in six.moves.range(1, self.dlayers):
+                z_list[i] = self.decoder[i](
+                    self.dropout_dec[i - 1](z_list[i - 1]), z_prev[i]
+                )
+        return z_list, c_list
+
+    def forward(self, hs_pad, hlens, ys_pad, strm_idx=0, lang_ids=None):
+        """Decoder forward
+
+        :param torch.Tensor hs_pad: batch of padded hidden state sequences (B, Tmax, D)
+                                    [in multi-encoder case,
+                                    list of torch.Tensor,
+                                    [(B, Tmax_1, D), (B, Tmax_2, D), ..., ] ]
+        :param torch.Tensor hlens: batch of lengths of hidden state sequences (B)
+                                   [in multi-encoder case, list of torch.Tensor,
+                                   [(B), (B), ..., ]
+        :param torch.Tensor ys_pad: batch of padded character id sequence tensor
+                                    (B, Lmax)
+        :param int strm_idx: stream index indicates the index of decoding stream.
+        :param torch.Tensor lang_ids: batch of target language id tensor (B, 1)
+        :return: attention loss value
+        :rtype: torch.Tensor
+        :return: accuracy
+        :rtype: float
+        """
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            hs_pad = [hs_pad]
+            hlens = [hlens]
+
+        # TODO(kan-bayashi): need to make more smart way
+        ys = [y[y != self.ignore_id] for y in ys_pad]  # parse padded ys
+        # attention index for the attention module
+        # in SPA (speaker parallel attention),
+        # att_idx is used to select attention module. In other cases, it is 0.
+        att_idx = min(strm_idx, len(self.att) - 1)
+
+        # hlens should be list of list of integer
+        hlens = [list(map(int, hlens[idx])) for idx in range(self.num_encs)]
+
+        self.loss = None
+        # prepare input and output word sequences with sos/eos IDs
+        eos = ys[0].new([self.eos])
+        sos = ys[0].new([self.sos])
+        if self.replace_sos:
+            ys_in = [torch.cat([idx, y], dim=0) for idx, y in zip(lang_ids, ys)]
+        else:
+            ys_in = [torch.cat([sos, y], dim=0) for y in ys]
+        ys_out = [torch.cat([y, eos], dim=0) for y in ys]
+
+        # padding for ys with -1
+        # pys: utt x olen
+        ys_in_pad = pad_list(ys_in, self.eos)
+        ys_out_pad = pad_list(ys_out, self.ignore_id)
+
+        # get dim, length info
+        batch = ys_out_pad.size(0)
+        olength = ys_out_pad.size(1)
+        for idx in range(self.num_encs):
+            logging.info(
+                self.__class__.__name__
+                + "Number of Encoder:{}; enc{}: input lengths: {}.".format(
+                    self.num_encs, idx + 1, hlens[idx]
+                )
+            )
+        logging.info(
+            self.__class__.__name__
+            + " output lengths: "
+            + str([y.size(0) for y in ys_out])
+        )
+
+        # initialization
+        c_list = [self.zero_state(hs_pad[0])]
+        z_list = [self.zero_state(hs_pad[0])]
+        for _ in six.moves.range(1, self.dlayers):
+            c_list.append(self.zero_state(hs_pad[0]))
+            z_list.append(self.zero_state(hs_pad[0]))
+        z_all = []
+        if self.num_encs == 1:
+            att_w = None
+            self.att[att_idx].reset()  # reset pre-computation of h
+        else:
+            att_w_list = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * (self.num_encs)  # atts
+            for idx in range(self.num_encs + 1):
+                self.att[idx].reset()  # reset pre-computation of h in atts and han
+
+        # pre-computation of embedding
+        eys = self.dropout_emb(self.embed(ys_in_pad))  # utt x olen x zdim
+
+        # loop for an output sequence
+        for i in six.moves.range(olength):
+            if self.num_encs == 1:
+                att_c, att_w = self.att[att_idx](
+                    hs_pad[0], hlens[0], self.dropout_dec[0](z_list[0]), att_w
+                )
+            else:
+                for idx in range(self.num_encs):
+                    att_c_list[idx], att_w_list[idx] = self.att[idx](
+                        hs_pad[idx],
+                        hlens[idx],
+                        self.dropout_dec[0](z_list[0]),
+                        att_w_list[idx],
+                    )
+                hs_pad_han = torch.stack(att_c_list, dim=1)
+                hlens_han = [self.num_encs] * len(ys_in)
+                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](
+                    hs_pad_han,
+                    hlens_han,
+                    self.dropout_dec[0](z_list[0]),
+                    att_w_list[self.num_encs],
+                )
+            if i > 0 and random.random() < self.sampling_probability:
+                logging.info(" scheduled sampling ")
+                z_out = self.output(z_all[-1])
+                z_out = np.argmax(z_out.detach().cpu(), axis=1)
+                z_out = self.dropout_emb(self.embed(to_device(hs_pad[0], z_out)))
+                ey = torch.cat((z_out, att_c), dim=1)  # utt x (zdim + hdim)
+            else:
+                ey = torch.cat((eys[:, i, :], att_c), dim=1)  # utt x (zdim + hdim)
+            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)
+            if self.context_residual:
+                z_all.append(
+                    torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+                )  # utt x (zdim + hdim)
+            else:
+                z_all.append(self.dropout_dec[-1](z_list[-1]))  # utt x (zdim)
+
+        z_all = torch.stack(z_all, dim=1).view(batch * olength, -1)
+        # compute loss
+        y_all = self.output(z_all)
+        self.loss = F.cross_entropy(
+            y_all,
+            ys_out_pad.view(-1),
+            ignore_index=self.ignore_id,
+            reduction="mean",
+        )
+        # compute perplexity
+        ppl = math.exp(self.loss.item())
+        # -1: eos, which is removed in the loss computation
+        self.loss *= np.mean([len(x) for x in ys_in]) - 1
+        acc = th_accuracy(y_all, ys_out_pad, ignore_label=self.ignore_id)
+        logging.info("att loss:" + "".join(str(self.loss.item()).split("\n")))
+
+        # show predicted character sequence for debug
+        if self.verbose > 0 and self.char_list is not None:
+            ys_hat = y_all.view(batch, olength, -1)
+            ys_true = ys_out_pad
+            for (i, y_hat), y_true in zip(
+                    enumerate(ys_hat.detach().cpu().numpy()), ys_true.detach().cpu().numpy()
+            ):
+                if i == MAX_DECODER_OUTPUT:
+                    break
+                idx_hat = np.argmax(y_hat[y_true != self.ignore_id], axis=1)
+                idx_true = y_true[y_true != self.ignore_id]
+                seq_hat = [self.char_list[int(idx)] for idx in idx_hat]
+                seq_true = [self.char_list[int(idx)] for idx in idx_true]
+                seq_hat = "".join(seq_hat)
+                seq_true = "".join(seq_true)
+                logging.info("groundtruth[%d]: " % i + seq_true)
+                logging.info("prediction [%d]: " % i + seq_hat)
+
+        if self.labeldist is not None:
+            if self.vlabeldist is None:
+                self.vlabeldist = to_device(hs_pad[0], torch.from_numpy(self.labeldist))
+            loss_reg = -torch.sum(
+                (F.log_softmax(y_all, dim=1) * self.vlabeldist).view(-1), dim=0
+            ) / len(ys_in)
+            self.loss = (1.0 - self.lsm_weight) * self.loss + self.lsm_weight * loss_reg
+
+        return self.loss, acc, ppl
+
+    def recognize_beam(self, h, lpz, recog_args, char_list, rnnlm=None, strm_idx=0):
+        """beam search implementation
+
+        :param torch.Tensor h: encoder hidden state (T, eprojs)
+                                [in multi-encoder case, list of torch.Tensor,
+                                [(T1, eprojs), (T2, eprojs), ...] ]
+        :param torch.Tensor lpz: ctc log softmax output (T, odim)
+                                [in multi-encoder case, list of torch.Tensor,
+                                [(T1, odim), (T2, odim), ...] ]
+        :param Namespace recog_args: argument Namespace containing options
+        :param char_list: list of character strings
+        :param torch.nn.Module rnnlm: language module
+        :param int strm_idx:
+            stream index for speaker parallel attention in multi-speaker case
+        :return: N-best decoding results
+        :rtype: list of dicts
+        """
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            h = [h]
+            lpz = [lpz]
+        if self.num_encs > 1 and lpz is None:
+            lpz = [lpz] * self.num_encs
+
+        for idx in range(self.num_encs):
+            logging.info(
+                "Number of Encoder:{}; enc{}: input lengths: {}.".format(
+                    self.num_encs, idx + 1, h[0].size(0)
+                )
+            )
+        att_idx = min(strm_idx, len(self.att) - 1)
+        # initialization
+        c_list = [self.zero_state(h[0].unsqueeze(0))]
+        z_list = [self.zero_state(h[0].unsqueeze(0))]
+        for _ in six.moves.range(1, self.dlayers):
+            c_list.append(self.zero_state(h[0].unsqueeze(0)))
+            z_list.append(self.zero_state(h[0].unsqueeze(0)))
+        if self.num_encs == 1:
+            a = None
+            self.att[att_idx].reset()  # reset pre-computation of h
+        else:
+            a = [None] * (self.num_encs + 1)  # atts + han
+            att_w_list = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * (self.num_encs)  # atts
+            for idx in range(self.num_encs + 1):
+                self.att[idx].reset()  # reset pre-computation of h in atts and han
+
+        # search parms
+        beam = recog_args.beam_size
+        penalty = recog_args.penalty
+        ctc_weight = getattr(recog_args, "ctc_weight", False)  # for NMT
+
+        if lpz[0] is not None and self.num_encs > 1:
+            # weights-ctc,
+            # e.g. ctc_loss = w_1*ctc_1_loss + w_2 * ctc_2_loss + w_N * ctc_N_loss
+            weights_ctc_dec = recog_args.weights_ctc_dec / np.sum(
+                recog_args.weights_ctc_dec
+            )  # normalize
+            logging.info(
+                "ctc weights (decoding): " + " ".join([str(x) for x in weights_ctc_dec])
+            )
+        else:
+            weights_ctc_dec = [1.0]
+
+        # preprate sos
+        if self.replace_sos and recog_args.tgt_lang:
+            y = char_list.index(recog_args.tgt_lang)
+        else:
+            y = self.sos
+        logging.info("<sos> index: " + str(y))
+        logging.info("<sos> mark: " + char_list[y])
+        vy = h[0].new_zeros(1).long()
+
+        maxlen = np.amin([h[idx].size(0) for idx in range(self.num_encs)])
+        if recog_args.maxlenratio != 0:
+            # maxlen >= 1
+            maxlen = max(1, int(recog_args.maxlenratio * maxlen))
+        minlen = int(recog_args.minlenratio * maxlen)
+        logging.info("max output length: " + str(maxlen))
+        logging.info("min output length: " + str(minlen))
+
+        # initialize hypothesis
+        if rnnlm:
+            hyp = {
+                "score": 0.0,
+                "yseq": [y],
+                "c_prev": c_list,
+                "z_prev": z_list,
+                "a_prev": a,
+                "rnnlm_prev": None,
+            }
+        else:
+            hyp = {
+                "score": 0.0,
+                "yseq": [y],
+                "c_prev": c_list,
+                "z_prev": z_list,
+                "a_prev": a,
+            }
+        if lpz[0] is not None:
+            ctc_prefix_score = [
+                CTCPrefixScore(lpz[idx].detach().numpy(), 0, self.eos, np)
+                for idx in range(self.num_encs)
+            ]
+            hyp["ctc_state_prev"] = [
+                ctc_prefix_score[idx].initial_state() for idx in range(self.num_encs)
+            ]
+            hyp["ctc_score_prev"] = [0.0] * self.num_encs
+            if ctc_weight != 1.0:
+                # pre-pruning based on attention scores
+                ctc_beam = min(lpz[0].shape[-1], int(beam * CTC_SCORING_RATIO))
+            else:
+                ctc_beam = lpz[0].shape[-1]
+        hyps = [hyp]
+        ended_hyps = []
+
+        for i in six.moves.range(maxlen):
+            logging.debug("position " + str(i))
+
+            hyps_best_kept = []
+            for hyp in hyps:
+                vy[0] = hyp["yseq"][i]
+                ey = self.dropout_emb(self.embed(vy))  # utt list (1) x zdim
+                if self.num_encs == 1:
+                    att_c, att_w = self.att[att_idx](
+                        h[0].unsqueeze(0),
+                        [h[0].size(0)],
+                        self.dropout_dec[0](hyp["z_prev"][0]),
+                        hyp["a_prev"],
+                    )
+                else:
+                    for idx in range(self.num_encs):
+                        att_c_list[idx], att_w_list[idx] = self.att[idx](
+                            h[idx].unsqueeze(0),
+                            [h[idx].size(0)],
+                            self.dropout_dec[0](hyp["z_prev"][0]),
+                            hyp["a_prev"][idx],
+                        )
+                    h_han = torch.stack(att_c_list, dim=1)
+                    att_c, att_w_list[self.num_encs] = self.att[self.num_encs](
+                        h_han,
+                        [self.num_encs],
+                        self.dropout_dec[0](hyp["z_prev"][0]),
+                        hyp["a_prev"][self.num_encs],
+                    )
+                ey = torch.cat((ey, att_c), dim=1)  # utt(1) x (zdim + hdim)
+                z_list, c_list = self.rnn_forward(
+                    ey, z_list, c_list, hyp["z_prev"], hyp["c_prev"]
+                )
+
+                # get nbest local scores and their ids
+                if self.context_residual:
+                    logits = self.output(
+                        torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+                    )
+                else:
+                    logits = self.output(self.dropout_dec[-1](z_list[-1]))
+                local_att_scores = F.log_softmax(logits, dim=1)
+                if rnnlm:
+                    rnnlm_state, local_lm_scores = rnnlm.predict(hyp["rnnlm_prev"], vy)
+                    local_scores = (
+                            local_att_scores + recog_args.lm_weight * local_lm_scores
+                    )
+                else:
+                    local_scores = local_att_scores
+
+                if lpz[0] is not None:
+                    local_best_scores, local_best_ids = torch.topk(
+                        local_att_scores, ctc_beam, dim=1
+                    )
+                    ctc_scores, ctc_states = (
+                        [None] * self.num_encs,
+                        [None] * self.num_encs,
+                    )
+                    for idx in range(self.num_encs):
+                        ctc_scores[idx], ctc_states[idx] = ctc_prefix_score[idx](
+                            hyp["yseq"], local_best_ids[0], hyp["ctc_state_prev"][idx]
+                        )
+                    local_scores = (1.0 - ctc_weight) * local_att_scores[
+                                                        :, local_best_ids[0]
+                                                        ]
+                    if self.num_encs == 1:
+                        local_scores += ctc_weight * torch.from_numpy(
+                            ctc_scores[0] - hyp["ctc_score_prev"][0]
+                        )
+                    else:
+                        for idx in range(self.num_encs):
+                            local_scores += (
+                                    ctc_weight
+                                    * weights_ctc_dec[idx]
+                                    * torch.from_numpy(
+                                ctc_scores[idx] - hyp["ctc_score_prev"][idx]
+                            )
+                            )
+                    if rnnlm:
+                        local_scores += (
+                                recog_args.lm_weight * local_lm_scores[:, local_best_ids[0]]
+                        )
+                    local_best_scores, joint_best_ids = torch.topk(
+                        local_scores, beam, dim=1
+                    )
+                    local_best_ids = local_best_ids[:, joint_best_ids[0]]
+                else:
+                    local_best_scores, local_best_ids = torch.topk(
+                        local_scores, beam, dim=1
+                    )
+
+                for j in six.moves.range(beam):
+                    new_hyp = {}
+                    # [:] is needed!
+                    new_hyp["z_prev"] = z_list[:]
+                    new_hyp["c_prev"] = c_list[:]
+                    if self.num_encs == 1:
+                        new_hyp["a_prev"] = att_w[:]
+                    else:
+                        new_hyp["a_prev"] = [
+                            att_w_list[idx][:] for idx in range(self.num_encs + 1)
+                        ]
+                    new_hyp["score"] = hyp["score"] + local_best_scores[0, j]
+                    new_hyp["yseq"] = [0] * (1 + len(hyp["yseq"]))
+                    new_hyp["yseq"][: len(hyp["yseq"])] = hyp["yseq"]
+                    new_hyp["yseq"][len(hyp["yseq"])] = int(local_best_ids[0, j])
+                    if rnnlm:
+                        new_hyp["rnnlm_prev"] = rnnlm_state
+                    if lpz[0] is not None:
+                        new_hyp["ctc_state_prev"] = [
+                            ctc_states[idx][joint_best_ids[0, j]]
+                            for idx in range(self.num_encs)
+                        ]
+                        new_hyp["ctc_score_prev"] = [
+                            ctc_scores[idx][joint_best_ids[0, j]]
+                            for idx in range(self.num_encs)
+                        ]
+                    # will be (2 x beam) hyps at most
+                    hyps_best_kept.append(new_hyp)
+
+                hyps_best_kept = sorted(
+                    hyps_best_kept, key=lambda x: x["score"], reverse=True
+                )[:beam]
+
+            # sort and get nbest
+            hyps = hyps_best_kept
+            logging.debug("number of pruned hypotheses: " + str(len(hyps)))
+            logging.debug(
+                "best hypo: "
+                + "".join([char_list[int(x)] for x in hyps[0]["yseq"][1:]])
+            )
+
+            # add eos in the final loop to avoid that there are no ended hyps
+            if i == maxlen - 1:
+                logging.info("adding <eos> in the last position in the loop")
+                for hyp in hyps:
+                    hyp["yseq"].append(self.eos)
+
+            # add ended hypotheses to a final list,
+            # and removed them from current hypotheses
+            # (this will be a problem, number of hyps < beam)
+            remained_hyps = []
+            for hyp in hyps:
+                if hyp["yseq"][-1] == self.eos:
+                    # only store the sequence that has more than minlen outputs
+                    # also add penalty
+                    if len(hyp["yseq"]) > minlen:
+                        hyp["score"] += (i + 1) * penalty
+                        if rnnlm:  # Word LM needs to add final <eos> score
+                            hyp["score"] += recog_args.lm_weight * rnnlm.final(
+                                hyp["rnnlm_prev"]
+                            )
+                        ended_hyps.append(hyp)
+                else:
+                    remained_hyps.append(hyp)
+
+            # end detection
+            if end_detect(ended_hyps, i) and recog_args.maxlenratio == 0.0:
+                logging.info("end detected at %d", i)
+                break
+
+            hyps = remained_hyps
+            if len(hyps) > 0:
+                logging.debug("remaining hypotheses: " + str(len(hyps)))
+            else:
+                logging.info("no hypothesis. Finish decoding.")
+                break
+
+            for hyp in hyps:
+                logging.debug(
+                    "hypo: " + "".join([char_list[int(x)] for x in hyp["yseq"][1:]])
+                )
+
+            logging.debug("number of ended hypotheses: " + str(len(ended_hyps)))
+
+        nbest_hyps = sorted(ended_hyps, key=lambda x: x["score"], reverse=True)[
+                     : min(len(ended_hyps), recog_args.nbest)
+                     ]
+
+        # check number of hypotheses
+        if len(nbest_hyps) == 0:
+            logging.warning(
+                "there is no N-best results, "
+                "perform recognition again with smaller minlenratio."
+            )
+            # should copy because Namespace will be overwritten globally
+            recog_args = Namespace(**vars(recog_args))
+            recog_args.minlenratio = max(0.0, recog_args.minlenratio - 0.1)
+            if self.num_encs == 1:
+                return self.recognize_beam(h[0], lpz[0], recog_args, char_list, rnnlm)
+            else:
+                return self.recognize_beam(h, lpz, recog_args, char_list, rnnlm)
+
+        logging.info("total log probability: " + str(nbest_hyps[0]["score"]))
+        logging.info(
+            "normalized log probability: "
+            + str(nbest_hyps[0]["score"] / len(nbest_hyps[0]["yseq"]))
+        )
+
+        # remove sos
+        return nbest_hyps
+
+    def recognize_beam_batch(
+            self,
+            h,
+            hlens,
+            lpz,
+            recog_args,
+            char_list,
+            rnnlm=None,
+            normalize_score=True,
+            strm_idx=0,
+            lang_ids=None,
+    ):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            h = [h]
+            hlens = [hlens]
+            lpz = [lpz]
+        if self.num_encs > 1 and lpz is None:
+            lpz = [lpz] * self.num_encs
+
+        att_idx = min(strm_idx, len(self.att) - 1)
+        for idx in range(self.num_encs):
+            logging.info(
+                "Number of Encoder:{}; enc{}: input lengths: {}.".format(
+                    self.num_encs, idx + 1, h[idx].size(1)
+                )
+            )
+            h[idx] = mask_by_length(h[idx], hlens[idx], 0.0)
+
+        # search params
+        batch = len(hlens[0])
+        beam = recog_args.beam_size
+        penalty = recog_args.penalty
+        ctc_weight = getattr(recog_args, "ctc_weight", 0)  # for NMT
+        att_weight = 1.0 - ctc_weight
+        ctc_margin = getattr(
+            recog_args, "ctc_window_margin", 0
+        )  # use getattr to keep compatibility
+        # weights-ctc,
+        # e.g. ctc_loss = w_1*ctc_1_loss + w_2 * ctc_2_loss + w_N * ctc_N_loss
+        if lpz[0] is not None and self.num_encs > 1:
+            weights_ctc_dec = recog_args.weights_ctc_dec / np.sum(
+                recog_args.weights_ctc_dec
+            )  # normalize
+            logging.info(
+                "ctc weights (decoding): " + " ".join([str(x) for x in weights_ctc_dec])
+            )
+        else:
+            weights_ctc_dec = [1.0]
+
+        n_bb = batch * beam
+        pad_b = to_device(h[0], torch.arange(batch) * beam).view(-1, 1)
+
+        max_hlen = np.amin([max(hlens[idx]) for idx in range(self.num_encs)])
+        if recog_args.maxlenratio == 0:
+            maxlen = max_hlen
+        else:
+            maxlen = max(1, int(recog_args.maxlenratio * max_hlen))
+        minlen = int(recog_args.minlenratio * max_hlen)
+        logging.info("max output length: " + str(maxlen))
+        logging.info("min output length: " + str(minlen))
+
+        # initialization
+        c_prev = [
+            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)
+        ]
+        z_prev = [
+            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)
+        ]
+        c_list = [
+            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)
+        ]
+        z_list = [
+            to_device(h[0], torch.zeros(n_bb, self.dunits)) for _ in range(self.dlayers)
+        ]
+        vscores = to_device(h[0], torch.zeros(batch, beam))
+
+        rnnlm_state = None
+        if self.num_encs == 1:
+            a_prev = [None]
+            att_w_list, ctc_scorer, ctc_state = [None], [None], [None]
+            self.att[att_idx].reset()  # reset pre-computation of h
+        else:
+            a_prev = [None] * (self.num_encs + 1)  # atts + han
+            att_w_list = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * (self.num_encs)  # atts
+            ctc_scorer, ctc_state = [None] * (self.num_encs), [None] * (self.num_encs)
+            for idx in range(self.num_encs + 1):
+                self.att[idx].reset()  # reset pre-computation of h in atts and han
+
+        if self.replace_sos and recog_args.tgt_lang:
+            logging.info("<sos> index: " + str(char_list.index(recog_args.tgt_lang)))
+            logging.info("<sos> mark: " + recog_args.tgt_lang)
+            yseq = [
+                [char_list.index(recog_args.tgt_lang)] for _ in six.moves.range(n_bb)
+            ]
+        elif lang_ids is not None:
+            # NOTE: used for evaluation during training
+            yseq = [
+                [lang_ids[b // recog_args.beam_size]] for b in six.moves.range(n_bb)
+            ]
+        else:
+            logging.info("<sos> index: " + str(self.sos))
+            logging.info("<sos> mark: " + char_list[self.sos])
+            yseq = [[self.sos] for _ in six.moves.range(n_bb)]
+
+        accum_odim_ids = [self.sos for _ in six.moves.range(n_bb)]
+        stop_search = [False for _ in six.moves.range(batch)]
+        nbest_hyps = [[] for _ in six.moves.range(batch)]
+        ended_hyps = [[] for _ in range(batch)]
+
+        exp_hlens = [
+            hlens[idx].repeat(beam).view(beam, batch).transpose(0, 1).contiguous()
+            for idx in range(self.num_encs)
+        ]
+        exp_hlens = [exp_hlens[idx].view(-1).tolist() for idx in range(self.num_encs)]
+        exp_h = [
+            h[idx].unsqueeze(1).repeat(1, beam, 1, 1).contiguous()
+            for idx in range(self.num_encs)
+        ]
+        exp_h = [
+            exp_h[idx].view(n_bb, h[idx].size()[1], h[idx].size()[2])
+            for idx in range(self.num_encs)
+        ]
+
+        if lpz[0] is not None:
+            scoring_num = min(
+                int(beam * CTC_SCORING_RATIO)
+                if att_weight > 0.0 and not lpz[0].is_cuda
+                else 0,
+                lpz[0].size(-1),
+            )
+            ctc_scorer = [
+                CTCPrefixScoreTH(
+                    lpz[idx],
+                    hlens[idx],
+                    0,
+                    self.eos,
+                    margin=ctc_margin,
+                )
+                for idx in range(self.num_encs)
+            ]
+
+        for i in six.moves.range(maxlen):
+            logging.debug("position " + str(i))
+
+            vy = to_device(h[0], torch.LongTensor(self._get_last_yseq(yseq)))
+            ey = self.dropout_emb(self.embed(vy))
+            if self.num_encs == 1:
+                att_c, att_w = self.att[att_idx](
+                    exp_h[0], exp_hlens[0], self.dropout_dec[0](z_prev[0]), a_prev[0]
+                )
+                att_w_list = [att_w]
+            else:
+                for idx in range(self.num_encs):
+                    att_c_list[idx], att_w_list[idx] = self.att[idx](
+                        exp_h[idx],
+                        exp_hlens[idx],
+                        self.dropout_dec[0](z_prev[0]),
+                        a_prev[idx],
+                    )
+                exp_h_han = torch.stack(att_c_list, dim=1)
+                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](
+                    exp_h_han,
+                    [self.num_encs] * n_bb,
+                    self.dropout_dec[0](z_prev[0]),
+                    a_prev[self.num_encs],
+                )
+            ey = torch.cat((ey, att_c), dim=1)
+
+            # attention decoder
+            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_prev, c_prev)
+            if self.context_residual:
+                logits = self.output(
+                    torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+                )
+            else:
+                logits = self.output(self.dropout_dec[-1](z_list[-1]))
+            local_scores = att_weight * F.log_softmax(logits, dim=1)
+
+            # rnnlm
+            if rnnlm:
+                rnnlm_state, local_lm_scores = rnnlm.buff_predict(rnnlm_state, vy, n_bb)
+                local_scores = local_scores + recog_args.lm_weight * local_lm_scores
+
+            # ctc
+            if ctc_scorer[0]:
+                local_scores[:, 0] = self.logzero  # avoid choosing blank
+                part_ids = (
+                    torch.topk(local_scores, scoring_num, dim=-1)[1]
+                    if scoring_num > 0
+                    else None
+                )
+                for idx in range(self.num_encs):
+                    att_w = att_w_list[idx]
+                    att_w_ = att_w if isinstance(att_w, torch.Tensor) else att_w[0]
+                    local_ctc_scores, ctc_state[idx] = ctc_scorer[idx](
+                        yseq, ctc_state[idx], part_ids, att_w_
+                    )
+                    local_scores = (
+                            local_scores
+                            + ctc_weight * weights_ctc_dec[idx] * local_ctc_scores
+                    )
+
+            local_scores = local_scores.view(batch, beam, self.odim)
+            if i == 0:
+                local_scores[:, 1:, :] = self.logzero
+
+            # accumulate scores
+            eos_vscores = local_scores[:, :, self.eos] + vscores
+            vscores = vscores.view(batch, beam, 1).repeat(1, 1, self.odim)
+            vscores[:, :, self.eos] = self.logzero
+            vscores = (vscores + local_scores).view(batch, -1)
+
+            # global pruning
+            accum_best_scores, accum_best_ids = torch.topk(vscores, beam, 1)
+            accum_odim_ids = (
+                torch.fmod(accum_best_ids, self.odim).view(-1).data.cpu().tolist()
+            )
+            accum_padded_beam_ids = (
+                (accum_best_ids // self.odim + pad_b).view(-1).data.cpu().tolist()
+            )
+
+            y_prev = yseq[:][:]
+            yseq = self._index_select_list(yseq, accum_padded_beam_ids)
+            yseq = self._append_ids(yseq, accum_odim_ids)
+            vscores = accum_best_scores
+            vidx = to_device(h[0], torch.LongTensor(accum_padded_beam_ids))
+
+            a_prev = []
+            num_atts = self.num_encs if self.num_encs == 1 else self.num_encs + 1
+            for idx in range(num_atts):
+                if isinstance(att_w_list[idx], torch.Tensor):
+                    _a_prev = torch.index_select(
+                        att_w_list[idx].view(n_bb, *att_w_list[idx].shape[1:]), 0, vidx
+                    )
+                elif isinstance(att_w_list[idx], list):
+                    # handle the case of multi-head attention
+                    _a_prev = [
+                        torch.index_select(att_w_one.view(n_bb, -1), 0, vidx)
+                        for att_w_one in att_w_list[idx]
+                    ]
+                else:
+                    # handle the case of location_recurrent when return is a tuple
+                    _a_prev_ = torch.index_select(
+                        att_w_list[idx][0].view(n_bb, -1), 0, vidx
+                    )
+                    _h_prev_ = torch.index_select(
+                        att_w_list[idx][1][0].view(n_bb, -1), 0, vidx
+                    )
+                    _c_prev_ = torch.index_select(
+                        att_w_list[idx][1][1].view(n_bb, -1), 0, vidx
+                    )
+                    _a_prev = (_a_prev_, (_h_prev_, _c_prev_))
+                a_prev.append(_a_prev)
+            z_prev = [
+                torch.index_select(z_list[li].view(n_bb, -1), 0, vidx)
+                for li in range(self.dlayers)
+            ]
+            c_prev = [
+                torch.index_select(c_list[li].view(n_bb, -1), 0, vidx)
+                for li in range(self.dlayers)
+            ]
+
+            # pick ended hyps
+            if i >= minlen:
+                k = 0
+                penalty_i = (i + 1) * penalty
+                thr = accum_best_scores[:, -1]
+                for samp_i in six.moves.range(batch):
+                    if stop_search[samp_i]:
+                        k = k + beam
+                        continue
+                    for beam_j in six.moves.range(beam):
+                        _vscore = None
+                        if eos_vscores[samp_i, beam_j] > thr[samp_i]:
+                            yk = y_prev[k][:]
+                            if len(yk) <= min(
+                                    hlens[idx][samp_i] for idx in range(self.num_encs)
+                            ):
+                                _vscore = eos_vscores[samp_i][beam_j] + penalty_i
+                        elif i == maxlen - 1:
+                            yk = yseq[k][:]
+                            _vscore = vscores[samp_i][beam_j] + penalty_i
+                        if _vscore:
+                            yk.append(self.eos)
+                            if rnnlm:
+                                _vscore += recog_args.lm_weight * rnnlm.final(
+                                    rnnlm_state, index=k
+                                )
+                            _score = _vscore.data.cpu().numpy()
+                            ended_hyps[samp_i].append(
+                                {"yseq": yk, "vscore": _vscore, "score": _score}
+                            )
+                        k = k + 1
+
+            # end detection
+            stop_search = [
+                stop_search[samp_i] or end_detect(ended_hyps[samp_i], i)
+                for samp_i in six.moves.range(batch)
+            ]
+            stop_search_summary = list(set(stop_search))
+            if len(stop_search_summary) == 1 and stop_search_summary[0]:
+                break
+
+            if rnnlm:
+                rnnlm_state = self._index_select_lm_state(rnnlm_state, 0, vidx)
+            if ctc_scorer[0]:
+                for idx in range(self.num_encs):
+                    ctc_state[idx] = ctc_scorer[idx].index_select_state(
+                        ctc_state[idx], accum_best_ids
+                    )
+
+        torch.cuda.empty_cache()
+
+        dummy_hyps = [
+            {"yseq": [self.sos, self.eos], "score": np.array([-float("inf")])}
+        ]
+        ended_hyps = [
+            ended_hyps[samp_i] if len(ended_hyps[samp_i]) != 0 else dummy_hyps
+            for samp_i in six.moves.range(batch)
+        ]
+        if normalize_score:
+            for samp_i in six.moves.range(batch):
+                for x in ended_hyps[samp_i]:
+                    x["score"] /= len(x["yseq"])
+
+        nbest_hyps = [
+            sorted(ended_hyps[samp_i], key=lambda x: x["score"], reverse=True)[
+            : min(len(ended_hyps[samp_i]), recog_args.nbest)
+            ]
+            for samp_i in six.moves.range(batch)
+        ]
+
+        return nbest_hyps
+
+    def calculate_all_attentions(self, hs_pad, hlen, ys_pad, strm_idx=0, lang_ids=None):
+        """Calculate all of attentions
+
+        :param torch.Tensor hs_pad: batch of padded hidden state sequences
+                                    (B, Tmax, D)
+                                    in multi-encoder case, list of torch.Tensor,
+                                    [(B, Tmax_1, D), (B, Tmax_2, D), ..., ] ]
+        :param torch.Tensor hlen: batch of lengths of hidden state sequences (B)
+                                    [in multi-encoder case, list of torch.Tensor,
+                                    [(B), (B), ..., ]
+        :param torch.Tensor ys_pad:
+            batch of padded character id sequence tensor (B, Lmax)
+        :param int strm_idx:
+            stream index for parallel speaker attention in multi-speaker case
+        :param torch.Tensor lang_ids: batch of target language id tensor (B, 1)
+        :return: attention weights with the following shape,
+            1) multi-head case => attention weights (B, H, Lmax, Tmax),
+            2) multi-encoder case =>
+                [(B, Lmax, Tmax1), (B, Lmax, Tmax2), ..., (B, Lmax, NumEncs)]
+            3) other case => attention weights (B, Lmax, Tmax).
+        :rtype: float ndarray
+        """
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            hs_pad = [hs_pad]
+            hlen = [hlen]
+
+        # TODO(kan-bayashi): need to make more smart way
+        ys = [y[y != self.ignore_id] for y in ys_pad]  # parse padded ys
+        att_idx = min(strm_idx, len(self.att) - 1)
+
+        # hlen should be list of list of integer
+        hlen = [list(map(int, hlen[idx])) for idx in range(self.num_encs)]
+
+        self.loss = None
+        # prepare input and output word sequences with sos/eos IDs
+        eos = ys[0].new([self.eos])
+        sos = ys[0].new([self.sos])
+        if self.replace_sos:
+            ys_in = [torch.cat([idx, y], dim=0) for idx, y in zip(lang_ids, ys)]
+        else:
+            ys_in = [torch.cat([sos, y], dim=0) for y in ys]
+        ys_out = [torch.cat([y, eos], dim=0) for y in ys]
+
+        # padding for ys with -1
+        # pys: utt x olen
+        ys_in_pad = pad_list(ys_in, self.eos)
+        ys_out_pad = pad_list(ys_out, self.ignore_id)
+
+        # get length info
+        olength = ys_out_pad.size(1)
+
+        # initialization
+        c_list = [self.zero_state(hs_pad[0])]
+        z_list = [self.zero_state(hs_pad[0])]
+        for _ in six.moves.range(1, self.dlayers):
+            c_list.append(self.zero_state(hs_pad[0]))
+            z_list.append(self.zero_state(hs_pad[0]))
+        att_ws = []
+        if self.num_encs == 1:
+            att_w = None
+            self.att[att_idx].reset()  # reset pre-computation of h
+        else:
+            att_w_list = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * (self.num_encs)  # atts
+            for idx in range(self.num_encs + 1):
+                self.att[idx].reset()  # reset pre-computation of h in atts and han
+
+        # pre-computation of embedding
+        eys = self.dropout_emb(self.embed(ys_in_pad))  # utt x olen x zdim
+
+        # loop for an output sequence
+        for i in six.moves.range(olength):
+            if self.num_encs == 1:
+                att_c, att_w = self.att[att_idx](
+                    hs_pad[0], hlen[0], self.dropout_dec[0](z_list[0]), att_w
+                )
+                att_ws.append(att_w)
+            else:
+                for idx in range(self.num_encs):
+                    att_c_list[idx], att_w_list[idx] = self.att[idx](
+                        hs_pad[idx],
+                        hlen[idx],
+                        self.dropout_dec[0](z_list[0]),
+                        att_w_list[idx],
+                    )
+                hs_pad_han = torch.stack(att_c_list, dim=1)
+                hlen_han = [self.num_encs] * len(ys_in)
+                att_c, att_w_list[self.num_encs] = self.att[self.num_encs](
+                    hs_pad_han,
+                    hlen_han,
+                    self.dropout_dec[0](z_list[0]),
+                    att_w_list[self.num_encs],
+                )
+                att_ws.append(att_w_list.copy())
+            ey = torch.cat((eys[:, i, :], att_c), dim=1)  # utt x (zdim + hdim)
+            z_list, c_list = self.rnn_forward(ey, z_list, c_list, z_list, c_list)
+
+        if self.num_encs == 1:
+            # convert to numpy array with the shape (B, Lmax, Tmax)
+            att_ws = att_to_numpy(att_ws, self.att[att_idx])
+        else:
+            _att_ws = []
+            for idx, ws in enumerate(zip(*att_ws)):
+                ws = att_to_numpy(ws, self.att[idx])
+                _att_ws.append(ws)
+            att_ws = _att_ws
+        return att_ws
+
+    @staticmethod
+    def _get_last_yseq(exp_yseq):
+        last = []
+        for y_seq in exp_yseq:
+            last.append(y_seq[-1])
+        return last
+
+    @staticmethod
+    def _append_ids(yseq, ids):
+        if isinstance(ids, list):
+            for i, j in enumerate(ids):
+                yseq[i].append(j)
+        else:
+            for i in range(len(yseq)):
+                yseq[i].append(ids)
+        return yseq
+
+    @staticmethod
+    def _index_select_list(yseq, lst):
+        new_yseq = []
+        for i in lst:
+            new_yseq.append(yseq[i][:])
+        return new_yseq
+
+    @staticmethod
+    def _index_select_lm_state(rnnlm_state, dim, vidx):
+        if isinstance(rnnlm_state, dict):
+            new_state = {}
+            for k, v in rnnlm_state.items():
+                new_state[k] = [torch.index_select(vi, dim, vidx) for vi in v]
+        elif isinstance(rnnlm_state, list):
+            new_state = []
+            for i in vidx:
+                new_state.append(rnnlm_state[int(i)][:])
+        return new_state
+
+    # scorer interface methods
+    def init_state(self, x):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            x = [x]
+
+        c_list = [self.zero_state(x[0].unsqueeze(0))]
+        z_list = [self.zero_state(x[0].unsqueeze(0))]
+        for _ in six.moves.range(1, self.dlayers):
+            c_list.append(self.zero_state(x[0].unsqueeze(0)))
+            z_list.append(self.zero_state(x[0].unsqueeze(0)))
+        # TODO(karita): support strm_index for `asr_mix`
+        strm_index = 0
+        att_idx = min(strm_index, len(self.att) - 1)
+        if self.num_encs == 1:
+            a = None
+            self.att[att_idx].reset()  # reset pre-computation of h
+        else:
+            a = [None] * (self.num_encs + 1)  # atts + han
+            for idx in range(self.num_encs + 1):
+                self.att[idx].reset()  # reset pre-computation of h in atts and han
+        return dict(
+            c_prev=c_list[:],
+            z_prev=z_list[:],
+            a_prev=a,
+            workspace=(att_idx, z_list, c_list),
+        )
+
+    def score(self, yseq, state, x):
+        # to support mutiple encoder asr mode, in single encoder mode,
+        # convert torch.Tensor to List of torch.Tensor
+        if self.num_encs == 1:
+            x = [x]
+
+        att_idx, z_list, c_list = state["workspace"]
+        vy = yseq[-1].unsqueeze(0)
+        ey = self.dropout_emb(self.embed(vy))  # utt list (1) x zdim
+        if self.num_encs == 1:
+            att_c, att_w = self.att[att_idx](
+                x[0].unsqueeze(0),
+                [x[0].size(0)],
+                self.dropout_dec[0](state["z_prev"][0]),
+                state["a_prev"],
+            )
+        else:
+            att_w = [None] * (self.num_encs + 1)  # atts + han
+            att_c_list = [None] * (self.num_encs)  # atts
+            for idx in range(self.num_encs):
+                att_c_list[idx], att_w[idx] = self.att[idx](
+                    x[idx].unsqueeze(0),
+                    [x[idx].size(0)],
+                    self.dropout_dec[0](state["z_prev"][0]),
+                    state["a_prev"][idx],
+                )
+            h_han = torch.stack(att_c_list, dim=1)
+            att_c, att_w[self.num_encs] = self.att[self.num_encs](
+                h_han,
+                [self.num_encs],
+                self.dropout_dec[0](state["z_prev"][0]),
+                state["a_prev"][self.num_encs],
+            )
+        ey = torch.cat((ey, att_c), dim=1)  # utt(1) x (zdim + hdim)
+        z_list, c_list = self.rnn_forward(
+            ey, z_list, c_list, state["z_prev"], state["c_prev"]
+        )
+        if self.context_residual:
+            logits = self.output(
+                torch.cat((self.dropout_dec[-1](z_list[-1]), att_c), dim=-1)
+            )
+        else:
+            logits = self.output(self.dropout_dec[-1](z_list[-1]))
+        logp = F.log_softmax(logits, dim=1).squeeze(0)
+        return (
+            logp,
+            dict(
+                c_prev=c_list[:],
+                z_prev=z_list[:],
+                a_prev=att_w,
+                workspace=(att_idx, z_list, c_list),
+            ),
+        )
+
+
+def decoder_for(args, odim, sos, eos, att, labeldist):
+    return Decoder(
+        args.eprojs,
+        odim,
+        args.dtype,
+        args.dlayers,
+        args.dunits,
+        sos,
+        eos,
+        att,
+        args.verbose,
+        args.char_list,
+        labeldist,
+        args.lsm_weight,
+        args.sampling_probability,
+        args.dropout_rate_decoder,
+        getattr(args, "context_residual", False),  # use getattr to keep compatibility
+        getattr(args, "replace_sos", False),  # use getattr to keep compatibility
+        getattr(args, "num_encs", 1),
+    )  # use getattr to keep compatibility
diff --git a/funasr/modules/rnn/encoders.py b/funasr/modules/rnn/encoders.py
new file mode 100644
index 000000000..a7320c282
--- /dev/null
+++ b/funasr/modules/rnn/encoders.py
@@ -0,0 +1,372 @@
+import logging
+
+import numpy as np
+import six
+import torch
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pack_padded_sequence
+from torch.nn.utils.rnn import pad_packed_sequence
+
+from funasr.modules.e2e_asr_common import get_vgg2l_odim
+from funasr.modules.nets_utils import make_pad_mask
+from funasr.modules.nets_utils import to_device
+
+
+class RNNP(torch.nn.Module):
+    """RNN with projection layer module
+
+    :param int idim: dimension of inputs
+    :param int elayers: number of encoder layers
+    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)
+    :param int hdim: number of projection units
+    :param np.ndarray subsample: list of subsampling numbers
+    :param float dropout: dropout rate
+    :param str typ: The RNN type
+    """
+
+    def __init__(self, idim, elayers, cdim, hdim, subsample, dropout, typ="blstm"):
+        super(RNNP, self).__init__()
+        bidir = typ[0] == "b"
+        for i in six.moves.range(elayers):
+            if i == 0:
+                inputdim = idim
+            else:
+                inputdim = hdim
+
+            RNN = torch.nn.LSTM if "lstm" in typ else torch.nn.GRU
+            rnn = RNN(
+                inputdim, cdim, num_layers=1, bidirectional=bidir, batch_first=True
+            )
+
+            setattr(self, "%s%d" % ("birnn" if bidir else "rnn", i), rnn)
+
+            # bottleneck layer to merge
+            if bidir:
+                setattr(self, "bt%d" % i, torch.nn.Linear(2 * cdim, hdim))
+            else:
+                setattr(self, "bt%d" % i, torch.nn.Linear(cdim, hdim))
+
+        self.elayers = elayers
+        self.cdim = cdim
+        self.subsample = subsample
+        self.typ = typ
+        self.bidir = bidir
+        self.dropout = dropout
+
+    def forward(self, xs_pad, ilens, prev_state=None):
+        """RNNP forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, idim)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous RNN states
+        :return: batch of hidden state sequences (B, Tmax, hdim)
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + " input lengths: " + str(ilens))
+        elayer_states = []
+        for layer in six.moves.range(self.elayers):
+            if not isinstance(ilens, torch.Tensor):
+                ilens = torch.tensor(ilens)
+            xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)
+            rnn = getattr(self, ("birnn" if self.bidir else "rnn") + str(layer))
+            rnn.flatten_parameters()
+            if prev_state is not None and rnn.bidirectional:
+                prev_state = reset_backward_rnn_state(prev_state)
+            ys, states = rnn(
+                xs_pack, hx=None if prev_state is None else prev_state[layer]
+            )
+            elayer_states.append(states)
+            # ys: utt list of frame x cdim x 2 (2: means bidirectional)
+            ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)
+            sub = self.subsample[layer + 1]
+            if sub > 1:
+                ys_pad = ys_pad[:, ::sub]
+                ilens = torch.tensor([int(i + 1) // sub for i in ilens])
+            # (sum _utt frame_utt) x dim
+            projection_layer = getattr(self, "bt%d" % layer)
+            projected = projection_layer(ys_pad.contiguous().view(-1, ys_pad.size(2)))
+            xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)
+            if layer < self.elayers - 1:
+                xs_pad = torch.tanh(F.dropout(xs_pad, p=self.dropout))
+
+        return xs_pad, ilens, elayer_states  # x: utt list of frame x dim
+
+
+class RNN(torch.nn.Module):
+    """RNN module
+
+    :param int idim: dimension of inputs
+    :param int elayers: number of encoder layers
+    :param int cdim: number of rnn units (resulted in cdim * 2 if bidirectional)
+    :param int hdim: number of final projection units
+    :param float dropout: dropout rate
+    :param str typ: The RNN type
+    """
+
+    def __init__(self, idim, elayers, cdim, hdim, dropout, typ="blstm"):
+        super(RNN, self).__init__()
+        bidir = typ[0] == "b"
+        self.nbrnn = (
+            torch.nn.LSTM(
+                idim,
+                cdim,
+                elayers,
+                batch_first=True,
+                dropout=dropout,
+                bidirectional=bidir,
+            )
+            if "lstm" in typ
+            else torch.nn.GRU(
+                idim,
+                cdim,
+                elayers,
+                batch_first=True,
+                dropout=dropout,
+                bidirectional=bidir,
+            )
+        )
+        if bidir:
+            self.l_last = torch.nn.Linear(cdim * 2, hdim)
+        else:
+            self.l_last = torch.nn.Linear(cdim, hdim)
+        self.typ = typ
+
+    def forward(self, xs_pad, ilens, prev_state=None):
+        """RNN forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous RNN states
+        :return: batch of hidden state sequences (B, Tmax, eprojs)
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + " input lengths: " + str(ilens))
+        if not isinstance(ilens, torch.Tensor):
+            ilens = torch.tensor(ilens)
+        xs_pack = pack_padded_sequence(xs_pad, ilens.cpu(), batch_first=True)
+        self.nbrnn.flatten_parameters()
+        if prev_state is not None and self.nbrnn.bidirectional:
+            # We assume that when previous state is passed,
+            # it means that we're streaming the input
+            # and therefore cannot propagate backward BRNN state
+            # (otherwise it goes in the wrong direction)
+            prev_state = reset_backward_rnn_state(prev_state)
+        ys, states = self.nbrnn(xs_pack, hx=prev_state)
+        # ys: utt list of frame x cdim x 2 (2: means bidirectional)
+        ys_pad, ilens = pad_packed_sequence(ys, batch_first=True)
+        # (sum _utt frame_utt) x dim
+        projected = torch.tanh(
+            self.l_last(ys_pad.contiguous().view(-1, ys_pad.size(2)))
+        )
+        xs_pad = projected.view(ys_pad.size(0), ys_pad.size(1), -1)
+        return xs_pad, ilens, states  # x: utt list of frame x dim
+
+
+def reset_backward_rnn_state(states):
+    """Sets backward BRNN states to zeroes
+
+    Useful in processing of sliding windows over the inputs
+    """
+    if isinstance(states, (list, tuple)):
+        for state in states:
+            state[1::2] = 0.0
+    else:
+        states[1::2] = 0.0
+    return states
+
+
+class VGG2L(torch.nn.Module):
+    """VGG-like module
+
+    :param int in_channel: number of input channels
+    """
+
+    def __init__(self, in_channel=1):
+        super(VGG2L, self).__init__()
+        # CNN layer (VGG motivated)
+        self.conv1_1 = torch.nn.Conv2d(in_channel, 64, 3, stride=1, padding=1)
+        self.conv1_2 = torch.nn.Conv2d(64, 64, 3, stride=1, padding=1)
+        self.conv2_1 = torch.nn.Conv2d(64, 128, 3, stride=1, padding=1)
+        self.conv2_2 = torch.nn.Conv2d(128, 128, 3, stride=1, padding=1)
+
+        self.in_channel = in_channel
+
+    def forward(self, xs_pad, ilens, **kwargs):
+        """VGG2L forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :return: batch of padded hidden state sequences (B, Tmax // 4, 128 * D // 4)
+        :rtype: torch.Tensor
+        """
+        logging.debug(self.__class__.__name__ + " input lengths: " + str(ilens))
+
+        # x: utt x frame x dim
+        # xs_pad = F.pad_sequence(xs_pad)
+
+        # x: utt x 1 (input channel num) x frame x dim
+        xs_pad = xs_pad.view(
+            xs_pad.size(0),
+            xs_pad.size(1),
+            self.in_channel,
+            xs_pad.size(2) // self.in_channel,
+        ).transpose(1, 2)
+
+        # NOTE: max_pool1d ?
+        xs_pad = F.relu(self.conv1_1(xs_pad))
+        xs_pad = F.relu(self.conv1_2(xs_pad))
+        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)
+
+        xs_pad = F.relu(self.conv2_1(xs_pad))
+        xs_pad = F.relu(self.conv2_2(xs_pad))
+        xs_pad = F.max_pool2d(xs_pad, 2, stride=2, ceil_mode=True)
+        if torch.is_tensor(ilens):
+            ilens = ilens.cpu().numpy()
+        else:
+            ilens = np.array(ilens, dtype=np.float32)
+        ilens = np.array(np.ceil(ilens / 2), dtype=np.int64)
+        ilens = np.array(
+            np.ceil(np.array(ilens, dtype=np.float32) / 2), dtype=np.int64
+        ).tolist()
+
+        # x: utt_list of frame (remove zeropaded frames) x (input channel num x dim)
+        xs_pad = xs_pad.transpose(1, 2)
+        xs_pad = xs_pad.contiguous().view(
+            xs_pad.size(0), xs_pad.size(1), xs_pad.size(2) * xs_pad.size(3)
+        )
+        return xs_pad, ilens, None  # no state in this layer
+
+
+class Encoder(torch.nn.Module):
+    """Encoder module
+
+    :param str etype: type of encoder network
+    :param int idim: number of dimensions of encoder network
+    :param int elayers: number of layers of encoder network
+    :param int eunits: number of lstm units of encoder network
+    :param int eprojs: number of projection units of encoder network
+    :param np.ndarray subsample: list of subsampling numbers
+    :param float dropout: dropout rate
+    :param int in_channel: number of input channels
+    """
+
+    def __init__(
+            self, etype, idim, elayers, eunits, eprojs, subsample, dropout, in_channel=1
+    ):
+        super(Encoder, self).__init__()
+        typ = etype.lstrip("vgg").rstrip("p")
+        if typ not in ["lstm", "gru", "blstm", "bgru"]:
+            logging.error("Error: need to specify an appropriate encoder architecture")
+
+        if etype.startswith("vgg"):
+            if etype[-1] == "p":
+                self.enc = torch.nn.ModuleList(
+                    [
+                        VGG2L(in_channel),
+                        RNNP(
+                            get_vgg2l_odim(idim, in_channel=in_channel),
+                            elayers,
+                            eunits,
+                            eprojs,
+                            subsample,
+                            dropout,
+                            typ=typ,
+                        ),
+                    ]
+                )
+                logging.info("Use CNN-VGG + " + typ.upper() + "P for encoder")
+            else:
+                self.enc = torch.nn.ModuleList(
+                    [
+                        VGG2L(in_channel),
+                        RNN(
+                            get_vgg2l_odim(idim, in_channel=in_channel),
+                            elayers,
+                            eunits,
+                            eprojs,
+                            dropout,
+                            typ=typ,
+                        ),
+                    ]
+                )
+                logging.info("Use CNN-VGG + " + typ.upper() + " for encoder")
+            self.conv_subsampling_factor = 4
+        else:
+            if etype[-1] == "p":
+                self.enc = torch.nn.ModuleList(
+                    [RNNP(idim, elayers, eunits, eprojs, subsample, dropout, typ=typ)]
+                )
+                logging.info(typ.upper() + " with every-layer projection for encoder")
+            else:
+                self.enc = torch.nn.ModuleList(
+                    [RNN(idim, elayers, eunits, eprojs, dropout, typ=typ)]
+                )
+                logging.info(typ.upper() + " without projection for encoder")
+            self.conv_subsampling_factor = 1
+
+    def forward(self, xs_pad, ilens, prev_states=None):
+        """Encoder forward
+
+        :param torch.Tensor xs_pad: batch of padded input sequences (B, Tmax, D)
+        :param torch.Tensor ilens: batch of lengths of input sequences (B)
+        :param torch.Tensor prev_state: batch of previous encoder hidden states (?, ...)
+        :return: batch of hidden state sequences (B, Tmax, eprojs)
+        :rtype: torch.Tensor
+        """
+        if prev_states is None:
+            prev_states = [None] * len(self.enc)
+        assert len(prev_states) == len(self.enc)
+
+        current_states = []
+        for module, prev_state in zip(self.enc, prev_states):
+            xs_pad, ilens, states = module(xs_pad, ilens, prev_state=prev_state)
+            current_states.append(states)
+
+        # make mask to remove bias value in padded part
+        mask = to_device(xs_pad, make_pad_mask(ilens).unsqueeze(-1))
+
+        return xs_pad.masked_fill(mask, 0.0), ilens, current_states
+
+
+def encoder_for(args, idim, subsample):
+    """Instantiates an encoder module given the program arguments
+
+    :param Namespace args: The arguments
+    :param int or List of integer idim: dimension of input, e.g. 83, or
+                                        List of dimensions of inputs, e.g. [83,83]
+    :param List or List of List subsample: subsample factors, e.g. [1,2,2,1,1], or
+                                        List of subsample factors of each encoder.
+                                         e.g. [[1,2,2,1,1], [1,2,2,1,1]]
+    :rtype torch.nn.Module
+    :return: The encoder module
+    """
+    num_encs = getattr(args, "num_encs", 1)  # use getattr to keep compatibility
+    if num_encs == 1:
+        # compatible with single encoder asr mode
+        return Encoder(
+            args.etype,
+            idim,
+            args.elayers,
+            args.eunits,
+            args.eprojs,
+            subsample,
+            args.dropout_rate,
+        )
+    elif num_encs >= 1:
+        enc_list = torch.nn.ModuleList()
+        for idx in range(num_encs):
+            enc = Encoder(
+                args.etype[idx],
+                idim[idx],
+                args.elayers[idx],
+                args.eunits[idx],
+                args.eprojs,
+                subsample[idx],
+                args.dropout_rate[idx],
+            )
+            enc_list.append(enc)
+        return enc_list
+    else:
+        raise ValueError(
+            "Number of encoders needs to be more than one. {}".format(num_encs)
+        )
diff --git a/funasr/modules/scorers/__init__.py b/funasr/modules/scorers/__init__.py
new file mode 100644
index 000000000..b7f177368
--- /dev/null
+++ b/funasr/modules/scorers/__init__.py
@@ -0,0 +1 @@
+"""Initialize sub package."""
diff --git a/funasr/modules/scorers/ctc.py b/funasr/modules/scorers/ctc.py
new file mode 100644
index 000000000..61deace59
--- /dev/null
+++ b/funasr/modules/scorers/ctc.py
@@ -0,0 +1,158 @@
+"""ScorerInterface implementation for CTC."""
+
+import numpy as np
+import torch
+
+from funasr.modules.scorers.ctc_prefix_score import CTCPrefixScore
+from funasr.modules.scorers.ctc_prefix_score import CTCPrefixScoreTH
+from funasr.modules.scorers.scorer_interface import BatchPartialScorerInterface
+
+
+class CTCPrefixScorer(BatchPartialScorerInterface):
+    """Decoder interface wrapper for CTCPrefixScore."""
+
+    def __init__(self, ctc: torch.nn.Module, eos: int):
+        """Initialize class.
+
+        Args:
+            ctc (torch.nn.Module): The CTC implementation.
+                For example, :class:`espnet.nets.pytorch_backend.ctc.CTC`
+            eos (int): The end-of-sequence id.
+
+        """
+        self.ctc = ctc
+        self.eos = eos
+        self.impl = None
+
+    def init_state(self, x: torch.Tensor):
+        """Get an initial state for decoding.
+
+        Args:
+            x (torch.Tensor): The encoded feature tensor
+
+        Returns: initial state
+
+        """
+        logp = self.ctc.log_softmax(x.unsqueeze(0)).detach().squeeze(0).cpu().numpy()
+        # TODO(karita): use CTCPrefixScoreTH
+        self.impl = CTCPrefixScore(logp, 0, self.eos, np)
+        return 0, self.impl.initial_state()
+
+    def select_state(self, state, i, new_id=None):
+        """Select state with relative ids in the main beam search.
+
+        Args:
+            state: Decoder state for prefix tokens
+            i (int): Index to select a state in the main beam search
+            new_id (int): New label id to select a state if necessary
+
+        Returns:
+            state: pruned state
+
+        """
+        if type(state) == tuple:
+            if len(state) == 2:  # for CTCPrefixScore
+                sc, st = state
+                return sc[i], st[i]
+            else:  # for CTCPrefixScoreTH (need new_id > 0)
+                r, log_psi, f_min, f_max, scoring_idmap = state
+                s = log_psi[i, new_id].expand(log_psi.size(1))
+                if scoring_idmap is not None:
+                    return r[:, :, i, scoring_idmap[i, new_id]], s, f_min, f_max
+                else:
+                    return r[:, :, i, new_id], s, f_min, f_max
+        return None if state is None else state[i]
+
+    def score_partial(self, y, ids, state, x):
+        """Score new token.
+
+        Args:
+            y (torch.Tensor): 1D prefix token
+            next_tokens (torch.Tensor): torch.int64 next token to score
+            state: decoder state for prefix tokens
+            x (torch.Tensor): 2D encoder feature that generates ys
+
+        Returns:
+            tuple[torch.Tensor, Any]:
+                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`
+                and next state for ys
+
+        """
+        prev_score, state = state
+        presub_score, new_st = self.impl(y.cpu(), ids.cpu(), state)
+        tscore = torch.as_tensor(
+            presub_score - prev_score, device=x.device, dtype=x.dtype
+        )
+        return tscore, (presub_score, new_st)
+
+    def batch_init_state(self, x: torch.Tensor):
+        """Get an initial state for decoding.
+
+        Args:
+            x (torch.Tensor): The encoded feature tensor
+
+        Returns: initial state
+
+        """
+        logp = self.ctc.log_softmax(x.unsqueeze(0))  # assuming batch_size = 1
+        xlen = torch.tensor([logp.size(1)])
+        self.impl = CTCPrefixScoreTH(logp, xlen, 0, self.eos)
+        return None
+
+    def batch_score_partial(self, y, ids, state, x):
+        """Score new token.
+
+        Args:
+            y (torch.Tensor): 1D prefix token
+            ids (torch.Tensor): torch.int64 next token to score
+            state: decoder state for prefix tokens
+            x (torch.Tensor): 2D encoder feature that generates ys
+
+        Returns:
+            tuple[torch.Tensor, Any]:
+                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`
+                and next state for ys
+
+        """
+        batch_state = (
+            (
+                torch.stack([s[0] for s in state], dim=2),
+                torch.stack([s[1] for s in state]),
+                state[0][2],
+                state[0][3],
+            )
+            if state[0] is not None
+            else None
+        )
+        return self.impl(y, batch_state, ids)
+
+    def extend_prob(self, x: torch.Tensor):
+        """Extend probs for decoding.
+
+        This extension is for streaming decoding
+        as in Eq (14) in https://arxiv.org/abs/2006.14941
+
+        Args:
+            x (torch.Tensor): The encoded feature tensor
+
+        """
+        logp = self.ctc.log_softmax(x.unsqueeze(0))
+        self.impl.extend_prob(logp)
+
+    def extend_state(self, state):
+        """Extend state for decoding.
+
+        This extension is for streaming decoding
+        as in Eq (14) in https://arxiv.org/abs/2006.14941
+
+        Args:
+            state: The states of hyps
+
+        Returns: exteded state
+
+        """
+        new_state = []
+        for s in state:
+            new_state.append(self.impl.extend_state(s))
+
+        return new_state
diff --git a/funasr/modules/scorers/ctc_prefix_score.py b/funasr/modules/scorers/ctc_prefix_score.py
new file mode 100644
index 000000000..0c67ecd09
--- /dev/null
+++ b/funasr/modules/scorers/ctc_prefix_score.py
@@ -0,0 +1,359 @@
+#!/usr/bin/env python3
+
+# Copyright 2018 Mitsubishi Electric Research Labs (Takaaki Hori)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import torch
+
+import numpy as np
+import six
+
+
+class CTCPrefixScoreTH(object):
+    """Batch processing of CTCPrefixScore
+
+    which is based on Algorithm 2 in WATANABE et al.
+    "HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,"
+    but extended to efficiently compute the label probablities for multiple
+    hypotheses simultaneously
+    See also Seki et al. "Vectorized Beam Search for CTC-Attention-Based
+    Speech Recognition," In INTERSPEECH (pp. 3825-3829), 2019.
+    """
+
+    def __init__(self, x, xlens, blank, eos, margin=0):
+        """Construct CTC prefix scorer
+
+        :param torch.Tensor x: input label posterior sequences (B, T, O)
+        :param torch.Tensor xlens: input lengths (B,)
+        :param int blank: blank label id
+        :param int eos: end-of-sequence id
+        :param int margin: margin parameter for windowing (0 means no windowing)
+        """
+        # In the comment lines,
+        # we assume T: input_length, B: batch size, W: beam width, O: output dim.
+        self.logzero = -10000000000.0
+        self.blank = blank
+        self.eos = eos
+        self.batch = x.size(0)
+        self.input_length = x.size(1)
+        self.odim = x.size(2)
+        self.dtype = x.dtype
+        self.device = (
+            torch.device("cuda:%d" % x.get_device())
+            if x.is_cuda
+            else torch.device("cpu")
+        )
+        # Pad the rest of posteriors in the batch
+        # TODO(takaaki-hori): need a better way without for-loops
+        for i, l in enumerate(xlens):
+            if l < self.input_length:
+                x[i, l:, :] = self.logzero
+                x[i, l:, blank] = 0
+        # Reshape input x
+        xn = x.transpose(0, 1)  # (B, T, O) -> (T, B, O)
+        xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim)
+        self.x = torch.stack([xn, xb])  # (2, T, B, O)
+        self.end_frames = torch.as_tensor(xlens) - 1
+
+        # Setup CTC windowing
+        self.margin = margin
+        if margin > 0:
+            self.frame_ids = torch.arange(
+                self.input_length, dtype=self.dtype, device=self.device
+            )
+        # Base indices for index conversion
+        self.idx_bh = None
+        self.idx_b = torch.arange(self.batch, device=self.device)
+        self.idx_bo = (self.idx_b * self.odim).unsqueeze(1)
+
+    def __call__(self, y, state, scoring_ids=None, att_w=None):
+        """Compute CTC prefix scores for next labels
+
+        :param list y: prefix label sequences
+        :param tuple state: previous CTC state
+        :param torch.Tensor pre_scores: scores for pre-selection of hypotheses (BW, O)
+        :param torch.Tensor att_w: attention weights to decide CTC window
+        :return new_state, ctc_local_scores (BW, O)
+        """
+        output_length = len(y[0]) - 1  # ignore sos
+        last_ids = [yi[-1] for yi in y]  # last output label ids
+        n_bh = len(last_ids)  # batch * hyps
+        n_hyps = n_bh // self.batch  # assuming each utterance has the same # of hyps
+        self.scoring_num = scoring_ids.size(-1) if scoring_ids is not None else 0
+        # prepare state info
+        if state is None:
+            r_prev = torch.full(
+                (self.input_length, 2, self.batch, n_hyps),
+                self.logzero,
+                dtype=self.dtype,
+                device=self.device,
+            )
+            r_prev[:, 1] = torch.cumsum(self.x[0, :, :, self.blank], 0).unsqueeze(2)
+            r_prev = r_prev.view(-1, 2, n_bh)
+            s_prev = 0.0
+            f_min_prev = 0
+            f_max_prev = 1
+        else:
+            r_prev, s_prev, f_min_prev, f_max_prev = state
+
+        # select input dimensions for scoring
+        if self.scoring_num > 0:
+            scoring_idmap = torch.full(
+                (n_bh, self.odim), -1, dtype=torch.long, device=self.device
+            )
+            snum = self.scoring_num
+            if self.idx_bh is None or n_bh > len(self.idx_bh):
+                self.idx_bh = torch.arange(n_bh, device=self.device).view(-1, 1)
+            scoring_idmap[self.idx_bh[:n_bh], scoring_ids] = torch.arange(
+                snum, device=self.device
+            )
+            scoring_idx = (
+                scoring_ids + self.idx_bo.repeat(1, n_hyps).view(-1, 1)
+            ).view(-1)
+            x_ = torch.index_select(
+                self.x.view(2, -1, self.batch * self.odim), 2, scoring_idx
+            ).view(2, -1, n_bh, snum)
+        else:
+            scoring_ids = None
+            scoring_idmap = None
+            snum = self.odim
+            x_ = self.x.unsqueeze(3).repeat(1, 1, 1, n_hyps, 1).view(2, -1, n_bh, snum)
+
+        # new CTC forward probs are prepared as a (T x 2 x BW x S) tensor
+        # that corresponds to r_t^n(h) and r_t^b(h) in a batch.
+        r = torch.full(
+            (self.input_length, 2, n_bh, snum),
+            self.logzero,
+            dtype=self.dtype,
+            device=self.device,
+        )
+        if output_length == 0:
+            r[0, 0] = x_[0, 0]
+
+        r_sum = torch.logsumexp(r_prev, 1)
+        log_phi = r_sum.unsqueeze(2).repeat(1, 1, snum)
+        if scoring_ids is not None:
+            for idx in range(n_bh):
+                pos = scoring_idmap[idx, last_ids[idx]]
+                if pos >= 0:
+                    log_phi[:, idx, pos] = r_prev[:, 1, idx]
+        else:
+            for idx in range(n_bh):
+                log_phi[:, idx, last_ids[idx]] = r_prev[:, 1, idx]
+
+        # decide start and end frames based on attention weights
+        if att_w is not None and self.margin > 0:
+            f_arg = torch.matmul(att_w, self.frame_ids)
+            f_min = max(int(f_arg.min().cpu()), f_min_prev)
+            f_max = max(int(f_arg.max().cpu()), f_max_prev)
+            start = min(f_max_prev, max(f_min - self.margin, output_length, 1))
+            end = min(f_max + self.margin, self.input_length)
+        else:
+            f_min = f_max = 0
+            start = max(output_length, 1)
+            end = self.input_length
+
+        # compute forward probabilities log(r_t^n(h)) and log(r_t^b(h))
+        for t in range(start, end):
+            rp = r[t - 1]
+            rr = torch.stack([rp[0], log_phi[t - 1], rp[0], rp[1]]).view(
+                2, 2, n_bh, snum
+            )
+            r[t] = torch.logsumexp(rr, 1) + x_[:, t]
+
+        # compute log prefix probabilities log(psi)
+        log_phi_x = torch.cat((log_phi[0].unsqueeze(0), log_phi[:-1]), dim=0) + x_[0]
+        if scoring_ids is not None:
+            log_psi = torch.full(
+                (n_bh, self.odim), self.logzero, dtype=self.dtype, device=self.device
+            )
+            log_psi_ = torch.logsumexp(
+                torch.cat((log_phi_x[start:end], r[start - 1, 0].unsqueeze(0)), dim=0),
+                dim=0,
+            )
+            for si in range(n_bh):
+                log_psi[si, scoring_ids[si]] = log_psi_[si]
+        else:
+            log_psi = torch.logsumexp(
+                torch.cat((log_phi_x[start:end], r[start - 1, 0].unsqueeze(0)), dim=0),
+                dim=0,
+            )
+
+        for si in range(n_bh):
+            log_psi[si, self.eos] = r_sum[self.end_frames[si // n_hyps], si]
+
+        # exclude blank probs
+        log_psi[:, self.blank] = self.logzero
+
+        return (log_psi - s_prev), (r, log_psi, f_min, f_max, scoring_idmap)
+
+    def index_select_state(self, state, best_ids):
+        """Select CTC states according to best ids
+
+        :param state    : CTC state
+        :param best_ids : index numbers selected by beam pruning (B, W)
+        :return selected_state
+        """
+        r, s, f_min, f_max, scoring_idmap = state
+        # convert ids to BHO space
+        n_bh = len(s)
+        n_hyps = n_bh // self.batch
+        vidx = (best_ids + (self.idx_b * (n_hyps * self.odim)).view(-1, 1)).view(-1)
+        # select hypothesis scores
+        s_new = torch.index_select(s.view(-1), 0, vidx)
+        s_new = s_new.view(-1, 1).repeat(1, self.odim).view(n_bh, self.odim)
+        # convert ids to BHS space (S: scoring_num)
+        if scoring_idmap is not None:
+            snum = self.scoring_num
+            hyp_idx = (best_ids // self.odim + (self.idx_b * n_hyps).view(-1, 1)).view(
+                -1
+            )
+            label_ids = torch.fmod(best_ids, self.odim).view(-1)
+            score_idx = scoring_idmap[hyp_idx, label_ids]
+            score_idx[score_idx == -1] = 0
+            vidx = score_idx + hyp_idx * snum
+        else:
+            snum = self.odim
+        # select forward probabilities
+        r_new = torch.index_select(r.view(-1, 2, n_bh * snum), 2, vidx).view(
+            -1, 2, n_bh
+        )
+        return r_new, s_new, f_min, f_max
+
+    def extend_prob(self, x):
+        """Extend CTC prob.
+
+        :param torch.Tensor x: input label posterior sequences (B, T, O)
+        """
+
+        if self.x.shape[1] < x.shape[1]:  # self.x (2,T,B,O); x (B,T,O)
+            # Pad the rest of posteriors in the batch
+            # TODO(takaaki-hori): need a better way without for-loops
+            xlens = [x.size(1)]
+            for i, l in enumerate(xlens):
+                if l < self.input_length:
+                    x[i, l:, :] = self.logzero
+                    x[i, l:, self.blank] = 0
+            tmp_x = self.x
+            xn = x.transpose(0, 1)  # (B, T, O) -> (T, B, O)
+            xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim)
+            self.x = torch.stack([xn, xb])  # (2, T, B, O)
+            self.x[:, : tmp_x.shape[1], :, :] = tmp_x
+            self.input_length = x.size(1)
+            self.end_frames = torch.as_tensor(xlens) - 1
+
+    def extend_state(self, state):
+        """Compute CTC prefix state.
+
+
+        :param state    : CTC state
+        :return ctc_state
+        """
+
+        if state is None:
+            # nothing to do
+            return state
+        else:
+            r_prev, s_prev, f_min_prev, f_max_prev = state
+
+            r_prev_new = torch.full(
+                (self.input_length, 2),
+                self.logzero,
+                dtype=self.dtype,
+                device=self.device,
+            )
+            start = max(r_prev.shape[0], 1)
+            r_prev_new[0:start] = r_prev
+            for t in six.moves.range(start, self.input_length):
+                r_prev_new[t, 1] = r_prev_new[t - 1, 1] + self.x[0, t, :, self.blank]
+
+            return (r_prev_new, s_prev, f_min_prev, f_max_prev)
+
+
+class CTCPrefixScore(object):
+    """Compute CTC label sequence scores
+
+    which is based on Algorithm 2 in WATANABE et al.
+    "HYBRID CTC/ATTENTION ARCHITECTURE FOR END-TO-END SPEECH RECOGNITION,"
+    but extended to efficiently compute the probablities of multiple labels
+    simultaneously
+    """
+
+    def __init__(self, x, blank, eos, xp):
+        self.xp = xp
+        self.logzero = -10000000000.0
+        self.blank = blank
+        self.eos = eos
+        self.input_length = len(x)
+        self.x = x
+
+    def initial_state(self):
+        """Obtain an initial CTC state
+
+        :return: CTC state
+        """
+        # initial CTC state is made of a frame x 2 tensor that corresponds to
+        # r_t^n(<sos>) and r_t^b(<sos>), where 0 and 1 of axis=1 represent
+        # superscripts n and b (non-blank and blank), respectively.
+        r = self.xp.full((self.input_length, 2), self.logzero, dtype=np.float32)
+        r[0, 1] = self.x[0, self.blank]
+        for i in six.moves.range(1, self.input_length):
+            r[i, 1] = r[i - 1, 1] + self.x[i, self.blank]
+        return r
+
+    def __call__(self, y, cs, r_prev):
+        """Compute CTC prefix scores for next labels
+
+        :param y     : prefix label sequence
+        :param cs    : array of next labels
+        :param r_prev: previous CTC state
+        :return ctc_scores, ctc_states
+        """
+        # initialize CTC states
+        output_length = len(y) - 1  # ignore sos
+        # new CTC states are prepared as a frame x (n or b) x n_labels tensor
+        # that corresponds to r_t^n(h) and r_t^b(h).
+        r = self.xp.ndarray((self.input_length, 2, len(cs)), dtype=np.float32)
+        xs = self.x[:, cs]
+        if output_length == 0:
+            r[0, 0] = xs[0]
+            r[0, 1] = self.logzero
+        else:
+            r[output_length - 1] = self.logzero
+
+        # prepare forward probabilities for the last label
+        r_sum = self.xp.logaddexp(
+            r_prev[:, 0], r_prev[:, 1]
+        )  # log(r_t^n(g) + r_t^b(g))
+        last = y[-1]
+        if output_length > 0 and last in cs:
+            log_phi = self.xp.ndarray((self.input_length, len(cs)), dtype=np.float32)
+            for i in six.moves.range(len(cs)):
+                log_phi[:, i] = r_sum if cs[i] != last else r_prev[:, 1]
+        else:
+            log_phi = r_sum
+
+        # compute forward probabilities log(r_t^n(h)), log(r_t^b(h)),
+        # and log prefix probabilities log(psi)
+        start = max(output_length, 1)
+        log_psi = r[start - 1, 0]
+        for t in six.moves.range(start, self.input_length):
+            r[t, 0] = self.xp.logaddexp(r[t - 1, 0], log_phi[t - 1]) + xs[t]
+            r[t, 1] = (
+                self.xp.logaddexp(r[t - 1, 0], r[t - 1, 1]) + self.x[t, self.blank]
+            )
+            log_psi = self.xp.logaddexp(log_psi, log_phi[t - 1] + xs[t])
+
+        # get P(...eos|X) that ends with the prefix itself
+        eos_pos = self.xp.where(cs == self.eos)[0]
+        if len(eos_pos) > 0:
+            log_psi[eos_pos] = r_sum[-1]  # log(r_T^n(g) + r_T^b(g))
+
+        # exclude blank probs
+        blank_pos = self.xp.where(cs == self.blank)[0]
+        if len(blank_pos) > 0:
+            log_psi[blank_pos] = self.logzero
+
+        # return the log prefix probability and CTC states, where the label axis
+        # of the CTC states is moved to the first axis to slice it easily
+        return log_psi, self.xp.rollaxis(r, 2)
diff --git a/funasr/modules/scorers/length_bonus.py b/funasr/modules/scorers/length_bonus.py
new file mode 100644
index 000000000..7f576e04f
--- /dev/null
+++ b/funasr/modules/scorers/length_bonus.py
@@ -0,0 +1,61 @@
+"""Length bonus module."""
+from typing import Any
+from typing import List
+from typing import Tuple
+
+import torch
+
+from funasr.modules.scorers.scorer_interface import BatchScorerInterface
+
+
+class LengthBonus(BatchScorerInterface):
+    """Length bonus in beam search."""
+
+    def __init__(self, n_vocab: int):
+        """Initialize class.
+
+        Args:
+            n_vocab (int): The number of tokens in vocabulary for beam search
+
+        """
+        self.n = n_vocab
+
+    def score(self, y, state, x):
+        """Score new token.
+
+        Args:
+            y (torch.Tensor): 1D torch.int64 prefix tokens.
+            state: Scorer state for prefix tokens
+            x (torch.Tensor): 2D encoder feature that generates ys.
+
+        Returns:
+            tuple[torch.Tensor, Any]: Tuple of
+                torch.float32 scores for next token (n_vocab)
+                and None
+
+        """
+        return torch.tensor([1.0], device=x.device, dtype=x.dtype).expand(self.n), None
+
+    def batch_score(
+        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor
+    ) -> Tuple[torch.Tensor, List[Any]]:
+        """Score new token batch.
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, n_vocab)`
+                and next state list for ys.
+
+        """
+        return (
+            torch.tensor([1.0], device=xs.device, dtype=xs.dtype).expand(
+                ys.shape[0], self.n
+            ),
+            None,
+        )
diff --git a/funasr/modules/scorers/scorer_interface.py b/funasr/modules/scorers/scorer_interface.py
new file mode 100644
index 000000000..946ec6be3
--- /dev/null
+++ b/funasr/modules/scorers/scorer_interface.py
@@ -0,0 +1,188 @@
+"""Scorer interface module."""
+
+from typing import Any
+from typing import List
+from typing import Tuple
+
+import torch
+import warnings
+
+
+class ScorerInterface:
+    """Scorer interface for beam search.
+
+    The scorer performs scoring of the all tokens in vocabulary.
+
+    Examples:
+        * Search heuristics
+            * :class:`espnet.nets.scorers.length_bonus.LengthBonus`
+        * Decoder networks of the sequence-to-sequence models
+            * :class:`espnet.nets.pytorch_backend.nets.transformer.decoder.Decoder`
+            * :class:`espnet.nets.pytorch_backend.nets.rnn.decoders.Decoder`
+        * Neural language models
+            * :class:`espnet.nets.pytorch_backend.lm.transformer.TransformerLM`
+            * :class:`espnet.nets.pytorch_backend.lm.default.DefaultRNNLM`
+            * :class:`espnet.nets.pytorch_backend.lm.seq_rnn.SequentialRNNLM`
+
+    """
+
+    def init_state(self, x: torch.Tensor) -> Any:
+        """Get an initial state for decoding (optional).
+
+        Args:
+            x (torch.Tensor): The encoded feature tensor
+
+        Returns: initial state
+
+        """
+        return None
+
+    def select_state(self, state: Any, i: int, new_id: int = None) -> Any:
+        """Select state with relative ids in the main beam search.
+
+        Args:
+            state: Decoder state for prefix tokens
+            i (int): Index to select a state in the main beam search
+            new_id (int): New label index to select a state if necessary
+
+        Returns:
+            state: pruned state
+
+        """
+        return None if state is None else state[i]
+
+    def score(
+        self, y: torch.Tensor, state: Any, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, Any]:
+        """Score new token (required).
+
+        Args:
+            y (torch.Tensor): 1D torch.int64 prefix tokens.
+            state: Scorer state for prefix tokens
+            x (torch.Tensor): The encoder feature that generates ys.
+
+        Returns:
+            tuple[torch.Tensor, Any]: Tuple of
+                scores for next token that has a shape of `(n_vocab)`
+                and next state for ys
+
+        """
+        raise NotImplementedError
+
+    def final_score(self, state: Any) -> float:
+        """Score eos (optional).
+
+        Args:
+            state: Scorer state for prefix tokens
+
+        Returns:
+            float: final score
+
+        """
+        return 0.0
+
+
+class BatchScorerInterface(ScorerInterface):
+    """Batch scorer interface."""
+
+    def batch_init_state(self, x: torch.Tensor) -> Any:
+        """Get an initial state for decoding (optional).
+
+        Args:
+            x (torch.Tensor): The encoded feature tensor
+
+        Returns: initial state
+
+        """
+        return self.init_state(x)
+
+    def batch_score(
+        self, ys: torch.Tensor, states: List[Any], xs: torch.Tensor
+    ) -> Tuple[torch.Tensor, List[Any]]:
+        """Score new token batch (required).
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, n_vocab)`
+                and next state list for ys.
+
+        """
+        warnings.warn(
+            "{} batch score is implemented through for loop not parallelized".format(
+                self.__class__.__name__
+            )
+        )
+        scores = list()
+        outstates = list()
+        for i, (y, state, x) in enumerate(zip(ys, states, xs)):
+            score, outstate = self.score(y, state, x)
+            outstates.append(outstate)
+            scores.append(score)
+        scores = torch.cat(scores, 0).view(ys.shape[0], -1)
+        return scores, outstates
+
+
+class PartialScorerInterface(ScorerInterface):
+    """Partial scorer interface for beam search.
+
+    The partial scorer performs scoring when non-partial scorer finished scoring,
+    and receives pre-pruned next tokens to score because it is too heavy to score
+    all the tokens.
+
+    Examples:
+         * Prefix search for connectionist-temporal-classification models
+             * :class:`espnet.nets.scorers.ctc.CTCPrefixScorer`
+
+    """
+
+    def score_partial(
+        self, y: torch.Tensor, next_tokens: torch.Tensor, state: Any, x: torch.Tensor
+    ) -> Tuple[torch.Tensor, Any]:
+        """Score new token (required).
+
+        Args:
+            y (torch.Tensor): 1D prefix token
+            next_tokens (torch.Tensor): torch.int64 next token to score
+            state: decoder state for prefix tokens
+            x (torch.Tensor): The encoder feature that generates ys
+
+        Returns:
+            tuple[torch.Tensor, Any]:
+                Tuple of a score tensor for y that has a shape `(len(next_tokens),)`
+                and next state for ys
+
+        """
+        raise NotImplementedError
+
+
+class BatchPartialScorerInterface(BatchScorerInterface, PartialScorerInterface):
+    """Batch partial scorer interface for beam search."""
+
+    def batch_score_partial(
+        self,
+        ys: torch.Tensor,
+        next_tokens: torch.Tensor,
+        states: List[Any],
+        xs: torch.Tensor,
+    ) -> Tuple[torch.Tensor, Any]:
+        """Score new token (required).
+
+        Args:
+            ys (torch.Tensor): torch.int64 prefix tokens (n_batch, ylen).
+            next_tokens (torch.Tensor): torch.int64 tokens to score (n_batch, n_token).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (torch.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[torch.Tensor, Any]:
+                Tuple of a score tensor for ys that has a shape `(n_batch, n_vocab)`
+                and next states for ys
+        """
+        raise NotImplementedError
diff --git a/funasr/modules/streaming_utils/chunk_utilis.py b/funasr/modules/streaming_utils/chunk_utilis.py
new file mode 100644
index 000000000..ea37c68cc
--- /dev/null
+++ b/funasr/modules/streaming_utils/chunk_utilis.py
@@ -0,0 +1,390 @@
+
+import torch
+import numpy as np
+import math
+from funasr.modules.nets_utils import make_pad_mask
+import logging
+import torch.nn.functional as F
+from funasr.modules.streaming_utils.utils import sequence_mask
+
+
+
+class overlap_chunk():
+	"""
+	author: Speech Lab, Alibaba Group, China
+	San-m: Memory equipped self-attention for end-to-end speech recognition
+	https://arxiv.org/abs/2006.01713
+
+	"""
+	def __init__(self,
+		chunk_size: tuple = (16,),
+		stride: tuple = (10,),
+		pad_left: tuple = (0,),
+		encoder_att_look_back_factor: tuple = (1,),
+        shfit_fsmn: int = 0,
+        decoder_att_look_back_factor: tuple = (1,),
+	):
+
+		pad_left = self.check_chunk_size_args(chunk_size, pad_left)
+		encoder_att_look_back_factor = self.check_chunk_size_args(chunk_size, encoder_att_look_back_factor)
+		decoder_att_look_back_factor = self.check_chunk_size_args(chunk_size, decoder_att_look_back_factor)
+		self.chunk_size, self.stride, self.pad_left, self.encoder_att_look_back_factor, self.decoder_att_look_back_factor \
+			= chunk_size, stride, pad_left, encoder_att_look_back_factor, decoder_att_look_back_factor
+		self.shfit_fsmn = shfit_fsmn
+		self.x_add_mask = None
+		self.x_rm_mask = None
+		self.x_len = None
+		self.mask_shfit_chunk = None
+		self.mask_chunk_predictor = None
+		self.mask_att_chunk_encoder = None
+		self.mask_shift_att_chunk_decoder = None
+		self.chunk_outs = None
+		self.chunk_size_cur, self.stride_cur, self.pad_left_cur, self.encoder_att_look_back_factor_cur, self.chunk_size_pad_shift_cur \
+			= None, None, None, None, None
+
+	def check_chunk_size_args(self, chunk_size, x):
+		if len(x) < len(chunk_size):
+			x = [x[0] for i in chunk_size]
+		return x
+
+	def get_chunk_size(self,
+		ind: int = 0
+	):
+		# with torch.no_grad:
+		chunk_size, stride, pad_left, encoder_att_look_back_factor, decoder_att_look_back_factor = \
+			self.chunk_size[ind], self.stride[ind], self.pad_left[ind], self.encoder_att_look_back_factor[ind], self.decoder_att_look_back_factor[ind]
+		self.chunk_size_cur, self.stride_cur, self.pad_left_cur, self.encoder_att_look_back_factor_cur, self.chunk_size_pad_shift_cur, self.decoder_att_look_back_factor_cur \
+			= chunk_size, stride, pad_left, encoder_att_look_back_factor, chunk_size + self.shfit_fsmn, decoder_att_look_back_factor
+		return self.chunk_size_cur, self.stride_cur, self.pad_left_cur, self.encoder_att_look_back_factor_cur, self.chunk_size_pad_shift_cur
+
+	def random_choice(self, training=True, decoding_ind=None):
+		chunk_num = len(self.chunk_size)
+		ind = 0
+		if training and chunk_num > 1:
+			ind = torch.randint(0, chunk_num-1, ()).cpu().item()
+		if not training and decoding_ind is not None:
+			ind = int(decoding_ind)
+
+		return ind
+
+
+
+
+	def gen_chunk_mask(self, x_len, ind=0, num_units=1, num_units_predictor=1):
+
+		with torch.no_grad():
+			x_len = x_len.cpu().numpy()
+			x_len_max = x_len.max()
+
+			chunk_size, stride, pad_left, encoder_att_look_back_factor, chunk_size_pad_shift = self.get_chunk_size(ind)
+			shfit_fsmn = self.shfit_fsmn
+			pad_right = chunk_size - stride - pad_left
+
+			chunk_num_batch = np.ceil(x_len/stride).astype(np.int32)
+			x_len_chunk = (chunk_num_batch-1) * chunk_size_pad_shift + shfit_fsmn + pad_left + 0 + x_len - (chunk_num_batch-1) * stride
+			x_len_chunk = x_len_chunk.astype(x_len.dtype)
+			x_len_chunk_max = x_len_chunk.max()
+
+			chunk_num = int(math.ceil(x_len_max/stride))
+			dtype = np.int32
+			max_len_for_x_mask_tmp = max(chunk_size, x_len_max + pad_left)
+			x_add_mask = np.zeros([0, max_len_for_x_mask_tmp], dtype=dtype)
+			x_rm_mask = np.zeros([max_len_for_x_mask_tmp, 0], dtype=dtype)
+			mask_shfit_chunk = np.zeros([0, num_units], dtype=dtype)
+			mask_chunk_predictor = np.zeros([0, num_units_predictor], dtype=dtype)
+			mask_shift_att_chunk_decoder = np.zeros([0, 1], dtype=dtype)
+			mask_att_chunk_encoder = np.zeros([0, chunk_num*chunk_size_pad_shift], dtype=dtype)
+			for chunk_ids in range(chunk_num):
+				# x_mask add
+				fsmn_padding = np.zeros((shfit_fsmn, max_len_for_x_mask_tmp), dtype=dtype)
+				x_mask_cur = np.diag(np.ones(chunk_size, dtype=np.float32))
+				x_mask_pad_left = np.zeros((chunk_size, chunk_ids * stride), dtype=dtype)
+				x_mask_pad_right = np.zeros((chunk_size, max_len_for_x_mask_tmp), dtype=dtype)
+				x_cur_pad = np.concatenate([x_mask_pad_left, x_mask_cur, x_mask_pad_right], axis=1)
+				x_cur_pad = x_cur_pad[:chunk_size, :max_len_for_x_mask_tmp]
+				x_add_mask_fsmn = np.concatenate([fsmn_padding, x_cur_pad], axis=0)
+				x_add_mask = np.concatenate([x_add_mask, x_add_mask_fsmn], axis=0)
+
+				# x_mask rm
+				fsmn_padding = np.zeros((max_len_for_x_mask_tmp, shfit_fsmn),dtype=dtype)
+				padding_mask_left = np.zeros((max_len_for_x_mask_tmp, pad_left),dtype=dtype)
+				padding_mask_right = np.zeros((max_len_for_x_mask_tmp, pad_right), dtype=dtype)
+				x_mask_cur = np.diag(np.ones(stride, dtype=dtype))
+				x_mask_cur_pad_top = np.zeros((chunk_ids*stride, stride), dtype=dtype)
+				x_mask_cur_pad_bottom = np.zeros((max_len_for_x_mask_tmp, stride), dtype=dtype)
+				x_rm_mask_cur = np.concatenate([x_mask_cur_pad_top, x_mask_cur, x_mask_cur_pad_bottom], axis=0)
+				x_rm_mask_cur = x_rm_mask_cur[:max_len_for_x_mask_tmp, :stride]
+				x_rm_mask_cur_fsmn = np.concatenate([fsmn_padding, padding_mask_left, x_rm_mask_cur, padding_mask_right], axis=1)
+				x_rm_mask = np.concatenate([x_rm_mask, x_rm_mask_cur_fsmn], axis=1)
+
+				# fsmn_padding_mask
+				pad_shfit_mask = np.zeros([shfit_fsmn, num_units], dtype=dtype)
+				ones_1 = np.ones([chunk_size, num_units], dtype=dtype)
+				mask_shfit_chunk_cur = np.concatenate([pad_shfit_mask, ones_1], axis=0)
+				mask_shfit_chunk = np.concatenate([mask_shfit_chunk, mask_shfit_chunk_cur], axis=0)
+
+				# predictor mask
+				zeros_1 = np.zeros([shfit_fsmn + pad_left, num_units_predictor], dtype=dtype)
+				ones_2 = np.ones([stride, num_units_predictor], dtype=dtype)
+				zeros_3 = np.zeros([chunk_size - stride - pad_left, num_units_predictor], dtype=dtype)
+				ones_zeros = np.concatenate([ones_2, zeros_3], axis=0)
+				mask_chunk_predictor_cur = np.concatenate([zeros_1, ones_zeros], axis=0)
+				mask_chunk_predictor = np.concatenate([mask_chunk_predictor, mask_chunk_predictor_cur], axis=0)
+
+				# encoder att mask
+				zeros_1_top = np.zeros([shfit_fsmn, chunk_num*chunk_size_pad_shift], dtype=dtype)
+
+				zeros_2_num = max(chunk_ids - encoder_att_look_back_factor, 0)
+				zeros_2 = np.zeros([chunk_size, zeros_2_num*chunk_size_pad_shift], dtype=dtype)
+
+				encoder_att_look_back_num = max(chunk_ids - zeros_2_num, 0)
+				zeros_2_left = np.zeros([chunk_size, shfit_fsmn], dtype=dtype)
+				ones_2_mid = np.ones([stride, stride], dtype=dtype)
+				zeros_2_bottom = np.zeros([chunk_size-stride, stride], dtype=dtype)
+				zeros_2_right = np.zeros([chunk_size, chunk_size-stride], dtype=dtype)
+				ones_2 = np.concatenate([ones_2_mid, zeros_2_bottom], axis=0)
+				ones_2 = np.concatenate([zeros_2_left, ones_2, zeros_2_right], axis=1)
+				ones_2 = np.tile(ones_2, [1, encoder_att_look_back_num])
+
+				zeros_3_left = np.zeros([chunk_size, shfit_fsmn], dtype=dtype)
+				ones_3_right = np.ones([chunk_size, chunk_size], dtype=dtype)
+				ones_3 = np.concatenate([zeros_3_left, ones_3_right], axis=1)
+
+				zeros_remain_num = max(chunk_num - 1 - chunk_ids, 0)
+				zeros_remain = np.zeros([chunk_size, zeros_remain_num*chunk_size_pad_shift], dtype=dtype)
+
+				ones2_bottom = np.concatenate([zeros_2, ones_2, ones_3, zeros_remain], axis=1)
+				mask_att_chunk_encoder_cur = np.concatenate([zeros_1_top, ones2_bottom], axis=0)
+				mask_att_chunk_encoder = np.concatenate([mask_att_chunk_encoder, mask_att_chunk_encoder_cur], axis=0)
+
+
+				# decoder fsmn_shift_att_mask
+				zeros_1 = np.zeros([shfit_fsmn, 1])
+				ones_1 = np.ones([chunk_size, 1])
+				mask_shift_att_chunk_decoder_cur = np.concatenate([zeros_1, ones_1], axis=0)
+				mask_shift_att_chunk_decoder = np.concatenate(
+					[mask_shift_att_chunk_decoder, mask_shift_att_chunk_decoder_cur], axis=0)
+
+			self.x_add_mask = x_add_mask[:x_len_chunk_max, :x_len_max+pad_left]
+			self.x_len_chunk = x_len_chunk
+			self.x_rm_mask = x_rm_mask[:x_len_max, :x_len_chunk_max]
+			self.x_len = x_len
+			self.mask_shfit_chunk = mask_shfit_chunk[:x_len_chunk_max, :]
+			self.mask_chunk_predictor = mask_chunk_predictor[:x_len_chunk_max, :]
+			self.mask_att_chunk_encoder = mask_att_chunk_encoder[:x_len_chunk_max, :x_len_chunk_max]
+			self.mask_shift_att_chunk_decoder = mask_shift_att_chunk_decoder[:x_len_chunk_max, :]
+			self.chunk_outs = (self.x_add_mask,
+		        self.x_len_chunk,
+		        self.x_rm_mask,
+		        self.x_len,
+		        self.mask_shfit_chunk,
+		        self.mask_chunk_predictor,
+		        self.mask_att_chunk_encoder,
+		        self.mask_shift_att_chunk_decoder)
+
+		return self.chunk_outs
+
+
+	def split_chunk(self, x, x_len, chunk_outs):
+		"""
+		:param x: (b, t, d)
+		:param x_length: (b)
+		:param ind: int
+		:return:
+		"""
+		x = x[:, :x_len.max(), :]
+		b, t, d = x.size()
+		x_len_mask = (~make_pad_mask(x_len, maxlen=t)).to(
+			x.device)
+		x *= x_len_mask[:, :, None]
+
+		x_add_mask = self.get_x_add_mask(chunk_outs, x.device, dtype=x.dtype)
+		x_len_chunk = self.get_x_len_chunk(chunk_outs, x_len.device, dtype=x_len.dtype)
+		pad = (0, 0, self.pad_left_cur, 0)
+		x = F.pad(x, pad, "constant", 0.0)
+		b, t, d = x.size()
+		x = torch.transpose(x, 1, 0)
+		x = torch.reshape(x, [t, -1])
+		x_chunk = torch.mm(x_add_mask, x)
+		x_chunk = torch.reshape(x_chunk, [-1, b, d]).transpose(1, 0)
+
+		return x_chunk, x_len_chunk
+
+	def remove_chunk(self, x_chunk, x_len_chunk, chunk_outs):
+		x_chunk = x_chunk[:, :x_len_chunk.max(), :]
+		b, t, d = x_chunk.size()
+		x_len_chunk_mask = (~make_pad_mask(x_len_chunk, maxlen=t)).to(
+			x_chunk.device)
+		x_chunk *= x_len_chunk_mask[:, :, None]
+
+		x_rm_mask = self.get_x_rm_mask(chunk_outs, x_chunk.device, dtype=x_chunk.dtype)
+		x_len = self.get_x_len(chunk_outs, x_len_chunk.device, dtype=x_len_chunk.dtype)
+		x_chunk = torch.transpose(x_chunk, 1, 0)
+		x_chunk = torch.reshape(x_chunk, [t, -1])
+		x = torch.mm(x_rm_mask, x_chunk)
+		x = torch.reshape(x, [-1, b, d]).transpose(1, 0)
+
+		return x, x_len
+
+	def get_x_add_mask(self, chunk_outs=None, device='cpu', idx=0, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+	def get_x_len_chunk(self, chunk_outs=None, device='cpu', idx=1, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+
+	def get_x_rm_mask(self, chunk_outs=None, device='cpu', idx=2, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+	def get_x_len(self, chunk_outs=None, device='cpu', idx=3, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+
+	def get_mask_shfit_chunk(self, chunk_outs=None, device='cpu', batch_size=1, num_units=1, idx=4, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = np.tile(x[None, :, :, ], [batch_size, 1, num_units])
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+	def get_mask_chunk_predictor(self, chunk_outs=None, device='cpu', batch_size=1, num_units=1, idx=5, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = np.tile(x[None, :, :, ], [batch_size, 1, num_units])
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+	def get_mask_att_chunk_encoder(self, chunk_outs=None, device='cpu', batch_size=1, idx=6, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = np.tile(x[None, :, :, ], [batch_size, 1, 1])
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+	def get_mask_shift_att_chunk_decoder(self, chunk_outs=None, device='cpu', batch_size=1, idx=7, dtype=torch.float32):
+		with torch.no_grad():
+			x = chunk_outs[idx] if chunk_outs is not None else  self.chunk_outs[idx]
+			x = np.tile(x[None, None, :, 0], [batch_size, 1, 1])
+			x = torch.from_numpy(x).type(dtype).to(device)
+		return x
+
+
+
+def build_scama_mask_for_cross_attention_decoder(
+							  predictor_alignments: torch.Tensor,
+                              encoder_sequence_length: torch.Tensor,
+                              chunk_size: int = 5,
+                              encoder_chunk_size: int = 5,
+                              attention_chunk_center_bias: int = 0,
+                              attention_chunk_size: int = 1,
+                              attention_chunk_type: str = 'chunk',
+                              step=None,
+							  predictor_mask_chunk_hopping: torch.Tensor = None,
+							  decoder_att_look_back_factor: int = 1,
+							  mask_shift_att_chunk_decoder: torch.Tensor = None,
+							  target_length: torch.Tensor = None,
+							  is_training=True,
+                              dtype: torch.dtype = torch.float32):
+	with torch.no_grad():
+		device = predictor_alignments.device
+		batch_size, chunk_num = predictor_alignments.size()
+		maximum_encoder_length = encoder_sequence_length.max().item()
+		int_type = predictor_alignments.dtype
+		if not is_training:
+			target_length = predictor_alignments.sum(dim=-1).type(encoder_sequence_length.dtype)
+		maximum_target_length = target_length.max()
+		predictor_alignments_cumsum = torch.cumsum(predictor_alignments, dim=1)
+		predictor_alignments_cumsum = predictor_alignments_cumsum[:, None, :].repeat(1, maximum_target_length, 1)
+	
+	
+		index = torch.ones([batch_size, maximum_target_length], dtype=int_type).to(device)
+		index = torch.cumsum(index, dim=1)
+		index = index[:, :, None].repeat(1, 1, chunk_num)
+	
+		index_div = torch.floor(torch.divide(predictor_alignments_cumsum, index)).type(int_type)
+		index_div_bool_zeros = index_div == 0
+		index_div_bool_zeros_count = torch.sum(index_div_bool_zeros.type(int_type), dim=-1) + 1
+	
+		index_div_bool_zeros_count = torch.clip(index_div_bool_zeros_count, min=1, max=chunk_num)
+	
+		index_div_bool_zeros_count *= chunk_size
+		index_div_bool_zeros_count += attention_chunk_center_bias
+		index_div_bool_zeros_count = torch.clip(index_div_bool_zeros_count-1, min=0, max=maximum_encoder_length)
+		index_div_bool_zeros_count_ori = index_div_bool_zeros_count
+	
+		index_div_bool_zeros_count = (torch.floor(index_div_bool_zeros_count / encoder_chunk_size)+1)*encoder_chunk_size
+		max_len_chunk = math.ceil(maximum_encoder_length / encoder_chunk_size) * encoder_chunk_size
+	
+		mask_flip, mask_flip2 = None, None
+		if attention_chunk_size is not None:
+			index_div_bool_zeros_count_beg = index_div_bool_zeros_count - attention_chunk_size
+			index_div_bool_zeros_count_beg = torch.clip(index_div_bool_zeros_count_beg, 0, max_len_chunk)
+			index_div_bool_zeros_count_beg_mask = sequence_mask(index_div_bool_zeros_count_beg, maxlen=max_len_chunk, dtype=int_type, device=device)
+			mask_flip = 1 - index_div_bool_zeros_count_beg_mask
+			attention_chunk_size2 = attention_chunk_size * (decoder_att_look_back_factor+1)
+			index_div_bool_zeros_count_beg = index_div_bool_zeros_count - attention_chunk_size2
+	
+			index_div_bool_zeros_count_beg = torch.clip(index_div_bool_zeros_count_beg, 0, max_len_chunk)
+			index_div_bool_zeros_count_beg_mask = sequence_mask(index_div_bool_zeros_count_beg, maxlen=max_len_chunk, dtype=int_type, device=device)
+			mask_flip2 = 1 - index_div_bool_zeros_count_beg_mask
+	
+		mask = sequence_mask(index_div_bool_zeros_count, maxlen=max_len_chunk, dtype=dtype, device=device)
+	
+		if predictor_mask_chunk_hopping is not None:
+				b, k, t = mask.size()
+				predictor_mask_chunk_hopping = predictor_mask_chunk_hopping[:, None, :, 0].repeat(1, k, 1)
+	
+				mask_mask_flip = mask
+				if mask_flip is not None:
+						mask_mask_flip = mask_flip * mask
+	
+				def _fn():
+						mask_sliced = mask[:b, :k, encoder_chunk_size:t]
+						zero_pad_right = torch.zeros([b, k, encoder_chunk_size], dtype=mask_sliced.dtype).to(device)
+						mask_sliced = torch.cat([mask_sliced, zero_pad_right], dim=2)
+						_, _, tt = predictor_mask_chunk_hopping.size()
+						pad_right_p = max_len_chunk - tt
+						predictor_mask_chunk_hopping_pad = torch.nn.functional.pad(predictor_mask_chunk_hopping, [0, pad_right_p], "constant", 0)
+						masked = mask_sliced * predictor_mask_chunk_hopping_pad
+	
+						mask_true = mask_mask_flip + masked
+						return mask_true
+	
+				mask = _fn() if t > chunk_size else mask_mask_flip
+	
+	
+	
+		if mask_flip2 is not None:
+			mask *= mask_flip2
+	
+		mask_target = sequence_mask(target_length, maxlen=maximum_target_length, dtype=mask.dtype, device=device)
+		mask = mask[:, :maximum_target_length, :] * mask_target[:, :, None]
+	
+	
+	
+		mask_len = sequence_mask(encoder_sequence_length, maxlen=maximum_encoder_length, dtype=mask.dtype, device=device)
+		mask = mask[:, :, :maximum_encoder_length] * mask_len[:, None, :]
+	
+	
+	
+	
+		if attention_chunk_type == 'full':
+			mask = torch.ones_like(mask).to(device)
+		if mask_shift_att_chunk_decoder is not None:
+			mask = mask * mask_shift_att_chunk_decoder
+		mask = mask[:, :maximum_target_length, :maximum_encoder_length].type(dtype).to(device)
+
+	return mask
+
diff --git a/funasr/modules/streaming_utils/utils.py b/funasr/modules/streaming_utils/utils.py
new file mode 100644
index 000000000..dd76de923
--- /dev/null
+++ b/funasr/modules/streaming_utils/utils.py
@@ -0,0 +1,47 @@
+import torch
+from torch.nn import functional as F
+
+import numpy as np
+
+def sequence_mask(lengths, maxlen=None, dtype=torch.float32, device=None):
+	if maxlen is None:
+		maxlen = lengths.max()
+	row_vector = torch.arange(0, maxlen, 1).to(lengths.device)
+	matrix = torch.unsqueeze(lengths, dim=-1)
+	mask = row_vector < matrix
+	mask = mask.detach()
+
+	return mask.type(dtype).to(device) if device is not None else mask.type(dtype)
+
+def apply_cmvn(inputs, mvn):
+	device = inputs.device
+	dtype = inputs.dtype
+	frame, dim = inputs.shape
+	meams = np.tile(mvn[0:1, :dim], (frame, 1))
+	vars = np.tile(mvn[1:2, :dim], (frame, 1))
+	inputs -= torch.from_numpy(meams).type(dtype).to(device)
+	inputs *= torch.from_numpy(vars).type(dtype).to(device)
+
+	return inputs.type(torch.float32)
+
+
+
+
+def drop_and_add(inputs: torch.Tensor,
+                 outputs: torch.Tensor,
+                 training: bool,
+                 dropout_rate: float = 0.1,
+                 stoch_layer_coeff: float = 1.0):
+
+
+
+	outputs = F.dropout(outputs, p=dropout_rate, training=training, inplace=True)
+	outputs *= stoch_layer_coeff
+
+	input_dim = inputs.size(-1)
+	output_dim = outputs.size(-1)
+
+	if input_dim == output_dim:
+		outputs += inputs
+	return outputs
+
diff --git a/funasr/modules/subsampling.py b/funasr/modules/subsampling.py
new file mode 100644
index 000000000..f9a1c16e6
--- /dev/null
+++ b/funasr/modules/subsampling.py
@@ -0,0 +1,304 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+# Copyright 2019 Shigeki Karita
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Subsampling layer definition."""
+
+import torch
+import torch.nn.functional as F
+from funasr.modules.embedding import PositionalEncoding
+
+
+class TooShortUttError(Exception):
+    """Raised when the utt is too short for subsampling.
+
+    Args:
+        message (str): Message for error catch
+        actual_size (int): the short size that cannot pass the subsampling
+        limit (int): the limit size for subsampling
+
+    """
+
+    def __init__(self, message, actual_size, limit):
+        """Construct a TooShortUttError for error handler."""
+        super().__init__(message)
+        self.actual_size = actual_size
+        self.limit = limit
+
+
+def check_short_utt(ins, size):
+    """Check if the utterance is too short for subsampling."""
+    if isinstance(ins, Conv2dSubsampling2) and size < 3:
+        return True, 3
+    if isinstance(ins, Conv2dSubsampling) and size < 7:
+        return True, 7
+    if isinstance(ins, Conv2dSubsampling6) and size < 11:
+        return True, 11
+    if isinstance(ins, Conv2dSubsampling8) and size < 15:
+        return True, 15
+    return False, -1
+
+
+class Conv2dSubsampling(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/4 length).
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        pos_enc (torch.nn.Module): Custom position encoding layer.
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
+        """Construct an Conv2dSubsampling object."""
+        super(Conv2dSubsampling, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 2),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
+            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, idim).
+            x_mask (torch.Tensor): Input mask (#batch, 1, time).
+
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 4.
+            torch.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 4.
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-2:2]
+
+    def __getitem__(self, key):
+        """Get item.
+
+        When reset_parameters() is called, if use_scaled_pos_enc is used,
+            return the positioning encoding.
+
+        """
+        if key != -1:
+            raise NotImplementedError("Support only `-1` (for `reset_parameters`).")
+        return self.out[key]
+
+
+class Conv2dSubsampling2(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/2 length).
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        pos_enc (torch.nn.Module): Custom position encoding layer.
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
+        """Construct an Conv2dSubsampling2 object."""
+        super(Conv2dSubsampling2, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 1),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * (((idim - 1) // 2 - 2)), odim),
+            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, idim).
+            x_mask (torch.Tensor): Input mask (#batch, 1, time).
+
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 2.
+            torch.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 2.
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-2:1]
+
+    def __getitem__(self, key):
+        """Get item.
+
+        When reset_parameters() is called, if use_scaled_pos_enc is used,
+            return the positioning encoding.
+
+        """
+        if key != -1:
+            raise NotImplementedError("Support only `-1` (for `reset_parameters`).")
+        return self.out[key]
+
+
+class Conv2dSubsampling6(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/6 length).
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        pos_enc (torch.nn.Module): Custom position encoding layer.
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
+        """Construct an Conv2dSubsampling6 object."""
+        super(Conv2dSubsampling6, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 5, 3),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim),
+            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, idim).
+            x_mask (torch.Tensor): Input mask (#batch, 1, time).
+
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 6.
+            torch.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 6.
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-4:3]
+
+
+class Conv2dSubsampling8(torch.nn.Module):
+    """Convolutional 2D subsampling (to 1/8 length).
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        pos_enc (torch.nn.Module): Custom position encoding layer.
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
+        """Construct an Conv2dSubsampling8 object."""
+        super(Conv2dSubsampling8, self).__init__()
+        self.conv = torch.nn.Sequential(
+            torch.nn.Conv2d(1, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 2),
+            torch.nn.ReLU(),
+            torch.nn.Conv2d(odim, odim, 3, 2),
+            torch.nn.ReLU(),
+        )
+        self.out = torch.nn.Sequential(
+            torch.nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2), odim),
+            pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate),
+        )
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, idim).
+            x_mask (torch.Tensor): Input mask (#batch, 1, time).
+
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 8.
+            torch.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 8.
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        return x, x_mask[:, :, :-2:2][:, :, :-2:2][:, :, :-2:2]
+
+class Conv1dSubsampling(torch.nn.Module):
+    """Convolutional 1D subsampling (to 1/2 length).
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        pos_enc (torch.nn.Module): Custom position encoding layer.
+
+    """
+
+    def __init__(self, idim, odim, kernel_size, stride, pad):
+        super(Conv1dSubsampling, self).__init__()
+        self.conv = torch.nn.Conv1d(idim, odim, kernel_size, stride)
+        self.pad_fn = torch.nn.ConstantPad1d(pad, 0.0)
+        self.stride = stride
+        self.odim = odim
+
+    def output_size(self) -> int:
+        return self.odim
+
+    def forward(self, x, x_len):
+        """Subsample x.
+
+        """
+        x = x.transpose(1, 2)  # (b, d ,t)
+        x = self.pad_fn(x)
+        x = F.relu(self.conv(x))
+        x = x.transpose(1, 2)  # (b, t ,d)
+
+        if x_len is None:
+
+            return x, None
+        x_len = (x_len - 1) // self.stride + 1
+        return x, x_len
+
+    def __getitem__(self, key):
+        """Get item.
+
+        When reset_parameters() is called, if use_scaled_pos_enc is used,
+            return the positioning encoding.
+
+        """
+        if key != -1:
+            raise NotImplementedError("Support only `-1` (for `reset_parameters`).")
+        return self.out[key]
diff --git a/funasr/modules/subsampling_without_posenc.py b/funasr/modules/subsampling_without_posenc.py
new file mode 100644
index 000000000..239d3f1ad
--- /dev/null
+++ b/funasr/modules/subsampling_without_posenc.py
@@ -0,0 +1,61 @@
+# Copyright 2020 Emiru Tsunoo
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Subsampling layer definition."""
+
+import math
+import torch
+
+
+class Conv2dSubsamplingWOPosEnc(torch.nn.Module):
+    """Convolutional 2D subsampling.
+
+    Args:
+        idim (int): Input dimension.
+        odim (int): Output dimension.
+        dropout_rate (float): Dropout rate.
+        kernels (list): kernel sizes
+        strides (list): stride sizes
+
+    """
+
+    def __init__(self, idim, odim, dropout_rate, kernels, strides):
+        """Construct an Conv2dSubsamplingWOPosEnc object."""
+        assert len(kernels) == len(strides)
+        super().__init__()
+        conv = []
+        olen = idim
+        for i, (k, s) in enumerate(zip(kernels, strides)):
+            conv += [
+                torch.nn.Conv2d(1 if i == 0 else odim, odim, k, s),
+                torch.nn.ReLU(),
+            ]
+            olen = math.floor((olen - k) / s + 1)
+        self.conv = torch.nn.Sequential(*conv)
+        self.out = torch.nn.Linear(odim * olen, odim)
+        self.strides = strides
+        self.kernels = kernels
+
+    def forward(self, x, x_mask):
+        """Subsample x.
+
+        Args:
+            x (torch.Tensor): Input tensor (#batch, time, idim).
+            x_mask (torch.Tensor): Input mask (#batch, 1, time).
+
+        Returns:
+            torch.Tensor: Subsampled tensor (#batch, time', odim),
+                where time' = time // 4.
+            torch.Tensor: Subsampled mask (#batch, 1, time'),
+                where time' = time // 4.
+
+        """
+        x = x.unsqueeze(1)  # (b, c, t, f)
+        x = self.conv(x)
+        b, c, t, f = x.size()
+        x = self.out(x.transpose(1, 2).contiguous().view(b, t, c * f))
+        if x_mask is None:
+            return x, None
+        for k, s in zip(self.kernels, self.strides):
+            x_mask = x_mask[:, :, : -k + 1 : s]
+        return x, x_mask
diff --git a/funasr/optimizers/__init__.py b/funasr/optimizers/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/optimizers/sgd.py b/funasr/optimizers/sgd.py
new file mode 100644
index 000000000..3f0d3d1c9
--- /dev/null
+++ b/funasr/optimizers/sgd.py
@@ -0,0 +1,32 @@
+import torch
+from typeguard import check_argument_types
+
+
+class SGD(torch.optim.SGD):
+    """Thin inheritance of torch.optim.SGD to bind the required arguments, 'lr'
+
+    Note that
+    the arguments of the optimizer invoked by AbsTask.main()
+    must have default value except for 'param'.
+
+    I can't understand why only SGD.lr doesn't have the default value.
+    """
+
+    def __init__(
+        self,
+        params,
+        lr: float = 0.1,
+        momentum: float = 0.0,
+        dampening: float = 0.0,
+        weight_decay: float = 0.0,
+        nesterov: bool = False,
+    ):
+        assert check_argument_types()
+        super().__init__(
+            params,
+            lr=lr,
+            momentum=momentum,
+            dampening=dampening,
+            weight_decay=weight_decay,
+            nesterov=nesterov,
+        )
diff --git a/funasr/samplers/__init__.py b/funasr/samplers/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/samplers/abs_sampler.py b/funasr/samplers/abs_sampler.py
new file mode 100644
index 000000000..2f7aa539b
--- /dev/null
+++ b/funasr/samplers/abs_sampler.py
@@ -0,0 +1,19 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Iterator
+from typing import Tuple
+
+from torch.utils.data import Sampler
+
+
+class AbsSampler(Sampler, ABC):
+    @abstractmethod
+    def __len__(self) -> int:
+        raise NotImplementedError
+
+    @abstractmethod
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        raise NotImplementedError
+
+    def generate(self, seed):
+        return list(self)
diff --git a/funasr/samplers/build_batch_sampler.py b/funasr/samplers/build_batch_sampler.py
new file mode 100644
index 000000000..edda6ba02
--- /dev/null
+++ b/funasr/samplers/build_batch_sampler.py
@@ -0,0 +1,167 @@
+from typing import List
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.samplers.abs_sampler import AbsSampler
+from funasr.samplers.folded_batch_sampler import FoldedBatchSampler
+from funasr.samplers.length_batch_sampler import LengthBatchSampler
+from funasr.samplers.num_elements_batch_sampler import NumElementsBatchSampler
+from funasr.samplers.sorted_batch_sampler import SortedBatchSampler
+from funasr.samplers.unsorted_batch_sampler import UnsortedBatchSampler
+
+
+BATCH_TYPES = dict(
+    unsorted="UnsortedBatchSampler has nothing in particular feature and "
+    "just creates mini-batches which has constant batch_size. "
+    "This sampler doesn't require any length "
+    "information for each feature. "
+    "'key_file' is just a text file which describes each sample name."
+    "\n\n"
+    "    utterance_id_a\n"
+    "    utterance_id_b\n"
+    "    utterance_id_c\n"
+    "\n"
+    "The fist column is referred, so 'shape file' can be used, too.\n\n"
+    "    utterance_id_a 100,80\n"
+    "    utterance_id_b 400,80\n"
+    "    utterance_id_c 512,80\n",
+    sorted="SortedBatchSampler sorts samples by the length of the first input "
+    " in order to make each sample in a mini-batch has close length. "
+    "This sampler requires a text file which describes the length for each sample "
+    "\n\n"
+    "    utterance_id_a 1000\n"
+    "    utterance_id_b 1453\n"
+    "    utterance_id_c 1241\n"
+    "\n"
+    "The first element of feature dimensions is referred, "
+    "so 'shape_file' can be also used.\n\n"
+    "    utterance_id_a 1000,80\n"
+    "    utterance_id_b 1453,80\n"
+    "    utterance_id_c 1241,80\n",
+    folded="FoldedBatchSampler supports variable batch_size. "
+    "The batch_size is decided by\n"
+    "    batch_size = base_batch_size // (L // fold_length)\n"
+    "L is referred to the largest length of samples in the mini-batch. "
+    "This samples requires length information as same as SortedBatchSampler\n",
+    length="LengthBatchSampler supports variable batch_size. "
+    "This sampler makes mini-batches which have same number of 'bins' as possible "
+    "counting by the total lengths of each feature in the mini-batch. "
+    "This sampler requires a text file which describes the length for each sample. "
+    "\n\n"
+    "    utterance_id_a 1000\n"
+    "    utterance_id_b 1453\n"
+    "    utterance_id_c 1241\n"
+    "\n"
+    "The first element of feature dimensions is referred, "
+    "so 'shape_file' can be also used.\n\n"
+    "    utterance_id_a 1000,80\n"
+    "    utterance_id_b 1453,80\n"
+    "    utterance_id_c 1241,80\n",
+    numel="NumElementsBatchSampler supports variable batch_size. "
+    "Just like LengthBatchSampler, this sampler makes mini-batches"
+    " which have same number of 'bins' as possible "
+    "counting by the total number of elements of each feature "
+    "instead of the length. "
+    "Thus this sampler requires the full information of the dimension of the features. "
+    "\n\n"
+    "    utterance_id_a 1000,80\n"
+    "    utterance_id_b 1453,80\n"
+    "    utterance_id_c 1241,80\n",
+)
+
+
+def build_batch_sampler(
+    type: str,
+    batch_size: int,
+    batch_bins: int,
+    shape_files: Union[Tuple[str, ...], List[str]],
+    sort_in_batch: str = "descending",
+    sort_batch: str = "ascending",
+    drop_last: bool = False,
+    min_batch_size: int = 1,
+    fold_lengths: Sequence[int] = (),
+    padding: bool = True,
+    utt2category_file: str = None,
+) -> AbsSampler:
+    """Helper function to instantiate BatchSampler.
+
+    Args:
+        type: mini-batch type. "unsorted", "sorted", "folded", "numel", or, "length"
+        batch_size: The mini-batch size. Used for "unsorted", "sorted", "folded" mode
+        batch_bins: Used for "numel" model
+        shape_files: Text files describing the length and dimension
+            of each features. e.g. uttA 1330,80
+        sort_in_batch:
+        sort_batch:
+        drop_last:
+        min_batch_size:  Used for "numel" or "folded" mode
+        fold_lengths: Used for "folded" mode
+        padding: Whether sequences are input as a padded tensor or not.
+            used for "numel" mode
+    """
+    assert check_argument_types()
+    if len(shape_files) == 0:
+        raise ValueError("No shape file are given")
+
+    if type == "unsorted":
+        retval = UnsortedBatchSampler(
+            batch_size=batch_size, key_file=shape_files[0], drop_last=drop_last
+        )
+
+    elif type == "sorted":
+        retval = SortedBatchSampler(
+            batch_size=batch_size,
+            shape_file=shape_files[0],
+            sort_in_batch=sort_in_batch,
+            sort_batch=sort_batch,
+            drop_last=drop_last,
+        )
+
+    elif type == "folded":
+        if len(fold_lengths) != len(shape_files):
+            raise ValueError(
+                f"The number of fold_lengths must be equal to "
+                f"the number of shape_files: "
+                f"{len(fold_lengths)} != {len(shape_files)}"
+            )
+        retval = FoldedBatchSampler(
+            batch_size=batch_size,
+            shape_files=shape_files,
+            fold_lengths=fold_lengths,
+            sort_in_batch=sort_in_batch,
+            sort_batch=sort_batch,
+            drop_last=drop_last,
+            min_batch_size=min_batch_size,
+            utt2category_file=utt2category_file,
+        )
+
+    elif type == "numel":
+        retval = NumElementsBatchSampler(
+            batch_bins=batch_bins,
+            shape_files=shape_files,
+            sort_in_batch=sort_in_batch,
+            sort_batch=sort_batch,
+            drop_last=drop_last,
+            padding=padding,
+            min_batch_size=min_batch_size,
+        )
+
+    elif type == "length":
+        retval = LengthBatchSampler(
+            batch_bins=batch_bins,
+            shape_files=shape_files,
+            sort_in_batch=sort_in_batch,
+            sort_batch=sort_batch,
+            drop_last=drop_last,
+            padding=padding,
+            min_batch_size=min_batch_size,
+        )
+
+    else:
+        raise ValueError(f"Not supported: {type}")
+    assert check_return_type(retval)
+    return retval
diff --git a/funasr/samplers/folded_batch_sampler.py b/funasr/samplers/folded_batch_sampler.py
new file mode 100644
index 000000000..48e960454
--- /dev/null
+++ b/funasr/samplers/folded_batch_sampler.py
@@ -0,0 +1,156 @@
+from typing import Iterator
+from typing import List
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import load_num_sequence_text
+from funasr.fileio.read_text import read_2column_text
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class FoldedBatchSampler(AbsSampler):
+    def __init__(
+        self,
+        batch_size: int,
+        shape_files: Union[Tuple[str, ...], List[str]],
+        fold_lengths: Sequence[int],
+        min_batch_size: int = 1,
+        sort_in_batch: str = "descending",
+        sort_batch: str = "ascending",
+        drop_last: bool = False,
+        utt2category_file: str = None,
+    ):
+        assert check_argument_types()
+        assert batch_size > 0
+        if sort_batch != "ascending" and sort_batch != "descending":
+            raise ValueError(
+                f"sort_batch must be ascending or descending: {sort_batch}"
+            )
+        if sort_in_batch != "descending" and sort_in_batch != "ascending":
+            raise ValueError(
+                f"sort_in_batch must be ascending or descending: {sort_in_batch}"
+            )
+
+        self.batch_size = batch_size
+        self.shape_files = shape_files
+        self.sort_in_batch = sort_in_batch
+        self.sort_batch = sort_batch
+        self.drop_last = drop_last
+
+        # utt2shape: (Length, ...)
+        #    uttA 100,...
+        #    uttB 201,...
+        utt2shapes = [
+            load_num_sequence_text(s, loader_type="csv_int") for s in shape_files
+        ]
+
+        first_utt2shape = utt2shapes[0]
+        for s, d in zip(shape_files, utt2shapes):
+            if set(d) != set(first_utt2shape):
+                raise RuntimeError(
+                    f"keys are mismatched between {s} != {shape_files[0]}"
+                )
+
+        # Sort samples in ascending order
+        # (shape order should be like (Length, Dim))
+        keys = sorted(first_utt2shape, key=lambda k: first_utt2shape[k][0])
+        if len(keys) == 0:
+            raise RuntimeError(f"0 lines found: {shape_files[0]}")
+
+        category2utt = {}
+        if utt2category_file is not None:
+            utt2category = read_2column_text(utt2category_file)
+            if set(utt2category) != set(first_utt2shape):
+                raise RuntimeError(
+                    "keys are mismatched between "
+                    f"{utt2category_file} != {shape_files[0]}"
+                )
+            for k in keys:
+                category2utt.setdefault(utt2category[k], []).append(k)
+        else:
+            category2utt["default_category"] = keys
+
+        self.batch_list = []
+        for d, v in category2utt.items():
+            category_keys = v
+            # Decide batch-sizes
+            start = 0
+            batch_sizes = []
+            while True:
+                k = category_keys[start]
+                factor = max(int(d[k][0] / m) for d, m in zip(utt2shapes, fold_lengths))
+                bs = max(min_batch_size, int(batch_size / (1 + factor)))
+                if self.drop_last and start + bs > len(category_keys):
+                    # This if-block avoids 0-batches
+                    if len(self.batch_list) > 0:
+                        break
+
+                bs = min(len(category_keys) - start, bs)
+                batch_sizes.append(bs)
+                start += bs
+                if start >= len(category_keys):
+                    break
+
+            if len(batch_sizes) == 0:
+                # Maybe we can't reach here
+                raise RuntimeError("0 batches")
+
+            # If the last batch-size is smaller than minimum batch_size,
+            # the samples are redistributed to the other mini-batches
+            if len(batch_sizes) > 1 and batch_sizes[-1] < min_batch_size:
+                for i in range(batch_sizes.pop(-1)):
+                    batch_sizes[-(i % len(batch_sizes)) - 2] += 1
+
+            if not self.drop_last:
+                # Bug check
+                assert sum(batch_sizes) == len(
+                    category_keys
+                ), f"{sum(batch_sizes)} != {len(category_keys)}"
+
+            # Set mini-batch
+            cur_batch_list = []
+            start = 0
+            for bs in batch_sizes:
+                assert len(category_keys) >= start + bs, "Bug"
+                minibatch_keys = category_keys[start : start + bs]
+                start += bs
+                if sort_in_batch == "descending":
+                    minibatch_keys.reverse()
+                elif sort_in_batch == "ascending":
+                    # Key are already sorted in ascending
+                    pass
+                else:
+                    raise ValueError(
+                        "sort_in_batch must be ascending or "
+                        f"descending: {sort_in_batch}"
+                    )
+                cur_batch_list.append(tuple(minibatch_keys))
+
+            if sort_batch == "ascending":
+                pass
+            elif sort_batch == "descending":
+                cur_batch_list.reverse()
+            else:
+                raise ValueError(
+                    f"sort_batch must be ascending or descending: {sort_batch}"
+                )
+            self.batch_list.extend(cur_batch_list)
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"N-batch={len(self)}, "
+            f"batch_size={self.batch_size}, "
+            f"shape_files={self.shape_files}, "
+            f"sort_in_batch={self.sort_in_batch}, "
+            f"sort_batch={self.sort_batch})"
+        )
+
+    def __len__(self):
+        return len(self.batch_list)
+
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        return iter(self.batch_list)
diff --git a/funasr/samplers/length_batch_sampler.py b/funasr/samplers/length_batch_sampler.py
new file mode 100644
index 000000000..cdf0e5809
--- /dev/null
+++ b/funasr/samplers/length_batch_sampler.py
@@ -0,0 +1,143 @@
+from typing import Iterator
+from typing import List
+from typing import Tuple
+from typing import Union
+
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import load_num_sequence_text
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class LengthBatchSampler(AbsSampler):
+    def __init__(
+        self,
+        batch_bins: int,
+        shape_files: Union[Tuple[str, ...], List[str]],
+        min_batch_size: int = 1,
+        sort_in_batch: str = "descending",
+        sort_batch: str = "ascending",
+        drop_last: bool = False,
+        padding: bool = True,
+    ):
+        assert check_argument_types()
+        assert batch_bins > 0
+        if sort_batch != "ascending" and sort_batch != "descending":
+            raise ValueError(
+                f"sort_batch must be ascending or descending: {sort_batch}"
+            )
+        if sort_in_batch != "descending" and sort_in_batch != "ascending":
+            raise ValueError(
+                f"sort_in_batch must be ascending or descending: {sort_in_batch}"
+            )
+
+        self.batch_bins = batch_bins
+        self.shape_files = shape_files
+        self.sort_in_batch = sort_in_batch
+        self.sort_batch = sort_batch
+        self.drop_last = drop_last
+
+        # utt2shape: (Length, ...)
+        #    uttA 100,...
+        #    uttB 201,...
+        utt2shapes = [
+            load_num_sequence_text(s, loader_type="csv_int") for s in shape_files
+        ]
+
+        first_utt2shape = utt2shapes[0]
+        for s, d in zip(shape_files, utt2shapes):
+            if set(d) != set(first_utt2shape):
+                raise RuntimeError(
+                    f"keys are mismatched between {s} != {shape_files[0]}"
+                )
+
+        # Sort samples in ascending order
+        # (shape order should be like (Length, Dim))
+        keys = sorted(first_utt2shape, key=lambda k: first_utt2shape[k][0])
+        if len(keys) == 0:
+            raise RuntimeError(f"0 lines found: {shape_files[0]}")
+
+        # Decide batch-sizes
+        batch_sizes = []
+        current_batch_keys = []
+        for key in keys:
+            current_batch_keys.append(key)
+            # shape: (Length, dim1, dim2, ...)
+            if padding:
+                # bins = bs x max_length
+                bins = sum(len(current_batch_keys) * sh[key][0] for sh in utt2shapes)
+            else:
+                # bins = sum of lengths
+                bins = sum(d[k][0] for k in current_batch_keys for d in utt2shapes)
+
+            if bins > batch_bins and len(current_batch_keys) >= min_batch_size:
+                batch_sizes.append(len(current_batch_keys))
+                current_batch_keys = []
+        else:
+            if len(current_batch_keys) != 0 and (
+                not self.drop_last or len(batch_sizes) == 0
+            ):
+                batch_sizes.append(len(current_batch_keys))
+
+        if len(batch_sizes) == 0:
+            # Maybe we can't reach here
+            raise RuntimeError("0 batches")
+
+        # If the last batch-size is smaller than minimum batch_size,
+        # the samples are redistributed to the other mini-batches
+        if len(batch_sizes) > 1 and batch_sizes[-1] < min_batch_size:
+            for i in range(batch_sizes.pop(-1)):
+                batch_sizes[-(i % len(batch_sizes)) - 1] += 1
+
+        if not self.drop_last:
+            # Bug check
+            assert sum(batch_sizes) == len(keys), f"{sum(batch_sizes)} != {len(keys)}"
+
+        # Set mini-batch
+        self.batch_list = []
+        iter_bs = iter(batch_sizes)
+        bs = next(iter_bs)
+        minibatch_keys = []
+        for key in keys:
+            minibatch_keys.append(key)
+            if len(minibatch_keys) == bs:
+                if sort_in_batch == "descending":
+                    minibatch_keys.reverse()
+                elif sort_in_batch == "ascending":
+                    # Key are already sorted in ascending
+                    pass
+                else:
+                    raise ValueError(
+                        "sort_in_batch must be ascending"
+                        f" or descending: {sort_in_batch}"
+                    )
+                self.batch_list.append(tuple(minibatch_keys))
+                minibatch_keys = []
+                try:
+                    bs = next(iter_bs)
+                except StopIteration:
+                    break
+
+        if sort_batch == "ascending":
+            pass
+        elif sort_batch == "descending":
+            self.batch_list.reverse()
+        else:
+            raise ValueError(
+                f"sort_batch must be ascending or descending: {sort_batch}"
+            )
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"N-batch={len(self)}, "
+            f"batch_bins={self.batch_bins}, "
+            f"sort_in_batch={self.sort_in_batch}, "
+            f"sort_batch={self.sort_batch})"
+        )
+
+    def __len__(self):
+        return len(self.batch_list)
+
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        return iter(self.batch_list)
diff --git a/funasr/samplers/num_elements_batch_sampler.py b/funasr/samplers/num_elements_batch_sampler.py
new file mode 100644
index 000000000..0ffad9289
--- /dev/null
+++ b/funasr/samplers/num_elements_batch_sampler.py
@@ -0,0 +1,160 @@
+from typing import Iterator
+from typing import List
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import load_num_sequence_text
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class NumElementsBatchSampler(AbsSampler):
+    def __init__(
+        self,
+        batch_bins: int,
+        shape_files: Union[Tuple[str, ...], List[str]],
+        min_batch_size: int = 1,
+        sort_in_batch: str = "descending",
+        sort_batch: str = "ascending",
+        drop_last: bool = False,
+        padding: bool = True,
+    ):
+        assert check_argument_types()
+        assert batch_bins > 0
+        if sort_batch != "ascending" and sort_batch != "descending":
+            raise ValueError(
+                f"sort_batch must be ascending or descending: {sort_batch}"
+            )
+        if sort_in_batch != "descending" and sort_in_batch != "ascending":
+            raise ValueError(
+                f"sort_in_batch must be ascending or descending: {sort_in_batch}"
+            )
+
+        self.batch_bins = batch_bins
+        self.shape_files = shape_files
+        self.sort_in_batch = sort_in_batch
+        self.sort_batch = sort_batch
+        self.drop_last = drop_last
+
+        # utt2shape: (Length, ...)
+        #    uttA 100,...
+        #    uttB 201,...
+        utt2shapes = [
+            load_num_sequence_text(s, loader_type="csv_int") for s in shape_files
+        ]
+
+        first_utt2shape = utt2shapes[0]
+        for s, d in zip(shape_files, utt2shapes):
+            if set(d) != set(first_utt2shape):
+                raise RuntimeError(
+                    f"keys are mismatched between {s} != {shape_files[0]}"
+                )
+
+        # Sort samples in ascending order
+        # (shape order should be like (Length, Dim))
+        keys = sorted(first_utt2shape, key=lambda k: first_utt2shape[k][0])
+        if len(keys) == 0:
+            raise RuntimeError(f"0 lines found: {shape_files[0]}")
+        if padding:
+            # If padding case, the feat-dim must be same over whole corpus,
+            # therefore the first sample is referred
+            feat_dims = [np.prod(d[keys[0]][1:]) for d in utt2shapes]
+        else:
+            feat_dims = None
+
+        # Decide batch-sizes
+        batch_sizes = []
+        current_batch_keys = []
+        for key in keys:
+            current_batch_keys.append(key)
+            # shape: (Length, dim1, dim2, ...)
+            if padding:
+                for d, s in zip(utt2shapes, shape_files):
+                    if tuple(d[key][1:]) != tuple(d[keys[0]][1:]):
+                        raise RuntimeError(
+                            "If padding=True, the "
+                            f"feature dimension must be unified: {s}",
+                        )
+                bins = sum(
+                    len(current_batch_keys) * sh[key][0] * d
+                    for sh, d in zip(utt2shapes, feat_dims)
+                )
+            else:
+                bins = sum(
+                    np.prod(d[k]) for k in current_batch_keys for d in utt2shapes
+                )
+
+            if bins > batch_bins and len(current_batch_keys) >= min_batch_size:
+                batch_sizes.append(len(current_batch_keys))
+                current_batch_keys = []
+        else:
+            if len(current_batch_keys) != 0 and (
+                not self.drop_last or len(batch_sizes) == 0
+            ):
+                batch_sizes.append(len(current_batch_keys))
+
+        if len(batch_sizes) == 0:
+            # Maybe we can't reach here
+            raise RuntimeError("0 batches")
+
+        # If the last batch-size is smaller than minimum batch_size,
+        # the samples are redistributed to the other mini-batches
+        if len(batch_sizes) > 1 and batch_sizes[-1] < min_batch_size:
+            for i in range(batch_sizes.pop(-1)):
+                batch_sizes[-(i % len(batch_sizes)) - 1] += 1
+
+        if not self.drop_last:
+            # Bug check
+            assert sum(batch_sizes) == len(keys), f"{sum(batch_sizes)} != {len(keys)}"
+
+        # Set mini-batch
+        self.batch_list = []
+        iter_bs = iter(batch_sizes)
+        bs = next(iter_bs)
+        minibatch_keys = []
+        for key in keys:
+            minibatch_keys.append(key)
+            if len(minibatch_keys) == bs:
+                if sort_in_batch == "descending":
+                    minibatch_keys.reverse()
+                elif sort_in_batch == "ascending":
+                    # Key are already sorted in ascending
+                    pass
+                else:
+                    raise ValueError(
+                        "sort_in_batch must be ascending"
+                        f" or descending: {sort_in_batch}"
+                    )
+
+                self.batch_list.append(tuple(minibatch_keys))
+                minibatch_keys = []
+                try:
+                    bs = next(iter_bs)
+                except StopIteration:
+                    break
+
+        if sort_batch == "ascending":
+            pass
+        elif sort_batch == "descending":
+            self.batch_list.reverse()
+        else:
+            raise ValueError(
+                f"sort_batch must be ascending or descending: {sort_batch}"
+            )
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"N-batch={len(self)}, "
+            f"batch_bins={self.batch_bins}, "
+            f"sort_in_batch={self.sort_in_batch}, "
+            f"sort_batch={self.sort_batch})"
+        )
+
+    def __len__(self):
+        return len(self.batch_list)
+
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        return iter(self.batch_list)
diff --git a/funasr/samplers/sorted_batch_sampler.py b/funasr/samplers/sorted_batch_sampler.py
new file mode 100644
index 000000000..d6c3b4111
--- /dev/null
+++ b/funasr/samplers/sorted_batch_sampler.py
@@ -0,0 +1,95 @@
+import logging
+from typing import Iterator
+from typing import Tuple
+
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import load_num_sequence_text
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class SortedBatchSampler(AbsSampler):
+    """BatchSampler with sorted samples by length.
+
+    Args:
+        batch_size:
+        shape_file:
+        sort_in_batch: 'descending', 'ascending' or None.
+        sort_batch:
+    """
+
+    def __init__(
+        self,
+        batch_size: int,
+        shape_file: str,
+        sort_in_batch: str = "descending",
+        sort_batch: str = "ascending",
+        drop_last: bool = False,
+    ):
+        assert check_argument_types()
+        assert batch_size > 0
+        self.batch_size = batch_size
+        self.shape_file = shape_file
+        self.sort_in_batch = sort_in_batch
+        self.sort_batch = sort_batch
+        self.drop_last = drop_last
+
+        # utt2shape: (Length, ...)
+        #    uttA 100,...
+        #    uttB 201,...
+        utt2shape = load_num_sequence_text(shape_file, loader_type="csv_int")
+        if sort_in_batch == "descending":
+            # Sort samples in descending order (required by RNN)
+            keys = sorted(utt2shape, key=lambda k: -utt2shape[k][0])
+        elif sort_in_batch == "ascending":
+            # Sort samples in ascending order
+            keys = sorted(utt2shape, key=lambda k: utt2shape[k][0])
+        else:
+            raise ValueError(
+                f"sort_in_batch must be either one of "
+                f"ascending, descending, or None: {sort_in_batch}"
+            )
+        if len(keys) == 0:
+            raise RuntimeError(f"0 lines found: {shape_file}")
+
+        # Apply max(, 1) to avoid 0-batches
+        N = max(len(keys) // batch_size, 1)
+        if not self.drop_last:
+            # Split keys evenly as possible as. Note that If N != 1,
+            # the these batches always have size of batch_size at minimum.
+            self.batch_list = [
+                keys[i * len(keys) // N : (i + 1) * len(keys) // N] for i in range(N)
+            ]
+        else:
+            self.batch_list = [
+                tuple(keys[i * batch_size : (i + 1) * batch_size]) for i in range(N)
+            ]
+
+        if len(self.batch_list) == 0:
+            logging.warning(f"{shape_file} is empty")
+
+        if sort_in_batch != sort_batch:
+            if sort_batch not in ("ascending", "descending"):
+                raise ValueError(
+                    f"sort_batch must be ascending or descending: {sort_batch}"
+                )
+            self.batch_list.reverse()
+
+        if len(self.batch_list) == 0:
+            raise RuntimeError("0 batches")
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"N-batch={len(self)}, "
+            f"batch_size={self.batch_size}, "
+            f"shape_file={self.shape_file}, "
+            f"sort_in_batch={self.sort_in_batch}, "
+            f"sort_batch={self.sort_batch})"
+        )
+
+    def __len__(self):
+        return len(self.batch_list)
+
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        return iter(self.batch_list)
diff --git a/funasr/samplers/unsorted_batch_sampler.py b/funasr/samplers/unsorted_batch_sampler.py
new file mode 100644
index 000000000..349e526c7
--- /dev/null
+++ b/funasr/samplers/unsorted_batch_sampler.py
@@ -0,0 +1,91 @@
+import logging
+from typing import Iterator
+from typing import Tuple
+
+from typeguard import check_argument_types
+
+from funasr.fileio.read_text import read_2column_text
+from funasr.samplers.abs_sampler import AbsSampler
+
+
+class UnsortedBatchSampler(AbsSampler):
+    """BatchSampler with constant batch-size.
+
+    Any sorting is not done in this class,
+    so no length information is required,
+    This class is convenient for decoding mode,
+    or not seq2seq learning e.g. classification.
+
+    Args:
+        batch_size:
+        key_file:
+    """
+
+    def __init__(
+        self,
+        batch_size: int,
+        key_file: str,
+        drop_last: bool = False,
+        utt2category_file: str = None,
+    ):
+        assert check_argument_types()
+        assert batch_size > 0
+        self.batch_size = batch_size
+        self.key_file = key_file
+        self.drop_last = drop_last
+
+        # utt2shape:
+        #    uttA <anything is o.k>
+        #    uttB <anything is o.k>
+        utt2any = read_2column_text(key_file)
+        if len(utt2any) == 0:
+            logging.warning(f"{key_file} is empty")
+        # In this case the, the first column in only used
+        keys = list(utt2any)
+        if len(keys) == 0:
+            raise RuntimeError(f"0 lines found: {key_file}")
+
+        category2utt = {}
+        if utt2category_file is not None:
+            utt2category = read_2column_text(utt2category_file)
+            if set(utt2category) != set(keys):
+                raise RuntimeError(
+                    f"keys are mismatched between {utt2category_file} != {key_file}"
+                )
+            for k, v in utt2category.items():
+                category2utt.setdefault(v, []).append(k)
+        else:
+            category2utt["default_category"] = keys
+
+        self.batch_list = []
+        for d, v in category2utt.items():
+            category_keys = v
+            # Apply max(, 1) to avoid 0-batches
+            N = max(len(category_keys) // batch_size, 1)
+            if not self.drop_last:
+                # Split keys evenly as possible as. Note that If N != 1,
+                # the these batches always have size of batch_size at minimum.
+                cur_batch_list = [
+                    category_keys[i * len(keys) // N : (i + 1) * len(keys) // N]
+                    for i in range(N)
+                ]
+            else:
+                cur_batch_list = [
+                    tuple(category_keys[i * batch_size : (i + 1) * batch_size])
+                    for i in range(N)
+                ]
+            self.batch_list.extend(cur_batch_list)
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"N-batch={len(self)}, "
+            f"batch_size={self.batch_size}, "
+            f"key_file={self.key_file}, "
+        )
+
+    def __len__(self):
+        return len(self.batch_list)
+
+    def __iter__(self) -> Iterator[Tuple[str, ...]]:
+        return iter(self.batch_list)
diff --git a/funasr/schedulers/__init__.py b/funasr/schedulers/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/schedulers/abs_scheduler.py b/funasr/schedulers/abs_scheduler.py
new file mode 100644
index 000000000..7395f259c
--- /dev/null
+++ b/funasr/schedulers/abs_scheduler.py
@@ -0,0 +1,84 @@
+from abc import ABC
+from abc import abstractmethod
+
+import torch.optim.lr_scheduler as L
+
+
+class AbsScheduler(ABC):
+    @abstractmethod
+    def step(self, epoch: int = None):
+        pass
+
+    @abstractmethod
+    def state_dict(self):
+        pass
+
+    @abstractmethod
+    def load_state_dict(self, state):
+        pass
+
+
+# If you need to define custom scheduler, please inherit these classes
+class AbsBatchStepScheduler(AbsScheduler):
+    @abstractmethod
+    def step(self, epoch: int = None):
+        pass
+
+    @abstractmethod
+    def state_dict(self):
+        pass
+
+    @abstractmethod
+    def load_state_dict(self, state):
+        pass
+
+
+class AbsEpochStepScheduler(AbsScheduler):
+    @abstractmethod
+    def step(self, epoch: int = None):
+        pass
+
+    @abstractmethod
+    def state_dict(self):
+        pass
+
+    @abstractmethod
+    def load_state_dict(self, state):
+        pass
+
+
+class AbsValEpochStepScheduler(AbsEpochStepScheduler):
+    @abstractmethod
+    def step(self, val, epoch: int = None):
+        pass
+
+    @abstractmethod
+    def state_dict(self):
+        pass
+
+    @abstractmethod
+    def load_state_dict(self, state):
+        pass
+
+
+# Create alias type to check the type
+# Note(kamo): Currently PyTorch doesn't provide the base class
+# to judge these classes.
+AbsValEpochStepScheduler.register(L.ReduceLROnPlateau)
+for s in [
+    L.ReduceLROnPlateau,
+    L.LambdaLR,
+    L.StepLR,
+    L.MultiStepLR,
+    L.MultiStepLR,
+    L.ExponentialLR,
+    L.CosineAnnealingLR,
+]:
+    AbsEpochStepScheduler.register(s)
+
+AbsBatchStepScheduler.register(L.CyclicLR)
+for s in [
+    L.OneCycleLR,
+    L.CosineAnnealingWarmRestarts,
+]:
+    AbsBatchStepScheduler.register(s)
diff --git a/funasr/schedulers/noam_lr.py b/funasr/schedulers/noam_lr.py
new file mode 100644
index 000000000..80df019ac
--- /dev/null
+++ b/funasr/schedulers/noam_lr.py
@@ -0,0 +1,65 @@
+"""Noam learning rate scheduler module."""
+from typing import Union
+import warnings
+
+import torch
+from torch.optim.lr_scheduler import _LRScheduler
+from typeguard import check_argument_types
+
+from funasr.schedulers.abs_scheduler import AbsBatchStepScheduler
+
+
+class NoamLR(_LRScheduler, AbsBatchStepScheduler):
+    """The LR scheduler proposed by Noam
+
+    Ref:
+        "Attention Is All You Need", https://arxiv.org/pdf/1706.03762.pdf
+
+    FIXME(kamo): PyTorch doesn't provide _LRScheduler as public class,
+     thus the behaviour isn't guaranteed at forward PyTorch version.
+
+    NOTE(kamo): The "model_size" in original implementation is derived from
+     the model, but in this implementation, this parameter is a constant value.
+     You need to change it if the model is changed.
+
+    """
+
+    def __init__(
+        self,
+        optimizer: torch.optim.Optimizer,
+        model_size: Union[int, float] = 320,
+        warmup_steps: Union[int, float] = 25000,
+        last_epoch: int = -1,
+    ):
+        assert check_argument_types()
+        self.model_size = model_size
+        self.warmup_steps = warmup_steps
+
+        lr = list(optimizer.param_groups)[0]["lr"]
+        new_lr = self.lr_for_WarmupLR(lr)
+        warnings.warn(
+            f"NoamLR is deprecated. "
+            f"Use WarmupLR(warmup_steps={warmup_steps}) with Optimizer(lr={new_lr})",
+        )
+
+        # __init__() must be invoked before setting field
+        # because step() is also invoked in __init__()
+        super().__init__(optimizer, last_epoch)
+
+    def lr_for_WarmupLR(self, lr: float) -> float:
+        return lr / self.model_size**0.5 / self.warmup_steps**0.5
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}(model_size={self.model_size}, "
+            f"warmup_steps={self.warmup_steps})"
+        )
+
+    def get_lr(self):
+        step_num = self.last_epoch + 1
+        return [
+            lr
+            * self.model_size**-0.5
+            * min(step_num**-0.5, step_num * self.warmup_steps**-1.5)
+            for lr in self.base_lrs
+        ]
diff --git a/funasr/schedulers/warmup_lr.py b/funasr/schedulers/warmup_lr.py
new file mode 100644
index 000000000..dbf3aca53
--- /dev/null
+++ b/funasr/schedulers/warmup_lr.py
@@ -0,0 +1,50 @@
+"""Warm up learning rate scheduler module."""
+from typing import Union
+
+import torch
+from torch.optim.lr_scheduler import _LRScheduler
+from typeguard import check_argument_types
+
+from funasr.schedulers.abs_scheduler import AbsBatchStepScheduler
+
+
+class WarmupLR(_LRScheduler, AbsBatchStepScheduler):
+    """The WarmupLR scheduler
+
+    This scheduler is almost same as NoamLR Scheduler except for following difference:
+
+    NoamLR:
+        lr = optimizer.lr * model_size ** -0.5
+             * min(step ** -0.5, step * warmup_step ** -1.5)
+    WarmupLR:
+        lr = optimizer.lr * warmup_step ** 0.5
+             * min(step ** -0.5, step * warmup_step ** -1.5)
+
+    Note that the maximum lr equals to optimizer.lr in this scheduler.
+
+    """
+
+    def __init__(
+        self,
+        optimizer: torch.optim.Optimizer,
+        warmup_steps: Union[int, float] = 25000,
+        last_epoch: int = -1,
+    ):
+        assert check_argument_types()
+        self.warmup_steps = warmup_steps
+
+        # __init__() must be invoked before setting field
+        # because step() is also invoked in __init__()
+        super().__init__(optimizer, last_epoch)
+
+    def __repr__(self):
+        return f"{self.__class__.__name__}(warmup_steps={self.warmup_steps})"
+
+    def get_lr(self):
+        step_num = self.last_epoch + 1
+        return [
+            lr
+            * self.warmup_steps**0.5
+            * min(step_num**-0.5, step_num * self.warmup_steps**-1.5)
+            for lr in self.base_lrs
+        ]
diff --git a/funasr/tasks/__init__.py b/funasr/tasks/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/tasks/abs_task.py b/funasr/tasks/abs_task.py
new file mode 100644
index 000000000..5ea78c349
--- /dev/null
+++ b/funasr/tasks/abs_task.py
@@ -0,0 +1,1795 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Abstract task module."""
+import argparse
+import functools
+import logging
+import os
+import sys
+from abc import ABC
+from abc import abstractmethod
+from dataclasses import dataclass
+from distutils.version import LooseVersion
+from io import BytesIO
+from pathlib import Path
+from typing import Any
+from typing import Callable
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+import numpy as np
+import torch
+import torch.multiprocessing
+import torch.nn
+import torch.optim
+import yaml
+from torch.utils.data import DataLoader
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr import __version__
+from funasr.datasets.dataset import AbsDataset
+from funasr.datasets.dataset import DATA_TYPES
+from funasr.datasets.dataset import ESPnetDataset
+from funasr.datasets.iterable_dataset import IterableESPnetDataset
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+from funasr.iterators.chunk_iter_factory import ChunkIterFactory
+from funasr.iterators.multiple_iter_factory import MultipleIterFactory
+from funasr.iterators.sequence_iter_factory import SequenceIterFactory
+from funasr.optimizers.sgd import SGD
+from funasr.samplers.build_batch_sampler import BATCH_TYPES
+from funasr.samplers.build_batch_sampler import build_batch_sampler
+from funasr.samplers.unsorted_batch_sampler import UnsortedBatchSampler
+from funasr.schedulers.noam_lr import NoamLR
+from funasr.schedulers.warmup_lr import WarmupLR
+from funasr.torch_utils.load_pretrained_model import load_pretrained_model
+from funasr.torch_utils.model_summary import model_summary
+from funasr.torch_utils.pytorch_version import pytorch_cudnn_version
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.train.abs_espnet_model import AbsESPnetModel
+from funasr.train.class_choices import ClassChoices
+from funasr.train.distributed_utils import DistributedOption
+from funasr.train.trainer import Trainer
+from funasr.utils import config_argparse
+from funasr.utils.build_dataclass import build_dataclass
+from funasr.utils.cli_utils import get_commandline_args
+from funasr.utils.get_default_kwargs import get_default_kwargs
+from funasr.utils.nested_dict_action import NestedDictAction
+from funasr.utils.types import humanfriendly_parse_size_or_none
+from funasr.utils.types import int_or_none
+from funasr.utils.types import str2bool
+from funasr.utils.types import str2triple_str
+from funasr.utils.types import str_or_int
+from funasr.utils.types import str_or_none
+from funasr.utils.yaml_no_alias_safe_dump import yaml_no_alias_safe_dump
+
+try:
+    import wandb
+except Exception:
+    wandb = None
+
+if LooseVersion(torch.__version__) >= LooseVersion("1.5.0"):
+    pass
+else:
+    pass
+
+optim_classes = dict(
+    adam=torch.optim.Adam,
+    adamw=torch.optim.AdamW,
+    sgd=SGD,
+    adadelta=torch.optim.Adadelta,
+    adagrad=torch.optim.Adagrad,
+    adamax=torch.optim.Adamax,
+    asgd=torch.optim.ASGD,
+    lbfgs=torch.optim.LBFGS,
+    rmsprop=torch.optim.RMSprop,
+    rprop=torch.optim.Rprop,
+)
+if LooseVersion(torch.__version__) >= LooseVersion("1.10.0"):
+    # From 1.10.0, RAdam is officially supported
+    optim_classes.update(
+        radam=torch.optim.RAdam,
+    )
+try:
+    import torch_optimizer
+
+    optim_classes.update(
+        accagd=torch_optimizer.AccSGD,
+        adabound=torch_optimizer.AdaBound,
+        adamod=torch_optimizer.AdaMod,
+        diffgrad=torch_optimizer.DiffGrad,
+        lamb=torch_optimizer.Lamb,
+        novograd=torch_optimizer.NovoGrad,
+        pid=torch_optimizer.PID,
+        # torch_optimizer<=0.0.1a10 doesn't support
+        # qhadam=torch_optimizer.QHAdam,
+        qhm=torch_optimizer.QHM,
+        sgdw=torch_optimizer.SGDW,
+        yogi=torch_optimizer.Yogi,
+    )
+    if LooseVersion(torch_optimizer.__version__) < LooseVersion("0.2.0"):
+        # From 0.2.0, RAdam is dropped
+        optim_classes.update(
+            radam=torch_optimizer.RAdam,
+        )
+    del torch_optimizer
+except ImportError:
+    pass
+try:
+    import apex
+
+    optim_classes.update(
+        fusedadam=apex.optimizers.FusedAdam,
+        fusedlamb=apex.optimizers.FusedLAMB,
+        fusednovograd=apex.optimizers.FusedNovoGrad,
+        fusedsgd=apex.optimizers.FusedSGD,
+    )
+    del apex
+except ImportError:
+    pass
+try:
+    import fairscale
+except ImportError:
+    fairscale = None
+
+scheduler_classes = dict(
+    ReduceLROnPlateau=torch.optim.lr_scheduler.ReduceLROnPlateau,
+    lambdalr=torch.optim.lr_scheduler.LambdaLR,
+    steplr=torch.optim.lr_scheduler.StepLR,
+    multisteplr=torch.optim.lr_scheduler.MultiStepLR,
+    exponentiallr=torch.optim.lr_scheduler.ExponentialLR,
+    CosineAnnealingLR=torch.optim.lr_scheduler.CosineAnnealingLR,
+    noamlr=NoamLR,
+    warmuplr=WarmupLR,
+    cycliclr=torch.optim.lr_scheduler.CyclicLR,
+    onecyclelr=torch.optim.lr_scheduler.OneCycleLR,
+    CosineAnnealingWarmRestarts=torch.optim.lr_scheduler.CosineAnnealingWarmRestarts,
+)
+# To lower keys
+optim_classes = {k.lower(): v for k, v in optim_classes.items()}
+scheduler_classes = {k.lower(): v for k, v in scheduler_classes.items()}
+
+
+@dataclass
+class IteratorOptions:
+    preprocess_fn: callable
+    collate_fn: callable
+    data_path_and_name_and_type: list
+    shape_files: list
+    batch_size: int
+    batch_bins: int
+    batch_type: str
+    max_cache_size: float
+    max_cache_fd: int
+    distributed: bool
+    num_batches: Optional[int]
+    num_iters_per_epoch: Optional[int]
+    train: bool
+
+
+class AbsTask(ABC):
+    # Use @staticmethod, or @classmethod,
+    # instead of instance method to avoid God classes
+
+    # If you need more than one optimizers, change this value in inheritance
+    num_optimizers: int = 1
+    trainer = Trainer
+    class_choices_list: List[ClassChoices] = []
+
+    def __init__(self):
+        raise RuntimeError("This class can't be instantiated.")
+
+    @classmethod
+    @abstractmethod
+    def add_task_arguments(cls, parser: argparse.ArgumentParser):
+        pass
+
+    @classmethod
+    @abstractmethod
+    def build_collate_fn(
+            cls, args: argparse.Namespace, train: bool
+    ) -> Callable[[Sequence[Dict[str, np.ndarray]]], Dict[str, torch.Tensor]]:
+        """Return "collate_fn", which is a callable object and given to DataLoader.
+
+        >>> from torch.utils.data import DataLoader
+        >>> loader = DataLoader(collate_fn=cls.build_collate_fn(args, train=True), ...)
+
+        In many cases, you can use our common collate_fn.
+        """
+        raise NotImplementedError
+
+    @classmethod
+    @abstractmethod
+    def build_preprocess_fn(
+            cls, args: argparse.Namespace, train: bool
+    ) -> Optional[Callable[[str, Dict[str, np.array]], Dict[str, np.ndarray]]]:
+        raise NotImplementedError
+
+    @classmethod
+    @abstractmethod
+    def required_data_names(
+            cls, train: bool = True, inference: bool = False
+    ) -> Tuple[str, ...]:
+        """Define the required names by Task
+
+        This function is used by
+        >>> cls.check_task_requirements()
+        If your model is defined as following,
+
+        >>> from funasr.train.abs_espnet_model import AbsESPnetModel
+        >>> class Model(AbsESPnetModel):
+        ...     def forward(self, input, output, opt=None):  pass
+
+        then "required_data_names" should be as
+
+        >>> required_data_names = ('input', 'output')
+        """
+        raise NotImplementedError
+
+    @classmethod
+    @abstractmethod
+    def optional_data_names(
+            cls, train: bool = True, inference: bool = False
+    ) -> Tuple[str, ...]:
+        """Define the optional names by Task
+
+        This function is used by
+        >>> cls.check_task_requirements()
+        If your model is defined as follows,
+
+        >>> from funasr.train.abs_espnet_model import AbsESPnetModel
+        >>> class Model(AbsESPnetModel):
+        ...     def forward(self, input, output, opt=None):  pass
+
+        then "optional_data_names" should be as
+
+        >>> optional_data_names = ('opt',)
+        """
+        raise NotImplementedError
+
+    @classmethod
+    @abstractmethod
+    def build_model(cls, args: argparse.Namespace) -> AbsESPnetModel:
+        raise NotImplementedError
+
+    @classmethod
+    def get_parser(cls) -> config_argparse.ArgumentParser:
+        assert check_argument_types()
+
+        class ArgumentDefaultsRawTextHelpFormatter(
+            argparse.RawTextHelpFormatter,
+            argparse.ArgumentDefaultsHelpFormatter,
+        ):
+            pass
+
+        parser = config_argparse.ArgumentParser(
+            description="base parser",
+            formatter_class=ArgumentDefaultsRawTextHelpFormatter,
+        )
+
+        # NOTE(kamo): Use '_' instead of '-' to avoid confusion.
+        #  I think '-' looks really confusing if it's written in yaml.
+
+        # NOTE(kamo): add_arguments(..., required=True) can't be used
+        #  to provide --print_config mode. Instead of it, do as
+        parser.set_defaults(required=["output_dir"])
+
+        group = parser.add_argument_group("Common configuration")
+
+        group.add_argument(
+            "--print_config",
+            action="store_true",
+            help="Print the config file and exit",
+        )
+        group.add_argument(
+            "--log_level",
+            type=lambda x: x.upper(),
+            default="INFO",
+            choices=("ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"),
+            help="The verbose level of logging",
+        )
+        group.add_argument(
+            "--dry_run",
+            type=str2bool,
+            default=False,
+            help="Perform process without training",
+        )
+        group.add_argument(
+            "--iterator_type",
+            type=str,
+            choices=["sequence", "chunk", "task", "none"],
+            default="sequence",
+            help="Specify iterator type",
+        )
+
+        group.add_argument("--output_dir", type=str_or_none, default=None)
+        group.add_argument(
+            "--ngpu",
+            type=int,
+            default=0,
+            help="The number of gpus. 0 indicates CPU mode",
+        )
+        group.add_argument("--seed", type=int, default=0, help="Random seed")
+        group.add_argument(
+            "--num_workers",
+            type=int,
+            default=1,
+            help="The number of workers used for DataLoader",
+        )
+        group.add_argument(
+            "--num_att_plot",
+            type=int,
+            default=3,
+            help="The number images to plot the outputs from attention. "
+                 "This option makes sense only when attention-based model. "
+                 "We can also disable the attention plot by setting it 0",
+        )
+
+        group = parser.add_argument_group("distributed training related")
+        group.add_argument(
+            "--dist_backend",
+            default="nccl",
+            type=str,
+            help="distributed backend",
+        )
+        group.add_argument(
+            "--dist_init_method",
+            type=str,
+            default="env://",
+            help='if init_method="env://", env values of "MASTER_PORT", "MASTER_ADDR", '
+                 '"WORLD_SIZE", and "RANK" are referred.',
+        )
+        group.add_argument(
+            "--dist_world_size",
+            default=None,
+            type=int_or_none,
+            help="number of nodes for distributed training",
+        )
+        group.add_argument(
+            "--dist_rank",
+            type=int_or_none,
+            default=None,
+            help="node rank for distributed training",
+        )
+        group.add_argument(
+            # Not starting with "dist_" for compatibility to launch.py
+            "--local_rank",
+            type=int_or_none,
+            default=None,
+            help="local rank for distributed training. This option is used if "
+                 "--multiprocessing_distributed=false",
+        )
+        group.add_argument(
+            "--dist_master_addr",
+            default=None,
+            type=str_or_none,
+            help="The master address for distributed training. "
+                 "This value is used when dist_init_method == 'env://'",
+        )
+        group.add_argument(
+            "--dist_master_port",
+            default=None,
+            type=int_or_none,
+            help="The master port for distributed training"
+                 "This value is used when dist_init_method == 'env://'",
+        )
+        group.add_argument(
+            "--dist_launcher",
+            default=None,
+            type=str_or_none,
+            choices=["slurm", "mpi", None],
+            help="The launcher type for distributed training",
+        )
+        group.add_argument(
+            "--multiprocessing_distributed",
+            default=False,
+            type=str2bool,
+            help="Use multi-processing distributed training to launch "
+                 "N processes per node, which has N GPUs. This is the "
+                 "fastest way to use PyTorch for either single node or "
+                 "multi node data parallel training",
+        )
+        group.add_argument(
+            "--unused_parameters",
+            type=str2bool,
+            default=False,
+            help="Whether to use the find_unused_parameters in "
+                 "torch.nn.parallel.DistributedDataParallel ",
+        )
+        group.add_argument(
+            "--sharded_ddp",
+            default=False,
+            type=str2bool,
+            help="Enable sharded training provided by fairscale",
+        )
+
+        group = parser.add_argument_group("cudnn mode related")
+        group.add_argument(
+            "--cudnn_enabled",
+            type=str2bool,
+            default=torch.backends.cudnn.enabled,
+            help="Enable CUDNN",
+        )
+        group.add_argument(
+            "--cudnn_benchmark",
+            type=str2bool,
+            default=torch.backends.cudnn.benchmark,
+            help="Enable cudnn-benchmark mode",
+        )
+        group.add_argument(
+            "--cudnn_deterministic",
+            type=str2bool,
+            default=True,
+            help="Enable cudnn-deterministic mode",
+        )
+
+        group = parser.add_argument_group("collect stats mode related")
+        group.add_argument(
+            "--collect_stats",
+            type=str2bool,
+            default=False,
+            help='Perform on "collect stats" mode',
+        )
+        group.add_argument(
+            "--write_collected_feats",
+            type=str2bool,
+            default=False,
+            help='Write the output features from the model when "collect stats" mode',
+        )
+
+        group = parser.add_argument_group("Trainer related")
+        group.add_argument(
+            "--max_epoch",
+            type=int,
+            default=40,
+            help="The maximum number epoch to train",
+        )
+        group.add_argument(
+            "--max_update",
+            type=int,
+            default=sys.maxsize,
+            help="The maximum number update step to train",
+        )
+        group.add_argument(
+            "--patience",
+            type=int_or_none,
+            default=None,
+            help="Number of epochs to wait without improvement "
+                 "before stopping the training",
+        )
+        group.add_argument(
+            "--val_scheduler_criterion",
+            type=str,
+            nargs=2,
+            default=("valid", "loss"),
+            help="The criterion used for the value given to the lr scheduler. "
+                 'Give a pair referring the phase, "train" or "valid",'
+                 'and the criterion name. The mode specifying "min" or "max" can '
+                 "be changed by --scheduler_conf",
+        )
+        group.add_argument(
+            "--early_stopping_criterion",
+            type=str,
+            nargs=3,
+            default=("valid", "loss", "min"),
+            help="The criterion used for judging of early stopping. "
+                 'Give a pair referring the phase, "train" or "valid",'
+                 'the criterion name and the mode, "min" or "max", e.g. "acc,max".',
+        )
+        group.add_argument(
+            "--best_model_criterion",
+            type=str2triple_str,
+            nargs="+",
+            default=[
+                ("train", "loss", "min"),
+                ("valid", "loss", "min"),
+                ("train", "acc", "max"),
+                ("valid", "acc", "max"),
+            ],
+            help="The criterion used for judging of the best model. "
+                 'Give a pair referring the phase, "train" or "valid",'
+                 'the criterion name, and the mode, "min" or "max", e.g. "acc,max".',
+        )
+        group.add_argument(
+            "--keep_nbest_models",
+            type=int,
+            nargs="+",
+            default=[10],
+            help="Remove previous snapshots excluding the n-best scored epochs",
+        )
+        group.add_argument(
+            "--nbest_averaging_interval",
+            type=int,
+            default=0,
+            help="The epoch interval to apply model averaging and save nbest models",
+        )
+        group.add_argument(
+            "--grad_clip",
+            type=float,
+            default=5.0,
+            help="Gradient norm threshold to clip",
+        )
+        group.add_argument(
+            "--grad_clip_type",
+            type=float,
+            default=2.0,
+            help="The type of the used p-norm for gradient clip. Can be inf",
+        )
+        group.add_argument(
+            "--grad_noise",
+            type=str2bool,
+            default=False,
+            help="The flag to switch to use noise injection to "
+                 "gradients during training",
+        )
+        group.add_argument(
+            "--accum_grad",
+            type=int,
+            default=1,
+            help="The number of gradient accumulation",
+        )
+        group.add_argument(
+            "--no_forward_run",
+            type=str2bool,
+            default=False,
+            help="Just only iterating data loading without "
+                 "model forwarding and training",
+        )
+        group.add_argument(
+            "--resume",
+            type=str2bool,
+            default=False,
+            help="Enable resuming if checkpoint is existing",
+        )
+        group.add_argument(
+            "--train_dtype",
+            default="float32",
+            choices=["float16", "float32", "float64"],
+            help="Data type for training.",
+        )
+        group.add_argument(
+            "--use_amp",
+            type=str2bool,
+            default=False,
+            help="Enable Automatic Mixed Precision. This feature requires pytorch>=1.6",
+        )
+        group.add_argument(
+            "--log_interval",
+            type=int_or_none,
+            default=None,
+            help="Show the logs every the number iterations in each epochs at the "
+                 "training phase. If None is given, it is decided according the number "
+                 "of training samples automatically .",
+        )
+        group.add_argument(
+            "--use_tensorboard",
+            type=str2bool,
+            default=True,
+            help="Enable tensorboard logging",
+        )
+        group.add_argument(
+            "--use_wandb",
+            type=str2bool,
+            default=False,
+            help="Enable wandb logging",
+        )
+        group.add_argument(
+            "--wandb_project",
+            type=str,
+            default=None,
+            help="Specify wandb project",
+        )
+        group.add_argument(
+            "--wandb_id",
+            type=str,
+            default=None,
+            help="Specify wandb id",
+        )
+        group.add_argument(
+            "--wandb_entity",
+            type=str,
+            default=None,
+            help="Specify wandb entity",
+        )
+        group.add_argument(
+            "--wandb_name",
+            type=str,
+            default=None,
+            help="Specify wandb run name",
+        )
+        group.add_argument(
+            "--wandb_model_log_interval",
+            type=int,
+            default=-1,
+            help="Set the model log period",
+        )
+        group.add_argument(
+            "--detect_anomaly",
+            type=str2bool,
+            default=False,
+            help="Set torch.autograd.set_detect_anomaly",
+        )
+
+        group = parser.add_argument_group("Pretraining model related")
+        group.add_argument("--pretrain_path", help="This option is obsoleted")
+        group.add_argument(
+            "--init_param",
+            type=str,
+            default=[],
+            nargs="*",
+            help="Specify the file path used for initialization of parameters. "
+                 "The format is '<file_path>:<src_key>:<dst_key>:<exclude_keys>', "
+                 "where file_path is the model file path, "
+                 "src_key specifies the key of model states to be used in the model file, "
+                 "dst_key specifies the attribute of the model to be initialized, "
+                 "and exclude_keys excludes keys of model states for the initialization."
+                 "e.g.\n"
+                 "  # Load all parameters"
+                 "  --init_param some/where/model.pth\n"
+                 "  # Load only decoder parameters"
+                 "  --init_param some/where/model.pth:decoder:decoder\n"
+                 "  # Load only decoder parameters excluding decoder.embed"
+                 "  --init_param some/where/model.pth:decoder:decoder:decoder.embed\n"
+                 "  --init_param some/where/model.pth:decoder:decoder:decoder.embed\n",
+        )
+        group.add_argument(
+            "--ignore_init_mismatch",
+            type=str2bool,
+            default=False,
+            help="Ignore size mismatch when loading pre-trained model",
+        )
+        group.add_argument(
+            "--freeze_param",
+            type=str,
+            default=[],
+            nargs="*",
+            help="Freeze parameters",
+        )
+
+        group = parser.add_argument_group("BatchSampler related")
+        group.add_argument(
+            "--num_iters_per_epoch",
+            type=int_or_none,
+            default=None,
+            help="Restrict the number of iterations for training per epoch",
+        )
+        group.add_argument(
+            "--batch_size",
+            type=int,
+            default=20,
+            help="The mini-batch size used for training. Used if batch_type='unsorted',"
+                 " 'sorted', or 'folded'.",
+        )
+        group.add_argument(
+            "--valid_batch_size",
+            type=int_or_none,
+            default=None,
+            help="If not given, the value of --batch_size is used",
+        )
+        group.add_argument(
+            "--batch_bins",
+            type=int,
+            default=1000000,
+            help="The number of batch bins. Used if batch_type='length' or 'numel'",
+        )
+        group.add_argument(
+            "--valid_batch_bins",
+            type=int_or_none,
+            default=None,
+            help="If not given, the value of --batch_bins is used",
+        )
+
+        group.add_argument("--train_shape_file", type=str, action="append", default=[])
+        group.add_argument("--valid_shape_file", type=str, action="append", default=[])
+
+        group = parser.add_argument_group("Sequence iterator related")
+        _batch_type_help = ""
+        for key, value in BATCH_TYPES.items():
+            _batch_type_help += f'"{key}":\n{value}\n'
+        group.add_argument(
+            "--batch_type",
+            type=str,
+            default="folded",
+            choices=list(BATCH_TYPES),
+            help=_batch_type_help,
+        )
+        group.add_argument(
+            "--valid_batch_type",
+            type=str_or_none,
+            default=None,
+            choices=list(BATCH_TYPES) + [None],
+            help="If not given, the value of --batch_type is used",
+        )
+        group.add_argument("--fold_length", type=int, action="append", default=[])
+        group.add_argument(
+            "--sort_in_batch",
+            type=str,
+            default="descending",
+            choices=["descending", "ascending"],
+            help="Sort the samples in each mini-batches by the sample "
+                 'lengths. To enable this, "shape_file" must have the length information.',
+        )
+        group.add_argument(
+            "--sort_batch",
+            type=str,
+            default="descending",
+            choices=["descending", "ascending"],
+            help="Sort mini-batches by the sample lengths",
+        )
+        group.add_argument(
+            "--multiple_iterator",
+            type=str2bool,
+            default=False,
+            help="Use multiple iterator mode",
+        )
+
+        group = parser.add_argument_group("Chunk iterator related")
+        group.add_argument(
+            "--chunk_length",
+            type=str_or_int,
+            default=500,
+            help="Specify chunk length. e.g. '300', '300,400,500', or '300-400'."
+                 "If multiple numbers separated by command are given, "
+                 "one of them is selected randomly for each samples. "
+                 "If two numbers are given with '-', it indicates the range of the choices. "
+                 "Note that if the sequence length is shorter than the all chunk_lengths, "
+                 "the sample is discarded. ",
+        )
+        group.add_argument(
+            "--chunk_shift_ratio",
+            type=float,
+            default=0.5,
+            help="Specify the shift width of chunks. If it's less than 1, "
+                 "allows the overlapping and if bigger than 1, there are some gaps "
+                 "between each chunk.",
+        )
+        group.add_argument(
+            "--num_cache_chunks",
+            type=int,
+            default=1024,
+            help="Shuffle in the specified number of chunks and generate mini-batches "
+                 "More larger this value, more randomness can be obtained.",
+        )
+
+        group = parser.add_argument_group("Dataset related")
+        _data_path_and_name_and_type_help = (
+            "Give three words splitted by comma. It's used for the training data. "
+            "e.g. '--train_data_path_and_name_and_type some/path/a.scp,foo,sound'. "
+            "The first value, some/path/a.scp, indicates the file path, "
+            "and the second, foo, is the key name used for the mini-batch data, "
+            "and the last, sound, decides the file type. "
+            "This option is repeatable, so you can input any number of features "
+            "for your task. Supported file types are as follows:\n\n"
+        )
+        for key, dic in DATA_TYPES.items():
+            _data_path_and_name_and_type_help += f'"{key}":\n{dic["help"]}\n\n'
+
+        # for large dataset
+        group.add_argument(
+            "--dataset_type",
+            type=str,
+            default="small",
+            help="whether to use dataloader for large dataset",
+        )
+        parser.add_argument(
+            "--dataset_conf",
+            action=NestedDictAction,
+            default=dict(),
+            help=f"The keyword arguments for dataset",
+        )
+        group.add_argument(
+            "--train_data_file",
+            type=str,
+            default=None,
+            help="train_list for large dataset",
+        )
+        group.add_argument(
+            "--valid_data_file",
+            type=str,
+            default=None,
+            help="valid_list for large dataset",
+        )
+
+        group.add_argument(
+            "--train_data_path_and_name_and_type",
+            type=str2triple_str,
+            action="append",
+            default=[],
+            help=_data_path_and_name_and_type_help,
+        )
+        group.add_argument(
+            "--valid_data_path_and_name_and_type",
+            type=str2triple_str,
+            action="append",
+            default=[],
+        )
+        group.add_argument(
+            "--allow_variable_data_keys",
+            type=str2bool,
+            default=False,
+            help="Allow the arbitrary keys for mini-batch with ignoring "
+                 "the task requirements",
+        )
+        group.add_argument(
+            "--max_cache_size",
+            type=humanfriendly.parse_size,
+            default=0.0,
+            help="The maximum cache size for data loader. e.g. 10MB, 20GB.",
+        )
+        group.add_argument(
+            "--max_cache_fd",
+            type=int,
+            default=32,
+            help="The maximum number of file descriptors to be kept "
+                 "as opened for ark files. "
+                 "This feature is only valid when data type is 'kaldi_ark'.",
+        )
+        group.add_argument(
+            "--valid_max_cache_size",
+            type=humanfriendly_parse_size_or_none,
+            default=None,
+            help="The maximum cache size for validation data loader. e.g. 10MB, 20GB. "
+                 "If None, the 5 percent size of --max_cache_size",
+        )
+
+        group = parser.add_argument_group("Optimizer related")
+        for i in range(1, cls.num_optimizers + 1):
+            suf = "" if i == 1 else str(i)
+            group.add_argument(
+                f"--optim{suf}",
+                type=lambda x: x.lower(),
+                default="adadelta",
+                choices=list(optim_classes),
+                help="The optimizer type",
+            )
+            group.add_argument(
+                f"--optim{suf}_conf",
+                action=NestedDictAction,
+                default=dict(),
+                help="The keyword arguments for optimizer",
+            )
+            group.add_argument(
+                f"--scheduler{suf}",
+                type=lambda x: str_or_none(x.lower()),
+                default=None,
+                choices=list(scheduler_classes) + [None],
+                help="The lr scheduler type",
+            )
+            group.add_argument(
+                f"--scheduler{suf}_conf",
+                action=NestedDictAction,
+                default=dict(),
+                help="The keyword arguments for lr scheduler",
+            )
+
+        # for training on PAI
+        group = parser.add_argument_group("PAI training related")
+        group.add_argument(
+            "--use_pai",
+            type=str2bool,
+            default=False,
+            help="flag to indicate whether training on PAI",
+        )
+        group.add_argument(
+            "--num_worker_count",
+            type=int,
+            default=1,
+            help="The number of machines on PAI.",
+        )
+        group.add_argument(
+            "--access_key_id",
+            type=str,
+            default=None,
+            help="The username for oss.",
+        )
+        group.add_argument(
+            "--access_key_secret",
+            type=str,
+            default=None,
+            help="The password for oss.",
+        )
+        group.add_argument(
+            "--endpoint",
+            type=str,
+            default=None,
+            help="The endpoint for oss.",
+        )
+        group.add_argument(
+            "--bucket_name",
+            type=str,
+            default=None,
+            help="The bucket name for oss.",
+        )
+        group.add_argument(
+            "--oss_bucket",
+            default=None,
+            help="oss bucket.",
+        )
+
+        cls.trainer.add_arguments(parser)
+        cls.add_task_arguments(parser)
+
+        assert check_return_type(parser)
+        return parser
+
+    @classmethod
+    def build_optimizers(
+            cls,
+            args: argparse.Namespace,
+            model: torch.nn.Module,
+    ) -> List[torch.optim.Optimizer]:
+        if cls.num_optimizers != 1:
+            raise RuntimeError(
+                "build_optimizers() must be overridden if num_optimizers != 1"
+            )
+
+        optim_class = optim_classes.get(args.optim)
+        if optim_class is None:
+            raise ValueError(f"must be one of {list(optim_classes)}: {args.optim}")
+        if args.sharded_ddp:
+            if fairscale is None:
+                raise RuntimeError("Requiring fairscale. Do 'pip install fairscale'")
+            optim = fairscale.optim.oss.OSS(
+                params=model.parameters(), optim=optim_class, **args.optim_conf
+            )
+        else:
+            optim = optim_class(model.parameters(), **args.optim_conf)
+
+        optimizers = [optim]
+        return optimizers
+
+    @classmethod
+    def exclude_opts(cls) -> Tuple[str, ...]:
+        """The options not to be shown by --print_config"""
+        return "required", "print_config", "config", "ngpu"
+
+    @classmethod
+    def get_default_config(cls) -> Dict[str, Any]:
+        """Return the configuration as dict.
+
+        This method is used by print_config()
+        """
+
+        def get_class_type(name: str, classes: dict):
+            _cls = classes.get(name)
+            if _cls is None:
+                raise ValueError(f"must be one of {list(classes)}: {name}")
+            return _cls
+
+        # This method is used only for --print_config
+        assert check_argument_types()
+        parser = cls.get_parser()
+        args, _ = parser.parse_known_args()
+        config = vars(args)
+        # Excludes the options not to be shown
+        for k in AbsTask.exclude_opts():
+            config.pop(k)
+
+        for i in range(1, cls.num_optimizers + 1):
+            suf = "" if i == 1 else str(i)
+            name = config[f"optim{suf}"]
+            optim_class = get_class_type(name, optim_classes)
+            conf = get_default_kwargs(optim_class)
+            # Overwrite the default by the arguments,
+            conf.update(config[f"optim{suf}_conf"])
+            # and set it again
+            config[f"optim{suf}_conf"] = conf
+
+            name = config[f"scheduler{suf}"]
+            if name is not None:
+                scheduler_class = get_class_type(name, scheduler_classes)
+                conf = get_default_kwargs(scheduler_class)
+                # Overwrite the default by the arguments,
+                conf.update(config[f"scheduler{suf}_conf"])
+                # and set it again
+                config[f"scheduler{suf}_conf"] = conf
+
+        for class_choices in cls.class_choices_list:
+            if getattr(args, class_choices.name) is not None:
+                class_obj = class_choices.get_class(getattr(args, class_choices.name))
+                conf = get_default_kwargs(class_obj)
+                name = class_choices.name
+                # Overwrite the default by the arguments,
+                conf.update(config[f"{name}_conf"])
+                # and set it again
+                config[f"{name}_conf"] = conf
+        return config
+
+    @classmethod
+    def check_required_command_args(cls, args: argparse.Namespace):
+        assert check_argument_types()
+        for k in vars(args):
+            if "-" in k:
+                raise RuntimeError(f'Use "_" instead of "-": parser.get_parser("{k}")')
+
+        required = ", ".join(
+            f"--{a}" for a in args.required if getattr(args, a) is None
+        )
+
+        if len(required) != 0:
+            parser = cls.get_parser()
+            parser.print_help(file=sys.stderr)
+            p = Path(sys.argv[0]).name
+            print(file=sys.stderr)
+            print(
+                f"{p}: error: the following arguments are required: " f"{required}",
+                file=sys.stderr,
+            )
+            sys.exit(2)
+
+    @classmethod
+    def check_task_requirements(
+            cls,
+            dataset: Union[AbsDataset, IterableESPnetDataset],
+            allow_variable_data_keys: bool,
+            train: bool,
+            inference: bool = False,
+    ) -> None:
+        """Check if the dataset satisfy the requirement of current Task"""
+        assert check_argument_types()
+        mes = (
+            f"If you intend to use an additional input, modify "
+            f'"{cls.__name__}.required_data_names()" or '
+            f'"{cls.__name__}.optional_data_names()". '
+            f"Otherwise you need to set --allow_variable_data_keys true "
+        )
+
+        for k in cls.required_data_names(train, inference):
+            if not dataset.has_name(k):
+                raise RuntimeError(
+                    f'"{cls.required_data_names(train, inference)}" are required for'
+                    f' {cls.__name__}. but "{dataset.names()}" are input.\n{mes}'
+                )
+        if not allow_variable_data_keys:
+            task_keys = cls.required_data_names(
+                train, inference
+            ) + cls.optional_data_names(train, inference)
+            for k in dataset.names():
+                if k not in task_keys:
+                    raise RuntimeError(
+                        f"The data-name must be one of {task_keys} "
+                        f'for {cls.__name__}: "{k}" is not allowed.\n{mes}'
+                    )
+
+    @classmethod
+    def print_config(cls, file=sys.stdout) -> None:
+        assert check_argument_types()
+        # Shows the config: e.g. python train.py asr --print_config
+        config = cls.get_default_config()
+        file.write(yaml_no_alias_safe_dump(config, indent=4, sort_keys=False))
+
+    @classmethod
+    def main(cls, args: argparse.Namespace = None, cmd: Sequence[str] = None):
+        assert check_argument_types()
+        print(get_commandline_args(), file=sys.stderr)
+        if args is None:
+            parser = cls.get_parser()
+            args = parser.parse_args(cmd)
+        args.version = __version__
+        if args.pretrain_path is not None:
+            raise RuntimeError("--pretrain_path is deprecated. Use --init_param")
+        if args.print_config:
+            cls.print_config()
+            sys.exit(0)
+        cls.check_required_command_args(args)
+
+        if not args.distributed or not args.multiprocessing_distributed:
+            cls.main_worker(args)
+        else:
+            assert args.ngpu > 1
+            cls.main_worker(args)
+
+    @classmethod
+    def main_worker(cls, args: argparse.Namespace):
+        assert check_argument_types()
+
+        # 0. Init distributed process
+        distributed_option = build_dataclass(DistributedOption, args)
+        # Setting distributed_option.dist_rank, etc.
+        if args.use_pai:
+            distributed_option.init_options_pai()
+        else:
+            distributed_option.init_options()
+
+        # NOTE(kamo): Don't use logging before invoking logging.basicConfig()
+        if not distributed_option.distributed or distributed_option.dist_rank == 0:
+            if not distributed_option.distributed:
+                _rank = ""
+            else:
+                _rank = (
+                    f":{distributed_option.dist_rank}/"
+                    f"{distributed_option.dist_world_size}"
+                )
+
+            # NOTE(kamo):
+            # logging.basicConfig() is invoked in main_worker() instead of main()
+            # because it can be invoked only once in a process.
+            # FIXME(kamo): Should we use logging.getLogger()?
+            logging.basicConfig(
+                level=args.log_level,
+                format=f"[{os.uname()[1].split('.')[0]}]"
+                       f" %(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+            )
+        else:
+            # Suppress logging if RANK != 0
+            logging.basicConfig(
+                level="ERROR",
+                format=f"[{os.uname()[1].split('.')[0]}]"
+                       f" %(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s",
+            )
+        # Invoking torch.distributed.init_process_group
+        if args.use_pai:
+            distributed_option.init_torch_distributed_pai(args)
+        else:
+            distributed_option.init_torch_distributed(args)
+
+        # 1. Set random-seed
+        set_all_random_seed(args.seed)
+        torch.backends.cudnn.enabled = args.cudnn_enabled
+        torch.backends.cudnn.benchmark = args.cudnn_benchmark
+        torch.backends.cudnn.deterministic = args.cudnn_deterministic
+        if args.detect_anomaly:
+            logging.info("Invoking torch.autograd.set_detect_anomaly(True)")
+            torch.autograd.set_detect_anomaly(args.detect_anomaly)
+
+        # 2. Build model
+        model = cls.build_model(args=args)
+        if not isinstance(model, AbsESPnetModel):
+            raise RuntimeError(
+                f"model must inherit {AbsESPnetModel.__name__}, but got {type(model)}"
+            )
+        model = model.to(
+            dtype=getattr(torch, args.train_dtype),
+            device="cuda" if args.ngpu > 0 else "cpu",
+        )
+        for t in args.freeze_param:
+            for k, p in model.named_parameters():
+                if k.startswith(t + ".") or k == t:
+                    logging.info(f"Setting {k}.requires_grad = False")
+                    p.requires_grad = False
+
+        # 3. Build optimizer
+        optimizers = cls.build_optimizers(args, model=model)
+
+        # 4. Build schedulers
+        schedulers = []
+        for i, optim in enumerate(optimizers, 1):
+            suf = "" if i == 1 else str(i)
+            name = getattr(args, f"scheduler{suf}")
+            conf = getattr(args, f"scheduler{suf}_conf")
+            if name is not None:
+                cls_ = scheduler_classes.get(name)
+                if cls_ is None:
+                    raise ValueError(
+                        f"must be one of {list(scheduler_classes)}: {name}"
+                    )
+                scheduler = cls_(optim, **conf)
+            else:
+                scheduler = None
+
+            schedulers.append(scheduler)
+
+        logging.info(pytorch_cudnn_version())
+        logging.info(model_summary(model))
+        for i, (o, s) in enumerate(zip(optimizers, schedulers), 1):
+            suf = "" if i == 1 else str(i)
+            logging.info(f"Optimizer{suf}:\n{o}")
+            logging.info(f"Scheduler{suf}: {s}")
+
+        # 5. Dump "args" to config.yaml
+        # NOTE(kamo): "args" should be saved after object-buildings are done
+        #  because they are allowed to modify "args".
+        output_dir = Path(args.output_dir)
+        if not distributed_option.distributed or distributed_option.dist_rank == 0:
+            output_dir.mkdir(parents=True, exist_ok=True)
+            with (output_dir / "config.yaml").open("w", encoding="utf-8") as f:
+                logging.info(
+                    f'Saving the configuration in {output_dir / "config.yaml"}'
+                )
+                if args.use_pai:
+                    buffer = BytesIO()
+                    torch.save({"config": vars(args)}, buffer)
+                    args.oss_bucket.put_object(os.path.join(args.output_dir, "config.dict"), buffer.getvalue())
+                else:
+                    yaml_no_alias_safe_dump(vars(args), f, indent=4, sort_keys=False)
+
+        if args.dry_run:
+            pass
+        else:
+            logging.info("Training args: {}".format(args))
+            # 6. Loads pre-trained model
+            for p in args.init_param:
+                logging.info(f"Loading pretrained params from {p}")
+                load_pretrained_model(
+                    model=model,
+                    init_param=p,
+                    ignore_init_mismatch=args.ignore_init_mismatch,
+                    # NOTE(kamo): "cuda" for torch.load always indicates cuda:0
+                    #   in PyTorch<=1.4
+                    map_location=f"cuda:{torch.cuda.current_device()}"
+                    if args.ngpu > 0
+                    else "cpu",
+                    oss_bucket=args.oss_bucket,
+                )
+
+            # 7. Build iterator factories
+            if args.dataset_type == "large":
+                from funasr.datasets.large_datasets.build_dataloader import ArkDataLoader
+                train_iter_factory = ArkDataLoader(args.train_data_file, args.token_list,
+                                                   args.config, mode="train")
+                valid_iter_factory = ArkDataLoader(args.valid_data_file, args.token_list,
+                                                   args.config, mode="eval")
+            elif args.dataset_type == "small":
+                train_iter_factory = cls.build_iter_factory(
+                    args=args,
+                    distributed_option=distributed_option,
+                    mode="train",
+                )
+                valid_iter_factory = cls.build_iter_factory(
+                    args=args,
+                    distributed_option=distributed_option,
+                    mode="valid",
+                )
+            else:
+                raise ValueError(f"Not supported dataset_type={args.dataset_type}")
+
+            if args.scheduler == "tri_stage":
+                for scheduler in schedulers:
+                    scheduler.init_tri_stage_scheudler(max_update=args.max_update)
+
+            # 8. Start training
+            if args.use_wandb:
+                if wandb is None:
+                    raise RuntimeError("Please install wandb")
+
+                try:
+                    wandb.login()
+                except wandb.errors.UsageError:
+                    logging.info("wandb not configured! run `wandb login` to enable")
+                    args.use_wandb = False
+
+            if args.use_wandb:
+                if (
+                        not distributed_option.distributed
+                        or distributed_option.dist_rank == 0
+                ):
+                    if args.wandb_project is None:
+                        project = "FunASR_" + cls.__name__
+                    else:
+                        project = args.wandb_project
+
+                    if args.wandb_name is None:
+                        name = str(Path(".").resolve()).replace("/", "_")
+                    else:
+                        name = args.wandb_name
+
+                    wandb.init(
+                        entity=args.wandb_entity,
+                        project=project,
+                        name=name,
+                        dir=output_dir,
+                        id=args.wandb_id,
+                        resume="allow",
+                    )
+                    wandb.config.update(args)
+                else:
+                    # wandb also supports grouping for distributed training,
+                    # but we only logs aggregated data,
+                    # so it's enough to perform on rank0 node.
+                    args.use_wandb = False
+
+            # Don't give args to trainer.run() directly!!!
+            # Instead of it, define "Options" object and build here.
+            trainer_options = cls.trainer.build_options(args)
+            cls.trainer.run(
+                model=model,
+                optimizers=optimizers,
+                schedulers=schedulers,
+                train_iter_factory=train_iter_factory,
+                valid_iter_factory=valid_iter_factory,
+                trainer_options=trainer_options,
+                distributed_option=distributed_option,
+            )
+
+            if args.use_wandb and wandb.run:
+                wandb.finish()
+
+    @classmethod
+    def build_iter_options(
+            cls,
+            args: argparse.Namespace,
+            distributed_option: DistributedOption,
+            mode: str,
+    ):
+        if mode == "train":
+            preprocess_fn = cls.build_preprocess_fn(args, train=True)
+            collate_fn = cls.build_collate_fn(args, train=True)
+            data_path_and_name_and_type = args.train_data_path_and_name_and_type
+            shape_files = args.train_shape_file
+            batch_size = args.batch_size
+            batch_bins = args.batch_bins
+            batch_type = args.batch_type
+            max_cache_size = args.max_cache_size
+            max_cache_fd = args.max_cache_fd
+            distributed = distributed_option.distributed
+            num_batches = None
+            num_iters_per_epoch = args.num_iters_per_epoch
+            train = True
+
+        elif mode == "valid":
+            preprocess_fn = cls.build_preprocess_fn(args, train=False)
+            collate_fn = cls.build_collate_fn(args, train=False)
+            data_path_and_name_and_type = args.valid_data_path_and_name_and_type
+            shape_files = args.valid_shape_file
+
+            if args.valid_batch_type is None:
+                batch_type = args.batch_type
+            else:
+                batch_type = args.valid_batch_type
+            if args.valid_batch_size is None:
+                batch_size = args.batch_size
+            else:
+                batch_size = args.valid_batch_size
+            if args.valid_batch_bins is None:
+                batch_bins = args.batch_bins
+            else:
+                batch_bins = args.valid_batch_bins
+            if args.valid_max_cache_size is None:
+                # Cache 5% of maximum size for validation loader
+                max_cache_size = 0.05 * args.max_cache_size
+            else:
+                max_cache_size = args.valid_max_cache_size
+            max_cache_fd = args.max_cache_fd
+            distributed = distributed_option.distributed
+            num_batches = None
+            num_iters_per_epoch = None
+            train = False
+        else:
+            raise NotImplementedError(f"mode={mode}")
+
+        return IteratorOptions(
+            preprocess_fn=preprocess_fn,
+            collate_fn=collate_fn,
+            data_path_and_name_and_type=data_path_and_name_and_type,
+            shape_files=shape_files,
+            batch_type=batch_type,
+            batch_size=batch_size,
+            batch_bins=batch_bins,
+            num_batches=num_batches,
+            max_cache_size=max_cache_size,
+            max_cache_fd=max_cache_fd,
+            distributed=distributed,
+            num_iters_per_epoch=num_iters_per_epoch,
+            train=train,
+        )
+
+    @classmethod
+    def build_iter_factory(
+            cls,
+            args: argparse.Namespace,
+            distributed_option: DistributedOption,
+            mode: str,
+            kwargs: dict = None,
+    ) -> AbsIterFactory:
+        """Build a factory object of mini-batch iterator.
+
+        This object is invoked at every epochs to build the iterator for each epoch
+        as following:
+
+        >>> iter_factory = cls.build_iter_factory(...)
+        >>> for epoch in range(1, max_epoch):
+        ...     for keys, batch in iter_fatory.build_iter(epoch):
+        ...         model(**batch)
+
+        The mini-batches for each epochs are fully controlled by this class.
+        Note that the random seed used for shuffling is decided as "seed + epoch" and
+        the generated mini-batches can be reproduces when resuming.
+
+        Note that the definition of "epoch" doesn't always indicate
+        to run out of the whole training corpus.
+        "--num_iters_per_epoch" option restricts the number of iterations for each epoch
+        and the rest of samples for the originally epoch are left for the next epoch.
+        e.g. If The number of mini-batches equals to 4, the following two are same:
+
+        - 1 epoch without "--num_iters_per_epoch"
+        - 4 epoch with "--num_iters_per_epoch" == 4
+
+        """
+        assert check_argument_types()
+        iter_options = cls.build_iter_options(args, distributed_option, mode)
+
+        # Overwrite iter_options if any kwargs is given
+        if kwargs is not None:
+            for k, v in kwargs.items():
+                setattr(iter_options, k, v)
+
+        if args.iterator_type == "sequence":
+            return cls.build_sequence_iter_factory(
+                args=args,
+                iter_options=iter_options,
+                mode=mode,
+            )
+        elif args.iterator_type == "chunk":
+            return cls.build_chunk_iter_factory(
+                args=args,
+                iter_options=iter_options,
+                mode=mode,
+            )
+        elif args.iterator_type == "task":
+            return cls.build_task_iter_factory(
+                args=args,
+                iter_options=iter_options,
+                mode=mode,
+            )
+        else:
+            raise RuntimeError(f"Not supported: iterator_type={args.iterator_type}")
+
+    @classmethod
+    def build_sequence_iter_factory(
+            cls, args: argparse.Namespace, iter_options: IteratorOptions, mode: str
+    ) -> AbsIterFactory:
+        assert check_argument_types()
+
+        dataset = ESPnetDataset(
+            iter_options.data_path_and_name_and_type,
+            float_dtype=args.train_dtype,
+            preprocess=iter_options.preprocess_fn,
+            max_cache_size=iter_options.max_cache_size,
+            max_cache_fd=iter_options.max_cache_fd,
+        )
+        cls.check_task_requirements(
+            dataset, args.allow_variable_data_keys, train=iter_options.train
+        )
+
+        if Path(
+                Path(iter_options.data_path_and_name_and_type[0][0]).parent, "utt2category"
+        ).exists():
+            utt2category_file = str(
+                Path(
+                    Path(iter_options.data_path_and_name_and_type[0][0]).parent,
+                    "utt2category",
+                )
+            )
+        else:
+            utt2category_file = None
+        batch_sampler = build_batch_sampler(
+            type=iter_options.batch_type,
+            shape_files=iter_options.shape_files,
+            fold_lengths=args.fold_length,
+            batch_size=iter_options.batch_size,
+            batch_bins=iter_options.batch_bins,
+            sort_in_batch=args.sort_in_batch,
+            sort_batch=args.sort_batch,
+            drop_last=False,
+            min_batch_size=torch.distributed.get_world_size()
+            if iter_options.distributed
+            else 1,
+            utt2category_file=utt2category_file,
+        )
+
+        batches = list(batch_sampler)
+        if iter_options.num_batches is not None:
+            batches = batches[: iter_options.num_batches]
+
+        bs_list = [len(batch) for batch in batches]
+
+        logging.info(f"[{mode}] dataset:\n{dataset}")
+        logging.info(f"[{mode}] Batch sampler: {batch_sampler}")
+        logging.info(
+            f"[{mode}] mini-batch sizes summary: N-batch={len(bs_list)}, "
+            f"mean={np.mean(bs_list):.1f}, min={np.min(bs_list)}, max={np.max(bs_list)}"
+        )
+
+        if args.scheduler == "tri_stage" and mode == "train":
+            args.max_update = len(bs_list) * args.max_epoch
+            logging.info("Max update: {}".format(args.max_update))
+
+        if iter_options.distributed:
+            world_size = torch.distributed.get_world_size()
+            rank = torch.distributed.get_rank()
+            for batch in batches:
+                if len(batch) < world_size:
+                    raise RuntimeError(
+                        f"The batch-size must be equal or more than world_size: "
+                        f"{len(batch)} < {world_size}"
+                    )
+            batches = [batch[rank::world_size] for batch in batches]
+
+        return SequenceIterFactory(
+            dataset=dataset,
+            batches=batches,
+            seed=args.seed,
+            num_iters_per_epoch=iter_options.num_iters_per_epoch,
+            shuffle=iter_options.train,
+            num_workers=args.num_workers,
+            collate_fn=iter_options.collate_fn,
+            pin_memory=args.ngpu > 0,
+        )
+
+    @classmethod
+    def build_chunk_iter_factory(
+            cls,
+            args: argparse.Namespace,
+            iter_options: IteratorOptions,
+            mode: str,
+    ) -> AbsIterFactory:
+        assert check_argument_types()
+
+        dataset = ESPnetDataset(
+            iter_options.data_path_and_name_and_type,
+            float_dtype=args.train_dtype,
+            preprocess=iter_options.preprocess_fn,
+            max_cache_size=iter_options.max_cache_size,
+            max_cache_fd=iter_options.max_cache_fd,
+        )
+        cls.check_task_requirements(
+            dataset, args.allow_variable_data_keys, train=iter_options.train
+        )
+
+        if len(iter_options.shape_files) == 0:
+            key_file = iter_options.data_path_and_name_and_type[0][0]
+        else:
+            key_file = iter_options.shape_files[0]
+
+        batch_sampler = UnsortedBatchSampler(batch_size=1, key_file=key_file)
+        batches = list(batch_sampler)
+        if iter_options.num_batches is not None:
+            batches = batches[: iter_options.num_batches]
+        logging.info(f"[{mode}] dataset:\n{dataset}")
+
+        if iter_options.distributed:
+            world_size = torch.distributed.get_world_size()
+            rank = torch.distributed.get_rank()
+            if len(batches) < world_size:
+                raise RuntimeError("Number of samples is smaller than world_size")
+            if iter_options.batch_size < world_size:
+                raise RuntimeError("batch_size must be equal or more than world_size")
+
+            if rank < iter_options.batch_size % world_size:
+                batch_size = iter_options.batch_size // world_size + 1
+            else:
+                batch_size = iter_options.batch_size // world_size
+            num_cache_chunks = args.num_cache_chunks // world_size
+            # NOTE(kamo): Split whole corpus by sample numbers without considering
+            #   each of the lengths, therefore the number of iteration counts are not
+            #   always equal to each other and the iterations are limitted
+            #   by the fewest iterations.
+            #   i.e. the samples over the counts are discarded.
+            batches = batches[rank::world_size]
+        else:
+            batch_size = iter_options.batch_size
+            num_cache_chunks = args.num_cache_chunks
+
+        return ChunkIterFactory(
+            dataset=dataset,
+            batches=batches,
+            seed=args.seed,
+            batch_size=batch_size,
+            # For chunk iterator,
+            # --num_iters_per_epoch doesn't indicate the number of iterations,
+            # but indicates the number of samples.
+            num_samples_per_epoch=iter_options.num_iters_per_epoch,
+            shuffle=iter_options.train,
+            num_workers=args.num_workers,
+            collate_fn=iter_options.collate_fn,
+            pin_memory=args.ngpu > 0,
+            chunk_length=args.chunk_length,
+            chunk_shift_ratio=args.chunk_shift_ratio,
+            num_cache_chunks=num_cache_chunks,
+        )
+
+    # NOTE(kamo): Not abstract class
+    @classmethod
+    def build_task_iter_factory(
+            cls,
+            args: argparse.Namespace,
+            iter_options: IteratorOptions,
+            mode: str,
+    ) -> AbsIterFactory:
+        """Build task specific iterator factory
+
+        Example:
+
+            >>> class YourTask(AbsTask):
+            ... @classmethod
+            ... def add_task_arguments(cls, parser: argparse.ArgumentParser):
+            ...     parser.set_defaults(iterator_type="task")
+            ...
+            ... @classmethod
+            ... def build_task_iter_factory(
+            ...     cls,
+            ...     args: argparse.Namespace,
+            ...     iter_options: IteratorOptions,
+            ...     mode: str,
+            ... ):
+            ...     return FooIterFactory(...)
+            ...
+            ... @classmethod
+            ... def build_iter_options(
+            ....    args: argparse.Namespace,
+            ...     distributed_option: DistributedOption,
+            ...     mode: str
+            ... ):
+            ...     # if you need to customize options object
+        """
+        raise NotImplementedError
+
+    @classmethod
+    def build_multiple_iter_factory(
+            cls, args: argparse.Namespace, distributed_option: DistributedOption, mode: str
+    ):
+        assert check_argument_types()
+        iter_options = cls.build_iter_options(args, distributed_option, mode)
+        assert len(iter_options.data_path_and_name_and_type) > 0, len(
+            iter_options.data_path_and_name_and_type
+        )
+
+        # 1. Sanity check
+        num_splits = None
+        for path in [
+                        path for path, _, _ in iter_options.data_path_and_name_and_type
+                    ] + list(iter_options.shape_files):
+            if not Path(path).is_dir():
+                raise RuntimeError(f"{path} is not a directory")
+            p = Path(path) / "num_splits"
+            if not p.exists():
+                raise FileNotFoundError(f"{p} is not found")
+            with p.open() as f:
+                _num_splits = int(f.read())
+                if num_splits is not None and num_splits != _num_splits:
+                    raise RuntimeError(
+                        f"Number of splits are mismathed: "
+                        f"{iter_options.data_path_and_name_and_type[0][0]} and {path}"
+                    )
+                num_splits = _num_splits
+
+            for i in range(num_splits):
+                p = Path(path) / f"split.{i}"
+                if not p.exists():
+                    raise FileNotFoundError(f"{p} is not found")
+
+        # 2. Create functions to build an iter factory for each splits
+        data_path_and_name_and_type_list = [
+            [
+                (str(Path(p) / f"split.{i}"), n, t)
+                for p, n, t in iter_options.data_path_and_name_and_type
+            ]
+            for i in range(num_splits)
+        ]
+        shape_files_list = [
+            [str(Path(s) / f"split.{i}") for s in iter_options.shape_files]
+            for i in range(num_splits)
+        ]
+        num_iters_per_epoch_list = [
+            (iter_options.num_iters_per_epoch + i) // num_splits
+            if iter_options.num_iters_per_epoch is not None
+            else None
+            for i in range(num_splits)
+        ]
+        max_cache_size = iter_options.max_cache_size / num_splits
+
+        # Note that iter-factories are built for each epoch at runtime lazily.
+        build_funcs = [
+            functools.partial(
+                cls.build_iter_factory,
+                args,
+                distributed_option,
+                mode,
+                kwargs=dict(
+                    data_path_and_name_and_type=_data_path_and_name_and_type,
+                    shape_files=_shape_files,
+                    num_iters_per_epoch=_num_iters_per_epoch,
+                    max_cache_size=max_cache_size,
+                ),
+            )
+            for (
+                _data_path_and_name_and_type,
+                _shape_files,
+                _num_iters_per_epoch,
+            ) in zip(
+                data_path_and_name_and_type_list,
+                shape_files_list,
+                num_iters_per_epoch_list,
+            )
+        ]
+
+        # 3. Build MultipleIterFactory
+        return MultipleIterFactory(
+            build_funcs=build_funcs, shuffle=iter_options.train, seed=args.seed
+        )
+
+    @classmethod
+    def build_streaming_iterator(
+            cls,
+            data_path_and_name_and_type,
+            preprocess_fn,
+            collate_fn,
+            key_file: str = None,
+            batch_size: int = 1,
+            dtype: str = np.float32,
+            num_workers: int = 1,
+            allow_variable_data_keys: bool = False,
+            ngpu: int = 0,
+            inference: bool = False,
+    ) -> DataLoader:
+        """Build DataLoader using iterable dataset"""
+        assert check_argument_types()
+        # For backward compatibility for pytorch DataLoader
+        if collate_fn is not None:
+            kwargs = dict(collate_fn=collate_fn)
+        else:
+            kwargs = {}
+
+        dataset = IterableESPnetDataset(
+            data_path_and_name_and_type,
+            float_dtype=dtype,
+            preprocess=preprocess_fn,
+            key_file=key_file,
+        )
+        if dataset.apply_utt2category:
+            kwargs.update(batch_size=1)
+        else:
+            kwargs.update(batch_size=batch_size)
+
+        cls.check_task_requirements(
+            dataset, allow_variable_data_keys, train=False, inference=inference
+        )
+
+        return DataLoader(
+            dataset=dataset,
+            pin_memory=ngpu > 0,
+            num_workers=num_workers,
+            **kwargs,
+        )
+
+    # ~~~~~~~~~ The methods below are mainly used for inference ~~~~~~~~~
+    @classmethod
+    def build_model_from_file(
+            cls,
+            config_file: Union[Path, str] = None,
+            model_file: Union[Path, str] = None,
+            device: str = "cpu",
+    ) -> Tuple[AbsESPnetModel, argparse.Namespace]:
+        """Build model from the files.
+
+        This method is used for inference or fine-tuning.
+
+        Args:
+            config_file: The yaml file saved when training.
+            model_file: The model file saved when training.
+            device: Device type, "cpu", "cuda", or "cuda:N".
+
+        """
+        assert check_argument_types()
+        if config_file is None:
+            assert model_file is not None, (
+                "The argument 'model_file' must be provided "
+                "if the argument 'config_file' is not specified."
+            )
+            config_file = Path(model_file).parent / "config.yaml"
+        else:
+            config_file = Path(config_file)
+
+        with config_file.open("r", encoding="utf-8") as f:
+            args = yaml.safe_load(f)
+        args = argparse.Namespace(**args)
+        model = cls.build_model(args)
+        if not isinstance(model, AbsESPnetModel):
+            raise RuntimeError(
+                f"model must inherit {AbsESPnetModel.__name__}, but got {type(model)}"
+            )
+        model.to(device)
+        if model_file is not None:
+            if device == "cuda":
+                # NOTE(kamo): "cuda" for torch.load always indicates cuda:0
+                #   in PyTorch<=1.4
+                device = f"cuda:{torch.cuda.current_device()}"
+            model.load_state_dict(torch.load(model_file, map_location=device))
+
+        return model, args
diff --git a/funasr/tasks/asr.py b/funasr/tasks/asr.py
new file mode 100644
index 000000000..9367ed32e
--- /dev/null
+++ b/funasr/tasks/asr.py
@@ -0,0 +1,879 @@
+import argparse
+import logging
+from typing import Callable
+from typing import Collection
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.datasets.collate_fn import CommonCollateFn
+from funasr.datasets.preprocessor import CommonPreprocessor
+from funasr.models.ctc import CTC
+from funasr.models.decoder.abs_decoder import AbsDecoder
+from funasr.models.decoder.rnn_decoder import RNNDecoder
+from funasr.models.decoder.transformer_decoder import (
+	DynamicConvolution2DTransformerDecoder,  # noqa: H301
+)
+from funasr.models.decoder.transformer_decoder import DynamicConvolutionTransformerDecoder
+from funasr.models.decoder.transformer_decoder import (
+	LightweightConvolution2DTransformerDecoder,  # noqa: H301
+)
+from funasr.models.decoder.transformer_decoder import (
+	LightweightConvolutionTransformerDecoder,  # noqa: H301
+)
+from funasr.models.decoder.transformer_decoder import TransformerDecoder
+from funasr.models.encoder.abs_encoder import AbsEncoder
+from funasr.models.encoder.conformer_encoder import ConformerEncoder
+from funasr.models.encoder.rnn_encoder import RNNEncoder
+from funasr.models.encoder.transformer_encoder import TransformerEncoder
+from funasr.models.frontend.abs_frontend import AbsFrontend
+from funasr.models.frontend.default import DefaultFrontend
+from funasr.models.frontend.fused import FusedFrontends
+from funasr.models.frontend.s3prl import S3prlFrontend
+from funasr.models.frontend.windowing import SlidingWindow
+from funasr.models.postencoder.abs_postencoder import AbsPostEncoder
+from funasr.models.postencoder.hugging_face_transformers_postencoder import (
+	HuggingFaceTransformersPostEncoder,  # noqa: H301
+)
+from funasr.models.preencoder.abs_preencoder import AbsPreEncoder
+from funasr.models.preencoder.linear import LinearProjection
+from funasr.models.preencoder.sinc import LightweightSincConvs
+from funasr.models.specaug.abs_specaug import AbsSpecAug
+from funasr.models.specaug.specaug import SpecAug
+from funasr.layers.abs_normalize import AbsNormalize
+from funasr.layers.global_mvn import GlobalMVN
+from funasr.layers.utterance_mvn import UtteranceMVN
+from funasr.tasks.abs_task import AbsTask
+from funasr.text.phoneme_tokenizer import g2p_choices
+from funasr.torch_utils.initialize import initialize
+from funasr.train.abs_espnet_model import AbsESPnetModel
+from funasr.train.class_choices import ClassChoices
+from funasr.train.trainer import Trainer
+from funasr.utils.get_default_kwargs import get_default_kwargs
+from funasr.utils.nested_dict_action import NestedDictAction
+from funasr.utils.types import float_or_none
+from funasr.utils.types import int_or_none
+from funasr.utils.types import str2bool
+from funasr.utils.types import str_or_none
+
+from funasr.models.specaug.specaug import SpecAugLFR
+from funasr.models.predictor.cif import CifPredictor, CifPredictorV2
+from funasr.modules.subsampling import Conv1dSubsampling
+from funasr.models.e2e_asr import ESPnetASRModel
+from funasr.models.e2e_uni_asr import UniASR
+from funasr.models.encoder.sanm_encoder import SANMEncoder, SANMEncoderChunkOpt
+from funasr.models.decoder.sanm_decoder import ParaformerSANMDecoder, FsmnDecoderSCAMAOpt
+from funasr.models.e2e_asr_paraformer import Paraformer, ParaformerBert
+from funasr.models.decoder.transformer_decoder import ParaformerDecoderSAN
+
+frontend_choices = ClassChoices(
+	name="frontend",
+	classes=dict(
+		default=DefaultFrontend,
+		sliding_window=SlidingWindow,
+		s3prl=S3prlFrontend,
+		fused=FusedFrontends,
+	),
+	type_check=AbsFrontend,
+	default="default",
+)
+specaug_choices = ClassChoices(
+	name="specaug",
+	classes=dict(
+		specaug=SpecAug,
+		specaug_lfr=SpecAugLFR,
+	),
+	type_check=AbsSpecAug,
+	default=None,
+	optional=True,
+)
+normalize_choices = ClassChoices(
+	"normalize",
+	classes=dict(
+		global_mvn=GlobalMVN,
+		utterance_mvn=UtteranceMVN,
+	),
+	type_check=AbsNormalize,
+	default=None,
+	optional=True,
+)
+model_choices = ClassChoices(
+	"model",
+	classes=dict(
+		asr=ESPnetASRModel,
+		uniasr=UniASR,
+		paraformer=Paraformer,
+		paraformer_bert=ParaformerBert,
+	),
+	type_check=AbsESPnetModel,
+	default="asr",
+)
+preencoder_choices = ClassChoices(
+	name="preencoder",
+	classes=dict(
+		sinc=LightweightSincConvs,
+		linear=LinearProjection,
+	),
+	type_check=AbsPreEncoder,
+	default=None,
+	optional=True,
+)
+encoder_choices = ClassChoices(
+	"encoder",
+	classes=dict(
+		conformer=ConformerEncoder,
+		transformer=TransformerEncoder,
+		rnn=RNNEncoder,
+		sanm=SANMEncoder,
+		sanm_chunk_opt=SANMEncoderChunkOpt,
+	),
+	type_check=AbsEncoder,
+	default="rnn",
+)
+encoder_choices2 = ClassChoices(
+	"encoder2",
+	classes=dict(
+		conformer=ConformerEncoder,
+		transformer=TransformerEncoder,
+		rnn=RNNEncoder,
+		sanm=SANMEncoder,
+		sanm_chunk_opt=SANMEncoderChunkOpt,
+	),
+	type_check=AbsEncoder,
+	default="rnn",
+)
+postencoder_choices = ClassChoices(
+	name="postencoder",
+	classes=dict(
+		hugging_face_transformers=HuggingFaceTransformersPostEncoder,
+	),
+	type_check=AbsPostEncoder,
+	default=None,
+	optional=True,
+)
+decoder_choices = ClassChoices(
+	"decoder",
+	classes=dict(
+		transformer=TransformerDecoder,
+		lightweight_conv=LightweightConvolutionTransformerDecoder,
+		lightweight_conv2d=LightweightConvolution2DTransformerDecoder,
+		dynamic_conv=DynamicConvolutionTransformerDecoder,
+		dynamic_conv2d=DynamicConvolution2DTransformerDecoder,
+		rnn=RNNDecoder,
+		fsmn_scama_opt=FsmnDecoderSCAMAOpt,
+		paraformer_decoder_sanm=ParaformerSANMDecoder,
+		paraformer_decoder_san=ParaformerDecoderSAN,
+	),
+	type_check=AbsDecoder,
+	default="rnn",
+)
+decoder_choices2 = ClassChoices(
+	"decoder2",
+	classes=dict(
+		transformer=TransformerDecoder,
+		lightweight_conv=LightweightConvolutionTransformerDecoder,
+		lightweight_conv2d=LightweightConvolution2DTransformerDecoder,
+		dynamic_conv=DynamicConvolutionTransformerDecoder,
+		dynamic_conv2d=DynamicConvolution2DTransformerDecoder,
+		rnn=RNNDecoder,
+		fsmn_scama_opt=FsmnDecoderSCAMAOpt,
+		paraformer_decoder_sanm=ParaformerSANMDecoder,
+	),
+	type_check=AbsDecoder,
+	default="rnn",
+)
+predictor_choices = ClassChoices(
+	name="predictor",
+	classes=dict(
+		cif_predictor=CifPredictor,
+		ctc_predictor=None,
+		cif_predictor_v2=CifPredictorV2,
+	),
+	type_check=None,
+	default="cif_predictor",
+	optional=True,
+)
+predictor_choices2 = ClassChoices(
+	name="predictor2",
+	classes=dict(
+		cif_predictor=CifPredictor,
+		ctc_predictor=None,
+		cif_predictor_v2=CifPredictorV2,
+	),
+	type_check=None,
+	default="cif_predictor",
+	optional=True,
+)
+stride_conv_choices = ClassChoices(
+	name="stride_conv",
+	classes=dict(
+		stride_conv1d=Conv1dSubsampling
+	),
+	type_check=None,
+	default="stride_conv1d",
+	optional=True,
+)
+
+
+class ASRTask(AbsTask):
+	# If you need more than one optimizers, change this value
+	num_optimizers: int = 1
+
+	# Add variable objects configurations
+	class_choices_list = [
+		# --frontend and --frontend_conf
+		frontend_choices,
+		# --specaug and --specaug_conf
+		specaug_choices,
+		# --normalize and --normalize_conf
+		normalize_choices,
+		# --model and --model_conf
+		model_choices,
+		# --preencoder and --preencoder_conf
+		preencoder_choices,
+		# --encoder and --encoder_conf
+		encoder_choices,
+		# --postencoder and --postencoder_conf
+		postencoder_choices,
+		# --decoder and --decoder_conf
+		decoder_choices,
+	]
+
+	# If you need to modify train() or eval() procedures, change Trainer class here
+	trainer = Trainer
+
+	@classmethod
+	def add_task_arguments(cls, parser: argparse.ArgumentParser):
+		group = parser.add_argument_group(description="Task related")
+
+		# NOTE(kamo): add_arguments(..., required=True) can't be used
+		# to provide --print_config mode. Instead of it, do as
+		required = parser.get_default("required")
+		required += ["token_list"]
+
+		group.add_argument(
+			"--token_list",
+			type=str_or_none,
+			default=None,
+			help="A text mapping int-id to token",
+		)
+		group.add_argument(
+			"--split_with_space",
+			type=str2bool,
+			default=True,
+			help="whether to split text using <space>",
+		)
+		group.add_argument(
+			"--init",
+			type=lambda x: str_or_none(x.lower()),
+			default=None,
+			help="The initialization method",
+			choices=[
+				"chainer",
+				"xavier_uniform",
+				"xavier_normal",
+				"kaiming_uniform",
+				"kaiming_normal",
+				None,
+			],
+		)
+
+		group.add_argument(
+			"--input_size",
+			type=int_or_none,
+			default=None,
+			help="The number of input dimension of the feature",
+		)
+
+		group.add_argument(
+			"--ctc_conf",
+			action=NestedDictAction,
+			default=get_default_kwargs(CTC),
+			help="The keyword arguments for CTC class.",
+		)
+		group.add_argument(
+			"--joint_net_conf",
+			action=NestedDictAction,
+			default=None,
+			help="The keyword arguments for joint network class.",
+		)
+
+		group = parser.add_argument_group(description="Preprocess related")
+		group.add_argument(
+			"--use_preprocessor",
+			type=str2bool,
+			default=True,
+			help="Apply preprocessing to data or not",
+		)
+		group.add_argument(
+			"--token_type",
+			type=str,
+			default="bpe",
+			choices=["bpe", "char", "word", "phn"],
+			help="The text will be tokenized " "in the specified level token",
+		)
+		group.add_argument(
+			"--bpemodel",
+			type=str_or_none,
+			default=None,
+			help="The model file of sentencepiece",
+		)
+		parser.add_argument(
+			"--non_linguistic_symbols",
+			type=str_or_none,
+			default=None,
+			help="non_linguistic_symbols file path",
+		)
+		parser.add_argument(
+			"--cleaner",
+			type=str_or_none,
+			choices=[None, "tacotron", "jaconv", "vietnamese"],
+			default=None,
+			help="Apply text cleaning",
+		)
+		parser.add_argument(
+			"--g2p",
+			type=str_or_none,
+			choices=g2p_choices,
+			default=None,
+			help="Specify g2p method if --token_type=phn",
+		)
+		parser.add_argument(
+			"--speech_volume_normalize",
+			type=float_or_none,
+			default=None,
+			help="Scale the maximum amplitude to the given value.",
+		)
+		parser.add_argument(
+			"--rir_scp",
+			type=str_or_none,
+			default=None,
+			help="The file path of rir scp file.",
+		)
+		parser.add_argument(
+			"--rir_apply_prob",
+			type=float,
+			default=1.0,
+			help="THe probability for applying RIR convolution.",
+		)
+		parser.add_argument(
+			"--noise_scp",
+			type=str_or_none,
+			default=None,
+			help="The file path of noise scp file.",
+		)
+		parser.add_argument(
+			"--noise_apply_prob",
+			type=float,
+			default=1.0,
+			help="The probability applying Noise adding.",
+		)
+		parser.add_argument(
+			"--noise_db_range",
+			type=str,
+			default="13_15",
+			help="The range of noise decibel level.",
+		)
+
+		for class_choices in cls.class_choices_list:
+			# Append --<name> and --<name>_conf.
+			# e.g. --encoder and --encoder_conf
+			class_choices.add_arguments(group)
+
+	@classmethod
+	def build_collate_fn(
+		cls, args: argparse.Namespace, train: bool
+	) -> Callable[
+		[Collection[Tuple[str, Dict[str, np.ndarray]]]],
+		Tuple[List[str], Dict[str, torch.Tensor]],
+	]:
+		assert check_argument_types()
+		# NOTE(kamo): int value = 0 is reserved by CTC-blank symbol
+		return CommonCollateFn(float_pad_value=0.0, int_pad_value=-1)
+
+	@classmethod
+	def build_preprocess_fn(
+		cls, args: argparse.Namespace, train: bool
+	) -> Optional[Callable[[str, Dict[str, np.array]], Dict[str, np.ndarray]]]:
+		assert check_argument_types()
+		if args.use_preprocessor:
+			retval = CommonPreprocessor(
+				train=train,
+				token_type=args.token_type,
+				token_list=args.token_list,
+				bpemodel=args.bpemodel,
+				non_linguistic_symbols=args.non_linguistic_symbols,
+				text_cleaner=args.cleaner,
+				g2p_type=args.g2p,
+				split_with_space=args.split_with_space if hasattr(args, "split_with_space") else False,
+				# NOTE(kamo): Check attribute existence for backward compatibility
+				rir_scp=args.rir_scp if hasattr(args, "rir_scp") else None,
+				rir_apply_prob=args.rir_apply_prob
+				if hasattr(args, "rir_apply_prob")
+				else 1.0,
+				noise_scp=args.noise_scp if hasattr(args, "noise_scp") else None,
+				noise_apply_prob=args.noise_apply_prob
+				if hasattr(args, "noise_apply_prob")
+				else 1.0,
+				noise_db_range=args.noise_db_range
+				if hasattr(args, "noise_db_range")
+				else "13_15",
+				speech_volume_normalize=args.speech_volume_normalize
+				if hasattr(args, "rir_scp")
+				else None,
+			)
+		else:
+			retval = None
+		assert check_return_type(retval)
+		return retval
+
+	@classmethod
+	def required_data_names(
+		cls, train: bool = True, inference: bool = False
+	) -> Tuple[str, ...]:
+		if not inference:
+			retval = ("speech", "text")
+		else:
+			# Recognition mode
+			retval = ("speech",)
+		return retval
+
+	@classmethod
+	def optional_data_names(
+		cls, train: bool = True, inference: bool = False
+	) -> Tuple[str, ...]:
+		retval = ()
+		assert check_return_type(retval)
+		return retval
+
+	@classmethod
+	def build_model(cls, args: argparse.Namespace):
+		assert check_argument_types()
+		if isinstance(args.token_list, str):
+			with open(args.token_list, encoding="utf-8") as f:
+				token_list = [line.rstrip() for line in f]
+
+			# Overwriting token_list to keep it as "portable".
+			args.token_list = list(token_list)
+		elif isinstance(args.token_list, (tuple, list)):
+			token_list = list(args.token_list)
+		else:
+			raise RuntimeError("token_list must be str or list")
+		vocab_size = len(token_list)
+		logging.info(f"Vocabulary size: {vocab_size}")
+
+		# 1. frontend
+		if args.input_size is None:
+			# Extract features in the model
+			frontend_class = frontend_choices.get_class(args.frontend)
+			frontend = frontend_class(**args.frontend_conf)
+			input_size = frontend.output_size()
+		else:
+			# Give features from data-loader
+			args.frontend = None
+			args.frontend_conf = {}
+			frontend = None
+			input_size = args.input_size
+
+		# 2. Data augmentation for spectrogram
+		if args.specaug is not None:
+			specaug_class = specaug_choices.get_class(args.specaug)
+			specaug = specaug_class(**args.specaug_conf)
+		else:
+			specaug = None
+
+		# 3. Normalization layer
+		if args.normalize is not None:
+			normalize_class = normalize_choices.get_class(args.normalize)
+			normalize = normalize_class(**args.normalize_conf)
+		else:
+			normalize = None
+
+		# 4. Pre-encoder input block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		if getattr(args, "preencoder", None) is not None:
+			preencoder_class = preencoder_choices.get_class(args.preencoder)
+			preencoder = preencoder_class(**args.preencoder_conf)
+			input_size = preencoder.output_size()
+		else:
+			preencoder = None
+
+		# 5. Encoder
+		encoder_class = encoder_choices.get_class(args.encoder)
+		encoder = encoder_class(input_size=input_size, **args.encoder_conf)
+
+		# 6. Post-encoder block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		encoder_output_size = encoder.output_size()
+		if getattr(args, "postencoder", None) is not None:
+			postencoder_class = postencoder_choices.get_class(args.postencoder)
+			postencoder = postencoder_class(
+				input_size=encoder_output_size, **args.postencoder_conf
+			)
+			encoder_output_size = postencoder.output_size()
+		else:
+			postencoder = None
+
+		# 7. Decoder
+		decoder_class = decoder_choices.get_class(args.decoder)
+		decoder = decoder_class(
+			vocab_size=vocab_size,
+			encoder_output_size=encoder_output_size,
+			**args.decoder_conf,
+		)
+
+		# 8. CTC
+		ctc = CTC(
+			odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf
+		)
+
+		# 9. Build model
+		try:
+			model_class = model_choices.get_class(args.model)
+		except AttributeError:
+			model_class = model_choices.get_class("asr")
+		model = model_class(
+			vocab_size=vocab_size,
+			frontend=frontend,
+			specaug=specaug,
+			normalize=normalize,
+			preencoder=preencoder,
+			encoder=encoder,
+			postencoder=postencoder,
+			decoder=decoder,
+			ctc=ctc,
+			token_list=token_list,
+			**args.model_conf,
+		)
+
+		# 10. Initialize
+		if args.init is not None:
+			initialize(model, args.init)
+
+		assert check_return_type(model)
+		return model
+
+
+class ASRTaskUniASR(ASRTask):
+	# If you need more than one optimizers, change this value
+	num_optimizers: int = 1
+
+	# Add variable objects configurations
+	class_choices_list = [
+		# --frontend and --frontend_conf
+		frontend_choices,
+		# --specaug and --specaug_conf
+		specaug_choices,
+		# --normalize and --normalize_conf
+		normalize_choices,
+		# --model and --model_conf
+		model_choices,
+		# --preencoder and --preencoder_conf
+		preencoder_choices,
+		# --encoder and --encoder_conf
+		encoder_choices,
+		# --postencoder and --postencoder_conf
+		postencoder_choices,
+		# --decoder and --decoder_conf
+		decoder_choices,
+		# --predictor and --predictor_conf
+		predictor_choices,
+		# --encoder2 and --encoder2_conf
+		encoder_choices2,
+		# --decoder2 and --decoder2_conf
+		decoder_choices2,
+		# --predictor2 and --predictor2_conf
+		predictor_choices2,
+		# --stride_conv and --stride_conv_conf
+		stride_conv_choices,
+	]
+
+	# If you need to modify train() or eval() procedures, change Trainer class here
+	trainer = Trainer
+
+	@classmethod
+	def build_model(cls, args: argparse.Namespace):
+		assert check_argument_types()
+		if isinstance(args.token_list, str):
+			with open(args.token_list, encoding="utf-8") as f:
+				token_list = [line.rstrip() for line in f]
+
+			# Overwriting token_list to keep it as "portable".
+			args.token_list = list(token_list)
+		elif isinstance(args.token_list, (tuple, list)):
+			token_list = list(args.token_list)
+		else:
+			raise RuntimeError("token_list must be str or list")
+		vocab_size = len(token_list)
+		logging.info(f"Vocabulary size: {vocab_size}")
+
+		# 1. frontend
+		if args.input_size is None:
+			# Extract features in the model
+			frontend_class = frontend_choices.get_class(args.frontend)
+			frontend = frontend_class(**args.frontend_conf)
+			input_size = frontend.output_size()
+		else:
+			# Give features from data-loader
+			args.frontend = None
+			args.frontend_conf = {}
+			frontend = None
+			input_size = args.input_size
+
+		# 2. Data augmentation for spectrogram
+		if args.specaug is not None:
+			specaug_class = specaug_choices.get_class(args.specaug)
+			specaug = specaug_class(**args.specaug_conf)
+		else:
+			specaug = None
+
+		# 3. Normalization layer
+		if args.normalize is not None:
+			normalize_class = normalize_choices.get_class(args.normalize)
+			normalize = normalize_class(**args.normalize_conf)
+		else:
+			normalize = None
+
+		# 4. Pre-encoder input block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		if getattr(args, "preencoder", None) is not None:
+			preencoder_class = preencoder_choices.get_class(args.preencoder)
+			preencoder = preencoder_class(**args.preencoder_conf)
+			input_size = preencoder.output_size()
+		else:
+			preencoder = None
+
+		# 5. Encoder
+		encoder_class = encoder_choices.get_class(args.encoder)
+		encoder = encoder_class(input_size=input_size, **args.encoder_conf)
+		encoder_output_size = encoder.output_size()
+
+		stride_conv_class = stride_conv_choices.get_class(args.stride_conv)
+		stride_conv = stride_conv_class(**args.stride_conv_conf, idim=input_size + encoder_output_size,
+		                                odim=input_size + encoder_output_size)
+		stride_conv_output_size = stride_conv.output_size()
+
+		# 6. Encoder2
+		encoder_class2 = encoder_choices2.get_class(args.encoder2)
+		encoder2 = encoder_class2(input_size=stride_conv_output_size, **args.encoder2_conf)
+
+		# 7. Post-encoder block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		encoder_output_size2 = encoder2.output_size()
+		if getattr(args, "postencoder", None) is not None:
+			postencoder_class = postencoder_choices.get_class(args.postencoder)
+			postencoder = postencoder_class(
+				input_size=encoder_output_size, **args.postencoder_conf
+			)
+			encoder_output_size = postencoder.output_size()
+		else:
+			postencoder = None
+
+		# 8. Decoder & Decoder2
+		decoder_class = decoder_choices.get_class(args.decoder)
+		decoder_class2 = decoder_choices2.get_class(args.decoder2)
+		decoder = decoder_class(
+			vocab_size=vocab_size,
+			encoder_output_size=encoder_output_size,
+			**args.decoder_conf,
+		)
+		decoder2 = decoder_class2(
+			vocab_size=vocab_size,
+			encoder_output_size=encoder_output_size2,
+			**args.decoder2_conf,
+		)
+
+		# 9. CTC
+		ctc = CTC(
+			odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf
+		)
+		ctc2 = CTC(
+			odim=vocab_size, encoder_output_size=encoder_output_size2, **args.ctc_conf
+		)
+
+		# 10. Predictor
+		predictor_class = predictor_choices.get_class(args.predictor)
+		predictor = predictor_class(**args.predictor_conf)
+
+		predictor_class = predictor_choices2.get_class(args.predictor2)
+		predictor2 = predictor_class(**args.predictor2_conf)
+
+		# 11. Build model
+		try:
+			model_class = model_choices.get_class(args.model)
+		except AttributeError:
+			model_class = model_choices.get_class("asr")
+		model = model_class(
+			vocab_size=vocab_size,
+			frontend=frontend,
+			specaug=specaug,
+			normalize=normalize,
+			preencoder=preencoder,
+			encoder=encoder,
+			postencoder=postencoder,
+			decoder=decoder,
+			ctc=ctc,
+			token_list=token_list,
+			predictor=predictor,
+			ctc2=ctc2,
+			encoder2=encoder2,
+			decoder2=decoder2,
+			predictor2=predictor2,
+			stride_conv=stride_conv,
+			**args.model_conf,
+		)
+
+		# 12. Initialize
+		if args.init is not None:
+			initialize(model, args.init)
+
+		assert check_return_type(model)
+		return model
+
+
+class ASRTaskParaformer(ASRTask):
+	# If you need more than one optimizers, change this value
+	num_optimizers: int = 1
+
+	# Add variable objects configurations
+	class_choices_list = [
+		# --frontend and --frontend_conf
+		frontend_choices,
+		# --specaug and --specaug_conf
+		specaug_choices,
+		# --normalize and --normalize_conf
+		normalize_choices,
+		# --model and --model_conf
+		model_choices,
+		# --preencoder and --preencoder_conf
+		preencoder_choices,
+		# --encoder and --encoder_conf
+		encoder_choices,
+		# --postencoder and --postencoder_conf
+		postencoder_choices,
+		# --decoder and --decoder_conf
+		decoder_choices,
+		# --predictor and --predictor_conf
+		predictor_choices,
+	]
+
+	# If you need to modify train() or eval() procedures, change Trainer class here
+	trainer = Trainer
+
+	@classmethod
+	def build_model(cls, args: argparse.Namespace):
+		assert check_argument_types()
+		if isinstance(args.token_list, str):
+			with open(args.token_list, encoding="utf-8") as f:
+				token_list = [line.rstrip() for line in f]
+
+			# Overwriting token_list to keep it as "portable".
+			args.token_list = list(token_list)
+		elif isinstance(args.token_list, (tuple, list)):
+			token_list = list(args.token_list)
+		else:
+			raise RuntimeError("token_list must be str or list")
+		vocab_size = len(token_list)
+		logging.info(f"Vocabulary size: {vocab_size }")
+
+		# 1. frontend
+		if args.input_size is None:
+			# Extract features in the model
+			frontend_class = frontend_choices.get_class(args.frontend)
+			frontend = frontend_class(**args.frontend_conf)
+			input_size = frontend.output_size()
+		else:
+			# Give features from data-loader
+			args.frontend = None
+			args.frontend_conf = {}
+			frontend = None
+			input_size = args.input_size
+
+		# 2. Data augmentation for spectrogram
+		if args.specaug is not None:
+			specaug_class = specaug_choices.get_class(args.specaug)
+			specaug = specaug_class(**args.specaug_conf)
+		else:
+			specaug = None
+
+		# 3. Normalization layer
+		if args.normalize is not None:
+			normalize_class = normalize_choices.get_class(args.normalize)
+			normalize = normalize_class(**args.normalize_conf)
+		else:
+			normalize = None
+
+		# 4. Pre-encoder input block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		if getattr(args, "preencoder", None) is not None:
+			preencoder_class = preencoder_choices.get_class(args.preencoder)
+			preencoder = preencoder_class(**args.preencoder_conf)
+			input_size = preencoder.output_size()
+		else:
+			preencoder = None
+
+		# 5. Encoder
+		encoder_class = encoder_choices.get_class(args.encoder)
+		encoder = encoder_class(input_size=input_size, **args.encoder_conf)
+
+		# 6. Post-encoder block
+		# NOTE(kan-bayashi): Use getattr to keep the compatibility
+		encoder_output_size = encoder.output_size()
+		if getattr(args, "postencoder", None) is not None:
+			postencoder_class = postencoder_choices.get_class(args.postencoder)
+			postencoder = postencoder_class(
+				input_size=encoder_output_size, **args.postencoder_conf
+			)
+			encoder_output_size = postencoder.output_size()
+		else:
+			postencoder = None
+
+		# 7. Decoder
+		decoder_class = decoder_choices.get_class(args.decoder)
+		decoder = decoder_class(
+			vocab_size=vocab_size,
+			encoder_output_size=encoder_output_size,
+			**args.decoder_conf,
+		)
+
+		# 8. CTC
+		ctc = CTC(
+			odim=vocab_size, encoder_output_size=encoder_output_size, **args.ctc_conf
+		)
+
+		# 9. Predictor
+		predictor_class = predictor_choices.get_class(args.predictor)
+		predictor = predictor_class(**args.predictor_conf)
+
+		# 10. Build model
+		try:
+			model_class = model_choices.get_class(args.model)
+		except AttributeError:
+			model_class = model_choices.get_class("asr")
+		model = model_class(
+			vocab_size=vocab_size,
+			frontend=frontend,
+			specaug=specaug,
+			normalize=normalize,
+			preencoder=preencoder,
+			encoder=encoder,
+			postencoder=postencoder,
+			decoder=decoder,
+			ctc=ctc,
+			token_list=token_list,
+			predictor=predictor,
+			**args.model_conf,
+		)
+
+		# 11. Initialize
+		if args.init is not None:
+			initialize(model, args.init)
+
+		assert check_return_type(model)
+		return model
diff --git a/funasr/tasks/lm.py b/funasr/tasks/lm.py
new file mode 100644
index 000000000..46b9fe089
--- /dev/null
+++ b/funasr/tasks/lm.py
@@ -0,0 +1,211 @@
+import argparse
+import logging
+from typing import Callable
+from typing import Collection
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.datasets.collate_fn import CommonCollateFn
+from funasr.datasets.preprocessor import CommonPreprocessor
+from funasr.lm.abs_model import AbsLM
+from funasr.lm.espnet_model import ESPnetLanguageModel
+from funasr.lm.seq_rnn_lm import SequentialRNNLM
+from funasr.lm.transformer_lm import TransformerLM
+from funasr.tasks.abs_task import AbsTask
+from funasr.text.phoneme_tokenizer import g2p_choices
+from funasr.torch_utils.initialize import initialize
+from funasr.train.class_choices import ClassChoices
+from funasr.train.trainer import Trainer
+from funasr.utils.get_default_kwargs import get_default_kwargs
+from funasr.utils.nested_dict_action import NestedDictAction
+from funasr.utils.types import str2bool
+from funasr.utils.types import str_or_none
+
+lm_choices = ClassChoices(
+    "lm",
+    classes=dict(
+        seq_rnn=SequentialRNNLM,
+        transformer=TransformerLM,
+    ),
+    type_check=AbsLM,
+    default="seq_rnn",
+)
+
+
+class LMTask(AbsTask):
+    # If you need more than one optimizers, change this value
+    num_optimizers: int = 1
+
+    # Add variable objects configurations
+    class_choices_list = [lm_choices]
+
+    # If you need to modify train() or eval() procedures, change Trainer class here
+    trainer = Trainer
+
+    @classmethod
+    def add_task_arguments(cls, parser: argparse.ArgumentParser):
+        # NOTE(kamo): Use '_' instead of '-' to avoid confusion
+        assert check_argument_types()
+        group = parser.add_argument_group(description="Task related")
+
+        # NOTE(kamo): add_arguments(..., required=True) can't be used
+        # to provide --print_config mode. Instead of it, do as
+        required = parser.get_default("required")
+        required += ["token_list"]
+
+        group.add_argument(
+            "--token_list",
+            type=str_or_none,
+            default=None,
+            help="A text mapping int-id to token",
+        )
+        group.add_argument(
+            "--init",
+            type=lambda x: str_or_none(x.lower()),
+            default=None,
+            help="The initialization method",
+            choices=[
+                "chainer",
+                "xavier_uniform",
+                "xavier_normal",
+                "kaiming_uniform",
+                "kaiming_normal",
+                None,
+            ],
+        )
+        group.add_argument(
+            "--model_conf",
+            action=NestedDictAction,
+            default=get_default_kwargs(ESPnetLanguageModel),
+            help="The keyword arguments for model class.",
+        )
+
+        group = parser.add_argument_group(description="Preprocess related")
+        group.add_argument(
+            "--use_preprocessor",
+            type=str2bool,
+            default=True,
+            help="Apply preprocessing to data or not",
+        )
+        group.add_argument(
+            "--token_type",
+            type=str,
+            default="bpe",
+            choices=["bpe", "char", "word"],
+            help="",
+        )
+        group.add_argument(
+            "--bpemodel",
+            type=str_or_none,
+            default=None,
+            help="The model file fo sentencepiece",
+        )
+        parser.add_argument(
+            "--non_linguistic_symbols",
+            type=str_or_none,
+            help="non_linguistic_symbols file path",
+        )
+        parser.add_argument(
+            "--cleaner",
+            type=str_or_none,
+            choices=[None, "tacotron", "jaconv", "vietnamese"],
+            default=None,
+            help="Apply text cleaning",
+        )
+        parser.add_argument(
+            "--g2p",
+            type=str_or_none,
+            choices=g2p_choices,
+            default=None,
+            help="Specify g2p method if --token_type=phn",
+        )
+
+        for class_choices in cls.class_choices_list:
+            class_choices.add_arguments(group)
+
+        assert check_return_type(parser)
+        return parser
+
+    @classmethod
+    def build_collate_fn(
+            cls, args: argparse.Namespace, train: bool
+    ) -> Callable[
+        [Collection[Tuple[str, Dict[str, np.ndarray]]]],
+        Tuple[List[str], Dict[str, torch.Tensor]],
+    ]:
+        assert check_argument_types()
+        return CommonCollateFn(int_pad_value=0)
+
+    @classmethod
+    def build_preprocess_fn(
+            cls, args: argparse.Namespace, train: bool
+    ) -> Optional[Callable[[str, Dict[str, np.array]], Dict[str, np.ndarray]]]:
+        assert check_argument_types()
+        if args.use_preprocessor:
+            retval = CommonPreprocessor(
+                train=train,
+                token_type=args.token_type,
+                token_list=args.token_list,
+                bpemodel=args.bpemodel,
+                text_cleaner=args.cleaner,
+                g2p_type=args.g2p,
+                non_linguistic_symbols=args.non_linguistic_symbols,
+            )
+        else:
+            retval = None
+        assert check_return_type(retval)
+        return retval
+
+    @classmethod
+    def required_data_names(
+            cls, train: bool = True, inference: bool = False
+    ) -> Tuple[str, ...]:
+        retval = ("text",)
+        return retval
+
+    @classmethod
+    def optional_data_names(
+            cls, train: bool = True, inference: bool = False
+    ) -> Tuple[str, ...]:
+        retval = ()
+        return retval
+
+    @classmethod
+    def build_model(cls, args: argparse.Namespace) -> ESPnetLanguageModel:
+        assert check_argument_types()
+        if isinstance(args.token_list, str):
+            with open(args.token_list, encoding="utf-8") as f:
+                token_list = [line.rstrip() for line in f]
+
+            # "args" is saved as it is in a yaml file by BaseTask.main().
+            # Overwriting token_list to keep it as "portable".
+            args.token_list = token_list.copy()
+        elif isinstance(args.token_list, (tuple, list)):
+            token_list = args.token_list.copy()
+        else:
+            raise RuntimeError("token_list must be str or dict")
+
+        vocab_size = len(token_list)
+        logging.info(f"Vocabulary size: {vocab_size}")
+
+        # 1. Build LM model
+        lm_class = lm_choices.get_class(args.lm)
+        lm = lm_class(vocab_size=vocab_size, **args.lm_conf)
+
+        # 2. Build ESPnetModel
+        # Assume the last-id is sos_and_eos
+        model = ESPnetLanguageModel(lm=lm, vocab_size=vocab_size, **args.model_conf)
+
+        # 3. Initialize
+        if args.init is not None:
+            initialize(model, args.init)
+
+        assert check_return_type(model)
+        return model
diff --git a/funasr/text/__init__.py b/funasr/text/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/text/abs_tokenizer.py b/funasr/text/abs_tokenizer.py
new file mode 100644
index 000000000..fc2ccb3c3
--- /dev/null
+++ b/funasr/text/abs_tokenizer.py
@@ -0,0 +1,14 @@
+from abc import ABC
+from abc import abstractmethod
+from typing import Iterable
+from typing import List
+
+
+class AbsTokenizer(ABC):
+    @abstractmethod
+    def text2tokens(self, line: str) -> List[str]:
+        raise NotImplementedError
+
+    @abstractmethod
+    def tokens2text(self, tokens: Iterable[str]) -> str:
+        raise NotImplementedError
diff --git a/funasr/text/build_tokenizer.py b/funasr/text/build_tokenizer.py
new file mode 100644
index 000000000..8e29d3ed5
--- /dev/null
+++ b/funasr/text/build_tokenizer.py
@@ -0,0 +1,63 @@
+from pathlib import Path
+from typing import Iterable
+from typing import Union
+
+from typeguard import check_argument_types
+
+from funasr.text.abs_tokenizer import AbsTokenizer
+from funasr.text.char_tokenizer import CharTokenizer
+from funasr.text.phoneme_tokenizer import PhonemeTokenizer
+from funasr.text.sentencepiece_tokenizer import SentencepiecesTokenizer
+from funasr.text.word_tokenizer import WordTokenizer
+
+
+def build_tokenizer(
+    token_type: str,
+    bpemodel: Union[Path, str, Iterable[str]] = None,
+    non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+    remove_non_linguistic_symbols: bool = False,
+    space_symbol: str = "<space>",
+    delimiter: str = None,
+    g2p_type: str = None,
+) -> AbsTokenizer:
+    """A helper function to instantiate Tokenizer"""
+    assert check_argument_types()
+    if token_type == "bpe":
+        if bpemodel is None:
+            raise ValueError('bpemodel is required if token_type = "bpe"')
+
+        if remove_non_linguistic_symbols:
+            raise RuntimeError(
+                "remove_non_linguistic_symbols is not implemented for token_type=bpe"
+            )
+        return SentencepiecesTokenizer(bpemodel)
+
+    elif token_type == "word":
+        if remove_non_linguistic_symbols and non_linguistic_symbols is not None:
+            return WordTokenizer(
+                delimiter=delimiter,
+                non_linguistic_symbols=non_linguistic_symbols,
+                remove_non_linguistic_symbols=True,
+            )
+        else:
+            return WordTokenizer(delimiter=delimiter)
+
+    elif token_type == "char":
+        return CharTokenizer(
+            non_linguistic_symbols=non_linguistic_symbols,
+            space_symbol=space_symbol,
+            remove_non_linguistic_symbols=remove_non_linguistic_symbols,
+        )
+
+    elif token_type == "phn":
+        return PhonemeTokenizer(
+            g2p_type=g2p_type,
+            non_linguistic_symbols=non_linguistic_symbols,
+            space_symbol=space_symbol,
+            remove_non_linguistic_symbols=remove_non_linguistic_symbols,
+        )
+
+    else:
+        raise ValueError(
+            f"token_mode must be one of bpe, word, char or phn: " f"{token_type}"
+        )
diff --git a/funasr/text/char_tokenizer.py b/funasr/text/char_tokenizer.py
new file mode 100644
index 000000000..00ae42732
--- /dev/null
+++ b/funasr/text/char_tokenizer.py
@@ -0,0 +1,62 @@
+from pathlib import Path
+from typing import Iterable
+from typing import List
+from typing import Union
+import warnings
+
+from typeguard import check_argument_types
+
+from funasr.text.abs_tokenizer import AbsTokenizer
+
+
+class CharTokenizer(AbsTokenizer):
+    def __init__(
+        self,
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        space_symbol: str = "<space>",
+        remove_non_linguistic_symbols: bool = False,
+    ):
+        assert check_argument_types()
+        self.space_symbol = space_symbol
+        if non_linguistic_symbols is None:
+            self.non_linguistic_symbols = set()
+        elif isinstance(non_linguistic_symbols, (Path, str)):
+            non_linguistic_symbols = Path(non_linguistic_symbols)
+            try:
+                with non_linguistic_symbols.open("r", encoding="utf-8") as f:
+                    self.non_linguistic_symbols = set(line.rstrip() for line in f)
+            except FileNotFoundError:
+                warnings.warn(f"{non_linguistic_symbols} doesn't exist.")
+                self.non_linguistic_symbols = set()
+        else:
+            self.non_linguistic_symbols = set(non_linguistic_symbols)
+        self.remove_non_linguistic_symbols = remove_non_linguistic_symbols
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f'space_symbol="{self.space_symbol}"'
+            f'non_linguistic_symbols="{self.non_linguistic_symbols}"'
+            f")"
+        )
+
+    def text2tokens(self, line: Union[str, list]) -> List[str]:
+        tokens = []
+        while len(line) != 0:
+            for w in self.non_linguistic_symbols:
+                if line.startswith(w):
+                    if not self.remove_non_linguistic_symbols:
+                        tokens.append(line[: len(w)])
+                    line = line[len(w) :]
+                    break
+            else:
+                t = line[0]
+                if t == " ":
+                    t = "<space>"
+                tokens.append(t)
+                line = line[1:]
+        return tokens
+
+    def tokens2text(self, tokens: Iterable[str]) -> str:
+        tokens = [t if t != self.space_symbol else " " for t in tokens]
+        return "".join(tokens)
diff --git a/funasr/text/cleaner.py b/funasr/text/cleaner.py
new file mode 100644
index 000000000..be26940c4
--- /dev/null
+++ b/funasr/text/cleaner.py
@@ -0,0 +1,48 @@
+from typing import Collection
+
+from jaconv import jaconv
+import tacotron_cleaner.cleaners
+from typeguard import check_argument_types
+
+try:
+    from vietnamese_cleaner import vietnamese_cleaners
+except ImportError:
+    vietnamese_cleaners = None
+
+
+class TextCleaner:
+    """Text cleaner.
+
+    Examples:
+        >>> cleaner = TextCleaner("tacotron")
+        >>> cleaner("(Hello-World);   &  jr. & dr.")
+        'HELLO WORLD, AND JUNIOR AND DOCTOR'
+
+    """
+
+    def __init__(self, cleaner_types: Collection[str] = None):
+        assert check_argument_types()
+
+        if cleaner_types is None:
+            self.cleaner_types = []
+        elif isinstance(cleaner_types, str):
+            self.cleaner_types = [cleaner_types]
+        else:
+            self.cleaner_types = list(cleaner_types)
+
+    def __call__(self, text: str) -> str:
+        for t in self.cleaner_types:
+            if t == "tacotron":
+                text = tacotron_cleaner.cleaners.custom_english_cleaners(text)
+            elif t == "jaconv":
+                text = jaconv.normalize(text)
+            elif t == "vietnamese":
+                if vietnamese_cleaners is None:
+                    raise RuntimeError("Please install underthesea")
+                text = vietnamese_cleaners.vietnamese_cleaner(text)
+            elif t == "korean_cleaner":
+                text = KoreanCleaner.normalize_text(text)
+            else:
+                raise RuntimeError(f"Not supported: type={t}")
+
+        return text
diff --git a/funasr/text/korean_cleaner.py b/funasr/text/korean_cleaner.py
new file mode 100644
index 000000000..ee556d42a
--- /dev/null
+++ b/funasr/text/korean_cleaner.py
@@ -0,0 +1,77 @@
+# Referenced from https://github.com/hccho2/Tacotron-Wavenet-Vocoder-Korean
+
+import re
+
+
+class KoreanCleaner:
+    @classmethod
+    def _normalize_numbers(cls, text):
+        number_to_kor = {
+            "0": "영",
+            "1": "일",
+            "2": "이",
+            "3": "삼",
+            "4": "사",
+            "5": "오",
+            "6": "육",
+            "7": "칠",
+            "8": "팔",
+            "9": "구",
+        }
+        new_text = "".join(
+            number_to_kor[char] if char in number_to_kor.keys() else char
+            for char in text
+        )
+        return new_text
+
+    @classmethod
+    def _normalize_english_text(cls, text):
+        upper_alphabet_to_kor = {
+            "A": "에이",
+            "B": "비",
+            "C": "씨",
+            "D": "디",
+            "E": "이",
+            "F": "에프",
+            "G": "지",
+            "H": "에이치",
+            "I": "아이",
+            "J": "제이",
+            "K": "케이",
+            "L": "엘",
+            "M": "엠",
+            "N": "엔",
+            "O": "오",
+            "P": "피",
+            "Q": "큐",
+            "R": "알",
+            "S": "에스",
+            "T": "티",
+            "U": "유",
+            "V": "브이",
+            "W": "더블유",
+            "X": "엑스",
+            "Y": "와이",
+            "Z": "지",
+        }
+        new_text = re.sub("[a-z]+", lambda x: str.upper(x.group()), text)
+        new_text = "".join(
+            upper_alphabet_to_kor[char]
+            if char in upper_alphabet_to_kor.keys()
+            else char
+            for char in new_text
+        )
+
+        return new_text
+
+    @classmethod
+    def normalize_text(cls, text):
+        # stage 0 : text strip
+        text = text.strip()
+
+        # stage 1 : normalize numbers
+        text = cls._normalize_numbers(text)
+
+        # stage 2 : normalize english text
+        text = cls._normalize_english_text(text)
+        return text
diff --git a/funasr/text/phoneme_tokenizer.py b/funasr/text/phoneme_tokenizer.py
new file mode 100644
index 000000000..d424b40a8
--- /dev/null
+++ b/funasr/text/phoneme_tokenizer.py
@@ -0,0 +1,528 @@
+import logging
+from pathlib import Path
+import re
+from typing import Iterable
+from typing import List
+from typing import Optional
+from typing import Union
+import warnings
+
+# import g2p_en
+import jamo
+from typeguard import check_argument_types
+
+from funasr.text.abs_tokenizer import AbsTokenizer
+
+
+g2p_choices = [
+    None,
+    "g2p_en",
+    "g2p_en_no_space",
+    "pyopenjtalk",
+    "pyopenjtalk_kana",
+    "pyopenjtalk_accent",
+    "pyopenjtalk_accent_with_pause",
+    "pyopenjtalk_prosody",
+    "pypinyin_g2p",
+    "pypinyin_g2p_phone",
+    "espeak_ng_arabic",
+    "espeak_ng_german",
+    "espeak_ng_french",
+    "espeak_ng_spanish",
+    "espeak_ng_russian",
+    "espeak_ng_greek",
+    "espeak_ng_finnish",
+    "espeak_ng_hungarian",
+    "espeak_ng_dutch",
+    "espeak_ng_english_us_vits",
+    "espeak_ng_hindi",
+    "g2pk",
+    "g2pk_no_space",
+    "korean_jaso",
+    "korean_jaso_no_space",
+]
+
+
+def split_by_space(text) -> List[str]:
+    if "   " in text:
+        text = text.replace("   ", " <space> ")
+        return [c.replace("<space>", " ") for c in text.split(" ")]
+    else:
+        return text.split(" ")
+
+
+def pyopenjtalk_g2p(text) -> List[str]:
+    import pyopenjtalk
+
+    # phones is a str object separated by space
+    phones = pyopenjtalk.g2p(text, kana=False)
+    phones = phones.split(" ")
+    return phones
+
+
+def pyopenjtalk_g2p_accent(text) -> List[str]:
+    import pyopenjtalk
+    import re
+
+    phones = []
+    for labels in pyopenjtalk.run_frontend(text)[1]:
+        p = re.findall(r"\-(.*?)\+.*?\/A:([0-9\-]+).*?\/F:.*?_([0-9]+)", labels)
+        if len(p) == 1:
+            phones += [p[0][0], p[0][2], p[0][1]]
+    return phones
+
+
+def pyopenjtalk_g2p_accent_with_pause(text) -> List[str]:
+    import pyopenjtalk
+    import re
+
+    phones = []
+    for labels in pyopenjtalk.run_frontend(text)[1]:
+        if labels.split("-")[1].split("+")[0] == "pau":
+            phones += ["pau"]
+            continue
+        p = re.findall(r"\-(.*?)\+.*?\/A:([0-9\-]+).*?\/F:.*?_([0-9]+)", labels)
+        if len(p) == 1:
+            phones += [p[0][0], p[0][2], p[0][1]]
+    return phones
+
+
+def pyopenjtalk_g2p_kana(text) -> List[str]:
+    import pyopenjtalk
+
+    kanas = pyopenjtalk.g2p(text, kana=True)
+    return list(kanas)
+
+
+def pyopenjtalk_g2p_prosody(text: str, drop_unvoiced_vowels: bool = True) -> List[str]:
+    """Extract phoneme + prosoody symbol sequence from input full-context labels.
+
+    The algorithm is based on `Prosodic features control by symbols as input of
+    sequence-to-sequence acoustic modeling for neural TTS`_ with some r9y9's tweaks.
+
+    Args:
+        text (str): Input text.
+        drop_unvoiced_vowels (bool): whether to drop unvoiced vowels.
+
+    Returns:
+        List[str]: List of phoneme + prosody symbols.
+
+    Examples:
+        >>> from funasr.text.phoneme_tokenizer import pyopenjtalk_g2p_prosody
+        >>> pyopenjtalk_g2p_prosody("こんにちは。")
+        ['^', 'k', 'o', '[', 'N', 'n', 'i', 'ch', 'i', 'w', 'a', '$']
+
+    .. _`Prosodic features control by symbols as input of sequence-to-sequence acoustic
+        modeling for neural TTS`: https://doi.org/10.1587/transinf.2020EDP7104
+
+    """
+    import pyopenjtalk
+
+    labels = pyopenjtalk.run_frontend(text)[1]
+    N = len(labels)
+
+    phones = []
+    for n in range(N):
+        lab_curr = labels[n]
+
+        # current phoneme
+        p3 = re.search(r"\-(.*?)\+", lab_curr).group(1)
+
+        # deal unvoiced vowels as normal vowels
+        if drop_unvoiced_vowels and p3 in "AEIOU":
+            p3 = p3.lower()
+
+        # deal with sil at the beginning and the end of text
+        if p3 == "sil":
+            assert n == 0 or n == N - 1
+            if n == 0:
+                phones.append("^")
+            elif n == N - 1:
+                # check question form or not
+                e3 = _numeric_feature_by_regex(r"!(\d+)_", lab_curr)
+                if e3 == 0:
+                    phones.append("$")
+                elif e3 == 1:
+                    phones.append("?")
+            continue
+        elif p3 == "pau":
+            phones.append("_")
+            continue
+        else:
+            phones.append(p3)
+
+        # accent type and position info (forward or backward)
+        a1 = _numeric_feature_by_regex(r"/A:([0-9\-]+)\+", lab_curr)
+        a2 = _numeric_feature_by_regex(r"\+(\d+)\+", lab_curr)
+        a3 = _numeric_feature_by_regex(r"\+(\d+)/", lab_curr)
+
+        # number of mora in accent phrase
+        f1 = _numeric_feature_by_regex(r"/F:(\d+)_", lab_curr)
+
+        a2_next = _numeric_feature_by_regex(r"\+(\d+)\+", labels[n + 1])
+        # accent phrase border
+        if a3 == 1 and a2_next == 1 and p3 in "aeiouAEIOUNcl":
+            phones.append("#")
+        # pitch falling
+        elif a1 == 0 and a2_next == a2 + 1 and a2 != f1:
+            phones.append("]")
+        # pitch rising
+        elif a2 == 1 and a2_next == 2:
+            phones.append("[")
+
+    return phones
+
+
+def _numeric_feature_by_regex(regex, s):
+    match = re.search(regex, s)
+    if match is None:
+        return -50
+    return int(match.group(1))
+
+
+def pypinyin_g2p(text) -> List[str]:
+    from pypinyin import pinyin
+    from pypinyin import Style
+
+    phones = [phone[0] for phone in pinyin(text, style=Style.TONE3)]
+    return phones
+
+
+def pypinyin_g2p_phone(text) -> List[str]:
+    from pypinyin import pinyin
+    from pypinyin import Style
+    from pypinyin.style._utils import get_finals
+    from pypinyin.style._utils import get_initials
+
+    phones = [
+        p
+        for phone in pinyin(text, style=Style.TONE3)
+        for p in [
+            get_initials(phone[0], strict=True),
+            get_finals(phone[0], strict=True),
+        ]
+        if len(p) != 0
+    ]
+    return phones
+
+
+class G2p_en:
+    """On behalf of g2p_en.G2p.
+
+    g2p_en.G2p isn't pickalable and it can't be copied to the other processes
+    via multiprocessing module.
+    As a workaround, g2p_en.G2p is instantiated upon calling this class.
+
+    """
+
+    def __init__(self, no_space: bool = False):
+        self.no_space = no_space
+        self.g2p = None
+
+    def __call__(self, text) -> List[str]:
+        if self.g2p is None:
+            self.g2p = g2p_en.G2p()
+
+        phones = self.g2p(text)
+        if self.no_space:
+            # remove space which represents word serapater
+            phones = list(filter(lambda s: s != " ", phones))
+        return phones
+
+
+class G2pk:
+    """On behalf of g2pk.G2p.
+
+    g2pk.G2p isn't pickalable and it can't be copied to the other processes
+    via multiprocessing module.
+    As a workaround, g2pk.G2p is instantiated upon calling this class.
+
+    """
+
+    def __init__(
+        self, descritive=False, group_vowels=False, to_syl=False, no_space=False
+    ):
+        self.descritive = descritive
+        self.group_vowels = group_vowels
+        self.to_syl = to_syl
+        self.no_space = no_space
+        self.g2p = None
+
+    def __call__(self, text) -> List[str]:
+        if self.g2p is None:
+            import g2pk
+
+            self.g2p = g2pk.G2p()
+
+        phones = list(
+            self.g2p(
+                text,
+                descriptive=self.descritive,
+                group_vowels=self.group_vowels,
+                to_syl=self.to_syl,
+            )
+        )
+        if self.no_space:
+            # remove space which represents word serapater
+            phones = list(filter(lambda s: s != " ", phones))
+        return phones
+
+
+class Jaso:
+    PUNC = "!'(),-.:;?"
+    SPACE = " "
+
+    JAMO_LEADS = "".join([chr(_) for _ in range(0x1100, 0x1113)])
+    JAMO_VOWELS = "".join([chr(_) for _ in range(0x1161, 0x1176)])
+    JAMO_TAILS = "".join([chr(_) for _ in range(0x11A8, 0x11C3)])
+
+    VALID_CHARS = JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + PUNC + SPACE
+
+    def __init__(self, space_symbol=" ", no_space=False):
+        self.space_symbol = space_symbol
+        self.no_space = no_space
+
+    def _text_to_jaso(self, line: str) -> List[str]:
+        jasos = list(jamo.hangul_to_jamo(line))
+        return jasos
+
+    def _remove_non_korean_characters(self, tokens):
+        new_tokens = [token for token in tokens if token in self.VALID_CHARS]
+        return new_tokens
+
+    def __call__(self, text) -> List[str]:
+        graphemes = [x for x in self._text_to_jaso(text)]
+        graphemes = self._remove_non_korean_characters(graphemes)
+
+        if self.no_space:
+            graphemes = list(filter(lambda s: s != " ", graphemes))
+        else:
+            graphemes = [x if x != " " else self.space_symbol for x in graphemes]
+        return graphemes
+
+
+class Phonemizer:
+    """Phonemizer module for various languages.
+
+    This is wrapper module of https://github.com/bootphon/phonemizer.
+    You can define various g2p modules by specifying options for phonemizer.
+
+    See available options:
+        https://github.com/bootphon/phonemizer/blob/master/phonemizer/phonemize.py#L32
+
+    """
+
+    def __init__(
+        self,
+        backend,
+        word_separator: Optional[str] = None,
+        syllable_separator: Optional[str] = None,
+        phone_separator: Optional[str] = " ",
+        strip=False,
+        split_by_single_token: bool = False,
+        **phonemizer_kwargs,
+    ):
+        # delayed import
+        from phonemizer.backend import BACKENDS
+        from phonemizer.separator import Separator
+
+        self.separator = Separator(
+            word=word_separator,
+            syllable=syllable_separator,
+            phone=phone_separator,
+        )
+
+        # define logger to suppress the warning in phonemizer
+        logger = logging.getLogger("phonemizer")
+        logger.setLevel(logging.ERROR)
+        self.phonemizer = BACKENDS[backend](
+            **phonemizer_kwargs,
+            logger=logger,
+        )
+        self.strip = strip
+        self.split_by_single_token = split_by_single_token
+
+    def __call__(self, text) -> List[str]:
+        tokens = self.phonemizer.phonemize(
+            [text],
+            separator=self.separator,
+            strip=self.strip,
+            njobs=1,
+        )[0]
+        if not self.split_by_single_token:
+            return tokens.split()
+        else:
+            # "a: ab" -> ["a", ":", "<space>",  "a", "b"]
+            # TODO(kan-bayashi): space replacement should be dealt in PhonemeTokenizer
+            return [c.replace(" ", "<space>") for c in tokens]
+
+
+class PhonemeTokenizer(AbsTokenizer):
+    def __init__(
+        self,
+        g2p_type: Union[None, str],
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        space_symbol: str = "<space>",
+        remove_non_linguistic_symbols: bool = False,
+    ):
+        assert check_argument_types()
+        if g2p_type is None:
+            self.g2p = split_by_space
+        elif g2p_type == "g2p_en":
+            self.g2p = G2p_en(no_space=False)
+        elif g2p_type == "g2p_en_no_space":
+            self.g2p = G2p_en(no_space=True)
+        elif g2p_type == "pyopenjtalk":
+            self.g2p = pyopenjtalk_g2p
+        elif g2p_type == "pyopenjtalk_kana":
+            self.g2p = pyopenjtalk_g2p_kana
+        elif g2p_type == "pyopenjtalk_accent":
+            self.g2p = pyopenjtalk_g2p_accent
+        elif g2p_type == "pyopenjtalk_accent_with_pause":
+            self.g2p = pyopenjtalk_g2p_accent_with_pause
+        elif g2p_type == "pyopenjtalk_prosody":
+            self.g2p = pyopenjtalk_g2p_prosody
+        elif g2p_type == "pypinyin_g2p":
+            self.g2p = pypinyin_g2p
+        elif g2p_type == "pypinyin_g2p_phone":
+            self.g2p = pypinyin_g2p_phone
+        elif g2p_type == "espeak_ng_arabic":
+            self.g2p = Phonemizer(
+                language="ar",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_german":
+            self.g2p = Phonemizer(
+                language="de",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_french":
+            self.g2p = Phonemizer(
+                language="fr-fr",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_spanish":
+            self.g2p = Phonemizer(
+                language="es",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_russian":
+            self.g2p = Phonemizer(
+                language="ru",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_greek":
+            self.g2p = Phonemizer(
+                language="el",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_finnish":
+            self.g2p = Phonemizer(
+                language="fi",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_hungarian":
+            self.g2p = Phonemizer(
+                language="hu",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_dutch":
+            self.g2p = Phonemizer(
+                language="nl",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "espeak_ng_hindi":
+            self.g2p = Phonemizer(
+                language="hi",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+            )
+        elif g2p_type == "g2pk":
+            self.g2p = G2pk(no_space=False)
+        elif g2p_type == "g2pk_no_space":
+            self.g2p = G2pk(no_space=True)
+        elif g2p_type == "espeak_ng_english_us_vits":
+            # VITS official implementation-like processing
+            # Reference: https://github.com/jaywalnut310/vits
+            self.g2p = Phonemizer(
+                language="en-us",
+                backend="espeak",
+                with_stress=True,
+                preserve_punctuation=True,
+                strip=True,
+                word_separator=" ",
+                phone_separator="",
+                split_by_single_token=True,
+            )
+        elif g2p_type == "korean_jaso":
+            self.g2p = Jaso(space_symbol=space_symbol, no_space=False)
+        elif g2p_type == "korean_jaso_no_space":
+            self.g2p = Jaso(no_space=True)
+        else:
+            raise NotImplementedError(f"Not supported: g2p_type={g2p_type}")
+
+        self.g2p_type = g2p_type
+        self.space_symbol = space_symbol
+        if non_linguistic_symbols is None:
+            self.non_linguistic_symbols = set()
+        elif isinstance(non_linguistic_symbols, (Path, str)):
+            non_linguistic_symbols = Path(non_linguistic_symbols)
+            try:
+                with non_linguistic_symbols.open("r", encoding="utf-8") as f:
+                    self.non_linguistic_symbols = set(line.rstrip() for line in f)
+            except FileNotFoundError:
+                warnings.warn(f"{non_linguistic_symbols} doesn't exist.")
+                self.non_linguistic_symbols = set()
+        else:
+            self.non_linguistic_symbols = set(non_linguistic_symbols)
+        self.remove_non_linguistic_symbols = remove_non_linguistic_symbols
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f'g2p_type="{self.g2p_type}", '
+            f'space_symbol="{self.space_symbol}", '
+            f'non_linguistic_symbols="{self.non_linguistic_symbols}"'
+            ")"
+        )
+
+    def text2tokens(self, line: str) -> List[str]:
+        tokens = []
+        while len(line) != 0:
+            for w in self.non_linguistic_symbols:
+                if line.startswith(w):
+                    if not self.remove_non_linguistic_symbols:
+                        tokens.append(line[: len(w)])
+                    line = line[len(w) :]
+                    break
+            else:
+                t = line[0]
+                tokens.append(t)
+                line = line[1:]
+
+        line = "".join(tokens)
+        tokens = self.g2p(line)
+        return tokens
+
+    def tokens2text(self, tokens: Iterable[str]) -> str:
+        # phoneme type is not invertible
+        return "".join(tokens)
diff --git a/funasr/text/sentencepiece_tokenizer.py b/funasr/text/sentencepiece_tokenizer.py
new file mode 100644
index 000000000..e4cc15272
--- /dev/null
+++ b/funasr/text/sentencepiece_tokenizer.py
@@ -0,0 +1,38 @@
+from pathlib import Path
+from typing import Iterable
+from typing import List
+from typing import Union
+
+import sentencepiece as spm
+from typeguard import check_argument_types
+
+from funasr.text.abs_tokenizer import AbsTokenizer
+
+
+class SentencepiecesTokenizer(AbsTokenizer):
+    def __init__(self, model: Union[Path, str]):
+        assert check_argument_types()
+        self.model = str(model)
+        # NOTE(kamo):
+        # Don't build SentencePieceProcessor in __init__()
+        # because it's not picklable and it may cause following error,
+        # "TypeError: can't pickle SwigPyObject objects",
+        # when giving it as argument of "multiprocessing.Process()".
+        self.sp = None
+
+    def __repr__(self):
+        return f'{self.__class__.__name__}(model="{self.model}")'
+
+    def _build_sentence_piece_processor(self):
+        # Build SentencePieceProcessor lazily.
+        if self.sp is None:
+            self.sp = spm.SentencePieceProcessor()
+            self.sp.load(self.model)
+
+    def text2tokens(self, line: str) -> List[str]:
+        self._build_sentence_piece_processor()
+        return self.sp.EncodeAsPieces(line)
+
+    def tokens2text(self, tokens: Iterable[str]) -> str:
+        self._build_sentence_piece_processor()
+        return self.sp.DecodePieces(list(tokens))
diff --git a/funasr/text/token_id_converter.py b/funasr/text/token_id_converter.py
new file mode 100644
index 000000000..c9a6b2863
--- /dev/null
+++ b/funasr/text/token_id_converter.py
@@ -0,0 +1,60 @@
+from pathlib import Path
+from typing import Dict
+from typing import Iterable
+from typing import List
+from typing import Union
+
+import numpy as np
+from typeguard import check_argument_types
+
+
+class TokenIDConverter:
+    def __init__(
+        self,
+        token_list: Union[Path, str, Iterable[str]],
+        unk_symbol: str = "<unk>",
+    ):
+        assert check_argument_types()
+
+        if isinstance(token_list, (Path, str)):
+            token_list = Path(token_list)
+            self.token_list_repr = str(token_list)
+            self.token_list: List[str] = []
+
+            with token_list.open("r", encoding="utf-8") as f:
+                for idx, line in enumerate(f):
+                    line = line.rstrip()
+                    self.token_list.append(line)
+
+        else:
+            self.token_list: List[str] = list(token_list)
+            self.token_list_repr = ""
+            for i, t in enumerate(self.token_list):
+                if i == 3:
+                    break
+                self.token_list_repr += f"{t}, "
+            self.token_list_repr += f"... (NVocab={(len(self.token_list))})"
+
+        self.token2id: Dict[str, int] = {}
+        for i, t in enumerate(self.token_list):
+            if t in self.token2id:
+                raise RuntimeError(f'Symbol "{t}" is duplicated')
+            self.token2id[t] = i
+
+        self.unk_symbol = unk_symbol
+        if self.unk_symbol not in self.token2id:
+            raise RuntimeError(
+                f"Unknown symbol '{unk_symbol}' doesn't exist in the token_list"
+            )
+        self.unk_id = self.token2id[self.unk_symbol]
+
+    def get_num_vocabulary_size(self) -> int:
+        return len(self.token_list)
+
+    def ids2tokens(self, integers: Union[np.ndarray, Iterable[int]]) -> List[str]:
+        if isinstance(integers, np.ndarray) and integers.ndim != 1:
+            raise ValueError(f"Must be 1 dim ndarray, but got {integers.ndim}")
+        return [self.token_list[i] for i in integers]
+
+    def tokens2ids(self, tokens: Iterable[str]) -> List[int]:
+        return [self.token2id.get(i, self.unk_id) for i in tokens]
diff --git a/funasr/text/word_tokenizer.py b/funasr/text/word_tokenizer.py
new file mode 100644
index 000000000..842734e75
--- /dev/null
+++ b/funasr/text/word_tokenizer.py
@@ -0,0 +1,58 @@
+from pathlib import Path
+from typing import Iterable
+from typing import List
+from typing import Union
+import warnings
+
+from typeguard import check_argument_types
+
+from funasr.text.abs_tokenizer import AbsTokenizer
+
+
+class WordTokenizer(AbsTokenizer):
+    def __init__(
+        self,
+        delimiter: str = None,
+        non_linguistic_symbols: Union[Path, str, Iterable[str]] = None,
+        remove_non_linguistic_symbols: bool = False,
+    ):
+        assert check_argument_types()
+        self.delimiter = delimiter
+
+        if not remove_non_linguistic_symbols and non_linguistic_symbols is not None:
+            warnings.warn(
+                "non_linguistic_symbols is only used "
+                "when remove_non_linguistic_symbols = True"
+            )
+
+        if non_linguistic_symbols is None:
+            self.non_linguistic_symbols = set()
+        elif isinstance(non_linguistic_symbols, (Path, str)):
+            non_linguistic_symbols = Path(non_linguistic_symbols)
+            try:
+                with non_linguistic_symbols.open("r", encoding="utf-8") as f:
+                    self.non_linguistic_symbols = set(line.rstrip() for line in f)
+            except FileNotFoundError:
+                warnings.warn(f"{non_linguistic_symbols} doesn't exist.")
+                self.non_linguistic_symbols = set()
+        else:
+            self.non_linguistic_symbols = set(non_linguistic_symbols)
+        self.remove_non_linguistic_symbols = remove_non_linguistic_symbols
+
+    def __repr__(self):
+        return f'{self.__class__.__name__}(delimiter="{self.delimiter}")'
+
+    def text2tokens(self, line: str) -> List[str]:
+        tokens = []
+        for t in line.split(self.delimiter):
+            if self.remove_non_linguistic_symbols and t in self.non_linguistic_symbols:
+                continue
+            tokens.append(t)
+        return tokens
+
+    def tokens2text(self, tokens: Iterable[str]) -> str:
+        if self.delimiter is None:
+            delimiter = " "
+        else:
+            delimiter = self.delimiter
+        return delimiter.join(tokens)
diff --git a/funasr/torch_utils/__init__.py b/funasr/torch_utils/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/torch_utils/add_gradient_noise.py b/funasr/torch_utils/add_gradient_noise.py
new file mode 100644
index 000000000..654e928ec
--- /dev/null
+++ b/funasr/torch_utils/add_gradient_noise.py
@@ -0,0 +1,31 @@
+import torch
+
+
+def add_gradient_noise(
+    model: torch.nn.Module,
+    iteration: int,
+    duration: float = 100,
+    eta: float = 1.0,
+    scale_factor: float = 0.55,
+):
+    """Adds noise from a standard normal distribution to the gradients.
+
+    The standard deviation (`sigma`) is controlled
+    by the three hyper-parameters below.
+    `sigma` goes to zero (no noise) with more iterations.
+
+    Args:
+        model: Model.
+        iteration: Number of iterations.
+        duration: {100, 1000}: Number of durations to control
+            the interval of the `sigma` change.
+        eta: {0.01, 0.3, 1.0}: The magnitude of `sigma`.
+        scale_factor: {0.55}: The scale of `sigma`.
+    """
+    interval = (iteration // duration) + 1
+    sigma = eta / interval**scale_factor
+    for param in model.parameters():
+        if param.grad is not None:
+            _shape = param.grad.size()
+            noise = sigma * torch.randn(_shape).to(param.device)
+            param.grad += noise
diff --git a/funasr/torch_utils/device_funcs.py b/funasr/torch_utils/device_funcs.py
new file mode 100644
index 000000000..7919e7d92
--- /dev/null
+++ b/funasr/torch_utils/device_funcs.py
@@ -0,0 +1,71 @@
+import dataclasses
+import warnings
+
+import numpy as np
+import torch
+
+
+def to_device(data, device=None, dtype=None, non_blocking=False, copy=False):
+    """Change the device of object recursively"""
+    if isinstance(data, dict):
+        return {
+            k: to_device(v, device, dtype, non_blocking, copy) for k, v in data.items()
+        }
+    elif dataclasses.is_dataclass(data) and not isinstance(data, type):
+        return type(data)(
+            *[
+                to_device(v, device, dtype, non_blocking, copy)
+                for v in dataclasses.astuple(data)
+            ]
+        )
+    # maybe namedtuple. I don't know the correct way to judge namedtuple.
+    elif isinstance(data, tuple) and type(data) is not tuple:
+        return type(data)(
+            *[to_device(o, device, dtype, non_blocking, copy) for o in data]
+        )
+    elif isinstance(data, (list, tuple)):
+        return type(data)(to_device(v, device, dtype, non_blocking, copy) for v in data)
+    elif isinstance(data, np.ndarray):
+        return to_device(torch.from_numpy(data), device, dtype, non_blocking, copy)
+    elif isinstance(data, torch.Tensor):
+        return data.to(device, dtype, non_blocking, copy)
+    else:
+        return data
+
+
+def force_gatherable(data, device):
+    """Change object to gatherable in torch.nn.DataParallel recursively
+
+    The difference from to_device() is changing to torch.Tensor if float or int
+    value is found.
+
+    The restriction to the returned value in DataParallel:
+        The object must be
+        - torch.cuda.Tensor
+        - 1 or more dimension. 0-dimension-tensor sends warning.
+        or a list, tuple, dict.
+
+    """
+    if isinstance(data, dict):
+        return {k: force_gatherable(v, device) for k, v in data.items()}
+    # DataParallel can't handle NamedTuple well
+    elif isinstance(data, tuple) and type(data) is not tuple:
+        return type(data)(*[force_gatherable(o, device) for o in data])
+    elif isinstance(data, (list, tuple, set)):
+        return type(data)(force_gatherable(v, device) for v in data)
+    elif isinstance(data, np.ndarray):
+        return force_gatherable(torch.from_numpy(data), device)
+    elif isinstance(data, torch.Tensor):
+        if data.dim() == 0:
+            # To 1-dim array
+            data = data[None]
+        return data.to(device)
+    elif isinstance(data, float):
+        return torch.tensor([data], dtype=torch.float, device=device)
+    elif isinstance(data, int):
+        return torch.tensor([data], dtype=torch.long, device=device)
+    elif data is None:
+        return None
+    else:
+        warnings.warn(f"{type(data)} may not be gatherable by DataParallel")
+        return data
diff --git a/funasr/torch_utils/forward_adaptor.py b/funasr/torch_utils/forward_adaptor.py
new file mode 100644
index 000000000..114af7851
--- /dev/null
+++ b/funasr/torch_utils/forward_adaptor.py
@@ -0,0 +1,33 @@
+import torch
+from typeguard import check_argument_types
+
+
+class ForwardAdaptor(torch.nn.Module):
+    """Wrapped module to parallelize specified method
+
+    torch.nn.DataParallel parallelizes only "forward()"
+    and, maybe, the method having the other name can't be applied
+    except for wrapping the module just like this class.
+
+    Examples:
+        >>> class A(torch.nn.Module):
+        ...     def foo(self, x):
+        ...         ...
+        >>> model = A()
+        >>> model = ForwardAdaptor(model, "foo")
+        >>> model = torch.nn.DataParallel(model, device_ids=[0, 1])
+        >>> x = torch.randn(2, 10)
+        >>> model(x)
+    """
+
+    def __init__(self, module: torch.nn.Module, name: str):
+        assert check_argument_types()
+        super().__init__()
+        self.module = module
+        self.name = name
+        if not hasattr(module, name):
+            raise ValueError(f"{module} doesn't have {name}")
+
+    def forward(self, *args, **kwargs):
+        func = getattr(self.module, self.name)
+        return func(*args, **kwargs)
diff --git a/funasr/torch_utils/initialize.py b/funasr/torch_utils/initialize.py
new file mode 100644
index 000000000..2c0e7a435
--- /dev/null
+++ b/funasr/torch_utils/initialize.py
@@ -0,0 +1,102 @@
+#!/usr/bin/env python3
+
+"""Initialize modules for espnet2 neural networks."""
+
+import math
+import torch
+from typeguard import check_argument_types
+
+
+def initialize(model: torch.nn.Module, init: str):
+    """Initialize weights of a neural network module.
+
+    Parameters are initialized using the given method or distribution.
+
+    Custom initialization routines can be implemented into submodules
+    as function `espnet_initialization_fn` within the custom module.
+
+    Args:
+        model: Target.
+        init: Method of initialization.
+    """
+    assert check_argument_types()
+
+    if init == "chainer":
+        # 1. lecun_normal_init_parameters
+        for p in model.parameters():
+            data = p.data
+            if data.dim() == 1:
+                # bias
+                data.zero_()
+            elif data.dim() == 2:
+                # linear weight
+                n = data.size(1)
+                stdv = 1.0 / math.sqrt(n)
+                data.normal_(0, stdv)
+            elif data.dim() in (3, 4):
+                # conv weight
+                n = data.size(1)
+                for k in data.size()[2:]:
+                    n *= k
+                stdv = 1.0 / math.sqrt(n)
+                data.normal_(0, stdv)
+            else:
+                raise NotImplementedError
+
+        for mod in model.modules():
+            # 2. embed weight ~ Normal(0, 1)
+            if isinstance(mod, torch.nn.Embedding):
+                mod.weight.data.normal_(0, 1)
+            # 3. forget-bias = 1.0
+            elif isinstance(mod, torch.nn.RNNCellBase):
+                n = mod.bias_ih.size(0)
+                mod.bias_ih.data[n // 4 : n // 2].fill_(1.0)
+            elif isinstance(mod, torch.nn.RNNBase):
+                for name, param in mod.named_parameters():
+                    if "bias" in name:
+                        n = param.size(0)
+                        param.data[n // 4 : n // 2].fill_(1.0)
+            if hasattr(mod, "espnet_initialization_fn"):
+                mod.espnet_initialization_fn()
+
+    else:
+        # weight init
+        for p in model.parameters():
+            if p.dim() > 1:
+                if init == "xavier_uniform":
+                    torch.nn.init.xavier_uniform_(p.data)
+                elif init == "xavier_normal":
+                    torch.nn.init.xavier_normal_(p.data)
+                elif init == "kaiming_uniform":
+                    torch.nn.init.kaiming_uniform_(p.data, nonlinearity="relu")
+                elif init == "kaiming_normal":
+                    torch.nn.init.kaiming_normal_(p.data, nonlinearity="relu")
+                else:
+                    raise ValueError("Unknown initialization: " + init)
+        # bias init
+        for p in model.parameters():
+            if p.dim() == 1:
+                p.data.zero_()
+
+        # reset some modules with default init
+        for m in model.modules():
+            if isinstance(
+                m, (torch.nn.Embedding, torch.nn.LayerNorm, torch.nn.GroupNorm)
+            ):
+                m.reset_parameters()
+            if hasattr(m, "espnet_initialization_fn"):
+                m.espnet_initialization_fn()
+
+        # TODO(xkc): Hacking s3prl_frontend and wav2vec2encoder initialization
+        if getattr(model, "encoder", None) and getattr(
+            model.encoder, "reload_pretrained_parameters", None
+        ):
+            model.encoder.reload_pretrained_parameters()
+        if getattr(model, "frontend", None) and getattr(
+            model.frontend, "reload_pretrained_parameters", None
+        ):
+            model.frontend.reload_pretrained_parameters()
+        if getattr(model, "postencoder", None) and getattr(
+            model.postencoder, "reload_pretrained_parameters", None
+        ):
+            model.postencoder.reload_pretrained_parameters()
diff --git a/funasr/torch_utils/load_pretrained_model.py b/funasr/torch_utils/load_pretrained_model.py
new file mode 100644
index 000000000..8e3f05e1e
--- /dev/null
+++ b/funasr/torch_utils/load_pretrained_model.py
@@ -0,0 +1,125 @@
+from typing import Any
+from typing import Dict
+from typing import Union
+from io import BytesIO
+
+import logging
+import torch
+import torch.nn
+import torch.optim
+
+
+def filter_state_dict(
+    dst_state: Dict[str, Union[float, torch.Tensor]],
+    src_state: Dict[str, Union[float, torch.Tensor]],
+):
+    """Filter name, size mismatch instances between dicts.
+
+    Args:
+        dst_state: reference state dict for filtering
+        src_state: target state dict for filtering
+
+    """
+    match_state = {}
+    for key, value in src_state.items():
+        if key in dst_state and (dst_state[key].size() == src_state[key].size()):
+            match_state[key] = value
+        else:
+            if key not in dst_state:
+                logging.warning(
+                    f"Filter out {key} from pretrained dict"
+                    + " because of name not found in target dict"
+                )
+            else:
+                logging.warning(
+                    f"Filter out {key} from pretrained dict"
+                    + " because of size mismatch"
+                    + f"({dst_state[key].size()}-{src_state[key].size()})"
+                )
+    return match_state
+
+
+def load_pretrained_model(
+    init_param: str,
+    model: torch.nn.Module,
+    ignore_init_mismatch: bool,
+    map_location: str = "cpu",
+    oss_bucket=None,
+):
+    """Load a model state and set it to the model.
+
+    Args:
+        init_param: <file_path>:<src_key>:<dst_key>:<exclude_Keys>
+
+    Examples:
+        >>> load_pretrained_model("somewhere/model.pth", model)
+        >>> load_pretrained_model("somewhere/model.pth:decoder:decoder", model)
+        >>> load_pretrained_model("somewhere/model.pth:decoder:decoder:", model)
+        >>> load_pretrained_model(
+        ...     "somewhere/model.pth:decoder:decoder:decoder.embed", model
+        ... )
+        >>> load_pretrained_model("somewhere/decoder.pth::decoder", model)
+    """
+    sps = init_param.split(":", 4)
+    if len(sps) == 4:
+        path, src_key, dst_key, excludes = sps
+    elif len(sps) == 3:
+        path, src_key, dst_key = sps
+        excludes = None
+    elif len(sps) == 2:
+        path, src_key = sps
+        dst_key, excludes = None, None
+    else:
+        (path,) = sps
+        src_key, dst_key, excludes = None, None, None
+    if src_key == "":
+        src_key = None
+    if dst_key == "":
+        dst_key = None
+
+    if dst_key is None:
+        obj = model
+    else:
+
+        def get_attr(obj: Any, key: str):
+            """Get an nested attribute.
+
+            >>> class A(torch.nn.Module):
+            ...     def __init__(self):
+            ...         super().__init__()
+            ...         self.linear = torch.nn.Linear(10, 10)
+            >>> a = A()
+            >>> assert A.linear.weight is get_attr(A, 'linear.weight')
+
+            """
+            if key.strip() == "":
+                return obj
+            for k in key.split("."):
+                obj = getattr(obj, k)
+            return obj
+
+        obj = get_attr(model, dst_key)
+
+    if oss_bucket is None:
+        src_state = torch.load(path, map_location=map_location)
+    else:
+        buffer = BytesIO(oss_bucket.get_object(path).read())
+        src_state = torch.load(buffer, map_location=map_location)
+    if excludes is not None:
+        for e in excludes.split(","):
+            src_state = {k: v for k, v in src_state.items() if not k.startswith(e)}
+
+    if src_key is not None:
+        src_state = {
+            k[len(src_key) + 1 :]: v
+            for k, v in src_state.items()
+            if k.startswith(src_key)
+        }
+
+    dst_state = obj.state_dict()
+    if ignore_init_mismatch:
+        src_state = filter_state_dict(dst_state, src_state)
+
+    logging.info("Loaded src_state keys: {}".format(src_state.keys()))
+    dst_state.update(src_state)
+    obj.load_state_dict(dst_state)
diff --git a/funasr/torch_utils/model_summary.py b/funasr/torch_utils/model_summary.py
new file mode 100644
index 000000000..8d7f14f8c
--- /dev/null
+++ b/funasr/torch_utils/model_summary.py
@@ -0,0 +1,70 @@
+import humanfriendly
+import numpy as np
+import torch
+
+
+def get_human_readable_count(number: int) -> str:
+    """Return human_readable_count
+
+    Originated from:
+    https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/core/memory.py
+
+    Abbreviates an integer number with K, M, B, T for thousands, millions,
+    billions and trillions, respectively.
+    Examples:
+        >>> get_human_readable_count(123)
+        '123  '
+        >>> get_human_readable_count(1234)  # (one thousand)
+        '1 K'
+        >>> get_human_readable_count(2e6)   # (two million)
+        '2 M'
+        >>> get_human_readable_count(3e9)   # (three billion)
+        '3 B'
+        >>> get_human_readable_count(4e12)  # (four trillion)
+        '4 T'
+        >>> get_human_readable_count(5e15)  # (more than trillion)
+        '5,000 T'
+    Args:
+        number: a positive integer number
+    Return:
+        A string formatted according to the pattern described above.
+    """
+    assert number >= 0
+    labels = [" ", "K", "M", "B", "T"]
+    num_digits = int(np.floor(np.log10(number)) + 1 if number > 0 else 1)
+    num_groups = int(np.ceil(num_digits / 3))
+    num_groups = min(num_groups, len(labels))  # don't abbreviate beyond trillions
+    shift = -3 * (num_groups - 1)
+    number = number * (10**shift)
+    index = num_groups - 1
+    return f"{number:.2f} {labels[index]}"
+
+
+def to_bytes(dtype) -> int:
+    # torch.float16 -> 16
+    return int(str(dtype)[-2:]) // 8
+
+
+def model_summary(model: torch.nn.Module) -> str:
+    message = "Model structure:\n"
+    message += str(model)
+    tot_params = sum(p.numel() for p in model.parameters())
+    num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    percent_trainable = "{:.1f}".format(num_params * 100.0 / tot_params)
+    tot_params = get_human_readable_count(tot_params)
+    num_params = get_human_readable_count(num_params)
+    message += "\n\nModel summary:\n"
+    message += f"    Class Name: {model.__class__.__name__}\n"
+    message += f"    Total Number of model parameters: {tot_params}\n"
+    message += (
+        f"    Number of trainable parameters: {num_params} ({percent_trainable}%)\n"
+    )
+    num_bytes = humanfriendly.format_size(
+        sum(
+            p.numel() * to_bytes(p.dtype) for p in model.parameters() if p.requires_grad
+        )
+    )
+    message += f"    Size: {num_bytes}\n"
+    dtype = next(iter(model.parameters())).dtype
+    message += f"    Type: {dtype}"
+    return message
diff --git a/funasr/torch_utils/pytorch_version.py b/funasr/torch_utils/pytorch_version.py
new file mode 100644
index 000000000..01f17cc74
--- /dev/null
+++ b/funasr/torch_utils/pytorch_version.py
@@ -0,0 +1,16 @@
+import torch
+
+
+def pytorch_cudnn_version() -> str:
+    message = (
+        f"pytorch.version={torch.__version__}, "
+        f"cuda.available={torch.cuda.is_available()}, "
+    )
+
+    if torch.backends.cudnn.enabled:
+        message += (
+            f"cudnn.version={torch.backends.cudnn.version()}, "
+            f"cudnn.benchmark={torch.backends.cudnn.benchmark}, "
+            f"cudnn.deterministic={torch.backends.cudnn.deterministic}"
+        )
+    return message
diff --git a/funasr/torch_utils/recursive_op.py b/funasr/torch_utils/recursive_op.py
new file mode 100644
index 000000000..286a92daf
--- /dev/null
+++ b/funasr/torch_utils/recursive_op.py
@@ -0,0 +1,47 @@
+"""Torch utility module."""
+import torch
+
+if torch.distributed.is_available():
+    from torch.distributed import ReduceOp
+
+
+def recursive_sum(obj, weight: torch.Tensor, distributed: bool = False):
+    assert weight.dim() == 1, weight.size()
+    if isinstance(obj, (tuple, list)):
+        return type(obj)(recursive_sum(v, weight, distributed) for v in obj)
+    elif isinstance(obj, dict):
+        return {k: recursive_sum(v, weight, distributed) for k, v in obj.items()}
+    elif isinstance(obj, torch.Tensor):
+        assert obj.size() == weight.size(), (obj.size(), weight.size())
+        obj = (obj * weight.type(obj.dtype)).sum()
+        if distributed:
+            torch.distributed.all_reduce(obj, op=ReduceOp.SUM)
+        return obj
+    elif obj is None:
+        return None
+    else:
+        raise ValueError(type(obj))
+
+
+def recursive_divide(a, b: torch.Tensor):
+    if isinstance(a, (tuple, list)):
+        return type(a)(recursive_divide(v, b) for v in a)
+    elif isinstance(a, dict):
+        return {k: recursive_divide(v, b) for k, v in a.items()}
+    elif isinstance(a, torch.Tensor):
+        assert a.size() == b.size(), (a.size(), b.size())
+        return a / b.type(a.dtype)
+    elif a is None:
+        return None
+    else:
+        raise ValueError(type(a))
+
+
+def recursive_average(obj, weight: torch.Tensor, distributed: bool = False):
+    obj = recursive_sum(obj, weight, distributed)
+    weight = weight.sum()
+    if distributed:
+        torch.distributed.all_reduce(weight, op=ReduceOp.SUM)
+    # Normalize weight to be sum-to-1
+    obj = recursive_divide(obj, weight)
+    return obj, weight
diff --git a/funasr/torch_utils/set_all_random_seed.py b/funasr/torch_utils/set_all_random_seed.py
new file mode 100644
index 000000000..ebdca3f53
--- /dev/null
+++ b/funasr/torch_utils/set_all_random_seed.py
@@ -0,0 +1,10 @@
+import random
+
+import numpy as np
+import torch
+
+
+def set_all_random_seed(seed: int):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.random.manual_seed(seed)
diff --git a/funasr/train/__init__.py b/funasr/train/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/train/abs_espnet_model.py b/funasr/train/abs_espnet_model.py
new file mode 100644
index 000000000..cc6a5a2a0
--- /dev/null
+++ b/funasr/train/abs_espnet_model.py
@@ -0,0 +1,55 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+from abc import ABC
+from abc import abstractmethod
+from typing import Dict
+from typing import Tuple
+
+import torch
+
+
+class AbsESPnetModel(torch.nn.Module, ABC):
+    """The common abstract class among each tasks
+
+    "ESPnetModel" is referred to a class which inherits torch.nn.Module,
+    and makes the dnn-models forward as its member field,
+    a.k.a delegate pattern,
+    and defines "loss", "stats", and "weight" for the task.
+
+    If you intend to implement new task in ESPNet,
+    the model must inherit this class.
+    In other words, the "mediator" objects between
+    our training system and the your task class are
+    just only these three values, loss, stats, and weight.
+
+    Example:
+        >>> from funasr.tasks.abs_task import AbsTask
+        >>> class YourESPnetModel(AbsESPnetModel):
+        ...     def forward(self, input, input_lengths):
+        ...         ...
+        ...         return loss, stats, weight
+        >>> class YourTask(AbsTask):
+        ...     @classmethod
+        ...     def build_model(cls, args: argparse.Namespace) -> YourESPnetModel:
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.num_updates = 0
+
+    @abstractmethod
+    def forward(
+        self, **batch: torch.Tensor
+    ) -> Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]:
+        raise NotImplementedError
+
+    @abstractmethod
+    def collect_feats(self, **batch: torch.Tensor) -> Dict[str, torch.Tensor]:
+        raise NotImplementedError
+
+    def set_num_updates(self, num_updates):
+        self.num_updates = num_updates
+
+    def get_num_updates(self):
+        return self.num_updates
diff --git a/funasr/train/class_choices.py b/funasr/train/class_choices.py
new file mode 100644
index 000000000..658d29166
--- /dev/null
+++ b/funasr/train/class_choices.py
@@ -0,0 +1,95 @@
+from typing import Mapping
+from typing import Optional
+from typing import Tuple
+
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+from funasr.utils.nested_dict_action import NestedDictAction
+from funasr.utils.types import str_or_none
+
+
+class ClassChoices:
+    """Helper class to manage the options for variable objects and its configuration.
+
+    Example:
+
+    >>> class A:
+    ...     def __init__(self, foo=3):  pass
+    >>> class B:
+    ...     def __init__(self, bar="aaaa"):  pass
+    >>> choices = ClassChoices("var", dict(a=A, b=B), default="a")
+    >>> import argparse
+    >>> parser = argparse.ArgumentParser()
+    >>> choices.add_arguments(parser)
+    >>> args = parser.parse_args(["--var", "a", "--var_conf", "foo=4")
+    >>> args.var
+    a
+    >>> args.var_conf
+    {"foo": 4}
+    >>> class_obj = choices.get_class(args.var)
+    >>> a_object = class_obj(**args.var_conf)
+
+    """
+
+    def __init__(
+        self,
+        name: str,
+        classes: Mapping[str, type],
+        type_check: type = None,
+        default: str = None,
+        optional: bool = False,
+    ):
+        assert check_argument_types()
+        self.name = name
+        self.base_type = type_check
+        self.classes = {k.lower(): v for k, v in classes.items()}
+        if "none" in self.classes or "nil" in self.classes or "null" in self.classes:
+            raise ValueError('"none", "nil", and "null" are reserved.')
+        if type_check is not None:
+            for v in self.classes.values():
+                if not issubclass(v, type_check):
+                    raise ValueError(f"must be {type_check.__name__}, but got {v}")
+
+        self.optional = optional
+        self.default = default
+        if default is None:
+            self.optional = True
+
+    def choices(self) -> Tuple[Optional[str], ...]:
+        retval = tuple(self.classes)
+        if self.optional:
+            return retval + (None,)
+        else:
+            return retval
+
+    def get_class(self, name: Optional[str]) -> Optional[type]:
+        assert check_argument_types()
+        if name is None or (self.optional and name.lower() == ("none", "null", "nil")):
+            retval = None
+        elif name.lower() in self.classes:
+            class_obj = self.classes[name]
+            assert check_return_type(class_obj)
+            retval = class_obj
+        else:
+            raise ValueError(
+                f"--{self.name} must be one of {self.choices()}: "
+                f"--{self.name} {name.lower()}"
+            )
+
+        return retval
+
+    def add_arguments(self, parser):
+        parser.add_argument(
+            f"--{self.name}",
+            type=lambda x: str_or_none(x.lower()),
+            default=self.default,
+            choices=self.choices(),
+            help=f"The {self.name} type",
+        )
+        parser.add_argument(
+            f"--{self.name}_conf",
+            action=NestedDictAction,
+            default=dict(),
+            help=f"The keyword arguments for {self.name}",
+        )
diff --git a/funasr/train/distributed_utils.py b/funasr/train/distributed_utils.py
new file mode 100644
index 000000000..088203a58
--- /dev/null
+++ b/funasr/train/distributed_utils.py
@@ -0,0 +1,384 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import dataclasses
+import logging
+import os
+import socket
+from typing import Optional
+
+import torch
+import torch.distributed
+
+
+@dataclasses.dataclass
+class DistributedOption:
+    # Enable distributed Training
+    distributed: bool = False
+    # torch.distributed.Backend: "nccl", "mpi", "gloo", or "tcp"
+    dist_backend: str = "nccl"
+    # if init_method="env://",
+    # env values of "MASTER_PORT", "MASTER_ADDR", "WORLD_SIZE", and "RANK" are referred.
+    dist_init_method: str = "env://"
+    dist_world_size: Optional[int] = None
+    dist_rank: Optional[int] = None
+    local_rank: Optional[int] = None
+    ngpu: int = 0
+    dist_master_addr: Optional[str] = None
+    dist_master_port: Optional[int] = None
+    dist_launcher: Optional[str] = None
+    multiprocessing_distributed: bool = True
+
+    def init_options(self):
+        if self.distributed:
+            if self.dist_init_method == "env://":
+                if get_master_addr(self.dist_master_addr, self.dist_launcher) is None:
+                    raise RuntimeError(
+                        "--dist_master_addr or MASTER_ADDR must be set "
+                        "if --dist_init_method == 'env://'"
+                    )
+                if get_master_port(self.dist_master_port) is None:
+                    raise RuntimeError(
+                        "--dist_master_port or MASTER_PORT must be set "
+                        "if --dist_init_port == 'env://'"
+                    )
+
+    def init_torch_distributed(self, args):
+        if self.distributed:
+            # See:
+            # https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html
+            os.environ.setdefault("NCCL_DEBUG", "INFO")
+
+            # See:
+            # https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
+            os.environ.setdefault("NCCL_BLOCKING_WAIT", "1")
+
+            torch.distributed.init_process_group(backend='nccl',
+                                                 init_method=self.dist_init_method,
+                                                 world_size=args.dist_world_size,
+                                                 rank=args.dist_rank)
+            self.dist_rank = torch.distributed.get_rank()
+            self.dist_world_size = torch.distributed.get_world_size()
+            self.local_rank = args.local_rank
+            logging.info("world size: {}, rank: {}, local_rank: {}".format(self.dist_world_size, self.dist_rank,
+                                                                           self.local_rank))
+
+    def init_options_pai(self):
+        if self.distributed:
+            if self.dist_init_method == "env://":
+                if get_master_addr(self.dist_master_addr, self.dist_launcher) is None:
+                    raise RuntimeError(
+                        "--dist_master_addr or MASTER_ADDR must be set "
+                        "if --dist_init_method == 'env://'"
+                    )
+                if get_master_port(self.dist_master_port) is None:
+                    raise RuntimeError(
+                        "--dist_master_port or MASTER_PORT must be set "
+                        "if --dist_init_port == 'env://'"
+                    )
+
+            self.dist_rank = get_rank(self.dist_rank, self.dist_launcher)
+            self.dist_world_size = get_world_size(
+                self.dist_world_size, self.dist_launcher
+            )
+            self.local_rank = get_local_rank(self.local_rank, self.dist_launcher)
+
+            if (
+                    self.dist_rank is not None
+                    and self.dist_world_size is not None
+                    and self.dist_rank >= self.dist_world_size
+            ):
+                raise RuntimeError(
+                    f"RANK >= WORLD_SIZE: {self.dist_rank} >= {self.dist_world_size}"
+                )
+
+            if self.dist_init_method == "env://":
+                self.dist_master_addr = get_master_addr(
+                    self.dist_master_addr, self.dist_launcher
+                )
+                self.dist_master_port = get_master_port(self.dist_master_port)
+                if (
+                        self.dist_master_addr is not None
+                        and self.dist_master_port is not None
+                ):
+                    self.dist_init_method = (
+                        f"tcp://{self.dist_master_addr}:{self.dist_master_port}"
+                    )
+
+    def init_torch_distributed_pai(self, args):
+        if self.distributed:
+            # See:
+            # https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/env.html
+            os.environ.setdefault("NCCL_DEBUG", "INFO")
+
+            # See:
+            # https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
+            os.environ.setdefault("NCCL_BLOCKING_WAIT", "1")
+
+            torch.distributed.init_process_group(backend='nccl', init_method='env://')
+            self.dist_rank = torch.distributed.get_rank()
+            self.dist_world_size = torch.distributed.get_world_size()
+            self.local_rank = args.local_rank
+            logging.info("world size: {}, rank: {}, local_rank: {}".format(self.dist_world_size, self.dist_rank,
+                                                                           self.local_rank))
+
+
+def resolve_distributed_mode(args):
+    # Note that args.distributed is set by only this function.
+    # and ArgumentParser doesn't have such option
+
+    if args.multiprocessing_distributed:
+        num_nodes = get_num_nodes(args.dist_world_size, args.dist_launcher)
+        # a. multi-node
+        if num_nodes > 1:
+            args.distributed = True
+        # b. single-node and multi-gpu with multiprocessing_distributed mode
+        elif args.ngpu > 1:
+            args.distributed = True
+        # c. single-node and single-gpu
+        else:
+            args.distributed = False
+
+        if args.ngpu <= 1:
+            # Disable multiprocessing_distributed mode if 1process per node or cpu mode
+            args.multiprocessing_distributed = False
+        if args.ngpu == 1:
+            # If the number of GPUs equals to 1 with multiprocessing_distributed mode,
+            # LOCAL_RANK is always 0
+            args.local_rank = 0
+
+        if num_nodes > 1 and get_node_rank(args.dist_rank, args.dist_launcher) is None:
+            raise RuntimeError(
+                "--dist_rank or RANK must be set "
+                "if --multiprocessing_distributed == true"
+            )
+
+        # Note that RANK, LOCAL_RANK, and WORLD_SIZE is automatically set,
+        # so we don't need to check here
+    else:
+        # d. multiprocess and multi-gpu with external launcher
+        #    e.g. torch.distributed.launch
+        if get_world_size(args.dist_world_size, args.dist_launcher) > 1:
+            args.distributed = True
+        # e. single-process
+        else:
+            args.distributed = False
+
+        if args.distributed and args.ngpu > 0:
+            if get_local_rank(args.local_rank, args.dist_launcher) is None:
+                raise RuntimeError(
+                    "--local_rank or LOCAL_RANK must be set "
+                    "if --multiprocessing_distributed == false"
+                )
+        if args.distributed:
+            if get_node_rank(args.dist_rank, args.dist_launcher) is None:
+                raise RuntimeError(
+                    "--dist_rank or RANK must be set "
+                    "if --multiprocessing_distributed == false"
+                )
+    if args.distributed and args.dist_launcher == "slurm" and not is_in_slurm_step():
+        raise RuntimeError("Launch by 'srun' command if --dist_launcher='slurm'")
+
+
+def is_in_slurm_job() -> bool:
+    return "SLURM_PROCID" in os.environ and "SLURM_NTASKS" in os.environ
+
+
+def is_in_slurm_step() -> bool:
+    return (
+            is_in_slurm_job()
+            and "SLURM_STEP_NUM_NODES" in os.environ
+            and "SLURM_STEP_NODELIST" in os.environ
+    )
+
+
+def _int_or_none(x: Optional[str]) -> Optional[int]:
+    if x is None:
+        return x
+    return int(x)
+
+
+def free_port():
+    """Find free port using bind().
+
+    There are some interval between finding this port and using it
+    and the other process might catch the port by that time.
+    Thus it is not guaranteed that the port is really empty.
+
+    """
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
+        sock.bind(("", 0))
+        return sock.getsockname()[1]
+
+
+def get_rank(prior=None, launcher: str = None) -> Optional[int]:
+    if prior is None:
+        if launcher == "slurm":
+            if not is_in_slurm_step():
+                raise RuntimeError("This process seems not to be launched by 'srun'")
+            prior = os.environ["SLURM_PROCID"]
+        elif launcher == "mpi":
+            raise RuntimeError(
+                "launcher=mpi is used for 'multiprocessing-distributed' mode"
+            )
+        elif launcher is not None:
+            raise RuntimeError(f"launcher='{launcher}' is not supported")
+
+    if prior is not None:
+        return int(prior)
+    else:
+        # prior is None and RANK is None -> RANK = None
+        return _int_or_none(os.environ.get("RANK"))
+
+
+def get_world_size(prior=None, launcher: str = None) -> int:
+    if prior is None:
+        if launcher == "slurm":
+            if not is_in_slurm_step():
+                raise RuntimeError("This process seems not to be launched by 'srun'")
+            prior = int(os.environ["SLURM_NTASKS"])
+        elif launcher == "mpi":
+            raise RuntimeError(
+                "launcher=mpi is used for 'multiprocessing-distributed' mode"
+            )
+        elif launcher is not None:
+            raise RuntimeError(f"launcher='{launcher}' is not supported")
+
+    if prior is not None:
+        return int(prior)
+    else:
+        # prior is None and WORLD_SIZE is None -> WORLD_SIZE = 1
+        return int(os.environ.get("WORLD_SIZE", "1"))
+
+
+def get_local_rank(prior=None, launcher: str = None) -> Optional[int]:
+    # LOCAL_RANK is same as GPU device id
+
+    if prior is None:
+        if launcher == "slurm":
+            if not is_in_slurm_step():
+                raise RuntimeError("This process seems not to be launched by 'srun'")
+
+            prior = int(os.environ["SLURM_LOCALID"])
+        elif launcher == "mpi":
+            raise RuntimeError(
+                "launcher=mpi is used for 'multiprocessing-distributed' mode"
+            )
+        elif launcher is not None:
+            raise RuntimeError(f"launcher='{launcher}' is not supported")
+
+    if prior is not None:
+        return int(prior)
+
+    elif "LOCAL_RANK" in os.environ:
+        return int(os.environ["LOCAL_RANK"])
+
+    elif "CUDA_VISIBLE_DEVICES" in os.environ:
+        # There are two possibility:
+        # - "CUDA_VISIBLE_DEVICES" is set to multiple GPU ids. e.g. "0.1,2"
+        #   => This intends to specify multiple devices to to be used exactly
+        #      and local_rank information is possibly insufficient.
+        # - "CUDA_VISIBLE_DEVICES" is set to an id. e.g. "1"
+        #   => This could be used for LOCAL_RANK
+        cvd = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
+        if len(cvd) == 1 and "LOCAL_RANK" not in os.environ:
+            # If CUDA_VISIBLE_DEVICES is set and LOCAL_RANK is not set,
+            # then use it as LOCAL_RANK.
+
+            # Unset CUDA_VISIBLE_DEVICES
+            # because the other device must be visible to communicate
+            return int(os.environ.pop("CUDA_VISIBLE_DEVICES"))
+        else:
+            return None
+    else:
+        return None
+
+
+def get_master_addr(prior=None, launcher: str = None) -> Optional[str]:
+    if prior is None:
+        if launcher == "slurm":
+            if not is_in_slurm_step():
+                raise RuntimeError("This process seems not to be launched by 'srun'")
+
+            # e.g nodelist = foo[1-10],bar[3-8] or foo4,bar[2-10]
+            nodelist = os.environ["SLURM_STEP_NODELIST"]
+            prior = nodelist.split(",")[0].split("-")[0].replace("[", "")
+
+    if prior is not None:
+        return str(prior)
+    else:
+        return os.environ.get("MASTER_ADDR")
+
+
+def get_master_port(prior=None) -> Optional[int]:
+    if prior is not None:
+        return prior
+    else:
+        return _int_or_none(os.environ.get("MASTER_PORT"))
+
+
+def get_node_rank(prior=None, launcher: str = None) -> Optional[int]:
+    """Get Node Rank.
+
+    Use for "multiprocessing distributed" mode.
+    The initial RANK equals to the Node id in this case and
+    the real Rank is set as (nGPU * NodeID) + LOCAL_RANK in torch.distributed.
+
+    """
+    if prior is not None:
+        return prior
+    elif launcher == "slurm":
+        if not is_in_slurm_step():
+            raise RuntimeError("This process seems not to be launched by 'srun'")
+
+        # Assume ntasks_per_node == 1
+        if os.environ["SLURM_STEP_NUM_NODES"] != os.environ["SLURM_NTASKS"]:
+            raise RuntimeError(
+                "Run with --ntasks_per_node=1 if mutliprocessing_distributed=true"
+            )
+        return int(os.environ["SLURM_NODEID"])
+    elif launcher == "mpi":
+        # Use mpi4py only for initialization and not using for communication
+        from mpi4py import MPI
+
+        comm = MPI.COMM_WORLD
+        # Assume ntasks_per_node == 1 (We can't check whether it is or not)
+        return comm.Get_rank()
+    elif launcher is not None:
+        raise RuntimeError(f"launcher='{launcher}' is not supported")
+    else:
+        return _int_or_none(os.environ.get("RANK"))
+
+
+def get_num_nodes(prior=None, launcher: str = None) -> Optional[int]:
+    """Get the number of nodes.
+
+    Use for "multiprocessing distributed" mode.
+    RANK equals to the Node id in this case and
+    the real Rank is set as (nGPU * NodeID) + LOCAL_RANK in torch.distributed.
+
+    """
+    if prior is not None:
+        return prior
+    elif launcher == "slurm":
+        if not is_in_slurm_step():
+            raise RuntimeError("This process seems not to be launched by 'srun'")
+
+        # Assume ntasks_per_node == 1
+        if os.environ["SLURM_STEP_NUM_NODES"] != os.environ["SLURM_NTASKS"]:
+            raise RuntimeError(
+                "Run with --ntasks_per_node=1 if mutliprocessing_distributed=true"
+            )
+        return int(os.environ["SLURM_STEP_NUM_NODES"])
+    elif launcher == "mpi":
+        # Use mpi4py only for initialization and not using for communication
+        from mpi4py import MPI
+
+        comm = MPI.COMM_WORLD
+        # Assume ntasks_per_node == 1 (We can't check whether it is or not)
+        return comm.Get_size()
+    elif launcher is not None:
+        raise RuntimeError(f"launcher='{launcher}' is not supported")
+    else:
+        # prior is None -> NUM_NODES = 1
+        return int(os.environ.get("WORLD_SIZE", 1))
diff --git a/funasr/train/reporter.py b/funasr/train/reporter.py
new file mode 100644
index 000000000..2921fef28
--- /dev/null
+++ b/funasr/train/reporter.py
@@ -0,0 +1,540 @@
+"""Reporter module."""
+import dataclasses
+import datetime
+import logging
+import time
+import warnings
+from collections import defaultdict
+from contextlib import contextmanager
+from distutils.version import LooseVersion
+from typing import ContextManager
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+import numpy as np
+import torch
+from typeguard import check_argument_types
+from typeguard import check_return_type
+
+Num = Union[float, int, complex, torch.Tensor, np.ndarray]
+
+_reserved = {"time", "total_count"}
+
+
+def to_reported_value(v: Num, weight: Num = None) -> "ReportedValue":
+    assert check_argument_types()
+    if isinstance(v, (torch.Tensor, np.ndarray)):
+        if np.prod(v.shape) != 1:
+            raise ValueError(f"v must be 0 or 1 dimension: {len(v.shape)}")
+        v = v.item()
+
+    if isinstance(weight, (torch.Tensor, np.ndarray)):
+        if np.prod(weight.shape) != 1:
+            raise ValueError(f"weight must be 0 or 1 dimension: {len(weight.shape)}")
+        weight = weight.item()
+
+    if weight is not None:
+        retval = WeightedAverage(v, weight)
+    else:
+        retval = Average(v)
+    assert check_return_type(retval)
+    return retval
+
+
+def aggregate(values: Sequence["ReportedValue"]) -> Num:
+    assert check_argument_types()
+
+    for v in values:
+        if not isinstance(v, type(values[0])):
+            raise ValueError(
+                f"Can't use different Reported type together: "
+                f"{type(v)} != {type(values[0])}"
+            )
+
+    if len(values) == 0:
+        warnings.warn("No stats found")
+        retval = np.nan
+
+    elif isinstance(values[0], Average):
+        retval = np.nanmean([v.value for v in values])
+
+    elif isinstance(values[0], WeightedAverage):
+        # Excludes non finite values
+        invalid_indices = set()
+        for i, v in enumerate(values):
+            if not np.isfinite(v.value) or not np.isfinite(v.weight):
+                invalid_indices.add(i)
+        values = [v for i, v in enumerate(values) if i not in invalid_indices]
+
+        if len(values) != 0:
+            # Calc weighed average. Weights are changed to sum-to-1.
+            sum_weights = sum(v.weight for i, v in enumerate(values))
+            sum_value = sum(v.value * v.weight for i, v in enumerate(values))
+            if sum_weights == 0:
+                warnings.warn("weight is zero")
+                retval = np.nan
+            else:
+                retval = sum_value / sum_weights
+        else:
+            warnings.warn("No valid stats found")
+            retval = np.nan
+
+    else:
+        raise NotImplementedError(f"type={type(values[0])}")
+    assert check_return_type(retval)
+    return retval
+
+
+def wandb_get_prefix(key: str):
+    if key.startswith("valid"):
+        return "valid/"
+    if key.startswith("train"):
+        return "train/"
+    if key.startswith("attn"):
+        return "attn/"
+    return "metrics/"
+
+
+class ReportedValue:
+    pass
+
+
+@dataclasses.dataclass(frozen=True)
+class Average(ReportedValue):
+    value: Num
+
+
+@dataclasses.dataclass(frozen=True)
+class WeightedAverage(ReportedValue):
+    value: Tuple[Num, Num]
+    weight: Num
+
+
+class SubReporter:
+    """This class is used in Reporter.
+
+    See the docstring of Reporter for the usage.
+    """
+
+    def __init__(self, key: str, epoch: int, total_count: int):
+        assert check_argument_types()
+        self.key = key
+        self.epoch = epoch
+        self.start_time = time.perf_counter()
+        self.stats = defaultdict(list)
+        self._finished = False
+        self.total_count = total_count
+        self.count = 0
+        self._seen_keys_in_the_step = set()
+
+    def get_total_count(self) -> int:
+        """Returns the number of iterations over all epochs."""
+        return self.total_count
+
+    def get_epoch(self) -> int:
+        return self.epoch
+
+    def next(self):
+        """Close up this step and reset state for the next step"""
+        for key, stats_list in self.stats.items():
+            if key not in self._seen_keys_in_the_step:
+                # Fill nan value if the key is not registered in this step
+                if isinstance(stats_list[0], WeightedAverage):
+                    stats_list.append(to_reported_value(np.nan, 0))
+                elif isinstance(stats_list[0], Average):
+                    stats_list.append(to_reported_value(np.nan))
+                else:
+                    raise NotImplementedError(f"type={type(stats_list[0])}")
+
+            assert len(stats_list) == self.count, (len(stats_list), self.count)
+
+        self._seen_keys_in_the_step = set()
+
+    def register(
+            self,
+            stats: Dict[str, Optional[Union[Num, Dict[str, Num]]]],
+            weight: Num = None,
+    ) -> None:
+        assert check_argument_types()
+        if self._finished:
+            raise RuntimeError("Already finished")
+        if len(self._seen_keys_in_the_step) == 0:
+            # Increment count as the first register in this step
+            self.total_count += 1
+            self.count += 1
+
+        for key2, v in stats.items():
+            if key2 in _reserved:
+                raise RuntimeError(f"{key2} is reserved.")
+            if key2 in self._seen_keys_in_the_step:
+                raise RuntimeError(f"{key2} is registered twice.")
+            if v is None:
+                v = np.nan
+            r = to_reported_value(v, weight)
+
+            if key2 not in self.stats:
+                # If it's the first time to register the key,
+                # append nan values in front of the the value
+                # to make it same length to the other stats
+                # e.g.
+                # stat A: [0.4, 0.3, 0.5]
+                # stat B: [nan, nan, 0.2]
+                nan = to_reported_value(np.nan, None if weight is None else 0)
+                self.stats[key2].extend(
+                    r if i == self.count - 1 else nan for i in range(self.count)
+                )
+            else:
+                self.stats[key2].append(r)
+            self._seen_keys_in_the_step.add(key2)
+
+    def log_message(self, start: int = None, end: int = None, num_updates: int = None) -> str:
+        if self._finished:
+            raise RuntimeError("Already finished")
+        if start is None:
+            start = 0
+        if start < 0:
+            start = self.count + start
+        if end is None:
+            end = self.count
+
+        if self.count == 0 or start == end:
+            return ""
+
+        message = f"{self.epoch}epoch:{self.key}:" f"{start + 1}-{end}batch:"
+        if num_updates is not None:
+            message += f"{num_updates}num_updates: "
+
+        for idx, (key2, stats_list) in enumerate(self.stats.items()):
+            assert len(stats_list) == self.count, (len(stats_list), self.count)
+            # values: List[ReportValue]
+            values = stats_list[start:end]
+            if idx != 0 and idx != len(stats_list):
+                message += ", "
+
+            v = aggregate(values)
+            if abs(v) > 1.0e3:
+                message += f"{key2}={v:.3e}"
+            elif abs(v) > 1.0e-3:
+                message += f"{key2}={v:.3f}"
+            else:
+                message += f"{key2}={v:.3e}"
+        return message
+
+    def tensorboard_add_scalar(self, summary_writer, start: int = None):
+        if start is None:
+            start = 0
+        if start < 0:
+            start = self.count + start
+
+        for key2, stats_list in self.stats.items():
+            assert len(stats_list) == self.count, (len(stats_list), self.count)
+            # values: List[ReportValue]
+            values = stats_list[start:]
+            v = aggregate(values)
+            summary_writer.add_scalar(f"{key2}", v, self.total_count)
+
+    def wandb_log(self, start: int = None):
+        import wandb
+
+        if start is None:
+            start = 0
+        if start < 0:
+            start = self.count + start
+
+        d = {}
+        for key2, stats_list in self.stats.items():
+            assert len(stats_list) == self.count, (len(stats_list), self.count)
+            # values: List[ReportValue]
+            values = stats_list[start:]
+            v = aggregate(values)
+            d[wandb_get_prefix(key2) + key2] = v
+        d["iteration"] = self.total_count
+        wandb.log(d)
+
+    def finished(self) -> None:
+        self._finished = True
+
+    @contextmanager
+    def measure_time(self, name: str):
+        start = time.perf_counter()
+        yield start
+        t = time.perf_counter() - start
+        self.register({name: t})
+
+    def measure_iter_time(self, iterable, name: str):
+        iterator = iter(iterable)
+        while True:
+            try:
+                start = time.perf_counter()
+                retval = next(iterator)
+                t = time.perf_counter() - start
+                self.register({name: t})
+                yield retval
+            except StopIteration:
+                break
+
+
+class Reporter:
+    """Reporter class.
+
+    Examples:
+
+        >>> reporter = Reporter()
+        >>> with reporter.observe('train') as sub_reporter:
+        ...     for batch in iterator:
+        ...         stats = dict(loss=0.2)
+        ...         sub_reporter.register(stats)
+
+    """
+
+    def __init__(self, epoch: int = 0):
+        assert check_argument_types()
+        if epoch < 0:
+            raise ValueError(f"epoch must be 0 or more: {epoch}")
+        self.epoch = epoch
+        # stats: Dict[int, Dict[str, Dict[str, float]]]
+        # e.g. self.stats[epoch]['train']['loss']
+        self.stats = {}
+
+    def get_epoch(self) -> int:
+        return self.epoch
+
+    def set_epoch(self, epoch: int) -> None:
+        if epoch < 0:
+            raise ValueError(f"epoch must be 0 or more: {epoch}")
+        self.epoch = epoch
+
+    @contextmanager
+    def observe(self, key: str, epoch: int = None) -> ContextManager[SubReporter]:
+        sub_reporter = self.start_epoch(key, epoch)
+        yield sub_reporter
+        # Receive the stats from sub_reporter
+        self.finish_epoch(sub_reporter)
+
+    def start_epoch(self, key: str, epoch: int = None) -> SubReporter:
+        if epoch is not None:
+            if epoch < 0:
+                raise ValueError(f"epoch must be 0 or more: {epoch}")
+            self.epoch = epoch
+
+        if self.epoch - 1 not in self.stats or key not in self.stats[self.epoch - 1]:
+            # If the previous epoch doesn't exist for some reason,
+            # maybe due to bug, this case also indicates 0-count.
+            if self.epoch - 1 != 0:
+                warnings.warn(
+                    f"The stats of the previous epoch={self.epoch - 1}"
+                    f"doesn't exist."
+                )
+            total_count = 0
+        else:
+            total_count = self.stats[self.epoch - 1][key]["total_count"]
+
+        sub_reporter = SubReporter(key, self.epoch, total_count)
+        # Clear the stats for the next epoch if it exists
+        self.stats.pop(epoch, None)
+        return sub_reporter
+
+    def finish_epoch(self, sub_reporter: SubReporter) -> None:
+        if self.epoch != sub_reporter.epoch:
+            raise RuntimeError(
+                f"Don't change epoch during observation: "
+                f"{self.epoch} != {sub_reporter.epoch}"
+            )
+
+        # Calc mean of current stats and set it as previous epochs stats
+        stats = {}
+        for key2, values in sub_reporter.stats.items():
+            v = aggregate(values)
+            stats[key2] = v
+
+        stats["time"] = datetime.timedelta(
+            seconds=time.perf_counter() - sub_reporter.start_time
+        )
+        stats["total_count"] = sub_reporter.total_count
+        if LooseVersion(torch.__version__) >= LooseVersion("1.4.0"):
+            if torch.cuda.is_initialized():
+                stats["gpu_max_cached_mem_GB"] = (
+                        torch.cuda.max_memory_reserved() / 2 ** 30
+                )
+        else:
+            if torch.cuda.is_available() and torch.cuda.max_memory_cached() > 0:
+                stats["gpu_cached_mem_GB"] = torch.cuda.max_memory_cached() / 2 ** 30
+
+        self.stats.setdefault(self.epoch, {})[sub_reporter.key] = stats
+        sub_reporter.finished()
+
+    def sort_epochs_and_values(
+            self, key: str, key2: str, mode: str
+    ) -> List[Tuple[int, float]]:
+        """Return the epoch which resulted the best value.
+
+        Example:
+            >>> val = reporter.sort_epochs_and_values('eval', 'loss', 'min')
+            >>> e_1best, v_1best = val[0]
+            >>> e_2best, v_2best = val[1]
+        """
+        if mode not in ("min", "max"):
+            raise ValueError(f"mode must min or max: {mode}")
+        if not self.has(key, key2):
+            raise KeyError(f"{key}.{key2} is not found: {self.get_all_keys()}")
+
+        # iterate from the last epoch
+        values = [(e, self.stats[e][key][key2]) for e in self.stats]
+
+        if mode == "min":
+            values = sorted(values, key=lambda x: x[1])
+        else:
+            values = sorted(values, key=lambda x: -x[1])
+        return values
+
+    def sort_epochs(self, key: str, key2: str, mode: str) -> List[int]:
+        return [e for e, v in self.sort_epochs_and_values(key, key2, mode)]
+
+    def sort_values(self, key: str, key2: str, mode: str) -> List[float]:
+        return [v for e, v in self.sort_epochs_and_values(key, key2, mode)]
+
+    def get_best_epoch(self, key: str, key2: str, mode: str, nbest: int = 0) -> int:
+        return self.sort_epochs(key, key2, mode)[nbest]
+
+    def check_early_stopping(
+            self,
+            patience: int,
+            key1: str,
+            key2: str,
+            mode: str,
+            epoch: int = None,
+            logger=None,
+    ) -> bool:
+        if logger is None:
+            logger = logging
+        if epoch is None:
+            epoch = self.get_epoch()
+
+        best_epoch = self.get_best_epoch(key1, key2, mode)
+        if epoch - best_epoch > patience:
+            logger.info(
+                f"[Early stopping] {key1}.{key2} has not been "
+                f"improved {epoch - best_epoch} epochs continuously. "
+                f"The training was stopped at {epoch}epoch"
+            )
+            return True
+        else:
+            return False
+
+    def has(self, key: str, key2: str, epoch: int = None) -> bool:
+        if epoch is None:
+            epoch = self.get_epoch()
+        return (
+                epoch in self.stats
+                and key in self.stats[epoch]
+                and key2 in self.stats[epoch][key]
+        )
+
+    def log_message(self, epoch: int = None) -> str:
+        if epoch is None:
+            epoch = self.get_epoch()
+
+        message = ""
+        for key, d in self.stats[epoch].items():
+            _message = ""
+            for key2, v in d.items():
+                if v is not None:
+                    if len(_message) != 0:
+                        _message += ", "
+                    if isinstance(v, float):
+                        if abs(v) > 1.0e3:
+                            _message += f"{key2}={v:.3e}"
+                        elif abs(v) > 1.0e-3:
+                            _message += f"{key2}={v:.3f}"
+                        else:
+                            _message += f"{key2}={v:.3e}"
+                    elif isinstance(v, datetime.timedelta):
+                        _v = humanfriendly.format_timespan(v)
+                        _message += f"{key2}={_v}"
+                    else:
+                        _message += f"{key2}={v}"
+            if len(_message) != 0:
+                if len(message) == 0:
+                    message += f"{epoch}epoch results: "
+                else:
+                    message += ", "
+                message += f"[{key}] {_message}"
+        return message
+
+    def get_value(self, key: str, key2: str, epoch: int = None):
+        if not self.has(key, key2):
+            raise KeyError(f"{key}.{key2} is not found in stats: {self.get_all_keys()}")
+        if epoch is None:
+            epoch = self.get_epoch()
+        return self.stats[epoch][key][key2]
+
+    def get_keys(self, epoch: int = None) -> Tuple[str, ...]:
+        """Returns keys1 e.g. train,eval."""
+        if epoch is None:
+            epoch = self.get_epoch()
+        return tuple(self.stats[epoch])
+
+    def get_keys2(self, key: str, epoch: int = None) -> Tuple[str, ...]:
+        """Returns keys2 e.g. loss,acc."""
+        if epoch is None:
+            epoch = self.get_epoch()
+        d = self.stats[epoch][key]
+        keys2 = tuple(k for k in d if k not in ("time", "total_count"))
+        return keys2
+
+    def get_all_keys(self, epoch: int = None) -> Tuple[Tuple[str, str], ...]:
+        if epoch is None:
+            epoch = self.get_epoch()
+        all_keys = []
+        for key in self.stats[epoch]:
+            for key2 in self.stats[epoch][key]:
+                all_keys.append((key, key2))
+        return tuple(all_keys)
+
+    def tensorboard_add_scalar(
+            self, summary_writer, epoch: int = None, key1: str = None
+    ):
+        if epoch is None:
+            epoch = self.get_epoch()
+            total_count = self.stats[epoch]["train"]["total_count"]
+            if key1 == "train":
+                summary_writer.add_scalar("iter_epoch", epoch, total_count)
+
+        if key1 is not None:
+            key1_iterator = tuple([key1])
+        else:
+            key1_iterator = self.get_keys(epoch)
+
+        for key1 in key1_iterator:
+            for key2 in self.get_keys2(key1):
+                summary_writer.add_scalar(
+                    f"{key2}", self.stats[epoch][key1][key2], total_count
+                )
+
+    def wandb_log(self, epoch: int = None):
+        import wandb
+
+        if epoch is None:
+            epoch = self.get_epoch()
+
+        d = {}
+        for key1 in self.get_keys(epoch):
+            for key2 in self.stats[epoch][key1]:
+                if key2 in ("time", "total_count"):
+                    continue
+                key = f"{key1}_{key2}_epoch"
+                d[wandb_get_prefix(key) + key] = self.stats[epoch][key1][key2]
+        d["epoch"] = epoch
+        wandb.log(d)
+
+    def state_dict(self):
+        return {"stats": self.stats, "epoch": self.epoch}
+
+    def load_state_dict(self, state_dict: dict):
+        self.epoch = state_dict["epoch"]
+        self.stats = state_dict["stats"]
diff --git a/funasr/train/trainer.py b/funasr/train/trainer.py
new file mode 100644
index 000000000..50bce477a
--- /dev/null
+++ b/funasr/train/trainer.py
@@ -0,0 +1,814 @@
+# Copyright ESPnet (https://github.com/espnet/espnet). All Rights Reserved.
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+"""Trainer module."""
+import argparse
+from contextlib import contextmanager
+import dataclasses
+from dataclasses import is_dataclass
+from distutils.version import LooseVersion
+import logging
+from pathlib import Path
+import time
+from typing import Dict
+from typing import Iterable
+from typing import List
+from typing import Optional
+from typing import Sequence
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+import oss2
+from io import BytesIO
+import os
+import numpy as np
+import torch
+import torch.nn
+import torch.optim
+from typeguard import check_argument_types
+
+from funasr.iterators.abs_iter_factory import AbsIterFactory
+from funasr.main_funcs.average_nbest_models import average_nbest_models
+from funasr.main_funcs.calculate_all_attentions import calculate_all_attentions
+from funasr.schedulers.abs_scheduler import AbsBatchStepScheduler
+from funasr.schedulers.abs_scheduler import AbsEpochStepScheduler
+from funasr.schedulers.abs_scheduler import AbsScheduler
+from funasr.schedulers.abs_scheduler import AbsValEpochStepScheduler
+from funasr.torch_utils.add_gradient_noise import add_gradient_noise
+from funasr.torch_utils.device_funcs import to_device
+from funasr.torch_utils.recursive_op import recursive_average
+from funasr.torch_utils.set_all_random_seed import set_all_random_seed
+from funasr.train.abs_espnet_model import AbsESPnetModel
+from funasr.train.distributed_utils import DistributedOption
+from funasr.train.reporter import Reporter
+from funasr.train.reporter import SubReporter
+from funasr.utils.build_dataclass import build_dataclass
+
+if torch.distributed.is_available():
+    from torch.distributed import ReduceOp
+
+if LooseVersion(torch.__version__) >= LooseVersion("1.6.0"):
+    from torch.cuda.amp import autocast
+    from torch.cuda.amp import GradScaler
+else:
+    # Nothing to do if torch<1.6.0
+    @contextmanager
+    def autocast(enabled=True):
+        yield
+
+    GradScaler = None
+
+try:
+    import fairscale
+except ImportError:
+    fairscale = None
+
+
+@dataclasses.dataclass
+class TrainerOptions:
+    ngpu: int
+    resume: bool
+    use_amp: bool
+    train_dtype: str
+    grad_noise: bool
+    accum_grad: int
+    grad_clip: float
+    grad_clip_type: float
+    log_interval: Optional[int]
+    no_forward_run: bool
+    use_tensorboard: bool
+    use_wandb: bool
+    output_dir: Union[Path, str]
+    max_epoch: int
+    max_update: int
+    seed: int
+    sharded_ddp: bool
+    patience: Optional[int]
+    keep_nbest_models: Union[int, List[int]]
+    nbest_averaging_interval: int
+    early_stopping_criterion: Sequence[str]
+    best_model_criterion: Sequence[Sequence[str]]
+    val_scheduler_criterion: Sequence[str]
+    unused_parameters: bool
+    wandb_model_log_interval: int
+    use_pai: bool
+    oss_bucket: Union[oss2.Bucket, None]
+
+
+class Trainer:
+    """Trainer having a optimizer.
+
+    If you'd like to use multiple optimizers, then inherit this class
+    and override the methods if necessary - at least "train_one_epoch()"
+
+    >>> class TwoOptimizerTrainer(Trainer):
+    ...     @classmethod
+    ...     def add_arguments(cls, parser):
+    ...         ...
+    ...
+    ...     @classmethod
+    ...     def train_one_epoch(cls, model, optimizers, ...):
+    ...         loss1 = model.model1(...)
+    ...         loss1.backward()
+    ...         optimizers[0].step()
+    ...
+    ...         loss2 = model.model2(...)
+    ...         loss2.backward()
+    ...         optimizers[1].step()
+
+    """
+
+    def __init__(self):
+        raise RuntimeError("This class can't be instantiated.")
+
+    @classmethod
+    def build_options(cls, args: argparse.Namespace) -> TrainerOptions:
+        """Build options consumed by train(), eval()"""
+        assert check_argument_types()
+        return build_dataclass(TrainerOptions, args)
+
+    @classmethod
+    def add_arguments(cls, parser: argparse.ArgumentParser):
+        """Reserved for future development of another Trainer"""
+        pass
+
+    @staticmethod
+    def resume(
+        checkpoint: Union[str, Path],
+        model: torch.nn.Module,
+        reporter: Reporter,
+        optimizers: Sequence[torch.optim.Optimizer],
+        schedulers: Sequence[Optional[AbsScheduler]],
+        scaler: Optional[GradScaler],
+        ngpu: int = 0,
+    ):
+        states = torch.load(
+            checkpoint,
+            map_location=f"cuda:{torch.cuda.current_device()}" if ngpu > 0 else "cpu",
+        )
+        model.load_state_dict(states["model"])
+        reporter.load_state_dict(states["reporter"])
+        for optimizer, state in zip(optimizers, states["optimizers"]):
+            optimizer.load_state_dict(state)
+        for scheduler, state in zip(schedulers, states["schedulers"]):
+            if scheduler is not None:
+                scheduler.load_state_dict(state)
+        if scaler is not None:
+            if states["scaler"] is None:
+                logging.warning("scaler state is not found")
+            else:
+                scaler.load_state_dict(states["scaler"])
+
+        logging.info(f"The training was resumed using {checkpoint}")
+
+    @classmethod
+    def run(
+        cls,
+        model: AbsESPnetModel,
+        optimizers: Sequence[torch.optim.Optimizer],
+        schedulers: Sequence[Optional[AbsScheduler]],
+        train_iter_factory: AbsIterFactory,
+        valid_iter_factory: AbsIterFactory,
+        trainer_options,
+        distributed_option: DistributedOption,
+    ) -> None:
+        """Perform training. This method performs the main process of training."""
+        assert check_argument_types()
+        # NOTE(kamo): Don't check the type more strictly as far trainer_options
+        assert is_dataclass(trainer_options), type(trainer_options)
+        assert len(optimizers) == len(schedulers), (len(optimizers), len(schedulers))
+
+        if isinstance(trainer_options.keep_nbest_models, int):
+            keep_nbest_models = [trainer_options.keep_nbest_models]
+        else:
+            if len(trainer_options.keep_nbest_models) == 0:
+                logging.warning("No keep_nbest_models is given. Change to [1]")
+                trainer_options.keep_nbest_models = [1]
+            keep_nbest_models = trainer_options.keep_nbest_models
+
+        output_dir = Path(trainer_options.output_dir)
+        reporter = Reporter()
+        if trainer_options.use_amp:
+            if LooseVersion(torch.__version__) < LooseVersion("1.6.0"):
+                raise RuntimeError(
+                    "Require torch>=1.6.0 for  Automatic Mixed Precision"
+                )
+            if trainer_options.sharded_ddp:
+                if fairscale is None:
+                    raise RuntimeError(
+                        "Requiring fairscale. Do 'pip install fairscale'"
+                    )
+                scaler = fairscale.optim.grad_scaler.ShardedGradScaler()
+            else:
+                scaler = GradScaler()
+        else:
+            scaler = None
+
+        if trainer_options.resume and (output_dir / "checkpoint.pth").exists():
+            cls.resume(
+                checkpoint=output_dir / "checkpoint.pth",
+                model=model,
+                optimizers=optimizers,
+                schedulers=schedulers,
+                reporter=reporter,
+                scaler=scaler,
+                ngpu=trainer_options.ngpu,
+            )
+
+        start_epoch = reporter.get_epoch() + 1
+        if start_epoch == trainer_options.max_epoch + 1:
+            logging.warning(
+                f"The training has already reached at max_epoch: {start_epoch}"
+            )
+
+        if distributed_option.distributed:
+            if trainer_options.sharded_ddp:
+                dp_model = fairscale.nn.data_parallel.ShardedDataParallel(
+                    module=model,
+                    sharded_optimizer=optimizers,
+                )
+            else:
+                dp_model = torch.nn.parallel.DistributedDataParallel(
+                    model, find_unused_parameters=trainer_options.unused_parameters)
+        elif distributed_option.ngpu > 1:
+            dp_model = torch.nn.parallel.DataParallel(
+                model,
+                device_ids=list(range(distributed_option.ngpu)),
+            )
+        else:
+            # NOTE(kamo): DataParallel also should work with ngpu=1,
+            # but for debuggability it's better to keep this block.
+            dp_model = model
+
+        if trainer_options.use_tensorboard and (
+            not distributed_option.distributed or distributed_option.dist_rank == 0
+        ):
+            from torch.utils.tensorboard import SummaryWriter
+            if trainer_options.use_pai:
+                train_summary_writer = SummaryWriter(
+                    os.path.join(trainer_options.output_dir, "tensorboard/train")
+                )
+                valid_summary_writer = SummaryWriter(
+                    os.path.join(trainer_options.output_dir, "tensorboard/valid")
+                )
+            else:
+                train_summary_writer = SummaryWriter(
+                    str(output_dir / "tensorboard" / "train")
+                )
+                valid_summary_writer = SummaryWriter(
+                    str(output_dir / "tensorboard" / "valid")
+                )
+        else:
+            train_summary_writer = None
+
+        start_time = time.perf_counter()
+        for iepoch in range(start_epoch, trainer_options.max_epoch + 1):
+            if iepoch != start_epoch:
+                logging.info(
+                    "{}/{}epoch started. Estimated time to finish: {}".format(
+                        iepoch,
+                        trainer_options.max_epoch,
+                        humanfriendly.format_timespan(
+                            (time.perf_counter() - start_time)
+                            / (iepoch - start_epoch)
+                            * (trainer_options.max_epoch - iepoch + 1)
+                        ),
+                    )
+                )
+            else:
+                logging.info(f"{iepoch}/{trainer_options.max_epoch}epoch started")
+            set_all_random_seed(trainer_options.seed + iepoch)
+
+            reporter.set_epoch(iepoch)
+            # 1. Train and validation for one-epoch
+            with reporter.observe("train") as sub_reporter:
+                all_steps_are_invalid, max_update_stop = cls.train_one_epoch(
+                    model=dp_model,
+                    optimizers=optimizers,
+                    schedulers=schedulers,
+                    iterator=train_iter_factory.build_iter(iepoch),
+                    reporter=sub_reporter,
+                    scaler=scaler,
+                    summary_writer=train_summary_writer,
+                    options=trainer_options,
+                    distributed_option=distributed_option,
+                )
+
+            with reporter.observe("valid") as sub_reporter:
+                cls.validate_one_epoch(
+                    model=dp_model,
+                    iterator=valid_iter_factory.build_iter(iepoch),
+                    reporter=sub_reporter,
+                    options=trainer_options,
+                    distributed_option=distributed_option,
+                )
+
+            # 2. LR Scheduler step
+            for scheduler in schedulers:
+                if isinstance(scheduler, AbsValEpochStepScheduler):
+                    scheduler.step(
+                        reporter.get_value(*trainer_options.val_scheduler_criterion)
+                    )
+                elif isinstance(scheduler, AbsEpochStepScheduler):
+                    scheduler.step()
+            if trainer_options.sharded_ddp:
+                for optimizer in optimizers:
+                    if isinstance(optimizer, fairscale.optim.oss.OSS):
+                        optimizer.consolidate_state_dict()
+
+            if not distributed_option.distributed or distributed_option.dist_rank == 0:
+                # 3. Report the results
+                logging.info(reporter.log_message())
+                if train_summary_writer is not None:
+                    reporter.tensorboard_add_scalar(train_summary_writer, key1="train")
+                    reporter.tensorboard_add_scalar(valid_summary_writer, key1="valid")
+                if trainer_options.use_wandb:
+                    reporter.wandb_log()
+
+                # save tensorboard on oss
+                if trainer_options.use_pai and train_summary_writer is not None:
+                    def write_tensorboard_summary(summary_writer_path, oss_bucket):
+                        file_list = []
+                        for root, dirs, files in os.walk(summary_writer_path, topdown=False):
+                            for name in files:
+                                file_full_path = os.path.join(root, name)
+                                file_list.append(file_full_path)
+
+                        for file_full_path in file_list:
+                            with open(file_full_path, "rb") as f:
+                                oss_bucket.put_object(file_full_path, f)
+
+                    write_tensorboard_summary(os.path.join(trainer_options.output_dir, "tensorboard/train"), trainer_options.oss_bucket)
+                    write_tensorboard_summary(os.path.join(trainer_options.output_dir, "tensorboard/valid"), trainer_options.oss_bucket)
+
+
+                # 4. Save/Update the checkpoint
+                if trainer_options.use_pai:
+                    buffer = BytesIO()
+                    torch.save(
+                        {
+                            "model": model.state_dict(),
+                            "reporter": reporter.state_dict(),
+                            "optimizers": [o.state_dict() for o in optimizers],
+                            "schedulers": [
+                                s.state_dict() if s is not None else None
+                                for s in schedulers
+                            ],
+                            "scaler": scaler.state_dict() if scaler is not None else None,
+                            "ema_model": model.encoder.ema.model.state_dict()
+                            if hasattr(model.encoder, "ema") and model.encoder.ema is not None else None,
+                        },
+                        buffer,
+                    )
+                    trainer_options.oss_bucket.put_object(os.path.join(trainer_options.output_dir, "checkpoint.pth"), buffer.getvalue())
+                else:
+                    torch.save(
+                        {
+                            "model": model.state_dict(),
+                            "reporter": reporter.state_dict(),
+                            "optimizers": [o.state_dict() for o in optimizers],
+                            "schedulers": [
+                                s.state_dict() if s is not None else None
+                                for s in schedulers
+                            ],
+                            "scaler": scaler.state_dict() if scaler is not None else None,
+                        },
+                        output_dir / "checkpoint.pth",
+                    )
+
+                # 5. Save and log the model and update the link to the best model
+                if trainer_options.use_pai:
+                    buffer = BytesIO()
+                    torch.save(model.state_dict(), buffer)
+                    trainer_options.oss_bucket.put_object(os.path.join(trainer_options.output_dir,
+                                                                       f"{iepoch}epoch.pth"),buffer.getvalue())
+                else:
+                    torch.save(model.state_dict(), output_dir / f"{iepoch}epoch.pth")
+
+                # Creates a sym link latest.pth -> {iepoch}epoch.pth
+                if trainer_options.use_pai:
+                    p = os.path.join(trainer_options.output_dir, "latest.pth")
+                    if trainer_options.oss_bucket.object_exists(p):
+                        trainer_options.oss_bucket.delete_object(p)
+                    trainer_options.oss_bucket.copy_object(trainer_options.oss_bucket.bucket_name,
+                                           os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pth"), p)
+                else:
+                    p = output_dir / "latest.pth"
+                    if p.is_symlink() or p.exists():
+                        p.unlink()
+                    p.symlink_to(f"{iepoch}epoch.pth")
+
+                _improved = []
+                for _phase, k, _mode in trainer_options.best_model_criterion:
+                    # e.g. _phase, k, _mode = "train", "loss", "min"
+                    if reporter.has(_phase, k):
+                        best_epoch = reporter.get_best_epoch(_phase, k, _mode)
+                        # Creates sym links if it's the best result
+                        if best_epoch == iepoch:
+                            if trainer_options.use_pai:
+                                p = os.path.join(trainer_options.output_dir, f"{_phase}.{k}.best.pth")
+                                if trainer_options.oss_bucket.object_exists(p):
+                                    trainer_options.oss_bucket.delete_object(p)
+                                trainer_options.oss_bucket.copy_object(trainer_options.oss_bucket.bucket_name,
+                                                       os.path.join(trainer_options.output_dir, f"{iepoch}epoch.pth"),p)
+                            else:
+                                p = output_dir / f"{_phase}.{k}.best.pth"
+                                if p.is_symlink() or p.exists():
+                                    p.unlink()
+                                p.symlink_to(f"{iepoch}epoch.pth")
+                            _improved.append(f"{_phase}.{k}")
+                if len(_improved) == 0:
+                    logging.info("There are no improvements in this epoch")
+                else:
+                    logging.info(
+                        "The best model has been updated: " + ", ".join(_improved)
+                    )
+
+                log_model = (
+                    trainer_options.wandb_model_log_interval > 0
+                    and iepoch % trainer_options.wandb_model_log_interval == 0
+                )
+                if log_model and trainer_options.use_wandb:
+                    import wandb
+
+                    logging.info("Logging Model on this epoch :::::")
+                    artifact = wandb.Artifact(
+                        name=f"model_{wandb.run.id}",
+                        type="model",
+                        metadata={"improved": _improved},
+                    )
+                    artifact.add_file(str(output_dir / f"{iepoch}epoch.pth"))
+                    aliases = [
+                        f"epoch-{iepoch}",
+                        "best" if best_epoch == iepoch else "",
+                    ]
+                    wandb.log_artifact(artifact, aliases=aliases)
+
+                # 6. Remove the model files excluding n-best epoch and latest epoch
+                _removed = []
+                # Get the union set of the n-best among multiple criterion
+                nbests = set().union(
+                    *[
+                        set(reporter.sort_epochs(ph, k, m)[: max(keep_nbest_models)])
+                        for ph, k, m in trainer_options.best_model_criterion
+                        if reporter.has(ph, k)
+                    ]
+                )
+
+                # Generated n-best averaged model
+                if (
+                    trainer_options.nbest_averaging_interval > 0
+                    and iepoch % trainer_options.nbest_averaging_interval == 0
+                ):
+                    average_nbest_models(
+                        reporter=reporter,
+                        output_dir=output_dir,
+                        best_model_criterion=trainer_options.best_model_criterion,
+                        nbest=keep_nbest_models,
+                        suffix=f"till{iepoch}epoch",
+                        oss_bucket=trainer_options.oss_bucket,
+                        pai_output_dir=trainer_options.output_dir,
+                    )
+
+                for e in range(1, iepoch):
+                    if trainer_options.use_pai:
+                        p = os.path.join(trainer_options.output_dir, f"{e}epoch.pth")
+                        if trainer_options.oss_bucket.object_exists(p) and e not in nbests:
+                            trainer_options.oss_bucket.delete_object(p)
+                            _removed.append(str(p))
+                    else:
+                        p = output_dir / f"{e}epoch.pth"
+                        if p.exists() and e not in nbests:
+                            p.unlink()
+                            _removed.append(str(p))
+                if len(_removed) != 0:
+                    logging.info("The model files were removed: " + ", ".join(_removed))
+
+            # 7. If any updating haven't happened, stops the training
+            if all_steps_are_invalid:
+                logging.warning(
+                    f"The gradients at all steps are invalid in this epoch. "
+                    f"Something seems wrong. This training was stopped at {iepoch}epoch"
+                )
+                break
+
+            if max_update_stop:
+                logging.info(
+                     f"Stopping training due to "
+                     f"num_updates: {trainer_options.num_updates} >= max_update: {trainer_options.max_update}"
+                )
+                break
+
+            # 8. Check early stopping
+            if trainer_options.patience is not None:
+                if reporter.check_early_stopping(
+                    trainer_options.patience, *trainer_options.early_stopping_criterion
+                ):
+                    break
+
+        else:
+            logging.info(
+                f"The training was finished at {trainer_options.max_epoch} epochs "
+            )
+
+        # Generated n-best averaged model
+        if not distributed_option.distributed or distributed_option.dist_rank == 0:
+            average_nbest_models(
+                reporter=reporter,
+                output_dir=output_dir,
+                best_model_criterion=trainer_options.best_model_criterion,
+                nbest=keep_nbest_models,
+                oss_bucket=trainer_options.oss_bucket,
+                pai_output_dir=trainer_options.output_dir,
+            )
+
+    @classmethod
+    def train_one_epoch(
+        cls,
+        model: torch.nn.Module,
+        iterator: Iterable[Tuple[List[str], Dict[str, torch.Tensor]]],
+        optimizers: Sequence[torch.optim.Optimizer],
+        schedulers: Sequence[Optional[AbsScheduler]],
+        scaler: Optional[GradScaler],
+        reporter: SubReporter,
+        summary_writer,
+        options: TrainerOptions,
+        distributed_option: DistributedOption,
+    ) -> Tuple[bool, bool]:
+        assert check_argument_types()
+
+        grad_noise = options.grad_noise
+        accum_grad = options.accum_grad
+        grad_clip = options.grad_clip
+        grad_clip_type = options.grad_clip_type
+        log_interval = options.log_interval
+        no_forward_run = options.no_forward_run
+        ngpu = options.ngpu
+        use_wandb = options.use_wandb
+        distributed = distributed_option.distributed
+
+        if log_interval is None:
+            try:
+                log_interval = max(len(iterator) // 20, 10)
+            except TypeError:
+                log_interval = 100
+
+        model.train()
+        all_steps_are_invalid = True
+        max_update_stop = False
+        # [For distributed] Because iteration counts are not always equals between
+        # processes, send stop-flag to the other processes if iterator is finished
+        iterator_stop = torch.tensor(0).to("cuda" if ngpu > 0 else "cpu")
+
+        start_time = time.perf_counter()
+        for iiter, (_, batch) in enumerate(
+            reporter.measure_iter_time(iterator, "iter_time"), 1
+        ):
+            assert isinstance(batch, dict), type(batch)
+
+            if distributed:
+                torch.distributed.all_reduce(iterator_stop, ReduceOp.SUM)
+                if iterator_stop > 0:
+                    break
+
+            batch = to_device(batch, "cuda" if ngpu > 0 else "cpu")
+            if no_forward_run:
+                all_steps_are_invalid = False
+                continue
+
+            with autocast(scaler is not None):
+                with reporter.measure_time("forward_time"):
+                    retval = model(**batch)
+
+                    # Note(kamo):
+                    # Supporting two patterns for the returned value from the model
+                    #   a. dict type
+                    if isinstance(retval, dict):
+                        loss = retval["loss"]
+                        stats = retval["stats"]
+                        weight = retval["weight"]
+                        optim_idx = retval.get("optim_idx")
+                        if optim_idx is not None and not isinstance(optim_idx, int):
+                            if not isinstance(optim_idx, torch.Tensor):
+                                raise RuntimeError(
+                                    "optim_idx must be int or 1dim torch.Tensor, "
+                                    f"but got {type(optim_idx)}"
+                                )
+                            if optim_idx.dim() >= 2:
+                                raise RuntimeError(
+                                    "optim_idx must be int or 1dim torch.Tensor, "
+                                    f"but got {optim_idx.dim()}dim tensor"
+                                )
+                            if optim_idx.dim() == 1:
+                                for v in optim_idx:
+                                    if v != optim_idx[0]:
+                                        raise RuntimeError(
+                                            "optim_idx must be 1dim tensor "
+                                            "having same values for all entries"
+                                        )
+                                optim_idx = optim_idx[0].item()
+                            else:
+                                optim_idx = optim_idx.item()
+
+                    #   b. tuple or list type
+                    else:
+                        loss, stats, weight = retval
+                        optim_idx = None
+
+                stats = {k: v for k, v in stats.items() if v is not None}
+                if ngpu > 1 or distributed:
+                    # Apply weighted averaging for loss and stats
+                    loss = (loss * weight.type(loss.dtype)).sum()
+
+                    # if distributed, this method can also apply all_reduce()
+                    stats, weight = recursive_average(stats, weight, distributed)
+
+                    # Now weight is summation over all workers
+                    loss /= weight
+                if distributed:
+                    # NOTE(kamo): Multiply world_size because DistributedDataParallel
+                    # automatically normalizes the gradient by world_size.
+                    loss *= torch.distributed.get_world_size()
+
+                loss /= accum_grad
+
+            reporter.register(stats, weight)
+
+            with reporter.measure_time("backward_time"):
+                if scaler is not None:
+                    # Scales loss.  Calls backward() on scaled loss
+                    # to create scaled gradients.
+                    # Backward passes under autocast are not recommended.
+                    # Backward ops run in the same dtype autocast chose
+                    # for corresponding forward ops.
+                    scaler.scale(loss).backward()
+                else:
+                    loss.backward()
+
+            if iiter % accum_grad == 0:
+                if scaler is not None:
+                    # Unscales the gradients of optimizer's assigned params in-place
+                    for iopt, optimizer in enumerate(optimizers):
+                        if optim_idx is not None and iopt != optim_idx:
+                            continue
+                        scaler.unscale_(optimizer)
+
+                # gradient noise injection
+                if grad_noise:
+                    add_gradient_noise(
+                        model,
+                        reporter.get_total_count(),
+                        duration=100,
+                        eta=1.0,
+                        scale_factor=0.55,
+                    )
+
+                # compute the gradient norm to check if it is normal or not
+                grad_norm = torch.nn.utils.clip_grad_norm_(
+                    model.parameters(),
+                    max_norm=grad_clip,
+                    norm_type=grad_clip_type,
+                )
+                # PyTorch<=1.4, clip_grad_norm_ returns float value
+                if not isinstance(grad_norm, torch.Tensor):
+                    grad_norm = torch.tensor(grad_norm)
+
+                if not torch.isfinite(grad_norm):
+                    logging.warning(
+                        f"The grad norm is {grad_norm}. Skipping updating the model."
+                    )
+
+                    # Must invoke scaler.update() if unscale_() is used in the iteration
+                    # to avoid the following error:
+                    #   RuntimeError: unscale_() has already been called
+                    #   on this optimizer since the last update().
+                    # Note that if the gradient has inf/nan values,
+                    # scaler.step skips optimizer.step().
+                    if scaler is not None:
+                        for iopt, optimizer in enumerate(optimizers):
+                            if optim_idx is not None and iopt != optim_idx:
+                                continue
+                            scaler.step(optimizer)
+                            scaler.update()
+
+                else:
+                    all_steps_are_invalid = False
+                    with reporter.measure_time("optim_step_time"):
+                        for iopt, (optimizer, scheduler) in enumerate(
+                            zip(optimizers, schedulers)
+                        ):
+                            if optim_idx is not None and iopt != optim_idx:
+                                continue
+                            if scaler is not None:
+                                # scaler.step() first unscales the gradients of
+                                # the optimizer's assigned params.
+                                scaler.step(optimizer)
+                                # Updates the scale for next iteration.
+                                scaler.update()
+                            else:
+                                optimizer.step()
+                            if isinstance(scheduler, AbsBatchStepScheduler):
+                                scheduler.step()
+                for iopt, optimizer in enumerate(optimizers):
+                    if optim_idx is not None and iopt != optim_idx:
+                        continue
+                    optimizer.zero_grad()
+
+                # Register lr and train/load time[sec/step],
+                # where step refers to accum_grad * mini-batch
+                reporter.register(
+                    dict(
+                        {
+                            f"optim{i}_lr{j}": pg["lr"]
+                            for i, optimizer in enumerate(optimizers)
+                            for j, pg in enumerate(optimizer.param_groups)
+                            if "lr" in pg
+                        },
+                        train_time=time.perf_counter() - start_time,
+                    ),
+                )
+                start_time = time.perf_counter()
+
+                # update num_updates
+                if distributed:
+                    if hasattr(model.module, "num_updates"):
+                        model.module.set_num_updates(model.module.get_num_updates() + 1)
+                        options.num_updates = model.module.get_num_updates()
+                        if model.module.get_num_updates() >= options.max_update:
+                            max_update_stop = True
+                else:
+                    if hasattr(model, "num_updates"):
+                        model.set_num_updates(model.get_num_updates() + 1)
+                        options.num_updates = model.get_num_updates()
+                        if model.get_num_updates() >= options.max_update:
+                            max_update_stop = True
+
+            # NOTE(kamo): Call log_message() after next()
+            reporter.next()
+            if iiter % log_interval == 0:
+                num_updates = options.num_updates if hasattr(options, "num_updates") else None
+                logging.info(reporter.log_message(-log_interval, num_updates=num_updates))
+                if summary_writer is not None:
+                    reporter.tensorboard_add_scalar(summary_writer, -log_interval)
+                if use_wandb:
+                    reporter.wandb_log()
+
+            if max_update_stop:
+                break
+
+        else:
+            if distributed:
+                iterator_stop.fill_(1)
+                torch.distributed.all_reduce(iterator_stop, ReduceOp.SUM)
+        return all_steps_are_invalid, max_update_stop
+
+    @classmethod
+    @torch.no_grad()
+    def validate_one_epoch(
+        cls,
+        model: torch.nn.Module,
+        iterator: Iterable[Dict[str, torch.Tensor]],
+        reporter: SubReporter,
+        options: TrainerOptions,
+        distributed_option: DistributedOption,
+    ) -> None:
+        assert check_argument_types()
+        ngpu = options.ngpu
+        no_forward_run = options.no_forward_run
+        distributed = distributed_option.distributed
+
+        model.eval()
+
+        # [For distributed] Because iteration counts are not always equals between
+        # processes, send stop-flag to the other processes if iterator is finished
+        iterator_stop = torch.tensor(0).to("cuda" if ngpu > 0 else "cpu")
+        for (_, batch) in iterator:
+            assert isinstance(batch, dict), type(batch)
+            if distributed:
+                torch.distributed.all_reduce(iterator_stop, ReduceOp.SUM)
+                if iterator_stop > 0:
+                    break
+
+            batch = to_device(batch, "cuda" if ngpu > 0 else "cpu")
+            if no_forward_run:
+                continue
+
+            retval = model(**batch)
+            if isinstance(retval, dict):
+                stats = retval["stats"]
+                weight = retval["weight"]
+            else:
+                _, stats, weight = retval
+            if ngpu > 1 or distributed:
+                # Apply weighted averaging for stats.
+                # if distributed, this method can also apply all_reduce()
+                stats, weight = recursive_average(stats, weight, distributed)
+
+            reporter.register(stats, weight)
+            reporter.next()
+
+        else:
+            if distributed:
+                iterator_stop.fill_(1)
+                torch.distributed.all_reduce(iterator_stop, ReduceOp.SUM)
\ No newline at end of file
diff --git a/funasr/utils/__init__.py b/funasr/utils/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/funasr/utils/build_dataclass.py b/funasr/utils/build_dataclass.py
new file mode 100644
index 000000000..6675c99a0
--- /dev/null
+++ b/funasr/utils/build_dataclass.py
@@ -0,0 +1,17 @@
+import argparse
+import dataclasses
+
+from typeguard import check_type
+
+
+def build_dataclass(dataclass, args: argparse.Namespace):
+    """Helper function to build dataclass from 'args'."""
+    kwargs = {}
+    for field in dataclasses.fields(dataclass):
+        if not hasattr(args, field.name):
+            raise ValueError(
+                f"args doesn't have {field.name}. You need to set it to ArgumentsParser"
+            )
+        check_type(field.name, getattr(args, field.name), field.type)
+        kwargs[field.name] = getattr(args, field.name)
+    return dataclass(**kwargs)
diff --git a/funasr/utils/cli_utils.py b/funasr/utils/cli_utils.py
new file mode 100644
index 000000000..c4a4cd15b
--- /dev/null
+++ b/funasr/utils/cli_utils.py
@@ -0,0 +1,65 @@
+from collections.abc import Sequence
+from distutils.util import strtobool as dist_strtobool
+import sys
+
+import numpy
+
+
+def strtobool(x):
+    # distutils.util.strtobool returns integer, but it's confusing,
+    return bool(dist_strtobool(x))
+
+
+def get_commandline_args():
+    extra_chars = [
+        " ",
+        ";",
+        "&",
+        "(",
+        ")",
+        "|",
+        "^",
+        "<",
+        ">",
+        "?",
+        "*",
+        "[",
+        "]",
+        "$",
+        "`",
+        '"',
+        "\\",
+        "!",
+        "{",
+        "}",
+    ]
+
+    # Escape the extra characters for shell
+    argv = [
+        arg.replace("'", "'\\''")
+        if all(char not in arg for char in extra_chars)
+        else "'" + arg.replace("'", "'\\''") + "'"
+        for arg in sys.argv
+    ]
+
+    return sys.executable + " " + " ".join(argv)
+
+
+def is_scipy_wav_style(value):
+    # If Tuple[int, numpy.ndarray] or not
+    return (
+        isinstance(value, Sequence)
+        and len(value) == 2
+        and isinstance(value[0], int)
+        and isinstance(value[1], numpy.ndarray)
+    )
+
+
+def assert_scipy_wav_style(value):
+    assert is_scipy_wav_style(
+        value
+    ), "Must be Tuple[int, numpy.ndarray], but got {}".format(
+        type(value)
+        if not isinstance(value, Sequence)
+        else "{}[{}]".format(type(value), ", ".join(str(type(v)) for v in value))
+    )
diff --git a/funasr/utils/config_argparse.py b/funasr/utils/config_argparse.py
new file mode 100644
index 000000000..c9d7197a7
--- /dev/null
+++ b/funasr/utils/config_argparse.py
@@ -0,0 +1,47 @@
+import argparse
+from pathlib import Path
+
+import yaml
+
+
+class ArgumentParser(argparse.ArgumentParser):
+    """Simple implementation of ArgumentParser supporting config file
+
+    This class is originated from https://github.com/bw2/ConfigArgParse,
+    but this class is lack of some features that it has.
+
+    - Not supporting multiple config files
+    - Automatically adding "--config" as an option.
+    - Not supporting any formats other than yaml
+    - Not checking argument type
+
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.add_argument("--config", help="Give config file in yaml format")
+
+    def parse_known_args(self, args=None, namespace=None):
+        # Once parsing for setting from "--config"
+        _args, _ = super().parse_known_args(args, namespace)
+        if _args.config is not None:
+            if not Path(_args.config).exists():
+                self.error(f"No such file: {_args.config}")
+
+            with open(_args.config, "r", encoding="utf-8") as f:
+                d = yaml.safe_load(f)
+            if not isinstance(d, dict):
+                self.error("Config file has non dict value: {_args.config}")
+
+            for key in d:
+                for action in self._actions:
+                    if key == action.dest:
+                        break
+                else:
+                    self.error(f"unrecognized arguments: {key} (from {_args.config})")
+
+            # NOTE(kamo): Ignore "--config" from a config file
+            # NOTE(kamo): Unlike "configargparse", this module doesn't check type.
+            #   i.e. We can set any type value regardless of argument type.
+            self.set_defaults(**d)
+        return super().parse_known_args(args, namespace)
diff --git a/funasr/utils/get_default_kwargs.py b/funasr/utils/get_default_kwargs.py
new file mode 100644
index 000000000..0f11e8af4
--- /dev/null
+++ b/funasr/utils/get_default_kwargs.py
@@ -0,0 +1,57 @@
+import inspect
+
+
+class Invalid:
+    """Marker object for not serializable-object"""
+
+
+def get_default_kwargs(func):
+    """Get the default values of the input function.
+
+    Examples:
+        >>> def func(a, b=3):  pass
+        >>> get_default_kwargs(func)
+        {'b': 3}
+
+    """
+
+    def yaml_serializable(value):
+        # isinstance(x, tuple) includes namedtuple, so type is used here
+        if type(value) is tuple:
+            return yaml_serializable(list(value))
+        elif isinstance(value, set):
+            return yaml_serializable(list(value))
+        elif isinstance(value, dict):
+            if not all(isinstance(k, str) for k in value):
+                return Invalid
+            retval = {}
+            for k, v in value.items():
+                v2 = yaml_serializable(v)
+                # Register only valid object
+                if v2 not in (Invalid, inspect.Parameter.empty):
+                    retval[k] = v2
+            return retval
+        elif isinstance(value, list):
+            retval = []
+            for v in value:
+                v2 = yaml_serializable(v)
+                # If any elements in the list are invalid,
+                # the list also becomes invalid
+                if v2 is Invalid:
+                    return Invalid
+                else:
+                    retval.append(v2)
+            return retval
+        elif value in (inspect.Parameter.empty, None):
+            return value
+        elif isinstance(value, (float, int, complex, bool, str, bytes)):
+            return value
+        else:
+            return Invalid
+
+    # params: An ordered mapping of inspect.Parameter
+    params = inspect.signature(func).parameters
+    data = {p.name: p.default for p in params.values()}
+    # Remove not yaml-serializable object
+    data = yaml_serializable(data)
+    return data
diff --git a/funasr/utils/griffin_lim.py b/funasr/utils/griffin_lim.py
new file mode 100644
index 000000000..c1536d51b
--- /dev/null
+++ b/funasr/utils/griffin_lim.py
@@ -0,0 +1,192 @@
+#!/usr/bin/env python3
+
+"""Griffin-Lim related modules."""
+
+# Copyright 2019 Tomoki Hayashi
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+import logging
+
+from distutils.version import LooseVersion
+from functools import partial
+from typeguard import check_argument_types
+from typing import Optional
+
+import librosa
+import numpy as np
+import torch
+
+EPS = 1e-10
+
+
+def logmel2linear(
+    lmspc: np.ndarray,
+    fs: int,
+    n_fft: int,
+    n_mels: int,
+    fmin: int = None,
+    fmax: int = None,
+) -> np.ndarray:
+    """Convert log Mel filterbank to linear spectrogram.
+
+    Args:
+        lmspc: Log Mel filterbank (T, n_mels).
+        fs: Sampling frequency.
+        n_fft: The number of FFT points.
+        n_mels: The number of mel basis.
+        f_min: Minimum frequency to analyze.
+        f_max: Maximum frequency to analyze.
+
+    Returns:
+        Linear spectrogram (T, n_fft // 2 + 1).
+
+    """
+    assert lmspc.shape[1] == n_mels
+    fmin = 0 if fmin is None else fmin
+    fmax = fs / 2 if fmax is None else fmax
+    mspc = np.power(10.0, lmspc)
+    mel_basis = librosa.filters.mel(
+        sr=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax
+    )
+    inv_mel_basis = np.linalg.pinv(mel_basis)
+    return np.maximum(EPS, np.dot(inv_mel_basis, mspc.T).T)
+
+
+def griffin_lim(
+    spc: np.ndarray,
+    n_fft: int,
+    n_shift: int,
+    win_length: int = None,
+    window: Optional[str] = "hann",
+    n_iter: Optional[int] = 32,
+) -> np.ndarray:
+    """Convert linear spectrogram into waveform using Griffin-Lim.
+
+    Args:
+        spc: Linear spectrogram (T, n_fft // 2 + 1).
+        n_fft: The number of FFT points.
+        n_shift: Shift size in points.
+        win_length: Window length in points.
+        window: Window function type.
+        n_iter: The number of iterations.
+
+    Returns:
+        Reconstructed waveform (N,).
+
+    """
+    # assert the size of input linear spectrogram
+    assert spc.shape[1] == n_fft // 2 + 1
+
+    if LooseVersion(librosa.__version__) >= LooseVersion("0.7.0"):
+        # use librosa's fast Grriffin-Lim algorithm
+        spc = np.abs(spc.T)
+        y = librosa.griffinlim(
+            S=spc,
+            n_iter=n_iter,
+            hop_length=n_shift,
+            win_length=win_length,
+            window=window,
+            center=True if spc.shape[1] > 1 else False,
+        )
+    else:
+        # use slower version of Grriffin-Lim algorithm
+        logging.warning(
+            "librosa version is old. use slow version of Grriffin-Lim algorithm."
+            "if you want to use fast Griffin-Lim, please update librosa via "
+            "`source ./path.sh && pip install librosa==0.7.0`."
+        )
+        cspc = np.abs(spc).astype(np.complex).T
+        angles = np.exp(2j * np.pi * np.random.rand(*cspc.shape))
+        y = librosa.istft(cspc * angles, n_shift, win_length, window=window)
+        for i in range(n_iter):
+            angles = np.exp(
+                1j
+                * np.angle(librosa.stft(y, n_fft, n_shift, win_length, window=window))
+            )
+            y = librosa.istft(cspc * angles, n_shift, win_length, window=window)
+
+    return y
+
+
+# TODO(kan-bayashi): write as torch.nn.Module
+class Spectrogram2Waveform(object):
+    """Spectrogram to waveform conversion module."""
+
+    def __init__(
+        self,
+        n_fft: int,
+        n_shift: int,
+        fs: int = None,
+        n_mels: int = None,
+        win_length: int = None,
+        window: Optional[str] = "hann",
+        fmin: int = None,
+        fmax: int = None,
+        griffin_lim_iters: Optional[int] = 8,
+    ):
+        """Initialize module.
+
+        Args:
+            fs: Sampling frequency.
+            n_fft: The number of FFT points.
+            n_shift: Shift size in points.
+            n_mels: The number of mel basis.
+            win_length: Window length in points.
+            window: Window function type.
+            f_min: Minimum frequency to analyze.
+            f_max: Maximum frequency to analyze.
+            griffin_lim_iters: The number of iterations.
+
+        """
+        assert check_argument_types()
+        self.fs = fs
+        self.logmel2linear = (
+            partial(
+                logmel2linear, fs=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax
+            )
+            if n_mels is not None
+            else None
+        )
+        self.griffin_lim = partial(
+            griffin_lim,
+            n_fft=n_fft,
+            n_shift=n_shift,
+            win_length=win_length,
+            window=window,
+            n_iter=griffin_lim_iters,
+        )
+        self.params = dict(
+            n_fft=n_fft,
+            n_shift=n_shift,
+            win_length=win_length,
+            window=window,
+            n_iter=griffin_lim_iters,
+        )
+        if n_mels is not None:
+            self.params.update(fs=fs, n_mels=n_mels, fmin=fmin, fmax=fmax)
+
+    def __repr__(self):
+        retval = f"{self.__class__.__name__}("
+        for k, v in self.params.items():
+            retval += f"{k}={v}, "
+        retval += ")"
+        return retval
+
+    def __call__(self, spc: torch.Tensor) -> torch.Tensor:
+        """Convert spectrogram to waveform.
+
+        Args:
+            spc: Log Mel filterbank (T_feats, n_mels)
+                or linear spectrogram (T_feats, n_fft // 2 + 1).
+
+        Returns:
+            Tensor: Reconstructed waveform (T_wav,).
+
+        """
+        device = spc.device
+        dtype = spc.dtype
+        spc = spc.cpu().numpy()
+        if self.logmel2linear is not None:
+            spc = self.logmel2linear(spc)
+        wav = self.griffin_lim(spc)
+        return torch.tensor(wav).to(device=device, dtype=dtype)
diff --git a/funasr/utils/nested_dict_action.py b/funasr/utils/nested_dict_action.py
new file mode 100644
index 000000000..38ec57b31
--- /dev/null
+++ b/funasr/utils/nested_dict_action.py
@@ -0,0 +1,106 @@
+import argparse
+import copy
+
+import yaml
+
+
+class NestedDictAction(argparse.Action):
+    """Action class to append items to dict object.
+
+    Examples:
+        >>> parser = argparse.ArgumentParser()
+        >>> _ = parser.add_argument('--conf', action=NestedDictAction,
+        ...                         default={'a': 4})
+        >>> parser.parse_args(['--conf', 'a=3', '--conf', 'c=4'])
+        Namespace(conf={'a': 3, 'c': 4})
+        >>> parser.parse_args(['--conf', 'c.d=4'])
+        Namespace(conf={'a': 4, 'c': {'d': 4}})
+        >>> parser.parse_args(['--conf', 'c.d=4', '--conf', 'c=2'])
+        Namespace(conf={'a': 4, 'c': 2})
+        >>> parser.parse_args(['--conf', '{d: 5, e: 9}'])
+        Namespace(conf={'d': 5, 'e': 9})
+
+    """
+
+    _syntax = """Syntax:
+  {op} <key>=<yaml-string>
+  {op} <key>.<key2>=<yaml-string>
+  {op} <python-dict>
+  {op} <yaml-string>
+e.g.
+  {op} a=4
+  {op} a.b={{c: true}}
+  {op} {{"c": True}}
+  {op} {{a: 34.5}}
+"""
+
+    def __init__(
+        self,
+        option_strings,
+        dest,
+        nargs=None,
+        default=None,
+        choices=None,
+        required=False,
+        help=None,
+        metavar=None,
+    ):
+        super().__init__(
+            option_strings=option_strings,
+            dest=dest,
+            nargs=nargs,
+            default=copy.deepcopy(default),
+            type=None,
+            choices=choices,
+            required=required,
+            help=help,
+            metavar=metavar,
+        )
+
+    def __call__(self, parser, namespace, values, option_strings=None):
+        # --{option} a.b=3 -> {'a': {'b': 3}}
+        if "=" in values:
+            indict = copy.deepcopy(getattr(namespace, self.dest, {}))
+            key, value = values.split("=", maxsplit=1)
+            if not value.strip() == "":
+                value = yaml.load(value, Loader=yaml.Loader)
+            if not isinstance(indict, dict):
+                indict = {}
+
+            keys = key.split(".")
+            d = indict
+            for idx, k in enumerate(keys):
+                if idx == len(keys) - 1:
+                    d[k] = value
+                else:
+                    if not isinstance(d.setdefault(k, {}), dict):
+                        # Remove the existing value and recreates as empty dict
+                        d[k] = {}
+                    d = d[k]
+
+            # Update the value
+            setattr(namespace, self.dest, indict)
+        else:
+            try:
+                # At the first, try eval(), i.e. Python syntax dict.
+                # e.g. --{option} "{'a': 3}" -> {'a': 3}
+                # This is workaround for internal behaviour of configargparse.
+                value = eval(values, {}, {})
+                if not isinstance(value, dict):
+                    syntax = self._syntax.format(op=option_strings)
+                    mes = f"must be interpreted as dict: but got {values}\n{syntax}"
+                    raise argparse.ArgumentTypeError(self, mes)
+            except Exception:
+                # and the second, try yaml.load
+                value = yaml.load(values, Loader=yaml.Loader)
+                if not isinstance(value, dict):
+                    syntax = self._syntax.format(op=option_strings)
+                    mes = f"must be interpreted as dict: but got {values}\n{syntax}"
+                    raise argparse.ArgumentError(self, mes)
+
+            d = getattr(namespace, self.dest, None)
+            if isinstance(d, dict):
+                d.update(value)
+            else:
+                # Remove existing params, and overwrite
+                setattr(namespace, self.dest, value)
diff --git a/funasr/utils/sized_dict.py b/funasr/utils/sized_dict.py
new file mode 100644
index 000000000..105d8c398
--- /dev/null
+++ b/funasr/utils/sized_dict.py
@@ -0,0 +1,75 @@
+import collections
+import sys
+
+from torch import multiprocessing
+
+
+def get_size(obj, seen=None):
+    """Recursively finds size of objects
+
+    Taken from https://github.com/bosswissam/pysize
+
+    """
+
+    size = sys.getsizeof(obj)
+    if seen is None:
+        seen = set()
+
+    obj_id = id(obj)
+    if obj_id in seen:
+        return 0
+
+    # Important mark as seen *before* entering recursion to gracefully handle
+    # self-referential objects
+    seen.add(obj_id)
+
+    if isinstance(obj, dict):
+        size += sum([get_size(v, seen) for v in obj.values()])
+        size += sum([get_size(k, seen) for k in obj.keys()])
+    elif hasattr(obj, "__dict__"):
+        size += get_size(obj.__dict__, seen)
+    elif isinstance(obj, (list, set, tuple)):
+        size += sum([get_size(i, seen) for i in obj])
+
+    return size
+
+
+class SizedDict(collections.abc.MutableMapping):
+    def __init__(self, shared: bool = False, data: dict = None):
+        if data is None:
+            data = {}
+
+        if shared:
+            # NOTE(kamo): Don't set manager as a field because Manager, which includes
+            # weakref object, causes following error with method="spawn",
+            # "TypeError: can't pickle weakref objects"
+            self.cache = multiprocessing.Manager().dict(**data)
+        else:
+            self.manager = None
+            self.cache = dict(**data)
+        self.size = 0
+
+    def __setitem__(self, key, value):
+        if key in self.cache:
+            self.size -= get_size(self.cache[key])
+        else:
+            self.size += sys.getsizeof(key)
+        self.size += get_size(value)
+        self.cache[key] = value
+
+    def __getitem__(self, key):
+        return self.cache[key]
+
+    def __delitem__(self, key):
+        self.size -= get_size(self.cache[key])
+        self.size -= sys.getsizeof(key)
+        del self.cache[key]
+
+    def __iter__(self):
+        return iter(self.cache)
+
+    def __contains__(self, key):
+        return key in self.cache
+
+    def __len__(self):
+        return len(self.cache)
diff --git a/funasr/utils/types.py b/funasr/utils/types.py
new file mode 100644
index 000000000..6b36f9c4b
--- /dev/null
+++ b/funasr/utils/types.py
@@ -0,0 +1,149 @@
+from distutils.util import strtobool
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import humanfriendly
+
+
+def str2bool(value: str) -> bool:
+    return bool(strtobool(value))
+
+
+def remove_parenthesis(value: str):
+    value = value.strip()
+    if value.startswith("(") and value.endswith(")"):
+        value = value[1:-1]
+    elif value.startswith("[") and value.endswith("]"):
+        value = value[1:-1]
+    return value
+
+
+def remove_quotes(value: str):
+    value = value.strip()
+    if value.startswith('"') and value.endswith('"'):
+        value = value[1:-1]
+    elif value.startswith("'") and value.endswith("'"):
+        value = value[1:-1]
+    return value
+
+
+def int_or_none(value: str) -> Optional[int]:
+    """int_or_none.
+
+    Examples:
+        >>> import argparse
+        >>> parser = argparse.ArgumentParser()
+        >>> _ = parser.add_argument('--foo', type=int_or_none)
+        >>> parser.parse_args(['--foo', '456'])
+        Namespace(foo=456)
+        >>> parser.parse_args(['--foo', 'none'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'null'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'nil'])
+        Namespace(foo=None)
+
+    """
+    if value.strip().lower() in ("none", "null", "nil"):
+        return None
+    return int(value)
+
+
+def float_or_none(value: str) -> Optional[float]:
+    """float_or_none.
+
+    Examples:
+        >>> import argparse
+        >>> parser = argparse.ArgumentParser()
+        >>> _ = parser.add_argument('--foo', type=float_or_none)
+        >>> parser.parse_args(['--foo', '4.5'])
+        Namespace(foo=4.5)
+        >>> parser.parse_args(['--foo', 'none'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'null'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'nil'])
+        Namespace(foo=None)
+
+    """
+    if value.strip().lower() in ("none", "null", "nil"):
+        return None
+    return float(value)
+
+
+def humanfriendly_parse_size_or_none(value) -> Optional[float]:
+    if value.strip().lower() in ("none", "null", "nil"):
+        return None
+    return humanfriendly.parse_size(value)
+
+
+def str_or_int(value: str) -> Union[str, int]:
+    try:
+        return int(value)
+    except ValueError:
+        return value
+
+
+def str_or_none(value: str) -> Optional[str]:
+    """str_or_none.
+
+    Examples:
+        >>> import argparse
+        >>> parser = argparse.ArgumentParser()
+        >>> _ = parser.add_argument('--foo', type=str_or_none)
+        >>> parser.parse_args(['--foo', 'aaa'])
+        Namespace(foo='aaa')
+        >>> parser.parse_args(['--foo', 'none'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'null'])
+        Namespace(foo=None)
+        >>> parser.parse_args(['--foo', 'nil'])
+        Namespace(foo=None)
+
+    """
+    if value.strip().lower() in ("none", "null", "nil"):
+        return None
+    return value
+
+
+def str2pair_str(value: str) -> Tuple[str, str]:
+    """str2pair_str.
+
+    Examples:
+        >>> import argparse
+        >>> str2pair_str('abc,def ')
+        ('abc', 'def')
+        >>> parser = argparse.ArgumentParser()
+        >>> _ = parser.add_argument('--foo', type=str2pair_str)
+        >>> parser.parse_args(['--foo', 'abc,def'])
+        Namespace(foo=('abc', 'def'))
+
+    """
+    value = remove_parenthesis(value)
+    a, b = value.split(",")
+
+    # Workaround for configargparse issues:
+    # If the list values are given from yaml file,
+    # the value givent to type() is shaped as python-list,
+    # e.g. ['a', 'b', 'c'],
+    # so we need to remove double quotes from it.
+    return remove_quotes(a), remove_quotes(b)
+
+
+def str2triple_str(value: str) -> Tuple[str, str, str]:
+    """str2triple_str.
+
+    Examples:
+        >>> str2triple_str('abc,def ,ghi')
+        ('abc', 'def', 'ghi')
+    """
+    value = remove_parenthesis(value)
+    a, b, c = value.split(",")
+
+    # Workaround for configargparse issues:
+    # If the list values are given from yaml file,
+    # the value givent to type() is shaped as python-list,
+    # e.g. ['a', 'b', 'c'],
+    # so we need to remove quotes from it.
+    return remove_quotes(a), remove_quotes(b), remove_quotes(c)
diff --git a/funasr/utils/yaml_no_alias_safe_dump.py b/funasr/utils/yaml_no_alias_safe_dump.py
new file mode 100644
index 000000000..70a7b0e40
--- /dev/null
+++ b/funasr/utils/yaml_no_alias_safe_dump.py
@@ -0,0 +1,14 @@
+import yaml
+
+
+class NoAliasSafeDumper(yaml.SafeDumper):
+    # Disable anchor/alias in yaml because looks ugly
+    def ignore_aliases(self, data):
+        return True
+
+
+def yaml_no_alias_safe_dump(data, stream=None, **kwargs):
+    """Safe-dump in yaml with no anchor/alias"""
+    return yaml.dump(
+        data, stream, allow_unicode=True, Dumper=NoAliasSafeDumper, **kwargs
+    )
diff --git a/funasr/version.txt b/funasr/version.txt
new file mode 100644
index 000000000..6e8bf73aa
--- /dev/null
+++ b/funasr/version.txt
@@ -0,0 +1 @@
+0.1.0
diff --git a/image/dingding.jpg b/image/dingding.jpg
new file mode 100644
index 000000000..fb3ee9928
Binary files /dev/null and b/image/dingding.jpg differ
diff --git a/image/funasr_logo.jpg b/image/funasr_logo.jpg
new file mode 100644
index 000000000..a47243e76
Binary files /dev/null and b/image/funasr_logo.jpg differ
diff --git a/image/wechat.png b/image/wechat.png
new file mode 100644
index 000000000..962404c3f
Binary files /dev/null and b/image/wechat.png differ
diff --git a/setup.py b/setup.py
new file mode 100644
index 000000000..ac5960d79
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python3
+
+"""FunASR setup script."""
+
+import os
+
+from distutils.version import LooseVersion
+from setuptools import find_packages
+from setuptools import setup
+
+
+requirements = {
+    "install": [
+        "setuptools>=38.5.1",
+        "configargparse>=1.2.1",
+        "typeguard>=2.7.0",
+        "humanfriendly",
+        "scipy>=1.4.1",
+        "filelock",
+        "librosa>=0.8.0",
+        "jamo==0.4.1",  # For kss
+        "PyYAML>=5.1.2",
+        "soundfile>=0.10.2",
+        "h5py>=2.10.0",
+        "kaldiio>=2.17.0",
+        "torch_complex",
+        "nltk>=3.4.5",
+        # ASR
+        "sentencepiece",
+        "ctc-segmentation<1.8,>=1.6.6",
+        # TTS
+        "pyworld>=0.2.10",
+        "pypinyin<=0.44.0",
+        "espnet_tts_frontend",
+        # ENH
+        "ci_sdr",
+        "pytorch_wpe",
+        "editdistance==0.5.2",
+        "tensorboard>=1.14",
+        "g2p",
+    ],
+    # train: The modules invoked when training only.
+    "train": [
+        "pillow>=6.1.0",
+        "editdistance==0.5.2",
+        "wandb",
+    ],
+    # recipe: The modules actually are not invoked in the main module of espnet,
+    #         but are invoked for the python scripts in each recipe
+    "recipe": [
+        "espnet_model_zoo",
+        "gdown",
+        "resampy",
+        "pysptk>=0.1.17",
+        "morfessor",  # for zeroth-korean
+        "youtube_dl",  # for laborotv
+        "nnmnkwii",
+        "museval>=0.2.1",
+        "pystoi>=0.2.2",
+        "mir-eval>=0.6",
+        "fastdtw",
+        "nara_wpe>=0.0.5",
+        "sacrebleu>=1.5.1",
+    ],
+    # all: The modules should be optionally installled due to some reason.
+    #      Please consider moving them to "install" occasionally
+    # NOTE(kamo): The modules in "train" and "recipe" are appended into "all"
+    "all": [
+        # NOTE(kamo): Append modules requiring specific pytorch version or torch>1.3.0
+        "torch_optimizer",
+        "fairscale",
+        "transformers",
+        "gtn==0.0.0",
+    ],
+    "setup": [
+        "numpy<=1.21.3",
+        "pytest-runner",
+    ],
+    "test": [
+        "pytest>=3.3.0",
+        "pytest-timeouts>=1.2.1",
+        "pytest-pythonpath>=0.7.3",
+        "pytest-cov>=2.7.1",
+        "hacking>=2.0.0",
+        "mock>=2.0.0",
+        "pycodestyle",
+        "jsondiff<2.0.0,>=1.2.0",
+        "flake8>=3.7.8",
+        "flake8-docstrings>=1.3.1",
+        "black",
+    ],
+    "doc": [
+        "Jinja2<3.1",
+        "Sphinx==2.1.2",
+        "sphinx-rtd-theme>=0.2.4",
+        "sphinx-argparse>=0.2.5",
+        "commonmark==0.8.1",
+        "recommonmark>=0.4.0",
+        "nbsphinx>=0.4.2",
+        "sphinx-markdown-tables>=0.0.12",
+    ],
+}
+requirements["all"].extend(requirements["train"] + requirements["recipe"])
+requirements["test"].extend(requirements["train"])
+
+install_requires = requirements["install"]
+setup_requires = requirements["setup"]
+tests_require = requirements["test"]
+extras_require = {
+    k: v for k, v in requirements.items() if k not in ["install", "setup"]
+}
+
+dirname = os.path.dirname(__file__)
+version_file = os.path.join(dirname, "funasr", "version.txt")
+with open(version_file, "r") as f:
+    version = f.read().strip()
+setup(
+    name="funasr",
+    version=version,
+    url="https://github.com/alibaba-damo-academy/FunASR.git",
+    author="Speech Lab, Alibaba Group, China",
+    author_email="funasr@list.alibaba-inc.com",
+    description="FunASR: A Fundamental End-to-End Speech Recognition Toolkit",
+    long_description=open(os.path.join(dirname, "README.md"), encoding="utf-8").read(),
+    long_description_content_type="text/markdown",
+    license="The MIT License",
+    packages=find_packages(include=["funasr*"]),
+    package_data={"funasr": ["version.txt"]},
+    install_requires=install_requires,
+    setup_requires=setup_requires,
+    tests_require=tests_require,
+    extras_require=extras_require,
+    python_requires=">=3.6.0",
+    classifiers=[
+        "Programming Language :: Python",
+        "Programming Language :: Python :: 3",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Development Status :: 5 - Production/Stable",
+        "Intended Audience :: Science/Research",
+        "Operating System :: POSIX :: Linux",
+        "License :: OSI Approved :: Apache Software License",
+        "Topic :: Software Development :: Libraries :: Python Modules",
+    ],
+)