This commit is contained in:
游雁 2022-11-26 21:56:51 +08:00
parent 460ab97011
commit c087854f71
314 changed files with 42966 additions and 1 deletions

View File

@ -18,4 +18,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SOFTWARE.

78
README.md Normal file
View File

@ -0,0 +1,78 @@
<div align="left"><img src="image/funasr_logo.jpg" width="400"/></div>
# FunASR: A Fundamental End-to-End Speech Recognition Toolkit
<strong>FunASR</strong> hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun
## Installation(Training and Developing)
- Clone the repo:
``` sh
git clone https://github.com/alibaba/FunASR.git
```
- Install Conda:
``` sh
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
conda create -n funasr python=3.7
conda activate funasr
```
- Install Pytorch (version >= 1.7.0):
| cuda | |
|:-----:| --- |
| 9.2 | conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=9.2 -c pytorch |
| 10.2 | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch |
| 11.1 | conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch |
For more versions, please see https://pytorch.org/get-started/locally/
- Install ModelScope:
``` sh
pip install "modelscope[audio]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
```
- Install other packages:
``` sh
pip install --editable ./
```
## Contact
If you have any questions about FunASR, please contact us by
- email: [funasr@list.alibaba-inc.com](funasr@list.alibaba-inc.com)
- Dingding group:
<div align="left"><img src="image/dingding.jpg" width="400"/></div>
## Acknowledge
1. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for data preparation.
2. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet). FunASR follows up the training and finetuning pipelines of ESPnet.
3. We referred [Wenet](https://github.com/wenet-e2e/wenet) for building dataloader for large scale data training.
## License
This project is licensed under the [The MIT License](https://opensource.org/licenses/MIT). FunASR also contains various third-party components and some code modified from other repos under other open source licenses.
## Citations
``` bibtex
@inproceedings{gao2020universal,
title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model},
author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian},
booktitle={arXiv preprint arXiv:2010.14099},
year={2010}
}
@inproceedings{gao2022paraformer,
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
booktitle={INTERSPEECH},
year={2022}
}
```

View File

@ -0,0 +1,17 @@
# Conformer Result
## Training Config
- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
- Train config: conf/train_asr_transformer.yaml
- LM config: LM was not used
- Model size: 46M
## Results (CER)
- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
| testset | CER(%) |
|:-----------:|:-------:|
| dev | 4.42 |
| test | 4.87 |

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.7

View File

@ -0,0 +1,80 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
# minibatch related
batch_type: length
batch_bins: 25000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
log_interval: 50
normalize: None

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Copyright 2017 Xingyu Na
# Apache 2.0
#. ./path.sh || exit 1;
if [ $# != 3 ]; then
echo "Usage: $0 <audio-path> <text-path> <output-path>"
echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
exit 1;
fi
aishell_audio_dir=$1
aishell_text=$2/aishell_transcript_v0.8.txt
output_dir=$3
train_dir=$output_dir/data/local/train
dev_dir=$output_dir/data/local/dev
test_dir=$output_dir/data/local/test
tmp_dir=$output_dir/data/local/tmp
mkdir -p $train_dir
mkdir -p $dev_dir
mkdir -p $test_dir
mkdir -p $tmp_dir
# data directory check
if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
echo "Error: $0 requires two directory arguments"
exit 1;
fi
# find wav audio file for train, dev and test resp.
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
n=`cat $tmp_dir/wav.flist | wc -l`
[ $n -ne 141925 ] && \
echo Warning: expected 141925 data data files, found $n
grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
# Transcriptions preparation
for dir in $train_dir $dev_dir $test_dir; do
echo Preparing $dir transcriptions
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
for f in wav.scp text; do
cp $train_dir/$f $output_dir/data/train/$f || exit 1;
cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
cp $test_dir/$f $output_dir/data/test/$f || exit 1;
done
echo "$0: AISHELL data preparation succeeded"
exit 0;

View File

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
# 2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
# Apache 2.0
# transform raw AISHELL-2 data to kaldi format
. ./path.sh || exit 1;
tmp=
dir=
if [ $# != 3 ]; then
echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
exit 1;
fi
corpus=$1
tmp=$2
dir=$3
echo "prepare_data.sh: Preparing data in $corpus"
mkdir -p $tmp
mkdir -p $dir
# corpus check
if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
exit 1;
fi
# validate utt-key list, IC0803W0380 is a bad utterance
awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
# wav.scp
awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
# text
utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
# copy prepared resources from tmp_dir to target dir
mkdir -p $dir
for f in wav.scp text; do
cp $tmp/$f $dir/$f || exit 1;
done
echo "local/prepare_data.sh succeeded"
exit 0;

5
egs/aishell/conformer/path.sh Executable file
View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

208
egs/aishell/conformer/run.sh Executable file
View File

@ -0,0 +1,208 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=8
train_cmd=utils/run.pl
# general configuration
feats_dir=".." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
# feature configuration
feats_dim=80
sample_frequency=16000
nj=32
speed_perturb="0.9,1.0,1.1"
# data
data_aishell=
# exp tag
tag=""
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
for x in train dev test; do
cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
> ${feats_dir}/data/${x}/text
utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "stage 1: Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
# compute global cmvn
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${exp_dir}/exp/make_fbank/train
# apply cmvn
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_test_dir}
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
echo "make a dictionary"
echo "<blank>" > ${token_list}
echo "<s>" >> ${token_list}
echo "</s>" >> ${token_list}
utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
| sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
num_token=$(cat ${token_list} | wc -l)
echo "<unk>" >> ${token_list}
vocab_size=$(cat ${token_list} | wc -l)
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type char \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--config $asr_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode asr
fi

1
egs/aishell/conformer/utils Symbolic link
View File

@ -0,0 +1 @@
../tranformer/utils

View File

@ -0,0 +1,24 @@
# Paraformer
pretrained model in [ModelScope](https://www.modelscope.cn/home)[speech_paraformer_asr_nat-aishell1-pytorch](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-aishell1-pytorch/summary)
## Training Config
- Feature info: using 80 dims fbank, global cmvn, speed perturb(0.9, 1.0, 1.1), specaugment
- Train info: lr 5e-4, batch_size 25000, 2 gpu(Tesla V100), acc_grad 1, 50 epochs
- Train config: conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
- LM config: LM was not used
## Results (CER)
- Decode config: conf/decode_asr_transformer_noctc_1best.yaml (ctc weight:0.0)
| testset | CER(%) |
|:-----------:|:-------:|
| dev | 4.66 |
| test | 5.11 |
- Decode config: conf/decode_asr_transformer.yaml (ctc weight:0.5)
| testset | CER(%) |
|:-----------:|:-------:|
| dev | 4.52 |
| test | 4.94 |

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.7

View File

@ -0,0 +1,6 @@
beam_size: 1
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.15

View File

@ -0,0 +1,91 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
model: paraformer
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
# minibatch related
batch_type: length
batch_bins: 25000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None

View File

@ -0,0 +1,92 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 320 # dimension of attention
attention_heads: 4
linear_units: 1280 # the number of units of position-wise feed forward
num_blocks: 20 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model: paraformer
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
# minibatch related
batch_type: length
batch_bins: 25000
num_workers: 16
# optimization related
accum_grad: 4
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None

View File

@ -0,0 +1,114 @@
# network architecture
# encoder related
encoder: sanm
encoder_conf:
output_size: 320 # dimension of attention
attention_heads: 4
linear_units: 1280 # the number of units of position-wise feed forward
num_blocks: 40 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: pe # encoder architecture type
pos_enc_class: SinusoidalPositionEncoder
normalize_before: true
kernel_size: 11
sanm_shfit: 0
selfattention_layer_type: sanm
# decoder related
decoder: paraformer_decoder_sanm
decoder_conf:
attention_heads: 4
linear_units: 1280
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
att_layer_num: 6
kernel_size: 11
sanm_shfit: 0
predictor: cif_predictor
predictor_conf:
idim: 320
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
# hybrid CTC/attention
model: paraformer
model_conf:
ctc_weight: 0.0
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
predictor_weight: 1.0
predictor_bias: 0
sampling_ratio: 0.75
# minibatch related
# dataset_type: small
batch_type: length
batch_bins: 6000
num_workers: 16
# dataset_type: large
dataset_conf:
filter_conf:
min_length: 10
max_length: 250
min_token_length: 1
max_token_length: 200
shuffle: True
shuffle_conf:
shuffle_size: 10240
sort_size: 500
batch_conf:
batch_type: token
batch_size: 6000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 5
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 15000
specaug: specaug_lfr
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
unused_parameters: true
log_interval: 50
normalize: None
split_with_space: true

View File

@ -0,0 +1,114 @@
# network architecture
# encoder related
encoder: sanm
encoder_conf:
output_size: 512 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 50 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
input_layer: pe # encoder architecture type
pos_enc_class: SinusoidalPositionEncoder
normalize_before: true
kernel_size: 11
sanm_shfit: 0
selfattention_layer_type: sanm
# decoder related
decoder: paraformer_decoder_sanm
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 16
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
att_layer_num: 16
kernel_size: 11
sanm_shfit: 0
predictor: cif_predictor_v2
predictor_conf:
idim: 512
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
# hybrid CTC/attention
model: paraformer
model_conf:
ctc_weight: 0.0
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
predictor_weight: 1.0
predictor_bias: 1
sampling_ratio: 0.75
# minibatch related
# dataset_type: small
batch_type: length
batch_bins: 10000
num_workers: 16
# dataset_type: large
dataset_conf:
filter_conf:
min_length: 10
max_length: 250
min_token_length: 1
max_token_length: 200
shuffle: true
shuffle_conf:
shuffle_size: 10240
sort_size: 500
batch_conf:
batch_type: 'token'
batch_size: 6000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 5
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug_lfr
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
unused_parameters: true
log_interval: 50
normalize: None
split_with_space: true

View File

@ -0,0 +1,99 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model: paraformer_bert
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
embeds_id: 3
embed_dims: 768
embeds_loss_weight: 2.0
# minibatch related
#batch_type: length
#batch_bins: 40000
batch_type: numel
batch_bins: 2000000
num_workers: 16
# optimization related
accum_grad: 4
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Copyright 2017 Xingyu Na
# Apache 2.0
#. ./path.sh || exit 1;
if [ $# != 3 ]; then
echo "Usage: $0 <audio-path> <text-path> <output-path>"
echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
exit 1;
fi
aishell_audio_dir=$1
aishell_text=$2/aishell_transcript_v0.8.txt
output_dir=$3
train_dir=$output_dir/data/local/train
dev_dir=$output_dir/data/local/dev
test_dir=$output_dir/data/local/test
tmp_dir=$output_dir/data/local/tmp
mkdir -p $train_dir
mkdir -p $dev_dir
mkdir -p $test_dir
mkdir -p $tmp_dir
# data directory check
if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
echo "Error: $0 requires two directory arguments"
exit 1;
fi
# find wav audio file for train, dev and test resp.
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
n=`cat $tmp_dir/wav.flist | wc -l`
[ $n -ne 141925 ] && \
echo Warning: expected 141925 data data files, found $n
grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
# Transcriptions preparation
for dir in $train_dir $dev_dir $test_dir; do
echo Preparing $dir transcriptions
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
for f in wav.scp text; do
cp $train_dir/$f $output_dir/data/train/$f || exit 1;
cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
cp $test_dir/$f $output_dir/data/test/$f || exit 1;
done
echo "$0: AISHELL data preparation succeeded"
exit 0;

View File

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
# 2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
# Apache 2.0
# transform raw AISHELL-2 data to kaldi format
. ./path.sh || exit 1;
tmp=
dir=
if [ $# != 3 ]; then
echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
exit 1;
fi
corpus=$1
tmp=$2
dir=$3
echo "prepare_data.sh: Preparing data in $corpus"
mkdir -p $tmp
mkdir -p $dir
# corpus check
if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
exit 1;
fi
# validate utt-key list, IC0803W0380 is a bad utterance
awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
# wav.scp
awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
# text
utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
# copy prepared resources from tmp_dir to target dir
mkdir -p $dir
for f in wav.scp text; do
cp $tmp/$f $dir/$f || exit 1;
done
echo "local/prepare_data.sh succeeded"
exit 0;

5
egs/aishell/paraformer/path.sh Executable file
View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

208
egs/aishell/paraformer/run.sh Executable file
View File

@ -0,0 +1,208 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=8
train_cmd=utils/run.pl
# general configuration
feats_dir=".." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
# feature configuration
feats_dim=80
sample_frequency=16000
nj=32
speed_perturb="0.9,1.0,1.1"
# data
data_aishell=
# exp tag
tag=""
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_paraformer_conformer_12e_6d_2048_256.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
for x in train dev test; do
cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
> ${feats_dir}/data/${x}/text
utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "stage 1: Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
# compute global cmvn
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${exp_dir}/exp/make_fbank/train
# apply cmvn
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_test_dir}
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
echo "make a dictionary"
echo "<blank>" > ${token_list}
echo "<s>" >> ${token_list}
echo "</s>" >> ${token_list}
utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
| sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
num_token=$(cat ${token_list} | wc -l)
echo "<unk>" >> ${token_list}
vocab_size=$(cat ${token_list} | wc -l)
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train_paraformer.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type char \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--config $asr_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode paraformer
fi

View File

@ -0,0 +1 @@
../tranformer/utils

View File

@ -0,0 +1,18 @@
# ParaformerBert + specaug + speed perturbation + specaugmentation
## Environments
- date: `Mon Nov 21 13:25:30 CST 2022`
- python version: `3.7.12`
- FunASR version: `0.1.0`
- pytorch version: `pytorch 1.7.0`
## Config files
- train config: conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
- model size: 46M
- lm config: LM was not used
- decode config: conf/decode_asr_transformer_noctc_1best.yaml (CTC was not used)
## Results (CER)
| testset | CER(%) |
|:-----------:|:-------:|
| dev | 4.30 |
| test | 4.80 |

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.7

View File

@ -0,0 +1,6 @@
beam_size: 1
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.15

View File

@ -0,0 +1,92 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model: paraformer
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
# minibatch related
batch_type: length
batch_bins: 25000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None

View File

@ -0,0 +1,94 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 320 # dimension of attention
attention_heads: 4
linear_units: 1280 # the number of units of position-wise feed forward
num_blocks: 20 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model: paraformer
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
# minibatch related
#batch_type: length
#batch_bins: 40000
batch_type: numel
batch_bins: 2000000
num_workers: 16
# optimization related
accum_grad: 4
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None

View File

@ -0,0 +1,100 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: paraformer_decoder_san
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model: paraformer_bert
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
predictor_weight: 1.0
sampling_ratio: 0.4
embeds_id: 3
embed_dims: 768
embeds_loss_weight: 2.0
# minibatch related
#batch_type: length
#batch_bins: 40000
batch_type: numel
batch_bins: 2000000
num_workers: 16
# optimization related
accum_grad: 4
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
predictor: cif_predictor
predictor_conf:
idim: 256
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
log_interval: 50
normalize: None
allow_variable_data_keys: true

View File

@ -0,0 +1,65 @@
#!/bin/bash
# Copyright 2017 Xingyu Na
# Apache 2.0
#. ./path.sh || exit 1;
if [ $# != 2 ]; then
echo "Usage: $0 <audio-path> <text-path>"
echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript"
exit 1;
fi
aishell_audio_dir=$1
aishell_text=$2/aishell_transcript_v0.8.txt
train_dir=data/local/train
dev_dir=data/local/dev
test_dir=data/local/test
tmp_dir=data/local/tmp
mkdir -p $train_dir
mkdir -p $dev_dir
mkdir -p $test_dir
mkdir -p $tmp_dir
# data directory check
if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
echo "Error: $0 requires two directory arguments"
exit 1;
fi
# find wav audio file for train, dev and test resp.
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
n=`cat $tmp_dir/wav.flist | wc -l`
[ $n -ne 141925 ] && \
echo Warning: expected 141925 data data files, found $n
grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
# Transcriptions preparation
for dir in $train_dir $dev_dir $test_dir; do
echo Preparing $dir transcriptions
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p data/train data/dev data/test
for f in wav.scp text; do
cp $train_dir/$f data/train/$f || exit 1;
cp $dev_dir/$f data/dev/$f || exit 1;
cp $test_dir/$f data/test/$f || exit 1;
done
echo "$0: AISHELL data preparation succeeded"
exit 0;

View File

@ -0,0 +1,67 @@
#!/usr/bin/env bash
stage=1
stop_stage=3
bert_model_root="../../huggingface_models"
bert_model_name="bert-base-chinese"
#bert_model_name="chinese-roberta-wwm-ext"
#bert_model_name="mengzi-bert-base"
raw_dataset_path=~/Funasr_data/aishell-1
model_path=${bert_model_root}/${bert_model_name}
. utils/parse_options.sh || exit 1;
nj=32
for data_set in train dev test;do
scp=$raw_dataset_path/dump/fbank/${data_set}/text
local_scp_dir_raw=$raw_dataset_path/embeds/$bert_model_name/${data_set}
local_scp_dir=$local_scp_dir_raw/split$nj
local_records_dir=$local_scp_dir_raw/ark
mkdir -p $local_records_dir
mkdir -p $local_scp_dir
split_scps=""
for JOB in $(seq ${nj}); do
split_scps="$split_scps $local_scp_dir/data.$JOB.text"
done
utils/split_scp.pl $scp ${split_scps}
for num in {0..7};do
tmp=`expr $num \* 4`
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
for idx in {1..4}; do
JOB=`expr $tmp + $idx`
echo "proces jobid=$JOB"
{
beg=0
gpu=`expr $beg + $idx`
echo ${local_scp_dir}/log.${JOB}
python utils/extract_embeds.py $local_scp_dir/data.$JOB.text ${local_records_dir}/embeds.${JOB}.ark ${local_records_dir}/embeds.${JOB}.scp ${local_records_dir}/embeds.${JOB}.shape ${gpu} ${model_path} &> ${local_scp_dir}/log.${JOB}
} &
done
wait
fi
done
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
for JOB in $(seq ${nj}); do
cat ${local_records_dir}/embeds.${JOB}.scp || exit 1;
done > ${local_scp_dir_raw}/embeds.scp
sed 's#nfs#data\/volume1#g' ${local_scp_dir_raw}/embeds.scp > ${local_scp_dir_raw}/embeds.scp.pai
for JOB in $(seq ${nj}); do
cat ${local_records_dir}/embeds.${JOB}.shape || exit 1;
done > ${local_scp_dir_raw}/embeds.shape
fi
done
echo "embeds is in: ${local_scp_dir_raw}"
echo "success"

View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

226
egs/aishell/paraformer2/run.sh Executable file
View File

@ -0,0 +1,226 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=8
train_cmd=utils/run.pl
# general configuration
feats_dir=".." #feature output dictionary, for large data
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
skip_extract_embed=false
bert_model_root="../../huggingface_models"
bert_model_name="bert-base-chinese"
# feature configuration
feats_dim=80
sample_frequency=16000
nj=32
speed_perturb="0.9,1.0,1.1"
# data
data_aishell=
# exp tag
tag=""
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_paraformerbert_conformer_12e_6d_2048_256.yaml
run_dir="exp"
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
exp_dir=$run_dir/$model_dir
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript
for x in train dev test; do
cp data/${x}/text data/${x}/text.org
paste -d " " <(cut -f 1 -d" " data/${x}/text.org) <(cut -f 2- -d" " data/${x}/text.org | tr -d " ") \
> data/${x}/text
utils/text2token.py -n 1 -s 1 data/${x}/text > data/${x}/text.org
mv data/${x}/text.org data/${x}/text
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "stage 1: Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
data/train exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
data/dev exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
data/test exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
# compute global cmvn
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train exp/make_fbank/train
# apply cmvn
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${fbankdir}/train/cmvn.json exp/make_fbank/train ${feat_train_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/dev ${fbankdir}/train/cmvn.json exp/make_fbank/dev ${feat_dev_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/test ${fbankdir}/train/cmvn.json exp/make_fbank/test ${feat_test_dir}
cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_test_dir}
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p data/${lang}_token_list/char/
echo "make a dictionary"
echo "<blank>" > ${token_list}
echo "<s>" >> ${token_list}
echo "</s>" >> ${token_list}
utils/text2token.py -s 1 -n 1 --space "" data/train/text | cut -f 2- -d" " | tr " " "\n" \
| sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
num_token=$(cat ${token_list} | wc -l)
echo "<unk>" >> ${token_list}
vocab_size=$(cat ${token_list} | wc -l)
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p asr_stats_fbank_zh_char/train
mkdir -p asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char asr_stats_fbank_zh_char/dev
fi
if ! "${skip_extract_embed}"; then
local/extract_embeds.sh \
--bert_model_root ${bert_model_root} \
--bert_model_name ${bert_model_name} \
--raw_dataset_path ${feats_dir}
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
mkdir -p $exp_dir
mkdir -p $exp_dir/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train_paraformer.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type char \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.scp,embed,${type} \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--train_shape_file ${feats_dir}/embeds/${bert_model_name}/${train_set}/embeds.shape \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_data_path_and_name_and_type ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.scp,embed,${type} \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--valid_shape_file ${feats_dir}/embeds/${bert_model_name}/${valid_set}/embeds.shape \
--resume true \
--output_dir $exp_dir \
--config $asr_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--allow_variable_data_keys true \
--local_rank $local_rank 1> $exp_dir/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp $exp_dir \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--gpu_inference ${gpu_inference} \
--mode paraformer
fi

View File

@ -0,0 +1 @@
../tranformer/utils

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.5
lm_weight: 0.7

View File

@ -0,0 +1,80 @@
# network architecture
# encoder related
encoder: conformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
pos_enc_layer_type: rel_pos
selfattention_layer_type: rel_selfattn
activation_type: swish
macaron_style: true
use_cnn_module: true
cnn_module_kernel: 15
# decoder related
decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
# minibatch related
batch_type: length
batch_bins: 25000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 50
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
log_interval: 50
normalize: None

View File

@ -0,0 +1,70 @@
# network architecture
# encoder related
encoder: transformer
encoder_conf:
output_size: 256 # dimension of attention
attention_heads: 4
linear_units: 2048 # the number of units of position-wise feed forward
num_blocks: 12 # the number of encoder blocks
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d # encoder architecture type
normalize_before: true
# decoder related
decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
# hybrid CTC/attention
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: false
# minibatch related
batch_type: length
batch_bins: 32000
num_workers: 8
# optimization related
accum_grad: 1
grad_clip: 5
patience: 3
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
# NoamLR is deprecated. Use WarmupLR.
# The following is equivalent setting for NoamLR:
#
# optim: adam
# optim_conf:
# lr: 10.
# scheduler: noamlr
# scheduler_conf:
# model_size: 256
# warmup_steps: 25000
#
optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr # pytorch v1.1.0+ required
scheduler_conf:
warmup_steps: 25000
log_interval: 50
normalize: None

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Copyright 2017 Xingyu Na
# Apache 2.0
#. ./path.sh || exit 1;
if [ $# != 3 ]; then
echo "Usage: $0 <audio-path> <text-path> <output-path>"
echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
exit 1;
fi
aishell_audio_dir=$1
aishell_text=$2/aishell_transcript_v0.8.txt
output_dir=$3
train_dir=$output_dir/data/local/train
dev_dir=$output_dir/data/local/dev
test_dir=$output_dir/data/local/test
tmp_dir=$output_dir/data/local/tmp
mkdir -p $train_dir
mkdir -p $dev_dir
mkdir -p $test_dir
mkdir -p $tmp_dir
# data directory check
if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
echo "Error: $0 requires two directory arguments"
exit 1;
fi
# find wav audio file for train, dev and test resp.
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
n=`cat $tmp_dir/wav.flist | wc -l`
[ $n -ne 141925 ] && \
echo Warning: expected 141925 data data files, found $n
grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
# Transcriptions preparation
for dir in $train_dir $dev_dir $test_dir; do
echo Preparing $dir transcriptions
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
for f in wav.scp text; do
cp $train_dir/$f $output_dir/data/train/$f || exit 1;
cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
cp $test_dir/$f $output_dir/data/test/$f || exit 1;
done
echo "$0: AISHELL data preparation succeeded"
exit 0;

View File

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
# 2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
# Apache 2.0
# transform raw AISHELL-2 data to kaldi format
. ./path.sh || exit 1;
tmp=
dir=
if [ $# != 3 ]; then
echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
exit 1;
fi
corpus=$1
tmp=$2
dir=$3
echo "prepare_data.sh: Preparing data in $corpus"
mkdir -p $tmp
mkdir -p $dir
# corpus check
if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
exit 1;
fi
# validate utt-key list, IC0803W0380 is a bad utterance
awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
# wav.scp
awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
# text
utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
# copy prepared resources from tmp_dir to target dir
mkdir -p $dir
for f in wav.scp text; do
cp $tmp/$f $dir/$f || exit 1;
done
echo "local/prepare_data.sh succeeded"
exit 0;

5
egs/aishell/tranformer/path.sh Executable file
View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

208
egs/aishell/tranformer/run.sh Executable file
View File

@ -0,0 +1,208 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
# for gpu decoding, inference_nj=ngpu*njob; for cpu decoding, inference_nj=njob
njob=8
train_cmd=utils/run.pl
# general configuration
feats_dir=".." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
# feature configuration
feats_dim=80
sample_frequency=16000
nj=32
speed_perturb="0.9,1.0,1.1"
# data
data_aishell=
# exp tag
tag=""
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_conformer.yaml
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
inference_config=conf/decode_asr_transformer.yaml
inference_asr_model=valid.acc.ave_10best.pth
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
for x in train dev test; do
cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
> ${feats_dir}/data/${x}/text
utils/text2token.py -n 1 -s 1 ${feats_dir}/data/${x}/text > ${feats_dir}/data/${x}/text.org
mv ${feats_dir}/data/${x}/text.org ${feats_dir}/data/${x}/text
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "stage 1: Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
# compute global cmvn
utils/compute_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${exp_dir}/exp/make_fbank/train
# apply cmvn
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/train ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/dev ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
utils/apply_cmvn.sh --cmd "$train_cmd" --nj $nj \
${fbankdir}/test ${fbankdir}/train/cmvn.json ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
cp ${fbankdir}/train/text ${fbankdir}/train/speech_shape ${fbankdir}/train/text_shape ${feat_train_dir}
cp ${fbankdir}/dev/text ${fbankdir}/dev/speech_shape ${fbankdir}/dev/text_shape ${feat_dev_dir}
cp ${fbankdir}/test/text ${fbankdir}/test/speech_shape ${fbankdir}/test/text_shape ${feat_test_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_test_dir}
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
echo "make a dictionary"
echo "<blank>" > ${token_list}
echo "<s>" >> ${token_list}
echo "</s>" >> ${token_list}
utils/text2token.py -s 1 -n 1 --space "" ${feats_dir}/data/train/text | cut -f 2- -d" " | tr " " "\n" \
| sort | uniq | grep -a -v -e '^\s*$' | awk '{print $0}' >> ${token_list}
num_token=$(cat ${token_list} | wc -l)
echo "<unk>" >> ${token_list}
vocab_size=$(cat ${token_list} | wc -l)
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type char \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--config $asr_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode asr
fi

View File

View File

@ -0,0 +1,79 @@
from kaldiio import ReadHelper
from kaldiio import WriteHelper
import argparse
import json
import math
import numpy as np
def get_parser():
parser = argparse.ArgumentParser(
description="apply cmvn",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--ark-file",
"-a",
default=False,
required=True,
type=str,
help="fbank ark file",
)
parser.add_argument(
"--cmvn-file",
"-c",
default=False,
required=True,
type=str,
help="cmvn file",
)
parser.add_argument(
"--ark-index",
"-i",
default=1,
required=True,
type=int,
help="ark index",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
with open(args.cmvn_file) as f:
cmvn_stats = json.load(f)
means = cmvn_stats['mean_stats']
vars = cmvn_stats['var_stats']
total_frames = cmvn_stats['total_frames']
for i in range(len(means)):
means[i] /= total_frames
vars[i] = vars[i] / total_frames - means[i] * means[i]
if vars[i] < 1.0e-20:
vars[i] = 1.0e-20
vars[i] = 1.0 / math.sqrt(vars[i])
with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
for key, mat in ark_reader:
mat = (mat - means) * vars
ark_writer(key, mat)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,29 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# Begin configuration section.
nj=32
cmd=./utils/run.pl
echo "$0 $@"
. utils/parse_options.sh || exit 1;
fbankdir=$1
cmvn_file=$2
logdir=$3
output_dir=$4
dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
mkdir -p ${logdir}
$cmd JOB=1:$nj $logdir/apply_cmvn.JOB.log \
python utils/apply_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
-c $cmvn_file -i JOB -o ${dump_dir} \
|| exit 1;
for n in $(seq $nj); do
cat ${dump_dir}/feats.$n.scp || exit 1
done > ${output_dir}/feats.scp || exit 1
echo "$0: Succeeded apply cmvn"

View File

@ -0,0 +1,143 @@
from kaldiio import ReadHelper, WriteHelper
import argparse
import numpy as np
def build_LFR_features(inputs, m=7, n=6):
LFR_inputs = []
T = inputs.shape[0]
T_lfr = int(np.ceil(T / n))
left_padding = np.tile(inputs[0], ((m - 1) // 2, 1))
inputs = np.vstack((left_padding, inputs))
T = T + (m - 1) // 2
for i in range(T_lfr):
if m <= T - i * n:
LFR_inputs.append(np.hstack(inputs[i * n:i * n + m]))
else:
num_padding = m - (T - i * n)
frame = np.hstack(inputs[i * n:])
for _ in range(num_padding):
frame = np.hstack((frame, inputs[-1]))
LFR_inputs.append(frame)
return np.vstack(LFR_inputs)
def build_CMVN_features(inputs, mvn_file): # noqa
with open(mvn_file, 'r', encoding='utf-8') as f:
lines = f.readlines()
add_shift_list = []
rescale_list = []
for i in range(len(lines)):
line_item = lines[i].split()
if line_item[0] == '<AddShift>':
line_item = lines[i + 1].split()
if line_item[0] == '<LearnRateCoef>':
add_shift_line = line_item[3:(len(line_item) - 1)]
add_shift_list = list(add_shift_line)
continue
elif line_item[0] == '<Rescale>':
line_item = lines[i + 1].split()
if line_item[0] == '<LearnRateCoef>':
rescale_line = line_item[3:(len(line_item) - 1)]
rescale_list = list(rescale_line)
continue
for j in range(inputs.shape[0]):
for k in range(inputs.shape[1]):
add_shift_value = add_shift_list[k]
rescale_value = rescale_list[k]
inputs[j, k] = float(inputs[j, k]) + float(add_shift_value)
inputs[j, k] = float(inputs[j, k]) * float(rescale_value)
return inputs
def get_parser():
parser = argparse.ArgumentParser(
description="apply low_frame_rate and cmvn",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--ark-file",
"-a",
default=False,
required=True,
type=str,
help="fbank ark file",
)
parser.add_argument(
"--lfr",
"-f",
default=True,
type=str,
help="low frame rate",
)
parser.add_argument(
"--lfr-m",
"-m",
default=7,
type=int,
help="number of frames to stack",
)
parser.add_argument(
"--lfr-n",
"-n",
default=6,
type=int,
help="number of frames to skip",
)
parser.add_argument(
"--cmvn-file",
"-c",
default=False,
required=True,
type=str,
help="global cmvn file",
)
parser.add_argument(
"--ark-index",
"-i",
default=1,
required=True,
type=int,
help="ark index",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
dump_ark_file = args.output_dir + "/feats." + str(args.ark_index) + ".ark"
dump_scp_file = args.output_dir + "/feats." + str(args.ark_index) + ".scp"
shape_file = args.output_dir + "/len." + str(args.ark_index)
ark_writer = WriteHelper('ark,scp:{},{}'.format(dump_ark_file, dump_scp_file))
shape_writer = open(shape_file, 'w')
with ReadHelper('ark:{}'.format(args.ark_file)) as ark_reader:
for key, mat in ark_reader:
if args.lfr:
lfr = build_LFR_features(mat, args.lfr_m, args.lfr_n)
else:
lfr = mat
cmvn = build_CMVN_features(lfr, args.cmvn_file)
dims = cmvn.shape[1]
lens = cmvn.shape[0]
shape_writer.write(key + " " + str(lens) + "," + str(dims) + '\n')
ark_writer(key, cmvn)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,38 @@
#!/usr/bin/env bash
# Begin configuration section.
nj=32
cmd=utils/run.pl
# feature configuration
lfr=True
lfr_m=7
lfr_n=6
echo "$0 $@"
. utils/parse_options.sh || exit 1;
fbankdir=$1
cmvn_file=$2
logdir=$3
output_dir=$4
dump_dir=${output_dir}/ark; mkdir -p ${dump_dir}
mkdir -p ${logdir}
$cmd JOB=1:$nj $logdir/apply_lfr_and_cmvn.JOB.log \
python utils/apply_lfr_and_cmvn.py -a $fbankdir/ark/feats.JOB.ark \
-f $lfr -m $lfr_m -n $lfr_n -c $cmvn_file -i JOB -o ${dump_dir} \
|| exit 1;
for n in $(seq $nj); do
cat ${dump_dir}/feats.$n.scp || exit 1
done > ${output_dir}/feats.scp || exit 1
for n in $(seq $nj); do
cat ${dump_dir}/len.$n || exit 1
done > ${output_dir}/speech_shape || exit 1
echo "$0: Succeeded apply low frame rate and cmvn"

View File

@ -0,0 +1,66 @@
import argparse
import json
import numpy as np
def get_parser():
parser = argparse.ArgumentParser(
description="combine cmvn file",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--cmvn-dir",
"-c",
default=False,
required=True,
type=str,
help="cmvn dir",
)
parser.add_argument(
"--nj",
"-n",
default=1,
required=True,
type=int,
help="num of cmvn file",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
total_means = 0.0
total_vars = 0.0
total_frames = 0
cmvn_file = args.output_dir + "/cmvn.json"
for i in range(1, args.nj+1):
with open(args.cmvn_dir + "/cmvn." + str(i) + ".json", "r") as fin:
cmvn_stats = json.load(fin)
total_means += np.array(cmvn_stats["mean_stats"])
total_vars += np.array(cmvn_stats["var_stats"])
total_frames += cmvn_stats["total_frames"]
cmvn_info = {
'mean_stats': list(total_means.tolist()),
'var_stats': list(total_vars.tolist()),
'total_frames': total_frames
}
with open(cmvn_file, 'w') as fout:
fout.write(json.dumps(cmvn_info))
if __name__ == '__main__':
main()

View File

@ -0,0 +1,67 @@
from kaldiio import ReadHelper
import argparse
import numpy as np
import json
def get_parser():
parser = argparse.ArgumentParser(
description="computer global cmvn",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--ark-file",
"-a",
default=False,
required=True,
type=str,
help="fbank ark file",
)
parser.add_argument(
"--ark-index",
"-i",
default=1,
required=True,
type=int,
help="ark index",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
ark_file = args.ark_file + "/feats." + str(args.ark_index) + ".ark"
cmvn_file = args.output_dir + "/cmvn." + str(args.ark_index) + ".json"
mean_stats = 0.0
var_stats = 0.0
total_frames = 0
with ReadHelper('ark:{}'.format(ark_file)) as ark_reader:
for key, mat in ark_reader:
mean_stats += np.sum(mat, axis=0)
var_stats += np.sum(np.square(mat), axis=0)
total_frames += mat.shape[0]
cmvn_info = {
'mean_stats': list(mean_stats.tolist()),
'var_stats': list(var_stats.tolist()),
'total_frames': total_frames
}
with open(cmvn_file, 'w') as fout:
fout.write(json.dumps(cmvn_info))
if __name__ == '__main__':
main()

View File

@ -0,0 +1,24 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# Begin configuration section.
nj=32
cmd=./utils/run.pl
echo "$0 $@"
. utils/parse_options.sh || exit 1;
fbankdir=$1
logdir=$2
output_dir=${fbankdir}/cmvn; mkdir -p ${output_dir}
mkdir -p ${logdir}
$cmd JOB=1:$nj $logdir/cmvn.JOB.log \
python utils/compute_cmvn.py -a $fbankdir/ark -i JOB -o ${output_dir} \
|| exit 1;
python utils/combine_cmvn_file.py -c ${output_dir} -n $nj -o $fbankdir
echo "$0: Succeeded compute global cmvn"

View File

@ -0,0 +1,153 @@
from kaldiio import WriteHelper
import argparse
import numpy as np
import json
import torch
import torchaudio
import torchaudio.compliance.kaldi as kaldi
def compute_fbank(wav_file,
num_mel_bins=80,
frame_length=25,
frame_shift=10,
dither=0.0,
resample_rate=16000,
speed=1.0):
waveform, sample_rate = torchaudio.load(wav_file)
if resample_rate != sample_rate:
waveform = torchaudio.transforms.Resample(orig_freq=sample_rate,
new_freq=resample_rate)(waveform)
if speed != 1.0:
waveform, _ = torchaudio.sox_effects.apply_effects_tensor(
waveform, resample_rate,
[['speed', str(speed)], ['rate', str(resample_rate)]]
)
waveform = waveform * (1 << 15)
mat = kaldi.fbank(waveform,
num_mel_bins=num_mel_bins,
frame_length=frame_length,
frame_shift=frame_shift,
dither=dither,
energy_floor=0.0,
window_type='hamming',
sample_frequency=resample_rate)
return mat.numpy()
def get_parser():
parser = argparse.ArgumentParser(
description="computer features",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--wav-lists",
"-w",
default=False,
required=True,
type=str,
help="input wav lists",
)
parser.add_argument(
"--text-files",
"-t",
default=False,
required=True,
type=str,
help="input text files",
)
parser.add_argument(
"--dims",
"-d",
default=80,
type=int,
help="feature dims",
)
parser.add_argument(
"--sample-frequency",
"-s",
default=16000,
type=int,
help="sample frequency",
)
parser.add_argument(
"--speed-perturb",
"-p",
default="1.0",
type=str,
help="speed perturb",
)
parser.add_argument(
"--ark-index",
"-a",
default=1,
required=True,
type=int,
help="ark index",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
ark_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".ark"
scp_file = args.output_dir + "/ark/feats." + str(args.ark_index) + ".scp"
text_file = args.output_dir + "/txt/text." + str(args.ark_index) + ".txt"
feats_shape_file = args.output_dir + "/ark/len." + str(args.ark_index)
text_shape_file = args.output_dir + "/txt/len." + str(args.ark_index)
ark_writer = WriteHelper('ark,scp:{},{}'.format(ark_file, scp_file))
text_writer = open(text_file, 'w')
feats_shape_writer = open(feats_shape_file, 'w')
text_shape_writer = open(text_shape_file, 'w')
speed_perturb_list = args.speed_perturb.split(',')
for speed in speed_perturb_list:
with open(args.wav_lists, 'r', encoding='utf-8') as wavfile:
with open(args.text_files, 'r', encoding='utf-8') as textfile:
for wav, text in zip(wavfile, textfile):
s_w = wav.strip().split()
wav_id = s_w[0]
wav_file = s_w[1]
s_t = text.strip().split()
text_id = s_t[0]
txt = s_t[1:]
fbank = compute_fbank(wav_file,
num_mel_bins=args.dims,
resample_rate=args.sample_frequency,
speed=float(speed)
)
feats_dims = fbank.shape[1]
feats_lens = fbank.shape[0]
txt_lens = len(txt)
if speed == "1.0":
wav_id_sp = wav_id
else:
wav_id_sp = wav_id + "_sp" + speed
feats_shape_writer.write(wav_id_sp + " " + str(feats_lens) + "," + str(feats_dims) + '\n')
text_shape_writer.write(wav_id_sp + " " + str(txt_lens) + '\n')
text_writer.write(wav_id_sp + " " + " ".join(txt) + '\n')
ark_writer(wav_id_sp, fbank)
if __name__ == '__main__':
main()

View File

@ -0,0 +1,51 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# Begin configuration section.
nj=32
cmd=./utils/run.pl
# feature configuration
feat_dims=80
sample_frequency=16000
speed_perturb="1.0"
echo "$0 $@"
. utils/parse_options.sh || exit 1;
data=$1
logdir=$2
fbankdir=$3
[ ! -f $data/wav.scp ] && echo "$0: no such file $data/wav.scp" && exit 1;
[ ! -f $data/text ] && echo "$0: no such file $data/text" && exit 1;
python utils/split_data.py $data $data $nj
ark_dir=${fbankdir}/ark; mkdir -p ${ark_dir}
text_dir=${fbankdir}/txt; mkdir -p ${text_dir}
mkdir -p ${logdir}
$cmd JOB=1:$nj $logdir/make_fbank.JOB.log \
python utils/compute_fbank.py -w $data/split${nj}/JOB/wav.scp -t $data/split${nj}/JOB/text \
-d $feat_dims -s $sample_frequency -p ${speed_perturb} -a JOB -o ${fbankdir} \
|| exit 1;
for n in $(seq $nj); do
cat ${ark_dir}/feats.$n.scp || exit 1
done > $fbankdir/feats.scp || exit 1
for n in $(seq $nj); do
cat ${text_dir}/text.$n.txt || exit 1
done > $fbankdir/text || exit 1
for n in $(seq $nj); do
cat ${ark_dir}/len.$n || exit 1
done > $fbankdir/speech_shape || exit 1
for n in $(seq $nj); do
cat ${text_dir}/len.$n || exit 1
done > $fbankdir/text_shape || exit 1
echo "$0: Succeeded compute FBANK features"

View File

@ -0,0 +1,157 @@
import os
import numpy as np
import sys
def compute_wer(ref_file,
hyp_file,
cer_detail_file):
rst = {
'Wrd': 0,
'Corr': 0,
'Ins': 0,
'Del': 0,
'Sub': 0,
'Snt': 0,
'Err': 0.0,
'S.Err': 0.0,
'wrong_words': 0,
'wrong_sentences': 0
}
hyp_dict = {}
ref_dict = {}
with open(hyp_file, 'r') as hyp_reader:
for line in hyp_reader:
key = line.strip().split()[0]
value = line.strip().split()[1:]
hyp_dict[key] = value
with open(ref_file, 'r') as ref_reader:
for line in ref_reader:
key = line.strip().split()[0]
value = line.strip().split()[1:]
ref_dict[key] = value
cer_detail_writer = open(cer_detail_file, 'w')
for hyp_key in hyp_dict:
if hyp_key in ref_dict:
out_item = compute_wer_by_line(hyp_dict[hyp_key], ref_dict[hyp_key])
rst['Wrd'] += out_item['nwords']
rst['Corr'] += out_item['cor']
rst['wrong_words'] += out_item['wrong']
rst['Ins'] += out_item['ins']
rst['Del'] += out_item['del']
rst['Sub'] += out_item['sub']
rst['Snt'] += 1
if out_item['wrong'] > 0:
rst['wrong_sentences'] += 1
cer_detail_writer.write(hyp_key + print_cer_detail(out_item) + '\n')
cer_detail_writer.write("ref:" + '\t' + "".join(ref_dict[hyp_key]) + '\n')
cer_detail_writer.write("hyp:" + '\t' + "".join(hyp_dict[hyp_key]) + '\n')
if rst['Wrd'] > 0:
rst['Err'] = round(rst['wrong_words'] * 100 / rst['Wrd'], 2)
if rst['Snt'] > 0:
rst['S.Err'] = round(rst['wrong_sentences'] * 100 / rst['Snt'], 2)
cer_detail_writer.write('\n')
cer_detail_writer.write("%WER " + str(rst['Err']) + " [ " + str(rst['wrong_words'])+ " / " + str(rst['Wrd']) +
", " + str(rst['Ins']) + " ins, " + str(rst['Del']) + " del, " + str(rst['Sub']) + " sub ]" + '\n')
cer_detail_writer.write("%SER " + str(rst['S.Err']) + " [ " + str(rst['wrong_sentences']) + " / " + str(rst['Snt']) + " ]" + '\n')
cer_detail_writer.write("Scored " + str(len(hyp_dict)) + " sentences, " + str(len(hyp_dict) - rst['Snt']) + " not present in hyp." + '\n')
def compute_wer_by_line(hyp,
ref):
hyp = list(map(lambda x: x.lower(), hyp))
ref = list(map(lambda x: x.lower(), ref))
len_hyp = len(hyp)
len_ref = len(ref)
cost_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int16)
ops_matrix = np.zeros((len_hyp + 1, len_ref + 1), dtype=np.int8)
for i in range(len_hyp + 1):
cost_matrix[i][0] = i
for j in range(len_ref + 1):
cost_matrix[0][j] = j
for i in range(1, len_hyp + 1):
for j in range(1, len_ref + 1):
if hyp[i - 1] == ref[j - 1]:
cost_matrix[i][j] = cost_matrix[i - 1][j - 1]
else:
substitution = cost_matrix[i - 1][j - 1] + 1
insertion = cost_matrix[i - 1][j] + 1
deletion = cost_matrix[i][j - 1] + 1
compare_val = [substitution, insertion, deletion]
min_val = min(compare_val)
operation_idx = compare_val.index(min_val) + 1
cost_matrix[i][j] = min_val
ops_matrix[i][j] = operation_idx
match_idx = []
i = len_hyp
j = len_ref
rst = {
'nwords': len_ref,
'cor': 0,
'wrong': 0,
'ins': 0,
'del': 0,
'sub': 0
}
while i >= 0 or j >= 0:
i_idx = max(0, i)
j_idx = max(0, j)
if ops_matrix[i_idx][j_idx] == 0: # correct
if i - 1 >= 0 and j - 1 >= 0:
match_idx.append((j - 1, i - 1))
rst['cor'] += 1
i -= 1
j -= 1
elif ops_matrix[i_idx][j_idx] == 2: # insert
i -= 1
rst['ins'] += 1
elif ops_matrix[i_idx][j_idx] == 3: # delete
j -= 1
rst['del'] += 1
elif ops_matrix[i_idx][j_idx] == 1: # substitute
i -= 1
j -= 1
rst['sub'] += 1
if i < 0 and j >= 0:
rst['del'] += 1
elif j < 0 and i >= 0:
rst['ins'] += 1
match_idx.reverse()
wrong_cnt = cost_matrix[len_hyp][len_ref]
rst['wrong'] = wrong_cnt
return rst
def print_cer_detail(rst):
return ("(" + "nwords=" + str(rst['nwords']) + ",cor=" + str(rst['cor'])
+ ",ins=" + str(rst['ins']) + ",del=" + str(rst['del']) + ",sub="
+ str(rst['sub']) + ") corr:" + '{:.2%}'.format(rst['cor']/rst['nwords'])
+ ",cer:" + '{:.2%}'.format(rst['wrong']/rst['nwords']))
if __name__ == '__main__':
if len(sys.argv) != 4:
print("usage : python compute-wer.py test.ref test.hyp test.wer")
sys.exit(0)
ref_file = sys.argv[1]
hyp_file = sys.argv[2]
cer_detail_file = sys.argv[3]
compute_wer(ref_file, hyp_file, cer_detail_file)

View File

@ -0,0 +1,407 @@
#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
min() {
local a b
a=$1
for b in "$@"; do
if [ "${b}" -le "${a}" ]; then
a="${b}"
fi
done
echo "${a}"
}
SECONDS=0
# General configuration
stage=1 # Processes starts from the specified stage.
stop_stage=10000 # Processes is stopped at the specified stage.
skip_data_prep=true # Skip data preparation stages.
skip_train=false # Skip training stages.
skip_eval=false # Skip decoding and evaluation stages.
skip_upload=true # Skip packing and uploading stages.
skip_upload_hf=true # Skip uploading to hugging face stages.
cuda_cmd=utils/run.pl
decode_cmd=utils/run.pl
ngpu=1 # The number of gpus ("0" uses cpu, otherwise use gpu).
njob=1 # the number of jobs for each gpu
gpuid_list=
num_nodes=1 # The number of nodes.
nj=32 # The number of parallel jobs.
inference_nj=32 # The number of parallel jobs in decoding.
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
datadir="./"
dumpdir=dump # Directory to dump features.
expdir=exp # Directory to save experiments.
python=python # Specify python to execute funasr commands.
# Data preparation related
local_data_opts= # The options given to local/data.sh.
# Speed perturbation related
speed_perturb_factors= # perturbation factors, e.g. "0.9 1.0 1.1" (separated by space).
# Feature extraction related
feats_type=fbank # Feature type (raw or fbank_pitch).
feats_dim=
audio_format=flac # Audio format: wav, flac, wav.ark, flac.ark (only in feats_type=raw).
fs=16k # Sampling rate.
min_wav_duration=0.1 # Minimum duration in second.
max_wav_duration=20 # Maximum duration in second.
# Tokenization related
token_type=bpe # Tokenization type (char or bpe).
nbpe=30 # The number of BPE vocabulary.
bpemode=unigram # Mode of BPE (unigram or bpe).
oov="<unk>" # Out of vocabulary symbol.
blank="<blank>" # CTC blank symbol
sos_eos="<sos/eos>" # sos and eos symbole
bpe_input_sentence_size=100000000 # Size of input sentence for BPE.
bpe_nlsyms= # non-linguistic symbols list, separated by a comma, for BPE
bpe_char_cover=1.0 # character coverage when modeling BPE
# Ngram model related
use_ngram=false
ngram_exp=
ngram_num=3
# Language model related
use_lm=false # Use language model for ASR decoding.
lm_tag= # Suffix to the result dir for language model training.
lm_exp= # Specify the directory path for LM experiment.
# If this option is specified, lm_tag is ignored.
lm_stats_dir= # Specify the directory path for LM statistics.
lm_config= # Config for language model training.
lm_args= # Arguments for language model training, e.g., "--max_epoch 10".
# Note that it will overwrite args in lm config.
use_word_lm=false # Whether to use word language model.
num_splits_lm=1 # Number of splitting for lm corpus.
# shellcheck disable=SC2034
word_vocab_size=10000 # Size of word vocabulary.
# ASR model related
asr_tag= # Suffix to the result dir for asr model training.
asr_exp= # Specify the directory path for ASR experiment.
# If this option is specified, asr_tag is ignored.
asr_stats_dir= # Specify the directory path for ASR statistics.
asr_config= # Config for asr model training.
asr_args= # Arguments for asr model training, e.g., "--max_epoch 10".
# Note that it will overwrite args in asr config.
pretrained_model= # Pretrained model to load
ignore_init_mismatch=false # Ignore initial mismatch
feats_normalize=global_mvn # Normalizaton layer type.
num_splits_asr=1 # Number of splitting for lm corpus.
# Upload model related
hf_repo=
# Decoding related
use_k2=false # Whether to use k2 based decoder
k2_ctc_decoding=true
use_nbest_rescoring=true # use transformer-decoder
# and transformer language model for nbest rescoring
num_paths=1000 # The 3rd argument of k2.random_paths.
nll_batch_size=100 # Affect GPU memory usage when computing nll
# during nbest rescoring
k2_config=./conf/decode_asr_transformer_with_k2.yaml
use_streaming=false # Whether to use streaming decoding
use_maskctc=false # Whether to use maskctc decoding
batch_size=1
inference_tag= # Suffix to the result dir for decoding.
inference_config= # Config for decoding.
inference_args= # Arguments for decoding, e.g., "--lm_weight 0.1".
# Note that it will overwrite args in inference config.
inference_lm=valid.loss.ave.pth # Language model path for decoding.
inference_ngram=${ngram_num}gram.bin
inference_asr_model=valid.acc.ave.pth # ASR model path for decoding.
# e.g.
# inference_asr_model=train.loss.best.pth
# inference_asr_model=3epoch.pth
# inference_asr_model=valid.acc.best.pth
# inference_asr_model=valid.loss.ave.pth
download_model= # Download a model from Model Zoo and use it for decoding.
# [Task dependent] Set the datadir name created by local/data.sh
train_set= # Name of training set.
valid_set= # Name of validation set used for monitoring/tuning network training.
test_sets= # Names of test sets. Multiple items (e.g., both dev and eval sets) can be specified.
bpe_train_text= # Text file path of bpe training set.
lm_train_text= # Text file path of language model training set.
lm_dev_text= # Text file path of language model development set.
lm_test_text= # Text file path of language model evaluation set.
nlsyms_txt=none # Non-linguistic symbol list if existing.
cleaner=none # Text cleaner.
g2p=none # g2p method (needed if token_type=phn).
lang=noinfo # The language type of corpus.
score_opts= # The options given to sclite scoring
local_score_opts= # The options given to local/score.sh.
asr_speech_fold_length=800 # fold_length for speech data during ASR training.
asr_text_fold_length=150 # fold_length for text data during ASR training.
lm_fold_length=150 # fold_length for LM training.
oss_path=
token_list=
scp=
text=
mode=
help_message=$(cat << EOF
Usage: $0 --train-set "<train_set_name>" --valid-set "<valid_set_name>" --test_sets "<test_set_names>"
Options:
# General configuration
--stage # Processes starts from the specified stage (default="${stage}").
--stop_stage # Processes is stopped at the specified stage (default="${stop_stage}").
--skip_data_prep # Skip data preparation stages (default="${skip_data_prep}").
--skip_train # Skip training stages (default="${skip_train}").
--skip_eval # Skip decoding and evaluation stages (default="${skip_eval}").
--skip_upload # Skip packing and uploading stages (default="${skip_upload}").
--ngpu # The number of gpus ("0" uses cpu, otherwise use gpu, default="${ngpu}").
--num_nodes # The number of nodes (default="${num_nodes}").
--nj # The number of parallel jobs (default="${nj}").
--inference_nj # The number of parallel jobs in decoding (default="${inference_nj}").
--gpu_inference # Whether to perform gpu decoding (default="${gpu_inference}").
--dumpdir # Directory to dump features (default="${dumpdir}").
--expdir # Directory to save experiments (default="${expdir}").
--python # Specify python to execute espnet commands (default="${python}").
# Data preparation related
--local_data_opts # The options given to local/data.sh (default="${local_data_opts}").
# Speed perturbation related
--speed_perturb_factors # speed perturbation factors, e.g. "0.9 1.0 1.1" (separated by space, default="${speed_perturb_factors}").
# Feature extraction related
--feats_type # Feature type (raw, fbank_pitch or extracted, default="${feats_type}").
--audio_format # Audio format: wav, flac, wav.ark, flac.ark (only in feats_type=raw, default="${audio_format}").
--fs # Sampling rate (default="${fs}").
--min_wav_duration # Minimum duration in second (default="${min_wav_duration}").
--max_wav_duration # Maximum duration in second (default="${max_wav_duration}").
# Tokenization related
--token_type # Tokenization type (char or bpe, default="${token_type}").
--nbpe # The number of BPE vocabulary (default="${nbpe}").
--bpemode # Mode of BPE (unigram or bpe, default="${bpemode}").
--oov # Out of vocabulary symbol (default="${oov}").
--blank # CTC blank symbol (default="${blank}").
--sos_eos # sos and eos symbole (default="${sos_eos}").
--bpe_input_sentence_size # Size of input sentence for BPE (default="${bpe_input_sentence_size}").
--bpe_nlsyms # Non-linguistic symbol list for sentencepiece, separated by a comma. (default="${bpe_nlsyms}").
--bpe_char_cover # Character coverage when modeling BPE (default="${bpe_char_cover}").
# Language model related
--lm_tag # Suffix to the result dir for language model training (default="${lm_tag}").
--lm_exp # Specify the directory path for LM experiment.
# If this option is specified, lm_tag is ignored (default="${lm_exp}").
--lm_stats_dir # Specify the directory path for LM statistics (default="${lm_stats_dir}").
--lm_config # Config for language model training (default="${lm_config}").
--lm_args # Arguments for language model training (default="${lm_args}").
# e.g., --lm_args "--max_epoch 10"
# Note that it will overwrite args in lm config.
--use_word_lm # Whether to use word language model (default="${use_word_lm}").
--word_vocab_size # Size of word vocabulary (default="${word_vocab_size}").
--num_splits_lm # Number of splitting for lm corpus (default="${num_splits_lm}").
# ASR model related
--asr_tag # Suffix to the result dir for asr model training (default="${asr_tag}").
--asr_exp # Specify the directory path for ASR experiment.
# If this option is specified, asr_tag is ignored (default="${asr_exp}").
--asr_stats_dir # Specify the directory path for ASR statistics (default="${asr_stats_dir}").
--asr_config # Config for asr model training (default="${asr_config}").
--asr_args # Arguments for asr model training (default="${asr_args}").
# e.g., --asr_args "--max_epoch 10"
# Note that it will overwrite args in asr config.
--pretrained_model= # Pretrained model to load (default="${pretrained_model}").
--ignore_init_mismatch= # Ignore mismatch parameter init with pretrained model (default="${ignore_init_mismatch}").
--feats_normalize # Normalizaton layer type (default="${feats_normalize}").
--num_splits_asr # Number of splitting for lm corpus (default="${num_splits_asr}").
# Decoding related
--inference_tag # Suffix to the result dir for decoding (default="${inference_tag}").
--inference_config # Config for decoding (default="${inference_config}").
--inference_args # Arguments for decoding (default="${inference_args}").
# e.g., --inference_args "--lm_weight 0.1"
# Note that it will overwrite args in inference config.
--inference_lm # Language model path for decoding (default="${inference_lm}").
--inference_asr_model # ASR model path for decoding (default="${inference_asr_model}").
--download_model # Download a model from Model Zoo and use it for decoding (default="${download_model}").
--use_streaming # Whether to use streaming decoding (default="${use_streaming}").
--use_maskctc # Whether to use maskctc decoding (default="${use_streaming}").
# [Task dependent] Set the datadir name created by local/data.sh
--train_set # Name of training set (required).
--valid_set # Name of validation set used for monitoring/tuning network training (required).
--test_sets # Names of test sets.
# Multiple items (e.g., both dev and eval sets) can be specified (required).
--bpe_train_text # Text file path of bpe training set.
--lm_train_text # Text file path of language model training set.
--lm_dev_text # Text file path of language model development set (default="${lm_dev_text}").
--lm_test_text # Text file path of language model evaluation set (default="${lm_test_text}").
--nlsyms_txt # Non-linguistic symbol list if existing (default="${nlsyms_txt}").
--cleaner # Text cleaner (default="${cleaner}").
--g2p # g2p method (default="${g2p}").
--lang # The language type of corpus (default=${lang}).
--score_opts # The options given to sclite scoring (default="{score_opts}").
--local_score_opts # The options given to local/score.sh (default="{local_score_opts}").
--asr_speech_fold_length # fold_length for speech data during ASR training (default="${asr_speech_fold_length}").
--asr_text_fold_length # fold_length for text data during ASR training (default="${asr_text_fold_length}").
--lm_fold_length # fold_length for LM training (default="${lm_fold_length}").
EOF
)
log "$0 $*"
# Save command line args for logging (they will be lost after utils/parse_options.sh)
run_args=$(utils/print_args.py $0 "$@")
. utils/parse_options.sh
if [ $# -ne 0 ]; then
log "${help_message}"
log "Error: No positional arguments are required."
exit 2
fi
# set absolute dump dir path
dumpdir=${datadir}/${dumpdir}
if [ -z "${inference_tag}" ]; then
if [ -n "${inference_config}" ]; then
inference_tag="$(basename "${inference_config}" .yaml)"
else
inference_tag=inference
fi
if "${use_k2}"; then
inference_tag+="_use_k2"
inference_tag+="_k2_ctc_decoding_${k2_ctc_decoding}"
inference_tag+="_use_nbest_rescoring_${use_nbest_rescoring}"
fi
fi
# ========================== Main stages start from here. ==========================
if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
log "Stage 12: Decoding: training_dir=${asr_exp}"
if ${gpu_inference}; then
_cmd="${cuda_cmd}"
_ngpu=1
else
_cmd="${decode_cmd}"
_ngpu=0
fi
_opts=
if [ -n "${inference_config}" ]; then
_opts+="--config ${inference_config} "
fi
if "${use_lm}"; then
if "${use_word_lm}"; then
_opts+="--word_lm_train_config ${lm_exp}/config.yaml "
_opts+="--word_lm_file ${lm_exp}/${inference_lm} "
else
_opts+="--lm_train_config ${lm_exp}/config.yaml "
_opts+="--lm_file ${lm_exp}/${inference_lm} "
fi
fi
if "${use_ngram}"; then
_opts+="--ngram_file ${ngram_exp}/${inference_ngram}"
inference_tag=${inference_tag}.${inference_ngram}
fi
# 2. Generate run.sh
log "Generate '${asr_exp}/${inference_tag}/run.sh'. You can resume the process from stage 12 using this script"
mkdir -p "${asr_exp}/${inference_tag}"; echo "${run_args} --stage 12 \"\$@\"; exit \$?" > "${asr_exp}/${inference_tag}/run.sh"; chmod +x "${asr_exp}/${inference_tag}/run.sh"
if "${use_streaming}"; then
asr_inference_tool="funasr.bin.asr_inference_streaming"
elif "${use_maskctc}"; then
asr_inference_tool="funasr.bin.asr_inference_maskctc"
else
asr_inference_tool="funasr.bin.asr_inference_launch"
fi
for dset in ${test_sets}; do
if [ $feats_type == "ark_wav" ]; then
_data="${dumpdir}/wav/${dset}"
else
_data="${dumpdir}/$feats_type/${dset}"
fi
_dir="${asr_exp}/${inference_tag}/${inference_asr_model}/${dset}"
_logdir="${_dir}/logdir"
if [ -d ${_dir} ]; then
#echo "${_dir} is already exists. if you want to decode again, please delete this dir first."
rm -r ${_dir}
fi
mkdir -p "${_logdir}"
_scp=$scp
_type=kaldi_ark
# 1. Split the key file
key_file=${_data}/${_scp}
split_scps=""
if "${use_k2}"; then
# Now only _nj=1 is verified if using k2
_nj=1
else
_nj=$(min "${inference_nj}" "$(<${key_file} wc -l)")
fi
for n in $(seq "${_nj}"); do
split_scps+=" ${_logdir}/keys.${n}.scp"
done
# shellcheck disable=SC2086
utils/split_scp.pl "${key_file}" ${split_scps}
# 2. Submit decoding jobs
log "Decoding started... log: '${_logdir}/asr_inference.*.log'"
# shellcheck disable=SC2086
${_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
${python} -m ${asr_inference_tool} \
--batch_size ${batch_size} \
--ngpu "${_ngpu}" \
--njob ${njob} \
--gpuid_list ${gpuid_list} \
--data_path_and_name_and_type "${_data}/${_scp},speech,${_type}" \
--key_file "${_logdir}"/keys.JOB.scp \
--asr_train_config "${asr_exp}"/config.yaml \
--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
--output_dir "${_logdir}"/output.JOB \
--mode $mode \
${_opts} ${inference_args}
# 3. Concatenates the output files from each jobs
for f in token token_int score text; do
if [ -f "${_logdir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${_nj}"); do
cat "${_logdir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${_dir}/${f}"
fi
done
python utils/proce_text.py ${_dir}/text ${_dir}/${text}.proc
python utils/proce_text.py ${_data}/text ${_data}/${text}.proc
python utils/compute_wer.py ${_data}/text.proc ${_dir}/text.proc ${_dir}/text.cer
tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
cat ${_dir}/text.cer.txt
done
fi
log "Successfully finished. [elapsed=${SECONDS}s]"

View File

@ -0,0 +1,370 @@
#!/usr/bin/env python3
# coding=utf8
# Copyright 2021 Jiayu DU
import sys
import argparse
import json
import logging
logging.basicConfig(stream=sys.stderr, level=logging.INFO, format='[%(levelname)s] %(message)s')
DEBUG = None
def GetEditType(ref_token, hyp_token):
if ref_token == None and hyp_token != None:
return 'I'
elif ref_token != None and hyp_token == None:
return 'D'
elif ref_token == hyp_token:
return 'C'
elif ref_token != hyp_token:
return 'S'
else:
raise RuntimeError
class AlignmentArc:
def __init__(self, src, dst, ref, hyp):
self.src = src
self.dst = dst
self.ref = ref
self.hyp = hyp
self.edit_type = GetEditType(ref, hyp)
def similarity_score_function(ref_token, hyp_token):
return 0 if (ref_token == hyp_token) else -1.0
def insertion_score_function(token):
return -1.0
def deletion_score_function(token):
return -1.0
def EditDistance(
ref,
hyp,
similarity_score_function = similarity_score_function,
insertion_score_function = insertion_score_function,
deletion_score_function = deletion_score_function):
assert(len(ref) != 0)
class DPState:
def __init__(self):
self.score = -float('inf')
# backpointer
self.prev_r = None
self.prev_h = None
def print_search_grid(S, R, H, fstream):
print(file=fstream)
for r in range(R):
for h in range(H):
print(F'[{r},{h}]:{S[r][h].score:4.3f}:({S[r][h].prev_r},{S[r][h].prev_h}) ', end='', file=fstream)
print(file=fstream)
R = len(ref) + 1
H = len(hyp) + 1
# Construct DP search space, a (R x H) grid
S = [ [] for r in range(R) ]
for r in range(R):
S[r] = [ DPState() for x in range(H) ]
# initialize DP search grid origin, S(r = 0, h = 0)
S[0][0].score = 0.0
S[0][0].prev_r = None
S[0][0].prev_h = None
# initialize REF axis
for r in range(1, R):
S[r][0].score = S[r-1][0].score + deletion_score_function(ref[r-1])
S[r][0].prev_r = r-1
S[r][0].prev_h = 0
# initialize HYP axis
for h in range(1, H):
S[0][h].score = S[0][h-1].score + insertion_score_function(hyp[h-1])
S[0][h].prev_r = 0
S[0][h].prev_h = h-1
best_score = S[0][0].score
best_state = (0, 0)
for r in range(1, R):
for h in range(1, H):
sub_or_cor_score = similarity_score_function(ref[r-1], hyp[h-1])
new_score = S[r-1][h-1].score + sub_or_cor_score
if new_score >= S[r][h].score:
S[r][h].score = new_score
S[r][h].prev_r = r-1
S[r][h].prev_h = h-1
del_score = deletion_score_function(ref[r-1])
new_score = S[r-1][h].score + del_score
if new_score >= S[r][h].score:
S[r][h].score = new_score
S[r][h].prev_r = r - 1
S[r][h].prev_h = h
ins_score = insertion_score_function(hyp[h-1])
new_score = S[r][h-1].score + ins_score
if new_score >= S[r][h].score:
S[r][h].score = new_score
S[r][h].prev_r = r
S[r][h].prev_h = h-1
best_score = S[R-1][H-1].score
best_state = (R-1, H-1)
if DEBUG:
print_search_grid(S, R, H, sys.stderr)
# Backtracing best alignment path, i.e. a list of arcs
# arc = (src, dst, ref, hyp, edit_type)
# src/dst = (r, h), where r/h refers to search grid state-id along Ref/Hyp axis
best_path = []
r, h = best_state[0], best_state[1]
prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
score = S[r][h].score
# loop invariant:
# 1. (prev_r, prev_h) -> (r, h) is a "forward arc" on best alignment path
# 2. score is the value of point(r, h) on DP search grid
while prev_r != None or prev_h != None:
src = (prev_r, prev_h)
dst = (r, h)
if (r == prev_r + 1 and h == prev_h + 1): # Substitution or correct
arc = AlignmentArc(src, dst, ref[prev_r], hyp[prev_h])
elif (r == prev_r + 1 and h == prev_h): # Deletion
arc = AlignmentArc(src, dst, ref[prev_r], None)
elif (r == prev_r and h == prev_h + 1): # Insertion
arc = AlignmentArc(src, dst, None, hyp[prev_h])
else:
raise RuntimeError
best_path.append(arc)
r, h = prev_r, prev_h
prev_r, prev_h = S[r][h].prev_r, S[r][h].prev_h
score = S[r][h].score
best_path.reverse()
return (best_path, best_score)
def PrettyPrintAlignment(alignment, stream = sys.stderr):
def get_token_str(token):
if token == None:
return "*"
return token
def is_double_width_char(ch):
if (ch >= '\u4e00') and (ch <= '\u9fa5'): # codepoint ranges for Chinese chars
return True
# TODO: support other double-width-char language such as Japanese, Korean
else:
return False
def display_width(token_str):
m = 0
for c in token_str:
if is_double_width_char(c):
m += 2
else:
m += 1
return m
R = ' REF : '
H = ' HYP : '
E = ' EDIT : '
for arc in alignment:
r = get_token_str(arc.ref)
h = get_token_str(arc.hyp)
e = arc.edit_type if arc.edit_type != 'C' else ''
nr, nh, ne = display_width(r), display_width(h), display_width(e)
n = max(nr, nh, ne) + 1
R += r + ' ' * (n-nr)
H += h + ' ' * (n-nh)
E += e + ' ' * (n-ne)
print(R, file=stream)
print(H, file=stream)
print(E, file=stream)
def CountEdits(alignment):
c, s, i, d = 0, 0, 0, 0
for arc in alignment:
if arc.edit_type == 'C':
c += 1
elif arc.edit_type == 'S':
s += 1
elif arc.edit_type == 'I':
i += 1
elif arc.edit_type == 'D':
d += 1
else:
raise RuntimeError
return (c, s, i, d)
def ComputeTokenErrorRate(c, s, i, d):
return 100.0 * (s + d + i) / (s + d + c)
def ComputeSentenceErrorRate(num_err_utts, num_utts):
assert(num_utts != 0)
return 100.0 * num_err_utts / num_utts
class EvaluationResult:
def __init__(self):
self.num_ref_utts = 0
self.num_hyp_utts = 0
self.num_eval_utts = 0 # seen in both ref & hyp
self.num_hyp_without_ref = 0
self.C = 0
self.S = 0
self.I = 0
self.D = 0
self.token_error_rate = 0.0
self.num_utts_with_error = 0
self.sentence_error_rate = 0.0
def to_json(self):
return json.dumps(self.__dict__)
def to_kaldi(self):
info = (
F'%WER {self.token_error_rate:.2f} [ {self.S + self.D + self.I} / {self.C + self.S + self.D}, {self.I} ins, {self.D} del, {self.S} sub ]\n'
F'%SER {self.sentence_error_rate:.2f} [ {self.num_utts_with_error} / {self.num_eval_utts} ]\n'
)
return info
def to_sclite(self):
return "TODO"
def to_espnet(self):
return "TODO"
def to_summary(self):
#return json.dumps(self.__dict__, indent=4)
summary = (
'==================== Overall Statistics ====================\n'
F'num_ref_utts: {self.num_ref_utts}\n'
F'num_hyp_utts: {self.num_hyp_utts}\n'
F'num_hyp_without_ref: {self.num_hyp_without_ref}\n'
F'num_eval_utts: {self.num_eval_utts}\n'
F'sentence_error_rate: {self.sentence_error_rate:.2f}%\n'
F'token_error_rate: {self.token_error_rate:.2f}%\n'
F'token_stats:\n'
F' - tokens:{self.C + self.S + self.D:>7}\n'
F' - edits: {self.S + self.I + self.D:>7}\n'
F' - cor: {self.C:>7}\n'
F' - sub: {self.S:>7}\n'
F' - ins: {self.I:>7}\n'
F' - del: {self.D:>7}\n'
'============================================================\n'
)
return summary
class Utterance:
def __init__(self, uid, text):
self.uid = uid
self.text = text
def LoadUtterances(filepath, format):
utts = {}
if format == 'text': # utt_id word1 word2 ...
with open(filepath, 'r', encoding='utf8') as f:
for line in f:
line = line.strip()
if line:
cols = line.split(maxsplit=1)
assert(len(cols) == 2 or len(cols) == 1)
uid = cols[0]
text = cols[1] if len(cols) == 2 else ''
if utts.get(uid) != None:
raise RuntimeError(F'Found duplicated utterence id {uid}')
utts[uid] = Utterance(uid, text)
else:
raise RuntimeError(F'Unsupported text format {format}')
return utts
def tokenize_text(text, tokenizer):
if tokenizer == 'whitespace':
return text.split()
elif tokenizer == 'char':
return [ ch for ch in ''.join(text.split()) ]
else:
raise RuntimeError(F'ERROR: Unsupported tokenizer {tokenizer}')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# optional
parser.add_argument('--tokenizer', choices=['whitespace', 'char'], default='whitespace', help='whitespace for WER, char for CER')
parser.add_argument('--ref-format', choices=['text'], default='text', help='reference format, first col is utt_id, the rest is text')
parser.add_argument('--hyp-format', choices=['text'], default='text', help='hypothesis format, first col is utt_id, the rest is text')
# required
parser.add_argument('--ref', type=str, required=True, help='input reference file')
parser.add_argument('--hyp', type=str, required=True, help='input hypothesis file')
parser.add_argument('result_file', type=str)
args = parser.parse_args()
logging.info(args)
ref_utts = LoadUtterances(args.ref, args.ref_format)
hyp_utts = LoadUtterances(args.hyp, args.hyp_format)
r = EvaluationResult()
# check valid utterances in hyp that have matched non-empty reference
eval_utts = []
r.num_hyp_without_ref = 0
for uid in sorted(hyp_utts.keys()):
if uid in ref_utts.keys(): # TODO: efficiency
if ref_utts[uid].text.strip(): # non-empty reference
eval_utts.append(uid)
else:
logging.warn(F'Found {uid} with empty reference, skipping...')
else:
logging.warn(F'Found {uid} without reference, skipping...')
r.num_hyp_without_ref += 1
r.num_hyp_utts = len(hyp_utts)
r.num_ref_utts = len(ref_utts)
r.num_eval_utts = len(eval_utts)
with open(args.result_file, 'w+', encoding='utf8') as fo:
for uid in eval_utts:
ref = ref_utts[uid]
hyp = hyp_utts[uid]
alignment, score = EditDistance(
tokenize_text(ref.text, args.tokenizer),
tokenize_text(hyp.text, args.tokenizer)
)
c, s, i, d = CountEdits(alignment)
utt_ter = ComputeTokenErrorRate(c, s, i, d)
# utt-level evaluation result
print(F'{{"uid":{uid}, "score":{score}, "ter":{utt_ter:.2f}, "cor":{c}, "sub":{s}, "ins":{i}, "del":{d}}}', file=fo)
PrettyPrintAlignment(alignment, fo)
r.C += c
r.S += s
r.I += i
r.D += d
if utt_ter > 0:
r.num_utts_with_error += 1
# corpus level evaluation result
r.sentence_error_rate = ComputeSentenceErrorRate(r.num_utts_with_error, r.num_eval_utts)
r.token_error_rate = ComputeTokenErrorRate(r.C, r.S, r.I, r.D)
print(r.to_summary(), file=fo)
print(r.to_json())
print(r.to_kaldi())

View File

@ -0,0 +1,47 @@
from transformers import AutoTokenizer, AutoModel, pipeline
import numpy as np
import sys
import os
import torch
from kaldiio import WriteHelper
import re
text_file_json = sys.argv[1]
out_ark = sys.argv[2]
out_scp = sys.argv[3]
out_shape = sys.argv[4]
device = int(sys.argv[5])
model_path = sys.argv[6]
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
extractor = pipeline(task="feature-extraction", model=model, tokenizer=tokenizer, device=device)
with open(text_file_json, 'r') as f:
js = f.readlines()
f_shape = open(out_shape, "w")
with WriteHelper('ark,scp:{},{}'.format(out_ark, out_scp)) as writer:
with torch.no_grad():
for idx, line in enumerate(js):
id, tokens = line.strip().split(" ", 1)
tokens = re.sub(" ", "", tokens.strip())
tokens = ' '.join([j for j in tokens])
token_num = len(tokens.split(" "))
outputs = extractor(tokens)
outputs = np.array(outputs)
embeds = outputs[0, 1:-1, :]
token_num_embeds, dim = embeds.shape
if token_num == token_num_embeds:
writer(id, embeds)
shape_line = "{} {},{}\n".format(id, token_num_embeds, dim)
f_shape.write(shape_line)
else:
print("{}, size has changed, {}, {}, {}".format(id, token_num, token_num_embeds, tokens))
f_shape.close()

View File

@ -0,0 +1,87 @@
#!/usr/bin/env perl
# Copyright 2010-2012 Microsoft Corporation
# Johns Hopkins University (author: Daniel Povey)
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
# This script takes a list of utterance-ids or any file whose first field
# of each line is an utterance-id, and filters an scp
# file (or any file whose "n-th" field is an utterance id), printing
# out only those lines whose "n-th" field is in id_list. The index of
# the "n-th" field is 1, by default, but can be changed by using
# the -f <n> switch
$exclude = 0;
$field = 1;
$shifted = 0;
do {
$shifted=0;
if ($ARGV[0] eq "--exclude") {
$exclude = 1;
shift @ARGV;
$shifted=1;
}
if ($ARGV[0] eq "-f") {
$field = $ARGV[1];
shift @ARGV; shift @ARGV;
$shifted=1
}
} while ($shifted);
if(@ARGV < 1 || @ARGV > 2) {
die "Usage: filter_scp.pl [--exclude] [-f <field-to-filter-on>] id_list [in.scp] > out.scp \n" .
"Prints only the input lines whose f'th field (default: first) is in 'id_list'.\n" .
"Note: only the first field of each line in id_list matters. With --exclude, prints\n" .
"only the lines that were *not* in id_list.\n" .
"Caution: previously, the -f option was interpreted as a zero-based field index.\n" .
"If your older scripts (written before Oct 2014) stopped working and you used the\n" .
"-f option, add 1 to the argument.\n" .
"See also: scripts/filter_scp.pl .\n";
}
$idlist = shift @ARGV;
open(F, "<$idlist") || die "Could not open id-list file $idlist";
while(<F>) {
@A = split;
@A>=1 || die "Invalid id-list file line $_";
$seen{$A[0]} = 1;
}
if ($field == 1) { # Treat this as special case, since it is common.
while(<>) {
$_ =~ m/\s*(\S+)\s*/ || die "Bad line $_, could not get first field.";
# $1 is what we filter on.
if ((!$exclude && $seen{$1}) || ($exclude && !defined $seen{$1})) {
print $_;
}
}
} else {
while(<>) {
@A = split;
@A > 0 || die "Invalid scp file line $_";
@A >= $field || die "Invalid scp file line $_";
if ((!$exclude && $seen{$A[$field-1]}) || ($exclude && !defined $seen{$A[$field-1]})) {
print $_;
}
}
}
# tests:
# the following should print "foo 1"
# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl <(echo foo)
# the following should print "bar 2".
# ( echo foo 1; echo bar 2 ) | scripts/filter_scp.pl -f 2 <(echo 2)

View File

@ -0,0 +1,35 @@
#!/usr/bin/env bash
echo "$0 $@"
data_dir=$1
if [ ! -f ${data_dir}/wav.scp ]; then
echo "$0: wav.scp is not found"
exit 1;
fi
if [ ! -f ${data_dir}/text ]; then
echo "$0: text is not found"
exit 1;
fi
mkdir -p ${data_dir}/.backup
awk '{print $1}' ${data_dir}/wav.scp > ${data_dir}/.backup/wav_id
awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
cp ${data_dir}/wav.scp ${data_dir}/.backup/wav.scp
cp ${data_dir}/text ${data_dir}/.backup/text
mv ${data_dir}/wav.scp ${data_dir}/wav.scp.bak
mv ${data_dir}/text ${data_dir}/text.bak
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/wav.scp.bak > ${data_dir}/wav.scp
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
rm ${data_dir}/wav.scp.bak
rm ${data_dir}/text.bak

View File

@ -0,0 +1,52 @@
#!/usr/bin/env bash
echo "$0 $@"
data_dir=$1
if [ ! -f ${data_dir}/feats.scp ]; then
echo "$0: feats.scp is not found"
exit 1;
fi
if [ ! -f ${data_dir}/text ]; then
echo "$0: text is not found"
exit 1;
fi
if [ ! -f ${data_dir}/speech_shape ]; then
echo "$0: feature lengths is not found"
exit 1;
fi
if [ ! -f ${data_dir}/text_shape ]; then
echo "$0: text lengths is not found"
exit 1;
fi
mkdir -p ${data_dir}/.backup
awk '{print $1}' ${data_dir}/feats.scp > ${data_dir}/.backup/wav_id
awk '{print $1}' ${data_dir}/text > ${data_dir}/.backup/text_id
sort ${data_dir}/.backup/wav_id ${data_dir}/.backup/text_id | uniq -d > ${data_dir}/.backup/id
cp ${data_dir}/feats.scp ${data_dir}/.backup/feats.scp
cp ${data_dir}/text ${data_dir}/.backup/text
cp ${data_dir}/speech_shape ${data_dir}/.backup/speech_shape
cp ${data_dir}/text_shape ${data_dir}/.backup/text_shape
mv ${data_dir}/feats.scp ${data_dir}/feats.scp.bak
mv ${data_dir}/text ${data_dir}/text.bak
mv ${data_dir}/speech_shape ${data_dir}/speech_shape.bak
mv ${data_dir}/text_shape ${data_dir}/text_shape.bak
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/feats.scp.bak > ${data_dir}/feats.scp
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text.bak > ${data_dir}/text
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/speech_shape.bak > ${data_dir}/speech_shape
utils/filter_scp.pl -f 1 ${data_dir}/.backup/id ${data_dir}/text_shape.bak > ${data_dir}/text_shape
rm ${data_dir}/feats.scp.bak
rm ${data_dir}/text.bak
rm ${data_dir}/speech_shape.bak
rm ${data_dir}/text_shape.bak

View File

@ -0,0 +1,20 @@
#!/usr/bin/env bash
# Begin configuration section.
nj=4
cmd=./utils/run.pl
echo "$0 $@"
. utils/parse_options.sh || exit 1;
data=$1
[ ! -d ${data}/ark ] && echo "$0: ark data is required" && exit 1;
[ ! -d ${data}/txt ] && echo "$0: txt data is required" && exit 1;
for n in $(seq $nj); do
echo "$data/ark/feats.$n.ark $data/txt/text.$n" || exit 1
done > $data/ark_txt.scp || exit 1

View File

@ -0,0 +1,97 @@
#!/usr/bin/env bash
# Copyright 2012 Johns Hopkins University (Author: Daniel Povey);
# Arnab Ghoshal, Karel Vesely
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
# Parse command-line options.
# To be sourced by another script (as in ". parse_options.sh").
# Option format is: --option-name arg
# and shell variable "option_name" gets set to value "arg."
# The exception is --help, which takes no arguments, but prints the
# $help_message variable (if defined).
###
### The --config file options have lower priority to command line
### options, so we need to import them first...
###
# Now import all the configs specified by command-line, in left-to-right order
for ((argpos=1; argpos<$#; argpos++)); do
if [ "${!argpos}" == "--config" ]; then
argpos_plus1=$((argpos+1))
config=${!argpos_plus1}
[ ! -r $config ] && echo "$0: missing config '$config'" && exit 1
. $config # source the config file.
fi
done
###
### Now we process the command line options
###
while true; do
[ -z "${1:-}" ] && break; # break if there are no arguments
case "$1" in
# If the enclosing script is called with --help option, print the help
# message and exit. Scripts should put help messages in $help_message
--help|-h) if [ -z "$help_message" ]; then echo "No help found." 1>&2;
else printf "$help_message\n" 1>&2 ; fi;
exit 0 ;;
--*=*) echo "$0: options to scripts must be of the form --name value, got '$1'"
exit 1 ;;
# If the first command-line argument begins with "--" (e.g. --foo-bar),
# then work out the variable name as $name, which will equal "foo_bar".
--*) name=`echo "$1" | sed s/^--// | sed s/-/_/g`;
# Next we test whether the variable in question is undefned-- if so it's
# an invalid option and we die. Note: $0 evaluates to the name of the
# enclosing script.
# The test [ -z ${foo_bar+xxx} ] will return true if the variable foo_bar
# is undefined. We then have to wrap this test inside "eval" because
# foo_bar is itself inside a variable ($name).
eval '[ -z "${'$name'+xxx}" ]' && echo "$0: invalid option $1" 1>&2 && exit 1;
oldval="`eval echo \\$$name`";
# Work out whether we seem to be expecting a Boolean argument.
if [ "$oldval" == "true" ] || [ "$oldval" == "false" ]; then
was_bool=true;
else
was_bool=false;
fi
# Set the variable to the right value-- the escaped quotes make it work if
# the option had spaces, like --cmd "queue.pl -sync y"
eval $name=\"$2\";
# Check that Boolean-valued arguments are really Boolean.
if $was_bool && [[ "$2" != "true" && "$2" != "false" ]]; then
echo "$0: expected \"true\" or \"false\": $1 $2" 1>&2
exit 1;
fi
shift 2;
;;
*) break;
esac
done
# Check for an empty argument to the --cmd option, which can easily occur as a
# result of scripting errors.
[ ! -z "${cmd+xxx}" ] && [ -z "$cmd" ] && echo "$0: empty argument to --cmd option" 1>&2 && exit 1;
true; # so this script returns exit code 0.

View File

@ -0,0 +1,45 @@
#!/usr/bin/env python
import sys
def get_commandline_args(no_executable=True):
extra_chars = [
" ",
";",
"&",
"|",
"<",
">",
"?",
"*",
"~",
"`",
'"',
"'",
"\\",
"{",
"}",
"(",
")",
]
# Escape the extra characters for shell
argv = [
arg.replace("'", "'\\''")
if all(char not in arg for char in extra_chars)
else "'" + arg.replace("'", "'\\''") + "'"
for arg in sys.argv
]
if no_executable:
return " ".join(argv[1:])
else:
return sys.executable + " " + " ".join(argv)
def main():
print(get_commandline_args())
if __name__ == "__main__":
main()

View File

@ -0,0 +1,35 @@
from pathlib import Path
import torch
import yaml
class NoAliasSafeDumper(yaml.SafeDumper):
# Disable anchor/alias in yaml because looks ugly
def ignore_aliases(self, data):
return True
def yaml_no_alias_safe_dump(data, stream=None, **kwargs):
"""Safe-dump in yaml with no anchor/alias"""
return yaml.dump(
data, stream, allow_unicode=True, Dumper=NoAliasSafeDumper, **kwargs
)
def gen_conf(file, out_dir):
conf = torch.load(file)["config"]
conf["oss_bucket"] = "null"
print(conf)
output_dir = Path(out_dir)
output_dir.mkdir(parents=True, exist_ok=True)
with (output_dir / "config.yaml").open("w", encoding="utf-8") as f:
yaml_no_alias_safe_dump(conf, f, indent=4, sort_keys=False)
if __name__ == "__main__":
import sys
in_f = sys.argv[1]
out_f = sys.argv[2]
gen_conf(in_f, out_f)

View File

@ -0,0 +1,31 @@
import sys
import re
in_f = sys.argv[1]
out_f = sys.argv[2]
with open(in_f, "r", encoding="utf-8") as f:
lines = f.readlines()
with open(out_f, "w", encoding="utf-8") as f:
for line in lines:
outs = line.strip().split(" ", 1)
if len(outs) == 2:
idx, text = outs
text = re.sub("</s>", "", text)
text = re.sub("<s>", "", text)
text = re.sub("@@", "", text)
text = re.sub("@", "", text)
text = re.sub("<unk>", "", text)
text = re.sub(" ", "", text)
text = text.lower()
else:
idx = outs[0]
text = " "
text = [x for x in text]
text = " ".join(text)
out = "{} {}\n".format(idx, text)
f.write(out)

View File

@ -0,0 +1,356 @@
#!/usr/bin/env perl
use warnings; #sed replacement for -w perl parameter
# In general, doing
# run.pl some.log a b c is like running the command a b c in
# the bash shell, and putting the standard error and output into some.log.
# To run parallel jobs (backgrounded on the host machine), you can do (e.g.)
# run.pl JOB=1:4 some.JOB.log a b c JOB is like running the command a b c JOB
# and putting it in some.JOB.log, for each one. [Note: JOB can be any identifier].
# If any of the jobs fails, this script will fail.
# A typical example is:
# run.pl some.log my-prog "--opt=foo bar" foo \| other-prog baz
# and run.pl will run something like:
# ( my-prog '--opt=foo bar' foo | other-prog baz ) >& some.log
#
# Basically it takes the command-line arguments, quotes them
# as necessary to preserve spaces, and evaluates them with bash.
# In addition it puts the command line at the top of the log, and
# the start and end times of the command at the beginning and end.
# The reason why this is useful is so that we can create a different
# version of this program that uses a queueing system instead.
#use Data::Dumper;
@ARGV < 2 && die "usage: run.pl log-file command-line arguments...";
#print STDERR "COMMAND-LINE: " . Dumper(\@ARGV) . "\n";
$job_pick = 'all';
$max_jobs_run = -1;
$jobstart = 1;
$jobend = 1;
$ignored_opts = ""; # These will be ignored.
# First parse an option like JOB=1:4, and any
# options that would normally be given to
# queue.pl, which we will just discard.
for (my $x = 1; $x <= 2; $x++) { # This for-loop is to
# allow the JOB=1:n option to be interleaved with the
# options to qsub.
while (@ARGV >= 2 && $ARGV[0] =~ m:^-:) {
# parse any options that would normally go to qsub, but which will be ignored here.
my $switch = shift @ARGV;
if ($switch eq "-V") {
$ignored_opts .= "-V ";
} elsif ($switch eq "--max-jobs-run" || $switch eq "-tc") {
# we do support the option --max-jobs-run n, and its GridEngine form -tc n.
# if the command appears multiple times uses the smallest option.
if ( $max_jobs_run <= 0 ) {
$max_jobs_run = shift @ARGV;
} else {
my $new_constraint = shift @ARGV;
if ( ($new_constraint < $max_jobs_run) ) {
$max_jobs_run = $new_constraint;
}
}
if (! ($max_jobs_run > 0)) {
die "run.pl: invalid option --max-jobs-run $max_jobs_run";
}
} else {
my $argument = shift @ARGV;
if ($argument =~ m/^--/) {
print STDERR "run.pl: WARNING: suspicious argument '$argument' to $switch; starts with '-'\n";
}
if ($switch eq "-sync" && $argument =~ m/^[yY]/) {
$ignored_opts .= "-sync "; # Note: in the
# corresponding code in queue.pl it says instead, just "$sync = 1;".
} elsif ($switch eq "-pe") { # e.g. -pe smp 5
my $argument2 = shift @ARGV;
$ignored_opts .= "$switch $argument $argument2 ";
} elsif ($switch eq "--gpu") {
$using_gpu = $argument;
} elsif ($switch eq "--pick") {
if($argument =~ m/^(all|failed|incomplete)$/) {
$job_pick = $argument;
} else {
print STDERR "run.pl: ERROR: --pick argument must be one of 'all', 'failed' or 'incomplete'"
}
} else {
# Ignore option.
$ignored_opts .= "$switch $argument ";
}
}
}
if ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+):(\d+)$/) { # e.g. JOB=1:20
$jobname = $1;
$jobstart = $2;
$jobend = $3;
if ($jobstart > $jobend) {
die "run.pl: invalid job range $ARGV[0]";
}
if ($jobstart <= 0) {
die "run.pl: invalid job range $ARGV[0], start must be strictly positive (this is required for GridEngine compatibility).";
}
shift;
} elsif ($ARGV[0] =~ m/^([\w_][\w\d_]*)+=(\d+)$/) { # e.g. JOB=1.
$jobname = $1;
$jobstart = $2;
$jobend = $2;
shift;
} elsif ($ARGV[0] =~ m/.+\=.*\:.*$/) {
print STDERR "run.pl: Warning: suspicious first argument to run.pl: $ARGV[0]\n";
}
}
# Users found this message confusing so we are removing it.
# if ($ignored_opts ne "") {
# print STDERR "run.pl: Warning: ignoring options \"$ignored_opts\"\n";
# }
if ($max_jobs_run == -1) { # If --max-jobs-run option not set,
# then work out the number of processors if possible,
# and set it based on that.
$max_jobs_run = 0;
if ($using_gpu) {
if (open(P, "nvidia-smi -L |")) {
$max_jobs_run++ while (<P>);
close(P);
}
if ($max_jobs_run == 0) {
$max_jobs_run = 1;
print STDERR "run.pl: Warning: failed to detect number of GPUs from nvidia-smi, using ${max_jobs_run}\n";
}
} elsif (open(P, "</proc/cpuinfo")) { # Linux
while (<P>) { if (m/^processor/) { $max_jobs_run++; } }
if ($max_jobs_run == 0) {
print STDERR "run.pl: Warning: failed to detect any processors from /proc/cpuinfo\n";
$max_jobs_run = 10; # reasonable default.
}
close(P);
} elsif (open(P, "sysctl -a |")) { # BSD/Darwin
while (<P>) {
if (m/hw\.ncpu\s*[:=]\s*(\d+)/) { # hw.ncpu = 4, or hw.ncpu: 4
$max_jobs_run = $1;
last;
}
}
close(P);
if ($max_jobs_run == 0) {
print STDERR "run.pl: Warning: failed to detect any processors from sysctl -a\n";
$max_jobs_run = 10; # reasonable default.
}
} else {
# allow at most 32 jobs at once, on non-UNIX systems; change this code
# if you need to change this default.
$max_jobs_run = 32;
}
# The just-computed value of $max_jobs_run is just the number of processors
# (or our best guess); and if it happens that the number of jobs we need to
# run is just slightly above $max_jobs_run, it will make sense to increase
# $max_jobs_run to equal the number of jobs, so we don't have a small number
# of leftover jobs.
$num_jobs = $jobend - $jobstart + 1;
if (!$using_gpu &&
$num_jobs > $max_jobs_run && $num_jobs < 1.4 * $max_jobs_run) {
$max_jobs_run = $num_jobs;
}
}
sub pick_or_exit {
# pick_or_exit ( $logfile )
# Invoked before each job is started helps to run jobs selectively.
#
# Given the name of the output logfile decides whether the job must be
# executed (by returning from the subroutine) or not (by terminating the
# process calling exit)
#
# PRE: $job_pick is a global variable set by command line switch --pick
# and indicates which class of jobs must be executed.
#
# 1) If a failed job is not executed the process exit code will indicate
# failure, just as if the task was just executed and failed.
#
# 2) If a task is incomplete it will be executed. Incomplete may be either
# a job whose log file does not contain the accounting notes in the end,
# or a job whose log file does not exist.
#
# 3) If the $job_pick is set to 'all' (default behavior) a task will be
# executed regardless of the result of previous attempts.
#
# This logic could have been implemented in the main execution loop
# but a subroutine to preserve the current level of readability of
# that part of the code.
#
# Alexandre Felipe, (o.alexandre.felipe@gmail.com) 14th of August of 2020
#
if($job_pick eq 'all'){
return; # no need to bother with the previous log
}
open my $fh, "<", $_[0] or return; # job not executed yet
my $log_line;
my $cur_line;
while ($cur_line = <$fh>) {
if( $cur_line =~ m/# Ended \(code .*/ ) {
$log_line = $cur_line;
}
}
close $fh;
if (! defined($log_line)){
return; # incomplete
}
if ( $log_line =~ m/# Ended \(code 0\).*/ ) {
exit(0); # complete
} elsif ( $log_line =~ m/# Ended \(code \d+(; signal \d+)?\).*/ ){
if ($job_pick !~ m/^(failed|all)$/) {
exit(1); # failed but not going to run
} else {
return; # failed
}
} elsif ( $log_line =~ m/.*\S.*/ ) {
return; # incomplete jobs are always run
}
}
$logfile = shift @ARGV;
if (defined $jobname && $logfile !~ m/$jobname/ &&
$jobend > $jobstart) {
print STDERR "run.pl: you are trying to run a parallel job but "
. "you are putting the output into just one log file ($logfile)\n";
exit(1);
}
$cmd = "";
foreach $x (@ARGV) {
if ($x =~ m/^\S+$/) { $cmd .= $x . " "; }
elsif ($x =~ m:\":) { $cmd .= "'$x' "; }
else { $cmd .= "\"$x\" "; }
}
#$Data::Dumper::Indent=0;
$ret = 0;
$numfail = 0;
%active_pids=();
use POSIX ":sys_wait_h";
for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
if (scalar(keys %active_pids) >= $max_jobs_run) {
# Lets wait for a change in any child's status
# Then we have to work out which child finished
$r = waitpid(-1, 0);
$code = $?;
if ($r < 0 ) { die "run.pl: Error waiting for child process"; } # should never happen.
if ( defined $active_pids{$r} ) {
$jid=$active_pids{$r};
$fail[$jid]=$code;
if ($code !=0) { $numfail++;}
delete $active_pids{$r};
# print STDERR "Finished: $r/$jid " . Dumper(\%active_pids) . "\n";
} else {
die "run.pl: Cannot find the PID of the child process that just finished.";
}
# In theory we could do a non-blocking waitpid over all jobs running just
# to find out if only one or more jobs finished during the previous waitpid()
# However, we just omit this and will reap the next one in the next pass
# through the for(;;) cycle
}
$childpid = fork();
if (!defined $childpid) { die "run.pl: Error forking in run.pl (writing to $logfile)"; }
if ($childpid == 0) { # We're in the child... this branch
# executes the job and returns (possibly with an error status).
if (defined $jobname) {
$cmd =~ s/$jobname/$jobid/g;
$logfile =~ s/$jobname/$jobid/g;
}
# exit if the job does not need to be executed
pick_or_exit( $logfile );
system("mkdir -p `dirname $logfile` 2>/dev/null");
open(F, ">$logfile") || die "run.pl: Error opening log file $logfile";
print F "# " . $cmd . "\n";
print F "# Started at " . `date`;
$starttime = `date +'%s'`;
print F "#\n";
close(F);
# Pipe into bash.. make sure we're not using any other shell.
open(B, "|bash") || die "run.pl: Error opening shell command";
print B "( " . $cmd . ") 2>>$logfile >> $logfile";
close(B); # If there was an error, exit status is in $?
$ret = $?;
$lowbits = $ret & 127;
$highbits = $ret >> 8;
if ($lowbits != 0) { $return_str = "code $highbits; signal $lowbits" }
else { $return_str = "code $highbits"; }
$endtime = `date +'%s'`;
open(F, ">>$logfile") || die "run.pl: Error opening log file $logfile (again)";
$enddate = `date`;
chop $enddate;
print F "# Accounting: time=" . ($endtime - $starttime) . " threads=1\n";
print F "# Ended ($return_str) at " . $enddate . ", elapsed time " . ($endtime-$starttime) . " seconds\n";
close(F);
exit($ret == 0 ? 0 : 1);
} else {
$pid[$jobid] = $childpid;
$active_pids{$childpid} = $jobid;
# print STDERR "Queued: " . Dumper(\%active_pids) . "\n";
}
}
# Now we have submitted all the jobs, lets wait until all the jobs finish
foreach $child (keys %active_pids) {
$jobid=$active_pids{$child};
$r = waitpid($pid[$jobid], 0);
$code = $?;
if ($r == -1) { die "run.pl: Error waiting for child process"; } # should never happen.
if ($r != 0) { $fail[$jobid]=$code; $numfail++ if $code!=0; } # Completed successfully
}
# Some sanity checks:
# The $fail array should not contain undefined codes
# The number of non-zeros in that array should be equal to $numfail
# We cannot do foreach() here, as the JOB ids do not start at zero
$failed_jids=0;
for ($jobid = $jobstart; $jobid <= $jobend; $jobid++) {
$job_return = $fail[$jobid];
if (not defined $job_return ) {
# print Dumper(\@fail);
die "run.pl: Sanity check failed: we have indication that some jobs are running " .
"even after we waited for all jobs to finish" ;
}
if ($job_return != 0 ){ $failed_jids++;}
}
if ($failed_jids != $numfail) {
die "run.pl: Sanity check failed: cannot find out how many jobs failed ($failed_jids x $numfail)."
}
if ($numfail > 0) { $ret = 1; }
if ($ret != 0) {
$njobs = $jobend - $jobstart + 1;
if ($njobs == 1) {
if (defined $jobname) {
$logfile =~ s/$jobname/$jobstart/; # only one numbered job, so replace name with
# that job.
}
print STDERR "run.pl: job failed, log is in $logfile\n";
if ($logfile =~ m/JOB/) {
print STDERR "run.pl: probably you forgot to put JOB=1:\$nj in your script.";
}
}
else {
$logfile =~ s/$jobname/*/g;
print STDERR "run.pl: $numfail / $njobs failed, log is in $logfile\n";
}
}
exit ($ret);

View File

@ -0,0 +1,44 @@
#!/usr/bin/env perl
# Copyright 2013 Johns Hopkins University (author: Daniel Povey)
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
if ($ARGV[0] eq "--srand") {
$n = $ARGV[1];
$n =~ m/\d+/ || die "Bad argument to --srand option: \"$n\"";
srand($ARGV[1]);
shift;
shift;
} else {
srand(0); # Gives inconsistent behavior if we don't seed.
}
if (@ARGV > 1 || $ARGV[0] =~ m/^-.+/) { # >1 args, or an option we
# don't understand.
print "Usage: shuffle_list.pl [--srand N] [input file] > output\n";
print "randomizes the order of lines of input.\n";
exit(1);
}
@lines;
while (<>) {
push @lines, [ (rand(), $_)] ;
}
@lines = sort { $a->[0] cmp $b->[0] } @lines;
foreach $l (@lines) {
print $l->[1];
}

View File

@ -0,0 +1,60 @@
import os
import sys
import random
in_dir = sys.argv[1]
out_dir = sys.argv[2]
num_split = sys.argv[3]
def split_scp(scp, num):
assert len(scp) >= num
avg = len(scp) // num
out = []
begin = 0
for i in range(num):
if i == num - 1:
out.append(scp[begin:])
else:
out.append(scp[begin:begin+avg])
begin += avg
return out
os.path.exists("{}/wav.scp".format(in_dir))
os.path.exists("{}/text".format(in_dir))
with open("{}/wav.scp".format(in_dir), 'r') as infile:
wav_list = infile.readlines()
with open("{}/text".format(in_dir), 'r') as infile:
text_list = infile.readlines()
assert len(wav_list) == len(text_list)
x = list(zip(wav_list, text_list))
random.shuffle(x)
wav_shuffle_list, text_shuffle_list = zip(*x)
num_split = int(num_split)
wav_split_list = split_scp(wav_shuffle_list, num_split)
text_split_list = split_scp(text_shuffle_list, num_split)
for idx, wav_list in enumerate(wav_split_list, 1):
path = out_dir + "/split" + str(num_split) + "/" + str(idx)
if not os.path.exists(path):
os.makedirs(path)
with open("{}/wav.scp".format(path), 'w') as wav_writer:
for line in wav_list:
wav_writer.write(line)
for idx, text_list in enumerate(text_split_list, 1):
path = out_dir + "/split" + str(num_split) + "/" + str(idx)
if not os.path.exists(path):
os.makedirs(path)
with open("{}/text".format(path), 'w') as text_writer:
for line in text_list:
text_writer.write(line)

View File

@ -0,0 +1,246 @@
#!/usr/bin/env perl
# Copyright 2010-2011 Microsoft Corporation
# See ../../COPYING for clarification regarding multiple authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
# WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
# MERCHANTABLITY OR NON-INFRINGEMENT.
# See the Apache 2 License for the specific language governing permissions and
# limitations under the License.
# This program splits up any kind of .scp or archive-type file.
# If there is no utt2spk option it will work on any text file and
# will split it up with an approximately equal number of lines in
# each but.
# With the --utt2spk option it will work on anything that has the
# utterance-id as the first entry on each line; the utt2spk file is
# of the form "utterance speaker" (on each line).
# It splits it into equal size chunks as far as it can. If you use the utt2spk
# option it will make sure these chunks coincide with speaker boundaries. In
# this case, if there are more chunks than speakers (and in some other
# circumstances), some of the resulting chunks will be empty and it will print
# an error message and exit with nonzero status.
# You will normally call this like:
# split_scp.pl scp scp.1 scp.2 scp.3 ...
# or
# split_scp.pl --utt2spk=utt2spk scp scp.1 scp.2 scp.3 ...
# Note that you can use this script to split the utt2spk file itself,
# e.g. split_scp.pl --utt2spk=utt2spk utt2spk utt2spk.1 utt2spk.2 ...
# You can also call the scripts like:
# split_scp.pl -j 3 0 scp scp.0
# [note: with this option, it assumes zero-based indexing of the split parts,
# i.e. the second number must be 0 <= n < num-jobs.]
use warnings;
$num_jobs = 0;
$job_id = 0;
$utt2spk_file = "";
$one_based = 0;
for ($x = 1; $x <= 3 && @ARGV > 0; $x++) {
if ($ARGV[0] eq "-j") {
shift @ARGV;
$num_jobs = shift @ARGV;
$job_id = shift @ARGV;
}
if ($ARGV[0] =~ /--utt2spk=(.+)/) {
$utt2spk_file=$1;
shift;
}
if ($ARGV[0] eq '--one-based') {
$one_based = 1;
shift @ARGV;
}
}
if ($num_jobs != 0 && ($num_jobs < 0 || $job_id - $one_based < 0 ||
$job_id - $one_based >= $num_jobs)) {
die "$0: Invalid job number/index values for '-j $num_jobs $job_id" .
($one_based ? " --one-based" : "") . "'\n"
}
$one_based
and $job_id--;
if(($num_jobs == 0 && @ARGV < 2) || ($num_jobs > 0 && (@ARGV < 1 || @ARGV > 2))) {
die
"Usage: split_scp.pl [--utt2spk=<utt2spk_file>] in.scp out1.scp out2.scp ...
or: split_scp.pl -j num-jobs job-id [--one-based] [--utt2spk=<utt2spk_file>] in.scp [out.scp]
... where 0 <= job-id < num-jobs, or 1 <= job-id <- num-jobs if --one-based.\n";
}
$error = 0;
$inscp = shift @ARGV;
if ($num_jobs == 0) { # without -j option
@OUTPUTS = @ARGV;
} else {
for ($j = 0; $j < $num_jobs; $j++) {
if ($j == $job_id) {
if (@ARGV > 0) { push @OUTPUTS, $ARGV[0]; }
else { push @OUTPUTS, "-"; }
} else {
push @OUTPUTS, "/dev/null";
}
}
}
if ($utt2spk_file ne "") { # We have the --utt2spk option...
open($u_fh, '<', $utt2spk_file) || die "$0: Error opening utt2spk file $utt2spk_file: $!\n";
while(<$u_fh>) {
@A = split;
@A == 2 || die "$0: Bad line $_ in utt2spk file $utt2spk_file\n";
($u,$s) = @A;
$utt2spk{$u} = $s;
}
close $u_fh;
open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
@spkrs = ();
while(<$i_fh>) {
@A = split;
if(@A == 0) { die "$0: Empty or space-only line in scp file $inscp\n"; }
$u = $A[0];
$s = $utt2spk{$u};
defined $s || die "$0: No utterance $u in utt2spk file $utt2spk_file\n";
if(!defined $spk_count{$s}) {
push @spkrs, $s;
$spk_count{$s} = 0;
$spk_data{$s} = []; # ref to new empty array.
}
$spk_count{$s}++;
push @{$spk_data{$s}}, $_;
}
# Now split as equally as possible ..
# First allocate spks to files by allocating an approximately
# equal number of speakers.
$numspks = @spkrs; # number of speakers.
$numscps = @OUTPUTS; # number of output files.
if ($numspks < $numscps) {
die "$0: Refusing to split data because number of speakers $numspks " .
"is less than the number of output .scp files $numscps\n";
}
for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
$scparray[$scpidx] = []; # [] is array reference.
}
for ($spkidx = 0; $spkidx < $numspks; $spkidx++) {
$scpidx = int(($spkidx*$numscps) / $numspks);
$spk = $spkrs[$spkidx];
push @{$scparray[$scpidx]}, $spk;
$scpcount[$scpidx] += $spk_count{$spk};
}
# Now will try to reassign beginning + ending speakers
# to different scp's and see if it gets more balanced.
# Suppose objf we're minimizing is sum_i (num utts in scp[i] - average)^2.
# We can show that if considering changing just 2 scp's, we minimize
# this by minimizing the squared difference in sizes. This is
# equivalent to minimizing the absolute difference in sizes. This
# shows this method is bound to converge.
$changed = 1;
while($changed) {
$changed = 0;
for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
# First try to reassign ending spk of this scp.
if($scpidx < $numscps-1) {
$sz = @{$scparray[$scpidx]};
if($sz > 0) {
$spk = $scparray[$scpidx]->[$sz-1];
$count = $spk_count{$spk};
$nutt1 = $scpcount[$scpidx];
$nutt2 = $scpcount[$scpidx+1];
if( abs( ($nutt2+$count) - ($nutt1-$count))
< abs($nutt2 - $nutt1)) { # Would decrease
# size-diff by reassigning spk...
$scpcount[$scpidx+1] += $count;
$scpcount[$scpidx] -= $count;
pop @{$scparray[$scpidx]};
unshift @{$scparray[$scpidx+1]}, $spk;
$changed = 1;
}
}
}
if($scpidx > 0 && @{$scparray[$scpidx]} > 0) {
$spk = $scparray[$scpidx]->[0];
$count = $spk_count{$spk};
$nutt1 = $scpcount[$scpidx-1];
$nutt2 = $scpcount[$scpidx];
if( abs( ($nutt2-$count) - ($nutt1+$count))
< abs($nutt2 - $nutt1)) { # Would decrease
# size-diff by reassigning spk...
$scpcount[$scpidx-1] += $count;
$scpcount[$scpidx] -= $count;
shift @{$scparray[$scpidx]};
push @{$scparray[$scpidx-1]}, $spk;
$changed = 1;
}
}
}
}
# Now print out the files...
for($scpidx = 0; $scpidx < $numscps; $scpidx++) {
$scpfile = $OUTPUTS[$scpidx];
($scpfile ne '-' ? open($f_fh, '>', $scpfile)
: open($f_fh, '>&', \*STDOUT)) ||
die "$0: Could not open scp file $scpfile for writing: $!\n";
$count = 0;
if(@{$scparray[$scpidx]} == 0) {
print STDERR "$0: eError: split_scp.pl producing empty .scp file " .
"$scpfile (too many splits and too few speakers?)\n";
$error = 1;
} else {
foreach $spk ( @{$scparray[$scpidx]} ) {
print $f_fh @{$spk_data{$spk}};
$count += $spk_count{$spk};
}
$count == $scpcount[$scpidx] || die "Count mismatch [code error]";
}
close($f_fh);
}
} else {
# This block is the "normal" case where there is no --utt2spk
# option and we just break into equal size chunks.
open($i_fh, '<', $inscp) || die "$0: Error opening input scp file $inscp: $!\n";
$numscps = @OUTPUTS; # size of array.
@F = ();
while(<$i_fh>) {
push @F, $_;
}
$numlines = @F;
if($numlines == 0) {
print STDERR "$0: error: empty input scp file $inscp\n";
$error = 1;
}
$linesperscp = int( $numlines / $numscps); # the "whole part"..
$linesperscp >= 1 || die "$0: You are splitting into too many pieces! [reduce \$nj ($numscps) to be smaller than the number of lines ($numlines) in $inscp]\n";
$remainder = $numlines - ($linesperscp * $numscps);
($remainder >= 0 && $remainder < $numlines) || die "bad remainder $remainder";
# [just doing int() rounds down].
$n = 0;
for($scpidx = 0; $scpidx < @OUTPUTS; $scpidx++) {
$scpfile = $OUTPUTS[$scpidx];
($scpfile ne '-' ? open($o_fh, '>', $scpfile)
: open($o_fh, '>&', \*STDOUT)) ||
die "$0: Could not open scp file $scpfile for writing: $!\n";
for($k = 0; $k < $linesperscp + ($scpidx < $remainder ? 1 : 0); $k++) {
print $o_fh $F[$n++];
}
close($o_fh) || die "$0: Eror closing scp file $scpfile: $!\n";
}
$n == $numlines || die "$n != $numlines [code error]";
}
exit ($error);

View File

@ -0,0 +1,30 @@
#!/usr/bin/env bash
dev_num_utt=1000
echo "$0 $@"
. utils/parse_options.sh || exit 1;
train_data=$1
out_dir=$2
[ ! -f ${train_data}/wav.scp ] && echo "$0: no such file ${train_data}/wav.scp" && exit 1;
[ ! -f ${train_data}/text ] && echo "$0: no such file ${train_data}/text" && exit 1;
mkdir -p ${out_dir}/train && mkdir -p ${out_dir}/dev
cp ${train_data}/wav.scp ${out_dir}/train/wav.scp.bak
cp ${train_data}/text ${out_dir}/train/text.bak
num_utt=$(wc -l <${out_dir}/train/wav.scp.bak)
utils/shuffle_list.pl --srand 1 ${out_dir}/train/wav.scp.bak > ${out_dir}/train/wav.scp.shuf
head -n ${dev_num_utt} ${out_dir}/train/wav.scp.shuf > ${out_dir}/dev/wav.scp
tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/wav.scp.shuf > ${out_dir}/train/wav.scp
utils/shuffle_list.pl --srand 1 ${out_dir}/train/text.bak > ${out_dir}/train/text.shuf
head -n ${dev_num_utt} ${out_dir}/train/text.shuf > ${out_dir}/dev/text
tail -n $((${num_utt}-${dev_num_utt})) ${out_dir}/train/text.shuf > ${out_dir}/train/text
rm ${out_dir}/train/wav.scp.bak ${out_dir}/train/text.bak
rm ${out_dir}/train/wav.scp.shuf ${out_dir}/train/text.shuf

View File

@ -0,0 +1,135 @@
#!/usr/bin/env python3
# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
import argparse
import codecs
import re
import sys
is_python2 = sys.version_info[0] == 2
def exist_or_not(i, match_pos):
start_pos = None
end_pos = None
for pos in match_pos:
if pos[0] <= i < pos[1]:
start_pos = pos[0]
end_pos = pos[1]
break
return start_pos, end_pos
def get_parser():
parser = argparse.ArgumentParser(
description="convert raw text to tokenized text",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--nchar",
"-n",
default=1,
type=int,
help="number of characters to split, i.e., \
aabb -> a a b b with -n 1 and aa bb with -n 2",
)
parser.add_argument(
"--skip-ncols", "-s", default=0, type=int, help="skip first n columns"
)
parser.add_argument("--space", default="<space>", type=str, help="space symbol")
parser.add_argument(
"--non-lang-syms",
"-l",
default=None,
type=str,
help="list of non-linguistic symobles, e.g., <NOISE> etc.",
)
parser.add_argument("text", type=str, default=False, nargs="?", help="input text")
parser.add_argument(
"--trans_type",
"-t",
type=str,
default="char",
choices=["char", "phn"],
help="""Transcript type. char/phn. e.g., for TIMIT FADG0_SI1279 -
If trans_type is char,
read from SI1279.WRD file -> "bricks are an alternative"
Else if trans_type is phn,
read from SI1279.PHN file -> "sil b r ih sil k s aa r er n aa l
sil t er n ih sil t ih v sil" """,
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
rs = []
if args.non_lang_syms is not None:
with codecs.open(args.non_lang_syms, "r", encoding="utf-8") as f:
nls = [x.rstrip() for x in f.readlines()]
rs = [re.compile(re.escape(x)) for x in nls]
if args.text:
f = codecs.open(args.text, encoding="utf-8")
else:
f = codecs.getreader("utf-8")(sys.stdin if is_python2 else sys.stdin.buffer)
sys.stdout = codecs.getwriter("utf-8")(
sys.stdout if is_python2 else sys.stdout.buffer
)
line = f.readline()
n = args.nchar
while line:
x = line.split()
print(" ".join(x[: args.skip_ncols]), end=" ")
a = " ".join(x[args.skip_ncols :])
# get all matched positions
match_pos = []
for r in rs:
i = 0
while i >= 0:
m = r.search(a, i)
if m:
match_pos.append([m.start(), m.end()])
i = m.end()
else:
break
if args.trans_type == "phn":
a = a.split(" ")
else:
if len(match_pos) > 0:
chars = []
i = 0
while i < len(a):
start_pos, end_pos = exist_or_not(i, match_pos)
if start_pos is not None:
chars.append(a[start_pos:end_pos])
i = end_pos
else:
chars.append(a[i])
i += 1
a = chars
a = [a[j : j + n] for j in range(0, len(a), n)]
a_flat = []
for z in a:
a_flat.append("".join(z))
a_chars = [z.replace(" ", args.space) for z in a_flat]
if args.trans_type == "phn":
a_chars = [z.replace("sil", args.space) for z in a_chars]
print(" ".join(a_chars))
line = f.readline()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,106 @@
import re
import argparse
def load_dict(seg_file):
seg_dict = {}
with open(seg_file, 'r') as infile:
for line in infile:
s = line.strip().split()
key = s[0]
value = s[1:]
seg_dict[key] = " ".join(value)
return seg_dict
def forward_segment(text, dic):
word_list = []
i = 0
while i < len(text):
longest_word = text[i]
for j in range(i + 1, len(text) + 1):
word = text[i:j]
if word in dic:
if len(word) > len(longest_word):
longest_word = word
word_list.append(longest_word)
i += len(longest_word)
return word_list
def tokenize(txt,
seg_dict):
out_txt = ""
pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])")
for word in txt:
if pattern.match(word):
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
out_txt += "<unk>" + " "
else:
continue
return out_txt.strip()
def get_parser():
parser = argparse.ArgumentParser(
description="text tokenize",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument(
"--text-file",
"-t",
default=False,
required=True,
type=str,
help="input text",
)
parser.add_argument(
"--seg-file",
"-s",
default=False,
required=True,
type=str,
help="seg file",
)
parser.add_argument(
"--txt-index",
"-i",
default=1,
required=True,
type=int,
help="txt index",
)
parser.add_argument(
"--output-dir",
"-o",
default=False,
required=True,
type=str,
help="output dir",
)
return parser
def main():
parser = get_parser()
args = parser.parse_args()
txt_writer = open("{}/text.{}.txt".format(args.output_dir, args.txt_index), 'w')
shape_writer = open("{}/len.{}".format(args.output_dir, args.txt_index), 'w')
seg_dict = load_dict(args.seg_file)
with open(args.text_file, 'r') as infile:
for line in infile:
s = line.strip().split()
text_id = s[0]
text_list = forward_segment("".join(s[1:]).lower(), seg_dict)
text = tokenize(text_list, seg_dict)
lens = len(text.strip().split())
txt_writer.write(text_id + " " + text + '\n')
shape_writer.write(text_id + " " + str(lens) + '\n')
if __name__ == '__main__':
main()

View File

@ -0,0 +1,35 @@
#!/usr/bin/env bash
# Begin configuration section.
nj=32
cmd=utils/run.pl
echo "$0 $@"
. utils/parse_options.sh || exit 1;
# tokenize configuration
text_dir=$1
seg_file=$2
logdir=$3
output_dir=$4
txt_dir=${output_dir}/txt; mkdir -p ${output_dir}/txt
mkdir -p ${logdir}
$cmd JOB=1:$nj $logdir/text_tokenize.JOB.log \
python utils/text_tokenize.py -t ${text_dir}/txt/text.JOB.txt \
-s ${seg_file} -i JOB -o ${txt_dir} \
|| exit 1;
# concatenate the text files together.
for n in $(seq $nj); do
cat ${txt_dir}/text.$n.txt || exit 1
done > ${output_dir}/text || exit 1
for n in $(seq $nj); do
cat ${txt_dir}/len.$n || exit 1
done > ${output_dir}/text_shape || exit 1
echo "$0: Succeeded text tokenize"

View File

@ -0,0 +1,834 @@
#!/usr/bin/env python3
# coding=utf-8
# Authors:
# 2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git)
# 2019.9 Jiayu DU
#
# requirements:
# - python 3.X
# notes: python 2.X WILL fail or produce misleading results
import sys, os, argparse, codecs, string, re
# ================================================================================ #
# basic constant
# ================================================================================ #
CHINESE_DIGIS = u'零一二三四五六七八九'
BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖'
BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖'
SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万'
SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬'
LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载'
LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載'
SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万'
SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬'
ZERO_ALT = u''
ONE_ALT = u''
TWO_ALTS = [u'', u'']
POSITIVE = [u'', u'']
NEGATIVE = [u'', u'']
POINT = [u'', u'']
# PLUS = [u'加', u'加']
# SIL = [u'杠', u'槓']
FILLER_CHARS = ['', '']
ER_WHITELIST = '(儿女|儿子|儿孙|女儿|儿媳|妻儿|' \
'胎儿|婴儿|新生儿|婴幼儿|幼儿|少儿|小儿|儿歌|儿童|儿科|托儿所|孤儿|' \
'儿戏|儿化|台儿庄|鹿儿岛|正儿八经|吊儿郎当|生儿育女|托儿带女|养儿防老|痴儿呆女|' \
'佳儿佳妇|儿怜兽扰|儿无常父|儿不嫌母丑|儿行千里母担忧|儿大不由爷|苏乞儿)'
# 中文数字系统类型
NUMBERING_TYPES = ['low', 'mid', 'high']
CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \
'里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)'
CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \
'砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \
'针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \
'毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \
'盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \
'纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)'
# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git)
CHINESE_PUNC_STOP = '!?。。'
CHINESE_PUNC_NON_STOP = '"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏'
CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP
# ================================================================================ #
# basic class
# ================================================================================ #
class ChineseChar(object):
"""
中文字符
每个字符对应简体和繁体,
e.g. 简体 = '', 繁体 = ''
转换时可转换为简体或繁体
"""
def __init__(self, simplified, traditional):
self.simplified = simplified
self.traditional = traditional
#self.__repr__ = self.__str__
def __str__(self):
return self.simplified or self.traditional or None
def __repr__(self):
return self.__str__()
class ChineseNumberUnit(ChineseChar):
"""
中文数字/数位字符
每个字符除繁简体外还有一个额外的大写字符
e.g. '' ''
"""
def __init__(self, power, simplified, traditional, big_s, big_t):
super(ChineseNumberUnit, self).__init__(simplified, traditional)
self.power = power
self.big_s = big_s
self.big_t = big_t
def __str__(self):
return '10^{}'.format(self.power)
@classmethod
def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False):
if small_unit:
return ChineseNumberUnit(power=index + 1,
simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1])
elif numbering_type == NUMBERING_TYPES[0]:
return ChineseNumberUnit(power=index + 8,
simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
elif numbering_type == NUMBERING_TYPES[1]:
return ChineseNumberUnit(power=(index + 2) * 4,
simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
elif numbering_type == NUMBERING_TYPES[2]:
return ChineseNumberUnit(power=pow(2, index + 3),
simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1])
else:
raise ValueError(
'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type))
class ChineseNumberDigit(ChineseChar):
"""
中文数字字符
"""
def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None):
super(ChineseNumberDigit, self).__init__(simplified, traditional)
self.value = value
self.big_s = big_s
self.big_t = big_t
self.alt_s = alt_s
self.alt_t = alt_t
def __str__(self):
return str(self.value)
@classmethod
def create(cls, i, v):
return ChineseNumberDigit(i, v[0], v[1], v[2], v[3])
class ChineseMath(ChineseChar):
"""
中文数位字符
"""
def __init__(self, simplified, traditional, symbol, expression=None):
super(ChineseMath, self).__init__(simplified, traditional)
self.symbol = symbol
self.expression = expression
self.big_s = simplified
self.big_t = traditional
CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath
class NumberSystem(object):
"""
中文数字系统
"""
pass
class MathSymbol(object):
"""
用于中文数字系统的数学符号 (/简体), e.g.
positive = ['', '']
negative = ['', '']
point = ['', '']
"""
def __init__(self, positive, negative, point):
self.positive = positive
self.negative = negative
self.point = point
def __iter__(self):
for v in self.__dict__.values():
yield v
# class OtherSymbol(object):
# """
# 其他符号
# """
#
# def __init__(self, sil):
# self.sil = sil
#
# def __iter__(self):
# for v in self.__dict__.values():
# yield v
# ================================================================================ #
# basic utils
# ================================================================================ #
def create_system(numbering_type=NUMBERING_TYPES[1]):
"""
根据数字系统类型返回创建相应的数字系统默认为 mid
NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型
low: '' = '亿' * '' = $10^{9}$, '' = '' * '', etc.
mid: '' = '亿' * '' = $10^{12}$, '' = '' * '', etc.
high: '' = '亿' * '亿' = $10^{16}$, '' = '' * '', etc.
返回对应的数字系统
"""
# chinese number units of '亿' and larger
all_larger_units = zip(
LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL)
larger_units = [CNU.create(i, v, numbering_type, False)
for i, v in enumerate(all_larger_units)]
# chinese number units of '十, 百, 千, 万'
all_smaller_units = zip(
SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL)
smaller_units = [CNU.create(i, v, small_unit=True)
for i, v in enumerate(all_smaller_units)]
# digis
chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS,
BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL)
digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)]
digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT
digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT
digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1]
# symbols
positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x)
negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x)
point_cn = CM(POINT[0], POINT[1], '.', lambda x,
y: float(str(x) + '.' + str(y)))
# sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y)))
system = NumberSystem()
system.units = smaller_units + larger_units
system.digits = digits
system.math = MathSymbol(positive_cn, negative_cn, point_cn)
# system.symbols = OtherSymbol(sil_cn)
return system
def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]):
def get_symbol(char, system):
for u in system.units:
if char in [u.traditional, u.simplified, u.big_s, u.big_t]:
return u
for d in system.digits:
if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]:
return d
for m in system.math:
if char in [m.traditional, m.simplified]:
return m
def string2symbols(chinese_string, system):
int_string, dec_string = chinese_string, ''
for p in [system.math.point.simplified, system.math.point.traditional]:
if p in chinese_string:
int_string, dec_string = chinese_string.split(p)
break
return [get_symbol(c, system) for c in int_string], \
[get_symbol(c, system) for c in dec_string]
def correct_symbols(integer_symbols, system):
"""
一百八 to 一百八十
一亿一千三百万 to 一亿 一千万 三百万
"""
if integer_symbols and isinstance(integer_symbols[0], CNU):
if integer_symbols[0].power == 1:
integer_symbols = [system.digits[1]] + integer_symbols
if len(integer_symbols) > 1:
if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU):
integer_symbols.append(
CNU(integer_symbols[-2].power - 1, None, None, None, None))
result = []
unit_count = 0
for s in integer_symbols:
if isinstance(s, CND):
result.append(s)
unit_count = 0
elif isinstance(s, CNU):
current_unit = CNU(s.power, None, None, None, None)
unit_count += 1
if unit_count == 1:
result.append(current_unit)
elif unit_count > 1:
for i in range(len(result)):
if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power:
result[-i - 1] = CNU(result[-i - 1].power +
current_unit.power, None, None, None, None)
return result
def compute_value(integer_symbols):
"""
Compute the value.
When current unit is larger than previous unit, current unit * all previous units will be used as all previous units.
e.g. '两千万' = 2000 * 10000 not 2000 + 10000
"""
value = [0]
last_power = 0
for s in integer_symbols:
if isinstance(s, CND):
value[-1] = s.value
elif isinstance(s, CNU):
value[-1] *= pow(10, s.power)
if s.power > last_power:
value[:-1] = list(map(lambda v: v *
pow(10, s.power), value[:-1]))
last_power = s.power
value.append(0)
return sum(value)
system = create_system(numbering_type)
int_part, dec_part = string2symbols(chinese_string, system)
int_part = correct_symbols(int_part, system)
int_str = str(compute_value(int_part))
dec_str = ''.join([str(d.value) for d in dec_part])
if dec_part:
return '{0}.{1}'.format(int_str, dec_str)
else:
return int_str
def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False,
traditional=False, alt_zero=False, alt_one=False, alt_two=True,
use_zeros=True, use_units=True):
def get_value(value_string, use_zeros=True):
striped_string = value_string.lstrip('0')
# record nothing if all zeros
if not striped_string:
return []
# record one digits
elif len(striped_string) == 1:
if use_zeros and len(value_string) != len(striped_string):
return [system.digits[0], system.digits[int(striped_string)]]
else:
return [system.digits[int(striped_string)]]
# recursively record multiple digits
else:
result_unit = next(u for u in reversed(
system.units) if u.power < len(striped_string))
result_string = value_string[:-result_unit.power]
return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:])
system = create_system(numbering_type)
int_dec = number_string.split('.')
if len(int_dec) == 1:
int_string = int_dec[0]
dec_string = ""
elif len(int_dec) == 2:
int_string = int_dec[0]
dec_string = int_dec[1]
else:
raise ValueError(
"invalid input num string with more than one dot: {}".format(number_string))
if use_units and len(int_string) > 1:
result_symbols = get_value(int_string)
else:
result_symbols = [system.digits[int(c)] for c in int_string]
dec_symbols = [system.digits[int(c)] for c in dec_string]
if dec_string:
result_symbols += [system.math.point] + dec_symbols
if alt_two:
liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t,
system.digits[2].big_s, system.digits[2].big_t)
for i, v in enumerate(result_symbols):
if isinstance(v, CND) and v.value == 2:
next_symbol = result_symbols[i +
1] if i < len(result_symbols) - 1 else None
previous_symbol = result_symbols[i - 1] if i > 0 else None
if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))):
if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)):
result_symbols[i] = liang
# if big is True, '两' will not be used and `alt_two` has no impact on output
if big:
attr_name = 'big_'
if traditional:
attr_name += 't'
else:
attr_name += 's'
else:
if traditional:
attr_name = 'traditional'
else:
attr_name = 'simplified'
result = ''.join([getattr(s, attr_name) for s in result_symbols])
# if not use_zeros:
# result = result.strip(getattr(system.digits[0], attr_name))
if alt_zero:
result = result.replace(
getattr(system.digits[0], attr_name), system.digits[0].alt_s)
if alt_one:
result = result.replace(
getattr(system.digits[1], attr_name), system.digits[1].alt_s)
for i, p in enumerate(POINT):
if result.startswith(p):
return CHINESE_DIGIS[0] + result
# ^10, 11, .., 19
if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0],
SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \
result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]:
result = result[1:]
return result
# ================================================================================ #
# different types of rewriters
# ================================================================================ #
class Cardinal:
"""
CARDINAL类
"""
def __init__(self, cardinal=None, chntext=None):
self.cardinal = cardinal
self.chntext = chntext
def chntext2cardinal(self):
return chn2num(self.chntext)
def cardinal2chntext(self):
return num2chn(self.cardinal)
class Digit:
"""
DIGIT类
"""
def __init__(self, digit=None, chntext=None):
self.digit = digit
self.chntext = chntext
# def chntext2digit(self):
# return chn2num(self.chntext)
def digit2chntext(self):
return num2chn(self.digit, alt_two=False, use_units=False)
class TelePhone:
"""
TELEPHONE类
"""
def __init__(self, telephone=None, raw_chntext=None, chntext=None):
self.telephone = telephone
self.raw_chntext = raw_chntext
self.chntext = chntext
# def chntext2telephone(self):
# sil_parts = self.raw_chntext.split('<SIL>')
# self.telephone = '-'.join([
# str(chn2num(p)) for p in sil_parts
# ])
# return self.telephone
def telephone2chntext(self, fixed=False):
if fixed:
sil_parts = self.telephone.split('-')
self.raw_chntext = '<SIL>'.join([
num2chn(part, alt_two=False, use_units=False) for part in sil_parts
])
self.chntext = self.raw_chntext.replace('<SIL>', '')
else:
sp_parts = self.telephone.strip('+').split()
self.raw_chntext = '<SP>'.join([
num2chn(part, alt_two=False, use_units=False) for part in sp_parts
])
self.chntext = self.raw_chntext.replace('<SP>', '')
return self.chntext
class Fraction:
"""
FRACTION类
"""
def __init__(self, fraction=None, chntext=None):
self.fraction = fraction
self.chntext = chntext
def chntext2fraction(self):
denominator, numerator = self.chntext.split('分之')
return chn2num(numerator) + '/' + chn2num(denominator)
def fraction2chntext(self):
numerator, denominator = self.fraction.split('/')
return num2chn(denominator) + '分之' + num2chn(numerator)
class Date:
"""
DATE类
"""
def __init__(self, date=None, chntext=None):
self.date = date
self.chntext = chntext
# def chntext2date(self):
# chntext = self.chntext
# try:
# year, other = chntext.strip().split('年', maxsplit=1)
# year = Digit(chntext=year).digit2chntext() + '年'
# except ValueError:
# other = chntext
# year = ''
# if other:
# try:
# month, day = other.strip().split('月', maxsplit=1)
# month = Cardinal(chntext=month).chntext2cardinal() + '月'
# except ValueError:
# day = chntext
# month = ''
# if day:
# day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1]
# else:
# month = ''
# day = ''
# date = year + month + day
# self.date = date
# return self.date
def date2chntext(self):
date = self.date
try:
year, other = date.strip().split('', 1)
year = Digit(digit=year).digit2chntext() + ''
except ValueError:
other = date
year = ''
if other:
try:
month, day = other.strip().split('', 1)
month = Cardinal(cardinal=month).cardinal2chntext() + ''
except ValueError:
day = date
month = ''
if day:
day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1]
else:
month = ''
day = ''
chntext = year + month + day
self.chntext = chntext
return self.chntext
class Money:
"""
MONEY类
"""
def __init__(self, money=None, chntext=None):
self.money = money
self.chntext = chntext
# def chntext2money(self):
# return self.money
def money2chntext(self):
money = self.money
pattern = re.compile(r'(\d+(\.\d+)?)')
matchers = pattern.findall(money)
if matchers:
for matcher in matchers:
money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext())
self.chntext = money
return self.chntext
class Percentage:
"""
PERCENTAGE类
"""
def __init__(self, percentage=None, chntext=None):
self.percentage = percentage
self.chntext = chntext
def chntext2percentage(self):
return chn2num(self.chntext.strip().strip('百分之')) + '%'
def percentage2chntext(self):
return '百分之' + num2chn(self.percentage.strip().strip('%'))
def remove_erhua(text, er_whitelist):
"""
去除儿化音词中的儿:
他女儿在那边儿 -> 他女儿在那边
"""
er_pattern = re.compile(er_whitelist)
new_str=''
while re.search('',text):
a = re.search('',text).span()
remove_er_flag = 0
if er_pattern.search(text):
b = er_pattern.search(text).span()
if b[0] <= a[0]:
remove_er_flag = 1
if remove_er_flag == 0 :
new_str = new_str + text[0:a[0]]
text = text[a[1]:]
else:
new_str = new_str + text[0:b[1]]
text = text[b[1]:]
text = new_str + text
return text
# ================================================================================ #
# NSW Normalizer
# ================================================================================ #
class NSWNormalizer:
def __init__(self, raw_text):
self.raw_text = '^' + raw_text + '$'
self.norm_text = ''
def _particular(self):
text = self.norm_text
pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))")
matchers = pattern.findall(text)
if matchers:
# print('particular')
for matcher in matchers:
text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1)
self.norm_text = text
return self.norm_text
def normalize(self):
text = self.raw_text
# 规范化日期
pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)")
matchers = pattern.findall(text)
if matchers:
#print('date')
for matcher in matchers:
text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1)
# 规范化金钱
pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)")
matchers = pattern.findall(text)
if matchers:
#print('money')
for matcher in matchers:
text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1)
# 规范化固话/手机号码
# 手机
# http://www.jihaoba.com/news/show/13680
# 移动139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198
# 联通130、131、132、156、155、186、185、176
# 电信133、153、189、180、181、177
pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D")
matchers = pattern.findall(text)
if matchers:
#print('telephone')
for matcher in matchers:
text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1)
# 固话
pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D")
matchers = pattern.findall(text)
if matchers:
# print('fixed telephone')
for matcher in matchers:
text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1)
# 规范化分数
pattern = re.compile(r"(\d+/\d+)")
matchers = pattern.findall(text)
if matchers:
#print('fraction')
for matcher in matchers:
text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1)
# 规范化百分数
text = text.replace('', '%')
pattern = re.compile(r"(\d+(\.\d+)?%)")
matchers = pattern.findall(text)
if matchers:
#print('percentage')
for matcher in matchers:
text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1)
# 规范化纯数+量词
pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS)
matchers = pattern.findall(text)
if matchers:
#print('cardinal+quantifier')
for matcher in matchers:
text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
# 规范化数字编号
pattern = re.compile(r"(\d{4,32})")
matchers = pattern.findall(text)
if matchers:
#print('digit')
for matcher in matchers:
text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1)
# 规范化纯数
pattern = re.compile(r"(\d+(\.\d+)?)")
matchers = pattern.findall(text)
if matchers:
#print('cardinal')
for matcher in matchers:
text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1)
self.norm_text = text
self._particular()
return self.norm_text.lstrip('^').rstrip('$')
def nsw_test_case(raw_text):
print('I:' + raw_text)
print('O:' + NSWNormalizer(raw_text).normalize())
print('')
def nsw_test():
nsw_test_case('固话0595-23865596或23880880。')
nsw_test_case('固话0595-23865596或23880880。')
nsw_test_case('手机:+86 19859213959或15659451527。')
nsw_test_case('分数32477/76391。')
nsw_test_case('百分数80.03%')
nsw_test_case('编号31520181154418。')
nsw_test_case('纯数2983.07克或12345.60米。')
nsw_test_case('日期1999年2月20日或09年3月15号。')
nsw_test_case('金钱12块534.5元20.1万')
nsw_test_case('特殊O2O或B2C。')
nsw_test_case('3456万吨')
nsw_test_case('2938个')
nsw_test_case('938')
nsw_test_case('今天吃了115个小笼包231个馒头')
nsw_test_case('有62的概率')
if __name__ == '__main__':
#nsw_test()
p = argparse.ArgumentParser()
p.add_argument('ifile', help='input filename, assume utf-8 encoding')
p.add_argument('ofile', help='output filename')
p.add_argument('--to_upper', action='store_true', help='convert to upper case')
p.add_argument('--to_lower', action='store_true', help='convert to lower case')
p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.")
p.add_argument('--remove_fillers', type=bool, default=True, help='remove filler chars such as "呃, 啊"')
p.add_argument('--remove_erhua', type=bool, default=True, help='remove erhua chars such as "这儿"')
p.add_argument('--log_interval', type=int, default=10000, help='log interval in number of processed lines')
args = p.parse_args()
ifile = codecs.open(args.ifile, 'r', 'utf8')
ofile = codecs.open(args.ofile, 'w+', 'utf8')
n = 0
for l in ifile:
key = ''
text = ''
if args.has_key:
cols = l.split(maxsplit=1)
key = cols[0]
if len(cols) == 2:
text = cols[1].strip()
else:
text = ''
else:
text = l.strip()
# cases
if args.to_upper and args.to_lower:
sys.stderr.write('text norm: to_upper OR to_lower?')
exit(1)
if args.to_upper:
text = text.upper()
if args.to_lower:
text = text.lower()
# Filler chars removal
if args.remove_fillers:
for ch in FILLER_CHARS:
text = text.replace(ch, '')
if args.remove_erhua:
text = remove_erhua(text, ER_WHITELIST)
# NSW(Non-Standard-Word) normalization
text = NSWNormalizer(text).normalize()
# Punctuations removal
old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations
new_chars = ' ' * len(old_chars)
del_chars = ''
text = text.translate(str.maketrans(old_chars, new_chars, del_chars))
#
if args.has_key:
ofile.write(key + '\t' + text + '\n')
else:
ofile.write(text + '\n')
n += 1
if n % args.log_interval == 0:
sys.stderr.write("text norm: {} lines done.\n".format(n))
sys.stderr.write("text norm: {} lines done in total.\n".format(n))
ifile.close()
ofile.close()

View File

@ -0,0 +1,38 @@
# ModelScope: Paraformer-large Model
## Highlight
### ModelScope: Paraformer-Large Model
- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
- Support finetuning and inference on AISHELL-1 and AISHELL-2.
- Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
### Finetune
- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
- Setting parameters in `paraformer_large_finetune.sh`
- <strong>data_aishell:</strong> please set the aishell data path
- <strong>tag:</strong> exp tag
- <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune:
```sh
sh ./paraformer_large_finetune.sh
```
### Inference
Or you can download the model from ModelScope for inference directly.
- Setting parameters in `paraformer_large_infer.sh`
- <strong>ori_data:</strong> please set the aishell raw data path
- <strong>data_dir:</strong> data output dictionary
- <strong>exp_dir:</strong> the result path
- <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
- <strong>test_sets:</strong> please set the testsets name
- Then you can run the pipeline to infer with:
```sh
sh ./paraformer_large_infer.sh
```

View File

@ -0,0 +1,24 @@
# Paraformer-Large
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
- Model size: 220M
- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
# Environments
- date: `Tue Nov 22 18:48:39 CST 2022`
- python version: `3.7.12`
- FunASR version: `0.1.0`
- pytorch version: `pytorch 1.7.0`
- Git hash: ``
- Commit date: ``
# Beachmark Results
## AISHELL-1
- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
- Decode without CTC
- Decode without LM
| testset | CER(%)|
|:---------:|:-----:|
| dev | 1.75 |
| test | 1.95 |

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.15

View File

@ -0,0 +1,6 @@
beam_size: 1
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.0

View File

@ -0,0 +1,91 @@
# network architecture
# encoder related
encoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
# decoder related
decoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
predictor_conf:
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
# hybrid CTC/attention
model_conf:
ctc_weight: 0.0
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
predictor_weight: 1.0
predictor_bias: 1
sampling_ratio: 0.75
# minibatch related
# dataset_type: small
batch_type: length
batch_bins: 2000
num_workers: 16
# dataset_type: large
dataset_conf:
filter_conf:
min_length: 10
max_length: 250
min_token_length: 1
max_token_length: 200
shuffle: true
shuffle_conf:
shuffle_size: 10240
sort_size: 500
batch_conf:
batch_type: 'token'
batch_size: 6000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug_lfr
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
unused_parameters: true
log_interval: 50
normalize: None
split_with_space: true

View File

@ -0,0 +1,66 @@
#!/bin/bash
# Copyright 2017 Xingyu Na
# Apache 2.0
#. ./path.sh || exit 1;
if [ $# != 3 ]; then
echo "Usage: $0 <audio-path> <text-path> <output-path>"
echo " $0 /export/a05/xna/data/data_aishell/wav /export/a05/xna/data/data_aishell/transcript data"
exit 1;
fi
aishell_audio_dir=$1
aishell_text=$2/aishell_transcript_v0.8.txt
output_dir=$3
train_dir=$output_dir/data/local/train
dev_dir=$output_dir/data/local/dev
test_dir=$output_dir/data/local/test
tmp_dir=$output_dir/data/local/tmp
mkdir -p $train_dir
mkdir -p $dev_dir
mkdir -p $test_dir
mkdir -p $tmp_dir
# data directory check
if [ ! -d $aishell_audio_dir ] || [ ! -f $aishell_text ]; then
echo "Error: $0 requires two directory arguments"
exit 1;
fi
# find wav audio file for train, dev and test resp.
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
n=`cat $tmp_dir/wav.flist | wc -l`
[ $n -ne 141925 ] && \
echo Warning: expected 141925 data data files, found $n
grep -i "wav/train" $tmp_dir/wav.flist > $train_dir/wav.flist || exit 1;
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
# Transcriptions preparation
for dir in $train_dir $dev_dir $test_dir; do
echo Preparing $dir transcriptions
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p $output_dir/data/train $output_dir/data/dev $output_dir/data/test
for f in wav.scp text; do
cp $train_dir/$f $output_dir/data/train/$f || exit 1;
cp $dev_dir/$f $output_dir/data/dev/$f || exit 1;
cp $test_dir/$f $output_dir/data/test/$f || exit 1;
done
echo "$0: AISHELL data preparation succeeded"
exit 0;

View File

@ -0,0 +1 @@
../../common/modelscope_utils

View File

@ -0,0 +1,224 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
njob=4 # the number of jobs for each gpu
train_cmd=utils/run.pl
# general configuration
feats_dir="." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
# feature configuration
feats_dim=560
sample_frequency=16000
nj=32
speed_perturb="1.0"
lfr=True
lfr_m=7
lfr_n=6
init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
cmvn_file=init_model/${init_model_name}/am.mvn
seg_file=init_model/${init_model_name}/seg_dict
vocab=init_model/${init_model_name}/tokens.txt
# data
data_aishell=
# exp tag
tag=""
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
init_param="init_model/${init_model_name}/${init_model_name}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
. utils/parse_options.sh || exit 1;
# download model from modelscope
python modelscope_utils/download_model.py --model_name ${init_model_name}
if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
exit 1
else
if [ -d init_model/${init_model_name} ]; then
echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
else
mkdir -p init_model/${init_model_name}
cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
fi
fi
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
inference_nj=$[${ngpu}*${njob}]
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# Data preparation
local/aishell_data_prep.sh ${data_aishell}/data_aishell/wav ${data_aishell}/data_aishell/transcript ${feats_dir}
for x in train dev test; do
cp ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
paste -d " " <(cut -f 1 -d" " ${feats_dir}/data/${x}/text.org) <(cut -f 2- -d" " ${feats_dir}/data/${x}/text.org | tr -d " ") \
> ${feats_dir}/data/${x}/text
rm ${feats_dir}/data/${x}/text.org
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
echo "apply low_frame_rate and cmvn"
[ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
echo "Text Tokenize"
# 我爱reading->我 爱 read@@ ing
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
cp ${fbankdir}/test/text ${feat_test_dir}
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
cp $vocab ${token_list}
vocab_size=$(wc -l <${token_list})
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# update asr train config.yaml
python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train_paraformer.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type $token_type \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--init_param $init_param \
--config $finetune_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
./utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/exp/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode paraformer
fi

View File

@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
ori_data=
data_dir=
exp_dir=
model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
inference_nj=32
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
njob=4 # the number of jobs for each gpu
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
# LM configs
use_lm=false
beam_size=1
lm_weight=0.0
test_sets="dev test"
. utils/parse_options.sh
aishell_audio_dir=$ori_data/data_aishell/wav
aishell_text=$ori_data/data_aishell/transcript/aishell_transcript_v0.8.txt
dev_dir=${data_dir}/aishell/dev
test_dir=${data_dir}/aishell/test
tmp_dir=${data_dir}/aishell/tmp
mkdir -p ${dev_dir}
mkdir -p ${test_dir}
mkdir -p ${tmp_dir}
find $aishell_audio_dir -iname "*.wav" > $tmp_dir/wav.flist
grep -i "wav/dev" $tmp_dir/wav.flist > $dev_dir/wav.flist || exit 1;
grep -i "wav/test" $tmp_dir/wav.flist > $test_dir/wav.flist || exit 1;
rm -r $tmp_dir
for dir in $dev_dir $test_dir; do
sed -e 's/\.wav//' $dir/wav.flist | awk -F '/' '{print $NF}' > $dir/utt.list
paste -d' ' $dir/utt.list $dir/wav.flist > $dir/wav.scp_all
utils/filter_scp.pl -f 1 $dir/utt.list $aishell_text > $dir/transcripts.txt
awk '{print $1}' $dir/transcripts.txt > $dir/utt.list
utils/filter_scp.pl -f 1 $dir/utt.list $dir/wav.scp_all | sort -u > $dir/wav.scp
sort -u $dir/transcripts.txt > $dir/text
done
mkdir -p ${exp_dir}/aishell
modelscope_utils/modelscope_infer.sh \
--data_dir ${data_dir}/aishell \
--exp_dir ${exp_dir}/aishell \
--test_sets "${test_sets}" \
--model_name ${model_name} \
--inference_nj ${inference_nj} \
--gpuid_list ${gpuid_list} \
--njob ${njob} \
--gpu_inference ${gpu_inference} \
--use_lm ${use_lm} \
--beam_size ${beam_size} \
--lm_weight ${lm_weight}

View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

View File

@ -0,0 +1 @@
../../../egs/aishell/tranformer/utils/

View File

@ -0,0 +1,39 @@
# ModelScope: Paraformer-large Model
## Highlight
### ModelScope: Paraformer-Large Model
- <strong>Fast</strong>: Non-autoregressive (NAR) model, the Paraformer can achieve comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
- <strong>Accurate</strong>: SOTA in a lot of public ASR tasks, with a very significant relative improvement, capable of industrial implementation.
- <strong>Convenient</strong>: Quickly and easily download Paraformer-large from Modelscope for finetuning and inference.
- Support finetuning and inference on AISHELL-1 and AISHELL-2.
- Support inference on AISHELL-1, AISHELL-2, Wenetspeech, SpeechIO and other audio.
## How to finetune and infer using a pretrained ModelScope Paraformer-large Model
### Finetune
- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
- Setting parameters in `paraformer_large_finetune.sh`
- <strong>tr_dir:</strong> please set the aishell2 train data path
- <strong>dev_tst_dir:</strong> please set the aishell2 dev/test data path
- <strong>tag:</strong> exp tag
- <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
- Then you can run the pipeline to finetune with our model download from modelscope and infer after finetune:
```sh
sh ./paraformer_large_finetune.sh
```
### Inference
Or you can download the model from ModelScope for inference directly.
- Setting parameters in `paraformer_large_infer.sh`
- <strong>ori_data:</strong> please set the aishell2 dev/test raw data path
- <strong>data_dir:</strong> data output dictionary
- <strong>exp_dir:</strong> the result path
- <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
- <strong>test_sets:</strong> please set the testsets name
- Then you can run the pipeline to infer with:
```sh
sh ./paraformer_large_infer.sh
```

View File

@ -0,0 +1,26 @@
# Paraformer-Large
- Model link: <https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary>
- Model size: 220M
- Train config: conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
# Environments
- date: `Tue Nov 22 18:48:39 CST 2022`
- python version: `3.7.12`
- FunASR version: `0.1.0`
- pytorch version: `pytorch 1.7.0`
- Git hash: ``
- Commit date: ``
# Beachmark Results
## AISHELL-2
- Decode config: conf/decode_asr_transformer_noctc_1best.yaml
- Decode without CTC
- Decode without LM
| testset | CER(%)|
|:------------:|:-----:|
| dev_ios | 2.80 |
| test_android | 3.13 |
| test_ios | 2.85 |
| test_mic | 3.06 |

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.15

View File

@ -0,0 +1,6 @@
beam_size: 1
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.0

View File

@ -0,0 +1,91 @@
# network architecture
# encoder related
encoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
# decoder related
decoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
predictor_conf:
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
# hybrid CTC/attention
model_conf:
ctc_weight: 0.0
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
predictor_weight: 1.0
predictor_bias: 1
sampling_ratio: 0.75
# minibatch related
# dataset_type: small
batch_type: length
batch_bins: 2000
num_workers: 16
# dataset_type: large
dataset_conf:
filter_conf:
min_length: 10
max_length: 250
min_token_length: 1
max_token_length: 200
shuffle: true
shuffle_conf:
shuffle_size: 10240
sort_size: 500
batch_conf:
batch_type: 'token'
batch_size: 6000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug_lfr
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
unused_parameters: true
log_interval: 50
normalize: None
split_with_space: true

View File

@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Copyright 2018 AIShell-Foundation(Authors:Jiayu DU, Xingyu NA, Bengu WU, Hao ZHENG)
# 2018 Beijing Shell Shell Tech. Co. Ltd. (Author: Hui BU)
# Apache 2.0
# transform raw AISHELL-2 data to kaldi format
. ./path.sh || exit 1;
tmp=
dir=
if [ $# != 3 ]; then
echo "Usage: $0 <corpus-data-dir> <tmp-dir> <output-dir>"
echo " $0 /export/AISHELL-2/iOS/train data/local/train data/train"
exit 1;
fi
corpus=$1
tmp=$2
dir=$3
echo "prepare_data.sh: Preparing data in $corpus"
mkdir -p $tmp
mkdir -p $dir
# corpus check
if [ ! -d $corpus ] || [ ! -f $corpus/wav.scp ] || [ ! -f $corpus/trans.txt ]; then
echo "Error: $0 requires wav.scp and trans.txt under $corpus directory."
exit 1;
fi
# validate utt-key list, IC0803W0380 is a bad utterance
awk '{print $1}' $corpus/wav.scp | grep -v 'IC0803W0380' > $tmp/wav_utt.list
awk '{print $1}' $corpus/trans.txt > $tmp/trans_utt.list
utils/filter_scp.pl -f 1 $tmp/wav_utt.list $tmp/trans_utt.list > $tmp/utt.list
# wav.scp
awk -F'\t' -v path_prefix=$corpus '{printf("%s\t%s/%s\n",$1,path_prefix,$2)}' $corpus/wav.scp > $tmp/tmp_wav.scp
utils/filter_scp.pl -f 1 $tmp/utt.list $tmp/tmp_wav.scp | sort -k 1 | uniq > $tmp/wav.scp
# text
utils/filter_scp.pl -f 1 $tmp/utt.list $corpus/trans.txt | sort -k 1 | uniq > $tmp/text
# copy prepared resources from tmp_dir to target dir
mkdir -p $dir
for f in wav.scp text; do
cp $tmp/$f $dir/$f || exit 1;
done
echo "local/prepare_data.sh succeeded"
exit 0;

View File

@ -0,0 +1 @@
../../common/modelscope_utils

View File

@ -0,0 +1,239 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
njob=4 # the number of jobs for each gpu
train_cmd=utils/run.pl
# general configuration
feats_dir="." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=0
stop_stage=4
# feature configuration
feats_dim=560
sample_frequency=16000
nj=100
speed_perturb="1.0"
lfr=True
lfr_m=7
lfr_n=6
init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
cmvn_file=init_model/${init_model_name}/am.mvn
seg_file=init_model/${init_model_name}/seg_dict
vocab=init_model/${init_model_name}/tokens.txt
# data
tr_dir=
dev_tst_dir=
# exp tag
tag=""
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev_ios
test_sets="dev_ios test_android test_ios test_mic"
asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
init_param="init_model/${init_model_name}/${init_model_name}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
. utils/parse_options.sh || exit 1;
# download model from modelscope
python modelscope_utils/download_model.py --model_name ${init_model_name}
if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
exit 1
else
if [ -d init_model/${init_model_name} ]; then
echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
else
mkdir -p init_model/${init_model_name}
cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
fi
fi
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
inference_nj=$[${ngpu}*${njob}]
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "stage 0: Data preparation"
# For training set
local/aishell2_data_prep.sh ${tr_dir} ${feats_dir}/data/local/train ${feats_dir}/data/train || exit 1;
# # For dev and test set
for x in Android iOS Mic; do
local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/dev ${feats_dir}/data/local/dev_${x,,} ${feats_dir}/data/dev_${x,,} || exit 1;
local/aishell2_data_prep.sh ${dev_tst_dir}/${x}/test ${feats_dir}/data/local/test_${x,,} ${feats_dir}/data/test_${x,,} || exit 1;
done
# Normalize text to capital letters
for x in train dev_android dev_ios dev_mic test_android test_ios test_mic; do
mv ${feats_dir}/data/${x}/text ${feats_dir}/data/${x}/text.org
paste -d " " <(cut -f 1 ${feats_dir}/data/${x}/text.org) <(cut -f 2- ${feats_dir}/data/${x}/text.org \
| tr 'A-Z' 'a-z' | tr -d " ") \
> ${feats_dir}/data/${x}/text
rm ${feats_dir}/data/${x}/text.org
done
fi
feat_train_dir=${feats_dir}/${dumpdir}/${train_set}; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/${valid_set}; mkdir -p ${feat_dev_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${feats_dir}/data/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
for x in android ios mic; do
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/dev_${x} ${exp_dir}/exp/make_fbank/dev_${x} ${fbankdir}/dev_${x}
utils/fix_data_feat.sh ${fbankdir}/dev_${x}
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${feats_dir}/data/test_${x} ${exp_dir}/exp/make_fbank/test_${x} ${fbankdir}/test_${x}
utils/fix_data_feat.sh ${fbankdir}/test_${x}
done
echo "apply low_frame_rate and cmvn"
[ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/${train_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/${valid_set} ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
for x in android ios mic; do
feat_test_dir=${feats_dir}/${dumpdir}/test_${x}; mkdir ${feat_test_dir}
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/test_${x} ${cmvn_file} ${exp_dir}/exp/make_fbank/test_${x} ${feat_test_dir}
done
echo "Text Tokenize"
# 我爱reading->我 爱 read@@ ing
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${train_set} ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/${valid_set} ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
for x in android ios mic; do
feat_test_dir=${feats_dir}/${dumpdir}/test_${x}
cp ${fbankdir}/test_${x}/text ${feat_test_dir}
done
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
cp $vocab ${token_list}
vocab_size=$(wc -l <${token_list})
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev_ios
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${train_set}
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# update asr train config.yaml
python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train_paraformer.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type $token_type \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--init_param $init_param \
--config $finetune_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
./utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/exp/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode paraformer
fi

View File

@ -0,0 +1,56 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
ori_data=
data_dir=
exp_dir=
model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
inference_nj=32
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
njob=4 # the number of jobs for each gpu
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
else
inference_nj=$njob
fi
use_lm=false
beam_size=1
lm_weight=0.0
test_sets="dev_ios test_android test_ios test_mic"
. utils/parse_options.sh
for x in Android iOS Mic; do
local/aishell2_data_prep.sh ${ori_data}/${x}/dev ${data_dir}/aishell2/local/dev_${x,,} ${data_dir}/aishell2/dev_${x,,} || exit 1;
local/aishell2_data_prep.sh ${ori_data}/${x}/test ${data_dir}/aishell2/local/test_${x,,} ${data_dir}/aishell2/test_${x,,} || exit 1;
done
for x in dev_android dev_ios dev_mic test_android test_ios test_mic; do
mv ${data_dir}/aishell2/${x}/text ${data_dir}/aishell2/${x}/text.org
paste -d " " <(cut -f 1 ${data_dir}/aishell2/${x}/text.org) <(cut -f 2- ${data_dir}/aishell2/${x}/text.org \
| tr 'A-Z' 'a-z' | tr -d " ") \
> ${data_dir}/aishell2/${x}/text
rm ${data_dir}/aishell2/${x}/text.org
done
mkdir -p ${exp_dir}/aishell2
modelscope_utils/modelscope_infer.sh \
--data_dir ${data_dir}/aishell2 \
--exp_dir ${exp_dir}/aishell2 \
--test_sets "${test_sets}" \
--model_name ${model_name} \
--inference_nj ${inference_nj} \
--gpuid_list ${gpuid_list} \
--njob ${njob} \
--gpu_inference ${gpu_inference} \
--use_lm ${use_lm} \
--beam_size ${beam_size} \
--lm_weight ${lm_weight}

View File

@ -0,0 +1,5 @@
export FUNASR_DIR=$PWD/../../..
# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PATH=$FUNASR_DIR/funasr/bin:$PATH

View File

@ -0,0 +1 @@
../../../egs/aishell/tranformer/utils/

View File

@ -0,0 +1,27 @@
# ModelScope Model
## How to finetune and infer using a pretrained ModelScope Model
### Finetune
- Modify finetune training related parameters in `conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml`
- Setting parameters in `modelscope_common_finetune.sh`
- <strong>dataset:</strong> the dataset dir needs to include files: train/wav.scp, train/text; optional dev/wav.scp, dev/text, test/wav.scp test/text
- <strong>tag:</strong> exp tag
- <strong>init_model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
- Then you can run the pipeline to finetune with our model download from modelscope:
```sh
sh ./modelscope_common_finetune.sh
```
### Inference
Or you can use the finetuned model for inference directly.
- Setting parameters in `modelscope_common_infer.sh`
- <strong>data_dir:</strong> # wav list, ${data_dir}/wav.scp
- <strong>exp_dir:</strong> the result path
- <strong>model_name:</strong> speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
- Then you can run the pipeline to infer with:
```sh
sh ./modelscope_common_infer.sh
```

View File

@ -0,0 +1,6 @@
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.15

View File

@ -0,0 +1,6 @@
beam_size: 1
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.0
lm_weight: 0.0

View File

@ -0,0 +1,91 @@
# network architecture
# encoder related
encoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
# decoder related
decoder_conf:
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
predictor_conf:
threshold: 1.0
l_order: 1
r_order: 1
tail_threshold: 0.45
# hybrid CTC/attention
model_conf:
ctc_weight: 0.0
lsm_weight: 0.1 # label smoothing option
length_normalized_loss: true
predictor_weight: 1.0
predictor_bias: 1
sampling_ratio: 0.75
# minibatch related
# dataset_type: small
batch_type: length
batch_bins: 2000
num_workers: 16
# dataset_type: large
dataset_conf:
filter_conf:
min_length: 10
max_length: 250
min_token_length: 1
max_token_length: 200
shuffle: true
shuffle_conf:
shuffle_size: 10240
sort_size: 500
batch_conf:
batch_type: 'token'
batch_size: 6000
num_workers: 16
# optimization related
accum_grad: 1
grad_clip: 5
max_epoch: 20
val_scheduler_criterion:
- valid
- acc
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
optim: adam
optim_conf:
lr: 0.0005
scheduler: warmuplr
scheduler_conf:
warmup_steps: 30000
specaug: specaug_lfr
specaug_conf:
apply_time_warp: false
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
lfr_rate: 6
num_freq_mask: 1
apply_time_mask: true
time_mask_width_range:
- 0
- 12
num_time_mask: 1
unused_parameters: true
log_interval: 50
normalize: None
split_with_space: true

View File

@ -0,0 +1,230 @@
#!/usr/bin/env bash
. ./path.sh || exit 1;
# machines configuration
CUDA_VISIBLE_DEVICES="0,1" # set gpus, e.g., CUDA_VISIBLE_DEVICES="0,1"
gpu_num=2
count=1
gpu_inference=true # Whether to perform gpu decoding, set false for cpu decoding
njob=4 # the number of jobs for each gpu
train_cmd=utils/run.pl
# general configuration
feats_dir="." #feature output dictionary, for large data
exp_dir="."
lang=zh
dumpdir=dump/fbank
feats_type=fbank
token_type=char
scp=feats.scp
type=kaldi_ark
stage=1
stop_stage=4
# feature configuration
feats_dim=560
sample_frequency=16000
nj=32
speed_perturb="1.0"
lfr=True
lfr_m=7
lfr_n=6
init_model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope during fine-tuning
cmvn_file=init_model/${init_model_name}/am.mvn
seg_file=init_model/${init_model_name}/seg_dict
vocab=init_model/${init_model_name}/tokens.txt
# data
dataset= # dataset (include train/wav.scp, train/text, dev/wav.scp, dev/text, optional test/wav.scp test/text)
# exp tag
tag=""
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train
valid_set=dev
test_sets="dev test"
asr_config=conf/train_asr_paraformer_sanm_50e_16d_2048_512_lfr6.yaml
init_param="init_model/${init_model_name}/${init_model_name}"
inference_config=conf/decode_asr_transformer_noctc_1best.yaml
inference_asr_model=valid.acc.ave_10best.pth
. utils/parse_options.sh || exit 1;
# download model from modelscope
python modelscope_utils/download_model.py --model_name ${init_model_name}
if [ ! -d ${HOME}/.cache/modelscope/hub/damo/${init_model_name} ]; then
echo "${HOME}/.cache/modelscope/hub/damo/${init_model_name} must exist"
exit 1
else
if [ -d init_model/${init_model_name} ]; then
echo "init_model/${init_model_name} is already exists. if you want to decode again, please delete init_model/${init_model_name} first."
else
mkdir -p init_model/${init_model_name}
cp -r ${HOME}/.cache/modelscope/hub/damo/${init_model_name}/* init_model/${init_model_name}
fi
fi
model_dir="baseline_$(basename "${asr_config}" .yaml)_${feats_type}_${lang}_${token_type}_${tag}"
# you can set gpu num for decoding here
gpuid_list=$CUDA_VISIBLE_DEVICES # set gpus for decoding, the same as training stage by default
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
inference_nj=$[${ngpu}*${njob}]
[ ! -d ${dataset} ] && echo "$0: Training data is required" && exit 1;
[ ! -f ${dataset}/train/wav.scp ] && [ ! -f ${dataset}/train/text ] && echo "$0: Training data wav.scp or text is not found" && exit 1;
if [ ! -d "${dataset}/dev" ]; then
utils/fix_data.sh ${dataset}/train
utils/subset_data_dir_tr_cv.sh --dev-num-utt 1000 ${dataset}/train ${dataset}
fi
if [ ! -d "${dataset}/test" ]; then
test_sets="dev"
fi
feat_train_dir=${feats_dir}/${dumpdir}/train; mkdir -p ${feat_train_dir}
feat_dev_dir=${feats_dir}/${dumpdir}/dev; mkdir -p ${feat_dev_dir}
feat_test_dir=${feats_dir}/${dumpdir}/test; mkdir -p ${feat_test_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Feature Generation"
# compute fbank features
fbankdir=${feats_dir}/fbank
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj --speed_perturb ${speed_perturb} \
${dataset}/train ${exp_dir}/exp/make_fbank/train ${fbankdir}/train
utils/fix_data_feat.sh ${fbankdir}/train
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${dataset}/dev ${exp_dir}/exp/make_fbank/dev ${fbankdir}/dev
utils/fix_data_feat.sh ${fbankdir}/dev
if [ -d "${dataset}/test" ]; then
utils/compute_fbank.sh --cmd "$train_cmd" --nj $nj \
${dataset}/test ${exp_dir}/exp/make_fbank/test ${fbankdir}/test
utils/fix_data_feat.sh ${fbankdir}/test
fi
echo "apply low_frame_rate and cmvn"
[ ! -f ${cmvn_file} ] && echo "$0: cmvn file is required" && exit 1;
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/train ${cmvn_file} ${exp_dir}/exp/make_fbank/train ${feat_train_dir}
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/dev ${cmvn_file} ${exp_dir}/exp/make_fbank/dev ${feat_dev_dir}
if [ -d "${dataset}/test" ]; then
utils/apply_lfr_and_cmvn.sh --cmd "$train_cmd" --nj $nj \
--lfr $lfr --lfr-m $lfr_m --lfr-n $lfr_n \
${fbankdir}/test ${cmvn_file} ${exp_dir}/exp/make_fbank/test ${feat_test_dir}
fi
echo "Text Tokenize"
# 我爱reading->我 爱 read@@ ing
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/train ${seg_file} ${feat_train_dir}/log ${feat_train_dir}
utils/fix_data_feat.sh ${feat_train_dir}
utils/text_tokenize.sh --cmd "$train_cmd" --nj $nj ${fbankdir}/dev ${seg_file} ${feat_dev_dir}/log ${feat_dev_dir}
utils/fix_data_feat.sh ${feat_dev_dir}
if [ -d "${dataset}/test" ]; then
cp ${fbankdir}/test/text ${feat_test_dir}
fi
fi
token_list=${feats_dir}/data/${lang}_token_list/char/tokens.txt
echo "dictionary: ${token_list}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "stage 2: Dictionary Preparation"
mkdir -p ${feats_dir}/data/${lang}_token_list/char/
cp $vocab ${token_list}
vocab_size=$(wc -l <${token_list})
awk -v v=,${vocab_size} '{print $0v}' ${feat_train_dir}/text_shape > ${feat_train_dir}/text_shape.char
awk -v v=,${vocab_size} '{print $0v}' ${feat_dev_dir}/text_shape > ${feat_dev_dir}/text_shape.char
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/train
mkdir -p ${feats_dir}/asr_stats_fbank_zh_char/dev
cp ${feat_train_dir}/speech_shape ${feat_train_dir}/text_shape ${feat_train_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/train
cp ${feat_dev_dir}/speech_shape ${feat_dev_dir}/text_shape ${feat_dev_dir}/text_shape.char ${feats_dir}/asr_stats_fbank_zh_char/dev
fi
# Training Stage
world_size=$gpu_num # run on one machine
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# update asr train config.yaml
python modelscope_utils/update_config.py --modelscope_config init_model/${init_model_name}/asr_train_config.yaml --finetune_config ${asr_config} --output_config init_model/${init_model_name}/asr_finetune_config.yaml
finetune_config=init_model/${init_model_name}/asr_finetune_config.yaml
mkdir -p ${exp_dir}/exp/${model_dir}
mkdir -p ${exp_dir}/exp/${model_dir}/log
INIT_FILE=$exp_dir/ddp_init
if [ -f $INIT_FILE ];then
rm -f $INIT_FILE
fi
init_method=file://$(readlink -f $INIT_FILE)
echo "$0: init method is $init_method"
for ((i = 0; i < $gpu_num; ++i)); do
{
rank=$i
local_rank=$i
gpu_id=$(echo $CUDA_VISIBLE_DEVICES | cut -d',' -f$[$i+1])
asr_train_paraformer.py \
--gpu_id $gpu_id \
--use_preprocessor true \
--token_type $token_type \
--token_list $token_list \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/${scp},speech,${type} \
--train_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${train_set}/text,text,text \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/speech_shape \
--train_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${train_set}/text_shape.char \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/${scp},speech,${type} \
--valid_data_path_and_name_and_type ${feats_dir}/${dumpdir}/${valid_set}/text,text,text \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/speech_shape \
--valid_shape_file ${feats_dir}/asr_stats_fbank_zh_char/${valid_set}/text_shape.char \
--resume true \
--output_dir ${exp_dir}/exp/${model_dir} \
--init_param $init_param \
--config $finetune_config \
--input_size $feats_dim \
--ngpu $gpu_num \
--num_worker_count $count \
--multiprocessing_distributed true \
--dist_init_method $init_method \
--dist_world_size $world_size \
--dist_rank $rank \
--local_rank $local_rank 1> ${exp_dir}/exp/${model_dir}/log/train.log.$i 2>&1
} &
done
wait
fi
# Testing Stage
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
./utils/easy_asr_infer.sh \
--lang zh \
--datadir ${feats_dir} \
--feats_type ${feats_type} \
--feats_dim ${feats_dim} \
--token_type ${token_type} \
--gpu_inference ${gpu_inference} \
--inference_config "${inference_config}" \
--test_sets "${test_sets}" \
--token_list $token_list \
--asr_exp ${exp_dir}/exp/${model_dir} \
--stage 12 \
--stop_stage 12 \
--scp $scp \
--text text \
--inference_nj $inference_nj \
--njob $njob \
--inference_asr_model $inference_asr_model \
--gpuid_list $gpuid_list \
--mode paraformer
fi

View File

@ -0,0 +1,78 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
model_name=speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch # pre-trained model, download from modelscope
data_dir= # wav list, ${data_dir}/wav.scp
exp_dir="exp"
gpuid_list="0,1"
ngpu=$(echo $gpuid_list | awk -F "," '{print NF}')
njob=4
gpu_inference=true
decode_cmd=utils/run.pl
. utils/parse_options.sh
if ${gpu_inference}; then
inference_nj=$[${ngpu}*${njob}]
_ngpu=1
else
inference_nj=${njob}
_ngpu=0
fi
# LM configs
use_lm=false
beam_size=1
lm_weight=0.0
python modelscope_utils/download_model.py \
--model_name ${model_name}
if [ -d ${exp_dir} ]; then
echo "${exp_dir} is already exists. if you want to decode again, please delete ${exp_dir} first."
exit 1
else
mkdir -p ${exp_dir}/${model_name}
cp ${HOME}/.cache/modelscope/hub/damo/${model_name}/* ${exp_dir}/${model_name}/. -r
_dir=${exp_dir}/decode_asr
_logdir=${_dir}/logdir
mkdir -p "${_dir}"
mkdir -p "${_logdir}"
fi
for n in $(seq "${inference_nj}"); do
split_scps+=" ${_logdir}/keys.${n}.scp"
done
# shellcheck disable=SC2086
utils/split_scp.pl "${data_dir}/wav.scp" ${split_scps}
if "${use_lm}"; then
cp ${exp_dir}/${model_name}/decode_asr_transformer.yaml ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back
cp ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back
sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
sed -i "s#beam_size: [0-9]*#beam_size: `echo $beam_size`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer.yaml
sed -i "s#lm_weight: 0.[0-9]*#lm_weight: `echo $lm_weight`#g" ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml
fi
echo "Decoding started... log: '${_logdir}/asr_inference.*.log'"
# shellcheck disable=SC2086
${decode_cmd} --max-jobs-run "${inference_nj}" JOB=1:"${inference_nj}" "${_logdir}"/asr_inference.JOB.log \
python -m funasr.bin.modelscope_infer \
--local_model_path ${exp_dir}/${model_name} \
--wav_list ${_logdir}/keys.JOB.scp \
--output_file ${_logdir}/text.JOB \
--gpuid_list ${gpuid_list} \
--njob ${njob} \
--ngpu ${_ngpu} \
for i in $(seq ${inference_nj}); do
cat ${_logdir}/text.${i}
done | sort -k1 >${_dir}/text
mv ${exp_dir}/${model_name}/decode_asr_transformer.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer.yaml
mv ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml.back ${exp_dir}/${model_name}/decode_asr_transformer_wav.yaml

Some files were not shown because too many files have changed in this diff Show More