From af3f3dd5fd82823eaf64c0523b8c1b0fc79a8d8e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=98=89=E6=B8=8A?= Date: Wed, 31 May 2023 16:21:40 +0800 Subject: [PATCH] update repo --- docs/academic_recipe/asr_recipe.md | 266 ----------------------------- 1 file changed, 266 deletions(-) diff --git a/docs/academic_recipe/asr_recipe.md b/docs/academic_recipe/asr_recipe.md index 1cb7e74a8..5a11dc5ee 100644 --- a/docs/academic_recipe/asr_recipe.md +++ b/docs/academic_recipe/asr_recipe.md @@ -264,270 +264,4 @@ Users can use ModelScope for inference and fine-tuning based on a trained academ ### Decoding by CPU or GPU -We support CPU and GPU decoding. For CPU decoding, set `gpu_inference=false` and `njob` to specific the total number of CPU jobs. For GPU decoding, first set `gpu_inference=true`. Then set `gpuid_list` to specific which GPUs for decoding and `njob` to specific the number of decoding jobs on each GPU.# Speech Recognition -In FunASR, we provide several ASR benchmarks, such as AISHLL, Librispeech, WenetSpeech, while different model architectures are supported, including conformer, paraformer, uniasr. - -## Quick Start -After downloaded and installed FunASR, users can use our provided recipes to easily reproduce the relevant experimental results. Here we take "paraformer on AISHELL-1" as an example. - -First, move to the corresponding dictionary of the AISHELL-1 paraformer example. -```sh -cd egs/aishell/paraformer -``` - -Then you can directly start the recipe as follows: -```sh -conda activate funasr -. ./run.sh --CUDA_VISIBLE_DEVICES="0,1" --gpu_num=2 -``` - -The training log files are saved in `${exp_dir}/exp/${model_dir}/log/train.log.*`, which can be viewed using the following command: -```sh -vim exp/*_train_*/log/train.log.0 -``` - -Users can observe the training loss, prediction accuracy and other training information, like follows: -```text -... 1epoch:train:751-800batch:800num_updates: ... loss_ctc=106.703, loss_att=86.877, acc=0.029, loss_pre=1.552 ... -... 1epoch:train:801-850batch:850num_updates: ... loss_ctc=107.890, loss_att=87.832, acc=0.029, loss_pre=1.702 ... -``` - -At the end of each epoch, the evaluation metrics are calculated on the validation set, like follows: -```text -... [valid] loss_ctc=99.914, cer_ctc=1.000, loss_att=80.512, acc=0.029, cer=0.971, wer=1.000, loss_pre=1.952, loss=88.285 ... -``` - -Also, users can use tensorboard to observe these training information by the following command: -```sh -tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train -``` -Here is an example of loss: - - - -The inference results are saved in `${exp_dir}/exp/${model_dir}/decode_asr_*/$dset`. The main two files are `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text, like follows: -```text -... -BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00% -ref: 构 建 良 好 的 旅 游 市 场 环 境 -res: 构 建 良 好 的 旅 游 市 场 环 境 -... -``` -`text.cer.txt` saves the final results, like follows: -```text -%WER ... -%SER ... -Scored ... sentences, ... -``` - -## Introduction -We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users. -- `CUDA_VISIBLE_DEVICES`: `0,1` (Default), visible gpu list -- `gpu_num`: `2` (Default), the number of GPUs used for training -- `gpu_inference`: `true` (Default), whether to use GPUs for decoding -- `njob`: `1` (Default),for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU -- `raw_data`: the raw path of AISHELL-1 dataset -- `feats_dir`: the path for saving processed data -- `token_type`: `char` (Default), indicate how to process text -- `type`: `sound` (Default), set the input type -- `scp`: `wav.scp` (Default), set the input file -- `nj`: `64` (Default), the number of jobs for data preparation -- `speed_perturb`: `"0.9, 1.0 ,1.1"` (Default), the range of speech perturbed -- `exp_dir`: the path for saving experimental results -- `tag`: `exp1` (Default), the suffix of experimental result directory -- `stage` `0` (Default), start the recipe from the specified stage -- `stop_stage` `5` (Default), stop the recipe from the specified stage - -### Stage 0: Data preparation -This stage processes raw AISHELL-1 dataset `$raw_data` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$raw_data`. The examples of `wav.scp` and `text` are as follows: -* `wav.scp` -``` -BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav -BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav -BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav -... -``` -* `text` -``` -BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购 -BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉 -BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后 -... -``` -These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens. - -### Stage 1: Feature and CMVN Generation -This stage computes CMVN based on `train` dataset, which is used in the following stages. Users can set `nj` to control the number of jobs for computing CMVN. The generated CMVN file is saved as `$feats_dir/data/train/cmvn/am.mvn`. - -### Stage 2: Dictionary Preparation -This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows: -``` - - - -一 -丁 -... -龚 -龟 - -``` -There are four tokens must be specified: -* ``: (required), indicates the blank token for CTC, must be in the first line -* ``: (required), indicates the start-of-sentence token, must be in the second line -* ``: (required), indicates the end-of-sentence token, must be in the third line -* ``: (required), indicates the out-of-vocabulary token, must be in the last line - -### Stage 3: LM Training - -### Stage 4: ASR Training -This stage achieves the training of the specified model. To start training, users should manually set `exp_dir` to specify the path for saving experimental results. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding. FunASR implements `train.py` for training different models and users can configure the following parameters if necessary. The training command is as follows: - -```sh -train.py \ - --task_name asr \ - --use_preprocessor true \ - --token_list $token_list \ - --data_dir ${feats_dir}/data \ - --train_set ${train_set} \ - --valid_set ${valid_set} \ - --data_file_names "wav.scp,text" \ - --cmvn_file ${feats_dir}/data/${train_set}/cmvn/am.mvn \ - --speed_perturb ${speed_perturb} \ - --resume true \ - --output_dir ${exp_dir}/exp/${model_dir} \ - --config $asr_config \ - --ngpu $gpu_num \ - ... -``` - -* `task_name`: `asr` (Default), specify the task type of the current recipe -* `ngpu`: `2` (Default), specify the number of GPUs for training. When `ngpu > 1`, DistributedDataParallel (DDP, the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)) training will be enabled. Correspondingly, `CUDA_VISIBLE_DEVICES` should be set to specify which ids of GPUs will be used. -* `use_preprocessor`: `true` (Default), specify whether to use pre-processing on each sample -* `token_list`: the path of token list for training -* `dataset_type`: `small` (Default). FunASR supports `small` dataset type for training small datasets. Besides, an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large datasets is supported and users can specify `dataset_type=large` to enable it. -* `data_dir`: the path of data. Specifically, the data for training is saved in `$data_dir/data/$train_set` while the data for validation is saved in `$data_dir/data/$valid_set` -* `data_file_names`: `"wav.scp,text"` specify the speech and text file names for ASR -* `cmvn_file`: the path of cmvn file -* `resume`: `true`, whether to enable "checkpoint training" -* `output_dir`: the path for saving training results -* `config`: the path of configuration file, which is usually a YAML file in `conf` directory. In FunASR, the parameters of the training, including model, optimization, dataset, etc., can also be set in this file. Note that if the same parameters are specified in both recipe and config file, the parameters of recipe will be employed - -### Stage 5: Decoding -This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model. - -* Mode Selection - -As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model. - -* Configuration - -We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0::: .. -``` -For example, the following command achieves loading all pretrained parameters starting from decoder except decoder.embed and set it to model.decoder2: -```shell -train.py ... --init_param model.pb:decoder:decoder2:decoder.embed ... -``` -Besides, loading parameters from multiple pre-trained models is supported. For example, the following command achieves loading encoder parameters from the pre-trained model1 and decoder parameters from the pre-trained model2: -```sh -train.py ... --init_param model1.pb:encoder --init_param model2.pb:decoder ... -``` - -### How to freeze part of the model parameters - -In certain situations, users may want to fix part of the model parameters update the rest model parameters. FunASR employs `freeze_param` to achieve this. For example, to fix all parameters like `encoder.*`, users need to set `freeze_param ` as follows: -```sh -train.py ... --freeze_param encoder ... -``` - -### ModelScope Usage - -Users can use ModelScope for inference and fine-tuning based on a trained academic model. To achieve this, users need to run the stage 6 in the script. In this stage, relevant files required by ModelScope will be generated automatically. Users can then use the corresponding ModelScope interface by replacing the model name with the local trained model path. For the detailed usage of the ModelScope interface, please refer to [ModelScope Usage](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_pipeline/quick_start.html). - -### Decoding by CPU or GPU - We support CPU and GPU decoding. For CPU decoding, set `gpu_inference=false` and `njob` to specific the total number of CPU jobs. For GPU decoding, first set `gpu_inference=true`. Then set `gpuid_list` to specific which GPUs for decoding and `njob` to specific the number of decoding jobs on each GPU. \ No newline at end of file