Merge branch 'main' into dev_smohan

This commit is contained in:
yhliang 2023-05-11 16:26:24 +08:00 committed by GitHub
commit 1d1ef01b4e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
208 changed files with 3639 additions and 1953 deletions

4
.gitignore vendored
View File

@ -16,4 +16,6 @@ MaaS-lib
.egg*
dist
build
funasr.egg-info
funasr.egg-info
docs/_build
modelscope

View File

@ -13,22 +13,22 @@
| [**Highlights**](#highlights)
| [**Installation**](#installation)
| [**Docs**](https://alibaba-damo-academy.github.io/FunASR/en/index.html)
| [**Tutorial**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C)
| [**Tutorial_CN**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C)
| [**Papers**](https://github.com/alibaba-damo-academy/FunASR#citations)
| [**Runtime**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime)
| [**Model Zoo**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/modelscope_models.md)
| [**Model Zoo**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md)
| [**Contact**](#contact)
| [**M2MET2.0 Challenge**](https://github.com/alibaba-damo-academy/FunASR#multi-channel-multi-party-meeting-transcription-20-m2met20-challenge)
## What's new:
### Multi-Channel Multi-Party Meeting Transcription 2.0 (M2MET2.0) Challenge
We are pleased to announce that the M2MeT2.0 challenge will be held in the near future. The baseline system is conducted on FunASR and is provided as a receipe of AliMeeting corpus. For more details you can see the guidence of M2MET2.0 ([CN](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)/[EN](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)).
### Multi-Channel Multi-Party Meeting Transcription 2.0 (M2MeT2.0) Challenge
We are pleased to announce that the M2MeT2.0 challenge has been accepted by the ASRU 2023 challenge special session. The registration is now open. The baseline system is conducted on FunASR and is provided as a receipe of AliMeeting corpus. For more details you can see the guidence of M2MET2.0 ([CN](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)/[EN](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)).
### Release notes
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
## Highlights
- FunASR supports speech recognition(ASR), Multi-talker ASR, Voice Activity Detection(VAD), Punctuation Restoration, Language Models, Speaker Verification and Speaker diarization.
- We have released large number of academic and industrial pretrained models on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), ref to [Model Zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html)
- We have released large number of academic and industrial pretrained models on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), ref to [Model Zoo](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/model_zoo/modelscope_models.md)
- The pretrained model [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) obtains the best performance on many tasks in [SpeechIO leaderboard](https://github.com/SpeechColab/Leaderboard)
- FunASR supplies a easy-to-use pipeline to finetune pretrained models from [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition)
- Compared to [Espnet](https://github.com/espnet/espnet) framework, the training speed of large-scale datasets in FunASR is much faster owning to the optimized dataloader.
@ -60,12 +60,8 @@ pip install -U modelscope
# pip install -U modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html -i https://mirror.sjtu.edu.cn/pypi/web/simple
```
For more details, please ref to [installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html)
For more details, please ref to [installation](https://alibaba-damo-academy.github.io/FunASR/en/installation/installation.html)
[//]: # ()
[//]: # (## Usage)
[//]: # (For users who are new to FunASR and ModelScope, please refer to FunASR Docs([CN](https://alibaba-damo-academy.github.io/FunASR/cn/index.html) / [EN](https://alibaba-damo-academy.github.io/FunASR/en/index.html)))
## Contact

19
docs/README.md Normal file
View File

@ -0,0 +1,19 @@
# FunASR document generation
## Generate HTML
For convenience, we provide users with the ability to generate local HTML manually.
First, you should install the following packages, which is required for building HTML:
```sh
conda activate funasr
pip install requests sphinx nbsphinx sphinx_markdown_tables sphinx_rtd_theme recommonmark
```
Then you can generate HTML manually.
```sh
cd docs
make html
```
The generated files are all contained in the "FunASR/docs/_build" directory. You can access the FunASR documentation by simply opening the "html/index.html" file in your browser from this directory.

View File

@ -1,129 +1,3 @@
# Speech Recognition
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).
## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
## Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
...
```
* `text`
```
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.
## Stage 1: Feature Generation
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
BAC009S0002W0122_sp0.9 665,80
...
```
* `text_shape`
```
...
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.
## Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
</s>
...
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
## Stage 3: Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
* DDP Training
We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.
* DataLoader
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.
* Configuration
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.
* Training Steps
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.
* Tensorboard
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
## Stage 4: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
* Mode Selection
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.
* Configuration
We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding.
* CPU/GPU Decoding
We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU.
* Performance
We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref: 构 建 良 好 的 旅 游 市 场 环 境
res: 构 建 良 好 的 旅 游 市 场 环 境
...
```
Undo

View File

@ -1,129 +1,2 @@
# Punctuation Restoration
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).
## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
## Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
...
```
* `text`
```
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.
## Stage 1: Feature Generation
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
BAC009S0002W0122_sp0.9 665,80
...
```
* `text_shape`
```
...
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.
## Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
</s>
...
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
## Stage 3: Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
* DDP Training
We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.
* DataLoader
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.
* Configuration
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.
* Training Steps
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.
* Tensorboard
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
## Stage 4: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
* Mode Selection
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.
* Configuration
We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding.
* CPU/GPU Decoding
We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU.
* Performance
We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref: 构 建 良 好 的 旅 游 市 场 环 境
res: 构 建 良 好 的 旅 游 市 场 环 境
...
```
Undo

View File

@ -1,129 +1,2 @@
# Speaker Diarization
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).
## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
## Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
...
```
* `text`
```
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.
## Stage 1: Feature Generation
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
BAC009S0002W0122_sp0.9 665,80
...
```
* `text_shape`
```
...
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.
## Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
</s>
...
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
## Stage 3: Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
* DDP Training
We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.
* DataLoader
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.
* Configuration
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.
* Training Steps
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.
* Tensorboard
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
## Stage 4: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
* Mode Selection
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.
* Configuration
We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding.
* CPU/GPU Decoding
We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU.
* Performance
We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref: 构 建 良 好 的 旅 游 市 场 环 境
res: 构 建 良 好 的 旅 游 市 场 环 境
...
```
Undo

View File

@ -1,129 +1,2 @@
# Speaker Verification
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).
## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
## Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
...
```
* `text`
```
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.
## Stage 1: Feature Generation
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
BAC009S0002W0122_sp0.9 665,80
...
```
* `text_shape`
```
...
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.
## Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
</s>
...
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
## Stage 3: Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
* DDP Training
We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.
* DataLoader
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.
* Configuration
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.
* Training Steps
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.
* Tensorboard
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
## Stage 4: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
* Mode Selection
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.
* Configuration
We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding.
* CPU/GPU Decoding
We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU.
* Performance
We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref: 构 建 良 好 的 旅 游 市 场 环 境
res: 构 建 良 好 的 旅 游 市 场 环 境
...
```
Undo

View File

@ -1,129 +1,2 @@
# Voice Activity Detection
Here we take "Training a paraformer model from scratch using the AISHELL-1 dataset" as an example to introduce how to use FunASR. According to this example, users can similarly employ other datasets (such as AISHELL-2 dataset, etc.) to train other models (such as conformer, transformer, etc.).
## Overall Introduction
We provide a recipe `egs/aishell/paraformer/run.sh` for training a paraformer model on AISHELL-1 dataset. This recipe consists of five stages, supporting training on multiple GPUs and decoding by CPU or GPU. Before introducing each stage in detail, we first explain several parameters which should be set by users.
- `CUDA_VISIBLE_DEVICES`: visible gpu list
- `gpu_num`: the number of GPUs used for training
- `gpu_inference`: whether to use GPUs for decoding
- `njob`: for CPU decoding, indicating the total number of CPU jobs; for GPU decoding, indicating the number of jobs on each GPU
- `data_aishell`: the raw path of AISHELL-1 dataset
- `feats_dir`: the path for saving processed data
- `nj`: the number of jobs for data preparation
- `speed_perturb`: the range of speech perturbed
- `exp_dir`: the path for saving experimental results
- `tag`: the suffix of experimental result directory
## Stage 0: Data preparation
This stage processes raw AISHELL-1 dataset `$data_aishell` and generates the corresponding `wav.scp` and `text` in `$feats_dir/data/xxx`. `xxx` means `train/dev/test`. Here we assume users have already downloaded AISHELL-1 dataset. If not, users can download data [here](https://www.openslr.org/33/) and set the path for `$data_aishell`. The examples of `wav.scp` and `text` are as follows:
* `wav.scp`
```
BAC009S0002W0122 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0122.wav
BAC009S0002W0123 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0123.wav
BAC009S0002W0124 /nfs/ASR_DATA/AISHELL-1/data_aishell/wav/train/S0002/BAC009S0002W0124.wav
...
```
* `text`
```
BAC009S0002W0122 而 对 楼 市 成 交 抑 制 作 用 最 大 的 限 购
BAC009S0002W0123 也 成 为 地 方 政 府 的 眼 中 钉
BAC009S0002W0124 自 六 月 底 呼 和 浩 特 市 率 先 宣 布 取 消 限 购 后
...
```
These two files both have two columns, while the first column is wav ids and the second column is the corresponding wav paths/label tokens.
## Stage 1: Feature Generation
This stage extracts FBank features from `wav.scp` and apply speed perturbation as data augmentation according to `speed_perturb`. Users can set `nj` to control the number of jobs for feature generation. The generated features are saved in `$feats_dir/dump/xxx/ark` and the corresponding `feats.scp` files are saved as `$feats_dir/dump/xxx/feats.scp`. An example of `feats.scp` can be seen as follows:
* `feats.scp`
```
...
BAC009S0002W0122_sp0.9 /nfs/funasr_data/aishell-1/dump/fbank/train/ark/feats.16.ark:592751055
...
```
Note that samples in this file have already been shuffled randomly. This file contains two columns. The first column is wav ids while the second column is kaldi-ark feature paths. Besides, `speech_shape` and `text_shape` are also generated in this stage, denoting the speech feature shape and text length of each sample. The examples are shown as follows:
* `speech_shape`
```
...
BAC009S0002W0122_sp0.9 665,80
...
```
* `text_shape`
```
...
BAC009S0002W0122_sp0.9 15
...
```
These two files have two columns. The first column is wav ids and the second column is the corresponding speech feature shape and text length.
## Stage 2: Dictionary Preparation
This stage processes the dictionary, which is used as a mapping between label characters and integer indices during ASR training. The processed dictionary file is saved as `$feats_dir/data/$lang_toekn_list/$token_type/tokens.txt`. An example of `tokens.txt` is as follows:
* `tokens.txt`
```
<blank>
<s>
</s>
...
<unk>
```
* `<blank>`: indicates the blank token for CTC
* `<s>`: indicates the start-of-sentence token
* `</s>`: indicates the end-of-sentence token
* `<unk>`: indicates the out-of-vocabulary token
## Stage 3: Training
This stage achieves the training of the specified model. To start training, users should manually set `exp_dir`, `CUDA_VISIBLE_DEVICES` and `gpu_num`, which have already been explained above. By default, the best `$keep_nbest_models` checkpoints on validation dataset will be averaged to generate a better model and adopted for decoding.
* DDP Training
We support the DistributedDataParallel (DDP) training and the detail can be found [here](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html). To enable DDP training, please set `gpu_num` greater than 1. For example, if you set `CUDA_VISIBLE_DEVICES=0,1,5,6,7` and `gpu_num=3`, then the gpus with ids 0, 1 and 5 will be used for training.
* DataLoader
We support an optional iterable-style DataLoader based on [Pytorch Iterable-style DataPipes](https://pytorch.org/data/beta/torchdata.datapipes.iter.html) for large dataset and users can set `dataset_type=large` to enable it.
* Configuration
The parameters of the training, including model, optimization, dataset, etc., can be set by a YAML file in `conf` directory. Also, users can directly set the parameters in `run.sh` recipe. Please avoid to set the same parameters in both the YAML file and the recipe.
* Training Steps
We support two parameters to specify the training steps, namely `max_epoch` and `max_update`. `max_epoch` indicates the total training epochs while `max_update` indicates the total training steps. If these two parameters are specified at the same time, once the training reaches any one of these two parameters, the training will be stopped.
* Tensorboard
Users can use tensorboard to observe the loss, learning rate, etc. Please run the following command:
```
tensorboard --logdir ${exp_dir}/exp/${model_dir}/tensorboard/train
```
## Stage 4: Decoding
This stage generates the recognition results and calculates the `CER` to verify the performance of the trained model.
* Mode Selection
As we support paraformer, uniasr, conformer and other models in FunASR, a `mode` parameter should be specified as `asr/paraformer/uniasr` according to the trained model.
* Configuration
We support CTC decoding, attention decoding and hybrid CTC-attention decoding in FunASR, which can be specified by `ctc_weight` in a YAML file in `conf` directory. Specifically, `ctc_weight=1.0` indicates CTC decoding, `ctc_weight=0.0` indicates attention decoding, `0.0<ctc_weight<1.0` indicates hybrid CTC-attention decoding.
* CPU/GPU Decoding
We support CPU and GPU decoding in FunASR. For CPU decoding, you should set `gpu_inference=False` and set `njob` to specify the total number of CPU decoding jobs. For GPU decoding, you should set `gpu_inference=True`. You should also set `gpuid_list` to indicate which GPUs are used for decoding and `njobs` to indicate the number of decoding jobs on each GPU.
* Performance
We adopt `CER` to verify the performance. The results are in `$exp_dir/exp/$model_dir/$decoding_yaml_name/$average_model_name/$dset`, namely `text.cer` and `text.cer.txt`. `text.cer` saves the comparison between the recognized text and the reference text while `text.cer.txt` saves the final `CER` result. The following is an example of `text.cer`:
* `text.cer`
```
...
BAC009S0764W0213(nwords=11,cor=11,ins=0,del=0,sub=0) corr=100.00%,cer=0.00%
ref: 构 建 良 好 的 旅 游 市 场 环 境
res: 构 建 良 好 的 旅 游 市 场 环 境
...
```
Undo

View File

@ -17,8 +17,8 @@ Overview
:maxdepth: 1
:caption: Installation
./installation.md
./docker.md
./installation/installation.md
./installation/docker.md
.. toctree::
:maxdepth: 1
@ -44,6 +44,7 @@ Overview
./modelscope_pipeline/tp_pipeline.md
./modelscope_pipeline/sv_pipeline.md
./modelscope_pipeline/sd_pipeline.md
./modelscope_pipeline/itn_pipeline.md
.. toctree::
:maxdepth: 1
@ -56,8 +57,8 @@ Overview
:maxdepth: 1
:caption: Model Zoo
./modelscope_models.md
./huggingface_models.md
./model_zoo/modelscope_models.md
./model_zoo/huggingface_models.md
.. toctree::
:maxdepth: 1
@ -85,25 +86,25 @@ Overview
:maxdepth: 1
:caption: Funasr Library
./build_task.md
./reference/build_task.md
.. toctree::
:maxdepth: 1
:caption: Papers
./papers.md
./reference/papers.md
.. toctree::
:maxdepth: 1
:caption: Application
./application.md
./reference/application.md
.. toctree::
:maxdepth: 1
:caption: FQA
./FQA.md
./reference/FQA.md
Indices and tables

View File

@ -1,13 +1,34 @@
# Baseline
## Overview
We will release an E2E SA-ASR~\cite{kanda21b_interspeech} baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.
We will release an E2E SA-ASR baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.
![model archietecture](images/sa_asr_arch.png)
## Quick start
#TODO: fill with the README.md of the baseline
To run the baseline, first you need to install FunASR and ModelScope. ([installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html))
There are two startup scripts, `run.sh` for training and evaluating on the old eval and test sets, and `run_m2met_2023_infer.sh` for inference on the new test set of the Multi-Channel Multi-Party Meeting Transcription 2.0 ([M2MeT2.0](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)) Challenge.
Before running `run.sh`, you must manually download and unpack the [AliMeeting](http://www.openslr.org/119/) corpus and place it in the `./dataset` directory:
```shell
dataset
|—— Eval_Ali_far
|—— Eval_Ali_near
|—— Test_Ali_far
|—— Test_Ali_near
|—— Train_Ali_far
|—— Train_Ali_near
```
Before running `run_m2met_2023_infer.sh`, you need to place the new test set `Test_2023_Ali_far` (to be released after the challenge starts) in the `./dataset` directory, which contains only raw audios. Then put the given `wav.scp`, `wav_raw.scp`, `segments`, `utt2spk` and `spk2utt` in the `./data/Test_2023_Ali_far` directory.
```shell
data/Test_2023_Ali_far
|—— wav.scp
|—— wav_raw.scp
|—— segments
|—— utt2spk
|—— spk2utt
```
For more details you can see [here](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md)
## Baseline results
The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.
![baseline result](images/baseline_result.png)
![baseline_result](images/baseline_result.png)

View File

@ -1,9 +1,9 @@
# Contact
If you have any questions about M2MET2.0 challenge, please contact us by
If you have any questions about M2MeT2.0 challenge, please contact us by
- email: [m2met.alimeeting@gmail.com](mailto:m2met.alimeeting@gmail.com)
| Wechat group |
|:------------------------------------------:|
<!-- | <img src="images/wechat.png" width="300"/> | -->
| <img src="images/qrcode.png" width="300"/> |

View File

@ -6,23 +6,23 @@ Over the years, several challenges have been organized to advance the developmen
The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU 2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
## Timeline(AOE Time)
- $ April~29, 2023: $ Challenge and registration open.
- $ May~8, 2023: $ Baseline release.
- $ May~15, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~9, 2023: $ Test data release and leaderboard open.
- $ June~13, 2023: $ Final submission deadline.
- $ June~19, 2023: $ Evaluation result and ranking release.
- $ May~11, 2023: $ Baseline release.
- $ May~22, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~16, 2023: $ Test data release and leaderboard open.
- $ June~20, 2023: $ Final submission deadline and leaderboar close.
- $ June~26, 2023: $ Evaluation result and ranking release.
- $ July~3, 2023: $ Deadline for paper submission.
- $ July~10, 2023: $ Deadline for final paper submission.
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and challenge session
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and Challenge Session.
## Guidelines
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 22, 2023. Participants are also welcome to join the [wechat group](https://alibaba-damo-academy.github.io/FunASR/m2met2/Contact.html) of M2MeT2.0 and keep up to date with the latest updates about the challenge.
[M2MET2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
[M2MeT2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top ranking submissions to be included in the ASRU2023 Proceedings.

View File

@ -1,5 +1,5 @@
# Organizers
***Lei Xie, Professor, Northwestern Polytechnical University, China***
***Lei Xie, Professor, AISHELL foundation, China***
Email: [lxie@nwpu.edu.cn](mailto:lxie@nwpu.edu.cn)

View File

@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 9907eab6bf227ca0fc6db297f26919da
config: a62852d90c3e533904d811bbf85f977d
tags: 645f666f9bcd5a90fca523b33c5a78b7

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Baseline &#8212; m2met2 documentation</title>
<title>Baseline &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -44,7 +44,7 @@
<li class="right" >
<a href="Track_setting_and_evaluation.html" title="Track &amp; Evaluation"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Baseline</a></li>
</ul>
</div>
@ -55,7 +55,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -126,17 +126,38 @@
<h1>Baseline<a class="headerlink" href="#baseline" title="Permalink to this heading"></a></h1>
<section id="overview">
<h2>Overview<a class="headerlink" href="#overview" title="Permalink to this heading"></a></h2>
<p>We will release an E2E SA-ASR~\cite{kanda21b_interspeech} baseline conducted on <a class="reference external" href="https://github.com/alibaba-damo-academy/FunASR">FunASR</a> at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.</p>
<p>We will release an E2E SA-ASR baseline conducted on <a class="reference external" href="https://github.com/alibaba-damo-academy/FunASR">FunASR</a> at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.</p>
<p><img alt="model archietecture" src="_images/sa_asr_arch.png" /></p>
</section>
<section id="quick-start">
<h2>Quick start<a class="headerlink" href="#quick-start" title="Permalink to this heading"></a></h2>
<p>#TODO: fill with the README.md of the baseline</p>
<p>To run the baseline, first you need to install FunASR and ModelScope. (<a class="reference external" href="https://alibaba-damo-academy.github.io/FunASR/en/installation.html">installation</a>)<br />
There are two startup scripts, <code class="docutils literal notranslate"><span class="pre">run.sh</span></code> for training and evaluating on the old eval and test sets, and <code class="docutils literal notranslate"><span class="pre">run_m2met_2023_infer.sh</span></code> for inference on the new test set of the Multi-Channel Multi-Party Meeting Transcription 2.0 (<a class="reference external" href="https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html">M2MeT2.0</a>) Challenge.<br />
Before running <code class="docutils literal notranslate"><span class="pre">run.sh</span></code>, you must manually download and unpack the <a class="reference external" href="http://www.openslr.org/119/">AliMeeting</a> corpus and place it in the <code class="docutils literal notranslate"><span class="pre">./dataset</span></code> directory:</p>
<div class="highlight-shell notranslate"><div class="highlight"><pre><span></span>dataset
<span class="p">|</span>——<span class="w"> </span>Eval_Ali_far
<span class="p">|</span>——<span class="w"> </span>Eval_Ali_near
<span class="p">|</span>——<span class="w"> </span>Test_Ali_far
<span class="p">|</span>——<span class="w"> </span>Test_Ali_near
<span class="p">|</span>——<span class="w"> </span>Train_Ali_far
<span class="p">|</span>——<span class="w"> </span>Train_Ali_near
</pre></div>
</div>
<p>Before running <code class="docutils literal notranslate"><span class="pre">run_m2met_2023_infer.sh</span></code>, you need to place the new test set <code class="docutils literal notranslate"><span class="pre">Test_2023_Ali_far</span></code> (to be released after the challenge starts) in the <code class="docutils literal notranslate"><span class="pre">./dataset</span></code> directory, which contains only raw audios. Then put the given <code class="docutils literal notranslate"><span class="pre">wav.scp</span></code>, <code class="docutils literal notranslate"><span class="pre">wav_raw.scp</span></code>, <code class="docutils literal notranslate"><span class="pre">segments</span></code>, <code class="docutils literal notranslate"><span class="pre">utt2spk</span></code> and <code class="docutils literal notranslate"><span class="pre">spk2utt</span></code> in the <code class="docutils literal notranslate"><span class="pre">./data/Test_2023_Ali_far</span></code> directory.</p>
<div class="highlight-shell notranslate"><div class="highlight"><pre><span></span>data/Test_2023_Ali_far
<span class="p">|</span>——<span class="w"> </span>wav.scp
<span class="p">|</span>——<span class="w"> </span>wav_raw.scp
<span class="p">|</span>——<span class="w"> </span>segments
<span class="p">|</span>——<span class="w"> </span>utt2spk
<span class="p">|</span>——<span class="w"> </span>spk2utt
</pre></div>
</div>
<p>For more details you can see <a class="reference external" href="https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md">here</a></p>
</section>
<section id="baseline-results">
<h2>Baseline results<a class="headerlink" href="#baseline-results" title="Permalink to this heading"></a></h2>
<p>The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.</p>
<p><img alt="baseline result" src="_images/baseline_result.png" /></p>
<p><img alt="baseline_result" src="_images/baseline_result.png" /></p>
</section>
</section>
@ -170,7 +191,7 @@
<li class="right" >
<a href="Track_setting_and_evaluation.html" title="Track &amp; Evaluation"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Baseline</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Contact &#8212; m2met2 documentation</title>
<title>Contact &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -40,7 +40,7 @@
<li class="right" >
<a href="Organizers.html" title="Organizers"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Contact</a></li>
</ul>
</div>
@ -51,7 +51,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -120,7 +120,7 @@
<section id="contact">
<h1>Contact<a class="headerlink" href="#contact" title="Permalink to this heading"></a></h1>
<p>If you have any questions about M2MET2.0 challenge, please contact us by</p>
<p>If you have any questions about M2MeT2.0 challenge, please contact us by</p>
<ul class="simple">
<li><p>email: <a class="reference external" href="mailto:m2met&#46;alimeeting&#37;&#52;&#48;gmail&#46;com">m2met<span>&#46;</span>alimeeting<span>&#64;</span>gmail<span>&#46;</span>com</a></p></li>
</ul>
@ -129,8 +129,11 @@
<tr class="row-odd"><th class="head text-center"><p>Wechat group</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><a class="reference internal" href="_images/qrcode.png"><img alt="_images/qrcode.png" src="_images/qrcode.png" style="width: 300px;" /></a></p></td>
</tr>
</tbody>
</table>
<!-- | <img src="images/wechat.png" width="300"/> | -->
</section>
@ -157,7 +160,7 @@
<li class="right" >
<a href="Organizers.html" title="Organizers"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Contact</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Datasets &#8212; m2met2 documentation</title>
<title>Datasets &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="Introduction.html" title="Introduction"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Datasets</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -181,7 +181,7 @@
<li class="right" >
<a href="Introduction.html" title="Introduction"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Datasets</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Introduction &#8212; m2met2 documentation</title>
<title>Introduction &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="index.html" title="ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0)"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Introduction</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -130,27 +130,27 @@
<p>Automatic speech recognition (ASR) and speaker diarization have made significant strides in recent years, resulting in a surge of speech technology applications across various domains. However, meetings present unique challenges to speech technologies due to their complex acoustic conditions and diverse speaking styles, including overlapping speech, variable numbers of speakers, far-field signals in large conference rooms, and environmental noise and reverberation.</p>
<p>Over the years, several challenges have been organized to advance the development of meeting transcription, including the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. The latest iteration of the CHIME challenge has a particular focus on distant automatic speech recognition and developing systems that can generalize across various array topologies and application scenarios. However, while progress has been made in English meeting transcription, language differences remain a significant barrier to achieving comparable results in non-English languages, such as Mandarin. The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have been instrumental in advancing Mandarin meeting transcription. The MISP challenge seeks to address the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on tackling the speech overlap issue in offline meeting rooms.</p>
<p>The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.</p>
<p>Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying “who spoke what at when”. To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models performance and advance the state of the art in this area.</p>
<p>Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU 2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying “who spoke what at when”. To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models performance and advance the state of the art in this area.</p>
</section>
<section id="timeline-aoe-time">
<h2>Timeline(AOE Time)<a class="headerlink" href="#timeline-aoe-time" title="Permalink to this heading"></a></h2>
<ul class="simple">
<li><p><span class="math notranslate nohighlight">\( April~29, 2023: \)</span> Challenge and registration open.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~8, 2023: \)</span> Baseline release.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~15, 2023: \)</span> Registration deadline, the due date for participants to join the Challenge.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~9, 2023: \)</span> Test data release and leaderboard open.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~13, 2023: \)</span> Final submission deadline.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~19, 2023: \)</span> Evaluation result and ranking release.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~11, 2023: \)</span> Baseline release.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~22, 2023: \)</span> Registration deadline, the due date for participants to join the Challenge.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~16, 2023: \)</span> Test data release and leaderboard open.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~20, 2023: \)</span> Final submission deadline and leaderboar close.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~26, 2023: \)</span> Evaluation result and ranking release.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~3, 2023: \)</span> Deadline for paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~10, 2023: \)</span> Deadline for final paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( December~12\ to\ 16, 2023: \)</span> ASRU Workshop and challenge session</p></li>
<li><p><span class="math notranslate nohighlight">\( December~12\ to\ 16, 2023: \)</span> ASRU Workshop and Challenge Session.</p></li>
</ul>
</section>
<section id="guidelines">
<h2>Guidelines<a class="headerlink" href="#guidelines" title="Permalink to this heading"></a></h2>
<p>Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.</p>
<p><a class="reference external" href="https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link">M2MET2.0 Registration</a></p>
<p>Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.</p>
<p>Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 22, 2023. Participants are also welcome to join the <a class="reference external" href="https://alibaba-damo-academy.github.io/FunASR/m2met2/Contact.html">wechat group</a> of M2MeT2.0 and keep up to date with the latest updates about the challenge.</p>
<p><a class="reference external" href="https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link">M2MeT2.0 Registration</a></p>
<p>Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top ranking submissions to be included in the ASRU2023 Proceedings.</p>
</section>
</section>
@ -184,7 +184,7 @@
<li class="right" >
<a href="index.html" title="ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0)"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Introduction</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Organizers &#8212; m2met2 documentation</title>
<title>Organizers &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -44,7 +44,7 @@
<li class="right" >
<a href="Rules.html" title="Rules"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Organizers</a></li>
</ul>
</div>
@ -55,7 +55,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -124,7 +124,7 @@
<section id="organizers">
<h1>Organizers<a class="headerlink" href="#organizers" title="Permalink to this heading"></a></h1>
<p><em><strong>Lei Xie, Professor, Northwestern Polytechnical University, China</strong></em></p>
<p><em><strong>Lei Xie, Professor, AISHELL foundation, China</strong></em></p>
<p>Email: <a class="reference external" href="mailto:lxie&#37;&#52;&#48;nwpu&#46;edu&#46;cn">lxie<span>&#64;</span>nwpu<span>&#46;</span>edu<span>&#46;</span>cn</a></p>
<a class="reference internal image-reference" href="_images/lxie.jpeg"><img alt="lxie" src="_images/lxie.jpeg" style="width: 20%;" /></a>
<p><em><strong>Kong Aik Lee, Senior Scientist at Institute for Infocomm Research, A*Star, Singapore</strong></em></p>
@ -180,7 +180,7 @@ Email: <a class="reference external" href="mailto:sly&#46;zsl&#37;&#52;&#48;alib
<li class="right" >
<a href="Rules.html" title="Rules"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Organizers</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Rules &#8212; m2met2 documentation</title>
<title>Rules &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -44,7 +44,7 @@
<li class="right" >
<a href="Baseline.html" title="Baseline"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Rules</a></li>
</ul>
</div>
@ -55,7 +55,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -165,7 +165,7 @@
<li class="right" >
<a href="Baseline.html" title="Baseline"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Rules</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Track &amp; Evaluation &#8212; m2met2 documentation</title>
<title>Track &amp; Evaluation &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="Dataset.html" title="Datasets"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Track &amp; Evaluation</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -180,7 +180,7 @@
<li class="right" >
<a href="Dataset.html" title="Datasets"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Track &amp; Evaluation</a></li>
</ul>
</div>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -1,13 +1,34 @@
# Baseline
## Overview
We will release an E2E SA-ASR~\cite{kanda21b_interspeech} baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.
We will release an E2E SA-ASR baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) at the time according to the timeline. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained speaker verification model from ModelScope. This speaker verification model is also be used to extract the speaker embedding in the speaker profile.
![model archietecture](images/sa_asr_arch.png)
## Quick start
#TODO: fill with the README.md of the baseline
To run the baseline, first you need to install FunASR and ModelScope. ([installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html))
There are two startup scripts, `run.sh` for training and evaluating on the old eval and test sets, and `run_m2met_2023_infer.sh` for inference on the new test set of the Multi-Channel Multi-Party Meeting Transcription 2.0 ([M2MeT2.0](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)) Challenge.
Before running `run.sh`, you must manually download and unpack the [AliMeeting](http://www.openslr.org/119/) corpus and place it in the `./dataset` directory:
```shell
dataset
|—— Eval_Ali_far
|—— Eval_Ali_near
|—— Test_Ali_far
|—— Test_Ali_near
|—— Train_Ali_far
|—— Train_Ali_near
```
Before running `run_m2met_2023_infer.sh`, you need to place the new test set `Test_2023_Ali_far` (to be released after the challenge starts) in the `./dataset` directory, which contains only raw audios. Then put the given `wav.scp`, `wav_raw.scp`, `segments`, `utt2spk` and `spk2utt` in the `./data/Test_2023_Ali_far` directory.
```shell
data/Test_2023_Ali_far
|—— wav.scp
|—— wav_raw.scp
|—— segments
|—— utt2spk
|—— spk2utt
```
For more details you can see [here](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md)
## Baseline results
The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.
![baseline result](images/baseline_result.png)
![baseline_result](images/baseline_result.png)

View File

@ -1,9 +1,9 @@
# Contact
If you have any questions about M2MET2.0 challenge, please contact us by
If you have any questions about M2MeT2.0 challenge, please contact us by
- email: [m2met.alimeeting@gmail.com](mailto:m2met.alimeeting@gmail.com)
| Wechat group |
|:------------------------------------------:|
<!-- | <img src="images/wechat.png" width="300"/> | -->
| <img src="images/qrcode.png" width="300"/> |

View File

@ -6,23 +6,23 @@ Over the years, several challenges have been organized to advance the developmen
The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU 2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
## Timeline(AOE Time)
- $ April~29, 2023: $ Challenge and registration open.
- $ May~8, 2023: $ Baseline release.
- $ May~15, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~9, 2023: $ Test data release and leaderboard open.
- $ June~13, 2023: $ Final submission deadline.
- $ June~19, 2023: $ Evaluation result and ranking release.
- $ May~11, 2023: $ Baseline release.
- $ May~22, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~16, 2023: $ Test data release and leaderboard open.
- $ June~20, 2023: $ Final submission deadline and leaderboar close.
- $ June~26, 2023: $ Evaluation result and ranking release.
- $ July~3, 2023: $ Deadline for paper submission.
- $ July~10, 2023: $ Deadline for final paper submission.
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and challenge session
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and Challenge Session.
## Guidelines
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 22, 2023. Participants are also welcome to join the [wechat group](https://alibaba-damo-academy.github.io/FunASR/m2met2/Contact.html) of M2MeT2.0 and keep up to date with the latest updates about the challenge.
[M2MET2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
[M2MeT2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top ranking submissions to be included in the ASRU2023 Proceedings.

View File

@ -1,5 +1,5 @@
# Organizers
***Lei Xie, Professor, Northwestern Polytechnical University, China***
***Lei Xie, Professor, AISHELL foundation, China***
Email: [lxie@nwpu.edu.cn](mailto:lxie@nwpu.edu.cn)

View File

@ -14,7 +14,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Index &#8212; m2met2 documentation</title>
<title>Index &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -35,7 +35,7 @@
<li class="right" style="margin-right: 10px">
<a href="#" title="General Index"
accesskey="I">index</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Index</a></li>
</ul>
</div>
@ -46,7 +46,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -132,7 +132,7 @@
<li class="right" style="margin-right: 10px">
<a href="#" title="General Index"
>index</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Index</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0) &#8212; m2met2 documentation</title>
<title>ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0) &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -40,7 +40,7 @@
<li class="right" >
<a href="Introduction.html" title="Introduction"
accesskey="N">next</a> |</li>
<li class="nav-item nav-item-0"><a href="#">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="#">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0)</a></li>
</ul>
</div>
@ -51,7 +51,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
#" class="text-logo">m2met2 documentation</a>
#" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -160,7 +160,7 @@ To facilitate reproducible research, we provide a comprehensive overview of the
<li class="right" >
<a href="Introduction.html" title="Introduction"
>next</a> |</li>
<li class="nav-item nav-item-0"><a href="#">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="#">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0)</a></li>
</ul>
</div>

View File

@ -1,5 +1,5 @@
# Sphinx inventory version 2
# Project: m2met2
# Project: MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0
# Version:
# The remainder of this file is compressed using zlib.
xÚ…<EFBFBD>AOƒ0ÇïýïdôÀ2ñæ<C3B1>!™M <-•6@ìŠ<C3AC>bœŸÞ޶£·ö÷~ÿ÷^Þ† B¶JÀ ù#ï*îaãØªÑG Š;¥Y¥çŠCÞxbÚd†CÖpŸa¥ûŽ<C3BB>•n;5·çÜ7Ýõ5Sí—臹y¡Þ+F)Ê|•ö¬z;˜áºUõ<55>)~LŽìzƒ¿<Ûk2àŽj¡ZÅÅçÔH²W!Ï­<µ1|~¢‹ææÍŒ¨,ö®ÃÈö)ÅAü¤î—G}<7D>,I(&[ EDʸÀ9Å;ÆLÓ„lWk¸ÍÂLPóºCÇŽÿ²Øû)ð;937—­8«þËOº1×ÁúªY-³™ÒÖsV ô é,Ö

View File

@ -14,7 +14,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Search &#8212; m2met2 documentation</title>
<title>Search &#8212; MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
@ -41,7 +41,7 @@
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Search</a></li>
</ul>
</div>
@ -52,7 +52,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 documentation</a>
index.html" class="text-logo">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a>
<div class="sidebar-block">
<div class="sidebar-toc">
@ -149,7 +149,7 @@
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 documentation</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">Search</a></li>
</ul>
</div>

File diff suppressed because one or more lines are too long

View File

@ -7,7 +7,7 @@ import guzzle_sphinx_theme
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = 'm2met2'
project = 'MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0'
copyright = '2023, Speech Lab, Alibaba Group; ASLP Group, Northwestern Polytechnical University'
author = 'Speech Lab, Alibaba Group; Audio, Speech and Language Processing Group, Northwestern Polytechnical University'

Binary file not shown.

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 5462207d1656a9ae4ca43c2890d094be
config: 06d9c1d4093817b45b9d4df7ab350eaf
tags: 645f666f9bcd5a90fca523b33c5a78b7

Binary file not shown.

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -5,8 +5,8 @@
ASRU 2023 多通道多方会议转录挑战 2.0
==================================================================================
在上一届M2MET竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
在上一届M2MeT竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MeT2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
我们对数据集、规则、基线系统和评估方法进行了详细介绍,以进一步促进多说话人语音识别领域研究的发展。
.. toctree::

View File

@ -5,8 +5,29 @@
![model archietecture](images/sa_asr_arch.png)
## 快速开始
#TODO: fill with the README.md of the baseline
首先需要安装FunASR和ModelScope. ([installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html))
基线系统有训练和测试两个脚本,`run.sh`是用于训练基线系统并在M2MeT的验证与测试集上评估的而`run_m2met_2023_infer.sh`用于此次竞赛预备开放的全新测试集上测试同时生成符合竞赛最终提交格式的文件。
在运行 `run.sh`前,需要自行下载并解压[AliMeeting](http://www.openslr.org/119/)数据集并放置于`./dataset`目录下:
```shell
dataset
|—— Eval_Ali_far
|—— Eval_Ali_near
|—— Test_Ali_far
|—— Test_Ali_near
|—— Train_Ali_far
|—— Train_Ali_near
```
在运行`run_m2met_2023_infer.sh`前, 需要将测试集`Test_2023_Ali_far`仅包含音频将于6.16发布)放置于`./dataset`目录下。然后将主办方提供的`wav.scp``wav_raw.scp``segments``utt2spk`和`spk2utt`放置于`./data/Test_2023_Ali_far`目录下。
```shell
data/Test_2023_Ali_far
|—— wav.scp
|—— wav_raw.scp
|—— segments
|—— utt2spk
|—— spk2utt
```
更多基线系统详情见[此处](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md)
## 基线结果
基线系统的结果如表3所示。在训练期间说话人档案采用了真实说话人嵌入。然而由于在评估过程中缺乏真实说话人标签因此使用了由额外的谱聚类提供的说话人特征。同时我们还提供了在评估和测试集上使用真实说话人档案的结果以显示说话人档案准确性的影响。
![baseline result](images/baseline_result.png)
![baseline_result](images/baseline_result.png)

View File

@ -1,32 +1,33 @@
# 简介
## 竞赛介绍
语音识别Automatic Speech Recognition、说话人日志Speaker Diarization等语音处理技术的最新发展激发了众多智能语音的广泛应用。然而会议场景由于其复杂的声学条件和不同的讲话风格包括重叠的讲话、不同数量的发言者、大会议室的远场信号以及环境噪声和混响仍然属于一项极具挑战性的任务。
为了推动会议场景语音识别的发展,已经有很多相关的挑战赛,如 Rich Transcription evaluation 和 CHIMEComputational Hearing in Multisource Environments 挑战赛。最新的CHIME挑战赛关注于远距离自动语音识别和开发能在各种不同拓扑结构的阵列和应用场景中通用的系统。然而不同语言之间的差异限制了非英语会议转录的进展。MISPMultimodal Information Based Speech Processing和M2MeTMulti-Channel Multi-Party Meeting Transcription挑战赛为推动普通话会议场景语音识别做出了贡献。MISP挑战赛侧重于用视听多模态的方法解决日常家庭环境中的远距离多麦克风信号处理问题而M2MeT挑战则侧重于解决离线会议室中会议转录的语音重叠问题。
ASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。
IASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。
在上一届M2MET成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。在上一届M2MET挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
在上一届M2MeT成功举办的基础上我们将在ASRU 2023上继续举办M2MeT2.0挑战赛。在上一届M2MeT挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
我们对数据集、规则、基线系统和评估方法进行了详细介绍以进一步促进多说话人语音识别领域研究的发展。此外我们将根据时间表发布一个全新的测试集包括大约10小时的音频。
## 时间安排(AOE时间)
- $ 2023.4.29: $ 开放注册
- $ 2023.5.8: $ 基线发布
- $ 2023.5.15: $ 注册截止
- $ 2023.6.9: $ 测试集数据发布
- $ 2023.6.13: $ 最终结果提交截止
- $ 2023.6.19: $ 评估结果和排名发布
- $ 2023.7.3: $ 论文提交截止
- $ 2023.7.10: $ 最终版论文提交截止
- $ 2023.12.12: $ ASRU Workshop & challenge session
- $ 2023.5.11: $ 基线发布
- $ 2023.5.22: $ 注册截止
- $ 2023.6.16: $ 测试集数据发布,排行榜开放
- $ 2023.6.20: $ 最终结果提交截止,排行榜关闭
- $ 2023.6.26: $ 评估结果和排名发布
- $ 2023.7.3: $ 论文提交截止通过ASRU2023官方投稿选择竞赛Session
- $ 2023.7.10: $ 最终版论文提交截止通过ASRU2023官方投稿选择竞赛Session
- $ 2023.12.12: $ ASRU Workshop & Challenge Session
## 竞赛报名
来自学术界和工业界的有意向参赛者均应在2023年5月15日及之前填写下方的谷歌表单
来自学术界和工业界的有意向参赛者均应在2023年5月22日及之前填写下方的谷歌表单。同时欢迎广大参赛者加入[官方交流微信群](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/%E8%81%94%E7%B3%BB%E6%96%B9%E5%BC%8F.html)交流并及时获取竞赛最新消息
[M2MET2.0报名](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
[M2MeT2.0报名](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将选择前三名纳入ASRU2023论文集。
主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将排名前列的队伍纳入ASRU2023论文集。

View File

@ -1,9 +1,9 @@
# 联系方式
如果对M2MET2.0竞赛有任何疑问,欢迎通过以下方式联系我们:
如果对M2MeT2.0竞赛有任何疑问,欢迎通过以下方式联系我们:
- 邮件: [m2met.alimeeting@gmail.com](mailto:m2met.alimeeting@gmail.com)
| M2MET2.0竞赛官方微信群 |
| M2MeT2.0竞赛官方微信群 |
|:------------------------------------------:|
<!-- | <img src="images/wechat.png" width="300"/> | -->
| <img src="images/qrcode.png" width="300"/> |

View File

@ -1,6 +1,6 @@
# 赛道设置与评估
## 说话人相关的语音识别
说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MET2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。
说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MeT2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。
![task difference](images/task_diff.png)

View File

@ -14,7 +14,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>索引 &#8212; m2met2 文档</title>
<title>索引 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -36,7 +36,7 @@
<li class="right" style="margin-right: 10px">
<a href="#" title="总索引"
accesskey="I">索引</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">索引</a></li>
</ul>
</div>
@ -47,7 +47,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -133,7 +133,7 @@
<li class="right" style="margin-right: 10px">
<a href="#" title="总索引"
>索引</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">索引</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>ASRU 2023 多通道多方会议转录挑战 2.0 &#8212; m2met2 文档</title>
<title>ASRU 2023 多通道多方会议转录挑战 2.0 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -41,7 +41,7 @@
<li class="right" >
<a href="%E7%AE%80%E4%BB%8B.html" title="简介"
accesskey="N">下一页</a> |</li>
<li class="nav-item nav-item-0"><a href="#">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="#">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">ASRU 2023 多通道多方会议转录挑战 2.0</a></li>
</ul>
</div>
@ -52,7 +52,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
#" class="text-logo">m2met2 文档</a>
#" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -121,8 +121,8 @@
<section id="asru-2023-2-0">
<h1>ASRU 2023 多通道多方会议转录挑战 2.0<a class="headerlink" href="#asru-2023-2-0" title="此标题的永久链接"></a></h1>
<p>在上一届M2MET竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
<p>在上一届M2MeT竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MeT2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
我们对数据集、规则、基线系统和评估方法进行了详细介绍,以进一步促进多说话人语音识别领域研究的发展。</p>
<div class="toctree-wrapper compound">
<p class="caption" role="heading"><span class="caption-text">目录:</span></p>
@ -161,7 +161,7 @@
<li class="right" >
<a href="%E7%AE%80%E4%BB%8B.html" title="简介"
>下一页</a> |</li>
<li class="nav-item nav-item-0"><a href="#">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="#">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">ASRU 2023 多通道多方会议转录挑战 2.0</a></li>
</ul>
</div>

View File

@ -14,7 +14,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>搜索 &#8212; m2met2 文档</title>
<title>搜索 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
@ -42,7 +42,7 @@
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="总索引"
accesskey="I">索引</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">搜索</a></li>
</ul>
</div>
@ -53,7 +53,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-toc">
@ -149,7 +149,7 @@
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="总索引"
>索引</a></li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">搜索</a></li>
</ul>
</div>

View File

@ -1 +1 @@
Search.setIndex({"docnames": ["index", "\u57fa\u7ebf", "\u6570\u636e\u96c6", "\u7b80\u4ecb", "\u7ec4\u59d4\u4f1a", "\u8054\u7cfb\u65b9\u5f0f", "\u89c4\u5219", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30"], "filenames": ["index.rst", "\u57fa\u7ebf.md", "\u6570\u636e\u96c6.md", "\u7b80\u4ecb.md", "\u7ec4\u59d4\u4f1a.md", "\u8054\u7cfb\u65b9\u5f0f.md", "\u89c4\u5219.md", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30.md"], "titles": ["ASRU 2023 \u591a\u901a\u9053\u591a\u65b9\u4f1a\u8bae\u8f6c\u5f55\u6311\u6218 2.0", "\u57fa\u7ebf", "\u6570\u636e\u96c6", "\u7b80\u4ecb", "\u7ec4\u59d4\u4f1a", "\u8054\u7cfb\u65b9\u5f0f", "\u7ade\u8d5b\u89c4\u5219", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30"], "terms": {"m2met": [0, 3, 5, 7], "asru2023": [0, 3], "m2met2": [0, 3, 5, 7], "funasr": 1, "sa": 1, "asr": [1, 3, 7], "speakerencod": 1, "modelscop": [1, 7], "todo": 1, "fill": 1, "with": 1, "the": 1, "readm": 1, "md": 1, "of": 1, "baselin": [1, 2], "aishel": [2, 7], "cn": [2, 4, 7], "celeb": [2, 7], "test": [2, 6, 7], "2023": [2, 3, 6, 7], "118": 2, "75": 2, "104": 2, "train": 2, "eval": [2, 6], "10": [2, 3, 7], "212": 2, "15": [2, 3], "30": 2, "456": 2, "25": 2, "13": [2, 3], "55": 2, "42": 2, "27": 2, "34": 2, "76": 2, "20": 2, "textgrid": 2, "id": 2, "openslr": 2, "automat": 3, "speech": 3, "recognit": 3, "speaker": 3, "diariz": 3, "rich": 3, "transcript": 3, "evalu": 3, "chime": 3, "comput": 3, "hear": 3, "in": 3, "multisourc": 3, "environ": 3, "misp": 3, "multimod": 3, "inform": 3, "base": 3, "process": 3, "multi": 3, "channel": 3, "parti": 3, "meet": 3, "assp2022": 3, "29": 3, "19": 3, "12": 3, "asru": 3, "workshop": 3, "challeng": 3, "session": 3, "lxie": 4, "nwpu": 4, "edu": 4, "kong": 4, "aik": 4, "lee": 4, "star": 4, "kongaik": 4, "ieee": 4, "org": 4, "zhiji": 4, "yzj": 4, "alibaba": 4, "inc": 4, "com": [4, 5], "sli": 4, "zsl": 4, "yanminqian": 4, "sjtu": 4, "zhuc": 4, "microsoft": 4, "wujian": 4, "ceo": 4, "buhui": 4, "aishelldata": 4, "alimeet": [5, 7], "gmail": 5, "cpcer": [6, 7], "las": 6, "rnnt": 6, "transform": 6, "aishell4": 7, "vad": 7, "cer": 7, "ins": 7, "sub": 7, "del": 7, "text": 7, "frac": 7, "mathcal": 7, "n_": 7, "total": 7, "time": 7, "100": 7, "hug": 7, "face": 7}, "objects": {}, "objtypes": {}, "objnames": {}, "titleterms": {"asru": 0, "2023": 0, "alimeet": 2, "aoe": 3}, "envversion": {"sphinx.domains.c": 2, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 8, "sphinx.domains.index": 1, "sphinx.domains.javascript": 2, "sphinx.domains.math": 2, "sphinx.domains.python": 3, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 57}, "alltitles": {"ASRU 2023 \u591a\u901a\u9053\u591a\u65b9\u4f1a\u8bae\u8f6c\u5f55\u6311\u6218 2.0": [[0, "asru-2023-2-0"]], "\u76ee\u5f55:": [[0, null]], "\u57fa\u7ebf": [[1, "id1"]], "\u57fa\u7ebf\u6982\u8ff0": [[1, "id2"]], "\u5feb\u901f\u5f00\u59cb": [[1, "id3"]], "\u57fa\u7ebf\u7ed3\u679c": [[1, "id4"]], "\u6570\u636e\u96c6": [[2, "id1"]], "\u6570\u636e\u96c6\u6982\u8ff0": [[2, "id2"]], "Alimeeting\u6570\u636e\u96c6\u4ecb\u7ecd": [[2, "alimeeting"]], "\u83b7\u53d6\u6570\u636e": [[2, "id3"]], "\u7b80\u4ecb": [[3, "id1"]], "\u7ade\u8d5b\u4ecb\u7ecd": [[3, "id2"]], "\u65f6\u95f4\u5b89\u6392(AOE\u65f6\u95f4)": [[3, "aoe"]], "\u7ade\u8d5b\u62a5\u540d": [[3, "id3"]], "\u7ec4\u59d4\u4f1a": [[4, "id1"]], "\u8054\u7cfb\u65b9\u5f0f": [[5, "id1"]], "\u7ade\u8d5b\u89c4\u5219": [[6, "id1"]], "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30": [[7, "id1"]], "\u8bf4\u8bdd\u4eba\u76f8\u5173\u7684\u8bed\u97f3\u8bc6\u522b": [[7, "id2"]], "\u8bc4\u4f30\u65b9\u6cd5": [[7, "id3"]], "\u5b50\u8d5b\u9053\u8bbe\u7f6e": [[7, "id4"]], "\u5b50\u8d5b\u9053\u4e00 (\u9650\u5b9a\u8bad\u7ec3\u6570\u636e):": [[7, "id5"]], "\u5b50\u8d5b\u9053\u4e8c (\u5f00\u653e\u8bad\u7ec3\u6570\u636e):": [[7, "id6"]]}, "indexentries": {}})
Search.setIndex({"docnames": ["index", "\u57fa\u7ebf", "\u6570\u636e\u96c6", "\u7b80\u4ecb", "\u7ec4\u59d4\u4f1a", "\u8054\u7cfb\u65b9\u5f0f", "\u89c4\u5219", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30"], "filenames": ["index.rst", "\u57fa\u7ebf.md", "\u6570\u636e\u96c6.md", "\u7b80\u4ecb.md", "\u7ec4\u59d4\u4f1a.md", "\u8054\u7cfb\u65b9\u5f0f.md", "\u89c4\u5219.md", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30.md"], "titles": ["ASRU 2023 \u591a\u901a\u9053\u591a\u65b9\u4f1a\u8bae\u8f6c\u5f55\u6311\u6218 2.0", "\u57fa\u7ebf", "\u6570\u636e\u96c6", "\u7b80\u4ecb", "\u7ec4\u59d4\u4f1a", "\u8054\u7cfb\u65b9\u5f0f", "\u7ade\u8d5b\u89c4\u5219", "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30"], "terms": {"m2met": [0, 1, 3, 5, 7], "asru2023": [0, 3], "m2met2": [0, 3, 5, 7], "funasr": 1, "sa": 1, "asr": [1, 3, 7], "speakerencod": 1, "modelscop": [1, 7], "instal": 1, "run": 1, "sh": 1, "run_m2met_2023_inf": 1, "alimeet": [1, 5, 7], "dataset": 1, "eval_ali_far": 1, "eval_ali_near": 1, "test_ali_far": 1, "test_ali_near": 1, "train_ali_far": 1, "train_ali_near": 1, "test_2023_ali_far": 1, "16": [1, 3], "wav": 1, "scp": 1, "wav_raw": 1, "segment": 1, "utt2spk": 1, "spk2utt": 1, "data": 1, "aishel": [2, 7], "cn": [2, 4, 7], "celeb": [2, 7], "test": [2, 6, 7], "2023": [2, 3, 6, 7], "118": 2, "75": 2, "104": 2, "train": 2, "eval": [2, 6], "10": [2, 3, 7], "212": 2, "15": 2, "30": 2, "456": 2, "25": 2, "13": 2, "55": 2, "42": 2, "27": 2, "34": 2, "76": 2, "20": [2, 3], "textgrid": 2, "id": 2, "openslr": 2, "baselin": 2, "automat": 3, "speech": 3, "recognit": 3, "speaker": 3, "diariz": 3, "rich": 3, "transcript": 3, "evalu": 3, "chime": 3, "comput": 3, "hear": 3, "in": 3, "multisourc": 3, "environ": 3, "misp": 3, "multimod": 3, "inform": 3, "base": 3, "process": 3, "multi": 3, "channel": 3, "parti": 3, "meet": 3, "iassp2022": 3, "asru": 3, "29": 3, "11": 3, "22": 3, "26": 3, "session": 3, "12": 3, "workshop": 3, "challeng": 3, "lxie": 4, "nwpu": 4, "edu": 4, "kong": 4, "aik": 4, "lee": 4, "star": 4, "kongaik": 4, "ieee": 4, "org": 4, "zhiji": 4, "yzj": 4, "alibaba": 4, "inc": 4, "com": [4, 5], "sli": 4, "zsl": 4, "yanminqian": 4, "sjtu": 4, "zhuc": 4, "microsoft": 4, "wujian": 4, "ceo": 4, "buhui": 4, "aishelldata": 4, "gmail": 5, "cpcer": [6, 7], "las": 6, "rnnt": 6, "transform": 6, "aishell4": 7, "vad": 7, "cer": 7, "ins": 7, "sub": 7, "del": 7, "text": 7, "frac": 7, "mathcal": 7, "n_": 7, "total": 7, "time": 7, "100": 7, "hug": 7, "face": 7}, "objects": {}, "objtypes": {}, "objnames": {}, "titleterms": {"asru": 0, "2023": 0, "alimeet": 2, "aoe": 3}, "envversion": {"sphinx.domains.c": 2, "sphinx.domains.changeset": 1, "sphinx.domains.citation": 1, "sphinx.domains.cpp": 8, "sphinx.domains.index": 1, "sphinx.domains.javascript": 2, "sphinx.domains.math": 2, "sphinx.domains.python": 3, "sphinx.domains.rst": 2, "sphinx.domains.std": 2, "sphinx": 57}, "alltitles": {"ASRU 2023 \u591a\u901a\u9053\u591a\u65b9\u4f1a\u8bae\u8f6c\u5f55\u6311\u6218 2.0": [[0, "asru-2023-2-0"]], "\u76ee\u5f55:": [[0, null]], "\u57fa\u7ebf": [[1, "id1"]], "\u57fa\u7ebf\u6982\u8ff0": [[1, "id2"]], "\u5feb\u901f\u5f00\u59cb": [[1, "id3"]], "\u57fa\u7ebf\u7ed3\u679c": [[1, "id4"]], "\u6570\u636e\u96c6": [[2, "id1"]], "\u6570\u636e\u96c6\u6982\u8ff0": [[2, "id2"]], "Alimeeting\u6570\u636e\u96c6\u4ecb\u7ecd": [[2, "alimeeting"]], "\u83b7\u53d6\u6570\u636e": [[2, "id3"]], "\u7b80\u4ecb": [[3, "id1"]], "\u7ade\u8d5b\u4ecb\u7ecd": [[3, "id2"]], "\u65f6\u95f4\u5b89\u6392(AOE\u65f6\u95f4)": [[3, "aoe"]], "\u7ade\u8d5b\u62a5\u540d": [[3, "id3"]], "\u7ec4\u59d4\u4f1a": [[4, "id1"]], "\u8054\u7cfb\u65b9\u5f0f": [[5, "id1"]], "\u7ade\u8d5b\u89c4\u5219": [[6, "id1"]], "\u8d5b\u9053\u8bbe\u7f6e\u4e0e\u8bc4\u4f30": [[7, "id1"]], "\u8bf4\u8bdd\u4eba\u76f8\u5173\u7684\u8bed\u97f3\u8bc6\u522b": [[7, "id2"]], "\u8bc4\u4f30\u65b9\u6cd5": [[7, "id3"]], "\u5b50\u8d5b\u9053\u8bbe\u7f6e": [[7, "id4"]], "\u5b50\u8d5b\u9053\u4e00 (\u9650\u5b9a\u8bad\u7ec3\u6570\u636e):": [[7, "id5"]], "\u5b50\u8d5b\u9053\u4e8c (\u5f00\u653e\u8bad\u7ec3\u6570\u636e):": [[7, "id6"]]}, "indexentries": {}})

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>基线 &#8212; m2met2 文档</title>
<title>基线 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="%E8%B5%9B%E9%81%93%E8%AE%BE%E7%BD%AE%E4%B8%8E%E8%AF%84%E4%BC%B0.html" title="赛道设置与评估"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">基线</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -132,12 +132,33 @@
</section>
<section id="id3">
<h2>快速开始<a class="headerlink" href="#id3" title="此标题的永久链接"></a></h2>
<p>#TODO: fill with the README.md of the baseline</p>
<p>首先需要安装FunASR和ModelScope. (<a class="reference external" href="https://alibaba-damo-academy.github.io/FunASR/en/installation.html">installation</a>)<br />
基线系统有训练和测试两个脚本,<code class="docutils literal notranslate"><span class="pre">run.sh</span></code>是用于训练基线系统并在M2MeT的验证与测试集上评估的<code class="docutils literal notranslate"><span class="pre">run_m2met_2023_infer.sh</span></code>用于此次竞赛预备开放的全新测试集上测试同时生成符合竞赛最终提交格式的文件。
在运行 <code class="docutils literal notranslate"><span class="pre">run.sh</span></code>前,需要自行下载并解压<a class="reference external" href="http://www.openslr.org/119/">AliMeeting</a>数据集并放置于<code class="docutils literal notranslate"><span class="pre">./dataset</span></code>目录下:</p>
<div class="highlight-shell notranslate"><div class="highlight"><pre><span></span>dataset
<span class="p">|</span>——<span class="w"> </span>Eval_Ali_far
<span class="p">|</span>——<span class="w"> </span>Eval_Ali_near
<span class="p">|</span>——<span class="w"> </span>Test_Ali_far
<span class="p">|</span>——<span class="w"> </span>Test_Ali_near
<span class="p">|</span>——<span class="w"> </span>Train_Ali_far
<span class="p">|</span>——<span class="w"> </span>Train_Ali_near
</pre></div>
</div>
<p>在运行<code class="docutils literal notranslate"><span class="pre">run_m2met_2023_infer.sh</span></code>前, 需要将测试集<code class="docutils literal notranslate"><span class="pre">Test_2023_Ali_far</span></code>仅包含音频将于6.16发布)放置于<code class="docutils literal notranslate"><span class="pre">./dataset</span></code>目录下。然后将主办方提供的<code class="docutils literal notranslate"><span class="pre">wav.scp</span></code><code class="docutils literal notranslate"><span class="pre">wav_raw.scp</span></code><code class="docutils literal notranslate"><span class="pre">segments</span></code><code class="docutils literal notranslate"><span class="pre">utt2spk</span></code><code class="docutils literal notranslate"><span class="pre">spk2utt</span></code>放置于<code class="docutils literal notranslate"><span class="pre">./data/Test_2023_Ali_far</span></code>目录下。</p>
<div class="highlight-shell notranslate"><div class="highlight"><pre><span></span>data/Test_2023_Ali_far
<span class="p">|</span>——<span class="w"> </span>wav.scp
<span class="p">|</span>——<span class="w"> </span>wav_raw.scp
<span class="p">|</span>——<span class="w"> </span>segments
<span class="p">|</span>——<span class="w"> </span>utt2spk
<span class="p">|</span>——<span class="w"> </span>spk2utt
</pre></div>
</div>
<p>更多基线系统详情见<a class="reference external" href="https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md">此处</a></p>
</section>
<section id="id4">
<h2>基线结果<a class="headerlink" href="#id4" title="此标题的永久链接"></a></h2>
<p>基线系统的结果如表3所示。在训练期间说话人档案采用了真实说话人嵌入。然而由于在评估过程中缺乏真实说话人标签因此使用了由额外的谱聚类提供的说话人特征。同时我们还提供了在评估和测试集上使用真实说话人档案的结果以显示说话人档案准确性的影响。
<img alt="baseline result" src="_images/baseline_result.png" /></p>
<p>基线系统的结果如表3所示。在训练期间说话人档案采用了真实说话人嵌入。然而由于在评估过程中缺乏真实说话人标签因此使用了由额外的谱聚类提供的说话人特征。同时我们还提供了在评估和测试集上使用真实说话人档案的结果以显示说话人档案准确性的影响。</p>
<p><img alt="baseline_result" src="_images/baseline_result.png" /></p>
</section>
</section>
@ -171,7 +192,7 @@
<li class="right" >
<a href="%E8%B5%9B%E9%81%93%E8%AE%BE%E7%BD%AE%E4%B8%8E%E8%AF%84%E4%BC%B0.html" title="赛道设置与评估"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">基线</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>数据集 &#8212; m2met2 文档</title>
<title>数据集 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="%E7%AE%80%E4%BB%8B.html" title="简介"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">数据集</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -180,7 +180,7 @@ Test-2023测试集由20场会议组成这些会议是在与AliMeeting数据
<li class="right" >
<a href="%E7%AE%80%E4%BB%8B.html" title="简介"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">数据集</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>简介 &#8212; m2met2 文档</title>
<title>简介 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -46,7 +46,7 @@
<li class="right" >
<a href="index.html" title="ASRU 2023 多通道多方会议转录挑战 2.0"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">简介</a></li>
</ul>
</div>
@ -57,7 +57,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -130,30 +130,30 @@
<h2>竞赛介绍<a class="headerlink" href="#id2" title="此标题的永久链接"></a></h2>
<p>语音识别Automatic Speech Recognition、说话人日志Speaker Diarization等语音处理技术的最新发展激发了众多智能语音的广泛应用。然而会议场景由于其复杂的声学条件和不同的讲话风格包括重叠的讲话、不同数量的发言者、大会议室的远场信号以及环境噪声和混响仍然属于一项极具挑战性的任务。</p>
<p>为了推动会议场景语音识别的发展,已经有很多相关的挑战赛,如 Rich Transcription evaluation 和 CHIMEComputational Hearing in Multisource Environments 挑战赛。最新的CHIME挑战赛关注于远距离自动语音识别和开发能在各种不同拓扑结构的阵列和应用场景中通用的系统。然而不同语言之间的差异限制了非英语会议转录的进展。MISPMultimodal Information Based Speech Processing和M2MeTMulti-Channel Multi-Party Meeting Transcription挑战赛为推动普通话会议场景语音识别做出了贡献。MISP挑战赛侧重于用视听多模态的方法解决日常家庭环境中的远距离多麦克风信号处理问题而M2MeT挑战则侧重于解决离线会议室中会议转录的语音重叠问题。</p>
<p>ASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。</p>
<p>在上一届M2MET成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。在上一届M2MET挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
<p>IASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。</p>
<p>在上一届M2MeT成功举办的基础上我们将在ASRU 2023上继续举办M2MeT2.0挑战赛。在上一届M2MeT挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
我们对数据集、规则、基线系统和评估方法进行了详细介绍以进一步促进多说话人语音识别领域研究的发展。此外我们将根据时间表发布一个全新的测试集包括大约10小时的音频。</p>
</section>
<section id="aoe">
<h2>时间安排(AOE时间)<a class="headerlink" href="#aoe" title="此标题的永久链接"></a></h2>
<ul class="simple">
<li><p><span class="math notranslate nohighlight">\( 2023.4.29: \)</span> 开放注册</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.5.8: \)</span> 基线发布</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.5.15: \)</span> 注册截止</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.9: \)</span> 测试集数据发布</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.13: \)</span> 最终结果提交截止</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.19: \)</span> 评估结果和排名发布</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.7.3: \)</span> 论文提交截止</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.7.10: \)</span> 最终版论文提交截止</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.12.12: \)</span> ASRU Workshop &amp; challenge session</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.5.11: \)</span> 基线发布</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.5.22: \)</span> 注册截止</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.16: \)</span> 测试集数据发布,排行榜开放</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.20: \)</span> 最终结果提交截止,排行榜关闭</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.6.26: \)</span> 评估结果和排名发布</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.7.3: \)</span> 论文提交截止通过ASRU2023官方投稿选择竞赛Session</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.7.10: \)</span> 最终版论文提交截止通过ASRU2023官方投稿选择竞赛Session</p></li>
<li><p><span class="math notranslate nohighlight">\( 2023.12.12: \)</span> ASRU Workshop &amp; Challenge Session</p></li>
</ul>
</section>
<section id="id3">
<h2>竞赛报名<a class="headerlink" href="#id3" title="此标题的永久链接"></a></h2>
<p>来自学术界和工业界的有意向参赛者均应在2023年5月15日及之前填写下方的谷歌表单</p>
<p><a class="reference external" href="https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link">M2MET2.0报名</a></p>
<p>主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将选择前三名纳入ASRU2023论文集。</p>
<p>来自学术界和工业界的有意向参赛者均应在2023年5月22日及之前填写下方的谷歌表单。同时欢迎广大参赛者加入<a class="reference external" href="https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/%E8%81%94%E7%B3%BB%E6%96%B9%E5%BC%8F.html">官方交流微信群</a>交流并及时获取竞赛最新消息</p>
<p><a class="reference external" href="https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link">M2MeT2.0报名</a></p>
<p>主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将排名前列的队伍纳入ASRU2023论文集。</p>
</section>
</section>
@ -187,7 +187,7 @@
<li class="right" >
<a href="index.html" title="ASRU 2023 多通道多方会议转录挑战 2.0"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">简介</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>组委会 &#8212; m2met2 文档</title>
<title>组委会 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="%E8%A7%84%E5%88%99.html" title="竞赛规则"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">组委会</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -188,7 +188,7 @@
<li class="right" >
<a href="%E8%A7%84%E5%88%99.html" title="竞赛规则"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">组委会</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>联系方式 &#8212; m2met2 文档</title>
<title>联系方式 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -41,7 +41,7 @@
<li class="right" >
<a href="%E7%BB%84%E5%A7%94%E4%BC%9A.html" title="组委会"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">联系方式</a></li>
</ul>
</div>
@ -52,7 +52,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -121,17 +121,20 @@
<section id="id1">
<h1>联系方式<a class="headerlink" href="#id1" title="此标题的永久链接"></a></h1>
<p>如果对M2MET2.0竞赛有任何疑问,欢迎通过以下方式联系我们:</p>
<p>如果对M2MeT2.0竞赛有任何疑问,欢迎通过以下方式联系我们:</p>
<ul class="simple">
<li><p>邮件: <a class="reference external" href="mailto:m2met&#46;alimeeting&#37;&#52;&#48;gmail&#46;com">m2met<span>&#46;</span>alimeeting<span>&#64;</span>gmail<span>&#46;</span>com</a></p></li>
</ul>
<table class="docutils align-default">
<thead>
<tr class="row-odd"><th class="head text-center"><p>M2MET2.0竞赛官方微信群</p></th>
<tr class="row-odd"><th class="head text-center"><p>M2MeT2.0竞赛官方微信群</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td class="text-center"><p><a class="reference internal" href="_images/qrcode.png"><img alt="_images/qrcode.png" src="_images/qrcode.png" style="width: 300px;" /></a></p></td>
</tr>
</tbody>
</table>
<!-- | <img src="images/wechat.png" width="300"/> | -->
</section>
@ -158,7 +161,7 @@
<li class="right" >
<a href="%E7%BB%84%E5%A7%94%E4%BC%9A.html" title="组委会"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">联系方式</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>竞赛规则 &#8212; m2met2 文档</title>
<title>竞赛规则 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -45,7 +45,7 @@
<li class="right" >
<a href="%E5%9F%BA%E7%BA%BF.html" title="基线"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">竞赛规则</a></li>
</ul>
</div>
@ -56,7 +56,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -166,7 +166,7 @@
<li class="right" >
<a href="%E5%9F%BA%E7%BA%BF.html" title="基线"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">竞赛规则</a></li>
</ul>
</div>

View File

@ -15,7 +15,7 @@
<link rel="stylesheet" type="text/css" href="_static/css/bootstrap-theme.min.css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>赛道设置与评估 &#8212; m2met2 文档</title>
<title>赛道设置与评估 &#8212; 多通道多方会议转录挑战2.0</title>
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" type="text/css" href="_static/guzzle.css" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
@ -46,7 +46,7 @@
<li class="right" >
<a href="%E6%95%B0%E6%8D%AE%E9%9B%86.html" title="数据集"
accesskey="P">上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">赛道设置与评估</a></li>
</ul>
</div>
@ -57,7 +57,7 @@
</div>
<div id="left-column">
<div class="sphinxsidebar"><a href="
index.html" class="text-logo">m2met2 文档</a>
index.html" class="text-logo">多通道多方会议转录挑战2.0</a>
<div class="sidebar-block">
<div class="sidebar-wrapper">
<div id="main-search">
@ -128,7 +128,7 @@
<h1>赛道设置与评估<a class="headerlink" href="#id1" title="此标题的永久链接"></a></h1>
<section id="id2">
<h2>说话人相关的语音识别<a class="headerlink" href="#id2" title="此标题的永久链接"></a></h2>
<p>说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MET2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。</p>
<p>说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MeT2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。</p>
<p><img alt="task difference" src="_images/task_diff.png" /></p>
</section>
<section id="id3">
@ -181,7 +181,7 @@
<li class="right" >
<a href="%E6%95%B0%E6%8D%AE%E9%9B%86.html" title="数据集"
>上一页</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">m2met2 文档</a> &#187;</li>
<li class="nav-item nav-item-0"><a href="index.html">多通道多方会议转录挑战2.0</a> &#187;</li>
<li class="nav-item nav-item-this"><a href="">赛道设置与评估</a></li>
</ul>
</div>

View File

@ -7,7 +7,7 @@ import guzzle_sphinx_theme
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = 'm2met2'
project = '多通道多方会议转录挑战2.0'
copyright = '2023, Speech Lab, Alibaba Group; ASLP Group, Northwestern Polytechnical University'
author = 'Speech Lab, Alibaba Group; Audio, Speech and Language Processing Group, Northwestern Polytechnical University'

Binary file not shown.

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 119 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@ -5,8 +5,8 @@
ASRU 2023 多通道多方会议转录挑战 2.0
==================================================================================
在上一届M2MET竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
在上一届M2MeT竞赛成功举办的基础上我们将在ASRU2023上继续举办M2MeT2.0挑战赛。
为了将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估,并且同时设立限定数据与不限定数据两个子赛道。
我们对数据集、规则、基线系统和评估方法进行了详细介绍,以进一步促进多说话人语音识别领域研究的发展。
.. toctree::

View File

@ -5,8 +5,29 @@
![model archietecture](images/sa_asr_arch.png)
## 快速开始
#TODO: fill with the README.md of the baseline
首先需要安装FunASR和ModelScope. ([installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html))
基线系统有训练和测试两个脚本,`run.sh`是用于训练基线系统并在M2MeT的验证与测试集上评估的而`run_m2met_2023_infer.sh`用于此次竞赛预备开放的全新测试集上测试同时生成符合竞赛最终提交格式的文件。
在运行 `run.sh`前,需要自行下载并解压[AliMeeting](http://www.openslr.org/119/)数据集并放置于`./dataset`目录下:
```shell
dataset
|—— Eval_Ali_far
|—— Eval_Ali_near
|—— Test_Ali_far
|—— Test_Ali_near
|—— Train_Ali_far
|—— Train_Ali_near
```
在运行`run_m2met_2023_infer.sh`前, 需要将测试集`Test_2023_Ali_far`仅包含音频将于6.16发布)放置于`./dataset`目录下。然后将主办方提供的`wav.scp``wav_raw.scp``segments``utt2spk`和`spk2utt`放置于`./data/Test_2023_Ali_far`目录下。
```shell
data/Test_2023_Ali_far
|—— wav.scp
|—— wav_raw.scp
|—— segments
|—— utt2spk
|—— spk2utt
```
更多基线系统详情见[此处](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs/alimeeting/sa-asr/README.md)
## 基线结果
基线系统的结果如表3所示。在训练期间说话人档案采用了真实说话人嵌入。然而由于在评估过程中缺乏真实说话人标签因此使用了由额外的谱聚类提供的说话人特征。同时我们还提供了在评估和测试集上使用真实说话人档案的结果以显示说话人档案准确性的影响。
![baseline result](images/baseline_result.png)
![baseline_result](images/baseline_result.png)

View File

@ -1,32 +1,33 @@
# 简介
## 竞赛介绍
语音识别Automatic Speech Recognition、说话人日志Speaker Diarization等语音处理技术的最新发展激发了众多智能语音的广泛应用。然而会议场景由于其复杂的声学条件和不同的讲话风格包括重叠的讲话、不同数量的发言者、大会议室的远场信号以及环境噪声和混响仍然属于一项极具挑战性的任务。
为了推动会议场景语音识别的发展,已经有很多相关的挑战赛,如 Rich Transcription evaluation 和 CHIMEComputational Hearing in Multisource Environments 挑战赛。最新的CHIME挑战赛关注于远距离自动语音识别和开发能在各种不同拓扑结构的阵列和应用场景中通用的系统。然而不同语言之间的差异限制了非英语会议转录的进展。MISPMultimodal Information Based Speech Processing和M2MeTMulti-Channel Multi-Party Meeting Transcription挑战赛为推动普通话会议场景语音识别做出了贡献。MISP挑战赛侧重于用视听多模态的方法解决日常家庭环境中的远距离多麦克风信号处理问题而M2MeT挑战则侧重于解决离线会议室中会议转录的语音重叠问题。
ASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。
IASSP2022 M2MeT挑战的侧重点是会议场景它包括两个赛道说话人日记和多说话人自动语音识别。前者涉及识别“谁在什么时候说了话”而后者旨在同时识别来自多个说话人的语音语音重叠和各种噪声带来了巨大的技术困难。
在上一届M2MET成功举办的基础上我们将在ASRU2023上继续举办M2MET2.0挑战赛。在上一届M2MET挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MET2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
在上一届M2MeT成功举办的基础上我们将在ASRU 2023上继续举办M2MeT2.0挑战赛。在上一届M2MeT挑战赛中评估指标是说话人无关的我们只能得到识别文本而不能确定相应的说话人。
为了解决这一局限性并将现在的多说话人语音识别系统推向实用化M2MeT2.0挑战赛将在说话人相关的人物上评估并且同时设立限定数据与不限定数据两个子赛道。通过将语音归属于特定的说话人这项任务旨在提高多说话人ASR系统在真实世界环境中的准确性和适用性。
我们对数据集、规则、基线系统和评估方法进行了详细介绍以进一步促进多说话人语音识别领域研究的发展。此外我们将根据时间表发布一个全新的测试集包括大约10小时的音频。
## 时间安排(AOE时间)
- $ 2023.4.29: $ 开放注册
- $ 2023.5.8: $ 基线发布
- $ 2023.5.15: $ 注册截止
- $ 2023.6.9: $ 测试集数据发布
- $ 2023.6.13: $ 最终结果提交截止
- $ 2023.6.19: $ 评估结果和排名发布
- $ 2023.7.3: $ 论文提交截止
- $ 2023.7.10: $ 最终版论文提交截止
- $ 2023.12.12: $ ASRU Workshop & challenge session
- $ 2023.5.11: $ 基线发布
- $ 2023.5.22: $ 注册截止
- $ 2023.6.16: $ 测试集数据发布,排行榜开放
- $ 2023.6.20: $ 最终结果提交截止,排行榜关闭
- $ 2023.6.26: $ 评估结果和排名发布
- $ 2023.7.3: $ 论文提交截止通过ASRU2023官方投稿选择竞赛Session
- $ 2023.7.10: $ 最终版论文提交截止通过ASRU2023官方投稿选择竞赛Session
- $ 2023.12.12: $ ASRU Workshop & Challenge Session
## 竞赛报名
来自学术界和工业界的有意向参赛者均应在2023年5月15日及之前填写下方的谷歌表单
来自学术界和工业界的有意向参赛者均应在2023年5月22日及之前填写下方的谷歌表单。同时欢迎广大参赛者加入[官方交流微信群](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/%E8%81%94%E7%B3%BB%E6%96%B9%E5%BC%8F.html)交流并及时获取竞赛最新消息
[M2MET2.0报名](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
[M2MeT2.0报名](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将选择前三名纳入ASRU2023论文集。
主办方将在3个工作日内通过电子邮件通知符合条件的参赛团队团队必须遵守将在挑战网站上发布的挑战规则。在排名发布之前每个参赛者必须提交一份系统描述文件详细说明使用的方法和模型。主办方将排名前列的队伍纳入ASRU2023论文集。

View File

@ -1,9 +1,9 @@
# 联系方式
如果对M2MET2.0竞赛有任何疑问,欢迎通过以下方式联系我们:
如果对M2MeT2.0竞赛有任何疑问,欢迎通过以下方式联系我们:
- 邮件: [m2met.alimeeting@gmail.com](mailto:m2met.alimeeting@gmail.com)
| M2MET2.0竞赛官方微信群 |
| M2MeT2.0竞赛官方微信群 |
|:------------------------------------------:|
<!-- | <img src="images/wechat.png" width="300"/> | -->
| <img src="images/qrcode.png" width="300"/> |

View File

@ -1,6 +1,6 @@
# 赛道设置与评估
## 说话人相关的语音识别
说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MET2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。
说话人相关的ASR任务需要从重叠的语音中识别每个说话人的语音并为识别内容分配一个说话人标签。图2展示了说话人相关语音识别任务和多说话人语音识别任务的主要区别。在本次竞赛中AliMeeting、Aishell4和Cn-Celeb数据集可作为受限数据源。在M2MeT挑战赛中使用的AliMeeting数据集包含训练、评估和测试集在M2MeT2.0可以在训练和评估中使用。此外一个包含约10小时会议数据的新的Test-2023集将根据赛程安排发布并用于挑战赛的评分和排名。值得注意的是对于Test-2023测试集主办方将不再提供耳机的近场音频、转录以及真实时间戳。而是提供可以通过一个简单的VAD模型得到的包含多个说话人的片段。
![task difference](images/task_diff.png)

View File

@ -15,7 +15,8 @@ Here we provided several pretrained models on different datasets. The details of
| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav |
| [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. |
| [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8358 | 68M | Offline | Duration of input wav <= 20s |
| [Paraformer-online](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8404 | 68M | Online | Which could deal with streaming input |
| [Paraformer-online](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8404-online/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8404 | 68M | Online | Which could deal with streaming input |
| [Paraformer-large-online](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Online | Which could deal with streaming input |
| [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary) | CN | Alibaba Speech Data (200hours) | 544 | 5.2M | Offline | Lightweight Paraformer model which supports Mandarin command words recognition |
| [Paraformer-aishell](https://www.modelscope.cn/models/damo/speech_paraformer_asr_nat-aishell1-pytorch/summary) | CN | AISHELL (178hours) | 4234 | 43M | Offline | |
| [ParaformerBert-aishell](https://modelscope.cn/models/damo/speech_paraformerbert_asr_nat-zh-cn-16k-aishell1-vocab4234-pytorch/summary) | CN | AISHELL (178hours) | 4234 | 43M | Offline | |
@ -25,13 +26,27 @@ Here we provided several pretrained models on different datasets. The details of
#### UniASR Models
| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |
|:--------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
| [UniASR](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-online/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8358 | 100M | Online | UniASR streaming offline unifying models |
| [UniASR-large](https://modelscope.cn/models/damo/speech_UniASR-large_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-offline/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8358 | 220M | Offline | UniASR streaming offline unifying models |
| [UniASR Burmese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/summary) | Burmese | Alibaba Speech Data (? hours) | 696 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Hebrew](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/summary) | Hebrew | Alibaba Speech Data (? hours) | 1085 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Urdu](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/summary) | Urdu | Alibaba Speech Data (? hours) | 877 | 95M | Online | UniASR streaming offline unifying models |
| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |
|:-------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:---------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
| [UniASR](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-online/summary) | CN & EN | Alibaba Speech Data (60000 hours) | 8358 | 100M | Online | UniASR streaming offline unifying models |
| [UniASR-large](https://modelscope.cn/models/damo/speech_UniASR-large_asr_2pass-zh-cn-16k-common-vocab8358-tensorflow1-offline/summary) | CN & EN | Alibaba Speech Data (60000 hours) | 8358 | 220M | Offline | UniASR streaming offline unifying models |
| [UniASR English](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-en-16k-common-vocab1080-tensorflow1-online/summary) | EN | Alibaba Speech Data (10000 hours) | 1080 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Russian](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ru-16k-common-vocab1664-tensorflow1-online/summary) | RU | Alibaba Speech Data (5000 hours) | 1664 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Japanese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ja-16k-common-vocab93-tensorflow1-online/summary) | JA | Alibaba Speech Data (5000 hours) | 5977 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Korean](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ko-16k-common-vocab6400-tensorflow1-online/summary) | KO | Alibaba Speech Data (2000 hours) | 6400 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Cantonese (CHS)](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-cantonese-CHS-16k-common-vocab1468-tensorflow1-online/summary) | Cantonese (CHS) | Alibaba Speech Data (5000 hours) | 1468 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Indonesian](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-id-16k-common-vocab1067-tensorflow1-online/summary) | ID | Alibaba Speech Data (1000 hours) | 1067 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Vietnamese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-vi-16k-common-vocab1001-pytorch-online/summary) | VI | Alibaba Speech Data (1000 hours) | 1001 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Spanish](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-es-16k-common-vocab3445-tensorflow1-online/summary) | ES | Alibaba Speech Data (1000 hours) | 3445 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Portuguese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-pt-16k-common-vocab1617-tensorflow1-online/summary) | PT | Alibaba Speech Data (1000 hours) | 1617 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR French](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fr-16k-common-vocab3472-tensorflow1-online/summary) | FR | Alibaba Speech Data (1000 hours) | 3472 | 95M | Online | UniASR streaming online unifying models |
| [UniASR German](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-de-16k-common-vocab3690-tensorflow1-online/summary) | GE | Alibaba Speech Data (1000 hours) | 3690 | 95M | Online | UniASR streaming online unifying models |
| [UniASR Persian](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-fa-16k-common-vocab1257-pytorch-online/summary) | FA | Alibaba Speech Data (1000 hours) | 1257 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Burmese](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-my-16k-common-vocab696-pytorch/summary) | MY | Alibaba Speech Data (1000 hours) | 696 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Hebrew](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-he-16k-common-vocab1085-pytorch/summary) | HE | Alibaba Speech Data (1000 hours) | 1085 | 95M | Online | UniASR streaming offline unifying models |
| [UniASR Urdu](https://modelscope.cn/models/damo/speech_UniASR_asr_2pass-ur-16k-common-vocab877-pytorch/summary) | UR | Alibaba Speech Data (1000 hours) | 877 | 95M | Online | UniASR streaming offline unifying models |
#### Conformer Models
@ -39,6 +54,7 @@ Here we provided several pretrained models on different datasets. The details of
|:----------------------------------------------------------------------------------------------------------------------:|:--------:|:---------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
| [Conformer](https://modelscope.cn/models/damo/speech_conformer_asr_nat-zh-cn-16k-aishell1-vocab4234-pytorch/summary) | CN | AISHELL (178hours) | 4234 | 44M | Offline | Duration of input wav <= 20s |
| [Conformer](https://www.modelscope.cn/models/damo/speech_conformer_asr_nat-zh-cn-16k-aishell2-vocab5212-pytorch/summary) | CN | AISHELL-2 (1000hours) | 5212 | 44M | Offline | Duration of input wav <= 20s |
| [Conformer](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) | EN | Alibaba Speech Data (10000hours) | 4199 | 220M | Offline | Duration of input wav <= 20s |
#### RNN-T Models
@ -92,3 +108,19 @@ Here we provided several pretrained models on different datasets. The details of
| Model Name | Language | Training Data | Parameters | Notes |
|:--------------------------------------------------------------------------------------------------:|:--------------:|:-------------------:|:----------:|:------|
| [TP-Aligner](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) | CN | Alibaba Speech Data (50000hours) | 37.8M | Timestamp prediction, Mandarin, middle size |
### Inverse Text Normalization (ITN) Models
| Model Name | Language | Parameters | Notes |
|:----------------------------------------------------------------------------------------------------------------:|:--------:|:----------:|:-------------------------|
| [English](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-en/summary) | EN | 1.54M | ITN, ASR post-processing |
| [Russian](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-ru/summary) | RU | 17.79M | ITN, ASR post-processing |
| [Japanese](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-ja/summary) | JA | 6.8M | ITN, ASR post-processing |
| [Korean](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-ko/summary) | KO | 1.28M | ITN, ASR post-processing |
| [Indonesian](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-id/summary) | ID | 2.06M | ITN, ASR post-processing |
| [Vietnamese](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-vi/summary) | VI | 0.92M | ITN, ASR post-processing |
| [Tagalog](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-tl/summary) | TL | 0.65M | ITN, ASR post-processing |
| [Spanish](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-es/summary) | ES | 1.32M | ITN, ASR post-processing |
| [Portuguese](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-pt/summary) | PT | 1.28M | ITN, ASR post-processing |
| [French](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-fr/summary) | FR | 4.39M | ITN, ASR post-processing |
| [German](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-de/summary)| GE | 3.95M | ITN, ASR post-processing |

View File

@ -0,0 +1,63 @@
# Inverse Text Normalization (ITN)
> **Note**:
> The modelscope pipeline supports all the models in [model zoo](https://modelscope.cn/models?page=1&tasks=inverse-text-processing&type=audio) to inference. Here we take the model of the Japanese ITN model as example to demonstrate the usage.
## Inference
### Quick start
#### [Japanese ITN model](https://modelscope.cn/models/damo/speech_inverse_text_processing_fun-text-processing-itn-ja/summary)
```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
itn_inference_pipline = pipeline(
task=Tasks.inverse_text_processing,
model='damo/speech_inverse_text_processing_fun-text-processing-itn-ja',
model_revision=None)
itn_result = itn_inference_pipline(text_in='百二十三')
print(itn_result)
# 123
```
- read text data directly.
```python
rec_result = inference_pipeline(text_in='一九九九年に誕生した同商品にちなみ、約三十年前、二十四歳の頃の幸四郎の写真を公開。')
# 1999年に誕生した同商品にちなみ、約30年前、24歳の頃の幸四郎の写真を公開。
```
- text stored via urlexamplehttps://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_text/ja_itn_example.txt
```python
rec_result = inference_pipeline(text_in='https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_text/ja_itn_example.txt')
```
Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/tree/main/fun_text_processing/inverse_text_normalization)
### API-reference
#### Define pipeline
- `task`: `Tasks.inverse_text_processing`
- `model`: model name in [model zoo](https://modelscope.cn/models?page=1&tasks=inverse-text-processing&type=audio), or model path in local disk
- `output_dir`: `None` (Default), the output path of results if set
- `model_revision`: `None` (Default), setting the model version
#### Infer pipeline
- `text_in`: the input to decode, which could be:
- text bytes, `e.g.`: "一九九九年に誕生した同商品にちなみ、約三十年前、二十四歳の頃の幸四郎の写真を公開。"
- text file, `e.g.`: https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_text/ja_itn_example.txt
In this case of `text file` input, `output_dir` must be set to save the output results
## Modify Your Own ITN Model
The rule-based ITN code is open-sourced in [FunTextProcessing](https://github.com/alibaba-damo-academy/FunASR/tree/main/fun_text_processing), users can modify by their own grammar rules for different languages. Let's take Japanese as an example, users can add their own whitelist in ```FunASR/fun_text_processing/inverse_text_normalization/ja/data/whitelist.tsv```. After modified the grammar rules, the users can export and evaluate their own ITN models in local directory.
### Export ITN Model
Export ITN model via ```FunASR/fun_text_processing/inverse_text_normalization/export_models.py```. An example to export ITN model to local folder is shown as below.
```shell
cd FunASR/fun_text_processing/inverse_text_normalization/
python export_models.py --language ja --export_dir ./itn_models/
```
### Evaluate ITN Model
Users can evaluate their own ITN model in local directory via ```FunASR/fun_text_processing/inverse_text_normalization/inverse_normalize.py```. Here is an example:
```shell
cd FunASR/fun_text_processing/inverse_text_normalization/
python inverse_normalize.py --input_file ja_itn_example.txt --cache_dir ./itn_models/ --output_file output.txt --language=ja
```

View File

@ -1,7 +1,7 @@
# Quick Start
> **Note**:
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage.
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage.
## Inference with pipeline

View File

@ -19,7 +19,7 @@ stage 6: Generate speaker profiles (Stage 6 takes a lot of time).
stage 7 - 9: Language model training (Optional).
stage 10 - 11: ASR training (SA-ASR requires loading the pre-trained ASR model).
stage 12: SA-ASR training.
stage 13 - 18: Inference and evaluation.
stage 13 - 16: Inference and evaluation.
```
Before running `run_m2met_2023_infer.sh`, you need to place the new test set `Test_2023_Ali_far` (to be released after the challenge starts) in the `./dataset` directory, which contains only raw audios. Then put the given `wav.scp`, `wav_raw.scp`, `segments`, `utt2spk` and `spk2utt` in the `./data/Test_2023_Ali_far` directory.
```shell
@ -37,6 +37,10 @@ stage 2: Generate speaker profiles for inference.
stage 3: Inference.
stage 4: Generation of SA-ASR results required for final submission.
```
The baseline model is available on [ModelScope](https://www.modelscope.cn/models/damo/speech_saasr_asr-zh-cn-16k-alimeeting/summary).
After generate stats of AliMeeting corpus(stage 10 in `run.sh`), you can set the `infer_with_pretrained_model=true` in `run.sh` to infer with our official baseline model released on ModelScope without training.
# Format of Final Submission
Finally, you need to submit a file called `text_spk_merge` with the following format:
```shell

View File

@ -107,8 +107,8 @@ inference_asr_model=valid.acc.ave.pb # ASR model path for decoding.
# inference_asr_model=valid.acc.best.pth
# inference_asr_model=valid.loss.ave.pth
inference_sa_asr_model=valid.acc_spk.ave.pb
download_model= # Download a model from Model Zoo and use it for decoding.
infer_with_pretrained_model=false # Use pretrained model for decoding
download_sa_asr_model= # Download the SA-ASR model from ModelScope and use it for decoding.
# [Task dependent] Set the datadir name created by local/data.sh
train_set= # Name of training set.
valid_set= # Name of validation set used for monitoring/tuning network training.
@ -203,7 +203,8 @@ Options:
# Note that it will overwrite args in inference config.
--inference_lm # Language modle path for decoding (default="${inference_lm}").
--inference_asr_model # ASR model path for decoding (default="${inference_asr_model}").
--download_model # Download a model from Model Zoo and use it for decoding (default="${download_model}").
--infer_with_pretrained_model # Use pretrained model for decoding (default="${infer_with_pretrained_model}").
--download_sa_asr_model= # Download the SA-ASR model from ModelScope and use it for decoding(default="${download_sa_asr_model}").
# [Task dependent] Set the datadir name created by local/data.sh
--train_set # Name of training set (required).
@ -304,6 +305,9 @@ else
lm_token_type="${token_type}"
fi
if ${infer_with_pretrained_model}; then
skip_train=true
fi
# Set tag for naming of model directory
if [ -z "${asr_tag}" ]; then
@ -1220,122 +1224,20 @@ else
log "Skip the training stages"
fi
if ${infer_with_pretrained_model}; then
log "Use ${download_sa_asr_model} for decoding and evaluation"
sa_asr_exp="${expdir}/${download_sa_asr_model}"
mkdir -p "${sa_asr_exp}"
python local/download_pretrained_model_from_modelscope.py $download_sa_asr_model ${expdir}
inference_sa_asr_model="model.pb"
inference_config=${sa_asr_exp}/decoding.yaml
fi
if ! "${skip_eval}"; then
if [ ${stage} -le 13 ] && [ ${stop_stage} -ge 13 ]; then
log "Stage 13: Decoding multi-talker ASR: training_dir=${asr_exp}"
if ${gpu_inference}; then
_cmd="${cuda_cmd}"
inference_nj=$[${ngpu}*${njob_infer}]
_ngpu=1
else
_cmd="${decode_cmd}"
inference_nj=$inference_nj
_ngpu=0
fi
_opts=
if [ -n "${inference_config}" ]; then
_opts+="--config ${inference_config} "
fi
if "${use_lm}"; then
if "${use_word_lm}"; then
_opts+="--word_lm_train_config ${lm_exp}/config.yaml "
_opts+="--word_lm_file ${lm_exp}/${inference_lm} "
else
_opts+="--lm_train_config ${lm_exp}/config.yaml "
_opts+="--lm_file ${lm_exp}/${inference_lm} "
fi
fi
# 2. Generate run.sh
log "Generate '${asr_exp}/${inference_tag}/run.sh'. You can resume the process from stage 13 using this script"
mkdir -p "${asr_exp}/${inference_tag}"; echo "${run_args} --stage 13 \"\$@\"; exit \$?" > "${asr_exp}/${inference_tag}/run.sh"; chmod +x "${asr_exp}/${inference_tag}/run.sh"
for dset in ${test_sets}; do
_data="${data_feats}/${dset}"
_dir="${asr_exp}/${inference_tag}/${dset}"
_logdir="${_dir}/logdir"
mkdir -p "${_logdir}"
_feats_type="$(<${_data}/feats_type)"
if [ "${_feats_type}" = raw ]; then
_scp=wav.scp
if [[ "${audio_format}" == *ark* ]]; then
_type=kaldi_ark
else
_type=sound
fi
else
_scp=feats.scp
_type=kaldi_ark
fi
# 1. Split the key file
key_file=${_data}/${_scp}
split_scps=""
_nj=$(min "${inference_nj}" "$(<${key_file} wc -l)")
echo $_nj
for n in $(seq "${_nj}"); do
split_scps+=" ${_logdir}/keys.${n}.scp"
done
# shellcheck disable=SC2086
utils/split_scp.pl "${key_file}" ${split_scps}
# 2. Submit decoding jobs
log "Decoding started... log: '${_logdir}/asr_inference.*.log'"
${_cmd} --gpu "${_ngpu}" --max-jobs-run "${_nj}" JOB=1:"${_nj}" "${_logdir}"/asr_inference.JOB.log \
python -m funasr.bin.asr_inference_launch \
--batch_size 1 \
--mc True \
--nbest 1 \
--ngpu "${_ngpu}" \
--njob ${njob_infer} \
--gpuid_list ${device} \
--data_path_and_name_and_type "${_data}/${_scp},speech,${_type}" \
--key_file "${_logdir}"/keys.JOB.scp \
--asr_train_config "${asr_exp}"/config.yaml \
--asr_model_file "${asr_exp}"/"${inference_asr_model}" \
--output_dir "${_logdir}"/output.JOB \
--mode asr \
${_opts}
# 3. Concatenates the output files from each jobs
for f in token token_int score text; do
for i in $(seq "${_nj}"); do
cat "${_logdir}/output.${i}/1best_recog/${f}"
done | LC_ALL=C sort -k1 >"${_dir}/${f}"
done
done
fi
if [ ${stage} -le 14 ] && [ ${stop_stage} -ge 14 ]; then
log "Stage 14: Scoring multi-talker ASR"
for dset in ${test_sets}; do
_data="${data_feats}/${dset}"
_dir="${asr_exp}/${inference_tag}/${dset}"
sed 's/\$//g' ${_data}/text > ${_data}/text_nosrc
sed 's/\$//g' ${_dir}/text > ${_dir}/text_nosrc
python utils/proce_text.py ${_data}/text_nosrc ${_data}/text.proc
python utils/proce_text.py ${_dir}/text_nosrc ${_dir}/text.proc
python utils/compute_wer.py ${_data}/text.proc ${_dir}/text.proc ${_dir}/text.cer
tail -n 3 ${_dir}/text.cer > ${_dir}/text.cer.txt
cat ${_dir}/text.cer.txt
done
fi
if [ ${stage} -le 15 ] && [ ${stop_stage} -ge 15 ]; then
log "Stage 15: Decoding SA-ASR (oracle profile): training_dir=${sa_asr_exp}"
log "Stage 13: Decoding SA-ASR (oracle profile): training_dir=${sa_asr_exp}"
if ${gpu_inference}; then
_cmd="${cuda_cmd}"
@ -1426,8 +1328,8 @@ if ! "${skip_eval}"; then
done
fi
if [ ${stage} -le 16 ] && [ ${stop_stage} -ge 16 ]; then
log "Stage 16: Scoring SA-ASR (oracle profile)"
if [ ${stage} -le 14 ] && [ ${stop_stage} -ge 14 ]; then
log "Stage 14: Scoring SA-ASR (oracle profile)"
for dset in ${test_sets}; do
_data="${data_feats}/${dset}"
@ -1454,8 +1356,8 @@ if ! "${skip_eval}"; then
fi
if [ ${stage} -le 17 ] && [ ${stop_stage} -ge 17 ]; then
log "Stage 17: Decoding SA-ASR (cluster profile): training_dir=${sa_asr_exp}"
if [ ${stage} -le 15 ] && [ ${stop_stage} -ge 15 ]; then
log "Stage 15: Decoding SA-ASR (cluster profile): training_dir=${sa_asr_exp}"
if ${gpu_inference}; then
_cmd="${cuda_cmd}"
@ -1545,8 +1447,8 @@ if ! "${skip_eval}"; then
done
fi
if [ ${stage} -le 18 ] && [ ${stop_stage} -ge 18 ]; then
log "Stage 18: Scoring SA-ASR (cluster profile)"
if [ ${stage} -le 16 ] && [ ${stop_stage} -ge 16 ]; then
log "Stage 16: Scoring SA-ASR (cluster profile)"
for dset in ${test_sets}; do
_data="${data_feats}/${dset}"

View File

@ -0,0 +1,7 @@
from modelscope.hub.snapshot_download import snapshot_download
import sys
if __name__ == "__main__":
model_tag = sys.argv[1]
local_model_dir = sys.argv[2]
model_dir = snapshot_download(model_tag, cache_dir=local_model_dir, revision='1.0.0')

View File

@ -8,8 +8,8 @@ set -o pipefail
ngpu=4
device="0,1,2,3"
stage=1
stop_stage=18
stage=12
stop_stage=13
train_set=Train_Ali_far
@ -18,6 +18,8 @@ test_sets="Test_Ali_far"
asr_config=conf/train_asr_conformer.yaml
sa_asr_config=conf/train_sa_asr_conformer.yaml
inference_config=conf/decode_asr_rnn.yaml
infer_with_pretrained_model=true
download_sa_asr_model="damo/speech_saasr_asr-zh-cn-16k-alimeeting"
lm_config=conf/train_lm_transformer.yaml
use_lm=false
@ -29,6 +31,8 @@ use_wordlm=false
--stop_stage ${stop_stage} \
--gpu_inference true \
--njob_infer 4 \
--infer_with_pretrained_model ${infer_with_pretrained_model} \
--download_sa_asr_model $download_sa_asr_model \
--asr_exp exp/asr_train_multispeaker_conformer_raw_zh_char_data_alimeeting \
--sa_asr_exp exp/sa_asr_train_conformer_raw_zh_char_data_alimeeting \
--asr_stats_dir exp/asr_stats_multispeaker_conformer_raw_zh_char_data_alimeeting \

View File

@ -1,7 +1,7 @@
# Speech Recognition
> **Note**:
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage.
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage.
## Inference
@ -44,7 +44,7 @@ print(rec_result)
Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/241)
#### [UniASR Model](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary)
There are three decoding mode for UniASR model(`fast`、`normal`、`offline`), for more model detailes, please refer to [docs](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary)
There are three decoding mode for UniASR model(`fast`、`normal`、`offline`), for more model details, please refer to [docs](https://www.modelscope.cn/models/damo/speech_UniASR_asr_2pass-zh-cn-8k-common-vocab3445-pytorch-online/summary)
```python
decoding_model = "fast" # "fast"、"normal"、"offline"
inference_pipeline = pipeline(
@ -61,7 +61,7 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/
Undo
#### [MFCCA Model](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary)
For more model detailes, please refer to [docs](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary)
For more model details, please refer to [docs](https://www.modelscope.cn/models/NPU-ASLP/speech_mfcca_asr-zh-cn-16k-alimeeting-vocab4950/summary)
```python
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
@ -79,7 +79,7 @@ print(rec_result)
### API-reference
#### Define pipeline
- `task`: `Tasks.auto_speech_recognition`
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU
- `output_dir`: `None` (Default), the output path of results if set
@ -103,7 +103,7 @@ print(rec_result)
FunASR also offer recipes [egs_modelscope/asr/TEMPLATE/infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.
#### Settings of `infer.sh`
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/model_zoo/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
- `data_dir`: the dataset dir needs to include `wav.scp`. If `${data_dir}/text` is also exists, CER will be computed
- `output_dir`: output dir of the recognition results
- `batch_size`: `64` (Default), batch size of inference on gpu

View File

@ -1 +1 @@
../TEMPLATE/README.md
../../TEMPLATE/README.md

View File

@ -0,0 +1,37 @@
import os
from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from funasr.datasets.ms_dataset import MsDataset
from funasr.utils.modelscope_param import modelscope_args
def modelscope_finetune(params):
if not os.path.exists(params.output_dir):
os.makedirs(params.output_dir, exist_ok=True)
# dataset split ["train", "validation"]
ds_dict = MsDataset.load(params.data_path)
kwargs = dict(
model=params.model,
model_revision="v1.0.2",
data_dir=ds_dict,
dataset_type=params.dataset_type,
work_dir=params.output_dir,
batch_bins=params.batch_bins,
max_epoch=params.max_epoch,
lr=params.lr)
trainer = build_trainer(Trainers.speech_asr_trainer, default_args=kwargs)
trainer.train()
if __name__ == '__main__':
params = modelscope_args(model="damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404", data_path="./data")
params.output_dir = "./checkpoint" # 模型保存路径
params.data_path = "./example_data/" # 数据路径
params.dataset_type = "large" # finetune contextual paraformer模型只能使用large dataset
params.batch_bins = 200000 # batch size如果dataset_type="small"batch_bins单位为fbank特征帧数如果dataset_type="large"batch_bins单位为毫秒
params.max_epoch = 20 # 最大训练轮数
params.lr = 0.0002 # 设置学习率
modelscope_finetune(params)

View File

@ -0,0 +1,105 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
stage=1
stop_stage=2
model="damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404"
data_dir="./data/test"
output_dir="./results"
batch_size=64
gpu_inference=true # whether to perform gpu decoding
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
njob=10 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
checkpoint_dir=
checkpoint_name="valid.cer_ctc.ave.pb"
hotword_txt=None
. utils/parse_options.sh || exit 1;
if ${gpu_inference} == "true"; then
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
else
nj=$njob
batch_size=1
gpuid_list=""
for JOB in $(seq ${nj}); do
gpuid_list=$gpuid_list"-1,"
done
fi
mkdir -p $output_dir/split
split_scps=""
for JOB in $(seq ${nj}); do
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
done
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
if [ -n "${checkpoint_dir}" ]; then
python utils/prepare_checkpoint.py ${model} ${checkpoint_dir} ${checkpoint_name}
model=${checkpoint_dir}/${model}
fi
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
echo "Decoding ..."
gpuid_list_array=(${gpuid_list//,/ })
for JOB in $(seq ${nj}); do
{
id=$((JOB-1))
gpuid=${gpuid_list_array[$id]}
mkdir -p ${output_dir}/output.$JOB
python infer.py \
--model ${model} \
--audio_in ${output_dir}/split/wav.$JOB.scp \
--output_dir ${output_dir}/output.$JOB \
--batch_size ${batch_size} \
--hotword_txt ${hotword_txt} \
--gpuid ${gpuid}
}&
done
wait
mkdir -p ${output_dir}/1best_recog
for f in token score text; do
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${nj}"); do
cat "${output_dir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${output_dir}/1best_recog/${f}"
fi
done
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
echo "Computing WER ..."
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
tail -n 3 ${output_dir}/1best_recog/text.cer
fi
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
echo "SpeechIO TIOBE textnorm"
echo "$0 --> Normalizing REF text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${data_dir}/text \
${output_dir}/1best_recog/ref.txt
echo "$0 --> Normalizing HYP text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${output_dir}/1best_recog/text.proc \
${output_dir}/1best_recog/rec.txt
grep -v $'\t$' ${output_dir}/1best_recog/rec.txt > ${output_dir}/1best_recog/rec_non_empty.txt
echo "$0 --> computing WER/CER and alignment ..."
./utils/error_rate_zh \
--tokenizer char \
--ref ${output_dir}/1best_recog/ref.txt \
--hyp ${output_dir}/1best_recog/rec_non_empty.txt \
${output_dir}/1best_recog/DETAILS.txt | tee ${output_dir}/1best_recog/RESULTS.txt
rm -rf ${output_dir}/1best_recog/rec.txt ${output_dir}/1best_recog/rec_non_empty.txt
fi

View File

@ -0,0 +1,40 @@
import os
import tempfile
import codecs
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
from modelscope.msdatasets import MsDataset
if __name__ == '__main__':
param_dict = dict()
param_dict['hotword'] = "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/hotword.txt"
output_dir = "./output"
batch_size = 1
# dataset split ['test']
ds_dict = MsDataset.load(dataset_name='speech_asr_aishell1_hotwords_testsets', namespace='speech_asr')
work_dir = tempfile.TemporaryDirectory().name
if not os.path.exists(work_dir):
os.makedirs(work_dir)
wav_file_path = os.path.join(work_dir, "wav.scp")
counter = 0
with codecs.open(wav_file_path, 'w') as fin:
for line in ds_dict:
counter += 1
wav = line["Audio:FILE"]
idx = wav.split("/")[-1].split(".")[0]
fin.writelines(idx + " " + wav + "\n")
if counter == 50:
break
audio_in = wav_file_path
inference_pipeline = pipeline(
task=Tasks.auto_speech_recognition,
model="damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404",
output_dir=output_dir,
batch_size=batch_size,
param_dict=param_dict)
rec_result = inference_pipeline(audio_in=audio_in)

View File

@ -34,6 +34,6 @@ for sample_offset in range(0, speech_length, min(stride_size, speech_length - sa
rec_result = inference_pipeline(audio_in=speech[sample_offset: sample_offset + stride_size],
param_dict=param_dict)
if len(rec_result) != 0:
final_result += rec_result['text'][0]
final_result += rec_result['text'] + " "
print(rec_result)
print(final_result)

View File

@ -1 +1 @@
../TEMPLATE/README.md
../../TEMPLATE/README.md

View File

@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -e
set -u
set -o pipefail
stage=1
stop_stage=2
model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
data_dir="./data/test"
output_dir="./results"
batch_size=64
gpu_inference=true # whether to perform gpu decoding
gpuid_list="0,1" # set gpus, e.g., gpuid_list="0,1"
njob=64 # the number of jobs for CPU decoding, if gpu_inference=false, use CPU decoding, please set njob
checkpoint_dir=
checkpoint_name="valid.cer_ctc.ave.pb"
. utils/parse_options.sh || exit 1;
if ${gpu_inference} == "true"; then
nj=$(echo $gpuid_list | awk -F "," '{print NF}')
else
nj=$njob
batch_size=1
gpuid_list=""
for JOB in $(seq ${nj}); do
gpuid_list=$gpuid_list"-1,"
done
fi
mkdir -p $output_dir/split
split_scps=""
for JOB in $(seq ${nj}); do
split_scps="$split_scps $output_dir/split/wav.$JOB.scp"
done
perl utils/split_scp.pl ${data_dir}/wav.scp ${split_scps}
if [ -n "${checkpoint_dir}" ]; then
python utils/prepare_checkpoint.py ${model} ${checkpoint_dir} ${checkpoint_name}
model=${checkpoint_dir}/${model}
fi
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
echo "Decoding ..."
gpuid_list_array=(${gpuid_list//,/ })
for JOB in $(seq ${nj}); do
{
id=$((JOB-1))
gpuid=${gpuid_list_array[$id]}
mkdir -p ${output_dir}/output.$JOB
python infer.py \
--model ${model} \
--audio_in ${output_dir}/split/wav.$JOB.scp \
--output_dir ${output_dir}/output.$JOB \
--batch_size ${batch_size} \
--gpuid ${gpuid}
}&
done
wait
mkdir -p ${output_dir}/1best_recog
for f in token score text; do
if [ -f "${output_dir}/output.1/1best_recog/${f}" ]; then
for i in $(seq "${nj}"); do
cat "${output_dir}/output.${i}/1best_recog/${f}"
done | sort -k1 >"${output_dir}/1best_recog/${f}"
fi
done
fi
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ];then
echo "Computing WER ..."
cp ${output_dir}/1best_recog/text ${output_dir}/1best_recog/text.proc
cp ${data_dir}/text ${output_dir}/1best_recog/text.ref
python utils/compute_wer.py ${output_dir}/1best_recog/text.ref ${output_dir}/1best_recog/text.proc ${output_dir}/1best_recog/text.cer
tail -n 3 ${output_dir}/1best_recog/text.cer
fi
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ];then
echo "SpeechIO TIOBE textnorm"
echo "$0 --> Normalizing REF text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${data_dir}/text \
${output_dir}/1best_recog/ref.txt
echo "$0 --> Normalizing HYP text ..."
./utils/textnorm_zh.py \
--has_key --to_upper \
${output_dir}/1best_recog/text.proc \
${output_dir}/1best_recog/rec.txt
grep -v $'\t$' ${output_dir}/1best_recog/rec.txt > ${output_dir}/1best_recog/rec_non_empty.txt
echo "$0 --> computing WER/CER and alignment ..."
./utils/error_rate_zh \
--tokenizer char \
--ref ${output_dir}/1best_recog/ref.txt \
--hyp ${output_dir}/1best_recog/rec_non_empty.txt \
${output_dir}/1best_recog/DETAILS.txt | tee ${output_dir}/1best_recog/RESULTS.txt
rm -rf ${output_dir}/1best_recog/rec.txt ${output_dir}/1best_recog/rec_non_empty.txt
fi

View File

@ -1 +1 @@
../TEMPLATE/README.md
../../TEMPLATE/README.md

View File

@ -1 +1 @@
../TEMPLATE/infer.py
../../TEMPLATE/infer.py

Some files were not shown because too many files have changed in this diff Show More