upload m2met2 pages

2025-09-15 14:48:36 +08:00 · 2023-04-12 13:02:14 +08:00 · 2023-04-12 13:02:14 +08:00 · b429d90bc4
commit b429d90bc4
parent 60d38fa9ca
15 changed files with 241 additions and 0 deletions
--- a/.github/workflows/m2met2.yml
+++ b/.github/workflows/m2met2.yml
@ -0,0 +1,34 @@
+name: "M2MET2.0 Pages"
+on:
+  pull_request:
+    branches:
+      - main
+  push:
+    branches:
+      - dev_lyh
+
+jobs:
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v1
+      - uses: ammaraskar/sphinx-action@master
+        with:
+          docs-folder: "m2met2/"
+          pre-build-command: "pip install sphinx-markdown-tables nbsphinx jinja2 recommonmark sphinx_rtd_theme"
+
+      - name: deploy copy
+        if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/dev_lyh'
+        run: |
+          mkdir public
+          mkdir public/m2met2
+          touch public/m2met2/.nojekyll
+          cp -r docs_m2met2/_build/html/* public/m2met2/
+
+      - name: deploy github.io pages
+        if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/dev_lyh'
+        uses: peaceiris/actions-gh-pages@v2.3.1
+        env:
+          GITHUB_TOKEN: ${{ secrets.ACCESS_TOKEN }}
+          PUBLISH_BRANCH: gh-pages
+          PUBLISH_DIR: public
--- a/docs_m2met2/Baseline.md
+++ b/docs_m2met2/Baseline.md
@ -0,0 +1,12 @@
+# Baseline
+## Overview
+We provide an end-to-end sa-asr baseline conducted on [FunASR](https://github.com/alibaba-damo-academy/FunASR) as a receipe. The model architecture is shown in Figure 3. The SpeakerEncoder is initialized with a pre-trained [speaker verification model](https://modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary) from [ModelScope](https://modelscope.cn/home). This speaker verification model is also be used to extract the speaker embedding in the speaker profile. 
+
+![model archietecture](images/sa_asr_arch.png)
+
+## Quick start
+#TODO: fill with the README.md of the baseline
+
+## Baseline results
+The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy. 
+![baseline result](images/baseline_result.png)
--- a/docs_m2met2/Dataset.md
+++ b/docs_m2met2/Dataset.md
@ -0,0 +1,24 @@
+# Datasets
+## Overview of training data
+In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.
+## Detail of AliMeeting corpus
+AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage.
+
+The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.
+
+![meeting room](images/meeting_room.png)
+
+The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train and Eval sets are 42.27\% and 34.76\%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train and Eval set is shown in Table 2.
+
+![dataset detail](images/dataset_detail.png)
+
+The Test-2023 set consists of 20 sessions that were recorded in an identical acoustic setting to that of the AliMeeting corpus. Each meeting session in the Test-2023 dataset comprises between 2 and 4 participants, thereby sharing a similar configuration with the AliMeeting test set.
+
+We also record the near-field signal of each participant using a headset microphone and ensure that only the participant's own speech is recorded and transcribed. It is worth noting that the far-field audio recorded by the microphone array and the near-field audio recorded by the headset microphone will be synchronized to a common timeline range.
+
+All transcriptions of the speech data are prepared in TextGrid format for each session, which contains the information of the session duration, speaker information (number of speaker, speaker-id, gender, etc.), the total number of segments of each speaker, the timestamp and transcription of each segment, etc.
+## Get the data
+The three dataset for training mentioned above can be downloaded at [OpenSLR](https://openslr.org/resources.php). The participants can download via the following links. Particularly, in the baseline we provide convenient data preparation scripts for AliMeeting corpus.
+- [AliMeeting](https://openslr.org/119/)
+- [AISHELL-4](https://openslr.org/111/)
+- [CN-Celeb](https://openslr.org/82/)
--- a/docs_m2met2/Introduction.md
+++ b/docs_m2met2/Introduction.md
@ -0,0 +1,27 @@
+# Introduction
+## Call for participation
+Recent advancements in speech signal processing, including speech recognition and speaker diarization, have led to a proliferation of speech technologies applications. Meetings are particularly challenging scenarios for speech technologies, given their varied speaking styles and complex acoustic conditions, such as overlapping speech, unknown numbers of speakers, far-field signals in large conference rooms, noise, and reverberation.
+
+To advance the development of meeting transcription, several relevant challenges have been organized, such as the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. However, the differences across languages limit the progress of non-English meeting transcription, such as Mandarin. The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have contributed to advancing Mandarin meeting transcription. The MISP challenge addresses the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on solving the speech overlap problem of meeting transcription in offline meeting rooms.
+
+Building on the success of the M2MeT challenge, we are pleased to announce the M2MeT2.0 challenge as an ASRU2023 Signal Processing Grand Challenge. In the M2MeT challenge, the evaluation metric is speaker-independent, meaning that we can only determine the transcription but not the corresponding speaker. To further advance the current multi-talker ASR system to practicality, the M2MeT2.0 challenge proposes the speaker-attribute ASR task with two sub-tracks performing in fixed and open training conditions. We provide a detailed introduction of the dataset, rules, evaluation methods, and baseline systems to further promote reproducible research in this field. The organizer will select top three papers and include them in the ASRU2023 Proceedings.
+
+## Timeline(AOE Time)
+
+- **$May~5^{th}, 2023:$** Registration deadline, the due date for participants to join the Challenge.
+- **$June~9^{th}, 2023:$** Test data release.
+- **$June~13^{rd}, 2023:$** Final submission deadline.
+- **$June~19^{th}, 2023:$** Evaluation result and ranking release.
+- **$July~3^{rd}, 2023:$** Deadline for paper submission.
+- **$July~10^{th}, 2023:$** Deadline for final paper submission.
+
+## Guidelines
+
+Potential participants from both academia and industry should send an email to **m2met.alimeeting@gmail.com** to register to the challenge before or by April 21 with the following requirements:
+
+
+- Email subject: [ASRU2023 M2MeT2.0 Challenge Registration] – Team Name - Participating 
+sub-track.
+- Provide team name, affiliation, participating track, team captain as well as members with contacts.
+
+The organizer will notify the qualified teams to join the challenge via email in 3 working days. The qualified teams must obey the challenge rules which will be released on the challenge website.
--- a/docs_m2met2/Makefile
+++ b/docs_m2met2/Makefile
@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs_m2met2/Organizers.md
+++ b/docs_m2met2/Organizers.md
@ -0,0 +1 @@
+# Organizers
--- a/docs_m2met2/Rules.md
+++ b/docs_m2met2/Rules.md
@ -0,0 +1,16 @@
+# Rules
+All participants should adhere to the following rules to be eligible for the challenge.
+
+- Data augmentation is allowed on the original training dataset, including, but not limited to, adding noise or reverberation, speed perturbation and tone change.
+
+- The use of Test dataset in any form of non-compliance is strictly prohibited, including but not limited to use the Test dataset to fine-tune or train the model.
+
+- Multi-system fusion is allowed, but the systems with same structure and different parameters is not encouraged.
+  
+- If the cpCER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.
+  
+- If the forced alignment is used to obtain the frame-level classification label, the forced alignment model must be trained on the basis of the data allowed by the corresponding sub-track.
+  
+- Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT and Transformer, but the training data of the shallow fusion language model can only come from the transcripts of the allowed training dataset.
+  
+- The right of final interpretation belongs to the organizer. In case of special circumstances, the organizer will coordinate the interpretation.
--- a/docs_m2met2/Track_setting_and_evaluation.md
+++ b/docs_m2met2/Track_setting_and_evaluation.md
@ -0,0 +1,12 @@
+# Speaker-Attributed ASR (Main Track)
+## Overview
+The speaker-attribute ASR task presents the challenge of transcribing the speech of each individual speaker from overlapped speech and assigning a speaker label to the transcription. In this track, the AliMeeting, Aishell4, and Cn-Celeb datasets can be used as constrained data sources. The AliMeeting dataset, which is used in the M2MeT challenge, contains Train, Eval, and Test sets that can be utilized during both training and evaluation. Additionally, a new Test-2023 set containing about 10 hours of meeting data will be released soon (according to the timeline) for challenge scoring and ranking. It is important to note that the organizers will not provide the headset near-field audio, transcriptions as well as oracle timestamps. Instead of providing oracle timestamps of each speaker, segments containing multiple speakers are provided on the Test-2023 set. These segments can be obtained using a simple vad model.
+## Evaluation metric
+The accuracy of a speaker-attributed ASR system is evaluated using the concatenated minimum permutation character error rate (cpCER) metric. The calculation of cpCER involves three steps. Firstly, the reference and hypothesis transcriptions from each speaker in a session are concatenated in chronological order. Secondly, the character error rate (CER) is calculated between the concatenated reference and hypothesis transcriptions, and this process is repeated for all possible speaker permutations. Finally, the permutation with the lowest CER is selected as the cpCER for that session. TThe CER is obtained by dividing the total number of insertions (Ins), substitutions (Sub), and deletions(Del) of characters required to transform the ASR output into the reference transcript by the total number of characters in the reference transcript. Specifically, CER is calculated by:
+$$\text{CER} = \frac {\mathcal N_{\text{Ins}} + \mathcal N_{\text{Sub}} + \mathcal N_{\text{Del}} }{\mathcal N_{\text{Total}}} \times 100\%,$$
+where $\mathcal N_{\text{Ins}}$, $\mathcal N_{\text{Sub}}$, $\mathcal N_{\text{Del}}$ are the character number of the three errors, and $\mathcal N_{\text{Total}}$ is the total number of characters.
+## Sub-track arrangement
+### Sub-track I (Fixed Training Condition):
+ Participants can only use the fixed constrained data to build systems, and the usage of extra data is strictly prohibited. In other words, system building is restricted to AliMeeting, Aishell-4 and CN-Celeb. Participants are allowed to use the open-source pretrained models from [Hugging Face](https://huggingface.co/models) and [ModelScope](https://www.modelscope.cn/models) with the clearly list of the utilized models in the final system description paper.
+### Sub-track II (Open Training Condition):
+ Besides the fixed constrained data, participants can use any data set publicly available, privately recorded, and manual simulation for system building. However, the participants have to clearly list the data used in the final system description paper. If manually simulated data is used, please describe the data simulation scheme in detail.
--- a/docs_m2met2/conf.py
+++ b/docs_m2met2/conf.py
@ -0,0 +1,39 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = 'm2met2'
+copyright = '2023, Speech Lab, Alibaba Group; Audio, Speech and Language Processing Group, Northwestern Polytechnical University'
+author = 'Speech Lab, Alibaba Group; Audio, Speech and Language Processing Group, Northwestern Polytechnical University'
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = [
+    'nbsphinx',
+    'sphinx_rtd_theme',
+    "sphinx.ext.autodoc",
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    "sphinx.ext.mathjax",
+    "sphinx.ext.todo",
+    "sphinx_markdown_tables",
+    "sphinx.ext.githubpages",
+    'recommonmark',
+]
+source_suffix = [".rst", ".md"]
+templates_path = ['_templates']
+# exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+exclude_patterns = []
+pygments_style = "sphinx"
+
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = 'sphinx_rtd_theme'
+html_static_path = ['_static']
--- a/docs_m2met2/images/baseline_result.png
+++ b/docs_m2met2/images/baseline_result.png
--- a/docs_m2met2/images/dataset_detail.png
+++ b/docs_m2met2/images/dataset_detail.png
--- a/docs_m2met2/images/meeting_room.png
+++ b/docs_m2met2/images/meeting_room.png
--- a/docs_m2met2/images/sa_asr_arch.png
+++ b/docs_m2met2/images/sa_asr_arch.png
--- a/docs_m2met2/index.rst
+++ b/docs_m2met2/index.rst
@ -0,0 +1,21 @@
+.. m2met2 documentation master file, created by
+   sphinx-quickstart on Tue Apr 11 14:18:55 2023.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+ASRU 2023 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION CHALLENGE 2.0 (M2MeT2.0)
+==================================================================================
+Building on the success of the M2MeT challenge, we are pleased to announce the M2MeT2.0 challenge as an ASRU2023 Signal Processing Grand Challenge.
+To further advance the current multi-talker ASR system to practicality, the M2MeT2.0 challenge proposes the speaker-attribute ASR task with two sub-tracks performing in fixed and open training conditions.
+We provide a detailed introduction of the dataset, rules, evaluation methods, and baseline systems to further promote reproducible research in this field.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Contents:
+
+   ./Introduction
+   ./Dataset
+   ./Track_setting_and_evaluation
+   ./Baseline
+   ./Rules
+   ./Organizers
--- a/docs_m2met2/make.bat
+++ b/docs_m2met2/make.bat
@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd