add github docs for sd and sv

This commit is contained in:
志浩 2023-04-27 17:36:07 +08:00
commit 6a1c2a4d05
973 changed files with 20016 additions and 191406 deletions

View File

@ -29,10 +29,10 @@ jobs:
cp -r docs/_build/html/* public/en/
mkdir public/m2met2
touch public/m2met2/.nojekyll
cp -r docs_m2met2/_build/html/* public/m2met2/
cp -r docs/m2met2/_build/html/* public/m2met2/
mkdir public/m2met2_cn
touch public/m2met2_cn/.nojekyll
cp -r docs_m2met2_cn/_build/html/* public/m2met2_cn/
cp -r docs/m2met2_cn/_build/html/* public/m2met2_cn/
- name: deploy github.io pages
if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/dev_wjm' || github.ref == 'refs/heads/dev_lyh'

View File

@ -18,14 +18,12 @@
| [**Runtime**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime)
| [**Model Zoo**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/modelscope_models.md)
| [**Contact**](#contact)
|
[**M2MET2.0 Guidence_CN**](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)
| [**M2MET2.0 Guidence_EN**](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)
| [**M2MET2.0 Challenge**](https://github.com/alibaba-damo-academy/FunASR#multi-channel-multi-party-meeting-transcription-20-m2met20-challenge)
## Multi-Channel Multi-Party Meeting Transcription 2.0 (M2MET2.0) Challenge
We are pleased to announce that the M2MeT2.0 challenge will be held in the near future. The baseline system is conducted on FunASR and is provided as a receipe of AliMeeting corpus. For more details you can see the guidence of M2MET2.0 ([CN](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)/[EN](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)).
## What's new:
### Multi-Channel Multi-Party Meeting Transcription 2.0 (M2MET2.0) Challenge
We are pleased to announce that the M2MeT2.0 challenge will be held in the near future. The baseline system is conducted on FunASR and is provided as a receipe of AliMeeting corpus. For more details you can see the guidence of M2MET2.0 ([CN](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)/[EN](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)).
### Release notes
For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases)
## Highlights

View File

@ -37,13 +37,13 @@ Overview
:maxdepth: 1
:caption: ModelScope Egs
./modescope_pipeline/quick_start.md
./modescope_pipeline/asr_pipeline.md
./modescope_pipeline/vad_pipeline.md
./modescope_pipeline/punc_pipeline.md
./modescope_pipeline/tp_pipeline.md
./modescope_pipeline/sv_pipeline.md
./modescope_pipeline/sd_pipeline.md
./modelscope_pipeline/quick_start.md
./modelscope_pipeline/asr_pipeline.md
./modelscope_pipeline/vad_pipeline.md
./modelscope_pipeline/punc_pipeline.md
./modelscope_pipeline/tp_pipeline.md
./modelscope_pipeline/sv_pipeline.md
./modelscope_pipeline/sd_pipeline.md
.. toctree::
:maxdepth: 1

View File

@ -2,7 +2,7 @@
## Overview of training data
In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.
## Detail of AliMeeting corpus
AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage.
AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.
The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.

View File

@ -9,17 +9,20 @@ The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises tw
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
## Timeline(AOE Time)
- $ May~5^{th}, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~9^{th}, 2023: $ Test data release.
- $ June~13^{rd}, 2023: $ Final submission deadline.
- $ June~19^{th}, 2023: $ Evaluation result and ranking release.
- $ July~3^{rd}, 2023: $ Deadline for paper submission.
- $ July~10^{th}, 2023: $ Deadline for final paper submission.
- $ December~12^{nd}\ to\ 16^{th}, 2023: $ ASRU Workshop
- $ April~29, 2023: $ Challenge and registration open.
- $ May~8, 2023: $ Baseline release.
- $ May~15, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~9, 2023: $ Test data release and leaderboard open.
- $ June~13, 2023: $ Final submission deadline.
- $ June~19, 2023: $ Evaluation result and ranking release.
- $ July~3, 2023: $ Deadline for paper submission.
- $ July~10, 2023: $ Deadline for final paper submission.
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and challenge session
## Guidelines
Interested participants, whether from academia or industry, must register for the challenge by completing a Google form, which will be available here. The deadline for registration is May 5, 2023.
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.
[M2MET2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.

View File

@ -4,8 +4,6 @@ All participants should adhere to the following rules to be eligible for the cha
- Data augmentation is allowed on the original training dataset, including, but not limited to, adding noise or reverberation, speed perturbation and tone change.
- Participants are permitted to use the Eval set for model training, but it is not allowed to use the Test set for this purpose. Instead, the Test set should only be utilized for parameter tuning and model selection. Any use of the Test-2023 dataset that violates these rules is strictly prohibited, including but not limited to the use of the Test set for fine-tuning or training the model.
- Multi-system fusion is allowed, but the systems with same structure and different parameters is not encouraged.
- If the cpCER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.

View File

@ -1,6 +1,6 @@
# Track & Evaluation
## Speaker-Attributed ASR (Main Track)
The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. It's worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps. Instead, segments containing multiple speakers will be provided on the Test-2023 set, which can be obtained using a simple voice activity detection (VAD) model.
## Speaker-Attributed ASR
The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. It's worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps of the Test-2023 set. Instead, segments containing multiple speakers will be provided, which can be obtained using a simple voice activity detection (VAD) model.
![task difference](images/task_diff.png)

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -88,7 +88,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -135,8 +135,8 @@
</section>
<section id="baseline-results">
<h2>Baseline results<a class="headerlink" href="#baseline-results" title="Permalink to this heading"></a></h2>
<p>The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.
<img alt="baseline result" src="_images/baseline_result.png" /></p>
<p>The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.</p>
<p><img alt="baseline result" src="_images/baseline_result.png" /></p>
</section>
</section>

View File

@ -84,7 +84,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>

View File

@ -89,7 +89,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -131,7 +131,7 @@
</section>
<section id="detail-of-alimeeting-corpus">
<h2>Detail of AliMeeting corpus<a class="headerlink" href="#detail-of-alimeeting-corpus" title="Permalink to this heading"></a></h2>
<p>AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage.</p>
<p>AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.</p>
<p>The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m<span class="math notranslate nohighlight">\(^{2}\)</span> to 55 m<span class="math notranslate nohighlight">\(^{2}\)</span>. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.</p>
<p><img alt="meeting room" src="_images/meeting_room.png" /></p>
<p>The number of participants within one meeting session ranges from 2 to 4. To ensure the coverage of different overlap ratios, we select various meeting topics during recording, including medical treatment, education, business, organization management, industrial production and other daily routine meetings. The average speech overlap ratio of Train, Eval and Test sets are 42.27%, 34.76% and 42.8%, respectively. More details of AliMeeting are shown in Table 1. A detailed overlap ratio distribution of meeting sessions with different numbers of speakers in the Train, Eval and Test set is shown in Table 2.</p>

View File

@ -89,7 +89,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -128,26 +128,28 @@
<section id="call-for-participation">
<h2>Call for participation<a class="headerlink" href="#call-for-participation" title="Permalink to this heading"></a></h2>
<p>Automatic speech recognition (ASR) and speaker diarization have made significant strides in recent years, resulting in a surge of speech technology applications across various domains. However, meetings present unique challenges to speech technologies due to their complex acoustic conditions and diverse speaking styles, including overlapping speech, variable numbers of speakers, far-field signals in large conference rooms, and environmental noise and reverberation.</p>
<p>Over the years, several challenges have been organized to advance the development of meeting transcription, including the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. The latest iteration of the CHIME challenge has a particular focus on distant automatic speech recognition (ASR) and developing systems that can generalize across various array topologies and application scenarios. However, while progress has been made in English meeting transcription, language differences remain a significant barrier to achieving comparable results in non-English languages, such as Mandarin.</p>
<p>The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have been instrumental in advancing Mandarin meeting transcription. The MISP challenge seeks to address the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on tackling the speech overlap issue in offline meeting rooms.</p>
<p>The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition (ASR). The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.</p>
<p>Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. By attributing speech to specific speakers, this task aims to improve the accuracy and applicability of multi-talker ASR systems in real-world settings. The challenge provides detailed datasets, rules, evaluation methods, and baseline systems to facilitate reproducible research in this field. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying “who spoke what at when”. To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models performance and advance the state of the art in this area.</p>
<p>Over the years, several challenges have been organized to advance the development of meeting transcription, including the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. The latest iteration of the CHIME challenge has a particular focus on distant automatic speech recognition and developing systems that can generalize across various array topologies and application scenarios. However, while progress has been made in English meeting transcription, language differences remain a significant barrier to achieving comparable results in non-English languages, such as Mandarin. The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have been instrumental in advancing Mandarin meeting transcription. The MISP challenge seeks to address the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on tackling the speech overlap issue in offline meeting rooms.</p>
<p>The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.</p>
<p>Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying “who spoke what at when”. To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models performance and advance the state of the art in this area.</p>
</section>
<section id="timeline-aoe-time">
<h2>Timeline(AOE Time)<a class="headerlink" href="#timeline-aoe-time" title="Permalink to this heading"></a></h2>
<ul class="simple">
<li><p><span class="math notranslate nohighlight">\( May~5^{th}, 2023: \)</span> Registration deadline, the due date for participants to join the Challenge.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~9^{th}, 2023: \)</span> Test data release.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~13^{rd}, 2023: \)</span> Final submission deadline.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~19^{th}, 2023: \)</span> Evaluation result and ranking release.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~3^{rd}, 2023: \)</span> Deadline for paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~10^{th}, 2023: \)</span> Deadline for final paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( December~12^{nd}\ to\ 16^{th}, 2023: \)</span> ASRU Workshop</p></li>
<li><p><span class="math notranslate nohighlight">\( April~29, 2023: \)</span> Challenge and registration open.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~8, 2023: \)</span> Baseline release.</p></li>
<li><p><span class="math notranslate nohighlight">\( May~15, 2023: \)</span> Registration deadline, the due date for participants to join the Challenge.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~9, 2023: \)</span> Test data release and leaderboard open.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~13, 2023: \)</span> Final submission deadline.</p></li>
<li><p><span class="math notranslate nohighlight">\( June~19, 2023: \)</span> Evaluation result and ranking release.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~3, 2023: \)</span> Deadline for paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( July~10, 2023: \)</span> Deadline for final paper submission.</p></li>
<li><p><span class="math notranslate nohighlight">\( December~12\ to\ 16, 2023: \)</span> ASRU Workshop and challenge session</p></li>
</ul>
</section>
<section id="guidelines">
<h2>Guidelines<a class="headerlink" href="#guidelines" title="Permalink to this heading"></a></h2>
<p>Possible improved version: Interested participants, whether from academia or industry, must register for the challenge by completing a Google form, which will be available here. The deadline for registration is May 5, 2023.</p>
<p>Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.</p>
<p><a class="reference external" href="https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link">M2MET2.0 Registration</a></p>
<p>Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.</p>
</section>
</section>

View File

@ -88,7 +88,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -127,36 +127,27 @@
<p><em><strong>Lei Xie, Professor, Northwestern Polytechnical University, China</strong></em></p>
<p>Email: <a class="reference external" href="mailto:lxie&#37;&#52;&#48;nwpu&#46;edu&#46;cn">lxie<span>&#64;</span>nwpu<span>&#46;</span>edu<span>&#46;</span>cn</a></p>
<a class="reference internal image-reference" href="_images/lxie.jpeg"><img alt="lxie" src="_images/lxie.jpeg" style="width: 20%;" /></a>
<p>Lei Xie received the Ph.D. degree in computer science from Northwestern Polytechnical University, Xian, China, in 2004. From 2001 to 2002, he was with the Department of Electronics and Information Processing, Vrije Universiteit Brussel (VUB), Brussels, Belgium, as a Visiting Scientist. From 2004 to 2006, he was a Senior Research Associate with the Center for Media Technology, School of Creative Media, City University of Hong Kong, Hong Kong, China. From 2006 to 2007, he was a Postdoctoral Fellow with the Human-Computer Communications Laboratory (HCCL), The Chinese University of Hong Kong, Hong Kong, China. He is currently a Professor with School of Computer Science, Northwestern Polytechnical University, Xian, China and leads the Audio, Speech and Language Processing Group (ASLP&#64;NPU). He has published over 200 papers in referred journals and conferences, such as IEEE/ACM Transactions on Audio, Speech and Language Processing, IEEE Transactions on Multimedia, Interspeech, ICASSP, ASRU, ACL and ACM Multimedia. He has achieved several best paper awards in flagship conferences. His current research interests include general topics in speech and language processing, multimedia, and human-computer interaction. Dr. Xie is currently an associate editor (AE) of IEEE/ACM Trans. on Audio, Speech and language Processing. He has actively served as Chairs in many conferences and technical committees. He serves as an IEEE Speech and Language Processing
Technical Committee Member.</p>
<p><em><strong>Kong Aik Lee, Senior Scientist at Institute for Infocomm Research, A*Star, Singapore</strong></em></p>
<p>Email: <a class="reference external" href="mailto:kongaik&#46;lee&#37;&#52;&#48;ieee&#46;org">kongaik<span>&#46;</span>lee<span>&#64;</span>ieee<span>&#46;</span>org</a></p>
<a class="reference internal image-reference" href="_images/kong.png"><img alt="kong" src="_images/kong.png" style="width: 20%;" /></a>
<p>Kong Aik Lee started off him career as a researcher, then a team leader and a strategic planning manager, at the Institute Infocomm Research, A*STAR, Singapore, working on speaker and language recognition research. From 2018 to 2020, he spent two and a half years in NEC Corporation, Japan, focusing very much on voice biometrics and multi-modal biometrics products. He is proud to work with a great team on voice biometrics featured on NEC Bio-Idiom platform. He returned to Singapore in July 2020, and now leading the speech and audio analytics research at the Institute for Infocomm Research, as a Senior Scientist and PI. He also serve as an Editor for Elsevier Computer Speech and Language (since 2016), and was an Associate Editor for IEEE/ACM Transactions on Audio, Speech and Language Processing (2017 - 2021), and am an elected member of IEEE Speech and Language Technical Committee (2019 - 2021).</p>
<p><em><strong>Zhijie Yan, Principal Engineer at Alibaba, China</strong></em>
Email: <a class="reference external" href="mailto:zhijie&#46;yzj&#37;&#52;&#48;alibaba-inc&#46;com">zhijie<span>&#46;</span>yzj<span>&#64;</span>alibaba-inc<span>&#46;</span>com</a></p>
<a class="reference internal image-reference" href="_images/zhijie.jpg"><img alt="zhijie" src="_images/zhijie.jpg" style="width: 20%;" /></a>
<p>Zhijie Yan holds a PhD from the University of Science and Technology of China, and is a senior member of the Institute of Electrical and Electronics Engineers (IEEE). He is also an expert reviewer of top academic conferences and journals in the speech field. His research fields include speech recognition, speech synthesis, voiceprints, and speech interaction. His research results are applied in speech services provided by Alibaba Group and Ant Financial. He was awarded the title of “One of the Top 100 Grassroots Scientists” by the China Association for Science and Technology.</p>
<p><em><strong>Shiliang Zhang, Senior Engineer at Alibaba, China</strong></em>
Email: <a class="reference external" href="mailto:sly&#46;zsl&#37;&#52;&#48;alibaba-inc&#46;com">sly<span>&#46;</span>zsl<span>&#64;</span>alibaba-inc<span>&#46;</span>com</a></p>
<a class="reference internal image-reference" href="_images/zsl.JPG"><img alt="zsl" src="_images/zsl.JPG" style="width: 20%;" /></a>
<p>Shiliang Zhang graduated with a Ph.D. from the University of Science and Technology of China in 2017. His research areas mainly include speech recognition, natural language understanding, and machine learning. Currently, he has published over 40 papers in mainstream academic journals and conferences in the fields of speech and machine learning, and has applied for dozens of patents. After obtaining his doctorate degree, he joined the Alibaba Intelligent Speech team. He is currently leading the direction of speech recognition and fundamental technology at DAMO Academys speech laboratory.</p>
<p><em><strong>Yanmin Qian, Professor, Shanghai Jiao Tong University, China</strong></em></p>
<p>Email: <a class="reference external" href="mailto:yanminqian&#37;&#52;&#48;sjtu&#46;edu&#46;cn">yanminqian<span>&#64;</span>sjtu<span>&#46;</span>edu<span>&#46;</span>cn</a></p>
<a class="reference internal image-reference" href="_images/qian.jpeg"><img alt="qian" src="_images/qian.jpeg" style="width: 20%;" /></a>
<p>Yanmin Qian received the B.S. degree from the Department of Electronic and Information Engineering,Huazhong University of Science and Technology, Wuhan, China, in 2007, and the Ph.D. degree from the Department of Electronic Engineering, Tsinghua University, Beijing, China, in 2012. Since 2013, he has been with the Department of Computer Science and Engineering, Shanghai Jiao Tong University (SJTU), Shanghai, China, where he is currently an Associate Professor. From 2015 to 2016, he also worked as an Associate Research in the Speech Group, Cambridge University Engineering Department, Cambridge, U.K. He is a senior member of IEEE and a member of ISCA, and one of the founding members of Kaldi Speech Recognition Toolkit. He has published more than 110 papers on speech and language processing with 4000+ citations, including the top conference: ICASSP, INTERSPEECH and ASRU. His current research interests include the acoustic and language modeling in speech recognition, speaker and language recognition, key word spotting, and multimedia signal processing.</p>
<p><em><strong>Zhuo Chen, Applied Scientist in Microsoft, USA</strong></em></p>
<p>Email: <a class="reference external" href="mailto:zhuc&#37;&#52;&#48;microsoft&#46;com">zhuc<span>&#64;</span>microsoft<span>&#46;</span>com</a></p>
<a class="reference internal image-reference" href="_images/chenzhuo.jpg"><img alt="chenzhuo" src="_images/chenzhuo.jpg" style="width: 20%;" /></a>
<p>Zhuo Chen received the Ph.D. degree from Columbia University, New York, NY, USA, in 2017. He is currently a Principal Applied Data Scientist with Microsoft. He has authored or coauthored more than 80 papers in peer-reviewed journals and conferences with around 6000 citations. He is a reviewer or technical committee member for more than ten journals and conferences. His research interests include automatic conversation recognition, speech separation, diarisation, and speaker information extraction. He actively participated in the academic events and challenges, and won several awards. Meanwhile, he contributed to open-sourced datasets, such as WSJ0-2mix, LibriCSS, and AISHELL-4, that have been main benchmark datasets for multi-speaker processing research. In 2020, he was the Team Leader in 2020 Jelinek workshop, leading more than 30 researchers and students to push the state of the art in conversation transcription.</p>
<p><em><strong>Jian Wu, Applied Scientist in Microsoft, USA</strong></em></p>
<p>Email: <a class="reference external" href="mailto:wujian&#37;&#52;&#48;microsoft&#46;com">wujian<span>&#64;</span>microsoft<span>&#46;</span>com</a></p>
<a class="reference internal image-reference" href="_images/wujian.jpg"><img alt="wujian" src="_images/wujian.jpg" style="width: 20%;" /></a>
<p>Jian Wu received a master degree from Northwestern Polytechnical University, Xian, China, in 2020 and currently he is a Applied Scientist in Microsoft, USA. His research interests cover multi-channel signal processing, robust and multi-talker speech recognition, speech enhancement, dereverberation and separation. He has around 30 conference publications with a total citation over 1200. He participated in several challenges such as CHiME5, DNS 2020 and FFSVC 2020 and contributed to the open-sourced datasets including LibriCSS and AISHELL-4. He is also a reviewer for several journals and conferences such as ICASSP, SLT, TASLP and SPL.</p>
<p><em><strong>Hui Bu, CEO, AISHELL foundation, China</strong></em></p>
<p>Email: <a class="reference external" href="mailto:buhui&#37;&#52;&#48;aishelldata&#46;com">buhui<span>&#64;</span>aishelldata<span>&#46;</span>com</a></p>
<a class="reference internal image-reference" href="_images/buhui.jpeg"><img alt="buhui" src="_images/buhui.jpeg" style="width: 20%;" /></a>
<p>Hui Bu received his master degree in the Artificial Intelligence Laboratory of Korea University in 2014. He is the founder and the CEO of AISHELL and AISHELL foundation. He participated in the release of AISHELL 1 &amp; 2 &amp; 3 &amp; 4, DMASH and HI-MIA open source database project and is the co-founder of China Kaldi offline Technology Forum.</p>
</section>

View File

@ -88,7 +88,7 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="Track_setting_and_evaluation.html">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="Track_setting_and_evaluation.html#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -128,7 +128,6 @@
<ul class="simple">
<li><p>Data augmentation is allowed on the original training dataset, including, but not limited to, adding noise or reverberation, speed perturbation and tone change.</p></li>
<li><p>Participants are permitted to use the Eval set for model training, but it is not allowed to use the Test set for this purpose. Instead, the Test set should only be utilized for parameter tuning and model selection. Any use of the Test-2023 dataset that violates these rules is strictly prohibited, including but not limited to the use of the Test set for fine-tuning or training the model.</p></li>
<li><p>Multi-system fusion is allowed, but the systems with same structure and different parameters is not encouraged.</p></li>
<li><p>If the cpCER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.</p></li>
<li><p>If the forced alignment is used to obtain the frame-level classification label, the forced alignment model must be trained on the basis of the data allowed by the corresponding sub-track.</p></li>
<li><p>Shallow fusion is allowed to the end-to-end approaches, e.g., LAS, RNNT and Transformer, but the training data of the shallow fusion language model can only come from the transcripts of the allowed training dataset.</p></li>

View File

@ -89,7 +89,7 @@
</ul>
</li>
<li class="toctree-l1 current"><a class="current reference internal" href="#">Track &amp; Evaluation</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#speaker-attributed-asr-main-track">Speaker-Attributed ASR (Main Track)</a></li>
<li class="toctree-l2"><a class="reference internal" href="#speaker-attributed-asr">Speaker-Attributed ASR</a></li>
<li class="toctree-l2"><a class="reference internal" href="#evaluation-metric">Evaluation metric</a></li>
<li class="toctree-l2"><a class="reference internal" href="#sub-track-arrangement">Sub-track arrangement</a></li>
</ul>
@ -125,9 +125,9 @@
<section id="track-evaluation">
<h1>Track &amp; Evaluation<a class="headerlink" href="#track-evaluation" title="Permalink to this heading"></a></h1>
<section id="speaker-attributed-asr-main-track">
<h2>Speaker-Attributed ASR (Main Track)<a class="headerlink" href="#speaker-attributed-asr-main-track" title="Permalink to this heading"></a></h2>
<p>The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. Its worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps. Instead, segments containing multiple speakers will be provided on the Test-2023 set, which can be obtained using a simple voice activity detection (VAD) model.</p>
<section id="speaker-attributed-asr">
<h2>Speaker-Attributed ASR<a class="headerlink" href="#speaker-attributed-asr" title="Permalink to this heading"></a></h2>
<p>The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. Its worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps of the Test-2023 set. Instead, segments containing multiple speakers will be provided, which can be obtained using a simple voice activity detection (VAD) model.</p>
<p><img alt="task difference" src="_images/task_diff.png" /></p>
</section>
<section id="evaluation-metric">

View File

Before

Width:  |  Height:  |  Size: 144 KiB

After

Width:  |  Height:  |  Size: 144 KiB

View File

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 11 KiB

View File

Before

Width:  |  Height:  |  Size: 166 KiB

After

Width:  |  Height:  |  Size: 166 KiB

View File

Before

Width:  |  Height:  |  Size: 232 KiB

After

Width:  |  Height:  |  Size: 232 KiB

View File

Before

Width:  |  Height:  |  Size: 4.4 KiB

After

Width:  |  Height:  |  Size: 4.4 KiB

View File

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 12 KiB

View File

Before

Width:  |  Height:  |  Size: 610 KiB

After

Width:  |  Height:  |  Size: 610 KiB

View File

Before

Width:  |  Height:  |  Size: 38 KiB

After

Width:  |  Height:  |  Size: 38 KiB

View File

Before

Width:  |  Height:  |  Size: 991 KiB

After

Width:  |  Height:  |  Size: 991 KiB

View File

Before

Width:  |  Height:  |  Size: 748 KiB

After

Width:  |  Height:  |  Size: 748 KiB

View File

Before

Width:  |  Height:  |  Size: 76 KiB

After

Width:  |  Height:  |  Size: 76 KiB

View File

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 11 KiB

View File

Before

Width:  |  Height:  |  Size: 2.3 MiB

After

Width:  |  Height:  |  Size: 2.3 MiB

View File

@ -9,4 +9,5 @@ We will release an E2E SA-ASR~\cite{kanda21b_interspeech} baseline conducted on
## Baseline results
The results of the baseline system are shown in Table 3. The speaker profile adopts the oracle speaker embedding during training. However, due to the lack of oracle speaker label during evaluation, the speaker profile provided by an additional spectral clustering is used. Meanwhile, the results of using the oracle speaker profile on Eval and Test Set are also provided to show the impact of speaker profile accuracy.
![baseline result](images/baseline_result.png)

View File

@ -2,7 +2,7 @@
## Overview of training data
In the fixed training condition, the training dataset is restricted to three publicly available corpora, namely, AliMeeting, AISHELL-4, and CN-Celeb. To evaluate the performance of the models trained on these datasets, we will release a new Test set called Test-2023 for scoring and ranking. We will describe the AliMeeting dataset and the Test-2023 set in detail.
## Detail of AliMeeting corpus
AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train and Eval sets contain 212 and 8 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train and Eval sets is 456 and 25, respectively, with balanced gender coverage.
AliMeeting contains 118.75 hours of speech data in total. The dataset is divided into 104.75 hours for training (Train), 4 hours for evaluation (Eval) and 10 hours as test set (Test) for scoring and ranking. Specifically, the Train, Eval and Test sets contain 212, 8 and 20 sessions, respectively. Each session consists of a 15 to 30-minute discussion by a group of participants. The total number of participants in Train, Eval and Test sets is 456, 25 and 60, respectively, with balanced gender coverage.
The dataset is collected in 13 meeting venues, which are categorized into three types: small, medium, and large rooms with sizes ranging from 8 m$^{2}$ to 55 m$^{2}$. Different rooms give us a variety of acoustic properties and layouts. The detailed parameters of each meeting venue will be released together with the Train data. The type of wall material of the meeting venues covers cement, glass, etc. Other furnishings in meeting venues include sofa, TV, blackboard, fan, air conditioner, plants, etc. During recording, the participants of the meeting sit around the microphone array which is placed on the table and conduct a natural conversation. The microphone-speaker distance ranges from 0.3 m to 5.0 m. All participants are native Chinese speakers speaking Mandarin without strong accents. During the meeting, various kinds of indoor noise including but not limited to clicking, keyboard, door opening/closing, fan, bubble noise, etc., are made naturally. For both Train and Eval sets, the participants are required to remain in the same position during recording. There is no speaker overlap between the Train and Eval set. An example of the recording venue from the Train set is shown in Fig 1.

View File

@ -0,0 +1,28 @@
# Introduction
## Call for participation
Automatic speech recognition (ASR) and speaker diarization have made significant strides in recent years, resulting in a surge of speech technology applications across various domains. However, meetings present unique challenges to speech technologies due to their complex acoustic conditions and diverse speaking styles, including overlapping speech, variable numbers of speakers, far-field signals in large conference rooms, and environmental noise and reverberation.
Over the years, several challenges have been organized to advance the development of meeting transcription, including the Rich Transcription evaluation and Computational Hearing in Multisource Environments (CHIME) challenges. The latest iteration of the CHIME challenge has a particular focus on distant automatic speech recognition and developing systems that can generalize across various array topologies and application scenarios. However, while progress has been made in English meeting transcription, language differences remain a significant barrier to achieving comparable results in non-English languages, such as Mandarin. The Multimodal Information Based Speech Processing (MISP) and Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenges have been instrumental in advancing Mandarin meeting transcription. The MISP challenge seeks to address the problem of audio-visual distant multi-microphone signal processing in everyday home environments, while the M2MeT challenge focuses on tackling the speech overlap issue in offline meeting rooms.
The ICASSP2022 M2MeT challenge focuses on meeting scenarios, and it comprises two main tasks: speaker diarization and multi-speaker automatic speech recognition. The former involves identifying who spoke when in the meeting, while the latter aims to transcribe speech from multiple speakers simultaneously, which poses significant technical difficulties due to overlapping speech and acoustic interferences.
Building on the success of the previous M2MeT challenge, we are excited to propose the M2MeT2.0 challenge as an ASRU2023 challenge special session. In the original M2MeT challenge, the evaluation metric was speaker-independent, which meant that the transcription could be determined, but not the corresponding speaker. To address this limitation and further advance the current multi-talker ASR system towards practicality, the M2MeT2.0 challenge proposes the speaker-attributed ASR task with two sub-tracks: fixed and open training conditions. The speaker-attribute automatic speech recognition (ASR) task aims to tackle the practical and challenging problem of identifying "who spoke what at when". To facilitate reproducible research in this field, we offer a comprehensive overview of the dataset, rules, evaluation metrics, and baseline systems. Furthermore, we will release a carefully curated test set, comprising approximately 10 hours of audio, according to the timeline. The new test set is designed to enable researchers to validate and compare their models' performance and advance the state of the art in this area.
## Timeline(AOE Time)
- $ April~29, 2023: $ Challenge and registration open.
- $ May~8, 2023: $ Baseline release.
- $ May~15, 2023: $ Registration deadline, the due date for participants to join the Challenge.
- $ June~9, 2023: $ Test data release and leaderboard open.
- $ June~13, 2023: $ Final submission deadline.
- $ June~19, 2023: $ Evaluation result and ranking release.
- $ July~3, 2023: $ Deadline for paper submission.
- $ July~10, 2023: $ Deadline for final paper submission.
- $ December~12\ to\ 16, 2023: $ ASRU Workshop and challenge session
## Guidelines
Interested participants, whether from academia or industry, must register for the challenge by completing the Google form below. The deadline for registration is May 15, 2023.
[M2MET2.0 Registration](https://docs.google.com/forms/d/e/1FAIpQLSf77T9vAl7Ym-u5g8gXu18SBofoWRaFShBo26Ym0-HDxHW9PQ/viewform?usp=sf_link)
Within three working days, the challenge organizer will send email invitations to eligible teams to participate in the challenge. All qualified teams are required to adhere to the challenge rules, which will be published on the challenge page. Prior to the ranking release time, each participant must submit a system description document detailing their approach and methods. The organizer will select the top three submissions to be included in the ASRU2023 Proceedings.

View File

@ -0,0 +1,48 @@
# Organizers
***Lei Xie, Professor, Northwestern Polytechnical University, China***
Email: [lxie@nwpu.edu.cn](mailto:lxie@nwpu.edu.cn)
<img src="images/lxie.jpeg" alt="lxie" width="20%">
***Kong Aik Lee, Senior Scientist at Institute for Infocomm Research, A\*Star, Singapore***
Email: [kongaik.lee@ieee.org](mailto:kongaik.lee@ieee.org)
<img src="images/kong.png" alt="kong" width="20%">
***Zhijie Yan, Principal Engineer at Alibaba, China***
Email: [zhijie.yzj@alibaba-inc.com](mailto:zhijie.yzj@alibaba-inc.com)
<img src="images/zhijie.jpg" alt="zhijie" width="20%">
***Shiliang Zhang, Senior Engineer at Alibaba, China***
Email: [sly.zsl@alibaba-inc.com](mailto:sly.zsl@alibaba-inc.com)
<img src="images/zsl.JPG" alt="zsl" width="20%">
***Yanmin Qian, Professor, Shanghai Jiao Tong University, China***
Email: [yanminqian@sjtu.edu.cn](mailto:yanminqian@sjtu.edu.cn)
<img src="images/qian.jpeg" alt="qian" width="20%">
***Zhuo Chen, Applied Scientist in Microsoft, USA***
Email: [zhuc@microsoft.com](mailto:zhuc@microsoft.com)
<img src="images/chenzhuo.jpg" alt="chenzhuo" width="20%">
***Jian Wu, Applied Scientist in Microsoft, USA***
Email: [wujian@microsoft.com](mailto:wujian@microsoft.com)
<img src="images/wujian.jpg" alt="wujian" width="20%">
***Hui Bu, CEO, AISHELL foundation, China***
Email: [buhui@aishelldata.com](mailto:buhui@aishelldata.com)
<img src="images/buhui.jpeg" alt="buhui" width="20%">

View File

@ -4,8 +4,6 @@ All participants should adhere to the following rules to be eligible for the cha
- Data augmentation is allowed on the original training dataset, including, but not limited to, adding noise or reverberation, speed perturbation and tone change.
- Participants are permitted to use the Eval set for model training, but it is not allowed to use the Test set for this purpose. Instead, the Test set should only be utilized for parameter tuning and model selection. Any use of the Test-2023 dataset that violates these rules is strictly prohibited, including but not limited to the use of the Test set for fine-tuning or training the model.
- Multi-system fusion is allowed, but the systems with same structure and different parameters is not encouraged.
- If the cpCER of the two systems on the Test dataset are the same, the system with lower computation complexity will be judged as the superior one.

View File

@ -1,6 +1,6 @@
# Track & Evaluation
## Speaker-Attributed ASR (Main Track)
The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. It's worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps. Instead, segments containing multiple speakers will be provided on the Test-2023 set, which can be obtained using a simple voice activity detection (VAD) model.
## Speaker-Attributed ASR
The speaker-attributed ASR task poses a unique challenge of transcribing speech from multiple speakers and assigning a speaker label to the transcription. Figure 2 illustrates the difference between the speaker-attributed ASR task and the multi-speaker ASR task. This track allows for the use of the AliMeeting, Aishell4, and Cn-Celeb datasets as constrained data sources during both training and evaluation. The AliMeeting dataset, which was used in the M2MeT challenge, includes Train, Eval, and Test sets. Additionally, a new Test-2023 set, consisting of approximately 10 hours of meeting data recorded in an identical acoustic setting as the AliMeeting corpus, will be released soon for challenge scoring and ranking. It's worth noting that the organizers will not provide the near-field audio, transcriptions, or oracle timestamps of the Test-2023 set. Instead, segments containing multiple speakers will be provided, which can be obtained using a simple voice activity detection (VAD) model.
![task difference](images/task_diff.png)

View File

@ -20,10 +20,3 @@ To facilitate reproducible research, we provide a comprehensive overview of the
./Rules
./Organizers
./Contact
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

Before

Width:  |  Height:  |  Size: 286 B

After

Width:  |  Height:  |  Size: 286 B

Some files were not shown because too many files have changed in this diff Show More