diff --git a/README.md b/README.md deleted file mode 100644 index 665f42592..000000000 --- a/README.md +++ /dev/null @@ -1,118 +0,0 @@ -[//]: # (
) - -# FunASR: A Fundamental End-to-End Speech Recognition Toolkit -

- - - -

- -FunASR hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model released on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition), researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun! - -[**News**](https://github.com/alibaba-damo-academy/FunASR#whats-new) -| [**Highlights**](#highlights) -| [**Installation**](#installation) -| [**Docs**](https://alibaba-damo-academy.github.io/FunASR/en/index.html) -| [**Tutorial**](https://github.com/alibaba-damo-academy/FunASR/wiki#funasr%E7%94%A8%E6%88%B7%E6%89%8B%E5%86%8C) -| [**Papers**](https://github.com/alibaba-damo-academy/FunASR#citations) -| [**Runtime**](https://github.com/alibaba-damo-academy/FunASR/tree/main/funasr/runtime) -| [**Model Zoo**](https://github.com/alibaba-damo-academy/FunASR/blob/main/docs/modelscope_models.md) -| [**Contact**](#contact) -| [**M2MET2.0 Challenge**](https://github.com/alibaba-damo-academy/FunASR#multi-channel-multi-party-meeting-transcription-20-m2met20-challenge) - -## What's new: -### Multi-Channel Multi-Party Meeting Transcription 2.0 (M2MET2.0) Challenge -We are pleased to announce that the M2MeT2.0 challenge will be held in the near future. The baseline system is conducted on FunASR and is provided as a receipe of AliMeeting corpus. For more details you can see the guidence of M2MET2.0 ([CN](https://alibaba-damo-academy.github.io/FunASR/m2met2_cn/index.html)/[EN](https://alibaba-damo-academy.github.io/FunASR/m2met2/index.html)). -### Release notes -For the release notes, please ref to [news](https://github.com/alibaba-damo-academy/FunASR/releases) - -## Highlights -- FunASR supports speech recognition(ASR), Multi-talker ASR, Voice Activity Detection(VAD), Punctuation Restoration, Language Models, Speaker Verification and Speaker diarization. -- We have released large number of academic and industrial pretrained models on [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) -- The pretrained model [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) obtains the best performance on many tasks in [SpeechIO leaderboard](https://github.com/SpeechColab/Leaderboard) -- FunASR supplies a easy-to-use pipeline to finetune pretrained models from [ModelScope](https://www.modelscope.cn/models?page=1&tasks=auto-speech-recognition) -- Compared to [Espnet](https://github.com/espnet/espnet) framework, the training speed of large-scale datasets in FunASR is much faster owning to the optimized dataloader. - -## Installation - -Install from pip -```shell -pip install -U funasr -# For the users in China, you could install with the command: -# pip install -U funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple -``` - -Or install from source code - - -``` sh -git clone https://github.com/alibaba/FunASR.git && cd FunASR -pip install -e ./ -# For the users in China, you could install with the command: -# pip install -e ./ -i https://mirror.sjtu.edu.cn/pypi/web/simple - -``` -If you want to use the pretrained models in ModelScope, you should install the modelscope: - -```shell -pip install -U modelscope -# For the users in China, you could install with the command: -# pip install -U modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html -i https://mirror.sjtu.edu.cn/pypi/web/simple -``` - -For more details, please ref to [installation](https://alibaba-damo-academy.github.io/FunASR/en/installation.html) - -[//]: # () -[//]: # (## Usage) - -[//]: # (For users who are new to FunASR and ModelScope, please refer to FunASR Docs([CN](https://alibaba-damo-academy.github.io/FunASR/cn/index.html) / [EN](https://alibaba-damo-academy.github.io/FunASR/en/index.html))) - -## Contact - -If you have any questions about FunASR, please contact us by - -- email: [funasr@list.alibaba-inc.com](funasr@list.alibaba-inc.com) - -|Dingding group | Wechat group | -|:---:|:-----------------------------------------------------:| -|
|
| - -## Contributors - -|
|
|
|
| | -|:---------------------------------------------------------------:|:---------------------------------------------------------------:|:--------------------------------------------------------------:|:-------------------------------------------------------:|:-----------------------------------------------------------:| - -## Acknowledge - -1. We borrowed a lot of code from [Kaldi](http://kaldi-asr.org/) for data preparation. -2. We borrowed a lot of code from [ESPnet](https://github.com/espnet/espnet). FunASR follows up the training and finetuning pipelines of ESPnet. -3. We referred [Wenet](https://github.com/wenet-e2e/wenet) for building dataloader for large scale data training. -4. We acknowledge [ChinaTelecom](https://github.com/zhuzizyf/damo-fsmn-vad-infer-httpserver) for contributing the VAD runtime. -5. We acknowledge [RapidAI](https://github.com/RapidAI) for contributing the Paraformer and CT_Transformer-punc runtime. -6. We acknowledge [DeepScience](https://www.deepscience.cn) for contributing the grpc service. - -## License -This project is licensed under the [The MIT License](https://opensource.org/licenses/MIT). FunASR also contains various third-party components and some code modified from other repos under other open source licenses. - -## Citations - -``` bibtex -@inproceedings{gao2022paraformer, - title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition}, - author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie}, - booktitle={INTERSPEECH}, - year={2022} -} -@inproceedings{gao2020universal, - title={Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model}, - author={Gao, Zhifu and Zhang, Shiliang and Lei, Ming and McLoughlin, Ian}, - booktitle={arXiv preprint arXiv:2010.14099}, - year={2020} -} -@inproceedings{Shi2023AchievingTP, - title={Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model}, - author={Xian Shi and Yanni Chen and Shiliang Zhang and Zhijie Yan}, - booktitle={arXiv preprint arXiv:2301.12343} - year={2023} -} -``` diff --git a/egs_modelscope/asr/TEMPLATE/README.md b/egs_modelscope/asr/TEMPLATE/README.md index 54af50fa1..28a31a200 100644 --- a/egs_modelscope/asr/TEMPLATE/README.md +++ b/egs_modelscope/asr/TEMPLATE/README.md @@ -76,15 +76,15 @@ rec_result = inference_pipeline(audio_in='https://isv-data.oss-cn-hangzhou.aliyu print(rec_result) ``` -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.auto_speech_recognition` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU - `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU - `output_dir`: `None` (Default), the output path of results if set - `batch_size`: `1` (Default), batch size when decoding -##### Infer pipeline +#### Infer pipeline - `audio_in`: the input to decode, which could be: - wav_path, `e.g.`: asr_example.wav, - pcm_path, `e.g.`: asr_example.pcm, diff --git a/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/demo.py b/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/demo.py index c533e6728..2fce734ed 100644 --- a/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/demo.py +++ b/egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/demo.py @@ -9,6 +9,7 @@ if __name__ == '__main__': model='damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch', vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', + output_dir=output_dir ) rec_result = inference_pipeline(audio_in=audio_in) print(rec_result) diff --git a/egs_modelscope/punctuation/TEMPLATE/README.md b/egs_modelscope/punctuation/TEMPLATE/README.md index 19600d3a3..3eaf68a11 100644 --- a/egs_modelscope/punctuation/TEMPLATE/README.md +++ b/egs_modelscope/punctuation/TEMPLATE/README.md @@ -52,15 +52,15 @@ print(rec_result_all) Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/FunASR/discussions/238) -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.punctuation` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU - `output_dir`: `None` (Default), the output path of results if set - `model_revision`: `None` (Default), setting the model version -##### Infer pipeline +#### Infer pipeline - `text_in`: the input to decode, which could be: - text bytes, `e.g.`: "我们都是木头人不会讲话不会动" - text file, `e.g.`: example/punc_example.txt diff --git a/egs_modelscope/speaker_diarization/TEMPLATE/README.md b/egs_modelscope/speaker_diarization/TEMPLATE/README.md index 2cd702cd8..99c9b593c 100644 --- a/egs_modelscope/speaker_diarization/TEMPLATE/README.md +++ b/egs_modelscope/speaker_diarization/TEMPLATE/README.md @@ -37,8 +37,8 @@ results = inference_diar_pipline(audio_in=audio_list) print(results) ``` -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.speaker_diarization` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU @@ -50,7 +50,7 @@ print(results) - vad format: spk1: [1.0, 3.0], [5.0, 8.0] - rttm format: "SPEAKER test1 0 1.00 2.00 spk1 " and "SPEAKER test1 0 5.00 3.00 spk1 " -##### Infer pipeline for speaker embedding extraction +#### Infer pipeline for speaker embedding extraction - `audio_in`: the input to process, which could be: - list of url: `e.g.`: waveform files at a website - list of local file path: `e.g.`: path/to/a.wav diff --git a/egs_modelscope/speaker_verification/TEMPLATE/README.md b/egs_modelscope/speaker_verification/TEMPLATE/README.md index 957da9065..f7b64ce4b 100644 --- a/egs_modelscope/speaker_verification/TEMPLATE/README.md +++ b/egs_modelscope/speaker_verification/TEMPLATE/README.md @@ -47,8 +47,8 @@ speaker_embedding = rec_result["spk_embedding"] ``` Full code of demo, please ref to [infer.py](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/speaker_verification/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/infer.py). -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.speaker_verification` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU @@ -57,7 +57,7 @@ Full code of demo, please ref to [infer.py](https://github.com/alibaba-damo-acad - `sv_threshold`: `0.9465` (Default), the similarity threshold to determine whether utterances belong to the same speaker (it should be in (0, 1)) -##### Infer pipeline for speaker embedding extraction +#### Infer pipeline for speaker embedding extraction - `audio_in`: the input to process, which could be: - url (str): `e.g.`: https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav - local_path: `e.g.`: path/to/a.wav @@ -71,7 +71,7 @@ whether utterances belong to the same speaker (it should be in (0, 1)) - fbank1.scp,speech,kaldi_ark: `e.g.`: extracted 80-dimensional fbank features with kaldi toolkits. -##### Infer pipeline for speaker verification +#### Infer pipeline for speaker verification - `audio_in`: the input to process, which could be: - Tuple(url1, url2): `e.g.`: (https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_enroll.wav, https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/sv_example_different.wav) - Tuple(local_path1, local_path2): `e.g.`: (path/to/a.wav, path/to/b.wav) diff --git a/egs_modelscope/tp/TEMPLATE/README.md b/egs_modelscope/tp/TEMPLATE/README.md index 745249f86..d33d4e6d6 100644 --- a/egs_modelscope/tp/TEMPLATE/README.md +++ b/egs_modelscope/tp/TEMPLATE/README.md @@ -23,15 +23,15 @@ Timestamp pipeline can also be used after ASR pipeline to compose complete ASR f -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.speech_timestamp` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU - `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU - `output_dir`: `None` (Default), the output path of results if set - `batch_size`: `1` (Default), batch size when decoding -##### Infer pipeline +#### Infer pipeline - `audio_in`: the input speech to predict, which could be: - wav_path, `e.g.`: asr_example.wav (wav in local or url), - wav.scp, kaldi style wav list (`wav_id wav_path`), `e.g.`: diff --git a/egs_modelscope/vad/TEMPLATE/README.md b/egs_modelscope/vad/TEMPLATE/README.md index 0542331a4..9ad9a1ce2 100644 --- a/egs_modelscope/vad/TEMPLATE/README.md +++ b/egs_modelscope/vad/TEMPLATE/README.md @@ -43,15 +43,15 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/ -#### API-reference -##### Define pipeline +### API-reference +#### Define pipeline - `task`: `Tasks.voice_activity_detection` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU - `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU - `output_dir`: `None` (Default), the output path of results if set - `batch_size`: `1` (Default), batch size when decoding -##### Infer pipeline +#### Infer pipeline - `audio_in`: the input to decode, which could be: - wav_path, `e.g.`: asr_example.wav, - pcm_path, `e.g.`: asr_example.pcm,