diff --git a/docs/huggingface_models.md b/docs/huggingface_models.md index 1568dd1e0..ad367dea5 100644 --- a/docs/huggingface_models.md +++ b/docs/huggingface_models.md @@ -9,11 +9,9 @@ Here we provided several pretrained models on different datasets. The details of ### Speech Recognition Models #### Paraformer Models -[//]: # (| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |) - -[//]: # (|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|) - -[//]: # (| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s |) +| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes | +|:-----------------------------------------------------------------------:|:--------:|:----------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------| +| [Paraformer-large](https://huggingface.co/funasr/paraformer-large) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s | [//]: # (| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav |) @@ -77,21 +75,17 @@ Here we provided several pretrained models on different datasets. The details of ### Voice Activity Detection Models -[//]: # (| Model Name | Training Data | Parameters | Sampling Rate | Notes |) - -[//]: # (|:----------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|) - -[//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | |) +| Model Name | Training Data | Parameters | Sampling Rate | Notes | +|:----------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------| +| [FSMN-VAD](https://huggingface.co/funasr/FSMN-VAD) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | | [//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary) | Alibaba Speech Data (5000hours) | 0.4M | 8000 | |) ### Punctuation Restoration Models -[//]: # (| Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes |) - -[//]: # (|:--------------------------------------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|) - -[//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model |) +| Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes | +|:--------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------| +| [CT-Transformer](https://huggingface.co/funasr/CT-Transformer-punc) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model | [//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary) | Alibaba Text Data | 70M | 272727 | Online | online punctuation model |) diff --git a/docs/index.rst b/docs/index.rst index f7afe809e..73f57fd1f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -31,12 +31,7 @@ Overview ./academic_recipe/sd_recipe.md -.. toctree:: - :maxdepth: 1 - :caption: Model Zoo - ./modelscope_models.md - ./huggingface_models.md .. toctree:: :maxdepth: 1 @@ -56,11 +51,13 @@ Overview Undo + .. toctree:: :maxdepth: 1 - :caption: Funasr Library + :caption: Model Zoo - ./build_task.md + ./modelscope_models.md + ./huggingface_models.md .. toctree:: :maxdepth: 1 @@ -82,6 +79,13 @@ Overview ./benchmark/benchmark_onnx_cpp.md ./benchmark/benchmark_libtorch.md + +.. toctree:: + :maxdepth: 1 + :caption: Funasr Library + + ./build_task.md + .. toctree:: :maxdepth: 1 :caption: Papers diff --git a/docs/modelscope_models.md b/docs/modelscope_models.md index 3538ae0d3..b000fcaea 100644 --- a/docs/modelscope_models.md +++ b/docs/modelscope_models.md @@ -13,7 +13,7 @@ Here we provided several pretrained models on different datasets. The details of |:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------| | [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s | | [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav | -| [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. | +| [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. | | [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8358 | 68M | Offline | Duration of input wav <= 20s | | [Paraformer-online](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8404 | 68M | Online | Which could deal with streaming input | | [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary) | CN | Alibaba Speech Data (200hours) | 544 | 5.2M | Offline | Lightweight Paraformer model which supports Mandarin command words recognition | diff --git a/egs_modelscope/asr/TEMPLATE/README.md b/egs_modelscope/asr/TEMPLATE/README.md index 19acefeb9..c64503389 100644 --- a/egs_modelscope/asr/TEMPLATE/README.md +++ b/egs_modelscope/asr/TEMPLATE/README.md @@ -1,7 +1,7 @@ # Speech Recognition > **Note**: -> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage. +> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage. ## Inference @@ -62,10 +62,10 @@ Undo ##### Define pipeline - `task`: `Tasks.auto_speech_recognition` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk -- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU -- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU -- `output_dir`: `None` (Defalut), the output path of results if set -- `batch_size`: `1` (Defalut), batch size when decoding +- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU +- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU +- `output_dir`: `None` (Default), the output path of results if set +- `batch_size`: `1` (Default), batch size when decoding ##### Infer pipeline - `audio_in`: the input to decode, which could be: - wav_path, `e.g.`: asr_example.wav, @@ -79,7 +79,7 @@ Undo ``` In this case of `wav.scp` input, `output_dir` must be set to save the output results - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio -- `output_dir`: None (Defalut), the output path of results if set +- `output_dir`: None (Default), the output path of results if set ### Inference with multi-thread CPUs or multi GPUs FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. diff --git a/egs_modelscope/vad/TEMPLATE/README.md b/egs_modelscope/vad/TEMPLATE/README.md index df45b35e7..a4b5e795f 100644 --- a/egs_modelscope/vad/TEMPLATE/README.md +++ b/egs_modelscope/vad/TEMPLATE/README.md @@ -1,7 +1,7 @@ # Voice Activity Detection > **Note**: -> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage. +> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of FSMN-VAD as example to demonstrate the usage. ## Inference @@ -47,10 +47,10 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/ ##### Define pipeline - `task`: `Tasks.voice_activity_detection` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk -- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU -- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU -- `output_dir`: `None` (Defalut), the output path of results if set -- `batch_size`: `1` (Defalut), batch size when decoding +- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU +- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU +- `output_dir`: `None` (Default), the output path of results if set +- `batch_size`: `1` (Default), batch size when decoding ##### Infer pipeline - `audio_in`: the input to decode, which could be: - wav_path, `e.g.`: asr_example.wav, @@ -64,7 +64,7 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/ ``` In this case of `wav.scp` input, `output_dir` must be set to save the output results - `audio_fs`: audio sampling rate, only set when audio_in is pcm audio -- `output_dir`: None (Defalut), the output path of results if set +- `output_dir`: None (Default), the output path of results if set ### Inference with multi-thread CPUs or multi GPUs FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs. diff --git a/funasr/runtime/python/onnxruntime/README.md b/funasr/runtime/python/onnxruntime/README.md index 1f7fcaa68..ed3deb6d3 100644 --- a/funasr/runtime/python/onnxruntime/README.md +++ b/funasr/runtime/python/onnxruntime/README.md @@ -19,7 +19,7 @@ python -m funasr.export.export_model --model-name damo/speech_paraformer-large_a ``` -## Install the `funasr_onnx` +## Install `funasr_onnx` install from pip ```shell @@ -46,16 +46,22 @@ pip install -e ./ from funasr_onnx import Paraformer model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch" - model = Paraformer(model_dir, batch_size=1) + model = Paraformer(model_dir, batch_size=1, quantize=True) wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav'] result = model(wav_path) print(result) ``` -- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` -- Input: wav formt file, support formats: `str, np.ndarray, List[str]` -- Output: `List[str]`: recognition result +- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` +- `batch_size`: `1` (Default), the batch size duration inference +- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) +- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` +- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU + +Input: wav formt file, support formats: `str, np.ndarray, List[str]` + +Output: `List[str]`: recognition result #### Paraformer-online @@ -71,9 +77,16 @@ model = Fsmn_vad(model_dir) result = model(wav_path) print(result) ``` -- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` -- Input: wav formt file, support formats: `str, np.ndarray, List[str]` -- Output: `List[str]`: recognition result +- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` +- `batch_size`: `1` (Default), the batch size duration inference +- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) +- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` +- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU + +Input: wav formt file, support formats: `str, np.ndarray, List[str]` + +Output: `List[str]`: recognition result + #### FSMN-VAD-online ```python @@ -105,9 +118,16 @@ for sample_offset in range(0, speech_length, min(step, speech_length - sample_of if segments_result: print(segments_result) ``` -- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` -- Input: wav formt file, support formats: `str, np.ndarray, List[str]` -- Output: `List[str]`: recognition result +- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` +- `batch_size`: `1` (Default), the batch size duration inference +- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) +- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` +- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU + +Input: wav formt file, support formats: `str, np.ndarray, List[str]` + +Output: `List[str]`: recognition result + ### Punctuation Restoration #### CT-Transformer @@ -121,9 +141,15 @@ text_in="跨境河流是养育沿岸人民的生命之源长期以来为帮助 result = model(text_in) print(result[0]) ``` -- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` -- Input: wav formt file, support formats: `str, np.ndarray, List[str]` -- Output: `List[str]`: recognition result +- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` +- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) +- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` +- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU + +Input: `str`, raw text of asr result + +Output: `List[str]`: recognition result + #### CT-Transformer-online ```python @@ -143,9 +169,14 @@ for vad in vads: print(rec_result_all) ``` -- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` -- Input: wav formt file, support formats: `str, np.ndarray, List[str]` -- Output: `List[str]`: recognition result +- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn` +- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu) +- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir` +- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU + +Input: `str`, raw text of asr result + +Output: `List[str]`: recognition result ## Performance benchmark