mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
Merge branch 'alibaba-damo-academy:main' into main
This commit is contained in:
commit
5b2b979634
@ -9,11 +9,9 @@ Here we provided several pretrained models on different datasets. The details of
|
||||
### Speech Recognition Models
|
||||
#### Paraformer Models
|
||||
|
||||
[//]: # (| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |)
|
||||
|
||||
[//]: # (|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|)
|
||||
|
||||
[//]: # (| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s |)
|
||||
| Model Name | Language | Training Data | Vocab Size | Parameter | Offline/Online | Notes |
|
||||
|:-----------------------------------------------------------------------:|:--------:|:----------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
|
||||
| [Paraformer-large](https://huggingface.co/funasr/paraformer-large) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s |
|
||||
|
||||
[//]: # (| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav |)
|
||||
|
||||
@ -77,21 +75,17 @@ Here we provided several pretrained models on different datasets. The details of
|
||||
|
||||
### Voice Activity Detection Models
|
||||
|
||||
[//]: # (| Model Name | Training Data | Parameters | Sampling Rate | Notes |)
|
||||
|
||||
[//]: # (|:----------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|)
|
||||
|
||||
[//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | |)
|
||||
| Model Name | Training Data | Parameters | Sampling Rate | Notes |
|
||||
|:----------------------------------------------------:|:----------------------------:|:----------:|:-------------:|:------|
|
||||
| [FSMN-VAD](https://huggingface.co/funasr/FSMN-VAD) | Alibaba Speech Data (5000hours) | 0.4M | 16000 | |
|
||||
|
||||
[//]: # (| [FSMN-VAD](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-8k-common/summary) | Alibaba Speech Data (5000hours) | 0.4M | 8000 | |)
|
||||
|
||||
### Punctuation Restoration Models
|
||||
|
||||
[//]: # (| Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes |)
|
||||
|
||||
[//]: # (|:--------------------------------------------------------------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|)
|
||||
|
||||
[//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/summary) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model |)
|
||||
| Model Name | Training Data | Parameters | Vocab Size| Offline/Online | Notes |
|
||||
|:--------------------------------------------------------------------:|:----------------------------:|:----------:|:----------:|:--------------:|:------|
|
||||
| [CT-Transformer](https://huggingface.co/funasr/CT-Transformer-punc) | Alibaba Text Data | 70M | 272727 | Offline | offline punctuation model |
|
||||
|
||||
[//]: # (| [CT-Transformer](https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727/summary) | Alibaba Text Data | 70M | 272727 | Online | online punctuation model |)
|
||||
|
||||
|
||||
@ -31,12 +31,7 @@ Overview
|
||||
./academic_recipe/sd_recipe.md
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Model Zoo
|
||||
|
||||
./modelscope_models.md
|
||||
./huggingface_models.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@ -56,11 +51,13 @@ Overview
|
||||
|
||||
Undo
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Funasr Library
|
||||
:caption: Model Zoo
|
||||
|
||||
./build_task.md
|
||||
./modelscope_models.md
|
||||
./huggingface_models.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
@ -82,6 +79,13 @@ Overview
|
||||
./benchmark/benchmark_onnx_cpp.md
|
||||
./benchmark/benchmark_libtorch.md
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Funasr Library
|
||||
|
||||
./build_task.md
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
:caption: Papers
|
||||
|
||||
@ -13,7 +13,7 @@ Here we provided several pretrained models on different datasets. The details of
|
||||
|:--------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:--------------------------------:|:----------:|:---------:|:--------------:|:--------------------------------------------------------------------------------------------------------------------------------|
|
||||
| [Paraformer-large](https://www.modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Duration of input wav <= 20s |
|
||||
| [Paraformer-large-long](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which ould deal with arbitrary length input wav |
|
||||
| [paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. |
|
||||
| [Paraformer-large-contextual](https://www.modelscope.cn/models/damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404/summary) | CN & EN | Alibaba Speech Data (60000hours) | 8404 | 220M | Offline | Which supports the hotword customization based on the incentive enhancement, and improves the recall and precision of hotwords. |
|
||||
| [Paraformer](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8358 | 68M | Offline | Duration of input wav <= 20s |
|
||||
| [Paraformer-online](https://modelscope.cn/models/damo/speech_paraformer_asr_nat-zh-cn-16k-common-vocab8358-tensorflow1/summary) | CN & EN | Alibaba Speech Data (50000hours) | 8404 | 68M | Online | Which could deal with streaming input |
|
||||
| [Paraformer-tiny](https://www.modelscope.cn/models/damo/speech_paraformer-tiny-commandword_asr_nat-zh-cn-16k-vocab544-pytorch/summary) | CN | Alibaba Speech Data (200hours) | 544 | 5.2M | Offline | Lightweight Paraformer model which supports Mandarin command words recognition |
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# Speech Recognition
|
||||
|
||||
> **Note**:
|
||||
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take typic model as example to demonstrate the usage.
|
||||
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the typic models as examples to demonstrate the usage.
|
||||
|
||||
## Inference
|
||||
|
||||
@ -62,10 +62,10 @@ Undo
|
||||
##### Define pipeline
|
||||
- `task`: `Tasks.auto_speech_recognition`
|
||||
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
|
||||
- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
|
||||
- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU
|
||||
- `output_dir`: `None` (Defalut), the output path of results if set
|
||||
- `batch_size`: `1` (Defalut), batch size when decoding
|
||||
- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
|
||||
- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
- `output_dir`: `None` (Default), the output path of results if set
|
||||
- `batch_size`: `1` (Default), batch size when decoding
|
||||
##### Infer pipeline
|
||||
- `audio_in`: the input to decode, which could be:
|
||||
- wav_path, `e.g.`: asr_example.wav,
|
||||
@ -79,7 +79,7 @@ Undo
|
||||
```
|
||||
In this case of `wav.scp` input, `output_dir` must be set to save the output results
|
||||
- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
|
||||
- `output_dir`: None (Defalut), the output path of results if set
|
||||
- `output_dir`: None (Default), the output path of results if set
|
||||
|
||||
### Inference with multi-thread CPUs or multi GPUs
|
||||
FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/asr/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.
|
||||
|
||||
@ -1,7 +1,7 @@
|
||||
# Voice Activity Detection
|
||||
|
||||
> **Note**:
|
||||
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take model of FSMN-VAD as example to demonstrate the usage.
|
||||
> The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of FSMN-VAD as example to demonstrate the usage.
|
||||
|
||||
## Inference
|
||||
|
||||
@ -47,10 +47,10 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/
|
||||
##### Define pipeline
|
||||
- `task`: `Tasks.voice_activity_detection`
|
||||
- `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk
|
||||
- `ngpu`: `1` (Defalut), decoding on GPU. If ngpu=0, decoding on CPU
|
||||
- `ncpu`: `1` (Defalut), sets the number of threads used for intraop parallelism on CPU
|
||||
- `output_dir`: `None` (Defalut), the output path of results if set
|
||||
- `batch_size`: `1` (Defalut), batch size when decoding
|
||||
- `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU
|
||||
- `ncpu`: `1` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
- `output_dir`: `None` (Default), the output path of results if set
|
||||
- `batch_size`: `1` (Default), batch size when decoding
|
||||
##### Infer pipeline
|
||||
- `audio_in`: the input to decode, which could be:
|
||||
- wav_path, `e.g.`: asr_example.wav,
|
||||
@ -64,7 +64,7 @@ Full code of demo, please ref to [demo](https://github.com/alibaba-damo-academy/
|
||||
```
|
||||
In this case of `wav.scp` input, `output_dir` must be set to save the output results
|
||||
- `audio_fs`: audio sampling rate, only set when audio_in is pcm audio
|
||||
- `output_dir`: None (Defalut), the output path of results if set
|
||||
- `output_dir`: None (Default), the output path of results if set
|
||||
|
||||
### Inference with multi-thread CPUs or multi GPUs
|
||||
FunASR also offer recipes [infer.sh](https://github.com/alibaba-damo-academy/FunASR/blob/main/egs_modelscope/vad/TEMPLATE/infer.sh) to decode with multi-thread CPUs, or multi GPUs.
|
||||
|
||||
@ -19,7 +19,7 @@ python -m funasr.export.export_model --model-name damo/speech_paraformer-large_a
|
||||
```
|
||||
|
||||
|
||||
## Install the `funasr_onnx`
|
||||
## Install `funasr_onnx`
|
||||
|
||||
install from pip
|
||||
```shell
|
||||
@ -46,16 +46,22 @@ pip install -e ./
|
||||
from funasr_onnx import Paraformer
|
||||
|
||||
model_dir = "./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
|
||||
model = Paraformer(model_dir, batch_size=1)
|
||||
model = Paraformer(model_dir, batch_size=1, quantize=True)
|
||||
|
||||
wav_path = ['./export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']
|
||||
|
||||
result = model(wav_path)
|
||||
print(result)
|
||||
```
|
||||
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
- Output: `List[str]`: recognition result
|
||||
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- `batch_size`: `1` (Default), the batch size duration inference
|
||||
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
||||
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
||||
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
|
||||
Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
|
||||
Output: `List[str]`: recognition result
|
||||
|
||||
#### Paraformer-online
|
||||
|
||||
@ -71,9 +77,16 @@ model = Fsmn_vad(model_dir)
|
||||
result = model(wav_path)
|
||||
print(result)
|
||||
```
|
||||
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
- Output: `List[str]`: recognition result
|
||||
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- `batch_size`: `1` (Default), the batch size duration inference
|
||||
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
||||
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
||||
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
|
||||
Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
|
||||
Output: `List[str]`: recognition result
|
||||
|
||||
|
||||
#### FSMN-VAD-online
|
||||
```python
|
||||
@ -105,9 +118,16 @@ for sample_offset in range(0, speech_length, min(step, speech_length - sample_of
|
||||
if segments_result:
|
||||
print(segments_result)
|
||||
```
|
||||
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
- Output: `List[str]`: recognition result
|
||||
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- `batch_size`: `1` (Default), the batch size duration inference
|
||||
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
||||
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
||||
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
|
||||
Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
|
||||
Output: `List[str]`: recognition result
|
||||
|
||||
|
||||
### Punctuation Restoration
|
||||
#### CT-Transformer
|
||||
@ -121,9 +141,15 @@ text_in="跨境河流是养育沿岸人民的生命之源长期以来为帮助
|
||||
result = model(text_in)
|
||||
print(result[0])
|
||||
```
|
||||
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
- Output: `List[str]`: recognition result
|
||||
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
||||
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
||||
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
|
||||
Input: `str`, raw text of asr result
|
||||
|
||||
Output: `List[str]`: recognition result
|
||||
|
||||
|
||||
#### CT-Transformer-online
|
||||
```python
|
||||
@ -143,9 +169,14 @@ for vad in vads:
|
||||
|
||||
print(rec_result_all)
|
||||
```
|
||||
- Model_dir: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- Input: wav formt file, support formats: `str, np.ndarray, List[str]`
|
||||
- Output: `List[str]`: recognition result
|
||||
- `model_dir`: the model path, which contains `model.onnx`, `config.yaml`, `am.mvn`
|
||||
- `device_id`: `-1` (Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)
|
||||
- `quantize`: `False` (Default), load the model of `model.onnx` in `model_dir`. If set `True`, load the model of `model_quant.onnx` in `model_dir`
|
||||
- `intra_op_num_threads`: `4` (Default), sets the number of threads used for intraop parallelism on CPU
|
||||
|
||||
Input: `str`, raw text of asr result
|
||||
|
||||
Output: `List[str]`: recognition result
|
||||
|
||||
## Performance benchmark
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user