mirror of
https://github.com/espressif/esp-sr.git
synced 2025-09-15 15:28:44 +08:00
docs: add vadnet docs
This commit is contained in:
parent
7f65aca192
commit
ccdac1a87a
@ -12,6 +12,7 @@ ESP-SR framework includes the following modules:
|
||||
|
||||
* [Audio Front-end AFE](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html)
|
||||
* [Wake Word Engine WakeNet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/README.html)
|
||||
* [VAD model vadnet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/vadnet/README.html)
|
||||
* [Speech Command Word Recognition MultiNet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/speech_command_recognition/README.html)
|
||||
* [Speech Synthesis](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/speech_synthesis/readme.html)
|
||||
|
||||
|
||||
@ -16,6 +16,7 @@ ESP-SR User Guide
|
||||
Getting Started <getting_started/readme>
|
||||
Audio Front-end (AFE) <audio_front_end/index>
|
||||
Wake Word WakeNet <wake_word_engine/index>
|
||||
VAD Model vadnet <vadnet/readme>
|
||||
Speech Command Word MultiNet <speech_command_recognition/README>
|
||||
Speech Synthesis (Only Supports Chinese Language) <speech_synthesis/readme>
|
||||
Flashing Models <flash_model/README>
|
||||
|
||||
69
docs/en/vadnet/readme.md
Normal file
69
docs/en/vadnet/readme.md
Normal file
@ -0,0 +1,69 @@
|
||||
Voice Activaty Detection Model
|
||||
==============================
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
VADNet is a Voice Activaty Detection model built upon neural network for low-power embedded MCUs.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
VADNet uses a model structure and data processing flow similar to WakeNet, for more details, you can refer to :doc:`AFE <../wake_word_engine/README>`
|
||||
|
||||
VADNet is trained by about 5,000 hours of Chinese data, 5,000 hours of English data, and 5,000 hours of multilingual data.
|
||||
|
||||
|
||||
Use VADNet
|
||||
-----------
|
||||
|
||||
- Select VADNet model
|
||||
|
||||
To select VADNet model, please refer to Section :doc:`Flashing Models <../flash_model/README>` .
|
||||
|
||||
- Run VADNet
|
||||
|
||||
VADNet is currently included in the :doc:`AFE <../audio_front_end/README>`, which is enabled by default, and returns the detection results through the AFE fetch interface.
|
||||
|
||||
The common vad setting is as follows:
|
||||
|
||||
::
|
||||
|
||||
afe_config->vad_init = true // Whether to initial vad in AFE pipeline. Default is true.
|
||||
afe_config->vad_min_noise_ms = 1000; // The minimum duration of noise or silence in ms.
|
||||
afe_config->vad_min_speech_ms = 128; // The minimum duration of speech in ms.
|
||||
afe_config->vad_delay_ms = 128; // The delay between the first frame trigger of VAD and the first frame of speech data.
|
||||
afe_config->vad_mode = VAD_MODE_1; // The larger the mode, the higher the speech trigger probability.
|
||||
|
||||
If users want to enable/disable/reset VADNet temporarily, please use:
|
||||
|
||||
::
|
||||
|
||||
afe_handle->disable_vad(afe_data); // disable VADNet
|
||||
afe_handle->enable_vad(afe_data); // enable VADNet
|
||||
afe_handle->reset_vad(afe_data); // reset VADNet status
|
||||
|
||||
- VAD Cache and Detection
|
||||
|
||||
There are two issues in the VAD settings that can cause a delay in the first frame trigger of speech.
|
||||
|
||||
1. The inherent delay of the VAD algorithm itself. VAD cannot accurately trigger speech on the first frame and may delay by 1 to 3 frames.
|
||||
2. To avoid false triggers, the VAD is triggered when the continuous trigger duration reaches the `vad_min_speech_ms` parameter in AFE configuation.
|
||||
|
||||
Due to the above two reasons, directly using the first frame trigger of VAD may cause the first word to be truncated.
|
||||
To avoid this case, AFE V2.0 has added a VAD cache. You can determine whether a VAD cache needs to be saved by checking the vad_cache_size
|
||||
|
||||
::
|
||||
|
||||
afe_fetch_result_t* result = afe_handle->fetch(afe_data);
|
||||
if (result->vad_cache_size > 0) {
|
||||
printf("vad cache size: %d\n", result->vad_cache_size);
|
||||
fwrite(result->vad_cache, 1, result->vad_cache_size, fp);
|
||||
}
|
||||
|
||||
printf("vad state: %s\n", res->vad_state==VAD_SILENCE ? "noise" : "speech");
|
||||
|
||||
|
||||
Resource Occupancy
|
||||
------------------
|
||||
|
||||
For the resource occupancy for this model, see :doc:`Resource Occupancy <../benchmark/README>`.
|
||||
@ -46,32 +46,6 @@ Please see the flow diagram of WakeNet below:
|
||||
- Keyword Triggering Method:
|
||||
For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.
|
||||
|
||||
The wake words supported by Espressif chips are listed below:
|
||||
|
||||
.. _esp-open-wake-word:
|
||||
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Chip | ESP32 | ESP32S3 |
|
||||
+=================+===========+=============+=============+===========+===========+===========+===========+
|
||||
| model | WakeNet 5 | WakeNet 8 | WakeNet 9 |
|
||||
| +-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| | WakeNet 5 | WakeNet 5X2 | WakeNet 5X3 | Q16 | Q8 | Q16 | Q8 |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Hi,Lexin | √ | √ | √ | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| nihaoxiaozhi | √ | | √ | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| nihaoxiaoxin | | | √ | | | | |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| xiaoaitongxue | | | | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Alexa | | | | √ | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Hi,ESP | | | | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Customized word | | | | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
|
||||
Use WakeNet
|
||||
-----------
|
||||
|
||||
@ -89,7 +63,7 @@ Use WakeNet
|
||||
|
||||
::
|
||||
|
||||
afe_config.wakeNet_init = False.
|
||||
afe_config->wakeNet_init = False.
|
||||
|
||||
If users want to enable/disable WakeNet temporarily, please use:
|
||||
|
||||
|
||||
@ -17,6 +17,7 @@ ESP-SR 用户指南
|
||||
入门指南 <getting_started/readme>
|
||||
AFE 声学前端算法 <audio_front_end/index>
|
||||
语音唤醒 WakeNet <wake_word_engine/index>
|
||||
VAD vadnet <vadnet/readme>
|
||||
语音指令 MultiNet <speech_command_recognition/README>
|
||||
语音合成(仅支持中文)<speech_synthesis/readme>
|
||||
模型加载 <flash_model/README>
|
||||
|
||||
67
docs/zh_CN/vadnet/readme.rst
Normal file
67
docs/zh_CN/vadnet/readme.rst
Normal file
@ -0,0 +1,67 @@
|
||||
语音活动检测模型
|
||||
==============================
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
VADNet 是一个基于神经网络的语音活动检测模型,专为低功耗嵌入式MCU设计。
|
||||
|
||||
概述
|
||||
--------
|
||||
|
||||
VADNet 采用了与 WakeNet 相似的模型结构和数据处理流程,更多实现细节可参考 :doc:`音频前端处理模块 <../audio_front_end/README>` 中的说明。
|
||||
|
||||
VADNet 训练数据包括了大约5000小时中文数据, 5000 小时英文数据,还有5000小时的多语言数据。
|
||||
|
||||
使用VADNet
|
||||
-----------
|
||||
|
||||
- 选择VADNet模型
|
||||
|
||||
选择VADNet模型请参考 :doc:`模型烧录指南 <../flash_model/README>` 。
|
||||
|
||||
- 运行VADNet
|
||||
|
||||
VADNet 当前集成在 :doc:`音频前端处理模块 <../audio_front_end/README>` 中,默认处于启用状态,通过AFE的fetch接口返回检测结果。
|
||||
|
||||
常用VAD参数配置如下:
|
||||
|
||||
::
|
||||
|
||||
afe_config->vad_init = true // 是否在AFE流水线中初始化VAD,默认启用
|
||||
afe_config->vad_min_noise_ms = 1000; // 噪声/静音段的最短持续时间(毫秒)
|
||||
afe_config->vad_min_speech_ms = 128; // 语音段的最短持续时间(毫秒)
|
||||
afe_config->vad_delay_ms = 128; // VAD首帧触发到语音首帧数据的延迟量
|
||||
afe_config->vad_mode = VAD_MODE_1; // 模式值越大,语音触发概率越高
|
||||
|
||||
如需临时启用/禁用/重置VADNet,可使用以下接口:
|
||||
|
||||
::
|
||||
|
||||
afe_handle->disable_vad(afe_data); // 禁用VAD
|
||||
afe_handle->enable_vad(afe_data); // 启用VAD
|
||||
afe_handle->reset_vad(afe_data); // 重置VAD状态
|
||||
|
||||
- VAD缓存与检测
|
||||
|
||||
VAD配置中的两个特性可能导致语音首帧触发延迟:
|
||||
|
||||
1. VAD算法固有延迟:VAD无法在首帧精准触发,可能有1-3帧延迟
|
||||
2. 防误触机制:需持续触发时间达到配置参数`vad_min_speech_ms`才会正式触发
|
||||
|
||||
为避免上述原因导致语音首字截断,AFE V2.0新增了VAD缓存机制。可通过检查vad_cache_size判断是否需要保存VAD缓存:
|
||||
|
||||
::
|
||||
|
||||
afe_fetch_result_t* result = afe_handle->fetch(afe_data);
|
||||
if (result->vad_cache_size > 0) {
|
||||
printf("vad缓存大小: %d\n", result->vad_cache_size);
|
||||
fwrite(result->vad_cache, 1, result->vad_cache_size, fp); // 写入缓存数据
|
||||
}
|
||||
|
||||
printf("vad状态: %s\n", res->vad_state==VAD_SILENCE ? "环境噪声" : "语音活动");
|
||||
|
||||
|
||||
资源占用
|
||||
------------------
|
||||
|
||||
本模型的资源占用情况请参考 :doc:`资源占用说明 <../benchmark/README>`。
|
||||
Loading…
Reference in New Issue
Block a user