docs: add vadnet docs

2025-09-15 15:28:44 +08:00 · 2025-02-13 16:24:15 +08:00 · 2025-02-13 16:24:15 +08:00 · ccdac1a87a
commit ccdac1a87a
parent 7f65aca192
6 changed files with 140 additions and 27 deletions
--- a/README.md
+++ b/README.md
@ -12,6 +12,7 @@ ESP-SR framework includes the following modules:

 * [Audio Front-end AFE](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html)
 * [Wake Word Engine WakeNet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/README.html)
+* [VAD model vadnet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/vadnet/README.html)
 * [Speech Command Word Recognition MultiNet](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/speech_command_recognition/README.html)
 * [Speech Synthesis](https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/speech_synthesis/readme.html)

--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -16,6 +16,7 @@ ESP-SR User Guide
    Getting Started <getting_started/readme>
    Audio Front-end (AFE) <audio_front_end/index>
    Wake Word WakeNet <wake_word_engine/index>
+    VAD Model vadnet <vadnet/readme>
    Speech Command Word MultiNet <speech_command_recognition/README>
    Speech Synthesis (Only Supports Chinese Language) <speech_synthesis/readme>
    Flashing Models <flash_model/README>
--- a/docs/en/vadnet/readme.md
+++ b/docs/en/vadnet/readme.md
@ -0,0 +1,69 @@
+Voice Activaty Detection Model
+==============================
+
+:link_to_translation:`zh_CN:[中文]`
+
+VADNet is a Voice Activaty Detection model built upon neural network for low-power embedded MCUs. 
+
+Overview
+--------
+
+VADNet uses a model structure and data processing flow similar to WakeNet, for more details, you can refer to :doc:`AFE <../wake_word_engine/README>`
+
+VADNet is trained by about 5,000 hours of Chinese data, 5,000 hours of English data, and 5,000 hours of multilingual data.
+
+
+Use VADNet
+-----------
+
+-  Select VADNet model
+
+    To select VADNet model, please refer to Section :doc:`Flashing Models <../flash_model/README>` .
+
+-  Run VADNet
+
+    VADNet is currently included in the :doc:`AFE <../audio_front_end/README>`, which is enabled by default, and returns the detection results through the AFE fetch interface.
+
+    The common vad setting is as follows:
+
+    ::
+        
+        afe_config->vad_init = true           // Whether to initial vad in AFE pipeline. Default is true.
+        afe_config->vad_min_noise_ms = 1000;  // The minimum duration of noise or silence in ms.
+        afe_config->vad_min_speech_ms = 128;  // The minimum duration of speech in ms.
+        afe_config->vad_delay_ms = 128;       // The delay between the first frame trigger of VAD and the first frame of speech data.
+        afe_config->vad_mode = VAD_MODE_1;    // The larger the mode, the higher the speech trigger probability.
+    
+    If users want to enable/disable/reset VADNet temporarily, please use:
+
+    ::
+
+        afe_handle->disable_vad(afe_data);  // disable VADNet
+        afe_handle->enable_vad(afe_data);   // enable VADNet
+        afe_handle->reset_vad(afe_data);    // reset VADNet status
+
+- VAD Cache and Detection
+
+    There are two issues in the VAD settings that can cause a delay in the first frame trigger of speech.
+
+    1. The inherent delay of the VAD algorithm itself. VAD cannot accurately trigger speech on the first frame and may delay by 1 to 3 frames.
+    2. To avoid false triggers, the VAD is triggered when the continuous trigger duration reaches the `vad_min_speech_ms` parameter in AFE configuation.
+
+    Due to the above two reasons, directly using the first frame trigger of VAD may cause the first word to be truncated. 
+    To avoid this case, AFE V2.0 has added a VAD cache. You can determine whether a VAD cache needs to be saved by checking the vad_cache_size
+
+    ::
+       
+        afe_fetch_result_t* result = afe_handle->fetch(afe_data); 
+        if (result->vad_cache_size > 0) {
+            printf("vad cache size: %d\n", result->vad_cache_size);
+            fwrite(result->vad_cache, 1, result->vad_cache_size, fp);
+        }
+
+        printf("vad state: %s\n", res->vad_state==VAD_SILENCE ? "noise" : "speech");
+
+
+Resource Occupancy
+------------------
+
+For the resource occupancy for this model, see :doc:`Resource Occupancy <../benchmark/README>`.
--- a/docs/en/wake_word_engine/README.rst
+++ b/docs/en/wake_word_engine/README.rst
@ -46,32 +46,6 @@ Please see the flow diagram of WakeNet below:
 -  Keyword Triggering Method:
    For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.

-The wake words supported by Espressif chips are listed below:
-
-.. _esp-open-wake-word:
-
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| Chip            | ESP32                                 | ESP32S3                                       |
-+=================+===========+=============+=============+===========+===========+===========+===========+
-| model           | WakeNet 5                             | WakeNet 8             | WakeNet 9             |
-|                 +-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-|                 | WakeNet 5 | WakeNet 5X2 | WakeNet 5X3 | Q16       | Q8        | Q16       | Q8        |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| Hi,Lexin        | √         | √           | √           |           |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| nihaoxiaozhi    | √         |             | √           |           |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| nihaoxiaoxin    |           |             | √           |           |           |           |           |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| xiaoaitongxue   |           |             |             |           |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| Alexa           |           |             |             | √         |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| Hi,ESP          |           |             |             |           |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-| Customized word |           |             |             |           |           |           | √         |
-+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
-
 Use WakeNet
 -----------

@ -89,7 +63,7 @@ Use WakeNet

    ::

-        afe_config.wakeNet_init = False.
+        afe_config->wakeNet_init = False.

    If users want to enable/disable WakeNet temporarily, please use:

--- a/docs/zh_CN/index.rst
+++ b/docs/zh_CN/index.rst
@ -17,6 +17,7 @@ ESP-SR 用户指南
    入门指南 <getting_started/readme>
    AFE 声学前端算法 <audio_front_end/index>
    语音唤醒 WakeNet <wake_word_engine/index>
+    VAD vadnet <vadnet/readme>
    语音指令 MultiNet <speech_command_recognition/README>
    语音合成（仅支持中文）<speech_synthesis/readme>
    模型加载 <flash_model/README>
--- a/docs/zh_CN/vadnet/readme.rst
+++ b/docs/zh_CN/vadnet/readme.rst
@ -0,0 +1,67 @@
+语音活动检测模型
+==============================
+
+:link_to_translation:`en:[English]`
+
+VADNet 是一个基于神经网络的语音活动检测模型，专为低功耗嵌入式MCU设计。
+
+概述
+--------
+
+VADNet 采用了与 WakeNet 相似的模型结构和数据处理流程，更多实现细节可参考 :doc:`音频前端处理模块 <../audio_front_end/README>` 中的说明。
+
+VADNet 训练数据包括了大约5000小时中文数据， 5000 小时英文数据，还有5000小时的多语言数据。
+
+使用VADNet
+-----------
+
+-  选择VADNet模型
+
+    选择VADNet模型请参考 :doc:`模型烧录指南 <../flash_model/README>` 。
+
+-  运行VADNet
+
+    VADNet 当前集成在 :doc:`音频前端处理模块 <../audio_front_end/README>` 中，默认处于启用状态，通过AFE的fetch接口返回检测结果。
+
+    常用VAD参数配置如下：
+
+    ::
+        
+        afe_config->vad_init = true           // 是否在AFE流水线中初始化VAD，默认启用
+        afe_config->vad_min_noise_ms = 1000;  // 噪声/静音段的最短持续时间（毫秒）
+        afe_config->vad_min_speech_ms = 128;  // 语音段的最短持续时间（毫秒）
+        afe_config->vad_delay_ms = 128;       // VAD首帧触发到语音首帧数据的延迟量
+        afe_config->vad_mode = VAD_MODE_1;    // 模式值越大，语音触发概率越高
+    
+    如需临时启用/禁用/重置VADNet，可使用以下接口：
+
+    ::
+
+        afe_handle->disable_vad(afe_data);  // 禁用VAD
+        afe_handle->enable_vad(afe_data);    // 启用VAD
+        afe_handle->reset_vad(afe_data);    // 重置VAD状态
+
+- VAD缓存与检测
+
+    VAD配置中的两个特性可能导致语音首帧触发延迟：
+
+    1. VAD算法固有延迟：VAD无法在首帧精准触发，可能有1-3帧延迟
+    2. 防误触机制：需持续触发时间达到配置参数`vad_min_speech_ms`才会正式触发
+
+    为避免上述原因导致语音首字截断，AFE V2.0新增了VAD缓存机制。可通过检查vad_cache_size判断是否需要保存VAD缓存：
+
+    ::
+       
+        afe_fetch_result_t* result = afe_handle->fetch(afe_data); 
+        if (result->vad_cache_size > 0) {
+            printf("vad缓存大小: %d\n", result->vad_cache_size);
+            fwrite(result->vad_cache, 1, result->vad_cache_size, fp);  // 写入缓存数据
+        }
+
+        printf("vad状态: %s\n", res->vad_state==VAD_SILENCE ? "环境噪声" : "语音活动");
+
+
+资源占用
+------------------
+
+本模型的资源占用情况请参考 :doc:`资源占用说明 <../benchmark/README>`。