Merge branch 'vad/add_reset_func' into 'master'

Vad/add reset func

See merge request speech-recognition-framework/esp-sr!135
This commit is contained in:
Sun Xiang Yu 2025-02-08 12:29:04 +08:00
commit 10d431ee5f
22 changed files with 431 additions and 833 deletions

View File

@ -6,7 +6,9 @@ Audio Front-end Framework
Overview
--------
Any voice-enabled product needs to perform well in a noisy environment, and audio front-end (AFE) algorithms play an important role in building a sensitive voice-user interface (VUI). Espressifs AI Lab has created a set of audio front-end algorithms that can offer this functionality. Customers can use these algorithms with Espressifs powerful {IDF_TARGET_NAME} series of chips, in order to build high-performance, yet low-cost, products with a voice-user interface.
This guide provides an overview of how to use the Audio Front End (AFE) framework and explains the definition of the input format.
The AFE framework is designed to process audio data for applications such as speech recognition and voice communication.
It includes various algorithms like Acoustic Echo Cancellation (AEC), Noise Suppression (NS), Voice Activity Detection (VAD), and Wake Word Detection (WakeNet).
.. list-table::
:widths: 25 75
@ -37,377 +39,106 @@ This section introduces two typical usage scenarios of Espressif AFE framework.
Speech Recognition
^^^^^^^^^^^^^^^^^^
Workflow
""""""""
.. figure:: ../../_static/AFE_SR_overview.png
:alt: overview
Data Flow
"""""""""
.. figure:: ../../_static/AFE_SR_workflow.png
:alt: overview
#. Use :cpp:func:`ESP_AFE_SR_HANDLE` to create and initialize AFE. Note, :cpp:member:`voice_communication_init` must be configured as false.
#. Use :cpp:func:`feed` to input audio data, which will perform the AEC algorithm inside :cpp:func:`feed` first.
#. Perform the BSS/NS algorithms inside :cpp:func:`feed` first.
#. Use :cpp:func:`fetch` to obtain processed single channel audio data and related information. Note, VAD processing and wake word detection will be performed inside :cpp:func:`fetch`. The specific behavior depends on the configuration of ``afe_config_t`` structure.
Voice Communication
^^^^^^^^^^^^^^^^^^^
Workflow
""""""""
.. figure:: ../../_static/AFE_VOIP_overview.png
:alt: overview
Data Flow
"""""""""
.. figure:: ../../_static/AFE_VOIP_workflow.png
:alt: overview
#. Use :cpp:func:`ESP_AFE_VC_HANDLE` to create and initialize AFE. Note, :cpp:member:`voice_communication_init` must be configured as true.
#. Use :cpp:func:`feed` to input audio data, which will perform the AEC algorithm inside :cpp:func:`feed` first.
#. Perform the BSS/NS algorithms inside :cpp:func:`feed` first. Additional MISO algorithm will be performed for dual mic setup.
#. Use :cpp:func:`fetch` to obtain processed single channel audio data and related information. The AGC algorithm processing will be carried out. And the specific gain depends on the config of :cpp:type:`afe_config_t` structure. If it's dual mic, the NS algorithm processing will be carried out before AGC.
Input Format Definition
----------------------------
The ``input_format`` parameter specifies the arrangement of audio channels in the input data. Each character in the string represents a channel type:
.. note::
#. The :cpp:member:`wakenet_init` and :cpp:member:`voice_communication_init` in :cpp:type:`afe_config_t` cannot be configured to true at the same time.
#. :cpp:func:`feed` and :cpp:func:`fetch` are visible to users, while other AFE interal tasks such as BSS/NS/MISO are not visible to users.
#. AEC algorithm is performed in :cpp:func:`feed`.
#. When :cpp:member:`aec_init` is configured to false, BSS/NS algorithms are performed in :cpp:func:`feed`.
+-----------+---------------------+
| Character | Description |
+===========+=====================+
| ``M`` | Microphone channel |
+-----------+---------------------+
| ``R`` | Playback reference |
| | channel |
+-----------+---------------------+
| ``N`` | Unused or unknown |
| | channel |
+-----------+---------------------+
Select AFE Handle
-----------------
**Example:**
- ``"MMNR"``: Indicates four channels: two microphone channels, one unused channel, and one playback reference channel.
Espressif AFE supports both single mic and dual mic setups, and allows flexible combinations of algorithms.
**Key Points:**
- The input data must be arranged in **channel-interleaved format**.
* Single mic
* Internal task is performed inside the NS algorithm
* Dual mic
* Internal task is performed inside the BSS algorithm
* An additional internal task is performed inside the MISO algorithm for voice communication scenario (i.e., :cpp:member:`wakenet_init` = false and :cpp:member:`voice_communication_init` = true)
Using the AFE Framework
----------------------------
To obtain the AFE Handle, use the commands below:
Based on the ``menuconfig`` -> ``ESP Speech Recognition``, select the required AFE (Analog Front End) models, such as the WakeNet model, VAD (Voice Activity Detection) model, NS (Noise Suppression) model, etc., and then call the AFE framework in the code using the following steps.
For reference, you can check the code in :project_file:`test_apps/esp-sr/main/test_afe.cpp`.
* Speech recognition
Step 1: Initialize AFE Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::
Get the default configuration using ``afe_config_init()`` and customize parameters as needed:
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
.. code-block:: c
* Voice communication
srmodel_list_t *models = esp_srmodel_init("model");
afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF);
::
- **``input_format``**: Define the channel arrangement (e.g., ``"MMNR"``).
- **``models``**: List of models (e.g., for NS, VAD, or WakeNet).
- **``afe_type``**: Type of AFE (e.g., ``AFE_TYPE_SR`` for speech recognition).
- **``afe_mode``**: Performance mode (e.g., ``AFE_MODE_HIGH_PERF``).
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
Step 2: Create AFE Instance
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. _input-audio-1:
Create an AFE instance using the configuration:
Input Audio Data
----------------
.. code-block:: c
Currently, Espressif AFE framework supports both single mic and dual mic setups. Users can configure the number of channels based on the input audio (:cpp:func:`esp_afe_sr_iface_op_feed_t`).
// get handle
esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config);
// create instance
esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config);
To be specific, users can configure the :cpp:member:`pcm_config` in :cpp:func:`AFE_CONFIG_DEFAULT()`:
Step 3: Feed Audio Data
^^^^^^^^^^^^^^^^^^^^^^^^^^
* :cpp:member:`total_ch_num` : total number of channels
* :cpp:member:`mic_num` : number of mic channels
* :cpp:member:`ref_num` : number of REF channels
Input audio data to the AFE for processing. The input data must match the ``input_format``:
When configuring, note the following requirements:
.. code-block:: c
1. :cpp:member:`total_ch_num` = :cpp:member:`mic_num` + :cpp:member:`ref_num`
2. :cpp:member:`ref_num` = 0 or :cpp:member:`ref_num` = 1 (This is because AEC only supports up to one reference data now)
int feed_chunksize = afe_handle->get_feed_chunksize(afe_data);
int feed_nch = afe_handle->get_feed_channel_num(afe_data);
int16_t *feed_buff = (int16_t *) malloc(feed_chunksize * feed_nch * sizeof(int16_t));
afe_handle->feed(afe_data, feed_buff);
- **``feed_chunksize``**: Number of samples to feed per frame.
- **``feed_nch``**: Number of channel of input data.
- **``feed_buff``**: Channel-interleaved audio data (16-bit signed, 16 kHz).
The supported configurations are:
Step 4: Fetch Processed Audio
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::
Retrieve the processed single-channel audio output:
total_ch_num=1, mic_num=1, ref_num=0
total_ch_num=2, mic_num=1, ref_num=1
total_ch_num=2, mic_num=2, ref_num=0
total_ch_num=3, mic_num=2, ref_num=1
.. code-block:: c
AFE Single Mic
^^^^^^^^^^^^^^
- Input audio data format: 16 KHz, 16 bit, two channels (one is mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure one channel of mic data, and the ref_num can be set to 0.
- The input data frame length varies to the algorithm modules configured by the user. Users can use :cpp:func:`get_feed_chunksize` to get the number of sampling points (the data type of sampling points is int16).
The input data is arranged as follows:
.. figure:: ../../_static/AFE_mode_0.png
:alt: input data of single mic
:height: 0.7in
AFE Dual Mic
^^^^^^^^^^^^
- Input audio data format: 16 KHz, 16 bit, three channels (two are mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure two channels of mic data, and the ref_num can be set to 0.
- The input data frame length varies to the algorithm modules configured by the user. Users can use :cpp:func:`get_feed_chunksize` to obtain the data size required (i.e., :cpp:func:`get_feed_chunksize` * :cpp:member:`total_ch_num` * sizeof(short)).
The input data is arranged as follows:
.. figure:: ../../_static/AFE_mode_other.png
:alt: input data of dual mic
:height: 0.75in
Output Audio
------------
The output audio of AFE is single-channel data.
- In the speech recognition scenario, AFE outputs single-channel data with human voice when WakeNet is enabled.
- In the voice communication scenario, AFE outputs single channel data with higher signal-to-noise ratio.
Enable Wake Word Engine WakeNet
--------------------------------
When performing AFE audio front-end processing, the user can choose whether to enable wake word engine :doc:`WakeNet <../wake_word_engine/README>` to allow waking up the chip via wake words.
Users can disable WakeNet to reduce the CPU resource consumption and perform other operations after wake-up, such as offline or online speech recognition. To do so, users can configure :cpp:func:`disable_wakenet()` to enter Bypass mode.
Users can also call :cpp:func:`enable_wakenet()` to enable WakeNet later whenever needed.
.. only:: esp32
ESP32 only supports one wake word. Users cannot switch between different wake words.
.. only:: esp32s3
ESP32-S3 allows users to switch among different wake words. After the initialization of AFE, ESP32-S3 allows users to change wake words by calling :cpp:func:`set_wakenet()` . For example, use ``set_wakenet(afe_data, "wn9_hilexin")`` to use "Hi Lexin" as the wake word. For details on how to configure more than one wake words, see Section :doc:`flash_model <../flash_model/README>`.
Enable Acoustic Echo Cancellation (AEC)
----------------------------------------
The usage of AEC is similar to that of WakeNet. Users can disable or enable AEC according to requirements.
- Disable AEC
``afe->disable_aec(afe_data);``
- Enable AEC
``afe->enable_aec(afe_data);``
.. only:: html
Programming Procedures
----------------------
Define afe_handle
^^^^^^^^^^^^^^^^^
``afe_handle`` is the function handle that the user calls the AFE interface. Therefore, the first step is to obtain ``afe_handle``.
- Speech recognition
::
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
- Voice communication
::
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
Configure AFE
^^^^^^^^^^^^^
Get the configuration of AFE:
::
afe_config_t afe_config = AFE_CONFIG_DEFAULT();
Users can further configure the corresponding parameters in ``afe_config``:
::
#define AFE_CONFIG_DEFAULT() { \
// Configures whether or not to enable AEC
.aec_init = true, \
// Configures whether or not to enable BSS/NS
.se_init = true, \
// Configures whether or not to enable VAD (only for speech recognition)
.vad_init = true, \
// Configures whether or not to enable WakeNet
.wakenet_init = true, \
// Configures whether or not to enable voice communication (cannot be enabled when wakenet_init is also enabled)
.voice_communication_init = false, \
// Configures whether or not to enable AGC for voice communication
.voice_communication_agc_init = false, \
// Configures the AGC gain (unit: dB)
.voice_communication_agc_gain = 15, \
// Configures the VAD mode (the larger the number is, the more aggressive VAD is)
.vad_mode = VAD_MODE_3, \
// Configures the wake model. See details below.
.wakenet_model_name = NULL, \
// Configures the wake mode. (corresponding to wakeup channels. This should be configured based on the number of mic channels)
.wakenet_mode = DET_MODE_2CH_90, \
// Configures AFE mode (SR_MODE_LOW_COST or SR_MODE_HIGH_PERF)
.afe_mode = SR_MODE_LOW_COST, \
// Configures the internal BSS/NS/MISO algorithm of AFE will be running on which CPU core
.afe_perferred_core = 0, \
// Configures the priority of BSS/NS/MISO algorithm tasks
.afe_perferred_priority = 5, \
// Configures the internal ringbuf size
.afe_ringbuf_size = 50, \
// Configures the memory allocation mode. See details below.
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
// Configures the linear audio amplification level. See details below.
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
// Configures the total number of audio channels
.pcm_config.total_ch_num = 3, \
// Configures the number of microphone channels
.pcm_config.mic_num = 2, \
// Configures the number of reference channels
.pcm_config.ref_num = 1, \
}
* :cpp:member:`wakenet_model_name` : configures the wake model. The default value in :cpp:type:`AFE_CONFIG_DEFAULT()` is NULL. Note:
* After selecting the wake model via ``idf.py menuconfig``, please configure :cpp:member:`create_from_config` to the configured wake model (type string) before using. For more information about wake model, go to Section :doc:`flash_model <../flash_model/README>` .
* :cpp:func:`esp_srmodel_filter()` can be used to obtain the model name. However, if more than one models are configured via ``idf.py menuconfig`` , this function returns any of them configured models randomly.
* :cpp:member:`afe_mode` :configures the AFE mode.
.. list::
:esp32s3: - :cpp:enumerator:`SR_MODE_LOW_COST` : quantized, which uses less resource
- :cpp:enumerator:`SR_MODE_HIGH_PERF` : unquantized, which uses more resource
For details, see :cpp:enumerator:`afe_sr_mode_t` .
* :cpp:member:`memory_alloc_mode` : configures how the memory is allocated
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_INTERNAL` : allocate most memory from internal ram
- :cpp:enumerator:`AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE` : allocate some memory from the internal ram
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_PSRAM` : allocate most memory from external psram
- :cpp:member:`agc_mode` : configures peak agc mode. Note that, this parameter is only for speech recognition scenarios, and is only valid when WakeNet is enabled:
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_1` : feed linearly amplified audio signals to MultiNet, peak is -5 dB.
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_2` : feed linearly amplified audio signals to MultiNet, peak is -4 dB.
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_3` : feed linearly amplified audio signals to MultiNet, peak is -3 dB.
- :cpp:enumerator:`AFE_MN_PEAK_NO_AGC` : feed original audio signals to MultiNet.
- :cpp:member:`pcm_config` : configures the audio signals fed through :cpp:func:`feed` :
- :cpp:member:`total_ch_num` : total number of channels
- :cpp:member:`mic_num` : number of mic channels
- :cpp:member:`ref_num` : number of REF channels
There are some limitation when configuring these parameters. For details, see Section :ref:`input-audio-1` .
Create afe_data
^^^^^^^^^^^^^^^
The user uses the :cpp:func:`esp_afe_sr_iface_op_create_from_config_t` function to create the data handle based on the parameters configured in previous steps.
::
/**
* @brief Function to initialze a AFE_SR instance
*
* @param afe_config The config of AFE_SR
* @returns Handle to the AFE_SR data
*/
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);
Feed Audio Data
^^^^^^^^^^^^^^^
After initializing AFE, users need to input audio data into AFE by :cpp:func:`feed` function for processing. The format of input audio data can be found in Section :ref:`input-audio-1` .
::
/**
* @brief Feed samples of an audio stream to the AFE_SR
*
* @Warning The input data should be arranged in the format of channel interleaving.
* The last channel is reference signal if it has reference data.
*
* @param afe The AFE_SR object to query
*
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
* `get_feed_chunksize`.
* @return The size of input
*/
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
Get the number of audio channels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
:cpp:func:`get_total_channel_num()` function can provide the number of channels that need to be put into :cpp:func:`feed()` function. Its return value is equal to ``pcm_config.mic_num + pcm_config.ref_num`` configured in :cpp:func:`AFE_CONFIG_DEFAULT()`.
::
/**
* @brief Get the total channel number which be config
*
* @param afe The AFE_SR object to query
* @return The amount of total channels
*/
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
Fetch Audio Data
^^^^^^^^^^^^^^^^
Users can get the processed single-channel audio and related information by :cpp:func:`fetch` function.
The number of data sampling points of :cpp:func:`fetch` (the data type of sampling point is ``int16``) can be obtained by :cpp:func:`get_feed_chunksize`.
::
/**
* @brief Get the amount of each channel samples per frame that need to be passed to the function
*
* Every speech enhancement AFE_SR processes a certain number of samples at the same time. This function
* can be used to query that amount. Note that the returned amount is in 16-bit samples, not in bytes.
*
* @param afe The AFE_SR object to query
* @return The amount of samples to feed the fetch function
*/
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
The declaration of :cpp:func:`fetch`:
::
/**
* @brief fetch enhanced samples of an audio stream from the AFE_SR
*
* @Warning The output is single channel data, no matter how many channels the input is.
*
* @param afe The AFE_SR object to query
* @return The result of output, please refer to the definition of `afe_fetch_result_t`. (The frame size of output audio can be queried by the `get_fetch_chunksize`.)
*/
typedef afe_fetch_result_t* (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe);
Its return value is a pointer of structure, and the structure is defined as follows:
::
/**
* @brief The result of fetch function
*/
typedef struct afe_fetch_result_t
{
int16_t *data; // the data of audio.
int data_size; // the size of data. The unit is byte.
int wakeup_state; // the value is wakenet_state_t
int wake_word_index; // if the wake word is detected. It will store the wake word index which start from 1.
int vad_state; // the value is afe_vad_state_t
int trigger_channel_id; // the channel index of output
int wake_word_length; // the length of wake word. It's unit is the number of samples.
int ret_value; // the return state of fetch function
void* reserved; // reserved for future use
} afe_fetch_result_t;
afe_fetch_result_t *result = fetch(afe_data);
int16_t *processed_audio = result->data;
vad_state_t vad_state = result->vad_state;
wakenet_state_t wakeup_state = result->wakeup_state;
// if vad cache is exists, please attach the cache to the front of processed_audio to avoid data loss
if (result->vad_cache_size > 0) {
int16_t *vad_cache = result->vad_cache;
}
Resource Occupancy
------------------
For the resource occupancy for this model, see :doc:`Resource Occupancy <../benchmark/README>`.
For the resource occupancy for AFE, see :doc:`Resource Occupancy <../benchmark/README>`.

View File

@ -9,6 +9,7 @@ AFE
Resource Consumption
~~~~~~~~~~~~~~~~~~~~
.. only:: esp32
+-----------------+-----------------+-----------------+-----------------+
@ -25,76 +26,139 @@ Resource Consumption
.. only:: esp32s3
+-----------------+-----------------+-----------------+-----------------+
| Algorithm Type | RAM | Average cpu | Frame Length |
| | | loading(compute | |
| | | with 2 cores) | |
+=================+=================+=================+=================+
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS | 27 KB | 5% | 10 ms |
+-----------------+-----------------+-----------------+-----------------+
| MISO | 56 KB | 8% | 16 ms |
+-----------------+-----------------+-----------------+-----------------+
| AFE Layer | 227 KB | | |
+-----------------+-----------------+-----------------+-----------------+
.. list-table:: AFE configuration and pipeline
:widths: 25 75
:header-rows: 1
* - Config
- Pipeline
* - MR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, VC, LOW_COST
- ``|AEC(VOIP_LOW_COST)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MR, VC, HIGH_PERF
- ``|AEC(VOIP_HIGH_PERF)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MMNR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MMNR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
.. note::
- **MR:** one microphone channel and one playback channel
- **MMNR:** two microphone channels and one playback channels
- **Models:** nsnet2, vadnet1_medium, wn9_hilexin
.. list-table:: ESP32-S3 AFE configuration and Performance
:widths: 25 15 15 20 20
:header-rows: 1
* - Config
- Internal RAM (KB)
- PSRAM (KB)
- Feed CPU usage (1 core,%)
- Fetch CPU usage (1 core,%)
* - MR, SR, LOW_COST
- 72.3
- 732.7
- 8.4
- 15.0
* - MR, SR, HIGH_PERF
- 78.0
- 734.7
- 9.4
- 14.9
* - MR, VC, LOW_COST
- 50.3
- 821.4
- 60.0
- 8.2
* - MR, VC, HIGH_PERF
- 93.7
- 824.0
- 64.0
- 8.2
* - MMNR, SR, LOW_COST
- 76.6
- 1173.9
- 36.6
- 30.0
* - MMNR, SR, HIGH_PERF
- 99.0
- 1173.7
- 38.8
- 30.0
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| Input Format | Type | Mode | Internal RAM | PSRAM | Feed Task CPU | Fetch Task CPU |
+==============+======+===========+===============+============+================+=================+
| MR | SR | LOW_COST | 72348 | 732932 | 8.4% | 14.9% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | SR | HIGH_PERF | 78016 | 734980 | 9.4% | 14.9% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | VC | LOW_COST | 50316 | 821564 | 60.0% | 8.1% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | VC | HIGH_PERF | 93668 | 824144 | 64.0% | 8.2% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MMR | SR | LOW_COST | 76684 | 1175148 | 36.6% | 30.2% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MMR | SR | HIGH_PERF | 99064 | 1174960 | 38.8% | 30.0% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
.. only:: esp32p4
+-----------------+-----------------+-----------------+-----------------+
| Algorithm Type | RAM | Average cpu | Frame Length |
| | | loading(compute | |
| | | with 2 cores) | |
+=================+=================+=================+=================+
| AEC(LOW_COST) | 152.3 KB | 6% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(LOW_COST) | 198.7 KB | 3% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS | 27 KB | 3% | 10 ms |
+-----------------+-----------------+-----------------+-----------------+
| MISO | 56 KB | 8% | 16 ms |
+-----------------+-----------------+-----------------+-----------------+
| AFE Layer | 227 KB | | |
+-----------------+-----------------+-----------------+-----------------+
.. list-table:: AFE configuration and pipeline
:widths: 25 75
:header-rows: 1
* - Config
- Pipeline
* - MR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, VC, LOW_COST
- ``|AEC(VOIP_LOW_COST)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MR, VC, HIGH_PERF
- ``|AEC(VOIP_HIGH_PERF)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MMNR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MMNR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| Input Format | Type | Mode | Internal RAM | PSRAM | Feed Task CPU | Fetch Task CPU |
+==============+======+===========+===============+============+=================+=================+
| MR | SR | LOW_COST | 75404 | 751292 | 10.6% | 11.3% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | SR | HIGH_PERF | 75128 | 751292 | 10.6% | 11.3% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | VC | LOW_COST | 76192 | 841300 | 40.3% | 5.7% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | VC | HIGH_PERF | 119536 | 843880 | 42.6% | 5.7% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MMR | SR | LOW_COST | 79940 | 1202692 | 28.4% | 24.9% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MMR | SR | HIGH_PERF | 79940 | 1202692 | 28.4% | 24.9% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
.. note::
- **MR:** one microphone channel and one playback channel
- **MMNR:** two microphone channels and one playback channels
- **Models:** nsnet2, vadnet1_medium, wn9_hilexin
.. list-table:: AFE configuration and Performance
:widths: 25 15 15 20 20
:header-rows: 1
* - Config
- Internal RAM (KB)
- PSRAM (KB)
- Feed CPU usage (1 core,%)
- Fetch CPU usage (1 core,%)
* - MR, SR, LOW_COST
- 73.6
- 733.2
- 10.6
- 11.2
* - MR, SR, HIGH_PERF
- 73.3
- 733.2
- 10.6
- 11.2
* - MR, VC, LOW_COST
- 74.4
- 821.3
- 40.2
- 5.7
* - MR, VC, HIGH_PERF
- 116.7
- 823.9
- 42.4
- 5.7
* - MMNR, SR, LOW_COST
- 78.0
- 1173.0
- 28.2
- 24.8
* - MMNR, SR, HIGH_PERF
- 78.0
- 1173.0
- 28.2
- 24.8
WakeNet
-------

View File

@ -29,390 +29,112 @@ AFE 声学前端算法框架
* - WakeNet
- 基于神经网络的唤醒词模型,专为低功耗嵌入式 MCU 设计
使用场景
--------
本节将介绍乐鑫 AFE 框架的两个典型使用场景。
语音识别场景
^^^^^^^^^^^^
工作流程
""""""""
.. figure:: ../../_static/AFE_SR_overview.png
:alt: overview
数据流
""""""
.. figure:: ../../_static/AFE_SR_workflow.png
:alt: overview
#. 使用 :cpp:func:`ESP_AFE_SR_HANDLE`,创建并初始化 AFE。注意 :cpp:member:`voice_communication_init` 需配置为 false。
#. 使用 :cpp:func:`feed`输入音频数据。feed 内部会先进行 AEC 算法处理
#. Feed 内部进行 BSS/NS 算法处理
#. 使用 :cpp:func:`fetch`获得经过处理过的单通道音频数据及相关信息。这里fetch 内部可以进行 VAD 处理并检测唤醒词等动作,具体可通过 :cpp:type:`afe_config_t` 结构体配置。
语音通话场景
^^^^^^^^^^^^
工作流程
""""""""
.. figure:: ../../_static/AFE_VOIP_overview.png
:alt: overview
数据流
""""""
.. figure:: ../../_static/AFE_VOIP_workflow.png
:alt: overview
#. 使用 :cpp:func:`ESP_AFE_VC_HANDLE`,创建并初始化 AFE。注意 :cpp:member:`voice_communication_init` 需配置为 true。
#. 使用 :cpp:func:`feed`输入音频数据。feed 内部会先进行 AEC 算法处理
#. Feed 内部进行 BSS/NS 算法处理。若为双麦,还将额外进行 MISO 算法处理。
#. 使用 :cpp:func:`fetch`,获得经过处理过的单通道音频数据及相关信息。这里,可对输出数据进行 AGC 非线性放大,具体增益值可通过 :cpp:type:`afe_config_t` 结构体配置。注意,若为双麦,则在进行 AGC 非线性放大前还会进行降噪处理。
.. note::
#. :cpp:type:`afe_config_t` 结构体中的 :cpp:member:`wakenet_init`:cpp:member:`voice_communication_init` 不可同时配置为 true。
#. :cpp:func:`feed`:cpp:func:`fetch` 对用户可见,其他内部 BSS/NS/MISO 算法处理为 AFE 的内部独立任务,对用户不可见。
#. AEC 算法处理在 :cpp:func:`feed` 中进行。
#. 当 :cpp:member:`aec_init` 配置为 falseBSS/NS 算法处理在 :cpp:func:`feed` 中进行。
选择 AFE Handle
---------------
目前,乐鑫 AFE 框架支持单麦和双麦配置,并允许对算法模块进行灵活配置
本节介绍Espressif AFE框架的两种典型应用场景。
* 单麦配置:
* 内部 Task 由 NS 算法模块处理
* 双麦配置:
* 内部 Task 由 BSS 算法模块处理
* 此外,如用于语音通话场景(即 :cpp:member:`wakenet_init` = false 且 :cpp:member:`voice_communication_init` = true则会再增加一个内部 Task 由 MISO 处理。
语音识别
^^^^^^^^^^^^^^^^^^
获取 AFE handle 的命令如下:
.. figure:: ../../_static/AFE_SR_overview.png
:alt: 概述
* 语音识别场景
语音通话
^^^^^^^^^^^^^^^^^^^
::
.. figure:: ../../_static/AFE_VOIP_overview.png
:alt: 概述
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
* 语音通话场景
输入格式定义
----------------------------
::
``input_format`` 参数定义了输入数据中音频通道的排列方式。字符串中的每个字符代表一个通道类型:
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
+-----------+---------------------+
| 字符 | 描述 |
+===========+=====================+
| ``M`` | 麦克风通道 |
+-----------+---------------------+
| ``R`` | 播放参考通道 |
+-----------+---------------------+
| ``N`` | 未使用或未知通道 |
+-----------+---------------------+
.. _input-audio-1:
**示例:**
- ``"MMNR"``:表示四通道排列,包含两个麦克风通道、一个未使用通道和一个播放参考通道。
输入音频
--------
**关键点:**
- 输入数据必须采用 **通道交错排列格式**
目前,乐鑫 AFE 框架支持单麦和双麦配置,可根据 :cpp:func:`esp_afe_sr_iface_op_feed_t` 的输入音频情况,配置所需的音频通道数。
使用AFE框架
----------------------------
根据 ``menuconfig`` -> ``ESP Speech Recognition`` 选择需要的AFE的模型比如WakeNet模型VAD模型 NS模型等然后在代码中使用以下步骤调用AFE框架。
代码可以参考 :project_file:`test_apps/esp-sr/main/test_afe.cpp`
具体方式为:
配置 :cpp:func:`AFE_CONFIG_DEFAULT()` 中的 :cpp:member:`pcm_config` 结构体成员:
步骤1初始化AFE配置
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* :cpp:member:`total_ch_num`:总通道数
* :cpp:member:`mic_num`:麦克风通道数
* :cpp:member:`ref_num`:参考回路通道数
使用 ``afe_config_init()`` 获取默认配置并根据需求调整参数:
注意,在配置时有如下要求:
.. code-block:: c
srmodel_list_t *models = esp_srmodel_init("model");
afe_config_t *afe_config = afe_config_init("MMNR", models, AFE_TYPE_SR, AFE_MODE_HIGH_PERF);
1. :cpp:member:`total_ch_num` = :cpp:member:`mic_num` + :cpp:member:`ref_num`
2. :cpp:member:`ref_num` = 0 或 :cpp:member:`ref_num` = 1 (由于目前 AEC 仅只支持单回路)
- **``input_format``**:定义通道排列(如 ``"MMNR"``)。
- **``models``**模型列表如NS、VAD或WakeNet模型
- **``afe_type``**AFE类型``AFE_TYPE_SR`` 表示语音识别场景)。
- **``afe_mode``**:性能模式(如 ``AFE_MODE_HIGH_PERF`` 表示高性能模式)。
在上述要求下,几种支持的配置组合如下:
步骤2创建AFE实例
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
::
通过配置创建AFE实例
total_ch_num=1, mic_num=1, ref_num=0
total_ch_num=2, mic_num=1, ref_num=1
total_ch_num=2, mic_num=2, ref_num=0
total_ch_num=3, mic_num=2, ref_num=1
.. code-block:: c
AFE 单麦配置
^^^^^^^^^^^^
* 输入音频的 **格式** 为 16 KHz、16 bit、双通道其中 1 个通道为 mic 数据,另 1 个通道为参考回路)。注意,若不需要 AEC 功能,则可只包含 1 个通道输入 mic 数据,而无需配置参考回路(即可配置 :cpp:member:`ref_num` = 0
* 根据用户配置的算法模块不同,输入音频的 **帧长** 将有所差异,具体可通过 :cpp:func:`get_feed_chunksize` 来获取需要的采样点数目(采样点数据类型为 ``int16``)。
// 获取句柄
esp_afe_sr_iface_t *afe_handle = esp_afe_handle_from_config(afe_config);
// 创建实例
esp_afe_sr_data_t *afe_data = afe_handle->create_from_config(afe_config);
数据排布示意如下:
步骤3输入音频数据
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. figure:: ../../_static/AFE_mode_0.png
:alt: input data of single mic
:height: 1.2in
将音频数据输入AFE进行处理。输入数据格式需与 ``input_format`` 匹配:
AFE 双麦配置
^^^^^^^^^^^^
* 输入音频格式为 16 KHz、16 bit、三通道其中 2 个通道为 mic 数据,另 1 个通道为参考回路)。注意,若不需要 AEC 功能,则可只包含 2 个通道 mic 数据,而无需配置参考回路(即可配置 :cpp:member:`ref_num` = 0
* 根据用户配置的算法模块不同,输入音频的 **帧长** 将有所差异,具体可通过 :cpp:func:`get_feed_chunksize` 来获取需要填充的数据量(即 :cpp:func:`get_feed_chunksize` * :cpp:member:`total_ch_num` * sizeof(short))。
.. code-block:: c
数据排布示意如下:
int feed_chunksize = afe_handle->get_feed_chunksize(afe_data);
int feed_nch = afe_handle->get_feed_channel_num(afe_data);
int16_t *feed_buff = (int16_t *) malloc(feed_chunksize * feed_nch * sizeof(int16_t));
afe_handle->feed(afe_data, feed_buff);
.. figure:: ../../_static/AFE_mode_other.png
:alt: input data of dual mic
:height: 0.75in
- **``feed_chunksize``**:每帧输入的样本数。
- **``feed_nch``**:输入数据的通道数。
- **``feed_buff``**通道交错的音频数据16位有符号16 kHz
输出音频
--------
步骤4获取处理结果
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AFE 的输出音频为单通道数据
获取处理后的单通道音频输出
* 语音识别场景:在 WakeNet 开启的情况下,输出有目标人声的单通道数据
* 语音通话场景:输出信噪比更高的单通道数据
.. code-block:: c
afe_fetch_result_t *result = fetch(afe_data);
int16_t *processed_audio = result->data;
vad_state_t vad_state = result->vad_state;
wakenet_state_t wakeup_state = result->wakeup_state;
使能唤醒词识别 WakeNet
----------------------
// if vad cache is exists, please attach the cache to the front of processed_audio to avoid data loss
if (result->vad_cache_size > 0) {
int16_t *vad_cache = result->vad_cache;
}
在进行 AFE 声学前端处理时,用户可选择是否使能 :doc:`WakeNet <../wake_word_engine/README>` 进行唤醒词识别。
资源占用
------------------
当用户在唤醒后需要进行其他操作,比如离线或在线语音识别,这时候可以暂停 WakeNet 的运行,从而减轻 CPU 的资源消耗。此时,仅需调用 :cpp:func:`disable_wakenet()`,进入 Bypass 模式。
当后续应用结束后又可以调用 :cpp:func:`enable_wakenet()` 再次使能 WakeNet。
.. only:: esp32
ESP32 芯片只支持一个唤醒词,不支持唤醒词切换。
.. only:: esp32s3
ESP32-S3 芯片支持唤醒词切换。在 AFE 初始化完成后ESP32-S3 芯片可允许用户通过 :cpp:func:`set_wakenet()` 函数切换唤醒词。例如, ``set_wakenet(afe_data, “wn9_hilexin”)`` 切换到 “Hi Lexin” 唤醒词。有关如何配置多个唤醒词的详细介绍,请见 :doc:`模型加载 <../flash_model/README>`
使能回声消除算法 AEC
--------------------
AEC 的使用和 WakeNet 相似,用户可以根据自己的需求来停止或开启 AEC。
- 停止 AEC
``afe->disable_aec(afe_data);``
- 开启 AEC
``afe->enable_aec(afe_data);``
.. only:: html
编程指南
--------
定义 afe_handle 函数句柄
^^^^^^^^^^^^^^^^^^^^^^^^
首先配置 ``afe_handle`` 函数句柄,后续才可以调用 afe 接口。具体配置方式如下:
- 语音识别
::
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
- 语音通话
::
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
配置 afe
^^^^^^^^
配置 afe
::
afe_config_t afe_config = AFE_CONFIG_DEFAULT();
其中,``afe_config`` 中各算法模块的使能及其相应参数可以单独调整:
::
#define AFE_CONFIG_DEFAULT() { \
// 配置是否使能 AEC
.aec_init = true, \
// 配置是否使能 BSS/NS
.se_init = true, \
// 配置是否使能 VAD仅用于语音识别场景
.vad_init = true, \
// 配置是否使能唤醒
.wakenet_init = true, \
// 配置是否使能语音通话(不可与 wakenet_init 同时使能)
.voice_communication_init = false, \
// 配置是否使能语音通话中 AGC
.voice_communication_agc_init = false, \
// 配置 AGC 的增益值(单位为 dB
.voice_communication_agc_gain = 15, \
// 配置 VAD 检测的操作模式,越大越激进
.vad_mode = VAD_MODE_3, \
// 配置唤醒模型,详见下方描述
.wakenet_model_name = NULL, \
// 配置唤醒模式对应为多少通道的唤醒根据mic通道的数量选择
.wakenet_mode = DET_MODE_2CH_90, \
// 配置 AFE 工作模式SR_MODE_LOW_COST 或 SR_MODE_HIGH_PERF
.afe_mode = SR_MODE_LOW_COST, \
// 配置运行 AFE 内部 BSS/NS/MISO 算法的 CPU 核
.afe_perferred_core = 0, \
// 配置运行 AFE 内部 BSS/NS/MISO 算法的 task 优先级
.afe_perferred_priority = 5, \
// 配置内部 ringbuf
.afe_ringbuf_size = 50, \
// 配置内存分配模式,详见下方描述
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
// 配置音频线性放大 Level详见下方描述
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
// 配置音频总的通道数
.pcm_config.total_ch_num = 3, \
// 配置音频麦克风的通道数
.pcm_config.mic_num = 2, \
// 配置音频参考回路通道数
.pcm_config.ref_num = 1, \
// 配置NS算法的模式NS_MODE_SSP为信号处理算法NS_MODE_NET为基于网络的降噪算法
.afe_ns_mode = NS_MODE_SSP, \
// 降噪网络的模型名字,默认为"nsnet1",一共两个版本:"nsnet1"和"nsnet2"
.afe_ns_model_name = "nsnet1", \
}
* :cpp:member:`wakenet_model_name` :配置唤醒模型。宏 :cpp:type:`AFE_CONFIG_DEFAULT()` 中该值默认为 NULL。注意
* 在使用 ``idf.py menuconfig`` 选择了相应的唤醒模型后,在调用 :cpp:member:`create_from_config` 之前,需要将此值配置为具体的模型名称,类型为字符串形式。有关唤醒模型的详细介绍,请见::doc:`模型加载 <../flash_model/README>`
* :cpp:func:`esp_srmodel_filter()` 可用于获取模型名称。但若 ``idf.py menuconfig`` 中选择了多模型共存,则该函数将会随机返回一个模型名称。
* :cpp:member:`afe_mode` :配置 AFE 工作模式:
.. list::
:esp32s3: - :cpp:enumerator:`SR_MODE_LOW_COST` :量化版本,占用资源较少。
- :cpp:enumerator:`SR_MODE_HIGH_PERF` :非量化版本,占用资源较多。
详情可见 :cpp:enumerator:`afe_sr_mode_t`
* :cpp:member:`memory_alloc_mode` :配置内存分配的模式:
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_INTERNAL` :更多从内部 ram 分配
- :cpp:enumerator:`AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE` :部分从内部 ram 分配
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_PSRAM` :更多从外部 psram 分配
- :cpp:member:`agc_mode` :配置音频线性放大的 level。注意该配置仅适用语音识别场景下且在唤醒使能时才生效。可配置四个值
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_1` :线性放大喂给后续 MultiNet 的音频,峰值处为 -5 dB。
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_2` :线性放大喂给后续 MultiNet 的音频,峰值处为 -4 dB。
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_3` :线性放大喂给后续 MultiNet 的音频,峰值处为 -3 dB。
- :cpp:enumerator:`AFE_MN_PEAK_NO_AGC` :不做线性放大
- :cpp:member:`pcm_config` :根据 :cpp:func:`feed` 喂入的音频结构进行配置,该结构体有三个成员变量需要配置:
- :cpp:member:`total_ch_num` :音频的总通道数
- :cpp:member:`mic_num` :音频的麦克风通道数
- :cpp:member:`ref_num` :音频的参考回路通道数
在配置时有一定注意事项,详见 :ref:`input-audio-1`
创建 afe_data
^^^^^^^^^^^^^
使用上一步配置好的 afe 语句创建函数句柄,使用函数为 :cpp:func:`esp_afe_sr_iface_op_create_from_config_t`
::
/**
* @brief Function to initialze a AFE_SR instance
*
* @param afe_config The config of AFE_SR
* @returns Handle to the AFE_SR data
*/
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);
feed 音频数据
^^^^^^^^^^^^^
在初始化 AFE 完成后,使用 :cpp:func:`feed` 函数,将音频数据输入到 AFE 模块中进行处理。输入音频的格式详见 :ref:`input-audio-1`
::
/**
* @brief Feed samples of an audio stream to the AFE_SR
*
* @Warning The input data should be arranged in the format of channel interleaving.
* The last channel is reference signal if it has reference data.
*
* @param afe The AFE_SR object to query
*
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
* `get_feed_chunksize`.
* @return The size of input
*/
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
获取音频通道数
^^^^^^^^^^^^^^
使用 :cpp:func:`get_total_channel_num()` 函数获取需要传入 :cpp:func:`feed()` 函数的音频总通道数,其返回值等于 :cpp:func:`AFE_CONFIG_DEFAULT()` 中配置的 ``pcm_config.mic_num + pcm_config.ref_num``
::
/**
* @brief Get the total channel number which be config
*
* @param afe The AFE_SR object to query
* @return The amount of total channels
*/
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
fetch 音频数据
^^^^^^^^^^^^^^
用户调用 :cpp:func:`fetch` 函数,获取经过处理过的单通道音频数据及相关信息。
:cpp:func:`fetch` 的数据采样点数目(采样点数据类型为 ``int16``)可以通过 :cpp:func:`get_feed_chunksize` 获取。
::
/**
* @brief Get the amount of each channel samples per frame that need to be passed to the function
*
* Every speech enhancement AFE_SR processes a certain number of samples at the same time. This function
* can be used to query that amount. Note that the returned amount is in 16-bit samples, not in bytes.
*
* @param afe The AFE_SR object to query
* @return The amount of samples to feed the fetch function
*/
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
:cpp:func:`fetch` 的函数声明如下:
::
/**
* @brief fetch enhanced samples of an audio stream from the AFE_SR
*
* @Warning The output is single channel data, no matter how many channels the input is.
*
* @param afe The AFE_SR object to query
* @return The result of output, please refer to the definition of `afe_fetch_result_t`. (The frame size of output audio can be queried by the `get_fetch_chunksize`.)
*/
typedef afe_fetch_result_t* (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe);
其返回值为结构体指针,结构体定义如下:
::
/**
* @brief The result of fetch function
*/
typedef struct afe_fetch_result_t
{
int16_t *data; // the data of audio.
int data_size; // the size of data. The unit is byte.
int wakeup_state; // the value is wakenet_state_t
int wake_word_index; // if the wake word is detected. It will store the wake word index which start from 1.
int vad_state; // the value is afe_vad_state_t
int trigger_channel_id; // the channel index of output
int wake_word_length; // the length of wake word. It's unit is the number of samples.
int ret_value; // the return state of fetch function
void* reserved; // reserved for future use
} afe_fetch_result_t;
资源消耗
--------
有关本模型的资源消耗情况,请见 :doc:`资源消耗 <../benchmark/README>`
关于AFE的资源占用情况请参阅 :doc:`资源占用 <../benchmark/README>`

View File

@ -25,80 +25,141 @@ AFE
.. only:: esp32s3
+-----------------+-----------------+-----------------+-----------------+
| Algorithm Type | RAM | Average cpu | Frame Length |
| | | loading(compute | |
| | | with 2 cores) | |
+=================+=================+=================+=================+
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS(NS_MODE_SSP) | 27 KB | 5% | 10 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS(nsnet1) | 885 KB | 25% | 16 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS(nsnet2) | 375 KB | 12% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| MISO | 56 KB | 8% | 16 ms |
+-----------------+-----------------+-----------------+-----------------+
| AFE Layer | 227 KB | | |
+-----------------+-----------------+-----------------+-----------------+
.. list-table:: AFE 配置和算法流程
:widths: 25 75
:header-rows: 1
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| Input Format | Type | Mode | Internal RAM | PSRAM | Feed Task CPU | Fetch Task CPU |
+==============+======+===========+===============+============+================+=================+
| MR | SR | LOW_COST | 72348 | 732932 | 8.4% | 14.9% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | SR | HIGH_PERF | 78016 | 734980 | 9.4% | 14.9% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | VC | LOW_COST | 50316 | 821564 | 60.0% | 8.1% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MR | VC | HIGH_PERF | 93668 | 824144 | 64.0% | 8.2% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MMR | SR | LOW_COST | 76684 | 1175148 | 36.6% | 30.2% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
| MMR | SR | HIGH_PERF | 99064 | 1174960 | 38.8% | 30.0% |
+--------------+------+-----------+---------------+------------+----------------+-----------------+
* - Config
- Pipeline
* - MR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, VC, LOW_COST
- ``|AEC(VOIP_LOW_COST)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MR, VC, HIGH_PERF
- ``|AEC(VOIP_HIGH_PERF)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MMNR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MMNR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
.. note::
- **MR:** 一个麦克风通道和一个播放通道
- **MMNR:** 两个麦克风通道和一个播放通道
- **Models:** nsnet2, vadnet1_medium, wn9_hilexin
.. list-table:: AFE 配置和性能
:widths: 25 15 15 20 20
:header-rows: 1
* - Config
- Internal RAM (KB)
- PSRAM (KB)
- Feed CPU usage (1 core,%)
- Fetch CPU usage (1 core,%)
* - MR, SR, LOW_COST
- 72.3
- 732.7
- 8.4
- 15.0
* - MR, SR, HIGH_PERF
- 78.0
- 734.7
- 9.4
- 14.9
* - MR, VC, LOW_COST
- 50.3
- 821.4
- 60.0
- 8.2
* - MR, VC, HIGH_PERF
- 93.7
- 824.0
- 64.0
- 8.2
* - MMNR, SR, LOW_COST
- 76.6
- 1173.9
- 36.6
- 30.0
* - MMNR, SR, HIGH_PERF
- 99.0
- 1173.7
- 38.8
- 30.0
.. only:: esp32p4
+-----------------+-----------------+-----------------+-----------------+
| Algorithm Type | RAM | Average cpu | Frame Length |
| | | loading(compute | |
| | | with 2 cores) | |
+=================+=================+=================+=================+
| AEC(LOW_COST) | 152.3 KB | 6% | 32 ms |
+-----------------+-----------------+-----------------+-----------------+
| BSS(LOW_COST) | 198.7 KB | 3% | 64 ms |
+-----------------+-----------------+-----------------+-----------------+
| NS | 27 KB | 3% | 10 ms |
+-----------------+-----------------+-----------------+-----------------+
| MISO | 56 KB | 8% | 16 ms |
+-----------------+-----------------+-----------------+-----------------+
| AFE Layer | 227 KB | | |
+-----------------+-----------------+-----------------+-----------------+
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| Input Format | Type | Mode | Internal RAM | PSRAM | Feed Task CPU | Fetch Task CPU |
+==============+======+===========+===============+============+=================+=================+
| MR | SR | LOW_COST | 75404 | 751292 | 10.6% | 11.3% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | SR | HIGH_PERF | 75128 | 751292 | 10.6% | 11.3% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | VC | LOW_COST | 76192 | 841300 | 40.3% | 5.7% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MR | VC | HIGH_PERF | 119536 | 843880 | 42.6% | 5.7% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MMR | SR | LOW_COST | 79940 | 1202692 | 28.4% | 24.9% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
| MMR | SR | HIGH_PERF | 79940 | 1202692 | 28.4% | 24.9% |
+--------------+------+-----------+---------------+------------+-----------------+-----------------+
.. list-table:: AFE 配置和算法流程
:widths: 25 75
:header-rows: 1
* - Config
- Pipeline
* - MR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MR, VC, LOW_COST
- ``|AEC(VOIP_LOW_COST)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MR, VC, HIGH_PERF
- ``|AEC(VOIP_HIGH_PERF)| -> |NS(nsnet2)| -> |VAD(vadnet1_medium)|``
* - MMNR, SR, LOW_COST
- ``|AEC(SR_LOW_COST)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
* - MMNR, SR, HIGH_PERF
- ``|AEC(SR_HIGH_PERF)| -> |SE(BSS)| -> |VAD(vadnet1_medium)| -> |WakeNet(wn9_hilexin,)|``
.. note::
- **MR:** 一个麦克风通道和一个播放通道
- **MMNR:** 两个麦克风通道和一个播放通道
- **Models:** nsnet2, vadnet1_medium, wn9_hilexin
.. list-table:: AFE 配置和性能
:widths: 25 15 15 20 20
:header-rows: 1
* - Config
- Internal RAM (KB)
- PSRAM (KB)
- Feed CPU usage (1 core,%)
- Fetch CPU usage (1 core,%)
* - MR, SR, LOW_COST
- 73.6
- 733.2
- 10.6
- 11.2
* - MR, SR, HIGH_PERF
- 73.3
- 733.2
- 10.6
- 11.2
* - MR, VC, LOW_COST
- 74.4
- 821.3
- 40.2
- 5.7
* - MR, VC, HIGH_PERF
- 116.7
- 823.9
- 42.4
- 5.7
* - MMNR, SR, LOW_COST
- 78.0
- 1173.0
- 28.2
- 24.8
* - MMNR, SR, HIGH_PERF
- 78.0
- 1173.0
- 28.2
- 24.8
WakeNet
-------

View File

@ -1,4 +1,4 @@
version: "2.0.0~1-rc.1"
version: "2.0.0~1-rc.2"
description: esp_sr provides basic algorithms for Speech Recognition applications
url: https://github.com/espressif/esp-sr
dependencies:

View File

@ -110,6 +110,8 @@ typedef struct {
int vad_min_speech_ms; // The minimum duration of speech in ms. It should be bigger than 32 ms, default: 128 ms
int vad_min_noise_ms; // The minimum duration of noise or silence in ms. It should be bigger than 64 ms, default:
// 1000 ms
int vad_delay_ms; // The delay of the first speech frame in ms, default: 128 ms
// If you find vad cache can not cover all speech, please increase this value.
bool vad_mute_playback; // If true, the playback will be muted for vad detection. default: false
bool vad_enable_channel_trigger; // If true, the vad will be used to choose the channel id. default: false

View File

@ -141,12 +141,12 @@ typedef int (*esp_afe_sr_iface_op_reset_buffer_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_set_wakenet_t)(esp_afe_sr_data_t *afe, char *model_name);
/**
* @brief Enable VAD algorithm.
* @brief Reset one function/module/algorithm.
*
* @param afe The AFE_SR object to query
* @return -1: fail, 0: disabled, 1: enabled
* @return -1: fail, 1: success
*/
typedef int (*esp_afe_sr_iface_op_enable_vad_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_reset_op_t)(esp_afe_sr_data_t *afe);
/**
* @brief Disable one function/module/algorithm.
@ -204,6 +204,7 @@ typedef struct {
esp_afe_sr_iface_op_enable_func_t enable_se;
esp_afe_sr_iface_op_disable_func_t disable_vad;
esp_afe_sr_iface_op_enable_func_t enable_vad;
esp_afe_sr_iface_op_reset_op_t reset_vad;
esp_afe_sr_iface_op_disable_func_t disable_ns;
esp_afe_sr_iface_op_enable_func_t enable_ns;
esp_afe_sr_iface_op_disable_func_t disable_agc;

View File

@ -110,7 +110,7 @@ vad_handle_t vad_create(vad_mode_t vad_mode);
* - NULL: Create failed
* - Others: The instance of VAD
*/
vad_handle_t vad_create_with_param(vad_mode_t vad_mode, int sample_rate, int one_frame_ms, int min_speech_len, int min_noise_len);
vad_handle_t vad_create_with_param(vad_mode_t vad_mode, int sample_rate, int one_frame_ms, int min_speech_ms, int min_noise_ms);
/**
* @brief Feed samples of an audio stream to the VAD and check if there is someone speaking.
@ -138,6 +138,13 @@ vad_state_t vad_process(vad_handle_t handle, int16_t *data, int sample_rate_hz,
*/
vad_state_t vad_process_with_trigger(vad_handle_t handle, int16_t *data);
/**
* @brief Reset trigger state as Silence
*
* @param handle The instance of VAD.
*/
void vad_reset_trigger(vad_handle_t handle);
/**
* @brief Free the VAD instance
*

View File

@ -110,6 +110,8 @@ typedef struct {
int vad_min_speech_ms; // The minimum duration of speech in ms. It should be bigger than 32 ms, default: 128 ms
int vad_min_noise_ms; // The minimum duration of noise or silence in ms. It should be bigger than 64 ms, default:
// 1000 ms
int vad_delay_ms; // The delay of the first speech frame in ms, default: 128 ms
// If you find vad cache can not cover all speech, please increase this value.
bool vad_mute_playback; // If true, the playback will be muted for vad detection. default: false
bool vad_enable_channel_trigger; // If true, the vad will be used to choose the channel id. default: false

View File

@ -141,12 +141,12 @@ typedef int (*esp_afe_sr_iface_op_reset_buffer_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_set_wakenet_t)(esp_afe_sr_data_t *afe, char *model_name);
/**
* @brief Enable VAD algorithm.
* @brief Reset one function/module/algorithm.
*
* @param afe The AFE_SR object to query
* @return -1: fail, 0: disabled, 1: enabled
* @return -1: fail, 1: success
*/
typedef int (*esp_afe_sr_iface_op_enable_vad_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_reset_op_t)(esp_afe_sr_data_t *afe);
/**
* @brief Disable one function/module/algorithm.
@ -204,6 +204,7 @@ typedef struct {
esp_afe_sr_iface_op_enable_func_t enable_se;
esp_afe_sr_iface_op_disable_func_t disable_vad;
esp_afe_sr_iface_op_enable_func_t enable_vad;
esp_afe_sr_iface_op_reset_op_t reset_vad;
esp_afe_sr_iface_op_disable_func_t disable_ns;
esp_afe_sr_iface_op_enable_func_t enable_ns;
esp_afe_sr_iface_op_disable_func_t disable_agc;

View File

@ -110,7 +110,7 @@ vad_handle_t vad_create(vad_mode_t vad_mode);
* - NULL: Create failed
* - Others: The instance of VAD
*/
vad_handle_t vad_create_with_param(vad_mode_t vad_mode, int sample_rate, int one_frame_ms, int min_speech_len, int min_noise_len);
vad_handle_t vad_create_with_param(vad_mode_t vad_mode, int sample_rate, int one_frame_ms, int min_speech_ms, int min_noise_ms);
/**
* @brief Feed samples of an audio stream to the VAD and check if there is someone speaking.
@ -138,6 +138,13 @@ vad_state_t vad_process(vad_handle_t handle, int16_t *data, int sample_rate_hz,
*/
vad_state_t vad_process_with_trigger(vad_handle_t handle, int16_t *data);
/**
* @brief Reset trigger state as Silence
*
* @param handle The instance of VAD.
*/
void vad_reset_trigger(vad_handle_t handle);
/**
* @brief Free the VAD instance
*

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.