mirror of
https://github.com/espressif/esp-sr.git
synced 2025-09-15 15:28:44 +08:00
Merge branch 'docs/update_afe_doc' into 'master'
Review / update / translate documents in the docs folder See merge request speech-recognition-framework/esp-sr!9
This commit is contained in:
commit
bc59cbcad7
9
.gitignore
vendored
9
.gitignore
vendored
@ -6,7 +6,6 @@ include/sdkconfig.h
|
||||
build/
|
||||
sdkconfig.old
|
||||
sdkconfig
|
||||
<<<<<<< HEAD
|
||||
.DS_Store
|
||||
|
||||
*.pyc
|
||||
@ -21,12 +20,8 @@ docs/*/xml/
|
||||
docs/*/xml_in/
|
||||
docs/*/man/
|
||||
docs/doxygen_sqlite3.db
|
||||
_build/*
|
||||
|
||||
# Downloaded font files
|
||||
docs/_static/DejaVuSans.ttf
|
||||
docs/_static/NotoSansSC-Regular.otf
|
||||
=======
|
||||
model/target/*
|
||||
.vscode
|
||||
docs/_build/*
|
||||
>>>>>>> 0981bc8425d6cace35ebb73789265a1c2e14dc92
|
||||
docs/_static/NotoSansSC-Regular.otf
|
||||
@ -37,7 +37,6 @@ build_esp_sr_html:
|
||||
script:
|
||||
- cd $DOCS_DIR
|
||||
- ./check_lang_folder_sync.sh
|
||||
- ./check_doc_chars.py
|
||||
- build-docs --skip-reqs-check -l $DOCLANG -t $DOCTGT
|
||||
- echo "ESP-SR documentation preview available at $CI_JOB_URL/artifacts/file/docs/_build/$DOCLANG/$DOCTGT/html/index.html"
|
||||
parallel:
|
||||
@ -59,8 +58,7 @@ build_esp_sr_pdf:
|
||||
script:
|
||||
- cd $DOCS_DIR
|
||||
- ./check_lang_folder_sync.sh
|
||||
- ./check_doc_chars.py
|
||||
- build-docs --skip-reqs-check -l $DOCLANG -t $DOCTGT
|
||||
- build-docs -bs latex -l $DOCLANG -t $DOCTGT
|
||||
parallel:
|
||||
matrix:
|
||||
- DOCLANG: ["en", "zh_CN"]
|
||||
|
||||
37
README.md
37
README.md
@ -1,41 +1,44 @@
|
||||
# esp_sr
|
||||
# ESP-SR Speech Recognition Framework
|
||||
|
||||
Espressif esp_sr provides basic algorithms for **Speech Recognition** applications. Now, this framework has four modules:
|
||||
[](https://docs.espressif.com/projects/esp-sr/zh_CN/latest/esp32/index.html)
|
||||
|
||||
* The wake word detection model [WakeNet](docs/wake_word_engine/README.md)
|
||||
* The speech command recognition model [MultiNet](docs/speech_command_recognition/README.md)
|
||||
* Audio Front-End [AFE](docs/audio_front_end/README.md)
|
||||
* The text to speech model [esp-tts](esp-tts/README.md)
|
||||
Espressif `ESP-SR <https://github.com/espressif/esp-sr>`_ helps users build AI speech solutions based on ESP32 or ESP32-S3 chips.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
ESP-SR framework includes the following modules:
|
||||
|
||||
* Audio Front-end AFE
|
||||
* Wake Word Engine WakeNet
|
||||
* Speech Command Word Recognition MultiNet
|
||||
* Speech Synthesis (only supports Chinese language)
|
||||
|
||||
These algorithms are provided in the form of a component, so they can be integrated into your projects with minimum effort.
|
||||
|
||||
These algorithms are provided in the form of a component, so they can be integrated into your projects with minimum efforts.
|
||||
ESP32-S3 is recommended, which supports AI instructions and larger, high-speech octal SPI PSRAM.
|
||||
The new algorithms will no longer support ESP32 chips.
|
||||
|
||||
|
||||
## Wake Word Engine
|
||||
|
||||
Espressif wake word engine [WakeNet](docs/wake_word_engine/README.md) is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen wake words, such as “Alexa”, “Hi,lexin” and “Hi,ESP”. You can refer to [Model loading method](./docs/flash_model/README.md) to build your project.
|
||||
Espressif wake word engine **WakeNet** is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen to wake words, such as “Alexa”, “Hi,lexin” and “Hi,ESP”. You can refer to **Model loading method** to build your project.
|
||||
|
||||
Currently, Espressif has not only provided an official wake word "Hi,Lexin","Hi,ESP" to public for free, but also allows customized wake words. For details on how to customize your own wake words, please see [Espressif Speech Wake Words Customization Process](docs/wake_word_engine/ESP_Wake_Words_Customization.md).
|
||||
|
||||
- [WakeNet Performance](docs/benchmark/README.md)
|
||||
Currently, Espressif has not only provided an official wake word "Hi,Lexin","Hi,ESP" to the public for free, but also allows customized wake words. For details on how to customize your own wake words, please see **Espressif Speech Wake Words Customization Process**.
|
||||
|
||||
## Speech Command Recognition
|
||||
|
||||
Espressif's speech command recognition model [MultiNet](docs/speech_command_recognition/README.md) is specially designed to provide a flexible off-line speech command recognition model. With this model, you can easily add your own speech commands, eliminating the need to train model again. You can refer to [Model loading method](./docs/flash_model/README.md) to build your project.
|
||||
Espressif's speech command recognition model **MultiNet** is specially designed to provide a flexible off-line speech command recognition model. With this model, you can easily add your own speech commands, eliminating the need to train model again. You can refer to **Model loading method** to build your project.
|
||||
|
||||
Currently, Espressif **MultiNet** supports up to 200 Chinese or English speech commands, such as “打开空调” (Turn on the air conditioner) and “打开卧室灯” (Turn on the bedroom light).
|
||||
|
||||
- [MultiNet Performance](docs/benchmark/README.md)
|
||||
|
||||
## Audio Front End
|
||||
|
||||
Espressif Audio Front-End [AFE](docs/audio_front_end/README.md) integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS(Blind Source Separation) and NS (Noise Suppression).
|
||||
Espressif Audio Front-End **AFE** integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS (Blind Source Separation) and NS (Noise Suppression).
|
||||
|
||||
Our two-mic Audio Front-End (AFE) have been qualified as a “Software Audio Front-End Solution” for [Amazon Alexa Built-in devices](https://developer.amazon.com/en-US/alexa/solution-providers/dev-kits#software-audio-front-end-dev-kits).
|
||||
|
||||
- [Audio Front-End Performance](docs/benchmark/README.md)
|
||||
|
||||
**In order to achieve optimal performance:**
|
||||
|
||||
* Please refer to software design [esp-skainet](https://github.com/espressif/esp-skainet).
|
||||
* Please refer to software design [esp-skainet](https://github.com/espressif/esp-skainet).
|
||||
@ -24,4 +24,8 @@ project_slug = 'esp-sr'
|
||||
versions_url = '_static/docs_version.js'
|
||||
|
||||
# Final PDF filename will contains target and version
|
||||
pdf_file_prefix = u'esp-sr'
|
||||
pdf_file_prefix = u'esp-sr'
|
||||
|
||||
# add Tracking id for Google Analytics
|
||||
|
||||
google_analytics_id = 'UA-132861133-1'
|
||||
|
||||
@ -3,67 +3,73 @@ Espressif Microphone Design Guidelines
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
This document provides microphone design guidelines and suggestions for the ESP32-S3 series of audio development boards.
|
||||
This document provides microphone design guidelines and suggestions for the {IDF_TARGET_NAME} series of audio development boards.
|
||||
|
||||
Electrical Performance
|
||||
----------------------
|
||||
Microphone Electrical Performance Requirement
|
||||
---------------------------------------------
|
||||
|
||||
#. Type: omnidirectional MEMS microphone
|
||||
#. Sensitivity
|
||||
- Type: omnidirectional MEMS microphone
|
||||
- Sensitivity
|
||||
|
||||
- Under 1 Pa sound pressure, it should be no less than -38 dBV for analog microphones, and -26 dB for digital microphones.
|
||||
- The tolerance should be controlled within ±2 dB, and within ±1 dB for microphone arrays.
|
||||
- Under 1 Pa sound pressure, the sensitivity should be no less than -38 dBV for analog microphones and -26 dB for digital microphones.
|
||||
- Tolerance should be within ±2 dB for microphones. And tolerance for microphone arrays should be within ±1 dB.
|
||||
|
||||
#. Signal-to-noise ratio (SNR)
|
||||
- Signal-to-noise ratio (SNR)
|
||||
|
||||
- No less than 62 dB. Higher than 64 dB is recommended.
|
||||
- Frequency response fluctuates within ±3 dB from 50 to 16 kHz.
|
||||
- PSRR should be larger than 55 dB for MEMS microphones.
|
||||
- SNR: No less than 62 dB. Higher than 64 dB is recommended.
|
||||
- Frequency response should only fluctuate within ±3 dB from 50 to 16 kHz.
|
||||
- PSRR should be larger than 55 dB for microphones.
|
||||
|
||||
Structure Design
|
||||
----------------
|
||||
Microphone Structure Design Suggestion
|
||||
--------------------------------------
|
||||
|
||||
#. The aperture or width of the microphone hole is recommended to be greater than 1 mm, the pickup pipe should be as short as possible, and the cavity should be as small as possible to ensure that the resonance frequency of the microphone and structural components is above 9 kHz.
|
||||
#. The depth and diameter of the pickup hole are less than 4:1, and the thickness of the shell is recommended to be 1 mm. If the shell is too thick, the opening area must be increased.
|
||||
#. The microphone hole must be protected by an anti-dust mesh.
|
||||
#. Silicone sleeve or foam must be added between the microphone and the device shell for sealing and shockproofing, and an interference fit design is required to ensure the tightness of the microphone.
|
||||
#. The microphone hole cannot be blocked. The bottom microphone hole needs to be increased in structure to prevent it from being blocked by the desktop.
|
||||
#. The microphone should be placed far away from the speaker and other objects that can produce noise or vibration, and be isolated and buffered by rubber pads from the speaker sound cavity.
|
||||
- The aperture or width of the microphone hole is recommended to be greater than 1 mm, the pickup pipe should be as short as possible, and the cavity should be as small as possible. All to ensure that the resonance frequency of the microphone and structural components is above 9 kHz.
|
||||
- The depth and diameter of the pickup hole are less than 2:1, and the thickness of the shell is recommended to be 1 mm. Increase the hole size of microphone if the shell is too thick.
|
||||
- The microphone hole must be protected by an anti-dust mesh.
|
||||
- Silicone sleeve or foam must be added between the microphone and the device shell for sealing and damping, and an interference fit design is required to ensure the leakproofness of the microphone.
|
||||
- The microphone hole cannot be covered. The microphone in the bottom must keep some clearance from the surfaces such as desktop. Therefore, it's suggested to design some legs for the product to provide such clearance.
|
||||
- The microphone should be placed far away from the speaker and other objects that can produce noise or vibration, and be isolated and buffered by rubber pads from the speaker sound cavity.
|
||||
|
||||
Microphone Array Design
|
||||
-----------------------
|
||||
Microphone Array Design Suggestion
|
||||
----------------------------------
|
||||
|
||||
#. Type: omnidirectional MEMS microphone. Use the same models from the same manufacturer for the array. Not recommended mixing different microphones.
|
||||
#. The sensitivity difference among microphones in the array is within 3 dB.
|
||||
#. The phase difference among the microphones in the array is controlled within 10°.
|
||||
#. It is recommended to keep the structural design of each microphone in the array the same to ensure consistency.
|
||||
#. Two-microphone solution: the distance between the microphones should be 4 ~ 6.5 cm, the axis connecting them should be parallel to the horizontal line, and the center of the two microphones should be horizontally as close as possible to the center of the product.
|
||||
#. Three-microphone solution: the microphones are equally spaced and distributed in a perfect circle with the angle 120 degrees from each other, and the spacing should be 4 ~ 6.5 cm.
|
||||
Customers can design two or three microphones in an array:
|
||||
|
||||
Microphone Structure Tightness
|
||||
------------------------------
|
||||
- Two-microphone solution: the distance between the microphones should be 4 ~ 6.5 cm, the axis connecting them should be parallel to the horizontal line, and the center of the two microphones should be horizontally as close as possible to the center of the product.
|
||||
- Three-microphone solution: the microphones are equally spaced and distributed in a perfect circle with the angle 120 °C from each other, and the spacing should be 4 ~ 6.5 cm.
|
||||
|
||||
Use plasticine or other materials to seal the microphone pickup hole and compare how much the signals collected by the microphone decrease by before and after the seal. 25 dB is qualified, and 30 dB is recommended. Below are the test procedures.
|
||||
There are some limitations when selecting microphones for the same array:
|
||||
|
||||
- Type: omnidirectional MEMS microphone. Use the same microphone models from the same manufacturer for the array. It's not recommended to use different microphone models in the same array.
|
||||
- The sensitivity difference of all the microphones in the same array should be within 3 dB.
|
||||
- The phase difference of all the microphones in the same array should be within 10°.
|
||||
- It is recommended to use the same structural design for all the microphones in the same array to ensure consistency.
|
||||
|
||||
|
||||
Microphone Leakproofness Suggestion
|
||||
-----------------------------------
|
||||
|
||||
Use plasticine or similar materials to seal the microphone pickup hole and compare how much the signals collected by the microphone decrease by before and after the seal. 25 dB is qualified, and 30 dB is recommended. Below are the test procedures:
|
||||
|
||||
#. Play white noise at 0.5 meters above the microphone, and keep the volume at the microphone 90 dB.
|
||||
#. Use the microphone array to record for more than 10 s, and store it as recording file A.
|
||||
#. Use plasticine or other materials to block the microphone pickup hole, record for more than 10 s, and store it as recording file B.
|
||||
#. Use the microphone array to record for more than 10 s, and store the recording as recording file A.
|
||||
#. Use plasticine or similar materials to block the microphone pickup hole, record for more than 10 s, and store it as recording file B.
|
||||
#. Compare the frequency spectrum of the two files and make sure that the overall attenuation in the 100 ~ 8 kHz frequency band is more than 25 dB.
|
||||
|
||||
Echo Reference Signal Design
|
||||
----------------------------
|
||||
Echo Reference Signal Design Suggestion
|
||||
---------------------------------------
|
||||
|
||||
#. It is recommended that the echo reference signal be as close to the speaker side as possible, and recover from the DAC post-stage and PA pre-stage.
|
||||
#. When the speaker volume is at its maximum, the echo reference signal input to the microphone should not have saturation distortion. At the maximum volume, the speaker amplifier output THD is less than 10% at 100 Hz, less than 6% at 200 Hz, and less than 3% above 350 Hz.
|
||||
#. When the speaker volume is at its maximum, the sound pressure picked up by the microphone does not exceed 102 dB @ 1 kHz.
|
||||
#. The echo reference signal voltage does not exceed the maximum allowed input voltage of the ADC. If it is too high, an attenuation circuit should be added.
|
||||
#. A low-pass filter should be added to introduce the reference echo signal from the output of the Class D power amplifier. The cutoff frequency of the filter is recommended to be more than 22 kHz.
|
||||
#. When the volume is played at the maximum, the recovery signal peak value is -3 to -5 dB.
|
||||
- It is recommended that the echo reference signal be as close to the speaker side as possible, and recover from the DA post-stage and PA pre-stage.
|
||||
- When the speaker volume is at its maximum, the echo reference signal input to the microphone should not have saturation distortion. At the maximum volume, the speaker amplifier output Total Harmonic Distortion (THD) is less than 10% at 100 Hz, less than 6% at 200 Hz, and less than 3% above 350 Hz.
|
||||
- When the speaker volume is at its maximum, the sound pressure picked up by the microphone does not exceed 102 dB @ 1 kHz.
|
||||
- The echo reference signal voltage does not exceed the maximum allowed input voltage of the ADC. If it is too high, an attenuation circuit should be added.
|
||||
- A low-pass filter should be added to introduce the reference echo signal from the output of the Class D power amplifier. The cutoff frequency of the filter is recommended to be more than 22 kHz.
|
||||
- When the volume is played at the maximum, the recovery signal peak value is -3 to -5 dB.
|
||||
|
||||
Microphone Array Consistency
|
||||
----------------------------
|
||||
Microphone Array Consistency Verification
|
||||
-----------------------------------------
|
||||
|
||||
It is required that the difference between the sampled signals of each microphone is less than 3 dB. Below are the test procedures.
|
||||
It is required that the difference between the sampled signals of each microphone in the same array is less than 3 dB. Below are the test procedures.
|
||||
|
||||
#. Play white noise at 0.5 meters above the microphone, and keep the volume at the microphone 90 dB.
|
||||
#. Use the microphone array to record for more than 10 s, and check whether the recording amplitude and audio sampling rate of each microphone are consistent.
|
||||
#. Use the microphone array to record for more than 10 s, and check whether the recording amplitude and audio sampling rate of each microphone are consistent.
|
||||
@ -3,102 +3,128 @@ Audio Front-end Framework
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
Espressif Audio Front-end (AFE) algorithm framework is independently developed by ESPRESSIF AI Lab. Based on ESP32 series chips, the framework can provide high-quality and stable audio data.
|
||||
Overview
|
||||
--------
|
||||
|
||||
Summary
|
||||
-------
|
||||
Any voice-enabled product needs to perform well in a noisy environment, and audio front-end (AFE) algorithms play an important role in building a sensitive voice-user interface (VUI). Espressif’s AI Lab has created a set of audio front-end algorithms that can offer this functionality. Customers can use these algorithms with Espressif’s powerful {IDF_TARGET_NAME} series of chips, in order to build high-performance, yet low-cost, products with a voice-user interface.
|
||||
|
||||
Espressif AFE provides the most convenient way to do audio front-end
|
||||
processing on ESP32 series chips. Espressif AFE framework stably get
|
||||
high-quality audio data for further wake-up or speech recognition.
|
||||
.. list-table::
|
||||
:widths: 25 75
|
||||
:header-rows: 1
|
||||
|
||||
Espressif AFE is divided into two sets of algorithms:
|
||||
* - Name
|
||||
- Description
|
||||
* - AEC (Acoustic Echo Cancellation)
|
||||
- Supports maximum two-mic processing, which can effectively remove the echo in the mic input signal, and help with further speech recognition.
|
||||
* - NS (Noise Suppression)
|
||||
- Supports single-channel processing and can suppress the non-human noise in single-channel audio, especially for stationary noise.
|
||||
* - BSS (Blind Source Separation)
|
||||
- Supports dual-channel processing, which can well separate the target sound source from the rest of the interference sound, so as to extract the useful audio signal and ensure the quality of the subsequent speech.
|
||||
* - MISO (Multi Input Single Output)
|
||||
- Supports dual channel input and single channel output. It is used to select a channel of audio output with high signal-to-noise ratio when there is no WakeNet enable in the dual mic scene.
|
||||
* - VAD (Voice Activity Detection)
|
||||
- Supports real-time output of the voice activity state of the current frame.
|
||||
* - AGC (Automatic Gain Control)
|
||||
- Dynamically adjusts the amplitude of the output audio, and amplifies the output amplitude when a weak signal is input; When the input signal reaches a certain strength, the output amplitude will be compressed.
|
||||
* - WakeNet
|
||||
- A wake word engine built upon neural network, and is specially designed for low-power embedded MCUs.
|
||||
|
||||
#. for speech recognition scenarios;
|
||||
#. for voice communication scenarios. Shown as below:
|
||||
Usage Scenarios
|
||||
---------------
|
||||
|
||||
- Speech recognition scenarios
|
||||
This section introduces two typical usage scenarios of Espressif AFE framework.
|
||||
|
||||
Speech Recognition
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Workflow
|
||||
""""""""
|
||||
|
||||
.. figure:: ../../_static/AFE_SR_overview.png
|
||||
:alt: overview
|
||||
|
||||
|
||||
- Voice communication scenarios
|
||||
|
||||
.. figure:: ../../_static/AFE_VOIP_overview.png
|
||||
:alt: overview
|
||||
|
||||
|
||||
The data flow of Espressif AFE is also divided into two scenarios, shown
|
||||
as below:
|
||||
|
||||
- Speech recognition scenarios
|
||||
|
||||
Data Flow
|
||||
"""""""""
|
||||
.. figure:: ../../_static/AFE_SR_workflow.png
|
||||
:alt: overview
|
||||
|
||||
|
||||
The workflow is as follows:
|
||||
#. Use :cpp:func:`ESP_AFE_SR_HANDLE` to create and initialize AFE. Note, :cpp:member:`voice_communication_init` must be configured as false.
|
||||
#. Use :cpp:func:`feed` to input audio data, which will perform the AEC algorithm inside :cpp:func:`feed` first.
|
||||
#. Perform the BSS/NS algorithms inside :cpp:func:`feed` first.
|
||||
#. Use :cpp:func:`fetch` to obtain processed single channel audio data and related information. Note, VAD processing and wake word detection will be performed inside :cpp:func:`fetch`. The specific behavior depends on the configuration of ``afe_config_t`` structure.
|
||||
|
||||
#. Use **ESP_AFE_SR_HANDLE** to create and initialize AFE
|
||||
(``voice_communication_init`` needs to be configured as false)
|
||||
#. AFE feed: Input audio data and will run AEC in the feed function
|
||||
#. Internal: BSS/NS algorithm processing will be carried out.
|
||||
#. AFE fetch: Return the audio data and the related information after processing. VAD processing and wake-up word detection will be carried out inside the fetch. The specific behavior depends on the config of ``afe_config_t`` structure.
|
||||
Voice Communication
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. note ::
|
||||
``wakenet_Init`` and ``voice_communication_Init`` cannot be configured to true at the same time
|
||||
Workflow
|
||||
""""""""
|
||||
|
||||
- Voice communication scenarios
|
||||
.. figure:: ../../_static/AFE_VOIP_overview.png
|
||||
:alt: overview
|
||||
|
||||
Data Flow
|
||||
"""""""""
|
||||
.. figure:: ../../_static/AFE_VOIP_workflow.png
|
||||
:alt: overview
|
||||
|
||||
#. Use :cpp:func:`ESP_AFE_VC_HANDLE` to create and initialize AFE. Note, :cpp:member:`voice_communication_init` must be configured as true.
|
||||
#. Use :cpp:func:`feed` to input audio data, which will perform the AEC algorithm inside :cpp:func:`feed` first.
|
||||
#. Perform the BSS/NS algorithms inside :cpp:func:`feed` first. Additional MISO algorithm will be performed for dual mic setup.
|
||||
#. Use :cpp:func:`fetch` to obtain processed single channel audio data and related information. The AGC algorithm processing will be carried out. And the specific gain depends on the config of :cpp:type:`afe_config_t` structure. If it's dual mic, the NS algorithm processing will be carried out before AGC.
|
||||
|
||||
The workflow is as follows:
|
||||
|
||||
#. Use **ESP_AFE_VC_HANDLE** to create and initialize AFE (``voice_communication_init`` needs to be configured as true)
|
||||
#. AFE feed: Input audio data and will run AEC in the feed function
|
||||
#. Internal: BSS/NS algorithm processing will be carried out. If it's dual MIC, the miso algorithm processing will be carried out later.
|
||||
#. AFE fetch: Return the audio data and the related information after processing. The AGC algorithm processing will be carried out. And the specific gain depends on the config of ``afe_config_t`` structure. If it's dual MIC, the NS algorithm processing will be carried out before AGC.
|
||||
|
||||
.. note ::
|
||||
``wakenet_Init`` and ``voice_communication_Init`` cannot be configured to true at the same time
|
||||
|
||||
.. note ::
|
||||
``afe->feed()`` and ``afe->fetch()`` are visible to users, while ``internal BSS/NS/MISO task`` is invisible to users.
|
||||
|
||||
* AEC runs in ``afe->feed()`` function; If aec_init is configured as false, BSS/NS will run in the afe->feed() function.
|
||||
* BSS/NS/MISO is an independent task in AFE;
|
||||
* The results of VAD/WakeNet and the audio data after processing are obtained by ``afe->fetch()`` function.
|
||||
.. note::
|
||||
#. The :cpp:member:`wakenet_init` and :cpp:member:`voice_communication_init` in :cpp:type:`afe_config_t` cannot be configured to true at the same time.
|
||||
#. :cpp:func:`feed` and :cpp:func:`fetch` are visible to users, while other AFE interal tasks such as BSS/NS/MISO are not visible to users.
|
||||
#. AEC algorithm is performed in :cpp:func:`feed`.
|
||||
#. When :cpp:member:`aec_init` is configured to false, BSS/NS algorithms are performed in :cpp:func:`feed`.
|
||||
|
||||
Select AFE Handle
|
||||
~~~~~~~~~~~~~~~~~
|
||||
-----------------
|
||||
|
||||
Espressif AFE supports both single MIC and dual MIC scenarios, and the algorithm module can be flexibly configured. The internal task of single MIC applications is processed by NS, and the internal task of dual MIC applications is processed by BSS. If the dual microphone scenario is configured for voice communication
|
||||
(i.e. ``wakenet_init=false, voice_communication_init=true``), the miso internal task will be added.
|
||||
Espressif AFE supports both single mic and dual mic setups, and allows flexible combinations of algorithms.
|
||||
|
||||
For the acquisition of AFE handle, there is a slight difference between speech recognition scenario and voice communication scenario:
|
||||
* Single mic
|
||||
* Internal task is performed inside the NS algorithm
|
||||
* Dual mic
|
||||
* Internal task is performed inside the BSS algorithm
|
||||
* An additional internal task is performed inside the MISO algorithm for voice communication scenario (i.e., :cpp:member:`wakenet_init` = false and :cpp:member:`voice_communication_init` = true)
|
||||
|
||||
- Speech recognition
|
||||
To obtain the AFE Handle, use the commands below:
|
||||
|
||||
* Speech recognition
|
||||
|
||||
::
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
|
||||
|
||||
- Voice communication
|
||||
* Voice communication
|
||||
|
||||
::
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
|
||||
|
||||
.. _input-audio-1:
|
||||
|
||||
Input Audio Data
|
||||
~~~~~~~~~~~~~~~~
|
||||
----------------
|
||||
|
||||
The AFE supports two kinds of scenarios: single MIC and dual MIC. The number of channels can be configured according to the audio of ``afe->feed()``. Modify method: It can modify the ``pcm_config`` configuration in macro ``AFE_CONFIG_DEFAULT()``. It supports the following configuration combinations
|
||||
Currently, Espressif AFE framework supports both single mic and dual mic setups. Users can configure the number of channels based on the input audio (:cpp:func:`esp_afe_sr_iface_op_feed_t`).
|
||||
|
||||
.. note ::
|
||||
It must meet ``total_ch_num = mic_num + ref_num`` :
|
||||
To be specific, users can configure the :cpp:member:`pcm_config` in :cpp:func:`AFE_CONFIG_DEFAULT()`:
|
||||
|
||||
* :cpp:member:`total_ch_num` : total number of channels
|
||||
* :cpp:member:`mic_num` : number of mic channels
|
||||
* :cpp:member:`ref_num` : number of REF channels
|
||||
|
||||
When configuring, note the following requirements:
|
||||
|
||||
1. :cpp:member:`total_ch_num` = :cpp:member:`mic_num` + :cpp:member:`ref_num`
|
||||
2. :cpp:member:`ref_num` = 0 or :cpp:member:`ref_num` = 1 (This is because AEC only supports up to one reference data now)
|
||||
|
||||
|
||||
The supported configurations are:
|
||||
|
||||
::
|
||||
|
||||
@ -107,317 +133,275 @@ The AFE supports two kinds of scenarios: single MIC and dual MIC. The number of
|
||||
total_ch_num=2, mic_num=2, ref_num=0
|
||||
total_ch_num=3, mic_num=2, ref_num=1
|
||||
|
||||
.. note ::
|
||||
total_ch_num: the number of total channels, mic_num: the number of microphone channels, ref_num: the number of reference channels
|
||||
|
||||
At present, the AEC only support one reference data , so ref_num can only be 0 or 1.
|
||||
|
||||
- AFE single MIC
|
||||
|
||||
- Input audio data format: 16KHz, 16bit, two channels (one is mic data, another is reference data) ; If AEC is not required and the audio does not contain reference data. The input data can only have one channel of MIC data, and the ref_num need to be set 0.
|
||||
- The input data frame length will vary according to the algorithm module configured by the user. Users can use ``afe->get_feed_chunksize()`` to get the number of sampling points (the data type of sampling points is int16).
|
||||
AFE Single Mic
|
||||
^^^^^^^^^^^^^^
|
||||
- Input audio data format: 16 KHz, 16 bit, two channels (one is mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure one channel of mic data, and the ref_num can be set to 0.
|
||||
- The input data frame length varies to the algorithm modules configured by the user. Users can use :cpp:func:`get_feed_chunksize` to get the number of sampling points (the data type of sampling points is int16).
|
||||
|
||||
The input data is arranged as follows:
|
||||
|
||||
.. figure:: ../../_static/AFE_mode_0.png
|
||||
:alt: input data of single MIC
|
||||
:alt: input data of single mic
|
||||
:height: 0.7in
|
||||
|
||||
- AFE dual MIC
|
||||
|
||||
- Input audio data format: 16KHz, 16bit, three channels (two are mic data, another is reference data) ; If AEC is not required and the audio does not contain reference data. The input data can only have two channels of MIC data, and the ref_num need to be set 0.
|
||||
- The input data frame length will vary according to the algorithm module configured by the user. Users can use ``afe->get_feed_chunksize()`` to get the number of sampling points (the data type of sampling points is int16).
|
||||
AFE Dual Mic
|
||||
^^^^^^^^^^^^
|
||||
- Input audio data format: 16 KHz, 16 bit, three channels (two are mic data, another is REF data). Note that if AEC is not required, then there is no need for reference data. Therefore, users can only configure two channels of mic data, and the ref_num can be set to 0.
|
||||
- The input data frame length varies to the algorithm modules configured by the user. Users can use :cpp:func:`get_feed_chunksize` to obtain the data size required (i.e., :cpp:func:`get_feed_chunksize` * :cpp:member:`total_ch_num` * sizeof(short)).
|
||||
|
||||
The input data is arranged as follows:
|
||||
|
||||
.. figure:: ../../_static/AFE_mode_other.png
|
||||
:alt: input data of dual MIC
|
||||
:alt: input data of dual mic
|
||||
:height: 0.75in
|
||||
|
||||
.. note::
|
||||
the converted data size is: ``afe->get_feed_chunksize * channel number * sizeof(short)``
|
||||
|
||||
AEC Introduction
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
The AEC (Acoustic Echo Cancellation) algorithm supports maximum two-mic processing, which can effectively remove the echo in the mic input signal, and help with further speech recognition.
|
||||
|
||||
NS (Noise Suppression)
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
NS algorithm supports single-channel processing and can suppress the non-human noise in single-channel audio, especially for steady noise.
|
||||
|
||||
BSS (Blind Source Separation)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
BSS algorithm supports dual-channel processing, which can well separate the target sound source from the rest of the interference sound, so as to extract the useful audio signal and ensure the quality of the subsequent speech.
|
||||
|
||||
MISO (Multi Input Single Output)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Miso algorithm supports dual channel input and single channel output. It is used to select a channel of audio output with high signal-to-noise ratio when there is no wakenet enable in the dual mic scene.
|
||||
|
||||
VAD (Voice Activity Detection)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
VAD algorithm supports real-time output of the voice activity state of the current frame.
|
||||
|
||||
AGC (Automatic Gain Control)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
AGC dynamically adjusts the amplitude of the output audio, and amplifies the output amplitude when a weak signal is input; When the input signal reaches a certain strength, the output amplitude will be compressed.
|
||||
|
||||
WakeNet or Bypass
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Users can choose whether to detect wake words in AFE. When calling ``afe->disable_wakenet(afe_data)``, it will enter bypass mode, and the WakeNet will not run.
|
||||
|
||||
Output Audio
|
||||
~~~~~~~~~~~~
|
||||
------------
|
||||
|
||||
The output audio of AFE is single-channel data. In the speech recognition scenario, AFE will output single-channel data with human voice while WakeNet is enabled. In the voice communication scenario, single channel data with higher signal-to-noise ratio will be output.
|
||||
The output audio of AFE is single-channel data.
|
||||
- In the speech recognition scenario, AFE outputs single-channel data with human voice when WakeNet is enabled.
|
||||
- In the voice communication scenario, AFE outputs single channel data with higher signal-to-noise ratio.
|
||||
|
||||
Quick Start
|
||||
-----------
|
||||
|
||||
Define afe_handle
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Enable Wake Word Engine WakeNet
|
||||
--------------------------------
|
||||
|
||||
``afe_handle`` is the function handle that the user calls the AFE interface. Therefore, the first step is to obtain ``afe_handle``.
|
||||
When performing AFE audio front-end processing, the user can choose whether to enable wake word engine :doc:`WakeNet <../wake_word_engine/README>` to allow waking up the chip via wake words.
|
||||
|
||||
- Speech recognition
|
||||
Users can disable WakeNet to reduce the CPU resource consumption and perform other operations after wake-up, such as offline or online speech recognition. To do so, users can configure :cpp:func:`disable_wakenet()` to enter Bypass mode.
|
||||
|
||||
::
|
||||
Users can also call :cpp:func:`enable_wakenet()` to enable WakeNet later whenever needed.
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
|
||||
.. only:: esp32
|
||||
|
||||
- Voice communication
|
||||
ESP32 only supports one wake word. Users cannot switch between different wake words.
|
||||
|
||||
::
|
||||
.. only:: esp32s3
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
|
||||
ESP32-S3 allows users to switch among different wake words. After the initialization of AFE, ESP32-S3 allows users to change wake words by calling :cpp:func:`set_wakenet()` . For example, use ``set_wakenet(afe_data, "wn9_hilexin")`` to use "Hi Lexin" as the wake word. For details on how to configure more than one wake words, see Section :doc:`flash_model <../flash_model/README>`.
|
||||
|
||||
Configure AFE
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Get the configuration of AFE:
|
||||
|
||||
::
|
||||
|
||||
afe_config_t afe_config = AFE_CONFIG_DEFAULT();
|
||||
|
||||
Users can adjust the switch of each algorithm module and its corresponding parameters in ``afe_config``:
|
||||
|
||||
::
|
||||
|
||||
#define AFE_CONFIG_DEFAULT() { \
|
||||
.aec_init = true, \
|
||||
.se_init = true, \
|
||||
.vad_init = true, \
|
||||
.wakenet_init = true, \
|
||||
.voice_communication_init = false, \
|
||||
.voice_communication_agc_init = false, \
|
||||
.voice_communication_agc_gain = 15, \
|
||||
.vad_mode = VAD_MODE_3, \
|
||||
.wakenet_model_name = NULL, \
|
||||
.wakenet_mode = DET_MODE_2CH_90, \
|
||||
.afe_mode = SR_MODE_LOW_COST, \
|
||||
.afe_perferred_core = 0, \
|
||||
.afe_perferred_priority = 5, \
|
||||
.afe_ringbuf_size = 50, \
|
||||
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
|
||||
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
|
||||
.pcm_config.total_ch_num = 3, \
|
||||
.pcm_config.mic_num = 2, \
|
||||
.pcm_config.ref_num = 1, \
|
||||
}
|
||||
|
||||
- aec_init: Whether the AEC algorithm is enabled.
|
||||
|
||||
- se_init: Whether the BSS/NS algorithm is enabled.
|
||||
|
||||
- vad_init: Whether the VAD algorithm is enabled ( It can only be used in speech recognition scenarios ).
|
||||
|
||||
- wakenet_init: Whether the wake algorithm is enabled.
|
||||
|
||||
- voice_communication_init: Whether voice communication is enabled. It cannot be enabled with wakenet_init at the same time.
|
||||
|
||||
- voice_communication_agc_init: Whether the AGC is enabled in voice communication.
|
||||
|
||||
- voice_communication_agc_gain: The gain of AGC ( unit: dB )
|
||||
|
||||
- vad_mode: The VAD operating mode. The bigger, the more radical.
|
||||
|
||||
- wakenet_model_name: Its default value is NULL in macro ``AFE_CONFIG_DEFAULT()``. At first, you need to choose WakeNet model through ``idf.py menuconfig``. Then you need to assign a specific model name to this place before ``afe_handle->create_from_config``. The type of value is string. Please refer to :doc:`flash_model <../flash_model/README>` .
|
||||
.. note::
|
||||
In the example, we use the ``esp_srmodel_filter()`` to get wakenet_model_name. If you choose the multiple wakenet models coexist through menuconfig, this function will return a model name randomly.
|
||||
|
||||
- wakenet_mode: Wakenet mode. It indicate the number of wake-up channels according to the number of MIC channels.
|
||||
|
||||
- afe_mode: Espressif AFE supports two working modes: SR_MODE_LOW_COST, SR_MODE_HIGH_PERF. See the afe_sr_mode_t enumeration for details.
|
||||
|
||||
- SR_MODE_LOW_COST: The quantified version occupies less resources.
|
||||
|
||||
- SR_MODE_HIGH_PERF: The non-quantified version occupies more resources.
|
||||
|
||||
**ESP32 only supports SR_MODE_HIGH_PERF; And ESP32S3 supports both of the modes**
|
||||
|
||||
- afe_perferred_core: The internal BSS/NS/MISO algorithm of AFE will be running on which CPU core.
|
||||
|
||||
- afe_perferred_priority: The running priority of BSS/NS/MISO algorithm task.
|
||||
|
||||
- afe_ringbuf_size: Configuration of internal ringbuf size.
|
||||
|
||||
- memory_alloc_mode: Memory allocation mode. Three values can be configured:
|
||||
|
||||
- AFE_MEMORY_ALLOC_MORE_INTERNAL: More memory is allocated from internal ram.
|
||||
|
||||
- AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE: Part of memory is allocated from internal psram.
|
||||
|
||||
- AFE_MEMORY_ALLOC_MORE_PSRAM: Most of memory is allocated from external psram.
|
||||
|
||||
- agc_mode: Configuration for linear audio amplification which be used in speech recognition. It only takes effect when wakenet_init is enabled. Four values can be configured:
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_1: Linearly amplify the audio which will fed to multinet. The peak value is -5 dB.
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_2: Linearly amplify the audio which will fed to multinet. The peak value is -4 dB.
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_3: Linearly amplify the audio which will fed to multinet. The peak value is -3 dB.
|
||||
|
||||
- AFE_MN_PEAK_NO_AGC: No amplification.
|
||||
|
||||
- pcm_config: Configure according to the audio that fed by ``afe->feed()``. This structure has three member variables to configure:
|
||||
|
||||
- total_ch_num: Total number of audio channels, total_ch_num = mic_num + ref_num.
|
||||
|
||||
- mic_num: The number of microphone channels. It only can be set to 1 or 2.
|
||||
|
||||
- ref_num: The number of reference channels. It only can be set to 0 or 1.
|
||||
|
||||
Create afe_data
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The user uses the ``afe_handle->create_from_config(&afe_config)`` function to obtain the data handle, which will be used internally in afe, and the parameters passed in are the configurations obtained in step 2 above.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Function to initialze a AFE_SR instance
|
||||
*
|
||||
* @param afe_config The config of AFE_SR
|
||||
* @returns Handle to the AFE_SR data
|
||||
*/
|
||||
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);
|
||||
|
||||
Feed Audio Data
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
After initializing AFE, users need to input audio data into AFE by ``afe_handle->feed()`` function for processing.
|
||||
|
||||
The input audio size and layout format can refer to the step **Input Audio data**.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Feed samples of an audio stream to the AFE_SR
|
||||
*
|
||||
* @Warning The input data should be arranged in the format of channel interleaving.
|
||||
* The last channel is reference signal if it has reference data.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
*
|
||||
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
|
||||
* `get_feed_chunksize`.
|
||||
* @return The size of input
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
|
||||
|
||||
Get the number of audio channels:
|
||||
|
||||
``afe_handle->get_total_channel_num()`` function can provide the number of channels that need to be put into ``afe_handle->feed()`` function. Its return value is equal to ``pcm_config.mic_num + pcm_config.ref_num`` in AFE_CONFIG_DEFAULT()
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Get the total channel number which be config
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The amount of total channels
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
Fetch Audio Data
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Users can get the processed single-channel audio and related information by ``afe_handle->fetch()`` function.
|
||||
|
||||
The number of data sampling points of fetch (the data type of sampling point is int16) can be got by ``afe_handle->get_fetch_chunksize``.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Get the amount of each channel samples per frame that need to be passed to the function
|
||||
*
|
||||
* Every speech enhancement AFE_SR processes a certain number of samples at the same time. This function
|
||||
* can be used to query that amount. Note that the returned amount is in 16-bit samples, not in bytes.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The amount of samples to feed the fetch function
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
The declaration of ``afe_handle->fetch()`` is as follows:
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief fetch enhanced samples of an audio stream from the AFE_SR
|
||||
*
|
||||
* @Warning The output is single channel data, no matter how many channels the input is.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The result of output, please refer to the definition of `afe_fetch_result_t`. (The frame size of output audio can be queried by the `get_fetch_chunksize`.)
|
||||
*/
|
||||
typedef afe_fetch_result_t* (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
Its return value is a pointer of structure, and the structure is defined as follows:
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief The result of fetch function
|
||||
*/
|
||||
typedef struct afe_fetch_result_t
|
||||
{
|
||||
int16_t *data; // the data of audio.
|
||||
int data_size; // the size of data. The unit is byte.
|
||||
int wakeup_state; // the value is wakenet_state_t
|
||||
int wake_word_index; // if the wake word is detected. It will store the wake word index which start from 1.
|
||||
int vad_state; // the value is afe_vad_state_t
|
||||
int trigger_channel_id; // the channel index of output
|
||||
int wake_word_length; // the length of wake word. It's unit is the number of samples.
|
||||
int ret_value; // the return state of fetch function
|
||||
void* reserved; // reserved for future use
|
||||
} afe_fetch_result_t;
|
||||
|
||||
Usage Of WakeNet
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
When users need to perform other operations after wake-up, such as offline or online speech recognition. They can pause the operation of WakeNet to reduce the CPU resource consumption.
|
||||
|
||||
Users can call ``afe_handle->disable_wakenet(afe_data)`` to stop WakeNet, or call ``afe_handle->enable_wakenet(afe_data)`` to enable WakeNet.
|
||||
|
||||
In addition, ESP32S3 chip supports switching between wakenet words. (Note: ESP32 chip only supports one wake-up word and does not support switching). After AFE initialization, the ESP32S3 can switch wakenet word by ``afe_handle->set_wakenet()``. For example, ``afe_handle->set_wakenet(afe_data, "wn9_hilexin")`` can switch to the "Hi Lexin". How to configure multiple wakenet words, please refer to: :doc:`flash_model <../flash_model/README>`
|
||||
|
||||
Usage Of AEC
|
||||
~~~~~~~~~~~~~
|
||||
Enable Acoustic Echo Cancellation (AEC)
|
||||
----------------------------------------
|
||||
|
||||
The usage of AEC is similar to that of WakeNet. Users can disable or enable AEC according to requirements.
|
||||
|
||||
- Disable AEC
|
||||
|
||||
afe->disable_aec(afe_data);
|
||||
``afe->disable_aec(afe_data);``
|
||||
|
||||
- Enable AEC
|
||||
|
||||
afe->enable_aec(afe_data);
|
||||
``afe->enable_aec(afe_data);``
|
||||
|
||||
.. only:: html
|
||||
|
||||
Programming Procedures
|
||||
----------------------
|
||||
|
||||
Define afe_handle
|
||||
^^^^^^^^^^^^^^^^^
|
||||
|
||||
``afe_handle`` is the function handle that the user calls the AFE interface. Therefore, the first step is to obtain ``afe_handle``.
|
||||
|
||||
- Speech recognition
|
||||
|
||||
::
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_SR_HANDLE;
|
||||
|
||||
- Voice communication
|
||||
|
||||
::
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
|
||||
|
||||
Configure AFE
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Get the configuration of AFE:
|
||||
|
||||
::
|
||||
|
||||
afe_config_t afe_config = AFE_CONFIG_DEFAULT();
|
||||
|
||||
Users can further configure the corresponding parameters in ``afe_config``:
|
||||
|
||||
::
|
||||
|
||||
#define AFE_CONFIG_DEFAULT() { \
|
||||
// Configures whether or not to enable AEC
|
||||
.aec_init = true, \
|
||||
// Configures whether or not to enable BSS/NS
|
||||
.se_init = true, \
|
||||
// Configures whether or not to enable VAD (only for speech recognition)
|
||||
.vad_init = true, \
|
||||
// Configures whether or not to enable WakeNet
|
||||
.wakenet_init = true, \
|
||||
// Configures whether or not to enable voice communication (cannot be enabled when wakenet_init is also enabled)
|
||||
.voice_communication_init = false, \
|
||||
// Configures whether or not to enable AGC for voice communication
|
||||
.voice_communication_agc_init = false, \
|
||||
// Configures the AGC gain (unit: dB)
|
||||
.voice_communication_agc_gain = 15, \
|
||||
// Configures the VAD mode (the larger the number is, the more aggressive VAD is)
|
||||
.vad_mode = VAD_MODE_3, \
|
||||
// Configures the wake model. See details below.
|
||||
.wakenet_model_name = NULL, \
|
||||
// Configures the wake mode. (corresponding to wakeup channels. This should be configured based on the number of mic channels)
|
||||
.wakenet_mode = DET_MODE_2CH_90, \
|
||||
// Configures AFE mode (SR_MODE_LOW_COST or SR_MODE_HIGH_PERF)
|
||||
.afe_mode = SR_MODE_LOW_COST, \
|
||||
// Configures the internal BSS/NS/MISO algorithm of AFE will be running on which CPU core
|
||||
.afe_perferred_core = 0, \
|
||||
// Configures the priority of BSS/NS/MISO algorithm tasks
|
||||
.afe_perferred_priority = 5, \
|
||||
// Configures the internal ringbuf size
|
||||
.afe_ringbuf_size = 50, \
|
||||
// Configures the memory allocation mode. See details below.
|
||||
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
|
||||
// Configures the linear audio amplification level. See details below.
|
||||
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
|
||||
// Configures the total number of audio channels
|
||||
.pcm_config.total_ch_num = 3, \
|
||||
// Configures the number of microphone channels
|
||||
.pcm_config.mic_num = 2, \
|
||||
// Configures the number of reference channels
|
||||
.pcm_config.ref_num = 1, \
|
||||
}
|
||||
|
||||
|
||||
* :cpp:member:`wakenet_model_name` : configures the wake model. The default value in :cpp:type:`AFE_CONFIG_DEFAULT()` is NULL. Note:
|
||||
* After selecting the wake model via ``idf.py menuconfig``, please configure :cpp:member:`create_from_config` to the configured wake model (type string) before using. For more information about wake model, go to Section :doc:`flash_model <../flash_model/README>` .
|
||||
* :cpp:func:`esp_srmodel_filter()` can be used to obtain the model name. However, if more than one models are configured via ``idf.py menuconfig`` , this function returns any of them configured models randomly.
|
||||
|
||||
* :cpp:member:`afe_mode` :configures the AFE mode.
|
||||
|
||||
.. list::
|
||||
|
||||
:esp32s3: - :cpp:enumerator:`SR_MODE_LOW_COST` : quantized, which uses less resource
|
||||
- :cpp:enumerator:`SR_MODE_HIGH_PERF` : unquantized, which uses more resource
|
||||
|
||||
For details, see :cpp:enumerator:`afe_sr_mode_t` .
|
||||
|
||||
* :cpp:member:`memory_alloc_mode` : configures how the memory is allocated
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_INTERNAL` : allocate most memory from internal ram
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE` : allocate some memory from the internal ram
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_PSRAM` : allocate most memory from external psram
|
||||
|
||||
- :cpp:member:`agc_mode` : configures peak agc mode. Note that, this parameter is only for speech recognition scenarios, and is only valid when WakeNet is enabled:
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_1` : feed linearly amplified audio signals to MultiNet, peak is -5 dB.
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_2` : feed linearly amplified audio signals to MultiNet, peak is -4 dB.
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_3` : feed linearly amplified audio signals to MultiNet, peak is -3 dB.
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_NO_AGC` : feed original audio signals to MultiNet.
|
||||
|
||||
- :cpp:member:`pcm_config` : configures the audio signals fed through :cpp:func:`feed` :
|
||||
* :cpp:member:`total_ch_num` : total number of channels
|
||||
* :cpp:member:`mic_num` : number of mic channels
|
||||
* :cpp:member:`ref_num` : number of REF channels
|
||||
|
||||
There are some limitation when configuring these parameters. For details, see Section :ref:`input-audio-1` .
|
||||
|
||||
Create afe_data
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
The user uses the :cpp:func:`esp_afe_sr_iface_op_create_from_config_t` function to create the data handle based on the parameters configured in previous steps.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Function to initialze a AFE_SR instance
|
||||
*
|
||||
* @param afe_config The config of AFE_SR
|
||||
* @returns Handle to the AFE_SR data
|
||||
*/
|
||||
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);
|
||||
|
||||
Feed Audio Data
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
After initializing AFE, users need to input audio data into AFE by :cpp:func: `feed` function for processing. The format of input audio data can be found in Section :ref:`input-audio-1` .
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Feed samples of an audio stream to the AFE_SR
|
||||
*
|
||||
* @Warning The input data should be arranged in the format of channel interleaving.
|
||||
* The last channel is reference signal if it has reference data.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
*
|
||||
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
|
||||
* `get_feed_chunksize`.
|
||||
* @return The size of input
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
|
||||
|
||||
Get the number of audio channels
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
:cpp:func:`get_total_channel_num()` function can provide the number of channels that need to be put into :cpp:func:`feed()` function. Its return value is equal to ``pcm_config.mic_num + pcm_config.ref_num`` configured in :cpp:func:`AFE_CONFIG_DEFAULT()`.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Get the total channel number which be config
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The amount of total channels
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
Fetch Audio Data
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
Users can get the processed single-channel audio and related information by :cpp:func:`fetch` function.
|
||||
|
||||
The number of data sampling points of :cpp:func:`fetch` (the data type of sampling point is ``int16``) can be obtained by :cpp:func:`get_feed_chunksize`.
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief Get the amount of each channel samples per frame that need to be passed to the function
|
||||
*
|
||||
* Every speech enhancement AFE_SR processes a certain number of samples at the same time. This function
|
||||
* can be used to query that amount. Note that the returned amount is in 16-bit samples, not in bytes.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The amount of samples to feed the fetch function
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
The declaration of :cpp:func:`fetch`:
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief fetch enhanced samples of an audio stream from the AFE_SR
|
||||
*
|
||||
* @Warning The output is single channel data, no matter how many channels the input is.
|
||||
*
|
||||
* @param afe The AFE_SR object to query
|
||||
* @return The result of output, please refer to the definition of `afe_fetch_result_t`. (The frame size of output audio can be queried by the `get_fetch_chunksize`.)
|
||||
*/
|
||||
typedef afe_fetch_result_t* (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
Its return value is a pointer of structure, and the structure is defined as follows:
|
||||
|
||||
::
|
||||
|
||||
/**
|
||||
* @brief The result of fetch function
|
||||
*/
|
||||
typedef struct afe_fetch_result_t
|
||||
{
|
||||
int16_t *data; // the data of audio.
|
||||
int data_size; // the size of data. The unit is byte.
|
||||
int wakeup_state; // the value is wakenet_state_t
|
||||
int wake_word_index; // if the wake word is detected. It will store the wake word index which start from 1.
|
||||
int vad_state; // the value is afe_vad_state_t
|
||||
int trigger_channel_id; // the channel index of output
|
||||
int wake_word_length; // the length of wake word. It's unit is the number of samples.
|
||||
int ret_value; // the return state of fetch function
|
||||
void* reserved; // reserved for future use
|
||||
} afe_fetch_result_t;
|
||||
8
docs/en/audio_front_end/index.rst
Normal file
8
docs/en/audio_front_end/index.rst
Normal file
@ -0,0 +1,8 @@
|
||||
AFE Audio Front-end
|
||||
======================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
AFE Introduction <README>
|
||||
Espressif Microphone Design Guidelines <Espressif_Microphone_Design_Guidelines>
|
||||
@ -6,93 +6,96 @@ Resource Consumption
|
||||
AFE
|
||||
---
|
||||
|
||||
Resource Occupancy(ESP32)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Resource Occupancy
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(HIGH_PERF) | 114 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 73 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
.. only:: esp32
|
||||
|
||||
Resource Occupancy(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(HIGH_PERF) | 114 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 73 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| MISO | 56 KB | 8% | 16 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 227 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| MISO | 56 KB | 8% | 16 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 227 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
WakeNet
|
||||
-------
|
||||
|
||||
.. _resource-occupancyesp32-1:
|
||||
|
||||
Resource Occupancy(ESP32)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Resource Occupancy
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Parameter | RAM | Average | Frame |
|
||||
| | Num | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| Quantised | 41 K | 15 KB | 5.5 ms | 30 ms |
|
||||
| WakeNet5 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 165 K | 20 KB | 10.5 ms | 30 ms |
|
||||
| WakeNet5X2 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 371 K | 24 KB | 18 ms | 30 ms |
|
||||
| WakeNet5X3 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
.. only:: esp32
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Parameter | RAM | Average | Frame |
|
||||
| | Num | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| Quantised | 41 K | 15 KB | 5.5 ms | 30 ms |
|
||||
| WakeNet5 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 165 K | 20 KB | 10.5 ms | 30 ms |
|
||||
| WakeNet5X2 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 371 K | 24 KB | 18 ms | 30 ms |
|
||||
| WakeNet5X3 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
.. _resource-occupancyesp32s3-1:
|
||||
|
||||
Resource Occupancy(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Model Type | RAM | PSRAM | Average | Frame Length |
|
||||
| | | | Running Time | |
|
||||
| | | | per Frame | |
|
||||
+================+=======+=========+================+==============+
|
||||
| Quantised | 50 KB | 1640 KB | 10.0 ms | 32 ms |
|
||||
| WakeNet8 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 16 KB | 324 KB | 3.0 ms | 32 ms |
|
||||
| WakeNet9 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 20 KB | 347 KB | 4.3 ms | 32 ms |
|
||||
| WakeNet9 @ 3 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Model Type | RAM | PSRAM | Average | Frame Length |
|
||||
| | | | Running Time | |
|
||||
| | | | per Frame | |
|
||||
+================+=======+=========+================+==============+
|
||||
| Quantised | 50 KB | 1640 KB | 10.0 ms | 32 ms |
|
||||
| WakeNet8 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 16 KB | 324 KB | 3.0 ms | 32 ms |
|
||||
| WakeNet9 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 20 KB | 347 KB | 4.3 ms | 32 ms |
|
||||
| WakeNet9 @ 3 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
|
||||
Performance
|
||||
~~~~~~~~~~~
|
||||
Performance Test
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Distance | Quiet | Stationary | Speech | AEC |
|
||||
@ -105,49 +108,52 @@ Performance
|
||||
| 3 m | 98% | 96% | 94% | 94% |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
False triggering rate: 1 time in 12 hours
|
||||
False triggering rate: once in 12 hours
|
||||
|
||||
**Note**: We use the ESP32-S3-Korvo V4.0 development board and the WakeNet9(Alexa) model in our test.
|
||||
.. note::
|
||||
|
||||
In this test, we used ESP32-S3-Korvo V4.0 development board and WakeNet9(Alexa) model.
|
||||
|
||||
MultiNet
|
||||
--------
|
||||
|
||||
.. _resource-occupancyesp32-2:
|
||||
|
||||
Resource Occupancy(ESP32)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Resource Occupancy
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 2 | 13.3 KB | 9KB | 38 ms | 30 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
.. only:: esp32
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 2 | 13.3 KB | 9KB | 38 ms | 30 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
.. _resource-occupancyesp32s3-2:
|
||||
|
||||
Resource Occupancy(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 4 | 16.8KB | 1866 KB | 18 ms | 32 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 4 | 10.5 KB | 1009 KB | 11 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 5 | 16 KB | 2310 KB | 12 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 4 | 16.8KB | 1866 KB | 18 ms | 32 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 4 | 10.5 KB | 1009 KB | 11 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 5 | 16 KB | 2310 KB | 12 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
Performance with AFE
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Performance Test
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
+-----------+-----------+-----------+-----------+-----------+
|
||||
| Model | Distance | Quiet | S | Speech |
|
||||
@ -161,4 +167,4 @@ Performance with AFE
|
||||
+-----------+-----------+-----------+-----------+-----------+
|
||||
| MultiNet | 3 m | 94% | 92% | 91% |
|
||||
| 4 Q8 | | | | |
|
||||
+-----------+-----------+-----------+-----------+-----------+
|
||||
+-----------+-----------+-----------+-----------+-----------+
|
||||
@ -1,122 +1,142 @@
|
||||
Model Loading Method
|
||||
====================
|
||||
Flash Model
|
||||
===========
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
In esp-sr, both WakeNet and MultiNet will use a large amount of model data, and the model data is located in *ESP-SR_PATH/model/*. Currently esp-sr supports the following model loading methods:
|
||||
ESP-SR's WakeNet and MultiNet both use a lot of model data (which can be found in :project:`model`). Currently, ESP-SR supports the following methods to flash models:
|
||||
|
||||
ESP32:
|
||||
.. only:: esp32
|
||||
|
||||
- Load directly from Flash
|
||||
ESP32: Load directly from Flash
|
||||
|
||||
ESP32S3:
|
||||
.. only:: esp32s3
|
||||
|
||||
- Load from Flash spiffs partition
|
||||
- Load from external SDCard
|
||||
ESP32-S3:
|
||||
|
||||
So that on ESP32S3 you can:
|
||||
- Load directly from SIP Flash File System (SPIFFS)
|
||||
- Load from external SD card
|
||||
|
||||
- Greatly reduce the size of the user application APP BIN
|
||||
- Supports the selection of up to two wake words
|
||||
- Support online switching of Chinese and English Speech Command Recognition
|
||||
- Convenient for users to perform OTA
|
||||
- Supports reading and changing models from SD card, which is more convenient and can reduce the size of module Flash used in the project
|
||||
- When the user is developing the code, when the modification does not involve the model, it can avoid flashing the model data every time, greatly reducing the flashing time and improving the development efficiency
|
||||
So that on ESP32S3 you can:
|
||||
|
||||
Model Configuration Introduction
|
||||
--------------------------------
|
||||
- Greatly reduce the size of the user application APP BIN
|
||||
- Supports the selection of up to two wake words
|
||||
- Support online switching of Chinese and English Speech Command Recognition
|
||||
- Convenient for users to perform OTA
|
||||
- Supports reading and changing models from SD card, which is more convenient and can reduce the size of module Flash used in the project
|
||||
- When the user is developing the code, when the modification does not involve the model, it can avoid flashing the model data every time, greatly reducing the flashing time and improving the development efficiency
|
||||
|
||||
Run *idf.py menuconfig* navigate to *ESP Speech Recognition*:
|
||||
Configuration
|
||||
-------------
|
||||
|
||||
Run ``idf.py menuconfig`` to navigate to ``ESP Speech Recognition``:
|
||||
|
||||
.. figure:: ../../_static/model-1.png
|
||||
:alt: overview
|
||||
|
||||
overview
|
||||
|
||||
Model Data Path
|
||||
~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
This option is only available on ESP32S3. It indicates the storage location of the model data. It supports the choice of ``spiffs partition`` or ``SD Card``.
|
||||
Model Data Path
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
- *spiffs partition* means that the model data is stored in the Flash spiffs partition, and the model data will be loaded from the Flash spiffs partition
|
||||
- ``SD Card`` means that the model data is stored in the SD card, and the model data will be loaded from the SD Card
|
||||
This option indicates the storage location of the model data: ``spiffs partition`` or ``SD Card``.
|
||||
|
||||
- ``spiffs partition`` means that the model data is stored in the SPIFFS partition, and the model data will be loaded from the SPIFFS partition
|
||||
- ``SD Card`` means that the model data is stored in the SD card, and the model data will be loaded from the SD card
|
||||
|
||||
Use AFE
|
||||
~~~~~~~
|
||||
|
||||
This option needs to be turned on. Users do not need to modify it. Please keep the default configuration.
|
||||
This option is enabled by default. Users do not need to modify it. Please keep the default configuration.
|
||||
|
||||
Use Wakenet
|
||||
Use WakeNet
|
||||
~~~~~~~~~~~
|
||||
|
||||
This option is turned on by default. When the user only uses ``AEC`` or ``BSS``, etc., and does not need to run ``WakeNet`` or ``MultiNet``, please turn off this option, which will reduce the size of the project firmware.
|
||||
This option is enabled by default. When the user only uses ``AEC`` or ``BSS``, etc., and does not need ``WakeNet`` or ``MultiNet``, please disable this option, which reduces the size of the project firmware.
|
||||
|
||||
- Select wake words by menuconfig, ``ESP Speech Recognition -> Select wake words``. The model name of wake word in parentheses is used to initialize wakenet handle. |select wake wake|
|
||||
- If you want to select multiple wake words, please select ``Load Multiple Wake Words`` ( **Note this option only supports ESP32S3**) |multi wake wake| Then you can select multiple wake words at the same time |image1|
|
||||
Select wake words by via ``menuconfig`` by navigating to ``ESP Speech Recognition`` > ``Select wake words``. The model name of wake word in parentheses must be used to initialize WakeNet handle.
|
||||
|
||||
|select wake wake|
|
||||
|
||||
If you want to select multiple wake words, please select ``Load Multiple Wake Words``
|
||||
|
||||
|multi wake wake|
|
||||
|
||||
Then you can select multiple wake words at the same time:
|
||||
|
||||
|image1|
|
||||
|
||||
.. only:: esp32
|
||||
|
||||
.. note::
|
||||
ESP32 doesn't support multiple wake words.
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
.. note::
|
||||
ESP32-S3 does support multiple wake words. Users can select more than one wake words according to the hardware flash size.
|
||||
|
||||
For more details, please refer to :doc:`WakeNet <../wake_word_engine/README>` .
|
||||
|
||||
Use Multinet
|
||||
~~~~~~~~~~~~
|
||||
|
||||
This option is turned on by default. When users only use WakeNet or other algorithm modules, please turn off this option, which will reduce the size of the project firmware in some cases.
|
||||
This option is enabled by default. When users only use WakeNet or other algorithm modules, please disable this option, which reduces the size of the project firmware in some cases.
|
||||
|
||||
ESP32 chip only supports Chinese Speech Commands Recognition.
|
||||
Chinese Speech Commands Model
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
ESP32S3 supports Chinese and English Speech Commands Recognition, and supports Chinese and English recognition model switching.
|
||||
.. only:: esp32
|
||||
|
||||
- Chinese Speech Commands Model
|
||||
ESP32 only supports command words in Chinese:
|
||||
|
||||
Chinese Speech Commands Recognition model selection.
|
||||
- None
|
||||
- Chinese single recognition (MultiNet2)
|
||||
|
||||
ESP32 supports:
|
||||
.. only:: esp32s3
|
||||
|
||||
- None
|
||||
- chinese single recognition (MultiNet2)
|
||||
ESP32-S3 supports command words in both Chinese and English:
|
||||
|
||||
ESP32S3 supports:
|
||||
- None
|
||||
- Chinese single recognition (MultiNet4.5)
|
||||
- Chinese single recognition (MultiNet4.5 quantized with 8-bit)
|
||||
- English Speech Commands Model
|
||||
|
||||
- None
|
||||
The user needs to add Chinese Speech Command words to this item when ``Chinese Speech Commands Model`` is not ``None``.
|
||||
|
||||
- chinese single recognition (MultiNet4.5)
|
||||
.. only:: esp32s3
|
||||
|
||||
- chinese single recognition (MultiNet4.5 quantized with 8-bit)
|
||||
English Speech Commands Model
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
- English Speech Commands Model
|
||||
ESP32-S3 supports command words in both Chinese and English, and allows users to switch between these two languages.
|
||||
|
||||
English Speech Commands Recognition model selection.
|
||||
- None
|
||||
- English recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
|
||||
- Add Chinese speech commands
|
||||
|
||||
This option does not support ESP32.
|
||||
The user needs to add English Speech Command words to this item when ``English Speech Commands Model`` is not ``None``.
|
||||
|
||||
ESP32S3 Supports:
|
||||
|
||||
- None
|
||||
|
||||
- english recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
|
||||
|
||||
- Add Chinese speech commands
|
||||
|
||||
The user needs to add Chinese Speech Command words to this item when ``Chinese Speech Commands Model`` is not ``None``.
|
||||
|
||||
- Add English speech commands
|
||||
|
||||
The user needs to add English Speech Command words to this item when ``Chinese Speech Commands Model`` is not ``None``.
|
||||
|
||||
For more details, please refer to :doc:`MultiNet <../speech_command_recognition/README>` .
|
||||
For more details, please refer to Section :doc:`MultiNet <../speech_command_recognition/README>` .
|
||||
|
||||
How To Use
|
||||
----------
|
||||
|
||||
Here is an introduction to the code implementation of model data loading in the project. If you want get more detailes, please refer to esp-skainet examples.
|
||||
After the above-mentioned configuration, users can initialize and start using the models following the examples described in the `ESP-Skainet <https://github.com/espressif/esp-skainet>`_ repo.
|
||||
|
||||
ESP32
|
||||
~~~~~
|
||||
Here, we only introduce the code implementation, which can also be found in `model_path.c <../src/model_path.c>`_ .
|
||||
|
||||
| When the user uses ESP32, since it only supports loading the model data directly from the Flash, the model data in the code will automatically read the required data from the Flash according to the address.
|
||||
| Now The ESP32S3 API is compatible with ESP32. You can refer to the ESP32S3 method to load and initialize the model.
|
||||
.. only:: esp32
|
||||
|
||||
ESP32S3
|
||||
~~~~~~~
|
||||
ESP32 can only load model data from flash. Therefore, the model data in the code will automatically read the required data from the Flash according to the address. Note that, ESP32 and ESP32-S3 APIs are compatible.
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
ESP32-S3 can load model data from SPIFFS or SD card.
|
||||
|
||||
Load Model Data from SPIFFS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
#. Write a partition table:
|
||||
|
||||
@ -124,41 +144,75 @@ ESP32S3
|
||||
|
||||
model, data, spiffs, , SIZE,
|
||||
|
||||
Among them, ``SIZE`` can refer to the recommended size when the user uses ``idf.py build`` to compile, for example:
|
||||
Among them, ``SIZE`` can refer to the recommended size when the user uses ``idf.py build`` to compile, for example: ``Recommended model partition size: 500K``
|
||||
|
||||
::
|
||||
|
||||
Recommended model partition size: 500K
|
||||
|
||||
After completing the above configuration, the project will automatically generate ``model.bin`` after the project is compiled, and flash it to the spiffs partition.
|
||||
|
||||
#. Initialize the spiffs partition User can use ``esp_srmodel_init()`` API to initialize spiffs and return all loaded models.
|
||||
#. Initialize the SPIFFS partition: User can use ``esp_srmodel_init()`` API to initialize SPIFFS and return all loaded models.
|
||||
|
||||
- base_path: The model storage ``base_path`` is ``srmodel`` and cannot be changed
|
||||
- partition_label: The partition label of the model is ``model``, which needs to be consistent with the ``Name`` in the above partition table
|
||||
|
||||
**Note: After the user changes the model, be sure to run ``idf.py clean`` before compiling again.**
|
||||
After completing the above configuration, the project will automatically generate ``model.bin`` after the project is compiled, and flash it to the SPIFFS partition.
|
||||
|
||||
.. _esp32s3-1:
|
||||
.. only:: esp32s3
|
||||
|
||||
ESP32S3
|
||||
-------
|
||||
Load Model Data from SD Card
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
::
|
||||
When configured to load model data from ``SD Card``, users need to:
|
||||
|
||||
- Manually load model data from SD card
|
||||
After the above-mentioned configuration, users can compile the code, and copy the files in ``model/target`` to the root directory of the SD card.
|
||||
|
||||
- Customized path
|
||||
Users can also use customized path by configuring the :cpp:func:`get_model_base_path()` of ``model/model_path.c``.
|
||||
|
||||
.. only:: html
|
||||
|
||||
For example, users can configure the customized path to the ``espmodel`` in the SD card:
|
||||
|
||||
::
|
||||
|
||||
char *get_model_base_path(void)
|
||||
{
|
||||
#if defined CONFIG_MODEL_IN_SDCARD
|
||||
return "sdcard/espmodel";
|
||||
#elif defined CONFIG_MODEL_IN_SPIFFS
|
||||
return "srmodel";
|
||||
#else
|
||||
return NULL;
|
||||
#endif
|
||||
}
|
||||
|
||||
- Initialize SD card
|
||||
Users must initialize SD card so the chip can load SD card. Users of `ESP-Skainet <https://github.com/espressif/esp-skainet>`_ can call ``esp_sdcard_init("/sdcard", num);`` to initialize any board supported SD cards. Otherwise, users need to write the initialization code themselves.
|
||||
After the above-mentioned steps, users can flash the project.
|
||||
|
||||
|
||||
.. |select wake wake| image:: ../../_static/wn_menu1.png
|
||||
.. |multi wake wake| image:: ../../_static/wn_menu2.png
|
||||
.. |image1| image:: ../../_static/wn_menu3.png
|
||||
|
||||
|
||||
.. only:: html
|
||||
|
||||
Model initialization and Usage
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
//
|
||||
// step1: initialize spiffs and return models in spiffs
|
||||
// step1: initialize SPIFFS and return models in SPIFFS
|
||||
//
|
||||
srmodel_list_t *models = esp_srmodel_init("model");
|
||||
|
||||
//
|
||||
// step2: select the specific model by keywords
|
||||
//
|
||||
char *wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, NULL); // select wakenet model
|
||||
char *nm_name = esp_srmodel_filter(models, ESP_MN_PREFIX, NULL); // select multinet model
|
||||
char *alexa_wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, "alexa"); // select wakenet with "alexa" wake word.
|
||||
char *en_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_ENGLISH); // select english multinet model
|
||||
char *cn_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_CHINESE); // select english multinet model
|
||||
char *wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, NULL); // select WakeNet model
|
||||
char *nm_name = esp_srmodel_filter(models, ESP_MN_PREFIX, NULL); // select MultiNet model
|
||||
char *alexa_wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, "alexa"); // select WakeNet with "alexa" wake word.
|
||||
char *en_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_ENGLISH); // select english MultiNet model
|
||||
char *cn_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_CHINESE); // select english MultiNet model
|
||||
|
||||
// It also works if you use the model name directly in your code.
|
||||
char *my_wn_name = "wn9_hilexin"
|
||||
@ -174,7 +228,3 @@ ESP32S3
|
||||
|
||||
esp_mn_iface_t *multinet = esp_mn_handle_from_name(mn_name);
|
||||
model_iface_data_t *mn_model_data = multinet->create(mn_name, 6000);
|
||||
|
||||
.. |select wake wake| image:: ../../_static/wn_menu1.png
|
||||
.. |multi wake wake| image:: ../../_static/wn_menu2.png
|
||||
.. |image1| image:: ../../_static/wn_menu3.png
|
||||
|
||||
46
docs/en/getting_started/readme.rst
Normal file
46
docs/en/getting_started/readme.rst
Normal file
@ -0,0 +1,46 @@
|
||||
Getting Started
|
||||
================
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
Espressif `ESP-SR <https://github.com/espressif/esp-sr>`__ helps you build AI voice solution based on ESP32 or ESP32-S3 chips. This document introduces the algorithms and models in ESP-SR via some simple examples.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
ESP-SR includes the following modules:
|
||||
|
||||
* :doc:`Audio Front-end AFE <../audio_front_end/README>`
|
||||
* :doc:`Wake Word Engine WakeNet <../wake_word_engine/README>`
|
||||
* :doc:`Speech Command Word Recognition MultiNet <../speech_command_recognition/README>`
|
||||
* Speech Synthesis (only supports Chinese language)
|
||||
|
||||
What You Need
|
||||
-------------
|
||||
|
||||
Hardware
|
||||
~~~~~~~~
|
||||
|
||||
.. list::
|
||||
|
||||
:esp32s3: - an audio development board. Recommendation: ESP32-S3-Korvo-1 or ESP32-S3-Korvo-2
|
||||
:esp32: - an audio development board. Recommendation: ESP32-Korvo
|
||||
- USB 2.0 cable (USB A / micro USB B)
|
||||
- PC (Linux)
|
||||
|
||||
.. note::
|
||||
Some development boards currently have the Type C interface. Make sure you use the proper cable to connect the board!
|
||||
|
||||
Software
|
||||
~~~~~~~~
|
||||
|
||||
* Download `ESP-SKAINET <https://github.com/espressif/esp-skainet>`__, which also downloads ESP-SR as a component.
|
||||
* Install the ESP-IDF version recommended in ESP-SKAINET. For detailed steps, please see Section `Getting Started <https://docs.espressif.com/projects/esp-idf/en/latest/esp32s3/get-started/index.html>`__ in `ESP-IDF Programming Guide <https://docs.espressif.com/projects/esp-idf/en/latest/esp32s3/index.html>`__.
|
||||
|
||||
|
||||
Compile an Example
|
||||
------------------
|
||||
|
||||
* Navigate to `ESP-SKAINET/examples/en_speech_commands_recognition <https://github.com/espressif/esp-skainet/tree/master/examples/en_speech_commands_recognition>`__ .
|
||||
* Compile and run an example following the instructions.
|
||||
* The example only supports commands in English. Users can wake up the chip by using wake word "Hi ESP". Note that the chip stops listening to commands if the users wake up the chip and do not give any commands for some time. In this case, just wake up the chip again by saying the wake word.
|
||||
16
docs/en/glossary/glossary.rst
Normal file
16
docs/en/glossary/glossary.rst
Normal file
@ -0,0 +1,16 @@
|
||||
Glossary
|
||||
========
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
General Terms
|
||||
-------------
|
||||
|
||||
ESP-SR reuses most of its terms in `Espressif Audio Development Framework <https://espressif-docs.readthedocs-hosted.com/projects/esp-adf/en/latest/get-started/index.html>`_. See details in `ADF English-Chinese Glossary <https://espressif-docs.readthedocs-hosted.com/projects/esp-adf/en/latest/english-chinese-glossary.html>`_ .
|
||||
|
||||
Unique Terms
|
||||
------------
|
||||
|
||||
ESP-SR's unique terms are listed below.
|
||||
|
||||
Voice-User Interface (VUI) 语音用户界面
|
||||
@ -2,30 +2,22 @@ ESP-SR User Guide
|
||||
=================
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
This is LEXIN `ESP-SR <https://github.com/espressif/esp-sr>`__ This document will introduce LEXIN's AI voice solution based on ESP32 series chip. From front-end audio processing, to voice command word recognition, from hardware design suggestions, to performance testing methods, it is a comprehensive introduction to Loxin's systematic work on AI speech, and provides a strong reference for users to build AIoT applications on Loxin ESP32 series chips and development boards.
|
||||
|
||||
Lexin AFE algorithm has passed the Software Audio Front-End certification for Amazon Alexa built-in devices. The built-in wake-up module in AFE algorithm can realize local voice wake-up function and support wake-up word customization. Lexin's voice command word recognition model can support up to 200 English and Chinese command words, and the command words can be modified during operation, bringing great flexibility to the application.
|
||||
|
||||
Based on years of hardware design and development experience, Loxin can provide voice development board review service for customers, and will be happy to test and tune the development board for customers to show the optimal performance of the algorithm. Customers can also conduct in-depth evaluation of the development board and the whole product according to the test methods and self-test results provided by Loxin.
|
||||
|
||||
.. only:: html
|
||||
|
||||
**This document only contains the ESP-AT usage** for the chip. For other chips, please select your target chip from the drop-down menu at the top left of the page.
|
||||
**This document only contains the ESP-SR usage** for {IDF_TARGET_NAME}. For other chips, please select your target chip from the drop-down menu at the top left of the page.
|
||||
|
||||
.. only:: latex
|
||||
|
||||
**This document contains ESP-AT usage** for the chip only.
|
||||
**This document contains ESP-SR usage** for {IDF_TARGET_NAME} only.
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
|
||||
AFE acoustic front-end algorithm <audio_front_end/README>
|
||||
Wake word model <wake_word_engine/README>
|
||||
Customized wake words <wake_word_engine/ESP_Wake_Words_Customization>
|
||||
Speech commands <speech_command_recognition/README>
|
||||
Model loading method <flash_model/README>
|
||||
Microphone Design Guidelines <audio_front_end/Espressif_Microphone_Design_Guidelines>
|
||||
Test Reports <test_report/README>
|
||||
Performance Testing <benchmark/README>
|
||||
|
||||
Translated with www.DeepL.com/Translator (free version)
|
||||
Getting Started <getting_started/readme>
|
||||
Audio Front-end (AFE) <audio_front_end/index>
|
||||
Wake Word WakeNet <wake_word_engine/index>
|
||||
Speech Command Word MultiNet <speech_command_recognition/README>
|
||||
Flash Model <flash_model/README>
|
||||
Resource Overhead <benchmark/README>
|
||||
Test Report <test_report/README>
|
||||
Glossary <glossary/glossary>
|
||||
@ -1,31 +1,29 @@
|
||||
MultiNet Introduction
|
||||
=====================
|
||||
Command Word
|
||||
============
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
MultiNet is a lightweight model designed to realize speech commands
|
||||
recognition offline on ESP32 series. Now, up to 200 speech commands,
|
||||
including customized commands, are supported.
|
||||
MultiNet Command Word Recognition Model
|
||||
---------------------------------------
|
||||
|
||||
* Support Chinese and English speech commands recognition (esp32s3 is required for English speech commands recognition)
|
||||
* Support user-defined commands
|
||||
* Support adding / deleting / modifying commands during operation
|
||||
* Up to 200 commands are supported
|
||||
* It supports single recognition and continuous recognition
|
||||
* Lightweight and low resource consumption
|
||||
* Low delay, within 500ms
|
||||
* Support online Chinese and English model switching (esp32s3 only)
|
||||
* The model is partitioned separately to support users to apply OTA
|
||||
MultiNet is a lightweight model designed to recognize multiple speech command words offline based on {IDF_TARGET_NAME}. Currently, up to 200 speech commands, including customized commands, are supported.
|
||||
|
||||
Overview
|
||||
-----------
|
||||
.. list::
|
||||
|
||||
The MultiNet input is the audio processed by the audio-front-end
|
||||
algorithm (AFE), with the format of 16KHz, 16bit and mono. By
|
||||
recognizing the audio, you can correspond to the corresponding Chinese
|
||||
characters or English words.
|
||||
:esp32s3: - Support Chinese and English speech commands recognition
|
||||
:esp32: - Support Chinese speech commands recognition
|
||||
- Support user-defined commands
|
||||
- Support adding / deleting / modifying commands during operation
|
||||
- Up to 200 commands are supported
|
||||
- It supports single recognition and continuous recognition
|
||||
- Lightweight and low resource consumption
|
||||
- Low delay, within 500ms
|
||||
:esp32s3: - Support online Chinese and English model switching (esp32s3 only)
|
||||
- The model is partitioned separately to support users to apply OTA
|
||||
|
||||
The following table shows the model support of Espressif SoCs:
|
||||
The MultiNet input is the audio processed by the audio-front-end algorithm (AFE), with the format of 16 KHz, 16 bit and mono. By recognizing the audio signals, speech commands can be recognized.
|
||||
|
||||
The following table shows the models supported by Espressif SoCs:
|
||||
|
||||
+---------+-----------+-------------+---------------+-------------+
|
||||
| Chip | ESP32 | ESP32S3 |
|
||||
@ -37,76 +35,78 @@ The following table shows the model support of Espressif SoCs:
|
||||
| English | | | | √ |
|
||||
+---------+-----------+-------------+---------------+-------------+
|
||||
|
||||
For details on flash models, see Section :doc:`flash model <../flash_model/README>` .
|
||||
|
||||
.. note::
|
||||
Note: the model ending with Q8 represents the 8bit version of the model, means more lightweight.
|
||||
Models ending with Q8 represents the 8 bit version of the model, which is more lightweight.
|
||||
|
||||
Commands Recognition Process
|
||||
-------------------------------
|
||||
----------------------------
|
||||
|
||||
Please see the flow diagram below:
|
||||
Please see the flow diagram for commands recognition below:
|
||||
|
||||
.. figure:: ../../_static/multinet_workflow.png
|
||||
:alt: speech_command-recognition-system
|
||||
|
||||
speech_command-recognition-system
|
||||
|
||||
User Guide
|
||||
-------------
|
||||
.. _command-requirements:
|
||||
|
||||
Requirements of speech commands
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Requirements of Speech Commands
|
||||
-------------------------------
|
||||
|
||||
- The recommended length of Chinese is generally 4-6 Chinese characters. Too short leads to high false recognition rate and too long is inconvenient for users to remember
|
||||
- The recommended length of English is generally 4-6 words
|
||||
- Mixed Chinese and English is not supported in command words
|
||||
- Currently, up to 200 command words are supported
|
||||
- The command word cannot contain Arabic numerals and special characters
|
||||
- Avoid common command words like "hello"
|
||||
- The greater the pronunciation difference of each Chinese character / word in the command word, the better the performance
|
||||
Currently, MultiNet supports up to **200** commands. There are some limitation when designing speech commands:
|
||||
|
||||
Speech commands customization method
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
- Chinese
|
||||
|
||||
* Support a variety of speech commands customization methods
|
||||
* Support dynamic addition / deletion / modification of speech commands
|
||||
|
||||
Format of Speech commands
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Speech commands string need to meet specific formats, as follows:
|
||||
|
||||
- Chines
|
||||
|
||||
Chinese speech commands need to use Chinese Pinyin, and there should be a space between the Pinyin spelling of each word.
|
||||
|
||||
In addition, we also provide corresponding tools for users to convert Chinese characters into pinyin. See details:
|
||||
Use Pinyin for Chinese speech commands, and add a space in between. For example, the Chinese speech command for turning on the air conditioner is "da kai kong tiao"; the Chinese speech command for turning on the green light is "da kai lv se deng".
|
||||
|
||||
- English
|
||||
|
||||
English speech commands need to be represented by specific phonetic symbols. The phonetic symbols of each word are separated by spaces, such as "turn on the light", which needs to be written as "TkN nN jc LiT".
|
||||
Use phonetic symbols for English speech commands, and add a space in between. For example, the English speech command for turnning on the light is "TkN nN jc LiT". Users can use the tool provided by us to do the convention. To find this tool, go to :project_file:`tool/multinet_g2p.py` .
|
||||
|
||||
|
||||
We provide specific conversion rules and tools. For details, please refer to the English G2P `tool <../../tool/multinet_g2p.py>`_ .
|
||||
Suggestions on Customizing Speech Commands
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Set speech commands offline
|
||||
When customizing speech command words, please pay attention to the following suggestions:
|
||||
|
||||
.. list::
|
||||
|
||||
- The recommended length of Chinese speech commands is generally 4-6 Chinese characters. Too short leads to high false recognition rate and too long is inconvenient for users to remember
|
||||
:esp32s3: - The recommended length of English speech commands is generally 4-6 words
|
||||
- Mixed Chinese and English is not supported in command words
|
||||
- The command word cannot contain Arabic numerals and special characters
|
||||
- Avoid common command words like "hello"
|
||||
- The greater the pronunciation difference of each Chinese character / word in the command words, the better the performance
|
||||
|
||||
Speech Commands Customization Methods
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Multinet supports flexible methods to customize speech commands. Users can do it either online or offline and can also add/delete/modify speech commands dynamically.
|
||||
|
||||
.. only:: latex
|
||||
|
||||
.. figure:: ../../_static/QR_multinet_g2p.png
|
||||
:alt: menuconfig_add_speech_commands
|
||||
|
||||
Customize Speech Commands Offline
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Multinet supports flexible speech commands setting methods. No matter which way users set speech commands (code / network / file), they only need to call the corresponding API.
|
||||
There are two methods for users to customize speech commands offline:
|
||||
|
||||
Here we provide two methods of adding speech commands:
|
||||
- Via ``menuconfig``
|
||||
|
||||
- Use ``menuconfig``
|
||||
|
||||
Users can refer to the example in ESP-Skainet, users can define their own speech commands by ``idf.py menuconfig -> ESP Speech Recognition-> Add Chinese speech commands/Add English speech commands``.
|
||||
1. Navigate to ``idf.py menuconfig`` > ``ESP Speech Recognition`` > ``Add Chinese speech commands/Add English speech commands`` to add speech commands. For details, please refer to the example in ESP-Skainet.
|
||||
|
||||
.. figure:: ../../_static/menuconfig_add_speech_commands.png
|
||||
:alt: menuconfig_add_speech_commands
|
||||
|
||||
menuconfig_add_speech_commands
|
||||
|
||||
Please note that a single ``Command ID`` can support multiple phrases. For example, "da kai kong tiao" and "kai kong tiao" have the same meaning, you can write them in the entry corresponding to the same command ID, and separate the adjacent entries with the English character "," without spaces before and after ",".
|
||||
Please note that a single ``Command ID`` can correspond to more than one commands. For example, "da kai kong tiao" and "kai kong tiao" have the same meaning. Therefore, users can assign the same command id to these two commands and separate them with "," (no space required before and after).
|
||||
|
||||
Then call the following API:
|
||||
1. Call the following API:
|
||||
|
||||
::
|
||||
|
||||
@ -125,85 +125,82 @@ Here we provide two methods of adding speech commands:
|
||||
*/
|
||||
esp_err_t esp_mn_commands_update_from_sdkconfig(esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
|
||||
|
||||
- Add speech commands in the code
|
||||
- Via modifying code
|
||||
|
||||
Users can refer to example in ESP-Skainet for this method of adding speech commands.
|
||||
Users directly customize the speech commands in the code and pass these commands to the MultiNet. In the actual user scenarios, users can pass these commands via various interfaces including network / UART / SPI. For details, see the example described in ESP-Skainet.
|
||||
|
||||
In this method, users directly set the speech command words in the code and transmits them to multinet. In the actual development and products, the user can transmit the required speech commands through various possible ways such as network / UART / SPI and change the speech commands.
|
||||
|
||||
Set speech commands online
|
||||
Customize speech commands online
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
MultiNet supports online dynamic addition / deletion / modification of speech commands during operation, without changing models or adjusting parameters. For details, please refer to the example in ESP-Skainet.
|
||||
MultiNet allows users to add/delete/modify speech commands dynamically during the operation, without the need to change models or modifying parameters. For details, see the example described in ESP-Skainet.
|
||||
|
||||
Please refer to `esp_mn_speech_commands <../../src/esp_mn_speech_commands.c>`_ for details of APIs:
|
||||
Run speech commands recognition
|
||||
----------------------------------
|
||||
For detailed description of APIs, please refer to :project_file:`src/esp_mn_speech_commands.c` .
|
||||
|
||||
Speech commands recognition needs to be run together with the audio front-end (AFE) in esp-sr (WakeNet needs to be enabled in AFE). For the use of AFE, please refer to the document:
|
||||
Use MultiNet
|
||||
------------
|
||||
|
||||
:doc:`AFE Introduction and Use <../audio_front_end/README>`
|
||||
MultiNet speech commands recognition must be used together with audio front-end (AFE) in ESP-SR (What's more, AFE must be used together with WakeNet). For details, see Section :doc:`AFE Introduction and Use <../audio_front_end/README>` .
|
||||
|
||||
MultiNet Initialization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
After configuring AFE, users can follow the steps below to configure and run MultiNet.
|
||||
|
||||
- Initialize multinet model
|
||||
Initialize MultiNet
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- Set speech commands
|
||||
- Load and initialize MultiNet. For details, see Section :doc:`flash_model <../flash_model/README>`
|
||||
|
||||
Please refer #3.
|
||||
- Customize speech commands. For details, see Section :ref:`command-requirements`
|
||||
|
||||
Run MultiNet
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
When users uses AFE and enables wakenet, then can use MultiNet. And
|
||||
there are the following requirements:
|
||||
Users can start MultiNet after enabling AFE and WakeNet, but must pay attention to the following limitations:
|
||||
|
||||
* The frame length of MultiNet is equal to the AFE fetch frame length
|
||||
* The audio format supported is 16KHz, 16bit, mono. The data obtained by AFE fetch is also in this format
|
||||
* The frame length of MultiNet must be equal to the AFE fetch frame length
|
||||
* The audio format supported is 16 KHz, 16 bit, mono. The data obtained by AFE fetch is also in this format
|
||||
|
||||
- Get the frame length that needs to be passed into MultiNet
|
||||
- Get the length of frame that needs to pass to MultiNet
|
||||
|
||||
::
|
||||
|
||||
int mu_chunksize = multinet->get_samp_chunksize(model_data);
|
||||
|
||||
- MultiNet detect
|
||||
``mu_chunksize`` describes the ``short`` of each frame passed to MultiNet. This size is exactly the same as the number of data points per frame obtained in AFE.
|
||||
|
||||
We send the data from AFE fetch to the following API:
|
||||
- Start the speech recognition
|
||||
|
||||
We send the data from AFE ``fetch`` to the following API:
|
||||
|
||||
::
|
||||
|
||||
esp_mn_state_t mn_state = multinet->detect(model_data, buff);
|
||||
|
||||
The lengthof ``buff`` is ``mu_chunksize * sizeof(int16_t)``.
|
||||
The length of ``buff`` is ``mu_chunksize * sizeof(int16_t)``.
|
||||
|
||||
The detect result of MultiNet
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
MultiNet Output
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Speech commands recognition supports two basic modes:
|
||||
|
||||
* Single recognition
|
||||
* Continuous recognition
|
||||
|
||||
Speech command recognition must be used with WakeNet. After wake-up, MultiNet detection can be run.
|
||||
Speech command recognition must be used with WakeNet. After wake-up, MultiNet detection can start.
|
||||
|
||||
When the MultiNet is running, it will return the recognition status of the current frame in real time ``mn_state``, which is currently divided into the following identification states:
|
||||
Afer running, MultiNet returns the recognition output of the current frame in real time ``mn_state``, which is currently divided into the following identification states:
|
||||
|
||||
- ESP_MN_STATE_DETECTING
|
||||
|
||||
This status indicates that the MultiNet is detecting but target
|
||||
speech command word has not been recognized.
|
||||
Indicates that the MultiNet is detecting but the target speech command word has not been recognized.
|
||||
|
||||
- ESP_MN_STATE_DETECTED
|
||||
|
||||
This status indicates that the target speech command has been recognized. At this time, the user can call ``get_results`` interface obtains the identification results.
|
||||
Indicates that the target speech command has been recognized. At this time, the user can call ``get_results`` interface to obtain the recognition results.
|
||||
|
||||
::
|
||||
|
||||
esp_mn_results_t *mn_result = multinet->get_results(model_data);
|
||||
|
||||
The information identifying the result is stored in the return value of the ``get_result`` API, the data type of the return value is as follows:
|
||||
The recognition result is stored in the return value of the ``get_result`` API in the following format:
|
||||
|
||||
::
|
||||
|
||||
@ -214,27 +211,29 @@ When the MultiNet is running, it will return the recognition status of the curre
|
||||
float prob[ESP_MN_RESULT_MAX_NUM]; // The list of probability.
|
||||
} esp_mn_results_t;
|
||||
|
||||
where,
|
||||
|
||||
- ``state`` is the recognition status of the current frame
|
||||
- ``num`` means the number of recognized commands, ``num`` <= 5, up to 5 possible results are returned
|
||||
- ``phrase_id`` means the Phrase ID of speech commands
|
||||
- ``prob`` meaNS the recognition probability of the recognized entries, which is arranged from large to small
|
||||
- ``prob`` means the recognition probability of the recognized entries, which is arranged from large to small
|
||||
|
||||
Users can use ``phrase_id[0]`` and ``prob[0]`` get the recognition result with the highest probability.
|
||||
|
||||
- ESP_MN_STATE_TIMEOUT
|
||||
|
||||
This status means that the speech commands has not been detected for a long time and will exit automatically Wait for the next wake-up.
|
||||
Indicates the speech commands has not been detected for a long time and will exit automatically and wait to be waked up again.
|
||||
|
||||
* Therefore:
|
||||
* Exit the speech recognition when the return status is ``ESP_MN_STATE_DETECTED``, it is single recognition mode;
|
||||
* Exit the speech recognition when the return status is ``ESP_MN_STATE_TIMEOUT``, it is continuous recognition mode;
|
||||
Therefore:
|
||||
* Single recognition mode: exit the speech recognition when the return status is ``ESP_MN_STATE_DETECTED``
|
||||
* Continuous recognition: exit the speech recognition when the return status is ``ESP_MN_STATE_TIMEOUT``
|
||||
|
||||
Other configurations
|
||||
-----------------------
|
||||
|
||||
Threshold setting
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
This function is still under development.
|
||||
More functions are still under development.
|
||||
|
||||
@ -1,4 +1,239 @@
|
||||
Test Methods and Test Reports
|
||||
==============================
|
||||
Test Method and Test Report
|
||||
===========================
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
To ensure the DUT performance, some tests can be performed to verify the following parameters:
|
||||
|
||||
- Wake-up rate
|
||||
- Speech recognition rate
|
||||
- False wake-up rate
|
||||
- Response Accuracy Rate Under Playback
|
||||
- Response time
|
||||
|
||||
Test Room Requirement
|
||||
---------------------
|
||||
|
||||
These tests must be performed in a proper test room. The requirements for this test room include:
|
||||
|
||||
* **Size**
|
||||
|
||||
* Area: no smaller than 4 m * 3.2 m
|
||||
* Height: no lower than 2.3 m
|
||||
|
||||
* **Setup**
|
||||
|
||||
* The floor should be equipped with carpet, the ceiling should be equipped with common acoustic damping materials, and the wall should have 1 to 2 walls with curtains to prevent strong reflection.
|
||||
* Room reverberation time (RT60) within the range of [125, 8k] shall be within 0.2 - 0.7 seconds.
|
||||
* Do not use anechoic chamber.
|
||||
|
||||
* **Background noise**: must < 35 dBA, best < 30 dBA
|
||||
|
||||
* **Temperature and humidity**: 20±10°C, 50%±20%
|
||||
|
||||
* **Placement of DUT, external noise and voice**:
|
||||
|
||||
* Place the DUT, external noise and voice according the actual use scenario of your DUT.
|
||||
|
||||
.. note::
|
||||
The RT60, background noise, and the placement of DUT, external noise and voice should be kept the same in all tests.
|
||||
|
||||
Test Case Design
|
||||
----------------
|
||||
|
||||
When designing test cases, it's suggested to factor in **some or all of the following parameters** based on the actual use scenarios of the product. For example,
|
||||
|
||||
- Different types of noises
|
||||
- White noise
|
||||
- Human noise
|
||||
- Music
|
||||
- News
|
||||
- . . . . . .
|
||||
- Test cases with multiple noise sources can also be added when necessary
|
||||
- Different noise levels
|
||||
- < 35 dBA
|
||||
- 45 dBA
|
||||
- 55 dBA
|
||||
- 65 dBA
|
||||
- Different voice levels
|
||||
- 54 dBA
|
||||
- 59 dBA
|
||||
- 64 dBA
|
||||
- Different SNR
|
||||
- 9 dBA
|
||||
- 4 dBA
|
||||
- -1 dBA
|
||||
|
||||
Espressif Test and Result
|
||||
-------------------------
|
||||
|
||||
In all the tests described in this section, the placement of DUT, external noise and voice can be seen in the diagrams below.
|
||||
|
||||
.. figure:: ../../_static/test_reference_position2.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
.. figure:: ../../_static/test_reference_position1.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
As seen in the diagrams above, place
|
||||
|
||||
- The DUT 0.75 meters above the ground.
|
||||
- The voice 3 meters away from the DUT and 1.5 meters above the ground.
|
||||
- The external noise 45°C apart from the voice, 2 meters away from the DUT and 1.2 meters above the ground.
|
||||
- The sound pressure meter right above the DUT by 0.75 meters.
|
||||
|
||||
Wake-up Rate Test
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Wake-up rate**: the probability of the DUT correctly wakes up to a wake word.
|
||||
|
||||
**Espressif's Wake-up Rate Test and Result**
|
||||
|
||||
.. list-table::
|
||||
:widths: 10 25 15 15 20 15
|
||||
:header-rows: 1
|
||||
|
||||
* - Test Case
|
||||
- Noise Type
|
||||
- Noise Decibel
|
||||
- Voice Decibel
|
||||
- SNR
|
||||
- Wake-up Rate
|
||||
* - 1
|
||||
- /
|
||||
- /
|
||||
- 59 dBA
|
||||
- /
|
||||
- 99%
|
||||
* - 2
|
||||
- White noise
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- >= 4 dBA
|
||||
- 99%
|
||||
* - 3
|
||||
- Human noise
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- >= 4 dBA
|
||||
- 99%
|
||||
|
||||
Speech Recognition Rate Test
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Speech recognition rate**: the probability of the DUT correctly recognizes the established command words when the DUT is in the speech recognition state.
|
||||
|
||||
**Espressif's Speech Recognition Rate Test and Result**
|
||||
|
||||
.. list-table::
|
||||
:widths: 10 25 15 15 20 15
|
||||
:header-rows: 1
|
||||
|
||||
* - Test Case
|
||||
- Noise Type
|
||||
- Noise Decibel
|
||||
- Voice Decibel
|
||||
- SNR
|
||||
- Speech Recognition Rate
|
||||
* - 1
|
||||
- /
|
||||
- /
|
||||
- 59 dBA
|
||||
- /
|
||||
- 91.5%
|
||||
* - 2
|
||||
- White noise
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- >= 4 dBA
|
||||
- 78.25%
|
||||
* - 3
|
||||
- Human noise
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- >= 4 dBA
|
||||
- 82.77%
|
||||
|
||||
False Wake-up Rate Test
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**False wake-up rate**: the probability of the DUT incorrectly wakes up to a random word (that is not a wake word).
|
||||
|
||||
**Espressif's False Wake-up Rate Test and Result**
|
||||
|
||||
.. list-table::
|
||||
:widths: 20 20 20 20 20
|
||||
:header-rows: 1
|
||||
|
||||
* - Test Case
|
||||
- Noise Type
|
||||
- Noise Decibel
|
||||
- Test Duration
|
||||
- Number of False Wake-up
|
||||
* - 1
|
||||
- Music
|
||||
- 55 dBA
|
||||
- 12 hours
|
||||
- 1 time
|
||||
* - 2
|
||||
- News
|
||||
- 55 dBA
|
||||
- 12 hours
|
||||
- 1 time
|
||||
|
||||
Response Accuracy Rate Under Playback
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Interrupting wake-up rate**: the probability of the DUT correctly responds to a wake word or a command word while playing sounds, such as music or TTS. This test is required for products with AEC feature.
|
||||
|
||||
**Espressif's Interrupting Wake-up Rate Test and Result**
|
||||
|
||||
.. list-table::
|
||||
:widths: 15 15 15 20 15 15
|
||||
:header-rows: 1
|
||||
|
||||
* - Test Case
|
||||
- Noise Type
|
||||
- Noise / Voice Decibel
|
||||
- SNR
|
||||
- Wake-up Rate
|
||||
- Speech Recognition Rate
|
||||
* - 1
|
||||
- Music
|
||||
- 69 dBA / 59 dBA
|
||||
- >= 10 dBA
|
||||
- 100%
|
||||
- 96%
|
||||
* - 2
|
||||
- TTS
|
||||
- 69 dBA / 59 dBA
|
||||
- >= 10 dBA
|
||||
- 100%
|
||||
- 96%
|
||||
|
||||
Response Time Test
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Response time**: the time required for the DUT to respond to a command word. It's measured as the time duration after a command word and before the DUT starts playing sound (see the diagram below).
|
||||
|
||||
.. figure:: ../../_static/test_response_time.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
**Espressif's Response Time Test and Result**
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
|
||||
* - Test Case
|
||||
- Noise / Voice Decibel
|
||||
- SNR
|
||||
- Response Time
|
||||
* - 1
|
||||
- NA / 59 dBA
|
||||
- /
|
||||
- < 500 ms
|
||||
@ -3,79 +3,96 @@ Espressif Speech Wake-up Solution Customization Process
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
Speech Wake Word Customization Process
|
||||
---------------------------------------
|
||||
Wake Word Customization Process
|
||||
-------------------------------
|
||||
|
||||
Espressif provides users with the offline wake word customization service, which allows users to use both publicly available wake words (such as "Hi Lexin", "Alexa", and "hi,ESP") and customized wake words.
|
||||
Espressif provides users with the **wake word customization** :
|
||||
|
||||
#. If you want to use publicly available wake words for commercial use
|
||||
#. Espressif has already opened some wake words for customers' commercial use, such as "HI Leixi", or "Nihao Xiaoxin".
|
||||
|
||||
- Please check the wake words provided in `esp-sr <https://github.com/espressif/esp-sr>`__;
|
||||
- We will continue to provide more and more wake words that are free for commercial use.
|
||||
- For a complete list, see Table :ref:`Publicly Available Wake Words Provided by Espressif <esp-open-wake-word>` .
|
||||
- Espressif also plans to provide more wake words that are free for commercial use soon.
|
||||
|
||||
#. If you want to use custom wake words, we can also provide the offline
|
||||
wake word customization service.
|
||||
#. Offline wake word customization can also be provided by Espressif:
|
||||
|
||||
- If you provide a training corpus
|
||||
- Training corpus provided by customer
|
||||
|
||||
- It must consist of at least 20,000 qualified corpus entries(see the section below for detailed requirements);
|
||||
- It will take two to three weeks for Espressif to train and optimize the corpus after the hardware design meets our requirement;
|
||||
- It will be delivered in a static library of wake word;
|
||||
- Espressif will charge training fees based on the scale of your production.
|
||||
- Customer must provide at least 20,000 qualified corpus entries. See detailed requirements in Section :ref:`corpus-requirement` .
|
||||
- It usually takes two to three weeks for Espressif to train and optimize the received corpus.
|
||||
- A fee will be charged for training and optimizing the corpus.
|
||||
|
||||
- Otherwise
|
||||
- Training corpus provided by Espressif
|
||||
|
||||
- Espressif will collect and provide all the training corpus;
|
||||
- Espressif will deliver a static library file of successfully trained wake word to you, but won't share the corpus;
|
||||
- It will take around three weeks to collect and train the corpus;
|
||||
- Espressif will charge training fees (corpus collecting fees included) based on the scale of your production.
|
||||
- Espressif provides all the corpus required for training.
|
||||
- The time required to collect corpus needs to be discussed separately. After the corpus is ready, it usually takes two to three weeks for Espressif to train and optimize the received corpus.
|
||||
- A fee will be charged for training and optimizing the corpus. A separate fee will be changed for collecting the corpus.
|
||||
|
||||
- The above time is subject to change depending on the project.
|
||||
- The actual fee and time for your customization depend on the **number of wake words you need** and the **scale of your mass production**. For details, please contact our `sales person <sales@espressif.com>`_ .
|
||||
|
||||
- Espressif will only charge a one-time customization fee depending on the number of wake words you customize and the scale of your production, and will not charge license fees for the quantity and time of use. Please email us at `sales@espressif.com <sales@espressif.com>`__ for details of the fee.
|
||||
#. About Espressif wake word engine WakeNet:
|
||||
|
||||
#. If you want to use offline command words
|
||||
- Currently, up to 5 wake words are supported by each WakeNet model.
|
||||
- A wake word usually consists of 3 to 6 symbols, such as "Hi Leixin", "xiaoaitongxue", "nihaotianmao".
|
||||
- More than one WakeNet models can be used together. However, more resource will be consumed when you use more models.
|
||||
- For more details, see Section :doc:`WakeNet Wake Word Model <README>` .
|
||||
|
||||
- Please set them by yourself referring to `esp-sr <https://github.com/espressif/esp-sr>`__ algorithm. They do not need additional customization.
|
||||
- Similar to speech wake words, the effect of command words is also related to hardware designs, so please refer to *Espressif MIC Design Guidelines*.
|
||||
.. _corpus-requirement:
|
||||
|
||||
Requirements on Corpus
|
||||
--------------------------
|
||||
|
||||
As mentioned above, you can provide your own training corpus for Espressif. Below are the requirements.
|
||||
As mentioned above, customers can provide Espressif with training corpus collected themselves or purchased from a third party. However, there are some limitations:
|
||||
|
||||
#. Audio file format
|
||||
- Audio file format
|
||||
|
||||
- Sample rate: 16 kHz
|
||||
- Encoding: 16-bit signed int
|
||||
- Channel: mono
|
||||
- Format: WAV
|
||||
- Sample rate: 16 kHz
|
||||
- Encoding: 16-bit signed int
|
||||
- Channel: mono
|
||||
- Format: WAV
|
||||
|
||||
#. Sampling environment
|
||||
#. Sampling requirement
|
||||
|
||||
- Room with an ambient noise lower than 30 dB and reverberation less than 0.3 s, or a professional audio room (recommended).
|
||||
- Number of samples: more than 500 people, including men and women of all ages and at least 100 children.
|
||||
- Sampling environment: a quiet room (< 40 dB). It is recommended to use a professional audio room.
|
||||
- Recording device: high-fidelity microphone.
|
||||
- How to sample:
|
||||
- At 1 meters away from the microphone: each person speaks the wake word out loud for 15 times (5 times in fast speed, 5 times in normal speed, 5 times in slow speed).
|
||||
- At 3 meters away from the microphone: each person speaks the wake word out loud for 15 times (5 times in fast speed, 5 times in normal speed, 5 times in slow speed).
|
||||
- File name: it is recommended to name the samples according to the age, gender, and quantity of the collected samples, such as ``female_age_fast_id.wav``. Or you can use a separate file to present such information.
|
||||
|
||||
- Recording device: high-fidelity microphone.
|
||||
Hardware Design and Test
|
||||
------------------------
|
||||
|
||||
- The whole product is strongly recommended.
|
||||
- The development board of your product also works when there is no cavity structure.
|
||||
The voice wake-up performance heavily depends on the hardware design and cavity structure. Therefore, please pay special attention to the following requirements:
|
||||
|
||||
- Record in 16 kHz, and don't use **resampling**.
|
||||
#. Hardware Design
|
||||
|
||||
- At the recording site, pay attention to the impact of reverberation interference in a closed environment.
|
||||
- Collect samples with multiple recording devices at the same time (recommended).
|
||||
- Speaker designs: customers can make their own designs by modifying the reference designs (schematic/PCB) provided by Espressif. Also, Espressif can also review customers' speaker designs to avoid some common design issues.
|
||||
|
||||
- For example, postion the devices at 1 m and 3 m away.
|
||||
- So more samples are collected with the same number of time and participants.
|
||||
- Cavity structure: cavity should be designed by acoustic specialists. Espressif does not provide ID design reference. Customers can refer to other mainstream speaker cavity designs on the market, such as Tmall Genie, Xiaodu Smart Speaker, and Google Smart Speaker, etc.
|
||||
|
||||
#. Sample distribution
|
||||
#. Customers can perform the following tests to verify the hardware designs. Note that it's suggested to perform the following tests in a professional audio room. Customers can adjust the actual tests based on their actual testing environment.
|
||||
|
||||
- Sample size: 500. Males and females should be close to 1:1.
|
||||
- The number of children under 12 years old invloved varies from product to product, but the percentage should be no less than 15%.
|
||||
- If there are requirements for certain languages or dialects, special corpus samples need to be provided.
|
||||
- It is recommended to name the samples according to the age, gender, and quantity of the collected samples, such as HiLeXin_male_B_014.wav, and ABCD represents different age groups.
|
||||
- Recording test to verify the gain and distortion of mic and codec
|
||||
|
||||
Hareware Design Guidelines
|
||||
---------------------------
|
||||
- Play the sample (90 dB, 0.1 meter away from the mic), and adjust the gain to ensure that the recording is not saturated.
|
||||
- Use a sweep file of 0~20 KHz, and start recording using the sampling rate of 16 KHz. The recording should not have obvious frequency aliasing.
|
||||
- Record 100 samples, and feed these samples to open cloud voice recognition API. A certain recognition rate must be reached.
|
||||
|
||||
#. Please refer to *Espressif MIC Design Guidelines*.
|
||||
- Playback test to verify the distortion of power amplifier (PA) and speaker
|
||||
|
||||
- Test PA power @ 1% Total Harmonic Distortion (THD)
|
||||
|
||||
- Speech algorithms test to verify the AEC, BFM and NS models
|
||||
|
||||
- Adjust the delays of the reference signals based on the different requirements of different AEC algorithms.
|
||||
- Test the product based on the actual use scenario. For example, play ``85DB-90DB Dreamer.wav`` (a song) and record.
|
||||
- Analyze the processed signals to evaluate the performance of AEC, BFM, NS, etc.
|
||||
|
||||
- DSP performance test to identify the correct DSP parameter and minimize the nonlinear distortion in the DSP algorithm
|
||||
|
||||
- Noise Suppression
|
||||
- Acoustic Echo Cancellation
|
||||
- Speech Enhancement
|
||||
|
||||
#. Customers can also **send** 1 or 2 pieces of hardware to Espressif and ask us to optimize the product for better wakeup performance.
|
||||
|
||||
@ -1,14 +1,14 @@
|
||||
wakeNet
|
||||
========
|
||||
WakeNet Wake Word Model
|
||||
=======================
|
||||
|
||||
:link_to_translation:`zh_CN:[中文]`
|
||||
|
||||
wakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the wakeNet model supports up to 5 wake words.
|
||||
WakeNet is a wake word engine built upon neural network for low-power embedded MCUs. Currently, WakeNet supports up to 5 wake words.
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Please see the flow diagram of wakeNet below:
|
||||
Please see the flow diagram of WakeNet below:
|
||||
|
||||
.. figure:: ../../_static/wakenet_workflow.png
|
||||
:alt: overview
|
||||
@ -21,30 +21,35 @@ Please see the flow diagram of wakeNet below:
|
||||
|
||||
</center>
|
||||
|
||||
- speech features:
|
||||
- Speech Feature
|
||||
We use `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ method to extract the speech spectrum features. The input audio file has a sample rate of 16KHz, mono, and is encoded as signed 16-bit. Each frame has a window width and step size of 30ms.
|
||||
|
||||
We use the `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ method to extract speech spectrum features. The sampling rate of the input audio file is 16KHz, mono, and the encoding mode is signed 16-bit. The window width and step size of each frame are 30ms.
|
||||
We use `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ method to extract the speech spectrum features. The input audio file has a sample rate of 16KHz, mono, and is encoded as signed 16-bit. each frame has a window width and step size of 30ms.
|
||||
.. only:: latex
|
||||
|
||||
- Speech Feature:
|
||||
|
||||
The wakeNet uses `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ to obtain the features of the input audio clip (16 KHz, 16 bit, single track). The window width and step width of each frame of the audio clip are both 30 ms.
|
||||
|
||||
- Neural Network:
|
||||
.. figure:: ../../_static/QR_MFCC.png
|
||||
:alt: overview
|
||||
|
||||
- Neural Network
|
||||
Now, the neural network structure has been updated to the ninth edition, among which:
|
||||
|
||||
- wakeNet1,wakeNet2,wakeNet3,wakeNet4,wakeNet6,wakeNet7 had been out of use.
|
||||
- wakeNet5 only support ESP32 chip.
|
||||
- wakeNet8,wakeNet9 only support ESP32S3 chip, which are built upon the `Dilated Convolution <https://arxiv.org/pdf/1609.03499.pdf>`__ structure.
|
||||
- WakeNet1, WakeNet2, WakeNet3, WakeNet4, WakeNet6, and WakeNet7 had been out of use.
|
||||
- WakeNet5 only supports ESP32 chip.
|
||||
- WakeNet8 and WakeNet9 only support ESP32-S3 chip, which are built upon the `Dilated Convolution <https://arxiv.org/pdf/1609.03499.pdf>`__ structure.
|
||||
|
||||
.. note:: text
|
||||
The network structure of wakeNet5,wakeNet5X2 and wakeNet5X3 is same, but the parameter of wakeNetX2 and wakeNetX3 is more than wakeNet5. Please refer to `Performance Test <#performance-test>`__ for details.
|
||||
.. only:: latex
|
||||
|
||||
.. figure:: ../../_static/QR_Dilated_Convolution.png
|
||||
:alt: overview
|
||||
|
||||
The network structure of WakeNet5, WakeNet5X2 and WakeNet5X3 is the same, but WakeNetX2 and WakeNetX3 have more parameters than WakeNet5. Please refer to :doc:`Resource Consumption <../benchmark/README>` for details.
|
||||
|
||||
- Keyword Triggering Method:
|
||||
|
||||
For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.
|
||||
|
||||
The wake words supported by Espressif chips are listed below:
|
||||
|
||||
.. _esp-open-wake-word:
|
||||
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Chip | ESP32 | ESP32S3 |
|
||||
+=================+===========+=============+=============+===========+===========+===========+===========+
|
||||
@ -67,36 +72,33 @@ Please see the flow diagram of wakeNet below:
|
||||
| Customized word | | | | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
|
||||
Use wakeNet
|
||||
Use WakeNet
|
||||
-----------
|
||||
|
||||
- How to select the wakeNet model
|
||||
- Select WakeNet model
|
||||
|
||||
Please refer to :doc:`flash model <../flash_model/README>` .
|
||||
To select WakeNet model, please refer to Section :doc:`flash model <../flash_model/README>` .
|
||||
|
||||
- How to run wakeNet
|
||||
To customize wake words, please refer to Section :doc:`Espressif Speech Wake-up Solution Customization Process <ESP_Wake_Words_Customization>`
|
||||
|
||||
wakeNet is currently included in the :doc:`AFE <../audio_front_end/README>`, which is running by default, and returns the detect results through the AFE fetch interface.
|
||||
- Run WakeNet
|
||||
|
||||
If users do not wants to initialize WakeNet, please use:
|
||||
WakeNet is currently included in the :doc:`AFE <../audio_front_end/README>`, which is enabled by default, and returns the detection results through the AFE fetch interface.
|
||||
|
||||
If users do not need WakeNet, please use:
|
||||
|
||||
::
|
||||
|
||||
afe_config.wakeNet_init = False.
|
||||
|
||||
If users want to close/open WakeNet temporarily, plese use:
|
||||
If users want to enable/disable WakeNet temporarily, please use:
|
||||
|
||||
::
|
||||
|
||||
afe_handle->disable_wakenet(afe_data)
|
||||
afe_handle->enable_wakenet(afe_data)
|
||||
|
||||
Performance Test
|
||||
----------------
|
||||
Resource Consumption
|
||||
--------------------
|
||||
|
||||
Please refer to :doc:`Performance Test <../benchmark/README>` .
|
||||
|
||||
Wake Word Customization
|
||||
-----------------------
|
||||
|
||||
For details on how to customize your wake words, please see :doc:`Espressif Speech Wake Word Customization Process <ESP_Wake_Words_Customization>` .
|
||||
Please refer to :doc:`Resource Consumption <../benchmark/README>` .
|
||||
8
docs/en/wake_word_engine/index.rst
Normal file
8
docs/en/wake_word_engine/index.rst
Normal file
@ -0,0 +1,8 @@
|
||||
Wake Word
|
||||
=========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
Introduction to WakeNet Wake Word Module <README>
|
||||
Espressif Wake Word Customization <ESP_Wake_Words_Customization>
|
||||
@ -3,45 +3,51 @@
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
本指南基于乐鑫的 ESP32-S3 系列语音开发板,对于整机 Mic 设计要求如下:
|
||||
本指南基于乐鑫的 {IDF_TARGET_NAME} 系列语音开发板,对于整机 mic 设计做出要求。
|
||||
|
||||
麦克风电器性能推荐
|
||||
------------------
|
||||
|
||||
#. 麦克类型:全向型 MEMS 麦克风
|
||||
#. 灵敏度
|
||||
- 麦克类型:全向型 MEMS 麦克风
|
||||
- 灵敏度
|
||||
|
||||
- 1 Pa 声压下模拟麦灵敏度不低于 -38 dBV,数字麦灵敏度要求不低于 -26 dB。
|
||||
- 公差控制在 ± 2dB,对于麦克阵列推荐采用 ±1 dB 公差。
|
||||
- 公差控制在 ±2 dB,对于麦克阵列推荐采用 ±1 dB 公差。
|
||||
|
||||
#. 信噪比(SNR)
|
||||
- 信噪比(SNR)
|
||||
|
||||
- 信噪比:信噪比不低于 62 dB,推荐 > 64 dB
|
||||
- 频率响应在 50~16 KHz 范围内的波动在 ±3 dB 之内
|
||||
- 麦克风(NENS MIC)的 PSRR 应大于 55 dB
|
||||
- 麦克风的 PSRR 应大于 55 dB
|
||||
|
||||
结构设计建议
|
||||
------------
|
||||
麦克风结构设计建议
|
||||
------------------
|
||||
|
||||
#. 麦克孔孔径或宽度推荐大于 1 mm,拾音管道尽量短,腔体尽可能小,保证麦克和结构组件配合的谐振频率在 9KHz 以上。
|
||||
#. 拾音孔深度和直径比小于 2:1,壳体厚度推荐 1 mm,如果壳体过厚,需增大开孔面积。
|
||||
#. 麦克孔上需通过防尘网进行保护。
|
||||
#. 麦克风与设备外壳之间必须加硅胶套或泡棉等进行密封和防震,需进行过盈配合设计,以保证麦克的密封性。
|
||||
#. 麦克孔不能被遮挡,底部拾音的麦克孔需结构上增加凸起,避免麦克孔被桌面等遮挡。
|
||||
#. 麦克需远离喇叭等会产生噪音或振动的物体摆放,且与喇叭音腔之间通过橡胶垫等隔离缓冲。
|
||||
- 麦克孔孔径或宽度推荐大于 1 mm,拾音管道尽量短,腔体尽可能小,保证麦克和结构组件配合的谐振频率在 9 KHz 以上。
|
||||
- 拾音孔深度和直径比小于 2:1,壳体厚度推荐 1 mm,如果壳体过厚,需增大开孔面积。
|
||||
- 麦克孔上需通过防尘网进行保护。
|
||||
- 麦克风与设备外壳之间必须加硅胶套或泡棉等进行密封和防震,需进行过盈配合设计,以保证麦克的密封性。
|
||||
- 麦克孔不能被遮挡,底部拾音的麦克风需结构上增加底部凸起,保证麦克风与桌面等平面有一定间隙。
|
||||
- 麦克需远离喇叭等会产生噪音或振动的物体摆放,且与喇叭音腔之间通过橡胶垫等隔离缓冲。
|
||||
|
||||
麦克阵列设计推荐
|
||||
麦克阵列设计建议
|
||||
----------------
|
||||
|
||||
#. 麦克类型:全向型硅麦,推荐同一个阵列内的麦克应使用同一厂家的同一型号,不建议混用。
|
||||
#. 麦克阵列中各麦克灵敏度差异在 3 dB 之内。
|
||||
#. 相位差:多麦克阵列中麦克之间的相位差控制在 10° 以内。
|
||||
#. 麦克阵列中各麦克的结构设计,推荐采用相同的设计,以保证结构设计的一致性。
|
||||
#. 2 MIC 方案:麦克间距要求 4~6.5 cm,连接两个麦克风的轴线应平行于水平线,且两个麦克的中心尽量靠近产品水平方向的中心。
|
||||
#. 3 MIC 方案:3 个麦克风等间距并且成正圆分布(夹⻆互成 120 度),间距要求 4~6.5 cm。
|
||||
客户可采用双麦克或三麦克方案:
|
||||
|
||||
麦克风结构密封性
|
||||
----------------
|
||||
- 双麦克方案:2 个麦克风之间间距保持 4~6.5 cm,连接 2 个麦克风的轴线应平行于水平线,且 2 个麦克的中心尽量靠近产品水平方向的中心。
|
||||
- 三麦克方案:3 个麦克风等间距并且成正圆分布(夹⻆互成 120 度),间距要求 4~6.5 cm。
|
||||
|
||||
在选择阵列中的麦克风时,有如下注意事项:
|
||||
|
||||
- 麦克类型:全向型硅麦,推荐同一个阵列内的麦克应使用同一厂家的同一型号,不建议混用。
|
||||
- 灵敏度:麦克阵列中各麦克灵敏度差异在 3 dB 之内。
|
||||
- 相位差:多麦克阵列中麦克之间的相位差控制在 10° 以内。
|
||||
- 麦克阵列中各麦克的结构设计,推荐采用相同的设计,以保证结构设计的一致性。
|
||||
|
||||
|
||||
麦克风结构密封性建议
|
||||
--------------------
|
||||
|
||||
用橡皮泥等材料封堵麦克拾音孔,密封前后麦克风采集信号的幅度衰减 25 dB 为合格,推荐 30 dB。测试方法:
|
||||
|
||||
@ -50,15 +56,15 @@
|
||||
#. 用橡皮泥等材料封堵麦克拾音孔,使用麦克风阵列录制 10s 以上,存储为录音文件 B。
|
||||
#. 对比两个文件的频谱,需保证 100~8 KHz 频段内整体衰减 25dB 以上。
|
||||
|
||||
回声参考信号设计
|
||||
----------------
|
||||
回声参考信号设计建议
|
||||
--------------------
|
||||
|
||||
#. 回声参考信号推荐尽量靠近喇叭侧,推荐从 DA 后级 PA 前级回采。
|
||||
#. 扬声器音量最大时,输入到麦克的回声参考信号不能有饱和失真,最大音量下喇叭功放输出 THD 满足 100 Hz 小于 10%,200 Hz 小于 6%,350 Hz 以上频率,小于 3%。
|
||||
#. 扬声器音量最大时,麦克处拾音的声压不超过 102 dB (1KHz)。
|
||||
#. 回声参考信号电压不超过 ADC 的最大允许输入电压,电压过高需增加衰减电路。
|
||||
#. 从 D 类功放输出引参考回声信号需增加低通滤波器,滤波器的截止频率推荐 > 22 KHz。
|
||||
#. 音量最大播放时,回采信号峰值 -3 到 -5 dB。
|
||||
- 回声参考信号推荐尽量靠近喇叭侧,推荐从 DA 后级 PA 前级回采。
|
||||
- 扬声器音量最大时,输入到麦克的回声参考信号不能有饱和失真,最大音量下喇叭功放输出 THD 满足 100 Hz 小于 10%,200 Hz 小于 6%,350 Hz 以上频率,小于 3%。
|
||||
- 扬声器音量最大时,麦克处拾音的声压不超过 102 dB (1KHz)。
|
||||
- 回声参考信号电压不超过 ADC 的最大允许输入电压,电压过高需增加衰减电路。
|
||||
- 从 D 类功放输出引参考回声信号需增加低通滤波器,滤波器的截止频率推荐 > 22 KHz。
|
||||
- 音量最大播放时,回采信号峰值 -3 到 -5 dB。
|
||||
|
||||
麦克风阵列一致性验证
|
||||
--------------------
|
||||
@ -66,4 +72,4 @@
|
||||
要求各个麦克风采样信号幅度相差小于 3 dB,测试方法:
|
||||
|
||||
#. 麦克风正上方 0.5 米处,播放白噪声,麦克风处音量 90 dB。
|
||||
#. 使用麦克风阵列录制 10s 以上,查看各 MIC 录音幅度和音频采样率是否一致。
|
||||
#. 使用麦克风阵列录制 10s 以上,查看各 mic 录音幅度和音频采样率是否一致。
|
||||
@ -1,20 +1,45 @@
|
||||
Audio Front-end 框架
|
||||
AEF 声学前端算法框架
|
||||
====================
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
乐鑫 Audio Front-end (AFE) 算法框架由乐鑫 AI 实验室自主开发。该框架基于 ESP32 系列芯片,能够提供高质量并且稳定的音频数据。
|
||||
|
||||
概述
|
||||
----
|
||||
|
||||
乐鑫 AFE 框架基于乐鑫的 ESP32 系列芯片,可以以最便捷的方式进行语音前端处理。使用乐鑫 AFE 框架,您可以获取高质量且稳定的音频数据,从而更加方便地构建唤醒或语音识别等应用。
|
||||
智能语音设备需要在远场噪声环境中,仍具备出色的语音交互性能,声学前端 (Audio Front-End, AFE) 算法在构建此类语音用户界面 (Voice-User Interface, VUI) 时至关重要。乐鑫 AI 实验室自主研发了一套乐鑫 AFE 算法框架,可基于功能强大的 {IDF_TARGET_NAME} 系列芯片进行声学前端处理,使用户获得高质量且稳定的音频数据,从而构建性能卓越且高性价比的智能语音产品。
|
||||
|
||||
乐鑫 AFE 主要有两个使用场景。
|
||||
.. list-table::
|
||||
:widths: 25 75
|
||||
:header-rows: 1
|
||||
|
||||
* - 名称
|
||||
- 简介
|
||||
* - AEC (Acoustic Echo Cancellation)
|
||||
- 回声消除算法,最多支持双麦处理,能够有效的去除 mic 输入信号中的自身播放声音,从而可以在自身播放音乐的情况下很好的完成语音识别。
|
||||
* - NS (Noise Suppression)
|
||||
- 噪声抑制算法,支持单通道处理,能够对单通道音频中的非人声噪声进行抑制,尤其针对稳态噪声,具有很好的抑制效果。
|
||||
* - BSS (Blind Source Separation)
|
||||
- 盲信号分离算法,支持双通道处理,能够很好的将目标声源和其余干扰音进行盲源分离,从而提取出有用音频信号,保证了后级语音的质量。
|
||||
* - MISO (Multi Input Single Output)
|
||||
- 多输入单输出算法,支持双通道输入,单通道输出。用于在双麦场景,没有唤醒使能的情况下,选择信噪比高的一路音频输出。
|
||||
* - VAD (Voice Activity Detection)
|
||||
- 语音活动检测算法,支持实时输出当前帧的语音活动状态。
|
||||
* - AGC (Automatic Gain Control)
|
||||
- 自动增益控制算法,可以动态调整输出音频的幅值,当弱信号输入时,放大输出幅度;当输入信号达到一定强度时,压缩输出幅度。
|
||||
* - WakeNet
|
||||
- 基于神经网络的唤醒词模型,专为低功耗潜入式 MCU 设计
|
||||
|
||||
使用场景
|
||||
--------
|
||||
|
||||
本节将介绍乐鑫 AFE 框架的两个典型使用场景。
|
||||
|
||||
语音识别场景
|
||||
^^^^^^^^^^^^
|
||||
|
||||
工作流程
|
||||
""""""""
|
||||
|
||||
.. figure:: ../../_static/AFE_SR_overview.png
|
||||
:alt: overview
|
||||
|
||||
@ -23,16 +48,18 @@ Audio Front-end 框架
|
||||
.. figure:: ../../_static/AFE_SR_workflow.png
|
||||
:alt: overview
|
||||
|
||||
工作流程
|
||||
""""""""
|
||||
|
||||
#. 使用 **ESP_AFE_SR_HANDLE**,进行 AFE 的创建和初始化(注意, ``voice_communication_init`` 需配置为 false)
|
||||
#. AFE feed,输入音频数据,feed 内部会先进行 AEC 算法处理
|
||||
#. 内部: 进行 BSS/NS 算法处理
|
||||
#. AFE fetch,返回处理过的单通道音频数据和相关信息, fetch 内部会进行 VAD 处理,以及唤醒词的检测,具体行为取决于用户对 ``afe_config_t`` 结构体的配置。(注意: ``wakenet_init`` 和 ``voice_communication_init`` 不可同时配置为 true)。
|
||||
#. 使用 :cpp:func:`ESP_AFE_SR_HANDLE`,创建并初始化 AFE。注意, :cpp:member:`voice_communication_init` 需配置为 false。
|
||||
#. 使用 :cpp:func:`feed`,输入音频数据。feed 内部会先进行 AEC 算法处理
|
||||
#. Feed 内部进行 BSS/NS 算法处理
|
||||
#. 使用 :cpp:func:`fetch`,获得经过处理过的单通道音频数据及相关信息。这里,fetch 内部可以进行 VAD 处理并检测唤醒词等动作,具体可通过 :cpp:type:`afe_config_t` 结构体配置。
|
||||
|
||||
语音通话场景
|
||||
^^^^^^^^^^^^
|
||||
|
||||
工作流程
|
||||
""""""""
|
||||
|
||||
.. figure:: ../../_static/AFE_VOIP_overview.png
|
||||
:alt: overview
|
||||
|
||||
@ -41,30 +68,28 @@ Audio Front-end 框架
|
||||
.. figure:: ../../_static/AFE_VOIP_workflow.png
|
||||
:alt: overview
|
||||
|
||||
工作流程
|
||||
""""""""
|
||||
#. 使用 **ESP_AFE_VC_HANDLE** ,进行 AFE 的创建和初始化 (``voice_communication_init`` 需配置为 true )
|
||||
#. AFE feed,输入音频数据,feed 内部会先进行 AEC 算法处理
|
||||
#. 内部: 首先进行 BSS/NS 算法处理;若为双麦,随后还会进行 MISO 算法处理;
|
||||
#. AFE fetch,返回处理过的单通道音频数据和相关信息。其中会进行 AGC 非线性放大,具体增益值取决于用户对 ``afe_config_t`` 结构体的配置;若为双麦,在 AGC 之前还会进行降噪处理。(注: ``wakenet_init`` 和 ``voice_communication_init`` 不可同时配置为 true)
|
||||
|
||||
#. 使用 :cpp:func:`ESP_AFE_VC_HANDLE`,创建并初始化 AFE。注意, :cpp:member:`voice_communication_init` 需配置为 true。
|
||||
#. 使用 :cpp:func:`feed`,输入音频数据。feed 内部会先进行 AEC 算法处理
|
||||
#. Feed 内部进行 BSS/NS 算法处理。若为双麦,还将额外进行 MISO 算法处理。
|
||||
#. 使用 :cpp:func:`fetch`,获得经过处理过的单通道音频数据及相关信息。这里,可对输出数据进行 AGC 非线性放大,具体增益值可通过 :cpp:type:`afe_config_t` 结构体配置。注意,若为双麦,则在进行 AGC 非线性放大前还会进行降噪处理。
|
||||
|
||||
.. note::
|
||||
``afe->feed()`` 和 ``afe->fetch()`` 对用户可见, ``Internal BSS/NS/MISO Task`` 对用户不可见。
|
||||
|
||||
* AEC 在 ``afe->feed()`` 函数中运行;若 ``aec_init`` 配置为 false 状态,BSS/NS 将会在 ``afe->feed()`` 函数中运行。
|
||||
* BSS/NS/MISO 作为 AFE 内部独立 Task 进行处理。
|
||||
* VAD/WakeNet 的结果,以及处理后的单通道音频,通过 ``afe->fetch()`` 函数获取。
|
||||
#. :cpp:type:`afe_config_t` 结构体中的 :cpp:member:`wakenet_init` 和 :cpp:member:`voice_communication_init` 不可同时配置为 true。
|
||||
#. :cpp:func:`feed` 和 :cpp:func:`fetch` 对用户可见,其他内部 BSS/NS/MISO 算法处理为 AFE 的内部独立任务,对用户不可见。
|
||||
#. AEC 算法处理在 :cpp:func:`feed` 中进行。
|
||||
#. 当 :cpp:member:`aec_init` 配置为 false,BSS/NS 算法处理在 :cpp:func:`feed` 中进行。
|
||||
|
||||
选择 AFE Handle
|
||||
---------------
|
||||
|
||||
目前 AFE 支持单麦和双麦配置,并且可对算法模块进行灵活配置。
|
||||
目前,乐鑫 AFE 框架支持单麦和双麦配置,并允许对算法模块进行灵活配置。
|
||||
|
||||
* 单麦配置:
|
||||
* 内部 Task 由 NS 处理
|
||||
* 内部 Task 由 NS 算法模块处理
|
||||
* 双麦配置:
|
||||
* 双麦场景的内部 Task 由 BSS 处理
|
||||
* 此外,如用于语音通话场景(即 ``wakenet_init=false, voice_communication_init=true``),则会再增加一个 MISO 的内部 Task。
|
||||
* 内部 Task 由 BSS 算法模块处理
|
||||
* 此外,如用于语音通话场景(即 :cpp:member:`wakenet_init` = false 且 :cpp:member:`voice_communication_init` = true),则会再增加一个内部 Task 由 MISO 处理。
|
||||
|
||||
获取 AFE handle 的命令如下:
|
||||
|
||||
@ -80,19 +105,26 @@ Audio Front-end 框架
|
||||
|
||||
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_VC_HANDLE;
|
||||
|
||||
.. _input-audio-1:
|
||||
|
||||
输入音频
|
||||
--------
|
||||
|
||||
目前 AFE 支持单麦和双麦配置,可根据 ``afe->feed()`` 的音频,配置相应的音频通道数。
|
||||
目前,乐鑫 AFE 框架支持单麦和双麦配置,可根据 :cpp:func:`esp_afe_sr_iface_op_feed_t` 的输入音频情况,配置所需的音频通道数。
|
||||
|
||||
修改方式:在宏 ``AFE_CONFIG_DEFAULT()`` 中对 ``pcm_config`` 结构体成员进行配置修改。在配置时有如下要求:
|
||||
具体方式为:
|
||||
配置 :cpp:func:`AFE_CONFIG_DEFAULT()` 中的 :cpp:member:`pcm_config` 结构体成员:
|
||||
|
||||
1. ``total_ch_num = mic_num + ref_num``
|
||||
2. ``ref_num = 0`` 或 ``ref_num = 1`` (由于目前 AEC 仅只支持单回路)
|
||||
* :cpp:member:`total_ch_num`:总通道数
|
||||
* :cpp:member:`mic_num`:麦克风通道数
|
||||
* :cpp:member:`ref_num`:参考回路通道数
|
||||
|
||||
几种支持的配置组合如下:
|
||||
注意,在配置时有如下要求:
|
||||
|
||||
几种支持的配置组合如下:
|
||||
1. :cpp:member:`total_ch_num` = :cpp:member:`mic_num` + :cpp:member:`ref_num`
|
||||
2. :cpp:member:`ref_num` = 0 或 :cpp:member:`ref_num` = 1 (由于目前 AEC 仅只支持单回路)
|
||||
|
||||
在上述要求下,几种支持的配置组合如下:
|
||||
|
||||
::
|
||||
|
||||
@ -101,80 +133,75 @@ Audio Front-end 框架
|
||||
total_ch_num=2, mic_num=2, ref_num=0
|
||||
total_ch_num=3, mic_num=2, ref_num=1
|
||||
|
||||
其中,
|
||||
* ``total_ch_num``:总通道数
|
||||
* ``mic_num``:麦克风通道数
|
||||
* ``ref_num``:参考回路通道数
|
||||
|
||||
|
||||
AFE 单麦场景
|
||||
AFE 单麦配置
|
||||
^^^^^^^^^^^^
|
||||
* 输入音频格式为 16 KHz,16 bit,双通道(1 个通道为 mic 数据,另一个通道为参考回路)。注意,若不需要 AEC,则可只包含 1 个通道 mic 数据,而无需包含参考回路(即 ``ref_num = 0``)。
|
||||
* 输入数据帧长,会根据用户配置的算法模块不同而有差异, 用户可以使用 ``afe->get_feed_chunksize`` 来获取需要的采样点数目(采样点数据类型为 int16)
|
||||
* 输入音频的 **格式** 为 16 KHz、16 bit、双通道(其中 1 个通道为 mic 数据,另 1 个通道为参考回路)。注意,若不需要 AEC 功能,则可只包含 1 个通道输入 mic 数据,而无需配置参考回路(即可配置 :cpp:member:`ref_num` = 0)。
|
||||
* 根据用户配置的算法模块不同,输入音频的 **帧长** 将有所差异,具体可通过 :cpp:func:`get_feed_chunksize` 来获取需要的采样点数目(采样点数据类型为 ``int16``)。
|
||||
|
||||
数据排布示意如下:
|
||||
|
||||
.. figure:: ../../_static/AFE_mode_0.png
|
||||
:alt: input data of single MIC
|
||||
:alt: input data of single mic
|
||||
:height: 1.2in
|
||||
|
||||
AFE 双麦场景
|
||||
AFE 双麦配置
|
||||
^^^^^^^^^^^^
|
||||
* 输入音频格式为 16 KHz,16 bit,三通道(2 个通道为 mic 数据,另一个通道为参考回路)。注意,若不需要 AEC,则可只包含 2 个通道 mic 数据,而无需包含参考回路(即 ``ref_num = 0``)。
|
||||
* 输入数据帧长,会根据用户配置的算法模块不同而有差异,用户可以使用 ``afe->get_feed_chunksize`` 来获取需要填充的数据量
|
||||
* 输入音频格式为 16 KHz、16 bit、三通道(其中 2 个通道为 mic 数据,另 1 个通道为参考回路)。注意,若不需要 AEC 功能,则可只包含 2 个通道 mic 数据,而无需配置参考回路(即可配置 :cpp:member:`ref_num` = 0)。
|
||||
* 根据用户配置的算法模块不同,输入音频的 **帧长** 将有所差异,具体可通过 :cpp:func:`get_feed_chunksize` 来获取需要填充的数据量(即 :cpp:func:`get_feed_chunksize` * :cpp:member:`total_ch_num` * sizeof(short))。
|
||||
|
||||
数据排布示意如下:
|
||||
|
||||
.. figure:: ../../_static/AFE_mode_other.png
|
||||
:alt: input data of dual MIC
|
||||
:alt: input data of dual mic
|
||||
:height: 0.75in
|
||||
|
||||
这里,数据量 = ``afe->get_feed_chunksize * 通道数 * sizeof(short)``
|
||||
|
||||
AEC 简介
|
||||
""""""""
|
||||
AEC (Acoustic Echo Cancellation) 算法最多支持双麦处理,能够有效的去除 mic 输入信号中的自身播放声音,从而可以在自身播放音乐的情况下很好的完成语音识别。
|
||||
|
||||
NS 简介
|
||||
"""""""
|
||||
NS (Noise Suppression) 算法支持单通道处理,能够对单通道音频中的非人声噪声进行抑制,尤其针对稳态噪声,具有很好的抑制效果。
|
||||
|
||||
BSS 简介
|
||||
""""""""
|
||||
BSS (Blind Source Separation) 算法支持双通道处理,能够很好的将目标声源和其余干扰音进行盲源分离,从而提取出有用音频信号,保证了后级语音的质量。
|
||||
|
||||
MISO 简介
|
||||
"""""""""
|
||||
MISO (Multi Input Single Output) 算法支持双通道输入,单通道输出。用于在双麦场景,没有唤醒使能的情况下,选择信噪比高的一路音频输出。
|
||||
|
||||
VAD 简介
|
||||
""""""""
|
||||
VAD (Voice Activity Detection) 算法支持实时输出当前帧的语音活动状态。
|
||||
|
||||
AGC 简介
|
||||
""""""""
|
||||
AGC (Automatic Gain Control) 动态调整输出音频的幅值,当弱信号输入时,放大输出幅度;当输入信号达到一定强度时,压缩输出幅度。
|
||||
|
||||
WakeNet 或 Bypass 简介
|
||||
""""""""""""""""""""""
|
||||
用户可以选择是否在 AFE 中进行唤醒词的识别。当用户调用 ``afe->disable_wakenet(afe_data)`` 后,则进入 Bypass 模式,此时 AFE 模块不会进行唤醒词的识别。
|
||||
|
||||
输出音频
|
||||
--------
|
||||
|
||||
AFE 的输出音频为单通道数据。
|
||||
AFE 的输出音频为单通道数据:
|
||||
* 语音识别场景:在 WakeNet 开启的情况下,输出有目标人声的单通道数据
|
||||
* 语音通话场景:输出信噪比更高的单通道数据
|
||||
|
||||
|
||||
使能唤醒词识别 WakeNet
|
||||
----------------------
|
||||
|
||||
在进行 AFE 声学前端处理时,用户可选择是否使能 :doc:`WakeNet <../wake_word_engine/README>` 进行唤醒词识别。
|
||||
|
||||
当用户在唤醒后需要进行其他操作,比如离线或在线语音识别,这时候可以暂停 WakeNet 的运行,从而减轻 CPU 的资源消耗。此时,仅需调用 :cpp:func:`disable_wakenet()`,进入 Bypass 模式。
|
||||
|
||||
当后续应用结束后又可以调用 :cpp:func:`enable_wakenet()` 再次使能 WakeNet。
|
||||
|
||||
.. only:: esp32
|
||||
|
||||
ESP32 芯片只支持一个唤醒词,不支持唤醒词切换。
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
ESP32-S3 芯片支持唤醒词切换。在 AFE 初始化完成后,ESP32-S3 芯片可允许用户通过 :cpp:func:`set_wakenet()` 函数切换唤醒词。例如, ``set_wakenet(afe_data, “wn9_hilexin”)`` 切换到 “Hi Lexin” 唤醒词。有关如何配置多个唤醒词的详细介绍,请见 :doc:`模型加载 <../flash_model/README>`。
|
||||
|
||||
使能回声消除算法 AEC
|
||||
--------------------
|
||||
|
||||
AEC 的使用和 WakeNet 相似,用户可以根据自己的需求来停止或开启 AEC。
|
||||
|
||||
- 停止 AEC
|
||||
|
||||
``afe->disable_aec(afe_data);``
|
||||
|
||||
- 开启 AEC
|
||||
|
||||
``afe->enable_aec(afe_data);``
|
||||
|
||||
.. only:: html
|
||||
|
||||
快速开始
|
||||
编程指南
|
||||
--------
|
||||
|
||||
定义 afe_handle
|
||||
^^^^^^^^^^^^^^^
|
||||
定义 afe_handle 函数句柄
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
``afe_handle`` 是用户后续调用 afe 接口的函数句柄。所以第一步需先获得 ``afe_handle``。
|
||||
首先配置 ``afe_handle`` 函数句柄,后续才可以调用 afe 接口。具体配置方式如下:
|
||||
|
||||
- 语音识别
|
||||
|
||||
@ -191,7 +218,7 @@ AFE 的输出音频为单通道数据。
|
||||
配置 afe
|
||||
^^^^^^^^
|
||||
|
||||
获取 afe 的配置:
|
||||
配置 afe:
|
||||
|
||||
::
|
||||
|
||||
@ -208,7 +235,7 @@ AFE 的输出音频为单通道数据。
|
||||
.se_init = true, \
|
||||
// 配置是否使能 VAD(仅用于语音识别场景)
|
||||
.vad_init = true, \
|
||||
// 配置是否使能唤
|
||||
// 配置是否使能唤醒
|
||||
.wakenet_init = true, \
|
||||
// 配置是否使能语音通话(不可与 wakenet_init 同时使能)
|
||||
.voice_communication_init = false, \
|
||||
@ -218,7 +245,7 @@ AFE 的输出音频为单通道数据。
|
||||
.voice_communication_agc_gain = 15, \
|
||||
// 配置 VAD 检测的操作模式,越大越激进
|
||||
.vad_mode = VAD_MODE_3, \
|
||||
//
|
||||
// 配置唤醒模型,详见下方描述
|
||||
.wakenet_model_name = NULL, \
|
||||
// 配置唤醒模式(对应为多少通道的唤醒,根据mic通道的数量选择)
|
||||
.wakenet_mode = DET_MODE_2CH_90, \
|
||||
@ -242,43 +269,42 @@ AFE 的输出音频为单通道数据。
|
||||
.pcm_config.ref_num = 1, \
|
||||
}
|
||||
|
||||
- wakenet_model_name: 宏 ``AFE_CONFIG_DEFAULT()`` 中该值默认为 NULL。使用 ``idf.py menuconfig`` 选择了相应的唤醒模型后,在调用 ``afe_handle->create_from_config`` 之前,需给该处赋值具体的模型名字,类型为字符串形式。唤醒模型的具体说明,详见::doc:`flash_model <../flash_model/README>` (注意:示例代码中,使用了 ``esp_srmodel_filter()`` 获取模型名字,若 ``menuconfig`` 中选择了多个模型共存,该函数将会随机返回一个模型名字)
|
||||
- afe_mode: 乐鑫 AFE 目前支持 2 种工作模式,分别为: ``SR_MODE_LOW_COS`` 和 ``SR_MODE_HIGH_PERF``。详细可见 ``afe_sr_mode_t`` 枚举。
|
||||
- SR_MODE_LOW_COST: 量化版本,占用资源较少。
|
||||
- SR_MODE_HIGH_PERF: 非量化版本,占用资源较多。
|
||||
|
||||
.. note::
|
||||
ESP32 芯片,只支持 ``SR_MODE_HIGH_PERF`` 模式;ESP32-S3 芯片,两种模式均支持。
|
||||
* :cpp:member:`wakenet_model_name` :配置唤醒模型。宏 :cpp:type:`AFE_CONFIG_DEFAULT()` 中该值默认为 NULL。注意:
|
||||
* 在使用 ``idf.py menuconfig`` 选择了相应的唤醒模型后,在调用 :cpp:member:`create_from_config` 之前,需要将此值配置为具体的模型名称,类型为字符串形式。有关唤醒模型的详细介绍,请见::doc:`模型加载 <../flash_model/README>` 。
|
||||
* :cpp:func:`esp_srmodel_filter()` 可用于获取模型名称。但若 ``idf.py menuconfig`` 中选择了多模型共存,则该函数将会随机返回一个模型名称。
|
||||
|
||||
- memory_alloc_mode: 内存分配的模式。可配置三个值:
|
||||
- AFE_MEMORY_ALLOC_MORE_INTERNAL:更多的从内部ram分配。
|
||||
* :cpp:member:`afe_mode` :配置 AFE 工作模式:
|
||||
|
||||
- AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE:部分从内部ram分配。
|
||||
.. list::
|
||||
|
||||
- AFE_MEMORY_ALLOC_MORE_PSRAM:绝大部分从外部psram分配
|
||||
:esp32s3: - :cpp:enumerator:`SR_MODE_LOW_COST` :量化版本,占用资源较少。
|
||||
- :cpp:enumerator:`SR_MODE_HIGH_PERF` :非量化版本,占用资源较多。
|
||||
|
||||
- agc_mode: 将音频线性放大的 level 配置,该配置在语音识别场景下起作用,并且在唤醒使能时才生效。可配置四个值:
|
||||
详情可见 :cpp:enumerator:`afe_sr_mode_t` 。
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_1:线性放大喂给后续multinet的音频,峰值处为 -5 dB。
|
||||
* :cpp:member:`memory_alloc_mode` :配置内存分配的模式:
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_INTERNAL` :更多从内部 ram 分配
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE` :部分从内部 ram 分配
|
||||
- :cpp:enumerator:`AFE_MEMORY_ALLOC_MORE_PSRAM` :更多从外部 psram 分配
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_2:线性放大喂给后续multinet的音频,峰值处为 -4 dB。
|
||||
- :cpp:member:`agc_mode` :配置音频线性放大的 level。注意,该配置仅适用语音识别场景下,且在唤醒使能时才生效。可配置四个值:
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_1` :线性放大喂给后续 MultiNet 的音频,峰值处为 -5 dB。
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_2` :线性放大喂给后续 MultiNet 的音频,峰值处为 -4 dB。
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_AGC_MODE_3` :线性放大喂给后续 MultiNet 的音频,峰值处为 -3 dB。
|
||||
- :cpp:enumerator:`AFE_MN_PEAK_NO_AGC` :不做线性放大
|
||||
|
||||
- AFE_MN_PEAK_AGC_MODE_3:线性放大喂给后续multinet的音频,峰值处为 -3 dB。
|
||||
- :cpp:member:`pcm_config` :根据 :cpp:func:`feed` 喂入的音频结构进行配置,该结构体有三个成员变量需要配置:
|
||||
- :cpp:member:`total_ch_num` :音频的总通道数
|
||||
- :cpp:member:`mic_num` :音频的麦克风通道数
|
||||
- :cpp:member:`ref_num` :音频的参考回路通道数
|
||||
|
||||
- AFE_MN_PEAK_NO_AGC:不做线性放大
|
||||
|
||||
- pcm_config: 根据 ``afe->feed()`` 喂入的音频结构进行配置,该结构体有三个成员变量需要配置:
|
||||
|
||||
- total_ch_num:音频总的通道数, ``total_ch_num = mic_num + ref_num``。
|
||||
|
||||
- mic_num: 音频的麦克风通道数。目前仅支持配置为 1 或 2。
|
||||
|
||||
- ref_num: 音频的参考回路通道数,目前仅支持配置为 0 或 1。
|
||||
在配置时有一定注意事项,详见 :ref:`input-audio-1`。
|
||||
|
||||
创建 afe_data
|
||||
"""""""""""""
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
用户使用 ``afe_handle->create_from_config(&afe_config)`` 函数来获得数据句柄,这将会在afe内部使用,传入的参数即为上面第2步中获得的配置。
|
||||
使用上一步配置好的 afe 语句创建函数句柄,使用函数为 :cpp:func:`esp_afe_sr_iface_op_create_from_config_t` 。
|
||||
|
||||
::
|
||||
|
||||
@ -291,11 +317,9 @@ AFE 的输出音频为单通道数据。
|
||||
typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_config_t *afe_config);
|
||||
|
||||
feed 音频数据
|
||||
"""""""""""""
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
在初始化 AFE 完成后,用户需要将音频数据使用 ``afe_handle->feed()`` 函数输入到 AFE 中进行处理。
|
||||
|
||||
输入的音频大小和排布格式可以参考 **输入音频** 这一步骤。
|
||||
在初始化 AFE 完成后,使用 :cpp:func: `feed` 函数,将音频数据输入到 AFE 模块中进行处理。输入音频的格式详见 :ref:`input-audio-1` 。
|
||||
|
||||
::
|
||||
|
||||
@ -314,9 +338,9 @@ AFE 的输出音频为单通道数据。
|
||||
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
|
||||
|
||||
获取音频通道数
|
||||
""""""""""""""
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
使用 ``afe_handle->get_total_channel_num()`` 函数可以获取需要传入 ``afe_handle->feed()`` 函数的总数据通道数。其返回值等于AFE_CONFIG_DEFAULT()中配置的 ``pcm_config.mic_num + pcm_config.ref_num``
|
||||
使用 :cpp:func:`get_total_channel_num()` 函数获取需要传入 :cpp:func:`feed()` 函数的音频总通道数,其返回值等于 :cpp:func:`AFE_CONFIG_DEFAULT()` 中配置的 ``pcm_config.mic_num + pcm_config.ref_num`` 。
|
||||
|
||||
::
|
||||
|
||||
@ -329,11 +353,11 @@ AFE 的输出音频为单通道数据。
|
||||
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
fetch 音频数据
|
||||
"""""""""""""""
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
用户调用 ``afe_handle->fetch()`` 函数可以获取处理完成的单通道音频以及相关处理信息。
|
||||
用户调用 :cpp:func:`fetch` 函数,获取经过处理过的单通道音频数据及相关信息。
|
||||
|
||||
fetch 的数据采样点数目(采样点数据类型为 int16)可以通过 ``afe_handle->get_fetch_chunksize`` 获取。
|
||||
:cpp:func:`fetch` 的数据采样点数目(采样点数据类型为 ``int16``)可以通过 :cpp:func:`get_feed_chunksize` 获取。
|
||||
|
||||
::
|
||||
|
||||
@ -348,7 +372,7 @@ AFE 的输出音频为单通道数据。
|
||||
*/
|
||||
typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
|
||||
|
||||
``afe_handle->fetch()`` 的函数声明如下:
|
||||
:cpp:func:`fetch` 的函数声明如下:
|
||||
|
||||
::
|
||||
|
||||
@ -381,25 +405,3 @@ AFE 的输出音频为单通道数据。
|
||||
int ret_value; // the return state of fetch function
|
||||
void* reserved; // reserved for future use
|
||||
} afe_fetch_result_t;
|
||||
|
||||
WakeNet 使用
|
||||
""""""""""""
|
||||
|
||||
当用户在唤醒后需要进行其他操作,比如离线或在线语音识别,这时候可以暂停 WakeNet 的运行,从而减轻 CPU 的资源消耗。
|
||||
|
||||
用户可以调用 ``afe_handle->disable_wakenet(afe_data)`` 来停止 WakeNet。当后续应用结束后又可以调用 ``afe_handle->enable_wakenet(afe_data)`` 来开启 WakeNet。
|
||||
|
||||
另外,ESP32-S3 芯片还可支持唤醒词切换。(ESP32 芯片只支持一个唤醒词,不支持切换)。在初始化 AFE 完成后,ESP32-S3 芯片可通过 ``set_wakenet()`` 函数切换唤醒词。例如, ``afe_handle->set_wakenet(afe_data, “wn9_hilexin”)`` 切换到 “Hi Lexin” 唤醒词。具体如何配置多个唤醒词,详见: :doc:`flash_model <../flash_model/README>`。
|
||||
|
||||
AEC 使用
|
||||
--------
|
||||
|
||||
AEC 的使用和 WakeNet 相似,用户可以根据自己的需求来停止或开启 AEC。
|
||||
|
||||
- 停止 AEC
|
||||
|
||||
``afe->disable_aec(afe_data);``
|
||||
|
||||
- 开启 AEC
|
||||
|
||||
``afe->enable_aec(afe_data);``
|
||||
|
||||
8
docs/zh_CN/audio_front_end/index.rst
Normal file
8
docs/zh_CN/audio_front_end/index.rst
Normal file
@ -0,0 +1,8 @@
|
||||
AFE 声学前端
|
||||
============
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
AFE 声学前端模型简介 <README>
|
||||
乐鑫麦克风设计指南 <Espressif_Microphone_Design_Guidelines>
|
||||
@ -6,93 +6,96 @@
|
||||
AFE
|
||||
---
|
||||
|
||||
资源占用(ESP32)
|
||||
~~~~~~~~~~~~~~~
|
||||
资源占用
|
||||
~~~~~~~~
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(HIGH_PERF) | 114 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 73 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
.. only:: esp32
|
||||
|
||||
资源占用(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(HIGH_PERF) | 114 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 73 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| MISO | 56 KB | 8% | 16 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 227 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| algorithm Type | RAM | Average cpu | Frame Length |
|
||||
| | | loading(compute | |
|
||||
| | | with 2 cores) | |
|
||||
+=================+=================+=================+=================+
|
||||
| AEC(LOW_COST) | 152.3 KB | 8% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AEC(HIGH_PERF) | 166 KB | 11% | 32 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(LOW_COST) | 198.7 KB | 6% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| BSS(HIGH_PERF) | 215.5 KB | 7% | 64 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| NS | 27 KB | 5% | 10 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| MISO | 56 KB | 8% | 16 ms |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
| AFE Layer | 227 KB | | |
|
||||
+-----------------+-----------------+-----------------+-----------------+
|
||||
|
||||
WakeNet
|
||||
-------
|
||||
|
||||
.. _resource-occupancyesp32-1:
|
||||
|
||||
资源占用(ESP32)
|
||||
~~~~~~~~~~~~~~~
|
||||
资源占用
|
||||
~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Parameter | RAM | Average | Frame |
|
||||
| | Num | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| Quantised | 41 K | 15 KB | 5.5 ms | 30 ms |
|
||||
| WakeNet5 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 165 K | 20 KB | 10.5 ms | 30 ms |
|
||||
| WakeNet5X2 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 371 K | 24 KB | 18 ms | 30 ms |
|
||||
| WakeNet5X3 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
.. only:: esp32
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Parameter | RAM | Average | Frame |
|
||||
| | Num | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| Quantised | 41 K | 15 KB | 5.5 ms | 30 ms |
|
||||
| WakeNet5 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 165 K | 20 KB | 10.5 ms | 30 ms |
|
||||
| WakeNet5X2 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Quantised | 371 K | 24 KB | 18 ms | 30 ms |
|
||||
| WakeNet5X3 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
.. _resource-occupancyesp32s3-1:
|
||||
|
||||
资源占用(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Model Type | RAM | PSRAM | Average | Frame Length |
|
||||
| | | | Running Time | |
|
||||
| | | | per Frame | |
|
||||
+================+=======+=========+================+==============+
|
||||
| Quantised | 50 KB | 1640 KB | 10.0 ms | 32 ms |
|
||||
| WakeNet8 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 16 KB | 324 KB | 3.0 ms | 32 ms |
|
||||
| WakeNet9 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 20 KB | 347 KB | 4.3 ms | 32 ms |
|
||||
| WakeNet9 @ 3 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Model Type | RAM | PSRAM | Average | Frame Length |
|
||||
| | | | Running Time | |
|
||||
| | | | per Frame | |
|
||||
+================+=======+=========+================+==============+
|
||||
| Quantised | 50 KB | 1640 KB | 10.0 ms | 32 ms |
|
||||
| WakeNet8 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 16 KB | 324 KB | 3.0 ms | 32 ms |
|
||||
| WakeNet9 @ 2 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
| Quantised | 20 KB | 347 KB | 4.3 ms | 32 ms |
|
||||
| WakeNet9 @ 3 | | | | |
|
||||
| channel | | | | |
|
||||
+----------------+-------+---------+----------------+--------------+
|
||||
|
||||
性能测试
|
||||
~~~~~~~~~
|
||||
~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Distance | Quiet | Stationary | Speech | AEC |
|
||||
@ -105,49 +108,52 @@ WakeNet
|
||||
| 3 m | 98% | 96% | 94% | 94% |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
误触发率:12小时1次
|
||||
误触发率:12 小时 1 次
|
||||
|
||||
**Note**: 我们在测试中使用了 ESP32-S3-Korvo V4.0 开发板和 WakeNet9(Alexa) 模型。
|
||||
.. note::
|
||||
|
||||
我们在测试中使用了 ESP32-S3-Korvo V4.0 开发板和 WakeNet9(Alexa) 模型。
|
||||
|
||||
MultiNet
|
||||
--------
|
||||
|
||||
.. _resource-occupancyesp32-2:
|
||||
|
||||
资源占用(ESP32)
|
||||
~~~~~~~~~~~~~~~
|
||||
资源占用
|
||||
~~~~~~~~
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 2 | 13.3 KB | 9KB | 38 ms | 30 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
.. only:: esp32
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 2 | 13.3 KB | 9KB | 38 ms | 30 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
.. _resource-occupancyesp32s3-2:
|
||||
|
||||
资源占用(ESP32S3)
|
||||
~~~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 4 | 16.8KB | 1866 KB | 18 ms | 32 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 4 | 10.5 KB | 1009 KB | 11 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 5 | 16 KB | 2310 KB | 12 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| Model Type | Internal | PSRAM | Average | Frame |
|
||||
| | RAM | | Running | Length |
|
||||
| | | | Time per | |
|
||||
| | | | Frame | |
|
||||
+=============+=============+=============+=============+=============+
|
||||
| MultiNet 4 | 16.8KB | 1866 KB | 18 ms | 32 ms |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 4 | 10.5 KB | 1009 KB | 11 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
| MultiNet 5 | 16 KB | 2310 KB | 12 ms | 32 ms |
|
||||
| Q8 | | | | |
|
||||
+-------------+-------------+-------------+-------------+-------------+
|
||||
|
||||
AFE 的性能
|
||||
~~~~~~~~~~
|
||||
性能测试
|
||||
~~~~~~~~
|
||||
|
||||
+-----------+-----------+-----------+-----------+-----------+
|
||||
| Model | Distance | Quiet | S | Speech |
|
||||
|
||||
@ -1,30 +1,32 @@
|
||||
模型加载方式
|
||||
============
|
||||
模型加载
|
||||
========
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
在 esp-sr 中,WakeNet 和 MultiNet 均会使用到大量的模型数据,模型数据位于 ``ESP-SR_PATH/model/`` 中。 目前 esp-sr 支持以下模型加载方式:
|
||||
ESP-SR 的 WakeNet 和 MultiNet 均会使用大量的模型数据,模型数据位于 :project:`model` 中。目前 ESP-SR 支持以下模型加载方式:
|
||||
|
||||
ESP32:
|
||||
.. only:: esp32
|
||||
|
||||
- 从 Flash 中直接加载
|
||||
ESP32:从 Flash 中直接加载
|
||||
|
||||
ESP32S3:
|
||||
.. only:: esp32s3
|
||||
|
||||
- 从 Flash spiffs 分区加载
|
||||
- 从外部 SDCard 加载
|
||||
ESP32-S3:
|
||||
|
||||
从而在 ESP32S3 上可以:
|
||||
- 从 SPI 闪存文件系统 (SPIFFS) 分区加载
|
||||
- 从外部 SD 卡加载
|
||||
|
||||
- 大大减小用户应用 APP BIN 的大小
|
||||
- 支持选择最多两个唤醒词
|
||||
- 支持中文和英文命令词识别在线切换
|
||||
- 方便用户进行 OTA
|
||||
- 支持从 SD 卡读取和更换模型,更加便捷且可以缩减项目使用的模组 Flash 大小
|
||||
- 当用户进行开发时,当修改不涉及模型时,可以避免每次烧录模型数据,大大缩减烧录时间,提高开发效率
|
||||
因此具有以下优势:
|
||||
|
||||
模型配置介绍
|
||||
------------
|
||||
- 大大减小用户应用 APP BIN 的大小
|
||||
- 支持选择最多两个唤醒词
|
||||
- 支持中文和英文命令词识别在线切换
|
||||
- 方便用户进行 OTA
|
||||
- 支持从 SD 卡读取和更换模型,更加便捷且可以缩减项目使用的模组 Flash 大小
|
||||
- 当用户进行开发时,当修改不涉及模型时,可以避免每次烧录模型数据,大大缩减烧录时间,提高开发效率
|
||||
|
||||
配置方法
|
||||
--------
|
||||
|
||||
运行 ``idf.py menuconfig`` 进入 ``ESP Speech Recognition``:
|
||||
|
||||
@ -33,167 +35,184 @@ ESP32S3:
|
||||
|
||||
overview
|
||||
|
||||
Model Data Path
|
||||
~~~~~~~~~~~~~~~
|
||||
.. only:: esp32s3
|
||||
|
||||
该选项只在 ESP32S3 上可用,表示模型数据的存储位置,支持选择 ``spiffs partition`` 或 ``SD Card`` 。
|
||||
Model Data Path
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
- ``spiffs partition`` 表示模型数据存储在 Flash spiffs 分区中,模型数据将会从 Flash spiffs 分区中加载
|
||||
- ``SD Card`` 表示模型数据存储在 SD 卡中,模型数据将会从 SD Card 中加载
|
||||
该选项表示模型数据的存储位置,支持选择 ``spiffs partition`` 或 ``SD Card`` 。
|
||||
|
||||
Use AFE
|
||||
~~~~~~~
|
||||
- ``spiffs partition`` 表示模型数据存储在 SPIFFS 分区中,模型数据将会从 SPIFFS 分区中加载
|
||||
- ``SD Card`` 表示模型数据存储在 SD 卡中,模型数据将会从 SD 卡中加载
|
||||
|
||||
该选项需要打开,用户无须修改,请保持默认配置。
|
||||
使用 AFE
|
||||
~~~~~~~~
|
||||
|
||||
Use Wakenet
|
||||
~~~~~~~~~~~~
|
||||
此选项需要打开,用户无须修改,请保持默认配置。
|
||||
|
||||
使用 WakeNet
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
此选项默认打开。当用户只使用 AEC 或者 BSS 等,而无须运行 WakeNet 或 MultiNet 时,请关闭次选项,这将会减小工程固件的大小。
|
||||
|
||||
根据 ``menuconfig`` 列表选择唤醒词模型, ``ESP Speech Recognition`` > ``Select wake words``。括号中为唤醒词模型的名字,在代码中初始化 WakeNet 时需写入对应的名字。
|
||||
|
||||
* 此选项默认打开,当用户只使用 AEC 或者 BSS 等,无须运行 WakeNet 或 MultiNet 时,请关闭次选项,将会减小工程固件的大小。根据menuconfig列表选择唤醒词模型, ``ESP Speech Recognition -> Select wake words``. 括号中为唤醒词模型的名字,你需要在代码用名字切换,初始化wakenet.
|
||||
|select wake wake|
|
||||
* 如果想加载多个唤醒词,以便在代码中进行唤醒词的切换,首选选择’Load Multiple Wake Words’
|
||||
|
||||
如果想加载多个唤醒词,以便在代码中进行唤醒词的切换,首选选择 ``Load Multiple Wake Words``
|
||||
|
||||
|multi wake wake|
|
||||
* 然后按照列表选择多个唤醒词:
|
||||
|
||||
然后按照列表选择多个唤醒词:
|
||||
|
||||
|image1|
|
||||
|
||||
**注:多唤醒词选项只支持 ESP32S3,具体根据客户硬件flash容量,选择合适数量的唤醒词。**
|
||||
.. only:: esp32
|
||||
|
||||
.. note::
|
||||
ESP32 不支持多唤醒词选项。
|
||||
|
||||
.. only:: esp32s3
|
||||
|
||||
.. note::
|
||||
ESP32-S3 支持多唤醒词选项。用户可根据具体硬件 flash 容量,选择合适数量的唤醒词。
|
||||
|
||||
更多细节请参考 :doc:`WakeNet <../wake_word_engine/README>` 。
|
||||
|
||||
Use Multinet
|
||||
~~~~~~~~~~~~~
|
||||
使用 MultiNet
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
此选项默认打开。当用户只使用 WakeNet 或者其他算法模块时,请关闭此选项,将会在一些情况下减小工程固件的大小。
|
||||
|
||||
ESP32 芯片只支持中文命令词识别。ESP32S3 支持中文和英文命令词识别,且支持中英文识别模型切换。
|
||||
中文命令词识别模型 (Chinese Speech Commands Model)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
- Chinese Speech Commands Model
|
||||
.. only:: esp32
|
||||
|
||||
中文命令词识别模型选择。
|
||||
ESP32 芯片只支持中文命令词识别:
|
||||
|
||||
ESP32 支持:
|
||||
- None
|
||||
- Chinese single recognition (MultiNet2)
|
||||
|
||||
- None
|
||||
- chinese single recognition (MultiNet2)
|
||||
.. only:: esp32s3
|
||||
|
||||
ESP32S3 支持:
|
||||
ESP32-S3 支持中文和英文命令词识别,且支持中英文识别模型切换。
|
||||
|
||||
- None
|
||||
- None
|
||||
- Chinese single recognition (MultiNet4.5)
|
||||
- Chinese single recognition (MultiNet4.5 quantized with 8-bit)
|
||||
- English Speech Commands Model
|
||||
|
||||
- chinese single recognition (MultiNet4.5)
|
||||
当用户在 ``Chinese Speech Commands Model`` 中选择非 ``None`` 时,需要在该项处添加中文命令词。
|
||||
|
||||
- chinese single recognition (MultiNet4.5 quantized with 8-bit)
|
||||
.. only:: esp32s3
|
||||
|
||||
- English Speech Commands Model
|
||||
英文命令词识别模型 (English Speech Commands Model)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
英文命令词识别模型选择。
|
||||
ESP32-S3 支持中文和英文命令词识别,且支持中英文识别模型切换。
|
||||
|
||||
该选项不支持 ESP32。
|
||||
- None
|
||||
- English recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
|
||||
- Add Chinese speech commands
|
||||
|
||||
ESP32S3 支持:
|
||||
|
||||
- None
|
||||
|
||||
- english recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
|
||||
|
||||
- Add Chinese speech commands
|
||||
|
||||
当用户在 ``Chinese Speech Commands Model`` 中选择非 ``None`` 时,需要在该项处添加中文命令词。
|
||||
|
||||
- Add English speech commands
|
||||
|
||||
当用户在 ``English Speech Commands Model`` 中选择非 ``None`` 时,需要在该项处添加中文命令词。
|
||||
当用户在 ``English Speech Commands Model`` 中选择非 ``None`` 时,需要在该项处添加英文命令词。
|
||||
|
||||
用户按照需求自定义添加命令词,具体请参考 :doc:`MultiNet <../speech_command_recognition/README>` 。
|
||||
|
||||
模型使用
|
||||
---------
|
||||
|
||||
当用户完成以上的配置选择后,应用层请参考 esp-skainet 进行初始化和使用。这里介绍一下模型数据加载在用户工程中的代码实现。 也可以参考代码 `model_path.c <../src/model_path.c>`_ 。
|
||||
当用户完成以上的配置选择后,可参考 `ESP-Skainet <https://github.com/espressif/esp-skainet>`_ 应用层仓库中的介绍,进行初始化和使用。
|
||||
|
||||
使用 ESP32
|
||||
~~~~~~~~~~
|
||||
这里主要介绍模型加载在用户工程中的代码实现,用户也可直接参考代码 `model_path.c <../src/model_path.c>`_ 。
|
||||
|
||||
当用户使用 ESP32 时,由于只支持从 Flash 中直接加载模型数据,因此代码中模型数据会自动按照地址从 Flash 中读取所需数据。 为了和ESP32S3进行兼容,代码中模型的初始化方法是和ESP32S3相同的,可参考下面ESP32S3的模型加载API.
|
||||
.. only:: esp32
|
||||
|
||||
使用 ESP32S3
|
||||
~~~~~~~~~~~~~
|
||||
ESP32 仅支持从 Flash 中直接加载模型数据,因此代码中模型数据会自动按照地址从 Flash 中读取所需数据。为了和 ESP32-S3 进行兼容,ESP32 代码中模型的初始化方法与 ESP32-S3 相同。
|
||||
|
||||
模型数据存储在 SPIFFS
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
.. only:: esp32s3
|
||||
|
||||
- 编写分区表:
|
||||
ESP32-S3 支持从 Flash SPIFFS 或 SD 卡中直接加载模型数据,下方将分别介绍。
|
||||
|
||||
模型数据存储在 Flash SPIFFS
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
#. 编写分区表:
|
||||
|
||||
::
|
||||
|
||||
model, data, spiffs, , SIZE,
|
||||
|
||||
其中 SIZE 可以参考在用户使用 ‘idf.py build’ 编译时的推荐大小,例如:
|
||||
其中 SIZE 可以参考在用户使用 ``idf.py build`` 编译时的推荐大小,例如: ``Recommended model partition size: 500K`` 。
|
||||
|
||||
::
|
||||
#. 初始化 SPIFFS 分区:用户可以直接调用提供的 ``esp_srmodel_init()`` API 来初始化 SPIFFS,并返回 SPIFFS 中的模型。
|
||||
|
||||
Recommended model partition size: 500K
|
||||
- base_path:模型的存储 ``base_path`` 为 ``srmodel``,不可更改
|
||||
- partition_label:模型的分区 label 为 ``model`` ,需要和上述分区表中的 ``Name`` 保持一致
|
||||
|
||||
- 初始化 spiffs 分区 **调用提供的 API** :用户可以直接调用
|
||||
``esp_srmodel_init()`` API 来初始化 spiffs,并返回spiffs中的模型。
|
||||
完成上述配置后,模型会在工程编译完成后自动生成 ``model.bin`` ,并在用户调用 ``idf.py flash`` 时烧写到 SPIFFS 分区。
|
||||
|
||||
- base_path:模型的存储 ``base_path`` 为 ``srmodel`` ,不可更改
|
||||
- partition_label:模型的分区 label 为 ``model`` ,需要和 上述分区表中的 ``Name`` 保持一致
|
||||
.. only:: esp32s3
|
||||
|
||||
完成上述配置后,模型会在工程编译完成后自动生成 ``model.bin`` ,并在用户调用 ``idf.py flash`` 时烧写到 spiffs 分区。
|
||||
模型数据存储在 SD 卡
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
模型存储在 SD Card
|
||||
^^^^^^^^^^^^^^^^^^
|
||||
当用户配置模型数据存储位置是 ``SD Card`` 时,用户需要:
|
||||
|
||||
当用户配置 #1.2 模型数据存储位置是 ``SD Card`` 时,用户需要:
|
||||
- 手动移动模型数据至 SD 卡中
|
||||
用户完成以上配置后,可以先进行编译,编译完成后将 ``model/target`` 目录下的文件拷贝至 SD 卡的根目录。
|
||||
|
||||
- 手动移动模型数据
|
||||
- 自定义路径
|
||||
如果用户想将模型放置于指定文件夹,可以自己修改位于 ``model/model_path.c`` 中的 :cpp:func:`get_model_base_path()` 函数。
|
||||
|
||||
将模型移动到 SDCard 中,用户完成以上配置后,可以先进行编译,编译完成后将 ``ESP-SR_PATH/model/target/`` 目录下的文件拷贝至 SD 卡的根目录。
|
||||
.. only:: html
|
||||
|
||||
- 自定义路径 如果用户想将模型放置于指定文件夹,可以自己修改 ``get_model_base_path()`` 函数,位于 ``ESP-SR_PATH/model/model_path.c``。 比如,指定文件夹为 SD 卡目录中的 ``espmodel``, 则可以修改该函数为:
|
||||
比如,如需指定文件夹为 SD 卡目录中的 ``espmodel``, 则可以修改该函数为:
|
||||
|
||||
.. only:: html
|
||||
::
|
||||
|
||||
::
|
||||
char *get_model_base_path(void)
|
||||
{
|
||||
#if defined CONFIG_MODEL_IN_SDCARD
|
||||
return "sdcard/espmodel";
|
||||
#elif defined CONFIG_MODEL_IN_SPIFFS
|
||||
return "srmodel";
|
||||
#else
|
||||
return NULL;
|
||||
#endif
|
||||
}
|
||||
|
||||
char *get_model_base_path(void)
|
||||
{
|
||||
#if defined CONFIG_MODEL_IN_SDCARD
|
||||
return "sdcard/espmodel";
|
||||
#elif defined CONFIG_MODEL_IN_SPIFFS
|
||||
return "srmodel";
|
||||
#else
|
||||
return NULL;
|
||||
#endif
|
||||
}
|
||||
- 初始化 SD 卡
|
||||
用户需要初始化 SD 卡,来使系统能够记载 SD 卡。如果用户使用 `ESP-Skainet <https://github.com/espressif/esp-skainet>`_ ,可以直接调用 ``esp_sdcard_init("/sdcard", num);`` 来初始化其支持开发板的 SD 卡。否则,需要自己编写初始化程序。
|
||||
完成以上操作后,便可以进行工程的烧录。
|
||||
|
||||
- 初始化 SD 卡
|
||||
|
||||
用户需要初始化 SD 卡,来使系统能够记载 SD 卡,如果用户使用 esp-skainet,可以直接调用 ``esp_sdcard_init("/sdcard", num);`` 来初始化其支持开发板的 SD 卡。否则,需要自己编写。
|
||||
.. |select wake wake| image:: ../../_static/wn_menu1.png
|
||||
.. |multi wake wake| image:: ../../_static/wn_menu2.png
|
||||
.. |image1| image:: ../../_static/wn_menu3.png
|
||||
|
||||
完成以上操作后,便可以进行工程的烧录。
|
||||
|
||||
.. only:: html
|
||||
|
||||
代码中模型初始化与使用
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
//
|
||||
// step1: initialize spiffs and return models in spiffs
|
||||
// step1: initialize SPIFFS and return models in SPIFFS
|
||||
//
|
||||
srmodel_list_t *models = esp_srmodel_init();
|
||||
|
||||
//
|
||||
// step2: select the specific model by keywords
|
||||
//
|
||||
char *wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, NULL); // select wakenet model
|
||||
char *nm_name = esp_srmodel_filter(models, ESP_MN_PREFIX, NULL); // select multinet model
|
||||
char *alexa_wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, "alexa"); // select wakenet with "alexa" wake word.
|
||||
char *en_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_ENGLISH); // select english multinet model
|
||||
char *cn_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_CHINESE); // select english multinet model
|
||||
char *wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, NULL); // select WakeNet model
|
||||
char *nm_name = esp_srmodel_filter(models, ESP_MN_PREFIX, NULL); // select MultiNet model
|
||||
char *alexa_wn_name = esp_srmodel_filter(models, ESP_WN_PREFIX, "alexa"); // select WakeNet with "alexa" wake word.
|
||||
char *en_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_ENGLISH); // select english MultiNet model
|
||||
char *cn_mn_name = esp_srmodel_filter(models, ESP_MN_PREFIX, ESP_MN_CHINESE); // select english MultiNet model
|
||||
|
||||
// It also works if you use the model name directly in your code.
|
||||
char *my_wn_name = "wn9_hilexin"
|
||||
@ -209,7 +228,3 @@ ESP32S3 支持:
|
||||
|
||||
esp_mn_iface_t *multinet = esp_mn_handle_from_name(mn_name);
|
||||
model_iface_data_t *mn_model_data = multinet->create(mn_name, 6000);
|
||||
|
||||
.. |select wake wake| image:: ../../_static/wn_menu1.png
|
||||
.. |multi wake wake| image:: ../../_static/wn_menu2.png
|
||||
.. |image1| image:: ../../_static/wn_menu3.png
|
||||
|
||||
46
docs/zh_CN/getting_started/readme.rst
Normal file
46
docs/zh_CN/getting_started/readme.rst
Normal file
@ -0,0 +1,46 @@
|
||||
入门指南
|
||||
========
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
乐鑫 `ESP-SR <https://github.com/espressif/esp-sr>`__ 可以帮助用户基于 ESP32 系列芯片或 ESP32-S3 系列芯片,搭建 AI 语音解决方案。本文档将通过一些简单的示例,展示如何使用 ESP-SR 中的算法和模型。
|
||||
|
||||
概述
|
||||
----
|
||||
|
||||
ESP-SR 支持以下模块:
|
||||
|
||||
* :doc:`声学前端算法 AFE <../audio_front_end/README>`
|
||||
* :doc:`唤醒词检测 WakeNet <../wake_word_engine/README>`
|
||||
* :doc:`命令词识别 MultiNet<../speech_command_recognition/README>`
|
||||
* 语音合成(目前只支持中文)
|
||||
|
||||
准备工作
|
||||
--------
|
||||
|
||||
必备硬件
|
||||
~~~~~~~~
|
||||
|
||||
.. list::
|
||||
|
||||
:esp32s3: - 一款音频开发版,推荐使用 ESP32-S3-Korvo-1 或者 ESP32-S3-Korvo-2
|
||||
:esp32: - 一款音频开发版,推荐使用 ESP32-Korvo
|
||||
- USB 2.0 数据线(标准 A 型转 Micro-B 型)
|
||||
- 电脑(Linux)
|
||||
|
||||
.. note::
|
||||
目前一些开发板使用的是 USB Type C 接口。请确保使用合适的数据线连接开发板!
|
||||
|
||||
必备软件
|
||||
~~~~~~~~
|
||||
|
||||
* 下载 `ESP-SKAINET <https://github.com/espressif/esp-skainet>`__,ESP-SR 将作为 ESP-SKAINET 的组件被一起下载。
|
||||
* 配置安装 ESP-IDF,推荐使用 ESP-SKAINET 中包含的版本。安装方法请参考 `ESP-IDF 编程指南 <https://docs.espressif.com/projects/esp-idf/zh_CN/latest/esp32s3/index.html>`__ 中的 `快速入门 <https://docs.espressif.com/projects/esp-idf/zh_CN/latest/esp32s3/get-started/index.html>`__ 小节。
|
||||
|
||||
|
||||
编译运行一个示例
|
||||
----------------
|
||||
|
||||
* 进入 `ESP-SKAINET/examples/cn_speech_commands_recognition <https://github.com/espressif/esp-skainet/tree/master/examples/cn_speech_commands_recognition>`__ 目录。
|
||||
* 参考该示例目录下的配置和编译说明,运行该示例。
|
||||
* 该示例为中文命令指令识别示例,通过说唤醒词(Hi,乐鑫),触发语音指令识别。注意,当一段时间没有语音指令后,语音指令识别功能将关闭,并等待下一次唤醒词触发。
|
||||
16
docs/zh_CN/glossary/glossary.rst
Normal file
16
docs/zh_CN/glossary/glossary.rst
Normal file
@ -0,0 +1,16 @@
|
||||
术语表
|
||||
======
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
通用术语
|
||||
--------
|
||||
|
||||
ESP-SR 仓库中的大多数术语均与 `乐鑫 ADF 音频应用开发框架 <https://espressif-docs.readthedocs-hosted.com/projects/esp-adf/zh_CN/latest/get-started/index.html>`_ 共用,具体请访问 `ADF 中英术语库 <https://espressif-docs.readthedocs-hosted.com/projects/esp-adf/zh_CN/latest/english-chinese-glossary.html>`_。
|
||||
|
||||
特别术语
|
||||
--------
|
||||
|
||||
ESP-SR 仓库独有术语,请见下方。
|
||||
|
||||
语音用户界面 (Voice-User Interface, VUI)
|
||||
@ -3,54 +3,22 @@ ESP-SR 用户指南
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
乐鑫 `ESP-SR <https://github.com/espressif/esp-sr>`__ 可以帮助用户基于 ESP32 系列芯片或 ESP32-S3 系列芯片,搭建 AI 语音解决方案。本文档将通过一个简单的示例,展示如何使用 ESP-SR 中的算法和模型。
|
||||
.. only:: html
|
||||
|
||||
概述
|
||||
----
|
||||
**本文档仅包含针对 {IDF_TARGET_NAME} 芯片的 ESP-SR 使用**。如需了解其他芯片,请在页面左上方的下拉菜单中选择您的目标芯片。
|
||||
|
||||
ESP-SR 支持以下模块:
|
||||
|
||||
* :doc:`声学前端算法 <audio_front_end/README>`
|
||||
* :doc:`唤醒词检测 <wake_word_engine/README>`
|
||||
* :doc:`命令词识别 <speech_command_recognition/README>`
|
||||
* 语音合成(目前只支持中文)
|
||||
|
||||
准备工作
|
||||
--------
|
||||
|
||||
必备硬件
|
||||
~~~~~~~~
|
||||
|
||||
* 一款音频开发版,推荐使用 ESP32-S3-Korvo-1 或者 ESP32-S3-Korvo-2
|
||||
* USB 2.0 数据线(标准 A 型转 Micro-B 型)
|
||||
* 电脑(Linux)
|
||||
|
||||
.. note::
|
||||
目前一些开发板使用的是 USB Type C 接口。请确保使用合适的数据线连接开发板!
|
||||
|
||||
必备软件
|
||||
~~~~~~~~
|
||||
|
||||
* 下载 `ESP-SKAINET <https://github.com/espressif/esp-skainet>`__,ESP-SR 将作为 ESP-SKAINET 的组件被一起下载。
|
||||
* 配置安装 ESP-IDF,推荐使用 ESP-SKAINET 中包含的版本。安装方法请参考 `ESP-IDF 编程指南 <https://docs.espressif.com/projects/esp-idf/zh_CN/latest/esp32s3/index.html>`__ 中的 `快速入门 <https://docs.espressif.com/projects/esp-idf/zh_CN/latest/esp32s3/get-started/index.html>`__ 小节。
|
||||
|
||||
|
||||
编译运行一个示例
|
||||
----------------
|
||||
|
||||
* 进入 `ESP-SKAINET/examples/cn_speech_commands_recognition <https://github.com/espressif/esp-skainet/tree/master/examples/cn_speech_commands_recognition>`__ 目录。
|
||||
* 参考该示例目录下的配置和编译说明,运行该示例。
|
||||
* 该示例为中文命令指令识别示例,通过说唤醒词(Hi,乐鑫),触发语音指令识别。注意,当一段时间没有语音指令后,语音指令识别功能将关闭,并等待下一次唤醒词触发。
|
||||
.. only:: latex
|
||||
|
||||
**本文档仅包含针对 {IDF_TARGET_NAME} 芯片的 ESP-SR 使用**。
|
||||
|
||||
.. toctree::
|
||||
:hidden:
|
||||
:hidden:
|
||||
|
||||
AFE 声学前端算法 <audio_front_end/README>
|
||||
唤醒词模型 <wake_word_engine/README>
|
||||
定制化唤醒词 <wake_word_engine/ESP_Wake_Words_Customization>
|
||||
语音指令 <speech_command_recognition/README>
|
||||
模型加载方式 <flash_model/README>
|
||||
麦克风设计指南 <audio_front_end/Espressif_Microphone_Design_Guidelines>
|
||||
入门指南 <getting_started/readme>
|
||||
AFE 声学前端算法 <audio_front_end/index>
|
||||
语音唤醒 WakeNet <wake_word_engine/index>
|
||||
语音指令 MultiNet <speech_command_recognition/README>
|
||||
模型加载 <flash_model/README>
|
||||
资源消耗 <benchmark/README>
|
||||
测试报告 <test_report/README>
|
||||
性能测试 <benchmark/README>
|
||||
术语表 <glossary/glossary>
|
||||
@ -1,24 +1,27 @@
|
||||
MultiNet 介绍
|
||||
=============
|
||||
命令词
|
||||
======
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
MultiNet 是为了在 ESP32 系列上离线实现多命令词识别而设计的轻量化模型,目前支持 200 个以内的自定义命令词识别。
|
||||
MultiNet 命令词识别模型
|
||||
----------------------------
|
||||
|
||||
* 支持中文和英文命令词识别(英文命令词识别需使用 ESP32S3)
|
||||
* 支持用户自定义命令词
|
||||
* 支持运行过程中 增加/删除/修改 命令词语
|
||||
* 最多支持 200 个命令词
|
||||
* 支持单次识别和连续识别两种模式
|
||||
* 轻量化,低资源消耗
|
||||
* 低延时,延时500ms内
|
||||
* 支持在线中英文模型切换(仅 ESP32S3)
|
||||
* 模型单独分区,支持用户应用 OTA
|
||||
MultiNet 是为了在 {IDF_TARGET_NAME} 系列上离线实现多命令词识别而设计的轻量化模型,目前支持 200 个以内的自定义命令词识别。
|
||||
|
||||
概述
|
||||
-------
|
||||
.. list::
|
||||
|
||||
MultiNet 输入为经过前端语音算法(AFE)处理过的音频,格式为 16KHz,16bit,单声道。通过对音频进行识别,则可以对应到相应的汉字或单词。
|
||||
:esp32s3: - 支持中文和英文命令词识别
|
||||
:esp32: - 支持中文命令词识别
|
||||
- 支持用户自定义命令词
|
||||
- 支持运行过程中 增加/删除/修改 命令词语
|
||||
- 最多支持 200 个命令词
|
||||
- 支持单次识别和连续识别两种模式
|
||||
- 轻量化,低资源消耗
|
||||
- 低延时,延时 500 ms内
|
||||
:esp32s3: - 支持在线中英文模型切换
|
||||
- 模型单独分区,支持用户应用 OTA
|
||||
|
||||
MultiNet 输入为经过前端语音算法(AFE)处理过的音频(格式为 16 KHz,16 bit,单声道)。通过对音频进行识别,则可以对应到相应的汉字或单词。
|
||||
|
||||
以下表格展示在不同芯片上的模型支持:
|
||||
|
||||
@ -32,48 +35,27 @@ MultiNet 输入为经过前端语音算法(AFE)处理过的音频,格式
|
||||
| English | | | | √ |
|
||||
+---------+-----------+-------------+---------------+-------------+
|
||||
|
||||
用户选择不同的模型的方法请参考 :doc:`flash model <../flash_model/README>` 。
|
||||
用户选择不同的模型的方法请参考 :doc:`模型加载 <../flash_model/README>` 。
|
||||
|
||||
**注:其中以 ``Q8`` 结尾的模型代表模型的 8bit 版本,表明该模型更加轻量化。**
|
||||
.. note::
|
||||
其中以 ``Q8`` 结尾的模型代表模型的 8 bit 版本,表明该模型更加轻量化。
|
||||
|
||||
命令词识别原理
|
||||
-----------------
|
||||
|
||||
可以参考以下命令词识别原理:
|
||||
命令词识别原理如下图所示:
|
||||
|
||||
.. figure:: ../../_static/multinet_workflow.png
|
||||
:alt: speech_command-recognition-system
|
||||
|
||||
speech_command-recognition-system
|
||||
|
||||
使用指南
|
||||
--------
|
||||
.. _command-requirements:
|
||||
|
||||
命令词设计要求
|
||||
~~~~~~~~~~~~~~~
|
||||
----------------
|
||||
|
||||
- 中文推荐长度一般为 4-6 个汉字,过短导致误识别率高,过长不方便用户记忆
|
||||
- 英文推荐长度一般为 4-6 个单词
|
||||
- 命令词中不支持中英文混合
|
||||
- 目前最多支持 **200** 条命令词
|
||||
- 命令词中不能含有阿拉伯数字和特殊字符
|
||||
- 命令词避免使用常用语
|
||||
- 命令词中每个汉字/单词的发音相差越大越好
|
||||
|
||||
命令词自定义方法
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
* 支持多种命令词自定义方法
|
||||
* 支持随时动态增加/删除/修改命令词
|
||||
|
||||
MultiNet 对命令词自定义方法没有限制,用户可以通过任意方式(在线/离线)等将所需的命令词按照相应的格式,组成链表发给 MultiNet 即可。
|
||||
|
||||
我们针对不同客户提供不同的 example 来展示一些命令词的自定义方法,大体分为以下两种。
|
||||
|
||||
命令词格式
|
||||
^^^^^^^^^^
|
||||
|
||||
命令词需要满足特定的格式,具体如下:
|
||||
目前,MultiNet 最多支持 **200** 条命令词。命令词需要满足特定的格式,具体如下:
|
||||
|
||||
- 中文
|
||||
|
||||
@ -81,9 +63,27 @@ MultiNet 对命令词自定义方法没有限制,用户可以通过任意方
|
||||
|
||||
- 英文
|
||||
|
||||
英文命令词需要使用特定音标表示,每个单词的音标间用空格隔开,比如“turn on the light”,需要写成“TkN nN jc LiT”。
|
||||
英文命令词需要使用特定音标表示,每个单词的音标间用空格隔开,比如“turn on the light”,需要写成“TkN nN jc LiT”。具体可使用我们提供的工具进行转换,详细可见: :project_file:`tool/multinet_g2p.py` 。
|
||||
|
||||
此外,我们也提供相应的工具,供用户将汉字转换为拼音,详细可见: `英文转音素工具 <../../tool/multinet_g2p.py>` 。
|
||||
|
||||
自定义要求
|
||||
~~~~~~~~~~~
|
||||
|
||||
在设计命令词时有如下要求和建议:
|
||||
|
||||
.. list::
|
||||
|
||||
- 中文推荐长度一般为 4-6 个汉字,过短导致误识别率高,过长不方便用户记忆
|
||||
:esp32s3: - 英文推荐长度一般为 4-6 个单词
|
||||
- 不支持中英文混合
|
||||
- 不能含有阿拉伯数字和特殊字符
|
||||
- 应避免使用常用语
|
||||
- 命令词中每个汉字/单词的发音相差越大越好
|
||||
|
||||
自定义方法
|
||||
~~~~~~~~~~~
|
||||
|
||||
MultiNet 支持多种且灵活的命令词设置方式,可通过在线或离线方法设置命令词,还允许随时动态增加/删除/修改命令词
|
||||
|
||||
.. only:: latex
|
||||
|
||||
@ -93,22 +93,20 @@ MultiNet 对命令词自定义方法没有限制,用户可以通过任意方
|
||||
离线设置命令词
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
MultiNet 支持多种且灵活的命令词设置方式,用户无论通过那种方式编写命令词(代码/网络/文件),只需调用相应的 API 即可。
|
||||
MultiNet 支持两种离线设置命令词的方法:
|
||||
|
||||
在这里我们提供两种常见的命令词添加方法。
|
||||
- 通过 ``menuconfig``
|
||||
|
||||
- 编写 ``menuconfig`` 进行添加
|
||||
|
||||
可以参考 ESP-Skainet 中 example 通过 ``idf.py menuconfig -> ESP Speech Recognition-> Add Chinese speech commands/Add English speech commands`` 添加命令词。
|
||||
1. ``idf.py menuconfig`` > ``ESP Speech Recognition`` > ``Add Chinese speech commands/Add English speech commands``,添加命令词。具体也可参考 ESP-Skainet 中的 example。
|
||||
|
||||
.. figure:: ../../_static/menuconfig_add_speech_commands.png
|
||||
:alt: menuconfig_add_speech_commands
|
||||
|
||||
menuconfig_add_speech_commands
|
||||
|
||||
请注意单个 Command ID 可以支持多个短语,比如“打开空调”和“开空调”表示的意义相同,则可以将其写在同一个 Command ID 对应的词条中,用英文字符“,”隔开相邻词条(“,”前后无需空格)。
|
||||
注意,单个 Command ID 可以支持多个短语,比如“打开空调”和“开空调”表示的意义相同,则可以将其写在同一个 Command ID 对应的词条中,用英文字符“,”隔开相邻词条(“,”前后无需空格)。
|
||||
|
||||
然后通过在代码里调用以下 API 即可:
|
||||
1. 在代码里调用以下 API:
|
||||
|
||||
::
|
||||
|
||||
@ -127,43 +125,38 @@ MultiNet 支持多种且灵活的命令词设置方式,用户无论通过那
|
||||
*/
|
||||
esp_err_t esp_mn_commands_update_from_sdkconfig(esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
|
||||
|
||||
- 通过自己创建命令词进行添加
|
||||
- 通过修改代码
|
||||
|
||||
可以参考 ESP-Skainet 中 example 了解这种添加命令词的方法。
|
||||
|
||||
该方法中,用户直接在代码中编写命令词,并传给 MultiNet,在实际开发和产品中,用户可以通过网络/UART/SPI等多种可能的方式传递所需的命令词并随时更换命令词。
|
||||
该方法中,用户直接在代码中编写命令词,并传给 MultiNet。在实际产品开发和使用中,用户可以通过网络/UART/SPI 等多种接口,传递所需的命令词并随时更换命令词。详情可参考 ESP-Skainet 中的 example。
|
||||
|
||||
在线设置命令词
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
MultiNet 支持在运行过程中在线动态添加/删除/修改命令词,该过程无须更换模型和调整参数。具体可以参考 ESP-Skainet 中 example。
|
||||
MultiNet 还支持在运行过程中,在线动态设置命令词(添加/删除/修改),且整个过程无须更换模型或调整参数。详情可参考 ESP-Skainet 中 example。
|
||||
|
||||
具体 API 说明请参考 `esp_mn_speech_commands <../../src/esp_mn_speech_commands.c>`_。
|
||||
具体 API 说明请参考 :project_file:`src/esp_mn_speech_commands.c` 。
|
||||
|
||||
运行命令词识别
|
||||
--------------
|
||||
MultiNet 的使用
|
||||
----------------
|
||||
|
||||
命令词识别需要和 ESP-SR 中的声学算法模块(AFE)(AFE中需使能唤醒(WakeNet))一起运行。关于 AFE 的使用,请参考文档:
|
||||
MultiNet 命令词识别需要和 ESP-SR 中的 AFE 声学算法模块一起运行(此外,AFE 运行还需要使能 WakeNet 功能,具体请参考 :doc:`AFE 介绍及使用 <../audio_front_end/README>` )。
|
||||
|
||||
:doc:`AFE 介绍及使用 <../audio_front_end/README>`
|
||||
|
||||
当用户配置完成 AFE 后,请按照以下步骤配置和运行 MultiNet:
|
||||
当用户配置完成 AFE 后,请按照以下步骤配置和运行 MultiNet。
|
||||
|
||||
MultiNet 初始化
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
- 模型加载与初始化
|
||||
请参考 :doc:`flash_model <../flash_model/README>`
|
||||
- 模型加载与初始化,请参考 :doc:`模型加载 <../flash_model/README>`
|
||||
|
||||
- 设置命令词 请参考上文 #3。
|
||||
- 设置命令词,请参考 :ref:`command-requirements`
|
||||
|
||||
MultiNet 运行
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
当用户开启 AFE 且使能 WakeNet 后,则可以运行 MultiNet。且有以下几点要求:
|
||||
当用户开启 AFE 且使能 WakeNet 后,则可以运行 MultiNet。但需要注意以下几点要求:
|
||||
|
||||
* 传入帧长和 AFE fetch 帧长长度相等
|
||||
* 支持音频格式为 16KHz,16bit,单通道。AFE fetch 拿到的数据也为这个格式
|
||||
* 传入帧长和 AFE fetch 帧长长度相等
|
||||
* 支持音频格式为 16 KHz,16 bit,单通道。AFE fetch 拿到的数据也为这个格式
|
||||
|
||||
- 确定需要传入 MultiNet 的帧长
|
||||
|
||||
@ -173,7 +166,7 @@ MultiNet 运行
|
||||
|
||||
``mu_chunksize`` 是需要传入 MultiNet 的每帧音频的 ``short`` 型点数,这个大小和 AFE 中 fetch 的每帧数据点数完全一致。
|
||||
|
||||
- MultiNet detect
|
||||
- MultiNet 识别
|
||||
|
||||
我们将 AFE 实时 ``fetch`` 到的数据送入以下 API:
|
||||
|
||||
@ -181,20 +174,19 @@ MultiNet 运行
|
||||
|
||||
esp_mn_state_t mn_state = multinet->detect(model_data, buff);
|
||||
|
||||
``buff`` 的长度为 ``mu_chunksize * sizeof(int16_t)``。
|
||||
``buff`` 的长度为 ``mu_chunksize * sizeof(int16_t)``。
|
||||
|
||||
MultiNet 识别结果
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
命令词识别支持两种基本模式:
|
||||
MultiNet 命令词识别支持两种基本模式:
|
||||
|
||||
* 单次识别
|
||||
* 连续识别
|
||||
* 单次识别
|
||||
* 连续识别
|
||||
|
||||
命令词识别必须和唤醒搭配使用,当唤醒后可以运行命令词的检测。
|
||||
|
||||
命令词模型在运行时,会实时返回当前帧的识别状态
|
||||
``mn_state`` ,目前分为以下几种识别状态:
|
||||
命令词模型在运行时,会实时返回当前帧的识别状态 ``mn_state`` ,目前分为以下几种识别状态:
|
||||
|
||||
- ESP_MN_STATE_DETECTING
|
||||
|
||||
@ -219,7 +211,9 @@ MultiNet 识别结果
|
||||
float prob[ESP_MN_RESULT_MAX_NUM]; // The list of probability.
|
||||
} esp_mn_results_t;
|
||||
|
||||
- 其中 ``state`` 为当前识别的状态
|
||||
其中,
|
||||
|
||||
- ``state`` 为当前识别的状态
|
||||
- ``num`` 表示识别到的词条数目, ``num`` <= 5,即最多返回 5 个候选结果
|
||||
- ``phrase_id`` 表示识别到的词条对应的 Phrase ID
|
||||
- ``prob`` 表示识别到的词条识别概率,从大到到小依次排列
|
||||
@ -230,9 +224,9 @@ MultiNet 识别结果
|
||||
|
||||
该状态表示长时间未检测到命令词,自动退出。等待下次唤醒。
|
||||
|
||||
| 因此:
|
||||
| 当命令词识别返回状态为 ``ESP_MN_STATE_DETECTED`` 时退出命令词识别,则为单次识别模式;
|
||||
| 当命令词识别返回状态为 ``ESP_MN_STATE_TIMEOUT`` 时退出命令词识别,则为连续识别模式;
|
||||
因此:
|
||||
当命令词识别返回状态为 ``ESP_MN_STATE_DETECTED`` 时退出命令词识别,则为单次识别模式;
|
||||
当命令词识别返回状态为 ``ESP_MN_STATE_TIMEOUT`` 时退出命令词识别,则为连续识别模式;
|
||||
|
||||
其他配置和使用
|
||||
--------------
|
||||
@ -240,4 +234,4 @@ MultiNet 识别结果
|
||||
阈值设置
|
||||
~~~~~~~~
|
||||
|
||||
该功能仍在开发中.
|
||||
该功能仍在开发中。
|
||||
|
||||
@ -3,191 +3,237 @@
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
测试场景
|
||||
~~~~~~~~
|
||||
为了保证 DUT 的性能,可通过测试确定 DUT 在以下方面的表现:
|
||||
|
||||
* 房间大小
|
||||
- 唤醒率
|
||||
- 识别率
|
||||
- 误唤醒率
|
||||
- 唤醒打断率
|
||||
- 响应时间
|
||||
|
||||
* 地面大小: 至少 4M*3.2M
|
||||
测试场景要求
|
||||
------------
|
||||
|
||||
* 高度至少: 2.30M
|
||||
以上测试需在合适的测试房间中进行,测试房间应满足如下要求:
|
||||
|
||||
* 房间装饰
|
||||
* **房间大小**
|
||||
|
||||
* 地板需配有地毯,在天花板上配备一些通常在办公室中常见的声学阻尼。在1到2面墙上挂有窗帘,防止强反射。
|
||||
* 面积:不小于 4 m * 3.2 m
|
||||
* 高度:不低于 2.3 m
|
||||
|
||||
* 房间混响(RT601)在[125, 8k]范围内,要满足0.2-0.7s的要求。
|
||||
* **房间装饰**
|
||||
|
||||
* 地板需配有地毯,天花板需配备常见声学阻尼材料,墙面应有 1 到 2 面墙上挂有窗帘,防止强反射。
|
||||
* 房间混响 (RT60) 3在 [125, 8k] 范围内,要满足 0.2-0.7s 的要求。
|
||||
* 不要使用消音室。
|
||||
|
||||
* 环境底噪要求:应该 < 35dBA,最好是 < 30dBA。
|
||||
* **环境底噪**:必须 < 35 dBA,最好 < 30 dBA
|
||||
|
||||
* 温度和湿度要求:70+-20 华氏度,相对湿度为50%+-20%。
|
||||
* **温度和湿度要求**:20±10°C,相对湿度为 50%±20%。
|
||||
|
||||
* 设备位置
|
||||
* **DUT、外部噪声、人声的放置**:
|
||||
|
||||
* 根据产品可能的实际使用方式,确定设备在性能测试时摆放的位置,比如设 备高度、离墙的距离、离地面的距离、角度等。
|
||||
* DUT、外部噪声、人声的具体位置和相对位置应根据 DUT 的实际应用场景进行安排。
|
||||
|
||||
* 外噪的角度、距离、高度和分贝
|
||||
.. note::
|
||||
RT60、环境底噪和 DUT、外部噪声、人声的放置应在所有测试中保持不变。
|
||||
|
||||
* 外噪到设备麦克的角度、距离,外噪距离地面的高度, 在设备麦克处测量到的外噪分贝值。
|
||||
测试案例设计
|
||||
------------
|
||||
|
||||
* 人声的角度、高度、距离和分贝
|
||||
在设计测试案例时,建议按照产品的实际应用场景,考虑 **以下部分或全部参数**:
|
||||
|
||||
* 性能测试时播放的测试语音集称为人声。人声到设备 麦克的角度、距离,人声距离地面的高度,在设备麦克处测量到的人声分贝值。
|
||||
- 不同噪声源
|
||||
- 白噪声
|
||||
- 人类噪声
|
||||
- 音乐
|
||||
- 新闻
|
||||
- . . . . . .
|
||||
- 有必要时还可以增加多噪声源的测试场景。
|
||||
- 不同噪声分贝
|
||||
- < 35 dBA
|
||||
- 45 dBA
|
||||
- 55 dBA
|
||||
- 65 dBA
|
||||
- 不同人声分贝
|
||||
- 54 dBA
|
||||
- 59 dBA
|
||||
- 64 dBA
|
||||
- 不同 SNR
|
||||
- 9 dBA
|
||||
- 4 dBA
|
||||
- -1 dBA
|
||||
|
||||
在不同的测试场景中,RT60、房间底噪、设备的位置是三个通用因素,在这些因素被确定之后,将被运用到不同的测试场景中。
|
||||
乐鑫测试与结果
|
||||
--------------
|
||||
|
||||
唤醒率测试
|
||||
~~~~~~~~~~
|
||||
|
||||
唤醒率测试是指当设备处于待唤醒状态时被唤醒成功的概率。
|
||||
|
||||
除通用因素外,通常唤醒率测试还需要确定的因素如表 1 所示。可以根据产品定位设计噪声和人声相对设备同向或者不同向的测试场景,或者多噪声源的测试场景,以及不同的 SNR 场景。
|
||||
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 测试场景编号 | 外噪距离 | 外噪角度 | 外噪分贝 | 人声距离 | 人声角度 | 人声分贝 | SNR |
|
||||
+==============+==========+==========+==========+==========+==========+==========+======+
|
||||
| 1 | / | / | <35dBA | 3m | 90° | 54dBA | / |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 2 | 2m | 45° | 45dBA | 3m | 90° | 54dBA | 9dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 3 | 2m | 45° | 55dBA | 3m | 90° | 59dBA | 4dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 4 | 2m | 45° | 65dBA | 3m | 90° | 64dBA | -1dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
|
||||
.. figure:: ../../_static/test_reference_position1.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
描述已自动生成在唤醒测试场景下,建议人工嘴(声音源)位于语音模块麦克风正前方,水平直线距离3米,人工嘴(声音源)距离地面1.5米。语音模块(ESP32-S3)和声压计位于同一垂直方向,声压计在语音模块(ESP32-S3)正上方75厘米处。噪声源在斜45度方向,距地高度1.2米,距离语音模块(ESP32-S3)2米。
|
||||
在本章节描述的所有测试中,DUT、外部噪声、人声的具体位置和相对位置如下所述。
|
||||
|
||||
.. figure:: ../../_static/test_reference_position2.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
识别测试
|
||||
~~~~~~~~
|
||||
.. figure:: ../../_static/test_reference_position1.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
|
||||
识别率测试是指当设备处于识别状态时成功识别词表里包含的命令词的概率。
|
||||
具体来说,
|
||||
|
||||
除通用因素外,通常识别率测试还需要确定的因素如下表所示。同唤醒率测试一样,识别率测试也可以根据产品定位去设计多样的测试场景。
|
||||
- DUT 距离地面高度 0.75 米
|
||||
- 人工嘴(人声源)距离地面高度 1.5 米,距离 DUT 直线水平距离 3 米
|
||||
- 噪声源在相对人工嘴 45°C 角处,距离地面高度 1.2 米,距离 DUT 直线水平距离 2 米
|
||||
- 声压计位于 DUT 正上方 0.75 米
|
||||
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 测试场景编号 | 外噪距离 | 外噪角度 | 外噪分贝 | 人声距离 | 人声角度 | 人声分贝 | SNR |
|
||||
+==============+==========+==========+==========+==========+==========+==========+======+
|
||||
| 1 | / | / | <35dBA | 3m | 90° | 54dBA | / |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 2 | 2m | 45° | 45dBA | 3m | 90° | 54dBA | 9dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 3 | 2m | 45° | 55dBA | 3m | 90° | 59dBA | 4dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
| 4 | 2m | 45° | 65dBA | 3m | 90° | 64dBA | -1dB |
|
||||
+--------------+----------+----------+----------+----------+----------+----------+------+
|
||||
|
||||
误唤醒测试
|
||||
唤醒率测试
|
||||
~~~~~~~~~~
|
||||
|
||||
误唤醒率测试是指设备在产品定义的应用场景下被非唤醒词成功唤醒的概率。需要根据产品定义的应用场景中,设备可能处于的环境来设计误唤醒的测试场景,比如在家居应用场景 中,设备可能处于安静、外噪、设备自噪等环境。
|
||||
**唤醒率**:指当设备处于待唤醒状态时被唤醒成功的概率。
|
||||
|
||||
除通用因素外,通常误唤醒率测试还需要确定的因素如下表所示。误唤醒率一般采用的衡量单位为次/小时。
|
||||
**乐鑫唤醒率测试和结果**
|
||||
|
||||
+--------------+----------+----------+----------+----------+----------+
|
||||
| 测试场景编号 | 噪声类型 | 噪声距离 | 噪声角度 | 噪声分贝 | 测试时长 |
|
||||
+==============+==========+==========+==========+==========+==========+
|
||||
| 1 | 安静 | / | / | <35dBA | 24小时 |
|
||||
+--------------+----------+----------+----------+----------+----------+
|
||||
| 2 | 白噪声 | 2m | 45° | 65dBA | 24小时 |
|
||||
+--------------+----------+----------+----------+----------+----------+
|
||||
| 3 | 新闻 | 2m | 45° | 65dBA | 24小时 |
|
||||
+--------------+----------+----------+----------+----------+----------+
|
||||
| 4 | 酒吧 | 2m | 45° | <65dBA | 24小时 |
|
||||
+--------------+----------+----------+----------+----------+----------+
|
||||
.. list-table::
|
||||
:widths: 10 25 15 15 20 15
|
||||
:header-rows: 1
|
||||
|
||||
* - 测试案例
|
||||
- 噪声类型
|
||||
- 噪声分贝
|
||||
- 人声分贝
|
||||
- SNR
|
||||
- 唤醒率
|
||||
* - 1
|
||||
- /
|
||||
- /
|
||||
- 59 dBA
|
||||
- /
|
||||
- 99%
|
||||
* - 2
|
||||
- 白噪声
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- ≥4 dBA
|
||||
- 99%
|
||||
* - 3
|
||||
- 人声噪声
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- ≥4 dBA
|
||||
- 99%
|
||||
|
||||
语音识别率测试
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
**语音识别率**:指当设备处于识别状态时,成功识别词表里包含的命令词的概率。
|
||||
|
||||
**乐鑫语音识别率测试和结果**
|
||||
|
||||
.. list-table::
|
||||
:widths: 10 25 15 15 20 15
|
||||
:header-rows: 1
|
||||
|
||||
* - 测试案例
|
||||
- 噪声类型
|
||||
- 噪声分贝
|
||||
- 人声分贝
|
||||
- SNR
|
||||
- 语音识别率
|
||||
* - 1
|
||||
- /
|
||||
- /
|
||||
- 59 dBA
|
||||
- /
|
||||
- 91.5%
|
||||
* - 2
|
||||
- 白噪声
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- ≥4 dBA
|
||||
- 78.25%
|
||||
* - 3
|
||||
- 人声噪声
|
||||
- 55 dBA
|
||||
- 59 dBA
|
||||
- ≥4 dBA
|
||||
- 82.77%
|
||||
|
||||
误唤醒率测试
|
||||
~~~~~~~~~~~~
|
||||
|
||||
**误唤醒率**:指设备在产品定义的应用场景下被非唤醒词成功唤醒的概率。
|
||||
|
||||
**乐鑫误唤醒率测试和结果**
|
||||
|
||||
.. list-table::
|
||||
:widths: 20 20 20 20 20
|
||||
:header-rows: 1
|
||||
|
||||
* - 测试案例
|
||||
- 噪声类型
|
||||
- 噪声分贝
|
||||
- 测试时长
|
||||
- 误唤醒次数
|
||||
* - 1
|
||||
- 音乐
|
||||
- 55 dBA
|
||||
- 12 小时
|
||||
- 1 次
|
||||
* - 2
|
||||
- 新闻
|
||||
- 55 dBA
|
||||
- 12 小时
|
||||
- 1 次
|
||||
|
||||
唤醒打断率测试
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
对于有 AEC 功能的产品,通常还需要测试唤醒打断率。唤醒打断率是指设备有自噪时, 即有 TTS3 播报或播放音频时,被唤醒成功的概率。
|
||||
**唤醒打断率**:唤醒打断率是指设备有自噪时,即有 TTS3 播报或播放音频时,被唤醒成功的概率。对于有 AEC 功能的产品需要进行该测试。
|
||||
|
||||
除通用因素外,通常唤醒打断率测试还需要确定的因素如下表所示。
|
||||
**乐鑫唤醒打断率测试和结果**
|
||||
|
||||
+--------------+--------------+----------+----------+----------+----------+
|
||||
| 测试场景编号 | 设备自噪类型 | 噪声分贝 | 人声距离 | 人声角度 | 人声分贝 |
|
||||
+==============+==============+==========+==========+==========+==========+
|
||||
| 1 | 音乐 | 65dB | 3米 | 90° | 64dB |
|
||||
+--------------+--------------+----------+----------+----------+----------+
|
||||
| 2 | TTS | 65dB | 3米 | 90° | 64dB |
|
||||
+--------------+--------------+----------+----------+----------+----------+
|
||||
.. list-table::
|
||||
:widths: 15 15 15 20 15 15
|
||||
:header-rows: 1
|
||||
|
||||
* - 测试案例
|
||||
- 噪声类型
|
||||
- 噪声 / 人声分贝
|
||||
- SNR
|
||||
- 唤醒率
|
||||
- 语音识别率
|
||||
* - 1
|
||||
- 音乐
|
||||
- 69 dBA / 59 dBA
|
||||
- ≥10 dBA
|
||||
- 100%
|
||||
- 96%
|
||||
* - 2
|
||||
- TTS
|
||||
- 69 dBA / 59 dBA
|
||||
- ≥10 dBA
|
||||
- 100%
|
||||
- 96%
|
||||
|
||||
响应时间测试
|
||||
~~~~~~~~~~~~
|
||||
搭建好测试环境,打开语音录制工具,播放测试机,播报完毕后,利用语音录制工具计算出语音指令与播报之间的时间间隔即为响应时间。
|
||||
|
||||
步骤:
|
||||
|
||||
#. 利用人工嘴播放测试集。
|
||||
|
||||
#. 记录测试数据。
|
||||
|
||||
#. 计算相应时间。
|
||||
|
||||
乐鑫语音测试结果
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
唤醒率测试
|
||||
-----------
|
||||
|
||||
+----------------+------------+-------------+-----------+-----------+-----------+--------+--------+
|
||||
| 测试项 | 环境噪声 | 噪声指标 | 信噪比SNR | 角度 | 距离 | 唤醒率 | 识别率 |
|
||||
+================+============+=============+===========+===========+===========+========+========+
|
||||
| 本地唤醒率测试 | 安静 | 人声:59dBA | NA | 人声:90° | 人声:3米 | 99% | 91.5% |
|
||||
| | | | | | | | |
|
||||
| | | 噪声:NA | | 噪声:45° | 噪声:2米 | | |
|
||||
| +------------+-------------+-----------+ | +--------+--------+
|
||||
| | 白噪声 | 人声:59dBA | ≥4dBA | | | 99% | 78.25% |
|
||||
| | | | | | | | |
|
||||
| | | 噪声:55dBA | | | | | |
|
||||
| +------------+-------------+-----------+ | +--------+--------+
|
||||
| | 人声类噪声 | 人声:59dBA | ≥4dBA | | | 99% | 82.77% |
|
||||
| | | | | | | | |
|
||||
| | | 噪声:55dBA | | | | | |
|
||||
+----------------+------------+-------------+-----------+-----------+-----------+--------+--------+
|
||||
|
||||
误唤醒测试
|
||||
-----------
|
||||
|
||||
+------------+----------+-------------+----------+------------+
|
||||
| 测试项 | 环境噪声 | 噪声指标 | 测试时间 | 误唤醒次数 |
|
||||
+============+==========+=============+==========+============+
|
||||
| 误唤醒测试 | 音乐 | 噪声:55dBA | 12h | 1 |
|
||||
| +----------+-------------+----------+------------+
|
||||
| | 新闻 | 噪声:55dBA | 12h | 1 |
|
||||
+------------+----------+-------------+----------+------------+
|
||||
|
||||
唤醒打断率测试
|
||||
--------------
|
||||
|
||||
+----------------+----------+---------------+-----------+--------+--------------+
|
||||
| 测试项 | 环境噪声 | 噪声指标 | 信噪比SNR | 唤醒率 | 命令词识别率 |
|
||||
+================+==========+===============+===========+========+==============+
|
||||
| 唤醒打断率测试 | 音乐 | 人声59dBA | ≥ 10dBA | 100% | 96% |
|
||||
| | | 噪声69dBA | | | |
|
||||
| +----------+---------------+-----------+--------+--------------+
|
||||
| | TTS | 人声:59dBA | ≥ 10dBA | 100% | 96% |
|
||||
| | | 噪声:69dBA | | | |
|
||||
+----------------+----------+---------------+-----------+--------+--------------+
|
||||
|
||||
响应时间测试
|
||||
------------
|
||||
|
||||
+--------------+----------+---------------+------------+----------+
|
||||
| 测试项 | 环境噪声 | 噪声指标 | 信噪比 SNR | 响应时间 |
|
||||
+==============+==========+===============+============+==========+
|
||||
| 响应时间测试 | 安静 | 人声:59dBA | NA | <500 ms |
|
||||
| | | 噪声:NA | | |
|
||||
+--------------+----------+---------------+------------+----------+
|
||||
**响应时间**:代表 DUT 响应语音命令的时间,具体测量为语音指令与播报之间的时间间隔(见下图)。
|
||||
|
||||
.. figure:: ../../_static/test_response_time.png
|
||||
:align: center
|
||||
:alt: overview
|
||||
:alt: overview
|
||||
|
||||
**乐鑫响应时间测试和结果**
|
||||
|
||||
.. list-table::
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
|
||||
* - 测试案例
|
||||
- 噪声 / 人声分贝
|
||||
- SNR
|
||||
- 响应时间
|
||||
* - 1
|
||||
- NA / 59 dBA
|
||||
- /
|
||||
- < 500 ms
|
||||
@ -3,93 +3,96 @@
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
离线唤醒词定制服务
|
||||
-------------------
|
||||
唤醒词定制服务
|
||||
--------------
|
||||
|
||||
乐鑫提供 离线语音唤醒词 定制服务,详情如下:
|
||||
乐鑫提供 **语音唤醒词定制** 服务,详情如下:
|
||||
|
||||
#. “HI乐鑫”,“你好小鑫” 等官方公开的唤醒词,客户可直接商用
|
||||
#. “HI乐鑫”,“你好小鑫” 等官方开放的唤醒词,客户可直接商用
|
||||
|
||||
- 如 ADF/ASR Demo 提供的离线命令词,同时乐鑫会逐渐开放更多的商用 Free 关键词
|
||||
- 完整列表可见 :ref:`乐鑫免费商用唤醒词 <esp-open-wake-word>`
|
||||
- 同时,乐鑫会逐渐开放更多可免费商用的唤醒词
|
||||
|
||||
#. 除官方开放的唤醒词,可接受客户定制服务,分如下两种情况
|
||||
#. 除官方开放的唤醒词,乐鑫还可为客户提供 **唤醒词定制服务**,主要分如下两种情况:
|
||||
|
||||
- 如果客户提供 唤醒词语料
|
||||
- 客户提供唤醒词语料
|
||||
|
||||
- 需要提供大于 2 万条合格的语料(语料需求见下文)
|
||||
- 语料提供给乐鑫后,需要 2~3 周进行模型训练及调优
|
||||
- 根据量级收取少量模型定制费用
|
||||
- 需要提供大于 2 万条合格的语料,具体语料需求见 :ref:`corpus-requirement`
|
||||
- 语料提供给乐鑫后,需要 2~3 周进行模型训练及调优
|
||||
- 根据量级收取少量模型定制费用
|
||||
|
||||
- 如果客户不提供唤醒词语料
|
||||
- 如果客户不提供唤醒词语料
|
||||
|
||||
- 所有训练语料由乐鑫采集提供
|
||||
- 语料提供给乐鑫后,需要 2~3 周进行模型训练及调优
|
||||
- 根据量级收取少量模型定制费用(语料采集费用另收)
|
||||
- 所有训练语料由乐鑫采集提供
|
||||
- 乐鑫需要一定时间收集语料,具体需要分别讨论;语料准备好后,需要 2~3 周进行模型训练及调优
|
||||
- 根据量级收取少量模型定制费用,语料采集费用另算
|
||||
|
||||
- 费用收取具体定价和定制时间,烦请邮件至 sales@espressif.com 协议商定
|
||||
|
||||
- 收费取决于 **唤醒词定制的数量** 以及 **产品量产数量**
|
||||
- 定制的具体时间和费用取决于 **唤醒词定制的数量** 以及 **产品量产数量**,详情请联系 `乐鑫销售人员 <sales@espressif.com>`_ 。
|
||||
|
||||
#. 对于乐鑫唤醒词模型:
|
||||
|
||||
- 目前单个模型最多支持5个及以内的唤醒词识别
|
||||
- 每个唤醒词通常由 3-6 音节组成,比如“hi乐鑫”,“Alexa”,“小爱同学”,“你好天猫”等
|
||||
- 可多个唤醒模型一起使用,具体需根据客户应用的资源消耗确定
|
||||
- 目前单个模型最多支持 5 个及以内的唤醒词识别
|
||||
- 每个唤醒词通常由 3-6 音节组成,比如 “hi乐鑫”,“Alexa”,“小爱同学”,“你好天猫”等
|
||||
- 可多个唤醒模型一起使用,具体需根据客户应用的资源消耗确定
|
||||
- 更多详情,请见 :doc:`WakeNet 唤醒词模型 <README>`
|
||||
|
||||
.. _corpus-requirement:
|
||||
|
||||
训练语料要求
|
||||
------------
|
||||
~~~~~~~~~~~~
|
||||
|
||||
客户可自备训练语料或向第三方采购,对于语料有以下要求
|
||||
客户可自备训练语料或向第三方采购,对于语料有以下要求:
|
||||
|
||||
- 语料音频格式要求
|
||||
- 语料音频格式要求
|
||||
|
||||
- 采样率(sample rate):16 KHz
|
||||
- 编码(encoding):16-bit signed int
|
||||
- 通道数(channel):mono
|
||||
- 格式:wav
|
||||
- 采样率(sample rate):16 KHz
|
||||
- 编码(encoding):16-bit signed int
|
||||
- 通道数(channel):mono
|
||||
- 格式:wav
|
||||
|
||||
- 语料采集要求
|
||||
- 语料采集要求
|
||||
|
||||
- 采样人数:最好样本可以大于 500 人,其中男女,年龄分布均衡,儿童不小于 100 人
|
||||
- 采样环境:环境噪声低(< 40 dB),建议在语音室等专业环境下录制
|
||||
- 录制场景:距离麦克风 1 m 处每人录制 15 遍,其中 5 遍快语速,5 遍正常语速,5 遍慢语速;距离麦克风 3 m 处每人录制 15 遍,其中 5 遍快语速,5 遍正常语速,5 遍慢语速
|
||||
- 录制设备:高保真麦克风
|
||||
- 样本命名需体现样本信息:如 female_age_fast_id.wav 或有单独表格记录每个样本的年龄,性别等信息
|
||||
- 采样人数:最好样本可以大于 500 人,其中男女,年龄分布均衡,儿童不小于 100 人
|
||||
- 采样环境:环境噪声低(< 40 dB),建议在语音室等专业环境下录制
|
||||
- 录制设备:高保真麦克风
|
||||
- 录制场景:
|
||||
- 距离麦克风 1 m 处每人录制 15 遍,其中 5 遍快语速,5 遍正常语速,5 遍慢语速;
|
||||
- 距离麦克风 3 m 处每人录制 15 遍,其中 5 遍快语速,5 遍正常语速,5 遍慢语速
|
||||
- 样本命名需体现样本信息:如 ``female_age_fast_id.wav`` 或有单独表格记录每个样本的年龄,性别等信息
|
||||
|
||||
硬件设计与测试
|
||||
--------------
|
||||
硬件设计与测试服务
|
||||
------------------
|
||||
|
||||
语音唤醒效果与硬件设计以及腔体结构有很大关系,为确保硬件设备设计合理,请认真阅读以下内容
|
||||
语音唤醒效果与硬件设计以及腔体结构有很大关系。因此,请认真阅读以下内容:
|
||||
|
||||
#. 硬件设计要求
|
||||
|
||||
- 对于各类语音音箱类设计,乐鑫可提供 原理图/PCB 等设计参考,客户可以根据自身具体需求设计修改,设计完毕后,乐鑫可提供Review服务,避免常见设计问题。
|
||||
- 各类语音音箱类设计:乐鑫可提供 **原理图/PCB** 等设计参考,客户可以根据自身具体需求设计修改,设计完毕后,乐鑫还可提供审阅服务,避免常见设计问题。
|
||||
|
||||
- 腔体结构,最好有专门的声学人员参与设计,乐鑫不提供 ID 设计类的参考,客户可以市场上主流音箱设计为参考
|
||||
- 例如:天猫精灵、小度音箱、谷歌音箱等
|
||||
- 腔体结构:建议有专门的声学人员参与设计,乐鑫不提供 ID 设计类参考,客户可参考市面上的主流音箱腔体设计,例如天猫精灵、小度音箱、谷歌音箱等。
|
||||
|
||||
#. 硬件设计好后,客户可通过以下简单测试,验证硬件设计效果(下列测试都是基于语音室环境,客户可以根据自身测试环境做调整)
|
||||
|
||||
- 录音测试,验证 MIC、codec 录音增益以及失真情况
|
||||
- 录音测试,验证 mic、codec 录音增益以及失真情况
|
||||
|
||||
- 音源 90 dB,距离 0.1 m 播放样本,调节增益,保证录音样本不饱和
|
||||
- 使用扫频文件(0~20 KHz),使用 16 KHz 采样率录音,音频不会出现明显频率混叠
|
||||
- 使用扫频文件 (0~20 KHz),使用 16 KHz 采样率录音,音频不会出现明显频率混叠
|
||||
- 录制 100 个语音样本,使用公开的云端语音识别端口识别,识别率达到指定标准
|
||||
|
||||
- 播音测试,验证 功率放大器(PA)、喇叭的失真情况
|
||||
- 播音测试,验证功率放大器 (PA)、喇叭的失真情况
|
||||
|
||||
- 测试PA功率 @1% 总谐波失真(THD)
|
||||
- 测试 PA 功率 @1% 总谐波失真 (THD)
|
||||
|
||||
- 语音算法测试,验证 AEC、BFM、NS 效果
|
||||
|
||||
- 首先需要注意下参考信号延时,不同的 AEC 算法有不同的要求
|
||||
- 以实际产品场景为测试指标,例如 MIC 播放 85DB-90DB 大梦想家.wav ,设备回采
|
||||
- 保存回声参考信号、回声消除后的信号分析,对比查看 AEC、NS、BFM 等效果
|
||||
- 以实际产品场景为测试指标,例如 mic 播放 ``85DB-90DB 大梦想家.wav``,设备回采
|
||||
- 保存回声参考信号、回声消除后的信号分析,对比查看 AEC、BFM、NS 等效果
|
||||
|
||||
- DSP性能测试,验证DSP参数是否合适,同时尽可能减少DSP算法中的非线性失真
|
||||
- DSP 性能测试,验证 DSP 参数是否合适,同时尽可能减少 DSP 算法中的非线性失真
|
||||
|
||||
- 降噪(Noise suppression)算法性能测试
|
||||
- 回声消除(Acoustic Echo Cancellation)算法性能测试
|
||||
- 语音增强(Speech Enhancement)算法性能测试
|
||||
- 降噪 (Noise Suppression) 算法性能测试
|
||||
- 回声消除 (Acoustic Echo Cancellation) 算法性能测试
|
||||
- 语音增强 (Speech Enhancement) 算法性能测试
|
||||
|
||||
#. 硬件设计完毕后, **可寄送** 1-2 台硬件至乐鑫,乐鑫会基于客户整机做唤醒词性能调优
|
||||
#. 硬件设计完毕后, **可寄送** 1-2 台硬件至乐鑫,乐鑫会基于客户整机做唤醒词性能调优。
|
||||
@ -1,14 +1,14 @@
|
||||
WakeNet
|
||||
========
|
||||
WakeNet 唤醒词模型
|
||||
===================
|
||||
|
||||
:link_to_translation:`en:[English]`
|
||||
|
||||
WakeNet是一个基于神经网络,为低功耗嵌入式MCU设计的的唤醒词模型,目前支持5个以内的唤醒词识别。
|
||||
WakeNet 是一个基于神经网络,为低功耗嵌入式 MCU 设计的唤醒词模型,目前支持 5 个以内的唤醒词识别。
|
||||
|
||||
Overview
|
||||
--------
|
||||
概述
|
||||
----
|
||||
|
||||
WakeNet的流程图如下:
|
||||
WakeNet 的流程图如下:
|
||||
|
||||
.. figure:: ../../_static/wakenet_workflow.png
|
||||
:alt: overview
|
||||
@ -21,33 +21,35 @@ WakeNet的流程图如下:
|
||||
|
||||
</center>
|
||||
|
||||
- Speech Features:
|
||||
我们使用 `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ 方法提取语音频谱特征。输入的音频文件采样率为16KHz,单声道,编码方式为signed 16-bit。每帧窗宽和步长均为30ms。
|
||||
- 语音特征 (Speech Feature)
|
||||
我们使用 `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ 方法提取语音频谱特征。输入的音频文件采样率为 16 KHz,单声道,编码方式为 signed 16-bit。每帧窗宽和步长均为 30 ms。
|
||||
|
||||
.. only:: latex
|
||||
|
||||
.. figure:: ../../_static/QR_MFCC.png
|
||||
:alt: overview
|
||||
|
||||
- Neural Network:
|
||||
神经网络结构已经更新到第9版,其中:
|
||||
- 神经网络 (Neural Network)
|
||||
神经网络结构已经更新到第 9 版,其中:
|
||||
|
||||
- wakeNet1,wakeNet2,wakeNet3,wakeNet4已经停止使用。
|
||||
- wakeNet5应用于ESP32芯片。
|
||||
- wakeNet8和wakeNet9应用于ESP32S3芯片,模型基于 `Dilated Convolution <https://arxiv.org/pdf/1609.03499.pdf>`__ 结构。
|
||||
- WakeNet1、WakeNet2、WakeNet3、WakeNet4、WakeNet6 and WakeNet7 已经停止使用。
|
||||
- WakeNet5 应用于 ESP32 芯片。
|
||||
- WakeNet8 和 WakeNet9 应用于 ESP32-S3 芯片,模型基于 `Dilated Convolution <https://arxiv.org/pdf/1609.03499.pdf>`__ 结构。
|
||||
|
||||
.. only:: latex
|
||||
|
||||
.. figure:: ../../_static/QR_Dilated_Convolution.png
|
||||
:alt: overview
|
||||
|
||||
注意,WakeNet5,WakeNet5X2 和 WakeNet5X3 的网络结构一致,但是 WakeNet5X2 和 WakeNet5X3 的参数比 WakeNet5 要多。请参考 `性能测试 <#性能测试>`__ 来获取更多细节。
|
||||
注意,WakeNet5、WakeNet5X2 和 WakeNet5X3 的网络结构一致,但是 WakeNet5X2 和 WakeNet5X3 的参数比 WakeNet5 要多。请参考 :doc:`资源消耗 <../benchmark/README>` 来获取更多细节。
|
||||
|
||||
- Keyword Trigger Method:
|
||||
对连续的音频流,为准确判断关键词的触发,我们通过计算若干帧内识别结果的平均值M,来判断触发。当M大于大于指定阈值,发出触发的命令。
|
||||
- Keyword Trigger Method
|
||||
对连续的音频流,为准确判断关键词的触发,我们通过计算若干帧内识别结果的平均值 M,来判断是否触发。当 M 大于指定阈值,则发出触发的命令。
|
||||
|
||||
以下表格展示在不同芯片上的模型支持:
|
||||
|
||||
.. _esp-open-wake-word:
|
||||
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
| Chip | ESP32 | ESP32S3 |
|
||||
+=================+===========+=============+=============+===========+===========+===========+===========+
|
||||
@ -70,8 +72,8 @@ WakeNet的流程图如下:
|
||||
| Customized word | | | | | | | √ |
|
||||
+-----------------+-----------+-------------+-------------+-----------+-----------+-----------+-----------+
|
||||
|
||||
WakeNet使用
|
||||
-----------
|
||||
WakeNet 的使用
|
||||
---------------
|
||||
|
||||
- WakeNet 模型选择
|
||||
|
||||
@ -79,29 +81,24 @@ WakeNet使用
|
||||
|
||||
自定义的唤醒词,请参考 :doc:`乐鑫语音唤醒词定制流程 <ESP_Wake_Words_Customization>` 。
|
||||
|
||||
- WakeNet 运行
|
||||
- WakeNet 模型运行
|
||||
|
||||
WakeNet 目前包含在语音前端算法 :doc:`AFE <../audio_front_end/README>` 中,默认为运行状态,并将识别结果通过 AFE fetch 接口返回。
|
||||
|
||||
如果用户不需要初始化 WakeNet,请在 AFE 配置时选择:
|
||||
如果用户无需使用 WakeNet 唤醒,请在 AFE 配置时选择:
|
||||
|
||||
::
|
||||
|
||||
afe_config.wakenet_init = False.
|
||||
|
||||
如果用户想临时关闭/打开 WakeNet, 请在运行过程中调用:
|
||||
如果用户想临时关闭/打开 WakeNet, 请在运行过程中调用:
|
||||
|
||||
::
|
||||
|
||||
afe_handle->disable_wakenet(afe_data)
|
||||
afe_handle->enable_wakenet(afe_data)
|
||||
|
||||
性能测试
|
||||
资源消耗
|
||||
--------
|
||||
|
||||
具体请参考 :doc:`Performance Test <../benchmark/README>` 。
|
||||
|
||||
唤醒词定制
|
||||
----------
|
||||
|
||||
如果需要定制唤醒词,请参考 :doc:`乐鑫语音唤醒词定制流程 <ESP_Wake_Words_Customization>` 。
|
||||
具体请参考 :doc:`资源消耗 <../benchmark/README>` 。
|
||||
8
docs/zh_CN/wake_word_engine/index.rst
Normal file
8
docs/zh_CN/wake_word_engine/index.rst
Normal file
@ -0,0 +1,8 @@
|
||||
唤醒词
|
||||
======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
WakeNet 唤醒词模型简介 <README>
|
||||
唤醒词定制服务 <ESP_Wake_Words_Customization>
|
||||
Loading…
Reference in New Issue
Block a user