mirror of
https://github.com/espressif/esp-sr.git
synced 2025-09-15 15:28:44 +08:00
Merge branch 'update_doc' into model_compress
This commit is contained in:
commit
75444c4ddd
21
README.md
21
README.md
@ -1,29 +1,30 @@
|
||||
# esp_sr
|
||||
|
||||
Espressif esp_sr provides basic algorithms for **Speech Recognition** applications. Now, this framework has three modules:
|
||||
Espressif esp_sr provides basic algorithms for **Speech Recognition** applications. Now, this framework has four modules:
|
||||
|
||||
* The wake word detection model [WakeNet](wake_word_engine/README.md)
|
||||
* The speech command recognition model [MultiNet](speech_command_recognition/README.md)
|
||||
* Acoustic algorithm: MASE(Mic Array Speech Enhancement), AEC(Acoustic Echo Cancellation), VAD(Voice Activity Detection), AGC(Automatic Gain Control), NS(Noise Suppression)
|
||||
* The wake word detection model [WakeNet](docs/wake_word_engine/README.md)
|
||||
* The speech command recognition model [MultiNet](docs/speech_command_recognition/README.md)
|
||||
* Audio Front-End [AFE](docs/audio_front_end/README.md)
|
||||
* The txt to speech model [esp-tts](esp-tts/README.md)
|
||||
|
||||
These algorithms are provided in the form of a component, so they can be integrated into your projects with minimum efforts.
|
||||
These algorithms are provided in the form of a component, so they can be integrated into your projects with minimum efforts.
|
||||
|
||||
## Wake Word Engine
|
||||
|
||||
Espressif wake word engine [WakeNet](wake_word_engine/README.md) is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen wake words, such as “Alexa”, “天猫精灵” (Tian Mao Jing Ling) and “小爱同学” (Xiao Ai Tong Xue).
|
||||
Espressif wake word engine [WakeNet](docs/wake_word_engine/README.md) is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen wake words, such as “Alexa”, “天猫精灵” (Tian Mao Jing Ling) and “小爱同学” (Xiao Ai Tong Xue).
|
||||
|
||||
Currently, Espressif has not only provided an official wake word "Hi, Lexin" to public for free, but also allows customized wake words. For details on how to customize your own wake words, please see [Espressif Speech Wake Words Customization Process](wake_word_engine/ESP_Wake_Words_Customization.md).
|
||||
Currently, Espressif has not only provided an official wake word "Hi,Lexin","Hi,ESP" to public for free, but also allows customized wake words. For details on how to customize your own wake words, please see [Espressif Speech Wake Words Customization Process](docs/wake_word_engine/ESP_Wake_Words_Customization.md).
|
||||
|
||||
## Speech Command Recognition
|
||||
|
||||
Espressif's speech command recognition model [MultiNet](speech_command_recognition/README.md) is specially designed to provide a flexible off-line speech command recognition model. With this model, you can easily add your own speech commands, eliminating the need to train model again.
|
||||
Espressif's speech command recognition model [MultiNet](docs/speech_command_recognition/README.md) is specially designed to provide a flexible off-line speech command recognition model. With this model, you can easily add your own speech commands, eliminating the need to train model again.
|
||||
|
||||
Currently, Espressif **MultiNet** supports up to 100 Chinese or English speech commands, such as “打开空调” (Turn on the air conditioner) and “打开卧室灯” (Turn on the bedroom light).
|
||||
Currently, Espressif **MultiNet** supports up to 200 Chinese or English speech commands, such as “打开空调” (Turn on the air conditioner) and “打开卧室灯” (Turn on the bedroom light).
|
||||
|
||||
|
||||
## Audio Front End
|
||||
|
||||
Espressif Audio Front-End [AFE](audio_front_end/README.md) integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection),BSS (Blind Source Separation) and NS (Noise Suppression).
|
||||
Espressif Audio Front-End [AFE](docs/audio_front_end/README.md) integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection),MASE(Mic Array Speech Enhancement) and NS (Noise Suppression).
|
||||
|
||||
Our two-mic Audio Front-End (AFE) have been qualified as a “Software Audio Front-End Solution” for [Amazon Alexa Built-in devices](https://developer.amazon.com/en-US/alexa/solution-providers/dev-kits#software-audio-front-end-dev-kits).
|
||||
|
||||
|
||||
@ -0,0 +1,55 @@
|
||||
# Espressif Microphone Design Guidelines
|
||||
> This document provides microphone design guidelines and suggestions for the ESP32-S3 series of audio development boards.
|
||||
|
||||
###Electrical Performance
|
||||
1. Type: omnidirectional MEMS microphone
|
||||
2. Sensitivity
|
||||
- Under 1 Pa sound pressure, it should be no less than -38 dBV for analog microphones, and -26 dB for digital microphones.
|
||||
- The tolerance should be controlled within ±2 dB, and within ±1 dB for microphone arrays.
|
||||
3. Signal-to-noise ratio (SNR)
|
||||
- No less than 62 dB. Higher than 64 dB is recommended.
|
||||
- Frequency response fluctuates within ±3 dB from 50 to 16 kHz.
|
||||
- PSRR should be larger than 55 dB for MEMS microphones.
|
||||
|
||||
---
|
||||
###Structure Design
|
||||
1. The aperture or width of the microphone hole is recommended to be greater than 1 mm, the pickup pipe should be as short as possible, and the cavity should be as small as possible to ensure that the resonance frequency of the microphone and structural components is above 9 kHz.
|
||||
2. The depth and diameter of the pickup hole are less than 4:1, and the thickness of the shell is recommended to be 1 mm. If the shell is too thick, the opening area must be increased.
|
||||
3. The microphone hole must be protected by an anti-dust mesh.
|
||||
4. Silicone sleeve or foam must be added between the microphone and the device shell for sealing and shockproofing, and an interference fit design is required to ensure the tightness of the microphone.
|
||||
5. The microphone hole cannot be blocked. The bottom microphone hole needs to be increased in structure to prevent it from being blocked by the desktop.
|
||||
6. The microphone should be placed far away from the speaker and other objects that can produce noise or vibration, and be isolated and buffered by rubber pads from the speaker sound cavity.
|
||||
|
||||
---
|
||||
###Microphone Array Design
|
||||
1. Type: omnidirectional MEMS microphone. Use the same models from the same manufacturer for the array. Not recommended mixing different microphones.
|
||||
2. The sensitivity difference among microphones in the array is within 3 dB.
|
||||
3. The phase difference among the microphones in the array is controlled within 10°.
|
||||
4. It is recommended to keep the structural design of each microphone in the array the same to ensure consistency.
|
||||
5. Two-microphone solution: the distance between the microphones should be 4 ~ 6.5 cm, the axis connecting them should be parallel to the horizontal line, and the center of the two microphones should be horizontally as close as possible to the center of the product.
|
||||
6. Three-microphone solution: the microphones are equally spaced and distributed in a perfect circle with the angle 120 degrees from each other, and the spacing should be 4 ~ 6.5 cm.
|
||||
|
||||
---
|
||||
###Microphone Structure Tightness
|
||||
Use plasticine or other materials to seal the microphone pickup hole and compare how much the signals collected by the microphone decrease by before and after the seal. 25 dB is qualified, and 30 dB is recommended. Below are the test procedures.
|
||||
|
||||
1. Play white noise at 0.5 meters above the microphone, and keep the volume at the microphone 90 dB.
|
||||
2. Use the microphone array to record for more than 10 s, and store it as recording file A.
|
||||
3. Use plasticine or other materials to block the microphone pickup hole, record for more than 10 s, and store it as recording file B.
|
||||
4. Compare the frequency spectrum of the two files and make sure that the overall attenuation in the 100 ~ 8 kHz frequency band is more than 25 dB.
|
||||
|
||||
---
|
||||
###Echo Reference Signal Design
|
||||
1. It is recommended that the echo reference signal be as close to the speaker side as possible, and recover from the DAC post-stage and PA pre-stage.
|
||||
2. When the speaker volume is at its maximum, the echo reference signal input to the microphone should not have saturation distortion. At the maximum volume, the speaker amplifier output THD is less than 10% at 100 Hz, less than 6% at 200 Hz, and less than 3% above 350 Hz.
|
||||
3. When the speaker volume is at its maximum, the sound pressure picked up by the microphone does not exceed 102 dB @ 1 kHz.
|
||||
4. The echo reference signal voltage does not exceed the maximum allowed input voltage of the ADC. If it is too high, an attenuation circuit should be added.
|
||||
5. A low-pass filter should be added to introduce the reference echo signal from the output of the Class D power amplifier. The cutoff frequency of the filter is recommended to be more than 22 kHz.
|
||||
6. When the volume is played at the maximum, the recovery signal peak value is -3 to -5 dB.
|
||||
|
||||
---
|
||||
###Microphone Array Consistency
|
||||
It is required that the difference between the sampled signals of each microphone is less than 3 dB. Below are the test procedures.
|
||||
|
||||
1. Play white noise at 0.5 meters above the microphone, and keep the volume at the microphone 90 dB.
|
||||
2. Use the microphone array to record for more than 10 s, and check whether the recording amplitude and audio sampling rate of each microphone are consistent.
|
||||
BIN
docs/img/add_speech_ch.png
Normal file
BIN
docs/img/add_speech_ch.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
BIN
docs/img/add_speech_en.png
Normal file
BIN
docs/img/add_speech_en.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 43 KiB |
BIN
docs/img/support.png
Normal file
BIN
docs/img/support.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.1 KiB |
@ -1,6 +1,6 @@
|
||||
# MultiNet Introduction [[中文]](./README_cn.md)
|
||||
# MultiNet Introduction
|
||||
|
||||
MultiNet is a lightweight model specially designed based on [CRNN](https://arxiv.org/pdf/1703.05390.pdf) and [CTC](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.6306&rep=rep1&type=pdf) for the implementation of multi-command recognization with ESP32. Now, up to 100 speech commands, including customized commands, are supported.
|
||||
MultiNet is a lightweight model specially designed based on [CRNN](https://arxiv.org/pdf/1703.05390.pdf) and [CTC](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.6306&rep=rep1&type=pdf) for the implementation of multi-command recognization. Now, up to 200 speech commands, including customized commands, are supported.
|
||||
|
||||
## Overview
|
||||
|
||||
@ -22,49 +22,6 @@ Please see the flow diagram below:
|
||||
|
||||
## User Guide
|
||||
|
||||
### User-defined Command
|
||||
|
||||
Currently, users can define their own speech commands by using the command `make menuconfig`. You can refer to the method of adding speech commands in `menuconfig->ESP Speech Recognition->Add speech commands`, there are already 20 chinese commands and 7 english commands pre-stored in sdkconfig.
|
||||
|
||||
**Chinese**
|
||||
|
||||
|Command ID|Command|Command ID|Command|Command ID|Command|Command ID|Command|
|
||||
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
||||
|0|打开空调 (Turn on the air conditioner)|5|降低一度 (Decrease by one degree)|10| 除湿模式 (Dehumidifying mode)|15| 播放歌曲 (Play a song)
|
||||
|1|关闭空调 (Turn on the air conditioner)|6|制热模式 (Heating mode)|11| 健康模式 (Healthy mode)|16| 暂停播放 (Pause playing)
|
||||
|2|增大风速 (Give me more wind)|7|制冷模式 (Cooling mode)|12| 睡眠模式 (Sleep mode)|17| 定时一小时 (Set timer to 1 hour)
|
||||
|3|减少风速 (Give me less wind)|8|送风模式 (Ventilating mode)|13| 打开蓝牙 (Enable the Bluetooth)|18| 打开电灯 (Turn on the light)
|
||||
|4| 升高一度 (Increase by one degree)|9|节能模式 (Power-saving mode)|10| 关闭蓝牙 (Disable the Bluetooth)|19| 关闭电灯 (Turn off the light)
|
||||
|
||||
**English**
|
||||
|
||||
|Command ID|Command|Command ID|Command|
|
||||
|:---:|:---:|:---:|:---:|
|
||||
|0|turn on the light|4|red mode|
|
||||
|1|turn off the light|5|blue mode|
|
||||
|2|lighting mode|6|yellow mode|
|
||||
|3|reading mode|
|
||||
|
||||
MultiNet supports user-defined commands. You can add your own commands to MultiNet. Note that the newly added command should obtain its command ID before it can be recognized by MultiNet.
|
||||
|
||||
### Add Speech Command
|
||||
|
||||
Now, the MultiNet model predifine some speech commands. Users also can define their own speech commands and the number of speech commands ID in the `menuconfig -> Component config -> ESP Speech Recognition -> Add speech commands` and `The number of speech commands`.
|
||||
|
||||
##### Chinese Speech Command Recognition
|
||||
|
||||
The speech commands should be provided in Pinyin with spaces in between. For example, the command of “打开空调”, which means to turn on the air conditioner, should be provided as "da kai kong tiao".
|
||||
|
||||
##### English Speech Command Recognition
|
||||
|
||||
The speech commands should be provided in specific phonetic symbol with spaces in between. Please use the `general_label_EN/general_label_en.py` script in the tools directory of the skainet root directory to generate the phonetic symbols corresponding to the command words. For details, please refer to [the phonetic symbol generation method](https://github.com/espressif/esp-skainet/tree/master/tools/general_label_EN/README.md).
|
||||
|
||||
**Note:**
|
||||
|
||||
- One speech commands ID can correspond to multiple speech command phrases;
|
||||
- Up to 100 speech commands ID or speech command phrases, including customized commands, are supported;
|
||||
- The corresponding multiple phrases for one Command ID need to be separated by ','.
|
||||
|
||||
### Basic Configuration
|
||||
|
||||
Define the following two variables before using the command recognition model:
|
||||
@ -79,6 +36,47 @@ Define the following two variables before using the command recognition model:
|
||||
|
||||
`model_iface_data_t *model_data = multinet->create(&MULTINET_COEFF, 6000);`
|
||||
|
||||
|
||||
### Modify Speech Commands
|
||||
|
||||
For Chinese MultiNet, we use Pinyin without tone as units.
|
||||
For English MultiNet, we use international phonetic alphabet as unit. [multinet_g2p.py](../../tool/multinet_g2p.py) is used to convert English phrase into phonemes which can be recognized by multinet.
|
||||
Now, the MultiNet support two methods to modify speech commands.
|
||||
|
||||
- 1.menuconfig (before compilation)
|
||||
|
||||
Users can define their own speech commands by `idf.py menuconfig -> ESP Speech Recognition -> add speech commands`
|
||||
|
||||
Chinese predefined commands:
|
||||
|
||||

|
||||
|
||||
English predefined commands:
|
||||
|
||||

|
||||
|
||||
- 2.reset API (after compilation)
|
||||
|
||||
Users also can modify speech commands in the code.
|
||||
|
||||
```
|
||||
// Chinese
|
||||
char err_id[200];
|
||||
char *ch_commands_str = "da kai dian deng,kai dian deng;guan bi dian deng,guan dian deng;guan deng;";
|
||||
multinet->reset(model_data, ch_commands_str, err_id);
|
||||
|
||||
// English
|
||||
char *en_commands_en = "TfL Mm c qbK;Sgl c Sel;TkN nN jc LiT;TkN eF jc LiT";
|
||||
multinet->reset(model_data, en_commands_en, err_id);
|
||||
```
|
||||
|
||||
**Note:**
|
||||
|
||||
- One speech commands ID can correspond to multiple speech command phrases;
|
||||
- Up to 200 speech commands ID or speech command phrases, including customized commands, are supported;
|
||||
- Different Command IDs need to be separated by ';'. The corresponding multiple phrases for one Command ID need to be separated by ','.
|
||||
- `err_id` return the spelling that does not meet the requirements.
|
||||
|
||||
### API Reference
|
||||
|
||||
#### Header
|
||||
@ -159,6 +157,19 @@ Define the following two variables before using the command recognition model:
|
||||
|
||||
* The command id, if a matching command is found.
|
||||
* -1, if no matching command is found.
|
||||
|
||||
- `typedef void (*esp_mn_iface_op_reset_t)(model_iface_data_t *model, char *command_str, char *err_phrase_id);`
|
||||
|
||||
**Definition**
|
||||
|
||||
Reset the speech commands.
|
||||
|
||||
**Parameters**
|
||||
|
||||
model: Model object to destroy.
|
||||
command_str: The speech commands string. ';' is used to separate commands for different command ID. ',' is used to separate different phrases for same command ID.
|
||||
err_phrase_id: Return incorrent spelling
|
||||
|
||||
|
||||
- `typedef void (*esp_mn_iface_op_destroy_t)(model_iface_data_t *model);`
|
||||
|
||||
|
||||
@ -1,79 +1,64 @@
|
||||
# Espressif Speech Wake Word Customization Process [[中文]](./乐鑫语音唤醒词定制流程.md)
|
||||
#Espressif Speech Wake-up Solution Customization Process
|
||||
---
|
||||
|
||||
#### Offline Wake Word Customization
|
||||
#### 1.1 Speech Wake Word Customization Process
|
||||
Espressif provides users with the offline wake word customization service, which allows users to use both publicly available wake words (such as "Hi Lexin", ”Alexa”, and “Espressif”) and customized wake words.
|
||||
|
||||
Espressif provides users with the **Off-line Wake Word Customization** service, which allows users to use both publicly available Wake Words (such as "Hi Lexin", "ni hao xiao xin", "ni hao xiao zhi" and "Hi Jeson") and customized Wake Words.
|
||||
1. If you want to use publicly available wake words for commercial use
|
||||
- Please check the wake words provided in [esp-sr](https://github.com/espressif/esp-sr);
|
||||
- We will continue to provide more and more wake words that are free for commercial use.
|
||||
|
||||
1. If you want to use publicly available Wake Words for commercial use,
|
||||
- please check the Wake Words provided in ADF/ASR Demos;
|
||||
- We will continue to provide more and more Wake Words that are free for commercial use.
|
||||
|
||||
2. If you want to use your own wake words, we can also provide the **Off-line Wake Word Customization** service.
|
||||
- If you are able to provide a training corpus meeting the requirements described in the following **Requirements on Corpus**.
|
||||
- We need two to three weeks for training and optimization.
|
||||
- Service fee will be charged by Espressif in this case.
|
||||
2. If you want to use custom wake words, we can also provide the offline wake word customization service.
|
||||
- If you provide a training corpus
|
||||
- It must consist of at least 20,000 qualified corpus entries (see the section below for detailed requirements);
|
||||
- It will take two to three weeks for Espressif to train and optimize the corpus after the hardware design meets our requirement;
|
||||
- It will be delivered in a static library of wake word;
|
||||
- Espressif will charge training fees based on the scale of your production.
|
||||
|
||||
- Otherwise
|
||||
- We will provide the training corpus (all your corpus won't be comprised and shared)
|
||||
- We need two to three weeks for training and optimization.
|
||||
- Service fee will be charged by Espressif in this case (Fee incurred from collecting the training corpus will be charged separately).
|
||||
|
||||
- For details on the fee and time required for customization, please email us at [sales@espressif.com](sales@espressif.com).
|
||||
- We will agree on a reasonable plan based on how many wake words for customization and how large is your scale of product production.
|
||||
- Espressif will collect and provide all the training corpus;
|
||||
- Espressif will deliver a static library file of successfully trained wake word to you, but won't share the corpus;
|
||||
- It will take around three weeks to collect and train the corpus;
|
||||
- Espressif will charge training fees (corpus collecting fees included) based on the scale of your production.
|
||||
|
||||
- The above time is subject to change depending on the project.
|
||||
|
||||
- Espressif will only charge a one-time customization fee depending on the number of wake words you customize and the scale of your production, and will not charge license fees for the quantity and time of use. Please email us at [sales@espressif.com](sales@espressif.com) for details of the fee.
|
||||
|
||||
|
||||
3. About Espressif Wake Word Model
|
||||
- Now, a single wake word model can recognize up to five Wake Words
|
||||
- Normally, each Wake Word contains three to six syllables, such as "Hi Le xin" (3 syllables), “Alexa” (3 syllables), "小爱同学" (4 syllables).
|
||||
- Several wake words can be used in combination based on your actual requirement.
|
||||
3. If you want to use offline command words
|
||||
- Please set them by yourself referring to [esp-sr](https://github.com/espressif/esp-sr/tree/c5896943ea278195968c93c8b3466c720e641ebc/speech_command_recognition) algorithm. They do not need additional customization.
|
||||
- Similar to speech wake words, the effect of command words is also related to hardware designs, so please refer to *Espressif MIC Design Guidelines*.
|
||||
|
||||
#### Requirements on Corpus Texts
|
||||
|
||||
--------
|
||||
#### 2.1 Requirements on Corpus
|
||||
|
||||
You can provide us your training corpus by preparing it yourself or purchasing one from a third party service provider. However, please make sure your corpus meets the following requirements.
|
||||
|
||||
- Audio File Format
|
||||
- Sample rate: 16 KHz
|
||||
- Encoding method: 16-bit signed int
|
||||
- Channel type: mono
|
||||
- File format: wav
|
||||
|
||||
- Sampling
|
||||
- Sample size: no less than 500 people, among which,
|
||||
- The number of males and females should be similar;
|
||||
- The number of people in different age-group should be similar;
|
||||
- The number of children should be larger than 100 (If children are one of your target users).
|
||||
- Environment:
|
||||
- It's advise to collect your sample with a Hi-Fi microphone in a professional audio room, with an ambient noise lower than 40 dB.
|
||||
- Each participant should
|
||||
- Position himself/herself at a distance of one meter from the microphone, and repeat the Wake Word for 15 times (5 times fast, 10 times normal);
|
||||
- Position himself/herself at a distance of three meters from the microphone, and repeat the Wake Word for 15 times (5 times fast, 10 times normal);
|
||||
- The naming of sample file should reflect the sex, age, and speech speed of the sample himself/herself. An example for naming your sample file is `female_age_fast_id.wav`. Or you can provide a separate form to record these information.
|
||||
|
||||
#### Hardware Design and Test
|
||||
|
||||
1. The performance of wake word detection is heavily impacted by the hardware design and cavity structure. Therefore, please go through the following requirements on hardware design.
|
||||
|
||||
- Hardware design: We provide reference design files for smart speakers, including schematic diagrams and PCB designs. Please refer to these files when designing your own speaker. It's advised that you send your designs to Espressif for review to avoid some most common design issues.
|
||||
|
||||
- Cavity structure: We don't provide reference designs for cavity structures. Therefore, it's advised to involve acoustic professionals during the design and take reference form other mainstream speakers in the market, such as TmallGenie(天猫精灵), Baidu speaker(小度音箱)and Google speaker(谷歌音箱).
|
||||
|
||||
2. You can evaluate the performance of your design by performing the following tests. Note that all the tests below are designed to be performed in an audio room. Please make adjustment according to your actual situation.
|
||||
|
||||
- Record test to evaluate the gain and distortion for MIC and codec.
|
||||
- Play audio samples (90 dB, 0.1 meter away from the MIC), and make sure the recording sample is not unsaturated by adjusting the gain of MIC.
|
||||
- Play frequency sweep file (0~20 KHz), and record it using a sample rate of 16 KHz. No prominent aliasing should be observed.
|
||||
- Use the publicly released speech recognize API provided on the cloud to recognize 100 audio samples. The recognition rate should meet certain standard.
|
||||
|
||||
- Playing test to verify the distortion of the PA and speaker by measuring:
|
||||
- PA power @1% THD.
|
||||
|
||||
- Test the performance of DSP, and verify if the DSP parameters are configured correctly, meanwhile minimizing the non-linear distortion in the DSP arithmetic.
|
||||
- Test the performance of the **Noise Suppression** algorithm
|
||||
- Test the performance of the **Acoustic Echo Cancellation** algorithm
|
||||
- Test the performance of the **Speech Enhancement** algorithm
|
||||
|
||||
3. After your hardware design, it's advised to **send** 1 or 2 pieces of your hardware, so we can optimize its performance for wake word detection on a whole product level.
|
||||
As mentioned above, you can provide your own training corpus for Espressif. Below are the requirements.
|
||||
|
||||
1. Audio file format
|
||||
- Sample rate: 16 kHz
|
||||
- Encoding: 6-bit signed int
|
||||
- Channel: mono
|
||||
- Format: WAV
|
||||
|
||||
2. Sampling environment
|
||||
- Room with an ambient noise lower than 30 dB and reverberation less than 0.3 s, or a professional audio room (recommended).
|
||||
- Recording device: high-fidelity microphone.
|
||||
- The whole product is strongly recommended.
|
||||
- The development board of your product also works when there is no cavity structure.
|
||||
- Record in 16 kHz, and don't use **resampling**.
|
||||
- At the recording site, pay attention to the impact of reverberation interference in a closed environment.
|
||||
- Collect samples with multiple recording devices at the same time (recommended).
|
||||
- For example, postion the devices at 1 m and 3 m away.
|
||||
- So more samples are collected with the same number of time and participants.
|
||||
|
||||
3. Sample distribution
|
||||
- Sample size: 500. Males and females should be close to 1:1.
|
||||
- The number of children under 12 years old invloved varies from product to product, but the percentage should be no less than 15%.
|
||||
- If there are requirements for certain languages or dialects, special corpus samples need to be provided.
|
||||
- It is recommended to name the samples according to the age, gender, and quantity of the collected samples, such as HiLeXin\_male\_B\_014.wav, and ABCD represents different age groups.
|
||||
|
||||
#### 2.2 Hareware Design Guidelines
|
||||
|
||||
1. Please refer to *Espressif MIC Design Guidelines*.
|
||||
@ -1,4 +1,4 @@
|
||||
# WakeNet [[中文]](./README_cn.md)
|
||||
# WakeNet
|
||||
|
||||
WakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the WakeNet model supports up to 5 wake words.
|
||||
|
||||
@ -16,8 +16,8 @@ Please see the flow diagram of WakeNet below:
|
||||
- Neural Network:
|
||||
Now, the neural network structure has been updated to the sixth edition, among which,
|
||||
- WakeNet1 and WakeNet2 had been out of use.
|
||||
- WakeNet3 and WakeNet4 are built upon the [CRNN](https://arxiv.org/abs/1703.05390) structure.
|
||||
- WakeNet5(WakeNet5X2,WakeNetX3) and WakeNet6 are built upon the [Dilated Convolution](https://arxiv.org/pdf/1609.03499.pdf) structure.
|
||||
- WakeNet3 and WakeNet4 had been out of use.
|
||||
- WakeNet5(WakeNet5X2,WakeNetX3), WakeNet7, WakeNet8 are built upon the [Dilated Convolution](https://arxiv.org/pdf/1609.03499.pdf) structure.
|
||||
Note that,The network structure of WakeNet5,WakeNet5X2 and WakeNet5X3 is same, but the parameter of WakeNetX2 and WakeNetX3 is more than WakeNet5. Please refer to [Resource Occupancy](#performance-test) for details.
|
||||
|
||||
|
||||
@ -54,8 +54,12 @@ Please see the flow diagram of WakeNet below:
|
||||
|
||||
```
|
||||
typedef enum {
|
||||
DET_MODE_90 = 0, //Normal, response accuracy rate about 90%
|
||||
DET_MODE_95 //Aggressive, response accuracy rate about 95%
|
||||
DET_MODE_90 = 0, // Normal,
|
||||
DET_MODE_95 = 1, // Aggressive,
|
||||
DET_MODE_2CH_90 = 2, // 2 Channel detection, Normal mode
|
||||
DET_MODE_2CH_95 = 3, // 2 Channel detection, Aggressive mode
|
||||
DET_MODE_2CH_90 = 4, // 3 Channel detection, Normal mode
|
||||
DET_MODE_2CH_95 = 5, // 3 Channel detection, Aggressive mode
|
||||
} det_mode_t;
|
||||
```
|
||||
|
||||
@ -71,25 +75,27 @@ Please see the flow diagram of WakeNet below:
|
||||
|
||||
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
|Quantized WakeNet3|26 K|20 KB|29 ms|90 ms|
|
||||
|Quantised WakeNet4|53 K|22 KB|48 ms|90 ms|
|
||||
|Quantised WakeNet5|41 K|15 KB|5.5 ms|30 ms|
|
||||
|Quantised WakeNet5X2|165 K|20 KB|10.5 ms|30 ms|
|
||||
|Quantised WakeNet5X3|371 K|24 KB|18 ms|30 ms|
|
||||
|Quantised WakeNet6|378 K|45 KB|4ms(task1) + 25 ms(task2)|30 ms|
|
||||
|
||||
**Note**: Quantised WakeNet6 is split into two tasks, task1 is used to calculate speech features and task2 is used to calculate neural network model.
|
||||
### 2. Resource Occupancy(ESP32S3)
|
||||
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
|Quantised WakeNet7_2CH|810 K|45 KB|10 ms|32 ms|
|
||||
|Quantised WakeNet8_2CH|821 K|50 KB|10 ms|32 ms|
|
||||
|
||||
|
||||
### 2. Performance
|
||||
|
||||
|Distance| Quiet | Stationary Noise (SNR = 5 dB)| Speech Noise (SNR = 5 dB)| AEC Interruption (-10 dB)|
|
||||
|Distance| Quiet | Stationary Noise (SNR = 4 dB)| Speech Noise (SNR = 4 dB)| AEC Interruption (-10 dB)|
|
||||
|:---:|:---:|:---:|:---:|:---:|
|
||||
|1 m|95%|88%|85%|89%|
|
||||
|3 m|90%|80%|75%|80%|
|
||||
|1 m|98%|96%|94%|96%|
|
||||
|3 m|98%|94%|92%|94%|
|
||||
|
||||
False triggering rate: 1 time in 12 hours
|
||||
|
||||
**Note**: We use the ESP32-LyraT-Mini development board and the WakeNet5X2(hilexin) model in our test. The performance is limited because ESP32-LyraT-Mini only has one microphone. We expect a better recognition performance when more microphones are involved in the test.
|
||||
|
||||
**Note**: We use the ESP32-S3-Korvo V4.0 development board and the WakeNet8(Alexa) model in our test.
|
||||
|
||||
## Wake Word Customization
|
||||
|
||||
|
||||
@ -74,7 +74,7 @@ typedef void (*esp_mn_iface_op_destroy_t)(model_iface_data_t *model);
|
||||
* @brief Reset the speech commands recognition model
|
||||
*
|
||||
*/
|
||||
typedef void (*esp_mn_iface_op_reset_t)(model_iface_data_t *model_data, char *command_str, char *err_phrase_id);
|
||||
typedef void (*esp_mn_iface_op_reset_t)(model_iface_data_t *model, char *command_str, char *err_phrase_id);
|
||||
|
||||
|
||||
typedef struct {
|
||||
|
||||
Loading…
Reference in New Issue
Block a user