5.0 KiB
WakeNet
WakeNet, which is a wake word engine built upon neural network, is specially designed for low-power embedded MCUs. Now, the WakeNet model supports up to 5 wake words.
Overview
Please see the flow diagram of WakeNet below:
-
Speech Feature:
The WakeNet uses MFCC to obtain the features of the input audio clip (16 KHz, 16 bit, single track). The window width and step width of each frame of the audio clip are both 30 ms. -
Neural Network:
Now, the neural network structure has been updated to the sixth edition, among which,- WakeNet1 and WakeNet2 had been out of use.
- WakeNet3 and WakeNet4 had been out of use.
- WakeNet5(WakeNet5X2,WakeNetX3), WakeNet7, WakeNet8 are built upon the Dilated Convolution structure. Note that,The network structure of WakeNet5,WakeNet5X2 and WakeNet5X3 is same, but the parameter of WakeNetX2 and WakeNetX3 is more than WakeNet5. Please refer to Resource Occupancy for details.
-
Keyword Triggering Method
For continuous audio stream, we calculate the average recognition results (M) for several frames and generate a smoothing prediction result, to improve the accuracy of keyword triggering. Only when the M value is larger than the set threshold, a triggering command is sent.
The following table shows the model support of Espressif SoCs:
API Introduction
-
How to select the WakeNet model
-
Go to
make menuconfig, navigate toComponent config>>ESP Speech Recognition>>Wake word engine. See below:
-
WakeNet6 is divided into two tasks, task1 is used to calculate speech recognition, the task2 is used to calculate neural network model. The ESP32 core used to calculate task2 can be selected by
Component config>>ESP Speech Recognition>>ESP32 core to run WakeNet6
-
-
How to select the wake words
Go tomake menuconfig, and navigate toComponent config>>ESP Speech Recognition>>Wake words list. See below:
Note that, the
customized wordoption only supports WakeNet5 and WakeNet6. WakeNet3 and WakeNet4 are only compatible with earlier versions. If you want to use your own wake words, please overwrite existing models inwake_word_enginedirectory with your own words model. -
How to set the triggering threshold
- The triggering threshold (0, 0.9999) for wake word can be set to adjust the accuracy of the wake words model. The threshold can be configured separately for each wake words if there are more than one words supported in a model.
- The smaller the triggering threshold is, the higher the risk of false triggering is (and vice versa). Please configure your threshold according to your applications.
- The wake word engine predefines two thresholds for each wake word during the initialization. See below:
typedef enum { DET_MODE_90 = 0, // Normal, DET_MODE_95 = 1, // Aggressive, DET_MODE_2CH_90 = 2, // 2 Channel detection, Normal mode DET_MODE_2CH_95 = 3, // 2 Channel detection, Aggressive mode DET_MODE_2CH_90 = 4, // 3 Channel detection, Normal mode DET_MODE_2CH_95 = 5, // 3 Channel detection, Aggressive mode } det_mode_t;- Use the
set_det_threshold()function to configure the thresholds for different wake words after the initialization.
-
How to get the sampling rate and frame size.
- Use
get_samp_rateto get the sampling rate of the audio stream to be recognized. - Use
get_samp_chunksizeto get the sampling point of each frame. The encoding of audio data issigned 16-bit int.
- Use
Performance Test
1. Resource Occupancy(ESP32)
| Model Type | Parameter Num | RAM | Average Running Time per Frame | Frame Length |
|---|---|---|---|---|
| Quantised WakeNet5 | 41 K | 15 KB | 5.5 ms | 30 ms |
| Quantised WakeNet5X2 | 165 K | 20 KB | 10.5 ms | 30 ms |
| Quantised WakeNet5X3 | 371 K | 24 KB | 18 ms | 30 ms |
2. Resource Occupancy(ESP32S3)
| Model Type | Parameter Num | RAM | Average Running Time per Frame | Frame Length |
|---|---|---|---|---|
| Quantised WakeNet7_2CH | 810 K | 45 KB | 10 ms | 32 ms |
| Quantised WakeNet8_2CH | 821 K | 50 KB | 10 ms | 32 ms |
2. Performance
| Distance | Quiet | Stationary Noise (SNR = 4 dB) | Speech Noise (SNR = 4 dB) | AEC Interruption (-10 dB) |
|---|---|---|---|---|
| 1 m | 98% | 96% | 94% | 96% |
| 3 m | 98% | 94% | 92% | 94% |
False triggering rate: 1 time in 12 hours
Note: We use the ESP32-S3-Korvo V4.0 development board and the WakeNet8(Alexa) model in our test.
Wake Word Customization
For details on how to customize your wake words, please see Espressif Speech Wake Word Customization Process.
