doc: Update docs

This commit is contained in:
Wang Wang Wang 2022-02-25 16:45:12 +08:00
parent f4f7b7f9bd
commit 1841f9151c
17 changed files with 901 additions and 639 deletions

View File

@ -11,7 +11,7 @@ These algorithms are provided in the form of a component, so they can be integra
## Wake Word Engine
Espressif wake word engine [WakeNet](docs/wake_word_engine/README.md) is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen wake words, such as “Alexa”, “天猫精灵” (Tian Mao Jing Ling) and “小爱同学” (Xiao Ai Tong Xue). You can refer to [Model loading method](./docs/flash_model/README.md) to build your project.
Espressif wake word engine [WakeNet](docs/wake_word_engine/README.md) is specially designed to provide a high performance and low memory footprint wake word detection algorithm for users, which enables devices always listen wake words, such as “Alexa”, “Hi,lexin” and “Hi,ESP”. You can refer to [Model loading method](./docs/flash_model/README.md) to build your project.
Currently, Espressif has not only provided an official wake word "Hi,Lexin","Hi,ESP" to public for free, but also allows customized wake words. For details on how to customize your own wake words, please see [Espressif Speech Wake Words Customization Process](docs/wake_word_engine/ESP_Wake_Words_Customization.md).
@ -21,10 +21,9 @@ Espressif's speech command recognition model [MultiNet](docs/speech_command_reco
Currently, Espressif **MultiNet** supports up to 200 Chinese or English speech commands, such as “打开空调” (Turn on the air conditioner) and “打开卧室灯” (Turn on the bedroom light).
## Audio Front End
Espressif Audio Front-End [AFE](docs/audio_front_end/README.md) integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection),MASE(Mic Array Speech Enhancement) and NS (Noise Suppression).
Espressif Audio Front-End [AFE](docs/audio_front_end/README.md) integrates AEC (Acoustic Echo Cancellation), VAD (Voice Activity Detection), BSS(Blind Source Separation) and NS (Noise Suppression).
Our two-mic Audio Front-End (AFE) have been qualified as a “Software Audio Front-End Solution” for [Amazon Alexa Built-in devices](https://developer.amazon.com/en-US/alexa/solution-providers/dev-kits#software-audio-front-end-dev-kits).

View File

@ -16,7 +16,7 @@ VAD takes an audio stream as input, and outputs the prediction that a frame of t
#### Function
- `vad_handle_t vad_create(vad_mode_t vad_mode, int sample_rate_hz, int one_frame_ms)`
- `vad_handle_t vad_create(vad_mode_t vad_mode)`
**Definition**
@ -24,15 +24,13 @@ VAD takes an audio stream as input, and outputs the prediction that a frame of t
**Parameter**
- vad_mode: operating mode of VAD, integer ranging from 1 to 5, larger value indicates more aggressive VAD.
- sample_rate_hz: audio sampling rate in Hz.
- one_frame_ms: frame length in ms.
- vad_mode: operating mode of VAD, VAD_MODE_0 to VAD_MODE_4, larger value indicates more aggressive VAD.
**Return**
Handle to VAD.
- `vad_state_t vad_process(vad_handle_t inst, int16_t *data)`
- `vad_state_t vad_process(vad_handle_t inst, int16_t *data, int sample_rate_hz, int one_frame_ms);`
**Definition**
@ -42,10 +40,13 @@ VAD takes an audio stream as input, and outputs the prediction that a frame of t
- inst: VAD handle.
- data: buffer to save both input and output audio stream.
- sample_rate_hz: The Sampling frequency (Hz) can be 32000, 16000, 8000, default: 16000.
- one_frame_ms: The length of the audio processing can be 10ms, 20ms, 30ms, default: 30.
**Return**
**Return**
VAD handle.
- VAD_SILENCE if no voice
- VAD_SPEECH if voice is detected
- `void vad_destroy(vad_handle_t inst)`
@ -67,26 +68,26 @@ AGC keeps the volume of audio signal at a stable level to avoid the situation th
- `void *esp_agc_open(int agc_mode, int sample_rate)`
**Definition**
**Definition**
Initialization of AGC handle.
**Parameter**
**Parameter**
- agc_mode: operating mode of AGC, 3 to enable AGC and 0 to disable it.
- sample_rate: sampling rate of audio signal.
**Return**
**Return**
- AGC handle.
- `int esp_agc_process(void *agc_handle, short *in_pcm, short *out_pcm, int frame_size, int sample_rate)`
**Definition**
**Definition**
Pocessing of AGC for one frame.
**Parameter**
**Parameter**
- agc_handle: AGC handle.
- in_pcm: input audio stream.
@ -94,7 +95,7 @@ AGC keeps the volume of audio signal at a stable level to avoid the situation th
- frame_size: signal frame length in ms.
- sample_rate: signal sampling rate in Hz.
**Return**
**Return**
Return 0 if AGC processing succeeds, -1 if fails; -2 and -3 indicate invalid input of sample_rate and frame_size, respectively.
@ -180,15 +181,16 @@ Single-channel speech enhancement. If multiple mics are available with the board
### API Reference
- `ns_handle_t ns_create(int frame_length_ms)`
- `ns_handle_t ns_pro_create(int frame_length, int mode)`
**Definition**
Initialization of NS handle.
Creates an instance of the more powerful noise suppression algorithm.
**Parameter**
- frame_length_ms: audio frame length in ms.
- mode: 0: Mild, 1: Medium, 2: Aggressive
**Return**
@ -214,55 +216,4 @@ Single-channel speech enhancement. If multiple mics are available with the board
**Parameter**
- inst: the NS handle to be destroyed.
## MASE
### Overview
Multi-channel speech enhancement. Currently, 2-mic linear array and 3-mic circular array are supported.
### API Reference
- `mase_handle_t mase_create(int sample_rate, int frame_size, int array_type, float mic_distance, int operating_mode, int filter_strength)`
**Definition**
Initialization of MASE handle.
**Parameter**
- sample_rate: signal sampling rate in Hz.
- frame_size: signal frame length in ms.
- array_type: 0 for 2-mic linear array and 1 for 3-mic circular array.
- mic_distance: distance between two microphones in mm.
- operating_mode: 0 for normal mode and 1 for wake-up enhanced mode.
- filter_strength: strength of the mic-array speech enhancement, must be 0, 1, 2 or 3 (smaller number indicates better performance and larger hardware cost).
- aec_on: true to enable and false to disable AEC.
- filter_length: the length of adaptive filter in AEC.
**Return**
Handle to MASE.
- `void mase_process(mase_handle_t st, int16_t *in, int16_t *dsp_out)`
**Definition**
Processing of MASE for one frame.
**Parameter**
- st: MASE handle.
- in: input multi-channel audio stream.
- dsp_out: output single-channel audio stream.
- `void mase_destory(mase_handle_t st)`
**Definition**
Destruction of a MASE handle.
**Parameter**
- inst: the MASE handle to be destroyed.
- inst: the NS handle to be destroyed.

View File

@ -20,33 +20,40 @@ The workflow of Espressif AFE can be divided into four parts:
- AFE creation and initialization
- AFE feed: Input audio data and will run AEC in the feed function
- Internal BSS/NS algorithms
- AFE fetch: Return the audio data after processing and the output value. the AFE fetch will perform VAD internally. If you configure WakeNet to be 'enabled', WakeNet wil do wake-word detection
- Internal: In case of wake-up recognition/single microphone speech noise reduction scene, BSS/NS algorithm processing will be carried out; If it is a multi microphone speech noise reduction scene, BSS/MISO algorithm processing will be carried out.
- AFE fetch: Return the audio data after processing and the output value. In the wake-up recognition scenario, VAD processing and wake-up word detection will be carried out inside the fetch. The specific behavior depends on the config of `afe_config_t` structure; If it is a multi microphone speech noise reduction scene, noise reduction will be carried out. (Note: `wakenet_Init` and `voice_communication_Init` cannot be configured to true at the same time)
**Note:** `afe->feed()` and `afe->fetch()` are visible to users, while `internal BSS/NS task` is invisible to users.
**Note:** `afe->feed()` and `afe->fetch()` are visible to users, while `internal BSS/NS/MISO task` is invisible to users.
> AEC runs in `afe->feed()` function;
> BSS is an independent task in AFE;
> The results of VAD and WakeNet are obtained by `afe->fetch()` function.
> AEC runs in `afe->feed()` function; If aec_init is configured as false, BSS/NS will run in the afe->feed() function.
> BSS/NS/MISO is an independent task in AFE;
> The results of VAD/WakeNet and the audio data after processing are obtained by `afe->fetch()` function.
### Select AFE handle
Espressif AFE supports both single MIC and dual MIC scenarios. The internal task of single MIC applications is processed by NS, and the internal task of dual MIC applications is processed by BSS.
Espressif AFE supports both single MIC and dual MIC scenarios, and the algorithm module can be flexibly configured. The internal task of single MIC applications is processed by NS, and the internal task of dual MIC applications is processed by BSS. If the dual microphone scenario is configured for voice noise reduction (i.e. `wakenet_init=false, voice_communication_init=true`), the miso internal task will be added.
- Single MIC
- Get AFE handle
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_1mic;
- Dual MIC
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_2mic;
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_HANDLE;
### Input Audio data
The AFE supports two kinds of scenarios: single MIC and dual MIC. The number of channels can be configured according to the audio of `afe->feed()`. Modify method: It can modify the `pcm_config` configuration in macro `AFE_CONFIG_DEFAULT()`. It supports the following configuration combinations (Note: It must meet `total_ch_num = mic_num + ref_num`) :
> total_ch_num=1, mic_num=1, ref_num=0
> total_ch_num=2, mic_num=1, ref_num=1
> total_ch_num=2, mic_num=2, ref_num=0
> total_ch_num=3, mic_num=2, ref_num=1
(Note: total_ch_num: the number of total channels, mic_num: the number of microphone channels, ref_num: the number of reference channels)
At present, the AEC only support one reference data , so ref_num can only be 0 or 1.
- AFE single MIC
- Input audio data format: 16KHz, 16bit, two channels (one is mic data, another is reference data)
- The data frame length is 32ms. Users can use `afe->get_feed_chunksize()` to get the number of sampling points needed (the data type of sampling points is int16).
- Input audio data format: 16KHz, 16bit, two channels (one is mic data, another is reference data) ; If AEC is not required and the audio does not contain reference data. The input data can only have one channel of MIC data, and the ref_num need to be set 0.
- The input data frame length will vary according to the algorithm module configured by the user. Users can use `afe->get_feed_chunksize()` to get the number of sampling points (the data type of sampling points is int16).
The input data is arranged as follows:
@ -54,8 +61,8 @@ Espressif AFE supports both single MIC and dual MIC scenarios. The internal task
- AFE dual MIC
- Input audio data format: 16KHz, 16bit, three channels (two are mic data, another is reference data)
- The data frame length is 32ms. Users can use `afe->get_feed_chunksize()` to get the number of sampling points needed (the data type of sampling points is int16).
- Input audio data format: 16KHz, 16bit, three channels (two are mic data, another is reference data) ; If AEC is not required and the audio does not contain reference data. The input data can only have two channels of MIC data, and the ref_num need to be set 0.
- The input data frame length will vary according to the algorithm module configured by the user. Users can use `afe->get_feed_chunksize()` to get the number of sampling points (the data type of sampling points is int16).
The input data is arranged as follows:
@ -75,6 +82,10 @@ NS algorithm supports single-channel processing and can suppress the non-human n
BSS algorithm supports dual-channel processing, which can well separate the target sound source from the rest of the interference sound, so as to extract the useful audio signal and ensure the quality of the subsequent speech.
### MISO (Multi Input Single Output)
Miso algorithm supports dual channel input and single channel output. It is used to select a channel of audio output with high signal-to-noise ratio when there is no wakenet enable in the dual mic scene.
### VAD (Voice Activity Detection)
VAD algorithm supports real-time output of the voice activity state of the current frame.
@ -93,15 +104,9 @@ The output audio of AFE is single-channel data. When WakeNet is enabled, AFE wil
### 1. Define afe_handle
`afe_handle ` is the function handle that the user calls the AFE interface. Users need to select the corresponding AFE handle according to the single MIC and dual MIC applications.
`afe_handle ` is the function handle that the user calls the AFE interface. Therefore, the first step is to obtain `afe_handle`.
- Single MIC
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_1mic;
- Dual MIC
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_2mic;
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_HANDLE;
### 2. Configure AFE
@ -117,16 +122,20 @@ Users can adjust the switch of each algorithm module and its corresponding param
.se_init = true, \
.vad_init = true, \
.wakenet_init = true, \
.vad_mode = 3, \
.wakenet_model = &WAKENET_MODEL, \
.wakenet_coeff = &WAKENET_COEFF, \
.voice_communication_init = false, \
.vad_mode = VAD_MODE_3, \
.wakenet_model = (esp_wn_iface_t *)&WAKENET_MODEL, \
.wakenet_coeff = (void *)&WAKENET_COEFF, \
.wakenet_mode = DET_MODE_2CH_90, \
.afe_mode = SR_MODE_HIGH_PERF, \
.afe_mode = SR_MODE_LOW_COST, \
.afe_perferred_core = 0, \
.afe_perferred_priority = 5, \
.afe_ringbuf_size = 50, \
.alloc_from_psram = 1, \
.agc_mode = 2, \
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
.pcm_config.total_ch_num = 3, \
.pcm_config.mic_num = 2, \
.pcm_config.ref_num = 1, \
}
```
@ -138,9 +147,11 @@ Users can adjust the switch of each algorithm module and its corresponding param
- wakenet_init: Whether the wake algorithm is enabled.
- vad_mode: The VAD operating mode. A more aggressive (higher mode) VAD is more.
- voice_communication_init: Whether voice communication noise reduction is enabled. It cannot be enabled with wakenet_init at the same time.
- wakenet_model/wakenet_coeff/wakenet_mode: Use `make menuconfig` to choose WakeNet model. Please refer to[WakeNet](https://github.com/espressif/esp-sr/tree/b9504e35485b60524977a8df9ff448ca89cd9d62/wake_word_engine)
- vad_mode: The VAD operating mode. The bigger, the more radical.
- wakenet_model/wakenet_coeff/wakenet_mode: Use `idf.py menuconfig` to choose WakeNet model. Please refer to[WakeNet](../wake_word_engine/README.md)
- afe_mode: Espressif AFE supports two working modes: SR_MODE_LOW_COST, SR_MODE_HIGH_PERF. See the afe_sr_mode_t enumeration for details.
@ -151,19 +162,37 @@ Users can adjust the switch of each algorithm module and its corresponding param
**ESP32 only supports SR_MODE_HIGH_PERF;
And ESP32S3 supports both of the modes **
- afe_perferred_core: The internal BSS/NS algorithm of AFE will be running on which CPU core.
- afe_perferred_core: The internal BSS/NS/MISO algorithm of AFE will be running on which CPU core.
- afe_ringbuf_size: The configuration of the internal ringbuf size.
- afe_perferred_priority: The running priority of BSS/NS/MISO algorithm task.
- alloc_from_psram: Whether to allocate memory from external psram first. Three values can be configured:
- afe_ringbuf_size: Configuration of internal ringbuf size.
- 0: Allocated from internal ram.
- memory_alloc_mode: Memory allocation mode. Three values can be configured:
- AFE_MEMORY_ALLOC_MORE_INTERNAL: More memory is allocated from internal ram.
- 1: Part of memory is allocated from external psram.
- AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE: Part of memory is allocated from internal psram.
- 2: Most of memory is allocated from external psram.
- AFE_MEMORY_ALLOC_MORE_PSRAM: Most of memory is allocated from external psram.
- agc_mode: Configuration for linear audio amplification.
- agc_mode: Configuration for linear audio amplification. Four values can be configured:
- AFE_MN_PEAK_AGC_MODE_1: Linearly amplify the audio which will fed to multinet. The peak value is -5 dB.
- AFE_MN_PEAK_AGC_MODE_2: Linearly amplify the audio which will fed to multinet. The peak value is -4 dB.
- AFE_MN_PEAK_AGC_MODE_3: Linearly amplify the audio which will fed to multinet. The peak value is -3 dB.
- AFE_MN_PEAK_NO_AGC: No amplification.
- pcm_config: Configure according to the audio that fed by `afe->feed()`. This structure has three member variables to configure:
- total_ch_num: Total number of audio channelstotal_ch_num = mic_num + ref_num。
- mic_num: The number of microphone channels. It only can be set to 1 or 2.
- ref_num: The number of reference channels. It only can be set to 0 or 1.
### 3. Create afe_data
@ -181,7 +210,7 @@ typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_confi
### 4. feed audio data
After initializing AFE and WakeNet, users need to input audio data into AFE by `afe_handle->feed()` function for processing.
After initializing AFE, users need to input audio data into AFE by `afe_handle->feed()` function for processing.
The input audio size and layout format can refer to the step **Input Audio data**.
@ -189,11 +218,13 @@ The input audio size and layout format can refer to the step **Input Audio data*
/**
* @brief Feed samples of an audio stream to the AFE_SR
*
* @Warning The input data should be arranged in the format of channel interleaving.
* The last channel is reference signal if it has reference data.
*
* @param afe The AFE_SR data handle
* @param afe The AFE_SR object to query
*
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
* `get_samp_chunksize`. The channel number can be queried `get_channel_num`.
* `get_feed_chunksize`.
* @return The size of input
*/
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
@ -202,16 +233,16 @@ typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t*
Get the number of audio channels:
`afe_handle->get_channel_num()` function can provide the number of MIC data channels that need to be put into `afe_handle->feed()` function Without reference channel).
`afe_handle->get_total_channel_num()` function can provide the number of channels that need to be put into `afe_handle->feed()` function. Its return value is equal to `pcm_config.mic_num + pcm_config.ref_num` in AFE_CONFIG_DEFAULT()
```
/**
* @brief Get the channel number of samples that need to be passed to the fetch function
* @brief Get the total channel number which be config
*
* @param afe The AFE_SR object to query
* @return The amount of channel number
* @param afe The AFE_SR object to query
* @return The amount of total channels
*/
typedef int (*esp_afe_sr_iface_op_get_channel_num_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
```
### 5. fetch audio data
@ -235,10 +266,11 @@ typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
Please pay attention to the return value of `afe_handle->fetch()`:
- AFE_FETCH_CHANNEL_VERIFIED: Audio channel confirmation (This value is not returned while single microphone wakes up.)
- AFE_FETCH_NOISE: Noise detected
- AFE_FETCH_SPEECH: Speech detected
- AFE_FETCH_WWE_DETECTED: Wakeup detected
- AFE_FETCH_ERROR: Get empty data, please try again.
- AFE_FETCH_CHANNEL_VERIFIED: Audio channel confirmation (This value is not returned while use single mic wakenet.)
- AFE_FETCH_NOISE: Noise detected.
- AFE_FETCH_SPEECH: Speech detected.
- AFE_FETCH_WWE_DETECTED: Wakeup detected.
- ...
```
@ -248,7 +280,7 @@ Please pay attention to the return value of `afe_handle->fetch()`:
* @Warning The output is single channel data, no matter how many channels the input is.
*
* @param afe The AFE_SR object to query
* @param out The output enhanced signal. The frame size can be queried by the `get_samp_chunksize`.
* @param out The output enhanced signal. The frame size can be queried by the `get_fetch_chunksize`.
* @return The state of output, please refer to the definition of `afe_fetch_mode_t`
*/
typedef afe_fetch_mode_t (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe, int16_t* out);
@ -256,10 +288,12 @@ typedef afe_fetch_mode_t (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe,
### 6. Usage of WakeNet
When users need to perform other operations after wake-up, such as offline or online speech recognitioafe_handlen, they can pause the operation of WakeNet to reduce the CPU resource consumption.
When users need to perform other operations after wake-up, such as offline or online speech recognition. They can pause the operation of WakeNet to reduce the CPU resource consumption.
Users can call `afe_handle->disable_wakenet(afe_data)` to stop WakeNet, or call `afe_handle->enable_wakenet(afe_data)` to enable WakeNet.
In addition, ESP32S3 chip supports switching between two wakenet words. (Note: ESP32 chip only supports one wake-up word and does not support switching). After AFE initialization, the ESP32S3 can switch to the second wakenet word by `afe_handle->set_wakenet(afe_data, SECOND_WAKE_WORD)`. How to configure two wakenet words, please refer to: [flash_model](../flash_model/README.md)
### 7. Usage of AEC
The usage of AEC is similar to that of WakeNet. Users can disable or enable AEC according to requirements.

View File

@ -20,33 +20,40 @@
- AFE 的创建和初始化
- AFE feed输入音频数据feed 内部会先进行 AEC 算法处理
- 内部:AFE BSS/NS 算法处理
- AFE fetch返回处理过的音频数据和返回值 fetch 内部会进行 VAD 处理,如果用户设置 WakeNet 为 enable 状态,也会进行唤醒词的检测
- 内部:若为唤醒识别/单麦语音降噪场景,进行 BSS/NS 算法处理;若为多麦语音降噪场景,进行 BSS/MISO 算法处理;
- AFE fetch返回处理过的音频数据和返回值 若为唤醒识别场景fetch 内部会进行 VAD 处理,以及唤醒词的检测,具体行为取决于用户对 `afe_config_t` 结构体的配置;若为多麦语音降噪场景,则会进行降噪处理。(注:`wakenet_init` 和 `voice_communication_init` 不可同时配置为 true)
其中 `afe->feed()``afe->fetch()` 对用户可见,`Internal BSS/NS Task` 对用户不可见。
其中 `afe->feed()``afe->fetch()` 对用户可见,`Internal BSS/NS/MISO Task` 对用户不可见。
> AEC 在 afe->feed() 函数中运行;
> BSS/NS 为 AFE 内部独立 Task 进行处理;
> VAD 和 WakeNet 的结果通过 afe->fetch() 函数获取。
> AEC 在 afe->feed() 函数中运行;若 aec_init 配置为 false 状态BSS/NS 将会在 afe->feed() 函数中运行。
> BSS/NS/MISO 为 AFE 内部独立 Task 进行处理;
> VAD 和 WakeNet 的结果,以及处理后的单通道音频,通过 afe->fetch() 函数获取。
### 选择 AFE handle
目前 AFE 支持单麦和双麦两种应用场景,单麦场景内部 Task 为 NS 处理,双麦场景内部 Task 为 BSS 处理。
目前 AFE 支持单麦和双麦两种应用场景,并且可对算法模块进行灵活配置。单麦场景内部 Task 为 NS 处理,双麦场景内部 Task 为 BSS 处理,双麦场景若配置为语音降噪(即:`wakenet_init=false, voice_communication_init=true`),则会再增加一个 MISO 的内部 Task
- 单麦
- 获取AFE handle
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_1mic;
- 双麦
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_2mic;
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_HANDLE;
### 输入音频
目前 AFE 支持单麦和双麦两种应用场景,可根据 `afe->feed()` 的音频,配置相应的音频通道数。修改方式:在宏 `AFE_CONFIG_DEFAULT()` 中对 `pcm_config` 结构体成员进行配置修改,其支持如下几种配置组合 (注:一定要满足 `total_ch_num = mic_num + ref_num`)
> total_ch_num=1, mic_num=1, ref_num=0
> total_ch_num=2, mic_num=1, ref_num=1
> total_ch_num=2, mic_num=2, ref_num=0
> total_ch_num=3, mic_num=2, ref_num=1
(注解: total_ch_num: 总通道数mic_num: 麦克风通道数ref_num: 参考回路通道数)
对于 AEC目前只支持单回路故 ref_num 的值只能为 0 或 1
- AFE 单麦场景
- 输入音频格式为 16KHz, 16bit, 双通道1个通道为 mic 数据,另一个通道为参考回路)
- 数据帧长为 32ms, 用户可以使用 `afe->get_feed_chunksize` 来获取需要的采样点数目(采样点数据类型为 int16
- 输入音频格式为 16KHz, 16bit, 双通道 (1个通道为 mic 数据,另一个通道为参考回路) ; 若不需要 AEC , 音频不包含参考回路则可只包含1个通道 mic 数据ref_num 设置为0。
- 输入数据帧长,会根据用户配置的算法模块不同而有差异, 用户可以使用 `afe->get_feed_chunksize` 来获取需要的采样点数目(采样点数据类型为 int16
数据排布如下:
@ -54,8 +61,8 @@
- AFE 双麦场景
- 输入音频格式为 16KHz, 16bit, 三通道
- 数据帧长为 32ms, 用户可以使用 `afe->get_feed_chunksize` 来获取需要填充的数据量
- 输入音频格式为 16KHz, 16bit, 三通道;若不需要 AEC , 音频不包含参考回路,则可只包含两个通道 mic 数据ref_num 设置为0。
- 输入数据帧长,会根据用户配置的算法模块不同而有差异, 用户可以使用 `afe->get_feed_chunksize` 来获取需要填充的数据量
数据排布如下:
@ -75,6 +82,10 @@ NS (Noise Suppression) 算法支持单通道处理,能够对单通道音频中
BSS (Blind Source Separation) 算法支持双通道处理,能够很好的将目标声源和其余干扰音进行盲源分离,从而提取出有用音频信号,保证了后级语音的质量。
### MISO 简介
MISO (Multi Input Single Output) 算法支持双通道输入,单通道输出。用于在双麦场景,没有唤醒使能的情况下,选择信噪比高的一路音频输出。
### VAD 简介
VAD (Voice Activity Detection) 算法支持实时输出当前帧的语音活动状态。
@ -93,15 +104,9 @@ AFE 的输出音频为单通道数据,在 WakeNet 开启的情况下AFE 会
### 1. 定义 afe_handle
`afe_handle` 是用户后续调用 afe 接口的函数句柄。用户需要根据单麦和双麦场景选择对应的 `afe_handle`
`afe_handle` 是用户后续调用 afe 接口的函数句柄。所以第一步需先获得 `afe_handle`
单麦场景:
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_1mic;
双麦场景:
esp_afe_sr_iface_t *afe_handle = &esp_afe_sr_2mic;
esp_afe_sr_iface_t *afe_handle = &ESP_AFE_HANDLE;
### 2. 配置 afe
@ -117,16 +122,20 @@ AFE 的输出音频为单通道数据,在 WakeNet 开启的情况下AFE 会
.se_init = true, \
.vad_init = true, \
.wakenet_init = true, \
.vad_mode = 3, \
.wakenet_model = &WAKENET_MODEL, \
.wakenet_coeff = &WAKENET_COEFF, \
.voice_communication_init = false, \
.vad_mode = VAD_MODE_3, \
.wakenet_model = (esp_wn_iface_t *)&WAKENET_MODEL, \
.wakenet_coeff = (void *)&WAKENET_COEFF, \
.wakenet_mode = DET_MODE_2CH_90, \
.afe_mode = SR_MODE_HIGH_PERF, \
.afe_mode = SR_MODE_LOW_COST, \
.afe_perferred_core = 0, \
.afe_perferred_priority = 5, \
.afe_ringbuf_size = 50, \
.alloc_from_psram = 1, \
.agc_mode = 2, \
.memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM, \
.agc_mode = AFE_MN_PEAK_AGC_MODE_2, \
.pcm_config.total_ch_num = 3, \
.pcm_config.mic_num = 2, \
.pcm_config.ref_num = 1, \
}
```
@ -138,9 +147,11 @@ AFE 的输出音频为单通道数据,在 WakeNet 开启的情况下AFE 会
- wakenet_init: 唤醒是否使能。
- voice_communication_init: 语音通话降噪是否使能。与 wakenet_init 不能同时使能。
- vad_mode: VAD 检测的操作模式,越大越激进。
- wakenet_model/wakenet_coeff/wakenet_mode: 使用 `make menuconfig` 来选择相应的唤醒模型,详见:[WakeNet](https://github.com/espressif/esp-sr/tree/b9504e35485b60524977a8df9ff448ca89cd9d62/wake_word_engine)
- wakenet_model/wakenet_coeff/wakenet_mode: 使用 `idf.py menuconfig` 来选择相应的唤醒模型,详见:[WakeNet](../wake_word_engine/README_cn.md)
- afe_mode: 乐鑫 AFE 目前支持 2 种工作模式分别为SR_MODE_LOW_COST, SR_MODE_HIGH_PERF。详细可见 afe_sr_mode_t 枚举。
@ -151,19 +162,37 @@ AFE 的输出音频为单通道数据,在 WakeNet 开启的情况下AFE 会
**ESP32 芯片,只支持模式 SR_MODE_HIGH_PERF;
ESP32S3 芯片,两种模式均支持 **
- afe_perferred_core: AFE 内部 BSS/NS 算法,运行在哪个 CPU 核。
- afe_perferred_core: AFE 内部 BSS/NS/MISO 算法,运行在哪个 CPU 核。
- afe_perferred_priority: AFE 内部 BSS/NS/MISO 算法运行的task优先级。
- afe_ringbuf_size: 内部 ringbuf 大小的配置。
- alloc_from_psram: 是否优先从外部 psram 分配内存。可配置三个值:
- memory_alloc_mode: 内存分配的模式。可配置三个值:
- 0: 从内部ram分配。
- AFE_MEMORY_ALLOC_MORE_INTERNAL: 更多的从内部ram分配。
- 1: 部分从外部psram分配。
- AFE_MEMORY_ALLOC_INTERNAL_PSRAM_BALANCE: 部分从内部psram分配。
- 2: 绝大部分从外部psram分配
- AFE_MEMORY_ALLOC_MORE_PSRAM: 绝大部分从外部psram分配
- agc_mode: 将音频线性放大的 level 配置([0,3],0 表示无放大
- agc_mode: 将音频线性放大的 level 配置。可配置四个值:
- AFE_MN_PEAK_AGC_MODE_1: 线性放大喂给后续multinet的音频峰值处为 -5dB。
- AFE_MN_PEAK_AGC_MODE_2: 线性放大喂给后续multinet的音频峰值处为 -4dB。
- AFE_MN_PEAK_AGC_MODE_3: 线性放大喂给后续multinet的音频峰值处为 -3dB。
- AFE_MN_PEAK_NO_AGC: 不做线性放大
- pcm_config: 根据 `afe->feed()` 喂入的音频结构进行配置,该结构体有三个成员变量需要配置:
- total_ch_num: 音频总的通道数total_ch_num = mic_num + ref_num。
- mic_num: 音频的麦克风通道数。目前仅支持配置为 1 或 2。
- ref_num: 音频的参考回路通道数,目前仅支持配置为 0 或 1。
### 3. 创建 afe_data
@ -182,7 +211,7 @@ typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_confi
### 4. feed 音频数据
在初始化 AFE 和 WakeNet 完成后,用户需要将音频数据使用 `afe_handle->feed()` 函数输入到 AFE 中进行处理。
在初始化 AFE 完成后,用户需要将音频数据使用 `afe_handle->feed()` 函数输入到 AFE 中进行处理。
输入的音频大小和排布格式可以参考 **输入音频** 这一步骤。
@ -190,11 +219,13 @@ typedef esp_afe_sr_data_t* (*esp_afe_sr_iface_op_create_from_config_t)(afe_confi
/**
* @brief Feed samples of an audio stream to the AFE_SR
*
* @Warning The input data should be arranged in the format of channel interleaving.
* The last channel is reference signal if it has reference data.
*
* @param afe The AFE_SR data handle
* @param afe The AFE_SR object to query
*
* @param in The input microphone signal, only support signed 16-bit @ 16 KHZ. The frame size can be queried by the
* `get_samp_chunksize`. The channel number can be queried `get_channel_num`.
* `get_feed_chunksize`.
* @return The size of input
*/
typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t* in);
@ -203,16 +234,16 @@ typedef int (*esp_afe_sr_iface_op_feed_t)(esp_afe_sr_data_t *afe, const int16_t*
获取音频通道数:
使用 `afe_handle->get_channel_num()` 函数可以获取需要传入 `afe_handle->feed()` 函数的 mic 数据通道数。(不含参考回路通道)
使用 `afe_handle->get_total_channel_num()` 函数可以获取需要传入 `afe_handle->feed()` 函数的总数据通道数。其返回值等于AFE_CONFIG_DEFAULT()中配置的 `pcm_config.mic_num + pcm_config.ref_num`
```
/**
* @brief Get the channel number of samples that need to be passed to the fetch function
* @brief Get the total channel number which be config
*
* @param afe The AFE_SR object to query
* @return The amount of channel number
* @param afe The AFE_SR object to query
* @return The amount of total channels
*/
typedef int (*esp_afe_sr_iface_op_get_channel_num_t)(esp_afe_sr_data_t *afe);
typedef int (*esp_afe_sr_iface_op_get_total_channel_num_t)(esp_afe_sr_data_t *afe);
```
### 5. fetch 音频数据
@ -236,6 +267,7 @@ typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
用户需要注意 `afe_handle->fetch()` 的返回值:
- AFE_FETCH_ERROR: 获取空数据,请重新尝试获取
- AFE_FETCH_CHANNEL_VERIFIED: 音频通道确认 (单麦唤醒,不返回该值)
- AFE_FETCH_NOISE: 侦测到噪声
- AFE_FETCH_SPEECH: 侦测到语音
@ -249,7 +281,7 @@ typedef int (*esp_afe_sr_iface_op_get_samp_chunksize_t)(esp_afe_sr_data_t *afe);
* @Warning The output is single channel data, no matter how many channels the input is.
*
* @param afe The AFE_SR object to query
* @param out The output enhanced signal. The frame size can be queried by the `get_samp_chunksize`.
* @param out The output enhanced signal. The frame size can be queried by the `get_fetch_chunksize`.
* @return The state of output, please refer to the definition of `afe_fetch_mode_t`
*/
typedef afe_fetch_mode_t (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe, int16_t* out);
@ -261,6 +293,8 @@ typedef afe_fetch_mode_t (*esp_afe_sr_iface_op_fetch_t)(esp_afe_sr_data_t *afe,
用户可以调用 `afe_handle->disable_wakenet(afe_data)` 来停止 WakeNet。 当后续应用结束后又可以调用 `afe_handle->enable_wakenet(afe_data)` 来开启 WakeNet。
另外ESP32S3 芯片,支持两个唤醒词之间切换。(注: ESP32 芯片只支持一个唤醒词,不支持切换)。在初始化 AFE 完成后ESP32S3 芯片可通过 `afe_handle->set_wakenet(afe_data, SECOND_WAKE_WORD)` 切换到第二个唤醒词。具体如何配置两个唤醒词,详见:[flash_model](../flash_model/README_CN.md)
### 7. AEC 使用
AEC 的使用和 WakeNet 相似,用户可以根据自己的需求来停止或开启 AEC。

View File

@ -15,6 +15,8 @@ ESP32S3:
So that on ESP32S3 you can:
- Greatly reduce the size of the user application APP BIN
- Supports the selection of up to two wake words
- Support online switching of Chinese and English Speech Command Recognition
- Convenient for users to perform OTA
- Supports reading and changing models from SD card, which is more convenient and can reduce the size of module Flash used in the project
- When the user is developing the code, when the modification does not involve the model, it can avoid flashing the model data every time, greatly reducing the flashing time and improving the development efficiency
@ -25,65 +27,86 @@ Run `idf.py menuconfig` navigate to `ESP Speech Recognition`:
![overview](../img/model-1.png)
### 1.1 Net to use acceleration
This option can configure the acceleration mode of the model, the user does not need to modify it, please keep the default configuration.
### 1.2 model data path
### 1.1 model data path
This option is only available on ESP32S3. It indicates the storage location of the model data. It supports the choice of `spiffs partition` or `SD Card`.
- `spiffs partition` means that the model data is stored in the Flash spiffs partition, and the model data will be loaded from the Flash spiffs partition
- `SD Card` means that the model data is stored in the SD card, and the model data will be loaded from the SD Card
### 1.2 use afe
This option needs to be turned on. Users do not need to modify it. Please keep the default configuration.
### 1.3 use wakenet
This option is turned on by default. When the user only uses `AEC` or `BSS`, etc., and does not need to run `WakeNet` or `MultiNet`, please turn off this option, which will reduce the size of the project firmware in some cases.
- Wake word engine
Wake word model engine selection.
- First Wake word
ESP32 supports:
- WakeNet 5 (quantized with 16-bit)
ESP32S3 supports:
Select the first wake word. Please select the corresponding wake word in the options as needed.
ESP32 支持:
- WakeNet 5 (quantized with 16-bit)
ESP32S3 支持:
- WakeNet 7 (quantized with 16-bit)
- WakeNet 7 (quantized with 8-bit)
- WakeNet 8 (quantized with 16-bit)
- Wake word name
- Second Wake wod
Wake-up word selection, the wake-up words supported by each wake-up engine are different.
For second wake word, please select the corresponding wake-up words in the options as needed.
**Note: this option only supports ESP32S3, ESP32S3 supports the selection of up to two wake words and supports the user to switch in the code.**
For more details, please refer to [WakeNet](../wake_word_engine/README.md) .
### 1.4 use multinet
This option is turned on by default. When users only use WakeNet or other algorithm modules, please turn off this option, which will reduce the size of the project firmware in some cases.
- langugae
ESP32 chip only supports Chinese Speech Commands Recognition.
Speech commands recognition language selection, ESP32 only supports Chinese, ESP32S3 supports Chinese or English.
- speech commands recognition model
ESP32S3 supports Chinese and English Speech Commands Recognition, and supports Chinese and English recognition model switching.
- Chinese Speech Commands Model
Chinese Speech Commands Recognition model selection.
model selection.
ESP32 supports:
- chinese single recognition (MultiNet2)
ESP32S3 supports:
- chinese single recognition (MultiNet3)
- chinese continuous recognition (MultiNet3)
- chinese single recognition (MultiNet4)
- Add speech commands
- None
- chinese single recognition (MultiNet2)
ESP32S3 支持:
Users add speech commands according to their needs.
- None
- chinese single recognition (MultiNet4.5)
- chinese single recognition (MultiNet4.5 quantized with 8-bit)
- English Speech Commands Model
English Speech Commands Recognition model selection.
This option does not support ESP32.
ESP32S3 Supports
- None
- english recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
- Add Chinese speech commands
The user needs to add Chinese Speech Command words to this item when `Chinese Speech Commands Model` is not `None`.
- Add English speech commands
The user needs to add English Speech Command words to this item when `Chinese Speech Commands Model` is not `None`.
For more details, please refer to [MultiNet](../speech_command_recognition/README.md) .

View File

@ -15,6 +15,8 @@ ESP32S3
从而在 ESP32S3 上可以:
- 大大减小用户应用 APP BIN 的大小
- 支持选择最多两个唤醒词
- 支持中文和英文命令词识别在线切换
- 方便用户进行 OTA
- 支持从 SD 卡读取和更换模型,更加便捷且可以缩减项目使用的模组 Flash 大小
- 当用户进行开发时,当修改不涉及模型时,可以避免每次烧录模型数据,大大缩减烧录时间,提高开发效率
@ -25,63 +27,85 @@ ESP32S3
![overview](../img/model-1.png)
### 1.1 Net to use acceleration
该选项可以配置模型的加速方式,用户无须修改,请保持默认配置。
### 1.2 model data path
### 1.1 model data path
该选项只在 ESP32S3 上可用,表示模型数据的存储位置,支持选择 `spiffs partition``SD Card`
- `spiffs partition` 表示模型数据存储在 Flash spiffs 分区中,模型数据将会从 Flash spiffs 分区中加载
- `SD Card` 表示模型数据存储在 SD 卡中,模型数据将会从 SD Card 中加载
### 1.2 use afe
该选项需要打开,用户无须修改,请保持默认配置。
### 1.3 use wakenet
此选项默认打开,当用户只使用 AEC 或者 BSS 等,无须运行 WakeNet 或 MultiNet 时,请关闭次选项,将会在一些情况下减小工程固件的大小。
- Wake word engine
唤醒模型引擎选择。
- First Wake word
首选唤醒词选择,请用户根据需要在选项中选择相应的唤醒词。
ESP32 支持:
- WakeNet 5 (quantized with 16-bit)
- WakeNet 5 (quantized with 16-bit) 系列
ESP32S3 支持:
- WakeNet 7 (quantized with 16-bit)
- WakeNet 7 (quantized with 8-bit)
- WakeNet 8 (quantized with 16-bit)
- WakeNet 7 (quantized with 16-bit) 系列
- WakeNet 7 (quantized with 8-bit) 系列
- WakeNet 8 (quantized with 16-bit) 系列
- Wake word name
- Second Wake wod
唤醒词选择,每个唤醒引擎支持的唤醒词有所不同,用户可以自行选择。
备选唤醒词语,请用户根据需要在选项中选择相应的唤醒词
**注:该选项只支持 ESP32S3即 ESP32S3 支持最多选择两个唤醒词,支持用户在代码中进行切换。**
更多细节请参考 [WakeNet](../wake_word_engine/README.md) 。
### 1.4 use multinet
此选项默认打开。当用户只使用 WakeNet 或者其他算法模块时,请关闭此选项,将会在一些情况下减小工程固件的大小。
- langugae
ESP32 芯片只支持中文命令词识别。ESP32S3 支持中文和英文命令词识别,且支持中英文识别模型切换。
命令词识别语言选择ESP32 只支持中文ESP32S3 支持中文或英文。
- Chinese Speech Commands Model
中文命令词识别模型选择。
- speech commands recognition model
命令词识别模型选择。
ESP32 支持:
- None
- chinese single recognition (MultiNet2)
ESP32S3 支持:
- chinese single recognition (MultiNet3)
- chinese continuous recognition (MultiNet3)
- chinese single recognition (MultiNet4)
- None
- chinese single recognition (MultiNet4.5)
- chinese single recognition (MultiNet4.5 quantized with 8-bit)
- Add speech commands
- English Speech Commands Model
英文命令词识别模型选择。
该选项不支持 ESP32。
ESP32S3 支持:
- None
- english recognition (MultiNet5 quantized with 8-bit, depends on WakeNet8)
- Add Chinese speech commands
当用户在 `Chinese Speech Commands Model` 中选择非 `None` 时,需要在该项处添加中文命令词。
- Add English speech commands
当用户在 `English Speech Commands Model` 中选择非 `None` 时,需要在该项处添加中文命令词。
用户按照需求自定义添加命令词,具体请参考 [MultiNet](../speech_command_recognition/README.md) 。
@ -149,6 +173,6 @@ ESP32S3
- 初始化 SD 卡
用户需要初始化 SD 卡,来使系统能够记载 SD 卡,如果用户使用 esp-skainet可以直接调用 `sd_card_mount("/sdcard")` 来初始化其支持开发板的 SD 卡。否则,需要自己编写。
用户需要初始化 SD 卡,来使系统能够记载 SD 卡,如果用户使用 esp-skainet可以直接调用 `esp_sdcard_init("/sdcard", num);` 来初始化其支持开发板的 SD 卡。否则,需要自己编写。
完成以上操作后,便可以进行工程的烧录。

BIN
docs/img/AFE_overview.png Normal file → Executable file

Binary file not shown.

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 51 KiB

BIN
docs/img/AFE_workflow.png Normal file → Executable file

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 5.5 KiB

After

Width:  |  Height:  |  Size: 5.9 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 13 KiB

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 43 KiB

74
docs/performance_test/README.md Executable file
View File

@ -0,0 +1,74 @@
# Performance Test
## 1. AFE
### 1.1 Resource Occupancy(ESP32)
|algorithm Type|RAM|Average cpu loading(compute with 2 cores)| Frame Length|
|:---:|:---:|:---:|:---:|
|AEC(HIGH_PERF)|114 KB|11%|32 ms|
|NS|27 KB|5%|10 ms|
|AFE Layer|73 KB| | |
### 1.2 Resource Occupancy(ESP32S3)
|algorithm Type|RAM|Average cpu loading(compute with 2 cores)| Frame Length|
|:---:|:---:|:---:|:---:|
|AEC(LOW_COST)|152.3 KB|8%|32 ms|
|AEC(HIGH_PERF)|166 KB|11%|32 ms|
|BSS(LOW_COST)|198.7 KB|6%|64 ms|
|BSS(HIGH_PERF)|215.5 KB|7%|64 ms|
|NS|27 KB|5%|10 ms|
|MISO|56 KB|8%|16 ms|
|AFE Layer|227 KB| | |
## 2. WakeNet
### 2.1 Resource Occupancy(ESP32)
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|Quantised WakeNet5|41 K|15 KB|5.5 ms|30 ms|
|Quantised WakeNet5X2|165 K|20 KB|10.5 ms|30 ms|
|Quantised WakeNet5X3|371 K|24 KB|18 ms|30 ms|
### 2.2 Resource Occupancy(ESP32S3)
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|Quantised WakeNet7_2CH|810 K|45 KB|10 ms|32 ms|
|Quantised WakeNet8_2CH|821 K|50 KB|10 ms|32 ms|
### 2.3 Performance
|Distance| Quiet | Stationary Noise (SNR = 4 dB)| Speech Noise (SNR = 4 dB)| AEC Interruption (-10 dB)|
|:---:|:---:|:---:|:---:|:---:|
|1 m|98%|96%|94%|96%|
|3 m|98%|94%|92%|94%|
False triggering rate: 1 time in 12 hours
**Note**: We use the ESP32-S3-Korvo V4.0 development board and the WakeNet8(Alexa) model in our test.
## 3. MultiNet
### 2.1 Resource Occupancy(ESP32)
|Model Type|Internal RAM|PSRAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|MultiNet 2|13.3 KB|9KB|38 ms|30 ms|
### 2.2 Resource Occupancy(ESP32S3)
|Model Type|Internal RAM|PSRAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|MultiNet 4.5|16.8KB|1866 KB|18 ms|32 ms|
|MultiNet 4.5 Q8|10.5 KB|1009 KB|11 ms|32 ms|
|MultiNet 5 Q8|||||
### 2.3 Performance with AFE
|Model Type|Distance| Quiet | Stationary Noise (SNR = 4 dB)| Speech Noise (SNR = 4 dB)|
|:---:|:---:|:---:|:---:|:---:|:---:|
|MultiNet 4.5|3 m|98%|93%|92%|
|MultiNet 4.5 Q8|3 m|94%|92%|91%|

View File

@ -1,184 +1,297 @@
# MultiNet Introduction
MultiNet is a lightweight model specially designed based on [CRNN](https://arxiv.org/pdf/1703.05390.pdf) and [CTC](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.6306&rep=rep1&type=pdf) for the implementation of multi-command recognization. Now, up to 200 speech commands, including customized commands, are supported.
MultiNet is a lightweight model designed to realize speech commands recognition offline on ESP32 series. Now, up to 200 speech commands, including customized commands, are supported.
## Overview
> Support Chinese and English speech commands recognition (esp32s3 is required for English speech commands recognition)
> Support user-defined commands
> Support adding / deleting / modifying commands during operation
> Up to 200 commands are supported
> It supports single recognition and continuous recognition
> Lightweight and low resource consumption
> Low delay, within 500ms
> Support online Chinese and English model switching (esp32s3 only)
> The model is partitioned separately to support users to apply OTA
MultiNet uses the **MFCC features** of an audio clip as input, and the **phonemes** (Chinese or English) as output. By comparing the output phonemes, the relevant Chinese or English command is identified.
## 1. Overview
The MultiNet input is the audio processed by the audio-front-end algorithm (AFE), with the format of 16KHz, 16bit and mono. By recognizing the audio, you can correspond to the corresponding Chinese characters or English words.
The following table shows the model support of Espressif SoCs:
![multinet_model](../img/MultiNet_model.png)
## Commands Recognition Process
Note: the model ending with Q8 represents the 8bit version of the model, means more lightweight.
1. Add customized commands to the speech command queue.
2. Prepare an audio clip of 30 ms (16 KHz, 16 bit, mono).
3. Input this audio to the MFCC model and get its **MFCC features**.
4. Input the obtained **MFCC features** to MultiNet and get the output **phoneme**.
5. Input the obtained **phoneme** to the Language model and get the output.
6. Compare the output against the existing speech commands one by one, and output the Command ID of the matching command (if any).
## 2. Commands Recognition Process
Please see the flow diagram below:
![speech_command-recognition-system](../img/multinet_workflow.png)
## User Guide
## 3. User Guide
### Basic Configuration
### 3.1 Requirements of speech commands
Define the following two variables before using the command recognition model:
- The recommended length of Chinese is generally 4-6 Chinese characters. Too short leads to high false recognition rate and too long is inconvenient for users to remember
- The recommended length of English is generally 4-6 words
- Mixed Chinese and English is not supported in command words
- Currently, up to 200 command words are supported
- The command word cannot contain Arabic numerals and special characters
- Avoid common command words like "hello"
- The greater the pronunciation difference of each Chinese character / word in the command word, the better the performance
1. Model version
The model version has been configured in `menuconfig` to facilitate your development. Please configure in `menuconfig` and add the following line in your code:
`static const esp_mn_iface_t *multinet = &MULTINET_MODEL;`
2. Model parameter
The language supported and the effectiveness of the model is determined by model parameters. Now only commands in Chinese are supported. Please configure the `MULTINET_COEFF` option in `menuconfig` and add the following line in your code to generate the model handle. The 6000 is the audio length for speech recognition, in ms, the range of sample_length is 0~6000.
`model_iface_data_t *model_data = multinet->create(&MULTINET_COEFF, 6000);`
### 3.2 Speech commands customization method
### Modify Speech Commands
> Support a variety of speech commands customization methods
> Support dynamic addition / deletion / modification of speech commands
For Chinese MultiNet, we use Pinyin without tone as units. 
For English MultiNet, we use international phonetic alphabet as unit. [multinet_g2p.py](../../tool/multinet_g2p.py) is used to convert English phrase into phonemes which can be recognized by multinet 
Now, the MultiNet support two methods to modify speech commands. 
#### 3.2.1 Format of Speech commands
#### 1.menuconfig (before compilation)
Speech commands string need to meet specific formats, as follows:
Users can define their own speech commands by `idf.py menuconfig -> ESP Speech Recognition -> add speech commands`
- Chines
**Chinese predefined commands:**
![add_speech_commands_ch](../img/add_speech_ch.png)
**English predefined commands:**
![add_speech_commands_en](../img/add_speech_en.png)
#### 2.reset API (on the fly)
Users also can modify speech commands in the code.
```
// Chinese
char err_id[200];
char *ch_commands_str = "da kai dian deng,kai dian deng;guan bi dian deng,guan dian deng;guan deng;";
multinet->reset(model_data, ch_commands_str, err_id);
// English
char *en_commands_en = "TfL Mm c qbK;Sgl c Sel;TkN nN jc LiT;TkN eF jc LiT";
multinet->reset(model_data, en_commands_en, err_id);
```
**Note:**
- One speech commands ID can correspond to multiple speech command phrases;
- Up to 200 speech commands ID or speech command phrases, including customized commands, are supported;
- Different Command IDs need to be separated by ';'. The corresponding multiple phrases for one Command ID need to be separated by ','.
- `err_id` return the spelling that does not meet the requirements.
### API Reference
#### Header
- esp_mn_iface.h
- esp_mn_models.h
#### Function
- `typedef model_iface_data_t* (*esp_mn_iface_op_create_t)(const model_coeff_getter_t *coeff, int sample_length);`
**Definition**
Easy function type to initialize a model instance with a coefficient.
**Parameter**
* coeff: The coefficient for speech commands recognition.
* sample_length: Audio length for speech recognition, in ms. The range of sample_length is 0~6000.
**Return**
Handle to the model data.
- `typedef int (*esp_mn_iface_op_get_samp_chunksize_t)(model_iface_data_t *model);`
**Definition**
Callback function type to fetch the amount of samples that need to be passed to the detection function. Every speech recognition model processes a certain number of samples at the same time. This function can be used to query the amount. Note that the returned amount is in 16-bit samples, not in bytes.
**Parameter**
model: The model object to query.
Chinese speech commands need to use Chinese Pinyin, and there should be a space between the Pinyin spelling of each word. For example, "打开空调" should be written as "da kai kong tiao", "打开绿色灯" should be written as "da kai lv se deng".
**Return**
In addition, we also provide corresponding tools for users to convert Chinese characters into pinyin. See details:
- English
English speech commands need to be represented by specific phonetic symbols. The phonetic symbols of each word are separated by spaces, such as "turn on the light", which needs to be written as "TkN nN jc LiT".
The amount of samples to feed the detection function.
**We provide specific conversion rules and tools. For details, please refer to the English G2P [tool](../tool/multinet_g2p.py).**
#### 3.2.2 Set speech commands offline
- `typedef int (*esp_mn_iface_op_get_samp_chunknum_t)(model_iface_data_t *model);`
Multinet supports flexible speech commands setting methods. No matter which way users set speech commands (code / network / file), they only need to call the corresponding API.
**Definition**
Callback function type to fetch the number of frames recognized by the speech command.
**Parameter**
model: The model object to query.
Here we provide two methods of adding speech commands:
- Use `menuconfig`
Users can refer to the example in ESP-Skainet, users can define their own speech commands by `idf.py menuconfig -> ESP Speech Recognition-> Add Chinese speech commands/Add English speech commands`.
**Return**
The number of the frames recognized by the speech command.
- `typedef int (*esp_mn_iface_op_get_samp_rate_t)(model_iface_data_t *model);`
![menuconfig_add_speech_commands](../img/menuconfig_add_speech_commands.png)
**Definition**
Get the sample rate of the samples to feed to the detection function.
**Parameter**
model: The model object to query.
Please note that a single `Command ID` can support multiple phrases. For example, "da kai kong tiao" and "kai kong tiao" have the same meaning, you can write them in the entry corresponding to the same command ID, and separate the adjacent entries with the English character "," without spaces before and after ",".
**Return**
The sample rate, in Hz.
Then call the following API:
- `typedef float* (*esp_mn_iface_op_detect_t)(model_iface_data_t *model, int16_t *samples);`
```
/**
* @brief Update the speech commands of MultiNet by menuconfig
*
* @param multinet The multinet handle
*
* @param model_data The model object to query
*
* @param langugae The language of MultiNet
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_update_from_sdkconfig(esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
```
**Definition**
- Add speech commands in the code
Users can refer to example in ESP-Skainet for this method of adding speech commands.
In this method, users directly set the speech command words in the code and transmits them to multinet. In the actual development and products, the user can transmit the required speech commands through various possible ways such as network / UART / SPI and change the speech commands.
#### 3.2.3 Set speech commands online
MultiNet supports online dynamic addition / deletion / modification of speech commands during operation, without changing models or adjusting parameters. For details, please refer to the example in ESP-Skainet.
Users only need to call the following APIs:
```
/**
* @brief Initialze the Speech Commands link of MultiNet
*
* @return
* - ESP_OK Success
* - ESP_ERR_NO_MEM No memory
* - ESP_ERR_INVALID_STATE The Speech Commands link has been initialized
*/
esp_err_t esp_mn_commands_init(void);
Easy function type to initialize a model instance with a coefficient.
**Parameter**
/**
* @brief Add one speech commands with phoneme and command ID
*
* @param command_id The command ID
*
* @param phoneme_string The phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_add(int command_id, char *phoneme_string);
coeff: The coefficient for speech commands recognition.
/**
* @brief Modify one speech commands with new phoneme
*
* @param old_phoneme_string The old phoneme string of the speech commands
*
* @param new_phoneme_string The new phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_modify(char *old_phoneme_string, char *new_phoneme_string);
/**
* @brief Remove one speech commands by phoneme
*
* @param phoneme_string The phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_remove(char *phoneme_string);
/**
* @brief Update the speech commands of MultiNet, must be used after [add/remove/modify] the speech commands
*
* @param multinet The multinet handle
*
* @param model_data The model object to query
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_update(const esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
```
## 4. Run speech commands recognition
Speech commands recognition needs to be run together with the audio front-end (AFE) in esp-sr (WakeNet needs to be enabled in AFE). For the use of AFE, please refer to the document:
[AFE 介绍及使用](../audio_front_end/README_CN.md)
### 4.1 MultiNet Initialization
Before using MultiNet, you need to define the following variables:
- Model version
Users need to declare the following model versions in the code. Users can directly use this in the following ways without changing it.
```
const esp_mn_iface_t *multinet = &MULTINET_MODEL;
```
**Return**
- Model handle
The users needs to use the `create` interface to generate the model handle `model_data` for subsequent operations.
```
model_iface_data_t *model_data = multinet->create(&MULTINET_COEFF, time_out_time_ms);
```
* The command id, if a matching command is found.
* -1, if no matching command is found.
- MULTINET_COEFF: Model parameters can be filled in directly without modifying
- time_out_time_ms: The waiting exit time when the speech commands cannot be detected by MultiNet. The unit is `ms`. it supports customization. The recommended range is [5000, 10000]
- `typedef void (*esp_mn_iface_op_reset_t)(model_iface_data_t *model, char *command_str, char *err_phrase_id);`
- Set speech commands
**Definition**
Please refer #3.
### 4.2 Run MultiNet
When users uses AFE and enables wakenet, then can use MultiNet. And there are the following requirements:
> The frame length of MultiNet is equal to the AFE fetch frame length
> The audio format supported is 16KHz, 16bit, mono. The data obtained by AFE fetch is also in this format
- Get the frame length that needs to be passed into MultiNet
```
int mu_chunksize = multinet->get_samp_chunksize(model_data);
```
- MultiNet detect
Reset the speech commands.
We send the data from AFE fetch to the following API:
```
esp_mn_state_t mn_state = multinet->detect(model_data, buff);
```
**Parameters**
model: Model object to destroy.
command_str: The speech commands string. ';' is used to separate commands for different command ID. ',' is used to separate different phrases for same command ID.
err_phrase_id: Return incorrent spelling
The lengthof `buff` is `mu_chunksize * sizeof(int16_t)`.
- `typedef void (*esp_mn_iface_op_destroy_t)(model_iface_data_t *model);`
### 4.3 The detect result of MultiNet
**Definition**
Speech commands recognition supports two basic modes:
> Single recognition
> Continuous recognition
Speech command recognition must be used with WakeNet. After wake-up, MultiNet detection can be run.
When the MultiNet is running, it will return the recognition status of the current frame in real time `mn_state`, which is currently divided into the following identification states:
- ESP_MN_STATE_DETECTING
This status indicates that the MultiNet is detecting but target speech command word has not been recognized.
Destroy a voiceprint recognition model.
- ESP_MN_STATE_DETECTED
This status indicates that the target speech command has been recognized. At this time, the user can call `get_results` interface obtains the identification results.
```
esp_mn_results_t *mn_result = multinet->get_results(model_data);
```
**Parameters**
The information identifying the result is stored in the return value of the `get_result` API, the data type of the return value is as follows:
```
typedef struct{
esp_mn_state_t state;
int num; // The number of phrase in list, num<=5. When num=0, no phrase is recognized.
int phrase_id[ESP_MN_RESULT_MAX_NUM]; // The list of phrase id.
float prob[ESP_MN_RESULT_MAX_NUM]; // The list of probability.
} esp_mn_results_t;
```
- `state` is the recognition status of the current frame
- `num` means the number of recognized commands, `num` <= 5, up to 5 possible results are returned
- `phrase_id` means the Phrase ID of speech commands
- `prob` meaNS the recognition probability of the recognized entries, which is arranged from large to small
Users can use `phrase_id[0]` and `prob[0]` get the recognition result with the highest probability.
model: Model object to destroy.
- ESP_MN_STATE_TIMEOUT
This status means that the speech commands has not been detected for a long time and will exit automatically. Wait for the next wake-up.
Therefore:
Exit the speech recognition when the return status is `ESP_MN_STATE_DETECTED`, it is single recognition mode;
Exit the speech recognition when the return status is `ESP_MN_STATE_TIMEOUT`, it is continuous recognition mode;
## 5. Other configurations
### 5.1 Threshold setting
MultiNet supports set or get the threshold of each speech command, which can help users optimize recognition performance.
- Get the threshold of one speech commands
```
multinet->get_command_det_threshold(model_data, phrase_id);
```
- Set the threshold of one speech commands
When setting the threshold, users are advised to get the threshold first and increase or decrease it appropriately on the basis of the original threshold.
The threshold range is (0, 1).
```
multinet->set_command_det_threshold(model_data, phrase_id, threshold);
```

View File

@ -1,209 +1,307 @@
# MultiNet 介绍 [[English]](./README.md)
MultiNet 是为了在 ESP32 上实现多命令词识别, 基于 [CRNN](https://arxiv.org/pdf/1703.05390.pdf) 网络和 [CTC](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.6306&rep=rep1&type=pdf) 设计的轻量化模型,目前支持 100 个以内的自定义命令词识别。
MultiNet 是为了在 ESP32 系列上离线实现多命令词识别而设计的轻量化模型,目前支持 200 个以内的自定义命令词识别。
## 概述
> 支持中文和英文命令词识别(英文命令词识别需使用 ESP32S3
> 支持用户自定义命令词
> 支持运行过程中 增加/删除/修改 命令词语
> 最多支持 200 个命令词
> 支持单次识别和连续识别两种模式
> 轻量化,低资源消耗
> 低延时延时500ms内
> 支持在线中英文模型切换(仅 ESP32S3
> 模型单独分区,支持用户应用 OTA
MultiNet 输入为音频经过 **MFCC** 处理后的特征值,输出为汉语/英语的“音素”分类。通过对输出音素进行组合,则可以对应到相应的汉字或单词。
## 1. 概述
MultiNet 输入为经过前端语音算法AFE处理过的音频格式为 16KHz16bit单声道。通过对音频进行识别则可以对应到相应的汉字或单词。
以下表格展示在不同芯片上的模型支持:
![multinet_model](../img/MultiNet_model.png)
## 命令词识别流程
用户选择不同的模型的方法请参考 [flash model](../flash_model/README_CN.md) 。
1. 添自定义命令词
2. 输入一帧时间长度为 30ms 的音频16KHz, 16bit, 单声道)
3. 获得输入音频的 **MFCC** 特征值
4. 将特征值输入 MultiNet输出该帧对应的识别**音素**
5. 将识别出的音素送至语言模型输出最终识别结果
6. 将识别结果和已存储的命令词队列比对,输出对应的命令词 ID
**注:其中以 `Q8` 结尾的模型代表模型的 8bit 版本,表明该模型更加轻量化。**
其中 3-6 步均在接口内完成,无须用户自己处理。
## 2. 命令词识别原理
可以参考以下命令词识别流程
可以参考以下命令词识别原理:
![speech_command-recognition-system](../img/multinet_workflow.png)
## 3. 使用指南
## 使用指南
### 3.1 命令词设计要求
### 命令词
- 中文推荐长度一般为 4-6 个汉字,过短导致误识别率高,过长不方便用户记忆
- 英文推荐长度一般为 4-6 个单词
- 命令词中不支持中英文混合
- 目前最多支持 **200** 条命令词
- 命令词中不能含有阿拉伯数字和特殊字符
- 命令词避免使用常用语
- 命令词中每个汉字/单词的发音相差越大越好
目前,用户可以使用 `make menuconfig` 命令来添加自定义命令词。可以通过 `menuconfig -> ESP Speech Recognition->Add speech commands` 添加命令词,目前已经添加有 20 个中文命令词和 7 个英文命令词,分别如下表所示:
### 3.2 命令词自定义方法
**中文**
> 支持多种命令词自定义方法
> 支持随时动态增加/删除/修改命令词
|Command ID|命令词|Command ID|命令词|Command ID|命令词|Command ID|命令词|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|0|打开空调|5|降低一度|10| 除湿模式|15| 播放歌曲
|1|关闭空调|6|制热模式|11| 健康模式|16| 暂停播放
|2|增大风速|7|制冷模式|12| 睡眠模式|17| 定时一小时
|3|减少风速|8|送风模式|13| 打开蓝牙|18| 打开电灯
|4| 升高一度|9|节能模式|10| 关闭蓝牙|19| 关闭电灯
MultiNet 对命令词自定义方法没有限制,用户可以通过任意方式(在线/离线)等将所需的命令词按照相应的格式,组成链表发给 MultiNet 即可。
**英文**
我们针对不同客户提供不同的 example 来展示一些命令词的自定义方法,大体分为以下两种。
|Command ID|命令词|Command ID|命令词|
|:---:|:---:|:---:|:---:|
|0|turn on the light|4|red mode|
|1|turn off the light|5|blue mode|
|2|lighting mode|6|yellow mode|
|3|reading mode|
#### 3.2.1 命令词格式
网络支持自定义命令词,用户可以将自己想要的设置的命令词加入 MultiNet注意新添加的命令词需要有其的对应 Command ID 已便于 MultiNet 时候后输出。
### 命令词识别模式
命令词识别支持两种基本模式:
- SINGLE_RECOGNITION 模式
即单次识别模式,当使用该模式时,用户在进行命令词识别时,必须将单独的单个命令词短语音频送入 MultiNet。
比如在唤醒后说出:打开电灯。则 MultiNet 会识别成功并返回对应的 Command ID。如果识别失败必须等 sample_length 时长结束后才能进行下次识别。
当配合唤醒使用时,如果用户在唤醒后只需要识别一个关键字返回即可,推荐使用该模式。
命令词需要满足特定的格式,具体如下:
- CONTINUOUS_RECOGNITION 模式
即连续识别模式,当使用该模式时,用户可以将多个命令词连续送入 MultiNet。
比如在唤醒后,可以说出打开电灯,等待 MultiNet 识别成功返回后可以在 sample_length 内继续说出下一个命令词,比如 关闭电灯。
当配合唤醒使用时,如果用户在唤醒后需要连续识别多个命令词,推荐使用该模式。
- 中文
用户可以通过 `menuconfig -> ESP Speech Recognition -> speech commands recognition mode after wake up` 来对以上两种模式进行切换,默认为 SINGLE_RECOGNITION 模式。
中文命令词需要使用汉语拼音,并且每个字的拼音拼写间要间隔一个空格。比如“打开空调”,应该写成 "da kai kong tiao"比如“打开绿色灯”需要写成“da kai lv se deng”。
**并且我们也提供相应的工具,供用户将汉字转换为拼音,详细可见:**
CONTINUOUS_RECOGNITION 模式下对单个词的识别率略低于 SINGLE_RECOGNITION 模式下的单个词识别率。
- 英文
英文命令词需要使用特定音标表示每个单词的音标间用空格隔开比如“turn on the light”需要写成“TkN nN jc LiT”。
**我们提供了具体转换规则和工具,详细可以参考[英文转音素工具](../tool/multinet_g2p.py) 。**
### 语言选择
#### 3.2.2 离线设置命令词
目前 MultiNet 支持中文和英文,目前英文只支持 SINGLE_RECOGNITION 模式。
用户可以通过 `menuconfig -> ESP Speech Recognition -> langugae` 进行选择。
MultiNet 支持多种且灵活的命令词设置方式,用户无论通过那种方式编写命令词(代码/网络/文件),只需调用相应的 API 即可。
### 添加自定义命令词
目前MultiNet 模型中已经预定义了一些命令词。用户可以通过 `menuconfig -> ESP Speech Recognition -> Add speech commands` and `The number of speech commands`来定义自己的语音命令词和语音命令的数目。
在这里我们提供两种常见的命令词添加方法。
##### 中文命令词识别
- 编写 `menuconfig` 进行添加
可以参考 ESP-Skainet 中 example 通过 `idf.py menuconfig -> ESP Speech Recognition-> Add Chinese speech commands/Add English speech commands` 添加命令词。
在填充命令词时应该使用拼音,并且每个字的拼音拼写间要间隔一个空格。比如“打开空调”,应该填入 "da kai kong tiao".
![menuconfig_add_speech_commands](../img/menuconfig_add_speech_commands.png)
请注意单个 Command ID 可以支持多个短语,比如“打开空调”和“开空调”表示的意义相同,则可以将其写在同一个 Command ID 对应的词条中,用英文字符“,”隔开相邻词条(“,”前后无需空格)。
然后通过在代码里调用以下 API 即可:
```
/**
* @brief Update the speech commands of MultiNet by menuconfig
*
* @param multinet The multinet handle
*
* @param model_data The model object to query
*
* @param langugae The language of MultiNet
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_update_from_sdkconfig(esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
```
##### 英文命令词识别
- 通过自己创建命令词进行添加
在填充命令词时应该使用特定音标,请使用 skainet 根目录 `tools` 目录下的 `general_label_EN/general_label_en.py` 脚本生成命令词对应的音标,具体使用方法请参考 [音标生成方法](https://github.com/espressif/esp-skainet/tree/master/tools/general_label_EN/README.md) .
可以参考 ESP-Skainet 中 example 了解这种添加命令词的方法。
该方法中,用户直接在代码中编写命令词,并传给 MultiNet在实际开发和产品中用户可以通过网络/UART/SPI等多种可能的方式传递所需的命令词并随时更换命令词。
#### 3.2.3 在线设置命令词
MultiNet 支持在运行过程中在线动态添加/删除/修改命令词,该过程无须更换模型和调整参数。具体可以参考 ESP-Skainet 中 example。
只需用户调用以下 API 即可:
```
/**
* @brief Initialze the Speech Commands link of MultiNet
*
* @return
* - ESP_OK Success
* - ESP_ERR_NO_MEM No memory
* - ESP_ERR_INVALID_STATE The Speech Commands link has been initialized
*/
esp_err_t esp_mn_commands_init(void);
/**
* @brief Add one speech commands with phoneme and command ID
*
* @param command_id The command ID
*
* @param phoneme_string The phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_add(int command_id, char *phoneme_string);
**注意:**
- 一个 Commnad ID 可以对应多个命令短语
- 最多支持 100 个 Command ID 或者 命令短语
- 同一个 Command ID 对应的几条命令短语之间应该由 "," 隔开
/**
* @brief Modify one speech commands with new phoneme
*
* @param old_phoneme_string The old phoneme string of the speech commands
*
* @param new_phoneme_string The new phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_modify(char *old_phoneme_string, char *new_phoneme_string);
/**
* @brief Remove one speech commands by phoneme
*
* @param phoneme_string The phoneme string of the speech commands
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_remove(char *phoneme_string);
/**
* @brief Update the speech commands of MultiNet, must be used after [add/remove/modify] the speech commands
*
* @param multinet The multinet handle
*
* @param model_data The model object to query
*
* @return
* - ESP_OK Success
* - ESP_ERR_INVALID_STATE Fail
*/
esp_err_t esp_mn_commands_update(const esp_mn_iface_t *multinet, const model_iface_data_t *model_data);
```
## 4. 运行命令词识别
命令词识别需要和 ESP-SR 中的声学算法模块AFEAFE中需使能唤醒WakeNet一起运行。关于 AFE 的使用,请参考文档:
[AFE 介绍及使用](../audio_front_end/README_CN.md)
当用户配置完成 AFE 后,请按照以下步骤配置和运行 MultiNet
### 4.1 MultiNet 初始化
### 基础配置
在使用命令词识别模型前首先需要定义以下变量:
1. 模型版本
- 模型版本声明
模型版本可以需要在 `menuconfig` 中进行预选择,请在选择后在代码里添加如下的代码
static const esp_mn_iface_t *multinet = &MULTINET_MODEL;
2. 生成模型句柄
用户需要在代码中声明以下模型版本,用户直接按以下方式使用,无须更改。
支持的语言和模型的有效性由模型参数决定,现在只支持中文命令。请在 `menuconfig` 中配置 `MULTINET_COEFF` 选项,并在代码中添加以下行以生成模型句柄。 sample_length 是语音识别的音频长度,以 ms 为单位,当使用 sample_length 的范围为 0~6000。
model_iface_data_t *model_data = multinet->create(&MULTINET_COEFF, sample_length);
```
const esp_mn_iface_t *multinet = &MULTINET_MODEL;
```
### API 参考
- 生成模型句柄
#### 头文件
- esp_mn_iface.h
- esp_mn_models.h
#### 函数
- `typedef model_iface_data_t* (*esp_mn_iface_op_create_t)(const model_coeff_getter_t *coeff, int sample_length);`
**Definition**
Easy function type to initialize a model instance with a coefficient.
**Parameter**
* coeff: The coefficient for speech commands recognition.
* sample_length: Audio length for speech recognition, in ms. The range of sample_length is 0~6000.
**Return**
Handle to the model data.
- `typedef int (*esp_mn_iface_op_get_samp_chunksize_t)(model_iface_data_t *model);`
**Definition**
Callback function type to fetch the amount of samples that need to be passed to the detection function. Every speech recognition model processes a certain number of samples at the same time. This function can be used to query the amount. Note that the returned amount is in 16-bit samples, not in bytes.
**Parameter**
model: The model object to query.
**Return**
The amount of samples to feed the detection function.
- `typedef int (*esp_mn_iface_op_get_samp_chunknum_t)(model_iface_data_t *model);`
**Definition**
Callback function type to fetch the number of frames recognized by the speech command.
**Parameter**
model: The model object to query.
**Return**
The number of the frames recognized by the speech command.
- `typedef int (*esp_mn_iface_op_set_det_threshold_t)(model_iface_data_t *model, float det_threshold);`
**Definition**
Set the detection threshold to manually abjust the probability.
**Parameter**
* model: The model object to query.
* det_treshold The threshold to trigger speech commands, the range of det_threshold is 0.5~0.9999
- `typedef int (*esp_mn_iface_op_get_samp_rate_t)(model_iface_data_t *model);`
**Definition**
Get the sample rate of the samples to feed to the detection function.
**Parameter**
model: The model object to query.
用户需要使用 `create` 接口生成模型句柄`model_data`,以供后续操作。
**Return**
The sample rate, in Hz.
- `typedef float* (*esp_mn_iface_op_detect_t)(model_iface_data_t *model, int16_t *samples);`
**Definition**
```
model_iface_data_t *model_data = multinet->create(&MULTINET_COEFF, time_out_time_ms);
```
Easy function type to initialize a model instance with a coefficient.
**Parameter**
- MULTINET_COEFF 模型参数,用户无须更改,直接填入
- time_out_time_ms当 MultiNet 检测不到命令词时的等待退出时间, 单位为 `ms`,支持自定义,建议范围为 [5000, 10000]
coeff: The coefficient for speech commands recognition.
**Return**
* The command id, if a matching command is found.
* -1, if no matching command is found.
- `typedef void (*esp_mn_iface_op_destroy_t)(model_iface_data_t *model);`
**Definition**
Destroy a voiceprint recognition model.
- 设置命令词
请参考上文 #3
### 4.2 MultiNet 运行
当用户开启 AFE 且使能 WakeNet 后,则可以运行 MultiNet。且有以下几点要求
**Parameters**
model: Model object to destroy.
> 传入帧长和 AFE fetch 帧长长度相等
> 支持音频格式为 16KHz16bit单通道。AFE fetch 拿到的数据也为这个格式
- 确定需要传入 MultiNet 的帧长
```
int mu_chunksize = multinet->get_samp_chunksize(model_data);
```
`mu_chunksize` 是需要传入 MultiNet 的每帧音频的 `short` 型点数,这个大小和 AFE 中 fetch 的每帧数据点数完全一致。
- MultiNet detect
我们将 AFE 实时 `fetch` 到的数据送入以下 API
```
esp_mn_state_t mn_state = multinet->detect(model_data, buff);
```
`buff` 的长度为 `mu_chunksize * sizeof(int16_t)`
### 4.3 MultiNet 识别结果
命令词识别支持两种基本模式:
> 单次识别
> 连续识别
命令词识别必须和唤醒搭配使用,当唤醒后可以运行命令词的检测。
命令词模型在运行时,会实时返回当前帧的识别状态 `mn_state`,目前分为以下几种识别状态:
- ESP_MN_STATE_DETECTING
该状态表示目前正在识别中,还未识别到目标命令词。
- ESP_MN_STATE_DETECTED
该状态表示目前识别到了目标命令词,此时用户可以调用 `get_results` 接口获取识别结果。
```
esp_mn_results_t *mn_result = multinet->get_results(model_data);
```
识别结果的信息存储在 `get_result` API 的返回值中,返回值的数据类型如下:
```
typedef struct{
esp_mn_state_t state;
int num; // The number of phrase in list, num<=5. When num=0, no phrase is recognized.
int phrase_id[ESP_MN_RESULT_MAX_NUM]; // The list of phrase id.
float prob[ESP_MN_RESULT_MAX_NUM]; // The list of probability.
} esp_mn_results_t;
```
- 其中 `state` 为当前识别的状态
- `num`表示识别到的词条数目,`num` <= 5即最多返回 5 个候选结果
- `phrase_id` 表示识别到的词条对应的 Phrase ID
- `prob` 表示识别到的词条识别概率,从大到到小依次排列
用户可以使用 `phrase_id[0]``prob[0]` 拿到概率最高的识别结果。
- ESP_MN_STATE_TIMEOUT
该状态表示长时间未检测到命令词,自动退出。等待下次唤醒。
因此:
当命令词识别返回状态为 `ESP_MN_STATE_DETECTED` 时退出命令词识别,则为单次识别模式;
当命令词识别返回状态为 `ESP_MN_STATE_TIMEOUT` 时退出命令词识别,则为连续识别模式;
## 5. 其他配置和使用
### 5.1 阈值设置
MultiNet 支持对每个命令词的阈值进行设置或者查看,可以帮助用户更好的进行识别调优。
- 获取某个命令词的阈值
```
multinet->get_command_det_threshold(model_data, phrase_id);
```
- 设置某个命令词的阈值
用户在设置阈值的时候,建议先获取其阈值,在原本阈值基础上进行合适的增减。
`threshold` 范围为 (0, 1)。
```
multinet->set_command_det_threshold(model_data, phrase_id, threshold);
```

View File

@ -28,77 +28,25 @@ The following table shows the model support of Espressif SoCs:
![wakent_model](../img/WakeNet_model.png)
## API Introduction
## Use WakeNet
- How to select the WakeNet model
1. Go to `make menuconfig`, navigate to `Component config` >> `ESP Speech Recognition` >> `Wake word engine`. See below:
Please refer to [Flash model 介绍](../flash_model/README.md).
<center> <img src="../img/model_sel.png" width = "500" /> </center>
2. WakeNet6 is divided into two tasks task1 is used to calculate speech recognition the task2 is used to calculate neural network model. The ESP32 core used to calculate task2 can be selected by `Component config` >> `ESP Speech Recognition` >> `ESP32 core to run WakeNet6`
- How to run WakeNet
WakeNet is currently included in the [AFE](../audio_front_end/README.md), which is running by default, and returns the detect results through the AFE fetch interface.
If users wants to close WakeNet, please use:
```
afe_config.wakenet_init = False.
```
- How to select the wake words
Go to `make menuconfig`, and navigate to `Component config` >> `ESP Speech Recognition` >> `Wake words list`. See below:
<center>
<img src="../img/word_sel.png" width = "500" />
</center>
Note that, the `customized word` option only supports WakeNet5 and WakeNet6. WakeNet3 and WakeNet4 are only compatible with earlier versions. If you want to use your own wake words, please overwrite existing models in `wake_word_engine` directory with your own words model.
- How to set the triggering threshold
1. The triggering threshold (0, 0.9999) for wake word can be set to adjust the accuracy of the wake words model. The threshold can be configured separately for each wake words if there are more than one words supported in a model.
2. The smaller the triggering threshold is, the higher the risk of false triggering is (and vice versa). Please configure your threshold according to your applications.
3. The wake word engine predefines two thresholds for each wake word during the initialization. See below:
```
typedef enum {
DET_MODE_90 = 0, // Normal,
DET_MODE_95 = 1, // Aggressive,
DET_MODE_2CH_90 = 2, // 2 Channel detection, Normal mode
DET_MODE_2CH_95 = 3, // 2 Channel detection, Aggressive mode
DET_MODE_2CH_90 = 4, // 3 Channel detection, Normal mode
DET_MODE_2CH_95 = 5, // 3 Channel detection, Aggressive mode
} det_mode_t;
```
4. Use the `set_det_threshold()` function to configure the thresholds for different wake words after the initialization.
- How to get the sampling rate and frame size.
- Use `get_samp_rate` to get the sampling rate of the audio stream to be recognized.
- Use `get_samp_chunksize` to get the sampling point of each frame. The encoding of audio data is `signed 16-bit int`.
## Performance Test
### 1. Resource Occupancy(ESP32)
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|Quantised WakeNet5|41 K|15 KB|5.5 ms|30 ms|
|Quantised WakeNet5X2|165 K|20 KB|10.5 ms|30 ms|
|Quantised WakeNet5X3|371 K|24 KB|18 ms|30 ms|
### 2. Resource Occupancy(ESP32S3)
|Model Type|Parameter Num|RAM|Average Running Time per Frame| Frame Length|
|:---:|:---:|:---:|:---:|:---:|
|Quantised WakeNet7_2CH|810 K|45 KB|10 ms|32 ms|
|Quantised WakeNet8_2CH|821 K|50 KB|10 ms|32 ms|
### 2. Performance
|Distance| Quiet | Stationary Noise (SNR = 4 dB)| Speech Noise (SNR = 4 dB)| AEC Interruption (-10 dB)|
|:---:|:---:|:---:|:---:|:---:|
|1 m|98%|96%|94%|96%|
|3 m|98%|94%|92%|94%|
False triggering rate: 1 time in 12 hours
**Note**: We use the ESP32-S3-Korvo V4.0 development board and the WakeNet8(Alexa) model in our test.
Please refer to [Performance_test](../performance_test/README.md).
## Wake Word Customization

View File

@ -17,7 +17,7 @@ WakeNet的流程图如下
神经网络结构已经更新到第6版其中
- wakeNet1和wakeNet2已经停止使用。
- wakeNet3和wakeNet4基于[CRNN](https://arxiv.org/abs/1703.05390)结构。
- WakeNet5(WakeNet5X2,WakeNetX3) 和 WakeNet6 基于 the [Dilated Convolution](https://arxiv.org/pdf/1609.03499.pdf) 结构。
- WakeNet5(WakeNet5X2,WakeNetX3) 和 WakeNet7 和 WakeNet8 基于 the [Dilated Convolution](https://arxiv.org/pdf/1609.03499.pdf) 结构。
注意WakeNet5,WakeNet5X2 和 WakeNet5X3 的网络结构一致,但是 WakeNet5X2 和 WakeNet5X3 的参数比 WakeNet5 要多。请参考 [性能测试](#性能测试) 来获取更多细节。
@ -28,65 +28,29 @@ WakeNet的流程图如下
![wakent_model](../img/WakeNet_model.png)
## API introduction
## WakeNet使用
- WakeNet模型选择
1. 使用make menuconfig选择Component config >> ESP Speech Recognition >> Wake Word Engine,如下图
- WakeNet 模型选择
<center>
<img src="../img/model_sel.png" width = "500" />
</center>
WakeNet 模型选择请参考 [Flash model 介绍](../flash_model/README_CN.md) 。
2. 不同与WakeNet5WakeNet6被拆分成两个tasktask1计算speech featurestask2计算neural network model。task2使用的ESP32核心可以通过Component config >> ESP Speech Recognition >> ESP32 core to run WakeNet6选择默认使用core1。
- 唤醒词选择
使用make menuconfig选择Component config >> ESP Speech Recognition >> Wake word list进行选择如下图
<center>
<img src="../img/word_sel.png" width = "500" />
</center>
对于自定义的唤醒词,请选择`customized wake word`目前唤醒词定制只支持WakeNet5和WakeNet6WakeNet3和WakeNet4只对之前版本保持兼容具体可参考《乐鑫唤醒词定制流程》。
- 阈值设定
1.唤醒词模型通过设定触发阈值,来调整唤醒灵敏度,阈值范围为0~0.9999,对于包含多个唤醒词的模型,每个唤醒词的阈值相互独立。
2.调整阈值时,当阈值减小,唤醒识别率增高同时误触发的风险也变大,反之唤醒识别率降低,误触发也减小。实际使用需要根据具体应用场景选择合适的阈值。
3.每个唤醒词在模型内部预定义两个阈值,在模型初始化时使用,其中
```
typedef enum {
DET_MODE_90 = 0, //Normal, response accuracy rate about 90%
DET_MODE_95 //Aggressive, response accuracy rate about 95%
} det_mode_t;
```
4.初始化后,可以使用`set_det_threshold()`对不同唤醒词阈值进行重新设定。
- 采样率与每帧长度
使用函数`get_samp_rate`获取识别所需音频数据的采样率
使用函数`get_samp_chunksize`获取每帧所需采样点语言编码方式为unsigned 16-bit int
对于自定义的唤醒词,请参考[乐鑫语音唤醒词定制流程](乐鑫语音唤醒词定制流程.md)。
- WakeNet 运行
WakeNet 目前包含在语音前端算法 [AFE](../audio_front_end/README_CN.md) 中,默认为运行状态,并将识别结果通过 AFE fetch 接口返回。
如果用户需要关掉 WakeNet请在 AFE 配置时选择:
```
afe_config.wakenet_init = False.
```
即可停止运行 WakeNet。
## 性能测试
### 1.资源占用
|模型|参数量|RAM|平均每帧时间消耗|每帧时长|
|:---:|:---:|:---:|:---:|:---:|
|Quantized WakeNet3|26 K|20 KB|29 ms|90 ms|
|Quantised WakeNet4|53 K|22 KB|48 ms|90 ms|
|Quantised WakeNet5|41 K|15 KB|5.5 ms|30 ms|
|Quantised WakeNet5X2|165 K|20 KB|10.5 ms|30 ms|
|Quantised WakeNet5X3|371 K|24 KB|18 ms|30 ms|
|Quantised WakeNet6|378 K|45 KB|4ms(task1)+25ms(task2)|30 ms|
**注**Quantised WakeNet6被拆分成两个task其中task1用于计算speech features另一个task2用于计算神经网络。
### 2.识别性能
|距离|安静环境|平稳噪声(SNR=0~10dB)|语音噪声(SNR=0~10dB)|AEC打断唤醒(-5~-15dB)|
|:---:|:---:|:---:|:---:|:---:|
|1 m|95%|88%|85%|89%|
|3 m|90%|80%|75%|80%|
误唤醒率1 次/ 12 小时
**注**:以上测试基于 WakeNet5X2(hilexin) 模型使用lyrat-mini开发板测试。该板为单颗麦克风拾音若使用多麦拾音的开发板可预料在远场场景会有更好的识别性能。
具体请参考 [Performance_test](../performance_test/README.md)。
## 唤醒词定制