8.4 KiB
MultiNet Introduction
MultiNet is a lightweight model designed to realize speech commands recognition offline on ESP32 series. Now, up to 200 speech commands, including customized commands, are supported.
Support Chinese and English speech commands recognition (esp32s3 is required for English speech commands recognition)
Support user-defined commands
Support adding / deleting / modifying commands during operation
Up to 200 commands are supported
It supports single recognition and continuous recognition
Lightweight and low resource consumption
Low delay, within 500ms
Support online Chinese and English model switching (esp32s3 only)
The model is partitioned separately to support users to apply OTA
1. Overview
The MultiNet input is the audio processed by the audio-front-end algorithm (AFE), with the format of 16KHz, 16bit and mono. By recognizing the audio, you can correspond to the corresponding Chinese characters or English words.
The following table shows the model support of Espressif SoCs:
Note: the model ending with Q8 represents the 8bit version of the model, means more lightweight.
2. Commands Recognition Process
Please see the flow diagram below:
3. User Guide
3.1 Requirements of speech commands
- The recommended length of Chinese is generally 4-6 Chinese characters. Too short leads to high false recognition rate and too long is inconvenient for users to remember
- The recommended length of English is generally 4-6 words
- Mixed Chinese and English is not supported in command words
- Currently, up to 200 command words are supported
- The command word cannot contain Arabic numerals and special characters
- Avoid common command words like "hello"
- The greater the pronunciation difference of each Chinese character / word in the command word, the better the performance
3.2 Speech commands customization method
Support a variety of speech commands customization methods
Support dynamic addition / deletion / modification of speech commands
3.2.1 Format of Speech commands
Speech commands string need to meet specific formats, as follows:
-
Chines
Chinese speech commands need to use Chinese Pinyin, and there should be a space between the Pinyin spelling of each word. For example, "打开空调" should be written as "da kai kong tiao", "打开绿色灯" should be written as "da kai lv se deng".
In addition, we also provide corresponding tools for users to convert Chinese characters into pinyin. See details:
-
English
English speech commands need to be represented by specific phonetic symbols. The phonetic symbols of each word are separated by spaces, such as "turn on the light", which needs to be written as "TkN nN jc LiT".
We provide specific conversion rules and tools. For details, please refer to the English G2P tool.
3.2.2 Set speech commands offline
Multinet supports flexible speech commands setting methods. No matter which way users set speech commands (code / network / file), they only need to call the corresponding API.
Here we provide two methods of adding speech commands:
-
Use
menuconfigUsers can refer to the example in ESP-Skainet, users can define their own speech commands by
idf.py menuconfig -> ESP Speech Recognition-> Add Chinese speech commands/Add English speech commands.Please note that a single
Command IDcan support multiple phrases. For example, "da kai kong tiao" and "kai kong tiao" have the same meaning, you can write them in the entry corresponding to the same command ID, and separate the adjacent entries with the English character "," without spaces before and after ",".Then call the following API:
/** * @brief Update the speech commands of MultiNet by menuconfig * * @param multinet The multinet handle * * @param model_data The model object to query * * @param langugae The language of MultiNet * * @return * - ESP_OK Success * - ESP_ERR_INVALID_STATE Fail */ esp_err_t esp_mn_commands_update_from_sdkconfig(esp_mn_iface_t *multinet, const model_iface_data_t *model_data); -
Add speech commands in the code
Users can refer to example in ESP-Skainet for this method of adding speech commands.
In this method, users directly set the speech command words in the code and transmits them to multinet. In the actual development and products, the user can transmit the required speech commands through various possible ways such as network / UART / SPI and change the speech commands.
3.2.3 Set speech commands online
MultiNet supports online dynamic addition / deletion / modification of speech commands during operation, without changing models or adjusting parameters. For details, please refer to the example in ESP-Skainet.
Please refer to esp_mn_speech_commands for details of APIs:
4. Run speech commands recognition
Speech commands recognition needs to be run together with the audio front-end (AFE) in esp-sr (WakeNet needs to be enabled in AFE). For the use of AFE, please refer to the document:
4.1 MultiNet Initialization
-
Initialize multinet model
-
Set speech commands
Please refer #3.
4.2 Run MultiNet
When users uses AFE and enables wakenet, then can use MultiNet. And there are the following requirements:
The frame length of MultiNet is equal to the AFE fetch frame length
The audio format supported is 16KHz, 16bit, mono. The data obtained by AFE fetch is also in this format
-
Get the frame length that needs to be passed into MultiNet
int mu_chunksize = multinet->get_samp_chunksize(model_data); -
MultiNet detect
We send the data from AFE fetch to the following API:
esp_mn_state_t mn_state = multinet->detect(model_data, buff);The lengthof
buffismu_chunksize * sizeof(int16_t).
4.3 The detect result of MultiNet
Speech commands recognition supports two basic modes:
Single recognition
Continuous recognition
Speech command recognition must be used with WakeNet. After wake-up, MultiNet detection can be run.
When the MultiNet is running, it will return the recognition status of the current frame in real time mn_state, which is currently divided into the following identification states:
-
ESP_MN_STATE_DETECTING
This status indicates that the MultiNet is detecting but target speech command word has not been recognized.
-
ESP_MN_STATE_DETECTED
This status indicates that the target speech command has been recognized. At this time, the user can call
get_resultsinterface obtains the identification results.esp_mn_results_t *mn_result = multinet->get_results(model_data);The information identifying the result is stored in the return value of the
get_resultAPI, the data type of the return value is as follows:typedef struct{ esp_mn_state_t state; int num; // The number of phrase in list, num<=5. When num=0, no phrase is recognized. int phrase_id[ESP_MN_RESULT_MAX_NUM]; // The list of phrase id. float prob[ESP_MN_RESULT_MAX_NUM]; // The list of probability. } esp_mn_results_t;stateis the recognition status of the current framenummeans the number of recognized commands,num<= 5, up to 5 possible results are returnedphrase_idmeans the Phrase ID of speech commandsprobmeaNS the recognition probability of the recognized entries, which is arranged from large to small
Users can use
phrase_id[0]andprob[0]get the recognition result with the highest probability.- ESP_MN_STATE_TIMEOUT
This status means that the speech commands has not been detected for a long time and will exit automatically. Wait for the next wake-up.
Therefore:
Exit the speech recognition when the return status is ESP_MN_STATE_DETECTED, it is single recognition mode;
Exit the speech recognition when the return status is ESP_MN_STATE_TIMEOUT, it is continuous recognition mode;
5. Other configurations
5.1 Threshold setting
This function is still under development.


