This commit is contained in:
liying@espressif.com 2022-11-23 15:10:25 +08:00 committed by LiJunu
commit fa9cfb1849
16 changed files with 253 additions and 751 deletions

6
.gitignore vendored
View File

@ -6,6 +6,7 @@ include/sdkconfig.h
build/
sdkconfig.old
sdkconfig
<<<<<<< HEAD
.DS_Store
*.pyc
@ -24,3 +25,8 @@ docs/doxygen_sqlite3.db
# Downloaded font files
docs/_static/DejaVuSans.ttf
docs/_static/NotoSansSC-Regular.otf
=======
model/target/*
.vscode
docs/_build/*
>>>>>>> 0981bc8425d6cace35ebb73789265a1c2e14dc92

BIN
docs/_static/QR_Dilated_Convolution.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.1 KiB

BIN
docs/_static/QR_MFCC.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.5 KiB

BIN
docs/_static/QR_multinet_g2p.png vendored Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.7 KiB

View File

@ -23,7 +23,6 @@ ESP32_DOCS = ['audio_front_end/README.rst',
'wake_word_engine/README.rst',
'wake_word_engine/ESP_Wake_Words_Customization.rst',
'speech_command_recognition/README.rst',
'acoustic_algorithm/README.rst',
'flash_model/README.rst',
'audio_front_end/Espressif_Microphone_Design_Guidelines.rst',
'test_report/README.rst',

View File

@ -1,27 +0,0 @@
#!/bin/bash
function convert_md2rst(){
for files in $1/$2/*
do
filename="$(basename -- $files)"
echo $filename
fname="${filename%.*}"
echo $fname
echo "converting $fname"
pandoc $1/$2/$filename -f markdown -t rst -s -o "$1/$2/${fname}".rst
done
}
convert_md2rst en acoustic_algorithm
convert_md2rst en audio_front_end
convert_md2rst en flash_model
convert_md2rst en performance_test
convert_md2rst en speech_command_recognition
convert_md2rst en wake_word_engine
convert_md2rst zh_cn acoustic_algorithm
convert_md2rst zh_cn audio_front_end
convert_md2rst zh_cn flash_model
convert_md2rst zh_cn performance_test
convert_md2rst zh_cn speech_command_recognition
convert_md2rst zh_cn wake_word_engine

View File

@ -1,248 +0,0 @@
Acoustic Algorithm Introduction
===============================
:link_to_translation:`zh_CN:[中文]`
Acoustic algorithms provided in esp-sr include voice activity detection(VAD), adaptive gain control (AGC), acoustic echo cancellation (AEC),noise suppression (NS), and mic-array speech enhancement (MASE). VAD, AGC, AEC, and NS are supported with either single-mic and multi-mic development board, MASE is supported with multi-mic board only.
VAD
---
Overview
~~~~~~~~
VAD takes an audio stream as input, and outputs the prediction that a frame of the stream contains audio or not.
API Reference
~~~~~~~~~~~~~
Header
^^^^^^
- esp_vad.h
Function
^^^^^^^^
- ``vad_handle_t vad_create(vad_mode_t vad_mode)``
**Definition**
Initialization of VAD handle.
**Parameter**
- vad_mode: operating mode of VAD, VAD_MODE_0 to VAD_MODE_4, larger value indicates more aggressive VAD.
**Return**
Handle to VAD.
- ``vad_state_t vad_process(vad_handle_t inst, int16_t *data, int sample_rate_hz, int one_frame_ms);``
**Definition**
Processing of VAD for one frame.
**Parameter**
- inst: VAD handle.
- data: buffer to save both input and output audio stream.
- sample_rate_hz: The Sampling frequency (Hz) can be 32000, 16000, 8000, default: 16000.
- one_frame_ms: The length of the audio processing can be 10ms, 20ms, 30ms, default: 30.
**Return**
- VAD_SILENCE if no voice
- VAD_SPEECH if voice is detected
- ``void vad_destroy(vad_handle_t inst)``
**Definition**
Destruction of a VAD handle.
**Parameter**
- inst: the VAD handle to be destroyed.
AGC
---
.. _overview-1:
Overview
~~~~~~~~
AGC keeps the volume of audio signal at a stable level to avoid the situation that the signal is so loud that gets clipped or too quiet to trigger the speech recognizer.
.. _api-reference-1:
API Reference
~~~~~~~~~~~~~
- ``void *esp_agc_open(int agc_mode, int sample_rate)``
**Definition**
Initialization of AGC handle.
**Parameter**
- agc_mode: operating mode of AGC, 3 to enable AGC and 0 to disable it.
- sample_rate: sampling rate of audio signal.
**Return**
- AGC handle.
- ``int esp_agc_process(void *agc_handle, short *in_pcm, short *out_pcm, int frame_size, int sample_rate)``
**Definition**
Pocessing of AGC for one frame.
**Parameter**
- agc_handle: AGC handle.
- in_pcm: input audio stream.
- out_pcm: output audio stream.
- frame_size: signal frame length in ms.
- sample_rate: signal sampling rate in Hz.
**Return**
Return 0 if AGC processing succeeds, -1 if fails; -2 and -3 indicate invalid input of sample_rate and frame_size, respectively.
- ``void esp_agc_clse(void *agc_handle)``
**Definition**
Destruction of an AGC handle.
**Parameter**
- agc_handle: the AGC handle to be destroyed.
AEC
---
.. _overview-2:
Overview
~~~~~~~~
AEC suppresses echo of the sound played by the speaker of the board.
.. _api-reference-2:
API Reference
~~~~~~~~~~~~~
- ``aec_handle_t aec_create(int sample_rate, int frame_length, int filter_length)``
**Definition**
Initialization of AEC handle.
**Parameter**
- sample_rate: audio signal sampling rate.
- frame_length: audio frame length in ms.
- filter_length: the length of adaptive filter in AEC.
**Return**
Handle to AEC.
- ``aec_create_t aec_create_multimic(int sample_rate, int frame_length, int filter_length, int nch)``
**Definition**
Initialization of AEC handle.
**Parameter**
- sample_rate: audio signal sampling rate.
- frame_length: audio frame length in ms.
- filter_length: the length of adaptive filter in AEC.
- nch: number of channels of the signal to be processed.
**Return**
Handle to AEC.
- ``void aec_process(aec_handle_t inst, int16_t *indata, int16_t *refdata, int16_t *outdata)``
**Definition**
Processing of AEC for one frame.
**Parameter**
- inst: AEC handle.
- indata: input audio stream, which could be single- or multi-channel, depending on the channel number defined on initialization.
- refdata: reference signal to be cancelled from the input.
- outdata: output audio stream, the number of channels is the same as indata.
- ``void aec_destroy(aec_handle_t inst)``
**Definition**
Destruction of an AEC handle.
**Parameter**
- inst: the AEC handle to be destroyed.
NS
--
.. _overview-3:
Overview
~~~~~~~~
Single-channel speech enhancement. If multiple mics are available with the board, MASE is recommened for noise suppression.
.. _api-reference-3:
API Reference
~~~~~~~~~~~~~
- ``ns_handle_t ns_pro_create(int frame_length, int mode)``
**Definition**
Creates an instance of the more powerful noise suppression algorithm.
**Parameter**
- frame_length_ms: audio frame length in ms.
- mode: 0: Mild, 1: Medium, 2: Aggressive
**Return**
Handle to NS.
- ``void ns_process(ns_handle_t inst, int16_t *indata, int16_t *outdata)``
**Definition**
Prodessing of NS for one frame.
**Parameter**
- inst: NS handle.
- indata: input audio stream.
- outdata: output audio stream.
- ``void ns_destroy(ns_handle_t inst)``
**Definition**
Destruction of a NS handle.
**Parameter**
- inst: the NS handle to be destroyed.

View File

@ -21,7 +21,6 @@ Based on years of hardware design and development experience, Loxin can provide
Wake word model <wake_word_engine/README>
Customized wake words <wake_word_engine/ESP_Wake_Words_Customization>
Speech commands <speech_command_recognition/README>
Acoustic algorithm introduction <acoustic_algorithm/README>
Model loading method <flash_model/README>
Microphone Design Guidelines <audio_front_end/Espressif_Microphone_Design_Guidelines>
Test Reports <test_report/README>

View File

@ -1,248 +0,0 @@
声学算法介绍
============
:link_to_translation:`en:[English]`
esp-sr 中提供的声学算法包括语音活动检测 (VAD)、自适应增益控制 (AGC)、声学回声消除 (AEC)、噪声抑制 (NS) 和麦克风阵列语音增强 (MASE)。 VAD、AGC、AEC 和 NS 支持单麦克风和多麦克风开发板MASE 仅支持多麦克风板。
VAD
---
概述
~~~~
VAD将一个音频流作为输入并输出该流的某一帧是否包含音频的预测。
API 参考
~~~~~~~~~~~~~
头文件
^^^^^^
- esp_vad.h
函数
^^^^
- ``vad_handle_t vad_create(vad_mode_t vad_mode)``
**定义**
VAD 句柄的初始化。
**范围**
- vad_modeVAD的工作模式VAD_MODE_0到VAD_MODE_4数值越大表示VAD越激进。
**返回值**
vad_handle_t
- ``vad_state_t vad_process(vad_handle_t inst, int16_t *data, int sample_rate_hz, int one_frame_ms);``
**定义**
处理一帧的 VAD。
**范围**
- instVAD句柄。
- data: 保存输入和输出音频流的缓冲区。
- sample_rate_hz: 采样频率Hz可以是32000、16000、8000默认是16000。
- one_frame_ms: 音频处理的长度可以是10ms、20ms、30ms默认30。
**返回值**
- VAD_SILENCE if no voice
- VAD_SPEECH if voice is detected
- ``void vad_destroy(vad_handle_t inst)``
**定义**
- 销毁 VAD 句柄.
**范围**
- inst要销毁的VAD句柄。
AGC
---
.. _overview-1:
概述
~~~~~~~~
AGC将音频信号的音量保持在一个稳定的水平以避免信号过大而被削掉或过小而无法触发语音识别器的情况。
.. _api-reference-1:
API 参考
~~~~~~~~~~~~~
- ``void *esp_agc_open(int agc_mode, int sample_rate)``
**定义**
AGC句柄的初始化。
**范围**
- agc_modeAGC的工作模式3表示启用AGC0表示禁用。
- sample_rate音频信号的采样率。
**返回值**
- AGC 句柄.
- ``int esp_agc_process(void *agc_handle, short *in_pcm, short *out_pcm, int frame_size, int sample_rate)``
**定义**
对一帧的AGC进行分配。
**范围**
- agc_handle: AGC手柄。
- in_pcm: 输入音频流。
- out_pcm输出音频流。
- frame_size: 信号帧的长度单位是ms。
- sample_rate信号的采样率单位为Hz。
**返回值**
- 返回 0 如果 AGC processing 成功, -1 如果失败; -2 和 -3 分别表示采样率和帧大小的无效输入。
- ``void esp_agc_clse(void *agc_handle)``
**定义**
- 销毁一个AGC句柄。
**范围**
- agc_handle: 销毁AGC句柄。
AEC
---
.. _overview-2:
概述
~~~~~~~~
AEC抑制了电路板上的扬声器所播放的声音的回声。
.. _api-reference-2:
API 参考
~~~~~~~~~~~~~
- ``aec_handle_t aec_create(int sample_rate, int frame_length, int filter_length)``
**定义**
AEC 句柄的初始化。
**范围**
- sample_rate: audio signal sampling rate.
- frame_length: audio frame length in ms.
- filter_length: the length of adaptive filter in AEC.
**返回值**
Handle to AEC.
- ``aec_create_t aec_create_multimic(int sample_rate, int frame_length, int filter_length, int nch)``
**定义**
AEC 句柄的初始化。
**范围**
- sample_rate音频信号采样率。
- frame_length以毫秒为单位的音频帧长度。
- filter_lengthAEC 中自适应滤波器的长度。
- nch要处理的信号的通道数。
**返回值**
Handle to AEC.
- ``void aec_process(aec_handle_t inst, int16_t *indata, int16_t *refdata, int16_t *outdata)``
**定义**
一帧的AEC处理。
**范围**
- instAEC 手柄。
- indata输入音频流可以是单声道或多声道取决于初始化时定义的声道号。
- refdata要从输入中取消的参考信号。
- outdata输出音频流通道数与indata相同。
- ``void aec_destroy(aec_handle_t inst)``
**定义**
AEC 句柄的破坏。
**范围**
-inst要销毁的 AEC 句柄。
NS
--
.. _overview-3:
概述
~~~~~~~~
单通道语音增强。如果电路板上有多个麦克风可用,建议使用 MASE 进行噪声抑制。
.. _api-reference-3:
API 参考
~~~~~~~~~~~~~
- ``ns_handle_t ns_pro_create(int frame_length, int mode)``
**定义**
创建更强大的噪声抑制算法的实例。
**范围**
- frame_length_ms以毫秒为单位的音频帧长度。
- mode0轻度1中度2激进
**返回值**
Handle to NS.
- ``void ns_process(ns_handle_t inst, int16_t *indata, int16_t *outdata)``
**定义**
NS 处理一帧。
**范围**
- instNS 句柄。
- indata输入音频流。
- outdata输出音频流。
- ``void ns_destroy(ns_handle_t inst)``
**定义**
NS句柄的破坏。
**范围**
- inst要销毁的 NS 句柄。

View File

@ -164,6 +164,8 @@ WakeNet or Bypass 简介
AFE 的输出音频为单通道数据。在语音识别场景若WakeNet 开启的情况下AFE 会输出有目标人声的单通道数据。在语音通话场景,将会输出信噪比更高的单通道数据。
.. only:: html
快速开始
--------

View File

@ -153,6 +153,8 @@ ESP32S3 支持:
- 自定义路径 如果用户想将模型放置于指定文件夹,可以自己修改 ``get_model_base_path()`` 函数,位于 ``ESP-SR_PATH/model/model_path.c``。 比如,指定文件夹为 SD 卡目录中的 ``espmodel``, 则可以修改该函数为:
.. only:: html
::
char *get_model_base_path(void)
@ -172,6 +174,8 @@ ESP32S3 支持:
完成以上操作后,便可以进行工程的烧录。
.. only:: html
代码中模型初始化与使用
^^^^^^^^^^^^^^^^^^^^^^

View File

@ -24,7 +24,6 @@ ESP-SR 用户指南
唤醒词模型 <wake_word_engine/README>
定制化唤醒词 <wake_word_engine/ESP_Wake_Words_Customization>
语音指令 <speech_command_recognition/README>
声学算法介绍 <acoustic_algorithm/README>
模型加载方式 <flash_model/README>
麦克风设计指南 <audio_front_end/Espressif_Microphone_Design_Guidelines>
测试报告 <test_report/README>

View File

@ -85,6 +85,11 @@ MultiNet 对命令词自定义方法没有限制,用户可以通过任意方
**并且我们也提供相应的工具,供用户将汉字转换为拼音,详细可见:** `英文转音素工具 <../../tool/multinet_g2p.py>`__
.. only:: latex
.. figure:: ../../_static/QR_multinet_g2p.png
:alt: menuconfig_add_speech_commands
离线设置命令词
^^^^^^^^^^^^^^^

View File

@ -138,18 +138,21 @@
唤醒率测试
-----------
+----------------+------------+---------------+-----------+-------------+-------------+--------+--------+
+----------------+------------+-------------+-----------+-----------+-----------+--------+--------+
| 测试项 | 环境噪声 | 噪声指标 | 信噪比SNR | 角度 | 距离 | 唤醒率 | 识别率 |
+================+============+===============+===========+=============+=============+========+========+
| 本地唤醒率测试 | 安静 | - 人声59dBA | NA | - 人声90° | - 人声3米 | 99% | 91.5% |
| | | - 噪声NA | | - 噪声45° | - 噪声2米 | | |
| +------------+---------------+-----------+ + +--------+--------+
| | 白噪声 | - 人声59dBA | ≥4dBA | | | 99% | 78.25% |
| | | - 噪声55dBA | | | | | |
| +------------+---------------+-----------+ + +--------+--------+
| | 人声类噪声 | - 人声59dBA | ≥4dBA | | | 99% | 82.77% |
| | | - 噪声55dBA | | | | | |
+----------------+------------+---------------+-----------+-------------+-------------+--------+--------+
+================+============+=============+===========+===========+===========+========+========+
| 本地唤醒率测试 | 安静 | 人声59dBA | NA | 人声90° | 人声3米 | 99% | 91.5% |
| | | | | | | | |
| | | 噪声NA | | 噪声45° | 噪声2米 | | |
| +------------+-------------+-----------+ | +--------+--------+
| | 白噪声 | 人声59dBA | ≥4dBA | | | 99% | 78.25% |
| | | | | | | | |
| | | 噪声55dBA | | | | | |
| +------------+-------------+-----------+ | +--------+--------+
| | 人声类噪声 | 人声59dBA | ≥4dBA | | | 99% | 82.77% |
| | | | | | | | |
| | | 噪声55dBA | | | | | |
+----------------+------------+-------------+-----------+-----------+-----------+--------+--------+
误唤醒测试
-----------
@ -168,11 +171,11 @@
+----------------+----------+---------------+-----------+--------+--------------+
| 测试项 | 环境噪声 | 噪声指标 | 信噪比SNR | 唤醒率 | 命令词识别率 |
+================+==========+===============+===========+========+==============+
| 唤醒打断率测试 | 音乐 | - 人声59dBA | ≥-10dBA | 100% | 96% |
| | | - 噪声69dBA | | | |
| 唤醒打断率测试 | 音乐 | 人声59dBA | ≥ 10dBA | 100% | 96% |
| | | 噪声69dBA | | | |
| +----------+---------------+-----------+--------+--------------+
| | TTS | - 人声59dBA | ≥-10dBA | 100% | 96% |
| | | - 噪声69dBA | | | |
| | TTS | 人声59dBA | ≥ 10dBA | 100% | 96% |
| | | 噪声69dBA | | | |
+----------------+----------+---------------+-----------+--------+--------------+
响应时间测试
@ -181,8 +184,8 @@
+--------------+----------+---------------+------------+----------+
| 测试项 | 环境噪声 | 噪声指标 | 信噪比 SNR | 响应时间 |
+==============+==========+===============+============+==========+
| 响应时间测试 | 安静 | - 人声59dBA | NA | <500 ms |
| | | - 噪声NA | | |
| 响应时间测试 | 安静 | 人声59dBA | NA | <500 ms |
| | | 噪声NA | | |
+--------------+----------+---------------+------------+----------+
.. figure:: ../../_static/test_response_time.png

View File

@ -24,6 +24,11 @@ WakeNet的流程图如下
- Speech Features
我们使用 `MFCC <https://en.wikipedia.org/wiki/Mel-frequency_cepstrum>`__ 方法提取语音频谱特征。输入的音频文件采样率为16KHz单声道编码方式为signed 16-bit。每帧窗宽和步长均为30ms。
.. only:: latex
.. figure:: ../../_static/QR_MFCC.png
:alt: overview
- Neural Network
神经网络结构已经更新到第9版其中
@ -31,6 +36,11 @@ WakeNet的流程图如下
- wakeNet5应用于ESP32芯片。
- wakeNet8和wakeNet9应用于ESP32S3芯片模型基于 `Dilated Convolution <https://arxiv.org/pdf/1609.03499.pdf>`__ 结构。
.. only:: latex
.. figure:: ../../_static/QR_Dilated_Convolution.png
:alt: overview
注意WakeNet5,WakeNet5X2 和 WakeNet5X3 的网络结构一致,但是 WakeNet5X2 和 WakeNet5X3 的参数比 WakeNet5 要多。请参考 `性能测试 <#性能测试>`__ 来获取更多细节。
- Keyword Trigger Method
@ -71,9 +81,7 @@ WakeNet使用
- WakeNet 运行
WakeNet 目前包含在语音前端算法
`AFE <../audio_front_end/README_CN.md>`__
中,默认为运行状态,并将识别结果通过 AFE fetch 接口返回。
WakeNet 目前包含在语音前端算法 `AFE <../audio_front_end/README_CN.md>`__中,默认为运行状态,并将识别结果通过 AFE fetch 接口返回。
如果用户不需要初始化 WakeNet请在 AFE 配置时选择: