FunASR/funasr/runtime/docs/websocket_protocol.md
Yabin Li 61ed60695a
coauthor:duj12, add itn;add timestamp、hotword to 2pass; (#966)
* Add ITN,include openfst/gflags in onnxruntime/third_party.

* 2pass server support Hotword and Timestamp. The start_time of each segment need to be fix.

* add global time start and end of each frame(both online and offline), support two-pass timestamp(both segment and token level).

* update websocket cmake.

* 2pass server support itn, hw and tp.

* Add local build and run. Add timestamp in 2pass server, update cmakelist.

* fix filemode bug in h5, avoid 2pass wss server close before final.

* offline server add itn.

* offline server add ITN.

* update hotword model dir.

* Add Acknowledgement to WeTextProcessing(https://github.com/wenet-e2e/WeTextProcessing)

* adapted to original FunASR.

* adapted to itn timestamp hotword

* merge from main (#949)

* fix empty timestamp list inference

* punc large

* fix decoding_ind none bug

* fix decoding_ind none bug

* docs

* setup

* change eng punc in offline model

* update contextual export

* update proc for oov in hotword onnx inference

* add python http code (#940)

* funasr-onnx 0.2.2

* funasr-onnx 0.2.3

* bug fix in timestamp inference

* fix bug in timestamp inference

* Update preprocessor.py

---------

Co-authored-by: shixian.shi <shixian.shi@alibaba-inc.com>
Co-authored-by: 游雁 <zhifu.gzf@alibaba-inc.com>
Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>
Co-authored-by: mengzhe.cmz <mengzhe.cmz@alibaba-inc.com>
Co-authored-by: Xian Shi <40013335+R1ckShi@users.noreply.github.com>
Co-authored-by: chenmengzheAAA <123789350+chenmengzheAAA@users.noreply.github.com>
Co-authored-by: 夜雨飘零 <yeyupiaoling@foxmail.com>

* update docs

* update deploy_tools

---------

Co-authored-by: dujing <dujing@xmov.ai>
Co-authored-by: Jean Du <37294470+duj12@users.noreply.github.com>
Co-authored-by: shixian.shi <shixian.shi@alibaba-inc.com>
Co-authored-by: 游雁 <zhifu.gzf@alibaba-inc.com>
Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>
Co-authored-by: mengzhe.cmz <mengzhe.cmz@alibaba-inc.com>
Co-authored-by: Xian Shi <40013335+R1ckShi@users.noreply.github.com>
Co-authored-by: chenmengzheAAA <123789350+chenmengzheAAA@users.noreply.github.com>
Co-authored-by: 夜雨飘零 <yeyupiaoling@foxmail.com>
2023-09-19 10:09:58 +08:00

5.3 KiB
Raw Blame History

(简体中文|English)

WebSocket/gRPC Communication Protocol

This protocol is the communication protocol for the FunASR software package, which includes offline file transcription (deployment document) and real-time speech recognition (deployment document).

Offline File Transcription

Sending Data from Client to Server

Message Format

Configuration parameters and meta information are in JSON format, while audio data is in bytes.

Initial Communication

The message (which needs to be serialized in JSON) is:

{"mode": "offline", "wav_name": "wav_name","wav_format":"pcm","is_speaking": True,"wav_format":"pcm","hotwords":"阿里巴巴 达摩院 阿里云","itn":true}

Parameter explanation:

`mode`: `offline`, indicating the inference mode for offline file transcription
`wav_name`: the name of the audio file to be transcribed
`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc.
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
`hotwords`If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example"阿里巴巴 达摩院 阿里云"
`itn`: whether to use itn, the default value is true for enabling and false for disabling.

Sending Audio Data

For PCM format, directly send the audio data. For other audio formats, send the header information and audio and video bytes data together. Multiple sampling rates and audio and video formats are supported.

Sending End of Audio Flag

After sending the audio data, an end-of-audio flag needs to be sent (which needs to be serialized in JSON):

{"is_speaking": False}

Sending Data from Server to Client

Sending Recognition Results

The message (serialized in JSON) is:

{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}

Parameter explanation:

`mode`: `offline`, indicating the inference mode for offline file transcription
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
`timestamp`If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"

Real-time Speech Recognition

System Architecture Diagram

Sending Data from Client to Server

Message Format

Configuration parameters and meta information are in JSON format, while audio data is in bytes.

Initial Communication

The message (which needs to be serialized in JSON) is:

{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5],"hotwords":"阿里巴巴 达摩院 阿里云","itn":true}

Parameter explanation:

`mode`: `offline` indicates the inference mode for single-sentence recognition; `online` indicates the inference mode for real-time speech recognition; `2pass` indicates real-time speech recognition and offline model correction for sentence endings.
`wav_name`: the name of the audio file to be transcribed
`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc. (Note: only PCM audio streams are supported in version 1.0)
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`chunk_size`: indicates the latency configuration of the streaming model, `[5,10,5]` indicates that the current audio is 600ms long, with a 300ms look-ahead and look-back time.
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
`hotwords`If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example"阿里巴巴 达摩院 阿里云"
`itn`: whether to use itn, the default value is true for enabling and false for disabling.

Sending Audio Data

Directly send the audio data, removing the header information and sending only the bytes data. Supported audio sampling rates are 8000 (which needs to be specified as audio_fs in message), and 16000.

Sending End of Audio Flag

After sending the audio data, an end-of-audio flag needs to be sent (which needs to be serialized in JSON):

{"is_speaking": False}

Sending Data from Server to Client

Sending Recognition Results

The message (serialized in JSON) is:

{"mode": "2pass-online", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}

Parameter explanation:

`mode`: indicates the inference mode, divided into `2pass-online` for real-time recognition results and `2pass-offline` for 2-pass corrected recognition results.
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
`timestamp`If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"