* Add ITN,include openfst/gflags in onnxruntime/third_party. * 2pass server support Hotword and Timestamp. The start_time of each segment need to be fix. * add global time start and end of each frame(both online and offline), support two-pass timestamp(both segment and token level). * update websocket cmake. * 2pass server support itn, hw and tp. * Add local build and run. Add timestamp in 2pass server, update cmakelist. * fix filemode bug in h5, avoid 2pass wss server close before final. * offline server add itn. * offline server add ITN. * update hotword model dir. * Add Acknowledgement to WeTextProcessing(https://github.com/wenet-e2e/WeTextProcessing) * adapted to original FunASR. * adapted to itn timestamp hotword * merge from main (#949) * fix empty timestamp list inference * punc large * fix decoding_ind none bug * fix decoding_ind none bug * docs * setup * change eng punc in offline model * update contextual export * update proc for oov in hotword onnx inference * add python http code (#940) * funasr-onnx 0.2.2 * funasr-onnx 0.2.3 * bug fix in timestamp inference * fix bug in timestamp inference * Update preprocessor.py --------- Co-authored-by: shixian.shi <shixian.shi@alibaba-inc.com> Co-authored-by: 游雁 <zhifu.gzf@alibaba-inc.com> Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com> Co-authored-by: mengzhe.cmz <mengzhe.cmz@alibaba-inc.com> Co-authored-by: Xian Shi <40013335+R1ckShi@users.noreply.github.com> Co-authored-by: chenmengzheAAA <123789350+chenmengzheAAA@users.noreply.github.com> Co-authored-by: 夜雨飘零 <yeyupiaoling@foxmail.com> * update docs * update deploy_tools --------- Co-authored-by: dujing <dujing@xmov.ai> Co-authored-by: Jean Du <37294470+duj12@users.noreply.github.com> Co-authored-by: shixian.shi <shixian.shi@alibaba-inc.com> Co-authored-by: 游雁 <zhifu.gzf@alibaba-inc.com> Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com> Co-authored-by: mengzhe.cmz <mengzhe.cmz@alibaba-inc.com> Co-authored-by: Xian Shi <40013335+R1ckShi@users.noreply.github.com> Co-authored-by: chenmengzheAAA <123789350+chenmengzheAAA@users.noreply.github.com> Co-authored-by: 夜雨飘零 <yeyupiaoling@foxmail.com>
5.3 KiB
(简体中文|English)
WebSocket/gRPC Communication Protocol
This protocol is the communication protocol for the FunASR software package, which includes offline file transcription (deployment document) and real-time speech recognition (deployment document).
Offline File Transcription
Sending Data from Client to Server
Message Format
Configuration parameters and meta information are in JSON format, while audio data is in bytes.
Initial Communication
The message (which needs to be serialized in JSON) is:
{"mode": "offline", "wav_name": "wav_name","wav_format":"pcm","is_speaking": True,"wav_format":"pcm","hotwords":"阿里巴巴 达摩院 阿里云","itn":true}
Parameter explanation:
`mode`: `offline`, indicating the inference mode for offline file transcription
`wav_name`: the name of the audio file to be transcribed
`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc.
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
`hotwords`:If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example:"阿里巴巴 达摩院 阿里云"
`itn`: whether to use itn, the default value is true for enabling and false for disabling.
Sending Audio Data
For PCM format, directly send the audio data. For other audio formats, send the header information and audio and video bytes data together. Multiple sampling rates and audio and video formats are supported.
Sending End of Audio Flag
After sending the audio data, an end-of-audio flag needs to be sent (which needs to be serialized in JSON):
{"is_speaking": False}
Sending Data from Server to Client
Sending Recognition Results
The message (serialized in JSON) is:
{"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}
Parameter explanation:
`mode`: `offline`, indicating the inference mode for offline file transcription
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
`timestamp`:If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"
Real-time Speech Recognition
System Architecture Diagram
Sending Data from Client to Server
Message Format
Configuration parameters and meta information are in JSON format, while audio data is in bytes.
Initial Communication
The message (which needs to be serialized in JSON) is:
{"mode": "2pass", "wav_name": "wav_name", "is_speaking": True, "wav_format":"pcm", "chunk_size":[5,10,5],"hotwords":"阿里巴巴 达摩院 阿里云","itn":true}
Parameter explanation:
`mode`: `offline` indicates the inference mode for single-sentence recognition; `online` indicates the inference mode for real-time speech recognition; `2pass` indicates real-time speech recognition and offline model correction for sentence endings.
`wav_name`: the name of the audio file to be transcribed
`wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc. (Note: only PCM audio streams are supported in version 1.0)
`is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file
`chunk_size`: indicates the latency configuration of the streaming model, `[5,10,5]` indicates that the current audio is 600ms long, with a 300ms look-ahead and look-back time.
`audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added
`hotwords`:If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example:"阿里巴巴 达摩院 阿里云"
`itn`: whether to use itn, the default value is true for enabling and false for disabling.
Sending Audio Data
Directly send the audio data, removing the header information and sending only the bytes data. Supported audio sampling rates are 8000 (which needs to be specified as audio_fs in message), and 16000.
Sending End of Audio Flag
After sending the audio data, an end-of-audio flag needs to be sent (which needs to be serialized in JSON):
{"is_speaking": False}
Sending Data from Server to Client
Sending Recognition Results
The message (serialized in JSON) is:
{"mode": "2pass-online", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"}
Parameter explanation:
`mode`: indicates the inference mode, divided into `2pass-online` for real-time recognition results and `2pass-offline` for 2-pass corrected recognition results.
`wav_name`: the name of the audio file to be transcribed
`text`: the text output of speech recognition
`is_final`: indicating the end of recognition
`timestamp`:If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]"
