([简体中文](./websocket_protocol_zh.md)|English) # WebSocket/gRPC Communication Protocol This protocol is the communication protocol for the FunASR software package, which includes offline file transcription ([deployment document](./SDK_tutorial.md)) and real-time speech recognition ([deployment document](./SDK_tutorial_online.md)). ## Offline File Transcription ### Sending Data from Client to Server #### Message Format Configuration parameters and meta information are in JSON format, while audio data is in bytes. #### Initial Communication The message (which needs to be serialized in JSON) is: ```text {"mode": "offline", "wav_name": "wav_name","wav_format":"pcm","is_speaking": True,"wav_format":"pcm","hotwords":"阿里巴巴 达摩院 阿里云"} ``` Parameter explanation: ```text `mode`: `offline`, indicating the inference mode for offline file transcription `wav_name`: the name of the audio file to be transcribed `wav_format`: the audio and video file extension, such as pcm, mp3, mp4, etc. `is_speaking`: False indicates the end of a sentence, such as a VAD segmentation point or the end of a WAV file `audio_fs`: when the input audio is in PCM format, the audio sampling rate parameter needs to be added `hotwords`:If AM is the hotword model, hotword data needs to be sent to the server in string format, with " " used as a separator between hotwords. For example:"阿里巴巴 达摩院 阿里云" ``` #### Sending Audio Data For PCM format, directly send the audio data. For other audio formats, send the header information and audio and video bytes data together. Multiple sampling rates and audio and video formats are supported. #### Sending End of Audio Flag After sending the audio data, an end-of-audio flag needs to be sent (which needs to be serialized in JSON): ```text {"is_speaking": False} ``` ### Sending Data from Server to Client #### Sending Recognition Results The message (serialized in JSON) is: ```text {"mode": "offline", "wav_name": "wav_name", "text": "asr ouputs", "is_final": True, "timestamp":"[[100,200], [200,500]]"} ``` Parameter explanation: ```text `mode`: `offline`, indicating the inference mode for offline file transcription `wav_name`: the name of the audio file to be transcribed `text`: the text output of speech recognition `is_final`: indicating the end of recognition `timestamp`:If AM is a timestamp model, it will return this field, indicating the timestamp, in the format of "[[100,200], [200,500]]" ``` ## Real-time Speech Recognition ### System Architecture Diagram
