update

2025-09-15 15:08:35 +08:00 · 2024-07-04 23:50:24 +08:00 · 2024-07-04 23:50:24 +08:00 · 4264e5dc52
commit 4264e5dc52
parent 41aef981fe
21 changed files with 1887 additions and 161 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,162 +1,28 @@
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-
-# C extensions
-*.so
-
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-cover/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
-
-# Sphinx documentation
-docs/_build/
-
-# PyBuilder
-.pybuilder/
-target/
-
-# Jupyter Notebook
+.idea
+./__pycache__/
+*/__pycache__/
+*/*/__pycache__/
+*/*/*/__pycache__/
+.DS_Store
+init_model/
+*.tar.gz
+test_local/
+RapidASR
+export/*
+*.pyc
+.eggs
+MaaS-lib
+.gitignore
+.egg*
+dist
+build
+funasr.egg-info
+docs/_build
+modelscope
+samples
 .ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-#   For a library or package, you might want to ignore these files since the code is
-#   intended to run in multiple environments; otherwise, check them in:
-# .python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# poetry
-#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
-#   This is especially recommended for binary packages to ensure reproducibility, and is more
-#   commonly ignored for libraries.
-#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
-#poetry.lock
-
-# pdm
-#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
-#pdm.lock
-#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
-#   in version control.
-#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
-.pdm.toml
-.pdm-python
-.pdm-build/
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
-
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
-
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
-
-# pytype static type analyzer
-.pytype/
-
-# Cython debug symbols
-cython_debug/
-
-# PyCharm
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+outputs*
+emotion2vec*
+GPT-SoVITS*
+modelscope_models
+examples/aishell/llm_asr_nar/*
--- a/README.md
+++ b/README.md
@ -0,0 +1,218 @@
+([简体中文](./README_zh.md)|English)
+
+
+# Introduction
+
+SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR),  spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). 
+
+<img src="image/sensevoice2.png">
+
+[//]: # (<div align="center"><img src="image/sensevoice.png" width="700"/> </div>)
+
+<div align="center">  
+<h4>
+<a href="https://www.modelscope.cn/studios/iic/SenseVoice"> Online Demo </a>
+｜<a href="# "> Homepage </a>
+｜<a href="#What's News"> What's News </a>
+｜<a href="#Benchmarks"> Benchmarks </a>
+｜<a href="#Install"> Install </a>
+｜<a href="#Usage"> Usage </a>
+｜<a href="#Community"> Community </a>
+</h4>
+
+Model Zoo:
+[modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall), [huggingface]()
+
+</div>
+
+
+<a name="Highligts"></a>
+# Highligts 🎯
+**SenseVoice** focuses on high-accuracy multilingual speech recognition, speech emotion recognition, and audio event detection.
+- **Multilingual Speech Recognition:** Trained with over 400,000 hours of data, supporting more than 50 languages, the recognition performance surpasses that of the Whisper model.
+- **Rich transcribe:** 
+  - Possess excellent emotion recognition capabilities, achieving and surpassing the effectiveness of the current best emotion recognition models on test data.
+  - Offer sound event detection capabilities, supporting the detection of various common human-computer interaction events such as bgm, applause, laughter, crying, coughing, and sneezing.
+- **Efficient Inference:** The SenseVoice-Small model utilizes a non-autoregressive end-to-end framework, leading to exceptionally low inference latency. It requires only 70ms to process 10 seconds of audio, which is 15 times faster than Whisper-Large.
+- **Convenient Finetuning:** Provide convenient finetuning scripts and strategies, allowing users to easily address long-tail sample issues according to their business scenarios.
+- **Service Deployment:** Offer service deployment pipeline,  supporting multi-concurrent requests, with client-side languages including Python, C++, HTML, Java, and C#, among others.
+
+<a name="What's News"></a>
+# What's News 🔥
+- 2024/7： The SenseVoice-Small voice understanding model is open-sourced, providing support for multilingual speech recognition, speech emotion recognition, and acoustic event detection capabilities in Mandarin, Cantonese, English, Japanese, and Korean.
+
+
+<a name="Benchmarks"></a>
+# Benchmarks 📝
+
+## Multilingual Speech Recognition
+We compared the performance of multilingual speech recognition between SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, LibriSpeech, and Common Voice. n terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages.
+
+<div align="center">  
+<img src="image/asr_results.png" width="1000" />
+</div>
+
+## Speech Emotion Recognition
+
+Due to the current lack of widely-used benchmarks and methods for speech emotion recognition, we conducted evaluations across various metrics on multiple test sets and performed a comprehensive comparison with numerous results from recent benchmarks. The selected test sets encompass data in both Chinese and English, and include multiple styles such as performances, films, and natural conversations. Without finetuning on the target data, SenseVoice was able to achieve and exceed the performance of the current best speech emotion recognition models.
+
+<div align="center">  
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+Furthermore, we compared multiple open-source speech emotion recognition models on the test sets, and the results indicate that the SenseVoice-Large model achieved the best performance on nearly all datasets, while the SenseVoice-Small model also surpassed other open-source models on the majority of the datasets.
+
+<div align="center">  
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## Audio Event Detection
+
+Although trained exclusively on speech data, SenseVoice can still function as a standalone event detection model. We compared its performance on the environmental sound classification ESC-50 dataset against the widely used industry models BEATS and PANN. The SenseVoice model achieved commendable results on these tasks. However, due to limitations in training data and methodology, its event classification performance has some gaps compared to specialized AED models.
+
+<div align="center">  
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## Computational  Efficiency
+
+The SenseVoice-Small model  non-autoregressive end-to-end architecture, resulting in extremely low inference latency. With a similar number of parameters to the Whisper-Small model, it infers more than 5 times faster than Whisper-Small and 15 times faster than Whisper-Large. 
+
+<div align="center">  
+<img src="image/inference.png" width="1000" />
+</div>
+
+
+# Requirements
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="Usage"></a>
+# Usage
+
+## Inference
+
+### Method 1
+
+```python
+from model import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
+
+
+res = m.inference(
+    data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
+    language="zh",
+    text_norm="woitn",
+    **kwargs,
+)
+
+print(res)
+```
+
+### Method 2
+
+```python
+from funasr import AutoModel
+
+model_dir = "iic/SenseVoiceSmall"
+input_file = (
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
+)
+
+model = AutoModel(model=model_dir,
+                  vad_model="fsmn-vad",
+                  vad_kwargs={"max_single_segment_time": 30000},
+                  trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="zh",
+    text_norm="woitn",
+    batch_size_s=0, 
+)
+
+print(res)
+```
+
+The funasr version has integrated the VAD (Voice Activity Detection) model and supports audio input of any duration, with `batch_size_s` in seconds.
+If all inputs are short audios, and batch inference is needed to speed up inference efficiency, the VAD model can be removed, and `batch_size` can be set accordingly.
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="zh",
+    text_norm="woitn",
+    batch_size=64, 
+)
+```
+
+For more usage, please ref to [docs](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+
+
+### Export and Test
+
+```python
+# pip3 install -U funasr-onnx
+from funasr_onnx import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceCTC"
+model = SenseVoiceSmall(model_dir, batch_size=1, quantize=True)
+
+wav_path = [f'~/.cache/modelscope/hub/{model_dir}/example/asr_example.wav']
+
+result = model(wav_path)
+print(result)
+```
+
+## Service
+
+Undo
+
+## Finetune
+
+### Requirements
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### Data prepare
+
+Data examples
+
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+
+Full ref to `data/train_example.jsonl`
+
+### Finetune
+
+Ensure to modify the train_tool in finetune.sh to the absolute path of `funasr/bin/train_ds.py` from the FunASR installation directory you have set up earlier.
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+<a name="Community"></a>
+# Community
+
+
--- a/README_zh.md
+++ b/README_zh.md
@ -0,0 +1,219 @@
+# SenseVoice
+
+「简体中文」|「[English](./README.md)」 
+
+SenseVoice是具有音频理解能力的音频基础模型，包括语音识别（ASR）、语种识别（LID）、语音情感识别（SER）和声学事件分类（AEC）或声学事件检测（AED）。本项目提供SenseVoice模型的介绍以及在多个任务测试集上的benchmark，以及体验模型所需的环境安装的与推理方式。
+
+<div align="center">  
+<img src="image/sensevoice2.png">
+
+[//]: # (<div align="center"><img src="image/sensevoice2.png" width="700"/> </div>)
+ 
+<h4>
+<a href="https://www.modelscope.cn/studios/iic/SenseVoice"> 在线体验 </a>
+｜<a href="#What's New"> 文档主页 </a>
+｜<a href="#核心功能"> 核心功能 </a>
+</h4>
+<h4>
+<a href="#On Going"> 最新动态 </a>
+｜<a href="#Benchmark"> Benchmark </a>
+｜<a href="#环境安装"> 环境安装 </a>
+｜<a href="#用法教程"> 用法教程 </a>
+｜<a href="#联系我们"> 联系我们 </a>
+</h4>
+
+模型仓库：中国大陆用户推荐 [modelscope](https://www.modelscope.cn/models/iic/SenseVoiceSmall)，海外用户推荐 [huggingface]()
+</div>
+
+<a name="核心功能"></a>
+# 核心功能 🎯
+**SenseVoice**专注于高精度多语言语音识别、情感辨识和音频事件检测
+- **多语言识别：** 采用超过40万小时数据训练，支持超过50种语言，识别效果上优于Whisper模型。
+- **富文本识别：** 
+  - 具备优秀的情感识别，能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+  - 支持声音事件检测能力，支持音乐、掌声、笑声、哭声、咳嗽、喷嚏等多种常见人机交互事件进行检测。
+- **高效推理：** SenseVoice-Small模型采用非自回归端到端框架，推理延迟极低，10s音频推理仅耗时70ms，15倍优于Whisper-Large。
+- **微调定制：** 具备便捷的微调脚本与策略，方便用户根据业务场景修复长尾样本问题。
+- **服务部署：** 具有完整的服务部署链路，支持多并发请求，支持客户端语言有，python、c++、html、java与c#等。
+
+<a name="最新动态"></a>
+# 最新动态 🔥
+- 2024/7： SenseVoice-Small多语言音频理解模型开源，支持中、粤、英、日、韩语的多语言语音识别，情感识别和事件检测能力。
+
+<a name="Benchmarks"></a>
+# Benchmarks 📝
+
+## 多语言语音识别
+
+我们在开源基准数据集（包括 AISHELL-1、AISHELL-2、Wenetspeech、Librispeech和Common Voice）上比较了SenseVoice与Whisper的多语言语音识别性能和推理效率。在中文和粤语识别效果上，SenseVoice-Small模型具有明显的效果优势。
+
+<div align="center">  
+<img src="image/asr_results.png" width="1000" />
+</div>
+
+## 情感识别
+
+由于目前缺乏被广泛使用的情感识别测试指标和方法，我们在多个测试集的多种指标进行测试，并与近年来Benchmark上的多个结果进行了全面的对比。所选取的测试集同时包含中文/英文两种语言以及表演、影视剧、自然对话等多种风格的数据，在不进行目标数据微调的前提下，SenseVoice能够在测试数据上达到和超过目前最佳情感识别模型的效果。
+
+<div align="center">  
+<img src="image/ser_table.png" width="1000" />
+</div>
+
+同时，我们还在测试集上对多个开源情感识别模型进行对比，结果表明，SenseVoice-Large模型可以在几乎所有数据上都达到了最佳效果，而SenseVoice-Small模型同样可以在多数数据集上取得超越其他开源模型的效果。
+
+<div align="center">  
+<img src="image/ser_figure.png" width="500" />
+</div>
+
+## 事件检测
+
+尽管SenseVoice只在语音数据上进行训练，它仍然可以作为事件检测模型进行单独使用。我们在环境音分类ESC-50数据集上与目前业内广泛使用的BEATS与PANN模型的效果进行了对比。SenseVoice模型能够在这些任务上取得较好的效果，但受限于训练数据与训练方式，其事件分类效果专业的事件检测模型相比仍然有一定的差距。
+
+<div align="center">  
+<img src="image/aed_figure.png" width="500" />
+</div>
+
+## 推理效率
+
+SenseVoice-small模型采用非自回归端到端架构，推理延迟极低。在参数量与Whisper-Small模型相当的情况下，比Whisper-Small模型推理速度快5倍，比Whisper-Large模型快15倍。同时SenseVoice-small模型在音频时长增加的情况下，推理耗时也无明显增加。
+
+<div align="center">  
+<img src="image/inference.png" width="1000" />
+</div>
+
+<a name="环境安装"></a>
+# 安装依赖环境 🐍
+
+```shell
+pip install -r requirements.txt
+```
+
+<a name="用法教程"></a>
+# 用法 🛠️
+
+## 推理
+
+### 直接推理
+
+```python
+from model import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
+
+
+res = m.inference(
+    data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    **kwargs,
+)
+
+print(res)
+```
+
+### 使用funasr推理
+
+```python
+from funasr import AutoModel
+
+model_dir = "iic/SenseVoiceSmall"
+input_file = (
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
+)
+
+model = AutoModel(model=model_dir,
+                  vad_model="fsmn-vad",
+                  vad_kwargs={"max_single_segment_time": 30000},
+                  trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    batch_size_s=0, 
+)
+
+print(res)
+```
+
+funasr版本已经集成了vad模型，支持任意时长音频输入，`batch_size_s`单位为秒。
+如果输入均为短音频，并且需要批量化推理，为了加快推理效率，可以移除vad模型，并设置`batch_size`
+
+```python
+model = AutoModel(model=model_dir, trust_remote_code=True, device="cuda:0")
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    batch_size=64, 
+)
+```
+
+更多详细用法，请参考 [文档](https://github.com/modelscope/FunASR/blob/main/docs/tutorial/README.md)
+
+## 服务部署
+
+Undo
+
+### 导出与测试
+
+```python
+# pip3 install -U funasr-onnx
+from funasr_onnx import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceSmall"
+model = SenseVoiceSmall(model_dir, batch_size=1, quantize=True)
+
+wav_path = [f'~/.cache/modelscope/hub/{model_dir}/example/asr_example.wav']
+
+result = model(wav_path)
+print(result)
+```
+
+### 部署
+
+待完成
+
+## 微调
+
+### 安装训练环境
+
+```shell
+git clone https://github.com/alibaba/FunASR.git && cd FunASR
+pip3 install -e ./
+```
+
+### 数据准备
+
+数据格式需要包括如下几个字段：
+```text
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+```
+详细可以参考：`data/train_example.jsonl`
+
+### 启动训练
+
+注意修改 `finetune.sh` 中 `train_tool` 为你前面安装FunASR路径中`funasr/bin/train_ds.py`绝对路径
+
+```shell
+bash finetune.sh
+```
+
+## WebUI
+
+```shell
+python webui.py
+```
+
+<div align="center"><img src="image/webui.png" width="700"/> </div>
+
+# 联系我们
+
+
+
+
--- a/data/train_example.jsonl
+++ b/data/train_example.jsonl
@ -0,0 +1,10 @@
+{"key": "YOU0000008470_S0000238_punc_itn", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Including legal due diligence, subscription agreement, negotiation.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/YOU0000008470_S0000238.wav", "target_len": 7, "source_len": 140}
+{"key": "AUD0000001556_S0007580", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "there is a tendency to identify the self or take interest in what one has got used to", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/AUD0000001556_S0007580.wav", "target_len": 18, "source_len": 360}
+{"key": "19208207_HJwKrcFJ8_o_segment720", "text_language": "<|en|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "fourth foul up top now to austin three for leonard and in and out no good rebounded by murray looking for some help and he almost throws it out of bounds", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/lcd_data/english/production/20240222/wer_0_5/taskid_19208207/wav/19208207_HJwKrcFJ8_o_segment720.wav", "target_len": 31, "source_len": 620}
+{"key": "wav000_0872_bb6f4a79bb9f49249083465445b1cafa_01c38131ed3a4c609e8270a09170525d", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "案件受理费减半收取计一千六百三十一元赵会龙已预交由赵会龙负担一百零七元", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/16k_common/audio/wav000_0872_bb6f4a79bb9f49249083465445b1cafa_01c38131ed3a4c609e8270a09170525d.wav", "target_len": 35, "source_len": 700}
+{"key": "Speaker0244_iPhone_s0_227", "text_language": "<|ko|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "저녁 다 해결합니다", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/korean/audio/Speaker0244_iPhone_s0_227.wav", "target_len": 10, "source_len": 200}
+{"key": "data2sim_speed_part1_channel0_CHANNEL0_SPEAKER0948_SESSION1_009481629_speed12", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "the money was entrust to him in february this year before he resign in june according to the documents", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/data2sim_speed_part1_channel0_CHANNEL0_SPEAKER0948_SESSION1_009481629_speed12.wav", "target_len": 19, "source_len": 380}
+{"key": "SPEAKER0272_SESSION1_002721613_punc_itn", "text_language": "<|en|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "Current proposals don't go far enough.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/SPEAKER0272_SESSION1_002721613.wav", "target_len": 6, "source_len": 120}
+{"key": "wav004_0490_04c4f9cb2cb347a2a156c5cad1a903aa", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "小凳子", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/16k_common/audio/wav004_0490_04c4f9cb2cb347a2a156c5cad1a903aa.wav", "target_len": 3, "source_len": 60}
+{"key": "18874657_MSnA4nfDC7Q_segment680", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "and anything that you would like to know please just put it in the email", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/lcd_data/english/production/20240105/wer_0_5/taskid_18874657/wav/18874657_MSnA4nfDC7Q_segment680.wav", "target_len": 15, "source_len": 300}
+{"key": "POD0000007250_S0000518", "text_language": "<|en|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "i use netflix but and that's not an app", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/POD0000007250_S0000518.wav", "target_len": 9, "source_len": 180}
--- a/data/val_example.jsonl
+++ b/data/val_example.jsonl
@ -0,0 +1,10 @@
+{"key": "datasim_speed_Speaker0129_winPhone_s0_149_speed-10_punc_itn", "text_language": "<|ko|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "이거 올리고 리베옹과 안 좋은 사이가 되었다는.", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/korean/audio/datasim_speed_Speaker0129_winPhone_s0_149_speed-10.wav", "target_len": 26, "source_len": 520}
+{"key": "data2sim_noise_rir_new_Speaker0048_winPhone_s0_102_punc_itn", "text_language": "<|ko|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "개통 대리점이랑 얘길 해봐야 합니다.", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/korean/audio/data2sim_noise_rir_new_Speaker0048_winPhone_s0_102.wav", "target_len": 20, "source_len": 400}
+{"key": "wav005_0655_1225906248786196892_punc_itn", "text_language": "<|yue|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "万科租售中心。", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/multilingual/cantonese/audio/wav005_0655_1225906248786196892.wav", "target_len": 7, "source_len": 140}
+{"key": "datasim_speed_Speaker0732_S0_Android_533_speed-10", "text_language": "<|ko|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "댁 공개 석상에서 에이즈 환자랑 껴안고 뭐 키스하고", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/korean/audio/datasim_speed_Speaker0732_S0_Android_533_speed-10.wav", "target_len": 28, "source_len": 560}
+{"key": "wav001_0437_lATPJxDjwp5xwcnOQp9lTs51zRmz", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "郑超啊你到我这儿来一下", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/16k_common/audio/wav001_0437_lATPJxDjwp5xwcnOQp9lTs51zRmz.wav", "target_len": 11, "source_len": 220}
+{"key": "wav010_0212_Speaker0045_iOS_s0_088_punc_itn", "text_language": "<|en|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "In dark moments, I speak with her.", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/english_all/audio/wav010_0212_Speaker0045_iOS_s0_088.wav", "target_len": 7, "source_len": 140}
+{"key": "18934860_GSD17-Sz1vw_segment107", "text_language": "<|en|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "some states also include suspected domestic violence within mandatory reporting laws", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/lcd_data/english/production/20240114/wer_0_5/taskid_18934860/wav/18934860_GSD17-Sz1vw_segment107.wav", "target_len": 11, "source_len": 220}
+{"key": "wav003_0734_XSG5MY6Eym.mp4_160", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "四万多四万多", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/dialect/audio/wav003_0734_XSG5MY6Eym.mp4_160.wav", "target_len": 6, "source_len": 120}
+{"key": "19208463_4_dWQ34YNU4_segment311", "text_language": "<|en|>", "emo_target": "<|EMO_UNKNOWN|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|woitn|>", "target": "i said well you have to see that movie history of the world", "source": "/cpfs_speech/data/shared/Group-speech/beinian.lzr/data/multilingual/lcd_data/english/production/20240222/wer_0_5/taskid_19208463/wav/19208463_4_dWQ34YNU4_segment311.wav", "target_len": 13, "source_len": 260}
+{"key": "wav005_0682_lATPJv8gRtSVeIDOUGfqsc5DF8yn_13_punc_itn", "text_language": "<|zh|>", "emo_target": "<|NEUTRAL|>", "event_target": "<|Speech|>", "with_or_wo_itn": "<|withitn|>", "target": "这也是我们呃。", "source": "/cpfs01/shared/Group-speech/beinian.lzr/data/industrial_data/16k_common/audio/wav005_0682_lATPJv8gRtSVeIDOUGfqsc5DF8yn_13.wav", "target_len": 7, "source_len": 140}
--- a/deepspeed_conf/ds_stage1.json
+++ b/deepspeed_conf/ds_stage1.json
@ -0,0 +1,33 @@
+{
+  "train_micro_batch_size_per_gpu": 1,
+  "gradient_accumulation_steps": 1,
+  "steps_per_print": 100,
+  "gradient_clipping": 5,
+  "fp16": {
+    "enabled": false,
+    "auto_cast": false,
+    "loss_scale": 0,
+    "initial_scale_power": 16,
+    "loss_scale_window": 1000,
+    "hysteresis": 2,
+    "consecutive_hysteresis": false,
+    "min_loss_scale": 1
+  },
+  "bf16": {
+   "enabled": true
+  },
+  "zero_force_ds_cpu_optimizer": false,
+  "zero_optimization": {
+    "stage": 1,
+    "offload_optimizer": {
+      "device": "none",
+      "pin_memory": true
+    },
+    "allgather_partitions": true,
+    "allgather_bucket_size": 5e8,
+    "overlap_comm": true,
+    "reduce_scatter": true,
+    "reduce_bucket_size": 5e8,
+    "contiguous_gradients" : true
+  }
+}
--- a/demo.py
+++ b/demo.py
@ -0,0 +1,19 @@
+#!/usr/bin/env python3
+# -*- encoding: utf-8 -*-
+# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
+#  MIT License  (https://opensource.org/licenses/MIT)
+
+from model import SenseVoiceSmall
+
+model_dir = "iic/SenseVoiceSmall"
+m, kwargs = SenseVoiceSmall.from_pretrained(model=model_dir)
+
+
+res = m.inference(
+    data_in="https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav",
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+    **kwargs,
+)
+
+print(res)
--- a/demo_funasr.py
+++ b/demo_funasr.py
@ -0,0 +1,26 @@
+#!/usr/bin/env python3
+# -*- encoding: utf-8 -*-
+# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
+#  MIT License  (https://opensource.org/licenses/MIT)
+
+import sys
+from funasr import AutoModel
+
+model_dir = "iic/SenseVoiceSmall"
+input_file = (
+    "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"
+)
+
+model = AutoModel(
+    model=model_dir,
+    trust_remote_code=True,
+)
+
+res = model.generate(
+    input=input_file,
+    cache={},
+    language="auto", # "zn", "en", "yue", "ja", "ko", "nospeech"
+    use_itn=False,
+)
+
+print(res)
--- a/export_meta.py
+++ b/export_meta.py
@ -0,0 +1,101 @@
+#!/usr/bin/env python3
+# -*- encoding: utf-8 -*-
+# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
+#  MIT License  (https://opensource.org/licenses/MIT)
+
+import types
+import torch
+import torch.nn as nn
+from funasr.register import tables
+
+
+def export_rebuild_model(model, **kwargs):
+    model.device = kwargs.get("device")
+    is_onnx = kwargs.get("type", "onnx") == "onnx"
+    # encoder_class = tables.encoder_classes.get(kwargs["encoder"] + "Export")
+    # model.encoder = encoder_class(model.encoder, onnx=is_onnx)
+
+
+
+    from funasr.utils.torch_function import sequence_mask
+
+    model.make_pad_mask = sequence_mask(kwargs["max_seq_len"], flip=False)
+
+    model.forward = types.MethodType(export_forward, model)
+    model.export_dummy_inputs = types.MethodType(export_dummy_inputs, model)
+    model.export_input_names = types.MethodType(export_input_names, model)
+    model.export_output_names = types.MethodType(export_output_names, model)
+    model.export_dynamic_axes = types.MethodType(export_dynamic_axes, model)
+    model.export_name = types.MethodType(export_name, model)
+
+    model.export_name = "model"
+    return model
+
+
+def export_forward(
+    self,
+    speech: torch.Tensor,
+    speech_lengths: torch.Tensor,
+    language: torch.Tensor,
+    textnorm: torch.Tensor,
+    **kwargs,
+):
+    speech = speech.to(device=kwargs["device"])
+    speech_lengths = speech_lengths.to(device=kwargs["device"])
+    
+    language_query = self.embed(language).to(speech.device)
+    
+    textnorm_query = self.embed(textnorm).to(speech.device)
+    speech = torch.cat((textnorm_query, speech), dim=1)
+    speech_lengths += 1
+    
+    event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(
+        speech.size(0), 1, 1
+    )
+    input_query = torch.cat((language_query, event_emo_query), dim=1)
+    speech = torch.cat((input_query, speech), dim=1)
+    speech_lengths += 3
+    
+    # Encoder
+    encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths)
+    if isinstance(encoder_out, tuple):
+        encoder_out = encoder_out[0]
+    
+    # c. Passed the encoder result and the beam search
+    ctc_logits = self.ctc.log_softmax(encoder_out)
+    
+    
+    return ctc_logits, encoder_out_lens
+
+
+def export_dummy_inputs(self):
+    speech = torch.randn(2, 30, 560)
+    speech_lengths = torch.tensor([6, 30], dtype=torch.int32)
+    language = torch.tensor([0, 0], dtype=torch.int32)
+    textnorm = torch.tensor([15, 15], dtype=torch.int32)
+    return (speech, speech_lengths, language, textnorm)
+
+
+def export_input_names(self):
+    return ["speech", "speech_lengths", "language", "textnorm"]
+
+
+def export_output_names(self):
+    return ["ctc_logits", "encoder_out_lens"]
+
+
+def export_dynamic_axes(self):
+    return {
+        "speech": {0: "batch_size", 1: "feats_length"},
+        "speech_lengths": {
+            0: "batch_size",
+        },
+        "logits": {0: "batch_size", 1: "logits_length"},
+    }
+
+
+def export_name(
+    self,
+):
+    return "model.onnx"
+
--- a/finetune.sh
+++ b/finetune.sh
@ -0,0 +1,70 @@
+# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
+#  MIT License  (https://opensource.org/licenses/MIT)
+
+workspace=`pwd`
+
+# which gpu to train or finetune
+export CUDA_VISIBLE_DEVICES="0,1"
+gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+
+# model_name from model_hub, or model_dir in local path
+
+## option 1, download model automatically
+model_name_or_model_dir="/Users/zhifu/Downloads/modelscope_models/SenseVoiceCTC"
+
+## option 2, download model by git
+#local_path_root=${workspace}/modelscope_models
+#mkdir -p ${local_path_root}/${model_name_or_model_dir}
+#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
+#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}
+
+
+# data dir, which contains: train.json, val.json
+train_data=${workspace}/data/train_example.jsonl
+val_data=${workspace}/data/val_example.jsonl
+
+# exp output dir
+output_dir="./outputs"
+log_file="${output_dir}/log.txt"
+
+deepspeed_config=${workspace}/deepspeed_conf/ds_stage1.json
+
+mkdir -p ${output_dir}
+echo "log_file: ${log_file}"
+
+DISTRIBUTED_ARGS="
+    --nnodes ${WORLD_SIZE:-1} \
+    --nproc_per_node $gpu_num \
+    --node_rank ${RANK:-0} \
+    --master_addr ${MASTER_ADDR:-127.0.0.1} \
+    --master_port ${MASTER_PORT:-26669}
+"
+
+echo $DISTRIBUTED_ARGS
+
+# funasr trainer path
+train_tool=`dirname $(which funasr)`/train_ds.py
+
+torchrun $DISTRIBUTED_ARGS \
+${train_tool} \
++model="${model_name_or_model_dir}" \
++trust_remote_code=true \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=6000  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=4 \
++train_conf.max_epoch=50 \
++train_conf.log_interval=1 \
++train_conf.resume=true \
++train_conf.validate_interval=2000 \
++train_conf.save_checkpoint_interval=2000 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++train_conf.use_deepspeed=false \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}
--- a/image/aed_figure.png
+++ b/image/aed_figure.png
--- a/image/asr_results.png
+++ b/image/asr_results.png
--- a/image/inference.png
+++ b/image/inference.png
--- a/image/sensevoice.png
+++ b/image/sensevoice.png
--- a/image/sensevoice2.png
+++ b/image/sensevoice2.png
--- a/image/ser_figure.png
+++ b/image/ser_figure.png
--- a/image/ser_table.png
+++ b/image/ser_table.png
--- a/image/webui.png
+++ b/image/webui.png
--- a/model.py
+++ b/model.py
@ -0,0 +1,898 @@
+from typing import Iterable, Optional
+import types
+import time
+import numpy as np
+import torch
+import torch.nn.functional as F
+from torch import Tensor
+from torch import nn
+from torch.cuda.amp import autocast
+from funasr.metrics.compute_acc import compute_accuracy, th_accuracy
+from funasr.losses.label_smoothing_loss import LabelSmoothingLoss
+from funasr.train_utils.device_funcs import force_gatherable
+
+from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
+from funasr.utils.datadir_writer import DatadirWriter
+from funasr.models.ctc.ctc import CTC
+
+from funasr.register import tables
+
+
+from funasr.models.paraformer.search import Hypothesis
+
+
+class SinusoidalPositionEncoder(torch.nn.Module):
+    """ """
+
+    def __int__(self, d_model=80, dropout_rate=0.1):
+        pass
+
+    def encode(
+        self, positions: torch.Tensor = None, depth: int = None, dtype: torch.dtype = torch.float32
+    ):
+        batch_size = positions.size(0)
+        positions = positions.type(dtype)
+        device = positions.device
+        log_timescale_increment = torch.log(torch.tensor([10000], dtype=dtype, device=device)) / (
+            depth / 2 - 1
+        )
+        inv_timescales = torch.exp(
+            torch.arange(depth / 2, device=device).type(dtype) * (-log_timescale_increment)
+        )
+        inv_timescales = torch.reshape(inv_timescales, [batch_size, -1])
+        scaled_time = torch.reshape(positions, [1, -1, 1]) * torch.reshape(
+            inv_timescales, [1, 1, -1]
+        )
+        encoding = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=2)
+        return encoding.type(dtype)
+
+    def forward(self, x):
+        batch_size, timesteps, input_dim = x.size()
+        positions = torch.arange(1, timesteps + 1, device=x.device)[None, :]
+        position_encoding = self.encode(positions, input_dim, x.dtype).to(x.device)
+
+        return x + position_encoding
+
+
+class PositionwiseFeedForward(torch.nn.Module):
+    """Positionwise feed forward layer.
+
+    Args:
+        idim (int): Input dimenstion.
+        hidden_units (int): The number of hidden units.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(self, idim, hidden_units, dropout_rate, activation=torch.nn.ReLU()):
+        """Construct an PositionwiseFeedForward object."""
+        super(PositionwiseFeedForward, self).__init__()
+        self.w_1 = torch.nn.Linear(idim, hidden_units)
+        self.w_2 = torch.nn.Linear(hidden_units, idim)
+        self.dropout = torch.nn.Dropout(dropout_rate)
+        self.activation = activation
+
+    def forward(self, x):
+        """Forward function."""
+        return self.w_2(self.dropout(self.activation(self.w_1(x))))
+
+
+class MultiHeadedAttentionSANM(nn.Module):
+    """Multi-Head Attention layer.
+
+    Args:
+        n_head (int): The number of heads.
+        n_feat (int): The number of features.
+        dropout_rate (float): Dropout rate.
+
+    """
+
+    def __init__(
+        self,
+        n_head,
+        in_feat,
+        n_feat,
+        dropout_rate,
+        kernel_size,
+        sanm_shfit=0,
+        lora_list=None,
+        lora_rank=8,
+        lora_alpha=16,
+        lora_dropout=0.1,
+    ):
+        """Construct an MultiHeadedAttention object."""
+        super().__init__()
+        assert n_feat % n_head == 0
+        # We assume d_v always equals d_k
+        self.d_k = n_feat // n_head
+        self.h = n_head
+        # self.linear_q = nn.Linear(n_feat, n_feat)
+        # self.linear_k = nn.Linear(n_feat, n_feat)
+        # self.linear_v = nn.Linear(n_feat, n_feat)
+
+        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.linear_q_k_v = nn.Linear(in_feat, n_feat * 3)
+        self.attn = None
+        self.dropout = nn.Dropout(p=dropout_rate)
+
+        self.fsmn_block = nn.Conv1d(
+            n_feat, n_feat, kernel_size, stride=1, padding=0, groups=n_feat, bias=False
+        )
+        # padding
+        left_padding = (kernel_size - 1) // 2
+        if sanm_shfit > 0:
+            left_padding = left_padding + sanm_shfit
+        right_padding = kernel_size - 1 - left_padding
+        self.pad_fn = nn.ConstantPad1d((left_padding, right_padding), 0.0)
+
+    def forward_fsmn(self, inputs, mask, mask_shfit_chunk=None):
+        b, t, d = inputs.size()
+        if mask is not None:
+            mask = torch.reshape(mask, (b, -1, 1))
+            if mask_shfit_chunk is not None:
+                mask = mask * mask_shfit_chunk
+            inputs = inputs * mask
+
+        x = inputs.transpose(1, 2)
+        x = self.pad_fn(x)
+        x = self.fsmn_block(x)
+        x = x.transpose(1, 2)
+        x += inputs
+        x = self.dropout(x)
+        if mask is not None:
+            x = x * mask
+        return x
+
+    def forward_qkv(self, x):
+        """Transform query, key and value.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+
+        Returns:
+            torch.Tensor: Transformed query tensor (#batch, n_head, time1, d_k).
+            torch.Tensor: Transformed key tensor (#batch, n_head, time2, d_k).
+            torch.Tensor: Transformed value tensor (#batch, n_head, time2, d_k).
+
+        """
+        b, t, d = x.size()
+        q_k_v = self.linear_q_k_v(x)
+        q, k, v = torch.split(q_k_v, int(self.h * self.d_k), dim=-1)
+        q_h = torch.reshape(q, (b, t, self.h, self.d_k)).transpose(
+            1, 2
+        )  # (batch, head, time1, d_k)
+        k_h = torch.reshape(k, (b, t, self.h, self.d_k)).transpose(
+            1, 2
+        )  # (batch, head, time2, d_k)
+        v_h = torch.reshape(v, (b, t, self.h, self.d_k)).transpose(
+            1, 2
+        )  # (batch, head, time2, d_k)
+
+        return q_h, k_h, v_h, v
+
+    def forward_attention(self, value, scores, mask, mask_att_chunk_encoder=None):
+        """Compute attention context vector.
+
+        Args:
+            value (torch.Tensor): Transformed value (#batch, n_head, time2, d_k).
+            scores (torch.Tensor): Attention score (#batch, n_head, time1, time2).
+            mask (torch.Tensor): Mask (#batch, 1, time2) or (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
+
+        """
+        n_batch = value.size(0)
+        if mask is not None:
+            if mask_att_chunk_encoder is not None:
+                mask = mask * mask_att_chunk_encoder
+
+            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+
+            min_value = -float(
+                "inf"
+            )  # float(numpy.finfo(torch.tensor(0, dtype=scores.dtype).numpy().dtype).min)
+            scores = scores.masked_fill(mask, min_value)
+            self.attn = torch.softmax(scores, dim=-1).masked_fill(
+                mask, 0.0
+            )  # (batch, head, time1, time2)
+        else:
+            self.attn = torch.softmax(scores, dim=-1)  # (batch, head, time1, time2)
+
+        p_attn = self.dropout(self.attn)
+        x = torch.matmul(p_attn, value)  # (batch, head, time1, d_k)
+        x = (
+            x.transpose(1, 2).contiguous().view(n_batch, -1, self.h * self.d_k)
+        )  # (batch, time1, d_model)
+
+        return self.linear_out(x)  # (batch, time1, d_model)
+
+    def forward(self, x, mask, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
+        """Compute scaled dot product attention.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q_h, k_h, v_h, v = self.forward_qkv(x)
+        fsmn_memory = self.forward_fsmn(v, mask, mask_shfit_chunk)
+        q_h = q_h * self.d_k ** (-0.5)
+        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
+        att_outs = self.forward_attention(v_h, scores, mask, mask_att_chunk_encoder)
+        return att_outs + fsmn_memory
+
+    def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
+        """Compute scaled dot product attention.
+
+        Args:
+            query (torch.Tensor): Query tensor (#batch, time1, size).
+            key (torch.Tensor): Key tensor (#batch, time2, size).
+            value (torch.Tensor): Value tensor (#batch, time2, size).
+            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+                (#batch, time1, time2).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time1, d_model).
+
+        """
+        q_h, k_h, v_h, v = self.forward_qkv(x)
+        if chunk_size is not None and look_back > 0 or look_back == -1:
+            if cache is not None:
+                k_h_stride = k_h[:, :, : -(chunk_size[2]), :]
+                v_h_stride = v_h[:, :, : -(chunk_size[2]), :]
+                k_h = torch.cat((cache["k"], k_h), dim=2)
+                v_h = torch.cat((cache["v"], v_h), dim=2)
+
+                cache["k"] = torch.cat((cache["k"], k_h_stride), dim=2)
+                cache["v"] = torch.cat((cache["v"], v_h_stride), dim=2)
+                if look_back != -1:
+                    cache["k"] = cache["k"][:, :, -(look_back * chunk_size[1]) :, :]
+                    cache["v"] = cache["v"][:, :, -(look_back * chunk_size[1]) :, :]
+            else:
+                cache_tmp = {
+                    "k": k_h[:, :, : -(chunk_size[2]), :],
+                    "v": v_h[:, :, : -(chunk_size[2]), :],
+                }
+                cache = cache_tmp
+        fsmn_memory = self.forward_fsmn(v, None)
+        q_h = q_h * self.d_k ** (-0.5)
+        scores = torch.matmul(q_h, k_h.transpose(-2, -1))
+        att_outs = self.forward_attention(v_h, scores, None)
+        return att_outs + fsmn_memory, cache
+
+
+class LayerNorm(nn.LayerNorm):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def forward(self, input):
+        output = F.layer_norm(
+            input.float(),
+            self.normalized_shape,
+            self.weight.float() if self.weight is not None else None,
+            self.bias.float() if self.bias is not None else None,
+            self.eps,
+        )
+        return output.type_as(input)
+
+
+def sequence_mask(lengths, maxlen=None, dtype=torch.float32, device=None):
+    if maxlen is None:
+        maxlen = lengths.max()
+    row_vector = torch.arange(0, maxlen, 1).to(lengths.device)
+    matrix = torch.unsqueeze(lengths, dim=-1)
+    mask = row_vector < matrix
+    mask = mask.detach()
+
+    return mask.type(dtype).to(device) if device is not None else mask.type(dtype)
+
+
+class EncoderLayerSANM(nn.Module):
+    def __init__(
+        self,
+        in_size,
+        size,
+        self_attn,
+        feed_forward,
+        dropout_rate,
+        normalize_before=True,
+        concat_after=False,
+        stochastic_depth_rate=0.0,
+    ):
+        """Construct an EncoderLayer object."""
+        super(EncoderLayerSANM, self).__init__()
+        self.self_attn = self_attn
+        self.feed_forward = feed_forward
+        self.norm1 = LayerNorm(in_size)
+        self.norm2 = LayerNorm(size)
+        self.dropout = nn.Dropout(dropout_rate)
+        self.in_size = in_size
+        self.size = size
+        self.normalize_before = normalize_before
+        self.concat_after = concat_after
+        if self.concat_after:
+            self.concat_linear = nn.Linear(size + size, size)
+        self.stochastic_depth_rate = stochastic_depth_rate
+        self.dropout_rate = dropout_rate
+
+    def forward(self, x, mask, cache=None, mask_shfit_chunk=None, mask_att_chunk_encoder=None):
+        """Compute encoded features.
+
+        Args:
+            x_input (torch.Tensor): Input tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+        skip_layer = False
+        # with stochastic depth, residual connection `x + f(x)` becomes
+        # `x <- x + 1 / (1 - p) * f(x)` at training time.
+        stoch_layer_coeff = 1.0
+        if self.training and self.stochastic_depth_rate > 0:
+            skip_layer = torch.rand(1).item() < self.stochastic_depth_rate
+            stoch_layer_coeff = 1.0 / (1 - self.stochastic_depth_rate)
+
+        if skip_layer:
+            if cache is not None:
+                x = torch.cat([cache, x], dim=1)
+            return x, mask
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm1(x)
+
+        if self.concat_after:
+            x_concat = torch.cat(
+                (
+                    x,
+                    self.self_attn(
+                        x,
+                        mask,
+                        mask_shfit_chunk=mask_shfit_chunk,
+                        mask_att_chunk_encoder=mask_att_chunk_encoder,
+                    ),
+                ),
+                dim=-1,
+            )
+            if self.in_size == self.size:
+                x = residual + stoch_layer_coeff * self.concat_linear(x_concat)
+            else:
+                x = stoch_layer_coeff * self.concat_linear(x_concat)
+        else:
+            if self.in_size == self.size:
+                x = residual + stoch_layer_coeff * self.dropout(
+                    self.self_attn(
+                        x,
+                        mask,
+                        mask_shfit_chunk=mask_shfit_chunk,
+                        mask_att_chunk_encoder=mask_att_chunk_encoder,
+                    )
+                )
+            else:
+                x = stoch_layer_coeff * self.dropout(
+                    self.self_attn(
+                        x,
+                        mask,
+                        mask_shfit_chunk=mask_shfit_chunk,
+                        mask_att_chunk_encoder=mask_att_chunk_encoder,
+                    )
+                )
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        x = residual + stoch_layer_coeff * self.dropout(self.feed_forward(x))
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        return x, mask, cache, mask_shfit_chunk, mask_att_chunk_encoder
+
+    def forward_chunk(self, x, cache=None, chunk_size=None, look_back=0):
+        """Compute encoded features.
+
+        Args:
+            x_input (torch.Tensor): Input tensor (#batch, time, size).
+            mask (torch.Tensor): Mask tensor for the input (#batch, time).
+            cache (torch.Tensor): Cache tensor of the input (#batch, time - 1, size).
+
+        Returns:
+            torch.Tensor: Output tensor (#batch, time, size).
+            torch.Tensor: Mask tensor (#batch, time).
+
+        """
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm1(x)
+
+        if self.in_size == self.size:
+            attn, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
+            x = residual + attn
+        else:
+            x, cache = self.self_attn.forward_chunk(x, cache, chunk_size, look_back)
+
+        if not self.normalize_before:
+            x = self.norm1(x)
+
+        residual = x
+        if self.normalize_before:
+            x = self.norm2(x)
+        x = residual + self.feed_forward(x)
+        if not self.normalize_before:
+            x = self.norm2(x)
+
+        return x, cache
+
+
+@tables.register("encoder_classes", "SenseVoiceEncoderSmall")
+class SenseVoiceEncoderSmall(nn.Module):
+    """
+    Author: Speech Lab of DAMO Academy, Alibaba Group
+    SCAMA: Streaming chunk-aware multihead attention for online end-to-end speech recognition
+    https://arxiv.org/abs/2006.01713
+    """
+
+    def __init__(
+        self,
+        input_size: int,
+        output_size: int = 256,
+        attention_heads: int = 4,
+        linear_units: int = 2048,
+        num_blocks: int = 6,
+        tp_blocks: int = 0,
+        dropout_rate: float = 0.1,
+        positional_dropout_rate: float = 0.1,
+        attention_dropout_rate: float = 0.0,
+        stochastic_depth_rate: float = 0.0,
+        input_layer: Optional[str] = "conv2d",
+        pos_enc_class=SinusoidalPositionEncoder,
+        normalize_before: bool = True,
+        concat_after: bool = False,
+        positionwise_layer_type: str = "linear",
+        positionwise_conv_kernel_size: int = 1,
+        padding_idx: int = -1,
+        kernel_size: int = 11,
+        sanm_shfit: int = 0,
+        selfattention_layer_type: str = "sanm",
+        **kwargs,
+    ):
+        super().__init__()
+        self._output_size = output_size
+
+        self.embed = SinusoidalPositionEncoder()
+
+        self.normalize_before = normalize_before
+
+        positionwise_layer = PositionwiseFeedForward
+        positionwise_layer_args = (
+            output_size,
+            linear_units,
+            dropout_rate,
+        )
+
+        encoder_selfattn_layer = MultiHeadedAttentionSANM
+        encoder_selfattn_layer_args0 = (
+            attention_heads,
+            input_size,
+            output_size,
+            attention_dropout_rate,
+            kernel_size,
+            sanm_shfit,
+        )
+        encoder_selfattn_layer_args = (
+            attention_heads,
+            output_size,
+            output_size,
+            attention_dropout_rate,
+            kernel_size,
+            sanm_shfit,
+        )
+
+        self.encoders0 = nn.ModuleList(
+            [
+                EncoderLayerSANM(
+                    input_size,
+                    output_size,
+                    encoder_selfattn_layer(*encoder_selfattn_layer_args0),
+                    positionwise_layer(*positionwise_layer_args),
+                    dropout_rate,
+                )
+                for i in range(1)
+            ]
+        )
+        self.encoders = nn.ModuleList(
+            [
+                EncoderLayerSANM(
+                    output_size,
+                    output_size,
+                    encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                    positionwise_layer(*positionwise_layer_args),
+                    dropout_rate,
+                )
+                for i in range(num_blocks - 1)
+            ]
+        )
+
+        self.tp_encoders = nn.ModuleList(
+            [
+                EncoderLayerSANM(
+                    output_size,
+                    output_size,
+                    encoder_selfattn_layer(*encoder_selfattn_layer_args),
+                    positionwise_layer(*positionwise_layer_args),
+                    dropout_rate,
+                )
+                for i in range(tp_blocks)
+            ]
+        )
+
+        self.after_norm = LayerNorm(output_size)
+
+        self.tp_norm = LayerNorm(output_size)
+
+    def output_size(self) -> int:
+        return self._output_size
+
+    def forward(
+        self,
+        xs_pad: torch.Tensor,
+        ilens: torch.Tensor,
+    ):
+        """Embed positions in tensor."""
+        masks = sequence_mask(ilens, device=ilens.device)[:, None, :]
+
+        xs_pad *= self.output_size() ** 0.5
+
+        xs_pad = self.embed(xs_pad)
+
+        # forward encoder1
+        for layer_idx, encoder_layer in enumerate(self.encoders0):
+            encoder_outs = encoder_layer(xs_pad, masks)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+
+        for layer_idx, encoder_layer in enumerate(self.encoders):
+            encoder_outs = encoder_layer(xs_pad, masks)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+
+        xs_pad = self.after_norm(xs_pad)
+
+        # forward encoder2
+        olens = masks.squeeze(1).sum(1).int()
+
+        for layer_idx, encoder_layer in enumerate(self.tp_encoders):
+            encoder_outs = encoder_layer(xs_pad, masks)
+            xs_pad, masks = encoder_outs[0], encoder_outs[1]
+
+        xs_pad = self.tp_norm(xs_pad)
+        return xs_pad, olens
+
+
+@tables.register("model_classes", "SenseVoiceSmall")
+class SenseVoiceSmall(nn.Module):
+    """CTC-attention hybrid Encoder-Decoder model"""
+
+    def __init__(
+        self,
+        specaug: str = None,
+        specaug_conf: dict = None,
+        normalize: str = None,
+        normalize_conf: dict = None,
+        encoder: str = None,
+        encoder_conf: dict = None,
+        ctc_conf: dict = None,
+        input_size: int = 80,
+        vocab_size: int = -1,
+        ignore_id: int = -1,
+        blank_id: int = 0,
+        sos: int = 1,
+        eos: int = 2,
+        length_normalized_loss: bool = False,
+        **kwargs,
+    ):
+
+        super().__init__()
+
+        if specaug is not None:
+            specaug_class = tables.specaug_classes.get(specaug)
+            specaug = specaug_class(**specaug_conf)
+        if normalize is not None:
+            normalize_class = tables.normalize_classes.get(normalize)
+            normalize = normalize_class(**normalize_conf)
+        encoder_class = tables.encoder_classes.get(encoder)
+        encoder = encoder_class(input_size=input_size, **encoder_conf)
+        encoder_output_size = encoder.output_size()
+
+        if ctc_conf is None:
+            ctc_conf = {}
+        ctc = CTC(odim=vocab_size, encoder_output_size=encoder_output_size, **ctc_conf)
+
+        self.blank_id = blank_id
+        self.sos = sos if sos is not None else vocab_size - 1
+        self.eos = eos if eos is not None else vocab_size - 1
+        self.vocab_size = vocab_size
+        self.ignore_id = ignore_id
+        self.specaug = specaug
+        self.normalize = normalize
+        self.encoder = encoder
+        self.error_calculator = None
+
+        self.ctc = ctc
+
+        self.length_normalized_loss = length_normalized_loss
+        self.encoder_output_size = encoder_output_size
+
+        self.lid_dict = {"auto": 0, "zh": 3, "en": 4, "yue": 7, "ja": 11, "ko": 12, "nospeech": 13}
+        self.lid_int_dict = {24884: 3, 24885: 4, 24888: 7, 24892: 11, 24896: 12, 24992: 13}
+        self.textnorm_dict = {"withitn": 14, "woitn": 15}
+        self.textnorm_int_dict = {25016: 14, 25017: 15}
+        self.embed = torch.nn.Embedding(7 + len(self.lid_dict) + len(self.textnorm_dict), input_size)
+        
+        self.criterion_att = LabelSmoothingLoss(
+            size=self.vocab_size,
+            padding_idx=self.ignore_id,
+            smoothing=kwargs.get("lsm_weight", 0.0),
+            normalize_length=self.length_normalized_loss,
+        )
+    
+    @staticmethod
+    def from_pretrained(model:str=None, **kwargs):
+        from funasr import AutoModel
+        model, kwargs = AutoModel.build_model(model=model, trust_remote_code=True, **kwargs)
+        
+        return model, kwargs
+
+    def forward(
+        self,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        text: torch.Tensor,
+        text_lengths: torch.Tensor,
+        **kwargs,
+    ):
+        """Encoder + Decoder + Calc loss
+        Args:
+                speech: (Batch, Length, ...)
+                speech_lengths: (Batch, )
+                text: (Batch, Length)
+                text_lengths: (Batch,)
+        """
+        # import pdb;
+        # pdb.set_trace()
+        if len(text_lengths.size()) > 1:
+            text_lengths = text_lengths[:, 0]
+        if len(speech_lengths.size()) > 1:
+            speech_lengths = speech_lengths[:, 0]
+
+        batch_size = speech.shape[0]
+
+        # 1. Encoder
+        encoder_out, encoder_out_lens = self.encode(speech, speech_lengths, text)
+
+        loss_ctc, cer_ctc = None, None
+        loss_rich, acc_rich = None, None
+        stats = dict()
+
+        loss_ctc, cer_ctc = self._calc_ctc_loss(
+            encoder_out[:, 4:, :], encoder_out_lens - 4, text[:, 4:], text_lengths - 4
+        )
+
+        loss_rich, acc_rich = self._calc_rich_ce_loss(
+            encoder_out[:, :4, :], text[:, :4]
+        )
+
+        loss = loss_ctc
+        # Collect total loss stats
+        stats["loss"] = torch.clone(loss.detach()) if loss_ctc is not None else None
+        stats["loss_rich"] = torch.clone(loss_rich.detach()) if loss_rich is not None else None
+        stats["acc_rich"] = acc_rich
+
+        # force_gatherable: to-device and to-tensor if scalar for DataParallel
+        if self.length_normalized_loss:
+            batch_size = int((text_lengths + 1).sum())
+        loss, stats, weight = force_gatherable((loss, stats, batch_size), loss.device)
+        return loss, stats, weight
+
+    def encode(
+        self,
+        speech: torch.Tensor,
+        speech_lengths: torch.Tensor,
+        text: torch.Tensor,
+        **kwargs,
+    ):
+        """Frontend + Encoder. Note that this method is used by asr_inference.py
+        Args:
+                speech: (Batch, Length, ...)
+                speech_lengths: (Batch, )
+                ind: int
+        """
+
+        # Data augmentation
+        if self.specaug is not None and self.training:
+            speech, speech_lengths = self.specaug(speech, speech_lengths)
+
+        # Normalization for feature: e.g. Global-CMVN, Utterance-CMVN
+        if self.normalize is not None:
+            speech, speech_lengths = self.normalize(speech, speech_lengths)
+
+
+        lids = torch.LongTensor([[self.lid_int_dict[int(lid)] if torch.rand(1) > 0.2 and int(lid) in self.lid_int_dict else 0 ] for lid in text[:, 0]]).to(speech.device)
+        language_query = self.embed(lids)
+        
+        styles = torch.LongTensor([[self.textnorm_int_dict[int(style)]] for style in text[:, 3]]).to(speech.device)
+        style_query = self.embed(styles)
+        speech = torch.cat((style_query, speech), dim=1)
+        speech_lengths += 1
+
+        event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(speech.size(0), 1, 1)
+        input_query = torch.cat((language_query, event_emo_query), dim=1)
+        speech = torch.cat((input_query, speech), dim=1)
+        speech_lengths += 3
+
+        encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths)
+
+        return encoder_out, encoder_out_lens
+
+    def _calc_ctc_loss(
+        self,
+        encoder_out: torch.Tensor,
+        encoder_out_lens: torch.Tensor,
+        ys_pad: torch.Tensor,
+        ys_pad_lens: torch.Tensor,
+    ):
+        # Calc CTC loss
+        loss_ctc = self.ctc(encoder_out, encoder_out_lens, ys_pad, ys_pad_lens)
+
+        # Calc CER using CTC
+        cer_ctc = None
+        if not self.training and self.error_calculator is not None:
+            ys_hat = self.ctc.argmax(encoder_out).data
+            cer_ctc = self.error_calculator(ys_hat.cpu(), ys_pad.cpu(), is_ctc=True)
+        return loss_ctc, cer_ctc
+
+    def _calc_rich_ce_loss(
+        self,
+        encoder_out: torch.Tensor,
+        ys_pad: torch.Tensor,
+    ):
+        decoder_out = self.ctc.ctc_lo(encoder_out)
+        # 2. Compute attention loss
+        loss_rich = self.criterion_att(decoder_out, ys_pad.contiguous())
+        acc_rich = th_accuracy(
+            decoder_out.view(-1, self.vocab_size),
+            ys_pad.contiguous(),
+            ignore_label=self.ignore_id,
+        )
+
+        return loss_rich, acc_rich
+
+
+    def inference(
+        self,
+        data_in,
+        data_lengths=None,
+        key: list = ["wav_file_tmp_name"],
+        tokenizer=None,
+        frontend=None,
+        **kwargs,
+    ):
+
+
+        meta_data = {}
+        if (
+            isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank"
+        ):  # fbank
+            speech, speech_lengths = data_in, data_lengths
+            if len(speech.shape) < 3:
+                speech = speech[None, :, :]
+            if speech_lengths is None:
+                speech_lengths = speech.shape[1]
+        else:
+            # extract fbank feats
+            time1 = time.perf_counter()
+            audio_sample_list = load_audio_text_image_video(
+                data_in,
+                fs=frontend.fs,
+                audio_fs=kwargs.get("fs", 16000),
+                data_type=kwargs.get("data_type", "sound"),
+                tokenizer=tokenizer,
+            )
+            time2 = time.perf_counter()
+            meta_data["load_data"] = f"{time2 - time1:0.3f}"
+            speech, speech_lengths = extract_fbank(
+                audio_sample_list, data_type=kwargs.get("data_type", "sound"), frontend=frontend
+            )
+            time3 = time.perf_counter()
+            meta_data["extract_feat"] = f"{time3 - time2:0.3f}"
+            meta_data["batch_data_time"] = (
+                speech_lengths.sum().item() * frontend.frame_shift * frontend.lfr_n / 1000
+            )
+
+        speech = speech.to(device=kwargs["device"])
+        speech_lengths = speech_lengths.to(device=kwargs["device"])
+
+        language = kwargs.get("language", "auto")
+        language_query = self.embed(
+            torch.LongTensor(
+                [[self.lid_dict[language] if language in self.lid_dict else 0]]
+            ).to(speech.device)
+        ).repeat(speech.size(0), 1, 1)
+        
+        use_itn = kwargs.get("use_itn", False)
+        textnorm = kwargs.get("text_norm", None)
+        if textnorm is None:
+            textnorm = "withitn" if use_itn else "woitn"
+        textnorm_query = self.embed(
+            torch.LongTensor([[self.textnorm_dict[textnorm]]]).to(speech.device)
+        ).repeat(speech.size(0), 1, 1)
+        speech = torch.cat((textnorm_query, speech), dim=1)
+        speech_lengths += 1
+
+        event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(
+            speech.size(0), 1, 1
+        )
+        input_query = torch.cat((language_query, event_emo_query), dim=1)
+        speech = torch.cat((input_query, speech), dim=1)
+        speech_lengths += 3
+
+        # Encoder
+        encoder_out, encoder_out_lens = self.encoder(speech, speech_lengths)
+        if isinstance(encoder_out, tuple):
+            encoder_out = encoder_out[0]
+
+        # c. Passed the encoder result and the beam search
+        ctc_logits = self.ctc.log_softmax(encoder_out)
+
+        results = []
+        b, n, d = encoder_out.size()
+        if isinstance(key[0], (list, tuple)):
+            key = key[0]
+        if len(key) < b:
+            key = key * b
+        for i in range(b):
+            x = ctc_logits[i, : encoder_out_lens[i].item(), :]
+            yseq = x.argmax(dim=-1)
+            yseq = torch.unique_consecutive(yseq, dim=-1)
+
+            ibest_writer = None
+            if kwargs.get("output_dir") is not None:
+                if not hasattr(self, "writer"):
+                    self.writer = DatadirWriter(kwargs.get("output_dir"))
+                ibest_writer = self.writer[f"1best_recog"]
+
+            mask = yseq != self.blank_id
+            token_int = yseq[mask].tolist()
+
+            # Change integer-ids to tokens
+            text = tokenizer.decode(token_int)
+
+            result_i = {"key": key[i], "text": text}
+            results.append(result_i)
+
+            if ibest_writer is not None:
+                ibest_writer["text"][key[i]] = text
+
+        return results, meta_data
+
+    def export(self, **kwargs):
+        from .export_meta import export_rebuild_model
+
+        if "max_seq_len" not in kwargs:
+            kwargs["max_seq_len"] = 512
+        models = export_rebuild_model(model=self, **kwargs)
+        return models
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,7 @@
+python>=3.8
+torch>=1.13
+torchaudio
+modelscope
+huggingface
+huggingface_hub
+funasr
--- a/webui.py
+++ b/webui.py
@ -0,0 +1,249 @@
+# coding=utf-8
+
+import os
+import librosa
+import base64
+import io
+import gradio as gr
+import re
+
+import numpy as np
+import torch
+import torchaudio
+
+
+from funasr import AutoModel
+
+model = "iic/SenseVoiceSmall"
+model = AutoModel(model=model,
+				  vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
+				  vad_kwargs={"max_single_segment_time": 30000},
+				  trust_remote_code=True,
+				  )
+
+import re
+
+emo_dict = {
+	"<|HAPPY|>": "😊",
+	"<|SAD|>": "😔",
+	"<|ANGRY|>": "😡",
+	"<|NEUTRAL|>": "",
+	"<|FEARFUL|>": "😰",
+	"<|DISGUSTED|>": "🤢",
+	"<|SURPRISED|>": "😮",
+}
+
+event_dict = {
+	"<|BGM|>": "🎼",
+	"<|Speech|>": "",
+	"<|Applause|>": "👏",
+	"<|Laughter|>": "😀",
+	"<|Cry|>": "😭",
+	"<|Sneeze|>": "🤧",
+	"<|Breath|>": "",
+	"<|Cough|>": "🤧",
+}
+
+emoji_dict = {
+	"<|nospeech|><|Event_UNK|>": "❓",
+	"<|zh|>": "",
+	"<|en|>": "",
+	"<|yue|>": "",
+	"<|ja|>": "",
+	"<|ko|>": "",
+	"<|nospeech|>": "",
+	"<|HAPPY|>": "😊",
+	"<|SAD|>": "😔",
+	"<|ANGRY|>": "😡",
+	"<|NEUTRAL|>": "",
+	"<|BGM|>": "🎼",
+	"<|Speech|>": "",
+	"<|Applause|>": "👏",
+	"<|Laughter|>": "😀",
+	"<|FEARFUL|>": "😰",
+	"<|DISGUSTED|>": "🤢",
+	"<|SURPRISED|>": "😮",
+	"<|Cry|>": "😭",
+	"<|EMO_UNKNOWN|>": "",
+	"<|Sneeze|>": "🤧",
+	"<|Breath|>": "",
+	"<|Cough|>": "😷",
+	"<|Sing|>": "",
+	"<|Speech_Noise|>": "",
+	"<|withitn|>": "",
+	"<|woitn|>": "",
+	"<|GBG|>": "",
+	"<|Event_UNK|>": "",
+}
+
+lang_dict =  {
+    "<|zh|>": "<|lang|>",
+    "<|en|>": "<|lang|>",
+    "<|yue|>": "<|lang|>",
+    "<|ja|>": "<|lang|>",
+    "<|ko|>": "<|lang|>",
+    "<|nospeech|>": "<|lang|>",
+}
+
+emo_set = {"😊", "😔", "😡", "😰", "🤢", "😮"}
+event_set = {"🎼", "👏", "😀", "😭", "🤧", "😷",}
+
+def format_str(s):
+	for sptk in emoji_dict:
+		s = s.replace(sptk, emoji_dict[sptk])
+	return s
+
+
+def format_str_v2(s):
+	sptk_dict = {}
+	for sptk in emoji_dict:
+		sptk_dict[sptk] = s.count(sptk)
+		s = s.replace(sptk, "")
+	emo = "<|NEUTRAL|>"
+	for e in emo_dict:
+		if sptk_dict[e] > sptk_dict[emo]:
+			emo = e
+	for e in event_dict:
+		if sptk_dict[e] > 0:
+			s = event_dict[e] + s
+	s = s + emo_dict[emo]
+
+	for emoji in emo_set.union(event_set):
+		s = s.replace(" " + emoji, emoji)
+		s = s.replace(emoji + " ", emoji)
+	return s.strip()
+
+def format_str_v3(s):
+	def get_emo(s):
+		return s[-1] if s[-1] in emo_set else None
+	def get_event(s):
+		return s[0] if s[0] in event_set else None
+
+	s = s.replace("<|nospeech|><|Event_UNK|>", "❓")
+	for lang in lang_dict:
+		s = s.replace(lang, "<|lang|>")
+	s_list = [format_str_v2(s_i).strip(" ") for s_i in s.split("<|lang|>")]
+	new_s = " " + s_list[0]
+	cur_ent_event = get_event(new_s)
+	for i in range(1, len(s_list)):
+		if len(s_list[i]) == 0:
+			continue
+		if get_event(s_list[i]) == cur_ent_event and get_event(s_list[i]) != None:
+			s_list[i] = s_list[i][1:]
+		#else:
+		cur_ent_event = get_event(s_list[i])
+		if get_emo(s_list[i]) != None and get_emo(s_list[i]) == get_emo(new_s):
+			new_s = new_s[:-1]
+		new_s += s_list[i].strip().lstrip()
+	new_s = new_s.replace("The.", " ")
+	return new_s.strip()
+
+def model_inference(input_wav, language, fs=16000):
+	# task_abbr = {"Speech Recognition": "ASR", "Rich Text Transcription": ("ASR", "AED", "SER")}
+	language_abbr = {"auto": "auto", "zh": "zh", "en": "en", "yue": "yue", "ja": "ja", "ko": "ko",
+					 "nospeech": "nospeech"}
+	
+	# task = "Speech Recognition" if task is None else task
+	language = "auto" if len(language) < 1 else language
+	selected_language = language_abbr[language]
+	# selected_task = task_abbr.get(task)
+	
+	# print(f"input_wav: {type(input_wav)}, {input_wav[1].shape}, {input_wav}")
+	
+	if isinstance(input_wav, tuple):
+		fs, input_wav = input_wav
+		input_wav = input_wav.astype(np.float32) / np.iinfo(np.int16).max
+		if len(input_wav.shape) > 1:
+			input_wav = input_wav.mean(-1)
+		if fs != 16000:
+			print(f"audio_fs: {fs}")
+			resampler = torchaudio.transforms.Resample(fs, 16000)
+			input_wav_t = torch.from_numpy(input_wav).to(torch.float32)
+			input_wav = resampler(input_wav_t[None, :])[0, :].numpy()
+	
+	
+	merge_vad = True #False if selected_task == "ASR" else True
+	print(f"language: {language}, merge_vad: {merge_vad}")
+	text = model.generate(input=input_wav,
+						  cache={},
+						  language=language,
+						  use_itn=True,
+						  batch_size_s=0, merge_vad=merge_vad)
+	
+	print(text)
+	text = text[0]["text"]
+	text = format_str_v3(text)
+	
+	print(text)
+	
+	return text
+
+
+audio_examples = [
+    ["example/zh.mp3", "zh"],
+    ["example/yue.mp3", "yue"],
+    ["example/en.mp3", "en"],
+    ["example/ja.mp3", "ja"],
+    ["example/ko.mp3", "ko"],
+    ["example/emo_1.wav", "auto"],
+    ["example/emo_2.wav", "auto"],
+    ["example/emo_3.wav", "auto"],
+    #["example/emo_4.wav", "auto"],
+    #["example/event_1.wav", "auto"],
+    #["example/event_2.wav", "auto"],
+    #["example/event_3.wav", "auto"],
+    ["example/rich_1.wav", "auto"],
+    ["example/rich_2.wav", "auto"],
+    #["example/rich_3.wav", "auto"],
+    ["example/longwav_1.wav", "auto"],
+    ["example/longwav_2.wav", "auto"],
+    ["example/longwav_3.wav", "auto"],
+    #["example/longwav_4.wav", "auto"],
+]
+
+
+
+
+html_content = """
+<div>
+    <h2 style="font-size: 22px;margin-left: 0px;">Voice Understanding Model: SenseVoice-Small</h2>
+    <p style="font-size: 18px;margin-left: 20px;">SenseVoice-Small is an encoder-only speech foundation model designed for rapid voice understanding. It encompasses a variety of features including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and acoustic event detection (AED). SenseVoice-Small supports multilingual recognition for Chinese, English, Cantonese, Japanese, and Korean. Additionally, it offers exceptionally low inference latency, performing 7 times faster than Whisper-small and 17 times faster than Whisper-large.</p>
+    <h2 style="font-size: 22px;margin-left: 0px;">Usage</h2> <p style="font-size: 18px;margin-left: 20px;">Upload an audio file or input through a microphone, then select the task and language. the audio is transcribed into corresponding text along with associated emotions (😊 happy, 😡 angry/exicting, 😔 sad) and types of sound events (😀 laughter, 🎼 music, 👏 applause, 🤧 cough&sneeze, 😭 cry). The event labels are placed in the front of the text and the emotion are in the back of the text.</p>
+	<p style="font-size: 18px;margin-left: 20px;">Recommended audio input duration is below 30 seconds. For audio longer than 30 seconds, local deployment is recommended.</p>
+	<h2 style="font-size: 22px;margin-left: 0px;">Repo</h2>
+	<p style="font-size: 18px;margin-left: 20px;"><a href="https://github.com/FunAudioLLM/SenseVoice" target="_blank">SenseVoice</a>: multilingual speech understanding model</p>
+	<p style="font-size: 18px;margin-left: 20px;"><a href="https://github.com/modelscope/FunASR" target="_blank">FunASR</a>: fundamental speech recognition toolkit</p>
+	<p style="font-size: 18px;margin-left: 20px;"><a href="https://github.com/modelscope/CosyVoice" target="_blank">CosyVoice</a>: high-quality multilingual TTS model</p>
+</div>
+"""
+
+
+def launch():
+	with gr.Blocks(theme=gr.themes.Soft()) as demo:
+		# gr.Markdown(description)
+		gr.HTML(html_content)
+		with gr.Row():
+			with gr.Column():
+				audio_inputs = gr.Audio(label="Upload audio or use the microphone")
+				
+				with gr.Accordion("Configuration"):
+					# task_inputs = gr.Radio(choices=["Speech Recognition", "Rich Text Transcription"],
+					# 					   value="Speech Recognition", label="Task")
+					language_inputs = gr.Dropdown(choices=["auto", "zh", "en", "yue", "ja", "ko", "nospeech"],
+												  value="auto",
+												  label="Language")
+				fn_button = gr.Button("Start", variant="primary")
+				text_outputs = gr.Textbox(label="Results")
+			gr.Examples(examples=audio_examples, inputs=[audio_inputs, language_inputs], examples_per_page=20)
+		
+		fn_button.click(model_inference, inputs=[audio_inputs, language_inputs], outputs=text_outputs)
+		# with gr.Accordion("More examples"):
+		# 	gr.HTML(centered_table_html)
+	demo.launch()
+
+
+if __name__ == "__main__":
+	# iface.launch()
+	launch()
+
+