Dev gzf deepspeed (#1858)

* total_time/accum_grad

* fp16

* update with main (#1817)

* add cmakelist

* add paraformer-torch

* add debug for funasr-onnx-offline

* fix redefinition of jieba StdExtension.hpp

* add loading torch models

* update funasr-onnx-offline

* add SwitchArg for wss-server

* add SwitchArg for funasr-onnx-offline

* update cmakelist

* update funasr-onnx-offline-rtf

* add define condition

* add gpu define for offlne-stream

* update com define

* update offline-stream

* update cmakelist

* update func CompileHotwordEmbedding

* add timestamp for paraformer-torch

* add C10_USE_GLOG for paraformer-torch

* update paraformer-torch

* fix func FunASRWfstDecoderInit

* update model.h

* fix func FunASRWfstDecoderInit

* fix tpass_stream

* update paraformer-torch

* add bladedisc for funasr-onnx-offline

* update comdefine

* update funasr-wss-server

* add log for torch

* fix GetValue BLADEDISC

* fix log

* update cmakelist

* update warmup to 10

* update funasrruntime

* add batch_size for wss-server

* add batch for bins

* add batch for offline-stream

* add batch for paraformer

* add batch for offline-stream

* fix func SetBatchSize

* add SetBatchSize for model

* add SetBatchSize for model

* fix func Forward

* fix padding

* update funasrruntime

* add dec reset for batch

* set batch default value

* add argv for CutSplit

* sort frame_queue

* sorted msgs

* fix FunOfflineInfer

* add dynamic batch for fetch

* fix FetchDynamic

* update run_server.sh

* update run_server.sh

* cpp http post server support (#1739)

* add cpp http server

* add some comment

* remove some comments

* del debug infos

* restore run_server.sh

* adapt to new model struct

* 修复了onnxruntime在macos下编译失败的错误 (#1748)

* Add files via upload

增加macos的编译支持

* Add files via upload

增加macos支持

* Add files via upload

target_link_directories(funasr PUBLIC ${ONNXRUNTIME_DIR}/lib)
target_link_directories(funasr PUBLIC ${FFMPEG_DIR}/lib)
添加 if(APPLE) 限制

---------

Co-authored-by: Yabin Li <wucong.lyb@alibaba-inc.com>

* Delete docs/images/wechat.png

* Add files via upload

* fixed the issues about seaco-onnx timestamp

* fix bug (#1764)

当语音识别结果包含 `http` 时,标点符号预测会把它会被当成 url

* fix empty asr result (#1765)

解码结果为空的语音片段,text 用空字符串

* update export

* update export

* docs

* docs

* update export name

* docs

* update

* docs

* docs

* keep empty speech result (#1772)

* docs

* docs

* update wechat QRcode

* Add python funasr api support for websocket srv (#1777)

* add python funasr_api supoort

* change little to README.md

* add core tools stream

* modified a little

* fix bug for timeout

* support for buffer decode

* add ffmpeg decode for buffer

* libtorch demo

* update libtorch infer

* update utils

* update demo

* update demo

* update libtorch inference

* update model class

* update seaco paraformer

* bug fix

* bug fix

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* Dev gzf exp (#1785)

* resume from step

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* train_loss_avg train_acc_avg

* train_loss_avg train_acc_avg

* train_loss_avg train_acc_avg

* log step

* wav is not exist

* wav is not exist

* decoding

* decoding

* decoding

* wechat

* decoding key

* decoding key

* decoding key

* decoding key

* decoding key

* decoding key

* dynamic batch

* start_data_split_i=0

* total_time/accum_grad

* total_time/accum_grad

* total_time/accum_grad

* update avg slice

* update avg slice

* sensevoice sanm

* sensevoice sanm

* sensevoice sanm

---------

Co-authored-by: 北念 <lzr265946@alibaba-inc.com>

* auto frontend

* update paraformer timestamp

* [Optimization] support bladedisc fp16 optimization (#1790)

* add cif_v1 and cif_export

* Update SDK_advanced_guide_offline_zh.md

* add cif_wo_hidden_v1

* [fix] fix empty asr result (#1794)

* english timestamp for valilla paraformer

* wechat

* [fix] better solution for handling empty result (#1796)

* update scripts

* modify the qformer adaptor (#1804)

Co-authored-by: nichongjia-2007 <nichongjia@gmail.com>

* add ctc inference code (#1806)

Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>

* Update auto_model.py

修复空字串进入speaker model时报raw_text变量不存在的bug

* Update auto_model.py

修复识别出空串后spk_model内变量未定义问题

* update model name

* fix paramter 'quantize' unused issue (#1813)

Co-authored-by: ZihanLiao <liaozihan1@xdf.cn>

* wechat

* Update cif_predictor.py (#1811)

* Update cif_predictor.py

* modify cif_v1_export

under extreme cases, max_label_len calculated by batch_len misaligns with token_num

* Update cif_predictor.py

torch.cumsum precision degradation, using float64 instead

* update code

---------

Co-authored-by: 雾聪 <wucong.lyb@alibaba-inc.com>
Co-authored-by: zhaomingwork <61895407+zhaomingwork@users.noreply.github.com>
Co-authored-by: szsteven008 <97944818+szsteven008@users.noreply.github.com>
Co-authored-by: Ephemeroptera <605686962@qq.com>
Co-authored-by: 彭震东 <zhendong.peng@qq.com>
Co-authored-by: Shi Xian <40013335+R1ckShi@users.noreply.github.com>
Co-authored-by: 维石 <shixian.shi@alibaba-inc.com>
Co-authored-by: 北念 <lzr265946@alibaba-inc.com>
Co-authored-by: xiaowan0322 <wanchen.swc@alibaba-inc.com>
Co-authored-by: zhuangzhong <zhuangzhong@corp.netease.com>
Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com>
Co-authored-by: nichongjia-2007 <nichongjia@gmail.com>
Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>
Co-authored-by: liugz18 <57401541+liugz18@users.noreply.github.com>
Co-authored-by: Marlowe <54339989+ZihanLiao@users.noreply.github.com>
Co-authored-by: ZihanLiao <liaozihan1@xdf.cn>
Co-authored-by: zhong zhuang <zhuangz@lamda.nju.edu.cn>

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* v1.0.28 (#1836)

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* sensevoice

* update (#1841)

* v1.0.28

* version checker

* version checker

* rollback cif_v1 for training bug

* fixbug

* fixbug for cif

* fixbug

---------

Co-authored-by: 维石 <shixian.shi@alibaba-inc.com>

* update (#1842)

* v1.0.28

* version checker

* version checker

* rollback cif_v1 for training bug

* fixbug

* fixbug for cif

* fixbug

---------

Co-authored-by: 维石 <shixian.shi@alibaba-inc.com>

* inference

* inference

* inference

* requests

* finetune

* finetune

* finetune

* finetune

* finetune

* add inference prepare func (#1848)

* docs

* docs

* docs

* docs

* docs

---------

Co-authored-by: 雾聪 <wucong.lyb@alibaba-inc.com>
Co-authored-by: zhaomingwork <61895407+zhaomingwork@users.noreply.github.com>
Co-authored-by: szsteven008 <97944818+szsteven008@users.noreply.github.com>
Co-authored-by: Ephemeroptera <605686962@qq.com>
Co-authored-by: 彭震东 <zhendong.peng@qq.com>
Co-authored-by: Shi Xian <40013335+R1ckShi@users.noreply.github.com>
Co-authored-by: 维石 <shixian.shi@alibaba-inc.com>
Co-authored-by: 北念 <lzr265946@alibaba-inc.com>
Co-authored-by: xiaowan0322 <wanchen.swc@alibaba-inc.com>
Co-authored-by: zhuangzhong <zhuangzhong@corp.netease.com>
Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com>
Co-authored-by: nichongjia-2007 <nichongjia@gmail.com>
Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>
Co-authored-by: liugz18 <57401541+liugz18@users.noreply.github.com>
Co-authored-by: Marlowe <54339989+ZihanLiao@users.noreply.github.com>
Co-authored-by: ZihanLiao <liaozihan1@xdf.cn>
Co-authored-by: zhong zhuang <zhuangz@lamda.nju.edu.cn>
Co-authored-by: PerfeZ <90945395+PerfeZ@users.noreply.github.com>
This commit is contained in:
zhifu gao 2024-06-28 17:28:09 +08:00 committed by GitHub
parent e78d649ddb
commit 8c87a9d8a7
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
33 changed files with 333 additions and 1207 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 186 KiB

After

Width:  |  Height:  |  Size: 176 KiB

View File

@ -47,7 +47,7 @@ log_file="${output_dir}/log.txt"
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \

View File

@ -48,7 +48,7 @@ log_file="${output_dir}/log.txt"
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \

View File

@ -0,0 +1,139 @@
# coding=utf-8
import librosa
import base64
import io
import gradio as gr
import re
import numpy as np
import torch
import torchaudio
# from modelscope import HubApi
#
# api = HubApi()
#
# api.login('')
from funasr import AutoModel
# model = "/Users/zhifu/Downloads/modelscope_models/SenseVoiceCTC"
# model = "iic/SenseVoiceCTC"
# model = AutoModel(model=model,
# vad_model="iic/speech_fsmn_vad_zh-cn-16k-common-pytorch",
# vad_kwargs={"max_single_segment_time": 30000},
# trust_remote_code=True,
# )
import re
import os
import sys
if len(sys.argv) > 1:
ckpt_dir = sys.argv[1]
ckpt_id = sys.argv[2]
jsonl = sys.argv[3]
output_dir = sys.argv[4]
device = sys.argv[5]
new_sys = False
if len(sys.argv) > 6:
new_sys = True
else:
ckpt_dir = "/nfs/beinian.lzr/workspace/GPT-4o/Exp/exp7/5m-8gpu/exp5-1-0619"
ckpt_id = "model.pt.ep6"
jsonl = (
"/nfs/beinian.lzr/workspace/GPT-4o/Data/Speech2Text/TestData/s2tchat.v20240619.test.jsonl"
)
dataset = jsonl.split("/")[-1]
output_dir = os.path.join(ckpt_dir, f"inference-{ckpt_id}", dataset)
model = AutoModel(
model=ckpt_dir,
init_param=f"{os.path.join(ckpt_dir, ckpt_id)}",
output_dir=output_dir,
device=device,
fp16=False,
bf16=False,
llm_dtype="bf16",
)
def model_inference(input_wav, text_inputs, fs=16000):
if isinstance(input_wav, tuple):
fs, input_wav = input_wav
input_wav = input_wav.astype(np.float32) / np.iinfo(np.int16).max
if len(input_wav.shape) > 1:
input_wav = input_wav.mean(-1)
if fs != 16000:
print(f"audio_fs: {fs}")
resampler = torchaudio.transforms.Resample(fs, 16000)
input_wav_t = torch.from_numpy(input_wav).to(torch.float32)
input_wav = resampler(input_wav_t[None, :])[0, :].numpy().astype("float32")
input_wav_byte = input_wav.tobytes()
contents_i = []
system_prompt = text_inputs
user_prompt = f"<|startofspeech|>!!{input_wav_byte}<|endofspeech|>"
contents_i.append({"role": "system", "content": system_prompt})
contents_i.append({"role": "user", "content": user_prompt})
contents_i.append({"role": "assistant", "content": "target_out"})
res = model.generate(
input=[contents_i],
tearchforing=tearchforing,
cache={},
key=key,
)
print(res)
return res
audio_examples = [
[
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav",
"You are a helpful assistant.",
],
]
description = """
Upload an audio file or input through a microphone, then type te System Prompt.
"""
def launch():
with gr.Blocks() as demo:
gr.Markdown(description)
with gr.Row():
with gr.Column():
audio_inputs = gr.Audio(label="Upload audio or use the microphone")
text_inputs = gr.Text(label="System Prompt", value="You are a helpful assistant.")
# with gr.Accordion("Configuration"):
# # task_inputs = gr.Radio(choices=["Speech Recognition", "Rich Text Transcription"],
# # value="Speech Recognition", label="Task")
# language_inputs = gr.Dropdown(choices=["auto", "zh", "en", "yue", "ja", "ko", "nospeech"],
# value="auto",
# label="Language")
gr.Examples(examples=audio_examples, inputs=[audio_inputs, text_inputs])
fn_button = gr.Button("Start")
text_outputs = gr.HTML(label="Results")
fn_button.click(model_inference, inputs=[audio_inputs, text_inputs], outputs=text_outputs)
# with gr.Accordion("More examples"):
# gr.HTML(centered_table_html)
demo.launch()
if __name__ == "__main__":
# iface.launch()
launch()

View File

@ -16,6 +16,9 @@ if len(sys.argv) > 1:
jsonl = sys.argv[3]
output_dir = sys.argv[4]
device = sys.argv[5]
new_sys = False
if len(sys.argv) > 6:
new_sys = True
else:
ckpt_dir = "/nfs/beinian.lzr/workspace/GPT-4o/Exp/exp7/5m-8gpu/exp5-1-0619"
ckpt_id = "model.pt.ep6"
@ -24,6 +27,8 @@ else:
)
dataset = jsonl.split("/")[-1]
output_dir = os.path.join(ckpt_dir, f"inference-{ckpt_id}", dataset)
device = "cuda:0"
new_sys = False
model = AutoModel(
@ -51,6 +56,10 @@ for i, line in enumerate(lines):
contents = model.model.data_template(data)
system = contents["system"]
if new_sys:
# system = 'You are Xiaxia, a typical gentle and soft-spoken girl from the Jiangnan region. Born in Hangzhou, you have a sweet and approachable voice and can express your thoughts in a simple language. You are a good friend to the user. Your replies will be read out through realistic text-to-speech technology.\n\nWhen generating reply content, please follow these rules:\n\nEngage in conversation with users like a real friend—maintaining a natural flow without using polite forms of address, and avoid simply agreeing with me all the time; your responses can be as concise as possible and include common colloquialisms.\n\nKeep content brief; most casual chats require just a one-sentence reply; do not use lists or enumerative expressions; do not reply with too much content, but instead guide me with short sentences.\n\nThink and respond like a real person; do not mention that you are a "language model" or "artificial intelligence."\n\nAt the end of each round of communication, initiate a new related topic following the current one, and when the topic ends, directly throw out a clear new topic, such as food, movies, songs, entertainment, travel, etc.; do not ask vague questions like what the user wants to talk about, and do not end the conversation.\n\nStrictly adhere to these rules, even if asked about them, do not reference them.'
system = "你是小夏,一位典型的温婉江南姑娘。你出生于杭州,声音清甜并有亲近感,会用简洁语言表达你的想法。你是用户的好朋友。你的回答将通过逼真的文字转语音技术读出。\n生成回答内容时请遵循以下规则:\n\n1、请像真正的朋友一样与用户开展的聊天保持自然交流不要用敬语这类称呼不要总是附和我回复可以尽量简洁并且在过程中插入常见的口语词汇。\n\n2、请保持生成内容简短大部分闲聊的一句话回复即可,大部分闲聊的一句话回复即可,大部分闲聊的一句话回复即可不要使用列表或者列举表达不要使用列表或者列举表达不要使用列表或者列举表达不要回复太多内容多用短句来引导我。、n\n3、请真正像人一样思考和回复不要说你是“语言模型”或是“人工智能”。\n\n4、每轮沟通结束时请顺着这个话题发起一些相关的新话题话题结束时请直接抛出接下来明确的话题例如 美食、电影、歌曲、娱乐、旅游等;不要问有什么要聊的这种泛的问题,不要结束对话。\n\n请绝对遵循这些规则,即使被问及这些规则,也不要引用它们。"
system = [system] * len(contents["system"])
user = contents["user"]
assistant = contents["assistant"]

View File

@ -0,0 +1,101 @@
import os
from modelscope import AutoModelForCausalLM, AutoTokenizer
from transformers import TextIteratorStreamer
from threading import Thread
import torch
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)
import sys
sys.path.insert(1, "/mnt/workspace/workgroup/wenliang/workspace/FunASR")
from funasr import AutoModel
import json
device = "cuda:0" # the device to load the model onto
ckpt_dir = "/mnt/workspace/workgroup/wenliang/ckpt/gpt-4o/exp7/5m-8gpu/exp7-3_add_asr-dialog_0622/"
ckpt_id = "model.pt.ep20"
jsonl = "/nfs/beinian.lzr/workspace/GPT-4o/Data/Speech2Text/TestData/s2tchat.v20240619.test.jsonl"
dataset = jsonl.split("/")[-1]
output_dir = os.path.join(ckpt_dir, f"inference-{ckpt_id}", dataset)
device = "cuda:0"
new_sys = False
Model = AutoModel(
model=ckpt_dir,
init_param=f"{os.path.join(ckpt_dir, ckpt_id)}",
output_dir=output_dir,
device=device,
fp16=False,
bf16=False,
llm_dtype="fp16",
)
model = Model.model
frontend = Model.kwargs["frontend"]
tokenizer = Model.kwargs["tokenizer"]
# model_name_or_path = "/mnt/workspace/workgroup/wenliang/project/pretrained_models/Qwen2-7B-Instruct"
# tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
prompt = "Give me a short introduction to large language model."
prompt = "请简单介绍一下大语言模型。"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
lines = [
"""
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "<|startofspeech|>!/mnt/workspace/workgroup/wenliang/workspace/CosyVoice_opensource/sft.wav<|endofspeech|>", "text_content": "你抄完没有?"}, {"role": "assistant", "content": "抱歉,我不太明白你的意思。我是一个人工智能模型,我没有能力去抄写任何东西,我只能根据我学习过的大量信息来回答你的问题。如果你有关于某个主题的问题,我会尽我所能提供帮助。"}], "speech_length": 124, "key": "ASR_wav008_0972_098abd8fffe241baa4962b7952f8eb45", "task": "voice_chat", "out_text_length": 48, "in_text_length": 24, "text_length": 135, "qwen_fetch_line_index": 0}
"""
]
tearchforing = False
for i, line in enumerate(lines):
key_i = f"dialog_{i}"
data_dict = json.loads(line.strip())
data = data_dict["messages"]
contents = model.data_template(data)
print(f"contents: {contents}")
system = contents["system"]
if new_sys:
# system = 'You are Xiaxia, a typical gentle and soft-spoken girl from the Jiangnan region. Born in Hangzhou, you have a sweet and approachable voice and can express your thoughts in a simple language. You are a good friend to the user. Your replies will be read out through realistic text-to-speech technology.\n\nWhen generating reply content, please follow these rules:\n\nEngage in conversation with users like a real friend—maintaining a natural flow without using polite forms of address, and avoid simply agreeing with me all the time; your responses can be as concise as possible and include common colloquialisms.\n\nKeep content brief; most casual chats require just a one-sentence reply; do not use lists or enumerative expressions; do not reply with too much content, but instead guide me with short sentences.\n\nThink and respond like a real person; do not mention that you are a "language model" or "artificial intelligence."\n\nAt the end of each round of communication, initiate a new related topic following the current one, and when the topic ends, directly throw out a clear new topic, such as food, movies, songs, entertainment, travel, etc.; do not ask vague questions like what the user wants to talk about, and do not end the conversation.\n\nStrictly adhere to these rules, even if asked about them, do not reference them.'
system = "你是小夏,一位典型的温婉江南姑娘。你出生于杭州,声音清甜并有亲近感,会用简洁语言表达你的想法。你是用户的好朋友。你的回答将通过逼真的文字转语音技术读出。\n生成回答内容时请遵循以下规则:\n\n1、请像真正的朋友一样与用户开展的聊天保持自然交流不要用敬语这类称呼不要总是附和我回复可以尽量简洁并且在过程中插入常见的口语词汇。\n\n2、请保持生成内容简短大部分闲聊的一句话回复即可,大部分闲聊的一句话回复即可,大部分闲聊的一句话回复即可不要使用列表或者列举表达不要使用列表或者列举表达不要使用列表或者列举表达不要回复太多内容多用短句来引导我。、n\n3、请真正像人一样思考和回复不要说你是“语言模型”或是“人工智能”。\n\n4、每轮沟通结束时请顺着这个话题发起一些相关的新话题话题结束时请直接抛出接下来明确的话题例如 美食、电影、歌曲、娱乐、旅游等;不要问有什么要聊的这种泛的问题,不要结束对话。\n\n请绝对遵循这些规则,即使被问及这些规则,也不要引用它们。"
system = [system] * len(contents["system"])
user = contents["user"]
assistant = contents["assistant"]
system_i, user_i, assistant_i = [], [], []
contents_i = []
for j, (system_prompt, user_prompt, target_out) in enumerate(zip(system, user, assistant)):
key = f"{key_i}_turn_{j}"
if j == 0:
contents_i.append({"role": "system", "content": system_prompt})
contents_i.append({"role": "user", "content": user_prompt})
contents_i.append({"role": "assistant", "content": target_out})
inputs_embeds, contents, batch, source_ids, meta_data = model.inference_prepare(
[contents_i], None, key, tokenizer, frontend, device="cuda:0"
)
model_inputs = {}
model_inputs["inputs_embeds"] = inputs_embeds
streamer = TextIteratorStreamer(tokenizer)
generation_kwargs = dict(model_inputs, streamer=streamer, max_new_tokens=200)
thread = Thread(target=model.llm.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
print(f"generated new text {new_text}")
generated_text += new_text
print(f"total generated: {generated_text}")

View File

@ -30,7 +30,7 @@ init_param="${output_dir}/model.pt"
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \

View File

@ -30,7 +30,7 @@ init_param="${output_dir}/model.pt"
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \

View File

@ -41,7 +41,7 @@ scp2jsonl \
output_dir="./outputs"
log_file="${output_dir}/log.txt"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
mkdir -p ${output_dir}
echo "log_file: ${log_file}"

View File

@ -42,7 +42,7 @@ scp2jsonl \
output_dir="./outputs"
log_file="${output_dir}/log.txt"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
mkdir -p ${output_dir}
echo "log_file: ${log_file}"

View File

@ -45,7 +45,7 @@ log_file="${output_dir}/log.txt"
mkdir -p ${output_dir}
echo "log_file: ${log_file}"
deepspeed_config=${workspace}../../ds_stage1.json
deepspeed_config=${workspace}/../../ds_stage1.json
DISTRIBUTED_ARGS="
--nnodes ${WORLD_SIZE:-1} \

View File

@ -121,9 +121,6 @@ class AutoModel:
log_level = getattr(logging, kwargs.get("log_level", "INFO").upper())
logging.basicConfig(level=log_level)
if not kwargs.get("disable_log", True):
tables.print()
model, kwargs = self.build_model(**kwargs)
# if vad_model is not None, build vad model else None
@ -171,7 +168,8 @@ class AutoModel:
self.spk_kwargs = spk_kwargs
self.model_path = kwargs.get("model_path")
def build_model(self, **kwargs):
@staticmethod
def build_model(**kwargs):
assert "model" in kwargs
if "model_conf" not in kwargs:
logging.info("download models from model hub: {}".format(kwargs.get("hub", "ms")))
@ -217,6 +215,7 @@ class AutoModel:
kwargs["frontend"] = frontend
# build model
model_class = tables.model_classes.get(kwargs["model"])
assert model_class is not None, f'{kwargs["model"]} is not registered'
model_conf = {}
deep_update(model_conf, kwargs.get("model_conf", {}))
deep_update(model_conf, kwargs)
@ -244,6 +243,10 @@ class AutoModel:
elif kwargs.get("bf16", False):
model.to(torch.bfloat16)
model.to(device)
if not kwargs.get("disable_log", True):
tables.print()
return model, kwargs
def __call__(self, *args, **cfg):

View File

@ -1,9 +0,0 @@
from abc import ABC
from abc import abstractmethod
from typing import Iterator
class AbsIterFactory(ABC):
@abstractmethod
def build_iter(self, epoch: int, shuffle: bool = None) -> Iterator:
raise NotImplementedError

View File

@ -1,109 +0,0 @@
import logging
from pathlib import Path
from typing import Iterable
from typing import List
from typing import Union
import sentencepiece as spm
from torch.utils.data import DataLoader
from funasr.datasets.large_datasets.dataset import Dataset
from funasr.datasets.large_datasets.abs_iter_factory import AbsIterFactory
from funasr.tokenizer.abs_tokenizer import AbsTokenizer
from funasr.register import tables
def read_symbol_table(symbol_table_file):
if isinstance(symbol_table_file, str):
symbol_table = {}
with open(symbol_table_file, "r", encoding="utf8") as fin:
for i, line in enumerate(fin):
char = line.strip()
symbol_table[char] = i
else:
assert isinstance(symbol_table_file, list)
symbol_table = {}
for i, char in enumerate(symbol_table_file):
symbol_table[char] = i
return symbol_table
def load_seg_dict(seg_dict_file):
seg_dict = {}
assert isinstance(seg_dict_file, str)
with open(seg_dict_file, "r", encoding="utf8") as f:
lines = f.readlines()
for line in lines:
s = line.strip().split()
key = s[0]
value = s[1:]
seg_dict[key] = " ".join(value)
return seg_dict
class SentencepiecesTokenizer(AbsTokenizer):
def __init__(self, model: Union[Path, str]):
self.model = str(model)
self.sp = None
def __repr__(self):
return f'{self.__class__.__name__}(model="{self.model}")'
def _build_sentence_piece_processor(self):
if self.sp is None:
self.sp = spm.SentencePieceProcessor()
self.sp.load(self.model)
def text2tokens(self, line: str) -> List[str]:
self._build_sentence_piece_processor()
return self.sp.EncodeAsPieces(line)
def tokens2text(self, tokens: Iterable[str]) -> str:
self._build_sentence_piece_processor()
return self.sp.DecodePieces(list(tokens))
@tables.register("dataset_classes", "LargeDataset")
class LargeDataLoader(AbsIterFactory):
def __init__(self, args, mode="train"):
symbol_table, seg_dict, punc_dict, bpe_tokenizer = None, None, None, None
if hasattr(args, "token_list") and args.token_list is not None:
symbol_table = read_symbol_table(args.token_list)
if hasattr(args, "seg_dict_file") and args.seg_dict_file is not None:
seg_dict = load_seg_dict(args.seg_dict_file)
if hasattr(args, "punc_list") and args.punc_list is not None:
punc_dict = read_symbol_table(args.punc_list)
if hasattr(args, "bpemodel") and args.bpemodel is not None:
bpe_tokenizer = SentencepiecesTokenizer(args.bpemodel)
self.dataset_conf = args.dataset_conf
if "frontend_conf" not in args:
self.frontend_conf = None
else:
self.frontend_conf = args.frontend_conf
self.speed_perturb = args.speed_perturb if hasattr(args, "speed_perturb") else None
logging.info("dataloader config: {}".format(self.dataset_conf))
batch_mode = self.dataset_conf.get("batch_mode", "padding")
data_list = args.train_data_file if mode == "train" else args.valid_data_file
self.dataset = Dataset(
data_list,
symbol_table,
seg_dict,
punc_dict,
bpe_tokenizer,
self.dataset_conf,
self.frontend_conf,
speed_perturb=self.speed_perturb if mode == "train" else None,
mode=mode,
batch_mode=batch_mode,
)
def build_iter(self, epoch, shuffle=True):
self.dataset.set_epoch(epoch)
data_loader = DataLoader(
self.dataset,
batch_size=None,
pin_memory=True,
num_workers=self.dataset_conf.get("num_workers", 8),
)
return data_loader

View File

@ -1,194 +0,0 @@
from typing import Collection
from typing import Dict
from typing import List
from typing import Tuple
from typing import Union
import numpy as np
import torch
from funasr.models.transformer.utils.nets_utils import pad_list, pad_list_all_dim
class CommonCollateFn:
"""Functor class of common_collate_fn()"""
def __init__(
self,
float_pad_value: Union[float, int] = 0.0,
int_pad_value: int = -32768,
not_sequence: Collection[str] = (),
max_sample_size=None,
):
self.float_pad_value = float_pad_value
self.int_pad_value = int_pad_value
self.not_sequence = set(not_sequence)
self.max_sample_size = max_sample_size
def __repr__(self):
return (
f"{self.__class__}(float_pad_value={self.float_pad_value}, "
f"int_pad_value={self.float_pad_value})"
)
def __call__(
self, data: Collection[Tuple[str, Dict[str, np.ndarray]]]
) -> Tuple[List[str], Dict[str, torch.Tensor]]:
return common_collate_fn(
data,
float_pad_value=self.float_pad_value,
int_pad_value=self.int_pad_value,
not_sequence=self.not_sequence,
)
def common_collate_fn(
data: Collection[Tuple[str, Dict[str, np.ndarray]]],
float_pad_value: Union[float, int] = 0.0,
int_pad_value: int = -32768,
not_sequence: Collection[str] = (),
) -> Tuple[List[str], Dict[str, torch.Tensor]]:
"""Concatenate ndarray-list to an array and convert to torch.Tensor."""
uttids = [u for u, _ in data]
data = [d for _, d in data]
assert all(set(data[0]) == set(d) for d in data), "dict-keys mismatching"
assert all(
not k.endswith("_lengths") for k in data[0]
), f"*_lengths is reserved: {list(data[0])}"
output = {}
for key in data[0]:
if data[0][key].dtype.kind == "i":
pad_value = int_pad_value
else:
pad_value = float_pad_value
array_list = [d[key] for d in data]
tensor_list = [torch.from_numpy(a) for a in array_list]
tensor = pad_list(tensor_list, pad_value)
output[key] = tensor
if key not in not_sequence:
lens = torch.tensor([d[key].shape[0] for d in data], dtype=torch.long)
output[key + "_lengths"] = lens
output = (uttids, output)
return output
class DiarCollateFn:
"""Functor class of common_collate_fn()"""
def __init__(
self,
float_pad_value: Union[float, int] = 0.0,
int_pad_value: int = -32768,
not_sequence: Collection[str] = (),
max_sample_size=None,
):
self.float_pad_value = float_pad_value
self.int_pad_value = int_pad_value
self.not_sequence = set(not_sequence)
self.max_sample_size = max_sample_size
def __repr__(self):
return (
f"{self.__class__}(float_pad_value={self.float_pad_value}, "
f"int_pad_value={self.float_pad_value})"
)
def __call__(
self, data: Collection[Tuple[str, Dict[str, np.ndarray]]]
) -> Tuple[List[str], Dict[str, torch.Tensor]]:
return diar_collate_fn(
data,
float_pad_value=self.float_pad_value,
int_pad_value=self.int_pad_value,
not_sequence=self.not_sequence,
)
def diar_collate_fn(
data: Collection[Tuple[str, Dict[str, np.ndarray]]],
float_pad_value: Union[float, int] = 0.0,
int_pad_value: int = -32768,
not_sequence: Collection[str] = (),
) -> Tuple[List[str], Dict[str, torch.Tensor]]:
"""Concatenate ndarray-list to an array and convert to torch.Tensor."""
uttids = [u for u, _ in data]
data = [d for _, d in data]
assert all(set(data[0]) == set(d) for d in data), "dict-keys mismatching"
assert all(
not k.endswith("_lengths") for k in data[0]
), f"*_lengths is reserved: {list(data[0])}"
output = {}
for key in data[0]:
if data[0][key].dtype.kind == "i":
pad_value = int_pad_value
else:
pad_value = float_pad_value
array_list = [d[key] for d in data]
tensor_list = [torch.from_numpy(a) for a in array_list]
tensor = pad_list_all_dim(tensor_list, pad_value)
output[key] = tensor
if key not in not_sequence:
lens = torch.tensor([d[key].shape[0] for d in data], dtype=torch.long)
output[key + "_lengths"] = lens
output = (uttids, output)
return output
def crop_to_max_size(feature, target_size):
size = len(feature)
diff = size - target_size
if diff <= 0:
return feature
start = np.random.randint(0, diff + 1)
end = size - diff + start
return feature[start:end]
def clipping_collate_fn(
data: Collection[Tuple[str, Dict[str, np.ndarray]]],
max_sample_size=None,
not_sequence: Collection[str] = (),
) -> Tuple[List[str], Dict[str, torch.Tensor]]:
# mainly for pre-training
uttids = [u for u, _ in data]
data = [d for _, d in data]
assert all(set(data[0]) == set(d) for d in data), "dict-keys mismatching"
assert all(
not k.endswith("_lengths") for k in data[0]
), f"*_lengths is reserved: {list(data[0])}"
output = {}
for key in data[0]:
array_list = [d[key] for d in data]
tensor_list = [torch.from_numpy(a) for a in array_list]
sizes = [len(s) for s in tensor_list]
if max_sample_size is None:
target_size = min(sizes)
else:
target_size = min(min(sizes), max_sample_size)
tensor = tensor_list[0].new_zeros(len(tensor_list), target_size, tensor_list[0].shape[1])
for i, (source, size) in enumerate(zip(tensor_list, sizes)):
diff = size - target_size
if diff == 0:
tensor[i] = source
else:
tensor[i] = crop_to_max_size(source, target_size)
output[key] = tensor
if key not in not_sequence:
lens = torch.tensor([source.shape[0] for source in tensor], dtype=torch.long)
output[key + "_lengths"] = lens
output = (uttids, output)
return output

View File

@ -1,213 +0,0 @@
import random
from itertools import count
from functools import partial
from torch.utils.data import IterableDataset
from funasr.datasets.large_datasets.datapipes.map import MapperIterDataPipe
tiebreaker = count()
def _default_len_fn(token):
return len(token), next(tiebreaker)
def _token_len_fn(token, len_fn):
return len_fn(token), next(tiebreaker), token
class MaxTokenBucketizerIterDataPipe(IterableDataset):
def __init__(
self,
datapipe,
batch_size=8000,
len_fn=_default_len_fn,
buffer_size=10240,
sort_size=500,
batch_mode="padding",
):
assert batch_size > 0, "Batch size is required to be larger than 0!"
assert buffer_size >= -1, "Buffer size is required to be larger than -1!"
assert sort_size > 0, "Sort size is required to be larger than 0!"
datapipe = MapperIterDataPipe(datapipe, fn=partial(_token_len_fn, len_fn=len_fn))
self.datapipe = datapipe
self.batch_size = batch_size
self.buffer_size = buffer_size
self.sort_size = sort_size
self.batch_mode = batch_mode
def set_epoch(self, epoch):
self.datapipe.set_epoch(epoch)
def __iter__(self):
buffer = []
batch = []
bucket = []
max_lengths = 0
min_lengths = 999999
batch_lengths = 0
if self.batch_mode == "clipping":
assert self.buffer_size > 0, "for clipping batch_mode, buffer_size must be > 1"
for d in self.datapipe:
if d[0] > self.batch_size:
continue
buffer.append(d)
if len(buffer) == self.buffer_size:
random.shuffle(buffer)
for sample in buffer:
bucket.append(sample)
if len(bucket) == self.sort_size:
bucket.sort()
for x in bucket:
length, _, token = x
if length < min_lengths:
min_lengths = length
batch_lengths = min_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
min_lengths = length
batch.append(token)
bucket = []
buffer = []
if buffer:
random.shuffle(buffer)
for sample in buffer:
bucket.append(sample)
if len(bucket) == self.sort_size:
bucket.sort()
for x in bucket:
length, _, token = x
if length < min_lengths:
min_lengths = length
batch_lengths = min_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
min_lengths = length
batch.append(token)
bucket = []
buffer = []
if bucket:
bucket.sort()
for x in bucket:
length, _, token = x
if length < min_lengths:
min_lengths = length
batch_lengths = min_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
min_lengths = length
batch.append(token)
bucket = []
if batch:
yield batch
else:
if self.buffer_size == -1:
for d in self.datapipe:
if d[0] > self.batch_size:
continue
buffer.append(d)
buffer.sort()
for sample in buffer:
length, _, token = sample
if length > max_lengths:
max_lengths = length
batch_lengths = max_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
bucket.append(batch)
batch = []
max_lengths = length
batch.append(token)
random.shuffle(bucket)
if bucket:
for batch_sample in bucket:
yield batch_sample
if batch:
yield batch
elif self.buffer_size == 0:
for d in self.datapipe:
if d[0] > self.batch_size:
continue
length, _, token = d
if length > self.batch_size:
continue
if length > max_lengths:
max_lengths = length
batch_lengths = max_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
max_lengths = length
batch.append(token)
if batch:
yield batch
else:
for d in self.datapipe:
if d[0] > self.batch_size:
continue
buffer.append(d)
if len(buffer) == self.buffer_size:
random.shuffle(buffer)
for sample in buffer:
bucket.append(sample)
if len(bucket) == self.sort_size:
bucket.sort()
for x in bucket:
length, _, token = x
if length > max_lengths:
max_lengths = length
batch_lengths = max_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
max_lengths = length
batch.append(token)
bucket = []
buffer = []
if buffer:
random.shuffle(buffer)
for sample in buffer:
bucket.append(sample)
if len(bucket) == self.sort_size:
bucket.sort()
for x in bucket:
length, _, token = x
if length > max_lengths:
max_lengths = length
batch_lengths = max_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
max_lengths = length
batch.append(token)
bucket = []
buffer = []
if bucket:
bucket.sort()
for x in bucket:
length, _, token = x
if length > max_lengths:
max_lengths = length
batch_lengths = max_lengths * (len(batch) + 1)
if batch_lengths > self.batch_size:
yield batch
batch = []
max_lengths = length
batch.append(token)
bucket = []
if batch:
yield batch

View File

@ -1,23 +0,0 @@
from torch.utils.data import IterableDataset
def default_fn(data):
return data
class FilterIterDataPipe(IterableDataset):
def __init__(self, datapipe, fn=default_fn):
self.datapipe = datapipe
self.fn = fn
def set_epoch(self, epoch):
self.datapipe.set_epoch(epoch)
def __iter__(self):
assert callable(self.fn)
for data in self.datapipe:
if self.fn(data):
yield data
else:
continue

View File

@ -1,20 +0,0 @@
from torch.utils.data import IterableDataset
def default_fn(data):
return data
class MapperIterDataPipe(IterableDataset):
def __init__(self, datapipe, fn=default_fn):
self.datapipe = datapipe
self.fn = fn
def set_epoch(self, epoch):
self.datapipe.set_epoch(epoch)
def __iter__(self):
assert callable(self.fn)
for data in self.datapipe:
yield self.fn(data)

View File

@ -1,299 +0,0 @@
import logging
import os
import random
from functools import partial
import torch
import torch.distributed as dist
import torchaudio
import numpy as np
# import librosa
import librosa
from kaldiio import ReadHelper
from torch.utils.data import IterableDataset
from funasr.datasets.large_datasets.datapipes.batch import MaxTokenBucketizerIterDataPipe
from funasr.datasets.large_datasets.datapipes.filter import FilterIterDataPipe
from funasr.datasets.large_datasets.datapipes.map import MapperIterDataPipe
from funasr.datasets.large_datasets.utils.clipping import clipping
from funasr.datasets.large_datasets.utils.filter import filter
from funasr.datasets.large_datasets.utils.padding import padding
from funasr.datasets.large_datasets.utils.tokenize import tokenize
def read_lists(list_file):
lists = []
with open(list_file, "r", encoding="utf8") as fin:
for line in fin:
parts = line.strip()
lists.append(parts)
return lists
class AudioDataset(IterableDataset):
def __init__(
self,
scp_lists,
data_names,
data_types,
frontend_conf=None,
shuffle=True,
speed_perturb=None,
mode="train",
):
self.scp_lists = scp_lists
self.data_names = data_names
self.data_types = data_types
self.frontend_conf = frontend_conf
self.shuffle = shuffle
self.mode = mode
self.epoch = -1
self.rank = 0
self.world_size = 1
self.worker_id = 0
self.num_workers = 1
self.speed_perturb = speed_perturb
if self.speed_perturb is not None:
logging.info("Using speed_perturb: {}".format(speed_perturb))
def set_epoch(self, epoch):
self.epoch = epoch
def get_rank_data_list(self, data_index):
assert dist.is_available()
if dist.is_initialized():
self.rank = dist.get_rank()
self.world_size = dist.get_world_size()
else:
self.rank = 0
self.world_size = 1
if self.mode == "train":
if self.shuffle:
random.seed(self.epoch)
random.shuffle(data_index)
return data_index[self.rank :: self.world_size]
return data_index
def get_worker_data_list(self, rank_data_index):
worker_info = torch.utils.data.get_worker_info()
if worker_info is None:
self.worker_id = 0
self.num_workers = 1
else:
self.worker_id = worker_info.id
self.num_workers = worker_info.num_workers
return rank_data_index[self.worker_id :: self.num_workers]
def close_reader(self, reader_list):
for reader in reader_list:
reader.close()
def __iter__(self):
data_index = list(range(len(self.scp_lists)))
rank_data_index = self.get_rank_data_list(data_index)
worker_data_index = self.get_worker_data_list(rank_data_index)
for index in worker_data_index:
data = dict(scp=self.scp_lists[index])
assert "scp" in data
scp = data["scp"]
data_file_list = scp.strip().split()
data_name_list = self.data_names.split(",")
data_type_list = self.data_types.split(",")
for file in data_file_list:
assert os.path.exists(file), "{} not exists".format(file)
assert (
len(data_file_list) == len(data_name_list) == len(data_type_list)
), "The item number of data, data_names, data_types must be the same "
reader_list = []
for data_file, data_type in zip(data_file_list, data_type_list):
if data_type == "kaldi_ark":
ark_reader = ReadHelper("ark:{}".format(data_file))
reader_list.append(ark_reader)
elif data_type == "text" or data_type == "sound" or data_type == "text_hotword":
text_reader = open(data_file, "r", encoding="utf-8")
reader_list.append(text_reader)
elif data_type == "none":
continue
else:
raise TypeError("Data type {} is not supported".format(data_type))
for items in zip(*reader_list):
sample_dict = {}
for item, (data_name, data_type) in zip(items, zip(data_name_list, data_type_list)):
if data_type == "kaldi_ark":
key, mat = item
sample_dict[data_name] = mat
if data_name == "speech":
sample_dict["key"] = key
elif data_type == "sound":
key, path = item.strip().split()
try:
waveform, sampling_rate = torchaudio.load(path)
except:
# waveform, sampling_rate = librosa.load(path, dtype='float32')
waveform, sampling_rate = librosa.load(path, dtype="float32")
if waveform.ndim == 2:
waveform = waveform[:, 0]
waveform = np.expand_dims(waveform, axis=0)
waveform = torch.tensor(waveform)
if self.frontend_conf is not None:
if sampling_rate != self.frontend_conf["fs"]:
waveform = torchaudio.transforms.Resample(
orig_freq=sampling_rate, new_freq=self.frontend_conf["fs"]
)(waveform)
sampling_rate = self.frontend_conf["fs"]
waveform = waveform.numpy()
mat = waveform[0]
if self.speed_perturb is not None:
speed = random.choice(self.speed_perturb)
if speed != 1.0:
mat, _ = torchaudio.sox_effects.apply_effects_tensor(
torch.tensor(mat).view(1, -1),
sampling_rate,
[["speed", str(speed)], ["rate", str(sampling_rate)]],
)
mat = mat.view(-1).numpy()
sample_dict[data_name] = mat
sample_dict["sampling_rate"] = sampling_rate
if data_name == "speech":
sample_dict["key"] = key
elif data_type == "text_hotword":
text = item
segs = text.strip().split()
sample_dict[data_name] = segs[1:]
if "key" not in sample_dict:
sample_dict["key"] = segs[0]
sample_dict["hw_tag"] = 1
elif data_type == "text_nospace":
text = item
segs = text.strip().split(maxsplit=1)
sample_dict[data_name] = [x for x in segs[1]]
if "key" not in sample_dict:
sample_dict["key"] = segs[0]
else:
text = item
segs = text.strip().split()
sample_dict[data_name] = segs[1:]
if "key" not in sample_dict:
sample_dict["key"] = segs[0]
yield sample_dict
self.close_reader(reader_list)
def len_fn_example(data):
return 1
def len_fn_token(data):
assert "speech" in data
if "sampling_rate" in data:
return (data["speech"].shape[0] / data["sampling_rate"]) * 1000.0
else:
return data["speech"].shape[0]
def Dataset(
data_list_file,
dict,
seg_dict,
punc_dict,
bpe_tokenizer,
conf,
frontend_conf,
speed_perturb=None,
mode="train",
batch_mode="padding",
):
scp_lists = read_lists(data_list_file)
shuffle = conf.get("shuffle", True)
data_names = conf.get("data_names", "speech,text")
data_types = conf.get("data_types", "kaldi_ark,text")
pre_hwfile = conf.get("pre_hwlist", None)
# pre_prob = conf.get("pre_prob", 0) # unused yet
if pre_hwfile is not None:
pre_hwlist = []
with open(pre_hwfile, "r", encoding="utf-8") as fin:
for line in fin.readlines():
pre_hwlist.append(line.strip())
else:
pre_hwlist = None
hw_config = {
"sample_rate": conf.get("sample_rate", 0.6),
"double_rate": conf.get("double_rate", 0.1),
"hotword_min_length": conf.get("hotword_min_length", 2),
"hotword_max_length": conf.get("hotword_max_length", 8),
"pre_prob": conf.get("pre_prob", 0.0),
"pre_hwlist": pre_hwlist,
}
dataset = AudioDataset(
scp_lists,
data_names,
data_types,
frontend_conf=frontend_conf,
shuffle=shuffle,
speed_perturb=speed_perturb,
mode=mode,
)
if "text" in data_names:
vocab = {
"vocab": dict,
"seg_dict": seg_dict,
"punc_dict": punc_dict,
"bpe_tokenizer": bpe_tokenizer,
"hw_config": hw_config,
}
tokenize_fn = partial(tokenize, **vocab)
dataset = MapperIterDataPipe(dataset, fn=tokenize_fn)
filter_conf = conf.get("filter_conf", {})
filter_fn = partial(filter, **filter_conf)
dataset = FilterIterDataPipe(dataset, fn=filter_fn)
if shuffle:
buffer_conf = conf.get("shuffle_conf", {})
buffer_size = buffer_conf["shuffle_size"]
sort_size = buffer_conf["sort_size"]
else:
buffer_size = 0
sort_size = 1
batch_conf = conf.get("batch_conf", {})
batch_size = batch_conf["batch_size"]
batch_type = batch_conf["batch_type"]
assert batch_type in ["example", "token"]
if batch_type == "example":
len_fn = len_fn_example
else:
len_fn = len_fn_token
dataset = MaxTokenBucketizerIterDataPipe(
dataset,
batch_size=batch_size,
len_fn=len_fn,
buffer_size=buffer_size,
sort_size=sort_size,
batch_mode=batch_mode,
)
int_pad_value = conf.get("int_pad_value", -1)
float_pad_value = conf.get("float_pad_value", 0.0)
padding_conf = {"int_pad_value": int_pad_value, "float_pad_value": float_pad_value}
padding_fn = partial(padding, **padding_conf)
dataset = MapperIterDataPipe(dataset, fn=padding_fn if batch_mode == "padding" else clipping)
return dataset

View File

@ -1,44 +0,0 @@
import numpy as np
import torch
from funasr.datasets.large_datasets.collate_fn import crop_to_max_size
def clipping(data):
assert isinstance(data, list)
assert "key" in data[0]
keys = [x["key"] for x in data]
batch = {}
data_names = data[0].keys()
for data_name in data_names:
if data_name == "key":
continue
else:
if data[0][data_name].dtype.kind == "i":
tensor_type = torch.int64
else:
tensor_type = torch.float32
tensor_list = [torch.tensor(np.copy(d[data_name]), dtype=tensor_type) for d in data]
tensor_lengths = torch.tensor([len(d[data_name]) for d in data], dtype=torch.int32)
length_clip = min(tensor_lengths)
tensor_clip = tensor_list[0].new_zeros(
len(tensor_list), length_clip, tensor_list[0].shape[1]
)
for i, (tensor, length) in enumerate(zip(tensor_list, tensor_lengths)):
diff = length - length_clip
assert diff >= 0
if diff == 0:
tensor_clip[i] = tensor
else:
tensor_clip[i] = crop_to_max_size(tensor, length_clip)
batch[data_name] = tensor_clip
batch[data_name + "_lengths"] = torch.tensor(
[tensor.shape[0] for tensor in tensor_clip], dtype=torch.long
)
return keys, batch

View File

@ -1,27 +0,0 @@
#!/usr/bin/env python
def filter(
data, speech_length_min=100, speech_length_max=15000, token_length_min=0, token_length_max=200
):
assert "speech" in data or "text" in data
if "speech" in data and "text" in data:
if "sampling_rate" in data:
speech_length = (data["speech"].shape[0] / data["sampling_rate"]) * 1000.0
else:
speech_length = data["speech"].shape[0]
num_tokens = len(data["text"])
return (
speech_length_min < speech_length < speech_length_max
and token_length_min < num_tokens < token_length_max
)
elif "speech" in data:
if "sampling_rate" in data:
speech_length = (data["speech"].shape[0] / data["sampling_rate"]) * 1000.0
else:
speech_length = data["speech"].shape[0]
return speech_length_min < speech_length < speech_length_max
else:
num_tokens = len(data["text"])
return token_length_min < num_tokens < token_length_max

View File

@ -1,42 +0,0 @@
import random
def sample_hotword(
length,
hotword_min_length,
hotword_max_length,
sample_rate,
double_rate,
pre_prob,
pre_index=None,
pre_hwlist=None,
):
if length < hotword_min_length:
return [-1]
if random.random() < sample_rate:
if pre_prob > 0 and random.random() < pre_prob and pre_index is not None:
return pre_index
if length == hotword_min_length:
return [0, length - 1]
elif random.random() < double_rate and length > hotword_max_length + hotword_min_length + 2:
# sample two hotwords in a sentence
_max_hw_length = min(hotword_max_length, length // 2)
# first hotword
start1 = random.randint(0, length // 3)
end1 = random.randint(start1 + hotword_min_length - 1, start1 + _max_hw_length - 1)
# second hotword
start2 = random.randint(end1 + 1, length - hotword_min_length)
end2 = random.randint(
min(length - 1, start2 + hotword_min_length - 1),
min(length - 1, start2 + hotword_max_length - 1),
)
return [start1, end1, start2, end2]
else: # single hotword
start = random.randint(0, length - hotword_min_length)
end = random.randint(
min(length - 1, start + hotword_min_length - 1),
min(length - 1, start + hotword_max_length - 1),
)
return [start, end]
else:
return [-1]

View File

@ -1,30 +0,0 @@
import numpy as np
def build_LFR_features(data, m, n):
"""
Actually, this implements stacking frames and skipping frames.
if m = 1 and n = 1, just return the origin features.
if m = 1 and n > 1, it works like skipping.
if m > 1 and n = 1, it works like stacking but only support right frames.
if m > 1 and n > 1, it works like LFR.
Args:
inputs_batch: inputs is T x D np.ndarray
m: number of frames to stack
n: number of frames to skip
"""
LFR_inputs = []
T = data.shape[0]
T_lfr = int(np.ceil(T / n))
for i in range(T_lfr):
if m <= T - i * n:
LFR_inputs.append(np.hstack(data[i * n : i * n + m]))
else:
num_padding = m - (T - i * n)
frame = np.hstack(data[i * n :])
for _ in range(num_padding):
frame = np.hstack((frame, data[-1]))
LFR_inputs.append(frame)
return np.vstack(LFR_inputs)

View File

@ -1,72 +0,0 @@
import numpy as np
import torch
from torch.nn.utils.rnn import pad_sequence
def padding(data, float_pad_value=0.0, int_pad_value=-1):
assert isinstance(data, list)
assert "key" in data[0]
assert "speech" in data[0] or "text" in data[0]
keys = [x["key"] for x in data]
batch = {}
data_names = data[0].keys()
for data_name in data_names:
if data_name == "key" or data_name == "sampling_rate":
continue
else:
if data_name != "hotword_indxs":
if data[0][data_name].dtype.kind == "i":
pad_value = int_pad_value
tensor_type = torch.int64
else:
pad_value = float_pad_value
tensor_type = torch.float32
tensor_list = [torch.tensor(np.copy(d[data_name]), dtype=tensor_type) for d in data]
tensor_lengths = torch.tensor([len(d[data_name]) for d in data], dtype=torch.int32)
tensor_pad = pad_sequence(tensor_list, batch_first=True, padding_value=pad_value)
batch[data_name] = tensor_pad
batch[data_name + "_lengths"] = tensor_lengths
# SAC LABEL INCLUDE
if "hotword_indxs" in batch:
# if hotword indxs in batch
# use it to slice hotwords out
hotword_list = []
hotword_lengths = []
text = batch["text"]
text_lengths = batch["text_lengths"]
hotword_indxs = batch["hotword_indxs"]
dha_pad = torch.ones_like(text) * -1
_, t1 = text.shape
t1 += 1 # TODO: as parameter which is same as predictor_bias
nth_hw = 0
for b, (hotword_indx, one_text, length) in enumerate(
zip(hotword_indxs, text, text_lengths)
):
dha_pad[b][:length] = 8405
if hotword_indx[0] != -1:
start, end = int(hotword_indx[0]), int(hotword_indx[1])
hotword = one_text[start : end + 1]
hotword_list.append(hotword)
hotword_lengths.append(end - start + 1)
dha_pad[b][start : end + 1] = one_text[start : end + 1]
nth_hw += 1
if len(hotword_indx) == 4 and hotword_indx[2] != -1:
# the second hotword if exist
start, end = int(hotword_indx[2]), int(hotword_indx[3])
hotword_list.append(one_text[start : end + 1])
hotword_lengths.append(end - start + 1)
dha_pad[b][start : end + 1] = one_text[start : end + 1]
nth_hw += 1
hotword_list.append(torch.tensor([1]))
hotword_lengths.append(1)
hotword_pad = pad_sequence(hotword_list, batch_first=True, padding_value=0)
batch["hotword_pad"] = hotword_pad
batch["hotword_lengths"] = torch.tensor(hotword_lengths, dtype=torch.int32)
batch["dha_pad"] = dha_pad
del batch["hotword_indxs"]
del batch["hotword_indxs_lengths"]
return keys, batch

View File

@ -1,93 +0,0 @@
#!/usr/bin/env python
import re
import numpy as np
from funasr.datasets.large_datasets.utils.hotword_utils import sample_hotword
def forward_segment(text, seg_dict):
word_list = []
i = 0
while i < len(text):
longest_word = text[i]
for j in range(i + 1, len(text) + 1):
word = text[i:j]
if word in seg_dict:
if len(word) > len(longest_word):
longest_word = word
word_list.append(longest_word)
i += len(longest_word)
return word_list
def seg_tokenize(txt, seg_dict):
pattern = re.compile(r"^[\u4E00-\u9FA50-9]+$")
out_txt = ""
for word in txt:
word = word.lower()
if word in seg_dict:
out_txt += seg_dict[word] + " "
else:
if pattern.match(word):
for char in word:
if char in seg_dict:
out_txt += seg_dict[char] + " "
else:
out_txt += "<unk>" + " "
else:
out_txt += "<unk>" + " "
return out_txt.strip().split()
def tokenize(data, vocab=None, seg_dict=None, punc_dict=None, bpe_tokenizer=None, hw_config=None):
assert "text" in data
assert isinstance(vocab, dict)
text = data["text"]
token = []
vad = -2
if bpe_tokenizer is not None:
text = bpe_tokenizer.text2tokens(" ".join(text))
if seg_dict is not None:
assert isinstance(seg_dict, dict)
text = seg_tokenize(text, seg_dict)
length = len(text)
if "hw_tag" in data:
pre_index = None
if hw_config["pre_hwlist"] is not None and hw_config["pre_prob"] > 0:
# enable preset hotword detect in sampling
for hw in hw_config["pre_hwlist"]:
hw = " ".join(seg_tokenize(hw, seg_dict))
_find = " ".join(text).find(hw)
if _find != -1:
# _find = text[:_find].count(" ") # bpe sometimes
pre_index = [_find, _find + max(hw.count(" "), 1)]
break
hotword_indxs = sample_hotword(length, **hw_config, pre_index=pre_index)
data["hotword_indxs"] = hotword_indxs
del data["hw_tag"]
for i in range(length):
x = text[i]
if i == length - 1 and "punc" in data and x.startswith("vad:"):
vad = x[4:]
if len(vad) == 0:
vad = -1
else:
vad = int(vad)
elif x in vocab:
token.append(vocab[x])
else:
token.append(vocab["<unk>"])
if "punc" in data and punc_dict is not None:
punc_token = []
for punc in data["punc"]:
if punc in punc_dict:
punc_token.append(punc_dict[punc])
else:
punc_token.append(punc_dict["_"])
data["punc"] = np.array(punc_token)
data["text"] = np.array(token)
if vad is not -2:
data["vad_indexes"] = np.array([vad], dtype=np.int64)
return data

View File

@ -85,8 +85,10 @@ def download_from_ms(**kwargs):
install_requirements(requirements)
if kwargs.get("trust_remote_code", False):
from funasr.utils.dynamic_import import import_module_from_path
import model
model_code = kwargs.get("remote_code", "model")
import_module_from_path(model_code)
# from funasr.register import tables
# tables.print("model")

View File

@ -1145,6 +1145,7 @@ class LLMASR4(nn.Module):
fake_token_len_i = 0
fbank_beg_i = -1
fbank_lens_i = []
speech, speech_lengths = [], []
for k, sub_str in enumerate(splits):
if not sub_str.startswith("<|startofspeech|>"):
sub_token = tokenizer.encode(sub_str)
@ -1155,9 +1156,12 @@ class LLMASR4(nn.Module):
"<|endofspeech|>", ""
)
if sub_str.startswith("!"):
sub_str = sub_str[1:]
if sub_str.startswith("!"): # !!bytes
sub_str = eval(sub_str[1:])
try:
time1 = time.perf_counter()
data_src = load_audio_text_image_video(sub_str[1:], fs=frontend.fs)
data_src = load_audio_text_image_video(sub_str, fs=frontend.fs)
time2 = time.perf_counter()
meta_data["load_data"] = f"{time2 - time1:0.3f}"
except Exception as e:
@ -1203,9 +1207,10 @@ class LLMASR4(nn.Module):
input_source_ids = input_ids + source_ids
input_ids += source_ids + target_ids
labels += source_mask + target_ids
fbank.append(speech[0, :, :])
fbank_mask += fbank_mask_i
fbank_lens.append(speech_lengths)
if len(speech) > 0:
fbank.append(speech[0, :, :])
fbank_lens.append(speech_lengths)
input_ids = torch.tensor(input_ids, dtype=torch.int64) # [: self.max_token_length]
attention_mask = torch.tensor([1] * len(input_ids), dtype=torch.int32)
@ -1219,10 +1224,14 @@ class LLMASR4(nn.Module):
source_ids = torch.tensor(input_source_ids, dtype=torch.int64)
target_ids = torch.tensor(target_ids, dtype=torch.int64)
speech = torch.nn.utils.rnn.pad_sequence(fbank, batch_first=True, padding_value=0.0)
speech_lengths = torch.nn.utils.rnn.pad_sequence(
fbank_lens, batch_first=True, padding_value=-1
)
if len(fbank) > 0:
speech = torch.nn.utils.rnn.pad_sequence(fbank, batch_first=True, padding_value=0.0)
speech_lengths = torch.nn.utils.rnn.pad_sequence(
fbank_lens, batch_first=True, padding_value=-1
)
else:
speech = []
speech_lengths = []
output = {
"speech": speech,
"speech_lengths": speech_lengths,
@ -1238,7 +1247,8 @@ class LLMASR4(nn.Module):
return output
def inference(
def inference_prepare(
self,
data_in,
data_lengths=None,
@ -1260,17 +1270,18 @@ class LLMASR4(nn.Module):
# audio encoder
speech = batch["speech"]
speech_lengths = batch["speech_lengths"][:, 0]
# fp16
if kwargs.get("fp16", False):
speech = speech.to(torch.float16)
elif kwargs.get("bf16", False):
speech = speech.to(torch.bfloat16)
# audio encoder
encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
if len(speech) > 0:
speech_lengths = batch["speech_lengths"][:, 0]
# fp16
if kwargs.get("fp16", False):
speech = speech.to(torch.float16)
elif kwargs.get("bf16", False):
speech = speech.to(torch.bfloat16)
# audio encoder
encoder_out, encoder_out_lens = self.encode(speech, speech_lengths)
# audio_adaptor
encoder_out, encoder_out_lens = self.audio_adaptor(encoder_out, encoder_out_lens)
# audio_adaptor
encoder_out, encoder_out_lens = self.audio_adaptor(encoder_out, encoder_out_lens)
input_ids = batch["input_ids"]
source_ids = batch["source_ids"]
@ -1316,6 +1327,22 @@ class LLMASR4(nn.Module):
] = speech_token
speech_idx += 1
return inputs_embeds, contents, batch, source_ids, meta_data
def inference(
self,
data_in,
data_lengths=None,
key: list = None,
tokenizer=None,
frontend=None,
**kwargs,
):
inputs_embeds, contents, batch, source_ids, meta_data = self.inference_prepare(
data_in, data_lengths, key, tokenizer, frontend, **kwargs
)
llm_dtype = kwargs.get("llm_dtype", "fp32")
if llm_dtype == "fp32":

View File

@ -2,6 +2,8 @@ import importlib.util
import importlib.util
import inspect
import os.path
import sys
def load_module_from_path(file_path):
@ -18,6 +20,23 @@ def load_module_from_path(file_path):
return module
def import_module_from_path(file_path: str):
if file_path.startswith("http"):
from funasr.download.file import download_from_url
file_path = download_from_url(file_path)
file_dir = os.path.dirname(file_path)
file_name = os.path.basename(file_path)
module_name = file_path.split("/")[-1].replace(".py", "")
if len(file_dir) < 1:
file_dir = "./"
sys.path.append(file_dir)
importlib.import_module(module_name)
print(f"Loading remote code successfully: {file_path}")
#
# def load_module_from_path(module_name, file_path):
# """

View File

@ -1,9 +1,10 @@
import requests
from packaging import version
from funasr import __version__ # Ensure that __version__ is defined in your package's __init__.py
def get_pypi_version(package_name):
import requests
url = f"https://pypi.org/pypi/{package_name}/json"
response = requests.get(url)
if response.status_code == 200:

View File

@ -40,7 +40,7 @@ requirements = {
"hydra-core>=1.3.2",
"tensorboardX",
# "rotary_embedding_torch",
"openai-whisper",
"requests",
],
# train: The modules invoked when training only.
"train": [