update with main (#1817)

* add cmakelist

* add paraformer-torch

* add debug for funasr-onnx-offline

* fix redefinition of jieba StdExtension.hpp

* add loading torch models

* update funasr-onnx-offline

* add SwitchArg for wss-server

* add SwitchArg for funasr-onnx-offline

* update cmakelist

* update funasr-onnx-offline-rtf

* add define condition

* add gpu define for offlne-stream

* update com define

* update offline-stream

* update cmakelist

* update func CompileHotwordEmbedding

* add timestamp for paraformer-torch

* add C10_USE_GLOG for paraformer-torch

* update paraformer-torch

* fix func FunASRWfstDecoderInit

* update model.h

* fix func FunASRWfstDecoderInit

* fix tpass_stream

* update paraformer-torch

* add bladedisc for funasr-onnx-offline

* update comdefine

* update funasr-wss-server

* add log for torch

* fix GetValue BLADEDISC

* fix log

* update cmakelist

* update warmup to 10

* update funasrruntime

* add batch_size for wss-server

* add batch for bins

* add batch for offline-stream

* add batch for paraformer

* add batch for offline-stream

* fix func SetBatchSize

* add SetBatchSize for model

* add SetBatchSize for model

* fix func Forward

* fix padding

* update funasrruntime

* add dec reset for batch

* set batch default value

* add argv for CutSplit

* sort frame_queue

* sorted msgs

* fix FunOfflineInfer

* add dynamic batch for fetch

* fix FetchDynamic

* update run_server.sh

* update run_server.sh

* cpp http post server support (#1739)

* add cpp http server

* add some comment

* remove some comments

* del debug infos

* restore run_server.sh

* adapt to new model struct

* 修复了onnxruntime在macos下编译失败的错误 (#1748)

* Add files via upload

增加macos的编译支持

* Add files via upload

增加macos支持

* Add files via upload

target_link_directories(funasr PUBLIC ${ONNXRUNTIME_DIR}/lib)
target_link_directories(funasr PUBLIC ${FFMPEG_DIR}/lib)
添加 if(APPLE) 限制

---------

Co-authored-by: Yabin Li <wucong.lyb@alibaba-inc.com>

* Delete docs/images/wechat.png

* Add files via upload

* fixed the issues about seaco-onnx timestamp

* fix bug (#1764)

当语音识别结果包含 `http` 时,标点符号预测会把它会被当成 url

* fix empty asr result (#1765)

解码结果为空的语音片段,text 用空字符串

* update export

* update export

* docs

* docs

* update export name

* docs

* update

* docs

* docs

* keep empty speech result (#1772)

* docs

* docs

* update wechat QRcode

* Add python funasr api support for websocket srv (#1777)

* add python funasr_api supoort

* change little to README.md

* add core tools stream

* modified a little

* fix bug for timeout

* support for buffer decode

* add ffmpeg decode for buffer

* libtorch demo

* update libtorch infer

* update utils

* update demo

* update demo

* update libtorch inference

* update model class

* update seaco paraformer

* bug fix

* bug fix

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* auto frontend

* Dev gzf exp (#1785)

* resume from step

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* batch

* train_loss_avg train_acc_avg

* train_loss_avg train_acc_avg

* train_loss_avg train_acc_avg

* log step

* wav is not exist

* wav is not exist

* decoding

* decoding

* decoding

* wechat

* decoding key

* decoding key

* decoding key

* decoding key

* decoding key

* decoding key

* dynamic batch

* start_data_split_i=0

* total_time/accum_grad

* total_time/accum_grad

* total_time/accum_grad

* update avg slice

* update avg slice

* sensevoice sanm

* sensevoice sanm

* sensevoice sanm

---------

Co-authored-by: 北念 <lzr265946@alibaba-inc.com>

* auto frontend

* update paraformer timestamp

* [Optimization] support bladedisc fp16 optimization (#1790)

* add cif_v1 and cif_export

* Update SDK_advanced_guide_offline_zh.md

* add cif_wo_hidden_v1

* [fix] fix empty asr result (#1794)

* english timestamp for valilla paraformer

* wechat

* [fix] better solution for handling empty result (#1796)

* update scripts

* modify the qformer adaptor (#1804)

Co-authored-by: nichongjia-2007 <nichongjia@gmail.com>

* add ctc inference code (#1806)

Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>

* Update auto_model.py

修复空字串进入speaker model时报raw_text变量不存在的bug

* Update auto_model.py

修复识别出空串后spk_model内变量未定义问题

* update model name

* fix paramter 'quantize' unused issue (#1813)

Co-authored-by: ZihanLiao <liaozihan1@xdf.cn>

* wechat

* Update cif_predictor.py (#1811)

* Update cif_predictor.py

* modify cif_v1_export

under extreme cases, max_label_len calculated by batch_len misaligns with token_num

* Update cif_predictor.py

torch.cumsum precision degradation, using float64 instead

* update code

---------

Co-authored-by: 雾聪 <wucong.lyb@alibaba-inc.com>
Co-authored-by: zhaomingwork <61895407+zhaomingwork@users.noreply.github.com>
Co-authored-by: szsteven008 <97944818+szsteven008@users.noreply.github.com>
Co-authored-by: Ephemeroptera <605686962@qq.com>
Co-authored-by: 彭震东 <zhendong.peng@qq.com>
Co-authored-by: Shi Xian <40013335+R1ckShi@users.noreply.github.com>
Co-authored-by: 维石 <shixian.shi@alibaba-inc.com>
Co-authored-by: 北念 <lzr265946@alibaba-inc.com>
Co-authored-by: xiaowan0322 <wanchen.swc@alibaba-inc.com>
Co-authored-by: zhuangzhong <zhuangzhong@corp.netease.com>
Co-authored-by: Xingchen Song(宋星辰) <xingchensong1996@163.com>
Co-authored-by: nichongjia-2007 <nichongjia@gmail.com>
Co-authored-by: haoneng.lhn <haoneng.lhn@alibaba-inc.com>
Co-authored-by: liugz18 <57401541+liugz18@users.noreply.github.com>
Co-authored-by: Marlowe <54339989+ZihanLiao@users.noreply.github.com>
Co-authored-by: ZihanLiao <liaozihan1@xdf.cn>
Co-authored-by: zhong zhuang <zhuangz@lamda.nju.edu.cn>
This commit is contained in:
zhifu gao 2024-06-19 10:27:21 +08:00 committed by GitHub
parent de0b35b378
commit ad99b262eb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
39 changed files with 514 additions and 168 deletions

Binary file not shown.

Before

Width:  |  Height:  |  Size: 187 KiB

After

Width:  |  Height:  |  Size: 186 KiB

View File

@ -12,17 +12,17 @@ model = AutoModel(
device="cpu",
)
res = model.export(type="onnx", quantize=False)
res = model.export(type="torchscripts", quantize=False)
print(res)
# method2, inference from local path
from funasr import AutoModel
# # method2, inference from local path
# from funasr import AutoModel
model = AutoModel(
model="/Users/zhifu/.cache/modelscope/hub/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
device="cpu",
)
# model = AutoModel(
# model="/Users/zhifu/.cache/modelscope/hub/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
# device="cpu",
# )
res = model.export(type="onnx", quantize=False)
print(res)
# res = model.export(type="onnx", quantize=False)
# print(res)

View File

@ -6,6 +6,7 @@
import sys
from funasr import AutoModel
model_dir = "/Users/zhifu/Downloads/modelscope_models/ctc_model"
input_file = (
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav"

View File

@ -10,19 +10,20 @@
from funasr import AutoModel
model = AutoModel(
model="iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch",
model="iic/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404",
)
res = model.export(type="onnx", quantize=False)
res = model.export(type="torchscripts", quantize=False)
# res = model.export(type="bladedisc", input=f"{model.model_path}/example/asr_example.wav")
print(res)
# method2, inference from local path
from funasr import AutoModel
# # method2, inference from local path
# from funasr import AutoModel
model = AutoModel(
model="/Users/zhifu/.cache/modelscope/hub/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
)
# model = AutoModel(
# model="/Users/zhifu/.cache/modelscope/hub/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
# )
res = model.export(type="onnx", quantize=False)
print(res)
# res = model.export(type="onnx", quantize=False)
# print(res)

View File

@ -324,7 +324,7 @@ class AutoModel:
input, input_len=input_len, model=self.vad_model, kwargs=self.vad_kwargs, **cfg
)
end_vad = time.time()
# FIX(gcf): concat the vad clips for sense vocie model for better aed
if kwargs.get("merge_vad", False):
for i in range(len(res)):
@ -467,23 +467,20 @@ class AutoModel:
else:
result[k] += restored_data[j][k]
if not len(result["text"].strip()):
continue
return_raw_text = kwargs.get("return_raw_text", False)
# step.3 compute punc model
raw_text = None
if self.punc_model is not None:
if not len(result["text"].strip()):
if return_raw_text:
result["raw_text"] = ""
else:
deep_update(self.punc_kwargs, cfg)
punc_res = self.inference(
result["text"], model=self.punc_model, kwargs=self.punc_kwargs, **cfg
)
raw_text = copy.copy(result["text"])
if return_raw_text:
result["raw_text"] = raw_text
result["text"] = punc_res[0]["text"]
else:
raw_text = None
deep_update(self.punc_kwargs, cfg)
punc_res = self.inference(
result["text"], model=self.punc_model, kwargs=self.punc_kwargs, **cfg
)
raw_text = copy.copy(result["text"])
if return_raw_text:
result["raw_text"] = raw_text
result["text"] = punc_res[0]["text"]
# speaker embedding cluster after resorted
if self.spk_model is not None and kwargs.get("return_spk_res", True):
@ -605,12 +602,6 @@ class AutoModel:
)
with torch.no_grad():
if type == "onnx":
export_dir = export_utils.export_onnx(model=model, data_in=data_list, **kwargs)
else:
export_dir = export_utils.export_torchscripts(
model=model, data_in=data_list, **kwargs
)
export_dir = export_utils.export(model=model, data_in=data_list, **kwargs)
return export_dir

View File

@ -64,8 +64,6 @@ class AudioLLMNARDataset(torch.utils.data.Dataset):
def __getitem__(self, index):
item = self.index_ds[index]
# import pdb;
# pdb.set_trace()
source = item["source"]
data_src = load_audio_text_image_video(source, fs=self.fs)
if self.preprocessor_speech:

View File

@ -66,8 +66,6 @@ class AudioLLMQwenAudioDataset(torch.utils.data.Dataset):
def __getitem__(self, index):
item = self.index_ds[index]
# import pdb;
# pdb.set_trace()
source = item["source"]
data_src = load_audio_text_image_video(source, fs=self.fs)
if self.preprocessor_speech:

View File

@ -66,8 +66,6 @@ class AudioLLMVicunaDataset(torch.utils.data.Dataset):
def __getitem__(self, index):
item = self.index_ds[index]
# import pdb;
# pdb.set_trace()
source = item["source"]
data_src = load_audio_text_image_video(source, fs=self.fs)
if self.preprocessor_speech:

View File

@ -72,8 +72,6 @@ class SenseVoiceDataset(torch.utils.data.Dataset):
return len(self.index_ds)
def __getitem__(self, index):
# import pdb;
# pdb.set_trace()
output = None
for idx in range(self.retry):

View File

@ -235,7 +235,6 @@ class MultiChannelFrontend(nn.Module):
self, input: torch.Tensor, input_lengths: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:
# 1. Domain-conversion: e.g. Stft: time -> time-freq
# import pdb;pdb.set_trace()
if self.stft is not None:
input_stft, feats_lens = self._compute_stft(input, input_lengths)
else:

View File

@ -198,7 +198,7 @@ class CifPredictorV3(torch.nn.Module):
output2 = self.upsample_cnn(_output)
output2 = output2.transpose(1, 2)
output2, _ = self.self_attn(output2, mask)
# import pdb; pdb.set_trace()
alphas2 = torch.sigmoid(self.cif_output2(output2))
alphas2 = torch.nn.functional.relu(alphas2 * self.smooth_factor2 - self.noise_threshold2)
# repeat the mask in T demension to match the upsampled length

View File

@ -29,7 +29,8 @@ def export_rebuild_model(model, **kwargs):
model.export_input_names = types.MethodType(export_input_names, model)
model.export_output_names = types.MethodType(export_output_names, model)
model.export_dynamic_axes = types.MethodType(export_dynamic_axes, model)
model.export_name = types.MethodType(export_name, model)
model.export_name = "model"
return model

View File

@ -424,7 +424,6 @@ class ContextualParaformerDecoderExport(torch.nn.Module):
# contextual_mask = myutils.sequence_mask(contextual_length, device=memory.device)[:, None, :]
contextual_mask = self.make_pad_mask(contextual_length)
contextual_mask, _ = self.prepare_mask(contextual_mask)
# import pdb; pdb.set_trace()
contextual_mask = contextual_mask.transpose(2, 1).unsqueeze(1)
cx, tgt_mask, _, _, _ = self.bias_decoder(
x_self_attn, tgt_mask, bias_embed, memory_mask=contextual_mask

View File

@ -16,6 +16,21 @@ class ContextualEmbedderExport2(ContextualEmbedderExport):
self.embedding = model.bias_embed
model.bias_encoder.batch_first = False
self.bias_encoder = model.bias_encoder
def export_dummy_inputs(self):
hotword = torch.tensor(
[
[10, 11, 12, 13, 14, 10, 11, 12, 13, 14],
[100, 101, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[10, 11, 12, 13, 14, 10, 11, 12, 13, 14],
[100, 101, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
],
dtype=torch.int32,
)
# hotword_length = torch.tensor([10, 2, 1], dtype=torch.int32)
return (hotword)
def export_rebuild_model(model, **kwargs):
@ -59,7 +74,9 @@ def export_rebuild_model(model, **kwargs):
backbone_model.export_dynamic_axes = types.MethodType(
export_backbone_dynamic_axes, backbone_model
)
backbone_model.export_name = types.MethodType(export_backbone_name, backbone_model)
embedder_model.export_name = "model_eb"
backbone_model.export_name = "model"
return backbone_model, embedder_model

View File

@ -23,8 +23,6 @@ from funasr.utils import postprocess_utils
from funasr.utils.datadir_writer import DatadirWriter
from funasr.register import tables
import pdb
@tables.register("model_classes", "LCBNet")
class LCBNet(nn.Module):

View File

@ -168,8 +168,6 @@ class LLMASR(nn.Module):
text: (Batch, Length)
text_lengths: (Batch,)
"""
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:

View File

@ -166,8 +166,6 @@ class LLMASRNAR(nn.Module):
text: (Batch, Length)
text_lengths: (Batch,)
"""
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:

View File

@ -34,7 +34,6 @@ from funasr.models.transformer.utils.subsampling import Conv2dSubsampling8
from funasr.models.transformer.utils.subsampling import TooShortUttError
from funasr.models.transformer.utils.subsampling import check_short_utt
from funasr.models.encoder.abs_encoder import AbsEncoder
import pdb
import math
@ -363,7 +362,6 @@ class MFCCAEncoder(AbsEncoder):
t_leng = xs_pad.size(1)
d_dim = xs_pad.size(2)
xs_pad = xs_pad.reshape(-1, channel_size, t_leng, d_dim)
# pdb.set_trace()
if channel_size < 8:
repeat_num = math.ceil(8 / channel_size)
xs_pad = xs_pad.repeat(1, repeat_num, 1, 1)[:, 0:8, :, :]

View File

@ -494,6 +494,8 @@ class CifPredictorV2Export(torch.nn.Module):
token_num_floor = torch.floor(token_num)
return hidden, alphas, token_num_floor
@torch.jit.script
def cif_v1_export(hidden, alphas, threshold: float):
device = hidden.device
@ -516,9 +518,7 @@ def cif_v1_export(hidden, alphas, threshold: float):
fires[fire_idxs] = 1
fires = fires + prefix_sum - prefix_sum_floor
prefix_sum_hidden = torch.cumsum(
alphas.unsqueeze(-1).tile((1, 1, hidden_size)) * hidden, dim=1
)
prefix_sum_hidden = torch.cumsum(alphas.unsqueeze(-1).tile((1, 1, hidden_size)) * hidden, dim=1)
frames = prefix_sum_hidden[fire_idxs]
shift_frames = torch.roll(frames, 1, dims=0)
@ -530,9 +530,7 @@ def cif_v1_export(hidden, alphas, threshold: float):
shift_frames[shift_batch_idxs] = 0
remains = fires - torch.floor(fires)
remain_frames = (
remains[fire_idxs].unsqueeze(-1).tile((1, hidden_size)) * hidden[fire_idxs]
)
remain_frames = remains[fire_idxs].unsqueeze(-1).tile((1, hidden_size)) * hidden[fire_idxs]
shift_remain_frames = torch.roll(remain_frames, 1, dims=0)
shift_remain_frames[shift_batch_idxs] = 0
@ -541,14 +539,13 @@ def cif_v1_export(hidden, alphas, threshold: float):
max_label_len = batch_len.max()
frame_fires = torch.zeros(
batch_size, max_label_len, hidden_size, dtype=dtype, device=device
)
frame_fires = torch.zeros(batch_size, max_label_len, hidden_size, dtype=dtype, device=device)
indices = torch.arange(max_label_len, device=device).expand(batch_size, -1)
frame_fires_idxs = indices < batch_len.unsqueeze(1)
frame_fires[frame_fires_idxs] = frames
return frame_fires, fires
@torch.jit.script
def cif_export(hidden, alphas, threshold: float):
batch_size, len_time, hidden_size = hidden.size()
@ -692,11 +689,8 @@ def cif_v1(hidden, alphas, threshold):
device = hidden.device
dtype = hidden.dtype
batch_size, len_time, hidden_size = hidden.size()
frames = torch.zeros(batch_size, len_time, hidden_size,
dtype=dtype, device=device)
prefix_sum_hidden = torch.cumsum(
alphas.unsqueeze(-1).tile((1, 1, hidden_size)) * hidden, dim=1
)
frames = torch.zeros(batch_size, len_time, hidden_size, dtype=dtype, device=device)
prefix_sum_hidden = torch.cumsum(alphas.unsqueeze(-1).tile((1, 1, hidden_size)) * hidden, dim=1)
frames = prefix_sum_hidden[fire_idxs]
shift_frames = torch.roll(frames, 1, dims=0)
@ -708,10 +702,7 @@ def cif_v1(hidden, alphas, threshold):
shift_frames[shift_batch_idxs] = 0
remains = fires - torch.floor(fires)
remain_frames = (
remains[fire_idxs].unsqueeze(-1).tile((1,
hidden_size)) * hidden[fire_idxs]
)
remain_frames = remains[fire_idxs].unsqueeze(-1).tile((1, hidden_size)) * hidden[fire_idxs]
shift_remain_frames = torch.roll(remain_frames, 1, dims=0)
shift_remain_frames[shift_batch_idxs] = 0
@ -720,9 +711,7 @@ def cif_v1(hidden, alphas, threshold):
max_label_len = batch_len.max()
frame_fires = torch.zeros(
batch_size, max_label_len, hidden_size, dtype=dtype, device=device
)
frame_fires = torch.zeros(batch_size, max_label_len, hidden_size, dtype=dtype, device=device)
indices = torch.arange(max_label_len, device=device).expand(batch_size, -1)
frame_fires_idxs = indices < batch_len.unsqueeze(1)
frame_fires[frame_fires_idxs] = frames

View File

@ -31,6 +31,7 @@ def export_rebuild_model(model, **kwargs):
model.export_dynamic_axes = types.MethodType(export_dynamic_axes, model)
model.export_name = types.MethodType(export_name, model)
model.export_name = 'model'
return model

View File

@ -50,8 +50,6 @@ class ParaformerStreaming(Paraformer):
super().__init__(*args, **kwargs)
# import pdb;
# pdb.set_trace()
self.sampling_ratio = kwargs.get("sampling_ratio", 0.2)
self.scama_mask = None
@ -83,8 +81,6 @@ class ParaformerStreaming(Paraformer):
text: (Batch, Length)
text_lengths: (Batch,)
"""
# import pdb;
# pdb.set_trace()
decoding_ind = kwargs.get("decoding_ind")
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]

View File

@ -780,7 +780,7 @@ class MultiHeadedAttentionCrossAttExport(nn.Module):
return q, k, v
def forward_attention(self, value, scores, mask, ret_attn):
scores = scores + mask
scores = scores + mask.to(scores.device)
self.attn = torch.softmax(scores, dim=-1)
context_layer = torch.matmul(self.attn, value) # (batch, head, time1, d_k)

View File

@ -109,7 +109,9 @@ def export_rebuild_model(model, **kwargs):
backbone_model.export_dynamic_axes = types.MethodType(
export_backbone_dynamic_axes, backbone_model
)
backbone_model.export_name = types.MethodType(export_backbone_name, backbone_model)
embedder_model.export_name = "model_eb"
backbone_model.export_name = "model"
return backbone_model, embedder_model
@ -198,6 +200,3 @@ def export_backbone_dynamic_axes(self):
"us_cif_peak": {0: "batch_size", 1: "alphas_length"},
}
def export_backbone_name(self):
return "model.onnx"

View File

@ -74,8 +74,6 @@ class SenseVoice(nn.Module):
):
target_mask = kwargs.get("target_mask", None)
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:
@ -304,8 +302,6 @@ class SenseVoiceRWKV(nn.Module):
):
target_mask = kwargs.get("target_mask", None)
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:
@ -649,8 +645,6 @@ class SenseVoiceFSMN(nn.Module):
):
target_mask = kwargs.get("target_mask", None)
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:
@ -1054,8 +1048,6 @@ class SenseVoiceSANM(nn.Module):
):
target_mask = kwargs.get("target_mask", None)
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:
@ -1594,15 +1586,25 @@ class SenseVoiceSANMCTC(nn.Module):
language = kwargs.get("language", None)
if language is not None:
language_query = self.embed(torch.LongTensor([[self.lid_dict[language] if language in self.lid_dict else 0]]).to(speech.device)).repeat(speech.size(0), 1, 1)
language_query = self.embed(
torch.LongTensor(
[[self.lid_dict[language] if language in self.lid_dict else 0]]
).to(speech.device)
).repeat(speech.size(0), 1, 1)
else:
language_query = self.embed(torch.LongTensor([[0]]).to(speech.device)).repeat(speech.size(0), 1, 1)
language_query = self.embed(torch.LongTensor([[0]]).to(speech.device)).repeat(
speech.size(0), 1, 1
)
textnorm = kwargs.get("text_norm", "wotextnorm")
textnorm_query = self.embed(torch.LongTensor([[self.textnorm_dict[textnorm]]]).to(speech.device)).repeat(speech.size(0), 1, 1)
textnorm_query = self.embed(
torch.LongTensor([[self.textnorm_dict[textnorm]]]).to(speech.device)
).repeat(speech.size(0), 1, 1)
speech = torch.cat((textnorm_query, speech), dim=1)
speech_lengths += 1
event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(speech.size(0), 1, 1)
event_emo_query = self.embed(torch.LongTensor([[1, 2]]).to(speech.device)).repeat(
speech.size(0), 1, 1
)
input_query = torch.cat((language_query, event_emo_query), dim=1)
speech = torch.cat((input_query, speech), dim=1)
speech_lengths += 3

View File

@ -145,8 +145,6 @@ class Transformer(nn.Module):
text: (Batch, Length)
text_lengths: (Batch,)
"""
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:

View File

@ -7,7 +7,10 @@ import torch
import torch.nn.functional as F
from torch import Tensor
from torch import nn
import whisper
# import whisper_timestamped as whisper
from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
from funasr.register import tables
@ -108,8 +111,10 @@ class WhisperWarp(nn.Module):
# decode the audio
options = whisper.DecodingOptions(**kwargs.get("DecodingOptions", {}))
result = whisper.decode(self.model, speech, options)
result = whisper.decode(self.model, speech, language='english')
# result = whisper.transcribe(self.model, speech)
results = []
result_i = {"key": key[0], "text": result.text}

View File

@ -140,8 +140,6 @@ class OpenAIWhisperModel(nn.Module):
text: (Batch, Length)
text_lengths: (Batch,)
"""
# import pdb;
# pdb.set_trace()
if len(text_lengths.size()) > 1:
text_lengths = text_lengths[:, 0]
if len(speech_lengths.size()) > 1:

View File

@ -1,8 +1,14 @@
import os
import torch
import functools
try:
import torch_blade
except Exception as e:
print(f"failed to load torch_blade: {e}")
def export_onnx(model, data_in=None, quantize: bool = False, opset_version: int = 14, **kwargs):
def export(model, data_in=None, quantize: bool = False, opset_version: int = 14, type='onnx', **kwargs):
model_scripts = model.export(**kwargs)
export_dir = kwargs.get("output_dir", os.path.dirname(kwargs.get("init_param")))
os.makedirs(export_dir, exist_ok=True)
@ -11,14 +17,32 @@ def export_onnx(model, data_in=None, quantize: bool = False, opset_version: int
model_scripts = (model_scripts,)
for m in model_scripts:
m.eval()
_onnx(
m,
data_in=data_in,
quantize=quantize,
opset_version=opset_version,
export_dir=export_dir,
**kwargs
)
if type == 'onnx':
_onnx(
m,
data_in=data_in,
quantize=quantize,
opset_version=opset_version,
export_dir=export_dir,
**kwargs
)
elif type == 'torchscripts':
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Exporting torchscripts on device {}".format(device))
_torchscripts(
m,
path=export_dir,
device=device
)
elif type == "bladedisc":
assert (
torch.cuda.is_available()
), "Currently bladedisc optimization for FunASR only supports GPU"
# bladedisc only optimizes encoder/decoder modules
if hasattr(m, "encoder") and hasattr(m, "decoder"):
_bladedisc_opt_for_encdec(m, path=export_dir, enable_fp16=True)
else:
_torchscripts(m, path=export_dir, device="cuda")
print("output dir: {}".format(export_dir))
return export_dir
@ -37,7 +61,7 @@ def _onnx(
verbose = kwargs.get("verbose", False)
export_name = model.export_name() if hasattr(model, "export_name") else "model.onnx"
export_name = model.export_name + '.onnx'
model_path = os.path.join(export_dir, export_name)
torch.onnx.export(
model,
@ -70,3 +94,106 @@ def _onnx(
weight_type=QuantType.QUInt8,
nodes_to_exclude=nodes_to_exclude,
)
def _torchscripts(model, path, device='cuda'):
dummy_input = model.export_dummy_inputs()
if device == 'cuda':
model = model.cuda()
if isinstance(dummy_input, torch.Tensor):
dummy_input = dummy_input.cuda()
else:
dummy_input = tuple([i.cuda() for i in dummy_input])
model_script = torch.jit.trace(model, dummy_input)
model_script.save(os.path.join(path, f'{model.export_name}.torchscripts'))
def _bladedisc_opt(model, model_inputs, enable_fp16=True):
model = model.eval()
torch_config = torch_blade.config.Config()
torch_config.enable_fp16 = enable_fp16
with torch.no_grad(), torch_config:
opt_model = torch_blade.optimize(
model,
allow_tracing=True,
model_inputs=model_inputs,
)
return opt_model
def _rescale_input_hook(m, x, scale):
if len(x) > 1:
return (x[0] / scale, *x[1:])
else:
return (x[0] / scale,)
def _rescale_output_hook(m, x, y, scale):
if isinstance(y, tuple):
return (y[0] / scale, *y[1:])
else:
return y / scale
def _rescale_encoder_model(model, input_data):
# Calculate absmax
absmax = torch.tensor(0).cuda()
def stat_input_hook(m, x, y):
val = x[0] if isinstance(x, tuple) else x
absmax.copy_(torch.max(absmax, val.detach().abs().max()))
encoders = model.encoder.model.encoders
hooks = [m.register_forward_hook(stat_input_hook) for m in encoders]
model = model.cuda()
model(*input_data)
for h in hooks:
h.remove()
# Rescale encoder modules
fp16_scale = int(2 * absmax // 65536)
print(f"rescale encoder modules with factor={fp16_scale}")
model.encoder.model.encoders0.register_forward_pre_hook(
functools.partial(_rescale_input_hook, scale=fp16_scale),
)
for name, m in model.encoder.model.named_modules():
if name.endswith("self_attn"):
m.register_forward_hook(
functools.partial(_rescale_output_hook, scale=fp16_scale)
)
if name.endswith("feed_forward.w_2"):
state_dict = {k: v / fp16_scale for k, v in m.state_dict().items()}
m.load_state_dict(state_dict)
def _bladedisc_opt_for_encdec(model, path, enable_fp16):
# Get input data
# TODO: better to use real data
input_data = model.export_dummy_inputs()
if isinstance(input_data, torch.Tensor):
input_data = input_data.cuda()
else:
input_data = tuple([i.cuda() for i in input_data])
# Get input data for decoder module
decoder_inputs = list()
def get_input_hook(m, x):
decoder_inputs.extend(list(x))
hook = model.decoder.register_forward_pre_hook(get_input_hook)
model = model.cuda()
model(*input_data)
hook.remove()
# Prevent FP16 overflow
if enable_fp16:
_rescale_encoder_model(model, input_data)
# Export and optimize encoder/decoder modules
model.encoder = _bladedisc_opt(model.encoder, input_data[:2])
model.decoder = _bladedisc_opt(model.decoder, tuple(decoder_inputs))
model_script = torch.jit.trace(model, input_data)
model_script.save(os.path.join(path, f"{model.export_name}_blade.torchscripts"))

View File

@ -1,17 +0,0 @@
from funasr_torch import Paraformer
model_dir = (
"/nfs/zhifu.gzf/export/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
)
model = Paraformer(model_dir, batch_size=1) # cpu
# model = Paraformer(model_dir, batch_size=1, device_id=0) # gpu
# when using paraformer-large-vad-punc model, you can set plot_timestamp_to="./xx.png" to get figure of alignment besides timestamps
# model = Paraformer(model_dir, batch_size=1, plot_timestamp_to="test.png")
wav_path = "YourPath/xx.wav"
result = model(wav_path)
print(result)

View File

@ -0,0 +1,13 @@
import torch
from pathlib import Path
from funasr_torch.paraformer_bin import ContextualParaformer
model_dir = "iic/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404"
device_id = 0 if torch.cuda.is_available() else -1
model = ContextualParaformer(model_dir, batch_size=1, device_id=device_id) # gpu
wav_path = "{}/.cache/modelscope/hub/{}/example/asr_example.wav".format(Path.home(), model_dir)
hotwords = "你的热词 魔搭"
result = model(wav_path, hotwords)
print(result)

View File

@ -0,0 +1,11 @@
from pathlib import Path
from funasr_torch.paraformer_bin import Paraformer
model_dir = "iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer(model_dir, batch_size=1) # cpu
# model = Paraformer(model_dir, batch_size=1, device_id=0) # gpu
wav_path = "{}/.cache/modelscope/hub/{}/example/asr_example.wav".format(Path.home(), model_dir)
result = model(wav_path)
print(result)

View File

@ -0,0 +1,13 @@
import torch
from pathlib import Path
from funasr_torch.paraformer_bin import SeacoParaformer
model_dir = "iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
device_id = 0 if torch.cuda.is_available() else -1
model = SeacoParaformer(model_dir, batch_size=1, device_id=device_id) # gpu
wav_path = "{}/.cache/modelscope/hub/{}/example/asr_example.wav".format(Path.home(), model_dir)
hotwords = "你的热词 魔搭"
result = model(wav_path, hotwords)
print(result)

View File

@ -1,23 +1,29 @@
# -*- encoding: utf-8 -*-
import json
import copy
import torch
import os.path
import librosa
import numpy as np
from pathlib import Path
from typing import List, Union, Tuple
import copy
import librosa
import numpy as np
from .utils.utils import CharTokenizer, Hypothesis, TokenIDConverter, get_logger, read_yaml
from .utils.postprocess_utils import sentence_postprocess
from .utils.utils import pad_list
from .utils.frontend import WavFrontend
from .utils.timestamp_utils import time_stamp_lfr6_onnx
from .utils.postprocess_utils import sentence_postprocess
from .utils.utils import CharTokenizer, Hypothesis, TokenIDConverter, get_logger, read_yaml
logging = get_logger()
import torch
class Paraformer:
"""
Author: Speech Lab of DAMO Academy, Alibaba Group
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
https://arxiv.org/abs/2206.08317
"""
def __init__(
self,
model_dir: Union[str, Path] = None,
@ -25,20 +31,42 @@ class Paraformer:
device_id: Union[str, int] = "-1",
plot_timestamp_to: str = "",
quantize: bool = False,
intra_op_num_threads: int = 1,
cache_dir: str = None,
**kwargs,
):
if not Path(model_dir).exists():
raise FileNotFoundError(f"{model_dir} does not exist.")
try:
from modelscope.hub.snapshot_download import snapshot_download
except:
raise "You are exporting model from modelscope, please install modelscope and try it again. To install modelscope, you could:\n" "\npip3 install -U modelscope\n" "For the users in China, you could install with the command:\n" "\npip3 install -U modelscope -i https://mirror.sjtu.edu.cn/pypi/web/simple"
try:
model_dir = snapshot_download(model_dir, cache_dir=cache_dir)
except:
raise "model_dir must be model_name in modelscope or local path downloaded from modelscope, but is {}".format(
model_dir
)
model_file = os.path.join(model_dir, "model.torchscripts")
if quantize:
model_file = os.path.join(model_dir, "model_quant.torchscripts")
if not os.path.exists(model_file):
print(".torchscripts does not exist, begin to export torchscripts")
try:
from funasr import AutoModel
except:
raise "You are exporting onnx, please install funasr and try it again. To install funasr, you could:\n" "\npip3 install -U funasr\n" "For the users in China, you could install with the command:\n" "\npip3 install -U funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple"
model = AutoModel(model=model_dir)
model_dir = model.export(type="torchscript", quantize=quantize, **kwargs)
config_file = os.path.join(model_dir, "config.yaml")
cmvn_file = os.path.join(model_dir, "am.mvn")
config = read_yaml(config_file)
token_list = os.path.join(model_dir, "tokens.json")
with open(token_list, "r", encoding="utf-8") as f:
token_list = json.load(f)
self.converter = TokenIDConverter(config["token_list"])
self.converter = TokenIDConverter(token_list)
self.tokenizer = CharTokenizer()
self.frontend = WavFrontend(cmvn_file=cmvn_file, **config["frontend_conf"])
self.ort_infer = torch.jit.load(model_file)
@ -49,6 +77,10 @@ class Paraformer:
self.pred_bias = config["model_conf"]["predictor_bias"]
else:
self.pred_bias = 0
if "lang" in config:
self.language = config["lang"]
else:
self.language = None
def __call__(self, wav_content: Union[str, np.ndarray, List[str]], **kwargs) -> List:
waveform_list = self.load_data(wav_content, self.frontend.opts.frame_opts.samp_freq)
@ -203,3 +235,186 @@ class Paraformer:
token = token[: valid_token_num - self.pred_bias]
# texts = sentence_postprocess(token)
return token
class ContextualParaformer(Paraformer):
"""
Author: Speech Lab of DAMO Academy, Alibaba Group
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition
https://arxiv.org/abs/2206.08317
"""
def __init__(
self,
model_dir: Union[str, Path] = None,
batch_size: int = 1,
device_id: Union[str, int] = "-1",
plot_timestamp_to: str = "",
quantize: bool = False,
cache_dir: str = None,
**kwargs,
):
if not Path(model_dir).exists():
try:
from modelscope.hub.snapshot_download import snapshot_download
except:
raise "You are exporting model from modelscope, please install modelscope and try it again. To install modelscope, you could:\n" "\npip3 install -U modelscope\n" "For the users in China, you could install with the command:\n" "\npip3 install -U modelscope -i https://mirror.sjtu.edu.cn/pypi/web/simple"
try:
model_dir = snapshot_download(model_dir, cache_dir=cache_dir)
except:
raise "model_dir must be model_name in modelscope or local path downloaded from modelscope, but is {}".format(
model_dir
)
if quantize:
model_bb_file = os.path.join(model_dir, "model_bb_quant.torchscripts")
model_eb_file = os.path.join(model_dir, "model_eb_quant.torchscripts")
else:
model_bb_file = os.path.join(model_dir, "model_bb.torchscripts")
model_eb_file = os.path.join(model_dir, "model_eb.torchscripts")
if not (os.path.exists(model_eb_file) and os.path.exists(model_bb_file)):
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:
raise "You are exporting onnx, please install funasr and try it again. To install funasr, you could:\n" "\npip3 install -U funasr\n" "For the users in China, you could install with the command:\n" "\npip3 install -U funasr -i https://mirror.sjtu.edu.cn/pypi/web/simple"
model = AutoModel(model=model_dir)
model_dir = model.export(type="torchscripts", quantize=quantize, **kwargs)
config_file = os.path.join(model_dir, "config.yaml")
cmvn_file = os.path.join(model_dir, "am.mvn")
config = read_yaml(config_file)
token_list = os.path.join(model_dir, "tokens.json")
with open(token_list, "r", encoding="utf-8") as f:
token_list = json.load(f)
# revert token_list into vocab dict
self.vocab = {}
for i, token in enumerate(token_list):
self.vocab[token] = i
self.converter = TokenIDConverter(token_list)
self.tokenizer = CharTokenizer()
self.frontend = WavFrontend(cmvn_file=cmvn_file, **config["frontend_conf"])
self.ort_infer_bb = torch.jit.load(model_bb_file)
self.ort_infer_eb = torch.jit.load(model_eb_file)
self.device_id = device_id
self.batch_size = batch_size
self.plot_timestamp_to = plot_timestamp_to
if "predictor_bias" in config["model_conf"].keys():
self.pred_bias = config["model_conf"]["predictor_bias"]
else:
self.pred_bias = 0
def __call__(
self, wav_content: Union[str, np.ndarray, List[str]], hotwords: str, **kwargs
) -> List:
# make hotword list
hotwords, hotwords_length = self.proc_hotword(hotwords)
if int(self.device_id) != -1:
bias_embed = self.eb_infer(hotwords.cuda())
else:
bias_embed = self.eb_infer(hotwords)
# index from bias_embed
bias_embed = torch.transpose(bias_embed, 0, 1)
_ind = np.arange(0, len(hotwords)).tolist()
bias_embed = bias_embed[_ind, hotwords_length.tolist()]
waveform_list = self.load_data(wav_content, self.frontend.opts.frame_opts.samp_freq)
waveform_nums = len(waveform_list)
asr_res = []
for beg_idx in range(0, waveform_nums, self.batch_size):
end_idx = min(waveform_nums, beg_idx + self.batch_size)
feats, feats_len = self.extract_feat(waveform_list[beg_idx:end_idx])
bias_embed = torch.unsqueeze(bias_embed, 0).repeat(feats.shape[0], 1, 1)
try:
with torch.no_grad():
if int(self.device_id) == -1:
outputs = self.bb_infer(feats, feats_len, bias_embed)
am_scores, valid_token_lens = outputs[0], outputs[1]
else:
outputs = self.bb_infer(feats.cuda(), feats_len.cuda(), bias_embed.cuda())
am_scores, valid_token_lens = outputs[0].cpu(), outputs[1].cpu()
except:
# logging.warning(traceback.format_exc())
logging.warning("input wav is silence or noise")
preds = [""]
else:
preds = self.decode(am_scores, valid_token_lens)
for pred in preds:
pred = sentence_postprocess(pred)
asr_res.append({"preds": pred})
return asr_res
def proc_hotword(self, hotwords):
hotwords = hotwords.split(" ")
hotwords_length = [len(i) - 1 for i in hotwords]
hotwords_length.append(0)
hotwords_length = np.array(hotwords_length)
# hotwords.append('<s>')
def word_map(word):
hotwords = []
for c in word:
if c not in self.vocab.keys():
hotwords.append(8403)
logging.warning(
"oov character {} found in hotword {}, replaced by <unk>".format(c, word)
)
else:
hotwords.append(self.vocab[c])
return np.array(hotwords)
hotword_int = [word_map(i) for i in hotwords]
hotword_int.append(np.array([1]))
hotwords = pad_list(hotword_int, pad_value=0, max_len=10)
return torch.tensor(hotwords), hotwords_length
def bb_infer(
self, feats, feats_len, bias_embed
):
outputs = self.ort_infer_bb(feats, feats_len, bias_embed)
return outputs
def eb_infer(self, hotwords):
outputs = self.ort_infer_eb(hotwords.long())
return outputs
def decode(self, am_scores: np.ndarray, token_nums: int) -> List[str]:
return [
self.decode_one(am_score, token_num)
for am_score, token_num in zip(am_scores, token_nums)
]
def decode_one(self, am_score: np.ndarray, valid_token_num: int) -> List[str]:
yseq = am_score.argmax(axis=-1)
score = am_score.max(axis=-1)
score = np.sum(score, axis=-1)
# pad with mask tokens to ensure compatibility with sos/eos tokens
# asr_model.sos:1 asr_model.eos:2
yseq = np.array([1] + yseq.tolist() + [2])
hyp = Hypothesis(yseq=yseq, score=score)
# remove sos/eos and get results
last_pos = -1
token_int = hyp.yseq[1:last_pos].tolist()
# remove blank symbol id, which is assumed to be 0
token_int = list(filter(lambda x: x not in (0, 2), token_int))
# Change integer-ids to tokens
token = self.converter.ids2tokens(token_int)
token = token[: valid_token_num - self.pred_bias]
# texts = sentence_postprocess(token)
return token
class SeacoParaformer(ContextualParaformer):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# no difference with contextual_paraformer in method of calling onnx models

View File

@ -7,7 +7,7 @@ def time_stamp_lfr6_onnx(us_cif_peak, char_list, begin_time=0.0, total_offset=-1
START_END_THRESHOLD = 5
MAX_TOKEN_DURATION = 30
TIME_RATE = 10.0 * 6 / 1000 / 3 # 3 times upsampled
cif_peak = us_cif_peak.reshape(-1)
cif_peak = us_cif_peak.reshape(-1).cpu()
num_frames = cif_peak.shape[-1]
if char_list[-1] == "</s>":
char_list = char_list[:-1]

View File

@ -1,21 +1,25 @@
# -*- encoding: utf-8 -*-
import functools
import yaml
import logging
import pickle
import functools
import numpy as np
from pathlib import Path
from typing import Any, Dict, Iterable, List, NamedTuple, Set, Tuple, Union
import numpy as np
import yaml
import warnings
root_dir = Path(__file__).resolve().parent
logger_initialized = {}
def pad_list(xs, pad_value, max_len=None):
n_batch = len(xs)
if max_len is None:
max_len = max(x.size(0) for x in xs)
# pad = xs[0].new(n_batch, max_len, *xs[0].size()[1:]).fill_(pad_value)
# numpy format
pad = (np.zeros((n_batch, max_len)) + pad_value).astype(np.int32)
for i in range(n_batch):
pad[i, : xs[i].shape[0]] = xs[i]
return pad
class TokenIDConverter:
def __init__(

View File

@ -62,7 +62,7 @@ class Paraformer:
if quantize:
model_file = os.path.join(model_dir, "model_quant.onnx")
if not os.path.exists(model_file):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:
@ -285,7 +285,7 @@ class ContextualParaformer(Paraformer):
model_eb_file = os.path.join(model_dir, "model_eb.onnx")
if not (os.path.exists(model_eb_file) and os.path.exists(model_bb_file)):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:
@ -331,7 +331,6 @@ class ContextualParaformer(Paraformer):
# ) -> List:
# make hotword list
hotwords, hotwords_length = self.proc_hotword(hotwords)
# import pdb; pdb.set_trace()
[bias_embed] = self.eb_infer(hotwords, hotwords_length)
# index from bias_embed
bias_embed = bias_embed.transpose(1, 0, 2)
@ -411,10 +410,10 @@ class ContextualParaformer(Paraformer):
return np.array(hotwords)
hotword_int = [word_map(i) for i in hotwords]
# import pdb; pdb.set_trace()
hotword_int.append(np.array([1]))
hotwords = pad_list(hotword_int, pad_value=0, max_len=10)
# import pdb; pdb.set_trace()
return hotwords, hotwords_length
def bb_infer(

View File

@ -54,7 +54,7 @@ class Paraformer:
encoder_model_file = os.path.join(model_dir, "model_quant.onnx")
decoder_model_file = os.path.join(model_dir, "decoder_quant.onnx")
if not os.path.exists(encoder_model_file) or not os.path.exists(decoder_model_file):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:

View File

@ -52,7 +52,7 @@ class CT_Transformer:
if quantize:
model_file = os.path.join(model_dir, "model_quant.onnx")
if not os.path.exists(model_file):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:

View File

@ -52,7 +52,7 @@ class Fsmn_vad:
if quantize:
model_file = os.path.join(model_dir, "model_quant.onnx")
if not os.path.exists(model_file):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except:
@ -221,7 +221,7 @@ class Fsmn_vad_online:
if quantize:
model_file = os.path.join(model_dir, "model_quant.onnx")
if not os.path.exists(model_file):
print(".onnx is not exist, begin to export onnx")
print(".onnx does not exist, begin to export onnx")
try:
from funasr import AutoModel
except: