mirror of
https://github.com/modelscope/FunASR
synced 2025-09-15 14:48:36 +08:00
3.9 KiB
3.9 KiB
Speaker Diarization
Note
: The modelscope pipeline supports all the models in model zoo to inference and finetine. Here we take the model of xvector_sv as example to demonstrate the usage.
Inference with pipeline
Quick start
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
# initialize pipeline
inference_diar_pipline = pipeline(
mode="sond_demo",
num_workers=0,
task=Tasks.speaker_diarization,
diar_model_config="sond.yaml",
model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch',
reversion="v1.0.5",
sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch",
sv_model_revision="v1.2.2",
)
# input: a list of audio in which the first item is a speech recording to detect speakers,
# and the following wav file are used to extract speaker embeddings.
audio_list = [
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav",
"https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav",
]
results = inference_diar_pipline(audio_in=audio_list)
print(results)
API-reference
Define pipeline
task:Tasks.speaker_diarizationmodel: model name in model zoo, or model path in local diskngpu:1(Default), decoding on GPU. If ngpu=0, decoding on CPUoutput_dir:None(Default), the output path of results if setbatch_size:1(Default), batch size when decodingsmooth_size:83(Default), the window size to perform smoothingdur_threshold:10(Default), segments shorter than 100 ms will be droppedout_format:vad(Default), the output format, choices["vad", "rttm"].- vad format: spk1: [1.0, 3.0], [5.0, 8.0]
- rttm format: "SPEAKER test1 0 1.00 2.00 spk1 " and "SPEAKER test1 0 5.00 3.00 spk1 "
Infer pipeline for speaker embedding extraction
audio_in: the input to process, which could be:- list of url:
e.g.: waveform files at a website - list of local file path:
e.g.: path/to/a.wav - ("wav.scp,speech,sound", "profile.scp,profile,kaldi_ark"): a script file of waveform files and another script file of speaker profiles (extracted with the model)
The profile.ark file contains speaker embeddings in a kaldi-like style. Please refer README.md for more details.wav.scp test1 path/to/enroll1.wav test2 path/to/enroll2.wav profile.scp test1 path/to/profile.ark:11 test2 path/to/profile.ark:234
- list of url:
Inference with you data
For single input, we recommend the "list of local file path" mode for inference. For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp.
Inference with multi-threads on CPU
We recommend the last mode with split wav.scp and profile.scp. Then, run inference for each split part. Please refer README.md to find a similar process.
Inference with multi GPU
Similar to CPU, please set ngpu=1 for inference on GPU.
Besides, you should use CUDA_VISIBLE_DEVICES=0 to specify a GPU device.
Please refer README.md to find a similar process.