# Speaker Diarization > **Note**: > The modelscope pipeline supports all the models in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope) to inference and finetine. Here we take the model of xvector_sv as example to demonstrate the usage. ## Inference with pipeline ### Quick start ```python from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks # initialize pipeline inference_diar_pipline = pipeline( mode="sond_demo", num_workers=0, task=Tasks.speaker_diarization, diar_model_config="sond.yaml", model='damo/speech_diarization_sond-zh-cn-alimeeting-16k-n16k4-pytorch', reversion="v1.0.5", sv_model="damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch", sv_model_revision="v1.2.2", ) # input: a list of audio in which the first item is a speech recording to detect speakers, # and the following wav file are used to extract speaker embeddings. audio_list = [ "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/record.wav", "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk1.wav", "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk2.wav", "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk3.wav", "https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_data/speaker_diarization/spk4.wav", ] results = inference_diar_pipline(audio_in=audio_list) print(results) ``` #### API-reference ##### Define pipeline - `task`: `Tasks.speaker_diarization` - `model`: model name in [model zoo](https://alibaba-damo-academy.github.io/FunASR/en/modelscope_models.html#pretrained-models-on-modelscope), or model path in local disk - `ngpu`: `1` (Default), decoding on GPU. If ngpu=0, decoding on CPU - `output_dir`: `None` (Default), the output path of results if set - `batch_size`: `1` (Default), batch size when decoding - `smooth_size`: `83` (Default), the window size to perform smoothing - `dur_threshold`: `10` (Default), segments shorter than 100 ms will be dropped - `out_format`: `vad` (Default), the output format, choices `["vad", "rttm"]`. - vad format: spk1: [1.0, 3.0], [5.0, 8.0] - rttm format: "SPEAKER test1 0 1.00 2.00 spk1 " and "SPEAKER test1 0 5.00 3.00 spk1 " ##### Infer pipeline for speaker embedding extraction - `audio_in`: the input to process, which could be: - list of url: `e.g.`: waveform files at a website - list of local file path: `e.g.`: path/to/a.wav - ("wav.scp,speech,sound", "profile.scp,profile,kaldi_ark"): a script file of waveform files and another script file of speaker profiles (extracted with the [model](https://www.modelscope.cn/models/damo/speech_xvector_sv-zh-cn-cnceleb-16k-spk3465-pytorch/summary)) ```text wav.scp test1 path/to/enroll1.wav test2 path/to/enroll2.wav profile.scp test1 path/to/profile.ark:11 test2 path/to/profile.ark:234 ``` The profile.ark file contains speaker embeddings in a kaldi-like style. Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) for more details. ### Inference with you data For single input, we recommend the "list of local file path" mode for inference. For multiple inputs, we recommend the last mode with pre-organized wav.scp and profile.scp. ### Inference with multi-threads on CPU We recommend the last mode with split wav.scp and profile.scp. Then, run inference for each split part. Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) to find a similar process. ### Inference with multi GPU Similar to CPU, please set `ngpu=1` for inference on GPU. Besides, you should use `CUDA_VISIBLE_DEVICES=0` to specify a GPU device. Please refer [README.md](../../speaker_verification/TEMPLATE/README.md) to find a similar process.