asr_eval.models

A collection of ASR model wrappers in unified formats.

class asr_eval.models.base.interfaces.Segmenter[source]

Bases: ABC

An abstract model that segments a long-form audio into chunks containing speech.

Any parameters, such as max segment size, should go into a class constructor.

class asr_eval.models.base.interfaces.Transcriber[source]

Bases: ABC

An abstract transcriber (audio -> text) to evaluate on any dataset.

abstractmethod transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.interfaces.TimedTranscriber[source]

Bases: Transcriber

An abstract timed transcriber (audio -> timed text chunks) to evaluate on any dataset.

Overrides a transcribe() method by concatenating the test chunks by space. Subclasses may custoimize this.

abstractmethod timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.interfaces.CTC[source]

Bases: Transcriber

An abstract CTC model that converts audio into log probabilties for each time frame.

Implementations override a list of additional methods for additional actions such as decoding.

abstractmethod ctc_log_probs(waveforms)[source]

Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.

Typically obtained from logits via torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.

Return type:

list[ndarray[tuple[int, ...], dtype[floating[Any]]]]

Parameters:

waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])

abstract property blank_id: int

An index in vocabulary for <blank> CTC token.

abstract property tick_size: float

A time interval in seconds between consecutive time frames in the log probs matrix.

abstract property vocab: tuple[str, ...]

a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.

Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.

Type:

Returns a vocabulary

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.interfaces.ContextualTranscriber[source]

Bases: Transcriber

An abstract transcriber being able to accept previous transcription as a context.

abstractmethod contextual_transcribe(waveform, prev_transcription='')[source]

Transcribes a float32 waveform, typically normalized from -1 to 1. The prev_transcription represents a transcription from all the previous text before the current waveform.

Return type:

str

Parameters:
  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • prev_transcription (str)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.longform.LongformVAD(shortform_model, segmenter, min_sec=30)[source]

Bases: TimedTranscriber

A longform transcriber wrapper for any shortform model.

Longform transcriber means one being able to transcribe long audios. The concrete threshold between “long” and “short” audio may be specific for the provided shortform_model.

The current wrapper uses a provided segmenter to segment into chunks, then applies a shortform model to each chunk independently. If a shortform model is a TimedTranscriber, concatenates the resulting lists for all chunks, while correcting the timestamps to be relative to the whole audio.

Example

>>> # requires `pip install pyannote.audio>=4` for `PyannoteSegmenter`
>>> from asr_eval.models.base.longform import LongformVAD
>>> from asr_eval.models.pyannote_vad import PyannoteSegmenter
>>> from asr_eval.models.wav2vec2_wrapper import Wav2vec2Wrapper
>>> LongformVAD(
...     Wav2vec2Wrapper('facebook/wav2vec2-base-960h'),
...     PyannoteSegmenter()
... )

See also: LongformCTC.

Parameters:
timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.longform.LongformCTC(shortform_model, segment_length=30, segment_shift=10, averaging_weights='beta', pbar_callback=None)[source]

Bases: CTC

A wrapper to apply a shortform CTC model to a longform audio.

Longform transcriber means one being able to transcribe long audios. The current wrapper segments audio uniformly with overlaps, then averages the logprobs for all segments. By default averages with beta-distributed weights (averaging_weights='beta'), because a model may be less certain on the edges of the segment.

See also: LongformVAD.

Parameters:
  • shortform_model (CTC)

  • segment_length (float)

  • segment_shift (float)

  • averaging_weights (Literal['beta', 'uniform', 'blank_sep'])

  • pbar_callback (Callable[[int, int], None] | None)

ctc_log_probs(waveforms)[source]

Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.

Typically obtained from logits via torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.

Return type:

list[ndarray[tuple[int, ...], dtype[floating[Any]]]]

Parameters:

waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])

property blank_id: int

An index in vocabulary for <blank> CTC token.

property tick_size: float

A time interval in seconds between consecutive time frames in the log probs matrix.

property vocab: tuple[str, ...]

a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.

Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.

Type:

Returns a vocabulary

class asr_eval.models.base.longform.ContextualLongformVAD(shortform_model, segmenter, pass_history=True, max_history_words=100, min_sec=30)[source]

Bases: TimedTranscriber

A wrapper that is similar to LongformVAD, but for each chunk passes the previously transcribed text, up to the max_history_words, as a context for the next chunk when transcribing it.

Requies a shortform model to be a ContextualTranscriber.

Parameters:
timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.base.openai_wrapper.APITranscriber(model_name='mistralai/Voxtral-Mini-3B-2507', client='run_local_server', language=None, prompt=None, chunking_strategy='omit', temperature=0.7, format='flac', local_server_verbose=False)[source]

Bases: Transcriber

A connector to OpenAI API for audio LLMs. Runs via client.audio.transcriptions.create. This class wraps api_transcribe() to implement Transcriber interface. See the api_transcribe() docstring for chunking_strategy and temperature params.

This class also allows to auto-start a local VLLM server. To do this, subclass this class and define vllm_run_args(). See VoxtralWrapper as the example.

Example with starting VLLM manually:

  1. Start a local VLLM server

vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral \
    --config_format mistral --load_format mistral \
    --tensor-parallel-size 1 --tool-call-parser mistral  \
    --enable-auto-tool-choice --gpu-memory-utilization 0.75
  1. Run the code

from openai import OpenAI
from asr_eval.models.base.openai_wrapper import APITranscriber

transcriber = APITranscriber(
    OpenAI(api_key='EMPTY', base_url='http://localhost:8000/v1'),
    model_name='mistralai/Voxtral-Mini-3B-2507',
    language='ru',
)

waveform = <load you audio sample>
transcriber.transcribe(waveform)
Parameters:
  • model_name (str)

  • client (OpenAI | Literal['run_local_server'])

  • language (str | LanguageAlpha2 | None)

  • prompt (str | None)

  • chunking_strategy (Literal['auto', 'omit'] | ChunkingStrategyVadConfig | Omit)

  • temperature (float)

  • format (str)

  • local_server_verbose (bool)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

asr_eval.models.base.openai_wrapper.api_transcribe(client, waveform, model_name, language=None, prompt=None, chunking_strategy='omit', temperature=0.7, format='flac')[source]

A connector to OpenAI API for audio LLMs. Runs via client.audio.transcriptions.create. See the full usage example in APITranscriber.

Sends a message with audio and language to transcribe. A default temperature is 0.7, this value is taken from mistral_common’s BaseCompletionRequest.

Return type:

tuple[str, list[Logprob] | None]

Returns:

A transcription and logprobs (optional, if returned by the model). According to openai.types.audio.transcription.Transcription docstring, logprobs are returned only with the models gpt-4o-transcribe and gpt-4o-mini-transcribe.

Parameters:
  • client (OpenAI)

  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • model_name (str)

  • language (str | LanguageAlpha2 | None)

  • prompt (str | None)

  • chunking_strategy (Literal['auto', 'omit'] | ~openai.types.audio.transcription_create_params.ChunkingStrategyVadConfig | ~openai.Omit)

  • temperature (float)

  • format (str)

By default chunking_strategy is unset, and the audio is transcribed as a single block, according to client.audio.transcriptions.create docstring.

Voxtral seem to ignore both chunking_strategy and a request to return logprobs, according to VLLM server logs.

format is FLAC by default, this is actually a compressed (lossess) wav, should have smaller size than wav.

Raises:
  • openai.APIConnectionError – If cannot connect to the API

  • openai.NotFoundError – If cannot find the specified model_name

  • InternalServerError – In some cases (happened with VseGPT)

Parameters:
  • client (OpenAI)

  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • model_name (str)

  • language (str | LanguageAlpha2 | None)

  • prompt (str | None)

  • chunking_strategy (Literal['auto', 'omit'] | ~openai.types.audio.transcription_create_params.ChunkingStrategyVadConfig | ~openai.Omit)

  • temperature (float)

  • format (str)

Return type:

tuple[str, list[Logprob] | None]

exception asr_eval.models.base.openai_wrapper.ContentFilterException[source]

Bases: RuntimeError

An API model refused to generate due to the content policy.

class asr_eval.models.ast_wrapper.AudioSpectrogramTransformer(model_path='MIT/ast-finetuned-audioset-10-10-0.4593', device='cuda')[source]

An AudioSpectrogramTransformer (AST) able to recognize sound types.

Requires transformers package.

Parameters:
  • model_path (str)

  • device (str)

class asr_eval.models.flamingo_wrapper.FlamingoWrapper(lang='ru')[source]

Bases: Transcriber

A Flamingo transcriber. Not working anymore, TODO fix

Installation: see Installation page.

Authors: Dmitry Ezhov & Oleg Sedukhin

Parameters:

lang (Literal['en', 'ru'])

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.gemma_wrapper.Gemma3nWrapper(lang='en', domain_text='')[source]

Bases: ContextualTranscriber

Gemma3n transcriber. Too slow currently, TODO fix

If domain_text is specified, it is added into prompt with a note “may be related”.

Installation: see Installation page.

Authors: Timur Rafikov & Oleg Sedukhin

Parameters:
  • lang (Literal['en', 'ru'])

  • domain_text (str)

contextual_transcribe(waveform, prev_transcription='')[source]

Transcribes a float32 waveform, typically normalized from -1 to 1. The prev_transcription represents a transcription from all the previous text before the current waveform.

Return type:

str

Parameters:
  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • prev_transcription (str)

class asr_eval.models.gigaam_wrapper.GigaAMShortformBase[source]

Bases: Transcriber, ABC

An abstract class for GigaAM model, either CTC or RNNT.

Implementations:
transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.gigaam_wrapper.GigaAMShortformRNNT(version, device='cuda', fp16=False)[source]

Bases: GigaAMShortformBase

A GigaAM RNNT model. Supports different versions (see version parameter): “v2”, “v3”, “v3_e2e”.

Installation: see Installation page.

Parameters:
  • version (Literal['v2', 'v3', 'v3_e2e'])

  • device (str | torch.device)

  • fp16 (bool)

class asr_eval.models.gigaam_wrapper.GigaAMShortformCTC(version, device='cuda', fp16=False)[source]

Bases: GigaAMShortformBase, CTC

A GigaAM CTC model. Supports different versions (see version parameter): “v2”, “v3”, “v3_e2e”.

Installation: see Installation page.

Parameters:
  • version (Literal['v2', 'v3', 'v3_e2e'])

  • device (str | torch.device)

  • fp16 (bool)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

property blank_id: int

An index in vocabulary for <blank> CTC token.

property tick_size: float

A time interval in seconds between consecutive time frames in the log probs matrix.

property vocab: tuple[str, ...]

a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.

Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.

Type:

Returns a vocabulary

ctc_log_probs(waveforms)[source]

Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.

Typically obtained from logits via torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.

Return type:

list[ndarray[tuple[int, ...], dtype[floating[Any]]]]

Parameters:

waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])

class asr_eval.models.legacy_pisets_wrapper.LegacyPisetsWrapper(repo_dir, min_segment_size=1, max_segment_size=20, use_vad=False, whisper_ckpt='bond005/whisper-large-v3-ru-podlodka')[source]

Bases: TimedTranscriber

A Pisets transcriber from https://github.com/bond005/pisets

Commit hash e095ae626bbd18bb4490b9745d0acc34006c4eb8

Requires a manual cloning into the repo_dir before instantiating.

Parameters:
  • repo_dir (str | Path)

  • min_segment_size (int)

  • max_segment_size (int)

  • use_vad (bool)

  • whisper_ckpt (str)

timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.nemo_wrapper.NvidiaNemoWrapper(model_name, inference_kwargs=None, verbose=False, dtype='float32', amp=False, beam_size=None)[source]

Bases: Transcriber

A Nvidia NEMO wrapper

Installation: see Installation page.

Some of the available models (many more are available): 1. “nvidia/canary-1b-v2”

  • NOTE: Specify language, example: inference_kwargs={'source_lang': 'ru', 'target_lang': 'ru'}

  • NOTE: in Nemo beam_size=1 by default

  1. “nvidia/parakeet-tdt-0.6b-v3”
    • NOTE: Supports torch.float16 or torch.bfloat16 only with amp=True

    • NOTE: in Nemo beam_size=2 by default

  2. “nvidia/stt_ru_fastconformer_hybrid_large_pc”

    NOTE: in Nemo beam_size=2 by default

Dtypes:
  • for amp=True, available dtypes are torch.float16, torch.bfloat16

  • for amp=False, available dtypes are torch.float16, torch.bfloat16, torch.float32

Notes:

This wrapper is build using the following docs and examples: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/transcribe_speech.py https://docs.nvidia.com/nemo-framework/user-guide/25.02/nemotoolkit/asr

/api.html#nemo.collections.asr.parts.mixins.transcription.TranscriptionMixin

The NEMO wrapper seems not to perform internal VAD; it raises OOM on too long audios. From the EncDecMultiTaskModel docstrings: “recommended length per file is between 5 and 25 seconds, but it is possible to pass a few hours long file if enough GPU memory is available”.

The .transcribe() method of the NEMO’s TranscriptionMixin allows to pass timestamps=True. It raises error for Canary, but returns timestamps for Parakeet and FastComformer. However, the output timestamps require postprocessing that is not implemented currently.

Some of the models should support CTC interface and/or LM interation, but this is not implemented in asr_eval currently.

To get the full list of available models, run:

from nemo.collections.asr.models import ASRModel
print(ASRModel.list_available_models())
Parameters:
  • model_name (str)

  • inference_kwargs (dict[str, str] | None)

  • verbose (bool)

  • dtype (torch.dtype | str)

  • amp (bool)

  • beam_size (int | None)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.pyannote_diarization.PyannoteDiarizationWrapper(verbose=False, preset='pyannote/speaker-diarization-community-1', segmentation=None, embedding=None)[source]

A wrapper for Pyannote diarization.

Requires pyannote>=4.0.0. To use, you first need to accept conditions here https://huggingface.co/pyannote/speaker-diarization-community-1 , then specify your HF_TOKEN in the environmental variable.

Parameters:
  • verbose (bool)

  • preset (str)

  • segmentation (str | None)

  • embedding (str | None)

class asr_eval.models.pyannote_vad.PyannoteSegmenter(min_duration=15, max_duration=22, strict_limit_duration=30.0, new_chunk_threshold=0.2, lower_limit_duration=0.1)[source]

Bases: Segmenter

VAD-based audio segmenter based on Pyannote. With default params is equivalent to gigaam.vad_utils.segment_audio.

Requires pyannote>=4.0.0. Based on https://github.com/salute-developers/GigaAM/blob/main/gigaam/vad_utils.py . This segmenter does NOT require gigaam package to be installed, because all the required functions are copied from the gigaam package. The model is cached in PYANNOTE_CACHE dir, by default: ~/.cache/torch/pyannote.

Parameters:
  • min_duration (float)

  • max_duration (float)

  • strict_limit_duration (float)

  • new_chunk_threshold (float)

  • lower_limit_duration (float)

class asr_eval.models.qwen2_audio_wrapper.Qwen2AudioWrapper(domain_text='')[source]

Bases: ContextualTranscriber

A wrapper for Qwen2-Audio transcriber.

Produces bad output, TODO fix.

If domain_text is specified, it is added into prompt with a note “may be related”.

Installation: see Installation page.

Authors: Muharyam Baviev & Oleg Sedukhin

Parameters:

domain_text (str)

contextual_transcribe(waveform, prev_transcription='')[source]

Transcribes a float32 waveform, typically normalized from -1 to 1. The prev_transcription represents a transcription from all the previous text before the current waveform.

Return type:

str

Parameters:
  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • prev_transcription (str)

class asr_eval.models.qwen_audio_wrapper.QwenAudioWrapper(language='en', audio_lang_unknown=False)[source]

Bases: Transcriber

A wrapper for Qwen-Audio v1 (NOTE: not v2!). Experimental, may not work.

Requires transformers package.

Parameters:
  • language (QWEN_AUDIO_LANGUAGES)

  • audio_lang_unknown (bool)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.salute_wrapper.SaluteWrapper(api_key, format='flac', language='en-US')[source]

Bases: TimedTranscriber

A wrapper for SaluteSpeech API transcriber.

Need to pass api_key: https://developers.sber.ru/docs/ru/salutespeech/quick-start/integration-individuals

Raises:

salute_speech.exceptions.SberSpeechError – on API errors

Parameters:
  • api_key (str)

  • format (str)

  • language (str)

Installation: see Installation page.

timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.speechbrain_wrapper.SpeechbrainStreaming(model_name='speechbrain/asr-streaming-conformer-gigaspeech', sampling_rate=16_000)[source]

Bases: StreamingASR

A speechbrain streaming model asr-streaming-conformer-gigaspeech.

Adopted from Gradio example from here: https://huggingface.co/speechbrain/asr-streaming-conformer-librispeech

Installation: see Installation page.

Parameters:
  • model_name (Literal['speechbrain/asr-streaming-conformer-gigaspeech'])

  • sampling_rate (int)

_run()[source]

A background thread that processes input chunks and emits outputs chunks.

Is started with start_thread() and should live forever, usually with while True loop. To get the next input chunk, we can use self.input_buffer.get() or get_with_rechunking() (both methods block until the next chunk is available). To emit a new output chunk, use self.output_buffer.put() (non-blocking).

For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.

Normally on stop_thread() an Exit exception is raised when the _run method tries to read from the input buffer. It causes exit from _run and is handled in a wrapping method _run_and_send_exit. So, the Exit exception should not be handled in _run.

property audio_type: Literal['float']

The required input audio format. Together with sampling_rate property, forms a specification of input audio.

See also convert_audio_format() for details about formats.

class asr_eval.models.t_one_wrapper.TOneStreaming[source]

Bases: StreamingASR

A streaming wrapper for T-One model.

Installation: see Installation page.

_run()[source]

A background thread that processes input chunks and emits outputs chunks.

Is started with start_thread() and should live forever, usually with while True loop. To get the next input chunk, we can use self.input_buffer.get() or get_with_rechunking() (both methods block until the next chunk is available). To emit a new output chunk, use self.output_buffer.put() (non-blocking).

For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.

Normally on stop_thread() an Exit exception is raised when the _run method tries to read from the input buffer. It causes exit from _run and is handled in a wrapping method _run_and_send_exit. So, the Exit exception should not be handled in _run.

property audio_type: Literal['int']

The required input audio format. Together with sampling_rate property, forms a specification of input audio.

See also convert_audio_format() for details about formats.

class asr_eval.models.t_one_wrapper.TOneWrapper[source]

Bases: Transcriber

A non-streaming wrapper for T-One model.

Installation: see Installation page.

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.vikhr_wrapper.VikhrBorealisWrapper[source]

Bases: Transcriber

A Vikhr Borealis wrapper.

Loading a model takes a long time, around 2 min.

Installation: see Installation page.

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.vosk54_wrapper.VoskV54(device='cuda')[source]

Bases: Transcriber

A wrapper for Vosk 0.54 model.

Installation: see Installation page.

Parameters:

device (str | torch.device)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.vosk_streaming_wrapper.VoskStreaming(model_name='vosk-model-small-en-us-0.15', sampling_rate=16_000, chunk_length_sec=None)[source]

Bases: StreamingASR

A wrapper for Vosk streaming model.

Installation: see Installation page.

Parameters:
  • model_name (str)

  • sampling_rate (int)

  • chunk_length_sec (float | None)

_run()[source]

A background thread that processes input chunks and emits outputs chunks.

Is started with start_thread() and should live forever, usually with while True loop. To get the next input chunk, we can use self.input_buffer.get() or get_with_rechunking() (both methods block until the next chunk is available). To emit a new output chunk, use self.output_buffer.put() (non-blocking).

For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.

Normally on stop_thread() an Exit exception is raised when the _run method tries to read from the input buffer. It causes exit from _run and is handled in a wrapping method _run_and_send_exit. So, the Exit exception should not be handled in _run.

property audio_type: Literal['bytes']

The required input audio format. Together with sampling_rate property, forms a specification of input audio.

See also convert_audio_format() for details about formats.

class asr_eval.models.voxtral_wrapper.VoxtralWrapper(model_name='mistralai/Voxtral-Mini-3B-2507', client='run_local_server', language=None, temperature=0.7, local_server_verbose=False, format='flac')[source]

Bases: APITranscriber

A wrapper to call Voxtral via OpenAI API.

Installation: see Installation page.

Example

>>> voxtral = VoxtralWrapper('mistralai/Voxtral-Mini-3B-2507')
>>> text = voxtral.transcribe(speech_sample(repeats=2))
>>> print(text)
>>> voxtral.stop_vllm_server()

See the VLLM source code in vllm.model_executor.models.voxtral.

According to VoxtralEncoderModel.prepare_inputs_for_conv, the Voxtral pipeline splits a long audio into non-overlapping chunks, then processes each chunk via Whisper and concatenate the outputs. So, the LLM sees the whole long audio at once.

According to vllm.model_executor.models.voxtral.get_generation_prompt, the Voxtral uses encode_transcription method of mistral_common.tokens.tokenizers.instruct.InstructTokenizerV7 tokenizer. It starts from <bos>, adds audio, adds f”lang:{request.language}” substring and a special token [TRANSCRIBE].

Thus, there is a problem with using domain words in Voxtral, since such a prompt does not support user instructions. There may be solutions, but this feature is not implemented in this wrapper yet.

Authors: Vasily Kudryavtsev & Oleg Sedukhin

Parameters:
  • model_name (str)

  • client (OpenAI | Literal['run_local_server'])

  • language (str | LanguageAlpha2 | None)

  • temperature (float)

  • local_server_verbose (bool)

  • format (str)

class asr_eval.models.wav2vec2_wrapper.Wav2vec2Wrapper(model_name='facebook/wav2vec2-base-960h')[source]

Bases: CTC

A wrapper for wav2vec2 Hugging Face models.

Requires transformers package.

Note

This does not support Wav2Vec2ProcessorWithLM. This wrapper is in CTC format: it returns log probs only. If you need LM, you may use CTCDecoderWithLM.

Parameters:

model_name (str)

ctc_log_probs(waveforms)[source]

Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.

Typically obtained from logits via torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.

Return type:

list[ndarray[tuple[int, ...], dtype[floating[Any]]]]

Parameters:

waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])

property blank_id: int

An index in vocabulary for <blank> CTC token.

property tick_size: float

A time interval in seconds between consecutive time frames in the log probs matrix.

property vocab: tuple[str, ...]

a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.

Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.

Type:

Returns a vocabulary

class asr_eval.models.whisper_faster_wrapper.FasterWhisperLongformWrapper(device='auto', checkpoint='large-v3-turbo', segmenter='internal', custom_segmenter_min_sec=30, allow_merging_segments=True, dtype='float16')[source]

Bases: TimedTranscriber

Faster-whisper wrapper for longform transcription.

Parameters:
  • checkpoint (str) – A checkpoint in CTranslate2 format. See the full list of available checkpoints in faster_whisper.transcribe.WhisperModel docstring. Examples: “medium”, “large-v3”, “distil-large-v3”, “large-v3-turbo”. It also can be a path to a local checkpoint. To use custom checkpoint it needs to be converted to CTranslate2 format like, example: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2#conversion-details

  • segmenter (Union[Literal['internal', 'shortform'], Segmenter]) – a segmentation method for longform transcription. If “internal” - faster-whisper will use https://github.com/snakers4/silero-vad model internally if segments=None in transcribe_internal(), otherwise will use the passed segments. If Segmenter - will use the specified segmenter if segments=None in transcribe_internal(), otherwise will use the passed segments. If “shortform” - if segments=None in transcribe_internal(), will use the whole audio as a single segment, otherwise will use the passed segments. All the segments should be shorter than 30 sec.

  • custom_segmenter_min_sec (float) – Is used if segmenter is instance of Segmenter. If the audio is shorter than the specified value, will not call the segmenter and will use the whole audio as a single segment.

  • custom_segmenter_allow_merging` – If True, faster-whisper may internally merge several segments into one. If custom segments are passed into transcribe_internal(), or obtained by a custom segmenter passed as the segmenter argument, then the length of the returned list may be larger than len(segments). Setting to False disables this behaviour.

  • device (Literal['cuda', 'cpu', 'auto'])

  • allow_merging_segments (bool)

  • dtype (Literal['float16', 'float32', 'bfloat16', 'int8_float16', 'int8_bfloat16', 'int8_float32', 'int8'])

Example - 7 input segments get merged into 5 output segments (faster-whisper behaviour by default):

waveform: FLOATS = librosa.load('tests/testdata/long.mp3', sr=16_000)[0] # type: ignore
segments = [AudioSegment(0, 16), AudioSegment(18, 34), AudioSegment(35, 52), AudioSegment(73, 90),
        AudioSegment(91, 103), AudioSegment(103, 119), AudioSegment(120, 132)]
model = FasterWhisperLongformWrapper(segmenter='shortform')
outputs = model.transcribe_internal(waveform, segments=segments)
print([(round(seg.start), round(seg.end)) for seg in outputs])

Output: [(0, 16), (18, 34), (35, 52), (73, 103), (103, 132)]

Example - disable merging:

model = FasterWhisperLongformWrapper(segmenter='shortform', allow_merging_segments=False)
outputs = model.transcribe_internal(waveform, segments=segments)
print([(round(seg.start), round(seg.end)) for seg in outputs])

Output: [(0, 8), (8, 16), (18, 34), (35, 52), (73, 90), (91, 103), (103, 119), (120, 132)]

Example - output segments may be shorter than the corresponding input segments:

dataset = get_dataset('multivariant-v1-200')
segmenter = PyannoteSegmenter()
model = FasterWhisperLongformWrapper(segmenter='shortform', allow_merging_segments=False)
waveform = cast(FLOATS, dataset[1]['audio']['array'])
segments = segmenter(waveform)
print([(round(seg.start_time), round(seg.end_time)) for seg in segments])
outputs = model.transcribe_internal(waveform, segments=segments)
print([(round(seg.start), round(seg.end)) for seg in outputs])

Output: [(1, 23), (23, 33), (35, 57), (57, 69), (76, 77), (78, 100), (100, 122), (122, 126), (128, 133)]
Output: [(1, 23), (23, 32), (35, 57), (66, 69), (76, 77), (90, 91), (100, 122), (122, 126), (128, 133)]

NOTE: For some reason, it subtrasts 0.5 sec from the original segments.

NOTE: If batch_size=1 in transcribe_internal(), and segmenter != 'internal', will call faster_whisper.WhisperModel (instead of faster_whisper.BatchedInferencePipeline) for each input segment, and then will postproces the outputs to shift all the output timestamps by input segment’s start time.

NOTE: If it says “Unable to load any of {libcudnn_ops.so.9.1.0, …}” - then run

pip install -U nvidia-cuda-runtime-cu12 nvidia-cudnn-cu12
sudo find / -name "libcudnn_ops.so*" 2>/dev/null

And add the directory containing this file to LD_LIBRARY_PATH, for example:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/venv/lib/python3.12/site-packages/nvidia/cudnn/lib
timed_transcribe(waveform, segments=None)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:
  • waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

  • segments (list[AudioSegment] | None)

class asr_eval.models.whisper_wrapper.WhisperLongformWrapper(model_name='openai/whisper-large-v3', preproc_name=None, lang=None, condition_on_prev_tokens=False, temperature=0, dtype='float32')[source]

Bases: Transcriber

A wrapper for Whisper.

If audio is long, internally performs a longform transcription and passes the previously transcriber words each time.

Since the transcription history is used internally in WhisperForConditionalGeneration.generate, this class does not implement a ContextualTranscriber interface.

Parameters:
  • model_name (str)

  • preproc_name (str | None)

  • lang (Literal['russian', 'english'] | None)

  • condition_on_prev_tokens (bool)

  • temperature (float)

  • dtype (str)

transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.models.yandex_speechkit_wrapper.YandexSpeechKitWrapper(api_key, model='general', language='ru-RU', audio_processing='Full', normalize=False)[source]

Bases: TimedTranscriber

A wrapper for Yandex SpeechKit transcriber.

Docs: https://yandex.cloud/ru/docs/speechkit/stt/models

To obtain API key, create service account and API key, as described: https://yandex.cloud/ru/docs/speechkit/quickstart/stt-quickstart-v2

Speechkit provides timings for each word, raw and normalized text, it seems to normalize text for language=’ru-Ru’ but not for language=’auto’.

Example raw:

[седьмого [0.399, 1.060], восьмого [1.120, 1.780], мая [1.860, 2.399], в [2.520, 2.580],
пуэрто [2.639, 3.340], рико [3.419, 3.899], прошел [4.110, 4.680], шестнадцатый [4.839, 5.839],
этап [5.890, 6.299], формулы [6.470, 7.170], один [7.259, 7.740], с [7.859, 7.890],
фондом [8.040, 8.780], сто [8.950, 9.320], тысяч [9.429, 9.690], долларов [9.900, 10.700],
победителем [11.559, 12.346], стал [12.420, 12.733],

Example normalized:

7 8 Мая в Пуэрто Рико прошел 16 этап Формулы 1 с Фондом 10.00000000000% $-победителем стал

As you can see, normalization introduces some errors, and it is sometimes hard to align raw and normalized text.

If normalize=True and normalized text is returned by the API:

  1. transcribe() returns a full normalized text.

  2. timed_transcribe() returns a list of normalized utterances if available, otherwise a fill text.

Otherwise: 1. transcribe() returns a full unnormalized text. 2. timed_transcribe() returns a list of unnormalized single

words.

Authors: Dmitry Ezhov & Oleg Sedukhin

Parameters:
  • api_key (str)

  • model (Literal['general', 'general:rc', 'general:deprecated'])

  • language (Literal['auto', 'ru-RU', 'en-US'] | str)

  • audio_processing (Literal['Full', 'Stream'])

  • normalize (bool)

timed_transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.

Return type:

list[TimedText]

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])