asr_eval.models¶
A collection of ASR model wrappers in unified formats.
- class asr_eval.models.base.interfaces.Segmenter[source]¶
Bases:
ABCAn abstract model that segments a long-form audio into chunks containing speech.
Any parameters, such as max segment size, should go into a class constructor.
- class asr_eval.models.base.interfaces.Transcriber[source]¶
Bases:
ABCAn abstract transcriber (audio -> text) to evaluate on any dataset.
- class asr_eval.models.base.interfaces.TimedTranscriber[source]¶
Bases:
TranscriberAn abstract timed transcriber (audio -> timed text chunks) to evaluate on any dataset.
Overrides a
transcribe()method by concatenating the test chunks by space. Subclasses may custoimize this.- abstractmethod timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- class asr_eval.models.base.interfaces.CTC[source]¶
Bases:
TranscriberAn abstract CTC model that converts audio into log probabilties for each time frame.
Implementations override a list of additional methods for additional actions such as decoding.
- abstractmethod ctc_log_probs(waveforms)[source]¶
Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.
Typically obtained from logits via
torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.- Return type:
list[ndarray[tuple[int,...],dtype[floating[Any]]]]- Parameters:
waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])
- abstract property blank_id: int¶
An index in vocabulary for <blank> CTC token.
- abstract property tick_size: float¶
A time interval in seconds between consecutive time frames in the log probs matrix.
- abstract property vocab: tuple[str, ...]¶
a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.
Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.
- Type:
Returns a vocabulary
- class asr_eval.models.base.interfaces.ContextualTranscriber[source]¶
Bases:
TranscriberAn abstract transcriber being able to accept previous transcription as a context.
- abstractmethod contextual_transcribe(waveform, prev_transcription='')[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1. The
prev_transcriptionrepresents a transcription from all the previous text before the currentwaveform.- Return type:
str- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
prev_transcription (str)
- class asr_eval.models.base.longform.LongformVAD(shortform_model, segmenter, min_sec=30)[source]¶
Bases:
TimedTranscriberA longform transcriber wrapper for any shortform model.
Longform transcriber means one being able to transcribe long audios. The concrete threshold between “long” and “short” audio may be specific for the provided
shortform_model.The current wrapper uses a provided segmenter to segment into chunks, then applies a shortform model to each chunk independently. If a shortform model is a
TimedTranscriber, concatenates the resulting lists for all chunks, while correcting the timestamps to be relative to the whole audio.Example
>>> # requires `pip install pyannote.audio>=4` for `PyannoteSegmenter` >>> from asr_eval.models.base.longform import LongformVAD >>> from asr_eval.models.pyannote_vad import PyannoteSegmenter >>> from asr_eval.models.wav2vec2_wrapper import Wav2vec2Wrapper >>> LongformVAD( ... Wav2vec2Wrapper('facebook/wav2vec2-base-960h'), ... PyannoteSegmenter() ... )
See also:
LongformCTC.- Parameters:
shortform_model (Transcriber)
segmenter (Segmenter)
min_sec (float)
- timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- class asr_eval.models.base.longform.LongformCTC(shortform_model, segment_length=30, segment_shift=10, averaging_weights='beta', pbar_callback=None)[source]¶
Bases:
CTCA wrapper to apply a shortform CTC model to a longform audio.
Longform transcriber means one being able to transcribe long audios. The current wrapper segments audio uniformly with overlaps, then averages the logprobs for all segments. By default averages with beta-distributed weights (
averaging_weights='beta'), because a model may be less certain on the edges of the segment.See also:
LongformVAD.- Parameters:
shortform_model (CTC)
segment_length (float)
segment_shift (float)
averaging_weights (Literal['beta', 'uniform', 'blank_sep'])
pbar_callback (Callable[[int, int], None] | None)
- ctc_log_probs(waveforms)[source]¶
Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.
Typically obtained from logits via
torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.- Return type:
list[ndarray[tuple[int,...],dtype[floating[Any]]]]- Parameters:
waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])
- property blank_id: int¶
An index in vocabulary for <blank> CTC token.
- property tick_size: float¶
A time interval in seconds between consecutive time frames in the log probs matrix.
- property vocab: tuple[str, ...]¶
a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.
Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.
- Type:
Returns a vocabulary
- class asr_eval.models.base.longform.ContextualLongformVAD(shortform_model, segmenter, pass_history=True, max_history_words=100, min_sec=30)[source]¶
Bases:
TimedTranscriberA wrapper that is similar to
LongformVAD, but for each chunk passes the previously transcribed text, up to themax_history_words, as a context for the next chunk when transcribing it.Requies a shortform model to be a
ContextualTranscriber.- Parameters:
shortform_model (ContextualTranscriber)
segmenter (Segmenter)
pass_history (bool)
max_history_words (int | None)
min_sec (float)
- timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- class asr_eval.models.base.openai_wrapper.APITranscriber(model_name='mistralai/Voxtral-Mini-3B-2507', client='run_local_server', language=None, prompt=None, chunking_strategy='omit', temperature=0.7, format='flac', local_server_verbose=False)[source]¶
Bases:
TranscriberA connector to OpenAI API for audio LLMs. Runs via
client.audio.transcriptions.create. This class wrapsapi_transcribe()to implementTranscriberinterface. See theapi_transcribe()docstring forchunking_strategyandtemperatureparams.This class also allows to auto-start a local VLLM server. To do this, subclass this class and define
vllm_run_args(). SeeVoxtralWrapperas the example.Example with starting VLLM manually:
Start a local VLLM server
vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral \ --config_format mistral --load_format mistral \ --tensor-parallel-size 1 --tool-call-parser mistral \ --enable-auto-tool-choice --gpu-memory-utilization 0.75
Run the code
from openai import OpenAI from asr_eval.models.base.openai_wrapper import APITranscriber transcriber = APITranscriber( OpenAI(api_key='EMPTY', base_url='http://localhost:8000/v1'), model_name='mistralai/Voxtral-Mini-3B-2507', language='ru', ) waveform = <load you audio sample> transcriber.transcribe(waveform)
- Parameters:
model_name (str)
client (OpenAI | Literal['run_local_server'])
language (str | LanguageAlpha2 | None)
prompt (str | None)
chunking_strategy (Literal['auto', 'omit'] | ChunkingStrategyVadConfig | Omit)
temperature (float)
format (str)
local_server_verbose (bool)
- asr_eval.models.base.openai_wrapper.api_transcribe(client, waveform, model_name, language=None, prompt=None, chunking_strategy='omit', temperature=0.7, format='flac')[source]¶
A connector to OpenAI API for audio LLMs. Runs via
client.audio.transcriptions.create. See the full usage example inAPITranscriber.Sends a message with audio and language to transcribe. A default temperature is 0.7, this value is taken from mistral_common’s
BaseCompletionRequest.- Return type:
tuple[str,list[Logprob] |None]- Returns:
A transcription and logprobs (optional, if returned by the model). According to
openai.types.audio.transcription.Transcriptiondocstring, logprobs are returned only with the models gpt-4o-transcribe and gpt-4o-mini-transcribe.- Parameters:
client (OpenAI)
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
model_name (str)
language (str | LanguageAlpha2 | None)
prompt (str | None)
chunking_strategy (Literal['auto', 'omit'] | ~openai.types.audio.transcription_create_params.ChunkingStrategyVadConfig | ~openai.Omit)
temperature (float)
format (str)
By default
chunking_strategyis unset, and the audio is transcribed as a single block, according toclient.audio.transcriptions.createdocstring.Voxtral seem to ignore both
chunking_strategyand a request to return logprobs, according to VLLM server logs.formatis FLAC by default, this is actually a compressed (lossess) wav, should have smaller size than wav.- Raises:
openai.APIConnectionError – If cannot connect to the API
openai.NotFoundError – If cannot find the specified model_name
InternalServerError – In some cases (happened with VseGPT)
- Parameters:
client (OpenAI)
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
model_name (str)
language (str | LanguageAlpha2 | None)
prompt (str | None)
chunking_strategy (Literal['auto', 'omit'] | ~openai.types.audio.transcription_create_params.ChunkingStrategyVadConfig | ~openai.Omit)
temperature (float)
format (str)
- Return type:
tuple[str, list[Logprob] | None]
- exception asr_eval.models.base.openai_wrapper.ContentFilterException[source]¶
Bases:
RuntimeErrorAn API model refused to generate due to the content policy.
- class asr_eval.models.ast_wrapper.AudioSpectrogramTransformer(model_path='MIT/ast-finetuned-audioset-10-10-0.4593', device='cuda')[source]¶
An AudioSpectrogramTransformer (AST) able to recognize sound types.
Requires
transformerspackage.- Parameters:
model_path (str)
device (str)
- class asr_eval.models.flamingo_wrapper.FlamingoWrapper(lang='ru')[source]¶
Bases:
TranscriberA Flamingo transcriber. Not working anymore, TODO fix
Installation: see Installation page.
Authors: Dmitry Ezhov & Oleg Sedukhin
- Parameters:
lang (Literal['en', 'ru'])
- class asr_eval.models.gemma_wrapper.Gemma3nWrapper(lang='en', domain_text='')[source]¶
Bases:
ContextualTranscriberGemma3n transcriber. Too slow currently, TODO fix
If domain_text is specified, it is added into prompt with a note “may be related”.
Installation: see Installation page.
Authors: Timur Rafikov & Oleg Sedukhin
- Parameters:
lang (Literal['en', 'ru'])
domain_text (str)
- contextual_transcribe(waveform, prev_transcription='')[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1. The
prev_transcriptionrepresents a transcription from all the previous text before the currentwaveform.- Return type:
str- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
prev_transcription (str)
- class asr_eval.models.gigaam_wrapper.GigaAMShortformBase[source]¶
Bases:
Transcriber,ABCAn abstract class for GigaAM model, either CTC or RNNT.
- Implementations:
- class asr_eval.models.gigaam_wrapper.GigaAMShortformRNNT(version, device='cuda', fp16=False)[source]¶
Bases:
GigaAMShortformBaseA GigaAM RNNT model. Supports different versions (see
versionparameter): “v2”, “v3”, “v3_e2e”.Installation: see Installation page.
- Parameters:
version (Literal['v2', 'v3', 'v3_e2e'])
device (str | torch.device)
fp16 (bool)
- class asr_eval.models.gigaam_wrapper.GigaAMShortformCTC(version, device='cuda', fp16=False)[source]¶
Bases:
GigaAMShortformBase,CTCA GigaAM CTC model. Supports different versions (see
versionparameter): “v2”, “v3”, “v3_e2e”.Installation: see Installation page.
- Parameters:
version (Literal['v2', 'v3', 'v3_e2e'])
device (str | torch.device)
fp16 (bool)
- transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1.
- Return type:
str- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- property blank_id: int¶
An index in vocabulary for <blank> CTC token.
- property tick_size: float¶
A time interval in seconds between consecutive time frames in the log probs matrix.
- property vocab: tuple[str, ...]¶
a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.
Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.
- Type:
Returns a vocabulary
- ctc_log_probs(waveforms)[source]¶
Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.
Typically obtained from logits via
torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.- Return type:
list[ndarray[tuple[int,...],dtype[floating[Any]]]]- Parameters:
waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])
- class asr_eval.models.legacy_pisets_wrapper.LegacyPisetsWrapper(repo_dir, min_segment_size=1, max_segment_size=20, use_vad=False, whisper_ckpt='bond005/whisper-large-v3-ru-podlodka')[source]¶
Bases:
TimedTranscriberA Pisets transcriber from https://github.com/bond005/pisets
Commit hash e095ae626bbd18bb4490b9745d0acc34006c4eb8
Requires a manual cloning into the repo_dir before instantiating.
- Parameters:
repo_dir (str | Path)
min_segment_size (int)
max_segment_size (int)
use_vad (bool)
whisper_ckpt (str)
- timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- class asr_eval.models.nemo_wrapper.NvidiaNemoWrapper(model_name, inference_kwargs=None, verbose=False, dtype='float32', amp=False, beam_size=None)[source]¶
Bases:
TranscriberA Nvidia NEMO wrapper
Installation: see Installation page.
Some of the available models (many more are available): 1. “nvidia/canary-1b-v2”
NOTE: Specify language, example:
inference_kwargs={'source_lang': 'ru', 'target_lang': 'ru'}NOTE: in Nemo
beam_size=1by default
- “nvidia/parakeet-tdt-0.6b-v3”
NOTE: Supports torch.float16 or torch.bfloat16 only with
amp=TrueNOTE: in Nemo
beam_size=2by default
- “nvidia/stt_ru_fastconformer_hybrid_large_pc”
NOTE: in Nemo
beam_size=2by default
- Dtypes:
for
amp=True, available dtypes are torch.float16, torch.bfloat16for
amp=False, available dtypes are torch.float16, torch.bfloat16, torch.float32
Notes:
This wrapper is build using the following docs and examples: https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/transcribe_speech.py https://docs.nvidia.com/nemo-framework/user-guide/25.02/nemotoolkit/asr
/api.html#nemo.collections.asr.parts.mixins.transcription.TranscriptionMixin
The NEMO wrapper seems not to perform internal VAD; it raises OOM on too long audios. From the EncDecMultiTaskModel docstrings: “recommended length per file is between 5 and 25 seconds, but it is possible to pass a few hours long file if enough GPU memory is available”.
The
.transcribe()method of the NEMO’sTranscriptionMixinallows to passtimestamps=True. It raises error for Canary, but returns timestamps for Parakeet and FastComformer. However, the output timestamps require postprocessing that is not implemented currently.Some of the models should support CTC interface and/or LM interation, but this is not implemented in asr_eval currently.
To get the full list of available models, run:
from nemo.collections.asr.models import ASRModel print(ASRModel.list_available_models())
- Parameters:
model_name (str)
inference_kwargs (dict[str, str] | None)
verbose (bool)
dtype (torch.dtype | str)
amp (bool)
beam_size (int | None)
- class asr_eval.models.pyannote_diarization.PyannoteDiarizationWrapper(verbose=False, preset='pyannote/speaker-diarization-community-1', segmentation=None, embedding=None)[source]¶
A wrapper for Pyannote diarization.
Requires
pyannote>=4.0.0. To use, you first need to accept conditions here https://huggingface.co/pyannote/speaker-diarization-community-1 , then specify your HF_TOKEN in the environmental variable.- Parameters:
verbose (bool)
preset (str)
segmentation (str | None)
embedding (str | None)
- class asr_eval.models.pyannote_vad.PyannoteSegmenter(min_duration=15, max_duration=22, strict_limit_duration=30.0, new_chunk_threshold=0.2, lower_limit_duration=0.1)[source]¶
Bases:
SegmenterVAD-based audio segmenter based on Pyannote. With default params is equivalent to
gigaam.vad_utils.segment_audio.Requires
pyannote>=4.0.0. Based on https://github.com/salute-developers/GigaAM/blob/main/gigaam/vad_utils.py . This segmenter does NOT require gigaam package to be installed, because all the required functions are copied from the gigaam package. The model is cached in PYANNOTE_CACHE dir, by default: ~/.cache/torch/pyannote.- Parameters:
min_duration (float)
max_duration (float)
strict_limit_duration (float)
new_chunk_threshold (float)
lower_limit_duration (float)
- class asr_eval.models.qwen2_audio_wrapper.Qwen2AudioWrapper(domain_text='')[source]¶
Bases:
ContextualTranscriberA wrapper for Qwen2-Audio transcriber.
Produces bad output, TODO fix.
If domain_text is specified, it is added into prompt with a note “may be related”.
Installation: see Installation page.
Authors: Muharyam Baviev & Oleg Sedukhin
- Parameters:
domain_text (str)
- contextual_transcribe(waveform, prev_transcription='')[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1. The
prev_transcriptionrepresents a transcription from all the previous text before the currentwaveform.- Return type:
str- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
prev_transcription (str)
- class asr_eval.models.qwen_audio_wrapper.QwenAudioWrapper(language='en', audio_lang_unknown=False)[source]¶
Bases:
TranscriberA wrapper for Qwen-Audio v1 (NOTE: not v2!). Experimental, may not work.
Requires
transformerspackage.- Parameters:
language (QWEN_AUDIO_LANGUAGES)
audio_lang_unknown (bool)
- class asr_eval.models.salute_wrapper.SaluteWrapper(api_key, format='flac', language='en-US')[source]¶
Bases:
TimedTranscriberA wrapper for SaluteSpeech API transcriber.
Need to pass api_key: https://developers.sber.ru/docs/ru/salutespeech/quick-start/integration-individuals
- Raises:
salute_speech.exceptions.SberSpeechError – on API errors
- Parameters:
api_key (str)
format (str)
language (str)
Installation: see Installation page.
- timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
- class asr_eval.models.speechbrain_wrapper.SpeechbrainStreaming(model_name='speechbrain/asr-streaming-conformer-gigaspeech', sampling_rate=16_000)[source]¶
Bases:
StreamingASRA speechbrain streaming model asr-streaming-conformer-gigaspeech.
Adopted from Gradio example from here: https://huggingface.co/speechbrain/asr-streaming-conformer-librispeech
Installation: see Installation page.
- Parameters:
model_name (Literal['speechbrain/asr-streaming-conformer-gigaspeech'])
sampling_rate (int)
- _run()[source]¶
A background thread that processes input chunks and emits outputs chunks.
Is started with
start_thread()and should live forever, usually withwhile Trueloop. To get the next input chunk, we can useself.input_buffer.get()orget_with_rechunking()(both methods block until the next chunk is available). To emit a new output chunk, useself.output_buffer.put()(non-blocking).For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.
Normally on
stop_thread()anExitexception is raised when the_runmethod tries to read from the input buffer. It causes exit from_runand is handled in a wrapping method_run_and_send_exit. So, theExitexception should not be handled in_run.
- property audio_type: Literal['float']¶
The required input audio format. Together with
sampling_rateproperty, forms a specification of input audio.See also
convert_audio_format()for details about formats.
- class asr_eval.models.t_one_wrapper.TOneStreaming[source]¶
Bases:
StreamingASRA streaming wrapper for T-One model.
Installation: see Installation page.
- _run()[source]¶
A background thread that processes input chunks and emits outputs chunks.
Is started with
start_thread()and should live forever, usually withwhile Trueloop. To get the next input chunk, we can useself.input_buffer.get()orget_with_rechunking()(both methods block until the next chunk is available). To emit a new output chunk, useself.output_buffer.put()(non-blocking).For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.
Normally on
stop_thread()anExitexception is raised when the_runmethod tries to read from the input buffer. It causes exit from_runand is handled in a wrapping method_run_and_send_exit. So, theExitexception should not be handled in_run.
- property audio_type: Literal['int']¶
The required input audio format. Together with
sampling_rateproperty, forms a specification of input audio.See also
convert_audio_format()for details about formats.
- class asr_eval.models.t_one_wrapper.TOneWrapper[source]¶
Bases:
TranscriberA non-streaming wrapper for T-One model.
Installation: see Installation page.
- class asr_eval.models.vikhr_wrapper.VikhrBorealisWrapper[source]¶
Bases:
TranscriberA Vikhr Borealis wrapper.
Loading a model takes a long time, around 2 min.
Installation: see Installation page.
- class asr_eval.models.vosk54_wrapper.VoskV54(device='cuda')[source]¶
Bases:
TranscriberA wrapper for Vosk 0.54 model.
Installation: see Installation page.
- Parameters:
device (str | torch.device)
- class asr_eval.models.vosk_streaming_wrapper.VoskStreaming(model_name='vosk-model-small-en-us-0.15', sampling_rate=16_000, chunk_length_sec=None)[source]¶
Bases:
StreamingASRA wrapper for Vosk streaming model.
Installation: see Installation page.
- Parameters:
model_name (str)
sampling_rate (int)
chunk_length_sec (float | None)
- _run()[source]¶
A background thread that processes input chunks and emits outputs chunks.
Is started with
start_thread()and should live forever, usually withwhile Trueloop. To get the next input chunk, we can useself.input_buffer.get()orget_with_rechunking()(both methods block until the next chunk is available). To emit a new output chunk, useself.output_buffer.put()(non-blocking).For example, if 16_000 floats/sec are streamed, and an exteral sender sends chunks of size 1600 10 times per second, but your model want to get 1s chunks, call self.input_buffer.get_with_rechunking(size=16_000). This will block until 10 chunks are accumulated for any ID and return the result.
Normally on
stop_thread()anExitexception is raised when the_runmethod tries to read from the input buffer. It causes exit from_runand is handled in a wrapping method_run_and_send_exit. So, theExitexception should not be handled in_run.
- property audio_type: Literal['bytes']¶
The required input audio format. Together with
sampling_rateproperty, forms a specification of input audio.See also
convert_audio_format()for details about formats.
- class asr_eval.models.voxtral_wrapper.VoxtralWrapper(model_name='mistralai/Voxtral-Mini-3B-2507', client='run_local_server', language=None, temperature=0.7, local_server_verbose=False, format='flac')[source]¶
Bases:
APITranscriberA wrapper to call Voxtral via OpenAI API.
Installation: see Installation page.
Example
>>> voxtral = VoxtralWrapper('mistralai/Voxtral-Mini-3B-2507') >>> text = voxtral.transcribe(speech_sample(repeats=2)) >>> print(text) >>> voxtral.stop_vllm_server()
See the VLLM source code in
vllm.model_executor.models.voxtral.According to
VoxtralEncoderModel.prepare_inputs_for_conv, the Voxtral pipeline splits a long audio into non-overlapping chunks, then processes each chunk via Whisper and concatenate the outputs. So, the LLM sees the whole long audio at once.According to
vllm.model_executor.models.voxtral.get_generation_prompt, the Voxtral usesencode_transcriptionmethod ofmistral_common.tokens.tokenizers.instruct.InstructTokenizerV7tokenizer. It starts from <bos>, adds audio, adds f”lang:{request.language}” substring and a special token [TRANSCRIBE].Thus, there is a problem with using domain words in Voxtral, since such a prompt does not support user instructions. There may be solutions, but this feature is not implemented in this wrapper yet.
Authors: Vasily Kudryavtsev & Oleg Sedukhin
- Parameters:
model_name (str)
client (OpenAI | Literal['run_local_server'])
language (str | LanguageAlpha2 | None)
temperature (float)
local_server_verbose (bool)
format (str)
- class asr_eval.models.wav2vec2_wrapper.Wav2vec2Wrapper(model_name='facebook/wav2vec2-base-960h')[source]¶
Bases:
CTCA wrapper for wav2vec2 Hugging Face models.
Requires
transformerspackage.Note
This does not support
Wav2Vec2ProcessorWithLM. This wrapper is inCTCformat: it returns log probs only. If you need LM, you may useCTCDecoderWithLM.- Parameters:
model_name (str)
- ctc_log_probs(waveforms)[source]¶
Calculates log probabilties each time frame, given a float32 waveform, typically normalized from -1 to 1. Exponent from the log probabilties should sum up to 1 for each time frame.
Typically obtained from logits via
torch.nn.functional.log_softmax. Note that the returned value should be a numpy array, not a torch tensor.- Return type:
list[ndarray[tuple[int,...],dtype[floating[Any]]]]- Parameters:
waveforms (list[ndarray[tuple[int, ...], dtype[floating[Any]]]])
- property blank_id: int¶
An index in vocabulary for <blank> CTC token.
- property tick_size: float¶
A time interval in seconds between consecutive time frames in the log probs matrix.
- property vocab: tuple[str, ...]¶
a character (usually a single letter) or character sequence for each vocabulary index, or empty string for blank token.
Note that this does not fully support Whisper-style BPE encoding: each single token should correspond to a valid unicode string.
- Type:
Returns a vocabulary
- class asr_eval.models.whisper_faster_wrapper.FasterWhisperLongformWrapper(device='auto', checkpoint='large-v3-turbo', segmenter='internal', custom_segmenter_min_sec=30, allow_merging_segments=True, dtype='float16')[source]¶
Bases:
TimedTranscriberFaster-whisper wrapper for longform transcription.
- Parameters:
checkpoint (
str) – A checkpoint in CTranslate2 format. See the full list of available checkpoints infaster_whisper.transcribe.WhisperModeldocstring. Examples: “medium”, “large-v3”, “distil-large-v3”, “large-v3-turbo”. It also can be a path to a local checkpoint. To use custom checkpoint it needs to be converted to CTranslate2 format like, example: https://huggingface.co/deepdml/faster-whisper-large-v3-turbo-ct2#conversion-detailssegmenter (
Union[Literal['internal','shortform'],Segmenter]) – a segmentation method for longform transcription. If “internal” - faster-whisper will use https://github.com/snakers4/silero-vad model internally ifsegments=Noneintranscribe_internal(), otherwise will use the passed segments. IfSegmenter- will use the specified segmenter ifsegments=Noneintranscribe_internal(), otherwise will use the passed segments. If “shortform” - ifsegments=Noneintranscribe_internal(), will use the whole audio as a single segment, otherwise will use the passed segments. All the segments should be shorter than 30 sec.custom_segmenter_min_sec (
float) – Is used if segmenter is instance ofSegmenter. If the audio is shorter than the specified value, will not call the segmenter and will use the whole audio as a single segment.custom_segmenter_allow_merging` – If True, faster-whisper may internally merge several segments into one. If custom
segmentsare passed intotranscribe_internal(), or obtained by a custom segmenter passed as thesegmenterargument, then the length of the returned list may be larger thanlen(segments). Setting to False disables this behaviour.device (Literal['cuda', 'cpu', 'auto'])
allow_merging_segments (bool)
dtype (Literal['float16', 'float32', 'bfloat16', 'int8_float16', 'int8_bfloat16', 'int8_float32', 'int8'])
Example - 7 input segments get merged into 5 output segments (faster-whisper behaviour by default):
waveform: FLOATS = librosa.load('tests/testdata/long.mp3', sr=16_000)[0] # type: ignore segments = [AudioSegment(0, 16), AudioSegment(18, 34), AudioSegment(35, 52), AudioSegment(73, 90), AudioSegment(91, 103), AudioSegment(103, 119), AudioSegment(120, 132)] model = FasterWhisperLongformWrapper(segmenter='shortform') outputs = model.transcribe_internal(waveform, segments=segments) print([(round(seg.start), round(seg.end)) for seg in outputs]) Output: [(0, 16), (18, 34), (35, 52), (73, 103), (103, 132)]
Example - disable merging:
model = FasterWhisperLongformWrapper(segmenter='shortform', allow_merging_segments=False) outputs = model.transcribe_internal(waveform, segments=segments) print([(round(seg.start), round(seg.end)) for seg in outputs]) Output: [(0, 8), (8, 16), (18, 34), (35, 52), (73, 90), (91, 103), (103, 119), (120, 132)]
Example - output segments may be shorter than the corresponding input segments:
dataset = get_dataset('multivariant-v1-200') segmenter = PyannoteSegmenter() model = FasterWhisperLongformWrapper(segmenter='shortform', allow_merging_segments=False) waveform = cast(FLOATS, dataset[1]['audio']['array']) segments = segmenter(waveform) print([(round(seg.start_time), round(seg.end_time)) for seg in segments]) outputs = model.transcribe_internal(waveform, segments=segments) print([(round(seg.start), round(seg.end)) for seg in outputs]) Output: [(1, 23), (23, 33), (35, 57), (57, 69), (76, 77), (78, 100), (100, 122), (122, 126), (128, 133)] Output: [(1, 23), (23, 32), (35, 57), (66, 69), (76, 77), (90, 91), (100, 122), (122, 126), (128, 133)]
NOTE: For some reason, it subtrasts 0.5 sec from the original segments.
NOTE: If
batch_size=1intranscribe_internal(), andsegmenter != 'internal', will callfaster_whisper.WhisperModel(instead offaster_whisper.BatchedInferencePipeline) for each input segment, and then will postproces the outputs to shift all the output timestamps by input segment’s start time.NOTE: If it says “Unable to load any of {libcudnn_ops.so.9.1.0, …}” - then run
pip install -U nvidia-cuda-runtime-cu12 nvidia-cudnn-cu12 sudo find / -name "libcudnn_ops.so*" 2>/dev/null
And add the directory containing this file to LD_LIBRARY_PATH, for example:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/venv/lib/python3.12/site-packages/nvidia/cudnn/lib
- timed_transcribe(waveform, segments=None)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])
segments (list[AudioSegment] | None)
- class asr_eval.models.whisper_wrapper.WhisperLongformWrapper(model_name='openai/whisper-large-v3', preproc_name=None, lang=None, condition_on_prev_tokens=False, temperature=0, dtype='float32')[source]¶
Bases:
TranscriberA wrapper for Whisper.
If audio is long, internally performs a longform transcription and passes the previously transcriber words each time.
Since the transcription history is used internally in
WhisperForConditionalGeneration.generate, this class does not implement aContextualTranscriberinterface.- Parameters:
model_name (str)
preproc_name (str | None)
lang (Literal['russian', 'english'] | None)
condition_on_prev_tokens (bool)
temperature (float)
dtype (str)
- class asr_eval.models.yandex_speechkit_wrapper.YandexSpeechKitWrapper(api_key, model='general', language='ru-RU', audio_processing='Full', normalize=False)[source]¶
Bases:
TimedTranscriberA wrapper for Yandex SpeechKit transcriber.
Docs: https://yandex.cloud/ru/docs/speechkit/stt/models
To obtain API key, create service account and API key, as described: https://yandex.cloud/ru/docs/speechkit/quickstart/stt-quickstart-v2
Speechkit provides timings for each word, raw and normalized text, it seems to normalize text for language=’ru-Ru’ but not for language=’auto’.
Example raw:
[седьмого [0.399, 1.060], восьмого [1.120, 1.780], мая [1.860, 2.399], в [2.520, 2.580], пуэрто [2.639, 3.340], рико [3.419, 3.899], прошел [4.110, 4.680], шестнадцатый [4.839, 5.839], этап [5.890, 6.299], формулы [6.470, 7.170], один [7.259, 7.740], с [7.859, 7.890], фондом [8.040, 8.780], сто [8.950, 9.320], тысяч [9.429, 9.690], долларов [9.900, 10.700], победителем [11.559, 12.346], стал [12.420, 12.733],
Example normalized:
7 8 Мая в Пуэрто Рико прошел 16 этап Формулы 1 с Фондом 10.00000000000% $-победителем стал
As you can see, normalization introduces some errors, and it is sometimes hard to align raw and normalized text.
If
normalize=Trueand normalized text is returned by the API:transcribe()returns a full normalized text.timed_transcribe()returns a list of normalized utterances if available, otherwise a fill text.
Otherwise: 1.
transcribe()returns a full unnormalized text. 2.timed_transcribe()returns a list of unnormalized singlewords.
Authors: Dmitry Ezhov & Oleg Sedukhin
- Parameters:
api_key (str)
model (Literal['general', 'general:rc', 'general:deprecated'])
language (Literal['auto', 'ru-RU', 'en-US'] | str)
audio_processing (Literal['Full', 'Stream'])
normalize (bool)
- timed_transcribe(waveform)[source]¶
Transcribes a float32 waveform, typically normalized from -1 to 1, into a list of texts with timings. Typically the texts are to be concatenated via space, so leading or trailing spaces in each chunk are not required.
- Return type:
list[TimedText]- Parameters:
waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])