asr_eval.correction

A collection of transcription postprocessing tools, experimental.

class asr_eval.correction.interfaces.TranscriptionCorrector[source]

Bases: ABC

An abstract postprocessor capable of correcting ASR transcriptions.

asr_eval.correction.bow_corpus.prepare_domain_specific_bag_of_words_corpus(corpus, pattern='\\\\w+', lemmatize='add', wordfreq_threshold=2, wordfreq_lang='ru', pbar=False)[source]

Extracts words from domain specific corpus or dictionary.

For each word adds/replaces with lemmatized form. Filters out too frequent words based on wordfreq_threshold.

Return type:

set[str]

Parameters:
  • corpus (str)

  • pattern (str)

  • lemmatize (Literal['add', 'replace', 'no'])

  • wordfreq_threshold (float | None)

  • wordfreq_lang (str)

  • pbar (bool)

class asr_eval.correction.comparator_wordfreq.RuWordFreqComparator(base_model, additional_model)[source]

Bases: Transcriber

A composite transcriber that retrieves prediction for two models (where the first is generally better) and replaces words predicted by the first model by words predicted by the second model in cases where the second model’s word is more frequent in language.

Works for Russian language currently.

Parameters:
transcribe(waveform)[source]

Transcribes a float32 waveform, typically normalized from -1 to 1.

Return type:

str

Parameters:

waveform (ndarray[tuple[int, ...], dtype[floating[Any]]])

class asr_eval.correction.corrector_langchain.CorrectorLangchain(api_key, base_url='https://api.vsegpt.ru/v1', model_name='openai/gpt-4o', temperature=0.3, verbose=False, domain_specific_texts=None, use_web_search=False)[source]

Bases: TranscriptionCorrector

An agent that corrects a transcription, optionally with DuckDuckGo search.

Works for Russian language currently.

Requires langchain_openai and duckduckgo_search packages currently.

Author: Timur Rafikov; Updated by: Oleg Sedukhin

Parameters:
  • api_key (str)

  • base_url (str)

  • model_name (str)

  • temperature (float)

  • verbose (bool)

  • domain_specific_texts (str | None)

  • use_web_search (bool)

class asr_eval.correction.corrector_levenshtein.WordCorrection(start, end, correction)[source]

A suggestion to replace text[start:end] with correction.

Parameters:
  • start (int)

  • end (int)

  • correction (str)

asr_eval.correction.corrector_levenshtein.apply_corrections(text, corrections)[source]

Apply the list of non-overlapping WordCorrection to the text.

Return type:

str

Parameters:
class asr_eval.correction.corrector_levenshtein.CorrectorLevenshtein(domain_specific_bag_of_words, freq_threshold=1, distance_thresholds=<factory>)[source]

Bases: TranscriptionCorrector

Finds rare words in the transcription, searches for similar words in the domain_specific_bag_of_words corpus, replaces if found, inflects accordingly.

Works for Russian language currently.

Author: Yana Fitkovskaja; Updated by: Oleg Sedukhin

Parameters:
  • domain_specific_bag_of_words (list[str])

  • freq_threshold (float)

  • distance_thresholds (list[float])

class asr_eval.correction.corrector_wikirag.WikiArticle(title, text, url)[source]

A Wikipedia page for RAG purposes.

Parameters:
  • title (str)

  • text (str)

  • url (str)

class asr_eval.correction.corrector_wikirag.WikiRAGSuggestions(original_text, detected_topic, query_terms, suggested_terms, term_scores)[source]

A list of suggestions returned by WikipediaTermRetriever.

Work in progress.

Parameters:
  • original_text (str)

  • detected_topic (str)

  • query_terms (list[str])

  • suggested_terms (list[str])

  • term_scores (list[float])

class asr_eval.correction.corrector_wikirag.WikipediaTermRetriever(lang='ru', candidate_topics=TOPICS, score_threshold=0.7, verbose=False)[source]

Bases: TranscriptionCorrector

A term retriever capable of correcting transcriptions.

Work in progress.

Author: Timur Rafikov; Updated by: Oleg Sedukhin

Parameters:
  • lang (str)

  • candidate_topics (list[str])

  • score_threshold (float)

  • verbose (bool)

detect_topic(text)[source]

Определение темы с помощью zero-shot классификации

Return type:

str

Parameters:

text (str)

get_category_articles(category_name, max_articles=500)[source]

Рекурсивная загрузка статей категории

Return type:

list[WikiArticle]

Parameters:
  • category_name (str)

  • max_articles (int)

text_to_terms(text)[source]

Токенизация и очистка текста

Return type:

list[str]

Parameters:

text (str)

build_term_index(articles)[source]

Создание семантического индекса терминов

Return type:

dict[str, ndarray[tuple[int, ...], dtype[floating[Any]]]]

Parameters:

articles (list[WikiArticle])

find_similar_terms(query_terms, term_index, top_k=5, similarity_threshold=0.7)[source]

Поиск семантически похожих терминов с учетом возможных ошибок

Return type:

dict[str, list[tuple[str, float]]]

Parameters:
  • query_terms (list[str])

  • term_index (dict[str, ndarray[tuple[int, ...], dtype[floating[Any]]]])

  • top_k (int)

  • similarity_threshold (float)

process_query(asr_text, top_terms=10)[source]

Полный цикл обработки запроса

Parameters:
  • asr_text (str)

  • top_terms (int)