asr_eval.correction¶
A collection of transcription postprocessing tools, experimental.
- class asr_eval.correction.interfaces.TranscriptionCorrector[source]¶
Bases:
ABCAn abstract postprocessor capable of correcting ASR transcriptions.
- asr_eval.correction.bow_corpus.prepare_domain_specific_bag_of_words_corpus(corpus, pattern='\\\\w+', lemmatize='add', wordfreq_threshold=2, wordfreq_lang='ru', pbar=False)[source]¶
Extracts words from domain specific corpus or dictionary.
For each word adds/replaces with lemmatized form. Filters out too frequent words based on wordfreq_threshold.
- Return type:
set[str]- Parameters:
corpus (str)
pattern (str)
lemmatize (Literal['add', 'replace', 'no'])
wordfreq_threshold (float | None)
wordfreq_lang (str)
pbar (bool)
- class asr_eval.correction.comparator_wordfreq.RuWordFreqComparator(base_model, additional_model)[source]¶
Bases:
TranscriberA composite transcriber that retrieves prediction for two models (where the first is generally better) and replaces words predicted by the first model by words predicted by the second model in cases where the second model’s word is more frequent in language.
Works for Russian language currently.
- Parameters:
base_model (Transcriber)
additional_model (Transcriber)
- class asr_eval.correction.corrector_langchain.CorrectorLangchain(api_key, base_url='https://api.vsegpt.ru/v1', model_name='openai/gpt-4o', temperature=0.3, verbose=False, domain_specific_texts=None, use_web_search=False)[source]¶
Bases:
TranscriptionCorrectorAn agent that corrects a transcription, optionally with DuckDuckGo search.
Works for Russian language currently.
Requires
langchain_openaiandduckduckgo_searchpackages currently.Author: Timur Rafikov; Updated by: Oleg Sedukhin
- Parameters:
api_key (str)
base_url (str)
model_name (str)
temperature (float)
verbose (bool)
domain_specific_texts (str | None)
use_web_search (bool)
- class asr_eval.correction.corrector_levenshtein.WordCorrection(start, end, correction)[source]¶
A suggestion to replace
text[start:end]withcorrection.- Parameters:
start (int)
end (int)
correction (str)
- asr_eval.correction.corrector_levenshtein.apply_corrections(text, corrections)[source]¶
Apply the list of non-overlapping WordCorrection to the text.
- Return type:
str- Parameters:
text (str)
corrections (list[WordCorrection])
- class asr_eval.correction.corrector_levenshtein.CorrectorLevenshtein(domain_specific_bag_of_words, freq_threshold=1, distance_thresholds=<factory>)[source]¶
Bases:
TranscriptionCorrectorFinds rare words in the transcription, searches for similar words in the
domain_specific_bag_of_wordscorpus, replaces if found, inflects accordingly.Works for Russian language currently.
Author: Yana Fitkovskaja; Updated by: Oleg Sedukhin
- Parameters:
domain_specific_bag_of_words (list[str])
freq_threshold (float)
distance_thresholds (list[float])
- class asr_eval.correction.corrector_wikirag.WikiArticle(title, text, url)[source]¶
A Wikipedia page for RAG purposes.
- Parameters:
title (str)
text (str)
url (str)
- class asr_eval.correction.corrector_wikirag.WikiRAGSuggestions(original_text, detected_topic, query_terms, suggested_terms, term_scores)[source]¶
A list of suggestions returned by
WikipediaTermRetriever.Work in progress.
- Parameters:
original_text (str)
detected_topic (str)
query_terms (list[str])
suggested_terms (list[str])
term_scores (list[float])
- class asr_eval.correction.corrector_wikirag.WikipediaTermRetriever(lang='ru', candidate_topics=TOPICS, score_threshold=0.7, verbose=False)[source]¶
Bases:
TranscriptionCorrectorA term retriever capable of correcting transcriptions.
Work in progress.
Author: Timur Rafikov; Updated by: Oleg Sedukhin
- Parameters:
lang (str)
candidate_topics (list[str])
score_threshold (float)
verbose (bool)
- detect_topic(text)[source]¶
Определение темы с помощью zero-shot классификации
- Return type:
str- Parameters:
text (str)
- get_category_articles(category_name, max_articles=500)[source]¶
Рекурсивная загрузка статей категории
- Return type:
list[WikiArticle]- Parameters:
category_name (str)
max_articles (int)
- text_to_terms(text)[source]¶
Токенизация и очистка текста
- Return type:
list[str]- Parameters:
text (str)
- build_term_index(articles)[source]¶
Создание семантического индекса терминов
- Return type:
dict[str,ndarray[tuple[int,...],dtype[floating[Any]]]]- Parameters:
articles (list[WikiArticle])
- find_similar_terms(query_terms, term_index, top_k=5, similarity_threshold=0.7)[source]¶
Поиск семантически похожих терминов с учетом возможных ошибок
- Return type:
dict[str,list[tuple[str,float]]]- Parameters:
query_terms (list[str])
term_index (dict[str, ndarray[tuple[int, ...], dtype[floating[Any]]]])
top_k (int)
similarity_threshold (float)