asr_eval.normalizing

Utils for text normalization.

class asr_eval.normalizing.filler_words.RuFillerRemover[source]

Removes Russian filler words.

class asr_eval.normalizing.silero.RuSileroNormalizer(model_path=CACHE_DIR / 'silero_normalizer/jit_s2s.pt', device='auto')[source]

Converts numbers into words and makes other various normalization steps for evaluating WER on Russian text.

A rare exception is handled that would create an inifinite loop, comparing with the original version. The normalizer is based on a neural network, so it is recommended to use caching.

TODO release a model required for RuSileroNormalizer.

Example

>>> from asr_eval.normalizing.silero import RuSileroNormalizer
>>> from asr_eval.utils.cacheable import DiskCacheable
>>> normalizer = RuSileroNormalizer()
>>> normalizer = DiskCacheable(normalizer, cache_path='sliero_normalizer_cache.db')
>>> print(normalizer('С 12.01.1943 г. площадь сельсовета — 1785,5 га.'))
С двенадцатого января тысяча девятьсот сорок третьего года
площадь сельсовета — тысяча семьсот восемьдесят пять целых
и пять десятых гектара

The code is adapted from https://github.com/snakers4/russian_stt_text_normalization

The model taken from (TODO upload the model to HF) https://t.me/silero_speech/6056

Parameters:
  • model_path (str | Path)

  • device (str | torch.device | Literal['auto'])

class asr_eval.normalizing.translit.TranslitNormalizer(word_mapping=RU_WORD_MAPPING)[source]

Normalizes several Russian transliterated words using a pre-defined rules.

Example

>>> print(TranslitNormalizer()('В Facebook и твиттере'))
В фейсбук и твиттер
Parameters:

word_mapping (dict[str, str])