asr_eval.linguistics

Linguistic utilities.

asr_eval.linguistics.word_freq(word, lang='ru')[source]

Get a word frequency for the specified language, according to wordfreq.zipf_frequency. Note that wordfreq does not lemmatize words before calculating frequency.

If word argument contains several words, frequencies for them are combined using the formula 1 / f = 1 / f1 + 1 / f2 + … (a default behaviour in wordfreq).

Examples for ‘ru’:

word_freq('трофонопсис') == 0
word_freq('трубочник') == 1.06
word_freq('трещотка') == 2.05
word_freq('барсук') == 3.01
word_freq('железный') == 4.02
word_freq('девушка') == 5.08
word_freq('до') == 6.38

See list of available languages in wordfreq.available_languages(wordlist='large').

Return type:

float

Parameters:
  • word (str)

  • lang (str)

asr_eval.linguistics.lemmatize_ru(word)[source]

Lemmatizes a Russian word using Mystem. We prefer it over pymorphy2 due to possibly less frequent errors.

Leaves non-Russian words unchanged.

TODO: maybe Mystem would lemmatize better if the whole sentence is passed?

Raises:

ValueError – If Mystem founds zero or more than one word in the word argument.

Return type:

str

Parameters:

word (str)

asr_eval.linguistics.try_inflect_ru(word, original_word)[source]

Tries to inflect a Russian lemmatized word using pymorphy2 to get same form as in original_word.

Useful to restore a word form after correcting misspelled word. Returns also a status: ‘ok’, ‘ok_manually’, ‘fail’ (see the code for details).

Return type:

tuple[str, Literal['ok', 'ok_manually', 'fail']]

Parameters:
  • word (str)

  • original_word (str)

Examples

>>> try_inflect_ru('мемас', 'мэмасы')
('мемасы', 'ok')
>>> try_inflect_ru('антиген', 'онтегенам')
('антигенам', 'ok')

Author: Yana Fitkovskaja; Updated by: Oleg Sedukhin

asr_eval.linguistics.split_text_into_sentences(text, language='russian', max_symbols=None, merge_smaller_than=None)[source]

Split the text into sentences using nltk.

If some sentence has more than max_symbols symbols, will split it further by space symbols so that each part has no more than max_symbols symbols. If a single word has more than max_symbols symbols, it will be kept as is (no truncation or dividing a word into parts).

If merge_smaller_than is specified, tries to merge sentences smaller than the specified value, without exceeding max_symbols.

Return type:

list[str]

Parameters:
  • text (str)

  • language (Literal['russian', 'english'])

  • max_symbols (int | None)

  • merge_smaller_than (int | None)

asr_eval.linguistics.split_text_by_space(text, max_symbols)[source]

Split text into parts by space (s) symbols so that each part has no more than max_symbols symbols. If a single word has more than max_symbols symbols, it will be kept as is (no truncation or dividing a word into parts).

Return type:

list[str]

Parameters:
  • text (str)

  • max_symbols (int)