asr_eval.linguistics¶
Linguistic utilities.
- asr_eval.linguistics.word_freq(word, lang='ru')[source]¶
Get a word frequency for the specified language, according to
wordfreq.zipf_frequency. Note that wordfreq does not lemmatize words before calculating frequency.If
wordargument contains several words, frequencies for them are combined using the formula 1 / f = 1 / f1 + 1 / f2 + … (a default behaviour in wordfreq).Examples for ‘ru’:
word_freq('трофонопсис') == 0 word_freq('трубочник') == 1.06 word_freq('трещотка') == 2.05 word_freq('барсук') == 3.01 word_freq('железный') == 4.02 word_freq('девушка') == 5.08 word_freq('до') == 6.38
See list of available languages in
wordfreq.available_languages(wordlist='large').- Return type:
float- Parameters:
word (str)
lang (str)
- asr_eval.linguistics.lemmatize_ru(word)[source]¶
Lemmatizes a Russian word using Mystem. We prefer it over pymorphy2 due to possibly less frequent errors.
Leaves non-Russian words unchanged.
TODO: maybe Mystem would lemmatize better if the whole sentence is passed?
- Raises:
ValueError – If Mystem founds zero or more than one word in the
wordargument.- Return type:
str- Parameters:
word (str)
- asr_eval.linguistics.try_inflect_ru(word, original_word)[source]¶
Tries to inflect a Russian lemmatized
wordusing pymorphy2 to get same form as inoriginal_word.Useful to restore a word form after correcting misspelled word. Returns also a status: ‘ok’, ‘ok_manually’, ‘fail’ (see the code for details).
- Return type:
tuple[str,Literal['ok','ok_manually','fail']]- Parameters:
word (str)
original_word (str)
Examples
>>> try_inflect_ru('мемас', 'мэмасы') ('мемасы', 'ok') >>> try_inflect_ru('антиген', 'онтегенам') ('антигенам', 'ok')
Author: Yana Fitkovskaja; Updated by: Oleg Sedukhin
- asr_eval.linguistics.split_text_into_sentences(text, language='russian', max_symbols=None, merge_smaller_than=None)[source]¶
Split the text into sentences using nltk.
If some sentence has more than
max_symbolssymbols, will split it further by space symbols so that each part has no more thanmax_symbolssymbols. If a single word has more thanmax_symbolssymbols, it will be kept as is (no truncation or dividing a word into parts).If
merge_smaller_thanis specified, tries to merge sentences smaller than the specified value, without exceedingmax_symbols.- Return type:
list[str]- Parameters:
text (str)
language (Literal['russian', 'english'])
max_symbols (int | None)
merge_smaller_than (int | None)
- asr_eval.linguistics.split_text_by_space(text, max_symbols)[source]¶
Split text into parts by space (s) symbols so that each part has no more than
max_symbolssymbols. If a single word has more thanmax_symbolssymbols, it will be kept as is (no truncation or dividing a word into parts).- Return type:
list[str]- Parameters:
text (str)
max_symbols (int)