Python modules for creation of custom dictionaries for learning of foreign languages and Anki decks

Objectives:

Given a list of of words or some text in specific language (let's call it 'Source language'), prepare materials for memorization of meanings of input words in 'Target' language, including examples of usages and media-files for these examples of usages. Final output is Anki decks.

What technologies are being used?

Python morphological Analyzer and Lemmatizer for Turkish language for lemmatization and frequency analysis : zeyrek
Open AI for translation and for preparation of usage examples: OpenAI
Langchain for formatting of input prompts and output for LLM-s: https://www.langchain.com/
Microsoft Azure Text-To-Speech API MS Azure Text-to-Speech
genanki: A Library for Generating Anki Decks: genakli
Anki applications (mobile, desktop and Web) Anki

Data sources:

Wiktionary:Frequency lists/40K Turkish Subtitles Ready frequency list, not full enough, for some reason I was not able to find words 'nar' and 'cami'
Kaggle Turkish Wikipedia Dataset Huge parquet file with Wikipedia articles, that I have been used as corpus to create own frequency list. Frequency and frequency rank in Anki cards are based on this text, thus it can be significantly different from usual speaking language.
Yeni İstanbul Uluslararası Öğrenciler İçin Türkçe A1 Turkish language, A1, - to create input list of words to study.

Currently the secuqunce of executions for pipeline looks like this:

create_frequency_list(cfg.INPUT_CORPUS_FILE, cfg.FREQ_LST_FILE_PATH): reading of corpus texts and creation of frequency list
lemmatize_frequency_list_io(cfg.FREQ_LST_FILE_PATH, cfg.FREQ_LST_LM_FILE_PATH)(): lemmatization of the word from frequency list
group_by_lemma_io(ifp=cfg.FREQ_LST_LM_FILE_PATH, ofp=cfg.FREQ_LST_GR_FILE_PATH)() : grouping by lemma (main grammar form)
attach_frequencies_io(cfg.INPUT_WORDS_LIST_FILE, cfg.FREQ_LST_GR_FILE_PATH, cfg.WORDS_AND_FREQ_LIST_FILE)() : join of frequency list to input list of words
request_and_parse_by_chunks_io(inp=cfg.WORDS_AND_FREQ_LIST_FILE, outp = )() : calling to Open AI in order to translate the list of words and to prepare examples of usage
generate_audio_batch_from_file(cfg.OUTPUT_FILE_NAME, cfg.DIR_AUDIO_FILES) : calling to Text-To-Speech API on order to produce .mp3 files for the examples of usage from the previous steps
create-anki-deck.generate_deck(): creation of anki deck to study translations of words and examples of usage. Note: in order to leverage this for creation of Anki decks with multimedia they should be in the same directory, where main python file been launched..

Root executor for the sequence above is launcher module.
Module persistence_guy contains decorators with functions to input output data from/to files. Module pipelines chains decorators from module above and main functions together.

Resulting Anki decks

Anki decks contain:

words in some, let's say, source language, (for my case it is Turkish),
it's translations to target language( English)
the examples of usage of these words in both languages (underlined root part of word for source language and the whole word for target)
sound multimedia for the examples in source language
frequencies metrics for these words by some corpus of texts.
You can easily configure the code to make similar decks for whatever pair of languages. You can use Anki decks in desktop, mobile, or web application.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.vscode		.vscode
data		data
modules		modules
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
config_data.py		config_data.py
launcher.py		launcher.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python modules for creation of custom dictionaries for learning of foreign languages and Anki decks

Objectives:

What technologies are being used?

Data sources:

Currently the secuqunce of executions for pipeline looks like this:

Resulting Anki decks

About

Releases

Packages

Languages

DmitriiK/Anki

Folders and files

Latest commit

History

Repository files navigation

Python modules for creation of custom dictionaries for learning of foreign languages and Anki decks

Objectives:

What technologies are being used?

Data sources:

Currently the secuqunce of executions for pipeline looks like this:

Resulting Anki decks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages