- Given a list of of words or some text in specific language (let's call it 'Source language'), prepare materials for memorization of meanings of input words in 'Target' language, including examples of usages and media-files for these examples of usages. Final output is Anki decks.
- Python morphological Analyzer and Lemmatizer for Turkish language for lemmatization and frequency analysis : zeyrek
- Open AI for translation and for preparation of usage examples: OpenAI
- Langchain for formatting of input prompts and output for LLM-s: https://www.langchain.com/
- Microsoft Azure Text-To-Speech API MS Azure Text-to-Speech
- genanki: A Library for Generating Anki Decks: genakli
- Anki applications (mobile, desktop and Web) Anki
- Wiktionary:Frequency lists/40K Turkish Subtitles Ready frequency list, not full enough, for some reason I was not able to find words 'nar' and 'cami'
- Kaggle Turkish Wikipedia Dataset Huge parquet file with Wikipedia articles, that I have been used as corpus to create own frequency list. Frequency and frequency rank in Anki cards are based on this text, thus it can be significantly different from usual speaking language.
- Yeni İstanbul Uluslararası Öğrenciler İçin Türkçe A1 Turkish language, A1, - to create input list of words to study.
- create_frequency_list(cfg.INPUT_CORPUS_FILE, cfg.FREQ_LST_FILE_PATH): reading of corpus texts and creation of frequency list
- lemmatize_frequency_list_io(cfg.FREQ_LST_FILE_PATH, cfg.FREQ_LST_LM_FILE_PATH)(): lemmatization of the word from frequency list
- group_by_lemma_io(ifp=cfg.FREQ_LST_LM_FILE_PATH, ofp=cfg.FREQ_LST_GR_FILE_PATH)() : grouping by lemma (main grammar form)
- attach_frequencies_io(cfg.INPUT_WORDS_LIST_FILE, cfg.FREQ_LST_GR_FILE_PATH, cfg.WORDS_AND_FREQ_LIST_FILE)() : join of frequency list to input list of words
- request_and_parse_by_chunks_io(inp=cfg.WORDS_AND_FREQ_LIST_FILE, outp = )() : calling to Open AI in order to translate the list of words and to prepare examples of usage
- generate_audio_batch_from_file(cfg.OUTPUT_FILE_NAME, cfg.DIR_AUDIO_FILES) : calling to Text-To-Speech API on order to produce .mp3 files for the examples of usage from the previous steps
- create-anki-deck.generate_deck(): creation of anki deck to study translations of words and examples of usage. Note: in order to leverage this for creation of Anki decks with multimedia they should be in the same directory, where main python file been launched..
Root executor for the sequence above is launcher module.
Module persistence_guy contains decorators with functions to input output data from/to files.
Module pipelines chains decorators from module above and main functions together.
Anki decks contain:
- words in some, let's say, source language, (for my case it is Turkish),
- it's translations to target language( English)
- the examples of usage of these words in both languages (underlined root part of word for source language and the whole word for target)
- sound multimedia for the examples in source language
- frequencies metrics for these words by some corpus of texts.
You can easily configure the code to make similar decks for whatever pair of languages. You can use Anki decks in desktop, mobile, or web application.