Skip to content

ML Research and DB Design for a Language Learning App

Notifications You must be signed in to change notification settings

oro13/language-app-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 

Repository files navigation

Totlahtol, an app for learning languages

Enhanced with Machine Learning Features

In collaboration with

(This repository contains research and development for ML and data pipeline. The code base is hosted here)


Our app, Totlahtol, is named for the word ‘Languages,’ in Nahuatl: the Aztec language once widely spoken on this continent and in Central America.

My friend in El Paso began working on this application over a year ago. Now I’m helping him enhance the prototype with advanced Machine Learning components. Including, natural language processing of user generated content and a recommender of lessons based on user interests.

Tech Stack Tools:

Front End • React JSFlask

Back End • SQLAlchemyKeras/TensorflowPythonNumpyPandas

The app (prototype) in action:


A user logins in, adds a lesson, and gets lessons specifically curated to their interests

Do we need another language app?

While there are many Language Apps available, Totlahtol stands out by offering:

  • User Generated Lessons,
  • Recommendation of Content specific to User interests and activity,
  • A seamless, interactive user interface to immerse users in the target language.

Whether you’re the type of polyglot who speaks Spanish and French or the kind who speaks Python and Javascript, feel free to reach out to learn more.


ML components

User Generated Lessons and NLP Topic Modeling

When a user uploads a lesson: Model and Embed Word Tokens and Latent Topics of Lessons, to Understand the Content (through NLP, LDA, word embeddings, and a Neural Network)

Doing so allows the app to group similar lessons together, on the fly, enabling: User Specific Recommendations based on Activity and Lesson Preferences (through Matrix Factorization and Deep Neural Network)


Why NLP?

topic modeling checking for duplicate lesson (hashing tokens)

Prototype: LDA

Production: lda2Vec, word2vec, multilingual embeddings, Deep Neural Network, consider Rust HuggingFace tokenizers for speed


Embedding Space using TensorBoard Project


Topic Modeling with Embedding, to show how similar lessons can be grouped for specific user interests.

Tag the Lessons with specific topics, to generate more signal for the recommender.

Why Recommenders?

Prototype: Sparse Matrix Factorization

Pros: quick, reliable when signal is reliable (enough user activity)

Cons: bad with limited data on new users (cold start), inputs restricted to User and Items matrix

Production: Deep Neural Network


The Data Pipeline, in Detail

My Research has centered on the most important app use case of uplading a lesson and recommending it to users if their activity implies it'd be relevent to them.

In general, a Recommender Needs these Three Steps:

  1. item candidate generation
  2. user specific scoring of items
  3. reranking, or sorting the items based on relevance to the user

Our Model has the additional step of processing lessons for its latent topics, to give more signal for the Here’s How:

  1. A user uploads a lesson

  1. NLP for processing the text and discerning the lesson topics

  2. The lesson specific word and topic embeddings are available for the recommender model


TensorBoard diagram of Word2Vec Embedding Process.

  1. Common lessons are grouped together by topic

  1. User ratings and lesson activity are made available for the recommender model


Users give thumbs up (1) or thumbs down ratings (-1), for explicit feedback for the recommender.

  1. The recommender gets an input of these and other features about the users and lessons


The layers of the Tensorflow Neural Network Recommender

  1. The recommender, a combination of deep neural network and matrix factorization, returns the probable ratings for lessons each user has not seen yet

  2. These predicted ratings are sorted to find the highest ratings

  3. When a user opens their feed, these lessons are suggested to them first

Takeaways:

Learned a number of libraries, such as Keras/Tensorflow, and worked with more familiar with creating custom functions in Pandas and Numpy, and text processing and NLP in GenSim and NLTK

Faced the Challenge of working remotely with the software engineer, my friend, in a different time zone, and had to iteratively adjust the app to implement changes.

Got experience working with machine learning in a production web development environment; being the domain expert to recommend best practice for performance and scalability; had to weigh trade offs of having a fast working prototype and implementing the best available solutions for a given task, faced this at nearly every step; sometimes making prototype is the clear priority, but some best practices shouldn’t be compromised, and found that out the hard way when late in the project decided to reimplement many features using Keras/Tensorflow to achieve state of the art recommendation, like those seen on Youtube, and FaceBook.

If interested in knowing more about the application, whether you’re the type of polyglot who speaks Spanish and French or the kind who speaks Python and Javascript, feel free to reach out! We intend to keep working on the app until we have a deliverable prototype.

About

ML Research and DB Design for a Language Learning App

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published