FELI GENTLE, Machine Learning Engineer
In collaboration with
ARMAND VILLAVERDE, Data Engineer, Full Stack Dev
(This repository contains research and development for ML and data pipeline. The code base is hosted here)
Our app, Totlahtol, is named for the word ‘Languages,’ in Nahuatl: the Aztec language once widely spoken on this continent and in Central America.
My friend in El Paso began working on this application over a year ago. Now I’m helping him enhance the prototype with advanced Machine Learning components. Including, natural language processing of user generated content and a recommender of lessons based on user interests.
Tech Stack Tools:
Back End • SQLAlchemy • Keras/Tensorflow • Python • Numpy • Pandas
The app (prototype) in action:
A user logins in, adds a lesson, and gets lessons specifically curated to their interests
While there are many Language Apps available, Totlahtol stands out by offering:
- User Generated Lessons,
- Recommendation of Content specific to User interests and activity,
- A seamless, interactive user interface to immerse users in the target language.
Whether you’re the type of polyglot who speaks Spanish and French or the kind who speaks Python and Javascript, feel free to reach out to learn more.
When a user uploads a lesson: Model and Embed Word Tokens and Latent Topics of Lessons, to Understand the Content (through NLP, LDA, word embeddings, and a Neural Network)
Doing so allows the app to group similar lessons together, on the fly, enabling: User Specific Recommendations based on Activity and Lesson Preferences (through Matrix Factorization and Deep Neural Network)
topic modeling checking for duplicate lesson (hashing tokens)
Prototype: LDA
Production: lda2Vec, word2vec, multilingual embeddings, Deep Neural Network, consider Rust HuggingFace tokenizers for speed
Embedding Space using TensorBoard Project
Topic Modeling with Embedding, to show how similar lessons can be grouped for specific user interests.
Tag the Lessons with specific topics, to generate more signal for the recommender.
Prototype: Sparse Matrix Factorization
Pros: quick, reliable when signal is reliable (enough user activity)
Cons: bad with limited data on new users (cold start), inputs restricted to User and Items matrix
Production: Deep Neural Network
My Research has centered on the most important app use case of uplading a lesson and recommending it to users if their activity implies it'd be relevent to them.
In general, a Recommender Needs these Three Steps:
- item candidate generation
- user specific scoring of items
- reranking, or sorting the items based on relevance to the user
Our Model has the additional step of processing lessons for its latent topics, to give more signal for the Here’s How:
- A user uploads a lesson
-
NLP for processing the text and discerning the lesson topics
-
The lesson specific word and topic embeddings are available for the recommender model
TensorBoard diagram of Word2Vec Embedding Process.
- Common lessons are grouped together by topic
- User ratings and lesson activity are made available for the recommender model
Users give thumbs up (1) or thumbs down ratings (-1), for explicit feedback for the recommender.
- The recommender gets an input of these and other features about the users and lessons
The layers of the Tensorflow Neural Network Recommender
-
The recommender, a combination of deep neural network and matrix factorization, returns the probable ratings for lessons each user has not seen yet
-
These predicted ratings are sorted to find the highest ratings
-
When a user opens their feed, these lessons are suggested to them first
Takeaways:
Learned a number of libraries, such as Keras/Tensorflow, and worked with more familiar with creating custom functions in Pandas and Numpy, and text processing and NLP in GenSim and NLTK
Faced the Challenge of working remotely with the software engineer, my friend, in a different time zone, and had to iteratively adjust the app to implement changes.
Got experience working with machine learning in a production web development environment; being the domain expert to recommend best practice for performance and scalability; had to weigh trade offs of having a fast working prototype and implementing the best available solutions for a given task, faced this at nearly every step; sometimes making prototype is the clear priority, but some best practices shouldn’t be compromised, and found that out the hard way when late in the project decided to reimplement many features using Keras/Tensorflow to achieve state of the art recommendation, like those seen on Youtube, and FaceBook.
If interested in knowing more about the application, whether you’re the type of polyglot who speaks Spanish and French or the kind who speaks Python and Javascript, feel free to reach out! We intend to keep working on the app until we have a deliverable prototype.