The task in this competition is to create a library that detects a programming or markup language of a code snippet.
My solution is inspired by this article about char-level CNN for programming language classification
Steps:
- First of all I generated dataset with ~190k samples of labeled code snippets
The histplot below shows the amount of code snippets for each language (numbers on the X axis mathes with description in common.py
file)
- Then I created a model and trained it on ~150k examples for 10 epochs
- After the model has trained, I got 79% accuracy on the validation dataset (~37k samples)
- Finally, I created telegram bot to test my model in real life