Skip to content

Recurrent neural network to split code snippets from text.

License

Notifications You must be signed in to change notification settings

vmarkovtsev/CodeNeuron

Repository files navigation

Code Neuron

Recurrent neural network to detect code blocks. Runs on Tensorflow. It is trained in two stages.

First stage is pre-training the character level RNN with two branches - before and after:

CharRNN Architecture

my code :  FooBar
------> x <------

We assign recurrent branches to different GPUs to train faster. I set 512 LSTM neurons and reach 89% validation accuracy over 200 most frequent character classes:

CharRNN Validation

The second stage is training the same network but with the different dense layer which predicts only 3 classes: code block begins, code block ends and no-op. The prediction scheme changes: now we look at the adjacent chars and decide if there is a code boundary between them or not.

Code Neuron Validation

It is much faster to train and it reaches ~99.2% validation accuracy.

Training set

StackSample questions and answers, processed with

unzip -p Answers(Questions).csv.zip | ./dataset | sed -r -e '/^$/d' -e '/\x03/ {N; s/\x03\s*\n/\x03/g}' | gzip >> Dataset.txt.gz

Baked model

model_LSTM_600_0.9924.pb - reaches 99.2% accuracy on validation. The model in Tensorflow "GraphDef" protobuf format.

Pretraining was performed with 20% validation on the first 8000000 bytes of the uncompressed questions. Training was performed with 20% validation and 90% negative samples on the first 256000000 bytes of the uncompressed questions. This means I was lazy to wait a week for it to train on the whole dataset - you are encouraged to experiment.

Try to run it:

cat sample.txt | python3 run_model.py -m model_LSTM_600_0.9924.pb

You should see:

Here is my Python code, it is awesome and easy to read:
<code>def main():
    print("Hello, world!")
</code>Please say what you think about it. Mad skills. Here is another one,
<code>func main() {
  println("Hello, world!")
}
</code>As you see, I know Go too. Some more text to provide enough context.

Visualize the trained model:

python3 model2tb.py --model-dir model_LSTM_600_0.9924.pb --log-dir tb_logs
tensorboard --logdir=tb_logs

Go inference

go get gopkg.in/vmarkovtsev/CodeNeuron.v1/...
cat sample.txt | $(go env GOPATH)/bin/codetect

API:

import "gopkg.in/vmarkovtsev/CodeNeuron.v1"

func main() {
  session, _ := codetect.OpenSession()
  textBytes, _ := ioutil.ReadFile("test.txt")
  result, _ := codetect.Run(string(textBytes), session)
}

Updating the model

go-bindata -nomemcopy -nometadata -pkg assets -o assets/bindata.go  model.pb

License

MIT, see LICENSE.

About

Recurrent neural network to split code snippets from text.

Resources

License

Stars

Watchers

Forks

Packages

No packages published