Pre-train model for Javascript #9

Alfusainey · 2018-02-25T12:53:29Z

@tuvistavie is there an existing pre-trained model for Javascript that one can use out-of-the-box? i have three data sets for training, testing and validation and I want to generated emebeddings for each dataset.

danhper · 2018-02-25T14:57:27Z

Unfortunately I do not have pretrained embeddings for JavaScript.
Note that you should normally train a single embedding on a large corpus and use it for training, test and validation, rather than train a different embedding per dataset.

You can either use your data directly, use the fetcher module to collect some data to generate the embedding, or use an existing dataset such as the one available here:
http:https://learnbigcode.github.io/datasets/
This tool should be compatible with the format of the ASTs in the above link.

Please let me know if you run into any trouble.

Alfusainey · 2018-02-25T15:40:53Z

@tuvistavie thank you. i am using this dataset http:https://www.srl.inf.ethz.ch/js150.php.

probably my approach is not intuitive but what I want is to train, validate and test a vanilla LSTM model for auto-completing javascript code (as a pet project). so i first want to convert the input source code into an embedding which becomes the input to the LSTM model. It is the LSTM model that I need to train, validate and test and is wondering how a single embedding would help?

in summary, what I want is to preprocess the train, test and validation set into a format (some vector) that I can pass as input to the lstm model i.e source code -> asts -> skipgram-data -> embeddings -> LSTM model

any thoughts if that make sense?

danhper · 2018-02-26T05:08:07Z

What you are trying to do makes sense, but the skipgram usage is I think a little off.
This should rather be a two steps process:

Train embeddings
code -> asts -> skipgram-data -> embeddings

Note that to be able to use the embeddings with any framework, bigcode-embeddings has a command to export to npy format.
The command usually looks something like this:

bigcode-embeddings export output/embeddings.bin-19724004  -o embeddings.npy

Train the LSTM
code -> asts -> embeddings learned in 1 -> LSTM

You can choose to fine-tune the embeddings when training your LSTM, but you should normally only train the embeddings once.

Does this make sense?

Alfusainey · 2018-02-26T09:56:05Z

Thank you @tuvistavie. yes it makes sense. i was wondering if I should use all my data (train + test + val) when training the embedding?

in step 2: should I use the JSON serialization of the AST together with the learned embeddings as input to the LSTM? i.e code -> (asts, learned embeddings) -> LSTM. am just a bit confuse moving from asts -> embeddings learned in 1 (since we already have the embeddings from step 1)

danhper · 2018-03-01T04:00:26Z

Sorry for the delay.

i was wondering if I should use all my data

I suggest you use only your training data when learning the embeddings,
to be sure what you learned generalizes well enough.

in step 2: should I use the JSON serialization of the AST together with the learned embeddings as input to the LSTM?

Yes, you should use the vocabulary and embeddings from step 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-train model for Javascript #9

Pre-train model for Javascript #9

Alfusainey commented Feb 25, 2018

danhper commented Feb 25, 2018

Alfusainey commented Feb 25, 2018 •

edited

Loading

danhper commented Feb 26, 2018 •

edited

Loading

Alfusainey commented Feb 26, 2018

danhper commented Mar 1, 2018

Pre-train model for Javascript #9

Pre-train model for Javascript #9

Comments

Alfusainey commented Feb 25, 2018

danhper commented Feb 25, 2018

Alfusainey commented Feb 25, 2018 • edited Loading

danhper commented Feb 26, 2018 • edited Loading

Alfusainey commented Feb 26, 2018

danhper commented Mar 1, 2018

Alfusainey commented Feb 25, 2018 •

edited

Loading

danhper commented Feb 26, 2018 •

edited

Loading