Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-train model for Javascript #9

Open
Alfusainey opened this issue Feb 25, 2018 · 5 comments
Open

Pre-train model for Javascript #9

Alfusainey opened this issue Feb 25, 2018 · 5 comments

Comments

@Alfusainey
Copy link

@tuvistavie is there an existing pre-trained model for Javascript that one can use out-of-the-box? i have three data sets for training, testing and validation and I want to generated emebeddings for each dataset.

@danhper
Copy link
Owner

danhper commented Feb 25, 2018

Unfortunately I do not have pretrained embeddings for JavaScript.
Note that you should normally train a single embedding on a large corpus and use it for training, test and validation, rather than train a different embedding per dataset.

You can either use your data directly, use the fetcher module to collect some data to generate the embedding, or use an existing dataset such as the one available here:
http:https://learnbigcode.github.io/datasets/
This tool should be compatible with the format of the ASTs in the above link.

Please let me know if you run into any trouble.

@Alfusainey
Copy link
Author

Alfusainey commented Feb 25, 2018

@tuvistavie thank you. i am using this dataset http:https://www.srl.inf.ethz.ch/js150.php.

probably my approach is not intuitive but what I want is to train, validate and test a vanilla LSTM model for auto-completing javascript code (as a pet project). so i first want to convert the input source code into an embedding which becomes the input to the LSTM model. It is the LSTM model that I need to train, validate and test and is wondering how a single embedding would help?

in summary, what I want is to preprocess the train, test and validation set into a format (some vector) that I can pass as input to the lstm model i.e source code -> asts -> skipgram-data -> embeddings -> LSTM model

any thoughts if that make sense?

@danhper
Copy link
Owner

danhper commented Feb 26, 2018

What you are trying to do makes sense, but the skipgram usage is I think a little off.
This should rather be a two steps process:

  1. Train embeddings
    code -> asts -> skipgram-data -> embeddings

Note that to be able to use the embeddings with any framework, bigcode-embeddings has a command to export to npy format.
The command usually looks something like this:

bigcode-embeddings export output/embeddings.bin-19724004  -o embeddings.npy
  1. Train the LSTM
    code -> asts -> embeddings learned in 1 -> LSTM

You can choose to fine-tune the embeddings when training your LSTM, but you should normally only train the embeddings once.

Does this make sense?

@Alfusainey
Copy link
Author

Thank you @tuvistavie. yes it makes sense. i was wondering if I should use all my data (train + test + val) when training the embedding?

in step 2: should I use the JSON serialization of the AST together with the learned embeddings as input to the LSTM? i.e code -> (asts, learned embeddings) -> LSTM. am just a bit confuse moving from asts -> embeddings learned in 1 (since we already have the embeddings from step 1)

@danhper
Copy link
Owner

danhper commented Mar 1, 2018

Sorry for the delay.

i was wondering if I should use all my data

I suggest you use only your training data when learning the embeddings,
to be sure what you learned generalizes well enough.

in step 2: should I use the JSON serialization of the AST together with the learned embeddings as input to the LSTM?

Yes, you should use the vocabulary and embeddings from step 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants