Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is the dataset for training? #7

Closed
arijit1410 opened this issue Mar 13, 2019 · 15 comments
Closed

Where is the dataset for training? #7

arijit1410 opened this issue Mar 13, 2019 · 15 comments

Comments

@arijit1410
Copy link

No description provided.

@RandolphVI
Copy link
Owner

Hi, @arijit1410

Sorry, I cannot show the whole dataset (according to some reason).
But you can find the data format in data/data_sample.json.
And modify your datasets into the same format of the sample.
Or modify the data preprocess code in data_helpers.py.

Hope this helps!

@arijit1410
Copy link
Author

Thanks for helping out quickly. One more question, what purpose does the 'context.txt' file have?

@RandolphVI
Copy link
Owner

@arijit1410

The content.txt is used for the package gensim to build the word2vec file for your corpus.
Each line is a sentence (before segmentation) in content.txt.

@arijit1410
Copy link
Author

Oh so we don't require it we are loading pre-trained word embeddings?

@RandolphVI
Copy link
Owner

@arijit1410

Yes.

@akash418
Copy link

@RandolphVI can you please give us an insight on the sample data format in data_sample.json, what exactly. is "features_content", "labels_index" and "labels_num". An early reply will be greatly appreciated.

@RandolphVI
Copy link
Owner

@akash418

For instance, you have two sentences:

  1. I like apple. (The label of this data record is: apple, fruit)
  2. I like orange. (The label of this data record is: orange, fruit)

Now, you have to:

  1. Split each sentence to get their segmentation (delete the stopwords).
  2. Index all labels. {apple:0, orange:1, fruit:2}
  3. Calculate the label number of each sentence.

So the data.json would be:
{'testid':1, 'features_content': ['like', 'apple'], labels_index: [0, 2], labels_num: 2}
{'testid':2, 'features_content': ['like', 'orange'], labels_index: [1, 2], labels_num: 2}

Hope this helps!

@akash418
Copy link

One more thing needed clarification @RandolphVI , can you please give information about the directory in which we have to place train.json, test.json, and validation.json. I was successfully able to train the dataset but testing it is giving several issues.

@RandolphVI
Copy link
Owner

@akash418

Like this:

  • data
    • train.json
    • test.json
    • validation.jsonn
  • utils
  • CNN
    • train_cnn.py
    • text_cnn.py
    • test_cnnn.py

@akash418
Copy link

akash418 commented Mar 20, 2019

Thanks for an early reply @RandolphVI. I was successfully able to train my dataset but when I tested on the TestSet, the predictions file predicted the same label for each and every data point and that too was consistent across all the models, due to which I am getting same precision, recall, and F score for all the models. @RandolphVI what do you think could the reason for this? I have attached predictions
predictions.txt
file for reference.

@RandolphVI
Copy link
Owner

@akash418

Sorry for replying so late.

Did you figure it out?
First thing, check the precision, recall and F score are changing or not when training the model.
If these metrics are changing on validation data, maybe the reason for your issue located in test.py.
If these metrics are not changing on validation data, maybe it already gets wrong in the training step.

@akash418
Copy link

@RandolphVI It's not changing much, to be honest during the training phase. I changed some parameters like the threshold to see there is some issue in labels predictions logic, but that seems to be working fine. This means the issue could be during the training phase itself. Changing models lead to change in the evaluation metrics, but still, every model is predicting the same set of labels for each and every data point in the test set. This seems quite strange don't you think @RandolphVI.

@Emmanuelgiwu
Copy link

@RandolphVI ,When run test_cnn.py,
“Please input the model file you want to test, it should be like(1490175368): ”,
I wonder if we need to create model file?

@RandolphVI
Copy link
Owner

@Emmanuelgiwu
u need to train a model first (use the training code, like train_cnn.py).
Then u can use the test code (like test_cnn.py) for testing the model you created before.

@tamaghnadutta
Copy link

tamaghnadutta commented Aug 8, 2019

@RandolphVI I seem to be facing the same issue as @akash418. The model predicts all test data as the same set of labels every time. Even during eval cycles, all the metrics except ROC-AUC are quite bad. This leads me to think that the issue is during the training cycle itself as @akash418 had mentioned. Have you faced this issue in your runs?

@akash418 Were you able to work around this?

Repository owner deleted a comment from 18398639574 Sep 27, 2020
Repository owner deleted a comment from 18398639574 Sep 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants