Where is the dataset for training? #7

arijit1410 · 2019-03-13T14:31:33Z

No description provided.

RandolphVI · 2019-03-14T00:49:34Z

Sorry, I cannot show the whole dataset (according to some reason).
But you can find the data format in data/data_sample.json.
And modify your datasets into the same format of the sample.
Or modify the data preprocess code in data_helpers.py.

Hope this helps!

arijit1410 · 2019-03-14T02:27:27Z

Thanks for helping out quickly. One more question, what purpose does the 'context.txt' file have?

RandolphVI · 2019-03-14T02:49:00Z

@arijit1410

The content.txt is used for the package gensim to build the word2vec file for your corpus.
Each line is a sentence (before segmentation) in content.txt.

arijit1410 · 2019-03-14T02:57:03Z

Oh so we don't require it we are loading pre-trained word embeddings?

RandolphVI · 2019-03-14T05:39:25Z

@arijit1410

Yes.

akash418 · 2019-03-14T09:20:48Z

@RandolphVI can you please give us an insight on the sample data format in data_sample.json, what exactly. is "features_content", "labels_index" and "labels_num". An early reply will be greatly appreciated.

RandolphVI · 2019-03-15T00:53:21Z

@akash418

For instance, you have two sentences:

I like apple. (The label of this data record is: apple, fruit)
I like orange. (The label of this data record is: orange, fruit)

Now, you have to:

Split each sentence to get their segmentation (delete the stopwords).
Index all labels. {apple:0, orange:1, fruit:2}
Calculate the label number of each sentence.

So the data.json would be:
{'testid':1, 'features_content': ['like', 'apple'], labels_index: [0, 2], labels_num: 2}
{'testid':2, 'features_content': ['like', 'orange'], labels_index: [1, 2], labels_num: 2}

Hope this helps!

akash418 · 2019-03-19T11:09:23Z

One more thing needed clarification @RandolphVI , can you please give information about the directory in which we have to place train.json, test.json, and validation.json. I was successfully able to train the dataset but testing it is giving several issues.

RandolphVI · 2019-03-19T13:30:00Z

@akash418

Like this:

data
- train.json
- test.json
- validation.jsonn
utils
CNN
- train_cnn.py
- text_cnn.py
- test_cnnn.py

akash418 · 2019-03-20T12:12:12Z

Thanks for an early reply @RandolphVI. I was successfully able to train my dataset but when I tested on the TestSet, the predictions file predicted the same label for each and every data point and that too was consistent across all the models, due to which I am getting same precision, recall, and F score for all the models. @RandolphVI what do you think could the reason for this? I have attached predictions
predictions.txt
file for reference.

RandolphVI · 2019-03-25T02:03:03Z

@akash418

Sorry for replying so late.

Did you figure it out?
First thing, check the precision, recall and F score are changing or not when training the model.
If these metrics are changing on validation data, maybe the reason for your issue located in test.py.
If these metrics are not changing on validation data, maybe it already gets wrong in the training step.

akash418 · 2019-03-26T10:29:11Z

@RandolphVI It's not changing much, to be honest during the training phase. I changed some parameters like the threshold to see there is some issue in labels predictions logic, but that seems to be working fine. This means the issue could be during the training phase itself. Changing models lead to change in the evaluation metrics, but still, every model is predicting the same set of labels for each and every data point in the test set. This seems quite strange don't you think @RandolphVI.

Emmanuelgiwu · 2019-07-30T12:56:59Z

@RandolphVI ，When run test_cnn.py,
“Please input the model file you want to test, it should be like(1490175368): ”，
I wonder if we need to create model file?

RandolphVI · 2019-07-31T11:30:00Z

@Emmanuelgiwu
u need to train a model first (use the training code, like train_cnn.py).
Then u can use the test code (like test_cnn.py) for testing the model you created before.

tamaghnadutta · 2019-08-08T13:55:30Z

@RandolphVI I seem to be facing the same issue as @akash418. The model predicts all test data as the same set of labels every time. Even during eval cycles, all the metrics except ROC-AUC are quite bad. This leads me to think that the issue is during the training cycle itself as @akash418 had mentioned. Have you faced this issue in your runs?

@akash418 Were you able to work around this?

RandolphVI mentioned this issue Apr 13, 2019

对于pad_seq = pad_sequences(data.tokenindex, maxlen=pad_seq_len, value=0.)这一步函数我在运行的时候为了补全将数据中不足的 pad_seq 词语长度的 doc 补齐至 pad_seq 长度时出现错误。 #13

Closed

18398639574 mentioned this issue Apr 3, 2020

why are the indices all the same? #19

Closed

Repository owner deleted a comment from 18398639574 Sep 27, 2020

RandolphVI closed this as completed Sep 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where is the dataset for training? #7

Where is the dataset for training? #7

arijit1410 commented Mar 13, 2019

RandolphVI commented Mar 14, 2019

arijit1410 commented Mar 14, 2019

RandolphVI commented Mar 14, 2019

arijit1410 commented Mar 14, 2019

RandolphVI commented Mar 14, 2019

akash418 commented Mar 14, 2019

RandolphVI commented Mar 15, 2019

akash418 commented Mar 19, 2019

RandolphVI commented Mar 19, 2019

akash418 commented Mar 20, 2019 •

edited

Loading

RandolphVI commented Mar 25, 2019

akash418 commented Mar 26, 2019

Emmanuelgiwu commented Jul 30, 2019

RandolphVI commented Jul 31, 2019

tamaghnadutta commented Aug 8, 2019 •

edited

Loading

Where is the dataset for training? #7

Where is the dataset for training? #7

Comments

arijit1410 commented Mar 13, 2019

RandolphVI commented Mar 14, 2019

arijit1410 commented Mar 14, 2019

RandolphVI commented Mar 14, 2019

arijit1410 commented Mar 14, 2019

RandolphVI commented Mar 14, 2019

akash418 commented Mar 14, 2019

RandolphVI commented Mar 15, 2019

akash418 commented Mar 19, 2019

RandolphVI commented Mar 19, 2019

akash418 commented Mar 20, 2019 • edited Loading

RandolphVI commented Mar 25, 2019

akash418 commented Mar 26, 2019

Emmanuelgiwu commented Jul 30, 2019

RandolphVI commented Jul 31, 2019

tamaghnadutta commented Aug 8, 2019 • edited Loading

akash418 commented Mar 20, 2019 •

edited

Loading

tamaghnadutta commented Aug 8, 2019 •

edited

Loading