-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Imdb sentiment dataset reader #962
Conversation
You should definitely try conversational English BERT to check whether that classifier will be better or not. |
ce850fd
to
7448af3
Compare
@dilyararimovna Conversational BERT seems to be slightly better: ~93.5% accuracy vs ~92.5% accuracy on the original cased english BERT with the same hyperparameters. However, I trained them for about 3 epochs, so maybe I'll try training them for a longer time. |
It's time to get rid out of necessity to cover one class to list of classes.
Thank you! |
My opinion here is that we should focus on sentiment reader in this ticket. Regarding the labels problem I would suggest to make |
@grwlf Then you should revert |
Co-Authored-By: Aleksei Lymar <[email protected]>
Co-Authored-By: Aleksei Lymar <[email protected]>
@grwlf Are you planning to release pre-trained classification models? |
Do you mean models for this Sentiment task? Not sure about that. I think we could either release the model or provide an instruction on how to fine-tune existing bert. What do you think would be better? |
Most of our configs come with pretrained models, so it would be preferable in this case as well. |
OK, we will train the model. |
So, how do we upload the weights to |
@sgrechanik-h, |
Could you please check this Google.Drive folder? It should contain sentiment_imdb_conv_bert_v0.tar.gz and sentiment_imdb_bert_v0.tar.gz checkpoints. https://drive.google.com/drive/folders/1fXAR02g-drJKOHhSGND6TLNkhzErz9PE?usp=sharing |
data_path = Path(data_path) | ||
|
||
if url is None: | ||
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hidden logic, why should the file be available at this link? remove, pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please suggest how to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced, many dataset readers of DeepPavlov have their URLs hardcoded, and forcing the user to provide the URL in the config file will harm usability. (However almost all of these harcoded URLs lead to files.deeppavlov.ai/*
, so the imdb dataset should probably be uploaded to files.deeppavlov.ai
as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yoptar what do you think about this? Maybe, i was wrong
@@ -192,6 +192,11 @@ def download_decompress(url: str, download_path: [Path, str], extract_paths=None | |||
extracted = extracted_path.exists() | |||
if not extracted and not arch_file_path.exists(): | |||
simple_download(url, arch_file_path) | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the following logic is implemented here(maybe i'm wrong):
if extracted_path.exists():
extracted = True
log.info(f'Found cached and extracted {url} in {extracted_path}')
elif arch_file_path.exists():
log.info(f'Found cached {url} in {arch_file_path}')
else:
simple_download(url, arch_file_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is much more readable, thanks.
download_decompress(url, data_path) | ||
mark_done(data_path) | ||
|
||
alternative_data_path = data_path / "aclImdb" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, you check data_path, download file and after try "alternative_data_path"? and if it exists you don't use original data_path anywhere or am I misunderstood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's naming problem and you should use "extracted_data_path" for this var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I'm trying to solve here is that the archive contains the aclImdb
folder where all the data files are located, however data_path
itself probably ends with aclImdb
, so the data files are probably somewhere in ~/.deeppavlov/downloads/aclImdb/aclImdb/
. This may create some confusion, and if data_path
is provided manually, it may be set to ~/.deeppavlov/downloads/aclImdb/aclImdb/
instead of the correct ~/.deeppavlov/downloads/aclImdb
by the user. This logic forgives this mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, it's not explicit solution, but it works....
for Computational Linguistics (ACL 2011). | ||
""" | ||
|
||
@overrides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary decorator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need it to, I don't know, protect us from misspelling the word read
or from the future changes of the function's name in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, is a redundant solution. we all write code and monitor its consistency, using a third-party little-known library and operations with metaclasses can shoot us in the future.
data_path = Path(data_path) | ||
|
||
if url is None: | ||
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the path must be specified directly in the config. Add "url" to "dataset_reader": {..}
* feat: Imdb sentiment dataset reader * fix: download_decompress: report found cached files * Use accuracy instead of set_accuracy * fix: Proba2Labels: return single label for max_proba=True * Config for conversational BERT for imdb * Imdb: produce singleton lists of labels instead of labels * Don't convert data_path to Path twice Co-Authored-By: Aleksei Lymar <[email protected]> * imdb: always use utf-8 Co-Authored-By: Aleksei Lymar <[email protected]> Co-authored-by: Aleksei Lymar <[email protected]>
This PR implements a dataset reader for the IMDb sentiment classification dataset. It also includes a json configuration for BERT (en, cased) which is mostly the same as the configuration for rusentiment except for the max seq length and batch size (which I set to values such that I don't get out-of-memory on my hardware).
This PR also includes a fix for the
sets_accuracy
metric which should now correctly work for string labels (i.e. wrap them into sets instead converting them to sets). Also I added reporting of cached files indownload_decompress
.