Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilabel Classification Dataset Loading #6

Closed
angrymeir opened this issue May 10, 2020 · 4 comments
Closed

Multilabel Classification Dataset Loading #6

angrymeir opened this issue May 10, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@angrymeir
Copy link
Collaborator

angrymeir commented May 10, 2020

Hey @sergioburdisso,

for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files.
My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.

# Writing Data
dataset = {"Text 1": ["label1", "label2"], 
           "Text 2": ["label2", "label3"], 
           "Text 3": ["label1"]}

for text, labels in dataset.items():

  with open('text.txt', 'a+') as text_file:
    text_file.write(text + '\n')

  with open('labels.txt', 'a+') as label_file:
    label_file.write(';'.join(labels) + '\n')

The result is the following:

# cat text.txt
Text 1
Text 2
Text 3

# cat labels.txt
label1;label2
label2;label3
label1

It would be great if util.Dataset.load_from_files could be adjusted to also support this!
But I'm also open for other suggestions on how to tackle that problem :)

Thanks for your hard work!

@sergioburdisso
Copy link
Owner

sergioburdisso commented May 11, 2020

Hi @angrymeir!
First of all, thanks for being interested in this project. Yeah, I agree that the file structures in the topic categorization tutorial is not well suited to work with multilabel classification, it follows the classic single-label dataset structure.
I haven't a lot of previous experience working with multilabel classification. That's one of the main reasons I haven't implemented full support for multilabel classification in the first place. Fortunately, now comes the time to implement full support for multilabel classification.

What do you think having two separate classes for loading datasets from disk? One for "standard" single label dataset (Dataset) and another for multilabel (MultiLabelDataset). For instance, for loading a dataset, we could use MultiLabelDataset.load_from_files.

Do you think we should provide support for another format/structure too?

For instance, having a file holding document name and category label pairs, like so:

doc1 label1
doc1 label2
doc2 label2
doc2 label3
doc3 label1
...

And a folder containing the actual documents. Being this the case, we should let the user specify somehow the file where these pairs are (also provide the separator/delimiter used, tab? comma? etc.) and the path to the folder where actual documents are.

The same should apply to your approach. The user should be able to provide the separator for the labels in labels.txt file, which in your case is a semicolon (;).

What do you think the load_from_files arguments should be? what do you think about this approach:

x_train, y_train = MultiLabelDataset.load_from_files(docs="a file or folder", labels="a file", sep=";")

If docs is a folder then the label file should have a format like the one I described above, if it is a file, it should have your structure.
The sep argument is by default "\s" if doc is a folder and ";" if it is a file (or should it be a comma like in a CSV?)

Do you recommend me any particular dataset to work with, while implementing full multilabel support? This dataset will be the one used for the tutorial introducing multilabel support, too, similar to the ones that are already available. I'm currently using a Kaggle's dataset for toxic comment classification.

@sergioburdisso sergioburdisso added the enhancement New feature or request label May 11, 2020
@sergioburdisso
Copy link
Owner

I just realized we would need two sep arguments to let the user specify the separator used for labels and also for documents. Since documents containing new lines will be considered as separate documents, so it is better to let the user specify what separator/delimiter was used to indicate where each document begins/ends (although it could be '\n' by default). Something like:

x_train, y_train = MultiLabelDataset.load_from_files(
    docs="the file or folder where the documents are",
    labels="the file containing the labels",
    sep_label="the separator used for labels e.g. ;",
    sep_doc="the separator used for documents e.g. \n"
)

What do you think about that?

@angrymeir
Copy link
Collaborator Author

Hey @sergioburdisso,

MultiLabelDataset.load_from_files vs Dataset.load_mulitlabel_from_file
I think for consistency reasons the decision whether to use a different class (MultiLabelDataset) or an additional method (e.g. load_multilabel_from_file) in the class Dataset depends on how multilabel data should be treated in general in the this project.
Would you also create a different class for multilabel evaluation or rather add the functionality to the existing class?

Format/Structure
Assuming, that catA corresponds to a combination of labels like:

toxic = -1, sever_toxic=0, obscene=-1, threat=1, insult=-1, identiy_hate=1

This would imply that there were 3^6 possible categories (in the toxic comment dataset) which seems just not feasible to annotate...
Would a combination of both approaches make sense?
Meaning having one file either containing the text or the link to the documents and another file that contains the labels as described in my initial suggestion?

Giving the user the option to specify both delimiters makes absolutely sense! I also agree about the default parameters.

Dataset
We're currently working with a parsed version of SemEval 2016 Task 5, I can provide you the dataset if you would like.
The challenges with this dataset, are that the number of labels for a given text is in a range of [0..8].

@sergioburdisso
Copy link
Owner

sergioburdisso commented May 12, 2020

😊 Following your suggestion, I've added a method called "load_from_files_multilabel" to carry out this task, supporting both dataset structures/format. I've decided to put "multilabel" at the end so that, as with classify and classify_multilabel, any method XXX related to multilabel will have "_multilabel" as a suffix, this way it will be easier to remember for users (and more consistent).

By catA I meant the label for category A, I'll edit my message to clarify this point (and to match my example with yours).

Now, following your example, you should be able to load your dataset simple by:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt",
    "path/to/labels.txt"
)

In case you need a different separator for labels, for instance, using commas, you could use the sep_label argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_label=","
)

And, finally, in case you need to use a document separator other than '\n', for instance, "\n---\n" you can use the sep_doc argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_doc="\n---\n"
)

More details are given in the API documentation. 👍

Dataset
SemEval 2016 Task 5 sounds cool, feel free to send me the dataset, probably it'll be much better for a tutorial and a Live Demo than the one that I'm using now (toxic comments 💩).

sergioburdisso added a commit that referenced this issue May 12, 2020
A new method (``load_from_files_multilabel``) was added to the
``Dataset`` class to load multilabel datasets from disk. More details
about this new class can be found in the API documentation
(https://pyss3.rtfd.io/en/latest/api/index.html#pyss3.util.Dataset.load_from_files_multilabel).

Resolves: #6
sergioburdisso added a commit that referenced this issue May 20, 2020
The dataset is a subset of the CMU Movie Summary Corpus
(http:https://www.cs.cmu.edu/~ark/personas/) with 32985 summaries and only 10
movie genres. The dataset is structured according to #6, i.e., there are
two files, one for the labels and another for the movie plot summaries.
sergioburdisso added a commit that referenced this issue May 24, 2020
PySS3 now fully support multi-label classification! :)

- The ``load_from_files_multilabel()`` function was added to the
  ``Dataset`` class (7ece7ce, resolved #6)

- The ``Evaluation`` class now supports multi-label classification (#5)
  - Add multi-label support to ``train()/fit()`` (4d00476)
  - Add multi-label support to ``Evaluation.test()`` (0a897dd)
  - Add multi-label support to ``show_best and get_best()`` (ef2419b)
  - Add multi-label support to ``kfold_cross_validation()`` (aacd3a0)
  - Add multi-label support to ``grid_search()`` (925156d, 79f1e9d)
  - Add multi-label support to the 3D Evaluation Plot (42bbc65)

- The Live Test tool now supports multi-label classification as well
  (15657ee, b617bb7, resolved #9)

- Category names are no longer case-insensitive (4ec009a, resolved #8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants