Skip to content

GoloMarcos/OCTCMG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OCTCMG

One-Class Text Collections From Marcos Gôlo

Citing:

If you use any part of this code or the datasets in your research, please cite it using the following BibTex entry

@article{ref:Golo2021,
  title={Learning to sense from events via semantic variational autoencoder},
  author={Gôlo, Marcos Paulo Silva and Rossi, Rafael Geraldeli and Marcacini, Ricardo Marcondes},
  journal={Plos one},
  volume={16},
  number={12},
  pages={e0260701},
  year={2021},
  publisher={Public Library of Science San Francisco, CA USA}
}

Informations

We collect these one-class text collections. We built these one-class text collections by the news titles from the project Global Database of Events, Language, and Tone (GDELT). We chose the topics related to each textual collection based on the International Press Telecommunications Council (IPTC) taxonomy. You can find the GDELT project at https://www.gdeltproject.org/ and all IPTC topics at https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html. The collection of the news was made by checking if the news contained the topic in its title. We did not consider topics that returned less than 4000 news items. Furthermore, We define the limit of 6000 news items for all text collections. Finally, we obtained 183 one-class text collections.

How to Use

!pip install git+https://github.com/GoloMarcos/OCTCMG/

from OneClassTextCollectionsLibrary import datasets

datasets_dictionary = datasets.load()

Parameters from load()

  • path_files: path of collections
  • only_download: a boolean parameter to download the collections but not load
  • dataset: a parameter to load only one collection, informing your name

The Dataframe of each text collection return (datasets_dictionary[base name]):

  • News Id
  • News Date
  • News Title
  • News Embeddings from BERT
  • News Embeddings from DistilBERT
  • News Embeddings from DistilBERT Multilingual
  • News Embeddings from RoBERTa
  • Class (interest or outlier)
    • the outliers are texts from the 182 other collections
    • the length of outliers are similar to 10% of the length of the texts of the class of interest

We obtain the embeddings with the library sentence_tranformers

  • BERT model: bert-large-nli-stsb-mean-tokens
    • Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.
  • DistilBERT model: distilbert-base-nli-stsb-mean-tokens
    • Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
  • RoBERTa model: roberta-large-nli-stsb-mean-tokens
    • Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
  • DistilBERT Multilingual model: distiluse-base-multilingual-cased
    • Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).

One-Class Text Collections Topics

  • Restaurant | Opera | Employee | Logistics | Imports | Flood | Tariff | Court | Children | Series | Metal | Ethics
  • War | Investments | School | Revolution | Society | People | Justice | Therapy | Marathon | Adults | Theft
  • University | Gender | Assault | Traffic | Retail | Medicare | Stocks | Massacre | Water | Rivers | Lifestyle
  • Welfare | Media | Impeachment | Illness | Fishing | Wedding | Fraud | Teachers | Pope | Fashion | Fire | Terrorism
  • Environment | Currency | Television | Social-media | Boxing | Architecture | Birthday | Theatre | Fiction | Politics
  • Drought | Arrest | Charity | Mosque | Police | Disaster | Sailing | Tobacco | Transfer | Education | Cancer | Students
  • Diet | Pension |Cinema | Advertising | Pandemic | Retirement | Radio | Judge | Newspaper | Medicine | Bonds
  • Animal | Coal | Surveillance | Ceremony | Management | Loans | Suicide | Democracy | Golf | Medicaid | Drama
  • Arson | Trend | Immigration | Labour | Automotive | Festival | Prison | Investigation | Exports | Tourism
  • Petrol | Language | Party | Prices | Parliament | Bullying | Poverty | Banking | Weather | Dance | Election | Holiday
  • Crime | Vaccines | Health | Recession | Employment | Music | College | Wildfire | Celebrity | Beverage | Mortgage
  • Abortion | Halloween | Bribery | Securities | Software | LGBTQ | Government | Economy | Shareholder | Witness
  • Mining | Transport | Discrimination | Divorce | Toy | Agriculture | Defence | Hospital | Epidemic | Farms | Earnings
  • Christmas | Family | Culture | Racism | Surgery | Privacy | Adoption | Kidnapping | Marriage | Grocery
  • Rugby | Unemployment | Parks | Voting | Bar | Constitution | Unions | Design | Church | Law | Musical
  • Insurance | Anniversary | Sport | Game | Homicide | Bankruptcy | Inflation | Easter | Nature | Cafe | Hunting
  • Volleyball | Hurricane | Regulations | Plant | Earthquake | Painting | Consumers

About

Library to use One-class text collections

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages