Protoformer: Embedding Prototypes for Transformers

Published in: Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022

Overview

Transformers have been widely applied in text classification, but real-world data often contain anomalies and noisy labels that challenge their performance. Protoformer is a novel self-learning framework for Transformers that leverages problematic samples to improve text classification. The framework features a selection mechanism for embedding samples, allowing efficient extraction and utilization of anomaly prototypes and difficult class prototypes.

Key Features

Anomaly Detection: Automatically detects and adjusts noisy labels to make the model more robust to complex datasets.
Prototype Selection: Utilizes a selection mechanism for embedding samples, focusing on both anomaly and difficult class prototypes.
Improved Performance: Demonstrates enhanced performance on datasets with diverse textual structures (e.g., Twitter, IMDB, ArXiv).

Contributions

Novel Framework: Leverages harder-to-classify and anomaly samples, providing a solution for classifying complex datasets from the Internet.
Label Adjustment Procedure: Robust to noise, making the framework suitable for noisy Internet data and promoting a more robust Transformer model.
Empirical Evaluation: Evaluated on multiple datasets with both clean and noisy labels, showing significant performance improvements.

Getting Started

All source code used to generate the results and figures in the paper are in the src folder. The data used in this study is provided in data and the result figures are in visualization. See the README.md files in each directory for a full description.

Datasets

Protoformer has been tested on the following datasets:

Twitter Data: Historical tweets for text classification tasks. IMDB Reviews: Movie reviews dataset for sentiment analysis. ArXiv Papers: Abstracts and titles of scientific papers for classification tasks.

Installation

• You can download a copy of all the files in this repository by cloning the repo:

git clone https://github.com/ashfarhangi/Protoformer.git

• Install requirement packages

pip install -r requirements.txt

• Run model.py

(optional) • Enter your Twitter API keys in: To use the data properly, you need to use your own official Twitter API. Please replace the API_KEY with you own. as shown below:

const API_KEY = 'ENTER YOUR API';

Prerequisites

Python
git
pip

License

All source code is made available under a GPL-3.0 license. You can freely use and modify the code, without warranty, as long as you provide attribution to the authors (e.g., citation). See LICENSE.md for the full license text.

Citation

@inproceedings{farhangiprotformer,
  title={Protoformer: Embedding Prototypes for Transformers},
  author={Farhangi, Ashkan and Sui, Ning and Hua, Nan and Bai, Haiyan and Huang, Arthur and Guo, Zhishan},
  booktitle={Advances in Knowledge Discovery and Data Mining: 26th Pacific-Asia Conference, PAKDD 2022},
  year={2022},
  organization={PAKDD}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
data		data
src		src
visualization		visualization
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protoformer: Embedding Prototypes for Transformers

Overview

Key Features

Contributions

Getting Started

Datasets

Installation

Prerequisites

License

Citation

About

Releases

Packages

Languages

License

ashfarhangi/Protoformer

Folders and files

Latest commit

History

Repository files navigation

Protoformer: Embedding Prototypes for Transformers

Overview

Key Features

Contributions

Getting Started

Datasets

Installation

Prerequisites

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages