This project is an optional challenge of the mandatory course "Machine Learning for Natural Language Understanding" of my NLP master's degree program at Trier University in the winter semester 22/23.
The Task was to train a clickbait filter to classify clickbait articles by their headline. I could freely decide how to prepare the data and which ML model to use for classification.
The challenge was considered passed if the model performs better than professor's baseline (a simple classifier; F1 ~0.89).
The data consists of two files, a text file with clickbait headlines and one with headlines from news sources. The hold out dataset is organized the same way.
I'm not allowed to publish the train and validation datasets since they are a property of Computerlinguistik und Digital Humanities Department of the University of Trier.
I implemented an LSTM model (Raschka, 2022, p.499)
with dropouts using PyTorch library (./utils/models.py
)
It showed a quite good result on the validation set: F1-Score = 96.2% (./notebooks/validation_and_examples.ipynb
) which is however can be easily overcome with Transformer architecture.
git clone https://github.com/bourgeois-radical/clickbait-detection.git
ClickbaitClassifier class (./utils/showing_results.py
)
provides a dunder-method which classifies every English sentence you give.
Feel free to check the classifier in the "Showing model results" section
(./notebooks/validation_and_examples.ipynb
). But don't forget to move vocab.pkl
(click to download)
to the ./data
folder and model_with_dropouts
(click to download)
to the ./notebooks
folder beforehand.
Aggarwal, C. (2022). Machine Learning for Text (2nd ed.). Springer
Raschka, S., Liu Y., & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt