Assignment 4: Text Classification using Finetuned Transformers

Repository Overview

Description
Repository Tree
Usage
Modified Usage
Results
Discussion

Description

This repository includes the solution by Anton Drasbæk Schiønning (202008161) to assignment 4 in the course "Language Analytics" at Aarhus University.

It provides a framework for doing emotion classification of headlines from the Fake News Dataset by utilizing a Huggingface pipeline. The dataset consists of over 7000 news headlines, texts and corresponding labels (real/fake). The HuggingFace model used to do the classification is j-hartmann/emotion-english-distilroberta-base which is a fine-tuned version of the destilled RoBERTa model for emotion classification.

Visualizations are also made to provide an overview of the classifications.

Repository Tree

.
├── README.md
├── assign_desc.md                                                  
├── data
│   ├── classified_titles_emotion-english-distilroberta-base.csv   <---- headlines with classifications and score
│   └── fake_or_real_news.csv                                      <---- original dataset for real/fake news
├── out
│   └── results_emotion-english-distilroberta-base    <---- example results
│       ├── classification_overview.csv                   <---- overview of all classifications         
│       ├── emotion_distribution.png                      <---- distribution of all emotions
│       └── emotions_by_label.png                         <---- share of emotions by headline type
├── requirements.txt
├── run.sh
├── setup.sh
└── src
    ├── classify.py                                   <---- script for running classifications                                         
    └── visualize.py                                  <---- script for creating visualizations/outputs

Usage

This analysis only assumes that you have Python3 installed and clone this GitHub repository. When this has been done, you can run the full analysis with the shell script:

bash run.sh

This will achieve the following:

Create and activate a virtual environment
Install requirements to that environment
Classify emotions in all headlines (classify.py) using j-hartmann/emotion-english-distilroberta-base
Create and save visualizations of the classifications (visualize.py)
Deactivate the environment

The results are saved to the out directory under a subfolder, named after the model used. This result contains three files:

classification_overview.csv: Csv file with overview of how many headlines were classified as each emotion. Also splits the classifications by real and fake headlines.
emotion_distribution.png: Bar chart showing the distribution of emotions identified across all headlines.
emotion_by_label.png: Pie charts showing the distribution of emotions for real and fake headlines, presented side-by-side for an easy comparison.

Examples of these three files are also seen under Results.

Modified Usage

If you wish to use a different model for the emotion classifications, the repository also allows running a modified analysis. Firstly, run the setup bash script to create an environment and install requirements:

bash setup.sh

Run classifications

By default, classify.py uses j-hartmann/emotion-english-distilroberta-base for classifications. However, any other pretrained model from Huggingface for text classification can be used. Please note, that you should select a model specific to emotion classification if you wish to maintain the scope of the analysis.

When having selected a model, run classifications as such:

# uses distilbert-base-uncased-go-emotions-student for classification
python src/classify.py -m "joeddav/distilbert-base-uncased-go-emotions-student"

You can find the data with classifications in data/classified_titles_{SELECTED_MODEL_NAME}.

Create Visualizations

Visualizations for a classification file can be created done by running the visualize.py file. Again, you must specify which model was used for classifying the data in order for the visualization to cover the right data file:

python src/visualize.py -m "joeddav/distilbert-base-uncased-go-emotions-student"

From this, you will get a folder named out/results_{SELECTED_MODEL_NAME} which contains the three files mentioned earlier.

PLEASE NOTE: Visualizations are made to look neatly for classifications that use 7 emotions. If there are more or fewer in your model, visualizations may not look as neat. Regardless, classification_overview.csv for your classifications should still provide the needed overview.

Results

Below are the results for running the classification with the default model, which can be found in the directory out/results_emotion-english-distilroberta-base.

Table: Classifcation Overview

Predicted Emotion	All Headlines	Real Only	Fake Only
Anger	795	383	412
Disgust	434	186	248
Fear	1076	555	521
Joy	155	63	92
Neural	3180	1649	1531
Sadness	487	245	242

Plot: Emotion Distribution

Plot: Emotions by Label

Discussion

Overall, the pie charts above reveal that the distribution of emotions in headlines is strikingly similar across the real and fake news. For both label types, the most common emotion by far is neutral with a 52% share for real headlines and 48% for fake ones. Also, joy is the rarest emotion in both the real and fake headlines. Perhaps the most noticeable discrepancy is that 7.8% of fake headlines are classified as disgust whereas it is just 5.9% for real ones.

The main takeaway remains that the fake headlines are extremely similar to the real ones, when it comes to the primary emotion displayed, according to this analysis. This implies that emotions in the headlines are not a good indicator of whether or not it is a real headline. Still, it should be emphasized that these results are just based on the classfications by j-hartmann/emotion-english-distilroberta-base and using a different classification model may have produced different results. If interested, exploring this can easily be achieved by following Modified Usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assignment 4: Text Classification using Finetuned Transformers

Repository Overview

Description

Repository Tree

Usage

Modified Usage

Run classifications

Create Visualizations

Results

Table: Classifcation Overview

Plot: Emotion Distribution

Plot: Emotions by Label

Discussion

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
out/results_emotion-english-distilroberta-base		out/results_emotion-english-distilroberta-base
src		src
.gitignore		.gitignore
README.md		README.md
assign_desc.md		assign_desc.md
requirements.txt		requirements.txt
run.sh		run.sh
setup.sh		setup.sh

drasbaek/assignment-4---using-finetuned-transformers-drasbaek

Folders and files

Latest commit

History

Repository files navigation

Assignment 4: Text Classification using Finetuned Transformers

Repository Overview

Description

Repository Tree

Usage

Modified Usage

Run classifications

Create Visualizations

Results

Table: Classifcation Overview

Plot: Emotion Distribution

Plot: Emotions by Label

Discussion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages