Skip to content

Udacity Machine Learning Engineer Capstone Project. The project is my participation to the Kaggle competition - Jigsaw unintended bias in toxicity classification. Field: Natural Language Processing

Notifications You must be signed in to change notification settings

gromag/MachineLearning-Engineer-Specialisation-Capstone-Project-Udacity

Repository files navigation

README

Unintended Bias in toxicity classification

Udacity MachineLearning Engineer Capstone Project

Project from the Kaggle competition: Jigsaw unintended bias in toxicity classification

Project Overview

Natural Language Processing is a complex field which is hypothesised to be part of AI-complete set of problems, implying that the difficulty of these computational problems is equivalent to that of solving the central artificial intelligence problem making computers as intelligent as people. With over 90% of data ever generated being produced in the last 2 years and with a great proportion being human generated unstructured text there is an ever increasing need to advance the field of Natural Language Processing.

Recent UK Government proposal to have measures to regulate social media companies over harmful content, including "substantial" fines and the ability to block services that do not stick to the rules is an example of the regulamentary need to better manage the content that is being generated by users.

Other initiatives like ​Riot Games​' work aimed to predict and reform toxic player behaviour during games is another example of this effort to understand the content being generated by users and moderate toxic content.

However, as highlighted by the Kaggle competition ​Jigsaw unintended bias in toxicity classification​, existing models suffer from unintended bias where models might predict high likelihood of toxicity for content containing certain words (e.g. "gay") even when those comments were not actually toxic (such as "I am a gay woman"), leaving machine only classification models still sub-standard.

Having tools that are able to flag up toxic content without suffering from unintended bias is of paramount importance to preserve Internet's fairness and freedom of speech.

Project Report

Download the Project-Report.pdf

Acquiring the data

Download the data from https://www.kaggle.com/c/12500/download-all, unzip and place it in /input folder.

Python package requirements

 torch
 keras
 sklearn
 numpy
 pandas
 nltk

Python entry file

 /notebooks/Main.py

About

Udacity Machine Learning Engineer Capstone Project. The project is my participation to the Kaggle competition - Jigsaw unintended bias in toxicity classification. Field: Natural Language Processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published