WebCred | Data Day Grind Hackathon

Our devpost submission can be found here: https://devpost.com/software/webcred-o6q8fh

Our Inspiration

The internet is a universe in itself with vast amount of data, networks, and resources. Along with this tremendous global facility comes user accountability. As internet usage increases, people with malicious intent will also naturally increase. So, it’s extremely important to keep every internet user safe; especially the more vulnerable. For example, one of the most prevalent ways that people are entrapped into giving away financial or personal details is with Fake Job listings.

How it works

General. On the main page the user is prompted to enter a URL for a news article, job listing, or any general website. We then use HTTP requests with Beautiful Soup to parse and extract the relevant details from the given website. These details are transferred to our back-end through Flask; Three Natural Language Processing Neural Networks will then extract various text features and present them to the user.

How the Natural Language Processing Works. Between the 3 NLP models, we used 125,000+ units of data (strings) to train and validate the networks. These strings are tokenized (mapped to a unique integer), padded (truncated and concatenated to have a common size), and passed into a recurrent neural network for training. After training, the model is exported and used for future predictions.

When Flask passes a string, tokenization and padding is applied. The padded sequence is then passed to the trained model and Predictions are made. These predictions and the associated confidence is then returned to the user through Flask.

We acquired data came the famous IMDB Dataset for sentiment analysis, the “Employment Scam Aegean Dataset” from The University of the Aegean | Laboratory of Information & Communication Systems Security for fake Job Listing Detection, and “Fake and real news dataset” from Clément Bisaillon on Kaggle for fake news detection.

(See Python Natural Language Processing Flowchart)

Challenges we ran into

Our initial NLP model used a simple Dense Neural Network (DNN) following embedding. With this technique we were observing around 70-80% accuracy on our validation data. Although this level of detection is Statistically Significant, it results in a relatively high chance of a False Prediction. We reasoned that this was due to word order not being a factor in the network's predictions. To solve this issue we implemented a Recurrent Neural Network (RNN) with Bidirectional LSTMs (Long Short Term Memory) to allow for all words in the sentence to affect each other. After making this modification we had our validation accuracy increased to 96%+ which is a significant improvement over our simple DNN.

We wanted to help enable visitors to visually understand our models, but we didn’t know how to represent such an abstract concept. We eventually were able to integrate a TensorBoard Embedded Projector, which provides data visualization by mapping the labels to values in vector space.

What's next for WebCred

We want to be able to expand our site to handle a broader range of data sources in order to ensure user dependability of our site. We’d also like to add a classification system (using graphs, charts, lists) that informs the users of what percent of the source is credible. We believe this feature can improve the overall awareness of different types of online sources and can allow users to decide whether or not to use their preferred websites.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
Data		Data
WebCred_nlp		WebCred_nlp
WebCred_notebooks		WebCred_notebooks
WebCred_website		WebCred_website
WebCred_websitev2		WebCred_websitev2
README.md		README.md
flowchart.jpg		flowchart.jpg
img1.png		img1.png
img2.png		img2.png
img3.png		img3.png
img4.png		img4.png
manage.py		manage.py
requirements.txt		requirements.txt
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCred | Data Day Grind Hackathon

Our Inspiration

How it works

Challenges we ran into

What's next for WebCred

Our Website: Home | Landing Page

Job Listing Page

News Article Page

Website Page

Sample Results Screen

Python Natural Language Processing Flowchart

About

Releases

Packages

Languages

hSiri01/WebCred-Hackathon2020

Folders and files

Latest commit

History

Repository files navigation

WebCred | Data Day Grind Hackathon

Our Inspiration

How it works

Challenges we ran into

What's next for WebCred

Our Website: Home | Landing Page

Job Listing Page

News Article Page

Website Page

Sample Results Screen

Python Natural Language Processing Flowchart

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages