Skip to content

BenedictYoung/Webpage_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Webpage Classification

1.Introduction

There is a dataset contains webpages collected from computer science departments of various universities. This project is about learning classifiers to predict the type of webpage from the text.

Since the data in the dataset contains some errors and saved in various encoding format, I did some data cleaning to improve data quality and utility(including delete 2 unrecognized pages). This project is based on cleaned data, which can be found in ./webkb/.

If you want to access the original dataset, you may visit: link.

2.Dataset

All webpages are labeled into the following 7 target categories:(cleaned/original)

Categories Cleaned Original
student 1641 1641
staff 136 137
department 182 182
course 930 930
project 504 504
other 3763 3764

The data is divided by universities:(cleaned/original)

Universities Cleaned Original
Cornell 867 867
Texas 827 827
Washington 1204 1205
Wisconsin 1263 1263
Miscellaneous 4119 4120

3.Methodology

Instead of treating the HTML format texts as structured data, I tend to treat webpage as plain text. Therefore, I implemented the pre-trained BERT model to solve this task. After applying some specific task-oriented "fine-tuning", my proposed method could achieve about 94% prediction accuracy.

4.Description

File name Description
webkb/ dataset
pretrained/ configurations of model
get_data.py reading data from 'webkb/'
functions.py necessary functions
bert.py main code
poster.pdf poster

About

A project to predict the type of webpage from its text.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published