dataset-random

A dataset of random pages with manually marked up semantic blocks.

Overview

This dataset was developed as part of my Master's thesis ("A Quantitative Comparison of Semantic Web Page Segmentation Algorithms"), which can be found here.

It contains a set of random webpages that were downloaded using wget. They include all static resources such as images, CSS files and Javascript files as well, so that they can be rendered offline as they are seen online. The links were rewritten to point to the local resources. Furthermore is each page available in three versions: One with just the basic HTML as can be obtained by a single GET request to a URL, and second as a serialized version of the DOM after all external resources were loaded. Finally there is a version of the DOM-pages which have manually marked up semantic blocks, which was done by a number of volunteers.

How to use this dataset

The file mapping.txt contains a mapping from the original URL of a downloaded page to its local file path. E.g.:

"https://www.ilse.nl/" : "/opt/dataset-random/www.ilse.nl/www.ilse.nl/index.html",

The filepath prefix "/opt/dataset-random" is constant, while the relative part "/www.ilse.nl/www.ilse.nl/index.html" gives the location of the index.html file. Next to this file there are always four files:

index.html.orig
index.html
index.dom.html
index.blocks.html

index.html.orig is the unchanged original file (obtained by a single GET request).

index.html is the original file where only the links have been made absolute and rewritten to match the local file structure (the repo contains all static resources as well).

index.dom.html is the HTML after the DOM was rendered (with rewritten links).

index.blocks.html is like index.dom.html but it additionally contains the manually marked up block tags, which are HTML attributes called data-block=1 or data-block=2 depending on whether they are top-level blocks or sublevel-blocks. Each block additionally also contains a type, which is indicated by an attribute like the following: data-block-type="Header".

License

This dataset is in the Public Domain, but attribution is encouraged.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
16kg.net		16kg.net
bbcgf.org		bbcgf.org
blog.stateofartportraits.com		blog.stateofartportraits.com
bluner.tripod.com		bluner.tripod.com
business3.plala.or.jp		business3.plala.or.jp
digilander.libero.it		digilander.libero.it
directory.virgilio.it		directory.virgilio.it
ecoceco.com		ecoceco.com
environmentalart.net		environmentalart.net
evridges.com		evridges.com
groups.yahoo.com		groups.yahoo.com
porzelt.net		porzelt.net
www-cs.canisius.edu/www-cs.canisius.edu/~bucheger		www-cs.canisius.edu/www-cs.canisius.edu/~bucheger
www.a-team.org/www.a-team.org		www.a-team.org/www.a-team.org
www.aiact.org		www.aiact.org
www.allegrosafety.com/www.allegrosafety.com		www.allegrosafety.com/www.allegrosafety.com
www.amazingweb.com/www.amazingweb.com		www.amazingweb.com/www.amazingweb.com
www.ambiente-select.de		www.ambiente-select.de
www.animis.de/www.animis.de/mirja		www.animis.de/www.animis.de/mirja
www.baumhaushotel-solling.de		www.baumhaushotel-solling.de
www.belizeretirementguide.com/www.belizeretirementguide.com		www.belizeretirementguide.com/www.belizeretirementguide.com
www.berlys.es		www.berlys.es
www.bih.at/www.bih.at		www.bih.at/www.bih.at
www.bloosem.nl		www.bloosem.nl
www.broederij-rombouts.nl/www.broederij-rombouts.nl		www.broederij-rombouts.nl/www.broederij-rombouts.nl
www.cadixtour.com/www.cadixtour.com		www.cadixtour.com/www.cadixtour.com
www.cam.rn.it/www.cam.rn.it		www.cam.rn.it/www.cam.rn.it
www.cclin-arlin.fr		www.cclin-arlin.fr
www.cellsignal.com/www.cellsignal.com		www.cellsignal.com/www.cellsignal.com
www.chessandpoker.com		www.chessandpoker.com
www.citylemon.com		www.citylemon.com
www.cmakeelhaulers.com		www.cmakeelhaulers.com
www.cometlog.com/www.cometlog.com		www.cometlog.com/www.cometlog.com
www.comune.quaregna.bi.it/www.comune.quaregna.bi.it		www.comune.quaregna.bi.it/www.comune.quaregna.bi.it
www.cultuurbewust.nl		www.cultuurbewust.nl
www.diamondfarm.com/www.diamondfarm.com		www.diamondfarm.com/www.diamondfarm.com
www.dub.uu.nl		www.dub.uu.nl
www.elisanet.fi/www.elisanet.fi/pienoisrautatiemuseo		www.elisanet.fi/www.elisanet.fi/pienoisrautatiemuseo
www.esigelec.fr		www.esigelec.fr
www.europa-auf-einen-blick.de		www.europa-auf-einen-blick.de
www.fierj.blogspot.nl		www.fierj.blogspot.nl
www.finance.gov.tt		www.finance.gov.tt
www.fishdevon.co.uk/www.fishdevon.co.uk		www.fishdevon.co.uk/www.fishdevon.co.uk
www.gammill.net		www.gammill.net
www.greencove.fr/www.greencove.fr		www.greencove.fr/www.greencove.fr
www.hedmark-hundeutstyr.no/www.hedmark-hundeutstyr.no		www.hedmark-hundeutstyr.no/www.hedmark-hundeutstyr.no
www.honda.dk		www.honda.dk
www.hostelmotango.com/www.hostelmotango.com		www.hostelmotango.com/www.hostelmotango.com
www.iledesirade.fr		www.iledesirade.fr
www.ilse.nl		www.ilse.nl
www.intercarto.com		www.intercarto.com
www.joebar.ch/www.joebar.ch/bin/view		www.joebar.ch/www.joebar.ch/bin/view
www.jres.com/www.jres.com		www.jres.com/www.jres.com
www.kenneljenager.dk/www.kenneljenager.dk		www.kenneljenager.dk/www.kenneljenager.dk
www.koreanconsulate.on.ca/www.koreanconsulate.on.ca		www.koreanconsulate.on.ca/www.koreanconsulate.on.ca
www.kupam.com		www.kupam.com
www.megane-no-wako.co.jp/www.megane-no-wako.co.jp		www.megane-no-wako.co.jp/www.megane-no-wako.co.jp
www.montair.it/www.montair.it		www.montair.it/www.montair.it
www.neocasa.ro		www.neocasa.ro
www.nmch.nl		www.nmch.nl
www.onderlijnenvooropzee.nl		www.onderlijnenvooropzee.nl
www.oxfordms.net		www.oxfordms.net
www.perviam.com		www.perviam.com
www.placement-uk.com		www.placement-uk.com
www.powerh2o.net		www.powerh2o.net
www.prestiscene.com		www.prestiscene.com
www.sayisalgrafik.com.tr		www.sayisalgrafik.com.tr
www.scotlandshop.com		www.scotlandshop.com
www.screechmedia.com		www.screechmedia.com
www.sentinelha.org.uk/www.sentinelha.org.uk		www.sentinelha.org.uk/www.sentinelha.org.uk
www.shimerspeaksout.com		www.shimerspeaksout.com
www.sicardialejandro.com/www.sicardialejandro.com		www.sicardialejandro.com/www.sicardialejandro.com
www.snobtop.com		www.snobtop.com
www.tarrytowndance.net		www.tarrytowndance.net
www.teine-station-ortho.com/www.teine-station-ortho.com		www.teine-station-ortho.com/www.teine-station-ortho.com
www.tetrade.be		www.tetrade.be
www.troymak.es/www.troymak.es		www.troymak.es/www.troymak.es
www.uk-rgf.ru		www.uk-rgf.ru
www.used-tanning-beds.com/www.used-tanning-beds.com		www.used-tanning-beds.com/www.used-tanning-beds.com
www.vt.government.bg		www.vt.government.bg
www.wv.nrcs.usda.gov		www.wv.nrcs.usda.gov
www.zarzailuminacion.com/www.zarzailuminacion.com		www.zarzailuminacion.com/www.zarzailuminacion.com
.gitignore		.gitignore
README.md		README.md
mapping.txt		mapping.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset-random

Overview

How to use this dataset

License

About

Releases

Packages

Languages

rkrzr/dataset-random

Folders and files

Latest commit

History

Repository files navigation

dataset-random

Overview

How to use this dataset

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages