Skip to content

haavala/commoncrawl-pagerank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

URL extraction and pagerank mapreduce jobs on CommonCrawl(https://commoncrawl.org) dataset

Test URL extraction locally:

chmod +x url_parsing/crawl_mapper.rb
chmod +x url_parsing/crawl_reducer.rb

cat sample_input | ./url_parsing/crawl_mapper.rb | ./url_parsing/crawl_reducer.rb > result

Running Amazon Elastic MapReduce jobs

  • Input location: s3:https://aws-publicdatasets/common-crawl/parse-output/segment/1346823846150/metadata-*
  • Output location: s3:https://../output
  • Mapper: s3:https://.../crawl_mapper.rb
  • Reducer: s3:https://.../crawl_reducer.rb
  • Extra args: -inputformat SequenceFileAsTextInputFormat
  • Custom Bootstrap action: s3:https://.../setup.sh

Commoncrawl Metadata

Metadata file is about 16MB each. To get the list of valid segments:

aws get aws-publicdatasets/common-crawl/parse-output/valid_segments.txt

URL extraction

Information about https://www.neti.ee/cgi-bin/serverid:

{
  "url": "https://www.neti.ee/cgi-bin/serverid",
  "arcFileParition": 584,
  "arcSourceSegmentId": 1346823846150,
  "arcFileDate": 1346832062332,
  "compressedSize": 524596,
  "arcFileOffset": 8186060
}

PageRank calculation

chmod +x page_rank/page_rank_mapper.rb
chmod +x page_rank/page_rank_reducer.rb

cat sample_input | ./page_rank/page_rank_mapper.rb | sort | ./page_rank/page_rank_reducer.rb > result

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published