Skip to content

A Web Crawler Example in Node.js and Puppeteer for Code4Lib BC 2021

Notifications You must be signed in to change notification settings

schuyberg/c4lbc-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Puppeteer Web Crawler Example - Code4Lib BC 2021

Example scripts for crawling web pages using Node.js, puppeteer, and puppeteer-cluster. Built for Code4Lib BC, 2021.

If a page mentions the word 'repository,' you have to drink more coffee!

Note: This is a simplified example and still has some issues that should probably be ironed out before you try and do anything serious with it. For example, it does nothing to handle query strings or anchors differently, and does a poor job of checking whether it has crawled a particular url already. In this way, drink counts are probably maximized.

Usage:

  1. Install node dependencies:

npm install

  1. Run script(s) with node:

node ./crawler.js

Documentation:

Puppeteer

Documentation : https://pptr.dev/

GitHub (with examples): https://github.com/puppeteer/puppeteer

Puppeteer-cluster

Github: https://github.com/thomasdondorf/puppeteer-cluster

About

A Web Crawler Example in Node.js and Puppeteer for Code4Lib BC 2021

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published