Puppeteer Web Crawler Example - Code4Lib BC 2021

Example scripts for crawling web pages using Node.js, puppeteer, and puppeteer-cluster. Built for Code4Lib BC, 2021.

If a page mentions the word 'repository,' you have to drink more coffee!

Note: This is a simplified example and still has some issues that should probably be ironed out before you try and do anything serious with it. For example, it does nothing to handle query strings ~~or anchors~~ differently, and does a poor job of checking whether it has crawled a particular url already. In this way, drink counts are probably maximized.

Usage:

Install node dependencies:

npm install

Run script(s) with node:

node ./crawler.js

Documentation:

Puppeteer

Documentation : https://pptr.dev/

GitHub (with examples): https://github.com/puppeteer/puppeteer

Puppeteer-cluster

Github: https://github.com/thomasdondorf/puppeteer-cluster

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
crawler.js		crawler.js
package-lock.json		package-lock.json
package.json		package.json
save-links.js		save-links.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Puppeteer Web Crawler Example - Code4Lib BC 2021

Usage:

Documentation:

Puppeteer

Puppeteer-cluster

About

Releases

Packages

Languages

schuyberg/c4lbc-crawler

Folders and files

Latest commit

History

Repository files navigation

Puppeteer Web Crawler Example - Code4Lib BC 2021

Usage:

Documentation:

Puppeteer

Puppeteer-cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages