Example scripts for crawling web pages using Node.js, puppeteer, and puppeteer-cluster. Built for Code4Lib BC, 2021.
If a page mentions the word 'repository,' you have to drink more coffee!
Note: This is a simplified example and still has some issues that should probably be ironed out before you try and do anything serious with it. For example, it does nothing to handle query strings or anchors differently, and does a poor job of checking whether it has crawled a particular url already. In this way, drink counts are probably maximized.
- Install node dependencies:
npm install
- Run script(s) with node:
node ./crawler.js
Documentation : https://pptr.dev/
GitHub (with examples): https://github.com/puppeteer/puppeteer
Github: https://github.com/thomasdondorf/puppeteer-cluster