Skip to content

A simple web scraper that takes a snapshot of a target website

License

Notifications You must be signed in to change notification settings

kwler/harvest-webscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Harvest: Web Scraper

Build Status

A simple web scraper that takes a snapshot of a target website. The keyword being "simple"; this scraper can take in and store as much data as it can, perform navigation, and store the result in multiple formats, but will never perform data extraction/processing, that step will be performed further down the line on a different project. This protects us from having to deal with site restructuring messing up with data extraction.

Features

  • wait for "orders" from HTTP
  • wait for "orders" from PubSub
  • navigate websites
  • take a screenshot
  • store the html contents
  • write results to HTTP response
  • write results to PubSub
  • write results to Cloud Storage
  • perform other commands aside from basic navigation
  • security
  • DoS mitigation

Developer "Quality-of-Life" Features

  • continuous integration
  • TypeScript
  • unit tests
  • unit test mocks
  • integration tests running on local emulator
  • environment variables

Developer Notes

  • install GCloud/Firebase CLI and setup account
  • initial setup
npm install -g firebase-tools
npm install --prefix ./functions
sudo npm install -g typescript

Unit Test

npm test --prefix ./functions

Deploy

firebase deploy --token $FIREBASE_TOKEN --project $FIREBASE_PROJECT --only functions

ERROR: Failed to launch chrome!

sudo apt-get install \
gconf-service \
libasound2 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libc6 \
libcairo2 \
libcups2 \
libdbus-1-3 \
libexpat1 \
libfontconfig1 \
libgcc1 \
libgconf-2-4 \
libgdk-pixbuf2.0-0 \
libglib2.0-0 \
libgtk-3-0 \
libnspr4 \
libpango-1.0-0 \
libpangocairo-1.0-0 \
libstdc++6 \
libx11-6 \
libx11-xcb1 \
libxcb1 \
libxcomposite1 \
libxcursor1 \
libxdamage1 \
libxext6 \
libxfixes3 \
libxi6 \
libxrandr2 \
libxrender1 \
libxss1 \
libxtst6 \
ca-certificates \
fonts-liberation \
libappindicator1 \
libnss3 \
lsb-release \
xdg-utils \
wget

Notes for IntelliJ Users

  • Please use Windows Linux subsystem and install NodeJS "Settings > Languages and Frameworks > Node.JS and NPM > Node Interpreter: Ubuntu"
  • Settings > Languages and Frameworks > Javascript > Javascript Language Version

About

A simple web scraper that takes a snapshot of a target website

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published