Website Scrape-And-Deploy

The goal of this script is to scrape all files of your website which you host locally. Then, you can use the AWS CLI to upload your static files to a S3 bucket.

Configuration

Rename scrapy.cfg.example to scrapy.cfg.

Execute

First crawl all the pages:

scrapy crawl web -a root_url=https://www.data-blogger.com/ -a output_path=/media/sf_Ubuntu/website-scrape-and-deploy/output/ -a exclude=/oembed/

Here, I specified the root URL. The sitemap.xml, robots.txt and index.html are automatically crawled. You need to specify where you want to store the output and you can optionally specify which URLs you would like to exclude (here: URLs containing /oembed/).

Then, you should upload your crawled files to a AWS S3 static website bucket:

aws s3 cp /media/sf_Ubuntu/website-scrape-and-deploy/output/ s3:https://www.data-blogger.com --recursive

And last but not least, you can invalidate any CloudFront cache:

aws cloudfront create-invalidation --distribution-id EQNYJUCRR9HHL --paths /*

In summary, the following one-liner can be used for generating the static website pages and upload it to AWS s3:

scrapy crawl web -a root_url=https://www.data-blogger.com/ -a output_path=/media/sf_Ubuntu/website-scrape-and-deploy/output/ -a exclude=/oembed/ && aws s3 cp /media/sf_Ubuntu/website-scrape-and-deploy/output/ s3:https://www.data-blogger.com --recursive && aws cloudfront create-invalidation --distribution-id EQNYJUCRR9HHL --paths /*

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
webscraper		webscraper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg.example		scrapy.cfg.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website Scrape-And-Deploy

Configuration

Execute

About

Releases

Packages

Languages

kevin91nl/website-scrape-and-deploy

Folders and files

Latest commit

History

Repository files navigation

Website Scrape-And-Deploy

Configuration

Execute

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages