Name		Name	Last commit message	Last commit date
parent directory ..
crawler		crawler
data		data
.gitignore		.gitignore
README.md		README.md
books.csv		books.csv
books.json		books.json
fahasa.csv		fahasa.csv
fahasa.json		fahasa.json
proxy-list.txt		proxy-list.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
vietnamet.json		vietnamet.json
websosanh.csv		websosanh.csv

README.md

Crawl books data from book website with scrapy & splash

Installation

Install local packages
```
$ pip3 install -r requirements.txt
```
There are two packages to install: scrapy and scrapy-splash And make sure you have splash installed and run on port: 8085. Check out the guide installation splash: Installation splash linux+docker

Run scrapy-splash with docker
```
$ docker run -it -p 8050:8050 --rm scrapinghub/splash
```
Bootstrap a scrapy project
```
$ scrapy startproject crawler
```

Update settings.py for scrapy-splash

SPLASH_URL = 'http:https://127.0.0.1:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
COOKIES_ENABLED = True 
SPLASH_COOKIES_DEBUG = False
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawler

crawler

README.md

Crawl books data from book website with scrapy & splash

Installation

Files

crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

crawler

Folders and files

parent directory

README.md

Crawl books data from book website with scrapy & splash

Installation