Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate scrapy to headless-chrome? #118

Open
fbuchinger opened this issue Apr 25, 2017 · 6 comments
Open

Migrate scrapy to headless-chrome? #118

fbuchinger opened this issue Apr 25, 2017 · 6 comments

Comments

@fbuchinger
Copy link

A few weeks ago, the chromium project announced headless chromium as new, clean way to open websites in a non-UI server context.

The announcement had quite an impact in the headless-browser scene and resulted in the resignment of the phantomJS maintainer.

Since the current webkit engine of Splash dates back to 2013, I wanted to know whether there are any plans to port splash to headless chrome?

@bufrr
Copy link

bufrr commented May 27, 2017

will take a lot of work, i guess

@kmike
Copy link
Member

kmike commented Jun 27, 2017

Webkit is upgraded to a much more recent version in Splash master (~mid-2016 Safari), and will be upgraded further (to Webkit trunk) in future, thanks to https://github.com/annulen/webkit. You can use scrapinghub/splash:master Docker image to try the changes, or wait for Splash 3.0 release.

Switching to Headless Chromium would be a huge change indeed. We don't have engineering resources to make this switch in a near future. Also, it may be easier to create a separate Scrapy + Headless Chromium intergation project.

Switching to Headless Chromium has both advantages and disadvantages; it seems there are more advantages. But some Splash features can't be implemented in Headless Chromium AFAIK - e.g. per-request proxy options are impossible if I'm not mistaken - this feature is nice to have e.g. for Crawlera integration, to avoid using Crawlera for static resources.

@fbuchinger
Copy link
Author

Thanks! Will try the master container to see if I can get around my scraping issues.

@fbuchinger
Copy link
Author

got the following error when trying out the master dockerfile:

$ docker pull scrapinghub/splash:master
master: Pulling from scrapinghub/splash
75c416ea735c: Pulling fs layer
c6ff40b6d658: Pulling fs layer
a7050fc1f338: Pulling fs layer
f0ffb5cf6ba9: Waiting
be232718519c: Waiting
02e48393bcae: Waiting
a699b90bbc99: Waiting
41da8db2bf8f: Waiting
ba57071e497d: Waiting
55c87f8bb02f: Waiting
error pulling image configuration: Get https://dseasb33srnrn.cloudfront.net/regi
stry-v2/docker/registry/v2/blobs/sha256/b3/b3f69a08d665f155a61dad4b436c4112f7580
36e2e5a1d4f97658707829b0d48/data?Expires=1498738729&Signature=BUj4fCBuoG2MDqovD8
9-hQ4UarCvnxIKG7qce0gkS6TC67GLSSR6fw2E1R7anC1iCyiaiA44tIniU0mtA1~HAVhlHjC73iQc3Z
j45ZStlPdSpOutmc4YEsOum33hbxG1Hox53J0CYatrXkOsHyzLqgyKXeU45QVab-Q7Kt2lVrE_&Key-P
air-Id=APKAJECH5M7VWIS5YZ6Q: read tcp 10.0.2.15:46376->13.32.28.215:443: read: c
onnection reset by peer

@kmike
Copy link
Member

kmike commented Jun 29, 2017

Could you try it again? It looks like a temporary issue - either a dockerhub issue, or a network issue.

@fbuchinger
Copy link
Author

We 've now successfully tested splash 3.0 and are really impressed: The execution time of our scraping jobs (running layoutstats,js on ~ 120 URLs) dropped from approx 75 minutes to just 25 minutes :-) Taking screenshots also seems to work more reliable now. Big kudos to you and the guys behind the "Chromium 2016" port!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants