Migrate scrapy to headless-chrome? #118

fbuchinger · 2017-04-25T16:06:29Z

A few weeks ago, the chromium project announced headless chromium as new, clean way to open websites in a non-UI server context.

The announcement had quite an impact in the headless-browser scene and resulted in the resignment of the phantomJS maintainer.

Since the current webkit engine of Splash dates back to 2013, I wanted to know whether there are any plans to port splash to headless chrome?

bufrr · 2017-05-27T02:50:33Z

will take a lot of work, i guess

kmike · 2017-06-27T20:58:43Z

Webkit is upgraded to a much more recent version in Splash master (~mid-2016 Safari), and will be upgraded further (to Webkit trunk) in future, thanks to https://github.com/annulen/webkit. You can use scrapinghub/splash:master Docker image to try the changes, or wait for Splash 3.0 release.

Switching to Headless Chromium would be a huge change indeed. We don't have engineering resources to make this switch in a near future. Also, it may be easier to create a separate Scrapy + Headless Chromium intergation project.

Switching to Headless Chromium has both advantages and disadvantages; it seems there are more advantages. But some Splash features can't be implemented in Headless Chromium AFAIK - e.g. per-request proxy options are impossible if I'm not mistaken - this feature is nice to have e.g. for Crawlera integration, to avoid using Crawlera for static resources.

fbuchinger · 2017-06-28T10:49:59Z

Thanks! Will try the master container to see if I can get around my scraping issues.

fbuchinger · 2017-06-29T12:01:22Z

got the following error when trying out the master dockerfile:

$ docker pull scrapinghub/splash:master
master: Pulling from scrapinghub/splash
75c416ea735c: Pulling fs layer
c6ff40b6d658: Pulling fs layer
a7050fc1f338: Pulling fs layer
f0ffb5cf6ba9: Waiting
be232718519c: Waiting
02e48393bcae: Waiting
a699b90bbc99: Waiting
41da8db2bf8f: Waiting
ba57071e497d: Waiting
55c87f8bb02f: Waiting
error pulling image configuration: Get https://dseasb33srnrn.cloudfront.net/regi
stry-v2/docker/registry/v2/blobs/sha256/b3/b3f69a08d665f155a61dad4b436c4112f7580
36e2e5a1d4f97658707829b0d48/data?Expires=1498738729&Signature=BUj4fCBuoG2MDqovD8
9-hQ4UarCvnxIKG7qce0gkS6TC67GLSSR6fw2E1R7anC1iCyiaiA44tIniU0mtA1~HAVhlHjC73iQc3Z
j45ZStlPdSpOutmc4YEsOum33hbxG1Hox53J0CYatrXkOsHyzLqgyKXeU45QVab-Q7Kt2lVrE_&Key-P
air-Id=APKAJECH5M7VWIS5YZ6Q: read tcp 10.0.2.15:46376->13.32.28.215:443: read: c
onnection reset by peer

kmike · 2017-06-29T12:29:19Z

Could you try it again? It looks like a temporary issue - either a dockerhub issue, or a network issue.

fbuchinger · 2017-07-16T19:52:35Z

We 've now successfully tested splash 3.0 and are really impressed: The execution time of our scraping jobs (running layoutstats,js on ~ 120 URLs) dropped from approx 75 minutes to just 25 minutes :-) Taking screenshots also seems to work more reliable now. Big kudos to you and the guys behind the "Chromium 2016" port!

Gallaecio added the enhancement label Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate scrapy to headless-chrome? #118

Migrate scrapy to headless-chrome? #118

fbuchinger commented Apr 25, 2017

bufrr commented May 27, 2017

kmike commented Jun 27, 2017 •

edited

Loading

fbuchinger commented Jun 28, 2017

fbuchinger commented Jun 29, 2017

kmike commented Jun 29, 2017

fbuchinger commented Jul 16, 2017

Migrate scrapy to headless-chrome? #118

Migrate scrapy to headless-chrome? #118

Comments

fbuchinger commented Apr 25, 2017

bufrr commented May 27, 2017

kmike commented Jun 27, 2017 • edited Loading

fbuchinger commented Jun 28, 2017

fbuchinger commented Jun 29, 2017

kmike commented Jun 29, 2017

fbuchinger commented Jul 16, 2017

kmike commented Jun 27, 2017 •

edited

Loading