Middleware settings for scrapy-splash with scrapy-cluster, SplashRequest not work #101

hustshawn · 2017-01-24T02:09:05Z

In single node scrapy project, the settings like below as your document indicate works well.

# ====== Splash settings ======
SPLASH_URL = 'http:https://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

While if I integrate with the scrapy-cluster with below settings, the request with SplashRequest may not successfully send request to splash, so the splash not will not respond. Actually, the splash itself works fine when I just directly access it with constructed url from render.html endpoint.

SPIDER_MIDDLEWARES = {
    # disable built-in DepthMiddleware, since we do our own
    # depth management per crawl request
    'scrapy.spidermiddlewares.depth.DepthMiddleware': None,
    'crawling.meta_passthrough_middleware.MetaPassthroughMiddleware': 100,
    'crawling.redis_stats_middleware.RedisStatsMiddleware': 105,
    # The original 100 is conflict with the MetaPassthroughMiddleware, thus changed to 101
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 101,
}

DOWNLOADER_MIDDLEWARES = {
    # Handle timeout retries with the redis scheduler and logger
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'crawling.redis_retry_middleware.RedisRetryMiddleware': 510,
    # exceptions processed in reverse order
    'crawling.log_retry_middleware.LogRetryMiddleware': 520,
    # custom cookies to not persist across crawl requests
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
    # 'crawling.custom_cookies.CustomCookiesMiddleware': 700,
    # Scrapy-splash DOWNLOADER_MIDDLEWARES
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Scrapy-splash settings
SPLASH_URL = 'scrapy_splash:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Anyone knows what's going wrong with the settings?

The text was updated successfully, but these errors were encountered:

kmike · 2017-01-24T07:32:22Z

I think it could be related to dupefilter used by crawling.distributed_scheduler.DistributedScheduler - this dupefilter uses request_fingerprint function which doesn't work correctly for Splash requests. Default dupefilter doesn't take request.meta values in account, while requests to Splash may differ only in request.meta until they are fixed by a downloader middleware.

rksaxena · 2017-02-10T07:15:28Z

Facing the same issue.

kmike · 2017-02-11T11:14:50Z

See also: istresearch/scrapy-cluster#94.
I'm not sure how it can be solved in scrapy-splash itself.

wenxzhen · 2017-03-27T08:16:03Z

so the scrapy-splash can't work with scrapy-cluster now?

kmike · 2017-03-27T09:33:03Z

Yes, it can't. Currently one have to fork & fix scrapy-cluster to make them work together.
An alternative way is to use Splash HTTP API directly, as shown at https://github.com/scrapy-plugins/scrapy-splash#why-not-use-the-splash-http-api-directly; I'm not completely sure, but likely it would work with scrapy-cluster.

wenxzhen · 2017-03-27T10:47:07Z

Thanks to @kmike

Do you happen to know where the problem is?

kmike · 2017-03-27T18:01:45Z

@wenxzhen I'm not a scrapy-cluster user myself, but a brief look results are in this comment: #101 (comment)

wenxzhen · 2017-03-28T10:41:22Z

Thanks to @kmike
After some investigation, found that python is not quite easy to support the serialization and deserialization of class instance. Therefore, I turn to another way:

add a download middleware to populate some "splash" meta in the original scrapy request
in the scrapy core downloander, when meeting with "splah" meta, replace the Scrapy request with a new Request with replaced URL -> to call the Splash HTTP API directly

Now it works

hustshawn · 2017-03-28T12:30:27Z

@wenxzhen Could you please share some core code with ur or sent a PR to this repo?

wenxzhen · 2017-03-29T02:42:55Z

@hustshawn the basic idea is to not use the scrapy-splash stuffs, but to make use of the functionalities of the scrapy-cluster + scrapy.

The followings are mainly for PoC without optimization.

we need to reuse the feeding capability of scrapy-cluster, so I add extra "attrs" in the json request

python kafka_monitor.py feed '{"url": "https://www.test.com", "appid":"testapp", "crawlid":"09876abc", "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36", "attrs": {"splash": "1"}, "spiderid": "test"}'

the "splash": 1 is to tell that the reuqest needs to go to Splash with Http API directly

add a downloader middleware into the scrapy-cluster and in the "process_request", if we detect the splash attribute, insert some necessary meta

splash_meta = request.meta[self.splash_meta_name]

        args = splash_meta.setdefault('args', {})
        splash_url = urljoin(self.splash_base_url, self.default_endpoint)
        args.setdefault('splash_url', splash_url)

        # only support POST api to Splash now
        args.setdefault('http_method', 'POST')

        body = json.dumps({"url": request.meta['url'], "wait": 5, "timeout": 10}, sort_keys=True)
        args.setdefault('body', body)

        headers = Headers({'Content-Type': 'application/json'})
        args.setdefault('headers', headers)

When the request arrives at scrapy downloader, in the HTTP11DownloadHandler's download_reuqest:
we need to replace the request:

def download_request(self, request, spider):
        """Return a deferred for the HTTP download"""
        agent = ScrapyAgent(contextFactory=self._contextFactory, pool=self._pool,
            maxsize=getattr(spider, 'download_maxsize', self._default_maxsize),
            warnsize=getattr(spider, 'download_warnsize', self._default_warnsize))

        if "splash" in request.meta:
            # we got a Splash forward request now
            splash_args = request.meta['splash']['args']
            new_splash_request = request.replace(
                url = splash_args['splash_url'],
                method = splash_args['http_method'],
                body = splash_args['body'],
                headers = splash_args['headers'],
                priority = request.priority
            )
            return agent.download_request(new_splash_request)
        else:
            return agent.download_request(request)

hustshawn · 2017-03-29T15:55:03Z

Got your idea. Thanks a lot. @wenxzhen

Dgadavin · 2017-04-07T12:59:40Z

Could you please do PR with this code? Parse JS is really useful feature.

wenxzhen · 2017-04-10T03:49:24Z

we need to ask @kmike whether the 'basic' solution is acceptable or not? If yes, we can start the PR work.

DreadfulDeveloper · 2017-05-21T16:50:21Z

@wenxzhen did you create a download_handler middleware to implement your solution or did you modify the HTTP11DownloadHandler directly?

wenxzhen · 2017-06-20T09:25:55Z

I need to do both as I need to bypass the proxy to Splash too

LazerJesus · 2018-09-29T21:23:00Z

@wenxzhen did you solve it? i also need to proxy and splash.

wenxzhen · 2018-10-26T02:35:30Z

@FinnFrotscher check the code snippets above, hope it can help.

Gallaecio · 2019-05-09T12:08:00Z

It seems like scrapy/scrapy#900 could be a good first step towards fixing this.

Gallaecio added the enhancement label Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Middleware settings for scrapy-splash with scrapy-cluster, SplashRequest not work #101

Middleware settings for scrapy-splash with scrapy-cluster, SplashRequest not work #101

hustshawn commented Jan 24, 2017 •

edited

Loading

kmike commented Jan 24, 2017

rksaxena commented Feb 10, 2017

kmike commented Feb 11, 2017 •

edited

Loading

wenxzhen commented Mar 27, 2017

kmike commented Mar 27, 2017 •

edited

Loading

wenxzhen commented Mar 27, 2017

kmike commented Mar 27, 2017

wenxzhen commented Mar 28, 2017

hustshawn commented Mar 28, 2017

wenxzhen commented Mar 29, 2017

hustshawn commented Mar 29, 2017

Dgadavin commented Apr 7, 2017

wenxzhen commented Apr 10, 2017

DreadfulDeveloper commented May 21, 2017

wenxzhen commented Jun 20, 2017

LazerJesus commented Sep 29, 2018

wenxzhen commented Oct 26, 2018

Gallaecio commented May 9, 2019

Middleware settings for scrapy-splash with scrapy-cluster, SplashRequest not work #101

Middleware settings for scrapy-splash with scrapy-cluster, SplashRequest not work #101

Comments

hustshawn commented Jan 24, 2017 • edited Loading

kmike commented Jan 24, 2017

rksaxena commented Feb 10, 2017

kmike commented Feb 11, 2017 • edited Loading

wenxzhen commented Mar 27, 2017

kmike commented Mar 27, 2017 • edited Loading

wenxzhen commented Mar 27, 2017

kmike commented Mar 27, 2017

wenxzhen commented Mar 28, 2017

hustshawn commented Mar 28, 2017

wenxzhen commented Mar 29, 2017

hustshawn commented Mar 29, 2017

Dgadavin commented Apr 7, 2017

wenxzhen commented Apr 10, 2017

DreadfulDeveloper commented May 21, 2017

wenxzhen commented Jun 20, 2017

LazerJesus commented Sep 29, 2018

wenxzhen commented Oct 26, 2018

Gallaecio commented May 9, 2019

hustshawn commented Jan 24, 2017 •

edited

Loading

kmike commented Feb 11, 2017 •

edited

Loading

kmike commented Mar 27, 2017 •

edited

Loading