Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set HttpCompressionMiddleware priority to 810 #1895

Open
kmike opened this issue Mar 30, 2016 · 3 comments
Open

set HttpCompressionMiddleware priority to 810 #1895

kmike opened this issue Mar 30, 2016 · 3 comments

Comments

@kmike
Copy link
Member

kmike commented Mar 30, 2016

Currently DOWNLOADER_MIDDLEWARE options looks like this:

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}

I encountered a case where this order causes problems: I want to add custom headers to response based on its contents (load cookies from JSON response body and handle them using standard CookiesMiddleware). But because CookiesMiddleware.process_response is called before HttpCompressionMiddleware.process_response this is not possible with a default middleware order - body is not uncompressed yet.

This change will move HttpCompressionMiddleware after RedirectMiddleware (i.e. responses will be decompressed prior to redirect handling); it means Scrapy can do a bit more work for redirect responses. I don't think this is a problem because 301, 302, 303, 307 responses shouldn't have bodies.

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

@kmike , did you consider moving scrapy.downloadermiddlewares.cookies.CookiesMiddleware upwards towards the engine?
e.g.

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 500,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 585,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}

@kmike
Copy link
Member Author

kmike commented Apr 26, 2016

@redapple it turns out this is not really related to CookiesMiddleware; scrapy-splash doesn't use it to handle cookies in the end, and handles cookies using a custom .

scrapy-splash uses original JSON response.body to set response.status, response.headers, response.url and a new response.body. Any other downloader middleware which wants to return a new response based on current response body must be put closer to Engine side than HttpCompressionMiddleware. Currently it means that RedirectMiddleware and CookiesMiddleware won't process such adjusted responses.

@redapple
Copy link
Contributor

As full-responses from transport layer are sent to the first middleware -- downloader-side, it does makes sense to have an uncompressed response to process soon enough.
If responses came in chunks to middlewares, it would make sense to have header-checking middleware act early (e.g. RedirectMiddleware), but that's not the case.

HttpCompressionMiddleware is very similar to ChunkedTransferMiddleware so using similar order does make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants