More clarity with encodings of SplashRequest #173

whalebot-helmsman · 2018-04-05T11:54:50Z

Problem is middleware treat any body as utf-8 text, this is not always a case. Added detection of body encoding using standard scrapy mechanism.

By the way fixes:
We copy headers of base response and we can get something like this "Content-Type":"text/html;charset=cp1251" after processing by middleware for SplashResponse. It is wrong as body of SplashResponse is always a utf-8 text.

Added this for both cases - when we get "body" and "html" from lua script.

lopuhin · 2018-04-05T12:38:21Z

tests/test_middleware.py

@@ -286,6 +286,32 @@ def test_magic_response():
 if c.name == 'spam':
 assert c.value == 'ham'

+ resp_data = {
+ 'html': '<html><body>Hello</body></html>',


I think it would be nice to check that non-ascii response body is correctly decoded

In case of "html" it is a duty of lua script to handle encoding

lopuhin · 2018-04-05T12:44:06Z

scrapy_splash/response.py

+ self._cached_ubody = self._body.decode(TextResponse(url=self.url,
+ headers=self.headers,
+ body=self._body).encoding)
+ content_type = self.headers.get(b"Content-Type", b"text/plain; charset=utf-8").split(b";")[0]


I think charset is not the only thing that can follow the ; separator in content-type, one example is Content-Type: multipart/form-data; boundary=something, so I think it would be better to replace only the charset=X part.

A more general question - as I understand, the problem is that when splash renders a non-utf8 page, the body is encoded using original encoding, right? Then another option to solve this problem would be to preserve both the headers and body as is, like done without splash - is this option worse than decoding body and manipulating headers?

Rendering of non-utf8 pages works fine. Browser handles encoding. We always get utf-encoded text. That's why we need to strictly set charset=utf-8 in case of html.
Problem is when we do not render non-utf8 page at browser and use splash:http_get.

would be to preserve both the headers and body as is

I don't understand how this can be achieved not changing current(expected by users) behaviour?

To add special way to handle this type of requests in spider you need to copy-paste all corresponding code or monkey patch existent one. From other point of view, this changes does not affect current behaviour and add strictness to way encoding handled in middleware.

codecov · 2018-04-05T14:34:18Z

Codecov Report

Merging #173 into master will increase coverage by 0.2%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #173     +/-   ##
=========================================
+ Coverage   92.67%   92.88%   +0.2%     
=========================================
  Files           9        9             
  Lines         587      604     +17     
  Branches      118      121      +3     
=========================================
+ Hits          544      561     +17     
  Misses         21       21             
  Partials       22       22

Impacted Files	Coverage Δ
scrapy_splash/response.py	`94.91% <100%> (+0.85%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e40ca4f...04feea6. Read the comment docs.

whalebot-helmsman · 2018-04-05T15:52:42Z

Close/open request to trigger travis build. They have network problems during last run

whalebot-helmsman · 2018-04-05T20:37:32Z

Added detailed description for proposed changes

lopuhin · 2018-04-06T16:42:31Z

Thanks for detailed description, I think I'm getting it now. As far as I understand, this fix will simplify handling of scripts which return splash:http_get result, where body has original encoding which should match encoding declared in headers.

I'm a little worried if this could break someone's script... but so far I don't have a realistic case in mind. Original headers are changed, and encoding declaration is added even if there was no such declaration - I see that this could possibly break something, but I'm not sure.

Another question is why do we decode body into utf-8 in the first place, why can't we just keep it in binary and leave headers as is. @kmike do you know?

Added this for both cases - when we get "body" and "html" from lua script.

@whalebot-helmsman I don't see how this changes "html" handling - I don't think we should change it, but maybe I'm missing something?

whalebot-helmsman · 2018-04-06T16:44:39Z

For html case only headers are changed

lopuhin · 2018-04-06T16:45:41Z

Re _replace_charset, I wonder if it could be simplified with a simple re.sub, and maybe we don't need to add encoding declaration if there was no such declaration in the first place, but I'd like to get feedback from @kmike first.

kmike · 2018-04-06T17:52:44Z

Hey! I think the PR fixes a real issue: we shouldn't decode 'body' using utf-8, we should use Scrapy's encoding detection features instead.

why can't we just keep it in binary and leave headers as is

Do you mean that we can leave ._cached_ubody creation to standard Scrapy features (lazy-decode it in .text attribute), and just make sure headers and body are set properly for such detection? It makes sense to me. I don't understand why do we have to replace Content-Type header.

kmike · 2018-04-06T17:59:00Z

It'd be also good to have an integration test for it (use a script with splash:http_get), instead of low-level tests which check middleware behavior. These middleware tests are

hard to maintain,
unreadable: it is hard to understand if the behavior is correct, and
unreliable - they don't emulate all components, and don't check concrete use cases, which is important, as there is a lot of interactions between different components.

It is my fault we have a lot of them, integration tests were added only recently. Tests in this PR are not as bad as some of other existing tests, but I'd trust an integration test much more.

kmike · 2018-04-06T18:02:04Z

This PR is also a good chance to clarify "html" vs "body" in README: if html is returned, there is no encoding detection (meta declarations and Content-Type headers are ignored, as html is already unicode; response.body is always utf-8 - though I'm not sure it is always great), while with "body" we assume that meta tags and http headers match the content.

lopuhin · 2018-04-06T19:16:19Z

why can't we just keep it in binary and leave headers as is

Do you mean that we can leave ._cached_ubody creation to standard Scrapy features (lazy-decode it in .text attribute), and just make sure headers and body are set properly for such detection? It makes sense to me. I don't understand why do we have to replace Content-Type header.

Yes - if this is possible for scrapy-splash, it seems that this would allow to remove header replacement and calling encoding detection.

whalebot-helmsman · 2018-04-09T13:37:11Z

Yes - if this is possible for scrapy-splash, it seems that this would allow to remove header replacement and calling encoding detection.

Encoding for every SplashJsonResponse is set to utf-8 at same time

lopuhin reviewed Apr 5, 2018

View reviewed changes

whalebot-helmsman closed this Apr 5, 2018

whalebot-helmsman reopened this Apr 5, 2018

whalebot-helmsman closed this Apr 11, 2018

whalebot-helmsman force-pushed the master branch from 04feea6 to e40ca4f Compare April 11, 2018 15:20

whalebot-helmsman mentioned this pull request Apr 16, 2018

Do not try to decode body of Response #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More clarity with encodings of SplashRequest #173

More clarity with encodings of SplashRequest #173

whalebot-helmsman commented Apr 5, 2018 •

edited

Loading

lopuhin Apr 5, 2018

whalebot-helmsman Apr 5, 2018

lopuhin Apr 5, 2018

lopuhin Apr 5, 2018

whalebot-helmsman Apr 5, 2018 •

edited

Loading

codecov bot commented Apr 5, 2018 •

edited

Loading

whalebot-helmsman commented Apr 5, 2018 •

edited

Loading

whalebot-helmsman commented Apr 5, 2018

lopuhin commented Apr 6, 2018

whalebot-helmsman commented Apr 6, 2018

lopuhin commented Apr 6, 2018

kmike commented Apr 6, 2018

kmike commented Apr 6, 2018

kmike commented Apr 6, 2018

lopuhin commented Apr 6, 2018

whalebot-helmsman commented Apr 9, 2018

More clarity with encodings of SplashRequest #173

More clarity with encodings of SplashRequest #173

Conversation

whalebot-helmsman commented Apr 5, 2018 • edited Loading

lopuhin Apr 5, 2018

Choose a reason for hiding this comment

whalebot-helmsman Apr 5, 2018

Choose a reason for hiding this comment

lopuhin Apr 5, 2018

Choose a reason for hiding this comment

lopuhin Apr 5, 2018

Choose a reason for hiding this comment

whalebot-helmsman Apr 5, 2018 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Apr 5, 2018 • edited Loading

Codecov Report

whalebot-helmsman commented Apr 5, 2018 • edited Loading

whalebot-helmsman commented Apr 5, 2018

lopuhin commented Apr 6, 2018

whalebot-helmsman commented Apr 6, 2018

lopuhin commented Apr 6, 2018

kmike commented Apr 6, 2018

kmike commented Apr 6, 2018

kmike commented Apr 6, 2018

lopuhin commented Apr 6, 2018

whalebot-helmsman commented Apr 9, 2018

whalebot-helmsman commented Apr 5, 2018 •

edited

Loading

whalebot-helmsman Apr 5, 2018 •

edited

Loading

codecov bot commented Apr 5, 2018 •

edited

Loading

whalebot-helmsman commented Apr 5, 2018 •

edited

Loading