Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More clarity with encodings of SplashRequest #173

Closed

Conversation

whalebot-helmsman
Copy link
Contributor

@whalebot-helmsman whalebot-helmsman commented Apr 5, 2018

Problem is middleware treat any body as utf-8 text, this is not always a case. Added detection of body encoding using standard scrapy mechanism.

By the way fixes:
We copy headers of base response and we can get something like this "Content-Type":"text/html;charset=cp1251" after processing by middleware for SplashResponse. It is wrong as body of SplashResponse is always a utf-8 text.

Added this for both cases - when we get "body" and "html" from lua script.

@@ -286,6 +286,32 @@ def test_magic_response():
if c.name == 'spam':
assert c.value == 'ham'

resp_data = {
'html': '<html><body>Hello</body></html>',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to check that non-ascii response body is correctly decoded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of "html" it is a duty of lua script to handle encoding

self._cached_ubody = self._body.decode(TextResponse(url=self.url,
headers=self.headers,
body=self._body).encoding)
content_type = self.headers.get(b"Content-Type", b"text/plain; charset=utf-8").split(b";")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think charset is not the only thing that can follow the ; separator in content-type, one example is Content-Type: multipart/form-data; boundary=something, so I think it would be better to replace only the charset=X part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general question - as I understand, the problem is that when splash renders a non-utf8 page, the body is encoded using original encoding, right? Then another option to solve this problem would be to preserve both the headers and body as is, like done without splash - is this option worse than decoding body and manipulating headers?

Copy link
Contributor Author

@whalebot-helmsman whalebot-helmsman Apr 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rendering of non-utf8 pages works fine. Browser handles encoding. We always get utf-encoded text. That's why we need to strictly set charset=utf-8 in case of html.
Problem is when we do not render non-utf8 page at browser and use splash:http_get.

would be to preserve both the headers and body as is

I don't understand how this can be achieved not changing current(expected by users) behaviour?

To add special way to handle this type of requests in spider you need to copy-paste all corresponding code or monkey patch existent one. From other point of view, this changes does not affect current behaviour and add strictness to way encoding handled in middleware.

@codecov
Copy link

codecov bot commented Apr 5, 2018

Codecov Report

Merging #173 into master will increase coverage by 0.2%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #173     +/-   ##
=========================================
+ Coverage   92.67%   92.88%   +0.2%     
=========================================
  Files           9        9             
  Lines         587      604     +17     
  Branches      118      121      +3     
=========================================
+ Hits          544      561     +17     
  Misses         21       21             
  Partials       22       22
Impacted Files Coverage Δ
scrapy_splash/response.py 94.91% <100%> (+0.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e40ca4f...04feea6. Read the comment docs.

@whalebot-helmsman
Copy link
Contributor Author

whalebot-helmsman commented Apr 5, 2018

Close/open request to trigger travis build. They have network problems during last run

@whalebot-helmsman
Copy link
Contributor Author

Added detailed description for proposed changes

@lopuhin
Copy link
Contributor

lopuhin commented Apr 6, 2018

Thanks for detailed description, I think I'm getting it now. As far as I understand, this fix will simplify handling of scripts which return splash:http_get result, where body has original encoding which should match encoding declared in headers.

I'm a little worried if this could break someone's script... but so far I don't have a realistic case in mind. Original headers are changed, and encoding declaration is added even if there was no such declaration - I see that this could possibly break something, but I'm not sure.

Another question is why do we decode body into utf-8 in the first place, why can't we just keep it in binary and leave headers as is. @kmike do you know?

Added this for both cases - when we get "body" and "html" from lua script.

@whalebot-helmsman I don't see how this changes "html" handling - I don't think we should change it, but maybe I'm missing something?

@whalebot-helmsman
Copy link
Contributor Author

For html case only headers are changed

@lopuhin
Copy link
Contributor

lopuhin commented Apr 6, 2018

Re _replace_charset, I wonder if it could be simplified with a simple re.sub, and maybe we don't need to add encoding declaration if there was no such declaration in the first place, but I'd like to get feedback from @kmike first.

@kmike
Copy link
Member

kmike commented Apr 6, 2018

Hey! I think the PR fixes a real issue: we shouldn't decode 'body' using utf-8, we should use Scrapy's encoding detection features instead.

why can't we just keep it in binary and leave headers as is

Do you mean that we can leave ._cached_ubody creation to standard Scrapy features (lazy-decode it in .text attribute), and just make sure headers and body are set properly for such detection? It makes sense to me. I don't understand why do we have to replace Content-Type header.

@kmike
Copy link
Member

kmike commented Apr 6, 2018

It'd be also good to have an integration test for it (use a script with splash:http_get), instead of low-level tests which check middleware behavior. These middleware tests are

  • hard to maintain,
  • unreadable: it is hard to understand if the behavior is correct, and
  • unreliable - they don't emulate all components, and don't check concrete use cases, which is important, as there is a lot of interactions between different components.

It is my fault we have a lot of them, integration tests were added only recently. Tests in this PR are not as bad as some of other existing tests, but I'd trust an integration test much more.

@kmike
Copy link
Member

kmike commented Apr 6, 2018

This PR is also a good chance to clarify "html" vs "body" in README: if html is returned, there is no encoding detection (meta declarations and Content-Type headers are ignored, as html is already unicode; response.body is always utf-8 - though I'm not sure it is always great), while with "body" we assume that meta tags and http headers match the content.

@lopuhin
Copy link
Contributor

lopuhin commented Apr 6, 2018

why can't we just keep it in binary and leave headers as is

Do you mean that we can leave ._cached_ubody creation to standard Scrapy features (lazy-decode it in .text attribute), and just make sure headers and body are set properly for such detection? It makes sense to me. I don't understand why do we have to replace Content-Type header.

Yes - if this is possible for scrapy-splash, it seems that this would allow to remove header replacement and calling encoding detection.

@whalebot-helmsman
Copy link
Contributor Author

Yes - if this is possible for scrapy-splash, it seems that this would allow to remove header replacement and calling encoding detection.

Encoding for every SplashJsonResponse is set to utf-8 at same time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants