-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More clarity with encodings of SplashRequest #173
Conversation
tests/test_middleware.py
Outdated
@@ -286,6 +286,32 @@ def test_magic_response(): | |||
if c.name == 'spam': | |||
assert c.value == 'ham' | |||
|
|||
resp_data = { | |||
'html': '<html><body>Hello</body></html>', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to check that non-ascii response body is correctly decoded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of "html" it is a duty of lua script to handle encoding
scrapy_splash/response.py
Outdated
self._cached_ubody = self._body.decode(TextResponse(url=self.url, | ||
headers=self.headers, | ||
body=self._body).encoding) | ||
content_type = self.headers.get(b"Content-Type", b"text/plain; charset=utf-8").split(b";")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think charset is not the only thing that can follow the ;
separator in content-type, one example is Content-Type: multipart/form-data; boundary=something
, so I think it would be better to replace only the charset=X
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general question - as I understand, the problem is that when splash renders a non-utf8 page, the body is encoded using original encoding, right? Then another option to solve this problem would be to preserve both the headers and body as is, like done without splash - is this option worse than decoding body and manipulating headers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering of non-utf8 pages works fine. Browser handles encoding. We always get utf-encoded text. That's why we need to strictly set charset=utf-8
in case of html.
Problem is when we do not render non-utf8 page at browser and use splash:http_get
.
would be to preserve both the headers and body as is
I don't understand how this can be achieved not changing current(expected by users) behaviour?
To add special way to handle this type of requests in spider you need to copy-paste all corresponding code or monkey patch existent one. From other point of view, this changes does not affect current behaviour and add strictness to way encoding handled in middleware.
Codecov Report
@@ Coverage Diff @@
## master #173 +/- ##
=========================================
+ Coverage 92.67% 92.88% +0.2%
=========================================
Files 9 9
Lines 587 604 +17
Branches 118 121 +3
=========================================
+ Hits 544 561 +17
Misses 21 21
Partials 22 22
Continue to review full report at Codecov.
|
Close/open request to trigger travis build. They have network problems during last run |
Added detailed description for proposed changes |
Thanks for detailed description, I think I'm getting it now. As far as I understand, this fix will simplify handling of scripts which return I'm a little worried if this could break someone's script... but so far I don't have a realistic case in mind. Original headers are changed, and encoding declaration is added even if there was no such declaration - I see that this could possibly break something, but I'm not sure. Another question is why do we decode body into utf-8 in the first place, why can't we just keep it in binary and leave headers as is. @kmike do you know?
@whalebot-helmsman I don't see how this changes "html" handling - I don't think we should change it, but maybe I'm missing something? |
For html case only headers are changed |
Re |
Hey! I think the PR fixes a real issue: we shouldn't decode 'body' using utf-8, we should use Scrapy's encoding detection features instead.
Do you mean that we can leave |
It'd be also good to have an integration test for it (use a script with splash:http_get), instead of low-level tests which check middleware behavior. These middleware tests are
It is my fault we have a lot of them, integration tests were added only recently. Tests in this PR are not as bad as some of other existing tests, but I'd trust an integration test much more. |
This PR is also a good chance to clarify "html" vs "body" in README: if html is returned, there is no encoding detection (meta declarations and Content-Type headers are ignored, as html is already unicode; response.body is always utf-8 - though I'm not sure it is always great), while with "body" we assume that meta tags and http headers match the content. |
Yes - if this is possible for scrapy-splash, it seems that this would allow to remove header replacement and calling encoding detection. |
Encoding for every SplashJsonResponse is set to |
04feea6
to
e40ca4f
Compare
Problem is middleware treat any body as utf-8 text, this is not always a case. Added detection of body encoding using standard scrapy mechanism.
By the way fixes:
We copy headers of base response and we can get something like this "Content-Type":"text/html;charset=cp1251" after processing by middleware for SplashResponse. It is wrong as body of SplashResponse is always a utf-8 text.
Added this for both cases - when we get "body" and "html" from lua script.