Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spider attr #15

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
[middleware] return HtmlResponse to spider
  • Loading branch information
pawelmhm committed Apr 3, 2015
commit 9efd145d0b7c6b7d5e743948ecbb124aa2b9def5
18 changes: 15 additions & 3 deletions scrapyjs/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from scrapy.exceptions import NotConfigured

from scrapy import log
from scrapy.http.response.html import HtmlResponse

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like unused import

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, this is fixed now

from scrapy.http.headers import Headers


Expand Down Expand Up @@ -107,20 +108,31 @@ def process_request(self, request, spider):
# are not respected.
headers=Headers({'Content-Type': 'application/json'}),
)

self.crawler.stats.inc_value('splash/%s/request_count' % endpoint)
return req_rep

def process_response(self, request, response, spider):
splash_options = request.meta.get("_splash_processed")
splash_options = self.get_splash_options(request, spider)
if splash_options:
endpoint = splash_options['endpoint']
self.crawler.stats.inc_value(
'splash/%s/response_count/%s' % (endpoint, response.status)
)

response = self.html_response(response, request)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't do it. #12 is a better way; basic splash requests should be as barebone as possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that by default response should be returned without changes, but if we will have option to enable html response and use scrapyjs transparently i.e. getting rendered html without handling this in callback - that would be very, very convenient, e.g. you can use response.css(), response.xpath() and response.body right away - just like if request was made to plain html page.

I don't know if you already spent some time working on #12 and description doesn't provide much information. How it would work? You say it would be like FormRequest, but this has nothing to do with responses. Can you share a bit of your thoughts? One thing that came to mind is SplashRequest and SplashHtmlRequest classes where SplashRequest will return plain response from splash and SplashHtmlRequest will return rendered html along with headers and cookies returned by target site. But I'm not sure if my insight is correct.

If you insist - I think this part can be removed, we can override scrapyjs and add this inside the project. @pawelmhm ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that by default response should be returned without changes, but if we will have option to enable html response and use scrapyjs transparently i.e. getting rendered html without handling this in callback

sure we can make it optional and just make it conditional on existence of some key either in meta or in spider attributes. What do you think @kmike ?

IMO rendering HTML will be very common thing that people will do so without handling this in base middleware most users will have to write code to generate html response themselves.

return response

def html_response(self, response, request):
"""Give user nice HTML response he probably
expects.
"""
data = json.loads(response.body)
html = data.get("html")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe html = data.pop('html', None)? Is there any reason to keep html in meta in this case?

also it might be good idea to have 'splash': {} option to enable/disable this behavior, like 'splash': {'html_response': True}. or 'transparent'. or any better name.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes having option to disable-enable this behavior sounds like a good idea

Is there any reason to keep html in meta in this case?

I think not.

maybe html = data.pop('html', None)?

yes sounds good

if not html:
return response

return HtmlResponse(data["url"], body=html, encoding='utf8',
status=response.status, request=request)

def _set_download_slot(self, request, meta, slot_policy):
if slot_policy == SlotPolicy.PER_DOMAIN:
# Use the same download slot to (sort of) respect download
Expand Down