Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spider attr #15

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions scrapyjs/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from scrapy.exceptions import NotConfigured

from scrapy import log
from scrapy.http.response.html import HtmlResponse

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like unused import

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, this is fixed now

from scrapy.http.headers import Headers


Expand All @@ -32,6 +33,14 @@ def __init__(self, crawler, splash_base_url, slot_policy):
self.splash_base_url = splash_base_url
self.slot_policy = slot_policy

def get_splash_options(self, request, spider):
if request.meta.get("dont_proxy"):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after sleeping on it I think it's bad idea to reuse 'dont_proxy' key here, because it's unexpected for developers who are using splash and crawlera middleware together in the project - they would expect it to work only with crawlera mw, so they can enable crawlera for spider (for all requests) and for some specific requests they might want to disable crawlera using 'dont_proxy' and add 'splash' arguments.

I think explicit 'dont_splash' would be better here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explicit 'dont_splash' would be better here.

I'm not sure. But it seems like this is just a question of terminology, "dont_proxy" to my ears sounds like general signal - 'dont use any kind of proxy mechanism for this request' so I thought we could reuse it here. On the other hand "splash" here is not used as proxy so it may be confusing, also dont_splash may be more specific and explicit. So maybe dont_splash would better.

return

spider_options = getattr(spider, "splash", {})
request_options = request.meta.get("splash")
return request_options or spider_options

@classmethod
def from_crawler(cls, crawler):
splash_base_url = crawler.settings.get('SPLASH_URL', cls.default_splash_url)
Expand All @@ -43,24 +52,26 @@ def from_crawler(cls, crawler):
return cls(crawler, splash_base_url, slot_policy)

def process_request(self, request, spider):
splash_options = request.meta.get('splash')
splash_options = self.get_splash_options(request, spider)
if not splash_options:
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawelmhm why you changed this? this key replacement looked good enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to https://github.com/scrapinghub/scrapyjs/pull/15/files#r28138976 (avoiding generating multiple splash requests). If splash options are in request meta as content of splash key they can be deleted. When request will be processed by middleware second time process_request will return None. But if splash options are set per spider you can't delete them. This means that first check on line 55 will always return True if options are in spider. This means we need another check, so I needed to redefine _splash_processed key, changing it's content from dictionary of splash options to Boolean flag. (edited, updated comment link)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you added check if request.meta.get("_splash_processed"): return above - isn't that enough.

  1. request processed first time, 'splash' moved to '_splash_processed'
  2. response processed first time - treat '_splash_processed' as dict of splash options
  3. request processed second time (any reason like retry) - '_splash_processed' is already in meta so request will be ignored because of the check request.meta.get("_splash_processed")

but after your explanation I think the way it's implemented in PR is better, because someone might expect to have 'splash' key in response.meta, and with current implementation it's missed. @kmike are you okay with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out I applied similar changes here: 95f0079.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe your approach is better


elif request.meta.get("_splash_processed"):
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for retries?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because if you return new request from middleware it will go through all middleware chain again, so we need to disable infinite loop from occurring

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

note - I think it would be a bit safer to use if here, not elif, as both statements (#15 (diff) and #15 (diff)) leading to return - it wouldn't change logic but will be safer for code changes - who knows how this code will change in future

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good


if request.method != 'GET':
log.msg("Currently only GET requests are supported by SplashMiddleware; %s "
"will be handled without Splash" % request, logging.WARNING)
return request

meta = request.meta
del meta['splash']
meta['_splash_processed'] = splash_options

slot_policy = splash_options.get('slot_policy', self.slot_policy)
self._set_download_slot(request, meta, slot_policy)

args = splash_options.setdefault('args', {})
args.setdefault('url', request.url)
args['url'] = request.url

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike are you okay with this change. What was usecase of .setdefault() here? Does it have any sense to create request with one url and specify another url in meta?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I was thinking of allowing user not to use 'url' argument at all. This argument is not required for Splash scripts - e.g. you can render a chunk of HTML using splash:set_content. But the meta syntax doesn't make much sense in this case; something in lines of #12 could be a better fit.


body = json.dumps(args, ensure_ascii=False)

if 'timeout' in args:
Expand All @@ -86,6 +97,7 @@ def process_request(self, request, spider):
endpoint = splash_options.setdefault('endpoint', self.default_endpoint)
splash_base_url = splash_options.get('splash_url', self.splash_base_url)
splash_url = urljoin(splash_base_url, endpoint)
meta['_splash_processed'] = True

req_rep = request.replace(
url=splash_url,
Expand All @@ -96,18 +108,16 @@ def process_request(self, request, spider):
# are not respected.
headers=Headers({'Content-Type': 'application/json'}),
)

self.crawler.stats.inc_value('splash/%s/request_count' % endpoint)
return req_rep

def process_response(self, request, response, spider):
splash_options = request.meta.get("_splash_processed")
splash_options = self.get_splash_options(request, spider)
if splash_options:
endpoint = splash_options['endpoint']
self.crawler.stats.inc_value(
'splash/%s/response_count/%s' % (endpoint, response.status)
)

return response

def _set_download_slot(self, request, meta, slot_policy):
Expand Down
275 changes: 150 additions & 125 deletions tests/test_middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,154 +2,179 @@
from __future__ import absolute_import
import copy
import json
from twisted.trial import unittest

import scrapy
from scrapy.core.engine import ExecutionEngine
from scrapy.utils.test import get_crawler
from scrapy.utils.httpobj import urlparse_cached

import scrapyjs
from scrapyjs.middleware import SplashMiddleware
from scrapyjs.request import SplashRequest


def _get_mw():
crawler = get_crawler(settings_dict={
'DOWNLOAD_HANDLERS': {'s3': None}, # for faster test running
})
if not hasattr(crawler, 'logformatter'):
crawler.logformatter = None
crawler.engine = ExecutionEngine(crawler, lambda _: None)
# spider = crawler._create_spider("foo")
return SplashMiddleware.from_crawler(crawler)


def test_nosplash():
mw = _get_mw()
req = scrapy.Request("http:https://example.com")
old_meta = copy.deepcopy(req.meta)
assert mw.process_request(req, None) is None
assert old_meta == req.meta


def test_splash_request():
mw = _get_mw()
req = SplashRequest("http:https://example.com?foo=bar&url=1&wait=100")

req2 = mw.process_request(req, None)
assert req2 is not None
assert req2 is not req
assert req2.url == "http:https://127.0.0.1:8050/render.html"
assert req2.headers == {'Content-Type': ['application/json']}
assert req2.method == 'POST'

expected_body = {'url': req.url}
expected_body.update(SplashRequest.default_splash_meta['args'])
assert json.loads(req2.body) == expected_body


def test_splash_request_no_url():
mw = _get_mw()
lua_source = "function main(splash) return {result='ok'} end"
req1 = SplashRequest(meta={'splash': {
'args': {'lua_source': lua_source},
'endpoint': 'execute',
}})
req = mw.process_request(req1, None)
assert req.url == 'http:https://127.0.0.1:8050/execute'
assert json.loads(req.body) == {
'url': 'about:blank',
'lua_source': lua_source
}


def test_override_splash_url():
mw = _get_mw()
req1 = scrapy.Request("http:https://example.com", meta={
'splash': {
'endpoint': 'render.png',
'splash_url': 'http:https://splash.example.com'
}
})
req = mw.process_request(req1, None)
assert req.url == 'http:https://splash.example.com/render.png'
assert json.loads(req.body) == {'url': req1.url}


def test_float_wait_arg():
mw = _get_mw()
req1 = scrapy.Request("http:https://example.com", meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
req = mw.process_request(req1, None)
assert json.loads(req.body) == {'url': req1.url, 'wait': 0.5}

class MockedSlot(object):

def __init__(self, delay=0.0):
self.delay = delay

def test_slot_policy_single_slot():
mw = _get_mw()
meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.SINGLE_SLOT
}}

req1 = scrapy.Request("http:https://example.com/path?key=value", meta=meta)
req1 = mw.process_request(req1, None)
class MockedDownloader(object):

req2 = scrapy.Request("http:https://fooexample.com/path?key=value", meta=meta)
req2 = mw.process_request(req2, None)
def __init__(self):
self.slots = {}

assert req1.meta.get('download_slot')
assert req1.meta['download_slot'] == req2.meta['download_slot']
def _get_slot_key(self, request, spider):
if 'download_slot' in request.meta:
return request.meta['download_slot']

key = urlparse_cached(request).hostname or ''
return key

def test_slot_policy_per_domain():
mw = _get_mw()
meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN
}}

req1 = scrapy.Request("http:https://example.com/path?key=value", meta=meta)
req1 = mw.process_request(req1, None)
class MockedEngine(object):
downloader = MockedDownloader()

req2 = scrapy.Request("http:https://example.com/path2", meta=meta)
req2 = mw.process_request(req2, None)

req3 = scrapy.Request("http:https://fooexample.com/path?key=value", meta=meta)
req3 = mw.process_request(req3, None)
class MiddlewareTest(unittest.TestCase):

assert req1.meta.get('download_slot')
assert req3.meta.get('download_slot')
def setUp(self):
self.crawler = get_crawler(settings_dict={
'DOWNLOAD_HANDLERS': {'s3': None}, # for faster test running
})
if not hasattr(self.crawler, 'logformatter'):
self.crawler.logformatter = None
self.crawler.engine = MockedEngine()
self.mw = SplashMiddleware.from_crawler(self.crawler)

assert req1.meta['download_slot'] == req2.meta['download_slot']
assert req1.meta['download_slot'] != req3.meta['download_slot']
def test_nosplash(self):
req = scrapy.Request("http:https://example.com")
old_meta = copy.deepcopy(req.meta)
assert self.mw.process_request(req, None) is None
assert old_meta == req.meta

def test_splash_request(self):
req = SplashRequest("http:https://example.com?foo=bar&url=1&wait=100")

def test_slot_policy_scrapy_default():
mw = _get_mw()
req = scrapy.Request("http:https://example.com", meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.SCRAPY_DEFAULT
}})
req = mw.process_request(req, None)
assert 'download_slot' not in req.meta
req2 = self.mw.process_request(req, None)
assert req2 is not None
assert req2 is not req
assert req2.url == "http:https://127.0.0.1:8050/render.html"
assert req2.headers == {'Content-Type': ['application/json']}
assert req2.method == 'POST'

expected_body = {'url': req.url}
expected_body.update(SplashRequest.default_splash_meta['args'])
assert json.loads(req2.body) == expected_body

def test_adjust_timeout():
mw = _get_mw()
req1 = scrapy.Request("http:https://example.com", meta = {
'splash': {'args': {'timeout': 60, 'html': 1}},

# download_timeout is always present,
# it is set by DownloadTimeoutMiddleware
'download_timeout': 30,
})
req1 = mw.process_request(req1, None)
assert req1.meta['download_timeout'] > 60
def test_splash_request_no_url(self):
lua_source = "function main(splash) return {result='ok'} end"
req1 = SplashRequest(meta={'splash': {
'args': {'lua_source': lua_source},
'endpoint': 'execute',
}})
req = self.mw.process_request(req1, None)
assert req.url == 'http:https://127.0.0.1:8050/execute'
assert json.loads(req.body) == {
'url': 'about:blank',
'lua_source': lua_source
}

req2 = scrapy.Request("http:https://example.com", meta = {
'splash': {'args': {'html': 1}},
'download_timeout': 30,
})
req2 = mw.process_request(req2, None)
assert req2.meta['download_timeout'] == 30
def test_override_splash_url(self):
req1 = scrapy.Request("http:https://example.com", meta={
'splash': {
'endpoint': 'render.png',
'splash_url': 'http:https://splash.example.com'
}
})
req = self.mw.process_request(req1, None)
assert req.url == 'http:https://splash.example.com/render.png'
assert json.loads(req.body) == {'url': req1.url}

def test_float_wait_arg(self):
req1 = scrapy.Request("http:https://example.com", meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
req = self.mw.process_request(req1, None)
assert json.loads(req.body) == {'url': req1.url, 'wait': 0.5}

def test_slot_policy_single_slot(self):
meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.SINGLE_SLOT
}}

req1 = scrapy.Request("http:https://example.com/path?key=value", meta=meta)
req1 = self.mw.process_request(req1, None)

req2 = scrapy.Request("http:https://fooexample.com/path?key=value", meta=meta)
req2 = self.mw.process_request(req2, None)

assert req1.meta.get('download_slot')
assert req1.meta['download_slot'] == req2.meta['download_slot']

def test_slot_policy_per_domain(self):
meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN
}}

req1 = scrapy.Request("http:https://example.com/path?key=value", meta=meta)
req1 = self.mw.process_request(req1, None)

req2 = scrapy.Request("http:https://example.com/path2", meta=meta)
req2 = self.mw.process_request(req2, None)

req3 = scrapy.Request("http:https://fooexample.com/path?key=value", meta=meta)
req3 = self.mw.process_request(req3, None)

assert req1.meta.get('download_slot')
assert req3.meta.get('download_slot')

assert req1.meta['download_slot'] == req2.meta['download_slot']
assert req1.meta['download_slot'] != req3.meta['download_slot']

def test_slot_policy_scrapy_default(self):
req = scrapy.Request("http:https://example.com", meta = {'splash': {
'slot_policy': scrapyjs.SlotPolicy.SCRAPY_DEFAULT
}})
req = self.mw.process_request(req, None)
assert 'download_slot' not in req.meta

def test_adjust_timeout(self):
req1 = scrapy.Request("http:https://example.com", meta = {
'splash': {'args': {'timeout': 60, 'html': 1}},

# download_timeout is always present,
# it is set by DownloadTimeoutMiddleware
'download_timeout': 30,
})
req1 = self.mw.process_request(req1, None)
assert req1.meta['download_timeout'] > 60

req2 = scrapy.Request("http:https://example.com", meta = {
'splash': {'args': {'html': 1}},
'download_timeout': 30,
})
req2 = self.mw.process_request(req2, None)
assert req2.meta['download_timeout'] == 30

def test_spider_attribute(self):
req_url = "http:https://scrapy.org"
req1 = scrapy.Request(req_url)

spider = self.crawler._create_spider("foo")
spider.splash = {"args": {"images": 0}}

req1 = self.mw.process_request(req1, spider)
self.assertIn("_splash_processed", req1.meta)
self.assertIn("render.json", req1.url)
self.assertIn("url", json.loads(req1.body))
self.assertEqual(json.loads(req1.body).get("url"), req_url)
self.assertIn("images", json.loads(req1.body))

# spider attribute blank middleware disabled
spider.splash = {}
req2 = self.mw.process_request(req1, spider)
self.assertIsNone(req2)