-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scrapy-splash recursive crawl using CrawlSpider not working #92
Comments
I also have this issue. NORMAL REQUEST - it will follow the rules and Follow=True USING SPLASH - it will only visit the first url |
Has someone found the solution ? |
i have not. unfortunately
On Fri, Jan 27, 2017 at 1:10 PM -0500, "dijadev" <[email protected]> wrote:
Has someone found the solution ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I have the same problem, any solution? |
Negative. |
+1 over here. Encountering the same issue as described by @wattsin. |
I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
... However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow. |
+1 |
1 similar comment
+1 |
I tried to debug my spider with PyCharm and set a breakpoint at What worked for me is to add this to the callback parsing function:
|
+1, any update for this issue? |
@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated |
@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck! |
+1 any updates here? |
Hello everyone ! hope this helps ! |
Having the same issue. Have overridden As soon as I start using splash by adding the following code to my spider:
it does not call |
Hi, I have found a workaround which works for me: |
@VictorXunS this is not working for me, could you share all your CrawlSpider code? |
Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.
`
` |
I am not expert, but scrapy has its own filter, isn't it? (you use not seen)
*http:https://doc.scrapy.org/en/latest/topics/link-extractors.html
<http:https://doc.scrapy.org/en/latest/topics/link-extractors.html>*
*class *scrapy.linkextractors.lxmlhtml.LxmlLinkExtracto
* unique* (*boolean*) – whether duplicate filtering should
be applied to extracted links.
<http:https://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Libre
de virus. www.avg.com
<http:https://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
El lun., 18 feb. 2019 a las 20:17, Nick-Verdegem (<[email protected]>)
escribió:
… Also had problems combining Crawlspider with SplashRequests and Crawlera.
Overwriting the _requests_to_follow function by taking the whole if
isinstance condition away worked for me. Thanks @dijadev
<https://github.com/dijadev> and @hieu-n <https://github.com/hieu-n> for
suggestions.
`
def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk
not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def _build_request(self, rule, link):
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=rule, link_text=link.text)
return r
`
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#92 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/Agwyu87plFAMY-qF8MolZsRwKXMp4Imrks5vOvxCgaJpZM4Ku50c>
.
|
Hi @Nick-Verdegem thank you for sharing. |
So i encountered this issue and solved it by overriding the type check as suggested :
but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:
and add it to ur Rule : |
I use scrapy-splash and scrapy-redis RedisCrawlSpider can running. Need to rewrite def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
'url': url, 'wait': 5, 'lua_source': default_script
})
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def _build_request(self, rule, link):
# parameter 'meta' is required !!!!!
r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
args={'wait': 5, 'url': link.url, 'lua_source': default_script})
# Maybe you can delete it here.
r.meta.update(rule=rule, link_text=link.text)
return r Some parameters need to be modified by themselves |
@MontaLabidi Your solution worked for me. This is how my code looks: class MySuperCrawler(CrawlSpider):
name = 'mysupercrawler'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LxmlLinkExtractor(
restrict_xpaths='//div/a'),
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//div[@class="pages"]/li/a'),
process_request="use_splash",
follow=True
),
Rule(LxmlLinkExtractor(
restrict_xpaths='//a[@class="product"]'),
callback='parse_item',
process_request="use_splash"
)
)
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
def parse_item(self, response):
pass This works perfectly for me. |
@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page... |
@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work. |
@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks! |
Anyone get this to work while running a Lua script for each pagination? |
@nciefeiniu |
I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong? |
Since Scrapy 1.7.0, the |
If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests e.g. to bypass cloudflare that's my solution: import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse
class Abc(scrapy.Item):
name = scrapy.Field()
class AbcSpider(CrawlSpider):
name = "abc"
allowed_domains = ['abc.de']
start_urls = ['https://www.abc.com/xyz']
rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})
def use_splash(self, request):
request.meta['splash'] = {
'endpoint':'render.html',
'args':{
'wait': 15,
}
}
return request
def _requests_to_follow(self, response):
if not isinstance(
response,
(HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def parse_item(self, response):
item = Abc()
item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
return item |
It does not work, throws an error use_splash() is missing 1 required positional argument: 'response' |
@vishKurama Which Scrapy version are you using? Can you share a minimal, reproducible example? |
I had this problem too. Just use |
I am facing a similar problem and the solutions listed here aren't working for me, unless I've missed something! import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
import logging
class MainSpider(CrawlSpider):
name = 'main'
allowed_domains = ['www.somesite.com']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
headers = {
['User-Agent'] = my_user_agent,
['Accept-Language'] = 'en-GB,en-US;q=0.9,en;q=0.8',
['Referer'] = 'https://www.google.com'
}
splash:set_custom_headers(headers)
url = args.url
assert(splash:go(url))
assert(splash:wait(2))
-- username input
username_input = assert(splash:select('#username'))
username_input:focus()
username_input:send_text('myusername')
assert(splash:wait(0.3))
-- password input
password_input = assert(splash:select('#password'))
password_input:focus()
password_input:send_text('mysecurepass')
assert(splash:wait(0.3))
-- the login button
login_btn = assert(splash:select('#login_btn'))
login_btn:mouse_click()
assert(splash:wait(4))
return splash:html()
end
'''
rules = (
Rule(LinkExtractor(restrict_xpaths="(//div[@id='sidebar']/ul/li)[7]/a"), callback='parse_item', follow=True, process_request='use_splash'),
)
def start_requests(self):
yield SplashRequest(url = 'https://www.somesite.com/login', callback = self.post_login, endpoint = 'execute', args = {
'lua_source': self.script
})
def use_splash(self, request):
request.meta.update(splash={
'args': {
'wait': 1,
},
'endpoint': 'render.html',
})
return request
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = self._build_request(n, link)
yield rule.process_request(r)
def post_login(self, response):
logging.info('hey from post login!')
with open('post_login_response.txt', 'w') as f:
f.write(response.text)
f.close()
def parse_item(self, response):
logging.info('hey from parse_item!')
with open('post_search_response.txt', 'w') as f:
f.write(response.text)
f.close() The |
Following is a working crawler for scraping import scrapy
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest, SplashTextResponse, SplashJsonResponse
class FictionBookScrapper(CrawlSpider):
_WAIT = 0.1
name = "fiction_book_scrapper"
allowed_domains = ['books.toscrape.com']
start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]
le_book_details = LinkExtractor(restrict_css=("h3 > a",))
rule_book_details = Rule(le_book_details, callback='parse_request', follow=False, process_request='use_splash')
le_next_page = LinkExtractor(restrict_css='.next > a')
rule_next_page = Rule(le_next_page, follow=True, process_request='use_splash')
rules = (
rule_book_details,
rule_next_page,
)
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, args={'wait': self._WAIT}, meta={'real_url': url})
def use_splash(self, request, response):
request.meta['splash'] = {
'endpoint': 'render.html',
'args': {
'wait': self._WAIT
}
}
return request
def _requests_to_follow(self, response):
if not isinstance(response, (HtmlResponse, SplashTextResponse, SplashJsonResponse)):
return
seen = set()
for rule_index, rule in enumerate(self._rules):
links = [
lnk
for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen
]
for link in rule.process_links(links):
seen.add(link)
request = self._build_request(rule_index, link)
yield rule.process_request(request, response)
def parse_request(self, response: scrapy.http.Response):
self.logger.info(f'Page status code = {response.status}, url= {response.url}')
yield {
'Title': response.css('h1 ::text').get(),
'Link': response.url,
'Description': response.xpath('//*[@id="content_inner"]/article/p/text()').get()
}
|
Hi !
I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:
The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;
Thanks in advance,
The text was updated successfully, but these errors were encountered: