Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrapy-splash recursive crawl using CrawlSpider not working #92

Open
dijadev opened this issue Nov 10, 2016 · 36 comments
Open

scrapy-splash recursive crawl using CrawlSpider not working #92

dijadev opened this issue Nov 10, 2016 · 36 comments
Labels

Comments

@dijadev
Copy link

dijadev commented Nov 10, 2016

Hi !

I have integrated scrapy-splash in my CrawlSpider process_request in rules like this:

 def process_request(self,request):
        request.meta['splash']={
            'args': {
                # set rendering arguments here
                'html': 1,
            }
        }
        return request

The problem is that the crawl renders just urls in the first depth,
I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

@dijadev dijadev changed the title splash crawls only urls in depth 1 and handle only 200 http responses scrapy-splash recursive crawl using CrawlSpider not working Nov 14, 2016
@wattsin
Copy link

wattsin commented Jan 23, 2017

I also have this issue.

NORMAL REQUEST - it will follow the rules and Follow=True
yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin)

USING SPLASH - it will only visit the first url
yield scrapy.Request(url, callback=self.parse, dont_filter=True, errback=self.errback_httpbin, meta={
'splash': {
'endpoint': 'render.html',
'args': { 'wait': 0.5 }
}
})

@dijadev
Copy link
Author

dijadev commented Jan 27, 2017

Has someone found the solution ?

@wattsin
Copy link

wattsin commented Jan 27, 2017 via email

@amirj
Copy link

amirj commented Feb 13, 2017

I have the same problem, any solution?

@wattsin
Copy link

wattsin commented Feb 14, 2017

Negative.

@brianherbert
Copy link

+1 over here. Encountering the same issue as described by @wattsin.

@dwj1324
Copy link

dwj1324 commented Jun 8, 2017

I also got the same issue here today and found that CrawlSpider do a response type check in _requests_to_follow function

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    ...

However responses generated by Splash would be SplashTextResponse or SplashJsonResponse. That check caused splash response won't have any requests to follow.

@komuher
Copy link

komuher commented Jul 21, 2017

+1

1 similar comment
@ghost
Copy link

ghost commented Aug 9, 2017

+1

@hieu-n
Copy link

hieu-n commented Aug 31, 2017

@dwj1324

I tried to debug my spider with PyCharm and set a breakpoint at if not isinstance(response, HtmlResponse):. That code was never reached when SplashRequest was used instead of scrapy.Request.

What worked for me is to add this to the callback parsing function:

def parse_item(self, response):
    """Parse response into item also create new requests."""

    page = RescrapItem()
    ...
    yield page

    if isinstance(response, (HtmlResponse, SplashTextResponse)):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = SplashRequest(url=link.url, callback=self._response_downloaded, 
                                              args=SPLASH_RENDER_ARGS)
                r.meta.update(rule=rule, link_text=link.text)
                yield rule.process_request(r)

@NingLu
Copy link

NingLu commented Oct 24, 2017

+1, any update for this issue?

@NingLu
Copy link

NingLu commented Oct 24, 2017

@hieu-n i use the code you paste here, and change splash request to request since i need to use the header, but it doesn't work, the spider still crawl the first depth content, any suggestion will be appreciated

2017-10-24 18 39 56

@hieu-n
Copy link

hieu-n commented Oct 25, 2017

@NingLu I haven't touched scrapy for a while. In your case, what I would do is to set a few breakpoints and step through your code and the scrapy's code. Good luck!

@Goles
Copy link

Goles commented Jan 16, 2018

+1 any updates here?

@dijadev
Copy link
Author

dijadev commented Jan 16, 2018

Hello everyone !
As @dwj1324 said the CrawlSpider do a response type check in _requests_to_follow function.
So I've juste overridden this function to avoid escaping SplashJsonResponse(s):
image

hope this helps !

@tf42src
Copy link

tf42src commented Feb 6, 2018

Having the same issue. Have overridden _requests_to_follow as stated by @dwj1324 and @dijadev.

As soon as I start using splash by adding the following code to my spider:

def start_requests(self):
        for url in self.start_urls:
            print('->', url)
            yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

it does not call _requests_to_follow anymore. Scrapy follows links when commenting out that function again.

@VictorXunS
Copy link

Hi, I have found a workaround which works for me:
Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:
yield scrapy.Request("http:https://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker

@reg3x
Copy link

reg3x commented Jan 7, 2019

Hi, I have found a workaround which works for me:
Instead of using a scrapy request:
yield scrapy.Request(page_url, self.parse_page)
simply append this splash prefix to the url:
yield scrapy.Request("http:https://localhost:8050/render.html?url=" + page_url, self.parse_page)
the localhost port may depend on how you built spalsh docker

@VictorXunS this is not working for me, could you share all your CrawlSpider code?

@victor-papa
Copy link

victor-papa commented Feb 18, 2019

Also had problems combining Crawlspider with SplashRequests and Crawlera. Overwriting the _requests_to_follow function by taking the whole if isinstance condition away worked for me. Thanks @dijadev and @hieu-n for suggestions.

` def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        for link in links:
            seen.add(link)
            r = self._build_request(n, link)
            yield rule.process_request(r)

`

` def _build_request(self, rule, link):
    r = Request(url=link.url, callback=self._response_downloaded)
    r.meta.update(rule=rule, link_text=link.text)
    return r

`

@JavierRuano
Copy link

JavierRuano commented Feb 18, 2019 via email

@XamHans
Copy link

XamHans commented Feb 19, 2019

Hi @Nick-Verdegem thank you for sharing.
My CrawlSPider is still not working with your solution, do you use start_requests?

@MontaLabidi
Copy link

MontaLabidi commented Mar 2, 2019

So i encountered this issue and solved it by overriding the type check as suggested :

def _requests_to_follow(self, response):
      if not isinstance(response, (HtmlResponse, SplashTextResponse)):
          return
      ....

but also u have to not use the SplashRequest in ur process_request method to create the new splash requests, just add splash to ur scrapy.Request meta, because the scrapy.Request returned from the _requests_to_follow method has attribute in its meta like the index of the 'rule' its generated by that it uses for its logic, so u dont want to generate a totally different request by using SplashRequest in ur request wrapper just add splash to the already built request like so:

def use_splash(self, request):
      request.meta.update(splash={
          'args': {
              'wait': 1,
          },
          'endpoint': 'render.html',           
      })
      return request

and add it to ur Rule :
process_request="use_splash"
the _requests_to_follow will apply the process_request to every built request, thats what worked for my CrawlSpider
Hope that helps!

@nciefeiniu
Copy link

nciefeiniu commented Mar 6, 2019

I use scrapy-splash and scrapy-redis

RedisCrawlSpider can running.

Need to rewrite

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse_m, endpoint='execute', dont_filter=True, args={
                'url': url, 'wait': 5, 'lua_source': default_script
            })

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def _build_request(self, rule, link):
        # parameter 'meta' is required !!!!!
        r = SplashRequest(url=link.url, callback=self._response_downloaded, meta={'rule': rule, 'link_text': link.text},
                          args={'wait': 5, 'url': link.url, 'lua_source': default_script})
        # Maybe you can delete it here.
        r.meta.update(rule=rule, link_text=link.text)
        return r

Some parameters need to be modified by themselves

@sp-philippe-oger
Copy link

@MontaLabidi Your solution worked for me.

This is how my code looks:

class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

@digitaldust
Copy link

@sp-philippe-oger could you please show the whole file? In my case the crawl spider won't call the redefined _requests_to_follow and as a consequence still stops after the first page...

@evanjs evanjs mentioned this issue May 3, 2019
@sp-philippe-oger
Copy link

@digitaldust pretty much the whole code is there. Not sure what is missing for you to make it work.

@digitaldust
Copy link

@sp-philippe-oger don't worry, I actually realized my problem is with the LinkExtractor, not the scrapy/splash combo... thanks!

@MSDuncan82
Copy link

MSDuncan82 commented Oct 1, 2019

Anyone get this to work while running a Lua script for each pagination?

@davisbra
Copy link

@nciefeiniu
hi... would you please give more information about integrating scrapy-redis with splash? i mean, how do you send your urls from redis to splash?

@zhaicongrong
Copy link

@MontaLabidi Your solution worked for me.

This is how my code looks:

class MySuperCrawler(CrawlSpider):
    name = 'mysupercrawler'
    allowed_domains = ['example.com']
    start_urls = ['https://www.example.com']

    rules = (
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div/a'),
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//div[@class="pages"]/li/a'),
            process_request="use_splash",
            follow=True
        ),
        Rule(LxmlLinkExtractor(
            restrict_xpaths='//a[@class="product"]'),
            callback='parse_item',
            process_request="use_splash"
        )
    )

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })
        return request

    def parse_item(self, response):
        pass

This works perfectly for me.

I use python3, but there's an error: _identity_process_request() missing 1 required positional argument. Is there something wrong?

@Gallaecio
Copy link
Contributor

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

@janwendt
Copy link

janwendt commented Aug 25, 2020

If someone runs into the same problem needing to use Splash in a CrawlSpider (with Rule and LinkExtractor) BOTH for parse_item and the initial start_requests e.g. to bypass cloudflare that's my solution:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_splash import SplashRequest, SplashJsonResponse, SplashTextResponse
from scrapy.http import HtmlResponse

class Abc(scrapy.Item):
    name = scrapy.Field()

class AbcSpider(CrawlSpider):
    name = "abc"
    allowed_domains = ['abc.de']
    start_urls = ['https://www.abc.com/xyz']

    rules = (Rule(LinkExtractor(restrict_xpaths='//h2[@class="abc"]'), callback='parse_item', process_request="use_splash"))

    def start_requests(self):        
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': 15}, meta={'real_url': url})

    def use_splash(self, request):
        request.meta['splash'] = {
                'endpoint':'render.html',
                'args':{
                    'wait': 15,
                    }
                }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(
                response,
                (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = self._build_request(n, link)
                yield rule.process_request(r)

    def parse_item(self, response):
        item = Abc()
        item['name'] = response.xpath('//div[@class="abc-name"]/h1/text()').get()
        return item

@vishalmry
Copy link

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'

@Gallaecio
Copy link
Contributor

@vishKurama Which Scrapy version are you using? Can you share a minimal, reproducible example?

@gingergenius
Copy link

Since Scrapy 1.7.0, the process_request callback also receives a response parameter, so you need to change def use_splash(self, request): to def use_splash(self, request, response):

It does not work, throws an error use_splash() is missing 1 required positional argument: 'response'

I had this problem too. Just use yield rule.process_request(r, response) in the last line of the overridden method

@JwanKhalaf
Copy link

I am facing a similar problem and the solutions listed here aren't working for me, unless I've missed something!

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest
import logging

class MainSpider(CrawlSpider):
    name = 'main'
    allowed_domains = ['www.somesite.com']

    script = '''
    function main(splash, args)
      splash.private_mode_enabled = false

      my_user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'

      headers = {
        ['User-Agent'] = my_user_agent,
        ['Accept-Language'] = 'en-GB,en-US;q=0.9,en;q=0.8',
        ['Referer'] = 'https://www.google.com'
      }

      splash:set_custom_headers(headers)

      url = args.url

      assert(splash:go(url))

      assert(splash:wait(2))

      -- username input
      username_input = assert(splash:select('#username'))
      username_input:focus()
      username_input:send_text('myusername')
      assert(splash:wait(0.3))

      -- password input
      password_input = assert(splash:select('#password'))
      password_input:focus()
      password_input:send_text('mysecurepass')
      assert(splash:wait(0.3))

      -- the login button
      login_btn = assert(splash:select('#login_btn'))
      login_btn:mouse_click()
      assert(splash:wait(4))

      return splash:html()
    end
    '''

    rules = (
        Rule(LinkExtractor(restrict_xpaths="(//div[@id='sidebar']/ul/li)[7]/a"), callback='parse_item', follow=True, process_request='use_splash'),
    )

    def start_requests(self):
        yield SplashRequest(url = 'https://www.somesite.com/login', callback = self.post_login, endpoint = 'execute', args = {
            'lua_source': self.script
        })

    def use_splash(self, request):
        request.meta.update(splash={
            'args': {
                'wait': 1,
            },
            'endpoint': 'render.html',
        })

        return request

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashJsonResponse, SplashTextResponse)):
            return

        seen = set()

        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response) if lnk not in seen]

            if links and rule.process_links:
                links = rule.process_links(links)

            for link in links:
                seen.add(link)
                r = self._build_request(n, link)

                yield rule.process_request(r)

    def post_login(self, response):
       logging.info('hey from post login!')

       with open('post_login_response.txt', 'w') as f:
           f.write(response.text)
           f.close()

    def parse_item(self, response):
        logging.info('hey from parse_item!')

        with open('post_search_response.txt', 'w') as f:
            f.write(response.text)
            f.close()

The parse_item function is never hit, in the logs, I never see hey from parse_item! but I do see hey from post login. I'm not sure what I'm missing.

@InzamamAnwar
Copy link

Following is a working crawler for scraping https://books.toscrape.com. Tested with Scrapy version 2.9.0. For installing and configuring splash, follow the README.

import scrapy
from scrapy.http import HtmlResponse
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_splash import SplashRequest, SplashTextResponse, SplashJsonResponse



class FictionBookScrapper(CrawlSpider):
    _WAIT = 0.1

    name = "fiction_book_scrapper"
    allowed_domains = ['books.toscrape.com']
    start_urls = ["https://books.toscrape.com/catalogue/category/books_1/index.html"]

    le_book_details = LinkExtractor(restrict_css=("h3 > a",))
    rule_book_details = Rule(le_book_details, callback='parse_request', follow=False, process_request='use_splash')

    le_next_page = LinkExtractor(restrict_css='.next > a')
    rule_next_page = Rule(le_next_page, follow=True, process_request='use_splash')

    rules = (
        rule_book_details,
        rule_next_page,
    )

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, args={'wait': self._WAIT}, meta={'real_url': url})

    def use_splash(self, request, response):
        request.meta['splash'] = {
            'endpoint': 'render.html',
            'args': {
                'wait': self._WAIT
            }
        }
        return request

    def _requests_to_follow(self, response):
        if not isinstance(response, (HtmlResponse, SplashTextResponse, SplashJsonResponse)):
            return
        seen = set()
        for rule_index, rule in enumerate(self._rules):
            links = [
                lnk
                for lnk in rule.link_extractor.extract_links(response)
                if lnk not in seen
            ]
            for link in rule.process_links(links):
                seen.add(link)
                request = self._build_request(rule_index, link)
                yield rule.process_request(request, response)

    def parse_request(self, response: scrapy.http.Response):
        self.logger.info(f'Page status code = {response.status}, url= {response.url}')

        yield {
             'Title': response.css('h1 ::text').get(),
             'Link': response.url,
             'Description': response.xpath('//*[@id="content_inner"]/article/p/text()').get()
         }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.