Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_requests can return items #6417

Merged
merged 16 commits into from
Aug 26, 2024
Merged

Conversation

GeorgeA92
Copy link
Contributor

resolve #5289
based on code sample from #5289 (comment)

I used this сode sample to test this locally.

script.py
import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesToScrapeSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/', callback=self.parse)
        yield {"quote": ['“It is possible to return items on start_requests. But I don\'t understand Why it\'s needed?“'], "author": ["Georgiy"], "tags": ["scrapy"]}

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "quote": quote.css("span.text::text").getall(),
                "author": quote.css("small.author::text").getall(),
                "tags": quote.css("a.tag::text").getall()
            }
if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(QuotesToScrapeSpider); p.start()
At this moment I didn't found test where output(type) of start_requests checked

@Gallaecio
Copy link
Member

Gallaecio commented Jun 28, 2024

At this moment I didn't found test where output(type) of start_requests checked.

There are a few if you search tests for start_requests, e.g. in here or here, but it may not be straightforward to adapt.

Regardless of whether there are existing tests or not, I think at the very minimum we need a test that yields an item in a crawl with all built-in spider middlewares enabled (did not check if there are built-in ones that are not enabled by default), just to be sure that they do not fail due to expecting the output of start_request to be requests, e.g. assume that the items in the result iterable have some attribute or method they may not have.

Maybe it could be a test that runs a crawl and verifies that there are no error messages with caplog. We should have quite a few tests like that around, probably something like this, although there may be better examples in the Scrapy tests.

@GeorgeA92
Copy link
Contributor Author

There are a few if you search tests for start_requests, e.g. in here or here, but it may not be straightforward to adapt.

Test cases that cover items from start_requests added

Copy link

codecov bot commented Aug 20, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.60%. Comparing base (41e15e9) to head (e63bcaa).
Report is 21 commits behind head on master.

Files Patch % Lines
scrapy/core/engine.py 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6417      +/-   ##
==========================================
+ Coverage   84.65%   87.60%   +2.94%     
==========================================
  Files         162      162              
  Lines       12045    12049       +4     
  Branches     1917     1921       +4     
==========================================
+ Hits        10197    10555     +358     
+ Misses       1550     1181     -369     
- Partials      298      313      +15     
Files Coverage Δ
scrapy/core/engine.py 87.00% <85.71%> (-0.09%) ⬇️

... and 26 files with indirect coverage changes

@wRAR
Copy link
Member

wRAR commented Aug 22, 2024

Nice, what's missing here?

@Gallaecio
Copy link
Member

Gallaecio commented Aug 23, 2024

I think we need to deprecate response in item signals, change it to src: Any. But doing that in a backward-compatible way seems not trivial.

Alternatively, I wonder if instead of using a string with the import path of the start_requests method, we could set src/response to None in those cases, so that keeping response makes sense, and in these cases there is simply no response. And to make sure the logs look nice, if response is None, we set the src to the start_requests import path in LogFormatter.scraped instead.

scrapy/core/scraper.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgeA92 Please confirm you are also OK with the current code before we merge.

@GeorgeA92
Copy link
Contributor Author

@Gallaecio

@GeorgeA92 Please confirm you are also OK with the current code before we merge.

I am OK with current code

@Gallaecio Gallaecio merged commit 6ce0342 into scrapy:master Aug 26, 2024
26 checks passed
@icaca
Copy link

icaca commented Sep 20, 2024

I tried to use peewee to read some data in start_requests, and then used a for loop to call yield (no http request was made, the database was updated directly). I found that there was a 5-second delay each time. I printed logs at the entrance and end of MySQLPipeline and confirmed that the database operation was in milliseconds. I tried to set DOWNLOAD_DELAY and AUTOTHROTTLE_START_DELAY in the setting file, but it didn't work. I wonder if anyone has encountered this situation.

@GeorgeA92 GeorgeA92 deleted the start_requests_items branch September 20, 2024 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better error (or support?) when yielding items from start requests
4 participants