start_requests can return items #6417

GeorgeA92 · 2024-06-26T20:09:32Z

resolve #5289
based on code sample from #5289 (comment)

I used this сode sample to test this locally.

script.py

import scrapy
from scrapy.crawler import CrawlerProcess

class QuotesToScrapeSpider(scrapy.Spider):
    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/', callback=self.parse)
        yield {"quote": ['“It is possible to return items on start_requests. But I don\'t understand Why it\'s needed?“'], "author": ["Georgiy"], "tags": ["scrapy"]}

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "quote": quote.css("span.text::text").getall(),
                "author": quote.css("small.author::text").getall(),
                "tags": quote.css("a.tag::text").getall()
            }
if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(QuotesToScrapeSpider); p.start()

At this moment I didn't found test where output(type) of start_requests checked

scrapy/core/engine.py

Gallaecio · 2024-06-28T12:30:31Z

At this moment I didn't found test where output(type) of start_requests checked.

There are a few if you search tests for start_requests, e.g. in here or here, but it may not be straightforward to adapt.

Regardless of whether there are existing tests or not, I think at the very minimum we need a test that yields an item in a crawl with all built-in spider middlewares enabled (did not check if there are built-in ones that are not enabled by default), just to be sure that they do not fail due to expecting the output of start_request to be requests, e.g. assume that the items in the result iterable have some attribute or method they may not have.

Maybe it could be a test that runs a crawl and verifies that there are no error messages with caplog. We should have quite a few tests like that around, probably something like this, although there may be better examples in the Scrapy tests.

GeorgeA92 · 2024-08-20T21:09:45Z

There are a few if you search tests for start_requests, e.g. in here or here, but it may not be straightforward to adapt.

Test cases that cover items from start_requests added

codecov · 2024-08-20T21:10:03Z

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 87.60%. Comparing base (41e15e9) to head (e63bcaa).
Report is 21 commits behind head on master.

Files	Patch %	Lines
scrapy/core/engine.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6417      +/-   ##
==========================================
+ Coverage   84.65%   87.60%   +2.94%     
==========================================
  Files         162      162              
  Lines       12045    12049       +4     
  Branches     1917     1921       +4     
==========================================
+ Hits        10197    10555     +358     
+ Misses       1550     1181     -369     
- Partials      298      313      +15

Files	Coverage Δ
scrapy/core/engine.py	`87.00% <85.71%> (-0.09%)`	⬇️

... and 26 files with indirect coverage changes

tests/test_spidermiddleware.py

wRAR · 2024-08-22T11:30:16Z

Nice, what's missing here?

Gallaecio · 2024-08-23T11:23:48Z

I think we need to deprecate response in item signals, change it to src: Any. But doing that in a backward-compatible way seems not trivial.

Alternatively, I wonder if instead of using a string with the import path of the start_requests method, we could set src/response to None in those cases, so that keeping response makes sense, and in these cases there is simply no response. And to make sure the logs look nice, if response is None, we set the src to the start_requests import path in LogFormatter.scraped instead.

… the src later on

scrapy/core/scraper.py

Gallaecio

@GeorgeA92 Please confirm you are also OK with the current code before we merge.

GeorgeA92 · 2024-08-26T16:47:24Z

@Gallaecio

@GeorgeA92 Please confirm you are also OK with the current code before we merge.

I am OK with current code

icaca · 2024-09-20T01:18:20Z

I tried to use peewee to read some data in start_requests, and then used a for loop to call yield (no http request was made, the database was updated directly). I found that there was a 5-second delay each time. I printed logs at the entrance and end of MySQLPipeline and confirmed that the database operation was in milliseconds. I tried to set DOWNLOAD_DELAY and AUTOTHROTTLE_START_DELAY in the setting file, but it didn't work. I wonder if anyone has encountered this situation.

start_requests can return items

78eb457

Gallaecio reviewed Jun 28, 2024

View reviewed changes

scrapy/core/engine.py Show resolved Hide resolved

Georgiy Zatserklianyi added 2 commits August 10, 2024 17:30

start_requests: test_spidermiddleware updated

521e689

start_requests: test_crawl updated

32edd9c

wRAR reviewed Aug 21, 2024

View reviewed changes

tests/test_spidermiddleware.py Outdated Show resolved Hide resolved

Georgiy Zatserklianyi added 4 commits August 21, 2024 15:10

start_requests: docs added

87b4e7e

start_requests: docs updated

8f24f13

start_requests: tests updated

21f1b90

start_requests: typing

e63bcaa

Cleanups

9a6b6e9

Gallaecio added 7 commits August 23, 2024 13:34

Use response=None instead of response=start_requests import path, set…

8f6cf34

… the src later on

Solve typing issues

274d2c1

Merge remote-tracking branch 'scrapy/master' into start_requests_items

81a7961

Update signal docs

54b56cf

Test the handling of unsupported output

98c54c7

Adjust test expectations for PyPy

05587f0

Adjust test expectations for Windows

75111f1

wRAR reviewed Aug 23, 2024

View reviewed changes

scrapy/core/scraper.py Outdated Show resolved Hide resolved

wRAR approved these changes Aug 23, 2024

View reviewed changes

Add a type hint for start_itemproc

74be2c3

Gallaecio approved these changes Aug 26, 2024

View reviewed changes

Gallaecio merged commit 6ce0342 into scrapy:master Aug 26, 2024
26 checks passed

GeorgeA92 deleted the start_requests_items branch September 20, 2024 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start_requests can return items #6417

start_requests can return items #6417

GeorgeA92 commented Jun 26, 2024

Gallaecio commented Jun 28, 2024 •

edited

Loading

GeorgeA92 commented Aug 20, 2024

codecov bot commented Aug 20, 2024 •

edited

Loading

wRAR commented Aug 22, 2024

Gallaecio commented Aug 23, 2024 •

edited

Loading

Gallaecio left a comment

GeorgeA92 commented Aug 26, 2024

icaca commented Sep 20, 2024

start_requests can return items #6417

start_requests can return items #6417

Conversation

GeorgeA92 commented Jun 26, 2024

Gallaecio commented Jun 28, 2024 • edited Loading

GeorgeA92 commented Aug 20, 2024

codecov bot commented Aug 20, 2024 • edited Loading

Codecov Report

wRAR commented Aug 22, 2024

Gallaecio commented Aug 23, 2024 • edited Loading

Gallaecio left a comment

Choose a reason for hiding this comment

GeorgeA92 commented Aug 26, 2024

icaca commented Sep 20, 2024

Gallaecio commented Jun 28, 2024 •

edited

Loading

codecov bot commented Aug 20, 2024 •

edited

Loading

Gallaecio commented Aug 23, 2024 •

edited

Loading