-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start_requests can return items #6417
Conversation
There are a few if you search tests for Regardless of whether there are existing tests or not, I think at the very minimum we need a test that yields an item in a crawl with all built-in spider middlewares enabled (did not check if there are built-in ones that are not enabled by default), just to be sure that they do not fail due to expecting the output of start_request to be requests, e.g. assume that the items in the result iterable have some attribute or method they may not have. Maybe it could be a test that runs a crawl and verifies that there are no error messages with |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6417 +/- ##
==========================================
+ Coverage 84.65% 87.60% +2.94%
==========================================
Files 162 162
Lines 12045 12049 +4
Branches 1917 1921 +4
==========================================
+ Hits 10197 10555 +358
+ Misses 1550 1181 -369
- Partials 298 313 +15
|
Nice, what's missing here? |
I think we need to deprecate Alternatively, I wonder if instead of using a string with the import path of the start_requests method, we could set src/response to |
… the src later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeA92 Please confirm you are also OK with the current code before we merge.
I am OK with current code |
I tried to use peewee to read some data in start_requests, and then used a for loop to call yield (no http request was made, the database was updated directly). I found that there was a 5-second delay each time. I printed logs at the entrance and end of MySQLPipeline and confirmed that the database operation was in milliseconds. I tried to set DOWNLOAD_DELAY and AUTOTHROTTLE_START_DELAY in the setting file, but it didn't work. I wonder if anyone has encountered this situation. |
resolve #5289
based on code sample from #5289 (comment)
I used this сode sample to test this locally.
script.py