Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

Retry #18

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Retry #18

wants to merge 16 commits into from

Conversation

elacuesta
Copy link
Member

@elacuesta elacuesta commented Apr 10, 2021

Closes #12

Tasks:

  • Document settings
  • Read settings in middleware
  • Deprecate CRAWLERA_FETCH_RAISE_ON_ERROR in favour of a more general CRAWLERA_FETCH_ON_ERROR setting (with possible values warn(ing), raise, retry)
  • Implement retry
  • Tests

@elacuesta elacuesta added the enhancement New feature or request label Apr 10, 2021
@codecov
Copy link

codecov bot commented Apr 10, 2021

Codecov Report

Merging #18 (8e87867) into master (d715b08) will decrease coverage by 3.10%.
The diff coverage is 90.52%.

Impacted file tree graph

@@             Coverage Diff             @@
##            master      #18      +/-   ##
===========================================
- Coverage   100.00%   96.89%   -3.11%     
===========================================
  Files            3        4       +1     
  Lines          206      290      +84     
===========================================
+ Hits           206      281      +75     
- Misses           0        9       +9     
Impacted Files Coverage Δ
crawlera_fetch/_utils.py 76.66% <76.66%> (ø)
crawlera_fetch/middleware.py 99.15% <96.87%> (-0.85%) ⬇️
crawlera_fetch/__init__.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d715b08...8e87867. Read the comment docs.

@elacuesta elacuesta changed the title Retry [WIP] Retry Apr 10, 2021
@PyExplorer
Copy link

Hi @elacuesta, thank you for starting the implementation.
Please let me know when I can test it with the real spiders.

@@ -76,6 +87,21 @@ Crawlera middleware won't be able to handle them.
Default values to be sent to the Crawlera Fetch API. For instance, set to `{"device": "mobile"}`
to render all requests with a mobile profile.

* `CRAWLERA_FETCH_SHOULD_RETRY` (type `Optional[Callable, str]`, default `None`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pawelmhm:

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

CRAWLERA_FETCH_SHOULD_RETRY receives a callable (or the name of a callable within the spider) to be used to determine if a request should be retried. Perhaps it could be named differently, I'm open to suggestions. CRAWLERA_FETCH_ON_ERROR is the setting to determine what to do with errors. I made OnError.Warn the default, just to keep backward-compatibility, but perhaps OnError.Retry can be a better default.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of scrapy/scrapy#4902. I wanted to avoid code duplication but I guess I can just use the upstream function if available and fall back to copying the implementation.

Copy link

@pawelmhm pawelmhm Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation. Is there any scenario where you don't need retry? Because in my experience it is very rare not to want retry of internal server errors, timeouts, or bans?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of

I think in some projects people are using old Scrapy versions and they will have to update Scrapy to most recent versions, which will be extra work, extra effort for them. If they are stuck on some old versions like 1.6 or 1.7 updating to 2.5 might not be straigtforward.

But my main point after thinking about this is that why do we actually need custom retry? Why can't we handle this in retry middleware by default? like all other HTTP error codes? There are 2 use cases mentioned by Taras in issue on GH, but I'm not convinced about them and after talking with developer of Fetch API I hear they plan to change behavior to return 500 anbd 503 HTTP status codes instead of 200 HTTP status code with error code in response body.

@elacuesta elacuesta changed the title [WIP] Retry Retry Apr 22, 2021
@elacuesta elacuesta marked this pull request as ready for review April 22, 2021 01:20
.coveragerc Outdated Show resolved Hide resolved
crawlera_fetch/middleware.py Show resolved Hide resolved
@elacuesta
Copy link
Member Author

@akshayphilar do you have any insights on this? Mostly about "change behavior to return 500 and 503 HTTP status codes instead of 200 HTTP status code with error code in response body" from #18 (comment). Bottom line is: should we provide a client-side retrying mechanism or should we rely on the server to return codes that will be retried by default by Scrapy?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add retry mechanism into middleware
4 participants