Skip to content

Commit

Permalink
update anti-bot techniques
Browse files Browse the repository at this point in the history
  • Loading branch information
pigivinci committed Jun 10, 2023
1 parent fec5a13 commit b9a8299
Show file tree
Hide file tree
Showing 5 changed files with 16 additions and 9 deletions.
10 changes: 6 additions & 4 deletions Pages/Antibot/Cloudflare.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Cloudflare Bot Management

## What is Cloudflare Bot Management?
[Akamai Bot Manager ](https://www.akamai.com/products/bot-manager "Akamai") detect bots using device fingerprinting bot signatures and ip checks.
[Cloudflare Bot Management ](https://www.cloudflare.com/products/bot-management/ "Cloudflare") detect bots using device fingerprinting bot signatures and ip checks.

## Our View on Cloudflare Bot Management

### How to Identify Cloudflare Bot Management
Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md)

### Recommended approach to Cloudflare Bot Management
**BEST CHOICE**: Depends from the configuration of the single website, but [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) are usually enough for scraping.
**BEST CHOICE**: Each website can be configured with different degrees of protection. The best approach is using [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + a privacy focused browser like Brave or antidetect browser like Gologin.

A good solution, still to be tested by our side, is to find the IP address of the web server of the target website and then scrape from there.
A good solution, still to be tested by our side, is to find the IP address of the web server of the target website and then scrape from there. An updated version of the solution techniques with code can be found on [The Web Scraping Club](https://substack.thewebscraping.club "The Web Scraping Club").

### Reference and interesting links
[Official web page](https://www.cloudflare.com/en-gb/products/bot-management/)
Expand All @@ -22,4 +22,6 @@ A good solution, still to be tested by our side, is to find the IP address of th

[Firefox appears to be flagged as suspicious from Cloudflare](https://brianlovin.com/hn/31459258)

[High level description](https://www.zenrows.com/blog/bypass-cloudflare#what-is-cloudflare-bot-management)
[High level description](https://www.zenrows.com/blog/bypass-cloudflare#what-is-cloudflare-bot-management)

[List of articles on The Web Scraping Club](https://substack.thewebscraping.club/t/cloudflare)
4 changes: 2 additions & 2 deletions Pages/Antibot/Datadome.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@
Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md)

### Recommended approach to Datadome
**BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) are usually enough for scraping.

**BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Brave browser is a good solution. An updated version of the solution techniques with code can be found on [The Web Scraping Club](https://substack.thewebscraping.club "The Web Scraping Club").
### Reference and interesting links
[Official web page](https://datadome.co/)
[Tests made with online tools](https://blog.vanila.io/how-strong-is-the-datadome-5e9ff211384e)
[List of articles on The Web Scraping Club](https://substack.thewebscraping.club/t/datadome)

3 changes: 2 additions & 1 deletion Pages/Antibot/Kasada.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,9 @@ Unluckily [Wappalyzer Chrome Extension](https://github.com/reanalytics-databouti
The first request to the website returns a 429 error (visible only from the Network inspection in the browser's developer tools), then redirect to the same page that works properly. This second request added some elements in the response headers like "x-kpsdk-ct"

### Recommended approach to Kasada
**BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) but result can depend from the hardware where the scraper is executed.
**BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) using Firefox with the right flags. An updated version of the solution techniques with code can be found on [The Web Scraping Club](https://substack.thewebscraping.club "The Web Scraping Club").

### Reference and interesting links
[Official web page](https://www.kasada.io/)
[List of articles on The Web Scraping Club](https://substack.thewebscraping.club/t/kasada)

5 changes: 4 additions & 1 deletion Pages/Antibot/PerimeterX.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,12 @@ Use [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/we
### Recommended approach to PerimeterX
During the execution of the scraper it happens, after some pages, that a challenge like the one in the picture is triggered, blocking the execution. It's needed a fully browser to not trigger the captcha, adding some random movement of the mouse and timers before moving to another page.

**BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md)
**BEST CHOICE**: [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Firefox

An updated version of the solution techniques with code can be found on [The Web Scraping Club](https://substack.thewebscraping.club "The Web Scraping Club").

### Reference and interesting links
[Official web page](https://www.perimeterx.com/products/bot-defender)
[How Perimeterx works](https://www.trickster.dev/post/how-does-perimeterx-bot-defender-work/)
[List of articles on The Web Scraping Club](https://substack.thewebscraping.club/t/perimeterx)

3 changes: 2 additions & 1 deletion Pages/Antibot/Shape.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@
[Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md) doesn't seem to recognize it, we've noticed that certain websites protected by Shape, if opened by a browser in incognito mode and with developer tools tab opened, they stop to work. Closing the developer tools tab, they work again.

### Recommended approach to Shape Bot Defence
**BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + [Stealth](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright_stealth.md) but cannot be enough. The scraper should mimic a plausible user interaction with the website, we'll share an example soon.
**BEST CHOICE**: at the moment, the best approach is a [Playwright](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Playwright.md) + Firefox with the right options. An updated version of the solution techniques with code can be found on [The Web Scraping Club](https://substack.thewebscraping.club "The Web Scraping Club").

### Reference and interesting links
[Shape Bot Defence](https://www.f5.com/cloud/products/bot-defense)
[List of articles on The Web Scraping Club](https://substack.thewebscraping.club/t/shape)

0 comments on commit b9a8299

Please sign in to comment.