Skip to content

Commit

Permalink
new home page
Browse files Browse the repository at this point in the history
  • Loading branch information
pigivinci committed Jul 6, 2023
1 parent b9a8299 commit 8159422
Showing 1 changed file with 60 additions and 49 deletions.
109 changes: 60 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,47 @@
# Web scraping open knowledge project (with Python)
During the past several years at [Re Analytics](http:https://re-analytics.com/ "Re Analytics website") we've spent a lot of time finding the best practices for web scraping, to make it scalable and efficient to maintain.
It's like the cat and mouse game, you need to be always updated on the latest developments but, at the same time, the information needed is very sparse on the net.
For this reason, we started to centralize all the information we collected and the best practices we developed, to build a point of reference for the Python web scraping community.
# Web scraping from 0 to hero
Originally named "Web Scraping Open Project", this repository wants so create a common knowledge among web scraping experts, interesting enough for both rookies and experts in the field.
Anyone can submit some content if it adds value to the project.
Of course, we won't accept any AI-generated content and sellish and sponsored material, even if there are some sections dedicated to commercial tools, but they're based on user experience and not on marketing.

## Why this repository?
Web scraping is becoming harder and more expensive, with anti-bot becoming more aggressive and requiring commercial tools for being bypassed. But, at the same time, the need for web data is growing exponentially, following the post-Covid-19 increase in digitalization. On top of this, AI models will need more and more data to be trained and the main source is usually the web (just ask [Reddit](https://techcrunch.com/2023/07/04/reddit-braces-for-life-after-api-changes/ "Reddit API controversy") and [Twitter](https://business.twitter.com/en/blog/update-on-twitters-limited-usage.html "Twitter anti-scraping measures") )
So while there are some increasing challenges, there are more and more opportunities for developers who want to embark on the career of a web data engineer.
In this repository we're building a silo of all the sparse and fragmented content around the web and sharing some experience with tools, languages, and best practices to create a great basecamp for who's starting now but also a source of inspiration for experts looking for new tools and solutions.

## Who am I?
I'm [Pierluigi Vinciguerra](https://www.linkedin.com/in/pierluigivinciguerra/), co-founder and CTO at [Databoutique.com](https://www.databoutique.com) and I'm working in web scraping for more than 10 years.
I've always felt the need to centralize in some places the information about web scraping that are sparse around the web. At first, I started taking some notes and in 2022 I've decided to share with everyone starting a free substack called [The Web Scraping Club](https://substack.thewebscraping.club/), a quite successful one considering the niche I'm writing to, even if it's only my voice that is heard. With this repository, I want to create a chorus of web scraping experts sharing their experiences and ideas so that all the industry could benefit from it.

## How this repository works?
This repository wants to be a central hub for information about web scraping, so to keep it readable and ordered this page will be used as a table of content, with links to all the topics covered.
Topics can be added by anyone if they are relevant and add some value to the repository.
I tend to use the pages to create short content (about 400/500 words max) and link to external pages if longer content is needed, but that's not a rule.
You can write an excerpt of a longer blog on these pages and then link the full article.
Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone.

## Why Using Some Best Practice
Our goal is to scrape as many sites as we can so we've always looked for these key elements to make a successful large-scale web scraping project. At the moment they are focused on web scraping of E-commerce website because it's what we've done for years, but we're open to integrate them with best practices derived from other industries.
- **Resilient execution**: We want the code to be as low maintenance as possible
- **Faster maintenance**: We work smarter if we find standard solutions, and do not have to decode creative creations every time.
- **Regulatory compliance**: web scraping is a serious thing, we need to know exactly what tools are used.
The following practices are always evolving and feel free to suggest yours.
Content not allowed:
- **Out of scope content**
- **Promotional content**
- **Referral codes**
- **AI-generated content**

### 1.Preliminary Study
The table of content below will be updated regularly as soon as some new topics are coming to my mind, if it's not linking to any article it means that the page still does not exist, so feel free to add one.

#### 1.1.Technology Stack
Perform a technology stack evaluation for the target website using [Wappalyzer Chrome Extension](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/0386528f99a1209a538f6d042e859cd9933011c8/Pages/Tools/Wappalyzer.md), with attention in the "Security" block.
When a technology stack is detected under the "Security" section, please verify if in this list of technologies there is a specific solution for that technology.
#### 1.2.API search
Has the website some internal or public APIs for fetching the price\product data? If so, this is the best scenario available and we should use them to gather data
#### 1.3. JSON in HTML Search
Sometimes websites have JSON in their HTML, not only when there's an API. Finding this, will ensure stability.
#### 1.4. Pagination
How the website handles the pagination of product catalogue? Internal services that provide the html code of the catalogue are preferred vs loading the full page code
### 2. Code Best Practices
#### 2.1. JSON
Use json if available (on html of the page or from API). It's less prone to changes
#### 2.2. XPATHS
Use Xpaths, not css selectors for getting a clearer code.
#### 2.3. No formatting rules in numeric fields
Don't insert rules for cleaning prices or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs.
#### 2.4. Product List Page wins on Single Product Page
Load the fewer pages you can. Try to see if the fields you need are all available from product catalogue pages and try avoiding enter the single product page.
#### 2.5. Ip rotation
One of the most basic actions that a target website can take against web scraping is to ban IPs that make too many requests in a certain timeframe. Given that the web scraping activity must not interfere with the website functionality and operations, if this is happening to your scrapers, you might consider splitting its execution from several machines or route it via proxies.
Nowadays there are plenty of proxy vendors on the market and also proxies for every need, we'll go in-depth [in this section](https://github.com/reanalytics-databoutique/webscraping-open-project/blob/main/Pages/Services/Proxies.md).
## Table of content

### 1.Before scraping a website
#### 1.1 Is scraping that website legal?
- Reading terms and conditions of the website
- Is the data I want to scrape compliant with privacy laws or copyrighted?
- Do I really need to scrape that website or I can buy pre-scraped data?
#### 1.2 Preliminary website study
- Does the website have an API (internal or exposed)?
- Does it have some JSON inside the HTML?
### 2. Best practices
- Use JSON instead of HTML, if possible
- Selectors
- Data formatting
- Reducing the requests number
### 3. Tools
#### 3.1. Headless python scrapers
- [Scrapy](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Tools/Scrapy.md)
Expand All @@ -52,30 +59,34 @@ Nowadays there are plenty of proxy vendors on the market and also proxies for ev

### 4. Common anti-bot softwares & techniques
#### 4.1. Anti-bot Softwares
- [Akamai](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Akamai.md)
- [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Cloudflare.md)
- [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Datadome.md)
- [PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/PerimeterX.md)
- [Kasada](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Kasada.md)
- [F5 Shape Security](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Shape.md)
- Forter
- Riskified
- [Akamai](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Akamai.md)
- [Cloudflare](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Cloudflare.md)
- [Datadome](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Datadome.md)
- [PerimeterX](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/PerimeterX.md)
- [Kasada](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Kasada.md)
- [F5 Shape Security](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Shape.md)
- Forter
- Riskified
#### 4.2. Anti-bot Techniques
- [Passive fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Passivefingerprint.md) including:
- [Passive fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Passivefingerprint.md) including:
- [TCP/IP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/TcpFingerprint.md)
- [TLS fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/TLSFingerprint.md)
- [HTTP Fingerprint](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/HttpFingerprint.md)
- [Browser Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Browserfingerprint.md) techniques including:
- [Browser Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Browserfingerprint.md) techniques including:
- [Canvas Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Canvasfingerprint.md)
- [WebGL Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Webglfingerprint.md)
- [Device Fingerprinting](https://github.com/reanalytics-databoutique/webscraping-open-doc/blob/main/Pages/Antibot/Devicefingerprint.md)

### 5. Test websites
Here's a list of websites where to test your scraper and find out how many checks it passes
### 5. Test websites for your scraper
- [https://bot.incolumitas.com/](https://bot.incolumitas.com/) one of the most complete set of tests for your scrapers
- [https://pixelscan.net/](https://pixelscan.net/) check your ip and your machine
- [https://bot.sannysoft.com/](https://bot.sannysoft.com/) another great list of tests
- [https://abrahamjuliot.github.io/creepjs/](https://abrahamjuliot.github.io/creepjs/) set of tests on fingerprinting
- [https://fingerprintjs.com/products/bot-detection/](https://fingerprintjs.com/products/bot-detection/) page about BotD, a javascript bot detection library included in Cloudflare, where you can also test your configuration

### 6. How to make money with web scraping
- Freelancing
- Sell your scrapers with Apify
- Sell your data on Databoutique.com

- [https://bot.incolumitas.com/](https://bot.incolumitas.com/) one of the most complete set of tests for your scrapers
- [https://pixelscan.net/](https://pixelscan.net/) check your ip and your machine
- [https://bot.sannysoft.com/](https://bot.sannysoft.com/) another great list of tests
- [https://abrahamjuliot.github.io/creepjs/](https://abrahamjuliot.github.io/creepjs/) set of tests on fingerprinting
- [https://fingerprintjs.com/products/bot-detection/](https://fingerprintjs.com/products/bot-detection/) page about BotD, a javascript bot detection library included in Cloudflare, where you can also test your configuration

0 comments on commit 8159422

Please sign in to comment.