Skip to content

Commit

Permalink
proxies added
Browse files Browse the repository at this point in the history
  • Loading branch information
Pierluigi Vinciguerra authored and Pierluigi Vinciguerra committed May 30, 2022
1 parent f96b770 commit 3365607
Show file tree
Hide file tree
Showing 5 changed files with 36 additions and 0 deletions.
Binary file added .DS_Store
Binary file not shown.
Binary file modified Pages/.DS_Store
Binary file not shown.
Binary file added Pages/Services/.DS_Store
Binary file not shown.
33 changes: 33 additions & 0 deletions Pages/Services/Proxies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Proxies

## What is a proxy?
Generally speaking, a proxy is a server that stays between a client and the target server.
![Proxy image from Wikipedia](https://en.wikipedia.org/wiki/Proxy_server#/media/File:Proxy_concept_en.svg)
They are tipically used in web scraping project for the following reasons:
- Ip rotation, a single scraper execution connects to multiple proxies, so the target website doesn't see that requests are coming from a single source
- Avoid geoblocking, some websites have different behaviours in different geographical areas.

## What kind of proxies are available?
Proxies can be divided in groups by their anonimity level or by the device they are running on.
According the anonimity level we have:
- **Transparent Proxies**: the entering request is not altered and the request made by the proxy specified the original Ip. No information about the original request or user is hidden.
- **Anonymous Proxies**: an anonymous proxy does not reveal your IP address but it does reveal that you are using a proxy server.
- **High Anoniymous Proxies (a.k.a. Elite Proxies)**: Elite proxy servers hide both your IP address and the fact that you are using a proxy server at all. These are the most advanced proxies that offer the most security.
For a more in depth analysis, please check [this useful article](https://proxyscrape.com/blog/proxy-anonymity-levels)

As said before, proxies can be divided also by devise they are running on.
- **Datacenter proxies** are typically running on servers and if your target server uses Canvas Fingerprint and blocks the views from these machines, they can be blocked.
- **Residential proxies** instead are running from devices outside datacenters (typically studios, offices and so on). They're tipically less reliable but if they're run on desktops, they can be used to avoid fingerprinting.
- **Mobile proxies** if the device is a mobile phone\tablet.

## Where to find proxies ready to use?
There are websites that offer a list of free proxies ready to use. They are free but not so reliable, so the list can be checked frequently and probably you'll discard a lot of entries because they are not working:
- [https://free-proxy-list.net/](https://free-proxy-list.net/)
- [https://spys.one/en/](https://spys.one/en/)
- [https://www.freeproxylists.net/](https://www.freeproxylists.net/)

For professional pursposes there are several IP providers that offer their services like:
- [BrightData](https://brightdata.com/lp/proxy-network)
- [Oxylabs](https://oxylabs.io/)
- [Zyte](https://www.zyte.com/smart-proxy-manager/)

3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ Use Xpaths, not css selectors for getting a clearer code.
Don't insert rules for cleaning prices or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs.
#### 2.4. Product List Page wins on Single Product Page
Load the fewer pages you can. Try to see if the fields you need are all available from product catalogue pages and try avoiding enter the single product page.
#### 2.5. Ip rotation
One of the most basic actions that a target website can take against web scraping is to ban IPs that make too many requests in a certain timeframe. Given that the web scraping activity must not interfere with the website functionality and operations, if this is happening to your scrapers, you might consider splitting its execution from several machines or route it via proxies.
Nowadays there are plenty of proxy vendors on the market and also proxies for every need, we'll go in-depth in this section.

### 3. Tools
#### 3.1. Headless python scrapers
Expand Down

0 comments on commit 3365607

Please sign in to comment.