Skip to content

Commit

Permalink
added chapter 1 pages
Browse files Browse the repository at this point in the history
  • Loading branch information
pigivinci committed Jul 6, 2023
1 parent 4f073cc commit 219c3ba
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 1 deletion.
4 changes: 4 additions & 0 deletions Pages/1.Before Scraping/Buy or Make.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Do I really need to scrape that website or I can buy pre-scraped data?
Before scraping by yourself, try to have a look if there's the data you already need in the following data marketplace
[Databoutique.com](https://www.databoutique.com/) the first data marketplace designed for web scraped data, launching in these weeks
[Aws Data Exchange](https://aws.amazon.com/data-exchange/) data marketplace with 3500+ datasets
12 changes: 12 additions & 0 deletions Pages/1.Before Scraping/Privacy and copyright.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Is the data I want to scrape compliant with privacy laws or copyrighted?
**This is not a legal advice, please refer to a lawyer if you're in doubt**
Another aspect needed to be considered before starting a web scraping project is about the kind of data we're retrieving.
## Personal data or PII
Unless you have the person's explicit consent it is now illegal to scrape an EU resident's personal data under GDPR and this should be enough to make you stop from any personal data gathering. It's very difficult to know before scraping the citizenship of a person whose data is going to be scraped and in any case, there are similar rules also in other countries, making the scraping of personal data prohibitive.

In this great article by [Zyte](https://www.zyte.com/blog/web-scraping-gdpr-compliance-guide/#:~:text=Scraping%20sensitive%20data%20means%20that,you%20should%20avoid%20scraping%20it.) it's explained how to behave to be compliant with GDPR, which is only valid in Europe.

## Copyrighted Data
[Unless you're OpenAI](https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books), you cannot scrape copyrighted material and hope to win a case in court.
So limit your operation on what is publicly available and it's factual, not made by someone who can claim the data as its own. This means also to don't scraping and store pictures made by professional photographers, not limited to artistic pictures but also pictures made for fashion websites.

10 changes: 10 additions & 0 deletions Pages/1.Before Scraping/Reading Terms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Reading Terms of services of the website
**This is not a legal advice, please refer to a lawyer if you're in doubt**
This great article by [Apify](https://blog.apify.com/enforceability-of-terms-of-use/) summarizes what are the different types of Terms of Use and when they are enforceable or not.
Basically, it depends if the user, or the scraper, did some active actions for accepting them.
- Browsewrap: when the TOS is placed somewhere on the website but the user doesn't need to make any action. Not enforceable in most cases, since the user could not have seen them.
- Clickwrap: then the TOS needs to be accepted with a click by the user. They are generally enforceable since the user actively accepted them and any break of TOS could be punished.
- Scrollwrap: similar to clickwrap but the user needs also to scroll down the page to the end of TOS before accepting. Enforceable in most cases too.
- Sign-in-wrap: when you need to login and somewhere in the UX you accept the TOS. Depending on the UX, how easy to see are TOS, could be enforceable.

Generally speaking, better not to scrape websites that require a login and an active acceptance of TOS that ban scraping, even because the study made by Apify refers to US the legislation results of cases may differ in other countries.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The table of content below will be updated regularly as soon as some new topics

### 1.Before scraping a website
#### 1.1 Is scraping that website legal?
- Reading terms and conditions of the website
- Reading terms of services of the website
- Is the data I want to scrape compliant with privacy laws or copyrighted?
- Do I really need to scrape that website or I can buy pre-scraped data?
#### 1.2 Preliminary website study
Expand Down

0 comments on commit 219c3ba

Please sign in to comment.