-
Notifications
You must be signed in to change notification settings - Fork 91
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
27 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Do I really need to scrape that website or I can buy pre-scraped data? | ||
Before scraping by yourself, try to have a look if there's the data you already need in the following data marketplace | ||
[Databoutique.com](https://www.databoutique.com/) the first data marketplace designed for web scraped data, launching in these weeks | ||
[Aws Data Exchange](https://aws.amazon.com/data-exchange/) data marketplace with 3500+ datasets |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Is the data I want to scrape compliant with privacy laws or copyrighted? | ||
**This is not a legal advice, please refer to a lawyer if you're in doubt** | ||
Another aspect needed to be considered before starting a web scraping project is about the kind of data we're retrieving. | ||
## Personal data or PII | ||
Unless you have the person's explicit consent it is now illegal to scrape an EU resident's personal data under GDPR and this should be enough to make you stop from any personal data gathering. It's very difficult to know before scraping the citizenship of a person whose data is going to be scraped and in any case, there are similar rules also in other countries, making the scraping of personal data prohibitive. | ||
|
||
In this great article by [Zyte](https://www.zyte.com/blog/web-scraping-gdpr-compliance-guide/#:~:text=Scraping%20sensitive%20data%20means%20that,you%20should%20avoid%20scraping%20it.) it's explained how to behave to be compliant with GDPR, which is only valid in Europe. | ||
|
||
## Copyrighted Data | ||
[Unless you're OpenAI](https://www.theguardian.com/books/2023/jul/05/authors-file-a-lawsuit-against-openai-for-unlawfully-ingesting-their-books), you cannot scrape copyrighted material and hope to win a case in court. | ||
So limit your operation on what is publicly available and it's factual, not made by someone who can claim the data as its own. This means also to don't scraping and store pictures made by professional photographers, not limited to artistic pictures but also pictures made for fashion websites. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# Reading Terms of services of the website | ||
**This is not a legal advice, please refer to a lawyer if you're in doubt** | ||
This great article by [Apify](https://blog.apify.com/enforceability-of-terms-of-use/) summarizes what are the different types of Terms of Use and when they are enforceable or not. | ||
Basically, it depends if the user, or the scraper, did some active actions for accepting them. | ||
- Browsewrap: when the TOS is placed somewhere on the website but the user doesn't need to make any action. Not enforceable in most cases, since the user could not have seen them. | ||
- Clickwrap: then the TOS needs to be accepted with a click by the user. They are generally enforceable since the user actively accepted them and any break of TOS could be punished. | ||
- Scrollwrap: similar to clickwrap but the user needs also to scroll down the page to the end of TOS before accepting. Enforceable in most cases too. | ||
- Sign-in-wrap: when you need to login and somewhere in the UX you accept the TOS. Depending on the UX, how easy to see are TOS, could be enforceable. | ||
|
||
Generally speaking, better not to scrape websites that require a login and an active acceptance of TOS that ban scraping, even because the study made by Apify refers to US the legislation results of cases may differ in other countries. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters