Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Feature: Block or make it harder to scrape data #5060

Open
2 tasks done
DankMemeGuy opened this issue Jan 29, 2023 · 8 comments
Open
2 tasks done

🚀 Feature: Block or make it harder to scrape data #5060

DankMemeGuy opened this issue Jan 29, 2023 · 8 comments
Labels
product / databases Fixes and upgrades for the Appwrite Database.

Comments

@DankMemeGuy
Copy link

DankMemeGuy commented Jan 29, 2023

🔖 Feature description

The issue when using AppWrite for the backend, with client-side rendering (CSR) is that when using Chrome's inspect tools and looking at the Network tab, you can view the API links, and view the JSON payloads. The issue for that is this can promote easy data scraping. I cannot find a way to prevent this right now other than to just use server-side rendering (SSR).

Rate limiting is not sufficient as the user could use proxies, and possibly alter the API links itself to allow for more data per JSON payload then intended.

However, on the AppWrite development side, there should be a feature that encrypts the JSON payload from the API links. The encryption key could be stored in a min.js JS file on the client side that does the decryption, and sure this isn't bullet proof, but at least this would disincentivize rookie scrapers from pillaging the website.

Another solution could just be to obfuscate the JSON payloads as they are sent from the server. Example: https://tools.pixelpoly.co/obfuscator

Whether you want to use that solution, or find another, but at the end of the day this is a real issue. The problem of forcing user registration to protect the payloads is that not all platforms have user feature for a free tier. So for example, say you offer a view mode, think of reddit.com as an example. You can view the website without an account. You could find the API links and then start scraping it without an account (there website is build SSR so that isn't an issue, but just an example of websites that could be targeted if they did use CSR).

🎤 Pitch

This would allow for the client side (CSR) to perform the actions to the AppWrite backend, while still revealing the API links, but prevent users from looking at the payloads. This would cause an additional barrier that would require users who intend of scraping the platform to work harder to find the decryption key in the minified/obfuscated JS.

Possibly the encryption key can be rotated regularly or automatic so that any scraper that does reverse engineer it will need to do this process frequently.

👀 Have you spent some time to check if this issue has been raised before?

  • I checked and didn't find similar issue

🏢 Have you read the Code of Conduct?

@stnguyen90
Copy link
Contributor

@DankMemeGuy, thanks for creating this issue! 🙏🏼 How about we focus this issue on preventing scraping rather than "encrypting API calls client-side/server-side"?

@DankMemeGuy
Copy link
Author

DankMemeGuy commented Jan 31, 2023

@DankMemeGuy, thanks for creating this issue! 🙏🏼 How about we focus this issue on preventing scraping rather than "encrypting API calls client-side/server-side"?

The encrypting the API calls is just a possible solution to the problem of scraping.

I'm trying to look into it, but I think obfuscating the payloads would solve the problem of people scraping.

Rate-limiting the API links (whether server side or client side) doesn't solve the issue since proxies can easily bypass that, and furthermore, they can still scrape the platform just by adding a wait command. Even obfuscating the client side code (the website, or apps JS files for example) doesn't solve the issue since they could still show the network tab then get the API links and bam, they now can just harness that data from that.

The root to solving this would have to be a solution from the server side, and the way data is actually pumped to the client. If the API calls come in via visible JSON payloads, then they can easily be found and scraped via basic inspect tools and then using requests module in Python to just pump that URL.

AppWrite has an anonymous registering option, even if you automatically create an anonymous account for each user, they could still just generate a new instance of the website (chrome incognito for example or using a proxy) then passing the cookie and bam, they can pump the data again if they were rate limited, or banned previously. Since this would be an infinite way of pumping data.

The issue at hand is: 1, allowing users to view the platform without a registered account, and 2, preventing, or at least highly inconvencinaincg scrapers via an API url.

(rate limiting will stop primitive scrapers such as Selenium, and Macro based scrapers.)

@stnguyen90
Copy link
Contributor

@DankMemeGuy

The issue at hand is: 1, allowing users to view the platform without a registered account, and 2, preventing, or at least highly inconvencinaincg scrapers via an API url.

Typically, when addressing feature requests, it's important to understand the "why" and the use case to come up with the best approach to address the root cause/problem.

Now that we've understood the "why," we'll see how many others need this while we brainstorm the best approach.

@stnguyen90 stnguyen90 changed the title 🚀 Feature: encrypting API calls client-side/server-side 🚀 Feature: Block or make it harder to scrape data Jan 31, 2023
@stnguyen90 stnguyen90 added the product / databases Fixes and upgrades for the Appwrite Database. label Jan 31, 2023
@stnguyen90
Copy link
Contributor

It would be great if appwrite could help to prevent or mitigate something like this from happening.

@DankMemeGuy
Copy link
Author

It would be great if appwrite could help to prevent or mitigate something like this from happening.

that is EXACTLY the kind of issue I want to prevent! wow. that's insane about that story.

@sanny-io
Copy link

I agree there should be more protection, but I don't think encryption/obfuscation is the answer. It simply would not be worth the effort of the Appwrite team.

@DankMemeGuy
Copy link
Author

I agree there should be more protection, but I don't think encryption/obfuscation is the answer. It simply would not be worth the effort of the Appwrite team.

There are only so many other options. If you use client side rendering then your API links will be revealed, and if you don't require user registration for some or all features, then you will face the possibility of mass scraping. Just saying 'use server side rendering' isn't the solution, so there aren't many options that can shield the API on a CSR layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
product / databases Fixes and upgrades for the Appwrite Database.
Projects
None yet
Development

No branches or pull requests

3 participants