Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Unblock Algolia domains #206

Closed
asurak opened this issue Jul 27, 2020 · 3 comments
Closed

Unblock Algolia domains #206

asurak opened this issue Jul 27, 2020 · 3 comments
Labels
false-positive Legit site being blocked by accident

Comments

@asurak
Copy link

asurak commented Jul 27, 2020

First, thank you for your work! I'm a happy user of your list myself in my Pi-Hole ;)

I've found a few domains that impact the day-to-day work of my team and I'm eager to get your feedback on how to avoid that further. Full disclaimer, I work at Algolia and privacy+security is big part of my work there.

The domains in question are:

  • analytics.algolia.com - Analytics service used from backend
  • analytics.de.algolia.com - Analytics service used from backend
  • analytics.us.algolia.com - Analytics service used from backend
  • recommendation.algolia.com - Recommendation service used from backend
  • insights.algolia.io - Insights service used from backend and frontend
  • telemetry.algolia.com - Performance data collection used from frontend

On a very high level, we provide a hosted search platform to our customers. The primary product is our search API which gets search requests and produces search responses. The ecosystem is accompanied by various other services like Analytics, Recommendation, Insights, Telemetry. Outside of our community projects (like docSearch, we don't runs services where visitors of the website would be our users. Thanks to this we have no interest in user tracking, fingerprinting or anyhow following our users around the web. Considering that Algolia is used on tens of thousands of websites around the world, it would be pretty terrifying to track all the users (and I would have a very hard time sleeping at night). For our systems to work, we're more than comfortable with anonymous identifiers like DB id, UUID and similar, which are provided to us by our customers, but not by us to the customers. (Interestingly, some of them expect us to do it for them and we then explain that we're not doing and won't be doing it)

Considering the reach that Algolia has, we're also strict on not mixing data between customers. Whenever a service is provided to a customer, it's for them, their data and their results. Our customers don't get access to data of other customers and cannot benefit from the data itself. Which means that if you visit two different websites using Algolia, we won't be able to say that you're the same user and the website won't know (from us) that you're the same user.

Some of the services look indeed scary. Our Analytics carries the same name as all-time-largest tracker Google Analytics while the only thing that our Analytics does is to provide an API of top of the application logs helping our customers identify what are their users searching for. The reason why there are multiple domains is because of data locality, allowing us to process data end-to-end in EU or US, without the data crossing the border. Analytics details

The other scary service is Recommendation, which looks like ads targeting platform where we tell you what you're looking for based on your overall preferences. That is true (if it's enabled), except that the dataset considered is only the website you're on. This makes it more similar to "search history" and "autocomplete" rather than " I know what will be your next search query because of your last 30 minutes of web activity". Personalization details

Our Insights service then primarily powers search relevancy feedback and A/B testing. Since search is a hard and unsolved problem, this feedback loop helps our customers improve the relevancy of their search results. All that still without tracking across the web. Insights details

The last one on the list is Telemetry. Here we definitely failed in communication and explanation during our first attempts to better understand end-user experience. Since we operate hundreds of endpoints around the world across many data-centres and providers we wanted to find a way to get insights into the real performance, not just what we see on our side. With the rebuild of the system, we've created a transparency page for our Telemetry project where we explain what we store, what we do with the data and why. We still don't store tracking cooking and we truncate the IPs, since we still have no use for such precise data and /24 is enough. Thanks to this data we're already helping our providers better understand their network performance and improve peering, thus helping all their customers. I’m aware that we’re getting into dangerous waters here and that’s exactly why we’re trying to make it as privacy preserving as possible.

All in all, I wanted to bring this to your attention. I’m very open to feedback about what we can do better on our side because as I mentioned, we have no desire to track people around the Internet. If possible, it would be great to remove the Algolia domains from the list as they're neither serving ads nor tracking.

@lightswitch05 lightswitch05 added the false-positive Legit site being blocked by accident label Jul 27, 2020
@lightswitch05
Copy link
Owner

Hey @asurak I appreciate the detailed ticket, I also greatly appreciate the transparency that you work at Algolia. Not all companies/employees that find themselves on my list have a such a thoughtful response.

Alright.... basically any ticket that gets opened here, I try to provide some details about why I added it first, your ticket is a bit different, but I'm a man of routine.

  • 4e9d955 : 82 new domains. - Sep 27, 2018
    • analytics.algolia.com
    • analytics.de.algolia.com
    • recommendation.algolia.com
    • analytics.us.algolia.com
  • e2abc80 : New ad and tracking domains. - Sep 9, 2018
    • insights.algolia.io
  • 6d93dae : Some hulu tracking hosts and a few random new ones - Nov 26, 2018
    • telemetry.algolia.com

As is unfortunately typical with my older entries, my commit message is absolutely trash and provided no comments as to my motivation. Now, I try to provide links and request URLs of all new entries.


Alright, on to my thought process for managing these lists. Note, this is all generic and is not making claims about what algolia does or does not do. Its also my opinion, which is a poor source, but its what I got for managing this list. Things are not always clear cut, there is a LOT of gray.

Ads & Tracking. I wanted to make two lists, one for Ads, and one for Tracking. Well, I found out real fast that there is no Ad blocking without also blocking tracking. Ads and tracking is not two categories, but a single one. You can have tracking without ads, but it is extremely rare to have Ads without tracking. After all, how do you prove to the Ad buyer that the Ad is worth the money if not by providing tracking details? Anonymous tracking invites Ad fraud, which is a whole thing on its own, but touches on an important details: not all tracking is bad.

  • Error tracking is extremely important to developers, reproducing errors are hard, and user reports are terrible: 'this thing is broken' ~useless.
  • A&B testing is not inherently bad. It can be great for finding patterns to make the user experience better and to find gaps in the UI. Or maybe to do a gradual rollout of a major redesign to catch errors early. This also does not work without the above tracking.
  • Recommendations are not always terrible either. "query: "thang", recommendation: "thing"

So there are these great tools for engineering quality products. And then there are these same great tools for maximizing profits - which is the only true goal of a company.

  • Error tracking is great for detecting when Ads are being blocked and being used to find the optimal way to bypass ad blocking. Or for companies that use random domain names to avoid lists - when its time to get a new domain.
  • A&B tests - when looked at from a profit point of view - is great for optimizing dark patterns. Who cares about the user's best experience? We need to funnel people into conversions.
  • Recommendations that end up just being affiliate marketing.

So where to draw the line? Whats acceptable tracking? Whats acceptable recommendations? I get a lot of false-positive requests. Sometimes, I agree and end up removing. More often, I disagree, but understand that I'm probably the outlier, and so I move it into the aggressive list. Other times, I disagree and make no change.

There are a lot of tools and services that do no tracking or advertising, but enable their clients to do tracking and advertising. I think this pattern is becoming more and more popular, as it puts the GDPR requirements on the customer instead of the company. The service will support collection of the data, and will store that data - but its on the client to choose to send the PII and when to delete that PII.

There is a LOT of gray when it comes to managing these lists. As I told a NewRelic employee, the only thing I have when curating the lists is to follow my own moral compass on what to block. When it is unclear, I often fall back on my "opt-in" preference. A really unpopular decision I made in #161 is where I'm breaking Xbox live achievements and tell users that if it is something they want, then then need to 'opt-in' to it by whitelisting it on their end. I think these domains are the same. I understand that you as a company might not be engaging in these shady practices, but it does not mean that you are not enabling your clients to do it. I also understand that the data can be really helpful in some cases, but blocking that tracking is a core feature of my list, and so I believe it is correctly categorized.


Again, I appreciate the detailed and thoughtful request, but I do not believe these domains are false-positives, and so I won't be removing them from my list. I hope you do no take it personally, I have nothing against Algolia as a company and have received quality search results from your product.

@asurak
Copy link
Author

asurak commented Jul 27, 2020

Thank you for the detailed answer, I really appreciate it. I don't take it personally at all, since I also hate tracking and advertisements. When I joined Algolia back in 2014 I made it clear that when we start injecting advertisements into search results or track people around the web, I'll quit. Still there :)

I understand your moral compass and it actually got me thinking about the potential shady use-cases that some of our customers might be using the APIs, despite those APIs having a single purpose: improve relevance of the search results. Ironically while trying to give businesses an alternative to Google/Amazon/Facebook tools and Google-controlled traffic, we ended-up blocked with them :-D

We'll definitely sit down with the team and brainstorm how the APIs can be misused, for now I'm failing to find ways how they can be misused for any of the mentioned shady use-cases, except for the affiliate marketing in a form of "people also buy..." or "similar items'' recommendations. Since they don't require direct connectivity to the end-user and can be easily proxied, without losing any functionality, we might be in a difficult fight. I really hope that our customers are putting more effort into optimising search results and overall quality of their data than finding sneaky ways to misuse the API (for which I hope we're already giving them a hard time).

Thank you for your consideration. Do you mind if I use your response internally to underline the importance of keeping our privacy-focused approach? (even if we didn't get from the list)

@lightswitch05
Copy link
Owner

lightswitch05 commented Jul 27, 2020

Feel free to use whatever you like, this is a public forum and I greatly appreciate feedback and opinions on my list - positive or negative - as long as people remain respectful.

Thanks for providing transparency in how your data is used, it sounds reasonable. I think if I were to make an exception here, there are some other domains I would also need to remove, and I don't really want to become the moral police. I spend way too much time on this project as it is. Can it be used for ads? Can it be used for tracking? Then it gets added 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
false-positive Legit site being blocked by accident
Projects
None yet
Development

No branches or pull requests

2 participants