Google is the only search engine that works on Reddit now, thanks to AI deal

tbeseda · 2024-07-24T16:09:18

popcalc · 2024-07-24T14:29:56

  # Welcome to Reddit's robots.txt
  # Reddit believes in an open internet, but not the misuse of public content.
  # See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
  # See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
  # policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy

  User-agent: *
  Disallow: /

Source: https://www.reddit.com/robots.txt

sunaookami · 2024-07-24T15:58:49

They serve a different robots.txt to Google: https://merj.com/blog/investigating-reddits-robots-txt-cloak...

You can see it here: https://search.google.com/test/rich-results/result?id=_mYogl... (click on "View Tested Page")

dogleash · 2024-07-24T16:09:18

> # Reddit believes in an open internet, but not the misuse of public content.

Calling it "public" content in the very act of exercising their ownership over it. The balls on whoever wrote that.

shit_game · 2024-07-24T23:44:39

Their license/Eula clearly state that Reddit has perpetual whatever to content posted on Reddit, but relying solely on DMCA for "stolen" content _yet again_ feels like a terrible way to deal with non-original content. Part of me hopes that Reddit gets hit with some new precidence-setting lawsuits regarding non-original content that requires useful attribution, but I double t that will ever happen.

account42 · 2024-07-25T11:39:35

An EULA does not change the morality of the situation anyway. They are a leech profiting off users generating content who are now upset about not getting a cut from third-parties also profiting from said user generated content.

cyanydeez · 2024-07-25T20:32:36

Also, EULAs cant violate laws, the same qay you cant sign away your first born child.

nsonha · 2024-07-26T04:21:20

> a leech profiting off users

can also say every humans are leech benefiting off free software (creators) and complaining about their worthless chitchat, barely usable because of its basic semantics, being "stolen".

j5155 · 2024-07-26T05:11:56

1. That’s kind of the point of free software, is it not? If you don’t want people “leeching” off of your software for free, don’t make it free. 2. Reddit is an amazing source of coding information and general Q&A on an extremely wide variety of topics. I would not characterize all of it as chitchat.

pas · 2024-07-24T17:57:57

it's even worse. it's not theirs (it's the users'), they are merely hosting it and using it (ToS gives them a fancy irrevocable license I guess).

so they can do whatever they want with it and the actual owners/authors have no chance to really influence Reddit at all to make it crawlable. (the GDPR-like data takeout is nice, but ... completely useless in these cases where the value is in the composition and aggregation with other users' content.)

deepfriedbits · 2024-07-24T18:19:57

On top of that, a sizable chunk of Reddit content is ripped from elsewhere, whether videos, images, etc.

throwaway290 · 2024-07-24T22:04:40

actually owners/authors like me would not want our stuff crawlable because that gives up our ownership.

When I am answering some random dude on reddit with a problem I want that dude to read my solution. I don't want this to be crawled and forever stored (probably deanonymized) or enshrined in a dozen commercial LLMs. There is substack for that stuff.

RealityVoid · 2024-07-26T02:34:43

I'll be honest... I don't care about this thing. I view public posts as something that belongs to the public domain and if that is searchable, all the better, other people can reach my post. And if that is what LLM's are trained on, also, great, I hope it will be useful to some people. What I _do_ care about is an even playing fields and access to _all_ llms to be trained in that data.

In the end, I view potential AGI's as a common consciousness brainchild.

throwaway290 · 2024-07-26T09:06:18

OK. I see "AGI" as a buzzword and current ML trend as exploitation of creative works that profits big tech. They will always have bigger computers and get more out of what we create increasing wealth gap, unless we stop it

pas · 2024-07-26T10:13:34

Social support (or we can even call it "[corporate] social responsibility") is a political question/problem/thing.

Endless iterations of the discussion about how copyright should/shouldn't be are meaningless without considering the larger social context.

The vast majority of creators (artists!?) are not compensated at all, and the users' (content consumers'? the public's?) attention is already 100% fully saturated.

So GenAI chrurning out more kitsch hardly changes that. (It got popular when it was novel, and ... that was it for now. And it became one more brush in the evergrowing workshop.)

And if/when a company creates a product out of it, then that product needs to be scrutinized and we should consider the ethical, social, and political problems. (Because ... politics is a blunt tool whereas ethics is as infinitely nuanced as we make it, whereas actual social considerations ought to be pragmatic and fair.)

And that's the problem. Users benefit from cheaper access to customized content. (In other words, arguably it's a net positive thing that they can - for example - ask some GenAI tool to make them a nice picture for their friend's birthday.) So what's the cost of this? Does it make some people jobless? Is that good or bad? (Is it good that plant breeding programs, fertilizers, tractors, and irrigation systems made a lot of agricultural workers jobless? Well, in some sense yes as allows a few people to feed many, freeing up time for others to become doctors and artists. In some sense bad, because our current socioeconomic system does not provide real social support - despite enormous redistribution of GDP. [Because brutal inefficiencies in allocation of that surplus. And that's again a political problem.])

... and of course here the status quo bias leads to "technological progress chipping away at inefficiencies" in a capitalist reality translates to "even more things get commoditized", and that in turn in the current shitty socioeconomic system equals "lots of externalities are not priced in, and lots of people are forced to drastically change their lives to adjust to new prices" (ie. price of their labor and/or products going down, so they need to change jobs, yet they barely get any support for doing so) ... and of course ethically it's bad that most people virtually uncritically accept and enjoy the results of progress (new products and services) without giving a fuck about the costs.

throwaway290 · 2024-07-26T10:37:18

> Users benefit from cheaper access to customized content. (In other words, arguably it's a net positive thing that they can - for example - ask some GenAI tool to make them a nice picture for their friend's birthday.)

very arguably yeah. we evolve to where we push a button and get a birthday present. the logical conclusion is a system that does not even require to push a button. this is the opposite of the original point (spend time and effort making something to show you care). so to me it is not benefit, it is harm.

maybe you picked a bad example. but the examples of legit benefits of this tech, they don't turn into harms if you really analyze it and go to the root, are hard to find in consumer land.

pas · 2024-07-27T06:54:00

I like this example because I tried making a greeting card "by hand" and it of course sucked ass. (As I have zero idea how to use Photoshop despite religiously reading the tutorials 20 years ago on a community site :D And, of course, I used PhotoPea.)

So by spending the same 20 minutes fiddling with some kind of graphics if the outcome becomes 10x better then I see that as an advantage. (That said I haven't tried this.)

Basically this is "exactly" that kind of capability that we already seen in "artist paints kid's drawing" [0] just commoditized. And this is where I (as a user) would be looking for the style transfer. Because I saw something famous/trendy/fancy/aesthetically-pleasing and now want to imitate it as a gag for this hypothetical birthday card. (Sure, copyright already doesn't care if I download someone's famous photo, and cut my friends head out from a photo I made, and put the head part on the copyrighted photo. Yet it was a straightforward derivative work. Just it's not commercial, damages are none, or even negative, ie. the artist benefits from that famous exposure! :D)

> legit benefits of this tech

I wanted to try to find an example that could apply to a lot of users. Because as awesome as using GenAI to generate Lean proofs (and then add a feedback loop by actually running the generated code through Lean) to solve Math Olympiad problems [1], it's not really an everyday thing.

> we evolve to where we push a button and get a birthday present. the logical conclusion is a system that does not even require to push a button.

Well, maybe! In no time we'll board the Axios and just chill. [2] But at the same time crafts and DIY and experiences (tourism, festivals, concerts) are ridiculously popular. (And it has its own problems. [3])

[0] https://www.youtube.com/watch?v=dB-Q0eNsUaQ KID'S ART Redrawn by a PROFESSIONAL ARTIST! - Ep.6

[1] https://news.ycombinator.com/item?id=41069829

[2] https://www.thelist.com/img/gallery/things-only-adults-notic...

[3] https://edition.cnn.com/2024/07/08/travel/barcelona-tourism-...

throwaway290 · 2024-07-27T12:20:34

> Well, maybe! In no time we'll board the Axios and just chill.

This was sarcasm. That system would never happen. If birthday card requires no effort it is worse than nothing, literal noise.

> So by spending the same 20 minutes fiddling with some kind of graphics if the outcome becomes 10x better then I see that as an advantage. (That said I haven't tried this.)

Feels like you misunderstood. It is not "10x better" or "advantage" because literally the easier it was for you the less it is valuable.

pas · 2024-07-25T10:48:13

have you heard about DMs?

throwaway290 · 2024-07-25T11:49:35

That's why more and more exchanges are moving to DMs and closed communities. Because we don't plan on stuff we intend for participants in the discussion to be harvested but it more and more is. If reddit breaks that trend I only welcome it.

visarga · 2024-07-24T18:58:36

> the GDPR like data takeout is nice

Is there a way to export my history? How?

pas · 2024-07-24T19:05:40

https://www.reddit.com/settings/data-request

(and there's some help article for it that I didn't read, but google found this first https://support.reddithelp.com/hc/en-us/articles/36004304835... that's how I got to the link)

visarga · 2024-07-25T03:28:07

Thank you, it worked

Terr_ · 2024-07-25T03:01:45

I tried it many months back when they glitch-killed my decade-plus account. (Yes, I'm still bitter over the kafkaesque injustice.)

Anyway, you basically submit a request and then later they will email you a link to a zip file that contains a free dozen CSV files with unescaped newlines. One for all the comments you made, one for up/down-votes, one for blocked users, etc.

Khelavaster · 2024-07-25T17:58:49

The Fake News police should shut down this sort of messaging

will0 · 2024-07-24T15:10:58

Looks like it changed a month ago:

https://old.reddit.com/r/redditdev/comments/1doc3pt/updating...

immibis · 2024-07-24T15:41:04

Nobody who wants to be successful obeys robots.txt. And I do mean nobody.

chippiewill · 2024-07-24T15:49:09

They changed it to disallow so that scrapers can't just claim the robots.txt gave them permission.

tedivm · 2024-07-24T18:25:05

According to the US court systems the robots.txt file is meaningless. If they respond with a 200 status code giving you the access then you can legally scrape it all you want. If they require that you log in then you have to follow the terms you agree to when creating an account. Public means public though, and if Reddit doesn't want to make the content private (put it behind a login) then we can scrape away.

Note that scraping, regardless of the level of permission, doesn't mean you can do anything you want with the content. Copyright still applies. But you can scrape it, and if your use falls under Fair Use or another caveat to the copyright laws then you can do ahead and do it without needing any permission from the authors.

sssilver · 2024-07-25T10:11:43

Fascinating. Where can one learn more about this?

neongreen · 2024-07-25T15:25:26

I liked the chapter on DMCA from the 5-volume E-Commerce & Internet Law. It was super detailed.

I haven’t read volume 1, but apparently half of it is about data scraping, and I expect it to be similarly detailed. So if I were you, that’s where I’d start.

Another option is looking for “robots.txt” at Google Scholar and trying various keywords like “legality”, “scraping”, “case law”, etc.

datavirtue · 2024-07-25T14:04:01

The internet.

deprecative · 2024-07-25T14:38:04

If you have nothing constructive to say why say anything?

datavirtue · 2024-07-26T21:19:14

That was my answer, FFS.

toomuchtodo · 2024-07-24T16:06:07

Independent scrapers can launder the data between Reddit and AI consumers. The only folks this hurts is users seeking info via search engines and folks willing to kowtow to rules that are potentially low impact to evade. Next steps would be (from an adversarial perspective) browser extensions that stream back data for ingestion similar to Recap for Pacer [1].

[1] https://free.law/recap/faq

(full disclosure: assisting someone pursuing regulatory action against reddit in the EU for a separate issue from scraping, it's a valuable resource, but the folks who own and control it are meh)

whycome · 2024-07-25T02:10:24

Scrapings laundering. Do we have a term for this?

throwaway4pp24 · 2024-07-25T04:49:33

Yes, right in the law - "fair use"

account42 · 2024-07-25T11:47:01

Even more basic, it's free speech. The data itself is public domain so your free speech is not restricted and you don't need fair use excemptions for those restrictions. On the the access through the official system is restricted.

latexr · 2024-07-25T10:30:48

That’s a weird statement to be absolutist about. The majority of individuals and companies who want to be successful do not do so by scrapping websites, thus have no reason to disobey robots.txt. Most people in the world, ambitious or not, wouldn’t even understand what your sentence refers to.

wsve · 2024-07-25T22:49:53

OP is obviously talking about people whose area of research/development/product would involve web scraping... This feels like being purposefully obtuse

maxnevermind · 2024-07-25T00:55:06

Has not NYT tried to sue OpenAI because of them ignoring robots.txt or you mean it's impossible to prove and / or it's still more profitable to just ignore robots.txt?

JohnFen · 2024-07-24T19:53:55

Sadly true. That's why I gave up on robots.txt years ago and started blocking crawlers outright in .htaccess

Of course, that became unsustainable so now I have everything behind a login wall.

Zuiii · 2024-07-25T05:06:30

> We believe in something that we will now proceed to violate.

I will never take a statement given by a company that blatantly lies like this at face value going forward. What a bunch of clowns.

raverbashing · 2024-07-24T17:18:08

With the amount of crap in Reddit, cleaning it must be a very non-trivial problem. (I mean, it never is, but in the case of Reddit it's probably extra complicated)

arnaudsm · 2024-07-24T17:41:25

I understand the AI context, but this is dangerously anticompetitive for other search engines.

This is a dangerous precedent for the internet. Business conglomerates have been controlling most of the web, but refusing basic interoperability is even worse.

zooq_ai · 2024-07-24T18:26:01

There is nothing preventing search companies paying the same $60 Million to license content.

If reddit had exclusive agreement, it would be anti-competive.

This is classic HN anti-Google tirade (and downvoting facts, logic and concepts of free market)

not_wyoming · 2024-07-24T19:08:20

> There is nothing preventing search companies paying the same $60 Million to license content.

Yes, actually, there is - having $60m to throw around.

"Barriers to entry often cause or aid the existence of monopolies and oligopolies" [0]. Monopolies and oligopolies are definitionally the opposite of free market forces. This is quite literally Econ 101.

[0] - https://en.wikipedia.org/wiki/Barriers_to_entry

GuB-42 · 2024-07-24T22:33:38

Microsoft can throw around $60M, and Bing is used by most of the "alternative" search engines.

It doesn't solve the problem, but if money is the only thing preventing search engines from accessing Reddit, then what goes for Google also goes for Microsoft.

superb_dev · 2024-07-25T03:31:29

Cool, so the top two players are now the only players and everyone else is now trapped feeding off of them

account42 · 2024-07-25T11:54:02

> Microsoft can throw around $60M, and Bing is used by most of the "alternative" search engines.

That's a symptom of the issue, not a solution. Bing is used because having your own crawler is infeasible, partially because you will be literally blocked in many cases.

saghm · 2024-07-24T19:41:08

Not to mention the fact that if this became commonplace, other websites might start charging as well

try_the_bass · 2024-07-24T20:14:36

And yet "free market forces" are often the reason why monopolies and oligopolies arise...?

Monopolies are entirely consistent with free market economics. After all, if there's clearly a "best product" for a particular niche, it's entirely rational (free market actor) behavior for everyone to use the same product, leading to its monopoly in that market segment.

I don't understand why people think this isn't/won't be/shouldn't be a common result of "free market forces".

Newlaptop · 2024-07-25T05:05:53

> Monopolies are entirely consistent with free market economics

Not in the least. Literally in the first semester, Economics 101 type class that any business/economics/etc student would take, it would be covered clearly that monopolies are violations of free markets.

A free market isn't a euphemism for anarchy or "no rules", it's a specific economic term. The things it is free of include artificial price floors or ceilings, barriers to entry, anti-competitive practices, etc. In other words, monopolies, oligopolies, cartels, monopsonies, etc are all violations of a free market. You do not have a free market if there is a monopoly supplier.

try_the_bass · 2024-07-27T05:19:48

Economics 101 also assumes perfect information symmetry, perfect competition, and spherical cows. In other words, it only vaguely models reality... Good enough to teach the basic concepts, but without diving in the nuance that makes the basic models break down.

If one competitor is far enough ahead of the rest, they can maintain that lead given that they can extract sufficient momentum from their early mover advantage. If they keep this up long enough, competitors never reach the scale to sufficiently prevent them from becoming a monopoly (at least over their local market segment).

None of this requires anti-competitive behavior; simply good execution on the part of the leader.

Unless, of course, you're suggesting that "free markets" also involve government intervention to suppress their lead in the market...

animal_spirits · 2024-07-25T06:05:19

And many times the sources of these monopolies come from special privileges given out by governmental authorities, such as the U.S. FDA

account42 · 2024-07-25T12:42:42

Or even more basic government-enforced restrictions like IP laws. If you want a survival of the fittest anarchy economy then those won't exist either. Neither will legal protections against espionage or circumvention of whatever technical means you come up to try and get all that back.

not_wyoming · 2024-07-24T21:55:39

> Monopolies are entirely consistent with free market economics.

This is a fair critique. I'm approaching this from an admittedly American perspective in which "free market" colloquially implies competition - but I recognize that competition is not inherently a free market concept.

Good callout!

zooq_ai · 2024-07-24T20:00:15

Having a Monopoly != Anti-Competitive.

Having Barriers to Entry != Anti-competitive

Yes, large players have advantages of Economies of Scale.

Just because you can't run an Airline because you don't have money to buy an Airplane isn't anti-competitive.

Today Microsoft, Apple, OpenAI, Google, Amazon all can afford those piddly $60m to license from reddit.

Not Anti-competitive at all.

But saddened by how much corporate-hate by HNers destroys their credibility in debating these thing.

Go ahead downvote

not_wyoming · 2024-07-24T21:52:25

If you check citations, you'd find the sentence preceding my excerpt on barriers to entry:

"Because barriers to entry protect incumbent firms and restrict competition in a market, they can contribute to distortionary prices and are therefore most important when discussing antitrust policy."

Antitrust policy then links to a page on competition law: "Competition law is the field of law that promotes or seeks to maintain market competition by regulating anti-competitive conduct by companies." [0]

So yes, I'd downvote you if I could, but HN doesn't allow downvotes - which is honestly pretty fitting in the context of this conversation.

[0] - https://en.wikipedia.org/wiki/Competition_law

nativeit · 2024-07-25T05:23:53

I took care of the down vote for you. I dunno when the privilege is earned, but keep up the quality comments and you’ll be able to downvote incurious ideologues like this in no time!

zooq_ai · 2024-07-25T17:12:53

HN Downvotes on non-engineering things are my badge of honor. It has rewarded very well financially

zooq_ai · 2024-07-24T22:06:31

Once again buying an Airplane and starting an Airline business has probably the highest barrier to entry. Yet the Airline industry is the most competitive.

not_wyoming · 2024-07-24T22:40:37

The air travel industry has also seen some of the most significant government regulation in the form of blocking mergers (ie monopolistic, anticompetitive behavior) - meaning that competition in the airline space is due to regulation, not free market dynamics alone.

I’m happy to continue this debate if you’d like to start supporting your posts with citations but probably won’t engage further unless you do. Have a great day!

stoperaticless · 2024-07-25T06:00:34

Hm. Chip fabrication?

animal_spirits · 2024-07-25T13:42:28

Btw you will be allowed to downvote once you hit a certain karma threshold

pluc · 2024-07-24T18:28:04

Paying 60 million to every site you want to index is also a bad precedent to set. Why can Reddit get paid and XYZ can't?

zooq_ai · 2024-07-24T18:30:48

Anyone can ask for licensing deal. I'm sure NY Times, Conde Nest all have licensing deals. Mr. Beast signed a deal with Amazon. Joe Rogan with Spotify. Why is it hard to understand?

Even HN can get a licensing deal if they want to.

If you are producing content, you have every right to do what you want to with the content.

SlackingOff123 · 2024-07-24T19:07:04

Reddit is not producing any content; its users are.

zooq_ai · 2024-07-24T19:33:50

Not the point. If users don't like it they can go somewhere else to post.

For practical purposes, reddit can do whatever they want with users post. It's right there in TOS

renewiltord · 2024-07-24T21:54:11

Users sign a deal to give Reddit the content.

spixy · 2024-07-24T18:37:34

maybe Reddit has more value than XYZ?

onlyrealcuzzo · 2024-07-24T14:53:50

This is an interesting development.

How many other sites might have leverage to charge to be indexed?

I don't want to live in a world where you have to use X search engine to get answers from Y site - but this seems like the beginning of that world.

From an efficiency perspective - it's obviously better for websites to just lease their data to search engines then both sides paying tons of bandwidth and compute to get that data onto search engines.

Realistically, there are only 2 search engines now.

This seems very bad for Kagi - but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?

ColinHayhurst · 2024-07-24T15:08:22

Kagi uses at least Google and Mojeek

edit:

> Realistically, there are only 2 search engines now.

https://seirdy.one/posts/2021/03/10/search-engines-with-own-...

WarOnPrivacy · 2024-07-24T16:21:11

> Realistically, there are only 2 search engines now.

From the article:

     Many alternatives to GBY [Google, Bing, and Yandex] exist, but almost none of them have their own results;

This seems to assert that ~0 other search providers do any crawling at all. Ever. Are we sure that's the case?

   (they could crawl but never ever return those results == more odd).

ColinHayhurst · 2024-07-24T16:31:17

It's a very long article so understandable that you did not read on and learn about other search engines crawling beyond GBY. Still there are indeed very few that are crawling at web scale, and internationally. We are at 8 billion pages and totally independent [0], hence expressing our concerns to 404 media after being blanked by Reddit.

[0] https://www.mojeek.com/about/why-mojeek

WarOnPrivacy · 2024-07-24T17:02:06

> did not read on and learn about other search engines crawling beyond GBY. Still there are indeed very few that are crawling at web scale, and internationally

That's helpful clarification.

In criticism of the article, you might agree that

none of them have their own results

is a fairly absolute statement. It signals: Final word on the matter; no nuance to follow.

topaz0 · 2024-07-24T17:19:44

Omitting the "almost" from "almost none" makes it sound disingenuously more absolute than it actually is.

WarOnPrivacy · 2024-07-26T22:31:38

> Omitting the "almost" from "almost none" makes it sound disingenuously more absolute than it actually is.

Except that isn't what happened. Within the context, 'almost none' referred to size of the group, not to the amount of crawling.

I was discussing how much crawling is being done - outside of the 'almost none' sized group.

The 'almost none' sized group has 'their own results', so they do crawl. Based on that, the rest do not. Ergo, they do not crawl.

A search engine that never crawls seems non-intuitive.

shadowgovt · 2024-07-24T17:53:41

I mean, I didn't read on because it's paid.

I'm not taking their reporting without compensation, but that also means I didn't have the whole story. Such is life in this era of the internet.

culi · 2024-07-24T18:01:17

I believe Brave Search is also starting their own index. There are some tiny independent indexes too:

https://www.crawlson.com/ https://search.marginalia.nu/ https://wiby.me/ https://searchmysite.net/

MichaelZuo · 2024-07-24T16:59:09

Bing provides far fewer verbatim results for pretty much all search queries that I've tested.

And Yandex isn't much better for non cyrillic search, Baidu is only for the Chinese web effectively.

And all other search engines either don't even attempt to do full web crawls anymore and/or buy from one of the four above.

So realistically there's just one search engine for the full web that actually does the work.

dev1ycan · 2024-07-24T17:51:19

Brave has their own search engine, yandex I only use for reverse image search, baidu's interface is really clean and feels like old school google... but I don't speak chinese so I can't use it.

I hope that one day they get a western version

MichaelZuo · 2024-07-24T19:30:53

Brave doesn't have its own index of the full web, and it's even less useful than Yandex. And very likely buys some of it, according to what I've heard. So it falls into the last category.

em-bee · 2024-07-24T22:19:35

if that is true then they are lying on their site where they claim: "Brave Search operates from a fully independent search index"

do you have any reference for your claim?

i use brave search and find it very useful. very rarely there is something i can't find, and when i run into that other search engines are not much better.

MichaelZuo · 2024-07-25T08:31:18

Notice that it doesn't say "Brave Search solely operates from..." or "only operates from"?

Instead the wording leaves wiggle room for the possibility of using multiple.

em-bee · 2024-07-25T17:38:11

it would still be lying by omission at least

hn_go_brrrrr · 2024-07-25T07:13:25

Given the piles of spammy shit on Google these days, I'm wondering if "doesn't have its own index of the full web" is actually a competitive advantage.

WarOnPrivacy · 2024-07-24T17:06:45

> And Yandex isn't much better for non cyrillic search,

I like Yandex when I'm rabbit-holing after obscure musicians/music. I routinely have a better experience than I do with DDG or Kagi or Goog.

freediver · 2024-07-24T23:27:46

Kagi also uses Yandex index so it would be unusual that something exists in Yandex and not in Kagi results.

MichaelZuo · 2024-07-24T17:33:59

It's also vastly better for finding livejournal blogs.

darreninthenet · 2024-07-24T22:27:20

I believe Kagi has its own crawler as well and it merges all the results and does whatever Kagi does behind the scenes to show the mix

tdeck · 2024-07-25T05:00:51

Aside: Does anyone know how the GBY term became a thing and why it includes Yandex but not Baidu?

Yawrehto · 2024-07-24T18:16:37

Doesn't it list three major ones, Google, Bing, and Yandex, plus Mojeek and a few other small ones? That's a bit more than two.

McDyver · 2024-07-24T16:30:52

That seems like the business model for streaming. You subscribe to X provider to watch Y series. So, as for streaming, I suppose a pirate bay search engine will come up

toomuchtodo · 2024-07-24T16:45:24

Pirate Bay is probably not the most optimal analogy, more like Anna's Archive imho [1], individually offered by web property scrape runs compressed into a package, maybe served by torrents like this Academic Torrents site example [2].

Scraper engine->validation/processing/cleanup->object storage->index + torrent serving is rough pipeline sketch.

[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... ("HN Search: annas archive")

[2] https://academictorrents.com/details/9c263fc85366c1ef8f5bb9d... ("AcademicTorrents: Reddit comments/submissions 2005-06 to 2023-12 [2.52TB]")

gtirloni · 2024-07-24T19:17:21

> but this seems like the beginning of that world.

It's not the beginning, it's mere continuation.

Walled gardens have existed since the AOL days. They deteriorate over time but it doesn't prevent companies from trying (each time, in bigger attempts).

aAaaArrRgH · 2024-07-25T05:09:20

> but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?

It still exists. It just isn't that popular.

splwjs · 2024-07-24T16:42:51

idk man i bet you five bucks and a handshake it's just going to play out like the existing startup grift.

There's an established player with institutional protections, then a scrappy upstart takes a bunch of VC money, converts it into runway, gives away the product for free, gradually replaces and becomes the standard, then puts out an s-1 document saying "we don't make money and we never have, want to invest?" and then they start to enjoy all the institutional protections. Or they don't. Either way you pay yourself handsomely from the runway money so who cares.

The upstart gets indexed and has an API, the established player doesn't.

The upstart is more easily found and modular but the institutional player can refuse to be indexed to own their data and they can block their API to prevent ai slop from getting in and dominating their content.

StrauXX · 2024-07-24T16:58:37

IANAL but as far as I understand the current legal status (in the US) a change in robots.txt or terms and conditions is not binding for web scrapers since the data is publicly accessible. Neither does displaying a banner "By using this site you accept our terms and conditions" change anything about that. The only thing that can make these kinds of terms binding is if the data is only accessible after proactively accepting terms. For instance by restricting the website until one has created an account. Linkedin lost a case against a startup scraping and indexing their data because of that a few years ago.

qingcharles · 2024-07-24T18:52:11

At the federal level; but states have their own laws. For instance, it can get you 5 years in prison in Illinois to violate a web site ToS.

https://www.ilga.gov/legislation/ilcs/ilcs4.asp?DocName=0720...

redcobra762 · 2024-07-24T19:03:04

Has anyone ever successfully been prosecuted for violating this statute?

qingcharles · 2024-07-25T00:03:42

I don't know. That data is really hard to put together.

quink · 2024-07-25T00:27:17

That’s a blink and you miss it joke of the highest calibre - well done!

jpalomaki · 2024-07-24T18:13:26

Quite sure they are also enforcing these with some technical measures to limit scraping.

renlo · 2024-07-24T18:44:03

As was LinkedIn, who was forced to rate stop limiting / IP-banning scrapers for public pages.

altdataseller · 2024-07-25T08:13:03

Many other websites enforce rate limiting or IP banning for public pages (using products like Cloudflare). Why is this not legal?

therealdrag0 · 2024-07-25T04:18:27

Really? That seems strange.

wtf242 · 2024-07-24T16:40:07

This problem is only going to get worse. for my thegreatestbooks.org site i used to just get indexed/scraped by google and bing. now it's like 50+ AI bots scraping my entire site just so they can train a LLM to answer questions my site answers without having a user ever visit my site. I just checked cloudflare and in the past 24 hours I've had 1.2 million bot/automated requests

sct202 · 2024-07-24T16:46:17

There's a new setting in Cloudflare to block AI/scraper bots. https://blog.cloudflare.com/declaring-your-aindependence-blo...

graeme · 2024-07-25T04:54:58

Anyone have any experience with this? Is there nothing but upside in blocking these bots

account42 · 2024-07-25T13:44:06

Considering it's Buttflare, enabling it probably also means blocking random users. But of course that's not Buttflare's problem because it's not enabled by default.

jedberg · 2024-07-24T14:58:42

They changed robots.txt a month or so ago. For the first 19 years of life, reddit had a very permissive robots.txt. We allowed all by default and then only restricted certain poorly behaved agents (and Bender's Shiny Metal Ass(tm))

But I can understand why they made the change they did. The data was being abused.

My guess is that this was an oversight -- that they will do an audit and reopen it for search engines after those engines agree not to use the data for training, because let's face it, reddit is a for profit business and they have to protect their income streams.

Closi · 2024-07-24T15:50:11

> But I can understand why they made the change they did. The data was being abused.

Depends how you see it - if you see it as 'their' data (legally true) or if you see it as user content (how their users would likely see it).

If you see it as 'user content', they are actually selling the data to be abused by one company, rather than stopping it being abused at all.

From a commercial 'lets sell user data and make a profit' perspective I get it, although does seem short-sighted to decide to effectively de-list yourself from alternative search engines (guess they just got enough cash to make it worth their while).

Ajedi32 · 2024-07-24T16:03:11

> if you see it as 'their' data (legally true)

Is that actually true? Reddit may indeed have a license to use that data (derived from their ToS), but I very much doubt they actually own the copyright to it. If I write a comment on Reddit, then copy-paste it somewhere else, can Reddit sue me for copyright infringement?

jedberg · 2024-07-24T16:41:32

They own a non-exclusive worldwide right to it. You own the copyright, they have a license to use it however they see fit.

account42 · 2024-07-25T13:22:02

> They own a non-exclusive worldwide right to it.

They demand that right. That doesn't mean they actually have a right to use the content in ways that are not directly required for the operation of the website or that are otherwise surprising to the average user. Putting something in the TOS doesn't always make it a valid contract.

passwordoops · 2024-07-24T16:06:44

Enough cash or enough data on hand to show the majority of traffic comes from the search monopoly

ColinHayhurst · 2024-07-24T15:07:22

Person extensively quoted in the article here. They are welcome to reach out. But not a single person from any level did that, nor replied to my polite requests to explain and engage. We first contacted them in early June and by 13th June, I had escalated to Steve Huffman @spez.

toomuchtodo · 2024-07-24T17:48:17

An acquaintance investigating Reddit's moderation mechanization inquired how a major subreddit was moderated after an Associated Press post was auto removed by automod. They were banned from said sub. They inquired why they were banned, and they shared they would share any responses with a journalism org (to be transparent where any replies would be going, because they are going to a journalism org). They were muted by mods for 28 days and were "told off" in a very poor manner (per the screenshots I've seen) by the anonymous mod who replied to them. They were then banned from Reddit for 3 days after an appeal for "harassment"; when they requested more info about what was considered harassment, they were ignored. Ergo, inquiring as to how the mods of a major sub are automodding non-biased journalism sources (the AP, in this case) without any transparency appears to be considered harassment by Reddit. The interaction was submitted to the FTC through their complaint system to contribute towards their existing antitrust investigation of Reddit.

Shared because it is unlikely Reddit responds except when required by law, so I recommend engaging regulators (FTC, and DOJ at the bare minimum) and legislators (primarily those focused on Section 230 reforms) whenever possible with regards to this entity. They're the only folks worth escalating to, as Reddit's incentives are to gate content, keep ad buyers happy, and keep the user base in check while they struggle to break even, sharing as little information publicly as possible along the way [1] [2].

[1] https://www.bloomberg.com/news/articles/2024-05-09/reddit-la... | https://archive.today/wQuKM

[2] https://www.sec.gov/edgar/browse/?CIK=1713445

account42 · 2024-07-25T13:38:33

> non-biased journalism sources

No such thing, and definitely not the AP.

JohnMakin · 2024-07-24T15:00:44

One (in this case, 2) company's incentive for profit should not take priority over the usability/well being of the internet as a whole, ever, and is exactly why we are where we are now. This is an absolutely terrible precedent.

BeetleB · 2024-07-24T15:53:28

I know people will hate to hear this, but Reddit it's not important to the A well being of the Internet.

TeaBrain · 2024-07-24T16:03:39

I think it's the other way around, in that people don't like to hear how Reddit has become important due to the death of independent forums and the degree to which information has become concentrated on the site.

BeetleB · 2024-07-24T19:00:17

The death of independent forums has been greatly exaggerated.

Of all the forums I used to be active in, many are still active. The ones that died did so because the community died (i.e. they did not shift to Reddit and the like).

Reddit is great simply because it allowed anyone to create a community. No need to get a LAMP stack and deal with security vulnerabilities in your forum SW.

These days you have Lemmy and its ilk. Much higher barrier than the old LAMP stack, but also much superior to it. I do hope it takes off.

latexr · 2024-07-25T10:44:16

Independent forums, like RSS, are not dead. I use both every day.

account42 · 2024-07-25T13:26:45

GP was not claiming that it is.

jedberg · 2024-07-24T15:02:39

I agree with you in theory, but in practice someone has to pay for all this magic.

JohnMakin · 2024-07-24T15:08:42

This is a false dichotomy. You can have services, and not have them devolve into complete unusability in the name of profit. This isn’t sustainable either. The myopic pursuit of short term gains at the expense of the product will collapse at some point in the future, no matter how much you believe in this weird frog-boil internet we’ve inherited now.

twelve40 · 2024-07-24T16:11:41

Complete unusability is when ai tools clone the content and people stop visiting the original service and participating. I'll leave it up to them to defend blocking duck duck go for example, but blocking "AI" bots for an online community is a matter of survival at this point.

talldayo · 2024-07-24T16:51:56

Alternatively, it's because the base platform has also devolved into unusability. Both Reddit and Twitter are in a position where their info is easily scraped, and their community is barely worth the advertising/paid-premium experience they demand from you. As both platforms continue to decline in quality, you might not even need to replace the original service. Both businesses appear intent on getting replaced.

talldayo · 2024-07-24T16:09:09

> The myopic pursuit of short term gains at the expense of the product will collapse at some point in the future,

The myopic pursuit of short-term gains is the only playbook that works. Long-term business strategy is a gamble, and today's businesses have all learned that they'd rather make hay when the sun is shining than be remembered as a good business.

Twitter tried a long-term playbook to reverse their unprofitable sinkhole of a website. That ended up with them being undervalued and sold to the highest bidder.

latexr · 2024-07-25T10:53:44

> Twitter tried a long-term playbook to reverse their unprofitable sinkhole of a website.

From what I recall reading at the time, Twitter was finally becoming profitable before the sale (last two years? It’s hard to find a source now since every story since is about some shit show or other post sale).

> That ended up with them being undervalued and sold to the highest bidder.

You make it seem they were in dire straits and had to be sold for scraps, but that’s far from the case. They sold for more than their valuation to the only bidder because they understood what a good deal it was for them. They forced the buyer to not back out, after all.

account42 · 2024-07-25T13:28:15

People were paying for forums before Reddit came along.

ToucanLoucan · 2024-07-24T15:14:42

We did. As in we, the Internet, existed for a long time without anyone making money and we paid for the privilege. Websites were built and hosted at owner's expense, for years, with no expectation that they be financially rewarded. Sure some would run donation drives, or work with sponsors relevant to the community in question, but a whole ton, mine included, just cost me a lot of money over many years.

Those websites were definitely technically inferior, as the march of progress is unavoidable, but web hosting is cheaper than it's ever been. A VPS that utterly blows away what mine was capable of in 2007 for nearly a hundred a month can now be had for about $10 per month. Yet everyone wants these monolith platforms, but even that wouldn't be the worst thing ever, except that every one of these platforms has a backend to support that we in the Old Internet never did: a C-suite's worth of executives and millions of shareholders, who for some reason have decided that reddit can't exist unless reddit makes them reams and reams of money.

I'd be very, very interested to see how much of, even what's probably the most massive one of all, Facebook, is non-essential busywork that could easily be shut down tomorrow with no adverse effects to the platform. Firstly the entire executive class, just, they don't do shit to make Facebook the product. In fact I'd argue their decisions almost universally have made it worse as a product very consistently for it's entire lifetime. Then, all the marketing people. There's just no goddamn reason to advertise Facebook (or reddit for that matter) the brand is so ubiquitous, if you actually found someone who'd never heard of it, I'd give you a large chunk of money. Add to that, if Facebook was doing a good job of being what it ostensibly is, then people immediately become the best advertising, because people want to hang with people in these digital spaces. Then get rid of the people working to make Facebook addictive with dark patterns. Then get rid of the entire targeted ad division, because it's gross and inhumane. Pare the company down to engineers who build the product, and if anything, expand the moderation team so they can actually ensure the safety of the platform, and of course the IT staff to back them. Now what does Facebook cost to operate?

As far as I'm concerned, this pearl-clutching about "well websites have to make money" is grossly, grossly overstated. Websites don't cost that much to run. A ton of money is being siphoned off by the MBA parasites playing games in Excel all day. A ton more is being wasted developing features that advertisers want and users hate. A ton more is being funneled into making products artificially addictive to vulnerable people, to exploit them, so let's just not do that. And of course, leadership, rewarding themselves with generous compensation packages they aren't even remotely able to justify. Now what does your website cost to maintain? Surely not nothing, and for websites of substantial size, it will still be high, but I'm willing to bet it's a hell, hell, hell of a lot less than it was before.

kjkjadksj · 2024-07-24T15:27:50

Part of the issue is that it isn’t just the web, but the inevitable american corporate shareholder model. Even businesses could be mom and pop ified and made way more popular overnight: quit raising prices and cutting corners and it would actually stand for itself like a massive $7 burrito. However the expectation is that shareholders get returns. Costs must be cut. Prices must be raised. Margins must be improved. It doesn’t matter if this eats the business alive, as shareholders are sufficiently leveraged. The whole system is incentivized to select for inferior quality and taking all the available money on the table.

ToucanLoucan · 2024-07-24T15:42:58

My rant above and your response reminded me of all those tons of MMO games out there that are ancient, with a tiny playerbase, that remain profitable nonetheless simply because if you have a product that people like using, putting it into maintenance mode and doing the bare minimum to keep it running is a perfectly valid business strategy. The companies that buy these service games and run them effectively just buy completed money printers and keep them operating. It's not going to make anyone rich probably, but it's a perfectly valid and profitable way to go about things.

The silicon valley "grow at all costs, always evolve and innovate forever" model is so detached from the reality of most businesses in my experience.

Suppafly · 2024-07-24T16:15:45

>The companies that buy these service games and run them effectively just buy completed money printers and keep them operating.

I hadn't really thought about that topic in that way before. Really explains why some of those older MMOs have no desire to really make any improvements, the owners are happy to just keep them powered up and collect a check but have no incentive to invest in making them better.

ToucanLoucan · 2024-07-24T16:24:22

I think the notion that sometimes things are just "done" is incredibly undervalued in our industry. Frankly I wish a ton of games I play would STOP updating.

Suppafly · 2024-07-24T19:58:47

>I think the notion that sometimes things are just "done" is incredibly undervalued in our industry.

I agree, but also the flip side is that things rapidly switch from 'done and working' to 'dead' pretty quickly if no one is willing to do minor maintenance.

isoprophlex · 2024-07-24T15:54:57

In biology, you'd call that a cancer. But to people praising the gospel of VC money, it's something desirable...

u8080 · 2024-07-24T16:52:16

Yeah, like Rockstar with GTA V Online.

lotsofpulp · 2024-07-24T17:48:37

>Websites don't cost that much to run.

Popular websites that allow user content to be uploaded or linked do cost that much to run, due to content moderation.

There might be a small (relatively) forum here and there that a few good moderators are willing to slave away at keeping clean, but you will never see a website that allows user content with as many users as Reddit/Youtube/Instagram/etc be cheap.

Although, due to AI, the cost to spam the small forums might be so small that even they might come into the crosshairs.

megaman821 · 2024-07-24T18:04:55

Although it is quite surprising that mainly text websites (Reddit, Twitter) are hard to run sustainably but video and image websites (YouTube, Instagram, TikTok) can because it is easier to sell ads against them.

account42 · 2024-07-25T13:36:03

> Popular websites that allow user content to be uploaded or linked do cost that much to run, due to content moderation.

Reddit outsourced most of it's moderation to unpaid volunteers.

lotsofpulp · 2024-07-25T14:11:33

I am referring to moderation of child sexual abuse material and other legally problematic content. I assume volunteers do not handle that.

ToucanLoucan · 2024-07-25T14:56:39

I don't see how that would fall to different people in reddit's case. I'm sure reddit has some moderators on staff but the vast, vast majority of their moderation happens on the proverbial front lines, which is basically all volunteers. I would hope there's a dedicated abuse team at Reddit that are actually paid people whom the volunteers can kick the truly sick shit to so it can be properly dealt with, but given the corporate culture Reddit has shown over the years, I also wouldn't be awfully surprised if it's JUST down to the volunteers either.

meiraleal · 2024-07-24T15:33:50

how can we keep paying the ever-growing profits of multi-trillion dollar companies? This is insane.

jsnell · 2024-07-24T15:54:13

Reddit is 100x from being a trillion-dollar company, and is not profitable.

meiraleal · 2024-07-24T17:49:16

Reddit offers no magic is just a forum. Google used to do some magic decades ago and still profit from it.

ColinHayhurst · 2024-07-24T15:17:13

The blocks for MojeekBot, as Cloudflare verified and respectful bot for 20 years, started before the robots.txt file changes. We first noticed in early June.

We thought it was an oversight too at first. It usually is. Large publishers have blocked us when they have not considered the details, but then reinstated us when we got in touch and explained.

ekidd · 2024-07-24T16:00:33

I personally feel that this kind of "exclusive search only by Google deal" should result in an anti-trust case against Google. This is the kind of abuse of monopoly power that caused anti-trust laws to be passed in the 1890s.

eddd-ddde · 2024-07-24T18:51:56

if i create a vacuum cleaner and decide to only sell it at Walmart you can't get mad at me for not wanting to sell it at costco

you can always buy a competitor's or make your own vacuum cleaner if you hate buying at Walmart

maybe what you are really mad about is Reddit monopolising content

ekidd · 2024-07-24T21:05:40

Usually, to trigger any kind of anti-trust law, you need to have massive market share. In this case, for example, Reddit almost certainly hasn't committed any antitrust violations, because they're a relatively minor player in their market.

Similarly, if you start a vacuum cleaner company, you can make whatever exclusive deals you want. But if you control 80% of the market for vacuum cleaners, then you might need to be more careful about leveraging your market share in unfair ways.

If a company is part of a robust, competitive market (like Reddit), it's usually wiser to let customers vote with their wallets, and leave the government out of it. If a company becomes massively dominant (like Google or TicketMaster), and if it starts pushing exclusive contracts, it's much harder for customers to switch away.

ffgjgf1 · 2024-07-25T06:37:56

Unless you’re deemed to be an unfair monopoly and abuse your position to harm consumers interests.

You don’t even need >90% market share for that to be the case. e.g. Standard Oil only controlled 64% of the US market at most, it was still broken upz

fredgrott · 2024-07-24T15:04:00

the article quotes reddit policy change: Reddit considers search and ads commercial activities and thus subject to robot.txt block and exclusion.

account42 · 2024-07-25T13:18:53

Ah so when reddit uses user content for monetization it's ok but when others do it then it isn't? Reddit may want that double standard but I think the only thing they are going to achieve with this stunt is more people ignoring robots.txt.

EasyMark · 2024-07-25T04:13:10

how was it being abused. You still clicked on the information and saw the reddit ads? Now they won't get any of that from "rival" sites to google. I guess they figured the 60 million was more than that ad revenue. Seems greedy but I don't think it's illegal like others are suggesting.

ykonstant · 2024-07-24T15:21:18

It's ironic, because Reddit is the only search engine that works on Google now thanks to shittening.

maxwell · 2024-07-24T16:07:45

They're both running on fumes at this point.

riiii · 2024-07-24T22:13:28

Also sniffing them.

QVVRP4nYz · 2024-07-25T09:21:38

For years reddit build-in search was broken (or at least broken) and people were forced to use 3rd parties like google, so we came full circle.

daft_pink · 2024-07-24T16:01:05

I don’t understand how this isn’t anti-competitive behavior. It seems like reddit has to offer this deal with similar terms to google’s competitors.

talldayo · 2024-07-24T16:06:25

They do offer that deal to others; a big news story was when OpenAI bought Reddit's data they were selling: https://openai.com/index/openai-and-reddit-partnership/

dathinab · 2024-07-24T16:22:06

yep, but for things which are "only" search engines it's not a viable offer. Only if you expect "big AI business value" from it does it make sense, maybe.

eddd-ddde · 2024-07-24T18:54:09

I don't see how this tracks at all. Companies can decide to only sell their products with some retailer if they want. You can't force them to make deals with other companies.

gtirloni · 2024-07-24T19:29:29

You certainly can in monopoly situations (which apparently this isn't the case).

Suppafly · 2024-07-24T16:06:42

Most business deals are anti-competitive in some way. What makes you think this specifically rises to the level where they'd legally have to offer similar terms to competitors?

daft_pink · 2024-07-25T13:29:34

I’m not sure. Maybe the angle is that Google is anti-competitive by signing an agreement that limits information to it’s rivals.

Being forced into using google services, because they are paying information companies to deal only with them seems like a disaster for the web.

carlosjobim · 2024-07-24T17:18:52

Why in the world would they have to do that? There are thousands of exclusive business-to-business deals being signed into action every second of the day.

dathinab · 2024-07-24T16:19:01

Worse it doesn't even really "work" anymore, giving how most search are flooded with garbage SEO results and payed advertisements "basically" looking like search results (most times more garbage not what you are looking for results, int he cases where it isn't it quite often times is on the line of "googles algorithm blackmailing companies to buy ads for users which want to find them through google but wouldn't without ads".)

I wonder if this might affect redis, as in slowly kill it's user base especially when it comes to user providing (and often also looking for) high quality content, because who of such users would want to use google search?

john-radio · 2024-07-24T16:26:42

> Worse it doesn't even really "work" anymore, giving how most search are flooded with garbage SEO results and payed advertisements "basically" looking like search results ...

I don't understand what you're saying. That's exactly why people append `site:reddit.com` to their searches in the first place, because those search results typically aren't like that.

wwweston · 2024-07-24T17:47:39

Or at least, reddit posts and comments that are content messaging / marketing (human or AI) fit in better with earnest and natural posts, so that they're more effective.

lmeyerov · 2024-07-24T18:02:05

FWIW, we inquired to the reddit sales team about paying for data sometime last year, as we do similar elsewhere for use cases like helping emergency responders, and even though they were launching the program and asking for customers... no email back. Nor on our second and I think third attempt.

I'm not sure what to make of that.

morkalork · 2024-07-24T18:20:00

How much were you willing to pay? Still, rude of them not to even discuss the issue. Every time I've gone to buy data, if I'm too small of a fish, vendors have always been happy refer me to a reseller.

heisenbit · 2024-07-24T19:18:51

Certainly rude but also possibly legally problematic. If they were judged to be in a dominant position in a market and were found making deals with exclusivity then it can get expensive.

It all depends of course what the market is. If one looks as reddit not as a whole but as a collection of niches then one could imho find niches where reddit has a dominant knowledge position.

lmeyerov · 2024-07-24T19:08:47

We do 4-6 figures/yr for providers which is normal in our world

An enterprise sales team with only 1 customer happens (eg, Mozilla 's search bar), but... That's surprising here, and scary as a sustainable & scalable business. Ignoring 5-6 figure/yr inquiries says a lot to me. In contrast, we did that same-day with Twitter without talking to anyone.

numbers · 2024-07-24T16:59:29

"Information is power. But like all power, there are those who want to keep it for themselves. The world’s entire scientific and cultural heritage, published over centuries in books and journals, is increasingly being digitized and locked up by a handful of private corporations." - Aaron Swartz (2008)

1vuio0pswjnm7 · 2024-07-25T01:24:10

"If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn't rely on Google's indexing and search Reddit by using "site:reddit.com," you will not see any results from the last week."

The veracity of this statement is questionable.

I found at least four web search engines not using Google's index that produced results from the last week.

Example: Recent eruption at Yellowstone Black Diamond Pool

https://www.ecosia.org/search?method=index&q=site:reddit.com...

https://search.brave.com/search?q=reddit.com+black+diamond+p...

https://api.yep.com/fs/2/search?client=web&gl=all&no_correct...

   POST /sp/search HTTP/1.0
   host: www.startpage.com
   content-length: 74
   content-type: application/x-www-form-urlencoded
   query=site:reddit.com black diamond pool&abp=-1&t=&lui=english&sc=&cat=web

At least for this example, I got the same desired result using Reddit site search.

https://old.reddit.com/search/?q=black+diamond+pool

If anyone has some good examples of search queries that I can test showing why a search engine must be used, please share.

1vuio0pswjnm7 · 2024-07-27T03:10:35

"Its search engine uses Microsoft's Bing's technology, with whom it has a long-term arrangement."

https://www.bbc.com/news/business-53922786

lopis · 2024-07-26T07:58:29

Ecosia does use Google or Bing, which you can select in the settings.

r_singh · 2024-07-24T15:18:48

I wonder how Aaron Swartz would react to this

geodel · 2024-07-24T15:23:54

My guess is he'd freak out once he'd hear that lawyers, law enforcement may get involved on this issue.

voisin · 2024-07-24T15:30:23

Makes sense that Google did this deal since their search quality tanked and they became an de facto front end UI for Reddit.

NoMoreNicksLeft · 2024-07-24T15:46:00

Up until 2016 (I think, +/- 1 year), if you could remember 3 uncommon words in a comment, you could find any reddit post instantly on Google. I'd want to follow up on a thread from weeks ago, and it was magic. Number one result. Then one day that just stopped working, and even adding site:*.reddit.com didn't fix it. At the time, I think, I didn't realize that it was mostly Google's fault, I thought maybe Reddit had changed their infrastructure so that it couldn't be crawled properly.

Google hasn't been a search engine in a long while, it's just an advertisement engine now.

dev1ycan · 2024-07-24T17:54:00

it's so bad it's crazy, you can legit not find stuff on the internet anymore, it's the same with youtube, I search something and get like 20 or so results and then everything else is hidden.

it started when youtube removed the ability to search for videos older than 5 years, if I had to guess? cost saving, have every old video in cheaper storage... but it sort of fragments youtube, every couple of years you only get newer content.

stuffoverflow · 2024-07-25T02:53:19

One day I was searching live videos of a local band from youtube and when sorting by upload date the oldest video was from 2010. I knew for a fact there had to be older videos so I got a youtube API key and searched via the API, ended up finding multiple videos starting from 2006. Learned that youtube is full of videos that are basically impossible to find with the regular search.

dev1ycan · 2024-07-25T03:56:55

One of my past times is looking up anime opening reactions for fun/to hear people listen to bands I enjoy that sometimes do anime ops, searching that is so scary, you see the same 4 people or every month or so you get 1 or get 1 removed, you can't tell me there's not more than 4 people on the internet that upload opening reactions... when anime is a billion+ dollar industry, if you know particular channels you can see daily uploads on plenty of channels, that are simply search banned for whatever reason by the algorithm.

But yeah, the most outrageous one is older videos, I do believe the reason is that they are using some long term cloud storage that is cheaper for older videos so they removed the ability to search by date.

Additionally, I don't believe the API fully fixes it, because Bing has a wrapper for youtube and searches do not really vary

lopis · 2024-07-26T09:50:46

Youtube will show me 3 results, tops, that are relevant to my query, then start showing irrelevant recommendations.

LegitShady · 2024-07-24T21:57:37

"we noticed that since our search results had gotten so bad nobody can use them to find the things they want, people just kept adding "reddit" to search terms anyways, so we figured we might as well make it official and exclusive"

mutatio · 2024-07-24T16:07:26

It's funny in the context of Google's past motto of "don't be evil". I feel the right thing for Google here would have been to decline any deal regarding exclusivity, then Reddit wouldn't have pulled the trigger with its robots.txt update. The entire manoeuvre required both parties.

peddling-brink · 2024-07-24T16:11:44

Google should abandon its mission to “organize the world’s information” because doing so requires spending money for valuable data, and others might not want to spend that money?

roughly · 2024-07-24T15:44:53

Boy, the LLMs have really been an apocalypse moment for the web, haven’t they? Between hoovering up and monetizing every bit of content they can without any attribution or compensation and the absolute flood of mediocre generated content, they’ve really done in the last straggling remains of the open internet.

It’s not like everyone wasn’t already pulling the same grift, but quantity really does have a quality all its own.

imglorp · 2024-07-24T18:24:03

Of course, we have to be careful not to villainize a neutral tech. Instead let's call it what it is: unchecked capitalism and monopolistic behaviors.

Capitalism seems to work ok for the common good until you remove all the protections. LLMs provide a defacto monopoly for the owner which must already be a near monopoly: they take vast resources to train; only a giant corp can afford to buy all the content and provision enough resources to train one.

LLM did not enshittify what's left of the internet, greed did it.

latexr · 2024-07-25T11:16:25

On the one hand, you’re absolutely right. But on the other hand it’s not like it matters in practice. Isn’t most technology technically neutral? But it’s also made to be used by people, who can do so beneficially or detrimentally. Criticising a technology is a shorthand for criticising how it’s used.

synicalx · 2024-07-25T00:18:58

> Of course, we have to be careful not to villainize a neutral tech

This is a very good point IMO. If we're going to chastise LLM's we may as well give servers, switches, routers, fiber-optic cable, and silicone a bollocking as well since that's ultimately what's facilitating all this.

latexr · 2024-07-25T11:24:33

No, those are not comparable. If someone criticises the electric chair, it’s not reasonable to defend it with “if we’re going to chastise the electric chair we may as well give wood, metal, chairs, and electricity a bollocking since that's ultimately what's facilitating all this”.

Things are more than the sum of their parts. If you have a ton of beneficial things which can be cobbled together into one bad detrimental thing, the existence of the latter does not remove the benefits of the former.

neilv · 2024-07-24T18:29:45

I'm concerned multiple ways by this, but I also could see some positive fallout from this, if it sets precedents that help protect 'content' owners from AI goldrush companies just taking everything.

gtirloni · 2024-07-24T19:31:55

AI companies are the least of our worries in the Reddit situation. The fact that Reddit has full control of user-generated data to do as they please gives them freedom to do as they please. I think this is the crux of today's issue.

AI companies like Google, Microsoft and OpenAI have deep pockets to 'unprotect' themselves from anything. The barrier to entry is for small AI companies and those aren't really making an impact currently.

lifestyleguru · 2024-07-24T14:46:02

I deeply regret every minute spent on and kilobyte of text contributed to reddit.

Ylpertnodi · 2024-07-24T14:50:37

I don't. There's nothing around that is similar...with the same traction. The various 'verses are variations on cat pics. I'm still looking, though.