I just had a look through the "new" queue with showdead on, and out of the most recent 200 submissions, I counted 22 that were not dead. I wasn't seeing false positives, either. A very large fraction of those dead submissions seem to come from one very specific blogspot domain (this time they all seem to be an identical URL, even, except for a TLD swap in some cases).
Does HN not just implement blacklists for URL submissions from certain domains / matching a regex pattern? I get that the showdead option is there so people can vouch for stuff but that would/should realistically never happen in this case. Can't more obvious spam just be deleted directly?
Also, how did HN become such a target for this? I would think that the audience here is generally savvy enough to avoid scams, and that having things linked here is not as beneficial for SEO as many other sites with UGC.
The (1) model sucks (AUC-ROC maybe 0.6), the (2) model is better (AUC maybe 0.7) but the (3) model got an AUC pushing 0.98 which seemed unreasonably high.
My mental model of "[dead]" was that it happens to articles that get popular but are about politics or some other bad subject. What I found though is that HN gets bursts of spam like the one you're experiencing and with the system I had (i) the same headline would show up [dead] a large number of times and (ii) the same headline would show up in the train, eval and test data sets so of course the system got an unreasonably high score for [dead]. That's how I learned that HN gets these spam waves.
reply