Hacker News new | past | comments | ask | show | jobs | submit | pabo's comments login

Sure, if you can do that, why not? I was also searching for a good explanation on I/Q data when I found this. I posted it since I found the page interesting, but also, I did not dive into the details. I'm more interested in the conceptual explanation, and not the hands-on details.

If you could add another (potentially better explanation), that could be beneficial for future learners (not necessarily limited to the HN community).

I hope this post won't be overlooked.

The piece builds a strong case for the coronavirus spreading pattern to be highly skewed. If true, this would explain several strange observations, e.g. why some families are not infected, even though one of the family members is.

And importantly, this article contains some great science education with real-world examples for key probability theory concepts.

I have a similar experience, though I stayed on DDG. It's a bit funny, but I built a habit to use !s. This means, I use DDG to use StartPage, which in turn uses Google.

There's an anecdote about Jim Gray: he was once asked by a company to help troubleshoot a performance issue in their database (this was in the 80s or 90s I guess). Surprisingly, he did not start by looking at any code, but went directly to the server room and listened carefully for a while. Then, he said what the problem was. (IIRC it was an issue with wrong/missing indexes.) They went to look at the code, and it turned out he was right. Everyone was astonished, how he did this. He was "merely" listening to the sound of the spinning disks while the problematic queries were running.

Whether these stories are true or false I love them.

Ok, I found the reference for this story [1]! It turns out I messed up some details, but the core of the story is true. (It was not a company but Alex Szalay [2] at JHU, and it was not an indexing but a layout issue.)

Jim asked about our "20 queries," his incisive way of learning about an application, as a deceptively simple way to jump-start a dialogue between him (a database expert) and me (an astronomer or any scientist). Jim said, "Give me your 20 most important questions you would like to ask of your data system and I will design the system for you. " It was amazing to watch how well this simple heuristic approach, combined with Jim's imagination, worked to produce quick results.

Jim then came to Baltimore to look over our computer room and within 30 seconds declared, with a grin, we had the wrong database layout. My colleagues and I were stunned. Jim explained later that he listened to the sounds the machines were making as they operated; the disks rattled too much, telling him there was too much random disk access. We began mapping SDSS database hardware requirements, projecting that in order to achieve acceptable performance with a 1TB data set we would need a GB/sec sequential read speed from the disks, translating to about 20 servers at the time. Jim was a firm believer in using "bricks," or the cheapest, simplest building blocks money could buy. We started experimenting with low-level disk IO on our inexpensive Dell servers, and our disks were soon much quieter and performing more efficiently.

[1] https://cacm.acm.org/magazines/2008/11/549-jim-gray-astronom...

[2] https://en.wikipedia.org/wiki/Alex_Szalay

Another classic you’ve probably already read but linking it in case you haven’t:


I can't resist posting this in response to "I want my MTV":


Yep. That was intentionally put at song start :)

I can't access the article, but there was a recent ransomware attack on Canon [1]. I wonder if this could be related to it.

[1] https://news.ycombinator.com/item?id=24185734

I think they claim that this was a coding error and not a ransomware case.

The timing is convenient though.

It seems there's a surge in ransomware attacks, at least there are more cases in the news. See e.g. Garmin [1] or CWT [2] hacked recently.

[1] https://news.ycombinator.com/item?id=23926289

[2] https://news.ycombinator.com/item?id=24013580

There's a related anecdote about John von Neumann: he used to joke that he has superpowers and can easily tell truly random and pseudo random sequences apart. He asked people to sit down in another room and generate a 0/1 sequence via coin flips, and record it. Then, generate another sequence by heart, trying to mimick randomness as much as possible. When people finally showed the two sequences to him, Neumann could instantly declare which one was which.

People were amazed.

The trick he used was based on the "burstiness" rule you describe: a long enough random sequence will likely contain a long homogeneous block. While humans tend to avoid long streaks of the same digit, as it does not feel random enough.

So, all he did was he quickly checked with a glimpse, which of the two sequences contained the longest homogeneous block, and recognized that as the one generated via the coin flips.

That's a cool anecdote :-) I wouldn't say it uses concentration of measure exactly, but I see how it is related. The anecdote is about asymptotic properties of random sequences, and concentration of measure is about the same too. In this case, I think you can show that homogenous blocks of length log(n) - log log (n) occur at least with constant probability as n gets large. In other words, the length of homogenous blocks is basically guaranteed to grow with n. I suppose a human trying to generate a random sequence will prevent homogenous blocks above a certain constant length from appearing regardless of the length of the sequence, which would make distinguishing the sequences for large n quite easy!

I think there is also a quite strong connection in this anecdote to the information-theoretic notion of entropy, which takes us all the way back to the idea of entropy as in the article :-) Information-theoretically, the entropy of a long random sequence concentrates as well (it concentrates around the entropy of the underlying random variable). The implication is that with high probability, a sampled long random sequence will have an entropy close to a specific value.

Human intuition actually is somewhat correct in the anecdote, though! The longer the homogenous substring, the less entropy the sequence has, and the less likely it is to appear (as a limiting example, the sequence of all 0s or all 1s is extremely ordered, but extremely unlikely to appear). I think where it breaks down is that there are sequences with relatively long homogenous substrings with entropy close to the specific values (in the sense that the length is e.g. log (n) - log log (n) as in the calculation before), where the human intuition of the entropy of the sequence is based on local factors (have I generated 'too many' 0s in a row?) and leads us astray.

The original PNAS publication can be found at [0]. I copy here (most of) the abstract, I think it's very nicely written:

"Plato envisioned Earth’s building blocks as cubes, a shape rarely found in nature. The solar system is littered, however, with distorted polyhedra—shards of rock and ice produced by ubiquitous fragmentation. We apply the theory of convex mosaics to show that the average geometry of natural two-dimensional (2D) fragments, from mud cracks to Earth’s tectonic plates, has two attractors: “Platonic” quadrangles and “Voronoi” hexagons. In three dimensions (3D), the Platonic attractor is dominant: Remarkably, the average shape of natural rock fragments is cuboid. When viewed through the lens of convex mosaics, natural fragments are indeed geometric shadows of Plato’s forms. Simulations show that generic binary breakup drives all mosaics toward the Platonic attractor, explaining the ubiquity of cuboid averages."

[0] https://www.pnas.org/content/early/2020/07/16/2001037117

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact