[2] "From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples" https://arxiv.org/abs/2404.07544v1

Sometimes, the networks instead find what the researchers call the “pizza” algorithm. This approach imagines a pizza divided into slices and numbered in order. To add two numbers, imagine drawing arrows from the center of the pizza to the numbers in question, then calculating the line that bisects the angle formed by the first two arrows. This line passes through the middle of some slice of the pizza: The number of the slice is the sum of the two numbers.

I feel I'm looking really dumb here, but it is not obvious to me that this works - seeing as this is a pizza analogy, take, for example, 1 + 1 mod 8 = ?. I do not see how the algorithm as set out in the paper can be properly described in this way.

all of the following are on the same line through the origin.

8+2 1+1 2+8 3+7 4+6 5+5 6+4 7+3

and the set of lines through the origin is double-covered by the output embeddings. which means there are twice as many slices in the output embedding as in the input embedding.

so its not that 1+1 bisector points at 1 and so the sum is 1. it points at 2 because of the double cover.

this double cover is easier to see if you examine step 2.2

when a=b the leading |cos| term is 1 and the vector is roughly cos(2a),sin(2b). now go from a=b= 0 to 2pi and youll see the embedding will loop around twice as it varies from 0 to 4pi

the outside of the circle is labelled twice with 12 at top and bottom. the line through the origin picks this labelling. not the input embedding.

Yep. Computers are just binary-state machines. Everything you can do with a computer, you could do with water channels and pulleys and gates (if you could assemble trillions and trillions of them). No one would ask whether the water-channel system 'groks' things, but because they are miniaturized and (to 99.9999% of people) esoteric in their actual workings, people treat them like magic.

Clarke was right, but sadly it doesn't take alien technology for people to start anthropomorphizing computers or thinking that they magically have transcended mere (though obviously very complexly-arranged) logic gates, it just takes a GUI that spits out algorithmically-generated pictures and text.

> No one would ask whether the water-channel system 'groks' things

I would. It's very common to describe the flow of electricity as similar to the flow of water. If it's electricity in my brain that allows me to understand, why couldn't there be an analogous system involving water which also understands?

Any substrate which supports the necessary logical operations ought to be sufficient. To believe otherwise seems needlessly anthropocentric.

> Any substrate which supports the necessary logical operations ought to be sufficient. To believe otherwise seems needlessly anthropocentric.

This makes sense. Anything that performs logic using electricity is conscious and comparable to the human brain. This is obviously true because there is no word or concept for anthropomorphizing

I thought the article did a very good job of explaining why the term 'grokking' was used to describe this emergent behavior, and how it fit Heinlein's original definition. I'm curious which part of their explanation you feel is incorrect.

> Automatic testing revealed this unexpected accuracy to the rest of the team, and they soon realized that the network had found clever ways of arranging the numbers a and b. Internally, the network represents the numbers in some high-dimensional space, but when the researchers projected these numbers down to 2D space and mapped them, the numbers formed a circle.

So the article mentions "regularization" as the secret ingredient to get to a generalized solution, but they don't explain it. Does someone know that is? Or is it an industrial secret of OpenAI?

Regularization as a concept is taught in introductory ML classes. A simple example is called L2 regularization: you include in your loss function the sum of squares of the parameters (times some constant k). This causes the parameter values to compete between being good at modeling the training data and satisfying this constraint--which (hopefully!) reduces overfitting.

The specific regularization techniques that any one model is trained with may not be publicly revealed, but OAI hardly deserves credit for the concept.

NNs (indeed, all statistical fitting algs) have no relevant properties here: properties derive just from the structure of the dataset. Here (https://arxiv.org/abs/2201.02177) NNs are trained on a 'complete world' problem, ie., modulo arithmetic where the whole outcome space is trivial in size and abstract with complete information.

Why should NNs eventually find a representation of this tiny, abstract, full-representable outcome space after an arbitrary amount of training time? Well it will do so, eventually, if this outcome space can be fully represented by sequences of conditional probabilities.

There is nothing more to this 'discovery' than some trivial abstract mathematical spaces can be represented as conditional probability structures. Is this even a discovery?

One has to imagine this deception is perpetrated because the peddlers of such systems want to impart the problem structure to properties of NNs in general, and thereby say, "well, if you train NNs on face shapes, phrenology becomes possible!". ie., as a way of whitewashing their broken half-baked generative AI systems where the problem domain isnt arithmetic mod 97

I think you’re being unfair. I say that as someone who has done his share of charlatan hunting.

They explicitly list their contributions in the paper. They’re not saying they did something they didn’t. It’s not like that bogus "rat brain flies plane" paper that was doing something simple under the hood and then dressing it up as a world changing discovery in order to gain funding. They are doing something simple and studying it carefully. This is grade A science as far as I’m concerned, every bit as good as studying the motion of the stars and trying to find patterns.

It did not occur to me that NNs might be able to extrapolate a tiny training dataset into complete solutions. ML scientists are taught from an early age that you need to have a large dataset to make any progress. I don’t know whether this paper counts as a "discovery", but it was certainly a fun read, which was nice enough for me.

In some sense this paper is a proof of the idea that NNs can extrapolate, not merely memorize. This is in contrast to recent work where researchers have been claiming otherwise.

The weight decay study was also a nice touch. It’s no discovery to say that weight decay helps, but it’s a reminder to use it. I haven’t. Adam has always been reliable for me, and now I wonder if it was a mistake to shy away from weight decay. (We wanted to keep our training pipeline as simple as possible, in case any part of it might be causing problems. And lots of unexpected parts did.)

Now, I haven’t read the article submitted here, only the paper you linked. Maybe they’re claiming something more than the paper. But if so, then that is a (very real) problem with scientific journalism, and not necessarily the scientists themselves. It depends how much the scientists are leading or misleading the reporters. It’s important to separate the criticism of the work from the reporting around the work.

I’d also be curious if you have any citations for your claim that if an outcome space can be represented as a sequence of conditional probabilities, then NNs are guaranteed to find a solution after some unknown amount of training time. This is a surprising thing to me.

[1] https://news.ycombinator.com/item?id=40019217

[2] "From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples" https://arxiv.org/abs/2404.07544v1

(Edit: format)