zlacker

Interpolation and generalization are two completely different constructs. Interpolation is when you have two data points and make a best guess where a hypothetical third point should fit between them. Generalization is when you have a distribution which describes a particular sample, and you apply it with some transformation (e.g. a margin of error, a confidence interval, p-value, etc.) to a population the sample is representative of.

Interpolation is a much narrower construct then generalization. LLMs are fundamentally much closer to curve fitting (where interpolation is king) then they are to hypothesis testing (where samples are used to describe populations), though they certainly do something akin to the latter to.

The bias I am talking about is not a bias in the training data, but bias in the curve fitting, probably because of mal-adjusted weights, parameters, etc. And since there are billions of them, I am very skeptical they can all be adjusted correctly.

replies(1): >>adastr+f4

>>runarb+(OP)
I assumed you were speaking by analogy, as LLMs do not work by interpolation, or anything resembling that. Diffusion models, maybe you can make that argument. But GPT-derived inference is fundamentally different. It works via model building and next token prediction, which is not interpolative.

As for bias, I don’t see the distinction you are making. Biases in the training data produce biases in the weights. That’s where the biases come from: over-fitting (or sometimes, correct fitting) of the training data. You don’t end up with biases at random.

replies(2): >>runarb+V5 >>IsTom+8f

>>adastr+f4
What I meant was that what LLMs are doing is very similar to curve fitting, so I think it is not wrong to call it interpolation (curve fitting is a type of interpolation, but not all interpolation is curve fitting).

As for bias, sampling bias is only one many types of biases. I mean the UNIX program YES(1) has a bias towards outputting the string y despite not sampling any data. You can very easily and deliberately program a bias into everything you like. I am writing a kanji learning program using SSR and I deliberately bias new cards towards the end of the review queue to help users with long review queues empty it quicker. There is no data which causes that bias, just program it in there.

I don‘t know enough about diffusion models to know how biases can arise, but with unsupervised learning (even though sampling bias is indeed very common) you can get a bias because you are using wrong, mal-adjusted, to many parameters, etc. even the way your data interacts during training can cause a bias, heck even by random one of your parameters hits an unfortunate local maxima yielding a mal-adjusted weight, which may cause bias in your output.

replies(1): >>adastr+Ji

>>adastr+f4
> It works via model building and next token prediction, which is not interpolative.

I'm not particularly well-versed in LLMs, but isn't there a step in there somewhere (latent space?) where you effectively interpolate in some high-dimensional space?

replies(1): >>adastr+ci

>>IsTom+8f
Not interpolation, no. It is more like the N-gram autocomplete used to use to make typing and autocorrect suggestions in your phone. Attention js not N-gram, but you can kinda think of it as being a sparsely compressed N-gram where N=256k or whatever the context window size is. It’s not technically accurate, but it will get your intuition closer than thinking of it as interpolation.

The LLM uses attention and some other tricks (attention, it turns out, is not all you need) to build a probabilistic model of what the next token will be, which it then sampled. This is much more powerful than interpolation.

>>runarb+V5
Training is kinda like curve fitting, but inference is not. The inference algorithm is random sampling from a next-token probability distribution.

It’s a subtle distinction, but I think an important one in this case, because if it was interpolation then genuine creativity would not be possible. But the attention mechanism results in model building in latent space, which then affects the next token distribution.

replies(1): >>runarb+cY

>>adastr+Ji
I’ve seen both opinions on this in the philosophy of statistics. Some would say that machine learning inference is something other then curve fitting, but others (and I subscribe to this) believe it is all curve fitting. I actually don‘t think which camp is right is that important but I do like it when philosophers ponder about these tings.

My reasons to subscribing to the latter camp is that when you have a distribution and you fit things according to that distribution (even when the fitting is stochastic; and even when the distribution belongs in billions of dimensions) you are doing curve fitting.

I think the one extreme would be a random walk, which is obviously not curve fitting, but if you draw from any other distribution then the uniform distribution, say the normal distribution, you are fitting that distribution (actually, I take that back, the original random walk is fitting the uniform distribution).

Note I am talking about inference, not training. Training can be done using all sorts of algorithms, some include priors (distributions) and would be curve fitting, but only compute the posteriors (also distributions). I think the popular stochastic linear descent does something like this, so it would be curve-fitting, but the older evolutionary algorithm just random walks it and is not fitting any curve (except the uniform distribution). What matters to me is that the training arrives at a distribution, which is described by a weight matrix, and what inference is doing is fitting to that distribution (i.e. the curve).

replies(1): >>adastr+kv1

>>runarb+cY
I get the argument that pulling from a distribution is a form of curve fitting. But unless I am misunderstanding, the claim is that it is a curve fitting / interpolation between the training data. The probability distribution generated in inference is not based on the training data though. It is a transform of the context through the trained weights, which is not the same thing. It is the application of a function to context. That function is (initially) constrained to reproduce the training data when presented with a portion of that data as context. But that does not mean that all outputs are mere interpolations between training datapoints.

Except in the most technical sense that any function constrained to meet certain input output values is an interpolation. But that is not the smooth interpolation that seems to be implied here.