A statistical approach to model evaluations

>>fnordp+Are
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)

I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]

[0] https://en.wikipedia.org/wiki/F._D._C._Willard

>>godels+5Ke
The front matter in Vladimir Vapnik’s book “Statistical Learning Theory” (first edition published 1995) has this quote:

*

During the last few years at various computer science conferences, I heard reiteration of the following claim:

“Complex theories do not work; simple algorithms do.”

One of the goals of this book is to show that, at least in the problems of statistical inference, this is not true. I would like to demonstrate that in the area of science a good old principle is valid:

“Nothing is more practical than a good theory.”

*

It’s seen in page xii of the front matter at: https://link.springer.com/content/pdf/bfm:978-1-4757-3264-1/...

Vladimir was a friend during this time, and I think about this quote a lot with regards to ML tinkering.

>>throwa+Tef

  > when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.

Gatekeeping is controlling access. Not to be confused with hurdles. I'm more than happy to have more people in "the party." No one is being excluded in the way that isn't also true for any other field. You unfortunately need some level of expertise to be able to understand discussions between experts. But am I stopping you from getting that expertise? No, in fact I'm very happy to lend a hand! Those aren't gates, they're hurdles. You don't need a specific PhD or to go to a good school or anything. It's about the knowledge. If you need a helping hand to get over, ask, because others may not know or may not know if you're struggling fruitlessly or struggling as part of the process of improving.

But yes, hurdles exist and they are not bad. I sure as hell don't want someone that can't do calculus designing rocket engines. And you probably don't want a rocket engineer performing surgery on you. Call them what you will, but it's not a bouncer at the door telling you you're not "pretty enough", which is what gatekeeping is generally used to refer to.

  > Talk is cheap.

Sure, but we actually know a lot more about the inner workings of networks than most people realize. Sure, they aren't transparent, but that doesn't mean they are completely opaque either.

But I have no idea how to respond to this comment. What I said was fairly broad and this response is broader. Are you asking for "proof"? Of what? Interpretability? Is not the article proof of it to some degree? Or Golden Gate Claude?

  >  Did you not just tell people to "learn math" rather than help them to understand it yourself?

No? I think you misunderstand. Mind you, this is hacker news. Would you like some books for reference? A roadmap? If you have suggestions for how I should phrase my venting differently, I'm all ears. But it feels like that would be out of left field to just drop a bunch of random books and requires a lot of words to explain how all these things connect. I've written many "walls of text" here and frankly, anything longer than a paragraph often gets skipped over. It's fine, it's HN after all.

  > you're both being about as toxic

Are you aware of the things I'm referencing? It seems like you are not. Given that, I think you should reserve your judgement and accusations until you know more about the context. (e.g.

So I will add more context to clarify my complaints, for any of those interested. I specifically called out Mission Impossible Language Models[0], so what's that about? I suggest reading the paper. The authors create a continuum of difficulties in impossible languages. The hardest being a random word ordering. The claim is that LLMs can't learn impossible languages just as well as natural languages. It's fairly easy to understand the error in this work. They use perplexity, which is sometimes called "surprisal." You take it conditioned on the previous words and you calculate what is likely to come next. But perplexity doesn't tell you that the model didn't learn the language, or even efficiently. The metric isn't going to work for a one-to-one comparison with a structured language. The reason being that there is naturally more entropy in the impossible language. Frankly, because there are more words that are equally likely to come next. It's comparing coin flips to dice throws.

Let's use an example: our target sentence will be "My name is godelski." In a random shuffle language we have 4! (24) ways to represent that sentence that are all __equivalent__. That's the key part. In natural language, all I can think of is 2 ("Godelski, my name is" as a highly unlikely alternative). So in natural language if we have "My name" and are predicting the next word, "is" is pretty likely. But in the random language "is" is just as likely as "<name>". This isn't proof that the language isn't learned, it is just that the language isn't structured. "My name is godelski" and "My name godelski is" are equivalent sentences in a random ordering language. But actually, this gets even harder because the tokenization was trained on natural word order. If you look at Table 1 you'll see how this gets messy (notice that "bookshelf" is the tokens " books" (space intentional) "he" "lf"). The picture gets clearer when you look at how they prepared the data (it isn't shuffled each time the model gets the data, it is shuffled once and then the model is trained on that. This is not the same as the random language and unless you're really lucky, there's going to be certain patterns more common than others and so that'll just make it more difficult for the model. The dataloader should shuffle sentences, which will teach the model to ignore the patterns. You should also measure perplexity against all valid predictions, not a single one. This one is a killer for me).

Side note:

  > fear of being judged:

You're always going to be judged. Stand up for yourself and what you believe in. Don't be afraid of being wrong either. Truth is, you're always wrong. It's a matter of degree. The problem isn't being wrong, it is not being able to change your mind. Even if things get heated between people, there typically isn't ill will between them if they believe the other person is capable of changing their mind.

Clearly, you have judged me quite harshly.

[0] https://arxiv.org/abs/2401.06416

>>mturmo+sVe
I haven't had a chance to read that, but that quote suggests I should (especially considering the author and the editors).

I often refer to "Elephant Fitting" w.r.t these systems. I suspect you understand this, but I think most think it is just about overfitting. But the way problem isn't about the number of parameters, but that parameters need to be justified. As explained by Dyson here[0]. Vladimir's quote really reminds me of this. Fermi likewise was stressing the importance of theory.

I think it is a profound quote, and you were (are?) lucky to have that friendship. I do think abstraction is at the heart of intelligence. François Chollet discusses it a lot, and he's far from alone. It seems to be well agreed upon in the neuroscience and cognitive science communities. I think this is essential to understand in our path forward to developing intelligent systems, because there are plenty of problems that need to be solved in which there is no algorithmic procedure. Where there is no explicit density function. Intractable, doubly intractable, and more problems. Maybe we're just too dumb, but it's clear there are plateaus where luck is needed to advance. I do not believe our current machines would be capable of closing a gap.

[0] https://www.youtube.com/watch?v=hV41QEKiMlM

>>abhgh+Rrf
Is a preprint of your paper available?

I looked at your blog a bit and was able to find this, which may be it?

> Learning Interpretable Models Using Uncertainty Oracles

https://arxiv.org/abs/1906.06852

https://doi.org/10.48550/arXiv.1906.06852

>>aspenm+Cyf
Yes, that's the one: https://arxiv.org/pdf/1906.06852

>>aspenm+nzf
Yes, it did come up during my defense, but it was deemed not to be a concern since I had one prior paper [1] (the original one in this thread of work, the paper I linked above was an improvement over it), and my advisor (co-author on both papers) vouched for the quality of the work.

Thank you for pointing out the typo - will fix it!

[1] https://www.frontiersin.org/journals/artificial-intelligence...

>>joshjo+Nch
Thanks, yes, lot of good ideas in ML seem to be slowly vanishing from the collective awareness. I have nothing against the current spate of methodologies which are empirically great - and if one needs proof, I am a "happy customer" at my day job which is mostly DL and a lot of LLMs - but it seems we are buying into a world where it is one versus the other. And this it need not be. Great ideas are great ideas irrespective of age and there is value in preserving them.

Anyway, since this thread surprisingly evoked a mini-discussion on Dirichlet Processes (DP), if someone needs an intro, I have tried to balance math and intuition in a description in my thesis: Section 2.2 in [1].

[1] https://drive.google.com/file/d/1zf_MIWyLY7nxEr5UioUQ7KhOQ1_...

EDIT: I looked at the description and I confess it still has a lot of math (since it is part of thesis). I will probably translate this to be more friendly and put it on my blog.

>>canjob+KPg
As I said in another comment the only relevant synthetic language that would refute Chomsky's claim are the ones we have human experiments for. Specifically those of Moro.

I believe the relevant papers are referenced here on page 4. (Tettamanti et al., 2002; Musso et al., 2003; Moro, 2016)

https://acesin.letras.ufrj.br/wp-content/uploads/2024/02/Mor...

zlacker

A statistical approach to model evaluations