A statistical approach to model evaluations

>>RobinH+(OP)
This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?

>>fnordp+Are
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)

I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]

[0] https://en.wikipedia.org/wiki/F._D._C._Willard

>>godels+5Ke
As someone who had questions about some of what you said and feels legitimately scared to ask what you meant out of fear of being judged:

> I've been told I'm "gatekeeping"

I mean...when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.

> I swear, the big reason models are black boxes are because we _want_ them to be.

Talk is cheap.

> I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

I agree "fuck theorists" is in no way constructive. But, Deep Mind has objectively helped move the field forward. And your criticism of "get good" stuff? Did you not just tell people to "learn math" rather than help them to understand it yourself? That's the _exact_ meaning of the phrase "get good" on the internet. At best you're both being about as toxic (at least from your own description).

>>throwa+Tef

  > when the alternative is politely (better yet - excitedly) answering the question asked? You kind of are.

Gatekeeping is controlling access. Not to be confused with hurdles. I'm more than happy to have more people in "the party." No one is being excluded in the way that isn't also true for any other field. You unfortunately need some level of expertise to be able to understand discussions between experts. But am I stopping you from getting that expertise? No, in fact I'm very happy to lend a hand! Those aren't gates, they're hurdles. You don't need a specific PhD or to go to a good school or anything. It's about the knowledge. If you need a helping hand to get over, ask, because others may not know or may not know if you're struggling fruitlessly or struggling as part of the process of improving.

But yes, hurdles exist and they are not bad. I sure as hell don't want someone that can't do calculus designing rocket engines. And you probably don't want a rocket engineer performing surgery on you. Call them what you will, but it's not a bouncer at the door telling you you're not "pretty enough", which is what gatekeeping is generally used to refer to.

  > Talk is cheap.

Sure, but we actually know a lot more about the inner workings of networks than most people realize. Sure, they aren't transparent, but that doesn't mean they are completely opaque either.

But I have no idea how to respond to this comment. What I said was fairly broad and this response is broader. Are you asking for "proof"? Of what? Interpretability? Is not the article proof of it to some degree? Or Golden Gate Claude?

  >  Did you not just tell people to "learn math" rather than help them to understand it yourself?

No? I think you misunderstand. Mind you, this is hacker news. Would you like some books for reference? A roadmap? If you have suggestions for how I should phrase my venting differently, I'm all ears. But it feels like that would be out of left field to just drop a bunch of random books and requires a lot of words to explain how all these things connect. I've written many "walls of text" here and frankly, anything longer than a paragraph often gets skipped over. It's fine, it's HN after all.

  > you're both being about as toxic

Are you aware of the things I'm referencing? It seems like you are not. Given that, I think you should reserve your judgement and accusations until you know more about the context. (e.g.

So I will add more context to clarify my complaints, for any of those interested. I specifically called out Mission Impossible Language Models[0], so what's that about? I suggest reading the paper. The authors create a continuum of difficulties in impossible languages. The hardest being a random word ordering. The claim is that LLMs can't learn impossible languages just as well as natural languages. It's fairly easy to understand the error in this work. They use perplexity, which is sometimes called "surprisal." You take it conditioned on the previous words and you calculate what is likely to come next. But perplexity doesn't tell you that the model didn't learn the language, or even efficiently. The metric isn't going to work for a one-to-one comparison with a structured language. The reason being that there is naturally more entropy in the impossible language. Frankly, because there are more words that are equally likely to come next. It's comparing coin flips to dice throws.

Let's use an example: our target sentence will be "My name is godelski." In a random shuffle language we have 4! (24) ways to represent that sentence that are all __equivalent__. That's the key part. In natural language, all I can think of is 2 ("Godelski, my name is" as a highly unlikely alternative). So in natural language if we have "My name" and are predicting the next word, "is" is pretty likely. But in the random language "is" is just as likely as "<name>". This isn't proof that the language isn't learned, it is just that the language isn't structured. "My name is godelski" and "My name godelski is" are equivalent sentences in a random ordering language. But actually, this gets even harder because the tokenization was trained on natural word order. If you look at Table 1 you'll see how this gets messy (notice that "bookshelf" is the tokens " books" (space intentional) "he" "lf"). The picture gets clearer when you look at how they prepared the data (it isn't shuffled each time the model gets the data, it is shuffled once and then the model is trained on that. This is not the same as the random language and unless you're really lucky, there's going to be certain patterns more common than others and so that'll just make it more difficult for the model. The dataloader should shuffle sentences, which will teach the model to ignore the patterns. You should also measure perplexity against all valid predictions, not a single one. This one is a killer for me).

Side note:

  > fear of being judged:

You're always going to be judged. Stand up for yourself and what you believe in. Don't be afraid of being wrong either. Truth is, you're always wrong. It's a matter of degree. The problem isn't being wrong, it is not being able to change your mind. Even if things get heated between people, there typically isn't ill will between them if they believe the other person is capable of changing their mind.

Clearly, you have judged me quite harshly.

[0] https://arxiv.org/abs/2401.06416

>>godels+Jpf
Doesn't seem like any further discussion will be worthwhile or constructive for me. Sorry, hope you understand.

>>throwa+mVi
That's okay. I'm not offended. And I hope you know I have no hard feelings. Despite disagreeing, I do respect your opinions. These are hard problems to solve and I think there's no perfect solutions.

zlacker