A statistical approach to model evaluations

>>RobinH+(OP)
This does feel a bit like under grad introduction to statistical analysis and surprising anyone felt the need to explain these things. But I also suspect most AI people out there now a days have limited math skills so maybe it’s helpful?

>>fnordp+Are
As an ML researcher who started in physics (this seems common among physics/math turned ML people. Which Evan is included), I cannot tell you how bad is it... One year at CVPR when diffusion models hit the scenes I was asking what people's covariance was (I had overestimated the model complexity), and the most common answer I got was "how do I calculate that?" People do not understand things like what "pdf" means. People at top schools! I've been told I'm "gatekeeping" for saying that you should learn math (I say "you don't need math to build good models, but you do to understand why they're wrong"). Not that you need to, but should. (I guess this explains why Mission Impossible Language Models won best paper...)

I swear, the big reason models are black boxes are because we _want_ them to be. There's clear anti-sentiment mentality against people doing theory and the result of this shows. I remember not too long ago Yi Tay (under @agihippo but main is @YiTayML) said "fuck theorists". I guess it's not a surprise Deep Mind recently hired him after that "get good" stuff.

Also, I'd like to point out, the author uses "we" but the paper only has one author on it. So may I suggest adding their cat as a coauthor? [0]

[0] https://en.wikipedia.org/wiki/F._D._C._Willard

>>godels+5Ke
Personal sad story, but hopefully relevant: during my recent PhD I worked on a problem where I used a Dirichlet Process in my solution. That paper has been bouncing around for the past few years getting rejected from every venue I have submitted it to. My interpretation is that most reviewers (there are exceptions - too few to impact the final voting) don't understand any non-DL theory anymore and are not willing to read up for the sake of a fair review. This is based on their comments, where we have been told that our solution is complex (maybe? - but no one suggests an alternative), exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase), and in one case, someone said the paper is derivative because it uses Blackwell-MacQueen sampling; their evidence? - they skimmed through a paper we had cited that also used the sampling algorithm. This is like saying a paper is derivative because it uses SGD.

I am on the review panel of some conferences too and it is not uncommon to be assigned a paper outside of my comfort zone. That doesn't mean I cut and bail. You set aside time, read up on the area, ask authors questions, and judge accordingly. Unfortunately this doesn't happen most of the time - people seem to be in a rush to finish their review no matter the quality. At this point, we just mechanically keep resubmitting the paper every once a while.

Sorry, end of rant :)

>>abhgh+Rrf
Just a note

> exposition is not clear (we have rewritten the paper a few times - we rewrite it based on comments from venue i to submit to venue i+1 - its a wild goose chase)

Does not mean that the paper is invalid, but maybe the storyline is difficult to follow, the results not easy to interpret, or overall badly written or missing justifications. Even if you take into account the reviews to rewrite it, it doesn't mean the paper is clear and easy to understand.

As you noted, researchers need to read material outside of their confort zone, and the publications have shifted in focus. Before you could expect a reader to be familiar to the topic, now you need to educate him as clearly as possible.

I picked a random text inside the paper > The workings of the technique itself are presented at a high-level in Figure 2.

Annoying to read.

> Instead of learning the training distribution directly, which might be expensive because of the dimensionality of the data, we first project the data down to one dimension.

Why is that good enough? Justification missing

> This is done just once, and is shown in the left panel in Figure 2. Since we are solving for classification, we pick this dimension to be a numeric indicator of how close an instance is to a class boundary.

Why is it a good indicator, justification

> As a convenient proxy, we train a separate highly accurate probabilistic

Ok, references on previous research that show it can work?

So in essence, I don't say you need to explain everything, but the text could be more clear on the choices and why they make sense.

My gut feeling is that you know and understand what you are doing, but you miss too many justifications that proves your work valuable.

I didn't read the whole thing, so maybe I'm missing the picture, but from random sampling on the text I expect the rest to follow the same.

While I read the introduction, I don't want to read 'we did that and that and that'. But 'there was this issue, we solve it in this way because this reason '

And following issues->solution->why should give me enough understanding of what you are trying to achieve.

Follow-up sections should refine the solutions

>>someth+C0g
Thank you for these comments. I appreciate them and I'll consider them in my next draft. However, I would like to point out a few things; just so that we have the larger picture in mind. Again, I do appreciate you took the time to look up the paper.

1. When I said we revise the paper between two submissions, I wasn't implying it was becoming "better". The message was that there is no general consensus around what should be expanded and what might be concise. Someone believes you should discuss prior work more, someone thinks the main algorithm requires more elaboration, someone wants you to talk more about BayesOpt etc., but you just have <10 pages in the main paper, and putting this stuff in the Appendix, or citing source, doesn't seem to be good enough in many cases (another comment in a sibling thread gives an example wrt GANs, and my experiences have been no different).

2. You say you randomly picked a few sentences to read; that's good for a casual discussion but that should not be how a review process functions. Some of the best reviewers I've encountered (and I hope I am continuing in that tradition) come back to say something like "I see what you're getting at, but your intro. doesn't sell it well enough; think about writing it like this ...". Rejecting based on random skimming is exactly one of the things I'm calling out. Let's face it - like a lot of things, high quality reviewing is hard. It isn't supposed to be quick or easy.

3. Predicting how much to elaborate: this is probably an extension of the first point, but I feel like this has become way harder in the recent years. The rule that mostly works seems to be that if its not a trending topic explain it as much as you can, because cited background material is overlooked. This is unfair for areas that are not trending - the goal of research should be to situate itself closer to "explore" on the "explore-exploit" spectrum, but the review system today heavily favors "exploit". And like I mentioned, a page limit means that the publication game stacked against people not working on mainstream ideas. This should not be the case.

>>abhgh+2Pg
1. When I said we revise the paper between two submissions, I wasn't implying it was becoming "better". The message was that there is no general consensus around what should be expanded and what might be concise. Someone believes you should discuss prior work more, someone thinks the main algorithm requires more elaboration, someone wants you to talk more about BayesOpt etc., but you just have <10 pages in the main paper, and putting this stuff in the Appendix, or citing source, doesn't seem to be good enough in many cases (another comment in a sibling thread gives an example wrt GANs, and my experiences have been no different).

That's exactly my point, the reviews do not converge because the message is too diffuse or not justified enough. I recently had a paper rejected because it was too difficult to understand, it was on 4 pages, now it's sent to a better journal and was expanded to 20 pages. The content was too big for a 4 pages content, we couldn't fit enough justifications. But in your paper you still have many places where the text could be shorter and clearer, gaining at least 1 page of content. Learning to write good research takes a lot of time, and a phd is the place where ideally this should happens. It's difficult, but you'll get there if you work on it enough! Read best paper awards of good conferences, notice how much material is there in the same number of pages, and reverse engineer what they did to make the paper clear, concise and easy to follow.

2. You say you randomly picked a few sentences to read; that's good for a casual discussion but that should not be how a review process functions. Some of the best reviewers I've encountered (and I hope I am continuing in that tradition) come back to say something like "I see what you're getting at, but your intro. doesn't sell it well enough; think about writing it like this ...". Rejecting based on random skimming is exactly one of the things I'm calling out. Let's face it - like a lot of things, high quality reviewing is hard. It isn't supposed to be quick or easy.

You cannot choose who will read. But even for the more throughout readers, if it's difficult to understand / missing justifications from the beginning, they will give a bad review, even if they read the whole thing. Reading should be like a conversation with the author, if I find the conversation with the author through the paper too sloppy or erratic, I will not understand the message, that's what happens when I ask more justifications on some part to the author. It's because I couldn't follow the logic enough or I was not agreeing with some part, so I require more justifications.

3. Predicting how much to elaborate: this is probably an extension of the first point, but I feel like this has become way harder in the recent years. The rule that mostly works seems to be that if its not a trending topic explain it as much as you can, because cited background material is overlooked. This is unfair for areas that are not trending - the goal of research should be to situate itself closer to "explore" on the "explore-exploit" spectrum, but the review system today heavily favors "exploit". And like I mentioned, a page limit means that the publication game stacked against people not working on mainstream ideas. This should not be the case.

I agree, there are no more general experts, everyone works in a very niche subfield, you don't get people that know the sota. Learning the good tradeoff is difficult. My threshold is: don't explain the math unless it's not self obvious why. For example for some equation I can give more insights on how it affects my method and if a parameter of the equation is very important to my method, a complete analysis of its effects and analogies and experiments to see its impact. I try to make the main story line as crystal clear as possible, if I deviate too much, it's better on a second paper. My experiments should reflect not trivial things. Finally I make sure the abstract corresponds to the text. I mainly don't work in deep learning, so by default my topics are extremely hard to find reviewers, I feel the pain. But it's my work to make them understand what I'm achieving and why it's important.

Hope that helps :)

zlacker