zlacker

3 words: Bayesian Comment Filter. Just does the opposite of what the spam filter does. Use the corpus of great comments from the past to find great comments of the present.

I'm only half joking. Fundamentally, the thread is about a filtering system.

replies(3): >>pg+w >>alextp+Ca >>waterh+ub

>>b_emer+(OP)
That could work, actually. Instead of focusing on discouraging bad comments, maybe the answer is to promote good ones. Plus I have some code I could use for that.

replies(2): >>colins+a6 >>erik+B7

>>pg+w
But it raises the question, do you have a way to single out the corpus of good comments from the past? If past up-votes were reliable I don't think this thread would exist.

replies(2): >>dwwoel+q6 >>pg+Lf

>>colins+a6
He could use up-votes from a trusted subset of users.

>>pg+w
Have you considered Reddit's 'best' ranking for comments?

http://blog.reddit.com/2009/10/reddits-new-comment-sorting-s...

""" Most of the time, you won't notice that there's anything obviously different (it doesn't affect threading or anything -- don't worry!), but it should improve the quality of the top comments immensely. """

>>b_emer+(OP)
The problem is that word features are not really that good predictors of quality.

I have done some research on this (unpublished), and I got a really good performance on predicting hacker news votes by just counting how many new words (not stopwords, not very-high-frequency words) a comment was adding to a thread. Just using a few variations on this theme predicted better than word counts or bigram features.

Fundamentally, though, I disagree with machine learning- based approaches as they can only _reinforce_ present behavior, and we'd like to shape voting behavior.

replies(1): >>crassh+jk

>>b_emer+(OP)
Perhaps you could train a Bayes classifier on users' voting patterns (beginning with a corpus of good and bad comments) and use that to a) decide how to weight users' votes and b) classify further comments as good or bad. (This makes no attempt to classify comments by their content--that'd be done in the construction of the original corpus.)

Cf. http://news.ycombinator.com/item?id=2404283, http://news.ycombinator.com/item?id=2404459

>>colins+a6
I could train it on stuff from "the good old days."

replies(1): >>crassh+Yg

>>pg+Lf
What if I get a bunch of good comments from the good old days, train a Bayesian filter on them, and then make a comment bot with bias in my favour?

Perhaps you could give everyone a comment bot, like a green/red bar that says whether the comment looks like low quality or high quality as you're typing it. A lot of people might edit the comment to make it better, or simply delete the comment (you could design UI to encourage this ... eg RBM's can highlight which words look like they're causing the problem, or offer a Kill Comment "X" when the comment is far into the red).

You could also train the bayesian filter on (graphwise) voting patterns, rather than on comments as bags of words.

>>alextp+Ca
alextp, what if you use ML on voting patterns instead of comment words?

(also what if the ML only provided feedback while one is typing the comment?)

replies(1): >>alextp+dy

>>crassh+jk
By voting patterns, what do you mean? Features like how many times the user has voted in this thread/hour/day? How many times the comment has been voted, the replies have been voted, and the story has been voted? I didn't try those; maybe they'd work, but I have no intuition saying why they should.

Using ML to provide feedback is a bad idea. Most ML techniques latch on to surface features of the text rather than the deeper structure, so it'd just make it really easy for people to reword their mean comments ("this is just stupid" becomes "What an incoherent piece of gobblydegook" or something like this, which might make things funnier but I doubt it would help).

replies(1): >>crassh+eK

>>alextp+dy
By voting patterns I meant

1) who votes on good comments 2) who votes on whose comments 3) who votes a lot / a little.

But mostly (1).

I take your point about ML being superficial. But if it's being used at all, shouldn't the users be informed about what the robo-brain thinks of them?

Your excellent example of a rewording might fool a lot of humans too (see pg's article another commenter linked to ... Ctrl+F "DH4").