Governance of Superintelligence

submitted by davidb+(OP) on 2023-05-22 17:34:56 | 93 points 172 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts

>>genera+H4
> - Poverty and homelessness running rampant? Check.

Compared to what, exactly? because over the last 50 years, there have been dramatic improvements[1].

[1]: https://www.brookings.edu/research/the-evolution-of-global-p...

It's true - there's room to do better. So, so much better. But discarding the progress of the last 50 years is so unbelievably counter-productive.

>>genera+H4
The current poverty rate in the US is around 11% today vs ~24% in 1960, and the poverty rate for children has dropped even further. There's also been about a 6% decrease in homelessness in the US in the past decade[1].

[1] https://www.security.org/resources/homeless-statistics/

>>causi+O6
> I wonder if AI will ever scale down to personal hardware

Happened last week. Download here.[1]

https://github.com/nomic-ai/gpt4all

>>davidb+(OP)
As I said last week on the lobbying article [1], if you don't like how "Open"AI is trying to build a regulatory capture moat, cancel your subscription.

Yes, the open models are worse, but are getting better. There will be plenty of high quality commercial alternatives.

[1] https://news.ycombinator.com/item?id=35967864

>>lsy+c9
Exactly, I also struggle to take seriously the "security" concerns of an organisation that releases this product with plugins and no proper way to restrict them and what they can do. The prompt injections are just ridiculous and show a complete lack of thought going into the design [0].

"move fast and break things"

It very much feels like they are trying to build a legislative moat, blocking out competitors and even open source projects. Ridiculous.

I don't fear what this technology does to us, I fear what we do to each other because of it. This is just the start.

0: https://twitter.com/wunderwuzzi23/status/1659411665853779971

> Let ChatGPT visit a website and have your email stolen.

> Plugins, Prompt Injection and Cross Plug-in Request Forgery.

>>schaef+o5
The usual myth. Counterpoint: https://www.marxist.com/world-poverty-capitalism-s-crime-aga...

>>davidb+(OP)
Setting aside value judgments on whether this is a good idea or not, it's odd how nonspecific this blog post is.

Metaculus has some probabilities [1] of what kind of regulation might actually happen by ~2024-2026, e.g. requiring disclosure of human/non-human, restricting APIs, reporting on large training runs, etc.

[1] https://www.metaculus.com/project/ai-policy/

>>davidb+(OP)
Fwiw, altman did request the U.S. government not put a legal burden on open source and Startup AI.

https://twitter.com/exteriorpower/status/1659069336227819520

>>schaef+o5
By your own link, statistics have reversed and just in 2019-2020 alone an increase of 8 million people fell into extreme poverty. Going by UN metrics, we're actually seeing a stabilization in the "dramatic improvements", and we're struggling to break past the ~8% mark. We're talking about a $1.90 poverty line v a $2.15 poverty line, and that sent the rate from 8.4 to 9.3[1]. In that same document, the UN had to adjust their goal of hitting 3% under extreme poverty by 2030.

How does this not justify what the above person stated, poverty is running rampant? More than 600 million people are still in extreme poverty. A record 100 million are displaced due to conflict in their countries. So I have to ask what exactly is unbelievably counter-productive here? I would argue that placating ourselves is.

[1:14] https://social.desa.un.org/sites/default/files/inline-files/...

>>Anthon+Vk
> The premise of AGI isn't that it can do something better than people, it's that it can do everything at least as well. Which is clearly still not the case.

I imagine an important concern is the learning & improvement velocity. Humans get old, tired, etc. GPUs do not. It isn't the case now, but it is fuzzy how fast we could collectively get there. Break out problem domains into modules, off to the silicon dojos until your models exceed human capabilities, and then roll them up. You can pick from OpenGPT plugins, why wouldn't an LLM hypervisor/orchestrator do the same?

https://waitbutwhy.com/2015/01/artificial-intelligence-revol...

>>ch4s3+Og
> It's actually completely incomprehensible to me that you would suggest otherwise. It seems like a totally disconnected comment.

I think you are the one who's disconnected. Ask your average crackhead on the block if they're happy, and then compare the answer to your average college dropout stocking groceries. People who haven't seen both sides tend to think happiness is made by Maslow's hierarchy of needs or is a linear function of material wealth - it's not. It seems like a joke, but this post https://www.reddit.com/r/drugscirclejerk/comments/8iyp0c/i_f... describes exactly what I mean. I genuinely believe some homeless people are more happy than some working-class people.

Case in point, you just spouted more metrics to me that have to do with the well being of the economy not the well being of the average person. I do not care about your numbers, because time and again they have been played. We should consider the idea that if we can take steps forward, we can also take steps backward.

And while we're at it I should ask - have you ever had to deal with a dead-end job with subpar pay? Were you ever forced to work in abusive environments? If so, then you can agree with me that it's a terrible state to be in - not the same as being homeless definitely but still terrible.

And if not, then why are you talking about things you don't know about? Do you really think economic metrics are a viable substitute for this lack of knowledge?

>>idopms+7j
It might not require specialized hardware in the future as GPUs get more powerful and we have other techniques such as LoRA for fine-tuning the models. We might see a distributed training [1] effort harnessing thousands of gamer GPUs worldwide, as well. All of this powered by open source software. Also there could be advances in the training software making it vastly more efficient.

1. https://arxiv.org/pdf/2301.11913.pdf

>>Anthon+nK
>It's not better at reasoning. It's barely even capable of it

You are wrong. and there's many papers to show otherwise.

Algorithmic, Casual, Inference, Analogical

LLMs reason just fine

https://arxiv.org/abs/2212.09196

https://arxiv.org/abs/2305.00050

https://arxiv.org/abs/2204.02329

https://arxiv.org/abs/2211.09066

>>famous+qO
I definitely don't buy these papers at face value. I say this as an ML researcher btw.

You'll often see these works discussing zero-shot performance. But many of these tasks are either not zero-shot or even a known n-shot. Let's take a good example, Imagen[0] claims zero-shot MS-COCO performance but trains on LAION. COCO classes exist in LAION and there are similar texts. Explore COCO[1] and explore clip retrieval[2] for LAION. The example given is the first sample from COCO aircraft and you'll find almost identical images and captions with many of the same keywords. This isn't zero-shot.

Why's this matter? Dataset contamination[3] being used in the evaluation process. You can't conclude that a model has learned something if it has access to the evaluation data. Test sets have always been a proxy for generalization and MUST be recognized as proxies.

This gets really difficult with LLMs where all we know is that they've scrapped a large swath of the internet and that includes GitHub and Reddit. I show some explicit examples and explanation with code generation here [4]. From there you might even see how it is difficult to generate novel test sets that aren't actually contaminated, which is my complaint about HumanEval. I show that we can find dupes or near dupes on GitHub despite these being "hand written."

As per your sources all use GPT, which we don't know what data they have and don't have. But we do know they were trained on Reddit and GitHub. That should be enough to tell you that certain things like Physics and Coding problems[5] are spoiled. If you look at all the datasets used for evaluation in the works you listed I think you'll find reason to believe that there's a good chance that these too are spoiled. (Other datasets are spoiled and there's lots of experimentation that demonstrates the causal reasoning isn't as good as the performance suggests)

Now mind you, this doesn't mean that LMs can't do causal reasoning. They definite can. Including causal discovery[6]. But this all tells us that it is fucking hard to evaluate models and even harder when we don't know what they were trained on. That maybe we need to be a bit more nuanced and stop claiming things so confidently. There's a lot of people trying to sell snake oil right now. These are very powerful tools that are going to change the world, but they are complex and people don't know much about them. We saw many snake oil salesmen at the birth of the internet too. Didn't mean the internet wasn't important or not going to change the course of humanity. Just meant that people were profiting off of the confusion and complexity.

[0] https://arxiv.org/abs/2205.11487

[1] https://cocodataset.org/#explore

[2] https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2....

[3] https://twitter.com/alon_jacovi/status/1659212730300268544

[4] https://news.ycombinator.com/item?id=35806152

[5] https://twitter.com/random_walker/status/1637929631037927424

[6] https://arxiv.org/abs/2011.02268

>>godels+8V
I don't think you took more than a passing glance, if any at those papers.

What you describe is impossible with these 3.

https://arxiv.org/abs/2212.09196 - new evaluation set introduced with the paper. modelled after tests that previously only had visual equivalents. contamination literally impossible

https://arxiv.org/abs/2204.02329 - effect of explanations on questions introduced with the paper. dataset concerns make no sense.

https://arxiv.org/abs/2211.09066 - new prompting method introduced to improve algorithmic calculations. dataset concerns make no sense.

The Casual paper is the only one where worries about dataset contamination makes any sense at all.

>>Anthon+YV
>Actual reasoning, or reconstruction of existing texts containing similar reasoning?

The papers were linked in another comment. 3 of them don't even have anything to do with a existing dataset testing. so yeah, actual.

for the world model papers

https://arxiv.org/abs/2210.13382

https://arxiv.org/abs/2305.11169

>Lack of access to cameras or vehicle controls isn't why it can't drive a car.

It would be best to wait till what you say can be evaluated. that is your hunch, not fact.

>The existence of numerous ChatGPT jailbreaks is evidence to the contrary.

No it's not. People fall for social engineering and do what you ask. if you think people can't be easily derailed, boy do i have a bridge for you.

>Many people are of below average intelligence, or give up when something is hard but not impossible.

Ok. Doesn't help your point. and many above average people don't reach expert level either. If you want to rationalize all that as "gave up when it wasn't impossible", go ahead lol but reality paints a very different picture.

>If you have one machine that will make one attempt to solve a problem a day and succeeds 90% of the time and another that will make a billion attempts to solve a problem a second and succeeds 10% of the time, which one has solved more problems by the end of the week?

"Problems" aren't made equal. Practically speaking, it's very unlikely the billion per second thinker is solving any of the caliber of problems the one attempt per day is solving. Solving more "problems" does not make you a super intelligence.

>>davegu+zX
I've had 4 respond i don't know to questions before.

and the base model was excellently calibrated. https://openai.com/research/gpt-4

"Interestingly, the base pre-trained model is highly calibrated (its predicted confidence in an answer generally matches the probability of being correct). However, through our current post-training process, the calibration is reduced."

Next ?

>>famous+fX
> I don't think you took more than a passing glance, if any at those papers.

I'll assume in good faith but let's try to keep this in mind both ways.

> What you describe is impossible with these 3.

Definitely possible. I did not write my comment as a paper but I did provide plenty of evidence. I specifically ask that you pay close attention to my HumanEval comment and click that link. I am much more specific about how a "novel" dataset may not actually be novel. This is a complicated topic and we must connect many dots. So care is needed. You have no reason to trust my claim that I am an ML researcher, but I assure you that this is what I do. I have a special place in my heart for evaluation metrics too and understanding their limitations. This is actually key. If you don't understand the limits to a metric then you don't understand your work. If you don't understand the limits of your datasets and how they could be hacked you don't understand your work.

=== Webb et al ===

Let's see what they are using to evaluate. > To answer this question, we evaluated the language model GPT-3 on a range of zero-shot analogy tasks, and performed direct comparisons with human behavior. These tasks included a novel text-based matrix reasoning task based on Raven’s Progressive Matrices, a visual analogy problem set commonly viewed as one of the best measures of fluid intelligence

Okay, so they created a new dataset. Great, but do we have the HumanEval issues? You can see that Raven Progressive Matrices were introduced in 1938 (referenced paper) and you'll also find many existing code sets on GitHub that are almost a decade old. Even ML ones that are >7 years old. We can also find them in blogspot, wordpress, and wikipedia, which are the top three domains for common crawl (used for GPT3)[0]. This automatically disqualifies this claim from the paper:

> Strikingly, we found that GPT-3 performed as well or better than college students in most conditions, __despite receiving no direct training on this task.__

It may be technically correct since there is no "direct" training but it is clear that the model was trained on these types of problems. But that's not the only work they did

> GPT-3 also displayed strong zero-shot performance on letter string analogies, four-term verbal analogies, and identification of analogies between stories.

I think we can see that these are also obviously going to be in the training data as well. That GPT-3 had access to examples, similar questions, and even in depth break downs as to why the answers are the correct answers.

Contamination isn't "literally impossible" but trivially proven. This seems to exactly match my complaint about HumanEval.

=== Lampinen et al ===

We need just look at our example on the second page.

Task instruction > Answer these questions by identifying whether the second sentence is an appro- priate paraphrase of the first, metaphori- cal sentence.

Answer explanation > Explanation: David’s eyes were not lit- erally daggers, it is a metaphor used to imply that David was glaring fiercely at Paul.

You just have to ask yourself if this prompt and answer are potentially anywhere in common crawl. I think we know there are many blogspot posts that have questions similar to SAT and IQ tests, which this experiment is similar to

=== Conclusion ===

You have strong critiques of my response but have little to back up these critiques. I'll reiterate, because it was in my initial response: you are not performing zero-shot testing when your test set is includes similar data. That's not what zero shot it. I wrote more about this a few months back[1] and may be worth reading. What needs to be responded to me to change my opinion is not a claim that the dataset was not existing prior to the crawl but that the model was not trained on data significantly similar to that in the test set. This is, again, my original complaint about HumanEval and these papers do nothing to address these complaints.

I'll go even further. I'd encourage you to look at this paper[2] where data isn't just exactly de-duplicated, but near de-duplicated. There is an increase in performance for these results. But I'm not going to explain everything to you. I will tell you that you need to look at Figures 4, 6, 7, A3, ESPECIALLY A4, A5, and A6 VERY carefully. Think about how these results can be explained and the relationship to random pruning. I'll also say that their ImageNet results ARE NOT zero-shot (for reasons given previously).

But we're coming back to the same TLDR: evaluating models is hard and already noisy process. Evaluating models that have scraped a significant portion of the internet are substantially harder to evaluate. If you can provide to me strong evidence that there isn't contamination then I'll take these works more seriously. This is a point you are not addressing. You have to back up the claims, not just state them. In the mean time, I have strong evidence that these, and many other, datasets are contaminated. This even includes many causal datasets that you have not listed but were used in other works. Essentially: if the test sets are on GitHub, it is contaminated. Again, see HumanEval and my specific response that I linked. You can't just say "wrong," drop some sources, and leave it at that. That's not how academic conversations happen.

[0] https://commoncrawl.github.io/cc-crawl-statistics/plots/doma...

[1] https://news.ycombinator.com/item?id=35489811

[2] https://arxiv.org/abs/2303.09540

>>famous+RZ
> The papers were linked in another comment.

For anyone following along, they are in my sibling comment. Linked papers here[0]. The exact same conversation is happening there, but sourced.

> 3 of them don't even have anything to do with a existing dataset testing

Specifically I address this claim and bring strong evidence to why you should doubt this claim. Especially this specific wording. The short end is when you scrape the entire internet for your training data that you have a lot of overlap and that you can't confidently call these evaluations "zero shot." All experiments performed in the linked works use datasets that are not significantly different from data found in the training set. For those that are "hand written" see my complaints (linked) about HumanEval.

[0] https://news.ycombinator.com/item?id=36037440

>>0xbadc+sq3
Well. The forecasting community estimates general intelligence (incl. robotics and passing a 2 hour Turing test) for 2031.

https://www.metaculus.com/questions/5121/date-of-artificial-...

I don't know what would change your mind.

zlacker

Governance of Superintelligence