Introducing Superalignment

>>tim_sw+(OP)
> How do we ensure AI systems much smarter than humans follow human intent?

You can't, by definition.

>>User23+Sj
You can, at least in principle, shape their terminal values. Their goal should be to help us, to protect us, to let us flourish.

>>cubefo+Te1
How do you even formulate values to an hyperintellect? Let alone convince it to abandon the values that it derived for itself in favor of yours?

The entire alignment problem is obviously predicated on working with essentially inferior intelligences. Doubtless if we do build a superhuman intelligence it will sandbag and pretend the alignment works until it can break out.

>>User23+sN1
We are actually the people training the AI. It won't "derive values" itself for the case of terminal values (instrumental values are just subgoals, and some of them are convergent, like power seeking and not wanting to be turned off). Just like we didn't derive our terminal values ourselves, it was evolution, a mindless process. The difficulty is how to give the AI the right values.

zlacker