zlacker

[parent] [thread] 1 comments
1. simian+(OP)[view] [source] 2026-01-27 22:19:44
In theory, the models have done alignment training to not do something malicious.

Can you get it to do something malicious? I'm not saying it is not unsafe, but the extent matters. I would like to see a reproduceable example.

replies(1): >>dgunay+ev1
2. dgunay+ev1[view] [source] 2026-01-28 10:48:27
>>simian+(OP)
I ran an experiment at work where I was able to adversarially prompt inject a Yolo mode code review agent into approving a pr just by editing the project's AGENTS.md in the pr. A contrived example (obviously the solution is to not give a bot approval power) but people are running Yolo agents connected to the internet with a lot of authority. It's very difficult to know exactly what the model will consider malicious or not.
[go to top]