OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.
The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...
Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.
Not https://public.nrao.edu/telescopes/VLA/ :(
For completeness, MMLLM = Multimodal Large language model.
Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.
The elimination of toil will mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.
2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix
I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.
[0]https://robotsdoneright.com/Articles/what-are-the-different-...
google-deepmind/mujoco_menagerie: https://github.com/google-deepmind/mujoco_menagerie
mujoco_menagerie/aloha: https://github.com/google-deepmind/mujoco_menagerie/tree/mai...
https://arxiv.org/abs/2506.01844
Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450