Gemini Robotics On-Device brings AI to local robotic devices

>>sajith+5d
The generally accepted term for the research around this in robotics is Constitutional AI (https://arxiv.org/abs/2212.08073) and has been cited/experimented with in several robotics VLAs.

>>meetpa+(OP)
I've spent the last few months looking into VLAs and I'm convinced that they're gonna be a big deal, ie they very well might be the "chatgpt moment for robotics" that everyone's been anticipating. Multimodal LLMs already have a ton of built-in understanding of images and text, so VLAs are just regular MMLLMs that are fine-tuned to output a specific sequence of instructions that can be fed to a robot.

OpenVLA, which came out last year, is a Llama2 fine tune with extra image encoding that outputs a 7-tuple of integers. The integers are rotation and translation inputs for a robot arm. If you give a vision llama2 a picture of a an apple and a bowl and say "put the apple in the bowl", it already understands apples, bowls, knows the end state should apple in bowl etc. What missing is a series of tuples that will correctly manipulate the arm to do that, and the way they did it is through a large number of short instruction videos.

The neat part is that although everyone is focusing on robot arms manipulating objects at the moment, there's no reason this method can't be applied to any task. Want a smart lawnmower? It already understands "lawn" "mow", "don't destroy toy in path" etc, just needs a finetune on how to corectly operate a lawnmower. Sam Altman made some comments about having self-driving technology recently and I'm certain it's a chat-gpt based VLA. After all, if you give chatgpt a picture of a street, it knows what's a car, pedestrian, etc. It doesn't know how to output the correct turn/go/stop commands, and it does need a great deal of diverse data, but there's no reason why it can't do it. https://www.reddit.com/r/SelfDrivingCars/comments/1le7iq4/sa...

Anyway, super exciting stuff. If I had time, I'd rig a snowblower with a remote control setup, record a bunch of runs and get a VLA to clean my driveway while I sleep.

>>martyt+Au
VLA = Vision-language-action model: https://en.wikipedia.org/wiki/Vision-language-action_model

Not https://public.nrao.edu/telescopes/VLA/ :(

For completeness, MMLLM = Multimodal Large language model.

>>meetpa+(OP)
These are going to be war machines, make absolutely no mistake about it. On-device autonomy is the perfect foil to escape centralized authority and accountability. There’s no human behind the drone to charge for war crimes. It’s what they’ve always dreamed of.

Who’s going to stop them? Who’s going to say no? The military contracts are too big to say no to, and they might not have a choice.

The elimination of toil will mean the elimination of humans all together. That’s where we’re headed. There will be no profitable life left for you, and you will be liquidated by “AI-Powered Automation for Every Decision”[0]. Every. Decision. It’s so transparent. The optimists in this thread are baffling.

0: https://www.palantir.com/

>>Workac+bT
1. Well, based on Karpathy's talks on Tesla FSD, his solution is to actually make the training set reflect everything you'd see in reality. The tricky part is that if something occurs 0.0000001% IRL and something else occurs 50% of the time, they both need to make 5% of the training corpus. The thing with multimodal LLMs is that lidar/depth input can just be another input that gets encoded along with everything else, so for driving "there's a blob I don't quite recognize" is still a blob you have to drive around.

2. Figure has a dual-model architecture which makes a lot of sense: A 7B model that does higher-level planning and control and a runs at 8Hz, and a tiny 0.08B model that runs at 200Hz and does the minute control outputs. https://www.figure.ai/news/helix

>>meetpa+(OP)
The MuJoCo link actually points to https://github.com/google-deepmind/aloha_sim

>>KoolKa+yc1
OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk

>>baron8+cy
Industrial robots at least are very reliable, MTBF is often upwards of 100,000 hours[0]. Industrial robots are optimized to be as reliable as possible because the longer they last and less often they need to be fixed, the more profitable they are. In fact, German and Japanese companies came to dominate the industrial robotics market because they focused on reliability. They developed rotary electric actuators that were more reliable. Cincinnati Millicron(US) was out competed in the industrial robot market because although their hydraulic robots were strong, they were less reliable.

I am personally a bit skeptical of anthropormophic hands achieving similarly high reliability. There's just too many small parts that need to withstand high forces.

[0]https://robotsdoneright.com/Articles/what-are-the-different-...

>>moelf+Zd1
mujoco_menagerie has Mujoco MJCF XML models of various robots.

google-deepmind/mujoco_menagerie: https://github.com/google-deepmind/mujoco_menagerie

mujoco_menagerie/aloha: https://github.com/google-deepmind/mujoco_menagerie/tree/mai...

>>martyt+Wg1
A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.

https://arxiv.org/abs/2506.01844

Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450

>>meetpa+(OP)
Elon Musk said in last week’s Starship Update: the very first Mars missions are planned to be flown by Optimus humanoid robots to scout and build basic infrastructure before humans arrive (full transcript + audio: https://transpocket.com/share/oUKhep6cUl3s/). If Gemini Robotics On-Device can truly adapt to new tasks with ~50–100 demos, pairing that with mass-produced Optimus bodies and Starship’s lift capacity could be powerful—offline autonomy, zero-latency control, and the ability to ship dozens of robots per launch.

>>martyt+Wg1
In the paper at the bottom of googles page, this VLA says it is built on the foundations of Gemini 2.0 (hence my quotations). They'd be using Gemini 2.0 rather than llama.

https://arxiv.org/pdf/2503.20020

>>suyash+kf
There's a post on x from one of the project contributors that says it fits on a 4090: https://x.com/sippeyxp/status/1937520297789497668

zlacker

Gemini Robotics On-Device brings AI to local robotic devices