Why humanoid robots are learning everyday tasks faster than expected

Last September, roboticist Benjie Holson published the “Humanoid Olympics”: a series of increasingly difficult tests for humanoid robots that he demonstrated himself wearing a silver bodysuit. Challenges, like opening a door with a round doorknob, started out easy, at least for a human, and progressed to “gold medal” tasks like properly buttoning and hanging a men’s shirt and using a key to open a door.
Holson meant that the difficult tasks are not the most dazzling. While other competitions feature robots playing sports and dancing, Holson argued that the robots we really want are the ones that can do laundry and cook meals.
He expects the challenges to take years to resolve. Instead, in a matter of months, robotics company Physical Intelligence completed 11 of 15 challenges, from bronze to gold, with a robot that washed windows, spread peanut butter and used a dog poop bag.
On supporting science journalism
If you enjoy this article, please consider supporting our award-winning journalism by subscription. By purchasing a subscription, you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Scientific American told Holson why vision-only or camera-based systems exceed his expectations and how close we are to a truly useful machine. Since then, he has launched a new, more difficult series of challenges.
[An edited transcript of the interview follows.]
You designed these challenges to be difficult. Were you surprised by how quickly the results came?
It was so much faster than I expected. When I chose the challenges, I was trying to calibrate them so that some bronze challenges would be done in the first month or two, then silver and gold in the next six months, and the harder ones might take a year or a year and a half. Getting almost all of them done in the first three months is crazy.
What made this possible?
I started by assuming that we have things that look impressive in a fairly narrow set of tasks: vision only, no contact, simple manipulator, not incredibly precise. This limits what you can be good at. I tried to think of tasks that would require us to step outside of this set. It turns out that I greatly underestimated what is possible with simple, visual-only manipulators.
When I visited Physical Intelligence, I learned that they had no force detection. They do all of this 100% based on their vision. The key-insertion task, the peanut butter spreading – I thought this would require some force input. But apparently you just throw more video demonstrations at it, and it works.
How exactly do you train a robot to do this without coding it line by line?
Everything is learned through demonstration. Someone teleoperates the robot to perform the task hundreds of times, they train a model based on that, and then the robot can perform the task.
There is a lot of confusion about whether large language models (LLMs) are useless for robots. Are they?
I was quite doubtful about the usefulness of LLMs in robotics. The problem they were able to solve two or three years ago was high-level planning: “If I want to make tea, what are the steps?” Ordering the steps is the easy part. Picking up the teapot and filling it is the hardest thing.
On the other hand, we started creating vision-action models using the same transformer architecture. [as that used in LLMs]. You can use transformers for text input, text output, image input, text output, but also image input and robot action output.
What’s interesting is that they start with models pre-trained on text, images, maybe video. Before you even start training for your specific task, the AI already understands what a teapot is, what water is, and that you might want to fill a teapot with water. So, when practicing your task, there is no need to start with “Let me understand what geometry is.” It can start with “I see, we’re moving the teapots” – which is crazy how it works.
How did the “Olympic” tasks come to you?
So it was part challenge and part prediction. I’ve been trying to think of the next set of things that we can’t do now that someone will be able to do soon.
Humans rely on touch to perform tasks such as finding keys in a pocket. How to get around this problem in robotics?
This is a very good question that we don’t know the answer to yet. Touch technology is much worse, more expensive, delicate and far behind cameras. The cameras, which we have been working on for a long time.
The big question is: are the cameras enough? Sunday physics and robotics intelligence [which completed the bronze-medal task of rolling matched socks] I bet that putting a camera on the wrist, very close to the fingers, somehow allows you to see the forces by seeing how everything is crushed. When the robot grabs something, it sees that the fingers have rubber that deflects; the object deviates and it deduces forces. When spreading peanut butter on bread, the robot watches the knife deflect downward and crush the bread and judges its forces. This works much better than expected.
What about security?
The energy required to stay in balance is often quite high. If a robot falls, it’s a very fast and hard acceleration to get the leg in front in time. Your system has to put a lot of energy into the world – and that’s what’s dangerous.
I’m a big fan of centaur robots: a movable wheelbase with arms and a head. For security reasons, this is a much easier way to get there quickly. If a humanoid loses power, it will fall. The general plan seems to be to make a robot so incredibly valuable that we, as a society, create a new safety class for it, like bikes and cars. They are dangerous but so valuable that we tolerate the risk.
Have these results changed your timeline?
I thought domestic robots would be at least 15 years away. Now I can think of at least six. The difference is that I thought it would take a lot longer before doing a useful thing in a human space, even in demo form, was plausible.
But roboticists have found time and time again that there’s a long way from “it worked in a lab and I got a video” to “I can sell a product.” Waymo was on the roads in 2009; I wouldn’t be able to buy a vehicle before 2024. It takes a long time to develop reliability.
What is the biggest bottleneck remaining?
Reliability and Safety: What physical intelligence shows is incredibly impressive, but if you put it on a different table with different lighting and use a different sock, it might not work. Each step toward generalization appears to require an order of magnitude more data, turning days of data collection into weeks or months.



