Gemini的0到1,具身智能的10到100

Gemini的0到1,具身智能的10到100


The reason for this title is quite interesting. Not long ago, we had an interview where a candidate presented a highly idealized financial projection model for an embodied AI company. In his forecast, until 2028, dexterous hands would account for over 60% of the cost of a humanoid robot. So, I asked him whether the brain or the cerebellum is most important for a humanoid robot. He replied: the dexterous hands.

I admit that dexterous hands are very important, and the force-feedback models have yet to achieve a significant breakthrough. However, the reason such a high-level closed-door meeting is discussing embodied AI today is likely that the attraction lies more in the intelligence.

To be honest, I haven't followed the progress of embodied AI, especially the robotics part, for some time. More than two years ago, when we internally recommended Unitree Robotics, we were indeed full of passion. Even at this time last year, when communicating with major institutions, my view was that embodied AI still depends on Chinese manufacturing.

However, from our own perception, the speed of progress in embodied AI over the past year has been somewhat lower than expected. This is another major reason this title popped into my head. Of course, I need to explain the title: the first half is easy to understand; the second half, regarding the "10 to 100" phase of embodied AI, doesn't mean that embodied AI has already reached that stage. Rather, it means we might be putting too much energy and discussion into the 10 to 100 phase, while in reality, we may not have truly achieved the "0 to 1" breakthrough. Or rather, even if we reach 1, the importance, difficulty, and value of moving from 1 to 10 might be much greater than moving from 10 to 100.

That is the explanation and basic conclusion of my title. Now I will elaborate. First, a question: Why was ChatGPT successful three years ago, and why could Google's Gemini-3, just released two days ago, surpass it?

In my view, the answer is the same: they mastered the "1 to 10" phase.

Why do I say this? Currently, the generative AI represented by ChatGPT, Gemini, etc., is built on the Transformer architecture. This architecture was the "0 to 1" breakthrough when Google published the "Attention Is All You Need" paper in 2017. The five years from 2017 to the release of ChatGPT in 2022, and the eight years to now, have been focused on model development centered on scaling—more data, larger data centers, and more computing power. The most important work when OpenAI released ChatGPT was actually Reinforcement Learning from Human Feedback (RLHF). In plain terms, this means hiring a bunch of so-called human experts for data labeling, including teaching the model how to speak like a human, answer various questions, solve math problems, write code, identify risky intent, and so on. Doing a good job in various data labeling tasks is constant optimization within the 1 to 10 phase, making the model look human. Whether you think it's an infant, an elementary student, or a PhD, it's about continuous optimization from 1 to 10 to make the model's capabilities appear stronger and stronger. Google initially neglected this. Both the DeepMind and Google Brain teams were a bit too idealized, thinking they could find reinforcement learning methods where the model could train and optimize itself, gaining confidence from the successes of AlphaGo and AlphaZero. However, the early release of ChatGPT disrupted the rhythm of all research teams in the AI field. Under immense pressure, Google underwent more than half a year of adjustment, eventually focusing all its energy on the 1 to 10 optimization. Through Gemini 1.0, 1.5, 2.0, 2.5, to version 3 two days ago—about two years of effort—they achieved a reversal.

Of course, the so-called 1 to 10 involves a massive amount of work, but the core is doing the data well. Today, even as we talk about "Agents," the essence remains data: how to abstract human operations into data that a model can understand. This includes multi-modality—audio, images, video, etc.—where the core is still data. It's not about scattered data, but how to align various types of data into a whole, allowing the model to think more steps and execute more steps—working for longer periods, recognizing more languages, not just understanding images but generating them, and not just generating a static picture but multiple frames to create a video, or even a multi-hour movie in the future.

Two examples illustrate the value here: one is a US-listed company called Innodata, which helps Google with data and has seen rapid revenue growth over the past two years. Another example is Scale AI, which Meta reportedly valued at over $10 billion; they also do data. Why did Meta's Llama model suddenly fall behind after version 3, unable to crack multi-modality? It was still a data problem. Whether they will catch up after recent maneuvers is something we won't speculate on.

Regarding the AI model part, there is a second issue: we see a bunch of AI application companies that gain massive momentum only to fall silent shortly after. The reason? Unclear positioning. They seem to have entered the wrong field in the wrong way. Simply put, they try to solve a 1 to 10 problem with a small sample of data and then attempt to sell it to customers who want to go from 10 to 100. If the customer has the ability to go from 10 to 100, they certainly have the ability to handle that small-sample data better themselves by just using a large model. Conversely, if the customer lacks the ability to go from 10 to 100, the rational choice for the application company is not to help them get there—if they had that capability, they should just take over the customer's business.

Of course, this description isn't complete, but it outlines the division of labor. Today's theme isn't just about models; I'm reporting this framework for your guidance and then, from my personal perspective, looking at what lessons it holds for embodied AI.

So, back to the topic: Embodied AI. Although both contain the word "intelligence," embodied AI differs from Large Language Models or generative AI in two very important ways, at least in terms of implementation. First is the execution phase. Since LLMs currently run entirely in the digital world, the final stage—what we call the 10 to 100 part—actually doesn't lack data and tools. Various software systems and operational processes have already solved the 10 to 100 part. LLMs aren't meant to replace this part but to use it better. When a large amount of 1 to 10 data on how humans work appears, adding an Agent mechanism allows the model to immediately use the 10 to 100 data and tools to automate work. Embodied AI is different; it truly hopes to physically mimic and replace humans in the execution phase. I will discuss this in detail. The second major difference is that generative AI has already achieved the 0 to 1 breakthrough (the Transformer). Predicting the next word is the "1," and subsequent work makes that prediction better. Embodied AI, however, has not yet broken through the "1." We haven't found the key to this intelligence yet. A few years ago, we thought LLMs were the key; later, it seemed spatial intelligence and physical AI breakthroughs were needed. In short, the 0 to 1 hasn't happened yet.

However, this doesn't seem to hinder the vigorous development of embodied AI.

Why? Because we humans already have a general conceptual framework for it. The goals are specific: for example, smart driving means driving like a human—comfortable, safe, and accident-free. For humanoid robots, it means doing housework, working on assembly lines, or other specific tasks. In computer system terms, this becomes a loosely coupled system. Regardless of how advanced the intelligence becomes, the terminal execution phase is certain. The input side will use cameras, microphones, and possibly LiDAR; the execution side needs a series of components—let's use dexterous hands as a stand-in for all of them.

Therefore, using the LLM framework as a reference, the 10 to 100 part of embodied AI indeed needs to be reconstructed, and because the KPI goals are clear, it becomes the easier part to implement. Of course, this isn't easy either, and things like dexterous hands or walking and running are actually part of the 1 to 10 phase.

To some extent, although our definition and imaginative framework for embodied AI may be complete—defined by human capabilities—the future remains blurry. The certainty is far lower than that of digital-world LLMs. Much of the work we do in the execution phase now is preparation for "intelligence," and much of it isn't even used in traditional industrial scenarios, which current traditional robots can already handle at very low cost. Embodied AI, specifically intelligent robots, is intended for open-world scenarios and non-preset tasks. Returning to the initial question of whether the brain or the cerebellum is more important: in my view, the brain is far more important. With a brain, you can define the entire process. With a brain, many processes might change. For instance, many home appliances might only need a networked command to operate, rather than requiring a robot like the recently released 1X Neo to spend minutes struggling to close a dishwasher.

The critical variable affecting such solutions is when this "intelligence" emerges. I believe this intelligence will arrive roughly in sync with the "AGI" claimed by major model companies, or even require "AGI" as support. Regarding AGI, most people currently align with DeepMind's Demis Hassabis's prediction of five to ten years—assuming it can actually be built. OpenAI's optimistic prediction is around 2028, with the most optimistic being 2027. There is even an "AI2027" website detailing the trajectory until then; it's quite interesting for those interested.

Let's be neutral and say 2030. That's five years away, and the technical shifts during this period will be massive, especially in embodied AI. It becomes a matter of perspective. Many of our peers with mechanical engineering backgrounds might feel that dexterous hands and lead screw motors are vital. If you ask them, they might disagree with much of what I've said, viewing those parts as belonging to the 1 to 10 phase. But I come from a software and data background and have spent nearly twenty years in finance. I believe model and algorithm systems can change the logic of execution. From a financial perspective, I maintain that for a major industry, hardware has few barriers; costs will only go down, and gross margins will only shrink. In fields with high technical volatility, there is even the risk of extreme rapid depreciation due to fast iteration.

So, in the final part, I will report my viewpoint: the core of embodied AI is intelligence, followed by embodiment. It is a system of coordination between brain planning and physical execution. Under the current framework, any part that can be described with specific technical indicators is not difficult; what we lack are the parts that cannot be described with specific indicators—namely, open-ended general tasks in an open world. Specifically, there is a massive lack of data—not just physical environment data (room layout, obstacles, dimensions), but data for the entire path from planning to execution. Take cars as an example: car data is relatively easy to collect because whether or not there is smart driving, the car is a specific actuator, and human operations are commands that can be recorded in the vehicle information. Thus, the more cars collecting data, the better the smart driving model becomes. Robots are different. They are meant to replace human execution, but who records the data of human execution? In fact, even though I've said the 0 to 1 for embodied AI hasn't appeared, if we had massive amounts of execution data, we might achieve acceptable results even using current domestic open-source LLMs. A clear example is grasping objects. Tesla demonstrated this: grasping different objects requires different force and angles. Grasping an egg is different from grasping a piece of iron; grasping clothes is different, and different fabrics vary further. This is just one very narrow scenario; there are too many in the open world.

The question then is: How much data is needed, and how does the scaling law work here? It's still unclear. Even with digital twin technology, the efficiency of data acquisition remains low. I even have a preliminary thought: companies like Tesla might produce the first tens or hundreds of thousands of robots solely to collect data for training models. I'm almost afraid to calculate the cost of that.

This likely means that robot model companies might have a higher entry barrier than current LLM companies. There is a mountain of 1 to 10 work to be done. If so, when we talk about investment opportunities, we should look at the path of LLMs. While everyone talks about NVIDIA, the truly meaningful investment opportunities currently come from companies providing supporting services for model training—whether selling hardware, building data centers, or providing data services.

Furthermore, robot model training might differ significantly from LLM training. Once an LLM is trained, it is open to users, and latecomers can use various technical means to extract enough data from public models to lower their own training costs—commonly known as distillation. Robot models, however, are likely impossible to distill. This means the cost for every player might be similar, making the value of data services even greater. In the 1 to 10 phase, integrated solutions from software to hardware to data acquisition become particularly important and represent core competitiveness and irreplacability. This isn't just about adding sensors to a dexterous hand; it requires control chips, communication chips, and even storage chips with specific computing power. And that's just one part. Others, like cameras and microphones, have data acquisition and processing logic that is far more complex in an embodied environment than in a non-embodied one, with higher computational overhead. Naturally, the value of integrating hardware, software, and services will rise accordingly.

← Back to Blog