World's Top Researcher on AI, LLMs, and Robot Intelligence
Sergey Levine, co-founder of Physical Intelligence, explains the company's bet to build general-purpose robotic foundation models that can control any physical robot to perform any task—rather than narrow, single-purpose machines. Unlike the Boston Dynamics approach of spectacular demos, Levine argues that true progress requires building systems that generalize across embodiments and tasks, learning from diverse data sources the way language models learn from internet text. The conversation covers why this "harder" approach may actually be easier long-term, the role of multimodal LLMs in giving robots common sense, and what it would take for robotics to experience the same Cambrian explosion that personal computers sparked in computing.
Key takeaways
- • General-purpose foundation models are more efficient than specialized single-task robots because they leverage data from many sources and learn transferable physical understanding, similar to how language models outperformed domain-specific translation and sentiment-analysis systems.
- • Robots can now be coached and improved through language alone—if a robot fails at a task, you can label its experience with semantic commands (e.g., "pick up the plate") rather than collecting new teleoperation data, shifting the bottleneck from low-level motor control to mid-level reasoning.
- • Morovex's Paradox explains why intuitive physical tasks (picking up objects, changing diapers) are harder for robots than math problems: humans are evolutionarily primed for physical interaction, so we underestimate the engineering challenge—but machine learning inverts this when training data is available.
- • The key innovation isn't hardware or demos—it's generality of improvement, meaning systems that can be enhanced autonomously from their own experience rather than requiring manual engineering, which unlocks rapid iteration and adaptation to new embodiments and tasks.
- • Companies preparing for robotics should focus on understanding what kind of data is needed (not just collecting any video) and whether their deployment model relies more on teleoperation demonstrations or autonomous learning—the technology roadmap looks very different depending on this choice.
- • Multimodal LLMs don't directly control robots, but they enable robots to reason about novel situations using web-scale knowledge when plugged in via "chain of thought" prompts, solving the long-standing problem of how robots handle edge cases and uncommon scenarios.