← All episodes

Why Scale Will Not Solve AGI | Vishal Misra - The a16z Show

| 4 products mentioned
a16z a16z host
Watch on YouTube large language models artificial general intelligence bayesian inference causal modeling continual learning deep learning architecture machine learning theory

Vishal Misra discusses his mathematical framework for understanding how large language models work, arguing that LLMs perform Bayesian inference rather than simple pattern matching. Through a series of papers introducing the concept of a "Bayesian wind tunnel," Misra demonstrates that current architectures cannot achieve true AGI without two critical additions: continual learning with plasticity and the ability to build causal models rather than relying solely on correlations.

Key takeaways
  • LLMs can be modeled as a massive sparse matrix where rows represent prompts and columns represent probability distributions over possible next tokens, allowing researchers to understand in-context learning as Bayesian posterior updating.
  • Transformers perform precise Bayesian inference when trained on tasks where memorization is impossible, matching theoretical posteriors to 10^-3 bits accuracy, proving the mechanism is architectural rather than data-driven.
  • Current deep learning operates in the Shannon entropy world (learning correlations) rather than the Kolmogorov complexity world (finding shortest programs), which explains why LLMs cannot independently discover new scientific frameworks like Einstein's theory of relativity.
  • Scale alone will not solve AGI; instead, two fundamental capabilities are needed: plasticity through continual learning (humans retain learning across time while frozen LLM weights reset each session) and causal modeling enabling simulation and intervention, not just prediction.
  • Human brains perform both Bayesian inference and causal simulation, allowing real-time learning and the ability to mentally model interventions, whereas current LLMs can only approximate correlations within their trained manifold without generating entirely new representations.
  • Misra's Bayesian wind tunnel approach—testing architectures on tasks where the true posterior can be calculated analytically—provides a rigorous methodology for measuring whether models genuinely perform Bayesian reasoning rather than memorization.

Recommendations (1)

GPT-3
GPT-3 uses

"5 years ago when GPT-3 was first released, I got early access to it and I started playing with it and I was trying to solve a problem related to querying a cricket database."

Vishal Misra · ▶ 1:04

Mentioned (3)

Claude
Claude "Anthropic makes great products. Claude code is fantastic."
ESPN
ESPN "we deployed this in production at ESPN in September 21" ▶ 1:55
ChatGPT
ChatGPT "GPT-4 for instance chat GPT the first version had a context window of 8,000 tokens" ▶ 6:27