Latent Space: The AI Engineer Podcast

Available Episodes

5 of 167

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI
From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-500-million-startup-videogame-data) and raising a $134M seed from Khosla (https://techcrunch.com/2025/10/16/general-intuition-lands-134m-seed-to-teach-agents-spatial-reasoning-using-video-game-clips/) to spin out General Intuition, Pim is betting that world models trained on peak human gameplay are the next frontier after LLMs. We sat down with Pim to dig into why game highlights are “episodic memory for simulation” (and how Medal’s privacy-first action labels became a world-model goldmine https://medal.tv/blog/posts/enabling-state-of-the-art-security-and-protections-on-medals-new-apm-and-controller-overlay-features), what it takes to build fully vision-based agents that just see frames and output actions in real time, how General Intuition transfers from games to real-world video and then into robotics, why world models and LLMs are complementary rather than rivals, what founders with proprietary datasets should know before selling or licensing to labs, and his bet that spatial-temporal foundation models will power 80% of future atoms-to-atoms interactions in both simulation and the real world. We discuss: How Medal’s 3.8B action-labeled highlight clips became a privacy-preserving goldmine for world models Building fully vision-based agents that only see frames and output actions yet play like (and sometimes better than) humans Transferring from arcade-style games to realistic games to real-world video using the same perception–action recipe Why world models need actions, memory, and partial observability (smoke, occlusion, camera shake) vs. “just” pretty video generation Distilling giant policies into tiny real-time models that still navigate, hide, and peek corners like real players Pim’s path from RuneScape private servers, Tourette’s, and reverse engineering to leading a frontier world-model lab How data-rich founders should think about valuing their datasets, negotiating with big labs, and deciding when to go independent GI’s first customers: replacing brittle behavior trees in games, engines, and controller-based robots with a “frames in, actions out” API Using Medal clips as “episodic memory of simulation” to move from imitation learning to RL via world models and negative events The 2030 vision: spatial–temporal foundation models that power the majority of atoms-to-atoms interactions in simulation and the real world — Pim X: https://x.com/PimDeWitte LinkedIn: https://www.linkedin.com/in/pimdw/ Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and Medal's Gaming Data Advantage 00:02:08 Exclusive Demo: Vision-Based Gaming Agents 00:06:17 Action Prediction and Real-World Video Transfer 00:08:41 World Models: Interactive Video Generation 00:13:42 From Runescape to AI: Pim's Founder Journey 00:16:45 The Research Foundations: Diamond, Genie, and SEMA 00:33:03 Vinod Khosla's Largest Seed Bet Since OpenAI 00:35:04 Data Moats and Why GI Stayed Independent 00:38:42 Self-Teaching AI Fundamentals: The Francois Fleuret Course 00:40:28 Defining World Models vs Video Generation 00:41:52 Why Simulation Complexity Favors World Models 00:43:30 World Labs, Yann LeCun, and the Spatial Intelligence Race 00:50:08 Business Model: APIs, Agents, and Game Developer Partnerships 00:58:57 From Imitation Learning to RL: Making Clips Playable 01:00:15 Open Research, Academic Partnerships, and Hiring 01:02:09 2030 Vision: 80 Percent of Atoms-to-Atoms AI Interactions
--------
--------
After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:42 Dense Captioning and Early Vision-Language Work 00:19:57 Spatial Intelligence: Beyond Language Models 00:28:46 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:21 Gaussian Splats and the Technical Architecture of Marble 00:22:10 Physics, Dynamics, and the Future of World Models 00:41:09 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:56:58 Hiring, Research Directions, and the Future of World Labs
--------
--------
After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environments from text, images, and other spatial inputs. Marble lets creators generate persistent 3D worlds, precisely control cameras, and interactively edit scenes, making it a powerful tool for games, film, VR, robotics simulation, and more. In this episode, Fei-Fei and Justin share how their journey from ImageNet and Stanford research led to World Labs, why spatial intelligence is the next frontier after LLMs, and how world models could change how machines see, understand, and build in 3D. We discuss: The massive compute scaling from AlexNet to today and why world models and spatial data are the most compelling way to “soak up” modern GPU clusters compared to language alone. What Marble actually is: a generative model of 3D worlds that turns text and images into editable scenes using Gaussian splats, supports precise camera control and recording, and runs interactively on phones, laptops, and VR headsets. Fei-fei’s essay (https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence) on spatial intelligence as a distinct form of intelligence from language: from picking up a mug to inferring the 3D structure of DNA, and why language is a lossy, low-bandwidth channel for describing the rich 3D/4D world we live in. Whether current models “understand” physics or just fit patterns: the gap between predicting orbits and discovering F=ma, and how attaching physical properties to splats and distilling physics engines into neural networks could lead to genuine causal reasoning. The changing role of academia in AI, why Fei-Fei worries more about under-resourced universities than “open vs closed,” and how initiatives like national AI compute clouds and open benchmarks can rebalance the ecosystem. Why transformers are fundamentally set models, not sequence models, and how that perspective opens up new architectures for world models, especially as hardware shifts from single GPUs to massive distributed clusters. Real use cases for Marble today: previsualization and VFX, game environments, virtual production, interior and architectural design (including kitchen remodels), and generating synthetic simulation worlds for training embodied agents and robots. How spatial intelligence and language intelligence will work together in multimodal systems, and why the goal isn’t to throw away LLMs but to complement them with rich, embodied models of the world. Fei-Fei and Justin’s long-term vision for spatial intelligence: from creative tools for artists and game devs to broader applications in science, medicine, and real-world decision-making. — Fei-Fei Li X: https://x.com/drfeifei LinkedIn: https://www.linkedin.com/in/fei-fei-li-4541247 Justin Johnson X: https://x.com/jcjohnss LinkedIn: https://www.linkedin.com/in/justin-johnson-41b43664 Where to find Latent Space X: https://x.com/latentspacepod Substack: https://www.latent.space/ Chapters 00:00:00 Introduction and the Fei-Fei Li & Justin Johnson Partnership 00:02:00 From ImageNet to World Models: The Evolution of Computer Vision 00:12:33 Dense Captioning and Early Vision-Language Work 00:19:39 Spatial Intelligence: Beyond Language Models 00:28:49 Introducing Marble: World Labs' First Spatial Intelligence Model 00:33:24 Gaussian Splats and the Technical Architecture of Marble 00:35:50 Physics, Dynamics, and the Future of World Models 00:44:00 Multimodality and the Interplay of Language and Space 00:37:37 Use Cases: From Creative Industries to Robotics and Embodied AI 00:57:03 Hiring, Research Directions, and the Future of World Labs
--------
--------
⚡️ 10x AI Engineers with $1m Salaries — Alex Lieberman & Arman Hezarkhani, Tenex
Alex Lieberman and Arman Hezarkani, co-founders of Tenex, reveal how they're revolutionizing software consulting by compensating AI engineers for output rather than hours—enabling some engineers to earn over $1 million annually while delivering 10x productivity gains. Their company represents a fundamental rethinking of knowledge work compensation in the age of AI agents, where traditional hourly billing models perversely incentivize slower work even as AI tools enable unprecedented speed. The Genesis: From 90% Downsizing to 10x Output The story behind 10X begins with Arman's previous company, Parthian, where he was forced to downsize his engineering team by 90%. Rather than collapse, Arman re-architected the entire product and engineering process to be AI-first—and discovered that production-ready software output increased 10x despite the massive headcount reduction. This counterintuitive result exposed a fundamental misalignment: engineers compensated by the hour are disincentivized from leveraging AI to work faster, even when the technology enables dramatic productivity gains. Alex, who had invested in Parthian, initially didn't believe the numbers until Arman walked him through why LLMs have made such a profound impact specifically on engineering as knowledge work. The Economic Model: Story Points Over Hours 10X's core innovation is compensating engineers based on story points—units of completed, quality output—rather than hours worked. This creates direct economic incentives for engineers to adopt every new AI tool, optimize their workflows, and maximize throughput. The company expects multiple engineers to earn over $1 million in cash compensation next year purely from story point earnings. To prevent gaming the system, they hire for two profiles: engineers who are "long-term selfish" (understanding that inflating story points will destroy client relationships) and those who genuinely love writing code and working with smart people. They also employ technical strategists incentivized on client retention (NRR) who serve as the final quality gate before any engineering plan reaches a client. Impressive Builds: From Retail AI to App Store Hits The results speak for themselves. In one project, 10X built a computer vision system for retail cameras that provides heat maps, queue detection, shelf stocking analysis, and theft detection—creating early prototypes in just two weeks for work that previously took quarters. They built Snapback Sports' mobile trivia app in one month, which hit 20th globally on the App Store. In a sales context, an engineer spent four hours building a working prototype of a fitness influencer's AI health coach app after the prospect initially said no—immediately moving 10X to the top of their vendor list. These examples demonstrate how AI-enabled speed fundamentally changes sales motions and product development timelines. The Interview Process: Unreasonably Difficult Take-Homes Despite concerns that AI would make take-home assessments obsolete, 10X still uses them—but makes them "unreasonably difficult." About 50% of candidates don't even respond, but those who complete the challenge demonstrate the caliber needed. The interview process is remarkably short: two calls before the take-home, review, then one or two final meetings—completable in as little as a week. A signature question: "If you had infinite resources to build an AI that could replace either of us on this call, what would be the first major bottleneck?" The sophisticated answer isn't just "model intelligence" or "context length"—it's controlling entropy, the accumulating error rate that derails autonomous agents over time. The Limiting Factor: Human Capital, Not Technology Despite being an AI-first company, 10X's primary constraint is human capital—finding and hiring enough exceptional engineers fast enough, then matching them with the right processes to maintain delivery quality as they scale. The company has ambitions beyond consulting to build their own technology, but for the foreseeable future, recruiting remains the bottleneck. This reveals an important insight about the AI era: even as technology enables unprecedented leverage, the constraint shifts to finding people who can harness that leverage effectively. Chapters 00:00:00 Introduction and Meeting the 10X Co-founders 00:01:29 The 10X Moment: From Hourly Billing to Output-Based Compensation 00:04:44 The Economic Model Behind 10X 00:05:42 Story Points and Measuring Engineering Output 00:08:41 Impressive Client Projects and Rapid Prototyping 00:12:22 The 10X Tech Stack: TypeScript and High Structure 00:13:21 AI Coding Tools: The Daily Evolution 00:15:05 Human Capital as the Limiting Factor 00:16:02 The Unreasonably Difficult Interview Process 00:17:14 Entropy and Context Engineering: The Future of AI Agents 00:23:28 The MCP Debate and AI Industry Sociology 00:26:01 Consulting, Digital Transformation, and Conference Insights
--------
--------
Anthropic, Glean & OpenRouter: How AI Moats Are Built with Deedy Das of Menlo Ventures
Deedy Das, Partner at Menlo Ventures, returns to Latent Space to discuss his journey from Glean to venture capital, the explosive rise of Anthropic, and how AI is reshaping enterprise software and coding. From investing in Anthropic early on when they had no revenue to managing the $100M Ontology Fund, Das shares insider perspectives on the fastest-growing software company in history and what's next for AI infrastructure, research investing, and the future of engineering. We cover Glean’s rise from “boring” enterprise search to a $7B AI-native company, Anthropic's meteoric rise, the strategic decisions behind products like Claude Code, and why market share in enterprise AI is shifting dramatically. Das explains his investment thesis on research companies like Goodfire, Prime Intellect, and OpenRouter and how the Anthology Fund is quietly seeding the next wave of AI infra, research, and devtools. Chapters 00:00:00 Introduction and Deedy's Return to Latent Space 00:01:20 Glean's Journey: From Boring Enterprise Search to $7B Valuation 00:15:37 Anthropic's Meteoric Rise and Market Share Dynamics 00:17:50 Claude Artifacts and Product Innovation 00:41:20 The Anthology Fund: Investing in the Anthropic Ecosystem 00:48:01 Goodfire and Mechanistic Interpretability 00:51:25 Prime Intellect and Distributed AI Training 00:53:40 OpenRouter: Building the AI Model Gateway 01:13:36 The Stargate Project and Infrastructure Arms Race 01:18:14 The Future of Software Engineering and AI Coding
--------
--------

More Business podcasts

Trending Business podcasts

About Latent Space: The AI Engineer Podcast

The podcast by and for AI Engineers! In 2024, over 2 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space

Podcast website

Business Technology Entrepreneurship

Listen to Latent Space: The AI Engineer Podcast, The Diary Of A CEO with Steven Bartlett and many other podcasts from around the world with the radio.net app

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Open app

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Latent Space: The AI Engineer Podcast

Scan code,
download the app,
start listening.