PodcastsScienceVanishing Gradients

Vanishing Gradients

Hugo Bowne-Anderson
Vanishing Gradients
Latest episode

75 episodes

  • Vanishing Gradients

    Next Level AI Evals for 2026

    2026/04/23 | 53 mins.
    There are a lot of reasons why we should do AI evals. For many companies doing AI evals is the way to build the feedback loop into the product development lifecycle. So it is like your compass. We’re using AI evals as a compass to guide product development and also product iteration. And also, many times we need evals to function as the pass or fail gate in release decisions. Whether this product is good enough for release or whether it is good enough for experiment, evals are also used in that.
    Stella Wenxing Liu, Head of Applied Science at ASU, and Eddie Landesberg, Staff Data Scientist at Google, join Hugo to talk about why AI evaluation is evolving from “vibe checks” into a rigorous, multi-disciplinary science and how causal inference will take AI evals to the next level in 2026.

    Vanishing Gradients is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    They Discuss:
    * Team-Centric AI Evals, integrating product managers, data scientists, and SMEs under a “benevolent dictator” (or not!) to ensure comprehensive and effective evaluation;
    * Custom Evaluation Metrics, moving beyond generic vendor metrics to analyze raw data and identify specific failure modes, avoiding generic product outcomes;
    * AI as Policy Evaluation, framing AI evaluation as a causal inference problem to estimate counterfactual performance of new “policies” (prompts, models) and predict online AB test outcomes;
    * Clear Product Constraints, defining what an AI product should not do with strict guardrails to prevent misuse, control costs, and avoid brand dilution;
    * Calibrated LLM Judges, statistically aligning LLM-as-a-judge with human experts using causal inference to ensure valid proxies for human welfare and business objectives;
    * Essential Data Curiosity, fostering a culture of manual data inspection to build intuition before relying on automated error analysis or agents, ensuring effective system design;
    * Statistical AI Evaluation, shifting from unit-test thinking to non-deterministic distributions, using confidence intervals and power analysis to discern genuine improvements from statistical noise;
    * Proactive Regulatory Compliance, developing rigorous, defensible internal evaluation standards now to gain a competitive advantage as vague AI regulations move towards enforced compliance;
    * Human-Centric Benchmarking, grounding AI systems in human judgment and user values, moving beyond automated scores to build resilient and differentiated AI.
    You can also find the full episode on Spotify, Apple Podcasts, and YouTube.
    You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!

    👉 Stella has just started teaching a cohort of her AI Evals and Analytics Playbook course starting this week. She’s kindly giving listeners of Vanishing Gradients 30% off with this link.👈
    Our flagship course Building AI Applications just wrapped its final cohort but we’re cooking up something new. If you want to be first to hear about it (and help shape what we build), drop your thoughts here.

    LINKS
    * Stella Wenxing Liu on LinkedIn
    * Eddie Landesberg on LinkedIn
    * Stella’s AI Evals & Analytics Playbook course on Maven (30% community discount)
    * CJE (Causal Judge Evaluation) package by Eddie
    * Trillion Dollar Coach
    * Goodhart’s Law
    * Upcoming Events on Luma
    * Vanishing Gradients on YouTube
    * Watch the podcast video on YouTube
    How You Can Support Vanishing Gradients
    Vanishing Gradients is a podcast, workshop series, blog, and newsletter focused on what you can build with AI right now. Over 70 episodes with expert practitioners from Google DeepMind, Netflix, Stanford, and elsewhere. Hundreds of hours of free, hands-on workshops. All independent, all free.
    If you want to help keep it going:
    * Become a paid subscriber, from $8/month
    * Share this with a builder who’d find it useful
    * Subscribe to our YouTube channel.

    Thanks for reading Vanishing Gradients! This post is public so feel free to share it.



    Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe
  • Vanishing Gradients

    Privacy Theater Is Not Privacy Engineering: What It Actually Takes to Ship Safe AI

    2026/04/15 | 1h 6 mins.
    Katharine Jarmul, Privacy in ML/AI Expert & Author of Practical Data Privacy, joins Hugo to unpack why most AI privacy advice is theater: and what technical privacy actually looks like when you’re shipping LLMs, agents, and multimodal systems into the real world.
    In this episode, we dig into how to build defensible systems in an era of AI agents and multimodal models: why system prompts (and your entire agent harness!) should be considered public by default, and why “privacy observability” is as critical as data observability for anyone building with LLMs today. Multimodal is what changes the threat model: identifiers hide in images, audio, and metadata, not just text, and the old anonymization playbook doesn’t cover it.
    Vanishing Gradients is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    We Discuss:
    * No Convenience Tax, you don’t have to trade privacy for utility: high-utility AI products can be privacy-preserving through technical controls like privacy routing and input sanitization;
    * Public Prompts and Harnesses: assume any instruction or secret in a system prompt or agent harness will be exfiltrated; don’t put sensitive info there in the first place;
    * Privacy Observability, tag and track data flows so information is used only for its original intended purpose: catch design flaws before they become legal problems;
    * Technical Privacy, implement mathematical and statistical constraints directly into ML systems and data flows so privacy is measurable and enforceable, not aspirational;
    * Tiered Guardrails, a three-layer approach: deterministic filters for hard rules, algorithmic models for nuanced classification, and internal alignment training for behavioral baselines;
    * Federated Learning Is Not Privacy, model updates in FL leak sensitive data on their own: you must layer differential privacy or encrypted computation on top, or you’re reverse-engineerable;
    * Anonymization Spectrum, navigate the “grayscale” of privacy in multimodal AI, balancing data utility and individual risk as identifiers hide in non-obvious places;
    * Privacy Champions, embed privacy accountability directly into development by training and incentivizing engineers inside product teams;
    * Red Teaming as Ritual, your goal is to attack yourself: practice thinking like an attacker, and turn privacy testing into an organization-wide creative ritual rather than a siloed security task.
    You can also find the full episode on Spotify, Apple Podcasts, and YouTube.
    You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!
    👉 Katharine is teaching her next cohort of Practical AI Privacy starting April 20. She’s kindly giving readers of Vanishing Gradients 10% off. Use this link. I’ll be taking it so hope to see you there!👈
    Our flagship course Building AI Applications just wrapped its final cohort but we’re cooking up something new. If you want to be first to hear about it (and help shape what we build), drop your thoughts here.
    LINKS
    * Practical AI Privacy course on Maven (10% off with code build-with-privacy)
    * Katharine Jarmul on LinkedIn
    * Probably Private — Katharine’s website & newsletter
    * Practical Data Privacy (Katharine’s book)
    * Let’s Build an AI Privacy Router — Lightning Lesson
    * Practical AI Privacy: Agents & Local LLMs (newsletter issue)
    * A Deep Dive into Memorization in Deep Learning (kjamistan blog)
    * Microsoft Presidio
    * Llama Guard 3 8B on Hugging Face
    * Nicholas Carlini
    * From Magic to Malware: How OpenClaws Agent Skills Become an Attack Surface (1Password)
    * Owning Ethics (Metcalf, Moss, boyd — Data & Society)
    * Hugo on guardrails in LLM applications
    * Upcoming Events on Luma
    * Vanishing Gradients on YouTube
    * Watch the podcast video on YouTube
    How You Can Support Vanishing Gradients
    Vanishing Gradients is a podcast, workshop series, blog, and newsletter focused on what you can build with AI right now. Over 70 episodes with expert practitioners from Google DeepMind, Netflix, Stanford, and elsewhere. Hundreds of hours of free, hands-on workshops. All independent, all free.
    If you want to help keep it going:
    * Become a paid subscriber, from $8/month
    * Share this with a builder who’d find it useful
    * Subscribe to our YouTube channel.
    Thanks for reading Vanishing Gradients! This post is public so feel free to share it.



    Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe
  • Vanishing Gradients

    LLM Architecture in 2026: What You Need to Know with Sebastian Raschka

    2026/04/13 | 1h 18 mins.
    If you take a model release as an anchor point, let’s say Nemotron 3 or Qwen 3.5, you can go in both directions: You can either plug them into an agent and play around with that, or you can look, okay, what does the model look like under the hood? What are the ingredients? What type of attention mechanism do they use? What are currently research techniques that could make that even better in the next generation of models? What can we swap out, basically? And I’m interested in both of these!

    Sebastian Raschka, Independent AI Researcher and author of Build a Large Language Model from Scratch, joins Hugo to talk about what’s changed in AI architecture, from post-training to hybrid models, and why understanding what’s under the hood matters more than ever for developers building in the agentic era. Sebastian’s upcoming book, Build a Reasoning Model from Scratch, currently available for pre-order on Amazon and in early access on Manning!
    Vanishing Gradients is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    We Discuss:
    * Ed Tech for Agents: should we design educational content specifically for agentic systems, or is there a better approach?
    * Inference Scaling is the new frontier, driving “gold-level” performance during generation via parallel sampling and internal meta-judges;
    * Hybrid Architectures from Qwen 3.5 and Nemotron 3 scale almost linearly, making long-context agentic workflows significantly more affordable and performant;
    * Multi-head Latent Attention (MLA), developed by DeepSeek, wins the KV cache war by drastically reducing memory overhead without performance hits;
    * Agent Harnesses need to be continuously simplified as frontier models are post-trained on agent trajectories. Teams that don’t strip back their scaffolding risk the harness getting in the way of a more capable model.
    * “AI Psychosis”: the cognitive load of supervising self-supervising agents, and why we’re all conducting an orchestra we were never trained to conduct;
    * Sebastian’s AI Stack: a surprisingly simple setup (Mac mini, Codex, Ollama) with a ~20-item QA checklist, delegating the boring work to preserve energy for creative development;
    * Fine-tuning is now an economic decision, optimizing costs and latency for high-volume tasks where long system prompts outweigh a one-time training run;
    * Process Reward Models (PRMs) are the next frontier, verifying intermediate reasoning steps to solve “hallucination in the middle” for complex math and code tasks;
    * “Implementation Does Not Lie”: Sebastian’s layer-by-layer verification philosophy, comparing from-scratch builds against HuggingFace references to catch details invisible in papers;
    * Architecture Details dictate inference stack choices; nuances like RMSNorm stability or RoPE flavors are critical for optimal performance and troubleshooting;
    * The Distillation Loop drives open-weight parity, enabling specialized, “frontier-class” models by “pre-digesting” frontier outputs without multi-million dollar training risks.
    You can also find the full episode on Spotify, Apple Podcasts, and YouTube.
    You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!
    Our flagship course Building AI Applications just wrapped its final cohort but we’re cooking up something new. If you want to be first to hear about it (and help shape what we build), drop your thoughts here.
    Links and Resources
    * Build a Reasoning Model (From Scratch): Sebastian’s new book, currently available for pre-order on Amazon and in early access on Manning. You’ll learn how reasoning LLMs actually work by starting with a pre-trained base LLM and adding reasoning capabilities step by step in code. A hands-on follow-up to Build a Large Language Model from Scratch.
    * LLM Architecture Gallery: Sebastian’s collection of architecture figures and fact sheets from his blog posts, updated with each major model release. A go-to visual reference for comparing what’s changed under the hood across model generations.
    * Sebastian Raschka on LinkedIn
    * Sebastian’s website
    * Ahead of AI (Sebastian’s Substack)
    * Build a Large Language Model from Scratch
    * PinchBench: OpenClaw Benchmark Leaderboard
    * DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
    * Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR 2025)
    * DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
    * Hugging Face Model Hub
    * Upcoming Events on Luma
    * Vanishing Gradients on YouTube
    A Bit More on Agent Harnesses
    * Components of A Coding Agent by Sebastian
    * How To Build An Agent that Builds its own Harness by Hugo and Ivan Leo (DeepMind, ex-Manus)
    * Build Your Own Deep Research Agent with Hugo & Ivan Leo (Google DeepMind, ex-Manus): In this livestream, you’ll learn how to build a production-grade agent harness from scratch in pure Python;
    * AI Agent Harness, 3 Principles for Context Engineering, and the Bitter Lesson Revisited with Lance Martin (Anthropic), Duncan Gilchrist (Delphina), and Hugo
    * The Post-Coding Era: What Happens When AI Writes the System? with Nicholas Moy (Google DeepMind), Duncan Gilchrist (Delphina), and Hugo
    * What is an Agent Harness? from What 300+ Engineers from Netflix, Amazon, and Instacart Asked About AI Engineering.
    How You Can Support Vanishing Gradients
    Vanishing Gradients is a podcast, workshop series, blog, and newsletter focused on what you can build with AI right now. Over 70 episodes with expert practitioners from Google DeepMind, Netflix, Stanford, and elsewhere. Hundreds of hours of free, hands-on workshops. All independent, all free.
    If you want to help keep it going:
    * Become a paid subscriber, from $8/month
    * Share this with a builder who’d find it useful
    * Subscribe to our YouTube channel.
    Thanks for reading Vanishing Gradients! This post is public so feel free to share it.



    Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe
  • Vanishing Gradients

    Episode 72: Why Agents Solve the Wrong Problem (and What Data Scientists Do Instead)

    2026/03/20 | 1h 33 mins.
    I often see what I would consider to be b******t evals, especially in data, like write this dumb SQL. Almost every one of these dumb SQL questions that I’ve seen for benchmarks are just so either obviously easy or overwhelmingly adversarial. They just, they don’t feel valuable as a data scientist, it’s something that you probably would never ask a real data scientist to do. So I went out my way to create real ones. Let me read one to you.
    Bryan Bischof, Head of AI at Theory Ventures, joins Hugo to talk about what happened when 150 people spent six hours using AI agents to answer real data science questions across SQL tables, log files, and 750,000 PDFs.
    They Discuss:
    * Failure Funnels, pinpoint where agent reasoning breaks down using causal-chain binary evaluations instead of vague 1-5 scales;
    * Median Score: 23 out of 65, what happened when world-class engineers turned agents loose on real data work, and why general-purpose coding agents with human prodding beat fancy frameworks;
    * Zero-Cost Submissions Kill Trust, without a penalty for wrong answers, agents hill-climb to correct submissions through brute force instead of building confidence;
    * Data Science is “Zooming”, moving beyond binary decisions to iterative problem framing, refining “does our inventory suck?” into a tractable hypothesis;
    * MCP as Semantic Layer, model your organization’s proprietary knowledge once and distribute it to whatever LLM interface your team prefers;
    * The Subagent vs. Tool Debate, a distinction that adds cognitive load without hiding complexity;
    * Self-Orchestration Gap, agents don’t yet realize they should trigger specialized extraction frameworks like DocETL instead of reading 750K PDFs one by one;
    * The Future of Evals, from vibe checks to objective functions and continuous user feedback that lets systems converge on reliability.
    You can also find the full episode on Spotify, Apple Podcasts, and YouTube.
    You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!
    👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort has started. Registration is still open. All sessions are recorded so don’t worry about having missed any. Here is a 25% discount code for readers. 👈
    LINKS
    * Bryan Bischof on Twitter/X
    * Bryan Bischof on LinkedIn
    * Theory Ventures
    * The Hunt for a Trustworthy Data Agent (blog post)
    * America’s Next Top Modeler GitHub repo
    * Hamel’s evals FAQ: How do I evaluate agentic workflows?
    * DocETL
    * LLM Judges and AI Agents at Scale (Hugo’s podcast with Shreya Shankar)
    * When Your Metrics Are Lying (Cimo Labs)
    * Lessons from a Year of Building with LLMs (livestream on YouTube)
    * Bryan Bischof: The Map is Not the Territory (YouTube)
    * Upcoming Events on Luma
    * Vanishing Gradients on YouTube
    * Watch the podcast video on YouTube

    👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort has started. Registration is still open. All sessions are recorded so don’t worry about having missed any. Here is a 25% discount code for readers. 👈


    Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe
  • Vanishing Gradients

    Episode 71: Durable Agents - How to Build AI Systems That Survive a Crash with Samuel Colvin

    2026/02/18 | 51 mins.
    Our thesis is that AI is still just engineering… those people who tell us for fun and profit, that somehow AI is so, so profound, so new, so different from anything that’s gone before that it somehow eclipses the need for good engineering practice are wrong. We need that good engineering practice still, and for the most part, most things are not new. But there are some things that have become more important with AI. One of those is durability.
    Samuel Colvin, Creator of Pydantic AI, joins Hugo to talk about applying battle-tested software engineering principles to build durable and reliable AI agents.
    They Discuss:
    * Production agents require engineering-grade reliability: Unlike messy coding agents, production agents need high constraint, reliability, and the ability to perform hundreds of tasks without drifting into unusual behavior;
    * Agents are the new “quantum” of AI software: Modern architecture uses discrete “agentlets”: small, specialized building blocks stitched together for sub-tasks within larger, durable systems;
    * Stop building “chocolate teapot” execution frameworks: Ditch rudimentary snapshotting; use battle-tested durable execution engines like Temporal for robust retry logic and state management;
    * AI observability will be a native feature: In five years, AI observability will be integrated, with token counts and prompt traces becoming standard features of all observability platforms;
    * Split agents into deterministic workflows and stochastic activities: Ensure true durability by isolating deterministic workflow logic from stochastic activities (IO, LLM calls) to cache results and prevent redundant model calls;
    * Type safety is essential for enterprise agents: Sacrificing type safety for flexible graphs leads to unmaintainable software; professional AI engineering demands strict type definitions for parallel node execution and state recovery;
    * Standardize on OpenTelemetry for portability: Use OpenTelemetry (OTel) to ensure agent traces and logs are portable, preventing vendor lock-in and integrating seamlessly into existing enterprise monitoring.
    You can also find the full episode on Spotify, Apple Podcasts, and YouTube.
    You can also interact directly with the transcript here in NotebookLM: If you do so, let us know anything you find in the comments!

    👉 Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Here is a 25% discount code for listeners. 👈
    LINKS
    * Samuel Colvin on LinkedIn
    * Pydantic
    * Pydantic Stack Demo repo
    * Deep research example code
    * Temporal
    * DBOS (Postgres alternative to Temporal)
    * Upcoming Events on Luma
    * Vanishing Gradients on YouTube
    * Watch the podcast video on YouTube
    👉Want to learn more about Building AI-Powered Software? Check out our Building AI Applications course. It’s a live cohort with hands on exercises and office hours. Our final cohort starts March 10, 2026. Here is a 25% discount code for listeners.👈
    https://maven.com/hugo-stefan/building-ai-apps-ds-and-swe-from-first-principles?promoCode=vgfs


    Get full access to Vanishing Gradients at hugobowne.substack.com/subscribe

More Science podcasts

About Vanishing Gradients

A podcast for people who build with AI. Long-format conversations with people shaping the field about agents, evals, multimodal systems, data infrastructure, and the tools behind them. Guests include Jeremy Howard (fast.ai), Hamel Husain (Parlance Labs), Shreya Shankar (UC Berkeley), Wes McKinney (creator of pandas), Samuel Colvin (Pydantic) and more. hugobowne.substack.com
Podcast website

Listen to Vanishing Gradients, The Naked Scientists Podcast and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features