

ThursdAI - Jan 8 - Vera Rubin's 5x Jump, Ralph Wiggum Goes Viral, GPT Health Launches & XAI Raises $20B Mid-Controversy
2026/1/08 | 1h 46 mins.
Hey folks, Alex here from Weights & Biases, with your weekly AI update (and a first live show of this year!) For the first time, we had a co-host of the show also be a guest on the show, Ryan Carson (from Amp) went supernova viral this week with an X article (1.5M views) about Ralph Wiggum (yeah, from Simpsons) and he broke down that agentic coding technique at the end of the show. LDJ and Nisten helped cover NVIDIA’s incredible announcements during CES with their Vera Rubin upcoming platform (4-5X improvements) and we all got excited about AI medicine with ChatGPT going into Health officially! Plus, a bunch of Open Source news, let’s get into this: ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open Source: The “Small” Models Are WinningWe often talk about the massive frontier models, but this week, Open Source came largely from unexpected places and focused on efficiency, agents, and specific domains.Solar Open 100B: A Data MasterclassUpstage released Solar Open 100B, and it’s a beast. It’s a 102B parameter Mixture-of-Experts (MoE) model, but thanks to MoE magic, it only uses about 12B active parameters during inference. This means it punches incredibly high but runs fast.What I really appreciated here wasn’t just the weights, but the transparency. They released a technical report detailing their “Data Factory” approach. They trained on nearly 20 trillion tokens, with a huge chunk being synthetic. They also used a dynamic curriculum that adjusted the difficulty and the ratio of synthetic data as training progressed. This transparency is what pushes the whole open source community forward.Technically, it hits 88.2 on MMLU and competes with top-tier models, especially in Korean language tasks. You can grab it on Hugging Face.MiroThinker 1.5: The DeepSeek Moment for Agents?We also saw MiroThinker 1.5, a 30B parameter model that is challenging the notion that you need massive scale to be smart. It uses something they call “Interactive Scaling.”Wolfram broke this down for us: this agent forms hypotheses, searches for evidence, and then iteratively revises its answers in a time-sensitive sandbox. It effectively “thinks” before answering. The result? It beats trillion-parameter models on search benchmarks like BrowseComp. It’s significantly cheaper to run, too. This feels like the year where smaller models + clever harnesses (harnesses are the software wrapping the model) will outperform raw scale.Liquid AI LFM 2.5: Running on Toasters (Almost)We love Liquid AI and they are great friends of the show. They announced LFM 2.5 at CES with AMD, and these are tiny ~1B parameter models designed to run on-device. We’re talking about running capable AI on your laptop, your phone, or edge devices (or the Reachy Mini bot that I showed off during the show! I gotta try and run LFM on him!)Probably the coolest part is the audio model. Usually, talking to an AI involves a pipeline: Speech-to-Text (ASR) -> LLM -> Text-to-Speech (TTS). Liquid’s model is end-to-end. It hears audio and speaks audio directly. We watched a demo from Maxime Labonne where the model was doing real-time interaction, interleaving text and audio. It’s incredibly fast and efficient. While it might not write a symphony for you, for on-device tasks like summarization or quick interactions, this is the future.NousCoder-14B and Zhipu AI IPOA quick shoutout to our friends at Nous Research who released NousCoder-14B, an open-source competitive programming model that achieved a 7% jump on LiveCodeBench accuracy in just four days of RL training on 48 NVIDIA B200 GPUs. The model was trained on 24,000 verifiable problems, and the lead researcher Joe Li noted it achieved in 4 days what took him 2 years as a teenager competing in programming contests. The full RL stack is open-sourced on GitHub and Nous published a great WandB results page as well! And in historic news, Zhipu AI (Z.ai)—the folks behind the GLM series—became the world’s first major LLM company to IPO, raising $558 million on the Hong Kong Stock Exchange. Their GLM-4.7 currently ranks #1 among open-source and domestic models on both Artificial Analysis and LM Arena. Congrats to them!Big Companies & APIsNVIDIA CES: Vera Rubin Changes EverythingLDJ brought the heat on this one covering Jensen’s CES keynote that unveiled the Vera Rubin platform, and the numbers are almost hard to believe. We’re talking about a complete redesign of six chips: the Rubin GPU delivering 50 petaFLOPS of AI inference (5x Blackwell), the Vera CPU with 88 custom Olympus ARM cores, NVLink 6, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet.Let me put this in perspective using LDJ’s breakdown: if you look at FP8 performance, the jump from Hopper to Blackwell was about 5x. The jump from Blackwell to Vera Rubin is over 3x again—but here’s the kicker—while only adding about 200 watts of power draw. That’s insane efficiency improvement.The real-world implications Jensen shared: training a 10 trillion parameter mixture-of-experts model now requires 75% fewer GPUs compared to Blackwell. Inference token costs drop roughly 10x—a 1MW cluster goes from 1 million to 10 million tokens per second at the same power. HBM4 memory delivers 22 TB/s bandwidth with 288GB capacity, exceeding NVIDIA’s own 2024 projections by nearly 70%.As Ryan noted, when people say there’s an AI bubble, this is why it’s hilarious. Jensen keeps saying the need for inference is unbelievable and only going up exponentially. We all see this. I can’t get enough inference—I want to spin up 10 Ralphs running concurrently! The NVL72 rack-scale system achieves 3.6 exaFLOPS inference with 20.7TB total HBM, and it’s already shipping. Runway 4.5 is already running on the new platform, having ported their model from Hopper to Vera Rubin NVL72 in a single day.NVIDIA also recently acqui-hidred Groq (with a Q) in a ~$20 billion deal, bringing the inference chip expertise from the guy who created Google’s TPUs in-house.Nemotron Speech ASR & The Speed of Voice (X, HF, Blog)NVIDIA also dropped Nemotron Speech ASR. This is a 600M parameter model that offers streaming transcription with 24ms latency.We showed a demo from our friend Kwindla Kramer at Daily. He was talking to an AI, and the response was virtually instant. The pipeline is: Nemotron (hearing) -> Llama/Nemotron Nano (thinking) -> Magpie TTS (speaking). The total latency is under 500ms. It feels like magic. Instant voice agents are going to be everywhere this year.XAI Raises $20B While Grok Causes Problems (Again)So here’s the thing about covering anything Elon-related: it’s impossible to separate signal from noise because there’s an army of fans who hype everything and an army of critics who hate everything. But let me try to be objective here.XAI raised another massive Round E of $20 billion! at a $230 billion valuation, with NVIDIA and Cisco as strategic investors. The speed of their infrastructure buildout is genuinely incredible. Grok’s voice mode is impressive. I use Grok for research and it’s really good, notable for it’s unprecedented access to X !But. This raise happened in the middle of a controversy where Grok’s image model was being used to “put bikinis” on anyone in reply threads, including—and this is where I draw a hard line—minors. As Nisten pointed out on the show, it’s not even hard to implement guardrails. You just put a 2B VL model in front and ask “is there a minor in this picture?” But people tested it, asked Grok not to use the feature, and it did it anyway. And yeah, putting Bikini on Claude is funny, but basic moderation is lacking! The response of “we’ll prosecute illegal users” is stupid when there’s no moderation built into the product. There’s an enormous difference between Photoshop technically being able to do something after hours of work, and a feature that generates edited images in one second as the first comment to a celebrity, then gets amplified by the platform’s algorithm to millions of people. One is a tool. The other is a product with amplification mechanics. Products need guardrails. I don’t often link to CNN (in fact this is the first time) but they have a great writeup about the whole incident here which apparently includes the quitting of a few trust and safety folks and Elon’s pushback on guardrails. CrazyThat said, Grok 5 is in training and XAI continues to ship impressive technology. I just wish they’d put the same engineering effort into safety as they do into capabilities!OpenAI Launches GPT HealthThis one’s exciting. OpenAI CEO Fidji Simo announced ChatGPT Health, a privacy-first space for personalized health conversations that can connect to electronic health records, Apple Health, Function Health, Peloton, and MyFitnessPal.Here’s why this matters: health already represents about 5% of all ChatGPT messages globally and touches 25% of weekly active users—often outside clinic hours or in underserved areas. People are already using these models for health advice constantly.Nisten, who has worked on AI doctors since the GPT-3 days and even published papers on on-device medical AI, gave us some perspective: the models have been fantastic for health stuff for two years now. The key insight is that medical data seems like a lot, but there are really only about 2,000 prescription drugs and 2,000 diseases (10,000 if you count rare ones). That’s nothing for an LLM. The models excel at pattern recognition across this relatively contained dataset.The integration with Function Health is particularly interesting to me. Function does 160+ lab tests, but many doctors won’t interpret them because they didn’t order them. ChatGPT could help bridge that gap, telling you “hey, this biomarker looks off, you should discuss this with your doctor.” The bad news is, this is just a waitlist and you can add yourself to the waitlist here, we’ll keep monitoring the situation and let you know when it opens upDoctronic: AI Prescribing Without Physician OversightSpeaking of healthcare, Doctronic launched a pilot in Utah where AI can autonomously renew prescriptions for chronic conditions without any physician in the loop. The system covers about 190 routine medications (excluding controlled substances) at just $4 per renewal. Trial data showed 99.2% concordance with physician treatment plans, and they’ve secured pioneering malpractice insurance that treats the AI like a clinician.Nisten made the case that it’s ethically wrong to delay this kind of automation when ER wait times keep increasing and doctors are overworked. The open source models are already excellent at medical tasks. Governments should be buying GPUs rather than creating administrative roadblocks. Strong strong agree here! Google Brings Gmail into the Gemini Era (X)Breaking news from the day of our show: Google announced Gmail’s biggest AI transformation since its 2004 launch, powered by Gemini 3. This brings AI Overviews that summarize email threads, natural language queries (”Who gave me a plumber quote last year?”), Help Me Write, contextual Suggested Replies matching your writing style, and the upcoming AI Inbox that filters noise to surface VIPs and urgent items.For 3 billion Gmail users, this is huge. I’m very excited to test it—though not live on the show because I don’t want you reading my emails.This weeks buzz - covering Weights & Biases updatesNot covered on the show, but a great update on stuff from WandB, Chris Van Pelt (@vanpelt), one of the 3 co-founders released a great project I wanted to tell you about! For coders, this is an app that allows you to run multiple Claude Codes on free Github sandboxes, so you can code (or Ralph) and control everything away from home! GitHub gives personal users 120 free Codespaces hours/month, and Catnip automatically shuts down inactive instances so you can code for quite a while with Catnip! It’s fully open source on Github and you can download the app hereInterview: Ryan Carson - What the hell is Ralph Wiggum?Okay, let’s talk about the character everyone is seeing on their timeline: Ralph Wiggum. My co-host Ryan Carson went viral this week with an article about this technique, and I had to have him break it down.Ralph isn’t a new model; it’s a technique for running agents in a loop to perform autonomous coding. The core idea is deceptively simple: Ralph is a bash script that loops an AI coding agent. In a loop, until it a certain condition is met. But why is it blowing up? Normally when you use a coding agent like Cursor, Claude Code, or AMP, you need to be in the loop. You approve changes, look at code, fix things when the agent hits walls or runs out of context. Ralph solves this by letting the agent run autonomously while you sleep.Here’s how it works: First, you write a Product Requirements Doc (PRD) by talking to your agent for a few minutes about what you want to build. Then you convert that PRD into a JSON file containing atomic user stories with clear acceptance criteria. Each user story is small enough for the agent to complete in one focused thread.The Ralph script then loops: it picks the first incomplete user story, the agent writes code to implement it, tests against the acceptance criteria, commits the changes, marks the story as complete, writes what it learned to a shared “agents.md” file, and loops to the next story. That compound learning step is crucial—without it, the agent would keep making the same mistakes.What makes this work is the pre-work. As Ryan put it, “no real work is done one-shot.” This is how software engineering has always worked—you break big problems into smaller problems into user stories and solve them incrementally. The innovation is letting AI agents work through that queue autonomously while you sleep! Ryan’s excellent (and viral) X article is here! Vision & VideoLTX-2 Goes Fully Open Source (HF, Paper)Lightricks finally open-sourced LTX-2, marking a major milestone as the first fully open audio-video generation model. This isn’t just “we released the weights” open—it’s complete model weights (13B and 2B variants), distilled versions, controllable LoRAs, a full multimodal trainer, benchmarks, and evaluation scripts. For a video model that is aiming to be the open source SORA, supports audio and lipsyncThe model generates synchronized audio and video in a single DiT-based architecture—motion, dialogue, ambience, and music flow simultaneously. Native 4K at up to 50 FPS with audio up to 10 seconds. And there’s also a distilled version (Thanks Pruna AI!) hosted on ReplicateComfyUI provided day-0 native support, and community testing shows an A6000 generating 1280x720 at 120 frames in 50 seconds. This is near Sora-level quality that you can fine-tune on your own data for custom styles and voices in about an hour.What a way to start 2026. From chips that are 5x faster to AI doctors prescribing meds in Utah, the pace is only accelerating. If anyone tells you we’re in an AI bubble, just show them what we covered today. Even if the models stopped improving tomorrow, the techniques like “Ralph” prove we have years of work ahead of us just figuring out how to use the intelligence we already have.Thank you for being a ThursdAI subscriber. See you next week!As always, here’s the show notes and TL;DR links: * Hosts & Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf, @nisten, @ldjconfirmed* Special Guest - Ryan Carson (@ryancarson) breaking down the Ralph Wiggum technique.* Open Source LLMs* Solar Open 100B - Upstage’s 102B MoE model. Trained on 19.7T tokens with a heavy focus on “data factory” synthetic data and high-performance Korean reasoning (X, HF, Tech Report).* MiroThinker 1.5 - A 30B parameter search agent that uses “Interactive Scaling” to beat trillion-parameter models on search benchmarks like BrowseComp (X, HF, GitHub).* Liquid AI LFM 2.5 - A family of 1B models designed for edge devices. Features a revolutionary end-to-end audio model that skips the ASR-LLM-TTS pipeline (X, HF).* NousCoder-14B - competitive coding model from Nous Research that saw a 7% LiveCodeBench accuracy jump in just 4 days of RL (X, WandB Dashboard).* Zhipu AI IPO - The makers of GLM became the first major LLM firm to go public on the HKEX, raising $558M (Announcement).* Big Co LLMs & APIs* NVIDIA Vera Rubin - Jensen Huang’s CES reveal of the next-gen platform. Delivers 5x Blackwell inference performance and 75% fewer GPUs needed for MoE training (Blog).* OpenAI ChatGPT Health - A privacy-first vertical for EHR and fitness data integration (Waitlist).* Google Gmail Era - Gemini 3 integration into Gmail for 3 billion users, featuring AI Overviews and natural language inbox search (Blog).* XAI $20B Raise - Elon’s XAI raises Series E at a $230B valuation, even as Grok faces heat over bikini-gate and safety guardrails (CNN Report).* Doctronic - The first US pilot in Utah for autonomous AI prescription renewals without a physician in the loop (Web).* Alexa+ Web - Amazon brings the “Smart Alexa” experience to browser-based chat (Announcement).* Autonomous Coding & Tools* Ralph Wiggum - The agentic loop technique for autonomous coding using small, atomic user stories. Ryan Carson’s breakdown of why this is the death of “vibe coding” (Viral X Article).* Catnip by W&B - Chris Van Pelt’s open-source iOS app to run Claude Code anywhere via GitHub Codespaces (App Store, GitHub).* Vision & Video* LTX-2 - Lightricks open-sources the first truly open audio-video generation model with synchronized output and full training code (GitHub, Replicate Demo).* Avatar Forcing - KAIST’s framework for real-time interactive talking heads with ~500ms latency (Arxiv).* Qwen Edit 2512 - Optimized by PrunaAI to generate high-res realistic images in under 7 seconds (Replicate).* Voice & Audio* Nemotron Speech ASR - NVIDIA’s 600M parameter streaming model with sub-100ms stable latency for massive-scale voice agents (HF). This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

ThursdAI - Jan 1 2026 - Will Brown Interview + Nvidia buys Groq, Meta buys Manus, Qwen Image 2412 & Alex New Year greetings
2026/1/01 | 29 mins.
Hey all, Happy new year! This is Alex, writing to you for the very fresh start of this year, it’s 2026 already, can you believe it? There was no live stream today, I figured the cohosts deserve a break and honestly it was a very slow week. Even the chinese labs who don’t really celebrate X-mas and new years didn’t come out with a banger AFAIK. ThursdAI - AI moves fast, we’re here to make sure you never miss a thing! Subscribe :) Tho I thought it was an incredible opportunity to finally post the Will Brow interview I recorded in November during the AI Engineer conference. Will is a researcher at Prime Intellect (big fans on WandB btw!) and is very known on X as a hot takes ML person, often going viral for tons of memes! Will is the creator and maintainer of the Verifiers library (Github) and his talk at AI Engineer was all about RL Environments (what they are, you can hear in the interview, I asked him!) TL;DR last week of 2025 in AIBesides this, my job here is to keep you up to date, and honestly this was very easy this week, as… almost nothing has happened, but here we go: Meta buys ManusThe year ended with 2 huge acquisitions / aquihires. First we got the news from Alex Wang that Meta has bought Manus.ai which is an agentic AI startup we covered back in March for an undisclosed amount (folks claim $2-3B) The most interesting thing here is that Manus is a Chinese company, and this deal requires very specific severance from Chinese operations.Jensen goes on a new years spending spree, Nvidia buys Groq (not GROK) for $20BGroq which we covered often here, and are great friends, is going to NVIDIA, in a… very interesting acqui-hire, which is a “non binding license” + most of Groq top employees apparently are going to NVIDIA. Jonathan Ross the CEO of Groq, was the co-creator of the TPU chips at Google before founding Groq, so this seems like a very strategic aquihire for NVIDIA! Congrats to our friends from Groq on this amazing news for the new year! Tencent open-sources HY-MT1.5 translation models with 1.8B edge-deployable and 7B cloud variants supporting 33 languages (X, HF, HF, GitHub)It seems that everyone’s is trying to de-throne whisper and this latest attempt from Tencent is a interesting one. a 1.8B and 7B translation models with very interesting stats. Alibaba’s Qwen-Image-2512 drops on New Year’s Eve as strongest open-source text-to-image model, topping AI Arena with photorealistic humans and sharper textures (X, HF, Arxiv)Our friends in Tongyi decided to give is a new years present in the form of an updated Qwen-image, with much improved realismThat’s it folks, this was a quick one, hopefully you all had an amazing new year celebration, and are gearing up to an eventful and crazy 2026. I wish you all happiness, excitement and energy to keep up with everything in the new year, and will make sure that we’re here to keep you up to date as always! P.S - I got a little news of my own this yesterday, not related to AI. She said yes 🎉 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

🔥 Someone Trained an LLM in Space This Year (And 50 Other Things You Missed)- ThursdAI yearly recap is here!
2025/12/25 | 1h 49 mins.
Ho Ho Ho, Alex here! (a real human writing these words, this needs to be said in 2025) Merry Christmas (to those who celebrate) and welcome to the very special yearly ThursdAI recap! This was an intense year in the world of AI, and after 51 weekly episodes (this is episode 52!) we have the ultimate record of all the major and most important AI releases of this year! So instead of bringing you a weekly update (it’s been a slow week so far, most AI labs are taking a well deserved break, the Cchinese AI labs haven’t yet surprised anyone), I’m dropping a comprehensive yearly AI review! Quarter by quarter, month by month, both in written form and as a pod/video! Why do this? Who even needs this? Isn’t most of it obsolete? I have asked myself this exact question while prepping for the show (it was quite a lot of prep, even with Opus’s help). I eventually landed on, hey, if nothing else, this will serve as a record of the insane week of AI progress we all witnessed. Can you imagine that the term Vibe Coding is less than 1 year old? That Claude Code was released at the start of THIS year? We get hedonicly adapt to new AI goodies so quick, and I figured this will serve as a point in time check, we can get back to and feel the acceleration! With that, let’s dive in - P.S. the content below is mostly authored by my co-author for this, Opus 4.5 high, which at the end of 2025 I find the best creative writer with the best long context coherence that can imitate my voice and tone (hey, I’m also on a break! 🎅) “Open source AI has never been as hot as this quarter. We’re accelerating as f*ck, and it’s only just beginning—hold on to your butts.” — Alex Volkov, ThursdAI Q1 2025🏆 The Big Picture — 2025 - The Year the AI Agents Became RealLooking back at 51 episodes and 12 months of relentless AI progress, several mega-themes emerged:1. 🧠 Reasoning Models Changed EverythingFrom DeepSeek R1 in January to GPT-5.2 in December, reasoning became the defining capability. Models now think for hours, call tools mid-thought, and score perfect on math olympiads.2. 🤖 2025 Was Actually the Year of AgentsWe said it in January, and it came true. Claude Code launched the CLI revolution, MCP became the universal protocol, and by December we had ChatGPT Apps, Atlas browser, and AgentKit.3. 🇨🇳 Chinese Labs Dominated Open SourceDeepSeek, Qwen, MiniMax, Kimi, ByteDance — despite chip restrictions, Chinese labs released the best open weights models all year. Qwen 3, Kimi K2, DeepSeek V3.2 were defining releases.4. 🎬 We Crossed the Uncanny ValleyVEO3’s native audio, Suno V5’s indistinguishable music, Sora 2’s social platform — 2025 was the year AI-generated media became indistinguishable from human-created content.5. 💰 The Investment Scale Became Absurd$500B Stargate, $1.4T compute obligations, $183B valuations, $100-300M researcher packages, LLMs training in space. The numbers stopped making sense.6. 🏆 Google Made a ComebackAfter years of “catching up,” Google delivered Gemini 3, Antigravity, Nano Banana Pro, VEO3, and took the #1 spot (briefly). Don’t bet against Google.By the NumbersQ1 2025 — The Quarter That Changed EverythingDeepSeek R1 crashed NVIDIA’s stock, reasoning models went mainstream, and Chinese labs took over open source. The quarter that proved AI isn’t slowing down—it’s just getting started.Key Themes:* 🧠 Reasoning models went mainstream (DeepSeek R1, o1, QwQ)* 🇨🇳 Chinese labs dominated open source (DeepSeek, Alibaba, MiniMax, ByteDance)* 🤖 2025 declared “The Year of Agents” (OpenAI Operator, MCP won)* 🖼️ Image generation revolution (GPT-4o native image gen, Ghibli-mania)* 💰 Massive infrastructure investment (Project Stargate $500B)January — DeepSeek Shakes the World(Jan 02 | Jan 10 | Jan 17 | Jan 24 | Jan 30)The earthquake that shattered the AI bubble. DeepSeek R1 dropped on January 23rd and became the most impactful open source release ever:* Crashed NVIDIA stock 17% — $560B loss, largest single-company monetary loss in history* Hit #1 on the iOS App Store* Cost allegedly only $5.5M to train (sparking massive debate)* Matched OpenAI’s o1 on reasoning benchmarks at 50x cheaper pricing* The 1.5B model beat GPT-4o and Claude 3.5 Sonnet on math benchmarks 🤯“My mom knows about DeepSeek—your grandma probably knows about it, too” — Alex VolkovAlso this month:* OpenAI Operator — First agentic ChatGPT (browser control, booking, ordering)* Project Stargate — $500B AI infrastructure (Manhattan Project for AI)* NVIDIA Project Digits — $3,000 desktop that runs 200B parameter models* Kokoro TTS — 82M param model hit #1 on TTS Arena, Apache 2, runs in browser* MiniMax-01 — 4M context window from Hailuo* Gemini Flash Thinking — 1M token context with thinking tracesFebruary — Reasoning Mania & The Birth of Vibe Coding(Feb 07 | Feb 13 | Feb 20 | Feb 28)The month that redefined how we work with AI.OpenAI Deep Research (Feb 6) — An agentic research tool that scored 26.6% on Humanity’s Last Exam (vs 10% for o1/R1). Dr. Derya Unutmaz called it “a phenomenal 25-page patent application that would’ve cost $10,000+.”Claude 3.7 Sonnet & Claude Code (Feb 24-27) — Anthropic’s coding beast hit 70% on SWE-Bench with 8x more output (64K tokens). Claude Code launched as Anthropic’s agentic coding tool — marking the start of the CLI agent revolution.“Claude Code is just exactly in the right stack, right around the right location... You can do anything you want with a computer through the terminal.” — Yam PelegGPT-4.5 (Orion) (Feb 27) — OpenAI’s largest model ever (rumored 10T+ parameters). 62.5% on SimpleQA, foundation for future reasoning models.Grok 3 (Feb 20) — xAI enters the arena with 1M token context and “free until GPUs melt.”Andrej Karpathy coins “Vibe Coding” (Feb 2) — The 5.2M view tweet that captured a paradigm shift: developers describe what they want, AI handles implementation.OpenAI Roadmap Revelation (Feb 13) — Sam Altman announced GPT-4.5 will be the last non-chain-of-thought model. GPT-5 will unify everything.March — Google’s Revenge & The Ghibli Explosion(Mar 06 | Mar 13 | Mar 20 | Mar 27)Gemini 2.5 Pro Takes #1 (Mar 27) — Google reclaimed the LLM crown with AIME jumping nearly 20 points, 1M context, “thinking” integrated into the core model.GPT-4o Native Image Gen — Ghibli-mania (Mar 27) — The internet lost its collective mind and turned everything into Studio Ghibli. Auto-regressive image gen with perfect text rendering, incredible prompt adherence.“The internet lost its collective mind and turned everything into Studio Ghibli” — Alex VolkovMCP Won (Mar 27) — OpenAI officially adopted Anthropic’s Model Context Protocol. No VHS vs Betamax situation. Tools work across Claude AND GPT.DeepSeek V3 685B — AIME jumped from 39.6% → 59.4%, MIT licensed, best non-reasoning open model.ThursdAI Turns 2! (Mar 13) — Two years since the first episode about GPT-4.Open Source Highlights:* Gemma 3 (1B-27B) — 128K context, multimodal, 140+ languages, single GPU* QwQ-32B — Qwen’s reasoning model matches R1, runs on Mac* Mistral Small 3.1 — 24B, beats Gemma 3, Apache 2* Qwen2.5-Omni-7B — End-to-end multimodal with speech outputQ2 2025 — The Quarter That Shattered RealityVEO3 crossed the uncanny valley, Claude 4 arrived with 80% SWE-bench, and Qwen 3 proved open source can match frontier models. The quarter we stopped being able to tell what’s real.Key Themes:* 🎬 Video AI crossed the uncanny valley (VEO3 with native audio)* 🧠 Tool-using reasoning models emerged (o3 calling tools mid-thought)* 🇨🇳 Open source matched frontier (Qwen 3, Claude 4)* 📺 Google I/O delivered everything* 💸 AI’s economic impact accelerated ($300B valuations, 80% price drops)April — Tool-Using Reasoners & Llama Chaos(Apr 03 | Apr 10 | Apr 17 | Apr 24)OpenAI o3 & o4-mini (Apr 17) — The most important reasoning upgrade ever. For the first time, o-series models can use tools during reasoning: web search, Python, image gen. Chain 600+ consecutive tool calls. Manipulate images mid-thought.“This is almost AGI territory — agents that reason while wielding tools” — Alex VolkovGPT-4.1 Family (Apr 14) — 1 million token context across all models. Near-perfect recall. GPT-4.5 deprecated.Meta Llama 4 (Apr 5) — Scout (17B active/109B total) & Maverick (17B active/400B total). LMArena drama (tested model ≠ released model). Community criticism. Behemoth teased but never released.Gemini 2.5 Flash (Apr 17) — Set “thinking budget” per API call. Ultra-cheap at $0.15/$0.60 per 1M tokens.ThursdAI 100th Episode! 🎉May — VEO3 Crosses the Uncanny Valley & Claude 4 Arrives(May 01 | May 09 | May 16 | May 23 | May 29)VEO3 — The Undisputed Star of Google I/O (May 20) — Native multimodal audio generation (speech, SFX, music synced perfectly). Perfect lip-sync. Characters understand who’s speaking. Spawned viral “Prompt Theory” phenomenon.“VEO3 isn’t just video generation — it’s a world simulator. We crossed the uncanny valley this quarter.” — Alex VolkovClaude 4 Opus & Sonnet — Live Drop During ThursdAI! (May 22) — Anthropic crashed the party mid-show. First models to cross 80% on SWE-bench. Handles 6-7 hour human tasks. Hybrid reasoning + instant response modes.Qwen 3 (May 1) — The most comprehensive open source release ever: 8 models, Apache 2.0. Runtime /think toggle for chain-of-thought. 4B dense beats Qwen 2.5-72B on multiple benchmarks. 36T training tokens, 119 languages.“The 30B MoE is ‘Sonnet 3.5 at home’ — 100+ tokens/sec on MacBooks” — NistenGoogle I/O Avalanche:* Gemini 2.5 Pro Deep Think (84% MMMU)* Jules (free async coding agent)* Project Mariner (browser control via API)* Gemini Ultra tier ($250/mo)June — The New Normal(Jun 06 | Jun 13 | Jun 20 | Jun 26)o3 Price Drop 90% (Jun 12) — From $40/$10 → $8/$2 per million tokens. o3-pro launched at 87% cheaper than o1-pro.Meta’s $15B Scale AI Power Play (Jun 12) — 49% stake in Scale AI. Alex Wang leads new “Superintelligence team” at Meta. Seven-to-nine-figure comp packages for researchers.MiniMax M1 — Reasoning MoE That Beats R1 (Jun 19) — 456B total / 45B active parameters. Full weights on Hugging Face.Gemini CLI (Jun 26) — Google’s open source terminal agent brings Gemini 2.5 Pro to your command line.Flux Kontext — SOTA image editing with character consistency.Q3 2025 — The Quarter of GPT-5 & Trillion-Parameter Open SourceGPT-5 arrived after 32 months. Open source hit trillion-parameter scale. World models became playable. Chinese labs continued their dominance.Key Themes:* 👑 GPT-5 Era began (unified reasoning + chat)* 🇨🇳 Open source hit trillion-scale (Kimi K2, Qwen3-Coder)* 🌍 World models became playable (Google Genie-3)* 🎥 Video reached “can’t tell” quality* 💰 Unprecedented investment ($100B pledges, $183B valuations)July — Trillion-Parameter Open Source Arrives(Jul 03 | Jul 11 | Jul 17 | Jul 24)Kimi K2 — The Trillion-Parameter King (Jul 17) — Moonshot dropped a 1 trillion parameter MoE model: 65.8% on SWE-bench Verified (beating Claude Sonnet without reasoning), 32B active parameters, 128K context, Modified MIT license.“This isn’t just another model release. This is ‘Sonnet at home’ if you have the hardware.” — Alex VolkovGrok-4 & Grok Heavy (Jul 10) — 50% on Humanity’s Last Exam with tools. 100% on AIME25. xAI finally became a serious contender.ChatGPT Agent (Odyssey) (Jul 17) — Unified agentic AI: browser + terminal + research. 41.6% on HLE (double o3).Chinese Open Source Explosion:* Baidu ERNIE 4.5 (10 models, Apache 2.0)* Tencent Hunyuan-A13B (80B MoE, 256K context)* Huawei Pangu Pro (trained entirely on Ascend NPUs — no Nvidia!)* Qwen3-Coder-480B (69.6% SWE-bench)August — GPT-5 Month(Aug 01 | Aug 07 | Aug 15 | Aug 21)GPT-5 Launch (Aug 7) — 32 months after GPT-4:* 400K context window* $1.25/$10 per million tokens (Opus is $15/$75)* Unified thinking + chat model* Router-based architecture (initially buggy)* Free tier access for back-to-school“32 months since GPT-4 release, 32 months of ThursdAI” — Alex VolkovGPT-OSS (Aug 5) — OpenAI goes Apache 2.0 open source for the first time since GPT-2: 120B and 20B models, configurable reasoning, full chain-of-thought access.Google Genie-3 (Aug 7) — DeepMind’s world model generates fully interactive 3D environments: real-time at 24fps, memory/consistency breakthrough, walk/fly/control in generated worlds.DeepSeek V3.1 Hybrid (Aug 21) — Matches/beats R1 with fewer thinking tokens. 66% SWE-bench Verified. Tool calls inside thinking. MIT licensed.September — Shiptember Delivers(Sep 05 | Sep 12 | Sep 19 | Sep 26)GPT-5-Codex (Sep 18) — Works 7+ hours independently. 93% fewer tokens on simple tasks. Reviews majority of OpenAI’s own PRs. Perfect 12/12 on 2025 ICPC.Meta Connect 25 (Sep 18) — AI glasses with built-in display, neural band wristband, live translation with subtitles, $799 shipping immediately.Qwen-mas Strikes Again (Sep 26):* Qwen3-VL-235B (vision reasoner, 1M context for video)* Qwen3-Omni-30B (end-to-end omni-modal)* Qwen-Max (over 1T parameters, roadmap to 100M token context)NVIDIA $100B pledge to OpenAI — “Biggest infrastructure project in history”Suno V5 — The music generation model where we officially can’t tell anymore.“I can no longer tell which music is AI and which is human. This is it. We’ve passed the Rubicon.” — Alex VolkovQ4 2025 — The Quarter of Agents, Gemini’s Crown & The Reasoning WarsThe densest quarter in AI history. Google took the throne with Gemini 3, OpenAI fired back with GPT-5.2, and agents became real products. Someone trained an LLM in space.Key Themes:* 🚀 Reasoning wars peaked (Gemini 3 → GPT-5.2 → DeepSeek gold medals)* 🤖 Agents became products (Atlas, AgentKit, ChatGPT Apps)* 👑 Google’s comeback (Gemini 3, Antigravity, Nano Banana)* 🏃 ASI race accelerated ($1.4T compute, 2028 autonomous researchers)* 🎬 Sora 2 launched AI-native social mediaOctober — Sora Changes Social Media Forever(Oct 03 | Oct 10 | Oct 17 | Oct 24 | Oct 30)Sora 2 — AI Social Media is Born (Oct 2):* Shot to #3 on iOS App Store within days* Cameos: upload your face, star in any video* Sam Altman shared his Cameo publicly, becoming the internet’s most meme-able person* All content is AI-generated — no uploads, only creations“This is the first social media with UGC where content can ONLY be generated” — Alex VolkovOpenAI Dev Day (Oct 9):* ChatGPT Apps for 800M+ weekly active users* AgentKit: drag-and-drop agent builder* GPT-5-Pro in API* Sam revealed $1.4 trillion in compute obligationsAI Makes Novel Cancer Discovery (Oct 16) — A 27B Gemma-based model generated a novel hypothesis about cancer cells validated in a wet lab. First confirmed case of AI creating genuinely new scientific knowledge.Claude Sonnet 4.5 — 61.4% OS World (computer use)Claude Haiku 4.5 — 73.3% SWE-Bench, lightning fastNovember — The Week That Changed Everything(Nov 07 | Nov 13 | Nov 20 | Nov 27)THE MOST INSANE WEEK IN AI HISTORY. In a single span of ~10 days:* Grok 4.1 — #1 LMArena (briefly)* Gemini 3 Pro — Took the throne with 45.14% on ARC-AGI-2 (Deep Think)* GPT-5.1-Codex-Max — 24+ hour autonomous coding* Nano Banana Pro — 4K image generation with perfect text rendering* Meta SAM 3 & SAM 3D — Open-vocabulary segmentation* Claude Opus 4.5 — 80.9% SWE-Bench Verified, beats GPT-5.1“This week almost broke me as a person whose full-time job is to cover and follow AI releases.” — Alex VolkovGemini 3 Pro + Deep Think (Nov 20) — Google finally took the LLM throne: 45.14% on ARC-AGI-2, roughly double previous SOTA.Google Antigravity IDE (Nov 20) — Free agent-first VS Code fork with browser integration, multiple parallel agents.Nano Banana Pro (Nov 20) — Native 4K resolution with “thinking” traces, perfect text rendering.Claude Opus 4.5 (Nov 27) — 80.9% SWE-Bench Verified. $5/$25 per MTok (1/3 previous cost). “Effort” parameter for reasoning control.“Opus 4.5 is unbelievable. You can ship a full feature on a mature code base in one day, always. It’s just mind blowing.” — Ryan Carson1X NEO (Oct 30) — First consumer humanoid robot, pre-orders at $20,000, delivery early 2026.December — GPT-5.2 Fires Back(Dec 02 | Dec 05 | Dec 12 | Dec 19)GPT-5.2 — OpenAI’s Answer to Gemini 3 (Dec 11) — Dropped live during ThursdAI:* 90.5% on ARC-AGI-1 (Pro X-High configuration)* 54%+ on ARC-AGI-2 — reclaiming frontier from Gemini 3* 100% on AIME 2025 — perfect math olympiad score* 70% on GDPval (up from 47% in Sept!)* Reports of models thinking for 1-3 hours on hard problemsDeepSeek V3.2 & V3.2-Speciale — Gold Medal Reasoning (Dec 4):* 96% on AIME (vs 94% for GPT-5 High)* Gold medals on IMO (35/42), CMO, ICPC (10/12), IOI (492/600)* $0.28/million tokens on OpenRouterMCP Donated to Linux Foundation (Dec 11) — Agentic AI Foundation launched under Linux Foundation. MCP, AGENTS.md, and goose donated to vendor-neutral governance.Mistral 3 Returns to Apache 2.0 (Dec 4) — Mistral Large 3 (675B MoE), Ministral 3 (vision, edge-optimized).Starcloud: LLM Training in Space (Dec 11) — An H100 satellite trained nanoGPT on Shakespeare. SSH into an H100… in space… with a US flag in the corner.“Peak 2025 energy — the era of weird infra ideas has begun.” — Karpathy reactsGemini 3 Flash (Dec 18) — Fastest frontier model, pairs with Gemini 3 Pro for speed vs depth tradeoffs.🙏 Thank YouThis has been an incredible year of ThursdAI. 51 episodes, countless releases, and a community that keeps showing up every week to make sense of the madness together.Huge thanks to our amazing co-hosts and friends of the pod:* Alex Volkov — AI Evangelist, Weights & Biases (@altryne)* Wolfram Ravenwolf (@WolframRvnwlf)* Yam Peleg (@yampeleg)* Nisten Tahiraj (@nisten)* LDJ (@ldjconfirmed)* Ryan Carson (@ryancarson)* Kwindla Hultman Kramer — CEO of Daily (@kwindla)And to everyone who tunes in — whether you’re listening on your commute, doing dishes, or just trying to keep up with the insanity — thank you. You make this possible.📢 Stay Connected* 🎧 Subscribe: thursdai.news* 🐦 Follow Alex: @altryne* 💻 This recap is open source: github.com/altryne/thursdAI_yearly_recap“We’re living through the early days of a technological revolution, and we get to be part of it. That’s something to be genuinely thankful for.” — Alex VolkovHappy Holidays, and see you in 2026! 🚀The best is yet to come. Hold on to your butts. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📆 ThursdAI - Dec 18 - Gemini 3 Flash, Grok Voice, ChatGPT Appstore, Image 1.5 & GPT 5.2 Codex, Meta Sam Audio & more AI news
2025/12/19 | 39 mins.
Hey folks 👋 Alex here, dressed as 🎅 for our pre X-mas episode!We’re wrapping up 2025, and the AI labs decided they absolutely could NOT let the year end quietly. This week was an absolute banger—we had Gemini 3 Flash dropping with frontier intelligence at flash prices, OpenAI firing off GPT 5.2 Codex as breaking news DURING our show, ChatGPT Images 1.5, Nvidia going all-in on open source with Nemotron 3 Nano, and the voice AI space heating up with Grok Voice and Chatterbox Turbo. Oh, and Google dropped FunctionGemma for all your toaster-to-fridge communication needs (yes, really).Today’s show was over three and a half hours long because we tried to cover both this week AND the entire year of 2025 (that yearly recap is coming next week—it’s a banger, we went month by month and you’ll really feel the acceleration). For now, let’s dive into just the insanity that was THIS week.00:00 Introduction and Overview00:39 Weekly AI News Highlights01:40 Open Source AI Developments01:44 Nvidia's Nemotron Series09:09 Google's Gemini 3 Flash19:26 OpenAI's GPT Image 1.520:33 Infographic and GPT Image 1.5 Discussion20:53 Nano Banana vs GPT Image 1.521:23 Testing and Comparisons of Image Models23:39 Voice and Audio Innovations24:22 Grok Voice and Tesla Integration26:01 Open Source Robotics and Voice Agents29:44 Meta's SAM Audio Release32:14 Breaking News: Google Function Gemma33:23 Weights & Biases Announcement35:19 Breaking News: OpenAI Codex 5.2 MaxTo receive new posts and support my work, consider becoming a free or paid subscriber.Big Companies LLM updatesGoogle’s Gemini 3 Flash: The High-Speed Intelligence KingIf we had to title 2025, as Ryan Carson mentioned on the show, it might just be “The Year of Google’s Comeback.” Remember at the start of the year when we were asking “Where is Google?” Well, they are here. Everywhere.This week they launched Gemini 3 Flash, and it is rightfully turning heads. This is a frontier-class model—meaning it boasts Pro-level intelligence—but it runs at Flash-level speeds and, most importantly, Flash-level pricing. We are talking $0.50 per 1 million input tokens. That is not a typo. The price-to-intelligence ratio here is simply off the charts.I’ve been using Gemini 2.5 Flash in production for a while because it was good enough, but Gemini 3 Flash is a different beast. It scores 71 on the Artificial Analysis Intelligence Index (a 13-point jump from the previous Flash), and it achieves 78% on SWE-bench Verified. That actually beats the bigger Gemini 3 Pro on some agentic coding tasks!What impressed me most, and something Kwindla pointed out, is the tool calling. Previous Gemini models sometimes struggled with complex tool use compared to OpenAI, but Gemini 3 Flash can handle up to 100 simultaneous function calls. It’s fast, it’s smart, and it’s integrated immediately across the entire Google stack—Workspace, Android, Chrome. Google isn’t just releasing models anymore; they are deploying them instantly to billions of users.For anyone building agents, this combination of speed, low latency, and 1 million context window (at this price!) makes it the new default workhorse.Google’s FunctionGemma Open Source releaseWe also got a smaller, quirkier release from Google: FunctionGemma. This is a tiny 270M parameter model. Yes, millions, not billions.It’s purpose-built for function calling on edge devices. It requires only 500MB of RAM, meaning it can run on your phone, in your browser, or even on a Raspberry Pi. As Nisten joked on the show, this is finally the model that lets your toaster talk to your fridge.Is it going to write a novel? No. But after fine-tuning, it jumped from 58% to 85% accuracy on mobile action tasks. This represents a future where privacy-first agents live entirely on your device, handling your calendar and apps without ever pinging a cloud server.OpenAI Image 1.5, GPT 5.2 Codex and ChatGPT AppstoreOpenAI had a busy week, starting with the release of GPT Image 1.5. It’s available now in ChatGPT and the API. The headline here is speed and control—it’s 4x faster than the previous model and 20% cheaper. It also tops the LMSYS Image Arena leaderboards.However, I have to give a balanced take here. We’ve been spoiled recently by Google’s “Nano Banana Pro” image generation (which powers Gemini). When we looked at side-by-side comparisons, especially with typography and infographic generation, Gemini often looked sharper and more coherent. This is what we call “hedonistic adaptation”—GPT Image 1.5 is great, but the bar has moved so fast that it doesn’t feel like the quantum leap DALL-E 3 was back in the day. Still, for production workflows where you need to edit specific parts of an image without ruining the rest, this is a massive upgrade.🚨 BREAKING: GPT 5.2 CodexJust as we were nearing the end of the show, OpenAI decided to drop some breaking news: GPT 5.2 Codex.This is a specialized model optimized specifically for agentic coding, terminal workflows, and cybersecurity. We quickly pulled up the benchmarks live, and they look significant. It hits 56.4% on SWE-Bench Pro and a massive 64% on Terminal-Bench 2.0.It supports up to 400k token inputs with native context compaction, meaning it’s designed for those long, complex coding sessions where you’re debugging an entire repository. The coolest (and scariest?) stat: a security researcher used this model to find three previously unknown vulnerabilities in React in just one week.OpenAI is positioning this for “professional software engineering,” and the benchmarks suggest a 30% improvement in token efficiency over the standard GPT 5.2. We are definitely going to be putting this through its paces in our own evaluations soon.ChatGPT ... the AppStore!Also today (OpenAI is really throwing everything they have to the end of the year release party), OpenAI has unveiled how their App Store is going to look and opened the submission forms to submit your own apps!Reminder, ChatGPT apps are powered by MCP and were announced during DevDay, they let companies build a full UI experience right inside ChatGPT, and given OpenAi’s almost 900M weekly active users, this is a big deal! Do you have an app you’d like in there? let me know in the comments!Open Source AI🔥 Nvidia Nemotron 3 Nano: The Most Important Open Source Release of the Week (X, HF)I think the most important release of this week in open source was Nvidia Nemotron 3 Nano, and it was pretty much everywhere. Nemotron is a series of models from Nvidia that’s been pushing efficiency updates, finetune innovations, pruning, and distillations—all the stuff Nvidia does incredibly well.Nemotron 3 Nano is a 30 billion parameter model with only 3 billion active parameters, using a hybrid Mamba-MoE architecture. This is huge. The model achieves 1.5 to 3.3x faster inference than competing models like Qwen 3 while maintaining competitive accuracy on H200 GPUs.But the specs aren’t even the most exciting part. NVIDIA didn’t just dump the weights over the wall. They released the datasets—all 25 trillion tokens of pre-training and post-training data. They released the recipes. They released the technical reports. This is what “Open AI” should actually look like.What’s next? Nemotron 3 Super at 120B parameters (4x Nano) and Nemotron 3 Ultra at 480B parameters (16x Nano) are coming in the next few months, featuring their innovative Latent Mixture of Experts architecture.Check out the release on HuggingFaceOther Open Source HighlightsLDJ brought up BOLMO from Allen AI—the first byte-level model that actually reaches parity with similar-size models using regular tokenization. This is really exciting because it could open up new possibilities for spelling accuracy, precise code editing, and potentially better omnimodality since ultimately everything is bytes—images, audio, everything.Wolfram highlighted OLMO 3.1, also from Allen AI, which is multimodal with video input in three sizes (4B, 7B, 8B). The interesting feature here is that you can give it a video, ask something like “how many times does a ball hit the crown?” and it’ll not only give you the answer but mark the precise coordinates on the video frames where it happens. Very cool for tracking objects throughout a video!Mistral OCR 3 (X)Mistral also dropped Mistral OCR 3 this week—their next-generation document intelligence model achieving a 74% win rate over OCR 2 across challenging document types. We’re talking forms, low-quality scans, handwritten text, complex tables, and multilingual documents.The pricing is aggressive at just $2 per 1,000 pages (or $1 with Batch API discount), and it outperforms enterprise solutions like AWS Textract, Azure Doc AI, and Google DocSeek. Available via API and their new Document AI Playground.🐝 This Week’s Buzz: Wolfram Joins Weights & Biases!I am so, so hyped to announce this. Our very own co-host and evaluation wizard, Wolfram RavenWlf, is officially joining the Weights & Biases / CoreWeave family as an AI Evangelist and “AIvaluator” starting in January!Wolfram has been the backbone of the “vibe checks” and deep-dive evals on this show for a long time. Now, he’ll be doing it full-time, building out benchmarks for the community and helping all of us make sense of this flood of models. Expect ThursdAI to get even more data-driven in 2026. Match made in heaven! And if you’re as excited as we are, give Weave a try, it’s free to get started!Voice & Audio: Faster, Cheaper, BetterIf 2025 was the year of the LLM comeback, the end of 2025 is the era of Voice AI commoditization. It is getting so cheap and so fast.Grok Voice Agent API (X)xAI launched their Grok Voice Agent API, and the pricing is aggressive: $0.05 per minute flat rate. That significantly undercuts OpenAI and others. But the real killer feature here is the integration.If you drive a Tesla, this is what powers the voice command when you hold down the button. It has native access to vehicle controls, but for developers, it has native tool calling for Real-time X Search. This means your voice agent can have up-to-the-minute knowledge about the world, something purely pre-trained models struggle with. It ranks #1 on Big Bench Audio, and with that pricing, we’re going to see voice ubiquity very soon.Kwindla had great insights here: it feels like they optimized for the Tesla use case where it’s a question and an answer. You can see this because Big Bench Audio is a hard audio Q&A benchmark but not multi-turn. So it’s super exciting, but it’s not necessarily what we’ll use for multi-turn conversational voice agents yet.Here’s what’s really interesting: the entire voice stack was built in-house with custom VAD, tokenizer, and audio models for end-to-end optimization. Tesla was a critical design partner—Grok now powers millions of Tesla vehicles. If you’re building AI voice agents, will you give Grok Voice SDK a try?Resemble AI’s Chatterbox Turbo (X, HF, GitHub, Blog)For the open-source heads, Resemble AI dropped a bombshell with Chatterbox Turbo. This is a 350M parameter open-source TTS model that is beating proprietary giants like ElevenLabs in blind tests.It allows for zero-shot voice cloning from just 5 seconds of audio and supports paralinguistic tags—meaning you can type [laugh] or [sigh]and the model actually acts it out naturally. Plus, it has built-in watermarking for safety. It’s MIT licensed, so you can run this yourself. The fact that an open model is winning on quality against the paid APIs is a huge moment for the community.Meta SAM AudioFinally, Meta extended their “Segment Anything” magic to audio with SAM Audio. You know how you can click an object in an image to select it? Now you can do that with sound.With Sam Audio, you could isolate just the sound of a train from a messy audio track, or pick out a specific instrument from a song. You can prompt it with text (”guitar”), visual clicks on a video, or time stamps. It’s incredible for creators and audio engineers, effectively automating what used to be painful manual editing.Wrapping UpWhat a week to close out 2025. Google proved once again that they’re the gorilla that’s learned to dance—Gemini 3 Flash delivering frontier intelligence at flash prices is going to change how people build AI applications. Nvidia showed that the most valuable company in the world is all-in on open source. OpenAI fired off GPT 5.2 Codex just to make sure we don’t forget about them. And the voice AI space is heating up with options that would have seemed impossible just a year ago.Look out for the full 2025 yearly recap episode coming next week—it’s a banger. We went month by month through every major AI release and talked about what we thought were the best overall. You’ll really feel the acceleration from that one.Happy holidays, folks! And as always, thanks for being part of the ThursdAI community.TL;DR and Show NotesHosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed, @ryancarson* Special Guest: @kwindla - CEO of DailyOpen Source LLMs* NVIDIA Nemotron 3 Nano - 30B-3A hybrid Mamba-MoE model (X, HF, HF FP8)* FunctionGemma - 270M parameter function calling model (X, Blog, Docs)* Mistral OCR 3 - Document intelligence model with 74% win rate over v2 (X, Blog, Console)* BOLMO from Allen AI - First byte-level model reaching parity with regular tokenization (X)* OLMO 2 from Allen AI - Multimodal with video input (4B, 7B, 8B sizes) (X)Big CO LLMs + APIs* Google Gemini 3 Flash - Frontier intelligence at $0.50/1M input tokens, 78% SWE-bench Verified (X, Announcement)* OpenAI GPT Image 1.5 - 4x faster, 20% cheaper, #1 on LMSYS Image Arena (X)* OpenAI GPT 5.2 Codex - 56.4% SWE-Bench Pro, 64% Terminal-Bench 2.0, 400K context (X, Blog)* ChatGPT App Store - MCP-powered apps submission now open (X)This Week’s Buzz* 🐝 Wolfram joins Weights & Biases / CoreWeave as AI Evangelist and AIvaluator!* Try Weave for AI evaluationsVoice & Audio* xAI Grok Voice Agent API - #1 Big Bench Audio (92.3%), $0.05/min flat rate, powers Tesla vehicles (X)* Resemble AI Chatterbox Turbo - MIT-licensed 350M TTS, beats ElevenLabs in blind tests (X, HF, GitHub, Blog)* Meta SAM Audio - Audio source separation with text/visual/temporal prompts (X, HF, GitHub)Show Links* Full 2025 Yearly Recap - Coming next week! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

📆 ThursdAI - Dec 11 - GPT 5.2 is HERE! Plus, LLMs in Space, MCP donated, Devstral surprises and more AI news!
2025/12/12 | 1h 37 mins.
Hey everyone, December started strong and does NOT want to slow down!? OpenAI showed us their response to the Code Red and it’s GPT 5.2, which doesn’t feel like a .1 upgrade! We got it literally as breaking news at the end of the show, and oh boy! The new kind of LLMs is here. GPT, then Gemini, then Opus and now GPT again... Who else feels like we’re on a trippy AI rolercoaster? Just me? 🫨 I’m writing this newsletter from a fresh “traveling podcaster” setup in SF (huge shoutout to the Chroma team for the studio hospitality). P.S - Next week we’re doing a year recap episode (52st episode of the year, what is my life), but today is about the highest-signal stuff that happened this week.Alright. No more foreplay. Let’s dive in. Please subscribe. 🔥 The main event: OpenAI launches GPT‑5.2 (and it’s… a lot)We started the episode with “garlic in the air” rumors (OpenAI holiday launches always have that Christmas panic energy), and then… boom: GPT‑5.2 actually drops while we’re live.What makes this release feel significant isn’t “one benchmark went up.” It’s that OpenAI is clearly optimizing for the things that have become the frontier in 2025: long-horizon reasoning, agentic coding loops, long context reliability, and lower hallucination rates when browsing/tooling is involved.5.2 Instant, Thinking and Pro in ChatGPT and in the APIOpenAI shipped multiple variants, and even within those there are “levels” (medium/high/extra-high) that effectively change how much compute the model is allowed to burn. At the extreme end, you’re basically running parallel thoughts and selecting winners. That’s powerful, but also… very expensive.It’s very clearly aimed at the agentic world: coding agents that run in loops, tool-using research agents, and “do the whole task end-to-end” workflows where spending extra tokens is still cheaper than spending an engineer day.Benchmarks I’m not going to pretend benchmarks tell the full story (they never do), but the shape of improvements matters. GPT‑5.2 shows huge strength on reasoning + structured work.It hits 90.5% on ARC‑AGI‑1 in the Pro X‑High configuration, and 54%+ on ARC‑AGI‑2 depending on the setting. For context, ARC‑AGI‑2 is the one where everyone learns humility again.On math/science, this thing is flexing. We saw 100% on AIME 2025, and strong performance on FrontierMath tiers (with the usual “Tier 4 is where dreams go to die” vibe still intact). GPQA Diamond is up in the 90s too, which is basically “PhD trivia mode.”But honestly the most practically interesting one for me is GDPval (knowledge-work tasks: slides, spreadsheets, planning, analysis). GPT‑5.2 lands around 70%, which is a massive jump vs earlier generations. This is the category that translates directly into “is this model useful at my job.” - This is a bench that OpenAI launched only in September and back then, Opus 4.1 was a “measly” 47%! Talk about acceleration! Long context: MRCR is the sleeper highlightOn MRCR (multi-needle long-context retrieval), GPT‑5.2 holds up absurdly well even into 128k and beyond. The graph OpenAI shared shows GPT‑5.1 falling off a cliff as context grows, while GPT‑5.2 stays high much deeper into long contexts.If you’ve ever built a real system (RAG, agent memory, doc analysis) you know this pain: long context is easy to offer, hard to use well. If GPT‑5.2 actually delivers this in production, it’s a meaningful shift.Hallucinations: down (especially with browsing)One thing we called out on the show is that a bunch of user complaints in 2025 have basically collapsed into one phrase: “it hallucinates.” Even people who don’t know what a benchmark is can feel when a model confidently lies.OpenAI’s system card shows lower rates of major incorrect claims compared to GPT‑5.1, and lower “incorrect claims” overall when browsing is enabled. That’s exactly the direction they needed.Real-world vibes:We did the traditional “vibe tests” mid-show: generate a flashy landing page, do a weird engineering prompt, try some coding inside Cursor/Codex.Early testers broadly agree on the shape of the improvement. GPT‑5.2 is much stronger in reasoning, math, long‑context tasks, visual understanding, and multimodal workflows, with multiple reports of it successfully thinking for one to three hours on hard problems. Enterprise users like Box report faster execution and higher accuracy on real knowledge‑worker tasks, while researchers note that GPT‑5.2 Pro consistently outperforms the standard “Thinking” variant. The tradeoffs are also clear: creative writing still slightly favors Claude Opus, and the highest reasoning tiers can be slow and expensive. But as a general‑purpose reasoning model, GPT‑5.2 is now the strongest publicly available option.AI in space: Starcloud trains an LLM on an H100 in orbitThis story is peak 2025.Starcloud put an NVIDIA H100 on a satellite, trained Andrej Karpathy’s nanoGPT on Shakespeare, and ran inference on Gemma. There’s a viral screenshot vibe here that’s impossible to ignore: SSH into an H100… in space… with a US flag in the corner. It’s engineered excitement, and I’m absolutely here for it.But we actually had a real debate on the show: is “GPUs in space” just sci‑fi marketing, or does it make economic sense?Nisten made a compelling argument that power is the real bottleneck, not compute, and that big satellites already operate in the ~20kW range. If you can generate that power reliably with solar in orbit, the economics start looking less insane than you’d think. LDJ added the long-term land/power convergence argument: Earth land and grid power get scarcer/more regulated, while launch costs trend down—eventually the curves may cross.I played “voice of realism” for a minute: what happens when GPUs fail? It’s hard enough to swap a GPU in a datacenter, now imagine doing it in orbit. Cooling and heat dissipation become a different engineering problem too (radiators instead of fans). Networking is nontrivial. But also: we are clearly entering the era where people will try weird infra ideas because AI demand is pulling the whole economy.Big Company: MCP gets donated, OpenRouter drops a report on AIAgentic AI Foundation Lands at the Linux FoundationThis one made me genuinely happy.Block, Anthropic, and OpenAI came together to launch the Agentic AI Foundation under the Linux Foundation, donating key projects like MCP, AGENTS.md, and goose. This is exactly how standards should happen: vendor‑neutral, boring governance, lots of stakeholders.It’s not flashy work, but it’s the kind of thing that actually lets ecosystems grow without fragmenting. BTW, I was recording my podcast while Latent.Space were recording theirs in the same office, and they have a banger episode upcoming about this very topic! All I’ll say is Alessio Fanelli introduced me to David Soria Parra from MCP 👀 Watch out for that episode on Latent space dropping soon! OpenRouter’s “State of AI”: 100 Trillion Tokens of RealityOpenRouter and a16z dropped a massive report analyzing over 100 trillion tokens of real‑world usage. A few things stood out:Reasoning tokens now dominate. Above 50%, around 60% of all tokens since early 2025 are reasoning tokens. Remember when we went from “LLMs can’t do math” to reasoning models? That happened in about a year.Programming exploded. From 11% of usage early 2025 to over 50% recently. Claude holds 60% of the coding market. (at least.. on Open Router)Open source hit 30% market share, led by Chinese labs: DeepSeek (14T tokens), Qwen (5.59T), Meta LLaMA (3.96T).Context lengths grew massively. Average prompt length went from 1.5k to 6k+ tokens (4x growth), completions from 133 to 400 tokens (3x).The “Glass Slipper” effect. When users find a model that fits their use case, they stay loyal. Foundational early-user cohorts retain around 40% at month 5. Claude 4 Sonnet still had 50% retention after three months.Geography shift. Asia doubled to 31% of usage (China key), while North America is at 47%.Yam made a good point that we should be careful interpreting these graphs—they’re biased toward people trying new models, not necessarily steady usage. But the trends are clear: agentic, reasoning, and coding are the dominant use cases.Open Source Is Not Slowing Down (If Anything, It’s Accelerating)One of the strongest themes this week was just how fast open source is closing the gap — and in some areas, outright leading. We’re not talking about toy demos anymore. We’re talking about serious models, trained from scratch, hitting benchmarks that were frontier‑only not that long ago.Essential AI’s Rnj‑1: A Real Frontier 8B ModelThis one deserves real attention. Essential AI — led by Ashish Vaswani, yes Ashish from the original Transformers paper — released Rnj‑1, a pair of 8B open‑weight models trained fully from scratch. No distillation. No “just a fine‑tune.” This is a proper pretrain.What stood out to me isn’t just the benchmarks (though those are wild), but the philosophy. Rnj‑1 is intentionally focused on pretraining quality: data curation, code execution simulation, STEM reasoning, and agentic behaviors emerging during pretraining instead of being bolted on later with massive RL pipelines.In practice, that shows up in places like SWE‑bench Verified, where Rnj‑1 lands in the same ballpark as much larger closed models, and in math and STEM tasks where it punches way above its size. And remember: this is an 8B model you can actually run locally, quantize aggressively, and deploy without legal gymnastics thanks to its Apache 2.0 license.Mistral Devstral 2 + Vibe: Open Coding Goes HardMistral followed up last week’s momentum with Devstral 2, and Mistral Vibe! The headline numbers are: the 123B Devstral 2 model lands right at the top of open‑weight coding benchmarks, nearly matching Claude 3.5 Sonnet on SWE‑bench Verified. But what really excited the panel was the 24B Devstral Small 2, which hits high‑60s SWE‑bench scores while being runnable on consumer hardware.This is the kind of model you can realistically run locally as a coding agent, without shipping your entire codebase off to someone else’s servers. Pair that with Mistral Vibe, their open‑source CLI agent, and you suddenly have a credible, fully open alternative to things like Claude Code, Codex, or Gemini CLI.We talked a lot about why this matters. Some teams can’t send code to closed APIs. Others just don’t want to pay per‑token forever. And some folks — myself included — just like knowing what’s actually running under the hood. Devstral 2 checks all those boxes.🐝 This week’s Buzz (W&B): Trace OpenRouter traffic into Weave with zero codeWe did a quick “Buzz” segment on a feature that I think a lot of builders will love:OpenRouter launched Broadcast, which can stream traces to observability vendors. One of those destinations is W&B Weave.The magic here is: if you’re using a tool that already talks to OpenRouter, you can get tracing into Weave without instrumenting your code. That’s especially useful when instrumentation is hard (certain agent frameworks, black-box tooling, restricted environments, etc.).If you want to set it up: OpenRouter Broadcast settings.Vision Models Are Getting Practical (and Weirdly Competitive)Vision‑language models quietly had a massive week.Jina‑VLM: Small, Multilingual, and Very Good at DocsJina released a 2.4B VLM that’s absolutely dialed in on document understanding, multilingual VQA, and OCR‑heavy tasks. This is exactly the kind of model you’d want for PDFs, charts, scans, and messy real‑world docs — and it’s small enough to deploy without sweating too much.Z.ai GLM‑4.6V: Long Context, Tool Calling, Serious Agent PotentialZ.ai’s GLM‑4.6V impressed us with its 128K context, native tool calling from vision inputs, and strong performance on benchmarks like MathVista and WebVoyager. It’s one of the clearest examples yet of a VLM that’s actually built for agentic workflows, not just answering questions about images.That said, I did run my unofficial “bee counting test” on it… and yeah, Gemini still wins there 😅Perceptron Isaac 0.2: Tiny Models, Serious PerceptionPerceptron’s Isaac 0.2 (1B and 2B variants) showed something I really like seeing: structured outputs, focus tools, and reliability in very small models. Watching a 2B model correctly identify, count, and point to objects in an image is still wild to me.These are the kinds of models that make physical AI, robotics, and edge deployments actually feasible.🧰 Tools: Cursor goes visual, and Google Stitch keeps getting scarier (in a good way)Cursor: direct visual editing inside the codebaseCursor shipped a new feature that lets you visually manipulate UI elements—click/drag/resize—directly in the editor. We lumped this under “tools” because it’s not just a nicety; it’s the next step in “IDE as design surface.”Cursor is also iterating fast on debugging workflows. The meta trend: IDEs are turning into agent platforms, not text editors.Stitch by Google: Gemini 3 Pro as default, plus clickable prototypesI showed Stitch on the show because it’s one of the clearest examples of “distribution beats raw capability.”Stitch (Google’s product born from the Galileo AI acquisition) is doing Shipmas updates and now defaults to “Thinking with Gemini 3 Pro.” It can generate complex UIs, export them, and even stitch multiple screens into prototypes. The killer workflow is exporting directly into AI Studio / agent tooling so you can go from UI idea → code → repo without playing copy-paste Olympics.Site: https://stitch.withgoogle.com🎬 Disney invests $1B into OpenAI (and Sora gets Disney characters)This is the corporate story that made me do a double take.Disney—arguably the most IP-protective company on Earth—is investing $1B into OpenAI and enabling use of Disney characters in Sora. That’s huge. It signals the beginning of a more explicit “licensed synthetic media” era, where major IP holders decide which model vendors get official access.It also raises the obvious question: does Disney now go harder against other model providers that generate Disney-like content without permission?We talked about how weird the timing is too, given Disney has also been sending legal pressure in the broader space. The next year of AI video is going to be shaped as much by licensing and distribution as by model quality.Closing thoughts: the intelligence explosion is loud, messy, and acceleratingThis episode had everything: open-source models catching up fast, foundation-level standardization around agents, a usage report that shows what developers actually do with LLMs, voice models getting dramatically better, and OpenAI shipping what looks like a serious “we’re not losing” answer to Gemini 3.And yes: we’re also apparently putting GPUs in space.Next week’s episode is our year recap, and—of course—we now have to update it because GPT‑5.2 decided to show up like the final boss.If you missed any part of the show, check out the chapters in the podcast feed and jump around. See you next week.TL;DR + Show Notes (links for everything)Hosts* Alex Volkov — AI Evangelist @ Weights & Biases: @altryne. I host ThursdAI and spend an unhealthy amount of time trying to keep up with this firehose of releases.* Co-hosts — @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed. Each of them brings a different “lens” (agents, infra, evaluation, open source, tooling), and it’s why the show works.Open Source LLMs* Essential AI — RNJ‑1 (8B base + instruct): tweet, blog, HF instruct, HF base. This is a from-scratch open pretrain led by Ashish Vaswani, and it’s one of the most important “Western open model” signals we’ve seen in a while.* Mistral — Devstral 2 + Devstral Small 2 + Mistral Vibe: tweet, Devstral Small 2 HF, Devstral 2 HF, news, mistral-vibe GitHub. Devstral is open coding SOTA territory, and Vibe is Mistral’s swing at the CLI agent layer.AI in Space* Starcloud trains and runs an LLM in orbit on an H100: Philip Johnston, Adi Oltean, CNBC, Karpathy reaction. A satellite H100 trained nanoGPT on Shakespeare and ran Gemma inference, igniting a real debate about power, cooling, repairability, and future orbital compute economics.Putnam Math Competition* Nous Research — Nomos 1 (Putnam scoring run): tweet, HF, GitHub harness, Hillclimb. This is a strong open-weight math reasoning model plus an open harness, and it shows how orchestration matters as much as raw weights.* Axiom — AxiomProver formal Lean proofs on Putnam: tweet, repo. Formal proofs are the “no excuses” version of math reasoning, and this is a serious milestone even if you argue about exact framing.Big Company LLMs + APIs* OpenAI — GPT‑5.2 release: Alex tweet, OpenAI announcement, ARC Prize verification, Sam Altman tweet. GPT‑5.2 brings major jumps in reasoning, long context, and agentic workflows, and it’s clearly positioned as an answer to the Gemini 3 era.* OpenRouter x a16z — State of AI report (100T+ tokens): tweet, landing page, PDF. The report highlights the dominance of programming/agents, the rise of reasoning tokens, and real-world usage patterns that explain why everyone is shipping agent harnesses.* Agentic AI Foundation under Linux Foundation (AAIF): Goose tweet, Block blog, aaif.io, Linux Foundation tweet. MCP + AGENTS.md + Goose moving into vendor-neutral governance is huge for interoperability and long-term ecosystem stability.* Disney invests $1B into OpenAI / Sora characters: (covered on the show as a major IP + distribution moment). This is an early signal of licensed synthetic media becoming a first-class business line rather than a legal gray zone.This week’s Buzz (W&B)* OpenRouter Broadcast → W&B Weave tracing: Broadcast settings. You can trace OpenRouter-based traffic into Weave with minimal setup, which is especially useful when you can’t (or don’t want to) instrument code directly.Vision & Video* Jina — jina‑VLM (2.4B): tweet, arXiv, HF, blog. A compact multilingual VLM optimized for doc understanding and VQA.* Z.ai — GLM‑4.6V + Flash: tweet, HF collection, GLM‑4.6V, Flash, blog. Strong open VLMs with tool calling and long context, even if my bee counting test still humbled it.* Perceptron — Isaac 0.2 (1B/2B): tweet, HF 2B, HF 1B, blog, demo. The Focus/zoom tooling and structured outputs point toward “VLMs as reliable perception modules,” not just chatty describers.Voice & Audio* Google DeepMind — Gemini 2.5 TTS (Flash + Pro): AI Studio tweet, GoogleAI devs tweet, blog, AI Studio speech playground. The key upgrades are control and consistency (emotion, pacing, multi-speaker) across many languages.* OpenBMB — VoxCPM 1.5: tweet, HF, GitHub. Open TTS keeps getting better, and this release is especially interesting for fine-tuning and voice cloning workflows.Tools* Cursor — direct visual editing (new UI workflow): (covered on the show as a major step toward “IDE as design surface”). Cursor continues to push the agentic IDE category into new territory.* Stitch by Google — Shipmas updates + Gemini 3 Pro “Thinking” + Prototypes: tweet 1, tweet 2, site, plus background articles: TechCrunch launch, acquisition detail. Stitch is turning prompt-to-UI into a full prototype-to-code pipeline with real export paths. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe



ThursdAI - The top AI news from the past week