PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

Available Episodes

5 of 134
  • 📆 ThursdAI - Dec 11 - GPT 5.2 is HERE! Plus, LLMs in Space, MCP donated, Devstral surprises and more AI news!
    Hey everyone, December started strong and does NOT want to slow down!? OpenAI showed us their response to the Code Red and it’s GPT 5.2, which doesn’t feel like a .1 upgrade! We got it literally as breaking news at the end of the show, and oh boy! The new kind of LLMs is here. GPT, then Gemini, then Opus and now GPT again... Who else feels like we’re on a trippy AI rolercoaster? Just me? 🫨 I’m writing this newsletter from a fresh “traveling podcaster” setup in SF (huge shoutout to the Chroma team for the studio hospitality). P.S - Next week we’re doing a year recap episode (52st episode of the year, what is my life), but today is about the highest-signal stuff that happened this week.Alright. No more foreplay. Let’s dive in. Please subscribe. 🔥 The main event: OpenAI launches GPT‑5.2 (and it’s… a lot)We started the episode with “garlic in the air” rumors (OpenAI holiday launches always have that Christmas panic energy), and then… boom: GPT‑5.2 actually drops while we’re live.What makes this release feel significant isn’t “one benchmark went up.” It’s that OpenAI is clearly optimizing for the things that have become the frontier in 2025: long-horizon reasoning, agentic coding loops, long context reliability, and lower hallucination rates when browsing/tooling is involved.5.2 Instant, Thinking and Pro in ChatGPT and in the APIOpenAI shipped multiple variants, and even within those there are “levels” (medium/high/extra-high) that effectively change how much compute the model is allowed to burn. At the extreme end, you’re basically running parallel thoughts and selecting winners. That’s powerful, but also… very expensive.It’s very clearly aimed at the agentic world: coding agents that run in loops, tool-using research agents, and “do the whole task end-to-end” workflows where spending extra tokens is still cheaper than spending an engineer day.Benchmarks I’m not going to pretend benchmarks tell the full story (they never do), but the shape of improvements matters. GPT‑5.2 shows huge strength on reasoning + structured work.It hits 90.5% on ARC‑AGI‑1 in the Pro X‑High configuration, and 54%+ on ARC‑AGI‑2 depending on the setting. For context, ARC‑AGI‑2 is the one where everyone learns humility again.On math/science, this thing is flexing. We saw 100% on AIME 2025, and strong performance on FrontierMath tiers (with the usual “Tier 4 is where dreams go to die” vibe still intact). GPQA Diamond is up in the 90s too, which is basically “PhD trivia mode.”But honestly the most practically interesting one for me is GDPval (knowledge-work tasks: slides, spreadsheets, planning, analysis). GPT‑5.2 lands around 70%, which is a massive jump vs earlier generations. This is the category that translates directly into “is this model useful at my job.” - This is a bench that OpenAI launched only in September and back then, Opus 4.1 was a “measly” 47%! Talk about acceleration! Long context: MRCR is the sleeper highlightOn MRCR (multi-needle long-context retrieval), GPT‑5.2 holds up absurdly well even into 128k and beyond. The graph OpenAI shared shows GPT‑5.1 falling off a cliff as context grows, while GPT‑5.2 stays high much deeper into long contexts.If you’ve ever built a real system (RAG, agent memory, doc analysis) you know this pain: long context is easy to offer, hard to use well. If GPT‑5.2 actually delivers this in production, it’s a meaningful shift.Hallucinations: down (especially with browsing)One thing we called out on the show is that a bunch of user complaints in 2025 have basically collapsed into one phrase: “it hallucinates.” Even people who don’t know what a benchmark is can feel when a model confidently lies.OpenAI’s system card shows lower rates of major incorrect claims compared to GPT‑5.1, and lower “incorrect claims” overall when browsing is enabled. That’s exactly the direction they needed.Real-world vibes:We did the traditional “vibe tests” mid-show: generate a flashy landing page, do a weird engineering prompt, try some coding inside Cursor/Codex.Early testers broadly agree on the shape of the improvement. GPT‑5.2 is much stronger in reasoning, math, long‑context tasks, visual understanding, and multimodal workflows, with multiple reports of it successfully thinking for one to three hours on hard problems. Enterprise users like Box report faster execution and higher accuracy on real knowledge‑worker tasks, while researchers note that GPT‑5.2 Pro consistently outperforms the standard “Thinking” variant. The tradeoffs are also clear: creative writing still slightly favors Claude Opus, and the highest reasoning tiers can be slow and expensive. But as a general‑purpose reasoning model, GPT‑5.2 is now the strongest publicly available option.AI in space: Starcloud trains an LLM on an H100 in orbitThis story is peak 2025.Starcloud put an NVIDIA H100 on a satellite, trained Andrej Karpathy’s nanoGPT on Shakespeare, and ran inference on Gemma. There’s a viral screenshot vibe here that’s impossible to ignore: SSH into an H100… in space… with a US flag in the corner. It’s engineered excitement, and I’m absolutely here for it.But we actually had a real debate on the show: is “GPUs in space” just sci‑fi marketing, or does it make economic sense?Nisten made a compelling argument that power is the real bottleneck, not compute, and that big satellites already operate in the ~20kW range. If you can generate that power reliably with solar in orbit, the economics start looking less insane than you’d think. LDJ added the long-term land/power convergence argument: Earth land and grid power get scarcer/more regulated, while launch costs trend down—eventually the curves may cross.I played “voice of realism” for a minute: what happens when GPUs fail? It’s hard enough to swap a GPU in a datacenter, now imagine doing it in orbit. Cooling and heat dissipation become a different engineering problem too (radiators instead of fans). Networking is nontrivial. But also: we are clearly entering the era where people will try weird infra ideas because AI demand is pulling the whole economy.Big Company: MCP gets donated, OpenRouter drops a report on AIAgentic AI Foundation Lands at the Linux FoundationThis one made me genuinely happy.Block, Anthropic, and OpenAI came together to launch the Agentic AI Foundation under the Linux Foundation, donating key projects like MCP, AGENTS.md, and goose. This is exactly how standards should happen: vendor‑neutral, boring governance, lots of stakeholders.It’s not flashy work, but it’s the kind of thing that actually lets ecosystems grow without fragmenting. BTW, I was recording my podcast while Latent.Space were recording theirs in the same office, and they have a banger episode upcoming about this very topic! All I’ll say is Alessio Fanelli introduced me to David Soria Parra from MCP 👀 Watch out for that episode on Latent space dropping soon! OpenRouter’s “State of AI”: 100 Trillion Tokens of RealityOpenRouter and a16z dropped a massive report analyzing over 100 trillion tokens of real‑world usage. A few things stood out:Reasoning tokens now dominate. Above 50%, around 60% of all tokens since early 2025 are reasoning tokens. Remember when we went from “LLMs can’t do math” to reasoning models? That happened in about a year.Programming exploded. From 11% of usage early 2025 to over 50% recently. Claude holds 60% of the coding market. (at least.. on Open Router)Open source hit 30% market share, led by Chinese labs: DeepSeek (14T tokens), Qwen (5.59T), Meta LLaMA (3.96T).Context lengths grew massively. Average prompt length went from 1.5k to 6k+ tokens (4x growth), completions from 133 to 400 tokens (3x).The “Glass Slipper” effect. When users find a model that fits their use case, they stay loyal. Foundational early-user cohorts retain around 40% at month 5. Claude 4 Sonnet still had 50% retention after three months.Geography shift. Asia doubled to 31% of usage (China key), while North America is at 47%.Yam made a good point that we should be careful interpreting these graphs—they’re biased toward people trying new models, not necessarily steady usage. But the trends are clear: agentic, reasoning, and coding are the dominant use cases.Open Source Is Not Slowing Down (If Anything, It’s Accelerating)One of the strongest themes this week was just how fast open source is closing the gap — and in some areas, outright leading. We’re not talking about toy demos anymore. We’re talking about serious models, trained from scratch, hitting benchmarks that were frontier‑only not that long ago.Essential AI’s Rnj‑1: A Real Frontier 8B ModelThis one deserves real attention. Essential AI — led by Ashish Vaswani, yes Ashish from the original Transformers paper — released Rnj‑1, a pair of 8B open‑weight models trained fully from scratch. No distillation. No “just a fine‑tune.” This is a proper pretrain.What stood out to me isn’t just the benchmarks (though those are wild), but the philosophy. Rnj‑1 is intentionally focused on pretraining quality: data curation, code execution simulation, STEM reasoning, and agentic behaviors emerging during pretraining instead of being bolted on later with massive RL pipelines.In practice, that shows up in places like SWE‑bench Verified, where Rnj‑1 lands in the same ballpark as much larger closed models, and in math and STEM tasks where it punches way above its size. And remember: this is an 8B model you can actually run locally, quantize aggressively, and deploy without legal gymnastics thanks to its Apache 2.0 license.Mistral Devstral 2 + Vibe: Open Coding Goes HardMistral followed up last week’s momentum with Devstral 2, and Mistral Vibe! The headline numbers are: the 123B Devstral 2 model lands right at the top of open‑weight coding benchmarks, nearly matching Claude 3.5 Sonnet on SWE‑bench Verified. But what really excited the panel was the 24B Devstral Small 2, which hits high‑60s SWE‑bench scores while being runnable on consumer hardware.This is the kind of model you can realistically run locally as a coding agent, without shipping your entire codebase off to someone else’s servers. Pair that with Mistral Vibe, their open‑source CLI agent, and you suddenly have a credible, fully open alternative to things like Claude Code, Codex, or Gemini CLI.We talked a lot about why this matters. Some teams can’t send code to closed APIs. Others just don’t want to pay per‑token forever. And some folks — myself included — just like knowing what’s actually running under the hood. Devstral 2 checks all those boxes.🐝 This week’s Buzz (W&B): Trace OpenRouter traffic into Weave with zero codeWe did a quick “Buzz” segment on a feature that I think a lot of builders will love:OpenRouter launched Broadcast, which can stream traces to observability vendors. One of those destinations is W&B Weave.The magic here is: if you’re using a tool that already talks to OpenRouter, you can get tracing into Weave without instrumenting your code. That’s especially useful when instrumentation is hard (certain agent frameworks, black-box tooling, restricted environments, etc.).If you want to set it up: OpenRouter Broadcast settings.Vision Models Are Getting Practical (and Weirdly Competitive)Vision‑language models quietly had a massive week.Jina‑VLM: Small, Multilingual, and Very Good at DocsJina released a 2.4B VLM that’s absolutely dialed in on document understanding, multilingual VQA, and OCR‑heavy tasks. This is exactly the kind of model you’d want for PDFs, charts, scans, and messy real‑world docs — and it’s small enough to deploy without sweating too much.Z.ai GLM‑4.6V: Long Context, Tool Calling, Serious Agent PotentialZ.ai’s GLM‑4.6V impressed us with its 128K context, native tool calling from vision inputs, and strong performance on benchmarks like MathVista and WebVoyager. It’s one of the clearest examples yet of a VLM that’s actually built for agentic workflows, not just answering questions about images.That said, I did run my unofficial “bee counting test” on it… and yeah, Gemini still wins there 😅Perceptron Isaac 0.2: Tiny Models, Serious PerceptionPerceptron’s Isaac 0.2 (1B and 2B variants) showed something I really like seeing: structured outputs, focus tools, and reliability in very small models. Watching a 2B model correctly identify, count, and point to objects in an image is still wild to me.These are the kinds of models that make physical AI, robotics, and edge deployments actually feasible.🧰 Tools: Cursor goes visual, and Google Stitch keeps getting scarier (in a good way)Cursor: direct visual editing inside the codebaseCursor shipped a new feature that lets you visually manipulate UI elements—click/drag/resize—directly in the editor. We lumped this under “tools” because it’s not just a nicety; it’s the next step in “IDE as design surface.”Cursor is also iterating fast on debugging workflows. The meta trend: IDEs are turning into agent platforms, not text editors.Stitch by Google: Gemini 3 Pro as default, plus clickable prototypesI showed Stitch on the show because it’s one of the clearest examples of “distribution beats raw capability.”Stitch (Google’s product born from the Galileo AI acquisition) is doing Shipmas updates and now defaults to “Thinking with Gemini 3 Pro.” It can generate complex UIs, export them, and even stitch multiple screens into prototypes. The killer workflow is exporting directly into AI Studio / agent tooling so you can go from UI idea → code → repo without playing copy-paste Olympics.Site: https://stitch.withgoogle.com🎬 Disney invests $1B into OpenAI (and Sora gets Disney characters)This is the corporate story that made me do a double take.Disney—arguably the most IP-protective company on Earth—is investing $1B into OpenAI and enabling use of Disney characters in Sora. That’s huge. It signals the beginning of a more explicit “licensed synthetic media” era, where major IP holders decide which model vendors get official access.It also raises the obvious question: does Disney now go harder against other model providers that generate Disney-like content without permission?We talked about how weird the timing is too, given Disney has also been sending legal pressure in the broader space. The next year of AI video is going to be shaped as much by licensing and distribution as by model quality.Closing thoughts: the intelligence explosion is loud, messy, and acceleratingThis episode had everything: open-source models catching up fast, foundation-level standardization around agents, a usage report that shows what developers actually do with LLMs, voice models getting dramatically better, and OpenAI shipping what looks like a serious “we’re not losing” answer to Gemini 3.And yes: we’re also apparently putting GPUs in space.Next week’s episode is our year recap, and—of course—we now have to update it because GPT‑5.2 decided to show up like the final boss.If you missed any part of the show, check out the chapters in the podcast feed and jump around. See you next week.TL;DR + Show Notes (links for everything)Hosts* Alex Volkov — AI Evangelist @ Weights & Biases: @altryne. I host ThursdAI and spend an unhealthy amount of time trying to keep up with this firehose of releases.* Co-hosts — @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed. Each of them brings a different “lens” (agents, infra, evaluation, open source, tooling), and it’s why the show works.Open Source LLMs* Essential AI — RNJ‑1 (8B base + instruct): tweet, blog, HF instruct, HF base. This is a from-scratch open pretrain led by Ashish Vaswani, and it’s one of the most important “Western open model” signals we’ve seen in a while.* Mistral — Devstral 2 + Devstral Small 2 + Mistral Vibe: tweet, Devstral Small 2 HF, Devstral 2 HF, news, mistral-vibe GitHub. Devstral is open coding SOTA territory, and Vibe is Mistral’s swing at the CLI agent layer.AI in Space* Starcloud trains and runs an LLM in orbit on an H100: Philip Johnston, Adi Oltean, CNBC, Karpathy reaction. A satellite H100 trained nanoGPT on Shakespeare and ran Gemma inference, igniting a real debate about power, cooling, repairability, and future orbital compute economics.Putnam Math Competition* Nous Research — Nomos 1 (Putnam scoring run): tweet, HF, GitHub harness, Hillclimb. This is a strong open-weight math reasoning model plus an open harness, and it shows how orchestration matters as much as raw weights.* Axiom — AxiomProver formal Lean proofs on Putnam: tweet, repo. Formal proofs are the “no excuses” version of math reasoning, and this is a serious milestone even if you argue about exact framing.Big Company LLMs + APIs* OpenAI — GPT‑5.2 release: Alex tweet, OpenAI announcement, ARC Prize verification, Sam Altman tweet. GPT‑5.2 brings major jumps in reasoning, long context, and agentic workflows, and it’s clearly positioned as an answer to the Gemini 3 era.* OpenRouter x a16z — State of AI report (100T+ tokens): tweet, landing page, PDF. The report highlights the dominance of programming/agents, the rise of reasoning tokens, and real-world usage patterns that explain why everyone is shipping agent harnesses.* Agentic AI Foundation under Linux Foundation (AAIF): Goose tweet, Block blog, aaif.io, Linux Foundation tweet. MCP + AGENTS.md + Goose moving into vendor-neutral governance is huge for interoperability and long-term ecosystem stability.* Disney invests $1B into OpenAI / Sora characters: (covered on the show as a major IP + distribution moment). This is an early signal of licensed synthetic media becoming a first-class business line rather than a legal gray zone.This week’s Buzz (W&B)* OpenRouter Broadcast → W&B Weave tracing: Broadcast settings. You can trace OpenRouter-based traffic into Weave with minimal setup, which is especially useful when you can’t (or don’t want to) instrument code directly.Vision & Video* Jina — jina‑VLM (2.4B): tweet, arXiv, HF, blog. A compact multilingual VLM optimized for doc understanding and VQA.* Z.ai — GLM‑4.6V + Flash: tweet, HF collection, GLM‑4.6V, Flash, blog. Strong open VLMs with tool calling and long context, even if my bee counting test still humbled it.* Perceptron — Isaac 0.2 (1B/2B): tweet, HF 2B, HF 1B, blog, demo. The Focus/zoom tooling and structured outputs point toward “VLMs as reliable perception modules,” not just chatty describers.Voice & Audio* Google DeepMind — Gemini 2.5 TTS (Flash + Pro): AI Studio tweet, GoogleAI devs tweet, blog, AI Studio speech playground. The key upgrades are control and consistency (emotion, pacing, multi-speaker) across many languages.* OpenBMB — VoxCPM 1.5: tweet, HF, GitHub. Open TTS keeps getting better, and this release is especially interesting for fine-tuning and voice cloning workflows.Tools* Cursor — direct visual editing (new UI workflow): (covered on the show as a major step toward “IDE as design surface”). Cursor continues to push the agentic IDE category into new territory.* Stitch by Google — Shipmas updates + Gemini 3 Pro “Thinking” + Prototypes: tweet 1, tweet 2, site, plus background articles: TechCrunch launch, acquisition detail. Stitch is turning prompt-to-UI into a full prototype-to-code pipeline with real export paths. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:37:01
  • 📆 ThursdAI - Dec 4, 2025 - DeepSeek V3.2 Goes Gold Medal, Mistral Returns to Apache 2.0, OpenAI Hits Code Red, and US-Trained MOEs Are Back!
    Hey yall, Alex here 🫡 Welcome to the first ThursdAI of December! Snow is falling in Colorado, and AI releases are falling even harder. This week was genuinely one of those “drink from the firehose” weeks where every time I refreshed my timeline, another massive release had dropped.We kicked off the show asking our co-hosts for their top AI pick of the week, and the answers were all over the map: Wolfram was excited about Mistral’s return to Apache 2.0, Yam couldn’t stop talking about Claude Opus 4.5 after a full week of using it, and Nisten came out of left field with an AWQ quantization of Prime Intellect’s model that apparently runs incredibly fast on a single GPU. As for me? I’m torn between Opus 4.5 (which literally fixed bugs that Gemini 3 created in my code) and DeepSeek’s gold-medal winning reasoning model.Speaking of which, let’s dive into what happened this week, starting with the open source stuff that’s been absolutely cooking. Open Source LLMsDeepSeek V3.2: The Whale Returns with Gold MedalsThe whale is back, folks! DeepSeek released two major updates this week: V3.2 and V3.2-Speciale. And these aren’t incremental improvements—we’re talking about an open reasoning-first model that’s rivaling GPT-5 and Gemini 3 Pro with actual gold medal Olympiad wins.Here’s what makes this release absolutely wild: DeepSeek V3.2-Speciale is achieving 96% on AIME versus 94% for GPT-5 High. It’s getting gold medals on IMO (35/42), CMO, ICPC (10/12), and IOI (492/600). This is a 685 billion parameter MOE model with MIT license, and it literally broke the benchmark graph on HMMT 2025—the score was so high it went outside the chart boundaries. That’s how you DeepSeek, basically.But it’s not just about reasoning. The regular V3.2 (not Speciale) is absolutely crushing it on agentic benchmarks: 73.1% on SWE-Bench Verified, first open model over 35% on Tool Decathlon, and 80.3% on τ²-bench. It’s now the second most intelligent open weights model and ranks ahead of Grok 4 and Claude Sonnet 4.5 on Artificial Analysis.The price is what really makes this insane: 28 cents per million tokens on OpenRouter. That’s absolutely ridiculous for this level of performance. They’ve also introduced DeepSeek Sparse Attention (DSA) which gives you 2-3x cheaper 128K inference without performance loss. LDJ pointed out on the show that he appreciates how transparent they’re being about not quite matching Gemini 3’s efficiency on reasoning tokens, but it’s open source and incredibly cheap.One thing to note: V3.2-Speciale doesn’t support tool calling. As Wolfram pointed out from the model card, it’s “designed exclusively for deep reasoning tasks.” So if you need agentic capabilities, stick with the regular V3.2.Check out the full release on Hugging Face or read the announcement.Mistral 3: Europe’s Favorite AI Lab Returns to Apache 2.0Mistral is back, and they’re back with fully open Apache 2.0 licenses across the board! This is huge news for the open source community. They released two major things this week: Mistral Large 3 and the Ministral 3 family of small models.Mistral Large 3 is a 675 billion parameter MOE with 41 billion active parameters and a quarter million (256K) context window, trained on 3,000 H200 GPUs. There’s been some debate about this model’s performance, and I want to address the elephant in the room: some folks saw a screenshot showing Mistral Large 3 very far down on Artificial Analysis and started dunking on it. But here’s the key context that Merve from Hugging Face pointed out—this is the only non-reasoning model on that chart besides GPT 5.1. When you compare it to other instruction-tuned (non-reasoning) models, it’s actually performing quite well, sitting at #6 among open models on LMSys Arena.Nisten checked LM Arena and confirmed that on coding specifically, Mistral Large 3 is scoring as one of the best open source coding models available. Yam made an important point that we should compare Mistral to other open source players like Qwen and DeepSeek rather than to closed models—and in that context, this is a solid release.But the real stars of this release are the Ministral 3 small models: 3B, 8B, and 14B, all with vision capabilities. These are edge-optimized, multimodal, and the 3B actually runs completely in the browser with WebGPU using transformers.js. The 14B reasoning variant achieves 85% on AIME 2025, which is state-of-the-art for its size class. Wolfram confirmed that the multilingual performance is excellent, particularly for German.There’s been some discussion about whether Mistral Large 3 is a DeepSeek finetune given the architectural similarities, but Mistral claims these are fully trained models. As Nisten noted, even if they used similar architecture (which is Apache 2.0 licensed), there’s nothing wrong with that—it’s an excellent architecture that works. Lucas Atkins later confirmed on the show that “Mistral Large looks fantastic... it is DeepSeek through and through architecture wise. But Kimi also does that—DeepSeek is the GOAT. Training MOEs is not as easy as just import deepseak and train.”Check out Mistral Large 3 and Ministral 3 on Hugging Face.Arcee Trinity: US-Trained MOEs Are BackWe had Lucas Atkins, CTO of Arcee AI, join us on the show to talk about their new Trinity family of models, and this conversation was packed with insights about what it takes to train MOEs from scratch in the US.Trinity is a family of open-weight MOEs fully trained end-to-end on American infrastructure with 10 trillion curated tokens from Datology.ai. They released Trinity-Mini (26B total, 3B active) and Trinity-Nano-Preview (6B total, 1B active), with Trinity-Large (420B parameters, 13B active) coming in mid-January 2026.The benchmarks are impressive: Trinity-Mini hits 84.95% on MMLU (0-shot), 92.1% on Math-500, and 65% on GPQA Diamond. But what really caught my attention was the inference speed—Nano generates at 143 tokens per second on llama.cpp, and Mini hits 157 t/s on consumer GPUs. They’ve even demonstrated it running on an iPhone via MLX Swift.I asked Lucas why it matters where models come from, and his answer was nuanced: for individual developers, it doesn’t really matter—use the best model for your task. But for Fortune 500 companies, compliance and legal teams are getting increasingly particular about where models were trained and hosted. This is slowing down enterprise AI adoption, and Trinity aims to solve that.Lucas shared a fascinating insight about why they decided to do full pretraining instead of just post-training on other people’s checkpoints: “We at Arcee were relying on other companies releasing capable open weight models... I didn’t like the idea of the foundation of our business being reliant on another company releasing models.” He also dropped some alpha about Trinity-Large: they’re going with 13B active parameters instead of 32B because going sparser actually gave them much faster throughput on Blackwell GPUs.The conversation about MOEs being cheaper for RL was particularly interesting. Lucas explained that because MOEs are so inference-efficient, you can do way more rollouts during reinforcement learning, which means more RL benefit per compute dollar. This is likely why we’re seeing labs like MiniMax go from their original 456B/45B-active model to a leaner 220B/10B-active model—they can get more gains in post-training by being able to do more steps.Check out Trinity-Mini and Trinity-Nano-Preview on Hugging Face, or read The Trinity Manifesto.OpenAI Code Red: Panic at the Disco (and Garlic?)It was ChatGPT’s 3rd birthday this week (Nov 30th), but the party vibes seem… stressful. Reports came out that Sam Altman has declared a “Code Red” at OpenAI.Why? Gemini 3.The user numbers don’t lie. ChatGPT apparently saw a 6% drop in daily active users following the Gemini 3 launch. Google’s integration is just too good, and their free tier is compelling.In response, OpenAI has supposedly paused “side projects” (ads, shopping bots) to focus purely on model intelligence and speed. Rumors point to a secret model codenamed “Garlic”—a leaner, more efficient model that beats Gemini 3 and Claude Opus 4.5 on coding reasoning, targeting a release in early 2026 (or maybe sooner if they want to save Christmas).Wolfram and Yam nailed the sentiment here: Integration wins. Wolfram’s family uses Gemini because it’s right there on the Pixel, controlling the lights and calendar. OpenAI needs to catch up not just on IQ, but on being helpful in the moment.Post the live show, OpenAI also finally added GPT 5.1 Codex Max we covered 2 weeks ago to their API and it’s now available in Cursor, for free, until Dec 11! Amazon Nova 2: Enterprise Push with Serious Agentic ChopsAmazon came back swinging with Nova 2, and the jump on Artificial Analysis is genuinely impressive—from around 30% to 61% on their index. That’s a massive improvement.The family includes Nova 2 Lite (7x cheaper, 5x faster than Nova Premier), Nova 2 Pro (93% on τ²-Bench Telecom, 70% on SWE-Bench Verified), Nova 2 Sonic (speech-to-speech with 1.39s time-to-first-audio), and Nova 2 Omni (unified text/image/video/speech with 1M token context window—you can upload 90 minutes of video!).Gemini 3 Deep Think ModeGoogle launched Gemini 3 Deep Think mode exclusively for AI Ultra subscribers, and it’s hitting some wild benchmarks: 45.1% on ARC-AGI-2 (a 2x SOTA leap using code execution), 41% on Humanity’s Last Exam, and 93.8% on GPQA Diamond. This builds on their Gemini 2.5 variants that earned gold medals at IMO and ICPC World Finals. The parallel reasoning approach explores multiple hypotheses simultaneously, but it’s compute-heavy—limited to 10 prompts per day at $77 per ARC-AGI-2 task.This Week’s Buzz: Mid-Training Evals are Here!A huge update from us at Weights & Biases this week: We launched LLM Evaluation Jobs. (Docs)If you are training models or finetuning, you usually wait until the end to run your expensive benchmarks. Now, directly inside W&B, you can trigger evaluations on mid-training checkpoints.It integrates with Inspect Evals (over 100+ public benchmarks). You just point it to your checkpoint or an API endpoint (even OpenRouter!), select the evals (MMLU-Pro, GPQA, etc.), and we spin up the managed GPUs to run it. You get a real-time leaderboard of your runs vs. the field.Also, a shoutout to users of Neptune.ai—congrats on the acquisition by OpenAI, but since the service is shutting down, we have built a migration script to help you move your history over to W&B seamlessly. We aren’t going anywhere!Video & Vision: Physics, Audio, and SpeedThe multimodal space was absolutely crowded this week.Runway Gen 4.5 (”Whisper Thunder”)Runway revealed that the mysterious “Whisper Thunder” model topping the leaderboards is actually Gen 4.5. The key differentiator? Physics and Multi-step adherence. It doesn’t have that “diffusion wobble” anymore. We watched a promo video where the shot changes every 3-4 seconds, and while it’s beautiful, it shows we still haven’t cracked super long consistent takes yet. But for 8-second clips? It’s apparently the new SOTA.Kling 2.6: Do you hear that?Kling hit back with Video 2.6, and the killer feature is Native Audio. I generated a clip of two people arguing, and the lip sync was perfect. Not “dubbed over” perfect, but actively generated with the video. It handles multi-character dialogue, singing, and SFX. It’s huge for creators.Kling was on a roll this week, releasing not one, but two Video Models (O1 Video is an omni modal one that takes Text, Images and Audio as inputs) and O1 Image and Kling Avatar 2.0 are also great updates! (Find all their releases on X)P-Image: Sub-Second Generation at Half a CentLast week we talked about ByteDance’s Z-Image, which was super cool and super cheap. Well, this week Pruna AI came out with P-Image, which is even faster and cheaper: image generation under one second for $0.005, and editing under one second for $0.01.I built a Chrome extension this week (completely rewritten by Opus 4.5, by the way—more on that in a second) that lets me play with these new image models inside the Infinite Craft game. When I tested P-Image Turbo against Z-Image, I was genuinely impressed by the quality at that speed. If you want quick iterations before moving to something like Nano Banana Pro for final 4K output, these sub-second models are perfect.The extension is available on GitHub if you want to try it—you just need to add your Replicate or Fal API keys.SeeDream 4.5: ByteDance Levels UpByteDance also launched SeeDream 4.5 in open beta, with major improvements in detail fidelity, spatial reasoning, and multi-image reference fusion (up to 10 inputs for consistent storyboards). The text rendering is much sharper, and it supports multilingual typography including Japanese. Early tests show it competing well with Nano Banana Pro in prompt adherence and logic.🎤 Voice & AudioMicrosoft VibeVoice-Realtime-0.5BIn a surprise drop, Microsoft open-sourced VibeVoice-Realtime-0.5B, a compact TTS model optimized for real-time applications. It delivers initial audible output in just 300 milliseconds while generating up to 10 minutes of speech. The community immediately started creating mirrors because, well, Microsoft has a history of releasing things on Hugging Face and then having legal pull them down. Get it while it’s hot!Use Cases: Code, Cursors, and “Antigravity”We wrapped up with some killer practical tips:* Opus 4.5 is a beast: As I mentioned, using Opus inside Cursor’s “Ask” mode is currently the supreme coding experience. It debugs logic flaws that Gemini misses completely. I also used Opus as a prompt engineer for my infographics, and it absolutely demolished GPT at creating the specific layouts I needed* Google’s Secret: Nisten dropped a bomb at the end of the show—Opus 4.5 is available for free inside Google’s Antigravity (and Colab)! If you want to try the model that’s beating GPT-5 without paying, go check Antigravity now before they patch it or run out of compute.* Microsoft VibeVoice: A surprise drop of a 0.5B speech model on HuggingFace that does real-time TTS (300ms latency). It was briefly questionable if it would stay up, but mirrors are already everywhere.That’s a wrap for this week, folks. Next week is probably going to be our final episode of the year, so we’ll be doing recaps and looking at our predictions from last year. Should be fun to see how wrong we were about everything!Thank you for tuning in. If you missed the live stream, subscribe to our Substack, YouTube, and wherever you get your podcasts. See you next Thursday!TL;DR and Show NotesHosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Guest - Lucas Atkins (@latkins) - CTO Arcee AIOpen Source LLMs* DeepSeek V3.2 and V3.2-Speciale - Gold medal olympiad wins, MIT license (X, HF V3.2, HF Speciale, Announcement)* Mistral 3 family - Large 3 and Ministral 3, Apache 2.0 (X, Blog, HF Large, HF Ministral)* Arcee Trinity - US-trained MOE family (X, HF Mini, HF Nano, Blog)* Hermes 4.3 - Decentralized training, SOTA RefusalBench (X, HF)Big CO LLMs + APIs* OpenAI Code Red - ChatGPT 3rd birthday, Garlic model in development (The Information)* Amazon Nova 2 - Lite, Pro, Sonic, and Omni models (X, Blog)* Gemini 3 Deep Think - 45.1% ARC-AGI-2 (X, Blog)* Cursor + GPT-5.1-Codex-Max - Free until Dec 11 (X, Blog)This Week’s Buzz* WandB LLM Evaluation Jobs - Evaluate any OpenAI-compatible API (X, Announcement)Vision & Video* Runway Gen-4.5 - #1 on text-to-video leaderboard, 1,247 Elo (X)* Kling VIDEO 2.6 - First native audio generation (X)* Kling O1 Image - Image generation (X)Voice & Audio* Microsoft VibeVoice-Realtime-0.5B - 300ms latency TTS (X, HF)AI Art & Diffusion* Pruna P-Image - Sub-second generation at $0.005 (X, Blog, Demo)* SeeDream 4.5 - Multi-reference fusion, text rendering (X) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:33:54
  • ThursdAI Special: Google's New Anti-Gravity IDE, Gemini 3 & Nano Banana Pro Explained (ft. Kevin Hou, Ammaar Reshi & Kat Kampf)
    Hey, Alex here, I recorded these conversations just in front of the AI Engineer auditorium, back to back, after these great folks gave their talks, and at the epitome of the most epic AI week we’ve seen since I started recording ThursdAI.This is less our traditional live recording, and more a real podcast-y conversation with great folks, inspired by Latent.Space. I hope you enjoy this format as much as I’ve enjoyed recording and editing it. AntiGravity with KevinKevin Hou and team just launched Antigravity, Google’s brand new Agentic IDE based on VSCode, and Kevin (second timer on ThursdAI) was awesome enough to hop on and talk about some of the product decisions they made, what makes Antigravity special and highlighted Artifacts as a completely new primitive. Gemini 3 in AI StudioIf you aren’t using Google’s AI Studio (ai.dev) then you’re missing out! We talk about AI Studio all the time on the show, and I’m a daily user! I generate most of my images with Nano Banana Pro in there, most of my Gemini conversations are happening there as well! Ammaar and Kat were so fun to talk to, as they covered the newly shipped “build mode” which allows you to vibe code full apps and experiences inside AI Studio, and we also covered Gemini 3’s features, multimodality understanding, UI capabilities. These folks gave a LOT of Gemini 3 demo’s so they know everything there is to know about this model’s capabilities! Tried new things with this one, multi camera angels, conversation with great folks, if you found this content valuable, please subscribe :) Topics Covered:* Inside Google’s new “AntiGravity” IDE* How the “Agent Manager” changes coding workflows* Gemini 3’s new multimodal capabilities* The power of “Artifacts” and dynamic memory* Deep dive into AI Studio updates & Vibe Coding* Generating 4K assets with Nano Banana ProTimestamps for your viewing convenience. 00:00 - Introduction and Overview01:13 - Conversation with Kevin Hou: Anti-Gravity IDE01:58 - Gemini 3 and Nano Banana Pro Launch Insights03:06 - Innovations in Anti-Gravity IDE06:56 - Artifacts and Dynamic Memory09:48 - Agent Manager and Multimodal Capabilities11:32 - Chrome Integration and Future Prospects20:11 - Conversation with Ammar and Kat: AI Studio Team21:21 - Introduction to AI Studio21:51 - What is AI Studio?22:52 - Ease of Use and User Feedback24:06 - Live Demos and Launch Week26:00 - Design Innovations in AI Studio30:54 - Generative UIs and Vibe Coding33:53 - Nano Banana Pro and Image Generation39:45 - Voice Interaction and Future Roadmap44:41 - Conclusion and Final ThoughtsLooking forward to seeing you on Thursday 🫡 P.S - I’ve recorded one more conversation during AI Engineer, and will be posting that soon, same format, very interesting person, look out for that soon! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    46:04
  • 🦃 ThursdAI - Thanksgiving special 25’ - Claude 4.5, Flux 2 & Z-image vs 🍌, MCP gets Apps + New DeepSeek!?
    Hey ya’ll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you!Just wrapped up the third (1, 2) Thanksgiving special Episode of ThursdAI, can you believe November is almost over? We had another banger week in AI, with a full feast of AI released, Anthropic dropped the long awaited Opus 4.5, which quickly became the top coding LLM, DeepSeek resurfaced with a math model, BFL and Tongyi both tried to take on Nano Banana, and Microsoft dropped a 7B computer use model in Open Source + Intellect 3 from Prime Intellect! With so much news to cover, we also had an interview with Ido Sal & Liad Yosef (their second time on the show!) about MCP-Apps, the new standard they are spearheading together with Anthropic, OpenAI & more! Exciting episode, let’s get into it! (P.S - I started generating infographics, so the show became much more visual, LMK if you like them) ThursdAI - I put a lot of work on a weekly basis to bring you the live show, podcast and a sourced newsletter! Please subscribe if you find this content valuable!Anthropic’s Opus 4.5: The “Premier Intelligence” Returns (Blog)Folks, Anthropic absolutely cooked. After Sonnet and Haiku had their time in the sun, the big brother is finally back. Opus 4.5 launched this week, and it is reclaiming the throne for coding and complex agentic tasks.First off, the specs are monstrous. It hits 80.9% on SWE-bench Verified, topping GPT-5.1 (77.9%) and Gemini 3 Pro (76.2%). But the real kicker? The price! It is now $5 per million input tokens and $25 per million output—literally one-third the cost of the previous Opus.Yam, our resident coding wizard, put it best during the show: “Opus knows a lot of tiny details about the stack that you didn’t even know you wanted... It feels like it can go forever.” Unlike Sonnet, which sometimes spirals or loses context on extremely long tasks, Opus 4.5 maintains coherence deep into the conversation.Anthropic also introduced a new “Effort” parameter, allowing you to control how hard the model thinks (similar to o1 reasoning tokens). Set it to high, and you get massive performance gains; set it to medium, and you get Sonnet-level performance at a fraction of the token cost. Plus, they’ve added Tool Search (cutting enormous token overhead for agents with many tools) and Programmatic Tool Calling, which effectively lets Opus write and execute code loops to manage data.If you are doing heavy software engineering or complex automations, Opus 4.5 is the new daily driver.📱 The Agentic Web: MCP Apps & MCP-UI StandardSpeaking of MCP updates, Can you believe it’s been exactly one year since the Model Context Protocol (MCP) launched? We’ve been “MCP-pilled” for a while, but this week, the ecosystem took a massive leap forward.We brought back our friends Ido and Liad, the creators of MCP-UI, to discuss huge news: MCP-UI has been officially standardized as MCP Apps. This is a joint effort adopted by both Anthropic and OpenAI.Why does this matter? Until now, when an LLM used a tool (like Spotify or Zillow), the output was just text. It lost the brand identity and the user experience. With MCP Apps, agents can now render full, interactive HTML interfaces directly inside the chat! Ido and Liad explained that they worked hard to avoid an “iOS vs. Android” fragmentation war. Instead of every lab building their own proprietary app format, we now have a unified standard for the “Agentic Web.” This is how AI stops being a chatbot and starts being an operating system.Check out the standard at mcpui.dev.🦃 The Open Source Thanksgiving FeastWhile the big labs were busy, the open-source community decided to drop enough papers and weights to feed us for a month.Prime Intellect unveils INTELLECT-3, a 106B MoE (X, HF, Blog, Try It)Prime Intellect releases INTELLECT-3, a 106B parameter Mixture-of-Experts model (12B active params) based on GLM-4.5-Air, achieving state-of-the-art performance for its size—including ~90% on AIME 2024/2025 math contests, 69% on LiveCodeBench v6 coding, 74% on GPQA-Diamond reasoning, and 74% on MMLU-Pro—outpacing larger models like DeepSeek-R1. Trained over two months on 512 H200 GPUs using their fully open-sourced end-to-end stack (PRIME-RL async trainer, Verifiers & Environments Hub, Prime Sandboxes), it’s now hosted on Hugging Face, OpenRouter, Parasail, and Nebius, empowering any team to scale frontier RL without big-lab resources. Especially notable is their very detailed release blog, covering how a lab that previously trained 32B, finetunes a monster 106B MoE model! Tencent’s HunyuanOCR: Small but Mighty (X, HF, Github, Blog)Tencent released HunyuanOCR, a 1 billion parameter model that is absolutely crushing benchmarks. It scored 860 on OCRBench, beating massive models like Qwen3-VL-72B. It’s an end-to-end model, meaning no separate detection and recognition steps. Great for parsing PDFs, docs, and even video subtitles. It’s heavily restricted (no EU/UK usage), but technically impressive.Microsoft’s Fara-7B: On-Device Computer UseMicrosoft quietly dropped Fara-7B, a model fine-tuned from Qwen 2.5, specifically designed for computer use agentic tasks. It hits 73.5% on WebVoyager, beating OpenAI’s preview models, all while running locally on-device. This is the dream of a local agent that can browse the web for you, click buttons, and book flights without sending screenshots to the cloud.DeepSeek-Math-V2: open-weights IMO-gold math LLM (X, HF)DeepSeek-Math-V2 is a 685B-parameter, Apache-2.0 licensed, open-weights mathematical reasoning model claiming gold-medal performance on IMO 2025 and CMO 2024, plus a near-perfect 118/120 on Putnam 2024. Nisten did note some limitations—specifically that the context window can get choked up on extremely long, complex proofs—but having an open-weight model of this caliber is a gift to researchers everywhere.🐝 This Week’s Buzz: Serverless LoRA InferenceA huge update from us at Weights & Biases! We know fine-tuning is powerful, but serving those fine-tunes can be a pain and expensive. We just launched Serverless LoRA Inference.This means you can upload your small LoRA adapters (which you can train cheaply) to W&B Artifacts, and we will serve them instantly on CoreWeave GPUs on top of a base model. No cold starts, no dedicated expensive massive GPU instances for just one adapter.I showed a demo of a “Mocking SpongeBob” model I trained in 25 minutes. It just adds that SaRcAsTiC tExT style to the Qwen 2.5 base model. You pass the adapter ID in the API call, and boom—customized intelligence instantly. You can get more details HERE and get started with your own LORA in this nice notebook the team made! 🎨 Visuals: Image & Video Generation ExplosionFlux.2: The Multi-Reference Image Creator from BFL (X, HF, Blog)Black Forest Labs released Flux.2, a series of models including a 32B Flux 2[DEV]. The killer feature here is Multi-Reference Editing. You can feed it up to 10 reference images to maintain character consistency, style, or specific objects. It also outputs native 4-megapixel images.Honestly, the launch timing was rough, coming right after Google’s Nano Banana Pro and alongside Z-Image, but for precise, high-res editing, this is a serious tool.Tongyi drops Z-Image Turbo: 6B single-stream DiT lands sub‑second, 8‑step text‑to‑image (GitHub, Hugging Face)Alibaba’s Tongyi Lab released Z-Image Turbo, a 6B parameter model that generates images in sub-second time on H800s (and super fast on consumer cards).I built a demo to show just how fast this is. You know that “Infinite Craft“ game? I hooked it up to Z-Image Turbo so that every time you combine elements (like Pirate + Ghost), it instantly generates the image for “Ghost Pirate.” It changes the game completely when generation is this cheap and fast.HunyuanVideo 1.5 – open video gets very realTencent also shipped HunyuanVideo 1.5, which they market as “the strongest open‑source video generation model.” For once, the tagline isn’t entirely hype.Under the hood it’s an 8.3B‑parameter Diffusion Transformer (DiT) model with a 3D causal VAE in front. The VAE compresses videos aggressively in both space and time, and the DiT backbone models that latent sequence.The important bits for you and me:* It generates 5–10 second clips at 480p/720p with good motion coherence and physics.* With FP16 or FP8 configs you can run it on a single consumer GPU with around 14GB VRAM.* There’s a built‑in path to upsample to 1080p for “real” quality.LTX Studio Retake: Photoshop for Video (X, Try It)For the video creators, LTX Studio launched Retake. This isn’t just “regenerate video.” This allows you to select a specific 2-second segment of a video, change the dialogue (keeping the voice!), change the emotion, or edit the action, all for like $0.10. It blends it perfectly back into the original clip. We are effectively getting a “Director Mode” for AI video where you can fix mistakes without rolling the dice on a whole new generation.A secret new model on the Arena called Whisper Thunder beats even Veo 3?This was a surprise of the week, while new video models get released often, Veo 3 has been the top one for a while, and now we’re getting a reshuffling of the video giants! But... we don’t yet know who this video model is from! You can sometimes get its generations at the Artificial Analysis video arena here, and the outputs look quite awesome! Thanksgiving reflections from the ThursdAI teamAs we wrapped up the show, Wolfram suggested we take a moment to think about what we’re thankful for in AI, and I think that’s a perfect note to end on.Wolfram put it well: he’s thankful for everyone contributing to this wonderful community—the people releasing models, creating open source tools, writing tutorials, sharing knowledge. It’s not just about the money; it’s about the love of learning and building together.Yam highlighted something I think is crucial: we’ve reached a point where there’s no real competition between open source and closed source anymore. Everything is moving forward together. Even if you think nobody’s looking at that random code you posted somewhere, chances are someone found it and used it to accelerate their own work. That collective effort is what’s driving this incredible pace of progress.For me, I want to thank Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Ilya Polosukhin for the 2017 paper “Attention Is All You Need.” Half Joking! But without the seminal attention is you need paper none of this AI was possible. But mostly I want to thank all of you—the audience, the co-hosts, the guests—for making ThursdAI what it is.If you go back and watch our 2024 Thanksgiving episode, or the one from 2023, you’ll be shocked at how far we’ve come. Tools that seemed magical a year ago are now just... normal. That’s hedonic adaptation at work, but it’s also a reminder to stay humble and appreciate just how incredible this moment in history really is.We’re living through the early days of a technological revolution, and we get to document it, experiment with it, and help shape where it goes. That’s something to be genuinely thankful for.TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Guests: @idosal1 @liadyosef - MCP-UI/MCP Apps* Big CO LLMs + APIs* Anthropic launches Claude Opus 4.5 - world’s top model for coding, agents, and tool use (X, Announcement, Blog)* OpenAI Integrates ChatGPT Voice Mode Directly into Chats (X)* Open Source LLMs* Prime Intellect - INTELLECT-3 106B MoE (X, HF, Blog, Try It)* Tencent - HunyuanOCR 1B SOTA OCR model (X, HF, Github, Blog)* Microsoft - Fara-7B on-device computer-use agent (X, Blog, HF, Github)* DeepSeek - Math-V2 IMO-gold math LLM (HF)* Interview: MCP Apps* MCP-UI standardized as MCP Apps by Anthropic and OpenAI (X, Blog, Announcement)* Vision & Video* Tencent - HunyuanVideo 1.5 lightweight DiT open video model (X, GitHub, HF)* LTX Studio - Retake AI video editing tool (X, Try It)* Whisper Thunder - mystery #1 ranked video model on arena* AI Art & Diffusion* Black Forest Labs - FLUX.2 32B multi-reference image model (X, HF, Blog)* Tongyi - Z-Image Turbo sub-second 6B image gen (GitHub, HF)* This Week’s Buzz* W&B launches Serverless LoRA Inference on CoreWeave (X, Blog, Notebook) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:21:18
  • 📆 ThursdAI - the week that changed the AI landscape forever - Gemini 3, GPT codex max, Grok 4.1 & fast, SAM3 and Nano Banana Pro
    Hey everyone, Alex here 👋I’m writing this one from a noisy hallway at the AI Engineer conference in New York, still riding the high (and the sleep deprivation) from what might be the craziest week we’ve ever had in AI.In the span of a few days:Google dropped Gemini 3 Pro, a new Deep Think mode, generative UIs, and a free agent-first IDE called Antigravity.xAI shipped Grok 4.1, then followed it up with Grok 4.1 Fast plus an Agent Tools API.OpenAI answered with GPT‑5.1‑Codex‑Max, a long‑horizon coding monster that can work for more than a day, and quietly upgraded ChatGPT Pro to GPT‑5.1 Pro.Meta looked at all of that and said “cool, we’ll just segment literally everything and turn photos into 3D objects” with SAM 3 and SAM 3D.Robotics folks dropped a home robot trained with almost no robot data.And Google, just to flex, capped Thursday with Nano Banana Pro, a 4K image model and a provenance system while we were already live! For the first time in a while it doesn’t just feel like “new models came out.” It feels like the future actually clicked forward a notch.This is why ThursdAI exists. Weeks like this are basically impossible to follow if you have a day job, so my co‑hosts and I do the no‑sleep version so you don’t have to. Plus, being at AI Engineer makes it easy to get super high quality guests so this week we had 3 folks join us, Swyx from Cognition/Latent Space, Thor from DeepMind (on his 3rd day) and Dominik from OpenAI! Alright, deep breath. Let’s untangle the week.TL;DR If you only skim one section, make it this one (links in the end):* Google* Gemini 3 Pro: 1M‑token multimodal model, huge reasoning gains - new LLM king* ARC‑AGI‑2: 31.11% (Pro), 45.14% (Deep Think) – enormous jumps* Antigravity IDE: free, Gemini‑powered VS Code fork with agents, plans, walkthroughs, and browser control* Nano Banana Pro: 4K image generation with perfect text + SynthID provenance; dynamic “generative UIs” in Gemini* xAI* Grok 4.1: big post‑training upgrade – #1 on human‑preference leaderboards, much better EQ & creative writing, fewer hallucinations* Grok 4.1 Fast + Agent Tools API: 2M context, SOTA tool‑calling & agent benchmarks (Berkeley FC, T²‑Bench, research evals), aggressive pricing and tight X + web integration* OpenAI* GPT‑5.1‑Codex‑Max: “frontier agentic coding” model built for 24h+ software tasks with native compaction for million‑token sessions; big gains on SWE‑Bench, SWE‑Lancer, TerminalBench 2* GPT‑5.1 Pro: new “research‑grade” ChatGPT mode that will happily think for minutes on a single query* Meta* SAM 3: open‑vocabulary segmentation + tracking across images and video (with text & exemplar prompts)* SAM 3D: single‑image → 3D objects & human bodies; surprisingly high‑quality 3D from one photo* Robotics* Sunday Robotics – ACT‑1 & Memo: home robot foundation model trained from a $200 skill glove instead of $20K teleop rigs; long‑horizon household tasks with solid zero‑shot generalization* Developer Tools* Antigravity and Marimo’s VS Code / Cursor extension both push toward agentic, reactive dev workflowsLive from AI Engineer New York: Coding Agents Take Center StageWe recorded this week’s show on location at the AI Engineer Summit in New York, inside a beautiful podcast studio the team set up right on the expo floor. Huge shout out to Swyx, Ben, and the whole AI Engineer crew for that — last time I was balancing a mic on a hotel nightstand, this time I had broadcast‑grade audio while a robot dog tried to steal the show behind us.This year’s summit theme is very on‑the‑nose for this week: coding agents.Everywhere you look, there’s a company building an “agent lab” on top of foundation models. Amp, Cognition, Cursor, CodeRabbit, Jules, Google Labs, all the open‑source folks, and even the enterprise players like Capital One and Bloomberg are here, trying to figure out what it means to have real software engineers that are partly human and partly model.Swyx framed it nicely when he said that if you take “vertical AI” seriously enough, you eventually end up building an agent lab. Lawyers, healthcare, finance, developer tools — they all converge on “agents that can reason and code.”The big labs heard that theme loud and clear. Almost every major release this week is about agents, tools, and long‑horizon workflows, not just chat answers.Google Goes All In: Gemini 3 Pro, Antigravity, and the Agent RevolutionLet’s start with Google because, after years of everyone asking “where’s Google?” in the AI race, they showed up this week with multiple bombshells that had even the skeptics impressed.Gemini 3 Pro: Multimodal Intelligence That Actually DeliversGoogle finally released Gemini 3 Pro, and the numbers are genuinely impressive. We’re talking about a 1 million token context window, massive benchmark improvements, and a model that’s finally competing at the very top of the intelligence charts. Thor from DeepMind joined us on the show (literally on day 3 of his new job!) and you could feel the excitement.The headline numbers: Gemini 3 Pro with Deep Think mode achieved 45.14% on ARC-AGI-2—that’s roughly double the previous state-of-the-art on some splits. For context, ARC-AGI has been one of those benchmarks that really tests genuine reasoning and abstraction, not just memorization. The standard Gemini 3 Pro hits 31.11% on the same benchmark, both scores are absolutely out of this world in Arc! On GPQA Diamond, Gemini 3 Pro jumped about 10 points compared to prior models. We’re seeing roughly 81% on MMLU-Pro, and the coding performance is where things get really interesting—Gemini 3 Pro is scoring around 56% on SciCode, representing significant improvements in actual software engineering tasks.But here’s what made Ryan from Amp switch their default model to Gemini 3 Pro immediately: the real-world usability. Ryan told us on the show that they’d never switched default models before, not even when GPT-5 came out, but Gemini 3 Pro was so noticeably better that they made it the default on Tuesday. Of course, they hit rate limits almost immediately (Google had to scale up fast!), but those have since been resolved.Antigravity: Google’s Agent-First IDEThen Google dropped Antigravity, and honestly, this might be the most interesting part of the whole release. It’s a free IDE (yes, free!) that’s basically a fork of VS Code, but reimagined around agents rather than human-first coding.The key innovation here is something they call the “Agent Manager”—think of it like an inbox for your coding agents. Instead of thinking in folders and files, you’re managing conversations with agents that can run in parallel, handle long-running tasks, and report back when they need your input.I got early access and spent time playing with it, and here’s what blew my mind: you can have multiple agents working on different parts of your codebase simultaneously. One agent fixing bugs, another researching documentation, a third refactoring your CSS—all at once, all coordinated through this manager interface.The browser integration is crazy too. Antigravity can control Chrome directly, take screenshots and videos of your app, and then use those visuals to debug and iterate. It’s using Gemini 3 Pro for the heavy coding, and even Nano Banana for generating images and assets. The whole thing feels like it’s from a couple years in the future.Wolfram on the show called out how good Gemini 3 is for creative writing too—it’s now his main model, replacing GPT-4.5 for German language tasks. The model just “gets” the intention behind your prompts rather than following them literally, which makes for much more natural interactions.Nano Banana Pro: 4K Image Generation With ThinkingAnd because Google apparently wasn’t done announcing things, they also dropped Nano Banana Pro on Thursday morning—literally breaking news during our live show. This is their image generation model that now supports 4K resolution and includes “thinking” traces before generating.I tested it live by having it generate an infographic about all the week’s AI news (which you can see on the top), and the results were wild. Perfect text across the entire image (no garbled letters!), proper logos for all the major labs, and compositional understanding that felt way more sophisticated than typical image models. The file it generated was 8 megabytes—an actual 4K image with stunning detail.What’s particularly clever is that Nano Banana Pro is really Gemini 3 Pro doing the thinking and planning, then handing off to Nano Banana for the actual image generation. So you get multimodal reasoning about your request, then production-quality output. You can even upload reference images—up to 14 of them—and it’ll blend elements while maintaining consistency.Oh, and every image is watermarked with SynthID (Google’s invisible watermarking tech) and includes C2PA metadata, so you can verify provenance. This matters as AI-generated content becomes more prevalent.Generative UIs: The Future of InterfacesOne more thing Google showed off: generative UIs in the Gemini app. Wolfram demoed this for us, and it’s genuinely impressive. Instead of just text responses, Gemini can generate full interactive mini-apps on the fly—complete dashboards, data visualizations, interactive widgets—all vibe-coded in real time.He asked for “four panels of the top AI news from last week” and Gemini built an entire news dashboard with tabs, live market data (including accurate pre-market NVIDIA stats!), model comparisons, and clickable sections. It pulled real information, verified facts, and presented everything in a polished UI that you could interact with immediately.This isn’t just a demo—it’s rolling out in Gemini now. The implication is huge: we’re moving from static responses to dynamic, contextual interfaces generated just-in-time for your specific need.xAI Strikes Back: Grok 4.1 and the Agent Tools APINot to be outdone, xAI released Grok 4.1 at the start of the week, briefly claimed the #1 spot on LMArena (at 1483 Elo, not 2nd to Gemini 3), and then followed up with Grok 4.1 Fast and a full Agent Tools API.Grok 4.1: Emotional Intelligence Meets Raw PerformanceGrok 4.1 brought some really interesting improvements. Beyond the benchmark numbers (64% win rate over the previous Grok in blind tests), what stood out was the emotional intelligence. On EQ-Bench3, Grok 4.1 Thinking scored 1586 Elo, beating every other model including Gemini, GPT-5, and Claude.The creative writing scores jumped by roughly 600 Elo points compared to earlier versions. And perhaps most importantly for practical use, hallucination rates dropped from around 12% to 4%—that’s roughly a 3x improvement in reliability on real user queries.xAI’s approach here was clever: they used “frontier agentic reasoning models as reward models” during RL training, which let them optimize for subjective qualities like humor, empathy, and conversational style without just scaling up model size.Grok 4.1 Fast: The Agent Platform PlayThen came Grok 4.1 Fast, released just yesterday, and this is where things get really interesting for developers. It’s got a 2 million token context window (compared to Gemini 3’s 1 million) and was specifically trained for agentic, tool-calling workflows.The benchmark performance is impressive: 93-100% on τ²-Bench Telecom (customer support simulation), ~72% on Berkeley Function Calling v4 (top of the leaderboard), and strong scores across research and browsing tasks. But here’s the kicker: the pricing is aggressive.At $0.20 per million input tokens and $0.50 per million output tokens, Grok 4.1 Fast is dramatically cheaper than GPT-5 and Claude while matching or exceeding their agentic performance. For the first two weeks, it’s completely free via the xAI API and OpenRouter, which is smart—get developers hooked on your agent platform.The Agent Tools API gives Grok native access to X search, web browsing, code execution, and document retrieval. This tight integration with X is a genuine advantage—where else can you get real-time access to breaking news, sentiment, and conversation? Yam tested it on the show and confirmed that Grok will search Reddit too, which other models often refuse to do. I’ve used both these models this week in my N8N research agent and I gotta say, 4.1 fast is a MASSIVE improvement! OpenAI’s Endurance Play: GPT-5.1-Codex-Max and ProOpenAI clearly saw Google and xAI making moves and decided they weren’t going to let this week belong to anyone else. They dropped two significant releases: GPT-5.1-Codex-Max and an update to GPT-5.1 Pro.GPT-5.1-Codex-Max: Coding That Never StopsThis is the headline: GPT-5.1-Codex-Max can work autonomously for over 24 hours. Not 24 minutes, not 24 queries—24 actual hours on a single software engineering task. I talked to someone from OpenAI at the conference who told me internal checkpoints ran for nearly a week on and off.How is this even possible? The secret is something OpenAI calls “compaction”—a native mechanism trained into the model that lets it prune and compress its working session history while preserving the important context. Think of it like the model taking notes on itself, discarding tool-calling noise and keeping only the critical design decisions and state.The performance numbers back this up:* SOTA 77.9% on SWE-Bench Verified (up from 73.7%)* SOTA 79.9% on SWE-Lancer IC SWE (up from 66.3%)* 58.1% on TerminalBench 2.0 (up from 52.8%)And crucially, in medium reasoning mode, it uses 30% fewer thinking tokens while achieving better results. There’s also an “Extra High” reasoning mode for when you truly don’t care about latency and just want maximum capability.Yam, one of our co-hosts who’s been testing extensively, said you can feel the difference immediately. The model just “gets it” faster, powers through complex problems, and the earlier version’s quirk of ignoring your questions and just starting to code is fixed—now it actually responds and collaborates.Dominic from OpenAI joined us on the show and confirmed that compaction was trained natively into the model using RL, similar to how Claude trained natively for MCP. This means the model doesn’t waste reasoning tokens on maintaining context—it just knows how to do it efficiently.GPT-5.1 Pro: Research-Grade Intelligence & ChatGPT joins your group chat1Then there’s GPT-5.1 Pro, which is less about coding and more about deep, research-level reasoning. This is the model that can run for 10-17 minutes on a single query, thinking through complex problems with the kind of depth that previously required human experts.OpenAI also quietly rolled out group chats—basically, you can now have multiple people in a ChatGPT conversation together, all talking to the model simultaneously. Useful for planning trips, brainstorming with teams, or working through problems collaboratively. If agent mode works in group chats (we haven’t confirmed yet), that could get really interesting.Meta drops SAM3 & SAM3D - image and video segmentation models powered by natural languagePhew ok, big lab releases now done, oh.. wait not yet! Because Meta has decided to also make a dent on this Week with SAM3 and SAM3D, which both are crazy. I’ll just add their video release here instead of going on and on! This Week’s Buzz from Weights & BiasesIt’s been a busy week at Weights & Biases as well! We are proud Gold Sponsors of the AI Engineer conference here in NYC. If you’re at the event, please stop by our booth—we’re even giving away a $4,000 robodog!This week, I want to highlight a fantastic update from Marimo, the reactive Python notebook company we acquired.Marimo just shipped a native VS Code and Cursor extension. This brings Marimo’s reactive, Git-friendly notebooks directly into your favorite editors.Crucially, it integrates deeply with uv for lightning-fast package installs and reproducible environments. If you import a package you don’t have, the extension prompts you to install it and records the dependency in the script metadata. This bridges the gap between experimental notebooks and production-ready code, and it’s a huge boost for AI-native development workflows. (Blog , GitHub )The Future Arrived EarlyPhew... if you read all the way until this point, can you leave a ⚡ emoji in the comemnts? I was writing this and it.. is a lot! I was wondering who would even read all the way till here! This week we felt the acceleration! 🔥 I can barely breathe, I need a nap! A huge thank you to our guests—Ryan, Swyx, Thor, and Dominik—for navigating the chaos with us live on stage, and to the AI Engineer team for hosting us.We’ll be back next week to cover whatever the AI world throws at us next. Stay tuned, because at this rate, AGI might be here by Christmas.TL;DR - show notes and linksHosts and Co‑hosts* Alex Volkov – AI Evangelist at Weights & Biases / CoreWeave, host of ThursdAI (X)* Co‑hosts - Wolfram Ravenwolf – (X), Yam Peleg (X) LDJ (X)Guests* Swyx – Founder of AI Engineer World’s Fair and Summit, now at Cognition ( Latent.Space , X)* Ryan Carson – Amp (X)* Thor Schaeff – Google DeepMind, Gemini API and AI Studio (X)* Dominik Kundel – Developer Experience at OpenAI (X)Open Source LLMs* Allen Institute Olmo 3 - 7B/32B fully open reasoning suite with end-to-end training transparency (X, Blog)Big CO LLMs + APIs* Google Gemini 3 Pro - 1M-token, multimodal, agentic model with Generative UIs (X, X, X)* Google Antigravity - Agent-first IDE powered by Gemini 3 Pro (Blog, X)* xAI Grok 4.1 and Grok 4.1 Thinking - big gains in Coding, EQ, creativity, and honesty (X, Blog)* xAI Grok 4.1 Fast and Agent Tools API - 2M-token context, state-of-the-art tool-calling (X)* OpenAI GPT-5.1-Codex-Max - long-horizon agentic coding model for 24-hour+ software tasks (X, X)* OpenAI GPT-5.1 Pro - research-grade reasoning model in ChatGPT Pro* Microsoft, NVIDIA, and Anthropic partnership - to scale Claude on Azure with massive GPU investments (Announcement, NVIDIA, Microsoft Blog)This weeks Buzz* Marimo ships native VS Code & Cursor extension with reactive notebooks and uv-powered environments (X, Blog, GitHub)Vision & Video & 3D* Meta SAM 3 & SAM 3D - promptable segmentation, tracking, and single-image 3D reconstruction (X, Blog, GitHub)AI Art & Diffusion* Google Nano Banana Pro and SynthID verification - 4K image generation with provenance (Blog)Show Notes and other Links* AI Engineer Summit NYC - Live from the conference* Full livestream available on YouTube* ThursdAI - Nov 20, 2025 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:29:13

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, The Intelligence from The Economist and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v8.1.2 | © 2007-2025 radio.de GmbH
Generated: 12/12/2025 - 6:59:34 PM