๐ ThursdAI - the week that changed the AI landscape forever - Gemini 3, GPT codex max, Grok 4.1 & fast, SAM3 and Nano Banana Pro
Hey everyone, Alex here ๐Iโm writing this one from a noisy hallway at the AI Engineer conference in New York, still riding the high (and the sleep deprivation) from what might be the craziest week weโve ever had in AI.In the span of a few days:Google dropped Gemini 3 Pro, a new Deep Think mode, generative UIs, and a free agent-first IDE called Antigravity.xAI shipped Grok 4.1, then followed it up with Grok 4.1 Fast plus an Agent Tools API.OpenAI answered with GPTโ5.1โCodexโMax, a longโhorizon coding monster that can work for more than a day, and quietly upgraded ChatGPT Pro to GPTโ5.1 Pro.Meta looked at all of that and said โcool, weโll just segment literally everything and turn photos into 3D objectsโ with SAM 3 and SAM 3D.Robotics folks dropped a home robot trained with almost no robot data.And Google, just to flex, capped Thursday with Nano Banana Pro, a 4K image model and a provenance system while we were already live! For the first time in a while it doesnโt just feel like โnew models came out.โ It feels like the future actually clicked forward a notch.This is why ThursdAI exists. Weeks like this are basically impossible to follow if you have a day job, so my coโhosts and I do the noโsleep version so you donโt have to. Plus, being at AI Engineer makes it easy to get super high quality guests so this week we had 3 folks join us, Swyx from Cognition/Latent Space, Thor from DeepMind (on his 3rd day) and Dominik from OpenAI! Alright, deep breath. Letโs untangle the week.TL;DR If you only skim one section, make it this one (links in the end):* Google* Gemini 3 Pro: 1Mโtoken multimodal model, huge reasoning gains - new LLM king* ARCโAGIโ2: 31.11% (Pro), 45.14% (Deep Think) โ enormous jumps* Antigravity IDE: free, Geminiโpowered VS Code fork with agents, plans, walkthroughs, and browser control* Nano Banana Pro: 4K image generation with perfect text + SynthID provenance; dynamic โgenerative UIsโ in Gemini* xAI* Grok 4.1: big postโtraining upgrade โ #1 on humanโpreference leaderboards, much better EQ & creative writing, fewer hallucinations* Grok 4.1 Fast + Agent Tools API: 2M context, SOTA toolโcalling & agent benchmarks (Berkeley FC, TยฒโBench, research evals), aggressive pricing and tight X + web integration* OpenAI* GPTโ5.1โCodexโMax: โfrontier agentic codingโ model built for 24h+ software tasks with native compaction for millionโtoken sessions; big gains on SWEโBench, SWEโLancer, TerminalBench 2* GPTโ5.1 Pro: new โresearchโgradeโ ChatGPT mode that will happily think for minutes on a single query* Meta* SAM 3: openโvocabulary segmentation + tracking across images and video (with text & exemplar prompts)* SAM 3D: singleโimage โ 3D objects & human bodies; surprisingly highโquality 3D from one photo* Robotics* Sunday Robotics โ ACTโ1 & Memo: home robot foundation model trained from a $200 skill glove instead of $20K teleop rigs; longโhorizon household tasks with solid zeroโshot generalization* Developer Tools* Antigravity and Marimoโs VS Code / Cursor extension both push toward agentic, reactive dev workflowsLive from AI Engineer New York: Coding Agents Take Center StageWe recorded this weekโs show on location at the AI Engineer Summit in New York, inside a beautiful podcast studio the team set up right on the expo floor. Huge shout out to Swyx, Ben, and the whole AI Engineer crew for that โ last time I was balancing a mic on a hotel nightstand, this time I had broadcastโgrade audio while a robot dog tried to steal the show behind us.This yearโs summit theme is very onโtheโnose for this week: coding agents.Everywhere you look, thereโs a company building an โagent labโ on top of foundation models. Amp, Cognition, Cursor, CodeRabbit, Jules, Google Labs, all the openโsource folks, and even the enterprise players like Capital One and Bloomberg are here, trying to figure out what it means to have real software engineers that are partly human and partly model.Swyx framed it nicely when he said that if you take โvertical AIโ seriously enough, you eventually end up building an agent lab. Lawyers, healthcare, finance, developer tools โ they all converge on โagents that can reason and code.โThe big labs heard that theme loud and clear. Almost every major release this week is about agents, tools, and longโhorizon workflows, not just chat answers.Google Goes All In: Gemini 3 Pro, Antigravity, and the Agent RevolutionLetโs start with Google because, after years of everyone asking โwhereโs Google?โ in the AI race, they showed up this week with multiple bombshells that had even the skeptics impressed.Gemini 3 Pro: Multimodal Intelligence That Actually DeliversGoogle finally released Gemini 3 Pro, and the numbers are genuinely impressive. Weโre talking about a 1 million token context window, massive benchmark improvements, and a model thatโs finally competing at the very top of the intelligence charts. Thor from DeepMind joined us on the show (literally on day 3 of his new job!) and you could feel the excitement.The headline numbers: Gemini 3 Pro with Deep Think mode achieved 45.14% on ARC-AGI-2โthatโs roughly double the previous state-of-the-art on some splits. For context, ARC-AGI has been one of those benchmarks that really tests genuine reasoning and abstraction, not just memorization. The standard Gemini 3 Pro hits 31.11% on the same benchmark, both scores are absolutely out of this world in Arc! On GPQA Diamond, Gemini 3 Pro jumped about 10 points compared to prior models. Weโre seeing roughly 81% on MMLU-Pro, and the coding performance is where things get really interestingโGemini 3 Pro is scoring around 56% on SciCode, representing significant improvements in actual software engineering tasks.But hereโs what made Ryan from Amp switch their default model to Gemini 3 Pro immediately: the real-world usability. Ryan told us on the show that theyโd never switched default models before, not even when GPT-5 came out, but Gemini 3 Pro was so noticeably better that they made it the default on Tuesday. Of course, they hit rate limits almost immediately (Google had to scale up fast!), but those have since been resolved.Antigravity: Googleโs Agent-First IDEThen Google dropped Antigravity, and honestly, this might be the most interesting part of the whole release. Itโs a free IDE (yes, free!) thatโs basically a fork of VS Code, but reimagined around agents rather than human-first coding.The key innovation here is something they call the โAgent Managerโโthink of it like an inbox for your coding agents. Instead of thinking in folders and files, youโre managing conversations with agents that can run in parallel, handle long-running tasks, and report back when they need your input.I got early access and spent time playing with it, and hereโs what blew my mind: you can have multiple agents working on different parts of your codebase simultaneously. One agent fixing bugs, another researching documentation, a third refactoring your CSSโall at once, all coordinated through this manager interface.The browser integration is crazy too. Antigravity can control Chrome directly, take screenshots and videos of your app, and then use those visuals to debug and iterate. Itโs using Gemini 3 Pro for the heavy coding, and even Nano Banana for generating images and assets. The whole thing feels like itโs from a couple years in the future.Wolfram on the show called out how good Gemini 3 is for creative writing tooโitโs now his main model, replacing GPT-4.5 for German language tasks. The model just โgetsโ the intention behind your prompts rather than following them literally, which makes for much more natural interactions.Nano Banana Pro: 4K Image Generation With ThinkingAnd because Google apparently wasnโt done announcing things, they also dropped Nano Banana Pro on Thursday morningโliterally breaking news during our live show. This is their image generation model that now supports 4K resolution and includes โthinkingโ traces before generating.I tested it live by having it generate an infographic about all the weekโs AI news (which you can see on the top), and the results were wild. Perfect text across the entire image (no garbled letters!), proper logos for all the major labs, and compositional understanding that felt way more sophisticated than typical image models. The file it generated was 8 megabytesโan actual 4K image with stunning detail.Whatโs particularly clever is that Nano Banana Pro is really Gemini 3 Pro doing the thinking and planning, then handing off to Nano Banana for the actual image generation. So you get multimodal reasoning about your request, then production-quality output. You can even upload reference imagesโup to 14 of themโand itโll blend elements while maintaining consistency.Oh, and every image is watermarked with SynthID (Googleโs invisible watermarking tech) and includes C2PA metadata, so you can verify provenance. This matters as AI-generated content becomes more prevalent.Generative UIs: The Future of InterfacesOne more thing Google showed off: generative UIs in the Gemini app. Wolfram demoed this for us, and itโs genuinely impressive. Instead of just text responses, Gemini can generate full interactive mini-apps on the flyโcomplete dashboards, data visualizations, interactive widgetsโall vibe-coded in real time.He asked for โfour panels of the top AI news from last weekโ and Gemini built an entire news dashboard with tabs, live market data (including accurate pre-market NVIDIA stats!), model comparisons, and clickable sections. It pulled real information, verified facts, and presented everything in a polished UI that you could interact with immediately.This isnโt just a demoโitโs rolling out in Gemini now. The implication is huge: weโre moving from static responses to dynamic, contextual interfaces generated just-in-time for your specific need.xAI Strikes Back: Grok 4.1 and the Agent Tools APINot to be outdone, xAI released Grok 4.1 at the start of the week, briefly claimed the #1 spot on LMArena (at 1483 Elo, not 2nd to Gemini 3), and then followed up with Grok 4.1 Fast and a full Agent Tools API.Grok 4.1: Emotional Intelligence Meets Raw PerformanceGrok 4.1 brought some really interesting improvements. Beyond the benchmark numbers (64% win rate over the previous Grok in blind tests), what stood out was the emotional intelligence. On EQ-Bench3, Grok 4.1 Thinking scored 1586 Elo, beating every other model including Gemini, GPT-5, and Claude.The creative writing scores jumped by roughly 600 Elo points compared to earlier versions. And perhaps most importantly for practical use, hallucination rates dropped from around 12% to 4%โthatโs roughly a 3x improvement in reliability on real user queries.xAIโs approach here was clever: they used โfrontier agentic reasoning models as reward modelsโ during RL training, which let them optimize for subjective qualities like humor, empathy, and conversational style without just scaling up model size.Grok 4.1 Fast: The Agent Platform PlayThen came Grok 4.1 Fast, released just yesterday, and this is where things get really interesting for developers. Itโs got a 2 million token context window (compared to Gemini 3โs 1 million) and was specifically trained for agentic, tool-calling workflows.The benchmark performance is impressive: 93-100% on ฯยฒ-Bench Telecom (customer support simulation), ~72% on Berkeley Function Calling v4 (top of the leaderboard), and strong scores across research and browsing tasks. But hereโs the kicker: the pricing is aggressive.At $0.20 per million input tokens and $0.50 per million output tokens, Grok 4.1 Fast is dramatically cheaper than GPT-5 and Claude while matching or exceeding their agentic performance. For the first two weeks, itโs completely free via the xAI API and OpenRouter, which is smartโget developers hooked on your agent platform.The Agent Tools API gives Grok native access to X search, web browsing, code execution, and document retrieval. This tight integration with X is a genuine advantageโwhere else can you get real-time access to breaking news, sentiment, and conversation? Yam tested it on the show and confirmed that Grok will search Reddit too, which other models often refuse to do. Iโve used both these models this week in my N8N research agent and I gotta say, 4.1 fast is a MASSIVE improvement! OpenAIโs Endurance Play: GPT-5.1-Codex-Max and ProOpenAI clearly saw Google and xAI making moves and decided they werenโt going to let this week belong to anyone else. They dropped two significant releases: GPT-5.1-Codex-Max and an update to GPT-5.1 Pro.GPT-5.1-Codex-Max: Coding That Never StopsThis is the headline: GPT-5.1-Codex-Max can work autonomously for over 24 hours. Not 24 minutes, not 24 queriesโ24 actual hours on a single software engineering task. I talked to someone from OpenAI at the conference who told me internal checkpoints ran for nearly a week on and off.How is this even possible? The secret is something OpenAI calls โcompactionโโa native mechanism trained into the model that lets it prune and compress its working session history while preserving the important context. Think of it like the model taking notes on itself, discarding tool-calling noise and keeping only the critical design decisions and state.The performance numbers back this up:* SOTA 77.9% on SWE-Bench Verified (up from 73.7%)* SOTA 79.9% on SWE-Lancer IC SWE (up from 66.3%)* 58.1% on TerminalBench 2.0 (up from 52.8%)And crucially, in medium reasoning mode, it uses 30% fewer thinking tokens while achieving better results. Thereโs also an โExtra Highโ reasoning mode for when you truly donโt care about latency and just want maximum capability.Yam, one of our co-hosts whoโs been testing extensively, said you can feel the difference immediately. The model just โgets itโ faster, powers through complex problems, and the earlier versionโs quirk of ignoring your questions and just starting to code is fixedโnow it actually responds and collaborates.Dominic from OpenAI joined us on the show and confirmed that compaction was trained natively into the model using RL, similar to how Claude trained natively for MCP. This means the model doesnโt waste reasoning tokens on maintaining contextโit just knows how to do it efficiently.GPT-5.1 Pro: Research-Grade Intelligence & ChatGPT joins your group chat1Then thereโs GPT-5.1 Pro, which is less about coding and more about deep, research-level reasoning. This is the model that can run for 10-17 minutes on a single query, thinking through complex problems with the kind of depth that previously required human experts.OpenAI also quietly rolled out group chatsโbasically, you can now have multiple people in a ChatGPT conversation together, all talking to the model simultaneously. Useful for planning trips, brainstorming with teams, or working through problems collaboratively. If agent mode works in group chats (we havenโt confirmed yet), that could get really interesting.Meta drops SAM3 & SAM3D - image and video segmentation models powered by natural languagePhew ok, big lab releases now done, oh.. wait not yet! Because Meta has decided to also make a dent on this Week with SAM3 and SAM3D, which both are crazy. Iโll just add their video release here instead of going on and on! This Weekโs Buzz from Weights & BiasesItโs been a busy week at Weights & Biases as well! We are proud Gold Sponsors of the AI Engineer conference here in NYC. If youโre at the event, please stop by our boothโweโre even giving away a $4,000 robodog!This week, I want to highlight a fantastic update from Marimo, the reactive Python notebook company we acquired.Marimo just shipped a native VS Code and Cursor extension. This brings Marimoโs reactive, Git-friendly notebooks directly into your favorite editors.Crucially, it integrates deeply with uv for lightning-fast package installs and reproducible environments. If you import a package you donโt have, the extension prompts you to install it and records the dependency in the script metadata. This bridges the gap between experimental notebooks and production-ready code, and itโs a huge boost for AI-native development workflows. (Blog , GitHub )The Future Arrived EarlyPhew... if you read all the way until this point, can you leave a โก emoji in the comemnts? I was writing this and it.. is a lot! I was wondering who would even read all the way till here! This week we felt the acceleration! ๐ฅ I can barely breathe, I need a nap! A huge thank you to our guestsโRyan, Swyx, Thor, and Dominikโfor navigating the chaos with us live on stage, and to the AI Engineer team for hosting us.Weโll be back next week to cover whatever the AI world throws at us next. Stay tuned, because at this rate, AGI might be here by Christmas.TL;DR - show notes and linksHosts and Coโhosts* Alex Volkov โ AI Evangelist at Weights & Biases / CoreWeave, host of ThursdAI (X)* Coโhosts - Wolfram Ravenwolf โ (X), Yam Peleg (X) LDJ (X)Guests* Swyx โ Founder of AI Engineer Worldโs Fair and Summit, now at Cognition ( Latent.Space , X)* Ryan Carson โ Amp (X)* Thor Schaeff โ Google DeepMind, Gemini API and AI Studio (X)* Dominik Kundel โ Developer Experience at OpenAI (X)Open Source LLMs* Allen Institute Olmo 3 - 7B/32B fully open reasoning suite with end-to-end training transparency (X, Blog)Big CO LLMs + APIs* Google Gemini 3 Pro - 1M-token, multimodal, agentic model with Generative UIs (X, X, X)* Google Antigravity - Agent-first IDE powered by Gemini 3 Pro (Blog, X)* xAI Grok 4.1 and Grok 4.1 Thinking - big gains in Coding, EQ, creativity, and honesty (X, Blog)* xAI Grok 4.1 Fast and Agent Tools API - 2M-token context, state-of-the-art tool-calling (X)* OpenAI GPT-5.1-Codex-Max - long-horizon agentic coding model for 24-hour+ software tasks (X, X)* OpenAI GPT-5.1 Pro - research-grade reasoning model in ChatGPT Pro* Microsoft, NVIDIA, and Anthropic partnership - to scale Claude on Azure with massive GPU investments (Announcement, NVIDIA, Microsoft Blog)This weeks Buzz* Marimo ships native VS Code & Cursor extension with reactive notebooks and uv-powered environments (X, Blog, GitHub)Vision & Video & 3D* Meta SAM 3 & SAM 3D - promptable segmentation, tracking, and single-image 3D reconstruction (X, Blog, GitHub)AI Art & Diffusion* Google Nano Banana Pro and SynthID verification - 4K image generation with provenance (Blog)Show Notes and other Links* AI Engineer Summit NYC - Live from the conference* Full livestream available on YouTube* ThursdAI - Nov 20, 2025 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe