Powered by RND
PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

Available Episodes

5 of 130
  • 📆 ThursdAI - the week that changed the AI landscape forever - Gemini 3, GPT codex max, Grok 4.1 & fast, SAM3 and Nano Banana Pro
    Hey everyone, Alex here 👋I’m writing this one from a noisy hallway at the AI Engineer conference in New York, still riding the high (and the sleep deprivation) from what might be the craziest week we’ve ever had in AI.In the span of a few days:Google dropped Gemini 3 Pro, a new Deep Think mode, generative UIs, and a free agent-first IDE called Antigravity.xAI shipped Grok 4.1, then followed it up with Grok 4.1 Fast plus an Agent Tools API.OpenAI answered with GPT‑5.1‑Codex‑Max, a long‑horizon coding monster that can work for more than a day, and quietly upgraded ChatGPT Pro to GPT‑5.1 Pro.Meta looked at all of that and said “cool, we’ll just segment literally everything and turn photos into 3D objects” with SAM 3 and SAM 3D.Robotics folks dropped a home robot trained with almost no robot data.And Google, just to flex, capped Thursday with Nano Banana Pro, a 4K image model and a provenance system while we were already live! For the first time in a while it doesn’t just feel like “new models came out.” It feels like the future actually clicked forward a notch.This is why ThursdAI exists. Weeks like this are basically impossible to follow if you have a day job, so my co‑hosts and I do the no‑sleep version so you don’t have to. Plus, being at AI Engineer makes it easy to get super high quality guests so this week we had 3 folks join us, Swyx from Cognition/Latent Space, Thor from DeepMind (on his 3rd day) and Dominik from OpenAI! Alright, deep breath. Let’s untangle the week.TL;DR If you only skim one section, make it this one (links in the end):* Google* Gemini 3 Pro: 1M‑token multimodal model, huge reasoning gains - new LLM king* ARC‑AGI‑2: 31.11% (Pro), 45.14% (Deep Think) – enormous jumps* Antigravity IDE: free, Gemini‑powered VS Code fork with agents, plans, walkthroughs, and browser control* Nano Banana Pro: 4K image generation with perfect text + SynthID provenance; dynamic “generative UIs” in Gemini* xAI* Grok 4.1: big post‑training upgrade – #1 on human‑preference leaderboards, much better EQ & creative writing, fewer hallucinations* Grok 4.1 Fast + Agent Tools API: 2M context, SOTA tool‑calling & agent benchmarks (Berkeley FC, T²‑Bench, research evals), aggressive pricing and tight X + web integration* OpenAI* GPT‑5.1‑Codex‑Max: “frontier agentic coding” model built for 24h+ software tasks with native compaction for million‑token sessions; big gains on SWE‑Bench, SWE‑Lancer, TerminalBench 2* GPT‑5.1 Pro: new “research‑grade” ChatGPT mode that will happily think for minutes on a single query* Meta* SAM 3: open‑vocabulary segmentation + tracking across images and video (with text & exemplar prompts)* SAM 3D: single‑image → 3D objects & human bodies; surprisingly high‑quality 3D from one photo* Robotics* Sunday Robotics – ACT‑1 & Memo: home robot foundation model trained from a $200 skill glove instead of $20K teleop rigs; long‑horizon household tasks with solid zero‑shot generalization* Developer Tools* Antigravity and Marimo’s VS Code / Cursor extension both push toward agentic, reactive dev workflowsLive from AI Engineer New York: Coding Agents Take Center StageWe recorded this week’s show on location at the AI Engineer Summit in New York, inside a beautiful podcast studio the team set up right on the expo floor. Huge shout out to Swyx, Ben, and the whole AI Engineer crew for that — last time I was balancing a mic on a hotel nightstand, this time I had broadcast‑grade audio while a robot dog tried to steal the show behind us.This year’s summit theme is very on‑the‑nose for this week: coding agents.Everywhere you look, there’s a company building an “agent lab” on top of foundation models. Amp, Cognition, Cursor, CodeRabbit, Jules, Google Labs, all the open‑source folks, and even the enterprise players like Capital One and Bloomberg are here, trying to figure out what it means to have real software engineers that are partly human and partly model.Swyx framed it nicely when he said that if you take “vertical AI” seriously enough, you eventually end up building an agent lab. Lawyers, healthcare, finance, developer tools — they all converge on “agents that can reason and code.”The big labs heard that theme loud and clear. Almost every major release this week is about agents, tools, and long‑horizon workflows, not just chat answers.Google Goes All In: Gemini 3 Pro, Antigravity, and the Agent RevolutionLet’s start with Google because, after years of everyone asking “where’s Google?” in the AI race, they showed up this week with multiple bombshells that had even the skeptics impressed.Gemini 3 Pro: Multimodal Intelligence That Actually DeliversGoogle finally released Gemini 3 Pro, and the numbers are genuinely impressive. We’re talking about a 1 million token context window, massive benchmark improvements, and a model that’s finally competing at the very top of the intelligence charts. Thor from DeepMind joined us on the show (literally on day 3 of his new job!) and you could feel the excitement.The headline numbers: Gemini 3 Pro with Deep Think mode achieved 45.14% on ARC-AGI-2—that’s roughly double the previous state-of-the-art on some splits. For context, ARC-AGI has been one of those benchmarks that really tests genuine reasoning and abstraction, not just memorization. The standard Gemini 3 Pro hits 31.11% on the same benchmark, both scores are absolutely out of this world in Arc! On GPQA Diamond, Gemini 3 Pro jumped about 10 points compared to prior models. We’re seeing roughly 81% on MMLU-Pro, and the coding performance is where things get really interesting—Gemini 3 Pro is scoring around 56% on SciCode, representing significant improvements in actual software engineering tasks.But here’s what made Ryan from Amp switch their default model to Gemini 3 Pro immediately: the real-world usability. Ryan told us on the show that they’d never switched default models before, not even when GPT-5 came out, but Gemini 3 Pro was so noticeably better that they made it the default on Tuesday. Of course, they hit rate limits almost immediately (Google had to scale up fast!), but those have since been resolved.Antigravity: Google’s Agent-First IDEThen Google dropped Antigravity, and honestly, this might be the most interesting part of the whole release. It’s a free IDE (yes, free!) that’s basically a fork of VS Code, but reimagined around agents rather than human-first coding.The key innovation here is something they call the “Agent Manager”—think of it like an inbox for your coding agents. Instead of thinking in folders and files, you’re managing conversations with agents that can run in parallel, handle long-running tasks, and report back when they need your input.I got early access and spent time playing with it, and here’s what blew my mind: you can have multiple agents working on different parts of your codebase simultaneously. One agent fixing bugs, another researching documentation, a third refactoring your CSS—all at once, all coordinated through this manager interface.The browser integration is crazy too. Antigravity can control Chrome directly, take screenshots and videos of your app, and then use those visuals to debug and iterate. It’s using Gemini 3 Pro for the heavy coding, and even Nano Banana for generating images and assets. The whole thing feels like it’s from a couple years in the future.Wolfram on the show called out how good Gemini 3 is for creative writing too—it’s now his main model, replacing GPT-4.5 for German language tasks. The model just “gets” the intention behind your prompts rather than following them literally, which makes for much more natural interactions.Nano Banana Pro: 4K Image Generation With ThinkingAnd because Google apparently wasn’t done announcing things, they also dropped Nano Banana Pro on Thursday morning—literally breaking news during our live show. This is their image generation model that now supports 4K resolution and includes “thinking” traces before generating.I tested it live by having it generate an infographic about all the week’s AI news (which you can see on the top), and the results were wild. Perfect text across the entire image (no garbled letters!), proper logos for all the major labs, and compositional understanding that felt way more sophisticated than typical image models. The file it generated was 8 megabytes—an actual 4K image with stunning detail.What’s particularly clever is that Nano Banana Pro is really Gemini 3 Pro doing the thinking and planning, then handing off to Nano Banana for the actual image generation. So you get multimodal reasoning about your request, then production-quality output. You can even upload reference images—up to 14 of them—and it’ll blend elements while maintaining consistency.Oh, and every image is watermarked with SynthID (Google’s invisible watermarking tech) and includes C2PA metadata, so you can verify provenance. This matters as AI-generated content becomes more prevalent.Generative UIs: The Future of InterfacesOne more thing Google showed off: generative UIs in the Gemini app. Wolfram demoed this for us, and it’s genuinely impressive. Instead of just text responses, Gemini can generate full interactive mini-apps on the fly—complete dashboards, data visualizations, interactive widgets—all vibe-coded in real time.He asked for “four panels of the top AI news from last week” and Gemini built an entire news dashboard with tabs, live market data (including accurate pre-market NVIDIA stats!), model comparisons, and clickable sections. It pulled real information, verified facts, and presented everything in a polished UI that you could interact with immediately.This isn’t just a demo—it’s rolling out in Gemini now. The implication is huge: we’re moving from static responses to dynamic, contextual interfaces generated just-in-time for your specific need.xAI Strikes Back: Grok 4.1 and the Agent Tools APINot to be outdone, xAI released Grok 4.1 at the start of the week, briefly claimed the #1 spot on LMArena (at 1483 Elo, not 2nd to Gemini 3), and then followed up with Grok 4.1 Fast and a full Agent Tools API.Grok 4.1: Emotional Intelligence Meets Raw PerformanceGrok 4.1 brought some really interesting improvements. Beyond the benchmark numbers (64% win rate over the previous Grok in blind tests), what stood out was the emotional intelligence. On EQ-Bench3, Grok 4.1 Thinking scored 1586 Elo, beating every other model including Gemini, GPT-5, and Claude.The creative writing scores jumped by roughly 600 Elo points compared to earlier versions. And perhaps most importantly for practical use, hallucination rates dropped from around 12% to 4%—that’s roughly a 3x improvement in reliability on real user queries.xAI’s approach here was clever: they used “frontier agentic reasoning models as reward models” during RL training, which let them optimize for subjective qualities like humor, empathy, and conversational style without just scaling up model size.Grok 4.1 Fast: The Agent Platform PlayThen came Grok 4.1 Fast, released just yesterday, and this is where things get really interesting for developers. It’s got a 2 million token context window (compared to Gemini 3’s 1 million) and was specifically trained for agentic, tool-calling workflows.The benchmark performance is impressive: 93-100% on τ²-Bench Telecom (customer support simulation), ~72% on Berkeley Function Calling v4 (top of the leaderboard), and strong scores across research and browsing tasks. But here’s the kicker: the pricing is aggressive.At $0.20 per million input tokens and $0.50 per million output tokens, Grok 4.1 Fast is dramatically cheaper than GPT-5 and Claude while matching or exceeding their agentic performance. For the first two weeks, it’s completely free via the xAI API and OpenRouter, which is smart—get developers hooked on your agent platform.The Agent Tools API gives Grok native access to X search, web browsing, code execution, and document retrieval. This tight integration with X is a genuine advantage—where else can you get real-time access to breaking news, sentiment, and conversation? Yam tested it on the show and confirmed that Grok will search Reddit too, which other models often refuse to do. I’ve used both these models this week in my N8N research agent and I gotta say, 4.1 fast is a MASSIVE improvement! OpenAI’s Endurance Play: GPT-5.1-Codex-Max and ProOpenAI clearly saw Google and xAI making moves and decided they weren’t going to let this week belong to anyone else. They dropped two significant releases: GPT-5.1-Codex-Max and an update to GPT-5.1 Pro.GPT-5.1-Codex-Max: Coding That Never StopsThis is the headline: GPT-5.1-Codex-Max can work autonomously for over 24 hours. Not 24 minutes, not 24 queries—24 actual hours on a single software engineering task. I talked to someone from OpenAI at the conference who told me internal checkpoints ran for nearly a week on and off.How is this even possible? The secret is something OpenAI calls “compaction”—a native mechanism trained into the model that lets it prune and compress its working session history while preserving the important context. Think of it like the model taking notes on itself, discarding tool-calling noise and keeping only the critical design decisions and state.The performance numbers back this up:* SOTA 77.9% on SWE-Bench Verified (up from 73.7%)* SOTA 79.9% on SWE-Lancer IC SWE (up from 66.3%)* 58.1% on TerminalBench 2.0 (up from 52.8%)And crucially, in medium reasoning mode, it uses 30% fewer thinking tokens while achieving better results. There’s also an “Extra High” reasoning mode for when you truly don’t care about latency and just want maximum capability.Yam, one of our co-hosts who’s been testing extensively, said you can feel the difference immediately. The model just “gets it” faster, powers through complex problems, and the earlier version’s quirk of ignoring your questions and just starting to code is fixed—now it actually responds and collaborates.Dominic from OpenAI joined us on the show and confirmed that compaction was trained natively into the model using RL, similar to how Claude trained natively for MCP. This means the model doesn’t waste reasoning tokens on maintaining context—it just knows how to do it efficiently.GPT-5.1 Pro: Research-Grade Intelligence & ChatGPT joins your group chat1Then there’s GPT-5.1 Pro, which is less about coding and more about deep, research-level reasoning. This is the model that can run for 10-17 minutes on a single query, thinking through complex problems with the kind of depth that previously required human experts.OpenAI also quietly rolled out group chats—basically, you can now have multiple people in a ChatGPT conversation together, all talking to the model simultaneously. Useful for planning trips, brainstorming with teams, or working through problems collaboratively. If agent mode works in group chats (we haven’t confirmed yet), that could get really interesting.Meta drops SAM3 & SAM3D - image and video segmentation models powered by natural languagePhew ok, big lab releases now done, oh.. wait not yet! Because Meta has decided to also make a dent on this Week with SAM3 and SAM3D, which both are crazy. I’ll just add their video release here instead of going on and on! This Week’s Buzz from Weights & BiasesIt’s been a busy week at Weights & Biases as well! We are proud Gold Sponsors of the AI Engineer conference here in NYC. If you’re at the event, please stop by our booth—we’re even giving away a $4,000 robodog!This week, I want to highlight a fantastic update from Marimo, the reactive Python notebook company we acquired.Marimo just shipped a native VS Code and Cursor extension. This brings Marimo’s reactive, Git-friendly notebooks directly into your favorite editors.Crucially, it integrates deeply with uv for lightning-fast package installs and reproducible environments. If you import a package you don’t have, the extension prompts you to install it and records the dependency in the script metadata. This bridges the gap between experimental notebooks and production-ready code, and it’s a huge boost for AI-native development workflows. (Blog , GitHub )The Future Arrived EarlyPhew... if you read all the way until this point, can you leave a ⚡ emoji in the comemnts? I was writing this and it.. is a lot! I was wondering who would even read all the way till here! This week we felt the acceleration! 🔥 I can barely breathe, I need a nap! A huge thank you to our guests—Ryan, Swyx, Thor, and Dominik—for navigating the chaos with us live on stage, and to the AI Engineer team for hosting us.We’ll be back next week to cover whatever the AI world throws at us next. Stay tuned, because at this rate, AGI might be here by Christmas.TL;DR - show notes and linksHosts and Co‑hosts* Alex Volkov – AI Evangelist at Weights & Biases / CoreWeave, host of ThursdAI (X)* Co‑hosts - Wolfram Ravenwolf – (X), Yam Peleg (X) LDJ (X)Guests* Swyx – Founder of AI Engineer World’s Fair and Summit, now at Cognition ( Latent.Space , X)* Ryan Carson – Amp (X)* Thor Schaeff – Google DeepMind, Gemini API and AI Studio (X)* Dominik Kundel – Developer Experience at OpenAI (X)Open Source LLMs* Allen Institute Olmo 3 - 7B/32B fully open reasoning suite with end-to-end training transparency (X, Blog)Big CO LLMs + APIs* Google Gemini 3 Pro - 1M-token, multimodal, agentic model with Generative UIs (X, X, X)* Google Antigravity - Agent-first IDE powered by Gemini 3 Pro (Blog, X)* xAI Grok 4.1 and Grok 4.1 Thinking - big gains in Coding, EQ, creativity, and honesty (X, Blog)* xAI Grok 4.1 Fast and Agent Tools API - 2M-token context, state-of-the-art tool-calling (X)* OpenAI GPT-5.1-Codex-Max - long-horizon agentic coding model for 24-hour+ software tasks (X, X)* OpenAI GPT-5.1 Pro - research-grade reasoning model in ChatGPT Pro* Microsoft, NVIDIA, and Anthropic partnership - to scale Claude on Azure with massive GPU investments (Announcement, NVIDIA, Microsoft Blog)This weeks Buzz* Marimo ships native VS Code & Cursor extension with reactive notebooks and uv-powered environments (X, Blog, GitHub)Vision & Video & 3D* Meta SAM 3 & SAM 3D - promptable segmentation, tracking, and single-image 3D reconstruction (X, Blog, GitHub)AI Art & Diffusion* Google Nano Banana Pro and SynthID verification - 4K image generation with provenance (Blog)Show Notes and other Links* AI Engineer Summit NYC - Live from the conference* Full livestream available on YouTube* ThursdAI - Nov 20, 2025 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:29:13
  • GPT‑5.1’s New Brain, Grok’s 2M Context, Omnilingual ASR, and a Terminal UI That Sparks Joy
    Hey, this is Alex! We’re finally so back! Tons of open source releases, OpenAI updates GPT and a few breakthroughs in audio as well, makes this a very dense week! Today on the show, we covered the newly released GPT 5.1 update, a few open source releases like Terminal Bench and Project AELLA (renamed OASSAS), and Baidu’s Ernie 4.5 VL that shows impressive visual understanding! Also, chatted with Paul from 11Labs and Dima Duev from the wandb SDK team, who brought us a delicious demo of LEET, our new TUI for wandb! Tons of news coverage, let’s dive in 👇 (as always links and show notes in the end) Open Source AILet’s jump directly into Open Source as this week has seen some impressive big company models. Terminal-Bench 2.0 - a harder, highly‑verified coding and terminal benchmark (X, Blog, Leaderboard)We opened with Terminal‑Bench 2.0 plus its new harness, Harbor, because this is the kind of benchmark we’ve all been asking for. Terminal‑Bench focuses on agentic coding in a real shell. Version 2.0 is a hard set of 89 terminal tasks, each one painstakingly vetted by humans and LLMs to make sure it’s solvable and realistic. Think “I checked out master and broke my personal site, help untangle the git mess” or “implement GPT‑2 code golf with the fewest characters.” On the new leaderboard, top agents like Warp’s agentic console and Codex CLI + GPT‑5 sit around fifty percent success. That number is exactly what excites me: we’re nowhere near saturation. When everyone is in the 90‑something range, tiny 0.1 improvements are basically noise. When the best models are at fifty percent, a five‑point jump really means something.A huge part of our conversation focused on reproducibility. We’ve seen other benchmarks like OSWorld turn out to be unreliable, with different task sets and non‑reproducible results making scores incomparable. Terminal‑Bench addresses this with Harbor, a harness designed to run sandboxed, containerized agent rollouts at scale in a consistent environment. This means results are actually comparable. It’s a ton of work to build an entire evaluation ecosystem like this, and with over a thousand contributors on their Discord, it’s a fantastic example of a healthy, community‑driven effort. This is one to watch! Baidu’s ERNIE‑4.5‑VL “Thinking”: a 3B visual reasoner that punches way up (X, HF, GitHub)Next up, Baidu dropped a really interesting model, ERNIE‑4.5‑VL‑28B‑A3B‑Thinking. This is a compact, 3B active‑parameter multimodal reasoning model focused on vision, and it’s much better than you’d expect for its size. Baidu’s own charts show it competing with much larger closed models like Gemini‑2.5‑Pro and GPT‑5‑High on a bunch of visual benchmarks like ChartQA and DocVQA.During the show, I dropped a fairly complex chart into the demo, and ERNIE‑4.5‑VL gave me a clean textual summary almost instantly—it read the chart more cleanly than I could. The model is built to “think with images,” using dynamic zooming and spatial grounding to analyze fine details. It’s released under an Apache‑2.0 license, making it a serious candidate for edge devices, education, and any product where you need a cheap but powerful visual brain.Open Source Quick Hits: OSSAS, VibeThinker, and Holo TwoWe also covered a few other key open-source releases. Project AELLA was quickly rebranded to OSSAS (Open Source Summaries At Scale), an initiative to make scientific literature machine‑readable. They’ve released 100k paper summaries, two fine-tuned models for the task, and a 3D visualizer. It’s a niche but powerful tool if you’re working with massive amounts of research. (X, HF)WeiboAI (from the Chinese social media company) released VibeThinker‑1.5B, a tiny 1.5B‑parameter reasoning model that is making bold claims about beating the 671B DeepSeek R1 on math benchmarks. We discussed the high probability of benchmark contamination, especially on tests like AIME24, but even with that caveat, getting strong chain‑of‑thought math out of a 1.5B model is impressive and useful for resource‑constrained applications. (X, HF, Arxiv)Finally, we had some breaking news mid‑show: H Company released Holo Two, their next‑gen multimodal agent for controlling desktops, websites, and mobile apps. It’s a fine‑tune of Qwen3‑VL and comes in 4B and 8B Apache‑2.0 licensed versions, pushing the open agent ecosystem forward. (X, Blog, HF)ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Big Companies & APIsGPT‑5.1: Instant vs Thinking, and a new personality barThe biggest headline of the week was OpenAI shipping GPT‑5.1, and this was a hot topic of debate on the show. The update introduces two modes: “Instant” for fast, low‑compute answers, and “Thinking” for deeper reasoning on hard problems. OpenAI claims Instant mode uses 57% fewer tokens on easy tasks, while Thinking mode dedicates 71% more compute to difficult ones. This adaptive approach is a smart evolution.The release also adds a personality dropdown with options like Professional, Friendly, Quirky, and Cynical, aiming for a more “warm” and customizable experience. Yam and I felt this was a step in the right direction, as GPT‑5 could often feel a bit cold and uncommunicative. However, Wolfram had a more disappointing experience, finding that GPT‑5.1 performed significantly worse on his German grammar and typography tasks compared to GPT‑4 or Claude Sonnet 4.5. It’s a reminder that “upgrades” can be subjective and task‑dependent.Since the show was recorded, GPT 5.1 is also released in the API and they have published a prompting guide and some evals! With some significant jumps across SWE-bench verified and GPQA Diamond! We’ll be testing this model out all week. The highlight for this model is the creative writing, it was made public that this model was being tested on OpenRouter as Polaris-alpha and that one tops the eqbench creative writing benchmarks beating Sonnet 4.5 and Gemini! Grok‑4 Fast: 2M context and a native X superpowerGrok‑4 Fast from xAI apparenly quietly got a substantial upgrade to a 2M‑token context window, but the most interesting part is its unique integration with X. The API version has access to internal tools for semantic search over tweets, retrieving top quote tweets, and understanding embedded images and videos. I’ve started using it as a research agent in my show prep, and it feels like having a research assistant living inside X’s backend—something you simply can’t replicate with public tools.I still have my gripes about their “stealth upgrade” versioning strategy, which makes rigorous evaluation difficult, but as a practical tool, Grok‑4 Fast is incredibly powerful. It’s also surprisingly fast and cost‑effective, holding its own against other top models on benchmarks while offering a superpower that no one else has.Google SIMA 2: Embodied Agents in Virtual WorldsGoogle’s big contribution this week was SIMA 2, DeepMind’s latest embodied agent for 3D virtual worlds. SIMA lives inside real games like No Man’s Sky and Goat Simulator, seeing the screen and controlling the game via keyboard and mouse, using Gemini as its reasoning brain. Demos showed it following complex, sketch‑based instructions, like finding an object that looks like a drawing of a spaceship and jumping on top of it.When you combine this with Genie 3—Google’s world model that can generate playable environments from a single image—you see the bigger picture: agents that learn physics, navigation, and common sense by playing in millions of synthetic worlds. We’re not there yet, but the pieces are clearly being assembled. We also touched on the latest Gemini Live voice upgrade, which users are reporting feels much more natural and responsiveMore Big Company News: Qwen Deep Research, Code Arena, and CursorWe also briefly covered Qwen’s new Deep Research feature, which offers an OpenAI‑style research agent inside their ecosystem. LMSYS launched Blog, a fantastic live evaluation platform where models build real web apps agentically, with humans voting on the results. And in the world of funding, the AI‑native code editor Cursor raised a staggering $2.3 billion, a clear sign that AI is becoming the default way developers interact with code.This Week’s Buzz: W&B LEET – a terminal UI that sparks joyFor this week’s buzz, I brought on Dima Duev from our SDK team at Weights & Biases to show off a side project that has everyone at the company excited: LEET, the Lightweight Experiment Exploration Tool. Imagine you’re training on an air‑gapped HPC cluster, living entirely in your terminal. How do you monitor your runs? With LEET.You run your training script in W&B offline mode, and in another terminal, you type wandb beta leet. Your terminal instantly turns into a full TUI dashboard with live metric plots, system stats, and run configs. You can zoom into spikes in your loss curve, filter metrics, and see everything updating in real time, all without a browser or internet connection. It’s one of those tools that just sparks joy. It ships with the latest wandb SDK (v0.23.0+), so just upgrade and give it a try! Voice & Audio: Scribe v2 Realtime and Omnilingual ASRElevenLabs Scribe v2 Realtime: ASR built for agents (X, Announcement, Demo)We’ve talked a lot on this show about ElevenLabs as “the place you go to make your AI talk.” This week, they came for the other half of the conversation. Paul Asjes from ElevenLabs joined us to walk through Scribe v2 Realtime, their new low‑latency speech‑to‑text model. If you’re building a voice agent, you need ears, a brain, and a mouth. ElevenLabs already nailed the mouth, and now they’ve built some seriously good ears.Scribe v2 Realtime is designed to run at around 150 milliseconds median latency, across more than ninety languages. Watching Paul’s live demo, it felt comfortably real‑time. When he switched from English to Dutch mid‑sentence, the system just followed along without missing a beat. Community benchmarks and our own impressions show it holding its own or beating competitors like Whisper and Deepgram in noisy, accented, and multi‑speaker scenarios. It’s also context‑aware enough to handle code, initialisms, and numbers correctly, which is critical for real‑world agents. This is a production‑ready ASR for anyone building live voice experiences.Meta’s drops Omnilingual ASR: 1,600+ languages, many for the first time + a bunch of open source models (X, Blog, Announcement, HF)On the other end of the spectrum, Meta released something that’s less about ultra‑low latency and more about sheer linguistic coverage: Omnilingual ASR. This is a family of models and a dataset designed to support speech recognition for more than 1,600 languages, including about 500 that have never had any ASR support before. That alone is a massive contribution.Technically, it uses a wav2vec 2.0 backbone scaled up to 7B parameters with both CTC and LLM‑style decoders. The LLM‑like architecture allows for in‑context learning, so communities can add support for new languages with only a handful of examples. They’re also releasing the Omnilingual ASR Corpus with data for 350 underserved languages. The models and code are Apache‑2.0, making this a huge step forward for more inclusive speech tech.AI Art, Diffusion & 3DQwen Image Edit + Multi‑Angle LoRA: moving the camera after the fact (X, HF, Fal)This one was pure fun. A new set of LoRAs for Qwen Image Edit adds direct camera control to still images. A Hugging Face demo lets you upload a photo and use sliders to rotate the camera up to 90 degrees, tilt from a bird’s‑eye to a worm’s‑eye view, and adjust the lens. We played with it live on the show with a portrait of Wolfram and a photo of my cat, generating different angles and then interpolating them into a short “fly‑around” video. It’s incredibly cool and preserves details surprisingly well, feeling like you have a virtual camera inside a 2D picture.NVIDIA ChronoEdit‑14B Upscaler LoRA (X, HF)Finally, NVIDIA released an upscaler LoRA based on their ChronoEdit‑14B model and merged the pipeline into Hugging Face Diffusers. ChronoEdit reframes image editing as a temporal reasoning task, like generating a tiny video. This makes it good for maintaining consistency in edits and upscales. It’s a heavy model, requiring ~34GB of VRAM, and for aggressive upscaling, specialized tools might still be better. But for moderate upscales where temporal coherence matters, it’s a very interesting new tool in the toolbox.Phew, we made it through this dense week! Looking to next week, I’ll be recoridng the show live from the AI Engieer CODE summit in NY, and we’ll likely see a few good releases from the big G? Maybe? finally? As always, if this was helpful, please subscribe to ThursdAI and share it with 2 friends, see you next week 🫡 TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf, @yampeleg, @ldjconfirmed* Guest: Dima Duev - SDK team Wandb* Guest: Paul Asjes - Eleven Labs (@paul_asjes)* Open Source LLMs* Terminal-Bench 2.0 and Harbor launch (X, Blog, Docs, Announcement)* Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking (X, HF, GitHub, Blog, Platform)* Project AELLA (OSSAS): 100K LLM-generated paper summaries (X, HF)* WeiboAI’s VibeThinker-1.5B (X, HF, Arxiv, Announcement)* Code Arena — live, agentic coding evaluations (X, Blog, Announcement)* Big CO LLMs + APIs* Grok 4 Fast, Grok Imagine and Nano Banana v1/v2 (X, X, X, X)* OpenAI launches GPT-5.1 (X, X)* This weeks Buzz* W&B LEET — an open-source Terminal UI (TUI) to monitor runs (X, Blog)* Voice & Audio* ElevenLabs launches Scribe v2 Realtime (X, Blog, Docs)* Meta releases Omnilingual ASR for 1,600+ languages (X, Blog, Paper, HF Dataset, HF Demo, GitHub)* Gemini Live conversational upgrade (X)* AI Art & Diffusion & 3D* Qwen Image Edit + Multi‑Angle LoRA for camera control (X, HF, Fal)* NVIDIA releases ChronoEdit-14B Upscaler LoRA (X, HF, Docs) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:10:20
  • 📆 ThursdAI - Nov 6, 2025 - Kimi’s 1T Thinking Model Shakes Up Open Source, Apple Bets $1B on Gemini for Siri, and Amazon vs. Perplexity!
    Hey, Alex here! Quick note, while preparing for this week, I posted on X that I don’t remember such a quiet week in AI since I started doing ThursdAI regularly, but then 45 min before the show started, Kimi dropped a SOTA oss reasoning model, turning a quiet week into an absolute banger. Besides Kimi, we covered the updated MCP thinking from Anthropic, and had Kenton Varda from cloudflare as a guest to talk about Code Mode, chatted about Windsurf and Cursor latest updates and covered OpenAI’s insane deals. Also, because it was a quiet week, I figured I’d use the opportunity to create an AI powered automation, and used N8N for that, and shared it on the stream, so if you’re interested in automating with AI with relatively low code, this episode is for you. Let’s dive inThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Kimi K2 Thinking is Here and It’s a 1 Trillion Parameter Beast! (X, HF, Tech Blog)Let’s start with the news that got everyone’s energy levels skyrocketing right as we went live. Moonshot AI dropped Kimi K2 Thinking, an open-source, 1 trillion-parameter Mixture-of-Experts (MoE) model, and it’s an absolute monster.This isn’t just a numbers game; Kimi K2 Thinking is designed from the ground up to be a powerful agent. With just around 32 billion active parameters during inference, a massive 256,000 token context window, and an insane tool-calling capacity. They’re claiming it can handle 200-300 sequential tool calls without any human intervention. The benchmarks are just as wild. On the Humanities Last Exam (HLE), they’re reporting a score of 44.9%, beating out both GPT-5 and Claude 4.5 Thinking. While it doesn’t quite top the charts on SWE-bench verified, it’s holding its own against the biggest closed-source models out there. Seeing an open-source model compete at this level is incredibly exciting.During the show, we saw some truly mind-blowing demos, from a beautiful interactive visualization of gradient descent to a simulation of a virus attacking cells, all generated by the model. The model’s reasoning traces, which are exposed through the API, also seem qualitatively different from other models, showing a deep and thoughtful process. My co-hosts and I were blown away. The weights and a very detailed technical report are available on Hugging Face, so you can dive in and see for yourself. Shout out to the entire Moonshot AI team for this incredible release!Other open source updates from this week* HuggingFace released an open source “Smol Training Playbook” on training LLMs, it’s a 200+ interactive beast with visualizations, deep dives into pretraining, dataset, postraining and more! (HF)* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (X, Blog)* LongCat-Flash-Omni — open-source omni-modal system with millisecond E2E spoken interaction, 128K context and a 560B ScMoE backbone (X, HF, Announcement)Big Tech’s Big Moves: Apple, Amazon, and OpenAIThe big companies were making waves this week, starting with a blockbuster deal that might finally make Siri smart. Apple is reportedly will be paying Google around $1 billion per year to license a custom 1.2 trillion-parameter version of Gemini to power a revamped Siri.This is a massive move. The Gemini model will run on Apple’s Private Cloud Compute, keeping user data walled off from Google, and will handle Siri’s complex summarizer and planner functions. After years of waiting for Apple to make a significant move in GenAI, it seems they’re outsourcing the heavy lifting for now while they work to catch up with their own in-house models. As a user, I don’t really care who builds the model, as long as Siri stops being dumb!In more dramatic news, Perplexity revealed that Amazon sent them a legal threat to block their Comet AI assistant from shopping on Amazon.com. This infuriated me. My browser is my browser, and I should be able to use whatever tools I want to interact with the web. Perplexity took a strong stand with their blog post, “Bullying is Not Innovation,” arguing that user agents are distinct from scrapers and act on behalf of the user with their own credentials. An AI assistant is just that—an assistant. It shouldn’t matter if I ask my wife or my AI to buy something for me on Amazon. This feels like a move by Amazon to protect its ad revenue at the expense of user choice and innovation, and I have to give major props to Perplexity for being so transparent and fighting back.Finally, OpenAI continues its quest for infinite compute, announcing a multi-year strategic partnership with AWS. This comes on top of massive deals with NVIDIA, Microsoft, Oracle, and others, bringing their total commitment to compute into the trillions of dollars. It’s getting to a point where OpenAI seems “too big to fail,” as any hiccup could have serious repercussions for the entire tech economy, which is now heavily propped up by AI investment. Sam has clarified that they don’t think OpenAI wants to be too big to fail in a recent post on X, and that the recent miscommunications around the US government backstopping OpenAI’s infrastructure bailouts were taken out of context. 🤔 Coding with AI: The Evolution of MCP and New Dev ToolsThis week, we kicked off a new segment on the show: Coding with AI! Essentially realizing that we talk about AI coding a LOT, and decided to add a dedicated corner to it! And we started with a fascinating development in the world of agentic tooling. Anthropic published a blog post arguing that the standard way of using the Model Context Protocol (MCP) — by loading full tool definitions into the context window — is inefficient.Their solution? Have LLMs write code to interact with tools instead. This approach can slash token usage by over 98% in some cases. This idea sounded familiar, and that’s because Cloudflare had already explored it with a feature called “Code Mode.” We were lucky enough to have Kenton Varda, one of the authors of the Code Mode post and head of engineering for Cloudflare Workers, join us to discuss this shift.Kenton explained that LLMs are trained on vast amounts of code, making it a more “native language” for them than the artificial construct of tool calls. By generating code, agents can chain multiple tool calls together, process intermediate results, and operate much more efficiently without sending everything back through the neural network. While MCP still provides crucial standardization for discovering and authorizing tools, this “code execution” pattern seems to be the way forward for building more powerful and scalable agents.Windsurfs CodeMaps and Cursor multi agent executionsIn other coding with AI news, Windsurf has pushed an incredible feature, called CodeMaps. They will use their SWE-1 model to (quickly) generate Codemaps that will expalins a code-base to you, in a visual way. What starts where and goes where. It’s really useful to understand a new codebase or re-understand one you forgot about already! You can even chat with codemaps, to see if your overall system’s design is solid! Great addition that I’m sure will help many folks adopt Windsurf! And Cursor, another popular AI-native IDE, released a super-performant in-IDE browser and a wild multi-agent feature that queries multiple LLMs in parallel and then synthesizes their answers.This Week’s TutorialI finally got around to building some serious automations for ThursdAI, and folks, N8N has been a game-changer. What used to take me 30+ minutes of manual work now happens automatically in the background.Here’s what I built: A Telegram bot that takes Twitter/X links, fetches the tweets and all linked content, uses AI agents to extract and summarize the information, and then posts it to our announcement channel and my notes app. The coolest part? I built this whole thing in about 4 hours with the help of Atlas browser and GPT-5 literally telling me what to do at each step.During the show, we even live-tested swapping out GPT-4o-mini for Kimi K2 - took literally 30 seconds to connect via OpenRouter. I went through my node and explains how this all works on the show, so if you’ve wanted to learn about n8n, check it out starting around 01:13:00. If you want to see how my automation turned out, it will be posting all my links to the new telegram channel t.me/thursdai_news (expect it to be messy at first as I’m testing out the automation) Robotics - Xpeng’s “Iron” humanoid: big vibes, few specsAnother week, another humanoid robot that is supposedly “coming” in 2026! A humanoid from Xpeng went viral this week, marketed as “the most human‑like” robot with soft skin, bionic muscles, customizable sexes (yes, really, they have a woman humanoid), something called a VLT brain, and a 2026 production goal. Here’s what we didn’t get: a spec sheet. No DOF, speed, payload, compute TOPS, battery capacity, runtime, or safety pathway. No pricing, manufacturing strategy, or clear target markets. In other words: lots of sizzle, no steak.Apparently, there was folks thinking Xpend pulled an Elon and put a human in a robot suit, making the CEO do the “we’ll cut a part of the soft skin to expose the robot underneath so you don’t think we’re lying” stunt. Which I agree, was very effective. But, If Xpeng is serious, the next thing we’ll see should be a crisp engineering document: joints, actuation, sensors, compute, and a locomotion/manipulation demo with independent measurements. Until then, treat this as a branding salvo and a reminder that the humanoid category is still sorting itself into “industrial payload first” versus “human likeness first” approaches. Voice & AudioMaya‑1: open‑source voice design from natural languageWe highlighted Maya‑1, a 3B Llama‑backboned TTS system designed to generate voices from natural language descriptions. Instead of picking from a menu, you describe the voice—age, accent, affect—and Maya conjures it. It supports real‑time streaming and over twenty “emotion tags.” The quality is compelling for its size and the Apache 2 license will make a lot of builders happy. There’s a growing middle class of TTS: tiny but expressive, good enough for in‑app narrators, prototyping, and even stylized content when you don’t want the constraints of commercial voice marketplaces.Inworld TTS: a new leader on independent rankingsWe also listened to Inworld’s latest, which currently tops the Artificial Analysis TTS leaderboard. It’s not open source, but the combo of expressivity, speed (sub‑250 ms), and multilingual support puts it firmly in the “commercially viable at scale” tier alongside the usual suspects. If you need SaaS TTS today and care about emotional range, add this to your shortlist. Pricing on their site targets availability rather than hobbyist tinkering, but the quality argues for itself.Whew! For a week that started slow, it certainly ended with a bang. It just goes to show you can never count AI out. We’re seeing open source continue to push the boundaries, big tech making landscape-defining moves, and agentic AI becoming more powerful and accessible every day.As always, thanks for tuning in. If you’re going to be at the AI.engineer conference in New York, please hit me up—I’d love to meet you.TL;DR and Show Notes + Links* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten* Kenton Varda @ Cloudflare (@KentonVarda)* Open Source LLMs* Smol Training Playbook — a 200+ page, end-to-end guide to reliably pretrain and operate LLMs (X, Announcement)* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (X, Blog)* Moonshot AI releases Kimi K2 Thinking — an open-source 1T-parameter MoE agent with 256K context and huge tool-calling capacity (X, HF, Blog, Arxiv)* LongCat flash Omni - 560B (27A) omni model (text, audio, video input)* Big CO LLMs + APIs* Apple will pay roughly $1B/year to license a custom 1.2 trillion‑parameter Google Gemini model to power a revamped Siri (X, Announcement)* Perplexity says Amazon issued a legal threat to block Comet AI assistants from shopping on Amazon (X, Blog)* AWS announces multi-year strategic infrastructure partnership with OpenAI to power ChatGPT inference, training, and agentic AI (X)* Robotics* Xpeng unveils ‘Iron’ humanoid claiming ‘most human-like’ design with soft skin, bionic muscles, VLT brain and a 2026 production plan (X)* Coding with AI* Anthropic shows how running MCP-connected tools as code slashes token use and scales agents (X, Blog)* Windsurf Codemaps — AI‑annotated, navigable maps of your codebase powered by SWE-1.5 (Fast) and Sonnet 4.5 (Smart) (X, Announcement)* Conversation with Kenton Varda (@KentonVarda) from Cloudflare about MCP and Code Mode* Cursor added in IDE browser - very performant!* Audio & Video* Maya-1 - Open source voice generation model.* Inworld TTS - new #1 on artifical analysis benchmark.* Tools & Gadgets* Sandbar launches Stream — a voice-first personal assistant — and Stream Ring, a wearable ‘mouse for voice’, available for preorder (X, Blog) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:32:45
  • ThursdAI - Oct 30 - From ASI in a Decade to Home Humanoids: MiniMax M2's Speed Demon, OpenAI's Bold Roadmap, and 2026 Robot Revolution
    Hey, it’s Alex! Happy Halloween friends! I’m excited to bring you this weeks (spooky) AI updates! We started the show today with MiniMax M2, the currently top Open Source LLM, with an interview with their head of eng, Skyler Miao, continued to dive into OpenAIs completed restructuring into a non-profit and a PBC, including a deep dive into a live stream Sam Altman had, with a ton of spicy details, and finally chatted with Arjun Desai from Cartesia, following a release of Sonic 3, a sub 49ms voice model! So, 2 interviews + tons of news, let’s dive in! (as always, show notes in the end)Hey, if you like this content, it would mean a lot if you subscribe as a paid subscriber.Open Source AIMiniMax M2: open-source agentic model at 8% of Claude’s price, 2× speed (X, Hugging Face )We kicked off our open-source segment with a banger of an announcement and a special guest. The new king of open-source LLMs is here, and it’s called MiniMax M2. We were lucky enough to have Skyler Miao, Head of Engineering at Minimax, join us live to break it all down.M2 is an agentic model built for code and complex workflows, and its performance is just staggering. It’s already ranked in the top 5 globally on the Artificial Analysis benchmark, right behind giants like OpenAI and Anthropic. But here’s the crazy part: it delivers nearly twice the speed of Claude 3.5 Sonnet at just 8% of the price. This is basically Sonnet-level performance, at home, in open source.Skylar explained that their team saw an “impossible triangle” in the market between performance, cost, and speed—you could only ever get two. Their goal with M2 was to build a model that could solve this, and they absolutely nailed it. It’s a 200B parameter Mixture-of-Experts (MoE) model, but with only 10B active parameters per inference, making it incredibly efficient.One key insight Skylar shared was about getting the best performance. M2 supports multiple APIs, but to really unlock its reasoning power, you need to use an API that passes the model’s “thinking” tokens back to it on the next turn, like the Anthropic API. Many open-source tools don’t support this yet, so it’s something to watch out for.Huge congrats to the MiniMax team on this Open Weights (MIT licensed) release, you can find the model on HF! MiniMax had quite a week, with 3 additional releases, MiniMax speech 2.6, an update to their video model Hailuo 2.3 and just after the show, they released a music 2.0 model as well! Congrats on the shipping folks! OpenAI drops gpt-oss-safeguard - first open-weight safety reasoning models for classification ( X, HF )OpenAI is back on the open weights bandwagon, with a finetune release of their previously open weighted gpt-oss models, with gpt-oss-safeguard. These models were trained exclusively to help companies build safeguarding policies to make sure their apps remains safe! With gpt-oss-safeguards 20B and 120B, OpenAI is achieving near parity with their internal safety models, and as Nisten said on the show, if anyone knows about censorship and safety, it’s OpenAI! The highlight of this release is, unlike traditional pre-trained classifiers, these models allow for updates to policy via natural language!These models will be great for businesses that want to safeguard their products in production, and I will advocate to bring these models to W&B Inference soon! A Humanoid Robot in Your Home by 2026? 1X NEO announcement ( X, Order page, Keynote )Things got really spooky when we started talking about robotics. The company 1X, which has been on our radar for a while, officially launched pre-orders for NEO, the world’s first consumer humanoid robot designed for your home. And yes, you can order one right now for $20,000, with deliveries expected in early 2026.The internet went crazy over this announcement, with folks posting receipts of getting one, other folks stoking the uncanny valley fears that Sci-fi has built into many people over the years, of the Robot uprising and talking about the privacy concerns of having a human tele-operate this Robot in your house to do chores. It can handle chores like cleaning and laundry, and for more complex tasks that it hasn’t learned yet, it uses a teleoperation system where a human “1X Expert” can pilot the robot remotely to perform the task. This is how it collects the data to learn to do these tasks autonomously in your specific home environment.The whole release is very interesting, from the “soft and quiet” approach 1X is taking, making their robot a 66lbs short king, draped in a knit sweater, to the $20K price point (effectively at loss given how much just the hands cost), the teleoperated by humans addition, to make sure the Robot learns about your unique house layout. The conversation on the show was fascinating. We talked about all the potential use cases, from having it water your plants and look after your pets while you’re on vacation to providing remote assistance for elderly relatives. Of course, there are real privacy concerns with having a telepresence device in your home, but 1X says these sessions are scheduled by you and have strict no-go zones.Here’s my prediction: by next Halloween, we’ll see videos of these NEO robots dressed up in costumes, helping out at parties. The future is officially here. Will you be getting one? If not this one, when will you think you’ll get one? OpenAI’s Grand Plan: From Recapitalization to ASIThis was by far the biggest update about the world of AI for me this week! Sam Altman was joined by Jakub Pachocki, chief scientist and Wojciech Zaremba, a co-founder, on a live stream to share an update about their corporate structure, plans for the future, and ASI goals (Artificial Superintelligence) First, the company now has a new structure: a non-profit OpenAI Foundation governs the for-profit OpenAI Group. The foundation starts with about 26% equity and has a mission to use AI for public good, including an initial $25 billion commitment to curing diseases and building an “AI Resilience” ecosystem.But the real bombshells were about their research timeline. Chief Scientist Jakub Pachocki stated that they believe deep learning systems are less than a decade away from superintelligence (ASI). He said that at this point, AGI isn’t even the right goal anymore. To get there, they’re planning to have an “AI research intern” by September 2026 and a fully autonomous AI researcher comparable to their human experts by March 2028. This is insane if you think about it. As Yam mentioned, OpenAI is already shipping at an insane speed, releasing Models and Products, Sora, Atlas, Pulse, ChatGPT app store, and this is with humans, assisted by AI. And here, they are talking about complete and fully autonomous researchers, that will be infinitely more scalable than humans, in the next 2 years. The outcomes of this are hard to imagine and are honestly mindblowing. To power all this innovation, Sam revealed they have over $1.4 trillion in obligations for compute (over 30 GW). And said even that’s not enough. Their aspiration is to build a “compute factory” capable of standing up one gigawatt of new compute per week, and he hinted they may need to “rethink their robotics strategy” to build the data centers fast enough. Does this mean OpenAI humanoid robots building factories? 🤔 Plus, don’t forget, Sam is one of the investors in Helion energy, working on power solutions like Fusion, and the above graphic has an Energy block that Sam said they will give an update on later (that’s also what he told me during Dev Day when I asked him about it). Super exciting and honestly mind-blowing stuff, Gigawats per week, fully autonomous researchers, the world is going to look way different in a few years! The Agent Labs Race: Cursor 2.0 vs. Cognition’s SWE-1.5 (X, Blog)This week also saw a major showdown in the agentic coding space. On the very same day, both Cursor and Cognition launched major updates and their own new models, signaling a new era where agent labs are training their own specialized AI.First up, Cursor 2.0 was released with a completely redesigned multi-agent interface and their new model, Composer. Composer is claimed to be four times faster than comparable models, and the new UI is built around managing a fleet of agents that can work in parallel on your codebase. It’s a clear shift from being just an IDE to a full-fledged agent platform. Look, the UI even looks like ChatGPT and no code in sight (until you switch to IDE mode) Their Composer model is also very interesting, and got a lot of folks excited, but the evaluations they shared, and the fact that they didn’t disclose if that’s a finetune of a chinese model (it likely is). Regardless, folks are saying that it’s a very good model that’s also VERY fast! Cognition own coding model - SWE 1.5 ( Blog, X, Windsurf )Then, just hours later, Cognition punched right back with SWE-1.5, their new frontier agent model that now powers Windsurf. The headline here is pure speed. Powered by Cerebras, SWE-1.5 hits a blistering 950 tokens per second—13 times faster than Sonnet 4.5—while achieving near-SOTA performance on SWE-Bench Pro. They’ve achieved this through a co-designed stack where the agent harness, inference system, and model were all built together and optimized with end-to-end reinforcement learning in real coding environments.This competition is fantastic news for all of us. We’re seeing specialized, highly-performant models being developed outside of the big labs, putting more power back in the hands of developers.This Week’s BuzzJust a few quick updates from the world of Weights & Biases and our parent company, CoreWeave.First, big news! CoreWeave announced the acquisition of Marimo, the company behind the popular open-source, reactive notebook for Python. This is another exciting step in building out the essential cloud for AI, adding powerful development tools to the stack alongside best-in-class GPU infrastructure and MLOps with Weights & Biases. Welcome to the Marimo team!Also, Fully Connected is coming to London next week! It’s our premier conference, and we’ll have speakers from Mistral, Google, LlamaIndex, and more. If you’re in Europe, please come join us. DM me if you need tickets!And if you’re in New York from November 19-22, come say hi at the AI Engineer Code Summit. We’re sponsoring and will have a big booth. It’s always a great place to meet folks from this community.Video & Voice: The Multimodal ExplosionThe world of video and voice AI was on fire this week.The absolute highlight was Odyssey ML V2, a new real-time interactive AI video platform. This thing is not like other video models that take minutes to generate a clip. With Odyssey, you type a prompt, and a video starts streaming instantly. Then, you can edit it live. We did a demo on the show where we prompted “army of robots in a starship corridor” and then typed “turn these robots into fluffy covered cat robots,” and the video changed in real time. It’s mind-blowing. This is a glimpse into the future of user-driven, playable media.On the more traditional video front, Sora is now invite-free in the US and Japan, and they launched Character Cameos. You can now upload photos of your pets or objects (like your kid’s carved pumpkin!) and turn them into characters that you and others can use in videos. I, of course, immediately made a cameo of my cat, Sonia.Voice and Audio - Cartesia launches Sonic 3, sub 50ms AI speech modelIn the world of voice, we had Arjun Desai from Cartesia join us to talk about Sonic-3, their new real-time TTS engine. Backed by a new $100M funding round, Sonic-3 is built on State Space Models (not Transformers) and can achieve insane speeds—we’re talking under 50ms latency. But it’s not just fast; it’s also incredibly expressive. It can laugh, emote, and speak 42 languages with natural code-switching. I used their Pro Voice cloning feature to create an AI version of myself, and the results were scarily good. We even had my AI clone host a segment of the show, see it yourself here, powered by Argil and Sonic 3, this is... AI AlexWrapping Up This Spooky Week 🎃As I sit here in my Halloween costume reflecting on this week, I can’t help but feel we’re at an inflection point. We have:* Open source models competing with the best proprietary ones* Humanoid robots becoming consumer products* ASI timelines measured in single-digit years* Real-time interactive AI across all modalitiesAnd yet, nothing about this scares me. If anything, I’m more excited than ever about what we’re building together. Yes, the pace is insane. Yes, keeping up with everything is becoming nearly impossible (and it’s literally my job!). But we’re living through the most transformative period in human history, and we get to be part of it.To everyone building, experimenting, and pushing boundaries - keep going. To everyone worried about what’s coming - join us in shaping it responsibly. And to everyone who celebrated Halloween today - I hope your costume was as epic as the AI developments we covered! 👻Until next week, this is Alex signing off. Remember to subscribe, give us five stars, and I’ll see you next ThursdAI!TL;DR - All Topics CoveredThursdAI - Oct 30 - Halloween Special 👻* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson* Guest: Skyler Miao - Head of Engineering, MiniMax (@SkylerMiao7)* Guest: Arjun Desai - CoFounder Cartesia (@jundesai)* Open Source LLMs* MiniMax M2: Open-source agentic model at 8% of Claude’s price, 2× speed (X, Hugging Face)* OpenAI GPT-OSS-Safeguard: First open-weight safety reasoning models (X, HF)* IBM Granite 4.0 Nano: Ultra-efficient tiny models for edge deployment (X, Artificial Analysis)* Ming-flash-omni Preview: Sparse MoE omni-modal model (X, HuggingFace)* Kimi Linear: 48B parameter model with 1M context (HF)* Robotics* 1X NEO: First consumer humanoid robot, $20k, delivery 2026 (X, Order page, Keynote)* Big Companies & APIs* OpenAI Restructuring: ASI within 10 years, AI researcher by 2028 (X)* Cursor 2.0 & Composer: 4x faster coding, new model (X, Blog)* Cognition SWE-1.5: 950 tok/s, 40% SWE-bench Pro (Blog, X, Windsurf)* Perplexity Email Assistant: Privacy-first AI inbox management (X, Assistant Site)* This Week’s Buzz* Fully Connected London - fullyconnected.com* AI Engineer Code Summit NYC - Nov 19-22* CoreWeave acquires Marmo notebooks (X)* Vision & Video* Odyssey ML V2: Real-time interactive AI video (X, Experience)* Sora: Now invite-free + Character Cameos feature (X, Sonia Cameo)* Hailuo 2.3: Cinema-grade video generation (X)* Voice & Audio* MiniMax Speech 2.6: X, MiniMax, API Docs)* Cartesia Sonic 3: Real-time TTS with emotion & laughter, $100M funding (X, Website, Docs)* Tools* Pokee: Agentic workflow builder (X)* Pomelli: Google’s AI marketing agent (X, Labs) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:37:29
  • 📆 ThursdAI - Oct 23: The AI Browser Wars Begin, DeepSeek's OCR Mind-Trick & The Race to Real-Time Video
    Hey everyone, Alex here! Welcome... to the browser war II - the AI edition! This week we chatted in depth about ChatGPT’s new Atlas agentic browser, and the additional agentic powers Microsoft added to Edge with Copilot Mode (tho it didn’t work for me) Also this week was a kind of crazy OCR week, with more than 4 OCR models releasing, and the crown one is DeepSeek OCR, that turned the whole industry on it’s head (more later) Quite a few video updates as well, with real time lipsync from Decart, and a new update from LTX with 4k native video generation, it’s been a busy AI week for sure! Additionally, I’ve had the pleasure to talk about AI Browsing agents with Paul from BrowserBase and real time video with Kwindla Kramer from Pipecat/Daily, so make sure to tune in for those interviews, buckle up, let’s dive in! Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.Open Source: OCR is Not What You Think It Is (X, HF, Paper)The most important and frankly mind-bending release this week came from DeepSeek. They dropped DeepSeek-OCR, and let me tell you, this is NOT just another OCR model. The cohost were buzzing about this, and once I dug in, I understood why. This isn’t just about reading text from an image; it’s a revolutionary approach to context compression.We think that DeepSeek needed this as an internal tool, so we’re really grateful to them for open sourcing this, as they did something crazy here. They are essentially turning text into a visual representation, compressing it, and then using a tiny vision decoder to read it back with incredible accuracy. We’re talking about a compression ratio of up to 10x with 97% decoding accuracy. Even at 20x compression they are achieving 60% decoding accuracy! My head exploded live on the show when I read that. This is like the middle-out compression algorithm joke from Silicon Valley, but it’s real. As Yam pointed out, this suggests our current methods of text tokenization are far from optimal.With only 3B and ~570M active parameters, they are taking a direct stab at long context inefficiency, imagine taking 1M tokens, encoding them into 100K visual tokens, and then feeding those into a model. Since the model is tiny, it’s very cheap to run, for example, alphaXiv claimed they have OCRd’ all of the papers on ArXiv with this model for $1000, a task that would have cost $7500 using MistalOCR - as per their paper, with DeepSeek OCR, on a single H100 GPU, its possible to scan up to 200K pages! 🤯 Really innovative stuff! OCR and VLM models had quite a week, with multiple models besides DeepSeek OCR releasing, models like Liquids LFM2-VL-3B (X, HF), and the newly updated 2B and 32B of Qwen3-VL (X, Hugging Face), and AI2’s olmo-ocr 2-7B (X, HF). The Qwen models are particularly interesting, as the 2B model is a generic VLM (can also do OCR) and is close to previous weeks 4B and 8B brothers, and the newly updated 32B model outperforms GPT-5 mini and Claud 4 sonnet even! The Browser Wars are BACK: OpenAI & Microsoft Go AgenticLook, I may be aging myself here, but I remember, as a young frontend dev, having to install 5 browers at once to test them out, Chrome, Internet Explorer, Firefox, Opera etc’. That was then, and now, I have Dia, Comet, and the newly released Atlas, and, yeah, today I even installed Microsoft Edge to test their AI features! It seems like the AI boom brought with it a newly possible reason for folks to try and take a bite out of Chrome (who’s agentic features are long rumored with project mariner but are nowhere to be found/shipped yet) OpenAI’s ChatGPT Atlas: The Browser Reimagined (X, Download)OpenAI is proving that besides just models, they are a product powerhouse, stepping into categories like Shopping (with a shopify integration), app stores (with ChatGPT apps), social (with Sora2) and now... browsers! This week, they have launched their tightly integrated into ChatGPT browser called Atlas, and it’s a big release! I’ll split my review here to 2 parts, the browser features part and the agentic part. New fresh take on a chromium based browserThe tight integration into ChatGPT is everywhere in this browser, from the new tab that looks like the basic ChatGPT interaface, one line of text, to the sidebar on the left that... is the ChatGPT web sidebar with all your chats, projects, custom GPTs etc. The integration doesn’t stop there, as you have to sign in to your ChatGPT account to even use this browser (available only to MacOS users, and Pro, Plus and Nano tiers). The browser has a few neat tricks, like a special tool that allows you to search your browsing history with natural language, a-la “what were those shoes I was looking at a few days ago” will find your the tabs you browsed for shoes. A special and cool feature is called, confusingly “Cursor”, wherein you can select a text, and then click the little OpenAI logo that pops up, allowing you to ask ChatGPT for changes to that selected text (like fix typos, spruce up your writing etc). It’s surprisingly convenient to rewrite tweets or for any type of document editing. ChatGPT Atlas also stores memories about your browsing patterns, which will be additional to the ChatGPT memories it stores about you from chats, helping even more by knowing your browsing patterns, which software you prefer to use, which websites you prefer to order food from etc. This IMO is one of the hugest unlocks for folks inside the ChatGPT ecosystem, as much of a stanard persons peferences can be gleaned from their browser usage and patterns.Lastly, the “Ask ChatGPT” sidepane on the right (which can be opened with cmd+.) is really great for chatting with a webpage, or going down search rabbit holes. It receives the context of the webpage you’re looking at by default (only 1 page so far, competitors allow you to add additional tabs with @, (which is supposedly coming to ChatGPT soon) and ask... ChatGPT anything about this. Agentic SOTA? not so fastThe most important “change” to how browsers work in Atlas imo is the agentic mode. This isn’t new, we remember when ChatGPT launched thier Operator Agent back in January of this year (our coverage) and then renamed it Agent Mode and integrated into ChatGPT itself back in July. So, web browsing agents are not entirely new, what’s novel here though, is the integration into your browser, and the ability for the Atlas browser to use your logged in sessions and cookies, to pretend to be you! This... can be quite scary for some, as prompt injection attacks are getting more popular (where-in malicious a******s add hidden instructions to their website that will get the agent to do something you don’t like) but it’s also very exciting, as the agent can do much much more, without getting blocked by providers who could previously just block Agent Mode as it ran on OpenAI servers! Until today, there were 2 main Agentic browsers in the mix, Perplexity’s Comet (where you can choose which model runs the agent) and Atlas. Comet seems to be doing a little bit better on some stuff on my tests, but not by much. I have the same agentic task (go to X.com, find my bookmarks, open all links, summarize per my specific format) that I’ve been running for a while now, and Comet outdid Atlas this week on that task.Who needs agentic browsing? For some reason, most of the demos for agentic browsing are showing the same, boring-ish examples. Book some flights, collect a grocery shopping cart. I’ve tried new and different things this week, for example, letting Atlas choose and order food for me (as ChatGPT knows my pescatarian preferences, it’s better than Comet for personal stuff), and one of the longest task I’ve had an agent do yet, I asked it to complete a Compliance training I had to take at work! Mind you, this is a very complex task, even for regular people, as these compliance websites are built to not be messed with. They have video players that stop if you switch focus to some other tab, they have interactive quizes and games, drag and drop interfaces, audio buttons, to make sure you really are taking the test. I can happily report, that after 5 hours, and a few stops along the way (where I had to convince the agent to keep going), it completed this very hard task! (and now I have to take this course myself again to actualy be compliant 😅 it will probably take me 2 hours to do manually) This experiment made me think, who needs the agentic browsing features and for what? Well, for tasks that require a lot of manual steps to do the same thing over and over again, agentic browser is going to make a lot of peoples browsing a lot easier. Things like kids schedules reviewing in multiple websites, collecitng data and formatting it differently etc. Scary security implications Atlas could only finish my compliance task while being logged in as me, and ChatGPT Atlas gives a all or nothing control. You can run your agent with full access to your logged in websites (think Gmail etc) or you can essentially give it an incognito mode. This, again, due to the risk of promp injections in malicious websites being more and more prevalent. In a rare post detailing how they are thinking about this, OpenAI Chief Information Security officer offered a deep dive into their attempts to mitigate this issue (Simon Willison had a great breakdown of that information here) but that’s likely not enough, so definitely be aware when you’re running agent mode (which needs to be explicitly turned on right now by selecting Agent) This Weeks Buzz - Weights & Biases // CoreweaveWeights & Biases (now proudly part of CoreWeave) had some exciting updates. Our Fully Connected conference series is hitting Tokyo on October 30-31 and London on November 4-5—perfect for ML practitioners and AI engineers. If you’re in the area, join us for talks, networking, and deep dives into the latest. Register at Fullyconnected.com—DM me if you need a hook-up!We also collaborated with Meta and Stanford on Torch Forge, a new PyTorch-native library for scalable RL post-training and agent development. It’s built for massive GPU runs (we provided 520 H100s!), competing with Ray via tools like Monarch scheduler. If you’re training on clusters, check the blog —it’s a big deal for efficient multi-GPU workflows.Microsoft goes after OpenAI with Edge copilot mode (X)In a pretty surprising move, Microsoft announced today their take on the agentic browser war, with a bunch of enhancements to Copilot (their overall word for their AI assistance across Microsoft 360, Browser, Bing search etc), Think.. clippy, for the AI age (they even brought clippy back as an easter egg) The short version is, Edge is getting more powerful with custom agentic features (which I enabled and couldn’t get to work no matter how much I tried, so I can’t tell you how they compare to Atlas/Comet), and they have a voice mode that allows you to talk to your browser, with Edge having a sense of what’s on the actual page! Of course, this being Microsoft, marketing aside and features aside, when I asked Copilot if it has access to other tabs (like the marketing video claims) it said it doesn’t have access, agentic mode didn’t work, and I’m very unlikely to be testing it further! But hey, if you use Copilot app on your mobile phone, and click the new Mico avatar like 25 times it will turn into Clippy, so.. yay? Claude Code on the Web, Claude on Desktop upgraded (X, Anthropic)Anthropic also made waves by bringing Claude Code to the web. Now you can delegate software tasks to Claude through a web interface with GitHub integration. Nisten was particularly excited about being able to manage his coding projects from his phone. It runs tasks in a secure sandbox, can handle multiple repos, and automatically create pull requests. It’s another powerful coding agent becoming more accessible to developers everywhere. They have also made changes to the desktop Claude app, allowing it to see the context of your screen with screenshots, and file sharing, and even a new voice mode that allows you to talk to Claude (which is unfortunately mapped to the tab button, without the ability to remap) Browser Automation and Delegated Authentication with Browserbase (X, Director.ai, Stagehand)While OpenAI and Microsoft are building chat into the browser, what about bringing the browser into our chat-based agents? We had Paul Klein, the founder of Browserbase, join us to talk about this exact topic. His company is tackling one of the biggest hurdles for AI agents: authentication.Paul and his team launched Director 2.0, a platform that lets you build web automation with natural language prompts. But the real innovation here is their integration with 1Password. Instead of giving an agent the “master keys” to all your logged-in sessions like Atlas does, Browserbase allows for delegated, per-site authentication. When an agent running in the cloud needs to log into a site on your behalf, you get a prompt on your local machine to approve it. This is a much safer, more granular way to give agents the access they need. As Paul said, you shouldn’t let an AI the master keys into your house; you should give it permission to enter one room at a time. It’s a brilliant paradigm for secure agentic workflows and I really like this approach of a piece-meal authentication for browser agents. I wish Atlas has something like this for the incognito mode! Director 2.0 itself is like V0 for web automation—you give it a prompt, it performs the task, and then it gives you a repeatable script you can deploy. It’s a way to create robust automations without needing to be a developer, and it’s already being used to automate thousands of hours of manual work. Video & Audio: The Race to Real-TimeThe world of generative media is moving at lightning speed, with a clear trajectory towards real-time, interactive experiences.Decart’s Real-Time Lip Sync API (X)We had Kwindla Kramer, one of the worlds leading experts in real-time audio, join us to break down a phenomenal release from Decart AI: a real-time lip-sync API. This isn’t the pre-rendered, slightly-off lip-sync we’re used to. This is a pipeline of models working together to generate perfectly synchronized lip movements for an avatar in real-time.Kwindla explained the tech stack: it captures your audio via WebRTC, sends it to Whisper for transcription, gets a response from an LLM like Grok, generates a voice with ElevenLabs, and then Decart’s model modifies the avatar’s video frames to match the new audio, all with a sub-two-second latency. This is how we get to truly interactive, believable AI characters. Kwindla even built a quick demo, though it didn’t seem to work the in the morning, probably GPU issues, so we just played the demo videos. LTX-2 and Sora’s Pet CameosThe trend towards high-fidelity, real-time generation continued with a breaking news release from Lightricks: LTX-2. This is an open-source (weights coming this fall!) engine that can generate native 4K video with synchronized audio. It’s fast, efficient, and is set to be a powerful open alternative to closed models like Sora. And it’s a native 4K, no upscaling! Speaking of Sora, they announced that character cameos are getting an upgrade. Soon, you’ll be able to turn anything—your pet, a coffee cup, or even a sunny-side-up egg—into an animated, talking character. I’m really looking forward for this new Sora update and will let you know my impressions when it drops (soon, according to Bill from OpenAI) What a week folks! What A WEEK! 😅 My head is still spinning! From browsers that can do our work for us to OCR that redefines context, we’re seeing foundational shifts across the board. The tools are getting more powerful, more accessible, and more integrated into our daily workflows. The future is being built right now, and we get to watch it happen week by week.Thank you for being a ThursdAI subscriber. As always, here are the show notes with all the links and details from this week’s whirlwind of AI news.* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @yampeleg @nisten @ldjconfirmed* Paul Kelin @pk_iv - Browser Base* Kwindla Kramer @kwindla - Pipecat & Daily* Open Source LLMs* DeepSeek-OCR: Efficient Vision-Text Compression for Massive Contexts (X, HF, Paper)* Liquid AI LFM2-VL-3B: Tiny Multilingual Vision-Language Model (X, HF)* PokeeResearch-7B: Open-source SOTA Deep Research Agent (X, HF, Web, ArXiv, GitHub)* Qwen3-VL 2B & 32B: compact STEM-tuned multimodal powerhouses (X, Hugging Face)* Big CO LLMs + APIs* OpenAI announces Atlas - its agentic AI browser (X, Download)* Security Implications, Injection + note from CISO* Claude Code on the Web: Cloud Coding with Secure Sandboxing (X, Anthropic)* Meta bans 1‑800‑ChatGPT on WhatsApp* Microsoft agentic addition to Copilot Mode in Edge (X)* Gemini AI Studio launches “Vibe Coding” (X, AI Studio Build)* This weeks Buzz* Fully connected comes to Tokyo (Oct 30-31) and London (Nov 4-5) ! (register at Fullyconnected.com)* Vision & Video* Sora is about to get pet cameos* Krea open‑sources a 14‑billion‑parameter real‑time video model (X, HF)* Reve’s unannounced video mode!? 1080p + sound* LTX-2: open-source 4K audio+video generation engine from Lightricks (X, Website, GitHub)* Voice & Audio* Decart Lip Sync API: Real-Time Avatar Lip Movement (X)* Tools* Browserbase launches Director 2.0: prompt-powered web automation (X, Director.ai, Stagehand) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:35:16

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, The Intelligence from The Economist and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.23.13 | © 2007-2025 radio.de GmbH
Generated: 11/22/2025 - 4:14:58 PM