Powered by RND
PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

Available Episodes

5 of 128
  • 📆 ThursdAI - Nov 6, 2025 - Kimi’s 1T Thinking Model Shakes Up Open Source, Apple Bets $1B on Gemini for Siri, and Amazon vs. Perplexity!
    Hey, Alex here! Quick note, while preparing for this week, I posted on X that I don’t remember such a quiet week in AI since I started doing ThursdAI regularly, but then 45 min before the show started, Kimi dropped a SOTA oss reasoning model, turning a quiet week into an absolute banger. Besides Kimi, we covered the updated MCP thinking from Anthropic, and had Kenton Varda from cloudflare as a guest to talk about Code Mode, chatted about Windsurf and Cursor latest updates and covered OpenAI’s insane deals. Also, because it was a quiet week, I figured I’d use the opportunity to create an AI powered automation, and used N8N for that, and shared it on the stream, so if you’re interested in automating with AI with relatively low code, this episode is for you. Let’s dive inThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Kimi K2 Thinking is Here and It’s a 1 Trillion Parameter Beast! (X, HF, Tech Blog)Let’s start with the news that got everyone’s energy levels skyrocketing right as we went live. Moonshot AI dropped Kimi K2 Thinking, an open-source, 1 trillion-parameter Mixture-of-Experts (MoE) model, and it’s an absolute monster.This isn’t just a numbers game; Kimi K2 Thinking is designed from the ground up to be a powerful agent. With just around 32 billion active parameters during inference, a massive 256,000 token context window, and an insane tool-calling capacity. They’re claiming it can handle 200-300 sequential tool calls without any human intervention. The benchmarks are just as wild. On the Humanities Last Exam (HLE), they’re reporting a score of 44.9%, beating out both GPT-5 and Claude 4.5 Thinking. While it doesn’t quite top the charts on SWE-bench verified, it’s holding its own against the biggest closed-source models out there. Seeing an open-source model compete at this level is incredibly exciting.During the show, we saw some truly mind-blowing demos, from a beautiful interactive visualization of gradient descent to a simulation of a virus attacking cells, all generated by the model. The model’s reasoning traces, which are exposed through the API, also seem qualitatively different from other models, showing a deep and thoughtful process. My co-hosts and I were blown away. The weights and a very detailed technical report are available on Hugging Face, so you can dive in and see for yourself. Shout out to the entire Moonshot AI team for this incredible release!Other open source updates from this week* HuggingFace released an open source “Smol Training Playbook” on training LLMs, it’s a 200+ interactive beast with visualizations, deep dives into pretraining, dataset, postraining and more! (HF)* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (X, Blog)* LongCat-Flash-Omni — open-source omni-modal system with millisecond E2E spoken interaction, 128K context and a 560B ScMoE backbone (X, HF, Announcement)Big Tech’s Big Moves: Apple, Amazon, and OpenAIThe big companies were making waves this week, starting with a blockbuster deal that might finally make Siri smart. Apple is reportedly will be paying Google around $1 billion per year to license a custom 1.2 trillion-parameter version of Gemini to power a revamped Siri.This is a massive move. The Gemini model will run on Apple’s Private Cloud Compute, keeping user data walled off from Google, and will handle Siri’s complex summarizer and planner functions. After years of waiting for Apple to make a significant move in GenAI, it seems they’re outsourcing the heavy lifting for now while they work to catch up with their own in-house models. As a user, I don’t really care who builds the model, as long as Siri stops being dumb!In more dramatic news, Perplexity revealed that Amazon sent them a legal threat to block their Comet AI assistant from shopping on Amazon.com. This infuriated me. My browser is my browser, and I should be able to use whatever tools I want to interact with the web. Perplexity took a strong stand with their blog post, “Bullying is Not Innovation,” arguing that user agents are distinct from scrapers and act on behalf of the user with their own credentials. An AI assistant is just that—an assistant. It shouldn’t matter if I ask my wife or my AI to buy something for me on Amazon. This feels like a move by Amazon to protect its ad revenue at the expense of user choice and innovation, and I have to give major props to Perplexity for being so transparent and fighting back.Finally, OpenAI continues its quest for infinite compute, announcing a multi-year strategic partnership with AWS. This comes on top of massive deals with NVIDIA, Microsoft, Oracle, and others, bringing their total commitment to compute into the trillions of dollars. It’s getting to a point where OpenAI seems “too big to fail,” as any hiccup could have serious repercussions for the entire tech economy, which is now heavily propped up by AI investment. Sam has clarified that they don’t think OpenAI wants to be too big to fail in a recent post on X, and that the recent miscommunications around the US government backstopping OpenAI’s infrastructure bailouts were taken out of context. 🤔 Coding with AI: The Evolution of MCP and New Dev ToolsThis week, we kicked off a new segment on the show: Coding with AI! Essentially realizing that we talk about AI coding a LOT, and decided to add a dedicated corner to it! And we started with a fascinating development in the world of agentic tooling. Anthropic published a blog post arguing that the standard way of using the Model Context Protocol (MCP) — by loading full tool definitions into the context window — is inefficient.Their solution? Have LLMs write code to interact with tools instead. This approach can slash token usage by over 98% in some cases. This idea sounded familiar, and that’s because Cloudflare had already explored it with a feature called “Code Mode.” We were lucky enough to have Kenton Varda, one of the authors of the Code Mode post and head of engineering for Cloudflare Workers, join us to discuss this shift.Kenton explained that LLMs are trained on vast amounts of code, making it a more “native language” for them than the artificial construct of tool calls. By generating code, agents can chain multiple tool calls together, process intermediate results, and operate much more efficiently without sending everything back through the neural network. While MCP still provides crucial standardization for discovering and authorizing tools, this “code execution” pattern seems to be the way forward for building more powerful and scalable agents.Windsurfs CodeMaps and Cursor multi agent executionsIn other coding with AI news, Windsurf has pushed an incredible feature, called CodeMaps. They will use their SWE-1 model to (quickly) generate Codemaps that will expalins a code-base to you, in a visual way. What starts where and goes where. It’s really useful to understand a new codebase or re-understand one you forgot about already! You can even chat with codemaps, to see if your overall system’s design is solid! Great addition that I’m sure will help many folks adopt Windsurf! And Cursor, another popular AI-native IDE, released a super-performant in-IDE browser and a wild multi-agent feature that queries multiple LLMs in parallel and then synthesizes their answers.This Week’s TutorialI finally got around to building some serious automations for ThursdAI, and folks, N8N has been a game-changer. What used to take me 30+ minutes of manual work now happens automatically in the background.Here’s what I built: A Telegram bot that takes Twitter/X links, fetches the tweets and all linked content, uses AI agents to extract and summarize the information, and then posts it to our announcement channel and my notes app. The coolest part? I built this whole thing in about 4 hours with the help of Atlas browser and GPT-5 literally telling me what to do at each step.During the show, we even live-tested swapping out GPT-4o-mini for Kimi K2 - took literally 30 seconds to connect via OpenRouter. I went through my node and explains how this all works on the show, so if you’ve wanted to learn about n8n, check it out starting around 01:13:00. If you want to see how my automation turned out, it will be posting all my links to the new telegram channel t.me/thursdai_news (expect it to be messy at first as I’m testing out the automation) Robotics - Xpeng’s “Iron” humanoid: big vibes, few specsAnother week, another humanoid robot that is supposedly “coming” in 2026! A humanoid from Xpeng went viral this week, marketed as “the most human‑like” robot with soft skin, bionic muscles, customizable sexes (yes, really, they have a woman humanoid), something called a VLT brain, and a 2026 production goal. Here’s what we didn’t get: a spec sheet. No DOF, speed, payload, compute TOPS, battery capacity, runtime, or safety pathway. No pricing, manufacturing strategy, or clear target markets. In other words: lots of sizzle, no steak.Apparently, there was folks thinking Xpend pulled an Elon and put a human in a robot suit, making the CEO do the “we’ll cut a part of the soft skin to expose the robot underneath so you don’t think we’re lying” stunt. Which I agree, was very effective. But, If Xpeng is serious, the next thing we’ll see should be a crisp engineering document: joints, actuation, sensors, compute, and a locomotion/manipulation demo with independent measurements. Until then, treat this as a branding salvo and a reminder that the humanoid category is still sorting itself into “industrial payload first” versus “human likeness first” approaches. Voice & AudioMaya‑1: open‑source voice design from natural languageWe highlighted Maya‑1, a 3B Llama‑backboned TTS system designed to generate voices from natural language descriptions. Instead of picking from a menu, you describe the voice—age, accent, affect—and Maya conjures it. It supports real‑time streaming and over twenty “emotion tags.” The quality is compelling for its size and the Apache 2 license will make a lot of builders happy. There’s a growing middle class of TTS: tiny but expressive, good enough for in‑app narrators, prototyping, and even stylized content when you don’t want the constraints of commercial voice marketplaces.Inworld TTS: a new leader on independent rankingsWe also listened to Inworld’s latest, which currently tops the Artificial Analysis TTS leaderboard. It’s not open source, but the combo of expressivity, speed (sub‑250 ms), and multilingual support puts it firmly in the “commercially viable at scale” tier alongside the usual suspects. If you need SaaS TTS today and care about emotional range, add this to your shortlist. Pricing on their site targets availability rather than hobbyist tinkering, but the quality argues for itself.Whew! For a week that started slow, it certainly ended with a bang. It just goes to show you can never count AI out. We’re seeing open source continue to push the boundaries, big tech making landscape-defining moves, and agentic AI becoming more powerful and accessible every day.As always, thanks for tuning in. If you’re going to be at the AI.engineer conference in New York, please hit me up—I’d love to meet you.TL;DR and Show Notes + Links* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten* Kenton Varda @ Cloudflare (@KentonVarda)* Open Source LLMs* Smol Training Playbook — a 200+ page, end-to-end guide to reliably pretrain and operate LLMs (X, Announcement)* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (X, Blog)* Moonshot AI releases Kimi K2 Thinking — an open-source 1T-parameter MoE agent with 256K context and huge tool-calling capacity (X, HF, Blog, Arxiv)* LongCat flash Omni - 560B (27A) omni model (text, audio, video input)* Big CO LLMs + APIs* Apple will pay roughly $1B/year to license a custom 1.2 trillion‑parameter Google Gemini model to power a revamped Siri (X, Announcement)* Perplexity says Amazon issued a legal threat to block Comet AI assistants from shopping on Amazon (X, Blog)* AWS announces multi-year strategic infrastructure partnership with OpenAI to power ChatGPT inference, training, and agentic AI (X)* Robotics* Xpeng unveils ‘Iron’ humanoid claiming ‘most human-like’ design with soft skin, bionic muscles, VLT brain and a 2026 production plan (X)* Coding with AI* Anthropic shows how running MCP-connected tools as code slashes token use and scales agents (X, Blog)* Windsurf Codemaps — AI‑annotated, navigable maps of your codebase powered by SWE-1.5 (Fast) and Sonnet 4.5 (Smart) (X, Announcement)* Conversation with Kenton Varda (@KentonVarda) from Cloudflare about MCP and Code Mode* Cursor added in IDE browser - very performant!* Audio & Video* Maya-1 - Open source voice generation model.* Inworld TTS - new #1 on artifical analysis benchmark.* Tools & Gadgets* Sandbar launches Stream — a voice-first personal assistant — and Stream Ring, a wearable ‘mouse for voice’, available for preorder (X, Blog) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:32:45
  • ThursdAI - Oct 30 - From ASI in a Decade to Home Humanoids: MiniMax M2's Speed Demon, OpenAI's Bold Roadmap, and 2026 Robot Revolution
    Hey, it’s Alex! Happy Halloween friends! I’m excited to bring you this weeks (spooky) AI updates! We started the show today with MiniMax M2, the currently top Open Source LLM, with an interview with their head of eng, Skyler Miao, continued to dive into OpenAIs completed restructuring into a non-profit and a PBC, including a deep dive into a live stream Sam Altman had, with a ton of spicy details, and finally chatted with Arjun Desai from Cartesia, following a release of Sonic 3, a sub 49ms voice model! So, 2 interviews + tons of news, let’s dive in! (as always, show notes in the end)Hey, if you like this content, it would mean a lot if you subscribe as a paid subscriber.Open Source AIMiniMax M2: open-source agentic model at 8% of Claude’s price, 2× speed (X, Hugging Face )We kicked off our open-source segment with a banger of an announcement and a special guest. The new king of open-source LLMs is here, and it’s called MiniMax M2. We were lucky enough to have Skyler Miao, Head of Engineering at Minimax, join us live to break it all down.M2 is an agentic model built for code and complex workflows, and its performance is just staggering. It’s already ranked in the top 5 globally on the Artificial Analysis benchmark, right behind giants like OpenAI and Anthropic. But here’s the crazy part: it delivers nearly twice the speed of Claude 3.5 Sonnet at just 8% of the price. This is basically Sonnet-level performance, at home, in open source.Skylar explained that their team saw an “impossible triangle” in the market between performance, cost, and speed—you could only ever get two. Their goal with M2 was to build a model that could solve this, and they absolutely nailed it. It’s a 200B parameter Mixture-of-Experts (MoE) model, but with only 10B active parameters per inference, making it incredibly efficient.One key insight Skylar shared was about getting the best performance. M2 supports multiple APIs, but to really unlock its reasoning power, you need to use an API that passes the model’s “thinking” tokens back to it on the next turn, like the Anthropic API. Many open-source tools don’t support this yet, so it’s something to watch out for.Huge congrats to the MiniMax team on this Open Weights (MIT licensed) release, you can find the model on HF! MiniMax had quite a week, with 3 additional releases, MiniMax speech 2.6, an update to their video model Hailuo 2.3 and just after the show, they released a music 2.0 model as well! Congrats on the shipping folks! OpenAI drops gpt-oss-safeguard - first open-weight safety reasoning models for classification ( X, HF )OpenAI is back on the open weights bandwagon, with a finetune release of their previously open weighted gpt-oss models, with gpt-oss-safeguard. These models were trained exclusively to help companies build safeguarding policies to make sure their apps remains safe! With gpt-oss-safeguards 20B and 120B, OpenAI is achieving near parity with their internal safety models, and as Nisten said on the show, if anyone knows about censorship and safety, it’s OpenAI! The highlight of this release is, unlike traditional pre-trained classifiers, these models allow for updates to policy via natural language!These models will be great for businesses that want to safeguard their products in production, and I will advocate to bring these models to W&B Inference soon! A Humanoid Robot in Your Home by 2026? 1X NEO announcement ( X, Order page, Keynote )Things got really spooky when we started talking about robotics. The company 1X, which has been on our radar for a while, officially launched pre-orders for NEO, the world’s first consumer humanoid robot designed for your home. And yes, you can order one right now for $20,000, with deliveries expected in early 2026.The internet went crazy over this announcement, with folks posting receipts of getting one, other folks stoking the uncanny valley fears that Sci-fi has built into many people over the years, of the Robot uprising and talking about the privacy concerns of having a human tele-operate this Robot in your house to do chores. It can handle chores like cleaning and laundry, and for more complex tasks that it hasn’t learned yet, it uses a teleoperation system where a human “1X Expert” can pilot the robot remotely to perform the task. This is how it collects the data to learn to do these tasks autonomously in your specific home environment.The whole release is very interesting, from the “soft and quiet” approach 1X is taking, making their robot a 66lbs short king, draped in a knit sweater, to the $20K price point (effectively at loss given how much just the hands cost), the teleoperated by humans addition, to make sure the Robot learns about your unique house layout. The conversation on the show was fascinating. We talked about all the potential use cases, from having it water your plants and look after your pets while you’re on vacation to providing remote assistance for elderly relatives. Of course, there are real privacy concerns with having a telepresence device in your home, but 1X says these sessions are scheduled by you and have strict no-go zones.Here’s my prediction: by next Halloween, we’ll see videos of these NEO robots dressed up in costumes, helping out at parties. The future is officially here. Will you be getting one? If not this one, when will you think you’ll get one? OpenAI’s Grand Plan: From Recapitalization to ASIThis was by far the biggest update about the world of AI for me this week! Sam Altman was joined by Jakub Pachocki, chief scientist and Wojciech Zaremba, a co-founder, on a live stream to share an update about their corporate structure, plans for the future, and ASI goals (Artificial Superintelligence) First, the company now has a new structure: a non-profit OpenAI Foundation governs the for-profit OpenAI Group. The foundation starts with about 26% equity and has a mission to use AI for public good, including an initial $25 billion commitment to curing diseases and building an “AI Resilience” ecosystem.But the real bombshells were about their research timeline. Chief Scientist Jakub Pachocki stated that they believe deep learning systems are less than a decade away from superintelligence (ASI). He said that at this point, AGI isn’t even the right goal anymore. To get there, they’re planning to have an “AI research intern” by September 2026 and a fully autonomous AI researcher comparable to their human experts by March 2028. This is insane if you think about it. As Yam mentioned, OpenAI is already shipping at an insane speed, releasing Models and Products, Sora, Atlas, Pulse, ChatGPT app store, and this is with humans, assisted by AI. And here, they are talking about complete and fully autonomous researchers, that will be infinitely more scalable than humans, in the next 2 years. The outcomes of this are hard to imagine and are honestly mindblowing. To power all this innovation, Sam revealed they have over $1.4 trillion in obligations for compute (over 30 GW). And said even that’s not enough. Their aspiration is to build a “compute factory” capable of standing up one gigawatt of new compute per week, and he hinted they may need to “rethink their robotics strategy” to build the data centers fast enough. Does this mean OpenAI humanoid robots building factories? 🤔 Plus, don’t forget, Sam is one of the investors in Helion energy, working on power solutions like Fusion, and the above graphic has an Energy block that Sam said they will give an update on later (that’s also what he told me during Dev Day when I asked him about it). Super exciting and honestly mind-blowing stuff, Gigawats per week, fully autonomous researchers, the world is going to look way different in a few years! The Agent Labs Race: Cursor 2.0 vs. Cognition’s SWE-1.5 (X, Blog)This week also saw a major showdown in the agentic coding space. On the very same day, both Cursor and Cognition launched major updates and their own new models, signaling a new era where agent labs are training their own specialized AI.First up, Cursor 2.0 was released with a completely redesigned multi-agent interface and their new model, Composer. Composer is claimed to be four times faster than comparable models, and the new UI is built around managing a fleet of agents that can work in parallel on your codebase. It’s a clear shift from being just an IDE to a full-fledged agent platform. Look, the UI even looks like ChatGPT and no code in sight (until you switch to IDE mode) Their Composer model is also very interesting, and got a lot of folks excited, but the evaluations they shared, and the fact that they didn’t disclose if that’s a finetune of a chinese model (it likely is). Regardless, folks are saying that it’s a very good model that’s also VERY fast! Cognition own coding model - SWE 1.5 ( Blog, X, Windsurf )Then, just hours later, Cognition punched right back with SWE-1.5, their new frontier agent model that now powers Windsurf. The headline here is pure speed. Powered by Cerebras, SWE-1.5 hits a blistering 950 tokens per second—13 times faster than Sonnet 4.5—while achieving near-SOTA performance on SWE-Bench Pro. They’ve achieved this through a co-designed stack where the agent harness, inference system, and model were all built together and optimized with end-to-end reinforcement learning in real coding environments.This competition is fantastic news for all of us. We’re seeing specialized, highly-performant models being developed outside of the big labs, putting more power back in the hands of developers.This Week’s BuzzJust a few quick updates from the world of Weights & Biases and our parent company, CoreWeave.First, big news! CoreWeave announced the acquisition of Marimo, the company behind the popular open-source, reactive notebook for Python. This is another exciting step in building out the essential cloud for AI, adding powerful development tools to the stack alongside best-in-class GPU infrastructure and MLOps with Weights & Biases. Welcome to the Marimo team!Also, Fully Connected is coming to London next week! It’s our premier conference, and we’ll have speakers from Mistral, Google, LlamaIndex, and more. If you’re in Europe, please come join us. DM me if you need tickets!And if you’re in New York from November 19-22, come say hi at the AI Engineer Code Summit. We’re sponsoring and will have a big booth. It’s always a great place to meet folks from this community.Video & Voice: The Multimodal ExplosionThe world of video and voice AI was on fire this week.The absolute highlight was Odyssey ML V2, a new real-time interactive AI video platform. This thing is not like other video models that take minutes to generate a clip. With Odyssey, you type a prompt, and a video starts streaming instantly. Then, you can edit it live. We did a demo on the show where we prompted “army of robots in a starship corridor” and then typed “turn these robots into fluffy covered cat robots,” and the video changed in real time. It’s mind-blowing. This is a glimpse into the future of user-driven, playable media.On the more traditional video front, Sora is now invite-free in the US and Japan, and they launched Character Cameos. You can now upload photos of your pets or objects (like your kid’s carved pumpkin!) and turn them into characters that you and others can use in videos. I, of course, immediately made a cameo of my cat, Sonia.Voice and Audio - Cartesia launches Sonic 3, sub 50ms AI speech modelIn the world of voice, we had Arjun Desai from Cartesia join us to talk about Sonic-3, their new real-time TTS engine. Backed by a new $100M funding round, Sonic-3 is built on State Space Models (not Transformers) and can achieve insane speeds—we’re talking under 50ms latency. But it’s not just fast; it’s also incredibly expressive. It can laugh, emote, and speak 42 languages with natural code-switching. I used their Pro Voice cloning feature to create an AI version of myself, and the results were scarily good. We even had my AI clone host a segment of the show, see it yourself here, powered by Argil and Sonic 3, this is... AI AlexWrapping Up This Spooky Week 🎃As I sit here in my Halloween costume reflecting on this week, I can’t help but feel we’re at an inflection point. We have:* Open source models competing with the best proprietary ones* Humanoid robots becoming consumer products* ASI timelines measured in single-digit years* Real-time interactive AI across all modalitiesAnd yet, nothing about this scares me. If anything, I’m more excited than ever about what we’re building together. Yes, the pace is insane. Yes, keeping up with everything is becoming nearly impossible (and it’s literally my job!). But we’re living through the most transformative period in human history, and we get to be part of it.To everyone building, experimenting, and pushing boundaries - keep going. To everyone worried about what’s coming - join us in shaping it responsibly. And to everyone who celebrated Halloween today - I hope your costume was as epic as the AI developments we covered! 👻Until next week, this is Alex signing off. Remember to subscribe, give us five stars, and I’ll see you next ThursdAI!TL;DR - All Topics CoveredThursdAI - Oct 30 - Halloween Special 👻* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson* Guest: Skyler Miao - Head of Engineering, MiniMax (@SkylerMiao7)* Guest: Arjun Desai - CoFounder Cartesia (@jundesai)* Open Source LLMs* MiniMax M2: Open-source agentic model at 8% of Claude’s price, 2× speed (X, Hugging Face)* OpenAI GPT-OSS-Safeguard: First open-weight safety reasoning models (X, HF)* IBM Granite 4.0 Nano: Ultra-efficient tiny models for edge deployment (X, Artificial Analysis)* Ming-flash-omni Preview: Sparse MoE omni-modal model (X, HuggingFace)* Kimi Linear: 48B parameter model with 1M context (HF)* Robotics* 1X NEO: First consumer humanoid robot, $20k, delivery 2026 (X, Order page, Keynote)* Big Companies & APIs* OpenAI Restructuring: ASI within 10 years, AI researcher by 2028 (X)* Cursor 2.0 & Composer: 4x faster coding, new model (X, Blog)* Cognition SWE-1.5: 950 tok/s, 40% SWE-bench Pro (Blog, X, Windsurf)* Perplexity Email Assistant: Privacy-first AI inbox management (X, Assistant Site)* This Week’s Buzz* Fully Connected London - fullyconnected.com* AI Engineer Code Summit NYC - Nov 19-22* CoreWeave acquires Marmo notebooks (X)* Vision & Video* Odyssey ML V2: Real-time interactive AI video (X, Experience)* Sora: Now invite-free + Character Cameos feature (X, Sonia Cameo)* Hailuo 2.3: Cinema-grade video generation (X)* Voice & Audio* MiniMax Speech 2.6: X, MiniMax, API Docs)* Cartesia Sonic 3: Real-time TTS with emotion & laughter, $100M funding (X, Website, Docs)* Tools* Pokee: Agentic workflow builder (X)* Pomelli: Google’s AI marketing agent (X, Labs) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:37:29
  • 📆 ThursdAI - Oct 23: The AI Browser Wars Begin, DeepSeek's OCR Mind-Trick & The Race to Real-Time Video
    Hey everyone, Alex here! Welcome... to the browser war II - the AI edition! This week we chatted in depth about ChatGPT’s new Atlas agentic browser, and the additional agentic powers Microsoft added to Edge with Copilot Mode (tho it didn’t work for me) Also this week was a kind of crazy OCR week, with more than 4 OCR models releasing, and the crown one is DeepSeek OCR, that turned the whole industry on it’s head (more later) Quite a few video updates as well, with real time lipsync from Decart, and a new update from LTX with 4k native video generation, it’s been a busy AI week for sure! Additionally, I’ve had the pleasure to talk about AI Browsing agents with Paul from BrowserBase and real time video with Kwindla Kramer from Pipecat/Daily, so make sure to tune in for those interviews, buckle up, let’s dive in! Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.Open Source: OCR is Not What You Think It Is (X, HF, Paper)The most important and frankly mind-bending release this week came from DeepSeek. They dropped DeepSeek-OCR, and let me tell you, this is NOT just another OCR model. The cohost were buzzing about this, and once I dug in, I understood why. This isn’t just about reading text from an image; it’s a revolutionary approach to context compression.We think that DeepSeek needed this as an internal tool, so we’re really grateful to them for open sourcing this, as they did something crazy here. They are essentially turning text into a visual representation, compressing it, and then using a tiny vision decoder to read it back with incredible accuracy. We’re talking about a compression ratio of up to 10x with 97% decoding accuracy. Even at 20x compression they are achieving 60% decoding accuracy! My head exploded live on the show when I read that. This is like the middle-out compression algorithm joke from Silicon Valley, but it’s real. As Yam pointed out, this suggests our current methods of text tokenization are far from optimal.With only 3B and ~570M active parameters, they are taking a direct stab at long context inefficiency, imagine taking 1M tokens, encoding them into 100K visual tokens, and then feeding those into a model. Since the model is tiny, it’s very cheap to run, for example, alphaXiv claimed they have OCRd’ all of the papers on ArXiv with this model for $1000, a task that would have cost $7500 using MistalOCR - as per their paper, with DeepSeek OCR, on a single H100 GPU, its possible to scan up to 200K pages! 🤯 Really innovative stuff! OCR and VLM models had quite a week, with multiple models besides DeepSeek OCR releasing, models like Liquids LFM2-VL-3B (X, HF), and the newly updated 2B and 32B of Qwen3-VL (X, Hugging Face), and AI2’s olmo-ocr 2-7B (X, HF). The Qwen models are particularly interesting, as the 2B model is a generic VLM (can also do OCR) and is close to previous weeks 4B and 8B brothers, and the newly updated 32B model outperforms GPT-5 mini and Claud 4 sonnet even! The Browser Wars are BACK: OpenAI & Microsoft Go AgenticLook, I may be aging myself here, but I remember, as a young frontend dev, having to install 5 browers at once to test them out, Chrome, Internet Explorer, Firefox, Opera etc’. That was then, and now, I have Dia, Comet, and the newly released Atlas, and, yeah, today I even installed Microsoft Edge to test their AI features! It seems like the AI boom brought with it a newly possible reason for folks to try and take a bite out of Chrome (who’s agentic features are long rumored with project mariner but are nowhere to be found/shipped yet) OpenAI’s ChatGPT Atlas: The Browser Reimagined (X, Download)OpenAI is proving that besides just models, they are a product powerhouse, stepping into categories like Shopping (with a shopify integration), app stores (with ChatGPT apps), social (with Sora2) and now... browsers! This week, they have launched their tightly integrated into ChatGPT browser called Atlas, and it’s a big release! I’ll split my review here to 2 parts, the browser features part and the agentic part. New fresh take on a chromium based browserThe tight integration into ChatGPT is everywhere in this browser, from the new tab that looks like the basic ChatGPT interaface, one line of text, to the sidebar on the left that... is the ChatGPT web sidebar with all your chats, projects, custom GPTs etc. The integration doesn’t stop there, as you have to sign in to your ChatGPT account to even use this browser (available only to MacOS users, and Pro, Plus and Nano tiers). The browser has a few neat tricks, like a special tool that allows you to search your browsing history with natural language, a-la “what were those shoes I was looking at a few days ago” will find your the tabs you browsed for shoes. A special and cool feature is called, confusingly “Cursor”, wherein you can select a text, and then click the little OpenAI logo that pops up, allowing you to ask ChatGPT for changes to that selected text (like fix typos, spruce up your writing etc). It’s surprisingly convenient to rewrite tweets or for any type of document editing. ChatGPT Atlas also stores memories about your browsing patterns, which will be additional to the ChatGPT memories it stores about you from chats, helping even more by knowing your browsing patterns, which software you prefer to use, which websites you prefer to order food from etc. This IMO is one of the hugest unlocks for folks inside the ChatGPT ecosystem, as much of a stanard persons peferences can be gleaned from their browser usage and patterns.Lastly, the “Ask ChatGPT” sidepane on the right (which can be opened with cmd+.) is really great for chatting with a webpage, or going down search rabbit holes. It receives the context of the webpage you’re looking at by default (only 1 page so far, competitors allow you to add additional tabs with @, (which is supposedly coming to ChatGPT soon) and ask... ChatGPT anything about this. Agentic SOTA? not so fastThe most important “change” to how browsers work in Atlas imo is the agentic mode. This isn’t new, we remember when ChatGPT launched thier Operator Agent back in January of this year (our coverage) and then renamed it Agent Mode and integrated into ChatGPT itself back in July. So, web browsing agents are not entirely new, what’s novel here though, is the integration into your browser, and the ability for the Atlas browser to use your logged in sessions and cookies, to pretend to be you! This... can be quite scary for some, as prompt injection attacks are getting more popular (where-in malicious a******s add hidden instructions to their website that will get the agent to do something you don’t like) but it’s also very exciting, as the agent can do much much more, without getting blocked by providers who could previously just block Agent Mode as it ran on OpenAI servers! Until today, there were 2 main Agentic browsers in the mix, Perplexity’s Comet (where you can choose which model runs the agent) and Atlas. Comet seems to be doing a little bit better on some stuff on my tests, but not by much. I have the same agentic task (go to X.com, find my bookmarks, open all links, summarize per my specific format) that I’ve been running for a while now, and Comet outdid Atlas this week on that task.Who needs agentic browsing? For some reason, most of the demos for agentic browsing are showing the same, boring-ish examples. Book some flights, collect a grocery shopping cart. I’ve tried new and different things this week, for example, letting Atlas choose and order food for me (as ChatGPT knows my pescatarian preferences, it’s better than Comet for personal stuff), and one of the longest task I’ve had an agent do yet, I asked it to complete a Compliance training I had to take at work! Mind you, this is a very complex task, even for regular people, as these compliance websites are built to not be messed with. They have video players that stop if you switch focus to some other tab, they have interactive quizes and games, drag and drop interfaces, audio buttons, to make sure you really are taking the test. I can happily report, that after 5 hours, and a few stops along the way (where I had to convince the agent to keep going), it completed this very hard task! (and now I have to take this course myself again to actualy be compliant 😅 it will probably take me 2 hours to do manually) This experiment made me think, who needs the agentic browsing features and for what? Well, for tasks that require a lot of manual steps to do the same thing over and over again, agentic browser is going to make a lot of peoples browsing a lot easier. Things like kids schedules reviewing in multiple websites, collecitng data and formatting it differently etc. Scary security implications Atlas could only finish my compliance task while being logged in as me, and ChatGPT Atlas gives a all or nothing control. You can run your agent with full access to your logged in websites (think Gmail etc) or you can essentially give it an incognito mode. This, again, due to the risk of promp injections in malicious websites being more and more prevalent. In a rare post detailing how they are thinking about this, OpenAI Chief Information Security officer offered a deep dive into their attempts to mitigate this issue (Simon Willison had a great breakdown of that information here) but that’s likely not enough, so definitely be aware when you’re running agent mode (which needs to be explicitly turned on right now by selecting Agent) This Weeks Buzz - Weights & Biases // CoreweaveWeights & Biases (now proudly part of CoreWeave) had some exciting updates. Our Fully Connected conference series is hitting Tokyo on October 30-31 and London on November 4-5—perfect for ML practitioners and AI engineers. If you’re in the area, join us for talks, networking, and deep dives into the latest. Register at Fullyconnected.com—DM me if you need a hook-up!We also collaborated with Meta and Stanford on Torch Forge, a new PyTorch-native library for scalable RL post-training and agent development. It’s built for massive GPU runs (we provided 520 H100s!), competing with Ray via tools like Monarch scheduler. If you’re training on clusters, check the blog —it’s a big deal for efficient multi-GPU workflows.Microsoft goes after OpenAI with Edge copilot mode (X)In a pretty surprising move, Microsoft announced today their take on the agentic browser war, with a bunch of enhancements to Copilot (their overall word for their AI assistance across Microsoft 360, Browser, Bing search etc), Think.. clippy, for the AI age (they even brought clippy back as an easter egg) The short version is, Edge is getting more powerful with custom agentic features (which I enabled and couldn’t get to work no matter how much I tried, so I can’t tell you how they compare to Atlas/Comet), and they have a voice mode that allows you to talk to your browser, with Edge having a sense of what’s on the actual page! Of course, this being Microsoft, marketing aside and features aside, when I asked Copilot if it has access to other tabs (like the marketing video claims) it said it doesn’t have access, agentic mode didn’t work, and I’m very unlikely to be testing it further! But hey, if you use Copilot app on your mobile phone, and click the new Mico avatar like 25 times it will turn into Clippy, so.. yay? Claude Code on the Web, Claude on Desktop upgraded (X, Anthropic)Anthropic also made waves by bringing Claude Code to the web. Now you can delegate software tasks to Claude through a web interface with GitHub integration. Nisten was particularly excited about being able to manage his coding projects from his phone. It runs tasks in a secure sandbox, can handle multiple repos, and automatically create pull requests. It’s another powerful coding agent becoming more accessible to developers everywhere. They have also made changes to the desktop Claude app, allowing it to see the context of your screen with screenshots, and file sharing, and even a new voice mode that allows you to talk to Claude (which is unfortunately mapped to the tab button, without the ability to remap) Browser Automation and Delegated Authentication with Browserbase (X, Director.ai, Stagehand)While OpenAI and Microsoft are building chat into the browser, what about bringing the browser into our chat-based agents? We had Paul Klein, the founder of Browserbase, join us to talk about this exact topic. His company is tackling one of the biggest hurdles for AI agents: authentication.Paul and his team launched Director 2.0, a platform that lets you build web automation with natural language prompts. But the real innovation here is their integration with 1Password. Instead of giving an agent the “master keys” to all your logged-in sessions like Atlas does, Browserbase allows for delegated, per-site authentication. When an agent running in the cloud needs to log into a site on your behalf, you get a prompt on your local machine to approve it. This is a much safer, more granular way to give agents the access they need. As Paul said, you shouldn’t let an AI the master keys into your house; you should give it permission to enter one room at a time. It’s a brilliant paradigm for secure agentic workflows and I really like this approach of a piece-meal authentication for browser agents. I wish Atlas has something like this for the incognito mode! Director 2.0 itself is like V0 for web automation—you give it a prompt, it performs the task, and then it gives you a repeatable script you can deploy. It’s a way to create robust automations without needing to be a developer, and it’s already being used to automate thousands of hours of manual work. Video & Audio: The Race to Real-TimeThe world of generative media is moving at lightning speed, with a clear trajectory towards real-time, interactive experiences.Decart’s Real-Time Lip Sync API (X)We had Kwindla Kramer, one of the worlds leading experts in real-time audio, join us to break down a phenomenal release from Decart AI: a real-time lip-sync API. This isn’t the pre-rendered, slightly-off lip-sync we’re used to. This is a pipeline of models working together to generate perfectly synchronized lip movements for an avatar in real-time.Kwindla explained the tech stack: it captures your audio via WebRTC, sends it to Whisper for transcription, gets a response from an LLM like Grok, generates a voice with ElevenLabs, and then Decart’s model modifies the avatar’s video frames to match the new audio, all with a sub-two-second latency. This is how we get to truly interactive, believable AI characters. Kwindla even built a quick demo, though it didn’t seem to work the in the morning, probably GPU issues, so we just played the demo videos. LTX-2 and Sora’s Pet CameosThe trend towards high-fidelity, real-time generation continued with a breaking news release from Lightricks: LTX-2. This is an open-source (weights coming this fall!) engine that can generate native 4K video with synchronized audio. It’s fast, efficient, and is set to be a powerful open alternative to closed models like Sora. And it’s a native 4K, no upscaling! Speaking of Sora, they announced that character cameos are getting an upgrade. Soon, you’ll be able to turn anything—your pet, a coffee cup, or even a sunny-side-up egg—into an animated, talking character. I’m really looking forward for this new Sora update and will let you know my impressions when it drops (soon, according to Bill from OpenAI) What a week folks! What A WEEK! 😅 My head is still spinning! From browsers that can do our work for us to OCR that redefines context, we’re seeing foundational shifts across the board. The tools are getting more powerful, more accessible, and more integrated into our daily workflows. The future is being built right now, and we get to watch it happen week by week.Thank you for being a ThursdAI subscriber. As always, here are the show notes with all the links and details from this week’s whirlwind of AI news.* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @yampeleg @nisten @ldjconfirmed* Paul Kelin @pk_iv - Browser Base* Kwindla Kramer @kwindla - Pipecat & Daily* Open Source LLMs* DeepSeek-OCR: Efficient Vision-Text Compression for Massive Contexts (X, HF, Paper)* Liquid AI LFM2-VL-3B: Tiny Multilingual Vision-Language Model (X, HF)* PokeeResearch-7B: Open-source SOTA Deep Research Agent (X, HF, Web, ArXiv, GitHub)* Qwen3-VL 2B & 32B: compact STEM-tuned multimodal powerhouses (X, Hugging Face)* Big CO LLMs + APIs* OpenAI announces Atlas - its agentic AI browser (X, Download)* Security Implications, Injection + note from CISO* Claude Code on the Web: Cloud Coding with Secure Sandboxing (X, Anthropic)* Meta bans 1‑800‑ChatGPT on WhatsApp* Microsoft agentic addition to Copilot Mode in Edge (X)* Gemini AI Studio launches “Vibe Coding” (X, AI Studio Build)* This weeks Buzz* Fully connected comes to Tokyo (Oct 30-31) and London (Nov 4-5) ! (register at Fullyconnected.com)* Vision & Video* Sora is about to get pet cameos* Krea open‑sources a 14‑billion‑parameter real‑time video model (X, HF)* Reve’s unannounced video mode!? 1080p + sound* LTX-2: open-source 4K audio+video generation engine from Lightricks (X, Website, GitHub)* Voice & Audio* Decart Lip Sync API: Real-Time Avatar Lip Movement (X)* Tools* Browserbase launches Director 2.0: prompt-powered web automation (X, Director.ai, Stagehand) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:35:16
  • 📆 ThursdAI - Oct 16 - VEO3.1, Haiku 4.5, ChatGPT adult mode, Claude Skills, NVIDIA DGX spark, Wordlabs RTFM & more AI news
    Hey folks, Alex here. Can you believe it’s already the middle of October? This week’s show was a special one, not just because of the mind-blowing news, but because we set a new ThursdAI record with four incredible interviews back-to-back!We had Jessica Gallegos from Google DeepMind walking us through the cinematic new features in VEO 3.1. Then we dove deep into the world of Reinforcement Learning with my new colleague Kyle Corbitt from OpenPipe. We got the scoop on Amp’s wild new ad-supported free tier from CEO Quinn Slack. And just as we were wrapping up, Swyx ( from Latent.Space , now with Cognition!) jumped on to break the news about their blazingly fast SWE-grep models. But the biggest story? An AI model from Google and Yale made a novel scientific discovery about cancer cells that was then validated in a lab. This is it, folks. This is the “let’s f*****g go” moment we’ve been waiting for. So buckle up, because this week was an absolute monster. Let’s dive in!ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open Source: An AI Model Just Made a Real-World Cancer DiscoveryWe always start with open source, but this week felt different. This week, open source AI stepped out of the benchmarks and into the biology lab.Our friends at Qwen kicked things off with new 3B and 8B parameter versions of their Qwen3-VL vision model. It’s always great to see powerful models shrink down to sizes that can run on-device. What’s wild is that these small models are outperforming last generation’s giants, like the 72B Qwen2.5-VL, on a whole suite of benchmarks. The 8B model scores a 33.9 on OS World, which is incredible for an on-device agent that can actually see and click things on your screen. For comparison, that’s getting close to what we saw from Sonnet 3.7 just a few months ago. The pace is just relentless.But then, Google dropped a bombshell. A 27-billion parameter Gemma-based model they developed with Yale, called C2S-Scale, generated a completely novel hypothesis about how cancer cells behave. This wasn’t a summary of existing research; it was a new idea, something no human scientist had documented before. And here’s the kicker: researchers then took that hypothesis into a wet lab, tested it on living cells, and proved it was true.This is a monumental deal. For years, AI skeptics like Gary Marcus have said that LLMs are just stochastic parrots, that they can’t create genuinely new knowledge. This feels like the first, powerful counter-argument. Friend of the pod, Dr. Derya Unutmaz, has been on the show before saying AI is going to solve cancer, and this is the first real sign that he might be right. The researchers noted this was an “emergent capability of scale,” proving once again that as these models get bigger and are trained on more complex data—in this case, turning single-cell RNA sequences into “sentences” for the model to learn from—they unlock completely new abilities. This is AI as a true scientific collaborator. Absolutely incredible.Big Companies & APIsThe big companies weren’t sleeping this week, either. The agentic AI race is heating up, and we’re seeing huge updates across the board.Claude Haiku 4.5: Fast, Cheap Model Rivals Sonnet 4 Accuracy (X, Official blog, X)First up, Anthropic released Claude Haiku 4.5, and it is a beast. It’s a fast, cheap model that’s punching way above its weight. On the SWE-bench verified benchmark for coding, it hit 73.3%, putting it right up there with giants like GPT-5 Codex, but at a fraction of the cost and twice the speed of previous Claude models. Nisten has already been putting it through its paces and loves it for agentic workflows because it just follows instructions without getting opinionated. It seems like Anthropic has specifically tuned this one to be a workhorse for agents, and it absolutely delivers. The thing to note also is the very impressive jump in OSWorld (50.7%), which is a computer use benchmark, and at this price and speed ($1/$5 MTok input/output) is going to make computer agents much more streamlined and speedy! ChatGPT will loose restrictions; age-gating enables “adult mode” with new personality features coming (X) Sam Altman set X on fire with a thread announcing that ChatGPT will start loosening its restrictions. They’re planning to roll out an “adult mode” in December for age-verified users, potentially allowing for things like erotica. More importantly, they’re bringing back more customizable personalities, trying to recapture some of the magic of GPT-4.0 that so many people missed. It feels like they’re finally ready to treat adults like adults, letting us opt-in to R-rated conversations while keeping strong guardrails for minors. This is a welcome change, and we’ve been advocating for this for a while, and it’s a notable change from the XAI approach I covered last week. Opt in for adults with verification while taking precautions vs engagement bait in the form of a flirty animated waifu with engagement mechanics. Microsoft is making every windows 11 an AI PC with copilot voice input and agentic powers (Blog,X)And in breaking news from this morning, Microsoft announced that every Windows 11 machine is becoming an AI PC. They’re building a new Copilot agent directly into the OS that can take over and complete tasks for you. The really clever part? It runs in a secure, sandboxed desktop environment that you can watch and interact with. This solves a huge problem with agents that take over your mouse and keyboard, locking you out of your own computer. Now, you can give the agent a task and let it run in the background while you keep working. This is going to put agentic AI in front of hundreds of millions of users, and it’s a massive step towards making AI a true collaborator at the OS level.NVIDIA DGX - the tiny personal supercomputer at $4K (X, LMSYS Blog)NVIDIA finally delivered their promised AI Supercomputer, and while the excitement was in the air with Jensen hand delivering the DGX Spark to OpenAI and Elon (recreating that historical picture when Jensen hand delivered a signed DGX workstation while Elon was still affiliated with OpenAI). The workstation was sold out almost immediately. Folks from LMSys did a great deep dive into specs, all the while, folks on our feeds are saying that if you want to get the maximum possible open source LLMs inference speed, this machine is probably overpriced, compared to what you can get with an M3 Ultra Macbook with 128GB of RAM or the RTX 5090 GPU which can get you similar if not better speeds at significantly lower price points. Anthropic’s “Claude Skills”: Your AI Agent Finally Gets a Playbook (Blog)Just when we thought the week couldn’t get any more packed, Anthropic dropped “Claude Skills,” a huge upgrade that lets you give your agent custom instructions and workflows. Think of them as expertise folders you can create for specific tasks. For example, you can teach Claude your personal coding style, how to format reports for your company, or even give it a script to follow for complex data analysis.The best part is that Claude automatically detects which “Skill” is needed for a given task, so you don’t have to manually load them. This is a massive step towards making agents more reliable and personalized, moving beyond just a single custom instruction and into a library of repeatable, expert processes. It’s available now for all paid users, and it’s a feature I’ve been waiting for. Our friend Simon Willison things skills may be a bigger deal than MCPs! 🎬 Vision & Video: Veo 3.1, Sora Gets Longer, and Real-Time WorldsThe AI video space is exploding. We started with an amazing interview with Jessica Gallegos, a Senior Product Manager at Google DeepMind, all about the new Veo 3.1. This is a significant 0.1 update, not a whole new model, but the new features are game-changers for creators.The audio quality is way better, and they’ve massively improved video extensions. The model now conditions on the last second of a clip—including the audio. This means if you extend a video of someone talking, they keep talking in the same voice! This is huge, saving creators from complex lip-syncing and dubbing workflows. They also added object insertion and removal, which works on both generated and real-world video. Jessica shared an incredible story about working with director Darren Aronofsky to insert a virtual baby into a live-action film shot, something that’s ethically and practically very difficult to do on a real set. These are professional-grade tools that are becoming accessible to everyone. Definitely worth listening to the whole interview with Jessica, starting at 00:25:44I’ve played with the new VEO in Google Flow, and while I was somewhat (still) disappointed with the UI itself (it froze sometimes and didn’t play). I wasn’t able to upload my own videos to use the insert/remove features Jessica mentioned yet, but saw examples online and they looked great! Ingredients were also improved with VEO 3.1, where you can add up to 3 references, and they will be included in your video but not as first frame, the model will use them to condition the vidoe generation. Jessica clarified that if you upload sound, as in, your voice, it won’t show up in the model as your voice yet, but maybe they will add this in the future (at least this was my feedback to her). SORA 2 extends video gen to 15s for all, 25 seconds to pro users with a new storyboard Not to be outdone, OpenAI pushed a bit of an update for Sora. All users can now generate up to 15-second clips (up from 8-10), and Pro users can go up to 25 seconds using a new storyboard feature. I’ve been playing with it, and while the new scene-based workflow is powerful, I’ve noticed the quality can start to degrade significantly in the final seconds of a longer generation (posted my experiments here) as you can see. The last few shot so the cowboy don’t have any action, and the face is a blurry mess. Worldlabs RTFM: Real-Time Frame Model renders 3D worlds at interactive speeds on a single H100 ( X, Blog, Demo )And just when we thought we’d seen it all, World Labs dropped a breaking news release: RTFM, the Real-Time Frame Model. This is a generative world model that renders interactive, 3D-consistent worlds on the fly, all on a single H100 GPU. Instead of pre-generating a 3D environment, it’s a “learned renderer” that streams pixels as you move. We played with the demo live on the show, and it’s mind-blowing. The object permanence is impressive; you can turn 360 degrees and the scene stays perfectly coherent. It feels like walking around inside a simulation being generated just for you.This Week’s Buzz: RL Made Easy with Serverless RL + interview with Kyle CorbittIt was a huge week for us at Weights & Biases and CoreWeave. I was thrilled to finally have my new colleague Kyle Corbitt, founder of OpenPipe, back on the show to talk all things Reinforcement Learning (RL).RL is the technique behind the massive performance gains we’re seeing in models for tasks like coding and math. At a high level, it lets a model try things, and then you “reward” it for good outcomes and penalize it for bad ones, allowing it to learn strategies that are better than what was in its original training data. The problem is, it’s incredibly complex and expensive to set up the infrastructure. You have to juggle an inference stack for generating the “rollouts” and a separate training stack for updating the model weights.This is the problem Kyle and his team have solved with Serverless RL, which we just launched and we covered last week. It’s a new offering that lets you run RL jobs without managing any servers or GPUs. The whole thing is powered by the CoreWeave stack, with tracing and evaluation beautifully visualized in Weave.We also launched a new model from the OpenPipe team on our inference service: a fine-tune-friendly “instruct” version of Qwen3 14B. The team is not just building amazing products, they’re also contributing great open-source models. It’s awesome to be working with them.🛠️ Tools & Agents: Free Agents & Lightning-Fast Code SearchThe agentic coding space saw two massive announcements this week, and we had the representatives of both companies on the show to discuss them!First, Quinn Slack from Amp announced that they’re launching a completely free, ad-supported tier. I’ll be honest, my first reaction was, “Ads? In my coding agent? Eww.” But the more I thought about it, the more it made sense. My AI subscriptions are stacking up, and this model makes powerful agentic coding accessible to students and developers who can’t afford another $20/month. The ads are contextual to your codebase (think Baseten or Axiom), and they’re powered by a rotating mix of models using surplus capacity from providers. It’s a bold and fascinating business model.This move was met with generally positive responses, though folks from a competing agent, claim that Amp is serving Grok-4-fast which XAI is giving out for free anyway? We’ll see how this shakes up. Cognition announces SWE-grep: RL-powered multi-turn context retriever for agentic code search (Blog, X, Playground, Windsurf)Then, just as we were about to sign off, friend of the pod Swyx (now from Cognition) dropped in with breaking news about SWE-grep. It’s a new, RL-tuned sub-agent for their Windsurf editor that makes code retrieval and context gathering ridiculously fast. We’re talking over 2,800 tokens per second. (yes, they are using Cerebras under the hood) The key insight from Swyx is that their model was trained for natively parallel tool calling, running up to eight searches on a codebase simultaneously. This speeds up the “read” phase of an agent’s workflow—which is 60-70% of the work—by 3-5x. It’s all about keeping the developer in a state of flow, and this is a huge leap forward in making agent interactions feel instantaneous. Swyx also dropped a hint that the next thing that comes is CodeMaps and they will make these retrievers look trivial! This was one for the books, folks. An AI making a novel cancer discovery, video models taking huge leaps, and the agentic coding space is on fire. The pace of innovation is just breathtaking. Thank you for being a ThursdAI subscriber, and as always, here’s the TL:DR and show notes for everything that happened in AI this week.TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Jessica Gallegos, Sr. Product Manager, Google DeepMind* Kyle Corbitt (@corbtt) - OpenPipe//W&B* Quinn Slack (@sqs) - Amp* Swyx (@swyx) - Cognition* Open Source LLMs* KAIST KROMo - bilingual Korean/English 10B (HF, Paper)* Qwen3-VL 3B and 8B (X post, HF)* Google’s C2S-Scale 27B: AI Model Validates Cancer Hypothesis in Living Cells (X, Blog, Paper)* Big CO LLMs + APIs* Claude Haiku 4.5: Fast, Cheap Model Rivals Sonnet 4 Accuracy (X, Official blog)* ChatGPT will loose restrictions; age-gating enables “adult mode” with new personality features coming (X)* OpenAI updates memory management - no more “memory full” (X, FAQ)* Microsoft is making every windows 11 an AI PC with copilot voice input (X)* Claude Skills: Custom instructions for AI agents now live (X, Anthropic News, YouTube Demo)* Hardware* NVIDIA DGX Spark: desktop personal supercomputer for AI prototyping and local inference (LMSYS Blog)* Apple announces M5 chip with double AI performance (Apple Newsroom)* OpenAI and Broadcom set to deploy 10 gigawatts of custom AI accelerators (Official announcement)* This weeks Buzz* New model - OpenPipe Qwen3 14B instruct (link)* Interview with Kyle Corbitt - RL, Serverless RL* W&B Fully Connected London & Tokyo in 20 days - SIGN UP* Vision & Video* Veo 3.1: Google’s Next-Gen Video Model Launches with Cinematic Audio (Developers Blog)* Sora up to 15s and pro now up to 25s generation with a new storyboard feature* Baidu’s MuseStreamer has >20 second generations (X)* AI Art & Diffusion & 3D* Worldlabs RTFM: Real-Time Frame Model renders 3D worlds at interactive speeds on a single H100 (Blog, Demo)* DiT360: SOTA Panoramic Image Generation with Hybrid Training (Project page, GitHub)* Riverflow 1 tops the image‑editing leaderboard (Sourceful blog)* Tools* Amp launches a Free tier - powered by ads and surplus model capacity (Website)* Cognition SWE-grep: RL-powered multi-turn context retriever for agentic code search (Blog, Playground) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:34:38
  • 📆 Oct 9, 2025 — Dev Day’s Agent Era, Samsung’s 7M TRM Shock, Ling‑1T at 1T, Grok Video goes NSFW, and Serverless RL arrives
    Hey everyone, Alex here 👋We’re deep in the post-reality era now. Between Sora2, the latest waves of video models, and “is-that-person-real” cameos, it’s getting genuinely hard to trust what we see. Case in point: I recorded a short clip with (the real) Sam Altman this week and a bunch of friends thought I faked it with Sora-style tooling. Someone even added a fake Sora watermark just to mess with people. Welcome to 2025.This week’s episode and this write-up focus on a few big arcs we’re all living through at once: OpenAI’s Dev Day and the beginning of the agent-app platform inside ChatGPT, a bizarre and exciting split-screen in model scaling where a 7M recursive model from Samsung is suddenly competitive on reasoning puzzles while inclusionAI is shipping a trillion-parameter mixture-of-reasoners, and Grok’s image-to-video now does audio and pushes the line on… taste. We also dove into practical evals for coding agents with Eric Provencher from Repo Prompt, and I’ve got big news from my day job world: W&B + CoreWeave launched Serverless RL, so training and deploying RL agents at scale is now one API call away.Let’s get into it.OpenAI’s 3rd Dev Day - Live Coverage + exclusive interviewsThis is the third Dev Day that I got to attend in person, covering this for ThursdAI (2023, 2024), and this one was the best by far! The production quality of their events rises every year, and this year they’ve opened up the conference to >1500 people, had 3 main launches and a lot of ways to interact with the OpenAI folks! I’ve also gotten an exclusive chance to sit in on a fireside chat with Sam Altman and Greg Brokman (snippets of which I’ve included in the podcast, starting 01:15:00 and I got to ask Sam a few questions after that as well. Event Ambiance and VibesOpenAI folks outdid themselves with this event, the live demos were quite incredible, the location (Fort Mason), Food and just the whole thing was on point. The event concluded with a 1x1 Sam and Jony Ive chat that I hope will be published on YT sometime, because it was very insightful. By far the best reason to go to this event in person is meeting folks and networking, both OpenAI employees, and AI Engineers who use their products. It’s 1 day a year, when OpenAI makes all their employees who attend, Developer Experience folks, as you can and are encouraged to, interact with them, ask questions, give feedback and it’s honestly great! I really enjoy meeting folks at this event and consider this to be a very high signal network, and was honored to have quite a few ThursdAI listeners among the participants and OpenAI folk! If you’re reading this, thank you for your patronage 🫡 Launches and ShipsOpenAI also shipped, and shipped a LOT! Sam was up on Keynote with 3 main pillars, which we’ll break down 1 by 1. ChatGPT Apps, AgentKit (+ agent builder) and Codex/New APIsCodex & New APIsCodex has gotten General Availability, but we’ve been using it all this time so we don’t really care, what we do care about is the new slack integration and the new Codex SDK, which means you can now directly inject Codex agency into your app. This flew a bit over people’s heads, but Romain Huet, VP of DevEx at OpenAI demoed on stage how his mobile app now has a Codex tab, where he can ask Codex to make changes to the app at runtime! It was quite crazy! ChatGPT Apps + AppsSDKThis was maybe the most visual and most surprising release, since they’ve tried to be an appstore before (plugins, customGPTs). But this time, it seems like, based on top of MCP, ChatGPT is going to become a full blown Appstore for 800+ million weekly active ChatGPT users as well. Some of the examples they have showed included Spotify and Zillow, where just by typing in “Spotify” in chatGPT, you will have an interactive app with it’s own UI, right inside of ChatGPT. So you could ask it to create a playlist for you based on your history, or ask Zillow to find homes in an area under a certain $$ amount.The most impressive thing, is that those are only launch partners, everyone can (technically) build a ChatGPT app with their AppsSDK that’s built on top of... the MCP (model context protocol) spec! The main question remains about discoverability, this is where Plugins and CustomGPTs (previous attempts to create apps within ChatGPT have failed), and when I asked him about it, Sam basically said “we’ll iterate and get it right” (starting 01:17:00). So it remains to be seen if folks really need their ChatGPT as yet another Appstore. AgentKit, AgentBuilder and ChatKit2025 is the year of agents, and besides launching quite a few of their own, OpenAI will not let you, build and host smart agents that can use tools, on their platform. Supposedly, with AgentBuilder, building agents is just dragging a few nodes around, prompting and connecting them. They had a great demo on stage where with less than 8 minutes, they’ve build an agent to interact with the DevDay website.It’s also great to see how greatly does OpenAI adapt the MCP spec, as this too, is powered by MCP, as in, any external connection you want to give your agent, must happen with an MCP server. Agents for the masses is maybe not quite there yetIn reality though, things are not so easy. Agents require more than just a nice drag & drop interface, they require knowledge, iteration, constant evaluation (which they’ve also added, kudos!) and eventually, customized agents need code. I spent an hour trying it out yesterday, building an agent to search the ThursdAI archives. The experience was a mixed bag. The AI-native features are incredibly cool. For instance, you can just describe the JSON schema you want as an output, and it generates it for you. The widget builder is also impressive, allowing you to create custom UI components for your agent’s responses.However, I also ran into the harsh realities of agent building. My agent’s web browsing tool failed because Substack seems to be blocking OpenAI’s crawlers, forcing me to fall back on the old-school RAG approach of uploading our entire archive to a vector store. And while the built-in evaluation and tracing tools are a great idea, they were buggy and failed to help me debug the error. It’s a powerful tool, but it also highlights that building a reliable agent is an iterative, often frustrating process that a nice UI alone can’t solve. It’s not just about the infrastructure; it’s about wrestling with a stochastic machine until it behaves.But to get started with something simple, they have definitely pushed the envelope on what is possible without coding. OpenAI also dropped a few key API updates:* GPT-5-Pro is now available via API. It’s incredibly powerful but also incredibly expensive. As Eric mentioned, you’re not going to be running agentic loops with it, but it’s perfect for a high-stakes initial planning step where you need an “expert opinion.”* SORA 2 is also in the API, allowing developers to integrate their state-of-the-art video generation model into their own apps. The API can access the 15-second “Pro” model but doesn’t support the “Cameo” feature for now.* Realtime-mini is a game-changer for voice AI. It’s a new, ultra-fast speech-to-speech model that’s 80% cheaper than the original Realtime API. This massive price drop removes one of the biggest barriers to building truly conversational, low-latency voice agents.My Chat with Sam & Greg - On Power, Responsibility, and EnergyAfter the announcements, I’ve got to sit in a fireside chat with Sam Altman and Greg Brockman and ask some questions. Here’s what stood out:When I asked about the energy requirements for their massive compute plans (remember the $500B Stargate deal?), Sam said they’d have announcements about Helion (his fusion investment) soon but weren’t ready to talk about it. Then someone from Semi Analysis told me most power will come from... generator trucks. 15-megawatt generator trucks that just drive up to data centers. We’re literally going to power AGI with diesel trucks!On responsibility, when I brought up the glazing incident and asked how they deal with being in the lives of 800+ million people weekly, Sam’s response was sobering: “This is not the excitement of ‘oh we’re building something important.’ This is just the stress of the responsibility... The fact that 10% of the world is talking to kind of one brain is a strange thing and there’s a lot of responsibility.”Greg added something profound: “AI is far more surprising than I anticipated... The deep nuance of how these problems contact reality is something that I think no one had anticipated.”This Week’s Buzz: RL X-mas came early with Serverless RL! (X, Blog)Big news from our side of the world! About a month ago, the incredible OpenPipe team joined us at Weights & Biases and CoreWeave. They are absolute wizards when it comes to fine-tuning and Reinforcement Learning (RL), and they wasted no time combining their expertise with CoreWeave’s massive infrastructure.This week, they launched Serverless RL, a managed reinforcement learning service that completely abstracts away the infrastructure nightmare that usually comes with RL. It automatically scales your training and inference compute, integrates with W&B Inference for instant deployment, and simplifies the creation of reward functions and verifiers. RL is what turns a good model into a great model for a specific task, often with surprisingly little data. This new service massively lowers the barrier to entry, and I’m so excited to see what people build with it. We’ll be doing a deeper dive on this soon but please check out the Colab Notebook to get a taste of what AutoRL is like! Open SourceWhile OpenAI was holding its big event, the open-source community was busy dropping bombshells of its own.Samsung’s TRM: Is This 7M Parameter Model... Magic? (X, Blog, arXiv)This was the release that had everyone’s jaws on the floor. A single researcher from the Samsung AI Lab in Montreal released a paper on a Tiny Recursive Model (TRM). Get this: it’s a 7 MILLION parameter model that is outperforming giants like DeepSeek-R1 and Gemini 2.5 Pro on complex reasoning benchmarks like ARC-AGI. I had to read that twice. 7 million, not billion.How is this possible? Instead of relying on brute-force scale, TRM uses a recursive process. It generates a first draft of an answer, then repeatedly critiques and refines its own logic in a hidden “scratchpad” up to 16 times. As Yam pointed out, the paper is incredibly insightful, and it’s a groundbreaking piece of work from a single author, which is almost unheard of these days. Eric made a great point that because it’s so small, it opens the door for hobbyists and solo researchers to experiment with cutting-edge architectures on their home GPUs. This feels like a completely new direction for AI, and it’s incredibly exciting.inclusionAI’s Ling-1T: Enter the Trillion Parameter Club (X, HF, Try it)On the complete opposite end of the scale (about 3 OOM away), we have Ling-1Tfrom inclusionAI. This is a 1 TRILLION parameter Mixture-of-Experts (MoE) model. The key here is efficiency; while it has a trillion total parameters, it only uses about 37 billion active parameters per token.The benchmarks are wild, showing it beating models like GPT-5-Main (in non-thinking mode) and Gemini 2.5 on a range of reasoning tasks. They claim to match Gemini’s performance using about half the compute. Of course, with any new model posting huge scores, there’s always the question of whether it was trained on the public test sets, but the results are undeniably impressive. It’s another example of the push towards maintaining top-tier performance while drastically reducing the computational cost of inference.More Open Source Goodness: Microsoft, AI21, and IBMIt didn’t stop there.* Microsoft released UserLM-8B, a fascinating Llama 3 finetune trained not to be an assistant, but to simulate the user in a conversation. As Yam explained from his own experience, this is a super useful technique for generating high-quality, multi-turn synthetic data to train more robust chatbot agents. (X, HF)* Our friends at AI21 Labs are back with Jamba Reasoning 3B, a tiny but mighty 3-billion-parameter model. It uses a hybrid SSM-Transformer architecture, which makes it incredibly fast for its size, making it a great option for local inference on a laptop.* IBM also released their Granite family of models, which also use a hybrid design. Their big focus is on enterprise-grade governance and trust, and it’s the first open model family to get an ISO certification for AI management systems.Big Company Moves: Grok Imagine Levels Up... And Leans InFinally, let’s talk about the latest update to Grok Imagine. They’ve rolled out video generation with synchronized sound, and it’s fast—often faster than Sora. The quality has significantly improved, and it’s a powerful tool.However, I have to talk about the other side of this. Grok is positioning itself as the “uncensored” alternative, and they are leaning into that hard. Their video generator has a “spicy” mode that explicitly generates 18+ content. But the thing that truly disturbed me was a new feature with their animated character, “Annie.” It’s a gamified engagement mechanic where you “make your connection better” by talking to her every day to unlock special rewards, like new outfits.To be blunt, this is disgusting. We talk a lot on this show about the immense responsibility that comes with building these powerful AIs. I know from my conversations with folks at OpenAI and other labs that they think deeply about safety, preventing misuse, and the psychological impact these systems can have. This feature from Grok is the polar opposite. It leans into the worst fears about AI creating addictive, para-social relationships. It’s exploitative, and frankly, the team behind it should reconsider their choices IMO. All righty, this is mostly the new for this week, it’s been a very busy week, and if you’d like to see our live coverage + DevDay keynote + interviews I’ve had with Simon Willison , Greg Kamradt, Jeffrey Huber, Allesio from Latent.Space, Matthew Berman and more impactful folks, our livestream can be found here: I’m incredibly humbled and privileged to keep being invited to the Dev Day, and looking forward to cover more events, with exclusive interviews, on the ground reporting and insights. Please subscribe if you like this content to continue. TL;DR and Show Notes* Show Notes & Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co-Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Guest: Kyle Corbitt - OpenPipe / CoreWeave (@corbtt)* Guest: Eric Provencher - Repo Prompt (@pvncher)* OpenAI Dev Day* OpenAI AgentKit All-in-One Agent Builder (X, OpenAI)* ChatGPT Apps & New APIs (GPT-5-pro, SORA, realtime-mini)* Open Source LLMs* Microsoft UserLM-8B Model Released (X, HF)* Samsung Tiny Recursive Model (TRM) (X, Blog, arXiv)* AI21 Labs releases Jamba Reasoning 3B (X, HF)* inclusionAI debuts Ling-1T: Trillion-Scale Efficient Reasoner (X, HF, Try it)* IBM Granite Models* Evals* Repo Bench by Repo Prompt (Web)* Big CO LLMs + APIs* Qwen 3 Omni & Realtime Models* Google DeepMind unveils Gemini 2.5 Computer-Use model (X, Blog)* Google Gemini Flash 2.5 (new)* Grok Imagine updated with video and sound* This weeks Buzz* OpenPipe (part of Coreweave,W&B) launch Serverless RL (X, Blog, Notebook)* Vision & Video* Ovi: Open Source Video & Synchronized Audio Generation (X, HF)* Voice & Audio* GPT-realtime-mini: OpenAI’s ultra-fast speech-to-speech model API (OpenAI Blog, TechCrunch)* AI Art & Diffusion & 3D* Bagel.com: Paris – Decentralized Diffusion Model (X, HF, Blogpost) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:41:29

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, The Daily and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.23.11 | © 2007-2025 radio.de GmbH
Generated: 11/9/2025 - 11:04:37 AM