Powered by RND
PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

Available Episodes

5 of 120
  • 📆 ThursdAI - Sep 11 - SeeDream 4, Lucy 14B, ChatGPT gets MCP, OpenAI $300B deal with Oracle, Qwen Next A3B & more AI news
    Hey Everyone, Alex here, thanks for being a subscriber! Let's get you caught up on this weeks most important AI news! The main thing you need to know this week is likely the incredible Image model that ByteDance released, that overshoots the (incredible image model from last 2 weeks) nano 🍌. ByteDance really outdid themselves on this one! But also, a video model with super fast generation, OpenAI rumor made Larry Ellison the richest man alive, ChatGPT gets MCP powers (under a flag you can enable) and much more! This week we covered a lot of visual stuff, so while the podcast format is good enough, it's really worth tuning in to the video recording to really enjoy the full show. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.AI Art and DiffusionIt's rare for me to start the newsletter not on Open Source AI news, but hey, at least this way you know that I'm writing it and not some AI right? 😉ByteDance SeeDream 4 - 4K SOTA image generation and editing model with up to 6 reference images (Fal, Replicate)The level of detail on ByteDance's new model, has really made all the hosts on ThursdAI stop and go... huh? is this AI? Bytedance really outdid themselves with this image model that not only generates images, it also is a fully functional image editing with natural language model. It's a diffusion transformer, able to generate 2K and 4K images, fast (under 5 seconds?) while enabling up to 6 reference images to be provided for the generation. This is going to be incredible for all kinds of purposes, AI art, marketing etc'. The promt adherence is quite incredible, text is also crisp and sharp at those 2/4K resolutions. We created this image live on the show with it (using a prompt extended by another model)I then provided my black and white headshot and the above image and asked to replace me as a cartoon character, and it did, super quick, and even got my bomber jacket and the W&B logo on it in there! Notable, nothing else was changed in the image, showing just how incredible this one is for image editing. In you want enhanced realism, our friend FoFr from replicate, reminded us that using IMG_3984.CR2 in the prompt, will make the model show images that are closer to reality, even if they depict some incredibly unrealistic things, like a pack of lions forming his nicknameAdditional uses for this model are just getting discovered, and one user already noted that given this model outputs 4K resolution, it can be used as a creative upscaler for other model outputs. Just shove your photo from another AI in Seedream and ask for an upscale. Just be ware that creative upscalers change some amount of details in the generated picture. DecART AI Lucy 14B Redefines Video Generation speeds! If Seedteam blew my mind with images, Decart's Lucy 14B absolutely shattered my expectations for video generation speed. We're talking about generating 5-second videos from images in 6.5 seconds. That's almost faster than watching the video itself!This video model is not open source yet (despite them adding 14B to the name) but it's smaller 5B brother was open sourced. The speed to quality ratio is really insane here, and while Lucy will not generate or animate text or faces that well, it does produce some decent imagery, but SUPER fast. This is really great for iteration, as AI Video is like a roulette machine, you have to generate a lot of tries to see a good result. This paired with Seedream (which is also really fast) are a game changer in the AI Art world! So stoked to see what folks will be creating with these! Bonus Round: Decart's Real-Time Minecraft Mod for Oasis 2 (X)The same team behind Lucy also dropped Oasis 2.0, a Minecraft mod that generates game environments in real-time using diffusion models. I got to play around with it live, and watching Minecraft transform into different themed worlds as I moved through them was surreal.Want a steampunk village? Just type it in. Futuristic city? Done. The frame rate stayed impressively smooth, and the visual coherence as I moved through the world was remarkable. It's like having an AI art director that can completely reskin your game environment on demand. And while the current quality remains low res, if you consider where Stable Diffusion 1.4 was 3 years ago, and where Seedream 4 is now, and do the same extrapolation to Oasis, in 2-3 years we'll be reskinning whole games on the fly and every pixel will be generated (like Jensen loves to say!) OpenAI adds full MCP to ChatGPT (under a flag) This is huge, folks. I've been waiting for this for a while, and finally, OpenAI quietly added full MCP (Model Context Protocol) support to ChatGPT via a hidden "developer mode."How to Enable MCP in ChatGPTHere's the quick setup I showed during the stream:* Go to ChatGPT settings → Connectors* Scroll down to find "Developer Mode" and enable it* Add MCP servers (I used Rube.ai from Composio)* Use GPT-4o in developer mode to access your connectorsDuring the show, I literally had ChatGPT pull Nisten's last five tweets using the Twitter MCP connector. It worked flawlessly (though Nisten was a bit concerned about what tweets it might surface 😂).The implications are massive - you can now connect ChatGPT to GitHub, databases, your local files, or chain multiple tools together for complex workflows. As Wolfram pointed out though, watch your context usage - each MCP connector eats into that 200K limit.Big Moves: Investments and InfrastructureSpeaking of OpenAI, Let's talk money, because the stakes are getting astronomical. OpenAI reportedly has a $300 billion (!) deal with Oracle for compute infrastructure over five years, starting in 2027. That's not a typo - $60 billion per year for compute. Larry Ellison just became the world's richest person, and Oracle's stock shot up 40% on the news in just a few days! This has got to be one of the biggest compute deals the world has ever head of!The scale is hard to comprehend. We're talking about potentially millions of H100 GPUs worth of compute power. When you consider that most AI companies are still figuring out how to profitably deploy thousands of GPUs, this deal represents infrastructure investment at a completely different magnitude.Meanwhile, Mistral just became Europe's newest decacorn, valued at $13.8 billion after receiving $1.3 billion from ASML. For context, ASML makes the lithography machines that TSMC uses to manufacture chips for Nvidia. They're literally at the beginning of the AI chip supply chain, and now they're investing heavily in Europe's answer to OpenAI.Wolfram made a great point - we're seeing the emergence of three major AI poles: American companies (OpenAI, Anthropic), Chinese labs (Qwen, Kimi, Ernie), and now European players like Mistral. Each is developing distinct approaches and capabilities, and the competition is driving incredible innovation.Anthropic's Mea Culpa and Code InterpreterAfter weeks of users complaining about Claude's degraded performance, Anthropic finally admitted there were bugs affecting both Claude Opus and Sonnet. Nisten, who tracks these things closely, speculated that the issues might be related to running different quantization schemes on different hardware during peak usage times. We already reported last week that they admitted that "something was affecting intelligence" but this week they said they pinpointed (and fixed) 2 bugs realted to inference! They also launched a code interpreter feature that lets Claude create and edit files directly. It's essentially their answer to ChatGPT's code interpreter - giving Claude its own computer to work with. The demo showed it creating Excel files, PDFs, and documents with complex calculations. Having watched Claude struggle with file operations for months, this is a welcome addition.🐝 This Week's Buzz: GLM 4.5 on W&B and We're on Open Router!Over at Weights & Biases, we've got some exciting updates for you. First, we've added Zhipu AI's GLM 4.5 to W&B Inference! This 300B+ parameter model is an absolute beast for coding and tool use, ranking among the top open models on benchmarks like SWE-bench. We've heard from so many of you, including Nisten, about how great this model is, so we're thrilled to host it. You can try it out now and get $2 in free credits to start.And for all you developers out there, you can use a proxy like LiteLLM to run GLM 4.5 from our inference endpoint inside Anthropic's Claude Code if you're looking for a powerful and cheap alternative! Second, we're now on Open Router! You can find several of our hosted models, like GPT-4-OSS and DeepSeek Coder, on the platform. If you're already using Open Router to manage your model calls, you can now easily route traffic to our high-performance inference stack.Open Source Continues to ShineOpen Source LLM models took a bit of a break this week, but there were still interesting models! Baidu released ERNIE-4.5, a very efficient 21B parameter "thinking" MoE that only uses 3B active parameters per token. From the UAE, MBZUAI released K2-Think, a finetune of Qwen 2.5 that's showing some seriously impressive math scores. And Moonshot AI updated Kimi K2, doubling its context window to 256K and further improving its already excellent tool use and writing capabilities.Tencent released an update to HunyuanImage 2.1, which is a bit slow, but also generates 2K images and is decent at text. Qwen drops Qwen3-Next-80B-A3B (X, HF)In breaking news post the show (we were expecting this on the show itself), Alibaba folks dropped a much more streamlined version of the next Qwen, 80B parametes with only 3B active! They call this an "Ultra Sparse MOE" and it beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context. This is quite unprecedented, as getting models as sparse as to work well takes a lot of effort and skill, but the Qwen folks delivered! ToolsWe wrapped with a quick shoutout to EBSynth, a nifty video editor that lets you draw or add elements to one frame and extrapolates to the rest—like Photoshop for motion. It's browser-based and fun for VFX tweaks; check the demo video on X. Simple but powerful for quick video hacks. Speaking of Photoshop, it was confirmed that Nano Banana is going to be embedded into Photoshop for image editing! Summary & TL;DRWhat a week—Seedream and Lucy alone have me rethinking how fast AI can iterate creatively, while MCP in ChatGPT feels like the dawn of truly accessible agents. With open-source keeping pace and big deals fueling the fire, AI's multimodal future is accelerating. Thanks for tuning in, folks; if you missed the live vibes, catch the podcast or hit sub.thursdai.news for all the links. See you next Thursday—what blew your mind this week? Drop a comment and share with a friend, it's the best way to support this endeavor! TL;DR of all topics covered:AI Models & APIs:* ChatGPT adds full MCP support - Developer mode unlocks tool connectors for 400M+ users (Setup Guide)* Sea Dream 4.0 - ByteDance's unified image generation/editing model creates 4K images in ~1.8 seconds (X, Try it)* Lucy 14B - Decart's lightning-fast video model generates 5-second clips in 6.5 seconds (Demo, Page)* Claude bug fixes - Anthropic admits to performance issues and releases code interpreter (Blog)* Sonoma Dusk & Sky - Mystery models on OpenRouter with 2M context, rumored to be Grok (OpenRouter)This Week's W&B Buzz:* OpenRouter integration - Serving models to broader developer community (Try us)* GLM 4.5 - 350B parameter coding model added to inference (X, Try It)* W&B inference in Claude Code with LiteLLM (Olaf's Guide)Open Source Releases:* ERNIE 4.5 - Baidu open-sources 21B parameter thinking model with 3B active parameters (X, HF)* K2-think - MBZUAI's Qwen 2.5 fine-tune with strong math performance (X)* Kimi K2 update - Doubled context to 256K, improved tool use (X)* HunyuanImage 2.1 - Tencent's 17B parameter open-source 2K image model (X, HF)* Qwen-next-80B-A3B - Alibaba's next frontier MoE with 3B active param (X, HF)Voice & Audio:* Qwen3-ASR-Flash - 11-language speech recognition with singing support (X)* Stable Audio 2.5 - Enterprise audio generator creating 3-minute tracks in X, Blog, Try It)* ElevenLabs Voice Remixing - Modify cloned voices for age, gender, accent (X)Business & Investment:* OpenAI-Oracle deal - $300B infrastructure agreement over 5 years* Mistral funding - $1.3B investment from ASML at $13.8B valuation (Blog)Tools:* Oasis 2.0 - Real-time Minecraft world generation mod from Decart (Try It)* EbSynth - Video editing tool for frame-by-frame manipulation (X)Hosts:* Alex Volkov (@altryne)* Wolfram RavenWlf (@WolframRvnwlf)* Yam Peleg (@yampeleg)* Nisten (@nisten)* LDJ (@ldjconfirmed) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:34:28
  • 📆 ThursdAI - Sep 4 - Codex Rises, Anthropic Raises $13B, Nous plays poker, Apple speeds up VLMs & more AI news
    Wohoo, hey ya’ll, Alex here,I'm back from the desert (pic at the end) and what a great feeling it is to be back in the studio to talk about everything that happened in AI! It's been a pretty full week (or two) in AI, with Coding agent space heating up, Grok entering the ring and taking over free tokens, Codex 10xing usage and Anthropic... well, we'll get to Anthropic. Today on the show we had Roger and Bhavesh from Nous Research cover the awesome Hermes 4 release and the new PokerBots benchmark, then we had a returning favorite, Kwindla Hultman Kramer, to talk about the GA of RealTime voice from OpenAI. Plus we got some massive funding news, some drama with model quality on Claude Code, and some very exciting news right here from CoreWeave aquiring OpenPipe! 👏 So grab your beverage of choice, settle in (or skip to the part that interests you) and let's take a look at the last week (or two) in AI! ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open Source: Soulful Models and Poker-Playing AgentsThis week did not disappoint as it comes to Open Source! Our friends at Nous Research released the 14B version of Hermes 4, after releasing the 405B and 70B versions last week. This company continues to excel in finetuning models for powerful, and sometimes just plain weird (in a good way) usecases. Nous Hermes 4 (14B, 70B, 405B) and the Quest for a "Model Soul" (X, HF)Roger and Bhavash from Nous came to announce the release of the smaller (14B) version of Hermes 4, and cover the last weeks releases of the larger 70B and 405B brothers. Hermes series of finetunes was always on our radar, as unique data mixes turned them into uncensored, valuable and creative models and unlocked a bunch of new use-cases. But the wildest part? They told us they intentionally stopped training the model not when reasoning benchmarks plateaued, but when they felt it started to "lose its model soul." They monitor the entropy and chaos in the model's chain-of-thought, and when it became too sterile and predictable, they hit the brakes to preserve that creative spark. This focus on qualities beyond raw benchmark scores is why Hermes 4 is showing some really interesting generalization, performing exceptionally well on benchmarks like EQBench3, which tests emotional and interpersonal understanding. It's a model that's primed for RL not just in math and code, but in creative writing, role-play, and deeper, more "awaken" conversations. It’s a soulful model that's just fun to talk to.Nous Husky Hold'em Bench: Can Your LLM Win at Poker? (Bench)As if a soulful model wasn't enough, the Nous team also dropped one of the most creative new evals I've seen in a while: Husky Hold'em Bench. We had Bhavesh, one of its creators, join the show to explain. This isn't a benchmark where the LLM plays poker directly. Instead, the LLM has to write a Python poker botfrom scratch, under time and memory constraints, which then competes against bots written by other LLMs in a high-stakes tournament. Very interesting approach, and we love creative benchmarking here at ThursdAI! This is a brilliant way to test for true strategic reasoning and planning, not just pattern matching. It's an "evergreen" benchmark that gets harder as the models get better. Early results are fascinating: Claude 4 Sonnet and Opus are currently leading the pack, but Hermes 4 is the top open-source model.More Open Source GoodnessThe hits just kept on coming this week. Tencent open-sourced Hunyuan-MT-7B, a translation model that swept the WMT2025 competition and rivals GPT-4.1 on some benchmarks. Having a small, powerful, specialized model like this is huge for anyone doing large-scale data translation for training or needing fast on-device capabilities.From Switzerland, we got Apertus-8B and 70B, a set of fully open (Apache 2.0 license, open data, open training recipes!) multilingual models trained on a massive 15 trillion tokens across 1,800 languages. It’s fantastic to see this level of transparency and contribution from European institutions.And Alibaba’s Tongyi Lab released WebWatcher, a powerful multimodal research agent that can plan steps, use a suite of tools (web search, OCR, code interpreter), and is setting new state-of-the-art results on tough visual-language benchmarks, often beating models like GPT-4o and Gemini.All links are in the TL;DR at the endBREAKING NEWS: Google Drops Embedding Gemma 308M (X, HF, Try It)Just as we were live on the show, news broke from our friends at Google. They've released Embedding Gemma, a new family of open-source embedding models. This is a big deal because they are tiny—the smallest is only 300M parameters and takes just 200MB to run—but they are topping the MTEB leaderboard for models under 500M parameters. For anyone building RAG pipelines, especially for on-device or mobile-first applications, having a small, fast, SOTA embedding model like this is a game-changer.It's so optimized for on device running that it can run fully in your browser on WebGPU, with this great example from our friend Xenova highlighted on the release blog! Big Companies, Big Money, and Big ProblemsIt was a rollercoaster week for the big labs, with massive fundraising, major product releases, and a bit of a reality check on the reliability of their services.OpenAI's GPT Real-Time Goes GA and gets an upgraded brain (X, Docs)We had the perfect guest to break down OpenAI's latest voice offering: Kwindla Kramer, founder of Daily and maintainer of the open-source PipeCat framework. OpenAI has officially taken its Realtime API to General Availability (GA), centered around the new gpt-realtime model.Kwindla explained that this is a true speech-to-speech model, not a pipeline of separate speech-to-text, LLM, and text-to-speech models. This reduces latency and preserves more nuance and prosody. The GA release comes with huge upgrades, including support for remote MCP servers, the ability to process image inputs during a conversation, and—critically for enterprise—native SIP integration for connecting directly to phone systems.However, Kwindla also gave us a dose of reality. While this is the future, for many high-stakes enterprise use cases, the multi-model pipeline approach is still more reliable. Observability is a major issue with the single-model black box; it's hard to know exactly what the model "heard." And in terms of raw instruction-following and accuracy, a specialized pipeline can still outperform the speech-to-speech model. It’s a classic jagged frontier: for the lowest latency and most natural vibe, GPT Real-Time is amazing. For mission-critical reliability, the old way might still be the right way for now.ChatGpt has branching! Just as I was about to finish writing this up, ChatGPT announced a new feature, and this one I had to tell you about! Finally you can branch chats in their interface, which is a highly requested feature! Branching seems to be live on the chat interface, and honestly, tiny but important UI changes like these are how OpenAI remains the best chat experience! The Money Printer Goes Brrrr: Anthropic's $13B RaiseLet's talk about the money. Anthropic announced it has raised an absolutely staggering $13 billion in a Series F round, valuing the company at $183 billion. Their revenue growth is just off the charts, jumping from a run rate of around $1 billion at the start of the year to over $5 billion by August. This growth is heavily driven by enterprise adoption and the massive success of Claude Code. It's clear that the AI gold rush is far from over, and investors are betting big on the major players. In related news, OpenAI is also reportedly raising $10 billion at a valuation of around $500 billion, primarily to allow employees to sell shares—a huge moment for the folks who have been building there for years.Oops... Did We Nerf Your AI? Anthropic's ApologyWhile Anthropic was celebrating its fundraise, it was also dealing with a self-inflicted wound. After days of users on X and other forums complaining that Claude Opus felt "dumber," the company finally issued a statement admitting that yes, for about three days, the model's quality was degraded due to a change in their infrastructure stack.Honestly, this is not okay. We're at a point where hundreds of thousands of developers and businesses rely on these models as critical tools. To have the quality of that tool change under your feet without any warning is a huge problem. It messes with people's ability to do their jobs and trust the platform. While it was likely an honest mistake in pursuit of efficiency, it highlights a fundamental issue with closed, proprietary models. You're at the mercy of the provider. It's a powerful argument for the stability and control that comes with open-source and self-hosted models. These companies need to realize that they are no longer just providing experimental toys; they're providing essential infrastructure, and that comes with a responsibility for stability and transparency.This Week's Buzz: CoreWeave Acquires OpenPipe! 🎉Super exciting news from the Weights & Biases and CoreWeave family - we've acquired OpenPipe! Kyle and David Corbett and their team are joining us to help build out the complete AI infrastructure stack from metal to model.OpenPipe has been doing incredible work on SFT and RL workflows with their open source ART framework. As Yam showed during the show, they demonstrated you can train a model to SOTA performance on deep research tasks for just $300 in a few hours - and it's all automated! The system can generate synthetic data, apply RLHF, and evaluate against any benchmark you specify.This fits perfectly into our vision at CoreWeave - bare metal infrastructure, training and observability with Weights & Biases, fine-tuning and RL with OpenPipe's tools, evaluation with Weave, and inference to serve it all. We're building the complete platform, and I couldn't be more excited!Vision & Speed: Apple's FastVLM (HF)Just before Apple's event next week, they dropped FastVLM - a speed-first vision model that's 85x faster on time-to-first-token than comparable models. They released it in three sizes (7B, 1.5B, and 0.5B), all optimized for on-device use.The demo that blew my mind was real-time video captioning running in WebGPU. HF CEO Clem showed it processing Apple's keynote video with maybe 250ms latency - the captions were describing what was happening almost in real-time. When someone complained it wasn't accurate because it described "an older man in headphones" when showing an F1 car, Clem pointed out that was actually the previous frame showing Tim Cook - the model was just slightly behind!Tools Showdown: Codex vs Claude CodeTo wrap up, we dove into the heated debate between Codex and Claude Code. Sam Altman reported that Codex usage is up 10x in the past two weeks (!) and improvements are coming. Yam gave us a live demo, and while Claude Code failed to even start up during the show (highlighting why people are switching), Codex with GPT-5 was smooth as butter.The key advantages? Codex authenticates with your OpenAI account (no API key juggling), it has MCP support, and perhaps most importantly - it's not just a CLI tool. You can use it for PR reviews on GitHub, as a cloud-based agent, and integrated into Cursor and Windsurf. Though as Yam pointed out, OpenAI really needs to stop calling everything "Codex" - there are like five different products with that name now! 😅If you're tried Codex (the CLI!) when it was released, and gave up, give it a try now, it's significantly upgraded! Ok, phew, what a great episode we had, if you're only reading, I strongly recommend checking out the live recording or the edited podcast, and of course, if this newsletter is helpful to you, the best way you can do to support it is to subscribe, and share with friends 👏 P.S - Just came back after my first burning man, it was a challenging, all consuming experience, where I truly disconnected for the first time (first ThursdAI in over 2 years that I didn't know what's going on with AI). It was really fun but I'm happy to =be back! See you next week! TL;DR and Show Notes* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)* Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed* Guests - Roger Jin - @rogershijin & Bhavesh Kumar @bha_ku21* Kwindla Kramer - @kwindla* Open Source LLMs* Nous Hermes 4 — 14B launches: compact hybrid reasoning model with tool calling for local and cloud use (X, HF, Tech Report)* Tencent open-sources Hunyuan-MT-7B translation model after sweeping WMT2025 (X, HF)* Nous - Husky Hold'em Bench launches as an open-source pokerbot eval for LLM strategic play (X, Bench)* WebWatcher: Alibaba's Tongyi Lab open-sources a vision-language deep research agent that sets new SOTA (X, HF)* Apertus-8B and 70B launch as Switzerland's fully open, multilingual LLMs trained on 15T tokens across 1,800+ languages (X, HF)* Google releases Embedding Gemma - 300M param SOTA embeddings model for RAG ([Breaking News])* Big CO LLMs + APIs* Mistral adds 20+ MCP-powered connectors and controllable Memories to Le Chat for enterprise workflows (X, Blog)* Anthropic raises $13B Series F at a $183B post-money valuation (X, Blog)* OpenAI fundraises $10B at ~$500B valuation - buyback for employees* OpenAI ships gpt-realtime and takes Realtime API to GA with remote MCP tools, image input, and SIP phone calling (X)* OpenAI releases projects for free users with larger file uploads and project-only memory controls* OpenAI acquires Statsig & Alex for $1.1B+ to strengthen applications team* Grok Code 1 - now taking 50% of coding traffic on OpenRouter* Codex usage up 10x in 2 weeks per Sam Altman, with improvements coming* Anthropic admits to Claude Opus quality degradation for 3 days due to infrastructure changes* This weeks Buzz* CoreWeave buys OpenPipe! 🎉 (Blog)* Vision & Video* Apple's FastVLM-7B lands with speed-first vision encoder—85x faster TTFT vs peers (X, HF)* AI Art & Diffusion & 3D* Nano Banana (Imagen 3) continues to dominate as Google's best image model (ai.studio/banana)* Tools* Codex vs Claude Code discussion → Codex now significantly better with GPT-5 engine, GitHub PR reviews, and cloud agents This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:38:00
  • 📆 ThursdAI - Aug 21 - DeepSeek V3.1’s hybrid upset, ByteDance’s 512K Seed-OSS, Nano Banana wizardry, Agents.md standardizes agents, and more AI
    Hey everyone, Alex here 👋This week looked quiet… until about 15 hours before we went live. Then the floodgates opened: DeepSeek dropped a hybrid V3.1 that beats their own R1 with fewer thinking tokens, ByteDance quietly shipped a 36B Apache-2.0 long-context family with a “thinking budget” knob, NVIDIA pushed a faster mixed-architecture 9B with open training data, and a stealth image editor dubbed “Nano Banana” started doing mind-bending scene edits that feel like a new tier of 3D-aware control. On the big-co side, a mystery “Sonic” model appeared in Cursor and Cline (spoiler: the function call paths say a lot), and OpenAI introduced Agents.md to stop the config-file explosion in agentic dev tools. We also got a new open desktop-agent RL framework that 4x’d OSWorld SOTA, an IBM + NASA model for solar weather, and Qwen’s fully open 20B image editor that’s shockingly capable and runnable on your own GPU.Our show today was one of the shortest yet, as I had to drop early to prepare for Burning Man 🔥🕺 Speaking of which, Wolfram and the team will host the next episode! Ok, let's dive in! DeepSeek V3.1: a faster hybrid that thinks less, scores more (X, HF)DeepSeek does this thing where they let a base artifact “leak” onto Hugging Face, and the rumor mill goes into overdrive. Then, hours before we went live, the full V3.1 model card and an instruct variant dropped. The headline: it’s a hybrid reasoner that combines the strengths of their V3 (fast, non-thinking) and R1 (deep, RL-trained thinking), and on many tasks it hits R1-level scores with fewer thinking tokens. In human terms: you get similar or better quality, faster.A few things I want to call out from the release and early testing:* Hybrid reasoning mode done right. The model can plan with thinking tokens and then switch to non-thinking execution, so you don’t have to orchestrate two separate models. This alone simplifies agent frameworks: plan with thinking on, execute with thinking off.* Thinking efficiency is real. DeepSeek shows curves where V3.1 reaches or surpasses R1 with significantly fewer thinking tokens. On AIME’25, for example, R1 clocks 87.5% with ~22k thinking tokens; V3.1 hits ~88.4 with ~15k. On GPQA Diamond, V3.1 basically matches R1 with roughly half the thinking budget.* Tool-use and search-agent improvements. V3.1 puts tool calls inside the thinking process, instead of doing a monologue and only then calling tools. That’s the pattern you want for multi-turn research agents that iteratively query the web or your internal search.* Long-context training was scaled up hard. DeepSeek says they increased the 32K extension phase to ~630B tokens, and the 128K phase to ~209B tokens. That’s a big bet on long-context quality at train time, not just inference-time RoPE tricks. The config shows a max position in the 160K range, with folks consistently running it in the 128K class.* Benchmarks show the coding and terminal agent work got a big push. TerminalBench jumps from a painful 5.7 (R1) to 31 with V3.1. Codeforces ratings are up. On SweBench Verified (non-thinking), V3.1 posts 66 vs R1’s ~44. And you feel it: it’s faster to “get to it” without noodling forever.* API parity you’ll actually use. Their API now supports the Anthropic-style interface as well, which means a bunch of editor integrations “just work” with minimal glue. If you’re in a Claude-first workflow, you won’t have to rewire the world to try V3.1.* License and availability. This release is MIT-licensed, and you can grab the base model on Hugging Face. If you prefer hosted, keep an eye on our inference—we’re working to get V3.1 live so you can benchmark without burning your weekend assembling a serving stack.Hugging Face: https://huggingface.co/deepseek-ai/DeepSeek-V3.1-BaseQuick personal note: I’m seeing a lot of small, pragmatic improvements add up here. If you’re building agents, the hybrid mode plus tighter tool integration is a gift. DeepSeek V3.1 is going to be deployed to W&B Inference service soon! Take a look here to see when it's ready wandb.me/inference ByteDance Seed-OSS 36B: Apache-2.0, 512K context, and a “thinking budget” knob (X, HF, Github)I didn’t see much chatter about this one, which is a shame because this seems like a serious release. ByteDance’s Seed team open-sourced a trio of 36B dense models—two Base variants (with and without synthetic data) and an Instruct model—under Apache-2.0, trained on 12T tokens and built for long-context and agentic use. The context window is a native half-million tokens, and they include a “thinking budget” control you can set in 512-token increments so you can trade depth for speed.They report strong general performance, long-context RULER scores, and solid code/math numbers for a sub-40B model, with the Instruct variant posting very competitive MMLU/MMLU-Pro and LiveCodeBench results. The architecture is a straightforward dense stack (not MoE), and the models ship with Transformers/vLLM support and 4/8-bit quantization ready to go. If you’ve been hunting for a commercial-friendly, long-context 30-something‑B with an explicit reasoning-control dial, this should be on your shortlist.A neat detail for the training nerds: two Base releases—one trained with synthetic data, one without—make for a rare apples-to-apples study in how synthetic data shapes base capability. Also worth noting: they previously shipped a Seed-Prover specialized for Lean; it looks like the team is interested in tight domain models and generalists.NVIDIA Nemotron Nano 9B V2: mixed architecture, open data, and long-context throughput (X, Blog, HF, Dataset, Try It) NVIDIA shipped a fully open release of Nemotron Nano 9B V2—base, base-before-alignment/pruning, and a realigned reasoning model—and, crucially, they published most of the pretraining dataset details (~6.6T tokens across premium web, math, code, and SFT). That level of data transparency is rare and makes this a great base for fine-tuners who want reproducibility.Under the hood, this is a mixed Mamba+Transformer architecture. NVIDIA is claiming up to 6x higher throughput versus a pure-Transformer peer (they compare to Qwen3-8B) and specifically highlight that they pruned a 12B down to 9B while preserving quality. They also note a single A10 can handle 128K context after compression and distillation passes, which is the kind of practical systems work that matters when you’re running fleets.A couple of caveats. The license is NVIDIA Open Model License—not Apache-2.0—so read it; it includes restrictions around illegal surveillance and safety bypasses and has revocation clauses. Personally, I appreciate the data openness and the long-context engineering; as always, just make sure the license fits your use case.If you’re into longer-context math/coding with small models, the numbers (AIME’25, RULER-128K, GPQA) are impressive for 9B. And if you fine-tune: the availability of both pruned and pre-pruned bases plus the dataset recipe is a rare treat.Cohere’s Command-A Reasoning: dense, multilingual, and research-only licensing (X, Blog, HF)Cohere Dropped a new reasoning model focused on enterprise deployment patterns. It’s dense 111B model, supports a 256K context, and includes very strong multilingual coverage (23 languages is what they called out). What caught my eye: on the BFCL (Berkeley Function-Calling Leaderboard) they show 70%—above DeepSeek R1’s ~63% and GPT-OSS’s ~61%—and they plot the now-familiar test-time compute curves where more thinking tokens yield higher scores.This release uses Cohere’s non-commercial research license; if you want commercial usage you’ll need to go through them. That said, for teams who need privately deployable, on-prem reasoning and can work under a research license for prototyping, it’s a serious option. A meta observation from the show: there’s accumulating evidence that more active parameters help multi-hop tool-use chains compared to very sparse MoE at similar effective capacity. This model nudges in that direction.Desktop agents leap: ComputerRL hits 48% on OSWorld (Paper)A new framework dubbed ComputerRL from Z.ai and folks at Tsingua Uni, unified API calls with GUI actions and scaled RL across fleets of virtual desktops, posting a 48.1% success rate on OSWorld versus ~12% for earlier open models. The training system spins up thousands of qemu-in-docker VMs via gRPC; the learning loop alternates RL with supervised fine-tuning and uses a clean step-level binary reward to simplify credit assignment. If you care about practical desktop automation across Ubuntu/Windows/macOS, this is a big jump.IBM + NASA’s Surya: open model for solar weather (HF)Scientists get some love: IBM and NASA open-sourced Surya, a transformer trained on nine years of multi-instrument observations (nearly 200 TB) to forecast solar dynamics and space weather—the stuff that can knock satellites and power grids sideways. It’s on Hugging Face, it’s actually runnable, and it’s a fantastic example of open models delivering real-world scientific utility.Smaller but notable: InternLM and OpenCUA, plus Intel’s quantsTwo quick flags from the “worth your time” pile. InternLM shipped S1 Mini, an 8B vision+language model (ViT on top) that’s multimodal and lightweight; if you need on-device omni-ish behavior on a laptop or tablet, give it a look. And OpenCUA 32B (Qwen-based) is a specialized computer-usage agent with strong scores; if you’re building automations that need native OS control, it’s worth benchmarking.Also, if you’re running 4-bit: the Intel quantization work is excellent right now. Their 4-bit quants have been extremely high precision in my testing, especially for large coders and reasoners like DeepSeek V3.1. It’s an easy win if you’re trying to squeeze a 30B+ onto a workstation without hemorrhaging quality.Big-co updates and platform shiftsSonic appears in Cursor and ClineIf you open Cursor or fire up Cline, you may see a new “Sonic” model toggle. It’s labeled as a reasoning model and, when you poke the function-calling internals, the call paths include “xai/…” strings. Folks report it’s fast and solid for coding. No official docs yet, but I’d be surprised if this isn’t Grok Code in pre-release clothes.Agents.md: one file to rule your agentsAgentic dev stacks have multiplied config files like gremlins: Cursor’s rules.json, Windsurf’s prompts, MCP server manifests, tool schemas, install scripts… and every tool wants a different filename and format. OpenAI’s Agents.md is a strong attempt at standardization. It’s just Markdown at repo root that says, “here’s how to set up, build, test, and run this project,” plus any agent-specific caveats. Tools then auto-detect and follow your instructions instead of guessing.It’s already supported by OpenAI Codex, Amp, Jules, Cursor, RooCode, and more, with tens of thousands of public repos adopting the pattern. In monorepos, the nearest Agents.md wins, so you can override at the package level. And human chat instructions still override the file’s guidance, which is the right default.GPT‑5 context truncation in the web UI (reports)There’s been a wave of reports that GPT‑5 in the web UI is truncating long prompts even when you’re under the documented context limit. The folks at Repo Prompt reproduced this multiple times and got confirmation from OpenAI that it’s a bug (not a deliberate nerf). If you saw GPT‑5 suddenly forget the bottom half of your carefully structured system prompt in the web app, this likely explains it. The API doesn’t seem affected. Fingers crossed for a quick fix—GPT‑5 is still the best model I’ve used for 300k‑token “read the whole repo and propose a plan” tasks.Image and 3D: Nano Banana and Qwen’s open image editorNano Banana: 3D-consistent scene editing from thin airA stealth model nicknamed “Nano Banana” surfaced in a web demo and started doing the kind of edits you’d normally expect from a 3D suite with a modeler at the controls. Take two photos—your living room and a product shot—and it composites the object into the scene with shockingly consistent lighting and geometry. Ask for a 3D mesh “five inches off the skin,” and the mesh really does offset. Ask for a new camera angle on a single still, and it renders the scene from above with plausible structure. People have been calling it a game-changer and, for once, it doesn’t feel like hyperbole.There’s a strong whiff of 3D world modeling under the hood—some volumetric representation or neural field that enables true view synthesis—and Logan Kilpatrick posted a banana emoji that set speculation on fire. We’ll see where it lands and under what license, but for now the demo has been doing the rounds and it is… wow.If you’re wondering where to try it: LMarena is the currently only way to give it a try, but it's supossedly dropping soon! Qwen Image Edit (20B): fully open and already practical (X, HF)Qwen shipped a 20B image-editing model layered on their existing vision stack, and it’s properly open (permissive license) with strong bilingual text editing in Chinese and English. It handles high-level semantic edits (pose adjustments, rotations, style/IP creation) and low-level touch-ups (add/remove/insert). You can swap objects, expand aspect ratios, keep character identity consistent across panels, and do clean style transfer. It runs locally if you’ve got a decent GPU.What I appreciate here is the precision. Product placement tasks like “put this book in this person’s hand at this angle,” or “make the shoes match the dress” come out with the kind of control that used to require hand masking and a dozen passes. And yes, the capybara mascot is back in the release materials, which made my day! 👏If Nano Banana is the closed-world demo of what’s “beyond SOTA,” Qwen Image Edit is the open tool you can actually ship with today.This week’s buzz from Weights & BiasesTwo quick updates from our side. First, we’re working to bring DeepSeek V3.1 to our inference as fast as we can so you can run serious benchmarks without fussing over serving stacks. Keep an eye on our channels; we’ll shout when it’s live and we’ll have some credits for early feedback.Second, our cofounder Chris Van Pelt released Catnip, a containerized multi‑agent coding workspace that runs multiple Claude Code sessions (or other agents) in isolated sandboxes, each with its own context and notification stream. If you’ve been juggling parallel coding agents that step on each other’s toes, this is catnip indeed.Catnip Github: https://github.com/wandb/catnipClosing thoughtsA year ago, “thinking tokens” weren't even a curiosity; We only got the first whiff of "reasoning" back in September, and now we’re watching hybrid models that do more with less thinking, tool calls woven inside the reasoning loop, and long-context training regimes scaled up by an order of magnitude. The agent stack is maturing fast—desktop RL is finally clearing real tasks; editor ecosystems are converging on a single config file; and even the stealth drops are clearly building toward world-model‑aware editing and control.If you only try two things this week: run DeepSeek V3.1 in both modes (planning with thinking on, execution with thinking off) and throw a complex multi-step tool workflow at it; then take Qwen Image Edit for a spin on a real product-placement or character-consistency task. You’ll feel the future in your hands.I’m off to the desert next week for a bit (no internet where I’m going), but Wolfram and the crew will keep the ship on course. If you’re at Burning Man, DM me—would love to say hi out there. As always, thank you for tuning in and nerding out with us every week.— AlexTL;DR and show notesThursdAI - Aug 21, 2025 - TL;DRThursdAI - Aug 21, 2024 - TL;DRTL;DR of all topics covered:* Hosts and Guests* Alex Volkov - AI Evangelist & Weights & Biases* Co Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Open Source LLMs // Papers* ByteDance Seed-OSS - 36B open-source LLM family (X, HF, GitHub)* DeepSeek V3.1 - Updated Hybrid model (HF)* Cohere CMD-a Reasoning - (X, Blog, HF)* Zai/Tsinghua ComputerRL - Framework for desktop agents (X, Paper, Benchmark)* IBM & NASA Surya - Solar weather prediction (HF)* NVIDIA Nemotron Nano 9B V2 - (X, Blog, HF, Dataset, Try It) * Alibaba Quark Med* Big CO LLMs + APIs* Sonic Stealth Model - Likely Grok Code* OpenAI agents.md - Unified agent files (agents.md)* GPT-5 Nerf* AI Art & Diffusion & 3D* Nano Banana - Image model (rumored Google)* Qwen-Image-Edit - 20B Image editing (X, HF)* This weeks Buzz* Catnip - Containerized AI agent runner (GitHub) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:06:24
  • 📆 ThursdAI - Aug 14 - A week with GPT5, OSS world models, VLMs in OSS, Tiny Gemma & more AI news
    Hey everyone, Alex here 👋Last week, I tried to test GPT-5 and got really surprisingly bad results, but it turns out, as you'll see below, it's partly because they had a bug in the router, and partly because ... well, the router itself! See below for an introduction, written by GPT-5, it's actually not bad?Last week was a whirlwind. We live‑streamed GPT‑5’s “birthday,” ran long, and then promptly spent the next seven days poking every corner of the new router‑driven universe.This week looked quieter on the surface, but it actually delivered a ton: two open‑source world models you can drive in real time, a lean vision‑language model built for edge devices, a 4B local search assistant that tops Perplexity Pro on SimpleQA, a base model “extraction” from GPT‑OSS that reverses alignment, fresh memory features landing across the big labs, and a practical prompting guide to unlock GPT‑5’s reasoning reliably.We also had Alan Dao join to talk about Jan‑v1 and what it takes to train a small model that consistently finds the right answers on the open web—locally.Not bad eh? Much better than last time 👏 Ok let's dive in, a lot to talk about in this "chill" AI week (show notes at the end as always) first open source, and then GPT-5 reactions and then... world models!00:00 Introduction and Welcome00:33 Host Introductions and Health Updates01:26 Recap of Last Week's AI News01:46 Discussion on GPT-5 and Prompt Techniques03:03 World Models and Genie 303:28 Interview with Alan Dow from Jan04:59 Open Source AI Releases06:55 Big Companies and APIs10:14 New Features and Tools14:09 Liquid Vision Language Model26:18 Focusing on the Task at Hand26:18 Reinforcement Learning and Reward Functions26:35 Offline AI and Privacy27:13 Web Retrieval and API Integration30:34 Breaking News: New AI Models30:41 Google's New Model: Gemma 333:53 Meta's Dino E3: Advancements in Computer Vision38:50 Open Source Model Updates45:56 Weights & Biases: New Features and Updates51:32 GPT-5: A Week in Review55:12 Community Outcry Over AI Model Changes56:06 OpenAI's Response to User Feedback56:38 Emotional Attachment to AI Models57:52 GPT-5's Performance in Coding and Writing59:55 Challenges with GPT-5's Custom Instructions01:01:45 New Prompting Techniques for GPT-501:04:10 Evaluating GPT-5's Reasoning Capabilities01:20:01 Open Source World Models and Video Generation01:27:54 Conclusion and Future ExpectationsOpen Source AIWe've had quite a lot of Open Source this week on the show, including a breaking news from the Gemma team!Liquid AI's drops LFM2-VL (X, blog, HF)Let's kick things off with our friends at Liquid AI who released LFM2-VL - their new vision-language models coming in at a tiny 440M and 1.6B parameters.Liquid folks continue to surprise with speedy, mobile device ready models, that run 2X faster vs top VLM peers. With a native 512x512 resolution (which breaks any image size into 512 smart tiles) and an OCRBench of 74, this tiny model beats SmolVLM2 while being half the size. We've chatted with Maxime from liquid about LFM2 back in july, and it's great to see they are making them multimodal as well with the same efficiency gains!Zhipu (z.ai) unleashes GLM-4.5V - 106B VLM (X, Hugging Face)In another "previous good model that now has eyes" release, the fine folks from Zhipu continued training thier recently released (and excelled) GLM 4.5-air with a vision encoder, resulting in probably one of the top vision models in the open source!It's an MoE with only 12B active parameters (106B total) and gets SOTA across 42 public vision-language benches + has a "thinking mode" that reasons about what it sees.Given that GLM-4.5Air is really a strong model, this is de fact the best visual intelligence in the open source, able to rebuild websites from a picture for example and identify statues and locations!Jan V1 - a tiny (4B) local search assistant QwenFinetune (X, Hugging Face)This one release got a lot of attention, as the folks at Menlo Research (Alan Dao who came to chat with us about Jan on the pod today) released an Apache 2 finetune of Qwen3-4B-thinking, that's focused on SimpleQA.They showed that their tiny model is beating Perplexity Pro on SimpleQA.Alan told us on the pod that Jan (the open source Jan app) is born to be an open source alternative to searching with local models!The trick is, you have to enable some source of search data (Exa, Serper, Tavily) via MCP and then enable tools in Jan, and then.. you have a tiny, completely local perplexity clone with a 4B model!Google drops Gemma 3 270M (blog)In some #breakingNews, Google open sourced a tiny (270M) parameters, "good at instruction following" Gemma variant. This joins models like SmolLM and LFM2 in the "smol models" arena, being only 300MB, you can run this.. on a toaster. This one apparently also fine-tunes very well while being very energy efficient!Big Companies (AKA OpenAI corner this past 2 weeks)Ok ok, we're finally here, a week with GPT-5! After watching the live stream and getting access to GPT-5, my first reactions were not great. Apparently, so have other peoples, and many folks outcried and complained about the model, some even yelling "AGI is cancelled".What apparently happened is (and since, been fixed by OpenAI) is that GPT-5 wasn't just a model that launched, it was a "smart" router between a few models, and not only did they have a routing bug, the basic GPT-5 model, the one without thinking, is... not great.But the thinking GPT-5, the one that the router refused to send me to, is really good (as confirmed independently by multiple evals at this point)For one, it's the most accurate function calling model on OpenRouterIt's also one of the best on this new FormulaOne benchmark that was just launchedYou're prompting it wrong!Apparently, not only is GPT-5 more intelligent, it's also significantly "surgical" in instruction following, and so, for many folks, just replacing GPT-5 into their tools or prompts didn't just "work", as this model, more than before, is sensitive to conflicting things in the prompt.OpenAI has released a guide for prompting the model, mostly aimed at developers (as users shouldn't be learning to prompt as models get more intelligent) + they also released a prompt optimizer! Just dump your long and complex prompts in there, and you'll get an updated prompt with explanations of why they changed what they changed!Model Picker (and legacy models) are back!So, OpenAI tried and super quickly reversed course on removing the "model picker". At first, it was only GPT-5 there, but many people complained about the abrupt removal of 4o, their .. favorite models. At first, OpenAI added back the models via a hidden setting, and later, they have added 4o back to everyone by default, while increasing the reasoning quota to 3000 messages per week!Generally, my thoughts are, if you've tried GPT-5 and weren't impressed, give it another go! (especially now that it's connected to Gmail in chats!)Other notable Big Company updatesIn other news, Claude has extended the context window of Sonnet to 1M in the API, and apparently both Claude and Gemini have been adding memory features!Grok video has been catching up and is now free for a while to all usersThis Week's Buzz: Weave DX improvementsQuick update from my day job at Weights & Biases - we've rolled out some quality-of-life improvements to Weave, our LLM observability platform. We now have a unified assets tab where you can manage all your prompts, models, and datasets with full versioning support.Prompts are being version tracked, so if you use that GPT-5 prompt optimizer, we'll store all the previous revisions for ya!The coolest addition? Threads! Perfect for tracking agent executions or grouping related API calls. You just add a thread_id to your traces and Weave handles the rest. If you're building AI applications and not tracking everything, you're flying blind - give Weave a try at wandb.me/weave!World models are getting... open sourced!I still think that Google's Genie-3 release from last week was maybe the more important one, though we didn't really get to play with it yet!And while getting excited by world models, I was thinking that it's goig to take a while for Open Source to catch up. But this week, not 1, but two world models were open sourced, making me think that we'll get to generated worlds quicker than I expected and the race has begun!Skywork's Matrix-Game 2.0 (project, HF)Matrix-game 2 is a auto-regressive diffusion model, that was trained on 1200 hours of Unreal Engine and GTA-5 environments that runs at 25 frames per second!It works by creating an "action injection module" that embeds the mouse/keyboard inputs into the generation, enabling frame-level controls.Hunyuan open-sources GameCraft for real-time, high-dynamic game video generation (X, Hugging Face)Two world-models (well, game models) in the same week? Tencent (who had Hunyuan video before) have trained a game engine on top of their excellent HY-video and have shown the same examples, of building a full world based on a few images.Their pipeline trained on 1M game play clips from AAA titles, and they also map W/A/S/D and mouse signals into continuous camera/action embeddings, allowing for control and angle creation.The cool thing? A quantized 13B version supposedly can run on a RTX 4090!Funnily, they already had Matrix-Game (the one that came out a few days before) benchmarked and beat on the release today!Genie 3 is not messing aroundWhile all the open source is impressive, I was… absolutely blown away by this video from an artist who got the Genie3 team to extend a video of his. Just look at the collision of the plane with the sphere, out of nowhere, Genie3 adds a shadow, and then collision mechanics, the plane bouncing off, and even the jet trails subside and then resume! It really really is crazy to imagine that no prompting was given and the model just.. knew how to do this!Phew, that was a lot! Much more as always on the actual show, despite it being a "quiet" week by summer of 2025 standards, there was a LOT of open source releases and GPT-5 situation shows that even OpenAI can stumble on new tech releases!Keep the feedback coming - find me on Twitter/X at @altryne or email the show. And remember, if you want to catch all the technical details and demos, the video version on YouTube has everything the podcast can't show.Until next week, keep tinkering, keep questioning, and keep pushing the boundaries of what's possible with AI!See you next ThursdAI 🚀TL;DR - All Topics CoveredHosts and Guests* Alex Volkov - AI Evangelist @ Weights & Biases (@altryne)* Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed* Guest: Alan Dao from Jan (@jandotai)Open Source LLMs* Liquid AI LFM2-VL: 440M & 1.6B vision-language models with 2x GPU speedup (Blog, HF)* Jan V1: 4B parameter search assistant beating Perplexity Pro on SimpleQA (X, HF)* GPT-OSS Base: Reverse-engineered base model from Jack Morris (X Thread)* Gemma 3: Google's 270M parameter model with strong instruction following (HF)* Meta Dino v3: Self-supervised vision model for segmentation (Blog)Big Companies & APIs* Mistral Medium 3.1: New model on Mistral platform* Gemini & Claude: Added memory/personalization features* GPT-5 Updates: Router fixes, model selector returned, prompting guide released (Guide, Optimizer)* Claude Sonnet 4: 1M token context window (Announcement)This Week's Buzz* Weave updates: Unified assets tab and threads for agent tracking* New features for LLM observability and evaluationVision & Video* Hunyuan Large Vision: 1B vision encoder + 389B MoE language model (Project)* GLM-4.5V: 106B open source VLM from Zhipu AI (X, HF)World Models* Matrix Game 2.0: Real-time interactive world model, 25fps generation (Project, HF)* Hunyuan GameCraft: Game video generation with physics understanding (X, HF)Tools* Grok: Now includes video generation for all users* Jan Desktop App: Local AI with MCP support and search capabilities This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:29:41
  • 📅 ThursdAI - GPT5 is here
    Hey folks 👋 Alex here, writing to you, from a makeshift recording studio in an Eastern European hookah bar, where I spent the last 7 hours. Why you ask? Well, when GPT-5 drops, the same week as OpenAI dropping the long awaited OSS models + Google is shipping perfect memory World Models (Genie 3) and tons of other AI drops, well I just couldn't stay away from the stream.Vacation or not, ThursdAI is keeping you up to date (for 32 months straight, which is also the time since the original GPT-4 release which gave this show its name!)So, what did we have today on the stream? Well, we started as usual, talking about the AI releases of the week, as if OpenAI dropping OSS models (apache 2) 120B and 20B is "usual". We then covered incredible releases like Google's World model Genie3 (more on this next week!) and Qwen-image + a few small Qwens.We then were VERY excited to tune in, and watch the (very long) announcement stream from OpenAI, in which they spent an hour to tell us about GPT-5.This was our longest stream by far (3.5 hours, 1hr was just OpenAI live stream) and I'm putting this here mostly unedited, but chapters are up so feel free to skip to the parts that are interesting to you the most.00:00 Introduction and Special Guests00:56 Twitter Space and Live Streaming Plans02:12 Open Source AI Models Overview03:44 Qwen and Other New AI Models08:59 Community Interaction and Comments10:01 Technical Deep Dive into AI Models25:06 OpenAI's New Releases and Benchmarks38:49 Expectations and Use Cases for AI Models40:03 Tool Use vs. Deep Knowledge in AI41:02 Evaluating GPT OSS and OpenAI Critique42:29 Historical and Medical Knowledge in AI51:16 Opus 4.1 and Coding Models55:38 Google's Genie 3: A New World Model01:00:43 Kitten TTS: A Lightweight Text-to-Speech Model01:02:07 11 Labs' Music Generation AI01:08:51 OpenAI's GPT-5 Launch Event01:24:33 Building a French Learning Web App01:26:22 Exploring the Web App Features01:29:19 Introducing Enhanced Voice Features01:30:02 Voice Model Demonstrations01:32:32 Personalizing Chat GPT01:33:23 Memory and Scheduling Features01:35:06 Safety and Training Enhancements01:39:17 Health Applications of GPT-501:45:07 Coding with GPT-501:46:57 Advanced Coding Capabilities01:52:59 Real-World Coding Demonstrations02:10:26 Enterprise Applications of GPT-502:11:49 Amgen's Use of GPT-5 in Drug Design02:12:09 BBVA's Financial Analysis with GPT-502:12:33 Healthcare Applications of GPT-502:12:52 Government Adoption of GPT-502:13:22 Pricing and Availability of GPT-502:13:51 Closing Remarks by Chief Scientist Yakob02:16:03 Live Reactions and Discussions02:16:41 Technical Demonstrations and Comparisons02:33:53 Healthcare and Scientific Advancements with GPT-502:47:09 Final Thoughts and Wrap-Up---My first reactions to GPT-5Look, I gotta keep it real with you, my first gut reaction was, hey, I'm on vacation, I don't have time to edit and write the newsletter (EU timezone) so let's see how ChatGPT-5 handles this task. After all, OpenAI has removed all other models from the dropdown, it's all GPT-5 now. (pricing from the incredible writeup by Simon Willison available here)And to tell you the truth, I was really disappointed! GPT seems to be incredible at coding benchmarks, with 400K tokens and incredible pricing (just $1.25/$10 compared to Opus $15/$75) this model, per the many friends who got to test it early, is a beast at coding! Readily beating opus on affordability per token, switching from thinking to less thinking when it needs to, it definitely seems like a great improvement for coding and agentic tasks.But for my, very much honed prompt of "hey, help me with ThursdAI drafts, here's previous drafts that I wrote myself, mimic my tone" it failed.. spectacularly!Here's just a funny example, after me replying that it did a bad job:It literally wrote "I'm Alex, I build the mind, not the vibe" 🤦‍♂️ What.. the actual...For comparison, here's o3, with the same prompt, with a fairly true to tone draft:High taste testers take on GPT-5But hey, I have tons of previous speakers in our group chats, and many of them who got early access (I didn't... OpenAI, I can be trusted lol) rave about this model. They are saying that this is a huge jump in intelligence.Folks like Dr Derya Unutmaz, who jumped on the live show and described how GPT5 does incredible things with less hallucinations, folks like Swyx from Latent.Space who had early access and even got invited to give first reactions to the OpenAI office, and Pietro Schirano who also showed up in an OpenAI video.So definitely, definitely check out their vibes, as we all try to wrap our heads around this new intelligence king we got!Other GPT5 updatesOpenAI definitely cooked, don't get me wrong, with this model plugging into everything else in their platform like memory, voice (that was upgraded and works in custom GPTs now, yay!), canvas and study mode, this will definitely be an upgrade for many folks using the models.They have now also opened access to GPT-5 to free users, just in time for schools to reopen, including a very interesting Quiz mode (that just showed up for me without asking for it), and connection to Gmail, all those will now work with GPT5.It now has 400K context, way less hallucinations but fewer refusals also, and the developer upgrades like a new verbosity setting and a new "minimal" reasoning setting are all very welcome!OpenAI finally launches gpt-oss (120B / 20B) apache 2 licensed models (model card, HF)It was really funny, on the stream Nisten talked about the open source models OpenAI dropped, and said "when we covered it last week", while it was just two days ago! It really does feel like this world is moving really fast.OpenAI's long promised open source models are here, and they got a fairly mixed bag of reviews from folks. Many folks are celebrating that the western world is now back in the game, releasing incredible local models, with an open license!Though, after the initial excitement, the vibes are split on these models. Folks are saying that maybe these were trained with only synthetic data, because, like Phi, they seem to be very good at benchmarks, and on the specific tasks they were optimized for (code, math) but really bad at creative writing (Sam Paech from EQBench was not impressed), they are also not multilingual, though OpenAI did release a cookbook on finetuning with HuggingFace!Overall, these models are trained for agentic workflows—supporting function calling, web search, Python execution, configurable reasoning effort, and full raw chain-of-thought access, which we will never get from GPT5.I particularly love the new approach, where a reasoning effort can be defined directly via the system prompt, by just adding "reasoning: high" to the system prompt, this model will reason for way longer! Can't wait to get back and bench these and share with you.Overall, the fine-tuning and open source community is split for now, but it's been only a few days, so we'll keep you up to date on how well these models land, regardless, this was a historic week for OpenAI!Speaking of open models, did you have a chance to try our W&B Inference? The team worked hard to bring these new models to you in record time and incredible pricing (just $.05 for 20B and $.15 for 120B!), these models are definitely worth giving a try!Plus, if you comment "OSS Power" on our announcement post, we'll likely give you a few credits to try it out and let us know what you think!World models "holy crap" moment - Google Genie3The other very important release this week was.... not a release at all, but an announcement from Deepmind, with Genie3.This World Model takes a single image or text prompt and creates a fully interactive, controllable 3D environment that runs in real-time at 24fps. An environment you as a user can control, walk (or fly) in, move around the camera view. It's really mindblowing stuff.We've covered world models like Mirage on previous episodes, but what Google released is a MAJOR step up in coherency, temporal consistency and just overall quality!The key breakthrough here is consistency and memory. In one demo, a user could "paint" a virtual wall, turn away, and when they turned back, the paint was still there. This is a massive step towards generalist agents that can train, plan, and reason in entirely simulated worlds, with huge implications for robotics and gaming.We’re hoping to have the Genie 3 team on the show next week to dive even deeper into this incredible technology!!Other AI news this weekThis week, the "other" news could have filled a full show 2 years ago, we got Qwen keeping the third week of releases with 2 new tiny models + a new diffusion model called Qwen-image (Blog, HF)Anthropic decided to pre-empt the GPT5 release, and upgraded Opus 4 and gave us Opus 4.1 with a slight bump in specs.ElevenLabs released a music API called ElevenMusic, which sounds very very good (this on top of last weeks Riffusion + Producer.ai news, that I'm still raving about)Also in voice an audio, a SUPER TINY TTS model called KittenTTS released, with just 15M parameters and a model that's 25MB, it's surprisingly decent at generating voice (X)And to cap it off with breaking news, the Cursor team, who showed up on the OpenAI stream today (marking quite the change in direction from OpenAI + Windsurf previous friendship), dropped their own CLI version of cursor, reminiscent of Claude Code!PHEW, wow ok this was a LOT to process. Not only did we tune in for the full GPT-5 release, we did a live stream when gpt-oss dropped as well.On a personal note, I was very humbled when Sam Altman said it was 32 months since GPT-4 release, because it means this was 32 months of ThursdAI, as many of you know, we started live streaming on March 13, 2023, when GPT-4 was released.I'm very proud of the incredible community we've built (50K views total across all streams this week!), the incredible co-hosts I have, who step up when I'm on vacation and the awesome guests we have on the show, to keep you up to date every week!So, a little favor to ask, if you find our content valuable, entertaining, the best way to support this pod is upgrade to a paid sub, and share ThursdAI with a friend or two! 👏 See you next week 🫡 This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    2:56:19

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, The Charlie Kirk Show and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.23.9 | © 2007-2025 radio.de GmbH
Generated: 9/16/2025 - 11:12:33 AM