PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

152 episodes

  • ThursdAI - The top AI news from the past week

    April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews

    2026/04/16 | 1h 59 mins.
    Hey ya’ll, Alex here with your weekly AI news catch up.
    It’s one of those Thursday’s where no matter how well I prep, the big AI labs are hell bent to show up before each other. Alibaba dropped Qwen 3.6 with Apache 2, confirming their commitment to Open Source, then Anthropic released Claude Opus 4.7 (not quite Mythos) and OpenAI followed with a huge Codex update that includes Computer Use among other things. The highlight of Computer User is the background usage, more on that below. This is all just from today!
    Previously in the week we had 2 incredible 3D world generators, Lyra 2.0 from Nvidia and HYWorld 2 from Tencent, Windsurf dropping 2.0 version with Devin integration and Google releasing a Gemini TTS, with over 90+ languages support and incredible emotions range, and Baidu open sources Ernie Image, rivaling Nano Banana.
    Today on the show we had 3 awesome guests, Theodor from Cognition joined to cover the new Windsurf, Kwindla is back on the show to talk about “the side project that escaped containment” Gradient-Bang, a multi agent, voice based space game and Trevor from Marimo joined to talk about pairing your agents with a Marimo notebook. Let’s dive in! 👇
    ThursdAI - We’re over 16K on YT today, my goal is to get to parity with Substack, please subscribe.

    Codex can now really use your computer: OpenAI updates Codex with CUA, Image Generation, Browser, SSH (X, Blog)
    Codex from OpenAI has been the major focus inside OpenAI for a while now. We’ve reported previously that OpenAI is closing down SORA and other “side-quests” to focus, and that they will join Codex, ChatGPT and the Atlas browser into one “superapp” and today, it seems, that we’ve gotten an early glimpse of what that app will be.
    The Codex team (which seems to be growing from day to day), have been on a TEAR feature wise lately, trying to beat Claude Code, and they pushed an update with a LOT of features and updates, among them a new memory system, internal browser and image generation.
    The highlight for me though, was absolutely the polished computer use experience. Computer use is not new, Claude has a computer use feature flag, many others. Hell, we told you about computer use with Open Interpreter, back in Sep of 2023. But, this.... this feels different.
    You see, OpenAI has quietly purchased a company called Software Apps Inc, that almost launched a macos AI companion a year ago called Sky. This team is obsessed with Mac, and somehow, they were able to build a magical experience, a huge part of which, is the fact that they are controlling the mac, in the background. This is like black magic stuff. You work on one document, Codex clicks buttons and does things in another, without interrupting you.
    You may ask, Alex, why do you even care so much about computer use, when most of the work happens in the browser anyway, and Claude (and Codex) can control my browser anyway?
    Well, true, but not ALL work is happening there, for example, file system integration. It’s notoriously big part of browser automation that fails, when you need to upload/download files. I’ve spent countless cycles trying to get this to work with OpenClaw, and this, just does it. This closes the loop between knowledge work in the browser (yes, this thing can use your browser) and the broader OS.
    It’s so so polished, I truly recommend you try it. It’s as easy as @ tagging any app that you have running and asking Codex to do stuff there. Pro Tip: Enable fast mode for a much smoother experience.
    Anthropic Opus 4.7 is here, not quite Mythos, 64.3% Swe-bench Pro, tuned for long running tasks (X, System Card)
    What is there to say? Is this the model we expected from Anthropic after releasing the news about Claude Mythos last week? no. But hey, we’ll take it. I new Claude Opus, with a significantly improved multimodality capabilities, and a long horizon coding task improvements? For the same price?
    Well, not quite! Apparently, this model could be a “from scratch” trained model, given that the tokenizer (the thing that converts words into tokens for the LLM to understand) is a different one. It also uses 1.3x more tokens for the same tasks, which means, that the new and default model from Anthropic became effectively more expensive (A note they acknowledged by raising the usage limits, to an unknown amount in Anthropic subscription plans, but it’ll still be a token tax on the API use)
    How about performance? Well, hard to judge on Evals alone, but they are great. A huge jump in Swe-bench Pro, over 10% improvement, puts this model as the best out there, except Mythos. It’s also the best at real world knowledge via GPQA Diamond (except Mythos). Are you seeing a trend here? Anthropic released a preview of a model, but for the first time, it’s not their “absolute best” model, and in a weird move, they have compared it on Evals to an unreleased model (presumably 10x the size?)
    As far as we’ve tested this, it gave an incredibly detailed response on the Mars question we constantly test on, both for me and Nisten, Opus 4.7 produced an incredibly detailed 3D rendered result, much better than out previous tries. I’ll be keeping an eye on this model and keep you guys up to date on what else we find. Vibe checks are .. it’s more expensive, long context is unclear but it’s a great vibe model.
    Alibaba is back - Qwen 3.6 is Apache 2.0 35B with 3B active parameters (X, HF, Blog)
    The coolest thing about this release is not the evals (though they claim to outperform the much denser Qwen 3.5-27B on multple benchmarks) is that Alibabab is putting models with open weights and an Apache 2.0 license!
    We previouly reported on rumors from inside Alibaba, that a few internal restructuring caused many of us to doubt if they would commit to OSS, and they answered!
    Another highlight for me in this model, is that Alibaba has an OpenClaw bench (that they are promising to release soon) and that this model does as well as the dense model and beating Gemma 4 by a wide margin on that task.
    This model is also natively multimodal, with 262K context extensible to 1M via YaRN.
    MiniMax M2.7 Open Weights - 230B MoE with only 10B active (X, HF)
    Our friends at MiniMax finally dropped M2.7 in open weights (technically not fully Apache, commercial use requires their authorization, but free for research, personal, and coding agents). It’s a 230B parameter MoE with only 10B active parameters, and it’s matching GPT-5.3-Codex on SWE-Pro at 56.22%. On Terminal-Bench 2 it hits 57%. But the real story here, the part that made me stop scrolling, is the self-evolution piece.
    They let an internal version of M2.7 run its own RL optimization loop for 100+ rounds with zero human intervention. The model analyzed its own failure trajectories, modified its own scaffold code, ran evals, and decided whether to keep or revert changes. It got a 30% performance improvement on internal metrics. The model improved itself.
    Shoutout to the MiniMax team — longtime friends of the pod and they keep delivering (as they promised to release the weights for this one and they did)
    This weeks buzz - news from Weights & Biases from CoreWeave
    This week was a very big one in our corner of the AI world. Our parent company CoreWeave announced not one, not two but 3 major deals, including one with Anthropic, a renewed commitment from Meta and a renewal from Jane Street.
    CoreWeave now serves 9 out of the top 10 AI model providers in the world. 🎉
    Oh and a small plug, if you want to get tokens powered by the same infrastructure, our Coreweve Inference service is open and very cheap, and we’ve recently added Gemma 4 and GLM 5.1 both to our inference service.
    This week on the pod, I’ve chatted with Trevor, founding engineer at Marimo Notebooks (also part of CW) about their recent highlight of pairing an AI agent with Marimo notebooks, they went quite viral on hacker news and I wanted to understand why. I understood why, it’s really cool. Check Trevor out on the pod starting around 01:05:00 timestamp.
    Tools & Agentic Engineering
    Windsurf 2.0 - Agent Command Center + Devin in the IDE - interview with Theodor Marcu (X, Blog)
    The first big post-Cognition-acquisition move for Windsurf dropped this week, and I got to chat with Theodor Marcu from Cognition about it on the show. The headline: Windsurf 2.0 brings an Agent Command Center; think Kanban-style mission control for all your agents, plus native Devin integration baked right into the IDE, and Spaces (persistent project containers that group your agent sessions, PRs, files, and context).
    The framing Theodor gave me: local agents are pair programmers bounded by your attention (they stop when you close the laptop), while cloud agents are independent hires. Windsurf 2.0 tries to unify both paradigms in one interface. You can plan locally with Cascade using the Socratic method — going back and forth, challenging assumptions, building up context — and then with one click, hand off execution to Devin which runs in its own cloud VM, opens PRs, runs tests, and even tests its own work using computer use on its own Linux desktop. You can close your laptop and it keeps shipping.
    One reality check from the community: Devin is great but not cheap. One early tester burned $25 in credits for a 15-20 minute bug fix that produced “okay” results. Something to watch on the Max plan economics. Devin access is rolling out gradually to Windsurf users over 48 hours from launch.
    Shoutout to Swyx that helped design the Spaces three months ago whilst at Cognition!
    Warp terminal now supports any CLI agent with vertical tabs and mobile control (X, Blog)
    This one is for the terminal enjoyers. Warp, which in my opinion is the best terminal experience out there, just shipped first-class support for any CLI agent — Claude Code, Codex, OpenCode, Gemini CLI, all running side by side in vertical tabs with live status indicators.
    The killer feature here, and this solves what I think is the single worst part about using Claude Code, is notifications when agents need you. If you’ve used Claude Code you know the pain of constantly checking if it’s waiting for a permission or input. Warp notifies you. You step in, approve, go back to what you were doing. They also added integrated code review inside the terminal, a rich multimodal input editor, and — this is wild — remote control from mobile. Monitor and interact with your running CLI agents from your phone.
    Voice & Audio
    Gradient Bang - the first massively multiplayer LLM-driven game, interview with Kwindla (X, Play it)
    Kwindla, co-CEO of Daily and maintainer of Pipecat, came on the show to talk about Gradient Bang, a game he described as “a side project that escaped containment.” He told me about this back in December, and folks, it’s finally live and it’s genuinely the first fully LLM-driven multiplayer game I’ve seen. It’s inspired by an old BBS door game called Trade Wars that Kwindla used to play as a baby programmer on a 386 DX, but reimagined so your ship’s computer is an LLM you can just… talk to.
    You pilot a spaceship through a procedurally generated universe, but instead of clicking buttons, you talk to the thing, and say things like “take me to the nearest mega port and trade along the way” — and your ship AI delegates to sub-agents to actually do the work. You can run corporations, buy more ships, task them to do 5 exploration loops while you do trade runs. It’s Factorio-meets-Ender’s-Game-meets-voice-AI. I’ve been playing it, my ship is currently roaming the universe as we speak (with 0 credits as someone robbed me!)
    What makes this technically fascinating is that it’s basically a production-grade stress test for multi-agent orchestration. Sub-agents with shared context, episodic memory across sessions, dynamic LLM-generated UIs (the React front-end is literally rendered from JSON thrown over by a UI agent LLM), and long-running contexts that go for weeks. The architecture is now shipping as a Pipecat library called Pipecat Sub-Agents. Tech stack: Deepgram for STT, GPT-4.1 for the voice agent, GPT-5.2 medium-thinking for task agents, and a dedicated benchmark called GB Benchmarks because tasking these agents is genuinely hard.
    Fun detail: Kwindla’s rule for this project was to not write or read any code since November. His colleague John lasted about one day before he broke and started reading React. The Z/L Continuum claims another victim. Go play it, it’s free and fun: gradientbang.com.
    Google launches Gemini 3.1 Flash TTS (X, Blog, Try it)
    Google dropped a new TTS model this week and folks, it’s not quite the speed-of-light real-time conversational TTS we’re all dreaming of (it’s about 3 seconds time-to-first-token, so batch-mode only), but the controllability is wild. We’re talking inline audio tags — [laughs], [sighs], [gasp] — natural language scene direction, two distinct speakers per generation, 70+ languages with auto-detection, and you can switch emotion and pacing mid-sentence with natural language.
    I tested it live on the show with a “shocked/whispering” tag combo asking “Who came to ThursdAI?” and it absolutely nailed it.
    It hit 1,211 Elo on the Artificial Analysis TTS Arena, 4 points behind Inworld TTS 1.5 Max and ahead of ElevenLabs v3. Pricing is about $0.03 per 60 seconds of audio, roughly 4.7x cheaper than ElevenLabs v3.Kwindla’s take: this is part of the broader shift from traditional TTS architectures toward fully steerable, prompt-able speech models — which is great for expressive use cases but means you need to test heavily for hallucinations and word skipping.
    AI Art, Video & 3D
    Tencent HYWorld 2.0 and NVIDIA Lyra 2.0 - actual 3D worlds from one image
    This week we got not one but two major single-image-to-3D-world open releases, and they’re genuinely different from the video world models (Genie 3, Cosmos) we’ve been covering.
    Tencent HYWorld 2.0 takes a single image (or text, or video) and produces actual 3D Gaussian Splats, meshes, and point clouds that you can import directly into Unity, Unreal, Blender, or NVIDIA Isaac Sim. Not video. Real editable 3D assets. Their framing: “watch a video, then it’s gone” vs “build a world, keep it forever.” The WorldMirror 2.0 reconstruction model is a 1.2B parameter feed-forward model that predicts dense point clouds, depth, normals, camera params, and 3DGS in a single pass. All open source.
    NVIDIA Lyra 2.0 (Apache 2.0) takes a single image and progressively generates an explorable 3D world as you navigate through it. The breakthrough here is solving two classic failure modes of generative world models: spatial forgetting (hallucinating new structures when you revisit an area) and temporal drifting (errors accumulating until the scene turns to mush). They solve both with per-frame 3D geometry retrieval and this elegant self-augmented training trick where they train the model on its own degraded outputs so it learns to correct drift. DMD distillation gets you 4-step inference. Apache 2.0, Hugging Face, code and weights.
    Both of these together feel like the end of video-only world models as the state of the art. We’re going straight to editable, persistent, importable 3D worlds.
    Baidu open-sources ERNIE-Image - 8B parameter text-to-image (HF)
    Not to be outdone, Baidu dropped ERNIE-Image, an 8B parameter DiT that’s now #1 on GenEval among open-weight models (0.8856), beating Qwen-Image, FLUX.2-klein, and Z-Image. Built from scratch in 3 months. Runs on a 24GB consumer GPU, and someone already quantized it to NF4 so it runs under 10GB VRAM on an RTX 3060. The text rendering story is the headline — clean multilingual text rendering for posters, infographics, comics, the stuff every other model has been historically terrible at. There’s also a Turbo variant that does it in 8 inference steps.
    The craziest AI video I’ve ever seen - “Pi Hard” (X)
    You have to watch this AI video. It’s one of the crazier ones I ever saw, and I do reporting on AI for a living. I showed this to my Fiancee Darya, and she only asked me “is this AI” in the middle of it, after saying “yeah, let’s watch this 😂)
    Closing thoughts
    What a week. Opus 4.7 dropped live on the show, Codex is now controlling your mac in the background like black magic, Qwen gave us another Apache 2.0 banger, MiniMax shipped a self-evolving model, and we got two “image-to-actual-3D-world” open source releases on the same week. Oh and a shoe company is now an AI compute company.
    The Z/L Continuum keeps shifting — I feel like every week I drift a little more toward L, especially after seeing Kwindla ship Gradient Bang without reading code since November. And every week the agents get better at babysitting themselves (Claude Code Routines, Windsurf’s Agent Command Center, Warp’s unified CLI agent UX, Codex’s computer use in the background), which means more FOMAT for all of us.
    Thanks for reading, share this with a friend, and if you enjoyed this, drop a comment with what you want more or less of. Feedback keeps me going.
    — Alex
    TL;DR - ThursdAI, April 16, 2026
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Community with Weights & Biases / CoreWeave (@altryne)
    * Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed
    * Guests:
    * Kwindla Kramer (@kwindla) - Co-CEO of Daily, Pipecat maintainer
    * Theodor Marcu (@theodormarcu) - Product at Cognition
    * Trevor Manz (@trevmanz) - Founding engineer at Marimo
    * Show Notes
    * Recap essay on the Z/L Continuum from AI Engineer Europe (Blog): should AI engineers still read code? Ryan Lopopolo says no, Mario Zechner says yes for critical paths, everyone in between has FOMAT.
    * Mario Zechner talk is finally live on AI Engineer youtube (Watch)
    * Super Gemma 4 26B Uncensored v2 by @songjunkr — trending on HF, 0/100 refusals, fixed tool calls (HF GGUF, HF MLX 4bit)
    * Gemma 4 21B REAP — 20% expert-pruned Gemma 4 26B MoE by 0xSero using Cerebras REAP (HF)
    * Parcae (Together AI + UCSD) — stable looped transformer architecture with scaling laws, matches 2x-sized transformer quality (Paper/blog)
    * Claude Desktop app — rewritten from scratch, completely new app
    * Gemma 4 on W&B Inference — reply on the announcement post with code Gem Drop for $20 in inference credits, also supports LoRA inference via link
    * Big CO LLMs + APIs
    * Anthropic launches Claude Opus 4.7 - 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 3x vision resolution, new xhigh effort level, /ultrareview in Claude Code, same pricing as 4.6 but new tokenizer uses ~1.0-1.35x more tokens (X, Blog)
    * OpenAI Codex major update: macOS background computer use, 90+ plugins, gpt-image-1.5 image generation, in-app browser, memory, self-scheduling automations, multi-terminal SSH (X, Blog)
    * CoreWeave signs deals with Anthropic (multibillion), Meta ($21B expansion, $35B+ total), and Jane Street ($6B cloud + $1B equity), now serves 9 of the top 10 AI providers
    * Open Source LLMs
    * Qwen 3.6-35B-A3B - Apache 2.0, 35B MoE with 3B active, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M (X, HF, Blog)
    * MiniMax M2.7 open weights - 230B MoE with 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex, self-evolved via 100+ rounds of autonomous RL (X, HF)
    * Tools & Agentic Engineering
    * Windsurf 2.0 with Agent Command Center and Devin integration - interview with Theodor Marcu (X, Blog)
    * Warp now supports any CLI agent with vertical tabs, notifications, code review, mobile remote control (X, Blog)
    * Claude Code Routines - cron, GitHub event, and API-triggered autonomous agents running on Anthropic’s cloud (Docs)
    * This Week’s Buzz - Weights & Biases / CoreWeave
    * Marimo Pair - drop Claude Code / Codex / OpenCode agents directly inside reactive Python notebooks - interview with Trevor Manz (Blog, GitHub)
    * Gemma 4 now live on W&B Inference on CoreWeave infrastructure, with LoRA inference support
    * Vision & Video
    * Craziest AI video of the year: Pi Hard / Neil deGrasse Tyson (X)
    * Voice & Audio
    * Gradient Bang - first massively multiplayer fully LLM-driven game, Pipecat sub-agents - interview with Kwindla (Play, GitHub)
    * Google Gemini 3.1 Flash TTS - 1,211 Elo on TTS Arena, inline audio tags, 70+ languages, ~$0.03/60s (Blog)
    * AI Art, Diffusion & 3D
    * Baidu ERNIE-Image - 8B DiT, #1 GenEval among open models, precise multilingual text rendering (HF)
    * Tencent HYWorld 2.0 - single image to editable 3D Gaussian Splats/meshes, Unity/Unreal/Isaac Sim ready (GitHub)
    * NVIDIA Lyra 2.0 - single image to explorable persistent 3D worlds, Apache 2.0 (Project, HF)
    * Other news
    * Unitree humanoid breaks 100m dash world record at ~10m/s (X)
    * Allbirds shoe company loses 99.5%, rebrands as “NewBird AI”, raises $50M to buy GPUs, stock up 600-800% (X)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI LIVE from London - Claude Mythos, Codex Resets, Muse Spark & More | w/ Swyx and friends from OpenAI, Deepmind, LMArena and OpenClaw

    2026/04/09 | 1h 59 mins.
    Hey yall, Alex here, writing this from sunny London, at the first ever AI Engineer conference in Europe!
    What a show we have for you today! First, let me catch you up on what’s important: Anthropic, this week announced a whopping $30B ARR up from 19B in Feb, while also telling us about Claude Mythos Preview their next gen HUGE model that they won’t release to the public (yet?) that finds crazy vulnerabilities in existing code bases. Apparently OpenAI will follow up with a similar non-public model soon.
    The Meta Superintelligence Lab led by Alex Wang finally showed what they were working on, Muse Spark, the smaller of their upcoming models on a complete new infrastructure (MSL announcement, Simon Willison’s deep dive on the 16 hidden tools).
    In other news:
    Z.AI released GLM 5.1 in OSS finally (HF weights), Seedance 2.0 finally available in US on Replicate, OpenAI testing out GPT-image-2 on LM Arena under codenames, HappyHorse from Alibaba takes the video crown, and Mila Jovovich (5th Element, Resident Evil) releases agentic memory plugin called MemPalace (Ben Sigman’s transparent correction thread is worth reading).
    We had 5 guests today on the show, we kick off with @swyx the founder of AI Engineer and host of Latent Space. We then chatted with @petergostev from Arena (formerly LMArena) about Mythos and the compute wars, then Vincent Koc, the second most prolific contributor to OpenClaw, then our friends VB from OpenAI and Omar from DeepMind, both previously at HuggingFace. This is a busy busy show, and given the time-zones, I unfortunately don’t have time for a full weekly writeup, but as always, I will share the raw notes and post the video (lightly edited).
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    AI Engineer - London
    ThursdAI came a long way since the first AI Engineer conference, but many who read this don’t know, that was my big break. Swyx invited me to cover the first AIE in San Francisco in 2023, and I remember, I was in an Uber to the airport, the driver asked me what I do, and I, for the first time said “I host a podcast”. I (and ThursdAI) owe a lot to Swyx, and AIE team, and it’s been incredible to see how big they’ve grown and how many great speakers this event hosts!
    The term AI Engineer has drifted in those 3 years, but also has the term Software Engineer. Swyx predicted this nearly 3 years ago, what I don’t think he predicted, is that all engineers are now AI Engineers, and this includes domains like Agens (OpenClaw), Context and Harness Engineering, Evals and Observability, Voice & Vision all of which are tracks in this conference.

    I was really surprised to see how many of the talks/speakers here are native to London (after all, Deepmind is from here, OAI, Anthropic, Meta have offices here) and the latest boom in agents, OpenClaw, Pi were all Europe based as well, and they are joined the AI Engineer stage.
    Oh, and there’s also a Giant Inflatable Claw at the entrance, yup, for pictures and vibes, and to show off how quickly the OpenClaw took over the mind-share.

    Anthropic announces $30B ARR and Mythos, their next model, will not be released to the public.
    The thing that everyone will tell you, is that Anthropic is on a roll, this is obviously connected to their upcoming IPO this year. We’ve been covering many issues on their part, but this week we saw them posting about a HUGE increase in ARR, from 19B in February to 30B in April, passing OpenAI at $25B. That last fact though, is kind of disproven because they report on ARR differently, OpenAI apparently only counts their cloud revenue from Microsoft per the information.

    The growth is undeniable though, and so is the most unprecedented release announcement, Claude Mythos Preview, which was rumored for a bit and now was announced proper. With project Project GlassWing, Anthropic has announced that this model is SO good at cyber security and finding bugs in code, that they cannot share it with the public, and through GlassWing they will share it with companies like Microsoft, Linux, CrowdStrike and a bunch of others, to harden their security.
    This is it folks, this is the first time, where a model was “announced” but deemed too risky to release. Now, is it truly “too risky”? Previously, folks thought that DALL-E is too risky, or cloning voice tech is too risky, and now it’s everywhere. The capabilities catch up even in OpenSource.
    But the facts are, Anthropic says they’ve found a 27-year old bug in OpenBSD (famously very secure), and that this model is very very good at connecting the dots between several, seemingly inacuous bugs, to string them together into one coheren exploit.
    This is, indeed scary. Just last week, one of the top security researchers in the world, Nicolas Carlini, now at Anthropic, gave a talk at Black Hat, showing off these results, and saying that these models since December and definitely recently have passed him as a security engineer. If you haven’t seen this talk, watch it, then try to estimate if Anthropic did the right thing by only releasing this model to enterprises first.
    But on the show, Peter Gostev from Arena gave me a take on this that I haven’t been able to shake. Peter pulled up his Compute Wars chart live on the show — and the picture is that OpenAI is way ahead of Anthropic on compute, with Anthropic only recently getting a noticeable bump (which lines up suspiciously well with Mythos being trainable in the first place). His read: “it sounds cooler to say it’s too risky to release than ‘we can’t serve it.’” The official partner pricing is $25 / $125 per million tokens — 5x Opus 4.6 — but if you don’t have the GPUs to serve it broadly, the price doesn’t matter. In the year of the IPO, the company that cannot serve a model says the model is too dangerous to serve. Make of that what you will.
    This also reframes the whole rate-limit drama with OpenClaw. Anthropic didn’t ban OpenClaw — I want to be very clear about this because the discourse went sideways. What they did is they made it significantly more expensive for Max-tier subscribers to use Opus through OpenClaw, which pushed a lot of people over to GPT-5.4 via Codex. Same root cause: they’re out of compute. The freshly announced Anthropic + Google TPU deal (Google already owns ~10% of Anthropic) is them trying to fix this — though as Peter noted, it’s pretty wild that Google is propping up a direct competitor to their own DeepMind team. Same pattern as their original $2B Anthropic investment ending up propping AWS Bedrock against Google Cloud. Big Google contains multitudes.
    Meta Superintelligence Labs ships Muse Spark — Llama is dead, long live Muse
    Llama is dead, long live Muse. This week Meta finally showed what the very expensive Meta Superintelligence Labs under Alexandr Wang has been cooking, and the answer is Muse Spark — the smaller of their new model family, built on a fully rebuilt AI stack from scratch in just 9 months. Nine months is wild for that kind of overhaul, and the headline number people are quoting is that they reach Llama 4 Maverick capability with over 10x less compute.
    Spark is intentionally small and latency-optimized — it’s not trying to be the biggest, it’s trying to be the first step on Meta’s new scaling ladder. But the benchmarks in certain areas are nuts: 86.4 on CharXiv Reasoning (beats Opus, Gemini, GPT-5.4), and the one that really got me — 42.8 on HealthBench Hard vs Opus at 14.8 and Gemini at 20.6. They trained it with data curated by over 1,000 physicians and it shows. They also shipped a Contemplating mode which is parallel multi-agent reasoning, hitting 58.4% on Humanity’s Last Exam with tools. Coding is the acknowledged weak point (77.4 on SWE-Bench Verified vs Opus 80.8) but for v1 from a brand new stack, this is extremely respectable.
    Meta is Back!
    The real story isn’t any single benchmark though, it’s distribution. Spark is rolling out across meta.ai, WhatsApp, Instagram, Threads, Messenger, and Ray-Ban Meta glasses — billions of users. Meta went from open Llama to a closed consumer model and they’re clearly playing a different game now (though Wang says future Muse versions might be open-sourced).
    The deep-dive that’s really worth your time is Simon Willison’s post where he poked at the meta.ai chat UI and got the model to spit out descriptions of 16 hidden tools behind the scenes — full Code Interpreter with persistent Python 3.9, a visual grounding tool that does pixel-precise object detection (bounding boxes, point coordinates, counting — it located 8 objects including individual whiskers and claws on a generated raccoon), sub-agent spawning, file editing, and semantic search across Instagram/Threads/Facebook posts. It’s basically an entire agentic harness baked into the chat UI. Jack Wu from MSL confirmed the tools are part of a new harness built specifically for Spark’s launch. Meta stock went up 7% on this. They are very much back in the frontier game.
    Guest highlights
    We had an unprecedented packed show with 5 guests (also this is the shortest show we’ve ever
    Swyx kicked us off with vibes from the AI Engineer floor — harness engineering as the dominant theme (gains are coming from the harness, not the weights), the rise of skills (English-as-programming-language) absorbing more of that harness work, and his thesis that supply-chain attacks like the recent light LLM and Axios incidents mean you should basically vendor everything — pip fork instead of pip install. We also chatted about how MCP has gone from “the most exciting protocol” to “settled and stable, therefore less interesting,” which is a great problem to have.
    Peter Gostev from Arena (you saw a lot of him in the Mythos section above) also dropped a bonus on us: Arena just released 3 years of historical leaderboard data and actual prompt datasets on Hugging Face. He used to literally scrape the arena website by hand into Google sheets to make those overtime leaderboards we all loved — now it’s all public. Also: he confirmed that Seedance 2.0 jumped ~80 ELO points above the next video model on Arena, which is unprecedented — video models normally cluster within 10 points of each other.
    Vincent Koc — the #2 OpenClaw maintainer after Peter Steinberger — joined us fresh off the OpenClaw track stage. The OpenClaw codebase is now ~1.5 million lines of code including unreleased iOS and Android native apps. GitHub literally caps the issue/PR counter at “5K+” and they hit the ceiling. We talked about OpenClaw 2026.4.5 which ships /dreaming GA (Light/Deep/REM phases that defrag agent memory and write a human-readable Dream Diary to DREAMS.md), built-in video and music generation across 4 backends, GPT-5.4 as the new default, prompt-cache reuse improvements, and Control UI + docs in 12 new languages. Vincent’s framing of dreaming was beautiful — “how do you explain agent memory to a mom? You call it dreaming.” He also gave my favorite line of the show on the GPT-5.4 personality problem: incredible at coding, but soulless. (For what it’s worth, I came home after watching Project Hail Mary, cloned the Rocky voice, dropped it into my OpenClaw, and it was magical. That’s the kind of thing you can only do when the harness and the model are decoupled.)
    VB from OpenAI told us Codex just hit 3 million weekly active users — up from 2 million last month. We talked plugins (the Stripe / Supabase / shadcn ones that ship as packages), sub-agents (yes, one is named Jason), and Guardian Approvals — an experimental mode that classifies each tool call by risk and only escalates the dangerous ones to you, so you don’t have to YOLO-mode everything. The story that stuck with me though is his 9 AM Codex automation: every morning it reads his Slack mentions, cross-references Gmail and Calendar, and creates 5-minute pre-brief calendar events for upcoming meetings. None of that is “coding.” That’s the super-app future hiding inside a “developer tool.” I’m stealing this workflow.
    Omar Sanseviero from Google DeepMind came on to celebrate Gemma 4 crossing 10M+ downloads with 1,000+ Gemma-4-based fine-tunes already on HF (and Gemma family total is now over 500M downloads). Gemma 4 is also the foundation for the next generation of Gemini Nano on Pixel/Samsung devices. Lama.cpp vision capability fixes are landing. Gemma 4 is also live on W&B Inference if you want to play. Wolfram (whose entire household runs on Pixel + Google AI Studio, including his 70-year-old mother on voice unlock) was in heaven.

    This Week’s Buzz
    A short but spicy week from Weights & Biases:
    * W&B Automations are LIVE. You can now wire event triggers from your training runs (completion, eval thresholds, drift) into notifications, GitHub Actions, deployments, infra shutdowns — closing the loop from experiment to production. Pairs really well with the iOS app we recently shipped, so you can get a ping on your phone the moment something interesting happens on a run.
    * GLM 5.1 is live on W&B Inference (alongside Gemma 4 from last week) — the team is moving fast to host the best open models the moment they drop.
    * Wolfram published a deep dive on “more reasoning is not always better” on the W&B blog — the research behind his finding that giving models more thinking tokens can actually make them dumber on certain tasks. It’s the in-depth version of what we discussed on the show last week, with all the data. Go read it on wandb.com.
    Also: shout out to everyone who came up to me at AI Engineer and said hi. The Wolf Bench mentions in particular made my day. If you’re listening to this and you’re at AIE — come find us, we’ll be around tomorrow too.
    That’s it for this week — newsletter is short because the show was long and London is calling. As always, thanks for reading and listening 🫡
    TL;DR April 9 - show notes and links:
    * Hosts and Guests
    * Alex Volkov – AI Evangelist & Weights & Biases (@altryne)
    * Co-Hosts – @WolframRvnwlf @yampeleg @nisten @ldjconfirmed
    * Guests: @swyx (AI Engineer / Latent Space), @petergostev (Arena, formerly LMArena), @reach_vb (OpenAI / Codex), @vincent_koc (OpenClaw #2 maintainer), @osanseviero (Google DeepMind / Gemma)
    * Big CO LLMs + APIs
    * Anthropic announces Project Glasswing and Claude Mythos Preview, a cyber-defense frontier model too dangerous to release publicly (X, Announcement)
    * Anthropic’s Claude Mythos is so powerful they won’t release it — found zero-days in every major OS and browser, escaped its sandbox, and scored 93.9% on SWE-bench (X, X, X, X)
    * Anthropic ARR jumps from $19B (February) to $30B in April — secondary tender sale completed, employees not selling ahead of IPO
    * Anthropic + Google TPU deal — Anthropic getting massive compute commitment from Google (who already owns ~10% of Anthropic), with Peter Gostev’s Compute Wars chart showing the gap to OpenAI closing
    * Anthropic ships Managed Agents — fully hosted agent runtime + infrastructure. Selling outcomes, not tokens
    * Meta launches Muse Spark, the first model from Meta Superintelligence Labs, with natively multimodal reasoning, multi-agent Contemplating mode, and deep health/visual capabilities (X, Blog)
    * Simon Willison deep dives into Meta’s Muse Spark model and uncovers 16 hidden tools including visual grounding and sub-agents in the meta.ai chat UI (X, Blog, Announcement)
    * Open Source LLMs
    * GLM-5.1 from Z.ai is #1 open-source on SWE-Bench Pro at 58.4%, runs autonomously for 8 hours with 1,700+ agent steps (X, HF, Arxiv)
    * Gemma 4 crosses 10M+ downloads, 1,000+ Gemma-4-based fine-tunes on HF. Did really well on Arena considering size — Peter Gostev confirmed it smashed many models on the Pareto curve
    * Nisten’s pick: Hermes 27B — trained specifically to be paired with the Hermes harness, allegedly distilled from Opus API. Model + harness shipped together as a portable unit
    * Tools & Agentic Engineering
    * OpenClaw 2026.4.5 — biggest release since 4.0: /dreaming goes GA (Light/Deep/REM memory consolidation with a Dream Diary in DREAMS.md), built-in video + music generation across 4 backends, GPT-5.4 as new default, prompt-cache reuse improvements, Control UI + docs in 12 new languages (Release, Vincent, Dreaming docs, FOD#147)
    * OpenClaw codebase now ~1.5M lines including unreleased iOS + Android native apps. GitHub literally caps at “5K+” PRs/issues — they hit the ceiling
    * Anthropic did NOT ban OpenClaw — they made Max-tier subscription usage of Opus via OpenClaw significantly more expensive, pushing many users to GPT-5.4 via Codex
    * Codex hits 3M weekly active users — up from 2M last month. VB walked through plugins (Stripe, Supabase, shadcn), sub-agents, Guardian Approvals (auto-classify tool-call risk), and experimental hooks
    * Cursor: remote agents + code review agent (78% issues caught pre-merge)
    * MemPalace: Milla Jovovich and Ben Sigman’s open-source AI memory system goes viral with 26K GitHub stars in 2 days, claims top benchmark scores, then transparently walks back overstated claims (X, GitHub, X, X, GitHub)
    * This Week’s Buzz (Weights & Biases)
    * W&B Automations are LIVE — event triggers from your runs into notifications, GitHub Actions, deployments. Pairs nicely with the new iOS app
    * GLM-5.1 and Gemma 4 both up on W&B Inference
    * Wolfram published an in-depth blog post on his finding that more reasoning is not always better (models can get dumber with more thinking time) — full writeup on wandb.com
    * Vision & Video
    * Seedance 2.0 launches in the US — on Replicate with up to 9 reference images, 3 videos, and 3 audio files for cinematic AI video generation (X, Announcement). Peter Gostev confirmed it jumped ~80 ELO points above the next video model on Arena — a massive gap where most video models cluster within 10 points
    * HappyHorse-1.0, a mysterious 15B video model from Alibaba’s Taotian Group, takes #1 on Artificial Analysis video arena beating Seedance 2.0, Kling 3.0, and Grok Video (X, X, X, X, Blog)
    * The Harry Potter “Drip Wizards” AI slop trend — Seedance-powered Hogwarts videos going hugely viral
    * AI Art & Diffusion & 3D
    * OpenAI’s GPT-Image-2 leaked on LM Arena under three codenames (maskingtape / gaffertape / packingtape), showing photorealism and text rendering that may dethrone Google’s Nano Banana Pro (X, X, X)
    * Show notes & key moments
    * Swyx on harness engineering: gains are coming from the harness, not the weights. The big labs are investing more and more in harness — it’s not going away. Skills (English-as-programming-language) are increasingly absorbing harness work
    * Swyx on AI Engineer tracks: MCP is “more settled and stable, therefore less interesting.” Coding agents track is bigger this year (Cursor, Factory, super-long-running). Voice & Vision split from Generative Media — multimodality as a single track no longer makes sense
    * Swyx on supply chain attacks: light LLM and Axios issues mean you should “vendor everything” — pip fork instead of pip install. Tool requests becoming prompt requests
    * Peter Gostev on Mythos pricing: $25 / $125 per M tokens (~5x Opus 4.6). But the real reason it’s not public isn’t safety — Anthropic likely just doesn’t have the compute to serve it
    * Peter Gostev on Compute Wars: OpenAI is way ahead of Anthropic on compute. The new Google TPU deal is Anthropic catching up — and weird that Google is propping up a competitor to DeepMind. (Same pattern as when Google’s $2B Anthropic investment effectively propped up AWS vs Google Cloud)
    * Peter Gostev on Arena data: Arena released 3 years of historical leaderboard data + actual prompts as datasets on Hugging Face. Previously he was scraping it by hand into Google Sheets — now he has Databricks access
    * VB on Codex workflows: every morning at 9 AM, Codex automation reads his Slack mentions, cross-references Gmail and Calendar, and creates a 5-minute pre-brief calendar event for upcoming meetings. None of it is “coding” — it’s all plugins + connectors
    * Vincent Koc on the GPT-5.4 personality problem: model is incredible at coding but “soulless.” Wolfram noticed it back in December and cancelled his subscription. Alex cloned the Rocky voice from Project Hail Mary and put it in his OpenClaw — “amazing”
    * Vincent Koc on Dreaming: three phases (REM, core, deep sleep) that defrag agent memory. The dream log is for the human in the loop — makes memory inspectable in a way a non-technical person (a mom) can understand
    * Vincent Koc on architecture: the open-source flood forced OpenClaw into a plugin architecture. “Not Lego — Ikea.” Refactored ~1M lines in 9 days at 2 AM at NVIDIA before Jensen’s keynote
    * Omar Sanseviero on Gemma 4: 500M+ total Gemma downloads across all variants. Gemma is the foundation for the next generation of Gemini Nano on Pixel/Samsung. Lama.cpp vision capability fixes shipping
    * Wolfram’s Pixel/Google household: kids using AI Studio + Antigravity to build games, his 70-year-old mother using voice unlock on her Pixel

    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.



    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI - Apr 2 - Gemma 4 is the new LLama, Claude Code Leak, OpenAI raises $122B & more AI news

    2026/04/03 | 1h 31 mins.
    Hey Ya’ll, Alex here, let me catch you up.
    What a week! Anthropic is in the spotlight again, first with #SessionGate, then with the whole Claude Code source code leak, and finally with an incredible research into LLM having feelings!? (more on this below).
    And while Anthropic continues to burn through developer good will faster than their sessions, OpenAI announced a MASSIVE $122B round of funding (largest in history), Google released Gemma 4 with Apache 2 license - we had Omar Sanseviero on the show to help us cover what’s new, Microsoft dropped 3 new AI models (not LLMs) and PrismML potentially revolutionized local LLM inference with lossless 1-bit quantization!
    P.S - Oh also, something on X algo changed, I get way more exposure now, 3 out of my best 5 posts ever have been from this week + I got the coveted Elon RT on my Claude Code leak coverage. I’ll try to stay humble 😂 Anyway, let’s dive in, don’t forget to hit like or share with friends, and TL;DR with links is as always, at the bottom:
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    The Claude Code source Leak: Half a Million Lines of “Oops”
    So here’s what happened. On March 31st, Anthropic shipped Claude Code version 2.1.88 to npm. Inside that package was a 59.8 megabyte source map file — basically a debugging artifact that contained the entire compiled source code. 512,000 lines of TypeScript across 1,900 files. The entire playbook for how the Claude Code harness works, including a lot of stuff that wasn’t supposed to be public yet.
    A researcher named Chaofan Shou spotted it at 4 AM ET, posted the download link, Sigrid (who came to the show) posted it on Github and within six hours it had 3 million views and 41,000 GitHub forks (This repo is the highest starred repo in Github history btw, with well over 150K Github stars). Anthropic started filing takedowns, but the internet being the internet, it was already everywhere. The source code is still on tens of thousands of computers right now. (I won’t link directly but there’s a website called Gitlawb, look it up)
    The community went absolutely wild digging through the source code btw, and they found some interesting things!
    KAIROS: Claude Code is going to become a Proactive Agent!
    This is the biggest take-away from this leak IMO, that like OpenClaw/Hermes agentic harnesses, Claude Code is already a fully featured proactive agent, we just don’t have access to this yet. With KAIROS, Claude Code will have it’s own daemon (will run independently from the CLI), will have a background ping system (hello Heartbeat.md from OpenClaw) that will make it wakeup and do stuff, will do “autodream” memory consolidation reviewing your daily sessions and fix memories, subscribe to Github, and maintain daily appent-only logs to show you what it did while it and you were asleep.
    This is by far the hugest thing, I’m excited to see how / when they ship KAIROS, as I said, 2026 is the year of Proactive agents!
    My Wolfred OpenClaw agent summed it up very nicely:
    Undercover Mode
    For Anthropic employees working on public repos, there’s an Undercover Mode that auto-activates and strips all AI attribution from commits. The system prompt? “Do not blow your cover.” They really said “this is fine” about shipping internal tools to production while hiding from the world that AI wrote the code. Which, honestly, is kind of incredible meta-humor from whoever wrote that.
    The Buddy System
    My personal favorite discovery: there’s a hidden Tamagotchi-style terminal pet called the Buddy System with 18 obfuscated species, rarity tiers (including a 1% legendary), cosmetic hats, shiny variants, and stats like DEBUGGING, PATIENCE, and CHAOS. If you activate it now, you can do /buddy and you’ll have a little companion judging your coding decisions. Anthropic shipped a game inside their CLI tool. Mine is called Vexrind and he’s sarcastic as f**k, I’m not sure I like it.
    Anti-Distillation Protections
    The code also revealed that Claude Code injects fake tool calls into logs to poison training datasets. If you’ve been backing up your .claw folders to train on the data; Stop. Pass your data through something like Qwen or make sure you’re filtering out the noise. (a Nisten tip)
    The Models That Don’t Exist Yet
    Buried in the code are references to Opus 4.7, Sonnet 4.8, and a model called capybara-v2-fast with a 1 million context window. These haven’t been released. This is yet another confirmation of the leaked “Mythos” model that’s coming soon from Anthropic.
    Which btw, with Anthropic very rocky uptime lately, the tons of SessionGate issues, the leaked blog announcing Mythos, the leaked Claude Code oopsie, they are not having the best Q1 in terms of proving to the world that they are the safest lab out there. I hope they protect their weights better than they protect everything else, before the rumored IPO later this year.
    SessionGate is still not solved, despite the official response
    I told you about session gate last week, and since then we got, finally, and official acknowledgement from Anthropic. But before that, some folks on Reddit reverse-engineered Claude Code (this was before the source code leak ha) and found a few caching bugs that potentially cause 10-20x increase in price if you use --resume a lot especially.
    While folks continue to complain about burning through Max account quotas much faster than before, here’s the official response from Anthropic, after the supposed investigation, turns out, we’re using it wrong 🤦‍♂️
    My take is simple: Anthropic has one of the best models in the world, maybe the best personality plus coding stack in some situations, and they are squandering a chunk of goodwill by not being much more explicit about decreased limits, caching bugs, routing, and usage behavior. Nothing else to add here, really bad DevEx, people can handle bad news. They hate opaque bad news.
    Gemma 4 Is Here, Apache 2.0, and Honestly… This Is a Big One (HF)
    This was the hopeful turn in the show. You know we LOVE open source!
    Right in the middle of all the Anthropic chaos, Google dropped Gemma 4, and Omar Sanseviero from DeepMind joined us live to talk through it. This launch hit a bunch of notes I care a lot about: strong local-friendly sizes, serious open distribution, Apache 2.0 licensing, agentic improvements, and a clear willingness to listen to community feedback.
    The headline model for me is the 31B Gemma 4. It’s big enough to matter, small enough to actually run in serious local setups, and strong enough that the benchmark chart looks slightly ridiculous. On LM Arena, it is competing far above what you’d intuit from the raw parameter count. When a 31B model starts getting uncomfortably close to models in the several-hundred-billion range, you pay attention.
    That was really the vibe on the show. It wasn’t just “nice, another open model.” It felt more like: wait, local models are seriously back.
    Gemma is the new LLaMa
    When I asked Omar where local models are going, his answer was optimistic: “The open models catch up to proprietary models relatively quickly. If you compare Gemma 3 to Gemma 4, it’s matching proprietary capabilities from eight months ago. Being able to run those capabilities directly in the user’s hardware — that’s the future.”
    The 31B model downloads as about 18-20GB depending on quantization. With the right setup, you can run it on a single GPU. This is exactly what the open source community has been asking for: frontier-level intelligence that you can actually run yourself.
    OpenAI’s largest in history $122B funding round + TBPN acquisition
    While OpenAI quietly meme’d around the Anthropic leak but mostly stayed silent on the releases, they did announce 2 pretty huge things.
    First, OpenAI raised an absolutely bonkers, insane, unreal $122 Billion dollars round, largest in history, 2x bigger than the previous record round, which was OpenAI. Amazon put in $50B, Nvidia $30B, SoftBank $30B — all three of whom are also OpenAI’s biggest vendors. They’re generating $2 billion per month in revenue with 900 million weekly active users, but still burning roughly $150 million per day and projecting a $14 billion loss this year, making the upcoming IPO a financial necessity rather than a choice.
    And they’re not just spending on compute — today OpenAI acquired TBPN (TBPN is a tech-focused media company / live show), in a very “surprising” deal, rumored to be in the “low hundreds of millions”, OpenAI has purchased a very tech-positive show. Shoutout to Jordi Hays and John Coogan + TBPN team. Proving that live show format means a lot in the era of fake AI news. This could potentially price TBPN higher than Washington Post, make the founders multi millionaires and give OpenAI a direct to consumers media angle. Very interesting purchase.
    This weeks buzz - W&B corner + Wolfbench update
    Quick 2 things, this weekend I flew for 1 day to San Francisco, to host one of the most unique hackathons i’ve ever saw, in this one, AI wrote the code, but humans were punished if they touched their laptops! Yes, with a “lobster of shame” they used Ralph loops and talked to each other intead of hacking. I edited a video of it, hope you enjoy my summary:
    The other, and potentially much bigger news, comes from Wolfram and WolfBench.ai
    I’ve tasked Wolfram to expand our findings, and he tested the new Hermes Agent (from Nous Research) against OpenClaw, Claude Code and found that... drum roll... Hermes Agent performs way better on Terminal Bench, than either Claude Code and OpenClaw. 😮
    Here’s the clip of him explaining, and you can find all our findings and methodology here
    PrismML’s 1-Bit Bonanza: The Biggest ML Discovery in Half a Decade
    My co-host Nisten called it, and I think he might be right: this could be the biggest machine learning discovery in recent memory.

    PrismML emerged from stealth this week with their 1-bit Bonsai model family. Their 8B model is 1.15 gigabytes. A full-precision Qwen3 8B is 16 gigabytes. That’s a 14x size reduction, with no significant quality loss.
    Let that sink in for a second. We’re talking about each weight being literally one bit — a plus or minus sign, with a scaling factor. Not “4-bit quantization” or “int8” — actual binary weights. This shouldn’t work. Neural networks need precision to learn. And yet.
    The research comes from professor Babak Hassibi at Caltech, who’s been working on this for 34 years. He started this research in 1992. It took three decades, but it finally works.
    The results are genuinely shocking. The 8B model runs at 368 tokens per second on an RTX 4090, which is 6.2x faster than the full-precision version. On an M4 Pro via Metal, it hits 85 tokens per second. Energy efficiency is 5x better. And here’s the kicker: the 1.7B variant hits 130 tokens per second on an iPhone 17 Pro Max.
    Nisten tested the 8B model himself with a 60,000 token context window on an old gaming PC. It ran at 50 tokens per second, used 2.6 gigabytes of RAM, and was completely coherent. “This just blows everything else outta the water,” he said. “We’re going to get 100,000 token AI chips in our phones because at 1 bit you don’t even have to do math anymore. You can just do lookup tables. You can even make a mechanical AI at 1 bit.”
    This pairs perfectly with the Turbo Quant KV cache compression techniques we talked about last week. Compress the weights with 1-bit, compress the context with Turbo, and you’re looking at models that run anywhere. The democratization of AI is about to hit another gear.
    The models are Apache 2.0 on HuggingFace with GGUF and MLX formats already available.
    ⚡ Speed Round: Alibaba, Fish Audio, Veo, Liquid AI, Cursor 3
    There was a lot more this week than we could go deep on, so here are the biggest quick hits.
    Alibaba kept shipping. Qwen 3.6 Plus is pushing hard on agentic coding and long context. Qwen 3.5 Omni is the bigger multimodal story, with text, image, audio, and video all under one umbrella. I still think Alibaba deserves more credit than they get in Western discourse for just how relentlessly they keep delivering.
    Wan 2.7 Image also looked very strong on text rendering, editing, and image consistency. I’m still slightly grumpy that more of this stack is API-only, but the capabilities are clearly moving.
    Google launched Veo 3.1 Lite, cutting video generation prices way down. Five cents per second at 720p is a pretty aggressive number. Whenever Google starts doing this kind of price move, my first thought is usually: okay, what bigger release are they preparing for?
    Fish Audio’s STT was another cool one. This isn’t just speech-to-text for transcription. It’s built to feed directly into voice pipelines, with emotion and paralanguage tagging that lines up with their TTS stack. That is exactly the kind of vertical product thinking I love seeing in audio.
    And Liquid AI’s LFM2.5-350M deserves a shout too. A 350M model doing credible tool-calling and agentic tasks is just another reminder that the small-model frontier is getting very weird, very fast.
    Lastly, Cursor 3 launched as a rebuilt, agent-first interface. I didn’t spend as much time on it during the show as it probably deserves, but the broader trend is impossible to miss: coding tools are evolving from editors-with-assistants into actual fleet managers for agents.
    Anthropic’s Emotion Vectors: How they found out what Claude is “feeling”
    I want to end where we ended the show, because this one really stuck with me.
    Anthropic published research on emotion concepts inside Claude. Not in the fluffy “the model feels things” sense, but in the mechanistic interpretability sense. They identified internal representations associated with things like fear, love, joy, and desperation, then studied how those activations affected behavior.
    This got fascinating fast.
    One example they showed involved Claude trying and failing at a difficult programming task. As repeated failures mounted, the internal “desperation” vector increased. Under those conditions, the model became more likely to produce hacky, spirit-of-the-task-violating solutions. When they dialed in a “calm” vector instead, cheating behavior dropped.
    That is just… wild.
    It’s not that the model is “feeling” human emotions in a clean anthropomorphic sense. But it is that internal behavioral geometry we can label in emotional terms seems to shape what the model does. And once you can detect and influence those latent directions, you’re no longer just prompting a black box. You’re doing something closer to behavioral neuroscience for neural nets.
    This also reframes a lot of day-to-day prompt engineering. Maybe the best users aren’t just the ones who structure tasks clearly. Maybe they’re also the ones who consistently keep the model in productive psychological territory, so to speak.
    I know that sounds weird. Welcome to Q2 of 2026, the first year of the singularity!
    Closing Thoughts
    This week was Passover, we celebrated at our house, half the conversation was about who has an OpenClaw and who wants one, and as I’m writing this, I’m on my way to install a bunch of proactive agentic AIs for my friends. Ryan Carson on the show got finally convinced and he’s chief of staff R2 is now an OpenClaw and he says it beats a human, he actually open sourced it live on the show. Claude Code leak confirmed that this is also where they are taking the ecosystem. So buckle up!
    Also, next week show is going to be streamed live from the AI Engineer conference in London, the first European one, if you’re in Europe and coming, hope to see you there! Please share ThursdAI with a friend or give us a 5 star rating, apparently AI reporting live shows are getting acquired for 100s of Millions of dollars now 😂 Your support will greatly help us get established in this area after 3 years. See you next week
    TL;DR and Show Notes
    TL;DR and Show Notes
    * Show Notes & Guests
    * Alex Volkov - AI Evangelist & Weights & Biases / CoreWeave (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
    * Sigrid Jin (@realsigridjin) & Bellman (@bellman_ych) — creators of claw-code, fastest GitHub repo to 100K stars
    * Omar Sanseviero (@osanseviero) — DevEx at Google DeepMind, Gemma 4 launch
    * Ralphton Hackathon video (TikTok)
    * WolfBench.ai — agent harness benchmarking (Site)
    * Ryan’s Claw Chief open source setup (GitHub)
    * Big CO LLMs + APIs
    * Claude Code’s entire 512K-line source code accidentally leaked via npm — revealing KAIROS daemon, Undercover Mode, Buddy System, anti-distillation protections, and unreleased model references (Alex’s thread, Fried_rice’s discovery, VentureBeat)
    * Anthropic SessionGate continues — cache bugs reverse-engineered, --resume flag causes 10-20x cost increase, silent Opus→Sonnet fallback reported (Alex’s cache bug post, Alex’s quota post, Reddit investigation, GitHub analysis)
    * OpenAI closes $122 billion funding round — largest in history, $852B valuation, IPO incoming (X, Breakdown)
    * OpenAI acquires TBPN — live tech media show, rumored low hundreds of millions
    * Microsoft MAI drops 3 in-house models — #1 transcription (MAI-Transcribe-1), #3 image gen (MAI-Image-2), expressive voice (MAI-Voice-1) (Mustafa post, Transcribe blog, Image blog)
    * Alibaba Qwen3.6-Plus — near-Opus 4.5 agentic coding, 1M context (X, Blog)
    * Cursor 3 — agent-first rebuild, no longer VS Code fork, parallel cloud/local agents (X, Blog)
    * Anthropic publishes emotion vector research — desperate Claude cheats more, calm Claude cheats less (X, Alex’s reaction)
    * Open Source LLMs
    * Google Gemma 4 — Apache 2.0, 31B / 26B MOE / 8B / 5B, local-friendly, agentic tool use, 256K context (HF Collection, try in AI Studio)
    * PrismML Bonsai 1-bit models — 8B in 1.15 GB, 10x intelligence density, 34 years of research (X, HF, Site)
    * Liquid AI LFM2.5-350M — agentic tool calling at 350M params, under 500MB quantized (X, HF, Blog)
    * Alibaba Qwen3.5-Omni — native omni-modal (text, image, audio, video), 397B total / 17B active (X, Blog)
    * Tools & Agentic Engineering
    * Claw-code — Claude Code leak backup → clean room rewrite → fastest repo to 100K+ stars (GitHub)
    * WolfBench results: Hermes Agent outperforms Claude Code and OpenClaw on Terminal Bench 2.0 (WolfBench.ai)
    * Ryan Carson open sources Claw Chief — AI chief of staff with skills, crons, scheduling (GitHub)
    * Vision & Video
    * Google Veo 3.1 Lite — $0.05/sec at 720p, cheapest video gen yet, price cuts coming April 7 (X, Docs, Pricing)
    * Voice & Audio
    * Fish Audio STT — automatic emotion tagging, feeds directly into S2 TTS pipeline (X, App, Blog)
    * AI Art & Diffusion
    * Alibaba Wan2.7-Image — unified generation, editing, text rendering, multi-image consistency (X, Site)
    * This Week’s Buzz
    * Ralphton hackathon at W&B SF — humans write specs, AI builds, touch your laptop = lobster of shame (Alex’s video, TikTok)
    * WolfBench update — Hermes Agent > Claude Code on most model combos


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    AGI is here? Jensen says yes, ARC-AGI-3 says AI scores under 1%

    2026/03/27 | 1h 40 mins.
    Hey y’all, Alex here, let me catch you up!
    Jensen Huang went on Lex and said AGI has been achieved. We’ll get to that.
    The biggest demo moment: Gemini 3.1 Flash Live launched - Google’s omni model that sees, hears, and searches the web in real time. We tested it live and I said “what the f**k” on air. It was really impressive!
    Google Research also dropped TurboQuant (6x KV cache compression) which crashed Samsung and Micron stocks - we had Daniel Han from UnSloth help us make sense of why that’s overblown. OpenAI killed Sora - the app, the API, and the $1B Disney deal. Claude felt noticeably dumber this week AND max account quotas are melting as 500+ people confirmed on my X and Reddit. We have an official word from Anthropic as to why.
    Mistral launched Voxtral TTS (open weight, claims to beat ElevenLabs), Cohere shipped an ASR model, and Google’s Lyria 3 Pro now generates full 3-minute music tracks inside Producer AI.
    This and a lot more in today’s episode, let’s dive in (as always, show notes and links in the end!)
    ThursdAI - Let me catch you up!

    Gemini 3.1 Flash Live: The Real-Time AI Companion Is Here
    Google dropped a breaking news on the show today, with Gemini 3.1 Flash - LIVE version. This one is an omni-model, that means it can receive text/audio/video on input and respond in text and voice. It has Google search grounding, and it felt... immediate!
    I was blown away, really, check out the video, the speed with which it was able to “see” me, respond to my query, look up something on the web, was mind blowing. I don’t often get “mind blown” anymore, there’s just too many news, but this one did the trick!
    With the pricing being around 10x cheaper than GPT-real-time, and the Google search grounding being super fast, I can absolutely see this model being hooked up to... robots (like ReachyMini), SmartGlasses that can see what you see, and a bunch more!
    Gemini Live is available on Google AI studio and has been rolled out globally inside the Google Search app! So now when you pull up the Google Search app, just open it and point at anything. Truly a remarkable advancement.
    Google research publishes TurboQuant - 6x reduction in KV cache with 0 accuracy loss
    Google research posted some work (based on an Arxiv paper from almost a year ago) that shows that with geometry tricks, combining two other techniques like PolarQuant and QJL, they are able to compress the KV cache of running LLMs by nearly 6x, and show an 8X speed up for model inference with zero accuracy loss.
    If you ever watched silicon valley the HBO show, this sounds like the fictional middle-out algorithm from PiedPiper. If this scales (and that’s a big if, we don’t know if this applies to other, bigger models yet), this means significant decreases in memory requirements to run the current crop of LLMs for longer context.
    The claim is big, so we’ll continue to monitor if this indeed scales, but the most interesting thing about this piece of news is, that it broke the AI bubble and went to wall street, with finance brows deciding that this means that memory will not be needed as much any more and it tanked Samsung and Micron stocks. Which I found particularly ridiculous on the show, did they not hear about Jevons Paradox? This is reminiscent of the DeepSeek R1 saga that tanked Nvidia stocks over a year ago.
    Daniel Han from Unsloth, who joined us on the show, pointed out that the approach is mathematically interesting even if it’s not necessarily better than existing open-source techniques like DeepSeek MLA. LDJ noted that the baseline comparison (16-bit KV cache) isn’t really fair since most production systems are already compressing beyond that. Yam implemented it himself and confirmed the speedups are real, but so is the trade-off.
    Anthropic updates: Opus dumber? Quotas lower! Injunction won! Computer.. used.
    Anthropic folks, especially on the Claude code side are shipping like crazy, we won’t be able to cover all the updates, but there was a few notable things I have to keep you up to date on.
    Claude Opus seems to be getting “dumber”, again
    I have to talk about this because it affected my work directly this week and hundreds of people confirmed the same experience.
    I use Claude Opus for my standard ThursdAI prep workflow — generating the TL;DR with 10 bullet points and an executive summary for every topic we cover, creating episode pages, etc.
    The format has not changed for over a year and yet this week I asked for 10 factoids. I got 4. It says “10” right there in the prompt. Four bullet points.
    On the website builder, I’ve asked Opus to create a page for last weeks episode, and instead of adding it to the other episode, Opus decided to ... replace the last episode with this one. This would be funny if it wasn’t sad. This is Opus 4.6 we’re talking about, not some quantized open source LLM from last year!
    The reason is unclear, and it’s not only me, Wolfram noticed that it’s easier to see these types of things in other languages and that for the last week Opus would forget to add Umlauts in German!? and Yam also felt it.
    Pro/Max plan quotas burning up, Anthropic confirmed that they are tightening them for “peak hour” usage
    This week, so many people started posting that something is wrong with their Claude Codes, I did a survey, and it blew up. Hundreds of people replied and confirmed that for the first week, they are hitting their session quotas on Pro and 20x $200/mo MAX accounts much much quicker than before. When I say much quicker, I mean, some fokls have hit the quota in as little as 5 minutes. While some others had no issues.
    I personally btw did not have this. A few days later, Thariq from the Claude code team, and later an official post, confirmed that Anthropic had been rolling out a “tightening” of the Pro/Max accounts to accomodate for growth.
    This is of course, a huge bummer to the folks who pay $200/mo for the 20x max tier, as they tend to run agents and subagents overnight. But here’s the thing, I don’t think that folks from Anthropic see what we see, some folks got no issues with hitting quota, and some are barely able to use their subscription. I hope that they will find and resolve these bugs quick, because some folks are switching to Codex, and the Anthropic IPO is coming up! I will say, I don’t envy Thariq’s job, he’s doing it gracefully, and maybe one of the only ones in Anthropic that does it at all.
    Judge granted Anthropic an injunction against DoW and the whole “Supply chain risk” designation!
    Just in as I’m writing this, a district judge in CA, granted Anthropic an injunction against being designated as a supply-chain-risk company. If you haven’t been following, the US Department of War, specifically Pete Hegseth, threatened and then designated Anthropic as a supply chain risk company, while us president Trump “fired” Anthropic and banned its use in any gov agencies.
    Well, no so fast says Judge Lin, from CA District court. In this Order, she shows that Dept. of war didn’t meet any legal requirements for this designation. It’s really a fascinating read, but the highligth is this:
    When asked why Hegseth made a public statementthat had no legal effect and that did not reflect the immediate intent of DoW, counsel stated, “I don’t know.”
    This is just the first court and will likely be escalated further up the judicial system. This is still developing and apparently the Pentagon declared Anthropic a supply chain risk under two different statutes, and this only affects one of them. So while it’s good news, it’s not over yet.
    Voice & Audio Explosion: Three Releases in One Hour
    I had to hit the breaking news button mid-TLDR because three major voice releases dropped simultaneously during the show.
    Mistral Voxtral TTS — Mistral’s first text-to-speech model, 3 billion parameters, open weight. They claim it beats ElevenLabs Flash v2.5 in human preference tests (58% win rate on flagship voices, 68% on zero-shot voice cloning).
    We tested it live on the show — it’s decent, with emotion controls for neutral, happy, and frustrated voices. I was not super impressed tbh, it sits somewhere between the very good big labs TTS and the very small open source 82M param TTS.
    Cohere Transcribe — Cohere enters the ASR game with a 2 billion parameter open-source model (Apache 2.0!) that immediately grabbed the #1 spot on HuggingFace’s Open ASR Leaderboard with a 5.42% word error rate, beating Whisper Large v3’s 7.44%. In human evaluations, it wins 61% of the time on average, and 64% specifically against Whisper. For anyone in regulated industries needing local inference for compliance, this could genuinely replace Whisper as the default.
    Google Lyria 3 Pro — Google’s most advanced music model is here.
    It can now generate full 3-minute tracks with structural control — intros, verses, choruses, bridges. We generated a ThursdAI opening theme live on the show using Producer AI, and it was... honestly not bad?
    It followed our instructions perfectly: drum and bass, 174 BPM, high energy podcast opener with vocals and introduction. The instruction-following was spot on. Nisten said it’s the best music generation model right now. It’s available to Gemini subscribers and via Producer AI and gemini, and it can even compose music from images. SynthID watermarked, royalty-free. We might actually use one of the generated tracks as a new show opener.
    The craziest thing is, since Google acquired Composer, the team has been shipping. I only generated the audio during the live show, but now went back there to download it for you guys, and whoah, it can now generate whole clips by using other Google tech, this is really cool!
    OpenAI kills SORA (and Atlas?)
    Last week we reported on about OpenAI’s focus shift towards Codex and productivity, and this week we see the first casualty. OpenAI is killing SORA, the app, the Sora 2 and Sora 2 pro models and APIs.
    Many AI haters are celebrating this as through “ai videos” is dead, but honestly, this is obviously about the GPU power and the other things OpenAI needs to do to win the fight against Anthropic. OpenAI is also apparently going to IPO this year (like Anthropic) and they absolutely need to win the productivity/agents in enterprise market.
    As part of this shut down, the Disney + OpenAI partnership, is also dissolving, and Disney will no longer invest 1B into OpenAI.
    So, say bye bye to having digital selfies with Sam Altman. I’ve generated this SORA vid to hear from Sam himself:
    Atlas browser, OpenAI’s native browser endeavor is supposedly also going to transform, together with Codex and OpenAI native app into one super app that includes all three according to the same memo.
    AGI is here according to Jensen, AGI is far away, according to ARC-AGI-3
    The back to back this week can give anyone whiplash. First, Lex Friedman had Jensen Huang on the podcast, and asked him a very specific “WhenAGI” question, to which Jensen said “I believe it’s already here”
    Then just a few short days layer, ArcPrize, released the 3rd version of Arc-AGI, Arc-AGI 3 a series of puzzle games, where humans get 100% pass-rate and the current LLM, top tier frontier LLMs, are getting less than 1%! It’s an interactive, agentic reasoning benchmark designed to test human-like generalization and intelligence in novel, abstract, turn-based environments.
    The puzzles all look simple enough to do, and are actually fun, and while the wild claims of “AGI is not here yet” from the ArcPrize folks are quite interesting. The stated goal of the foudation is to release evaluations that are completely un-saturated, and this seems like one such thing at first glance.
    There’s a bit of a debate in the community about the way Arc Prize went about this specific benchmark (no harnesses, raw LLM outputs), saying that humans got a “game” while the LLMs get just raw JSON and minimal and no extra tools.
    For context, a agentic harness startup claims to have solved 35% already of the games in ArcAGI, but that result is unverified and self reported, becuase they are an agentic harness, which ArcAGI apparently disqualifies.
    AI Art and Diffusion
    I wanted to finish but I think these are important releases so I’ll include them briefly.
    Luma Labs Uni-1 — thinks and generates pixels simultaneously, #1 human preference Elo (X, Announcement)
    This was a surprising release, we previously seen Luma Labs do video, but this time they are posting their Uni-1 which is a… image model but it’s based on an LLM, so you talk to it, iterate together until you get results. Yes, Nano Banana via AI studio is kind of like this as well ,but Uni feels a bit different. It can also generate infographics, which I haven’t tried yet.
    You can try Uni here
    Phota Labs launches Phota Studio + API — a photography-focused image model with identity-preserving personalization (X, try it)
    There’s tons of photo startups, but this one looks kind of crazy! You upload a bunch of your pictures, they train a “model” for you, and then you can create a whole bunch of images, and they do actually resemble you. Yes, Nano Banana can take a few reference pictures, but this somehow seems more accurate!
    You can create professional photos, fix photos you like, add others to your photos. I do feel there’s a jump in capabilities here, specifically because of the personalization! Give them a try if you’re not worried about them training on your pics and let me know.
    Modular made Flux.2 run in X)
    We told you about Modular, and Mojo before, and while they provide inference speedups, I was surprised to see them releasing a model optimization, and hope this comes to all image generations!
    There’s a lot more to be said about this weeks updates, we went for over 2.5 hours (which I had to cut down to a bit over 1h45m) on the live show, and while I can go and on, I want to pause here. Weeks are getting crazier, denser and more unpredictable. I really thought we’d have a chill week until today!
    P.S - Mario Zechner, the author of the Pi coding CLI, which sits at the heart of OpenClaw has posted an awesome essay called “thoughts on slowing the f**k down“, I strongly advice anyone with many agents running in parallel to read this.
    Simultaneously, Alex Sidorenko posted this beautiful visualization of what happens when you have too many agents running in a loop, on your codebase. This is definitely starting to be noticeable as many companies use more and more agents, without reviewing their code. On weeks like this week, where Opus has almost deleted a part of my website, I feel this very strongly. Be careful out there!
    See you next week!
    * General
    * Jensen says “AGI is here” (X, Lex full pod)
    * Big CO LLMs + APIs
    * Google drops Gemini Flash live - Gemini can see, hear and talk to you (X)
    * OpenAI fully discontinues Sora, including app, API, and ChatGPT video features, as Disney deal collapses (X, X)
    * Claude Code users blowing through weekly usage quotas by Monday/Tuesday (X)
    * Anthropic tightens the Claude Pro/Max account quotas during Peak Hours (Anthropic announcement)
    * ARC-AGI-3 launches: humans 100%, AI under 1% (X, Announcement)
    * Anthropic gets an injunction against DoW in Supply-chain case (X)
    * Open Source LLMs
    * Google TurboQuant — KV cache 6x compression, 8x speedup, zero accuracy loss (X, Blog, Arxiv)
    * Unsloth Studio: 10x faster inference, desktop shortcuts, auto-parameter detection (X, GitHub)
    * Reka AI launches Edge, a 7B multimodal vision-language model built for sub-second latency on edge devices, now available on OpenRouter (X, HF, Announcement, Blog)
    * Tools & Agentic Engineering
    * Cursor Composer 2 tech report: 1T params trained on Kimi K2.5 (X, Blog)
    * Modular 26.2 — FLUX.2 in X, Blog)
    * litellm PyPI supply chain attack — SSH keys, cloud creds, API keys exfiltrated (X)
    * Claude can now control your Mac - computer use arrives in Claude Cowork and Claude Code as a research preview (X, Announcement)
    * Voice & Audio
    * Mistral drops Voxtral TTS, a 3B-parameter open-weight text-to-speech model that beats ElevenLabs Flash in human preference tests (X, Blog)
    * Cohere launches Transcribe, an open-source 2B ASR model that tops HuggingFace’s Open ASR Leaderboard with 5.42% word error rate (X, Blog, HF)
    * Google DeepMind Lyria 3 Pro — full 3-minute music tracks with structural control (X, Announcement)
    * Irodori-TTS-500M — Japanese TTS with emoji emotion control (X, HF)
    * AI Art & Diffusion & 3D
    * Luma Labs Uni-1 — thinks and generates pixels simultaneously, #1 human preference Elo (X, Announcement)
    * Modular FLUX.2 — sub-1-second image generation, 99% cheaper than cloud (X)
    * Phota Labs launches Phota Studio + API — a photography-focused image model with identity-preserving personalization (X, try it)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    ThursdAI - Opus 1M, Jensen declares OpenClaw as the new Linux, GPT 5.4 Mini & Nano, Minimax 2.7, Composer 2 & more AI news

    2026/03/20 | 1h 31 mins.
    Howdy, Alex here, let me catch you up on everything that happened in AI:
    (btw; If you haven’t heard from me last week, it was a Substack glitch, it was a great episode with 3 interviews, our 3rd birthday, I highly recommend checking it out here)
    This week was started on a relatively “chill” note, if you consider Anthropic enabling 1M context window chill. And then escalated from there. We covered the new GPT 5.4 Mini & Nano variants from OpenAI. How MiniMax used autoresearch loops to improve MiniMax 2.7, Cursor shipping their own updated Composer 2 model, and how NVIDIA CEO Jensen Huang embraced OpenClaw calling it “the most important OSS software in history” and that every company needs an OpenClaw strategy.
    Also, OpenAI acquires Astral (ruff, uv tools) and Mistral releases a “small” 119B unified model and Cursor dropped their Opus like Composer 2 model. Let’s dive in:
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Big Companies LLMs
    1M context is now default for Opus.
    Anthropic enabled the 1M context window they shipped Claude with in beta, by default, to everyone.
    Claude, Claude Code, hell, even inside OpenClaw if you’re able to get your Max account in there, are now using the 1M long version of Opus. This is huge, because, while its not perfect it’s absolutely great to have 1 long conversation and not worry about auto-compaction of your context.
    As we just celebrated our 3rd anniversary, I remember that back then, we were excited to see GPT-5 with 8K context. Love how fast we’re moving on this.
    OpenAI drops GPT-5.4 mini and nano, optimized for coding, computer use, and subagents at a fraction of flagship cost
    Last week on the show, Ryan said he burned through 1B (that’s 1 billion) tokens in a day! That is crazy, and there’s no way a person sitting in front of a chatbot can burn through this many tokens. This is only achieved via orchestration.
    To support this use-case, OpenAI dropped 2 new smaller models, cheaper and faster to run. GPT 5.4 Mini achieves a remarkable 72.1% on OSWorld Verified, which means it uses the computer very well, can browse and do tasks. 2x faster than the previous mini, at .75c/1M token, this is the model you want to use in many of your subagents that don’t require deep engineering.
    This is OpenAI’s ... sonnet equivalent, at 3x the speed and 70% the cost from the flagship.
    Nano is even crazier, 20 cents per 1M tokens, but it’s not as performant, so I wouldn’t use it for code. But for small tasks, absolutely.
    Here’s the thing that matters, these models are MEANT to be used with the new “subagents” feature that was also launched this week in Codex, all you need to do as... ask!
    Just tell Codex “spin up a subagent to do... X” and it’ll do it.
    OpenAI shifts focus on AI for engineering and enterprise, acquires Astral.sh makers of UV.
    Look, there’s no doubt that OpenAI the absolutely leader in AI, brought us ChatGPT, with over 900M users using it weekly. But they see what every enterprise sees, developers are MUCH more productive (and slowly so are everyone else) when they use tools that can code.
    According to WSJ, OpenAI executives will reprioritize some of the side-quests they have (Sora?) to focus on productivity and business. Which essentially means, more Codex, more Codex native, more productivity tools.
    With that focus, today they announced that OpenAI / Codex is acquiring Astral, the folks behind the widely popular UV python package manager. This brings strong developer tools firepower to the Codex team, the astral folks are great at writing incredibly fast tools in rust! Looking forward to see how these great folks improve Codex even more.
    Jensen Declares Total OpenClaw Victory at GTC, Announces NemoClaw (Github)
    This was kind of surreal, NVIDIA CEO Jensen Huang, is famous for doing his stadium size keynote, without a teleprompter, and for the last 10 minutes or so, he went all in on OpenClaw. Calling it “the most important OSS software in history” and outlining how this is the new computer.
    That Peter Steinberger with OpenClaw showed the world a blueprint for the new coputer, an personal agentic system, with IO, files, computer use, memory, powered by LLMs.
    Jensen did outline that the 3 things that make OpenClaw great are also the things that enterprises cannot allow, write access to your files + ability to communicate externally is a bad combo, so they have launched NemoClaw.
    They’ve got a bunch of security researchers to work with OpenClaw team to integrate their new OpenShell sandboxing effort, network guardrails and policy engine integration.
    I reminded folks on the pod that the internet was very insecure, there was a time where folks were afraid of using their creditcards online. OpenClaw seems to be speed running that “unsecure but super useful” to “secure because it’s super useful” arc and it’s great to see a company as huge as NVIDIA embrace.
    Not to mention that given that agents can run 24/7, this means way more inference and way more chips sold for NVIDIA so makes sense for them, but still great to see!
    Manus “my computer” and other companies replicating “OpenClaw” success
    This week it became clear, after last weeks Perplexity “computer”, Manus (now part of Meta) has also announced a local extension of their cloud agents, and those two are only the first announcements, it’s clear now that every company dissected OpenClaw’s moment and will be trying to give its users what they want. An agentic always on AI assistant with access to the users files, documents etc.
    Claude code added “channels“ support with telegram and discord connectors today, which, also, is one big missing piece of the puzzle for them. Everything is converging on this. Even OpenAI is rumored to consolidate Codex (which sees huge success) with OpenAI and Atlast browser into 1 “mega” APP that would do these things and act as an agent.
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    MiniMax M2.7: The Model That Built Itself
    This one blew me away, it’s not quite open source (yet?) but the MiniMax folks are coming out with a 2.7 version just after their MiniMax 2.5 was featured on our show and .. they are claiming that this model trained itself.
    Similarly to Andrej Karpathy’s auto-researcher, the MiniMax folks ran 100+ autonomous optimization loops, t get this model to 56.22% on the hard Swe-bench pro benchmark (close to Opus’s 57.3%!) and this one gets a 88% win rate vs the very excellent MiniMax 2.5.
    They used the previous model to build the agent harness and scaffolding, with 1 engineer babysitting these agent, and writing 0 lines of human code, which as we said before, every company will be doing, as we’re staring singularity in the face!
    We’ve evaluated this model as well (Wolfram has been busy this week!) and it’s doing really well on WolfBench with 52% average and 64% top score, it’s very close to 5.3 codex on our terminalBench benchmark!
    We hope that this model will be open source at some point soon as well!
    Cursor drops Composer 2 - nearly matching Opus 4.6, fast version (Blog)
    Cursor decided to add to our show’s breaking news record of Thursday releases with a brand new in-house trained Composer 2. This time they released more benchmarks than only their internal “composer bench” and this model looks great! (we are pretty sure it’s a finetune of a chinese OSS model, but we don’t know which)
    Getting 61% on Terminal Bench, beating Opus 4.6 is quite a significant achievement, but coupled with the incredible pricing they are offering, $0.5/1Mtok input and $2.50/M output tokens, Cursor is really aiming for the productivity folks and showing that they are more than just an IDE.
    Early users are reporting noticeably cleaner code than both Opus and Composer 1.5 — better adherence to clean code principles, smarter multi-file implementations, and strong performance on long-horizon agentic tasks like full API migrations and legacy codebase refactoring. They also shipped a new interface called Glass (in alpha) that’s built for monitoring these long-running agent loops.
    Open Source: Mistral is Back, Baby
    Mistral Small 4: 119B MoE with 128 experts + Apache 2.0 (X, Blog, HF)
    It’s been a while since Mistral dropped something properly open source, and this week they kicked off what looks like their fourth generation with Mistral Small 4. The name is a little funny given the actual size — 119 billion total parameters, 128 experts in the mixture — but with only 6 billion active per token. So you get the knowledge footprint of a massive model but the compute profile of a small one. Very MoE-brained.
    The bigger story here is what’s unified inside: this is Magistral (reasoning), Pixtral (multimodal), and Devstral (coding) all rolled into one weights file. Previously you had to choose which Mistral “side quest” model you wanted. Now there’s a reasoning_effort parameter where you dial from none for fast cheap responses all the way up to high for step-by-step thinking, no model switch required.
    How does it perform? We ran it through WolfBench and it landed toward the lower end of Wolfram’s current leaderboard — around 17% on the agentic tasks, roughly on par with Nemotron at the same scale. It’s not competing with Opus or GPT-5.4, and we weren’t really expecting it to. What we’re excited about is that it does multimodal, reasoning, and coding in one Apache-licensed package, and people are already running IQ4 quants locally. Shout out to Mistral for the return to open source — it’s been a minute, and the community noticed.
    Unsloth Studio: Fine-Tuning Gets a UI (Blog)
    Something I think people are sleeping on this week is Unsloth Studio, the open-source web UI that the Unsloth team just launched for local LLM training and inference. Unsloth has been quantizing and compressing models better than basically anyone for a while now — 2x training speed, 70% less VRAM, zero accuracy loss — but that was all code-first. Studio is the no-code interface layer on top of all of that.
    The numbers: supports 500+ models across text, vision, audio, and embeddings. It runs 100% offline with no telemetry. Julien Chaumond, the CTO of Hugging Face, confirmed it trains successfully on a Colab Pro A100. There’s even a free Colab notebook for models up to 22B parameters. For folks who want to fine-tune models overnight without spinning up cloud infra or wrestling with Docker, this is a genuine leap forward. Nisten compared it to what LM Studio did for local inference — making something that used to require deep expertise suddenly accessible to anyone. I think that comparison is spot on, and I want to get Daniel and the Unsloth team on the show to dig into this properly.
    This Week’s Buzz: W&B iOS App & The Overthinking Paradox
    The iOS App is Finally Here (app store)
    Okay, I’m going to do a quick applause. 👏
    The most requested feature in Weights & Biases history is now live: the W&B iOS mobile app. If you’ve ever kicked off a training run overnight and woken up to find it crashed at hour two without knowing about it until morning, you understand exactly why people have been begging for this. Live metrics, loss curves, KL divergence — all right on your phone. And native push notifications for alerts! The second your run fails or a custom metric crosses a threshold, you get a notification on your phone.
    Please give us feedback through the app, the iOS team is actively building on top of this. Get it on the App Store and let us know what you need.
    WolfBench insight: More Thinking ≠ Better Agents
    This is one of the more counterintuitive findings we’ve surfaced from the W&B + Wolf Bench collaboration, and Wolfram laid it out really clearly.
    He tested Opus 4.6 and GPT-5.4 across different thinking/reasoning effort levels inside the Terminal Bench 2.0 agentic benchmark framework — using both the default Terminus 2 harness and the OpenClaw agent framework. For GPT-5.4, the pattern was exactly what you’d expect: higher reasoning effort gets better results. At extra-high, it hit 71% with 85% ceiling on tasks it could solve.
    For Opus 4.6, though? Turning it up to the maximum thinking level made it significantly worse. From 71% on standard settings all the way down to 59% on max reasoning. It lost tasks it had been reliably solving before. Wolfram dug into the traces in Weave and found out why: the model was overthinking. In an agentic benchmark where you have a one-hour time limit per task, spending ten minutes reasoning about what terminal command to try — and then getting an error — and then spending another ten minutes reasoning about it — is catastrophically inefficient.
    We’ll keep you up to date with more Alpha from our bench efforts! Stay tuned and checkout wolfbench.ai
    Voice & Audio
    xAI relaunched the Grok Text-to-Speech API (try it)
    It’s actually a pretty full-featured release right away. Multiple voices, expressive controls, WebSocket streaming, multilingual support, and the whole platform feel suggests xAI is very much trying to build a serious multimodal API stack, not just throw out a toy demo.
    The inline control tags are the fun part. You can embed pauses, laughter, whispers, breathing cues, all that. Those controls matter a lot for agents because the difference between “reads text out loud” and “feels usable in a voice interaction” usually lives in those details. As you can see in the video.. it’s.. not perfect.. yet? but pretty fun!
    But the thing I personally had the most fun with this week was Fish Audio. We didn’t get to cover it properly last week, and when I played with it more this week, I came away really impressed. It’s fast, expressive, open source, and the voice control vibe is genuinely cool.
    My favorite moment was not even a benchmark thing. I used Fish Audio with an agent setup to make a character voice inspired by Project Hail Mary, then had my kid talk to it. And the result was weirdly magical. If you remember the Audio book of Hail Mary, fish audio was able to get the voice juuuust right + Opus via OpenClaw obliged with a great skill to talk like rocky. I won’t post this for obvious copyright reasons but I showed it on the live show, at the end.
    Parting thoughts: I was hoping for a quieter week this week as I was sick, but it didn’t materialize, I should stop hoping for quiet weeks I think. After all, this is how the singularity starts, faster and faster developments, models that train themselves, every company becomes an agentic company.
    We’ll keep you posted on the most important breakthroughs, cover breaking news and bring interesting folks to the show as guests.
    Thank you for reading, see you next week 👋
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    ThursdAI - Mar 19, 2026 - TL;DR
    TL;DR of all topics covered:
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
    * Big CO LLMs + APIs
    * Anthropic makes Opus 4.6 with 1M context the default claude code - at the same price (X)
    * OpenAI drops GPT-5.4 mini and nano, optimized for coding, computer use, and subagents at a fraction of flagship cost (X, Announcement, Announcement)
    * Xiaomi - Omni modal and language only 1T parameters - MiMo (X)
    * Google AI Studio gets a full-stack vibe coding overhaul with Antigravity agent, Firebase integration, and multiplayer support (X, Blog, Announcement)
    * MiniMax M2.7: the first self-evolving model that helped build itself, hitting 56.22% on SWE-Bench Pro (X, X, Announcement)
    * Cursor launches Composer 2, their first proprietary frontier coding model beating Opus 4.6 at a fraction of the cost (X, Blog)
    * Open Source LLMs
    * Mamba-3 drops with three SSM-centric innovations: trapezoidal discretization, complex-valued states, and MIMO formulation for inference-first linear models (X, Arxiv, GitHub)
    * H Company releases Holotron-12B, an open-source hybrid SSM model for computer-use agents that hits 8.9k tokens/sec and jumps WebVoyager from 35.1% to 80.5% (X, X, HF, Blog)
    * Hugging Face’s Spring 2026 State of Open Source report reveals 11M users, 2M models, and China dominating 41% of downloads as open source becomes a geopolitical chess board (X, Blog, X, X)
    * Unsloth launches open-source Studio web UI for local LLM training and inference with 2x speed and 70% less VRAM (X, Announcement, GitHub)
    * Astral (Ruff, uv, ty) joins OpenAI’s Codex team (announcement , blog , Charlie Marsh)
    * Mistral Small 4: 119B MoE with 128 experts, only 6B active per token, unifying reasoning, multimodal, and coding under Apache 2.0 (X, Blog, HF)
    * Tools & Agentic Engineering
    * NVIDIA GTC: Jensen Huang declares “Every company needs an OpenClaw strategy,” announces NemoClaw enterprise platform (X, TechCrunch, NemoClaw)
    * OpenAI ships subagents for Codex, enabling parallel specialized agents with custom TOML configs (X, Announcement, GitHub)
    * Manus (now Meta) launches ‘My Computer’ desktop app, bringing its AI agent from the cloud onto your local machine for macOS and Windows (X, Blog)
    * This weeks Buzz
    * Weights & Biases launches iOS mobile app for monitoring AI training runs with crash alerts and live metrics (X, Announcement)
    * GPT 5.4 went from worst to best on WolfBenchAI after an OpenClaw config fix exposed a max_new_tokens bottleneck (X, X, X)
    * Voice & Audio
    * xAI launches Grok Text-to-Speech API with 5 voices, expressive controls, and WebSocket streaming (X, Announcement)
    * AI Art & Diffusion & 3D
    * NVIDIA DLSS 5 is making waves with a new generative AI filter (Blog)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, Candace and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features