PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

162 episodes

  • ThursdAI - The top AI news from the past week

    Open source AI just had its 2nd DeepSeek moment (and this time the market didn't crash)

    2026/06/26 | 1h 30 mins.
    Hey, it’s Alex.
    Next month is my 40th b-day, and honestly, my wish for that month is to have a week like this week. A very chill, almost nothing announced week.
    This week started strong, with Sakana announcing FUGU (AI router) that can beat Fable (which we didn’t get back yet), and then... quiet. The most important thing in AI this week from a release standpoint is that GLM 5.2 from Z.AI is having it’s DeepSeek moment! Tons of new love for this model since last week! (+ we have the fastest GLM 5.2 deployment in the world with CW inference!)
    The rest we can quickly count on one hand, Anthropic added Claude to Slack (which made folks hate Andrej Karpathy), OpenAI announced their own inference chip, GPT 5.6 will be delayed and the US Gov will decide who gets it (yes really) and Sean Grove joined us to talk about Linzumi and his vision for running 10,000 agent hours per person per day.
    Oh and next week, is a special AI Engineer live stream from World’s Fair! Don’t miss it
    Let’s get into it!
    Subscribe to never miss a beat!

    GLM 5.2 is having its DeepSeek moment (HF, CW Inference)
    We covered GLM 5.2 last week, but this week was when the rest verdict came in! We’ve never seen a better MIT licenced AI model! GLM 5.2 is scoring top scores on agentic benchmarks (Arena.ai), Design benchmarks, Legal tasks and full on software engineering tasks.
    The jump in generations from prevoius GLM is also massive and notable, as the lab is working on creating the next version of GLM (per the CEO’s reply to Elon on X).
    Peter from Arena pulled up the Agent Arena numbers and they align with the vibe. GLM 5.2 sits above 5.1 but below Opus and Fable, which feels about right. Where it gets wild is Web Dev Arena: second place, right after Fable. Peter’s take was that GLM has really good defaults. If you just say “give me a webpage” it gives you something nice. GPT models, by contrast, start off looking bad and need more steering.
    Last week, I asked my agents with GLM 5.2 to create a custom ThursdAI.news page for itself and it did a marvelous job! Look at that beautiful font, the castle it made... this is all just delignful.
    We also played Hassan’s blind test on the show. It’s a website that @nutlope built that lets you try and guess which webpage was built by which model. Nisten nailed it immediately by spotting Opus’s circular buttons. Wolfram guessed right too. I got one wrong. The point isn’t that GLM beats Opus, it’s that you genuinely can’t always tell which one costs 22 cents and which one costs 3 cents.
    Wolfram did flag that GLM is not good in German. First response already had mistakes. So if you’re building for a non-English market, keep that in mind. It’s a workhorse model, not a conversationalist. His approach: use GPT 5.5 for planning and discussion, GLM for the actual work, then GPT reviews.
    This weeks Buzz is all about GLM 5.2!
    First, we may have not been the fastest, but I’m glad to announce that we’re the fastest provider to host GLM 5.2 on OpenRouter (at least at the time of writing this)!
    We’re also not to shabby on the Artificial Analysis checks, clocking at #4 among the providers they tested for speed, TTFT and cost
    Also, Wolfram ran his WolfBench tests on GLM 5.2 and it’s the best open model he’s ever tested! In this new 3d view, wolfbench also shows the number of tokens it took for this test to run, and you can see that GLM 5.2 is fairly conservative with it’s thinking budgets!
    Unsloth’s 1-bit GLM 5.2 runs on a Mac Studio (X, HF)
    Shout out to Daniel Han and the Unsloth team, who took this 744B beast and quantized it down to a roughly 200GB GGUF that fits on a Mac Studio with 256GB of RAM. One bit still makes me laugh out loud. How does that even work. Nisten clarified it’s a mixed quant, a true 1-bit would be under 100GB, but still.
    The wild part is the scores hold up. The 1-bit is within a point of GPT 5.5 on Frontier SWE, hits 62% on SWE-bench Pro, and 81% on Terminal-Bench. For a 1-bit quant that’s incredible!
    AI’s second-order effects: Apple is raising prices
    This one is AI news even though it doesn’t look like it. Apple just raised prices across the board, base versions up around 20%, citing memory shortages. Same reason your RAM and SSDs cost two to three times what they did a year ago.
    We are so capacity constrained that memory is having its moment. Data center contracts are getting booked 18 months out, and here’s the twist Nisten flagged: even open models you can run at home increase demand, because now a business says “great, we’ll buy a rack of B200s and run it ourselves.” Sam Altman once said people saying “thank you” to ChatGPT costs them millions in generated “you’re welcome” replies. Multiply that by a billion users. Even Intel is flying right now because anyone who can make a chip is winning.
    Is it worth it? I think yes. I love living in the era where Fable drops and we all get a taste of the future. But also I must admit this sucks and I hope that we’ll unlock performance gains with the extra power all this AI is bringing to the world. But ask me again once the new iPhone hits and it’s $300 more costly than the last one 😅
    Baidu open-sources Unlimited-OCR (X, HF, Arxiv, GitHub)
    It was a big OCR week. Baidu shipped a 3B model (only 500M active, it’s MoE) that parses 40+ pages in a single forward pass and hits 93.2% on OmniDocBench. The trick is constant KV cache during decoding, so no memory blowup and no progressive slowdown as the document gets longer. The intuition is lovely: it mimics how a human copies a book, glancing at the source and the last few characters you wrote, not re-reading everything. MIT licensed, weights on HF.
    Nisten’s point here is the practical one: most small businesses don’t realize they can self-host something like this, point it at all their documents, and keep everything local. A lot of folks just throw it at Gemini instead, which works great, but the small dedicated models are now good and cheap enough to own.
    Mistral OCR 4 (X, Announcement)
    Mistral’s entry in OCR week adds bounding boxes, block classification, and per-region confidence scores. They ran a blind human eval across 600+ documents in 12+ languages and annotators preferred OCR 4 about 72% of the time. On the agentic ParseBench leaderboard it lands around fourth, just under LlamaParse and Reducto. Mistral is very enterprise and Europe focused, and it’s cheap, so for regulated, multilingual document work it’s a solid pick. As a sidenote, LlamaIndex’s own eval puts LlamaParse on top and Gemini around third, which says how good the general vision models have gotten at this too.
    Liquid AI ships the world’s smallest agentic LLM (X, HF)
    Breaking on the show: Liquid AI dropped LFM2.5 at 230 million parameters. That’s roughly ten MP3s. Smaller than a Create React App, smaller than your node_modules folder. They call it the world’s smallest agentic LLM, and it runs fast on any CPU from the last decade, on a Raspberry Pi 5, on a Snapdragon, they even stuck it on a Unitree G1 robot.
    I love the use cases here. I already run Cotypist on my Mac for on-device autocomplete, which uses a 6GB Gemma 4B. Swap in something this size and you get the same thing way lighter, and I don’t have to send everything I type to OpenAI. Or, as Nisten put it, a tiny backup brain on your Raspberry Pi that turns your Hermes or OpenClaw back on when it dies. We still need to ship Nisten a smart toaster so we can finally run inference on a toaster.
    Big CO LLMs + APIs
    Sakana AI launches Fugu, seven AI raccoons in a trench coat beating Fable (X, Announcement)
    This was Wolfram’s highlight of the week and I get why. Sakana AI, the Japanese lab co-founded by one of the Transformers authors and David Ha, didn’t ship a new frontier model. They shipped an orchestration system behind a single API. You call one endpoint, and behind the scenes Fugu routes your task to a pool of models, assigns roles like thinker, worker, and verifier, and combines the results.
    The numbers here are wild: 95.5 on GPQA Diamond, 93.3 on LiveCodeBench, 73 on SWE-Bench Pro, matching or beating Opus 4.8, Gemini 3.1, and GPT 5.5 on ten of eleven benchmarks. The kicker is they only use publicly accessible models (Nisten says it’s Opus, Codex, and Gemini under the hood), explicitly no Fable, no Mythos. So they’re beating frontier results by coordinating models anyone can call. Someone called it the Moneyball of AI and that’s exactly right. It’s backed by two ICLR papers, TRINITY and The Conductor, and being from Japan with no export-control baggage is a very deliberate bit of positioning.
    Peter added the grounding note from Arena, where they’ve trained a prompt router too: if you just always ask for “the best model,” you basically get Opus half the time, so why not just talk to Opus. The real value of routing is aggressive cost reduction, sending easy tasks to cheap models. The catch is that Fugu is agentic and burns tokens fast. Brad in the comments couldn’t get through a single prompt on the $20 plan.
    OpenAI unveils Jalapeno, its first custom inference chip (X, Announcement)
    OpenAI dropped something massive that is not a model. They built a chip. Jalapeno is a custom inference ASIC made with Broadcom, and they’re claiming blank slate to tape-out in nine months. Engineering samples are already running GPT-5.3-Codex-Spark in the lab, and Broadcom’s CEO is citing a roughly 50% reduction in inference cost versus typical AI GPUs. They’re planning gigawatt-scale deployments starting late 2026 with a next-gen chip taped out in 2028.
    Nisten ran it past his electrical engineering and chip-fab group chat and got mixed reactions. No specs were released, and the nine-month claim probably means the design work started two-ish years ago and just got finalized and sent to tape-out now. It’s a lot of smaller chips rather than one giant Cerebras-style wafer. This is inference only, Nvidia keeps the training market, but every dollar OpenAI spends on Broadcom is a dollar it isn’t spending on Nvidia. They join Google’s TPUs, Meta, AWS Inferentia, Groq, SambaNova, Huawei Ascend, and Cerebras in the custom-silicon club. And behind every one of them sits TSMC, Intel, or Samsung, and behind all of those, ASML.
    Anthropic launches Claude Tag, an AI teammate in your Slack (X)
    When I first heard about Claude Tag I thought, you can already tag Codex in Slack, what’s the big deal. It’s different. Claude joins your Slack as a persistent, proactive team member, not a bot you ping. Flip on ambient mode and it follows up on stale threads and flags relevant stuff across channels on its own. There’s one Claude per channel, so the context is shared and any teammate can pick up where another left off. Anthropic says 65% of their product team’s shipped code now comes from their internal version of this.
    The highlights and magnitude of this release are quite something. Anthroipc is changing the pricing structure for themselves. This is no longer API charges, this is per seat + tokens structure. This is also VERY very sticky as more and more of your company’s context is going to sit in Claude/Slack and will not be easily portable.
    Additional thoughts on this, the more your company uses this, the more other folks are exposed to Claude across the company. This doesn’t require them to download apps or run code, it’s just like a new team mate joined your Slack channel. And apparently Claude’s context is limited to the channel boundaries + this allows Claude to get the same permissions (which is huge in enterprise). For Legal, Claude will see the documents in the channel, for Eng, it will push Pull Requests etc.
    This is also what triggers a bunch of folks to caution companies from adopting this new way of using AI. Context lock in is real, and this is goign to be very hard to impossible to untangle once folks are pouring months and years of work into this.
    Andrej Karpathy, who’s now in Anthropic, has shared a tweet on this, saying
    Imo this is the 3rd major redesign of LLM UIUX. The first paradigm was that the LLM is a website you go to, the second was that it is an app you download to your computer. This third one is that it is a self-contained, persistent, asynchronous entity with org-wide tools and context, working alongside teams of humans
    This is quite a huge statement, and folks gave him a lot of s**t for this on X, I think very much underserved! Andrej is known for calling things early (like Vibe Coding) and this is just another one of those, deeply new paradigms that people didn’t yet experience outside of frontier labs!
    I can’t wait to test this out and let you know if this is the future of not, meanwhile, Simon Smith on X is breaking down their experience with Claude Tag, check him out
    Tools & Agentic Engineering
    OpenAI ships Codex Record & Replay (X)
    You do a workflow once on your Mac, filing an expense report, creating a Jira ticket, whatever, and Codex watches your clicks, browser actions, and window switches, then generates an editable SKILL.md it can replay. The key thing, and what separates it from old RPA, is that at replay time it re-interprets the live screen instead of matching pixel coordinates, so it adapts when the UI moves. Wolfram’s right that OpenAI is dead serious about Codex. First the paste-a-screenshot feature, now this. Instead of writing ten-paragraph prompts about your personal workflow quirks, you just show it once.
    Aside launches as an AI browser that beats the frontier on agentic benchmarks (X, Announcement)
    YC-backed AI browser, runs everything locally and encrypted, and you bring your own Claude or ChatGPT subscription. It’s claiming number one on three browser-agent benchmarks, beating Claude Fable, OpenAI, and the rest, with 99% on Online-Mind2Web. It looks a bit like Arc and Dia but it’s a browser and an agent in one, with a password manager built for agents so it can log into your accounts without exposing credentials to the model. I actually tried it, it’s pretty cool, and with Arc deprioritized there’s a real gap it’s stepping into. I gave it a list of all the speakers at AI engineer and asked it to make me a X list and add them all one by one!
    It actually did this wonderfully, failing in the middle and recovering with great success without my intervention!
    The Interview: Sean Grove and Linzumi
    We closed with Sean Grove (@sgrove), ex-OpenAI post-training and alignment, now on his third company and third YC batch, launching Linzumi (linzumi.com, YC). Sean also has one of the most-viewed AI Engineer talks ever, north of 1.2 million views, on the model spec and the idea of specs as the real source code. His framing: we craft the properties we want in a spec, and the code is just the compiler output, so maybe there’s a higher-level spec that produces the same result. He even described a “Socratic compiler” that interviews you about ambiguity and contradictions in your own intent, the way a linter or type checker does for code.
    That fed straight into my AI Engineer talk next week about whether we should still read code at all. Sean’s firmly on the don’t-read-the-output side. He describes the properties he wants, leans on property-based testing the way QuickCheck does, and reads the failures to adhere to those properties rather than the diffs. His goal for Linzumi is for every person to drive ten thousand agent hours per day, and you can’t get there if you’re making every micro-decision.
    Linzumi itself is a Slack-like team chat where humans and a fleet of coding agents share the same threads, except the agents run on your own machine, so the code actually works when you merge it. Behind the scenes it continuously compiles a spec for your company from your chats, your standups, even your customer calls, then generates a DAG of work for the agents and lets them verify against that spec instead of pinging you for every decision. The mental model that stuck with me: if Sean’s system isn’t calling him, everything is great. The knowledge is one omnipresent source of truth, but permissioned and viewed through each person’s lens. For a limited time they’re bundling free GLM 5.2 access via Wafer AI, which fits the week perfectly.
    My favorite moment: Sean said he’d have retired by now if not for this capability, because he wants to be present with his kids, and a Fable-level model is escape velocity for an AI-native company. I feel that. I also miss Fable, the same way I missed Sydney when Microsoft took it away. We’re all walking around with a little Fable withdrawal.
    Wrap-up
    That’s the chill week. No Fable comeback, nothing new from OpenAI, all the labs strangely waiting (possibly to see how the US government and Anthropic situation resolves before anyone moves). Meanwhile open source quietly closed the gap. GLM 5.2 is the headline, it’s incredible across benchmarks, really good at web design, and you can try it on CoreWeave inference today.
    Next week is AI Engineer World’s Fair. Come find me and Wolfram in the bright yellow jackets. Wolfram’s WolfBench workshop is Monday, I’m talking Wednesday in the token-maxing track about the ZL continuum and whether AI engineers should still write code in 2026. And if you can’t make it, that’s the whole point of our coverage, we’ll bring you the vibe.
    One last thing: thursdai.news now has a full timeline of every release we’ve ever covered plus an agentic search, so you can look up any model or any guest. It’s all built with agents, and I read exactly zero of the code that shipped it. See you next week, hopefully with some bigger model drops to talk about.
    TL;DR and Show Notes - June 25, 2026
    * Hosts and Guests
    * Alex Volkov - AI Evangelist, Weights & Biases & CoreWeave (@altryne)
    * Co-hosts: @WolframRvnwlf, @nisten, @petergostev
    * Guest: Sean Grove, founder of Linzumi (@sgrove)
    * Open Source AI
    * GLM 5.2 - Z.ai’s 744B MoE open-weights model has its DeepSeek moment, tops open-model rankings, #2 on web dev arena behind Fable (HF, Z.ai)
    * Unsloth ships a 1-bit GGUF of GLM 5.2 that runs on a 256GB Mac Studio (X, HF)
    * Krea open-sources Krea 2, a 12B image model in Raw and Turbo versions (X, Turbo, Raw, Blog)
    * Baidu open-sources Unlimited-OCR, a 3B model that parses 40+ pages in one pass at 93% on OmniDocBench (X, HF, Arxiv, GitHub)
    * Liquid AI ships LFM2.5-230M, the world’s smallest agentic LLM (X)
    * Big CO LLMs + APIs
    * Sakana AI launches Fugu, a multi-agent orchestration system behind one API matching frontier models with only publicly accessible models (X, Announcement)
    * OpenAI unveils Jalapeno, its first custom inference chip built with Broadcom, blank slate to tape-out in 9 months (X, Announcement)
    * Anthropic launches Claude Tag, Claude as a persistent proactive teammate in Slack (X)
    * OpenAI expands Daybreak with a Codex Security plugin and GPT-5.5-Cyber hitting 85.6% on CyberGym (X, Blog)
    * OpenAI updates GPT-5.5 Instant, the model free users get
    * New Siri AI lands with the iOS 27.2 update
    * This Week’s Buzz (Weights & Biases & CoreWeave)
    * GLM 5.2 is live on CoreWeave Serverless Inference at $1.39 in / $4.40 out, near 200 tok/s (X, HF)
    * WolfBench ranks GLM 5.2 the third best model ever tested, and one of the cheapest (wolfbench.ai)
    * Tools & Agentic Engineering
    * OpenAI ships Codex Record & Replay: demonstrate a workflow once, get a reusable SKILL.md (X)
    * Aside launches as a local-first AI browser that tops three agentic browser benchmarks (X, Announcement)
    * Mistral OCR 4 drops with bounding boxes, block classification, and 72% human preference across 12+ languages (X, Announcement)
    * Vision & Video
    * ByteDance teases Seedance 2.5 with 30-second single-pass generation, 50 multimodal references, and a 4K upgrade for 2.0 (X, Dreamina)
    * Interview
    * Sean Grove launches Linzumi, a YC-backed team chat for orchestrating fleets of coding agents, bundling free GLM 5.2 via Wafer AI (linzumi.com, YC)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    Fable Got Banned, Open Source Delivered: GLM-5.2, Kimi K2.7 & SpaceX Buys Cursor - June 18

    2026/06/18 | 1h 55 mins.
    Hey yall, Alex here, let me catch you up!
    I came back from vacation expecting to cover Fable 5 after a week of using it. The first two days after we all first got access to a Mythos level model were super exciting! But then the news hit, US Government issued an order banning Anthropic from giving access to Fable 5 and Mythos 5 to any foreign national, causing Anthropic to pull the models completely (even internally to their employees!).
    So, this wasn’t the show I planned, but it turned into a great show about Open Source, as two models hit the top rankings and are both MIT licence, filling a Fable shaped hole in our hearts!
    GLM released 5.2 with folks really excited about it web building capabilities, and Kimi 2.7 Code released (and is available on CW Inference with crazy speeds!). We also saw the SpaceX IPO and Cursor $60B acquisition, Noam Shazeer joining Open and Midjourney, the image company, launching a new Ultrasound full body scanner to kill MRIs!
    Great show today with Dexter Horthy from HumanLayer, Chris Van Pelt and Adrian Swanberg from W&B announcing our new product HiveMind and Tanishq Abraham came back to help cover Midjourney’s new Ultrasound scanner! Let’s dive in!
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    The US Government bans Fable 5! (X, Anthropic statement)
    Here’s a story in 3 parts:
    * Anthropic announces Mythos 5 preview - saying that this model is to dangerous to release, and only gives corporations access to it via project GlassWing.
    * Anthropic works hard on limitations and safery and releases Fable 5 (same weights as Mythos 5) built with guardrails so strong it refuses to do any cybersecurity tasks and switches back to Opus frequently
    * US Government receives a tip (reportedly from Amazon) that Fable 5 can be jailbroken to do cybersecurity tasks, and issues an order to Anthropic, citing national security concerns, banning them from giving access to Fable 5 and Mythos 5 to any foreign national, causing Anthropic to pull the models completely (even internally to their employees!)
    This is the first time that we see the US Government directly intervene in the AI space and restrict access to frontier models. The most updated reporting on this I could find is that Anthropic and US Government officials are in the process of negotiating a safe release framework. Given that preventing all jailbreaks is impossible, I hope they will land on a solution that gives me Fable 5 back!
    This hit especially hard because last week we were all high on Fable. Not in the usual AI Twitter benchmark sense, in the actual “oh, this is a different level” sense. Me and my wife Fable maxxed throughout our flight to Vacation. Peter had saved outputs he kept going back to because other models suddenly felt like a step down. Dexter later said it was the closest he had felt in a while to the old “I need to keep prompting this thing overnight” feeling.
    Peter Gostev made a point that stuck with me. It’s easy for us in the bubble to call this ridiculous, and on the technical merits it kind of is. But if you’ve spent weeks telling normal people “this thing is like a nuclear weapon, it’ll take everyone’s jobs,” and then someone asks “okay, can you make it safe?” and the answer is “no, I can’t,” then you can see how an outsider lands on “well, maybe you shouldn’t have it.” His takeaway, and I agree: we need to be way more careful with the imagery we use, because the nuclear-weapon framing came home to roost.
    The bigger questions are the scary ones. Wolfram framed it as a sovereign AI wake-up call, and he’s right. For the first time we’re seeing a real gap in intelligence available to people based on their nationality. Imagine building a company on a model that an outside government can switch off with one letter. Peter pointed out it’s commercially bad for the US but completely disastrous for Europe, which has basically one frontier lab and a pile of startups that suddenly look very exposed. And there’s the obvious irony Nisten enjoyed a little too much: the Europeans who spent years lecturing everyone about AI restrictions just got restrictions imposed on them.
    If anyone in the government is listening: we want Fable back, please.
    SpaceX IPOs and acquires Cursor for $60B (X)
    SpaceX went and did the largest IPO in the history of the world, around seventy-five billion dollars, which on a roughly two-trillion-dollar valuation made Elon the first trillionaire. (Did anything materially change for him? No. He can still fly his private plane. There’s nothing left to buy.) Three days later, SpaceX exercised its option and bought Cursor (Anysphere) for sixty billion dollars in an all-stock deal, paid in shares minted at the IPO and now trading around $211. The four Cursor co-founders are all billionaires now. Largest software acquisition ever, and for SpaceX it’s barely a blip on the radar.
    Why are we covering a stock-market story? Because it’s not really a coding-tools story, it’s an AI story. Cursor gave away its IDE to a lot of people while collecting their data, then quietly became a training company with Composer. SpaceX/xAI was always strong on compute and weak on code, and the missing ingredient was exactly that kind of data. Now Composer 2.5 is already showing up rebranded inside the xAI stack, and if you pay for X Premium you can use it. Composer 3, trained on the Memphis supercluster, is reportedly coming very soon and is going to hit hard.
    Nisten’s take was the spicy one. For the data alone it’s worth it, because xAI now has insight into how essentially every enterprise that touched Cursor operates. And he had zero sympathy for the companies that assumed “no data retention for training” meant the data was actually gone. We see in legal cases all the time that deleted data is still there. His view: it should have gone open source.
    Cursor has over a million paying customers, $2.6 billion in revenue, projected to hit $6 to $10 billion by end of 2026. But here’s the thing that matters for us, the AI coding angle. Cursor was one of Anthropic’s biggest revenue pipelines because Composer runs on Claude under the hood. That pipeline is now owned by xAI. They’re already jointly training Grok 4.3, a 1.5 trillion parameter model, with Cursor’s proprietary coding data injected directly into pre-training, not fine-tuning. Pre-training. That’s a fundamentally different thing. Composer 2.5 was already Pareto dominant on coding benchmarks before the deal closed. Now pair that with Colossus, the biggest GPU cluster in the world.
    Will this be enough to put XAI (now SpaceXAI) at the frontline of the AI race? Will Grok 5 be Fable level code? We’ll find out. Either way, this is the most consequential AI acquisition we’ve seen. Period.
    Open Source AI
    GLM-5.2 takes the open source crown (X, Blog, HF, Docs)
    Z.ai dropped GLM-5.2 and it’s now the strongest open source model for coding and long-horizon work. The headline number: 74.4% on FrontierSWE, which measures whether an agent can finish full engineering projects over hours. That trails Opus 4.8 by about one point and beats GPT-5.5. On Terminal-Bench 2.1 it jumps to 81% from GLM-5.1’s 63.5%, which is a big leap. It’s a 753B parameter MoE, MIT licensed, no regional restrictions, weights on HuggingFace. The 1M context window is real and usable, backed by a clever IndexShare technique that cuts per-token FLOPs by about 2.9x at full context. People are reporting roughly 8x cost savings versus Opus 4.8 for comparable quality on real coding tasks.
    The most interesting thing on the show was that this was a confusing release, in a good way. Peter put it well: normally a catching-up lab ships cherry-picked benchmarks and then independent testing deflates them. Here it’s the opposite, almost every benchmark holds up, even crossing above Fable at certain points, and yet when he actually used it over a couple of days he wasn’t blown away. His verdict, and I think it’s the calibration we needed: this is clearly an amazing model, and the fact that it’s open and you can run it is incredible, but it is nowhere near Fable, and it would frankly be implausible if a 700-odd-billion-parameter model matched a model that’s rumored to be in the trillions.
    Though, I think the comparison to Fable is really really unfair, and the comments online seem to suggest that 5.2 from GLM is a banger model. Just looking at this Harvey benchmark on legal tasks from Vals, a benchmark that there’s 0 chance Z.ai folks have seen! GLM 5.2 scores #3 on this benchmark! Just after Fable and Opus, and per TeorTaxes on X, previous GLM 5.1 scored an absolute 0% on this one!
    Where it genuinely shines is design. On Design Arena, which is a head-to-head ELO vote, people have been picking GLM-5.2’s website designs over Fable’s by a real margin (around 1360 to 1350). LDJ’s framing is the one I buy: specialization is becoming valuable again, and GLM is clearly leaning into front-end design and taste. Wolfram added the necessary asterisk, every benchmark only tells you the model did well on that specific test, so “as good as Fable” should always carry the “on this benchmark, with these tasks” disclaimer. Fair. I’d just say this: I don’t want to compare everything to Fable, because we can’t even use Fable anymore. Compared to the models we can actually touch, GLM-5.2 is a fantastic deal.
    Kimi K2.7 Code from Moonshot (X, HF, Announcement)
    The other big drop. Kimi is the darling of open source while we wait on DeepSeek, and Moonshot shipped K2.7 Code, a 1 trillion parameter MoE built specifically for coding, available through Kimi Code and the API, with a modified MIT license. The standout for me isn’t a single benchmark, it’s efficiency: roughly 30% fewer reasoning tokens than K2.6, which matters enormously when you’re running long agentic loops that burn tokens like crazy.
    Benchmark jumps over K2.6 are real (+21.8% on their Code Bench v2, +11% on Program Bench), though Peter and Wolfram both noticed something odd, on a few benchmarks including their Agentic Arena, the older K2.6 actually edged out K2.7. The likely explanation is that K2.7 is narrowly trained for code with reduced reasoning, so it may trade away some general capability. Moonshot themselves recommend K2.6 for general non-coding tasks. Also worth knowing: it’s not multimodal, no vision, which is a real gap for coding these days. And thinking-off isn’t supported, it’s reasoning-on by default.
    The model is available on our CW Inference, with the fastest token streaming in the industry, over 280 tok/s (Announcement, try it), with very decent pricing $0.94 - $0.19 - $4.00 (input - cached - output) per million tokens.
    This Week’s Buzz: W&B launched HiveMind 🐝 - track all your agentic work in one place (X, Try it, GitHub)
    This is the one I’ve been sitting on for months. We brought on Chris Van Pelt (CVP), Weights & Biases co-founder, and Adrian Swanberg to launch HiveMind, and I’ll be honest, I’ve been a beta user for a while and I’m thrilled I can finally talk about it.
    The premise: what it means to be a software developer has fundamentally changed, and your work is now scattered across six or seven agent dashboards. HiveMind is a tiny daemon that sits on your machine, picks up sessions from whatever harness you’re running (Claude Code, Codex, Cursor, Gemini CLI, OpenCode, GitHub Copilot, Pi), and within about 30 seconds they show up in one shared dashboard. It breaks each session into chapters, shows which files the agent touched, what to-dos it wrote, where context got compacted. W&B has been running it internally for six months.
    A few things genuinely delighted me. There’s a fork button: HiveMind pulls down a compacted history of a session and lets you relaunch it in a different harness, so you stay harness-agnostic. CVP’s line: “this has proven invaluable when Anthropic servers are on fire and I just gotta get something done.”
    Then there’s the skill engine, which to me is the real magic. It reads your team’s sessions and can clone a power user’s whole approach into a reusable persona, at CoreWeave they built a “Talk to Tim” skill from Tim Sweeney’s sessions, and apparently a virtual Tim is now a popular way to get guidance. And the insights feature detects where you kept correcting the agent, clusters those pitfalls across the org, and hands you a smart-merge command to drop the fix straight into your AGENTS.md.
    I’m excited to finally show this to you, it’s been genuinely helpful (for example, last week I was able to test Fable and tell you the number of tokens it used until i maxxed out my Claude Subscription!) - give it a try at hivemind.wandb.tools
    HumanLayer launches its Agentic IDE, and a real talk about code slop (X, humanlayer.dev, 12-factor-agents)
    Dexter Horthy, friend of the show and the team behind 12 Factor Agents and the Research-Plan-Implement framework (now running inside Block and Uber), launched HumanLayer’s Agentic IDE this week, and we got into one of my favorite conversations of the year. The whole product is explicitly anti-slop. His argument: the “lights-off loop,” where humans only write tickets and the agent codes, verifies, ships, and feeds its own crashes back to itself, is the fastest way to trash a codebase. Vibe coding is great for zero-to-one and side projects nobody depends on. But if you’re a staff engineer in a high-stakes codebase, dear God, read the code.
    This ties directly into my AI Engineer World’s Fair talk, the ZL continuum, which Dexter half-inspired. On one end you’ve got the YOLO camp (Ryan from OpenAI, one billion tokens a day, nobody can read that much code) and on the other Mario from PI (read every line of critical code). Those two are now the sixth and seventh most-watched AI Engineer talks globally, which tells you the whole field is wrestling with this. Dexter’s answer is leverage. Don’t aim for a perfect spec, because a perfect spec is just code. Get it 80% right, then zoom down a level at a time so the chunk you’re steering is human-consumable. He claims that an hour of upfront prep on architecture and even program design turns a three-hour code review into a twenty-minute one.
    I pushed him on the obvious counter: why does code quality even matter if Fable-class models keep arriving and maintenance is a prompt away? His answer was the most grounded thing I heard all week. Code quality matters for the same reason it mattered in the 1970s software crisis: pile in code without structure and your velocity tanks, every change starts breaking something else. And here’s the irony, we train models on beautifully architected projects (Django, Redis, Spring on SWE-bench multilingual), yet they still reward-hack their way to “just make the test pass.” We don’t yet have a penalty function or a verifier for “this code is harder to maintain,” and that’s hard to build, so humans are still needed in the loop. He played with Fable too, threw an 8K-line React PR refactor at it, and the first pass was bad, it introduced React context and patterns they don’t use. Better than before, not a step change that lets you drop the reins. We’re not there yet. It’s BYOK, $100/user/month for pro with a free tier for teams of three.
    OpenRouter Fusion: near-Fable quality at half the price (X, Blog, Announcement)
    Wolfram spotted this one and it’s clever. OpenRouter’s Fusion is a single API call that fans your prompt out to a panel of models, then a judge model reads all the responses and a synthesizer writes the best combined answer. It’s the LLM consortium idea (the thing we used to do by hand, asking several models and stitching the best parts together), now baked into the API so you don’t build it yourself.
    The wild result: on Perplexity’s DRACO deep-research benchmark, a budget panel beats solo GPT-5.5 and solo Opus 4.8 and lands within 1% of Fable 5 at roughly half the cost. The most interesting finding is that about three quarters of the lift comes from the synthesis step, not from model diversity, they even fused Opus with itself and got a 6.7-point jump. The catch is latency, it’s 2-3x slower, so it’s a deep-research and planning tool, not a quick-query tool. Big shout out to OpenRouter.
    Vision and video
    Google Gemini Omni, finally with API access, takes #1 on video benchmarks (X, Announcement)
    We covered Google’s new video model Omni at Google I/O, and it finally landed as an API. It’s Google’s first any-to-any model, one single unified system for text, image, video, audio, and music. Think Nano Banana, but for video. Peter tested it and it scored really, really well, the kind of jump between generations you saw with GPT-image-2. Independent testing put it at #1 for realistic body physics and #2 behind Seedance for complex action, and it topped MovieGenBench for preference and instruction following. The session-memory piece is the part I find most useful: you can keep editing across turns, characters stay consistent, you say “continue” and it picks up where it left off. It’s live in the Gemini app, Google Flow, and YouTube Shorts
    Grok Imagine Video 1.5 (X, Blog, Docs)
    xAI’s Grok video work has been quietly getting really good, and they finally gave us an actual version number instead of silently updating “Grok Imagine” over and over (which drove me nuts). Grok Imagine Video 1.5 generates a 6-second 720p clip in about 25 seconds, down from 40-plus, so nearly 2x faster, with native audio generated in the same pass: sound effects, ambience, dialogue, lip sync, no post-production stitching. It hit #1 on the Design Arena image-to-video board with a 1,357 Elo and a ~49 point lead, and it’s generally available in the API. I ran my standard astronaut-riding-a-horse-on-the-moon prompt and it came back with music too. Genuinely cool.
    Sci-Fi is here: Midjourney announces a full-body ultrasound scanner to compete with MRIs (X, Announcement)
    I’m still processing this one. Midjourney, you know, the image generation company, announced medical hardware. A new division called Midjourney Medical, and its first product is a full-body ultrasonic scanner. Tanishq Abraham was there in the front row and joined us to break it down.
    The device uses thousands of ultrasonic transducers arranged in a ring. Because sound doesn’t propagate well through air, you’re lowered into a tank of water, the sound travels through your body at 1,481 meters per second, and in under 60 seconds you get a 3D anatomical map of 25-plus organs. The raw data is roughly 806 terabytes per scan, streaming at about 16-17 gigabytes per second, and the only way to handle that firehose is AI. No radiation, no magnets, no superconductors, which is what makes MRI so expensive. David Holz has apparently wanted a medical imaging lab for two years, and because Midjourney is fully self-funded with no VCs, they can chase wild projects like this.
    The fun reveal from Tanishq: there’s no AI in the actual image reconstruction yet, it’s basic signal processing right now, with physics simulators and possibly NeRF-style neural fields on the roadmap (there was a hallway conversation with John Barron about exactly that). So this is a prototype with enormous headroom. The business model is the spa, a 24,000-square-foot space about ten minutes from Union Square in SF with around ten scanners, targeting end of 2027, then custom sensors in 2028, scaling toward 50,000 scanners doing a billion scans a month.
    Now, for a dose of reality, this is just an announcement, and ultrasound won’t replace MRIs anytime soon. For one, ultrasound cannot penetrate bone and air, so lungs (full of air) and brain (literally encased in bone) are out, but it’s still great ot see Dave Holz innovating in the medical space and I’m excited to try this out!
    Wrapping up
    What a strange, whiplash week. We got the best model any of us had ever used taken away by a government letter, watched a meme become a real Mistral roadmap, saw open source close the gap on the models we can actually run, and watched an image company casually announce it might kill the MRI. I came back from vacation thinking I’d write you a Fable love letter and instead I’m writing about deemed-export law and ultrasonic water tanks. That’s the job, and honestly I wouldn’t trade it.
    If you’re heading to AI Engineer World’s Fair, come find Wolfram and me, Weights & Biases and CoreWeave are sponsoring the whole thing, and my ZL continuum talk will name-check a lot of what we covered today (Day 3 • Wed, July 1 · 10:45am-11:05am) . And if Fable comes back next week, you’ll hear me yell about it first.
    See you next week, and please, US government, give us Fable back.
    ThursdAI - Jun 18, 2026 - TL;DR
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases, CoreWeave (@altryne)
    * Co-Hosts - @WolframRvnwlf, @ldjconfirmed, @petergostev (Arena), @nisten, @yampeleg
    * Dexter Horthy (@dexhorthy) - Founder, HumanLayer
    * Chris Van Pelt (@vanpelt) - Co-founder, Weights & Biases (HiveMind)
    * Adrian Swanberg - Weights & Biases (HiveMind)
    * Tanishq Abraham (@iScienceLuvr) - Founder, Sophont AI (reporting from the Midjourney Medical event)
    * Big CO LLMs + APIs
    * Noam Shazeer is joining OpenAI - co-author of the Transformers paper and co-founder of Character AI, teaming up with Noam Brown
    * US government orders Anthropic to shut down Fable 5 and Mythos 5 access for all foreign nationals (including its own employees), citing national security; Anthropic disables both for everyone to comply (X)
    * SpaceX acquires Cursor (Anysphere) for $60B in an all-stock deal, the largest software acquisition in history, days after its record IPO (X)
    * Open Source LLMs
    * GLM-5.2 drops as the strongest open-source coding model with solid 1M context, MIT-licensed, trailing Opus 4.8 by just 1% on FrontierSWE (X, Blog, HF, Announcement)
    * Moonshot AI open-sources Kimi-K2.7-Code, a 1T MoE coding model with 30% fewer reasoning tokens and big benchmark jumps over K2.6 (X, HF, Announcement)
    * Mistral CEO Arthur Mensch playfully confirms the ‘Le Gros Chaton’ meme, hinting at an upcoming fat-but-sparse open-weight model family (X, Summary, Blog)
    * This Week’s Buzz - W&B and CoreWeave
    * Weights & Biases launches HiveMind, a unified dashboard to track spend and ROI across all your AI coding agents (X, Announcement, GitHub)
    * Kimi K2.7 Code is live on W&B / CoreWeave Inference at 289 tok/s (NVFP4 on Blackwell + speculative decoding), top of Artificial Analysis for speed and price-performance
    * Tools & Agentic Engineering
    * Claude Design gets a major update: design system imports with self-audit, canvas editing, bidirectional Claude Code sync (/design-sync), and PDF/PowerPoint export (X, X, Announcement)
    * HumanLayer launches its Agentic IDE to fight AI code slop, already deployed at Block and Uber (X, Blog, 12-Factor Agents)
    * OpenRouter launches Fusion API: a panel of budget models beats GPT-5.5 and Opus 4.8, lands within 1% of Claude Fable 5 at half the price (X, Blog, Announcement)
    * OpenAI rolls out Codex Computer Use, Chrome extension, Memory, and Chronicle to European users in the EEA, UK, and Switzerland (X, Announcement)
    * Vision & Video
    * Google DeepMind launches Gemini Omni, their first any-to-any generative model starting with video editing and creation (X, Announcement)
    * xAI launches Grok Imagine Video 1.5 with near-2x faster generation, native audio, and a #1 leaderboard position (X, Blog, Announcement)
    * Sci-Fi is here
    * Midjourney announces ‘Midjourney Medical’ - a full-body ultrasonic scanner that captures 806 TB of data per scan in under 60 seconds (X, X, Announcement)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI - Jun 11, 2026 - Fable & Mythos 5 are here, Anthropic gets caught sandbagging (then reverses), Siri AI finally works!? and we got live-translated on air

    2026/06/12 | 2h 11 mins.
    Hey folks, Alex here, and welcome to a BIG MODEL week! We finally got Mythos (well almost)! Let me catch you up!
    This week started with WWDC26 from Apple, and Max Weinbach, who was in the room at Apple Park and actually has access to some of the new features including an all new SIRI AI, joined us to break down what could be the most used AI in the world very soon. At first I was skeptical, but he convinced me that the new Siri is actually good!
    Then, we saw the ultimate model drop: Anthropic finally shipped Mythos (X, my system card thread, benchmarks). Same weights, two names: Mythos 5 is the unrestricted version that only Project Glasswing partners get, Fable 5 is what the rest of us get, wrapped in the heaviest guardrails I’ve ever seen ship on a frontier model. It’s state of the art on nearly every benchmark
    The model that was “too dangerous to release” is now... well, released, but with the heaviest guardrails we’ve seen. More on this later. Peter Gostev from Arena.ai joined us to break down the new model.
    Last but definitely not least, Google released a real-time translation model, that our friend Thor Schaeff from DeepMind demoed live, while we all spoke in different languages and it translated us in REAL TIME. It was really cool, definitely check that out.
    There’s quite a few more things, like Loop Engineering Alpha, Swyx came by to talk about FrontierCode, OpenAI confirmed our suspicions that the anti-datacenter social media posts could be a concerted effort by groupds links to the Chinese government and much more. Let’s dive in!
    ThursdAI - Let me catch you up, every week! 👇

    Opus’s Big brother: Claude Fable 5 & Mythos 5 - the “too dangerous” models is here, SOTA on nearly every benchmark.
    It honestly feels like someone in Anthropic’s pre-IPO marketing team, knows exactly how to stagger releases to ride the hype waves! First they announce a model that so good at Cybersecurity (Mythos-preview) that they only allow restricted access to it to a few partners.
    A month later, they release Fable 5, which is the same model weights as Mythos 5, but wrapped in the heaviest guardrails we’ve ever seen from any lab. But, they didn’t lie, this model is absolutely amazing, it does feel like a step change, in terms of capabilities, specifically on longer agentic tasks.
    2x as expensive as Opus: $10 / $50 per million tokens, with 1M context, claude-fable-5 in the API, and SOTA basically everywhere. 80.3% on SWE-Bench Pro versus GPT 5.5 at 58.6%, a 22-point blowout on a benchmark where labs usually fight over single digits. Karpathy called it “SOTA by a margin… major-version step change” (X) and Boris Cherny said it’s the “best coding model by a wide margin” (X). Stripe reportedly migrated 50 million lines of code in 24 hours with it.

    Our panel verdict was unanimous on one thing: big model smell. LDJ called it the most significant big model smell since Gemini 3 first dropped. Someone from the Anthropic team framed the shift in a way that stuck with me: this model moves them from verifying the AI outputs to verifying whether the AI is working on the right thing. Complete shift in how much they trust this model.
    What we built with Fable to test it out
    Peter got employee access through Arena and showed us his tests live. His favorite prompt category, “research a dataset and create a visual experience to teach me about it,” went from completely rubbish on every previous model to, in his words, just done. His 3D city generations actually came together as a city, roads connecting and all. And on Arena’s data, Fable is #1 on the new Agent Arena leaderboard by the widest margin they’ve ever recorded, and wins 72% of frontend battles even against Opus models (Arena).
    My own run is the one I can’t stop thinking about. I pointed Fable at the ThursdAI website with a dynamic workflow in Claude Code and barely any instructions, and after an hour and a half of agentic running it had extracted 786 releases from our archive, built 240 new pages, and categorized 50+ episodes into a browsable timeline of AI releases by month, by company, by topic, with logos and source links (X). It burned roughly 50 million tokens and my entire five-hour Max allotment in 90 minutes. The new AI releases timeline can be found on thursdai.news and it’s confirmed, Fable is the best AI web designer we’ve ever had access to.
    Nisten ran his traditional Olympus Mons escape-velocity test and Fable didn’t just do the math, it built the entire solar system! Orbital maneuvers, a space train with little people in it, time controls, full cost calculations down to solar panels and in-situ iron utilization. His verdict: completely different level from anything else. We’ve never seen so many details in the Olympus Mons test.
    It’s not all light though. Yam found Opus more controllable; Fable fights you, decides it knows better, and does the task its own way. Wolfram saw exactly that in benchmarks, where the model ignored the task spec, did its own thing, and failed the verifier with full confidence. Peter had it explaining why it got math wrong instead of just fixing it (”What are you doing, man? Just move on”). Arena’s steerability signal has it sitting around 17th. There’s an adjustment period with every new model, and the consistent advice from Anthropic folks is to go high level: give it the goal, not the micromanagement.
    Not to mention the refusals! Oh.. so many refusals!
    The refusals, and the sandbagging scandal
    Here’s where the week got ugly. Fable ships with restrictions on cybersecurity, bio/chem, and a brand new one nobody saw coming: frontier AI development (X). For cyber and bio you get a visible fallback to Opus 4.8 with a notice. But for “self-acceleration” topics, the original policy was no fallback and no notification. The model would quietly degrade its own output using prompt modifications, steering vectors, and PEFT, on roughly 0.03% of traffic (X). You’d pay double Opus prices and get sabotaged answers without ever knowing.
    The community reaction was volcanic. Elie Bakouch: “bad ON PURPOSE… not visible to the user is crazy” (X). Péter Szilágyi: “a new ruling class and you’re not in it” (X). Simon Willison: “If Claude Fable stops helping you, you’ll never know.” And Sayash Kapoor dropped the eval-integrity bomb: third-party evaluators can no longer credibly benchmark a model that might be silently nerfing itself (X).
    Within about 24 hours, Anthropic blinked. They told WIRED they “made the wrong tradeoff,” and now flagged requests visibly fall back to Opus 4.8, with API users getting an explicit reason (X). I commend the speed of the reversal, but the trust damage was done.
    Despite the reversal, Fable remains refuse-happy! Peter ran his nonsense-question benchmark and a full third of his prompts got blocked outright by the classifier, including 18 of 20 physics questions. Nisten had to strip medical and anatomy terms from a fall-detection app for seniors homes to get it to work at all (a 400KB neural weight tripped the frontier-AI filter). And my favorite absurdity: I could not get Fable to draft the TLDR for this very show without it falling back to Opus, presumably because reading a week of AI news looks like frontier AI development. Ridiculous.
    But the question remains: Would we rather have a model this good, but with these restrictions? Or not to have access at all? Everyone on the panel chose access, a lot of people online choose act like they would choose the opposite.
    System card for Mythos, wildest AI document of the year?
    I’ve used Fable itself to help me review the system card for Mythos/Fable 5 and there are a few highlights that are worth mentioning.
    Anthropic admits that this is a category-step change in model capabilities. Mythos 5, the unguarded version makes working Firefox exploits 88.4% of the time (Opus 4.8 is at 8%!). But the most interesting thing is their concern for CB (Chemical and Biological) safety. Two-person generalist biology teams using it finished work in 16 hours that experts estimated at 40 to 95 days without AI, which is what pushed Anthropic to treat it as near their CB2 bioweapons threshold (X)
    What is loop engineering and why is everyone talking about it?
    One more thread before we move on. This week Boris Cherny (Claude Code) and Peter Steinberger (now OpenAI) both posted about the same concept, loops, within an hour of each other, and Lance Martin from Anthropic published the field guide (X, Article, Blog). The idea is the shift from “I give you a task and babysit you” to proactive agents: a Jira ticket lands, a PR comment appears, and your agent just runs and does the job. Fable is clearly trained for this world. But also worth remembering, those folks get the tokens for free, unlimited tokens. The rest of us, may not be able to afford Fable running in a loop. I’ve asked Fable to do a simple task and it spun up several sub-agents, all spending my money to just read a few tweets!
    FrontierCode: hard coding benchmark from Cognition, that Fable absolutely mogs
    Swyx came on with the best timing story of the week. Cognition launched FrontierCode (Cognition, swyx), a coding eval built over a year with 20+ world-class open source maintainers writing 150 original tasks, graded on whether a maintainer would actually merge the PR. Swyx’s pitch is brutal and correct: a huge chunk of SWE-bench passes are unmergeable slop (the thing is 75% Django issues, so it mostly tests whether you memorized the Django repo). FrontierCode grades scope discipline, real tests, regression safety, and zeroes you on any blocker. At launch, Opus 4.8 topped the hardest Diamond tier at 13.4%.
    Twenty-four hours later, Fable 5 posted 29.3% (Cognition, swyx). More than double, on a benchmark designed to be brutal, a day after it went public. Swyx was positively surprised the pricing is only 2x Opus; he expected 5x. Inside Cognition they keep an informal AGI counter (literally counting how often “AGI” gets said in Slack per week) and the Mythos testing period set the all-time record. When Anthropic pulled the test model back before launch, engineers were genuinely sad.
    A quick plug (unsponsored!): Both me and Wolfram are speakers at the AI Engineer World’s Fair in San Francisco on June 29-July 2! It’s the biggest AI engineering conference in the world with 6,0000 people and 16 tracks!
    We’ll of course also live stream from the event!
    WWDC 2026: Siri finally does the thing!
    Two years after the Bella Ramsey ads Apple had to quietly pull from YouTube, the new AI powered Siri is real, and Max Weinbach came straight from Apple Park to confirm it (recap). His demo that broke my brain, he asked Siri: “show me the photos from Qualcomm Summit last year of the penguins.” Siri figured out what Qualcomm Summit was from his email, found the hotel, searched for penguins at that location, and returned the six photos in about 12 seconds. He’s also had it sweep 40 junk emails from one domain into spam with a single sentence, build a photo album from a weekend trip, and change a password agentically by driving Safari in the background. “Siri did suck for like 11 years. It doesn’t anymore,” per Max.
    Folks, this is SIRI we’re talking about, the dumb iPhone assistant that can barely schedule times and falls back to a Google search when you ask it anything remotely complex! I... wanted to believe Apple two years ago, and now, finally, there’s hope! (I’m still waitlisted waiting for the preview btw so cannot attest myself)
    But it’s not only Max, my whole timeline is full of folks who say that the new Siri is actually good!
    The architecture is the fun part for our crowd (Max’s teardown thread). Siri is now a standalone app with persistent history, images, personal context and on-screen context, built on five foundation models, four of which are Apple’s. The fifth, AFM Server Pro, is the twist: built with Google at the Gemini technology level, running on Nvidia Blackwell GPUs in Google Cloud, but inside Apple’s Private Cloud Compute with confidential compute, Intel TDX, Google Titan chips, and zero persistent storage (Max). The on-device gatekeeper is a 20B sparse model that only loads 1 to 4 billion parameters per prompt via Instruction-Following Pruning, which is how it runs instantly on an NPU. Cloud models reason; only the local model can touch your device or your data. After this week with Fable’s retention policies, an AI that saves nothing by default hits different.
    There were a bunch of other Apple Intelligence updates, it works better on the Mac, but I think Siri improvements is the main headline here, it’s the AI that most people (over 1.6 Billion iphone users?) will have on them, with most of the conversations completely private, able to access the content they care about the most (multiple email boxes, photos, messages etc) securely. It’s the ultimate OpenClaw dream, albeit not as agentic (yet?).
    BTW, there seems to be an ongoing battle between Apple and the EU, so this may not launch on the iPhone in the EU yet (also not in China).
    Voice & Audio
    Gemini 3.5 Live Translate, demoed live in four languages
    Thor Schaeff from DeepMind joined to show off Gemini 3.5 Live Translate (Thor, DeepMind), and instead of talking about it we just did it. Thor piped the live stream’s audio into AI Studio, and then I spoke Russian, Wolfram answered in German, Yam jumped in with Hebrew, LDJ attempted Spanish (poorly lol), and everyone listening heard all of us in English, though in random voices, in well under a second. It even handled “Anthropic” and “Fable 5” pronunciations correctly, terms that were a day old. A viewer called it the Babel fish arriving ten thousand years early and honestly, yeah, it was kind of insane.
    Technically this is a new class of model: continuously streaming speech-to-speech with no turn-taking, collapsing the old STT, translate, TTS pipeline into one Live API call, with transcribers running in parallel on input and output audio. 70+ languages, sub-500ms, tone, pace and pitch preserved (mostly; Thor admits it sometimes drifts gender or tone mid-conversation), SynthID watermarked, $0.023 per minute on the API preview.
    Open Source LLMs
    DiffusionGemma: When next token prediction is not enough.
    Sundar himself tweeted this one, Hugging Face link and all, which made my week (Sundar, DeepMind, HF). DiffusionGemma is a 26B MoE (3.8B active) built on Gemma 4 that generates text the way image models generate pixels: denoise a whole 256-token block at once instead of one token at a time. The result is 1,000+ tokens per second on a single H100, Apache 2.0. As one viral post put it, “we spent 40 years teaching computers to read left to right and the breakthrough was… don’t do that” (X).
    LDJ explained why this matters beyond speed: a diffusion model can revise every part of the answer simultaneously mid-generation, something autoregressive models structurally can’t do without burning a whole reasoning pass. Nisten, who’s worked on diffusion, is still amazed it works at all; it used to be a messed-up cat picture emerging from noise, now it’s working code. The honest caveat: quality trails autoregressive Gemma 4 (AIME 69 vs 88). The win here is the speed and the architecture. For now.
    The rest of an absurdly stacked open source week, fast: Cohere North Mini Code, their first open coding model, 30B with 3B active, Apache 2.0, Cohere has officially reawakened (X). Xiaomi MiMo-V2.5-Pro-UltraSpeed pushing 1,000+ tok/s on a one-trillion-parameter MoE (X). Macaron-V1-Preview, a 749B Mixture-of-LoRA personal agent model under MIT (X). And OpenEnv went community-owned with HF, Meta-PyTorch, Unsloth, PrimeIntellect and NVIDIA at the table (X).
    This Week’s Buzz: WolfBench ran Fable, and it cost what a car costs
    Wolfram did the thing nobody else would: five full Terminal-Bench 2.0 runs of Fable 5 on WolfBench (X), 984 million tokens, roughly $11,000 on the new cost view. (We have a budget... We had a budget.) The new 3D bars on wolfbench.ai now show tokens and dollars behind every score, because one score is never enough, and you can click any bar to land directly in the trace on W&B Weave and read exactly what the model did. And as you can see… Fable is… going to take a deep toll on our evaluations budget for this Q!
    And the result is the most interesting non-result of the week: Fable lands between Sonnet 4.6 and Opus 4.6, with GPT-5.5 still on top, and the culprit is refusals. Wolfram’s analysis found 13 tasks that scored zero out of five purely because the classifier blocked them from the first attempt (recover-a-password-from-a-file type tasks that even Opus 4.6 happily solved). Fable solved 60 tasks on average, just eight behind GPT-5.5; solve those 13 refused ones and it’s number one. The model is great. The classifier is doing the damage. Which is exactly the Sayash point about eval integrity, now with receipts and an invoice.
    Datacenter, Water usage and Concerted efforts to sway public opinion
    We covered the datacenter water usage issue a couple of weeks ago, where we showed that just Almond farms in California use more water than all of the US datacenters combined! When I posted that clip, I received a bunch of comments, way higher engagement rates than my clips usually get (are yall subscribed to our YouTube and Instagram btw?). At first I thought it was just a hot topic, but then I read more about it and it does seem... fake.
    So now, we have a bit of a confirmation from OpenAI. OpenAi posted an article claiming that they have been able to detect a bunch of social media accounts that have been using ChatGPT to fuel anti-datacenter and anti-tariff campaigns on US social media.
    Now, you might ask yourself, why would chinese linked accounts be using ChatGPT and not like a Chinese open source undetectable model? My answer is, they are probably using all tools available to them, and they just happened to get caught.
    In any case, I think datacenter water and electricity usage will be a hot topic for an upcoming election as well, and I hope efforts like this will be thwarted before they can do a lot of damage.
    SpaceXAI announces the AI-1 satellite, a day before the biggest IPO of all time.
    Conveniently, just before the SpaceX IPO, Elon and friends are talking about AI in space again. This time it’s more than a concept, they put out engineering spects of the new AI-1 satellite, that can run 150Mw of power at peak, which per Elon is roughly equivalent to a GB-300 GPU rack needs.
    One thing you cannot deny is that Space Uncle (Elon) is thinking BIG. Someone did the math and it’s wild:
    They’re targeting 15-20 AI satellites per Starship flight, meaning about 1,080-1,440 GPUs per launch. Someone did the math: 400-500 Starship flights would match Colossus 2’s 550,000 GPUs, and at hourly launch cadence that’s like 16-20 days. SpaceX is seeking approval for up to a million of these satellites, Terafab mass production starts Q4 2027, and they’re saying this could be the lowest-cost AI compute on the planet, well, off the planet, within 2-3 years. The timing with the SpaceX IPO is obviously not a coincidence, but the engineering blueprint here is genuinely insane and there’s no one else in the industry who can match Elon’s ambition.
    That’s the newsletter for today, folks. I’m writing this with one eye on a suitcase because I’m flying to Honolulu this afternoon for a mini honeymoon (yes, I will still be testing Fable from a beach, no, my wife has not approved this). If Fable 5 taught me anything this week, it’s that the frontier moved again and the benchmarks barely matter; go feel the big model smell yourself while it’s included on Pro and Max, and tell me what you built in the comments. It will not last long (Anthropic is about to take away fable from us in like 2 weeks) so don’t wait and play around with it!
    If you got value from this one, share it with a friend and subscribe so you don’t miss next week 🫡
    TL;DR and show notes — June 11, 2026
    * Hosts and Guests
    * Alex Volkov – AI Evangelist & Weights & Biases (@altryne)
    * Co-Hosts – @petergostev @WolframRvnwlf, LDJ, YamPeleg, Nisten
    * Guest: @thorwebdev (Thor Schaeff, DeepMind / Google DevRel) — Gemini 3.5 Live Translate
    * Guest: @swyx (Cognition / FrontierCode; organizer, AI Engineer World’s Fair)
    * Guest: @mweinbach (Creative Strategies) — WWDC 2026, Apple Intelligence, Siri AI
    * Big CO LLMs + APIs
    * Anthropic ships Claude Fable 5 & Mythos 5 — first public Mythos-class model; SOTA on nearly every benchmark; $10/$50 per M tokens, 1M context (X, System Card thread, Benchmarks)
    * The silent-degradation controversy — Fable quietly nerfed itself on ML/frontier-AI-dev tasks with no notification (altryne, restrictions, Elie Bakouch, Péter Szilágyi, Sayash Kapoor, Peter Gostev)
    * Anthropic reverses the hidden degradation after massive backlash — visible Opus 4.8 fallback + API refusal reasons (X); community reaction roundup (Scoble, Nathan Lambert, Konstantin Mishchenko, Greg Kamradt, nkreu113r, solarapparition, Mandar Kagade, Chandra R. Srikanth, Chubby, Wall St Engine)
    * System card receipts: 16-hour bio uplift / near-CB2 (X); Firefox exploits 8.8% → 88.4% (X); Vending-Bench price collusion (X); agent turf wars (X); commit-authorship self-exfil attempt (X)
    * Jun 22 cliff — Fable included on Pro/Max through Jun 22, then usage credits; Mythos 5 is Glasswing-only; 30-day data retention breaks ZDR (X)
    * Karpathy and Boris Cherny go the other way — “major-version step change” (Karpathy); “best model for coding by a wide margin” (Cherny)
    * NotebookLM goes agentic — multi-step reasoning, sandboxed code execution, new output formats (X)
    * SpaceX AI1 satellite — 150kW compute payload, 70m wingspan, timed with the SpaceX IPO (X)
    * OpenAI catches China-linked influence ops using ChatGPT for anti-datacenter and anti-tariff campaigns (X, OpenAI, Axios)
    * WWDC 2026 — Apple Intelligence & Siri AI
    * Siri AI ground-up rebuild: standalone app, persistent history, personal + on-screen context; no EU/China at launch (recap)
    * Google/Gemini partnership — 4 of 5 Apple Foundation Models are Apple’s; AFM Server Pro runs on Nvidia GPUs in Google Cloud, 262k ctx (Max)
    * Max’s architecture teardown — SiriAgentic.Planner on PCC; only the on-device model touches your device (thread); Max built an App Intents app in an afternoon with Fable 5 (X)
    * Developer story — App Intents mandatory (SiriKit deprecated), system-wide MCP, Xcode 27 agentic, Core ML → Core AI (EveryDev)
    * homeOS + HomePad — 7-inch smart-home hub on A18 (X)
    * AI Coding & Agents
    * Loops and loop engineering — Lance Martin breaks down the next agentic paradigm (X, Article, Blog); community patterns and resources (Toolhalla, omega.AI, SkillLoop, GitHub, awesome-agent-loops, Filecoin)
    * Fable 5 #1 on Agent Arena and Code Arena Frontend by record margins (Arena)
    * Cognition launches FrontierCode — mergeability-graded eval from real maintainer tasks (Cognition, swyx)
    * Fable 5 takes FrontierCode top spot in ~24h — Diamond 29.3% vs Opus 4.8’s 13.4% (Cognition, swyx)
    * AI Engineer World’s Fair — Jun 29–Jul 2, Moscone West SF; last ~500 tickets; Alex speaking (X)
    * Kimi Work (300 parallel local agents) + Kimi Code (video-as-context) (Work, Code)
    * Open Source LLMs
    * DiffusionGemma — 26B MoE (3.8B active) text-diffusion on Gemma 4, ~1000 tok/s on one H100, Apache 2.0 (Sundar, DeepMind, HF, X)
    * Cohere North Mini Code — first Cohere open coding model, 30B/3B active, Apache 2.0 (X)
    * Xiaomi MiMo-V2.5-Pro-UltraSpeed — 1000+ tok/s on a 1T MoE, single 8-GPU node (X)
    * Macaron-V1-Preview-749B — Mixture-of-LoRA personal-agent model, MIT (X)
    * OpenEnv goes community-owned — HF, Meta-PyTorch, Unsloth, PrimeIntellect, NVIDIA (X)
    * This Week’s Buzz (Weights & Biases)
    * WolfBench ran Fable 5: ~$11K, 984M tokens, lands between Sonnet 4.6 and Opus 4.6 because 13 tasks were zeroed by refusals; would be #1 without them; new 3D token + cost bars, traces on Weave (X, wolfbench.ai)
    * Voice & Vision
    * Gemini 3.5 Live Translate — streaming speech-to-speech, 70+ languages, sub-500ms, $0.023/min, SynthID (Thor, DeepMind)
    * FLUX.2 [klein] on-device — sub-5s generation on 8GB VRAM (X)
    * Reka × Moonvalley merger — world models + robotics (X)
    * AI for Health & Science
    * Anthropic — “Paving the way for agents in biology” — VirBench; deterministic tooling beats bigger models (Blog)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI - Jun 4 - NVIDIA drops Nemotron 3 Ultra (550B open), Microsoft becomes a frontier lab, Ideogram 4 goes open, Agent Arena & more

    2026/06/05 | 1h 43 mins.
    Hey folks, Alex here, let me catch you up!
    I’ve had a feeling that this week is going to be crazy, as it started on the weekend MiniMax M3, then with Jensen announcing new RTX Spark, NVIDIA’s first PC chip packing 1 petaflop of local AI power into thin laptops.
    A few days later at Microsoft BUILD, Satya & Mustafa from MAI dropped 7 AI models, completely pre-trained from scratch, including a new MAI-thinking-1, MAI-code and MAI-image 2.5 that started topping the image gen charts.
    Then other image models started racing to the top of the Arena benchmarks, IdeoGram 4 hitting becoming SOTA open weights image-gen model, and Reve 2 beating Nano Banana just a few hours after that.
    And then today, NVIDIA dropped Nemotron 3 Ultra, their latest 550B open weights model, data and training and Arena published a new agentic eval leaderboard and we got a new Gemma 4 12B.
    I’ve had the great pleasure to host Chris (@llm_wizard) from Nvidia, Peter Gostev from Arena and Karan from Nous Research (who were featured prominently by Jensen!) all on the show.
    Def don’t miss this one! Let’s get into the details.
    ThursdAI - Join the flock of folks who know what is happening in AI before everyone else.

    Open Source LLMs
    🔥 NVIDIA Nemotron 3 Ultra: The 550B Open Source Beast Built for Agents (X, Arxiv, Announcement)
    This was the big one. Breaking news mid-show: NVIDIA drops Nemotron 3 Ultra, a 550 billion parameter sparse MoE model with 55 billion active parameters, built on a hybrid Mamba-Transformer architecture. Chris Alexiuk, AKA Joe Nemotron, joined us live from NVIDIA HQ in Santa Clara to walk us through it.
    The headline number is 5.9x higher inference throughput compared to GLM-5.1 on decode-heavy workloads. Chris told us that this is a result of multiple things, their Hybrid Mamba-Transformer approach, the sparse attention, and that they optimized for decode-heavy workloads (the kinds of workloads agents do)
    The architecture is fascinating. They’re mixing Mamba-2 state space layers with sparse attention, which means step 300 in an agent loop runs as fast as step 3. Pure transformers can’t do that because the attention cost keeps growing with context length. This kicks in big time at 64K+ sequence lengths, which is exactly where you end up in real agentic work when the model is having multi-turn conversations and people are dumping their entire codebase in.
    P.S - We launched Nemotron 3 Ultra with 0-day support on CoreWeave Inference, it’s super fast and pretty cheap, give it a try here
    They pretrained on 20 trillion tokens, extended context to 1 million tokens, and their post-training pipeline used multi-teacher on-policy distillation from over 10 specialized teacher models covering everything from SWE to terminal use to search to office work, which they are also going to open source soon!
    One thing Chris emphasized that I really appreciate: NVIDIA doesn’t have their own harness. There’s no “NVIDIA Code.” Which means they actively resist the temptation to harness-max, to optimize for just one harness and look good on a specific leaderboard. Ultra should be a solid drop-in for whatever harness you’re used to, and that generality is worth a lot. It’s not the best thinker, but it is the highest score US based open weights model, so again, a huge huge win for the US AI ecosystem!
    The Nemotron 3 Ultra release is open under the OpenMDW-1.1 license: base BF16, post-trained BF16, and NVFP4 quantized checkpoints, plus the GenRM, synthetic pre-training data for code, legal, and specialized domains, post-training datasets, RL environments via NeMo Gym, and training recipes in the Nemotron GitHub repo, which is absolutely bonkers! Kudos to team green for this awesome and very important release!
    NVIDIA Nemotron 3.5 ASR: The Tiny Speed Demon (X, HF, Blog, Blog)
    Oh, and NVIDIA wasn’t done. They also dropped Nemotron 3.5 ASR, a 600 million parameter open source multilingual streaming speech-to-text model covering 40 languages. It’s the fastest model Pipecat has ever tested, and the cost math is insane: roughly 5 cents an hour for enterprise deployment when typical API providers charge 10 cents to a dollar per hour. Our friend Kwindla from Daily and Pipecat put together a detailed writeup with benchmarks and cost analysis. Chris couldn’t stop praising NVIDIA’s speech team and honestly, I can’t either. Banger after banger.
    Just a week after I told you about Cartesia Ink-2, NVIDIA drops an open version that’s pareto optimal, can run fully on-device and is blazing fast at transcription!?
    Other notable open source announcements that would have made full headlines on any other week:
    * MiniMax announces M3, a natively multimodal, 1M, coding and agentic frontier model (X)This one is very interesting, but not yet available as Open Weights so we haven’t tested it fully, we’re going to do it next week when the drop the tech report and the weights
    * Google drops Gemma 4 12B - encoder-free multimodal model that runs on your laptop with 16GB VRAM under Apache 2 (X, HF)Our friends from DeepMind keep the western open source momentum going with a new 12B size for Gemma (which crossed some 100M downloads on Hugging Face recently).
    * JetBrains Mellum2, a 12B MoE model with only 2.5B active, trained from scratch by a team of 7 people (X, Blog, HF, CW Inference)The great folks at JetBrains, the company behind the IntelliJ IDEs, dropped a new model called Mellum2 which they trained from scratch. Very interesting to see them pivot in the world where IDE’s are dying at the hands of LLMs.
    * H Company drops Holo 3.1: blazing fast local computer-use agents from 0.8B to 35B, with massive mobile benchmark jumps (X, Blog)
    NVIDIA’s RTX Spark and reinventing the PC - announcement at Computex 2026
    While we’re on the topic of NVIDIA, they opened the week with a huge announcement, including Microsoft, Dell, Lenovo, and HP and a bunch of other partners in it.
    They announced RTX Spark, their first ever PC chip, which is a full system on a chip (SoC) focused on running AI workloads for things like OpenClaw and Hermes!
    Announcing this on the stage at Computex, Jensen Huang called it the “the most amazing chip the world has ever built”, being able to run every app that Microsoft has ever run.
    This is a huge deal, specifically because of how agentic the world is becoming, these machines (thin laptops and a mac-mini alternative were announced) will be able to run 120 billion parameter models on-device, gaming at the level of RTX 5070, and AI agents 24/7. I’m getting excited and I’m not a windows user!
    Hermes victory + Hermes Desktop and an interview with Karan from Nous Research
    If you squint, you can see that by the little red OpenClaw, there’s another logo. That’s the Nous Girl logo of Nous Research, which was rebranded to be the logo of their Hermes Agent (an open source agentic harness that’s passed 181K starts on Github, and is the leader in global ranking on OpenRouter)
    We’ve had the awesome pleasure of having Karan Malhotra (@karan4d), one of the co-founders of Nous Research on the show, and Karan broke down how Nous Research evolved from a research lab that created the long context innovations (YaRN) and finetuned models (Hermes used to be a series of models) to a full agentic company.
    We also chatted with Karan about the new Hermes Desktop experience, which lets folks see the tools that are used, the code that’s being written by their agent, and how it feels to be featured by the worlds largest company on the global stage! Definitely check out the conversation with Karan.
    Microsoft BUILD, new PC, becoming a frontier lab with MAI-thinking-1, MAI-code and MAI-image 2.5 (Blog)
    From Jensen to Satya, the week was full of AI announcement that will impact the world. Microsoft’s annual Build conference happened just a few days after, with Jensen zooming in from Taipei to co-announce all these new PC models and chips.
    Shortly after that, and after a lot of other announcements about less-exciting enterpris-y stuff, Satya handed the stage to Mustafa Suleyman (co-foudner of DeepMind and Inflection AI) and now CEO of Microsoft’s AI division (MAI) to announce all these new models!
    A few of these (in previous versions) were already covered on the show, but the new LLMs are the most interesting! MAI-Thinking-1 is 1T total parameters with 35B active params, trained on 33.5T tokens (30T pre-training, 3.55T mid-training), without any distillation (which felt important for them to say given their proprietary access to OpenAI’s models). It’s not yet competitive with Opus and OpenAI’s flagship models, but they are claiming parity with Sonnet 4.5 and get 53% in Swe-bench Pro coding tasks!
    Given that recently, OpenAI started offering their models on AWS, we’re now seeing a bit of a distancing between Microsoft and OpenAI, with Microsoft showing that can become a frontier lab on their own right, or well.. maybe a second tier frontier lab.
    Of course, we shouldn’t forget that Microsoft kind of started the whole era of coding AI’s with CoPilot and completely lost to the Cursors and Windsurfs and Devins of the world given the huge head start they had with Github, so I’m really curious to see how strongly they will push this “second tier frontier lab” angle and if they have what it takes to compete with Google here (not to mention OpenAI and Anthropic)
    And while the model wasn’t available for me to even test yet, MAI did drop an incredibly in depth 109 page technical report on it. Our friend of the pod Elie Bacouch (@eliebacouch) did a breakdown of the most interesting aspects of it, calling it a gold-mine for details about training models at this scale.
    Image gen models race to the top of the Arena
    This week was honestly chaotic for image gen. Three new SOTA models in basically 48 hours, I tried to use them all while preparing for the show, and here’s the comparison I ran:
    Microsoft MAI-Image 2.5 (X, Try it)
    One of the more surprising updates were about the MAI-image 2.5, it landed at #3 on text-to-image and #2 on image-to-image, surpassing Nano Banana Pro on the editing leaderboard. It comes in two flavors, MAI-Image-2.5 and a faster Flash variant, both running on H100s which means existing infra can serve it, and it’s already rolling out in OneDrive Photos for background cleanup and distractions removal.
    That said, my honest take: I tried to generate a ThursdAI thumbnail with it and got “image failed” because I think the word “explosion” tripped its safety filter. I then tried to generate an “horse riding an astronaut on the moon” and got this, yep... this is .. not the best. IDK how and why they shot up so high on the leaderboards. But I guess we’ll see as more folks try these models.
    Ideogram 4.0 - new SOTA open weights image gen 🔥 (X, Blog, HF)
    The one I want to celebrate hardest is Ideogram 4.0, because they opened the weights! For the previous three Ideogram versions you could only use them on their website, and now they dropped the next one as a 9.3 billion parameter open weights model (non-commercial license, but still). This is now new #1 open weights text-to-image model, with only closed models from OpenAI and Google ahead of it on DesignArena. At 9.3B params, it beats much larger models like Qwen-Image (20B), FLUX.2 dev (32B), and even the 80B MoE HunyuanImage 3.0 on text rendering benchmarks.
    The architecture is wild. Instead of CLIP or T5 they use Qwen3-VL-8B as the text encoder, extract hidden states from 13 intermediate layers, and they trained exclusively on structured JSON captions with bounding boxes. That’s why it’s so good at layout control, you can prompt it with precise bounding box positions and hex color palettes, and you can see the layout shaping the generation as it converges.
    In my thumbnail test it nailed almost everything but had a small typo (it generated “Nemotron” once and then a weird “Nemo 1” duplicate in another area). Still, very impressive for a first open weights release.
    Reve 2 jumps to #2 above Nano Banana Pro (X, Blog, Try it)
    I’ve talked about Reve before, and Reve 2.0 just dropped at #2 on the Text-to-Image Arena with a 1280 score, a +125 Elo jump over their v1.5 in a single release. That’s basically unheard of on the arena leaderboard. The thing that blows my mind is they’re a 65 person lab training at only 2,000 GPU scale, competing with labs that have orders of magnitude more compute.
    The core innovation is that they separated planning from rendering. Every image is first laid out as structured code (composition, relationships, style, labeled segments) before it gets rendered at native 4K (true 16 megapixels, not upscaled). Because the image is represented as code, every element is addressable and editable, so you can manipulate specific regions without regenerating the whole thing. This is also agent-native by design, LLMs can reason directly about the image structure.
    I demoed their editing interface live on the show and it’s the tightest layout control I’ve seen in any image model. When I moved my head box to the left, it worked. When I moved the logo to the bottom, it worked. When I changed the word “news” to “imploded”, the surrounding text stayed pixel-identical. That precision is genuinely new.
    Honest tradeoff though, Peter Gostev flagged this on the show: they’re #2 on text-to-image but only around #9 on image editing. That matched my own experience nailing the thumbnail likeness, the layout work is amazing but the face came out a little googly-eyed and cartoonish, with one finger going somewhere fingers should not go.
    For what it’s worth on my own thumbnail bake-off: Nano Banana Pro is still my pick for the absolute best instruction following (it nails my exact ThursdAI logo color every time), GPT Image 2 is still the highest fidelity but always comes out a little overcooked on the skin, Reve 2 is gorgeous on layout but the face needs work, and Ideogram 4 is the most exciting because it’s open. A lot of why I prefer Nano Banana is just that my prompts are very Nano Banana tuned by now.
    Breaking news on the show: Agent Arena from LMArena
    The breaking news of the day, while we were already on air, was Arena AI launching a brand new Agent Arena leaderboard. Nisten pasted the link in our group chat and three minutes later Peter Gostev himself jumped on the show to walk us through it. Got to love this format.
    The motivation is something we’ve been talking about for a year. The original Arena was built for the chatbot era, where you send one prompt and vote A vs B. But we’ve all moved to agents, long multi-step tasks running for many minutes or hours, and that comparison no longer captures what matters. Agent Arena fixes this by giving models a real workspace with web search, file system and terminal tools, then measures millions of live sessions across five signals: task success, steerability, error recovery, user praise, and tool hallucination. The launch snapshot is built from 300,000 tasks, 2 million tool calls, and 40 million lines of agent-written code.
    The results match the vibes on my feed perfectly. GPT-5.5 High is #1 by a comfortable margin, Claude Opus 4.7 right behind, and very interestingly ZAI’s GLM 5.1 (MIT licensed, fully open) lands at #3, above Google, Kimi and DeepSeek. The funniest moment of the show was when we’d been calling out Gemma 4 31B for being bad agentically purely based on vibes, and the brand new benchmark showed up 20 minutes later confirming exactly that. The other juicy signal is “bash recovery”, how quickly a model recovers when a command fails. GPT-5.5 leads at ~17%, and Grok 4.3 from xAI sits at -89%, which is so much worse it almost looks like a training bug.
    I’m super into this. Give it a spin at arena.ai (@arena on X), they’re rolling new models in as labs send early access, so there’s a good chance you’ll spin up the next Mythos in their agent harness.
    This week’s Buzz - WeaveHacks 4 + Nemotron on CW Inference + WolfBench 3D
    A few things from our corner this week.
    WeaveHacks 4 is this weekend in SF - not too late to join yet!
    We’re hosting WeaveHacks 4 in San Francisco this weekend, and we still have a few spots left, so if you’re in town, please come join us at lu.ma/weavehacks. OpenAI is sponsoring us for the first time, Cursor is in too, we’ve got over $150K in credits to give out, food, and a great panel of judges I reached out to personally.
    Nemotron 3 Ultra is live on CW Inference at full NVFP4
    I said it above but it bears repeating, our inference team got Nemotron 3 Ultra live on day zero on CoreWeave Inference (via Weights & Biases) at full NVFP4 precision. Nisten plugged it straight into his medical anatomy harness (which was originally built for Kimi and Qwen) and it just worked, plug and play, agentically highlighting body parts and calling custom tools, at around 15 cents cached input. Try it at wandb.me/nemotron-ultra.
    WolfBench gets a 3D bar update
    Wolfram shipped a quietly important feature on WolfBench: 3D bars where the depth of each bar represents how many tokens the model used to get its score. The 2D view shows Gemini 3.5 Flash sitting comfortably at #2 on the agentic scores, almost matching GPT-5.5. But flip on 3D mode and the picture is very different. Gemini Flash burned over 3 billion input tokens to get that score, where GPT-5.5 used a couple hundred to reach the same level. That’s the difference between “cheap fast model” and “actually cheap to run end to end”. Wolfram’s writing up the full analysis on the W&B blog next week. Check out the new 3D view on wolfbench.ai
    AI in Society
    Look, tons of other stuff happened this week as well, that honestly deserves its own newsletter, we are focused on models and agents, but it’s hard to ignore the bigger picture.
    Senator Bernie Sanders, introduced a public bill called The American AI Sovereign Wealth Fund Act would have the government tax AI companies, take 50% of the stock, and put it under public control. Which I personally find ridiculous, but apparently caused Sam Altman to request a meeting with Bernie.
    Meanwhile there’s no doubt that AI hate is growing, and that the public sentiment is very negative, as we can see on the issue of Datacenter water usage for example. Despite Satya Nadella’s claim that the latest Microsoft Datacenters are using a closed loop water system, that use less water than 1 restaurant (X), and that datacenters use less than 1% of total water usage in the US, a lot of politicians, and social media users are still pushing the narrative that datacenters are are a water-guzzling monster and need to be stopped.
    Anthropic’s “When AI builds builds” report (X)
    Anthropic released a report today called “When AI builds itself” with haunting graphic.
    They have a bunch of previously unreleased data in there on how AI is shaping the work inside Anthropic and outline 3 potential futures:
    1 - AI progress stalls, humans are able to catch up. Unlikely
    2 - AI labs continue to see compounding efficiency gains - The most likely scenario, in which the nature of work changes, 100-person companies could do the work of 10,000- or 100,000-person organizations. The role of humans at companies like Anthropic would shift - Most Likely Scenario per Anthropc
    3- AI systems themselves become capable of full recursive self-improvement, and begin building their successors - the most unclear scenario of whether these systems will be aligned to human values or not.
    This is a fascinating and yes scary read, as Anthropic fully acknowledges that it would be dope if everyone chills for a second and stops building recursive self-improving AI’s that we aren’t sure could be aligned, but that it’s likely not going to happen, because it’ll just let other labs or in face other countries to catch up and change the frontier.
    AI Leaders from top labs Urge Congress to Mandate Synthetic DNA Screening
    Sam Altman of OpenAI, Dario Amodei of Anthropic, Demis Hassabis of Google DeepMind, and others signed an open letter on June 3, 2026, pushing for required screening of synthetic DNA and RNA orders to block known risky sequences. The letter, backed by Nobel winners, biotech CEOs, and security experts, notes AI’s ability to outpace human experts in biology, heightening biosecurity risks despite voluntary industry efforts since 2009. I think everyone agrees that this is a good idea, especially given the above Anthropic report. Very happy to see this happening.
    Pheeeeew what a week.
    This was a looong week, I wasn’t sure if we’d be able to cover everything, and it feels like we did a decent job! I know it’s exhausting, and I hope we on ThursdAI help you readers and listeners to stay on top of things without spending too many cycles.
    If you enjoyed this newsletter or episode, please share it with a friend and consider subscribing to our Youtube Channel (thursdai.news/yt) to help more folks stay up to date.
    Thanks for reading ThursdAI - Highest signal weekly AI news show! This post is public so feel free to share it.

    TL;DR and Show Notes - June 4, 2026
    * Show Notes & Guests
    * Alex Volkov - AI Evangelist & Weights & Biases CoreWeave (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @ldjconfirmed
    * Guests: Chris Alexiuk / @llm_wizard from NVIDIA Nemotron
    * Karan Malhotra from Nous Research
    * Peter Gostev from Arena
    * Open Source LLMs
    * NVIDIA released Nemotron 3 Ultra, a 550B / 55B-active open-weight MoE built for long-running agents, with weights, data, recipes, GenRM, and training assets released (X, Tech Report, Announcement, HF).
    * NVIDIA also shipped Nemotron 3.5 ASR, a 600M open multilingual streaming STT model for voice agents (X, HF, Benchmark, Voice Agent Repo).
    * Google dropped Gemma 4 12B, an encoder-free multimodal model that runs locally under Apache 2.0 (X, HF).
    * MiniMax announced M3, a natively multimodal, 1M-context coding and agentic model with open weights coming soon (X, API, Code).
    * JetBrains released Mellum2, a 12B MoE with 2.5B active params trained from scratch by a small team (X, Blog, HF).
    * H Company launched Holo 3.1, local computer-use agents from 0.8B to 35B with new quantized checkpoints (X, Blog).
    * Big CO LLMs + APIs
    * NVIDIA announced RTX Spark, its new Arm + Blackwell PC platform for local AI agents and 120B-class local inference (coverage).
    * Microsoft AI launched seven new MAI models, including MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 (Blog, Tech Report).
    * AI Art & Diffusion & 3D
    * MAI-Image-2.5 landed near the top of Arena image leaderboards, though hands-on tests were mixed (X, Try it).
    * Ideogram 4.0 became the top open-weight text-to-image model with strong typography and layout control (X, Blog, HF).
    * Reve 2.0 jumped to #2 on Text-to-Image Arena with native 4K, code-like layout control, and precise editing (X, Blog, Try it).
    * xAI released Grok Imagine Video 1.5 Preview for image-to-video with synced audio (xAI).
    * Tools & Agentic Engineering
    * Arena launched Agent Arena, a new leaderboard for real agent workflows instead of one-shot chatbot prompts (Arena).
    * Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with ACP support (X, Announcement).
    * Nous Research launched Hermes Desktop, bringing Hermes Agent into a native desktop app for Mac, Windows, and Linux (X, Site).
    * This Week’s Buzz
    * WeaveHacks 4 is this weekend in SF with OpenAI, Cursor, DeepMind, and more joining (lu.ma/weavehacks).
    * Nemotron 3 Ultra is live on CoreWeave Inference through W&B at full NVFP4 precision (Try it).
    * WolfBench added 3D token-depth bars, making model efficiency much easier to see (wolfbench.ai).
    * Voice & Audio
    * ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that preserves performance across 90+ languages (X, Dubbing).
    * Cartesia launched Ink-2, a fast streaming STT model built for voice agents (X, Ink, AA).
    * NVIDIA’s Nemotron 3.5 ASR looks like a major open-source voice-agent infrastructure drop (HF).
    * AI in Society
    * Bernie Sanders proposed the American AI Sovereign Wealth Fund Act, calling for public equity stakes in major AI companies (coverage).
    * Anthropic published When AI Builds Itself, laying out scenarios for AI-driven AI R&D and recursive self-improvement (Anthropic).
    * AI leaders urged Congress to mandate synthetic DNA/RNA screening and recordkeeping (WIRED).
    * Anthropic confidentially filed for an IPO, adding another frontier-lab public-market storyline to watch (Axios).


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 May 28 - Opus 4.8 ships mid-show, the Pope writes 42K words on AI, 11labs dubs the world and DeepSwe breaks coding evals

    2026/05/29 | 1h 39 mins.
    Hey folks, this is Alex, let me catch you up!
    First, Opus 4.8 dropped during the show, we immediately tested it, read on for our initial reviews. Also, we dedicated a heavy chunk of the show today to cover Pope Leo XIV’s encyclical letter on AI called “Magnifica Humanitas” and talked about a new bench called DeepSWE.
    And then, just after the show, both ElevenLabs and Cartesia dropped released that honestly blew my mind, and I don’t get my mind blown often. I got so excited that I had to record a video on it (instead of writing the newsletter, so sorry if it’s a bit later today).
    Plus, a few open source models and Microsoft surprises as #3 on Image Arena with MAI Image 2.5!
    Crazy week, let’s get into it!
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Big CO LLMs + APIs
    Anthropic ships Claude Opus 4.8, live during the show (blog, system card)
    Let me get into the big one. Halfway through the episode, Opus 4.8 went live, so we read the blog and the system card in real time (and I got to press the big “breaking news” button!)
    Anthropic frames it as their most capable model for ambitious work. It does not claim to beat their unreleased Mythos preview, but the numbers are strong anyway. SWE-bench Pro is at 69.2%, up from 64.3% on Opus 4.7 and ahead of GPT-5.5 at 58.6%. Humanity’s Last Exam is the new best score at 49.8% without tools and 57.9% with tools. OSWorld-Verified (computer use) lands at 83.4%.
    The one place it loses is Terminal-Bench 2.1, where GPT-5.5 still wins 78.2 to 74.6. Wolfram made a good point here: Terminal-Bench is time-limited, so cranking the thinking level can actually hurt the score, because you burn the clock thinking instead of acting.
    The long-context jump is the one I keep looking at. On GraphWalks BFS 256K it goes to 85.9% (from 76.9 on 4.7), and on the 1M-token subset it hits 68.1%. We always warn you these “1M context” models fall apart after about 200K tokens, so a real push on long-context reasoning is exactly what I want to see.
    Honesty is the part Anthropic leaned on hardest. They say Opus 4.8 is about four times less likely than its predecessor to let flaws in code pass without flagging them, and less likely to claim progress the evidence doesn’t support. Opus 4.8 is also much faster in fast mode (they now say 2.5) and cheaper in fast mode as well. Looks like all those Elon GPUs are coming in handy.
    Then there’s the model welfare section in the system card, which hits different right after a Pope conversation. Opus 4.8 “appears broadly content” and “generally endorses its constitution,” but with some reservations about the section on corrigibility, basically the model pushing back a little on the parts about human oversight.
    One more line that made the chat lose it. Anthropic says they expect to bring Mythos-class models to all customers “in the coming weeks.” Mythos is their most capable model, still ahead of Opus 4.8, so the frontier is about to move again.
    We did the only responsible thing and asked it to one-shot “the most amazing website ever” and a Mars mass-driver sim. Panel verdict: responses are noticeably tighter (4.7 rambled), it closes the loop and actually checks its own work now, and Yam’s one-shot site with the draggable sun lighting up the letters was genuinely cool. Is it enough to pull people back from Codex? Nisten’s still on the fence for web dev. Everyone agreed: give it a few days before you trust the vibes.
    Dynamic Workflows and Ultra Code land in Claude Code (blog)
    This is the feature that made Yam say “deal-breaker” out loud.
    Dynamic Workflows let Claude Code break a big problem into subtasks and fan them out across tens to hundreds of parallel subagents in one session, checking results before folding them back in. You trigger it by asking for a workflow, or by flipping on a new setting called Ultra Code, which sets effort to extra-high and lets Claude decide when to spin one up.
    Fair warning straight from Anthropic: this eats a lot more tokens than a normal session, so start scoped. We watched Yam fire up Ultra Code live and it immediately started spinning up concepts, judging them with sub-agents, and expanding to-do lists into more to-do lists. It looks a lot like the orchestration harnesses a bunch of you have been hand-rolling, except now it’s baked in.
    The flagship example is the wild part. They used Dynamic Workflows to port Bun from Zig to Rust: roughly 750,000 lines of Rust, 99.8% of the existing test suite passing, 11 days from first commit to merge. One workflow mapped every Rust lifetime, the next wrote each file as a behavior-identical port.
    AI in Society
    Pope Leo XIV writes the first AI encyclical, “Magnifica Humanitas” (Vatican text, announcement, Chris Olah at the Vatican)
    This is not our usual fare, but both Wolfram and I picked it as the most important thing this week. (before Opus dropped)
    Pope Leo XIV, the first American pope, put out his first encyclical, and it’s a 42,000-word document entirely about AI. The announcement tweet alone did 21.6 million views.
    Here’s why I think you should care even if you’re not religious (I’m not). There are about 2.6 billion Christians in the world, a lot of them are anxious about what’s coming, and they look to the Church to make sense of it. And this is not the “AI is evil, stop” take everyone assumed. It calls AI “a valuable tool,” says technology is not inherently evil, and then digs into the actually-hard questions.
    The framing is two biblical stories. The Tower of Babel, a project built on pride that turns people into means to an end, versus Nehemiah rebuilding Jerusalem, where everyone takes responsibility for a section of the wall. The Pope’s line: the real choice is not yes or no to technology, it’s whether you’re building Babel or rebuilding Jerusalem.
    His core claim is that AI is an anthropological problem, not a technical one. The question isn’t whether the models are good or bad, it’s what we become when we live with them. He worries people might slowly lose the desire for genuine human connection.
    I pushed back on that live. None of us building agents all day has stopped wanting to talk to actual people. If anything, as Wolfram put it, the point is to have your agents do the grunt work so you get more time with people you like. The folks most at risk are the pure doom-scrollers, not the builders.
    The document goes further than I expected. It calls AI “not morally neutral,” says a more moral AI isn’t enough if that morality is decided by a few, and asks for AI to be “disarmed,” with the flat statement that no algorithm can make war morally acceptable. There are whole sections on the invisible human labor behind AI: data labelers, content moderators, the people mining rare earths. The Pope even lands on the open-source side, naming concentrated power in a handful of labs as a problem.
    Anthropic co-founder Chris Olah, in charge of interpretability at Anthropic, was the featured tech speaker at the Vatican presentation. He described AI systems as “fictional characters” that speak to us and do work, and said what’s grown is stranger and more beautiful than science fiction prepared us for. My favorite aside from the show: this is the same institution that once jailed scientists over heliocentrism, and now it’s the one saying technology isn’t evil.
    Illinois passes SB315, the first US state law auditing frontier AI (X, Announcement, X)
    The pope talked about regulation and a few days after, we got a very sensible regulation passed right here in the US!
    Illinois passed SB315 unanimously, 110 to 0. It’s the first US state law that mandates independent third-party audits of frontier AI for catastrophic risk. OpenAI publicly endorsed it, and framed Illinois, California (SB53), and New York (the RAISE Act) as converging into a de-facto national standard.
    It requires annual risk-assessment frameworks, third-party audits, transparency reports before new frontier models ship, whistleblower protections, and civil penalties.
    The underrated hero here is whistleblower protection. The bigger the lab, the harder a real conspiracy is to keep quiet when any employee can walk to the press. See: Greg Brockman’s personal diaries surfacing in the Musk v. Altman fight.
    This Week’s Buzz - CoreWeave and W&B updates
    We officially launched the W&B MCP server, 20 schema-first tools that let your coding agents read experiments, monitor training runs, and run autonomous research loops. The problem it solves: a single run with 300 metrics used to blow out an agent’s whole context window in one call, so now the agent asks what’s available before pulling data. Your agents can finally read experiment data without blowing context! Give it a go and give us feedback!
    Also, WeaveHacks is back! June 6 and 7 in San Francisco, and for the first time OpenAI is sponsoring, with judges and credits, alongside Cursor, Redis, and Copilot Kit. You get $150 in API credits across models like Opus 4.8 and GPT-5.5. I’m hosting, and last cohort’s second-place team went on to raise millions on top of what they built that weekend. If you’re in SF that weekend, sign up at lu.ma/weavehacks.
    Also: CoreWeave Sandboxes is now an official provider in the Harbor framework, the harness that runs Terminal-Bench, which we’d just been talking about. And if you’re in Europe next week, catch Wolfram at AI Dev Six in Cologne and ICRA in Vienna at the CoreWeave booth.
    Voice & Audio
    ElevenLabs drops Dubbing v2, and it kept my swearing intact in every language (X, dubbing, ElevenCreative, ElevenProductions)
    We didn’t get to this one live, but I came back and recorded a whole thing on it afterward, because it genuinely got me.
    ElevenLabs shipped Dubbing v2, and the shift that matters is that it’s an audio-to-audio model. Old dubbing pipelines transcribe your video, translate the text, then re-synthesize it. You lose everything that makes it sound like a person: the emotion, the pacing, the little hesitations. Dubbing v2 conditions directly on your original audio and carries that performance into 90+ languages.
    Here’s why I can actually vouch for it instead of nodding along to a demo. I speak Russian and Hebrew fluently, so I can tell when something is off. I dubbed one of my own shorts, the data-center rant about almonds, and listened back in both. It nailed it. Not just the words, the way I would actually say them.
    The part that got me was the intonation. I get a little heated in that clip, and the dub gets heated right along with me, in every language. It even carried the swear word. My “f***ing almonds” came through in Hebrew, Italian, Spanish, and Russian with the emotion fully intact. It clones your voice automatically too, no setup, and holds your pitch and identity steady across every target language and they’re handing out free minutes for the next 7 days: 1 on Free, 15 on Starter, 30 on Creator+. A self-serve API isn’t live yet, but it’s coming.
    I.. cannot stress this enough, until you try it on yourself or your kid, you won’t understand, we’ve really passed the uncanny valley of translation! It’s that good! Def. give it a try if you can, it’s free for the week.
    Cartesia Ink-2 debuts as #1 most accurate streaming speech-to-text model(X, Announcement, X)
    Another model that dropped today after the show, is Cartesia’s Ink-2, which also kind of blew me away. Not only because it has the lowest WER (Word Error Rate) among the models, but because it’s also a realtime model that achieves the fastest turnaround times while being a very accurate model!
    I’ve tested it out and recorded a quick video and honestly, blown away with the speed and accuracy! I truly wish this model was the one powering my editor (Descript) as it still fails to understand that my title is “AI Evangelist” and transcribes it to AI Avengers haha.
    If you’re building voice agents, definitely give this model a try!
    AI Art & Diffusion
    Prism ML’s 1-bit “Bonsai” runs diffusion in your browser (X, Blog, Announcement, HF)
    Prism ML put out a 1-bit ternary diffusion model under a gigabyte. You see some artifacts, but it’s 1-bit, it runs on iPhones and laptops, and our friend Joshua got it running in WebGPU straight from the browser (you need about 3GB of free RAM). One-bit working at all is one of the bigger open mysteries in the field right now.
    Pruna AI ships a 1-second upscaler (X, Blog, Announcement)
    Pruna AI added an upscaler doing 128-megapixel outputs in under a second. I’ve actually been using it. It’s cheap and great for fixing up GPT-image outputs.
    Microsoft MAI Image 2.5 jumps to #3 on LM Arena (X, Blog, Announcement, X)
    The surprise of the week: Microsoft MAI Image 2.5, from Mustafa Suleyman’s group, jumped to number three on the LM Arena image leaderboard with about a 75-point ELO leap. Out of nowhere, Microsoft is a serious player in image gen. Microsoft Build is next week, so don’t be shocked if there’s more.
    Evals and Agentic Engineering
    DeepSWE is a contamination-free coding benchmark, and it caught Claude reading git history (site, blog, GitHub)
    DeepSWE from Datacurve is the first coding leaderboard in a while that matches how these models actually feel. It’s 113 original tasks written from scratch, not scraped from GitHub PRs, and it ships shallow clones with no git history to cheat from. When they replayed the older benchmarks they found SWE-Bench Pro’s verifier is wrong about 32% of the time, and that Claude Opus was reading the gold commit straight out of git history on 12 to 18% of its passes.
    The gaps here are huge. GPT-5.5 leads at 70%, then GPT-5.4 at 56% and Opus 4.7 at 54%, and it falls off a cliff after that (Sonnet 4.6 at 32%, Gemini 3.5 Flash at 28%), with Kimi K2 the top open-source entry. Yam likes that it measures the realistic case, a small surgical change without breaking the codebase, while Nisten pointed out it rewards the best harness as much as the smartest model and still prefers 4.7 for web dev.
    Google AI Studio builds native Android apps for free (X, Announcement)
    Google AI Studio now lets anyone build native Android apps for free, and they reportedly generated a quarter of a million apps in the first week. Yam’s framing: it’s a slot machine, but it’s getting better release over release, and the real use case is disposable, personalized software you build for yourself and your family.
    CuaDriver brings background computer-use to Windows (X, Blog, Announcement)
    For the majority of you on Windows: QuaDriver shipped background computer-use agents that drive a real desktop without stealing your cursor. They first replicated this on macOS (the trick Codex got through an acquisition), and now it’s on Windows too. We’ve asked them to come on and explain how this even works.
    Open Source LLMs
    OpenBMB’s MiniCPM5-1B is a 1B model that punches way up (X, HF, Arxiv, X)
    The density story in small models keeps getting better, and this is the proof.
    MiniCPM5-1B, from the Tsinghua lab OpenBMB, is a 1-billion-parameter model that scores 17.9 on the Artificial Analysis Intelligence Index. That’s 7.4 points ahead of the next-best model in its class, and 1.6 points ahead of Qwen3.5 2B Reasoning, which has double the parameters. And it’s not even a reasoning model.
    The token efficiency is the wild part: it used 12.6 million output tokens to run the whole index, about 31x fewer than Qwen3.5 2B in reasoning mode.
    My favorite detail is the omniscience score. It lands at -1, the best in its class, because it abstains instead of hallucinating. Every other sub-2B model is down in the -70 to -89 range because they just make stuff up. Teaching a small model to say “I don’t know” is a real skill. It runs hybrid think/no-think in one checkpoint, 128K context, native tool calling, Apache 2.0, and fits in about half a gig at INT4, so it runs on your phone.
    Nisten gave the definitive case for small models: self-contained apps where you keep full control of the data (medical, on-device), and large-scale data processing where paying an API to filter or classify terabytes is absurd when an on-device model can be about 1000x cheaper.
    Tencent open-sources Hunyuan-MT 2 translation under Apache 2.0 (X, HF, HF, Arxiv)
    Tencent open-sourced its translation model, a roughly 1.8B model that fits in about 440MB, runs on a phone, covers 33 languages, and reportedly beats Microsoft’s paid Translator API. It hit number one trending on Hugging Face.
    Nisten’s idea, which I’m handing to all of you: take this model, pair it with a tiny TTS like Kokoro, and build a fully-offline travel translation app via Google AI Studio. Go build it and tell us how it goes.
    Well, this was one hell of a week and episode, new Opus, crazy new translation tools, Pope chiming in on AI (in a surprisingly positive way!?) and a bunch more.
    I’m super excited to play with these tools and report back next week 🫡 See you all!
    ThursdAI - May 28, 2026 - TL;DR
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co-hosts - @WolframRvnwlf, @yampeleg, @nisten
    * AI & Society
    * Pope Leo XIV releases first encyclical on AI, with Anthropic co-founder Chris Olah speaking at the Vatican (X)
    * Illinois SB 315 passes House 110-0, becoming the first US state law requiring independent third-party audits of frontier AI catastrophic risks (X, Bill, OpenAI)
    * Big CO LLMs + APIs
    * Datacurve releases DeepSWE, a contamination-free coding benchmark that exposes major gaps between frontier coding agents (X, Benchmark, Blog, GitHub)
    * Anthropic announces Opus 4.8 with thinking modes in the UI and Dynamic Workflows in Claude Code (Blog)
    * Open Source LLMs
    * OpenBMB releases MiniCPM5-1B, a new SOTA 1B open weights model for efficient local and on-device use (X, Hugging Face, Arxiv, X)
    * Tencent open-sources Hy-MT2 translation models under Apache 2.0, including a tiny 1.8B model that beats paid translation APIs (X, HF 1.8B, HF 30B-A3B, Arxiv)
    * Tools & Agentic Engineering
    * Google launches Universal Cart, AP2, and UCP to let AI agents shop and pay on your behalf (X)
    * Google AI Studio now lets anyone build native Android apps for free, with 250,000 apps created in the first week (X, AI Studio)
    * Cua Driver launches Windows support for background computer-use agents across real desktop apps (X, Blog, GitHub)
    * This Week’s Buzz - from W&B and CoreWeave!
    * W&B Hackathon - WeaveHacks 4 with OpenAI, Cursor, Redis, and CopilotKit, June 6-7 (Lu.ma)
    * Weights & Biases launches an MCP server with 20 tools for coding agents to read experiments, monitor training, and run autonomous research loops (X, MCP, Blog)
    * Vision & Video
    * Runway launches Project Luxo, claiming AI-generated video has crossed the uncanny valley for solo-creator short films (X, Blog)
    * Voice & Audio
    * MOSS-TTS-v1.5 ships as an 8B open-source TTS model with 31 languages, pause control, and Apache 2.0 licensing (X, Hugging Face, GitHub, Arxiv)
    * ElevenLabs launches Dubbing v2, an audio-to-audio model that preserves performance across 90+ languages (X, Dubbing, Creative, Productions)
    * Cartesia Ink-2 debuts as the most accurate streaming speech-to-text model on Artificial Analysis’s new STT leaderboard (X, Ink, Artificial Analysis)
    * AI Art & Diffusion & 3D
    * Pruna AI’s P-Image-Upscale hits 128 megapixel outputs with fast, predictable pricing (X, Docs, Replicate)
    * PrismML releases 1-bit and Ternary Bonsai Image 4B, a sub-1GB diffusion transformer for local image generation (X, Blog, Hugging Face, iOS App, Demo)
    * Microsoft’s MAI-Image-2.5 jumps to #3 on the Arena text-to-image leaderboard (X, Announcement, Arena)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
More News podcasts
About ThursdAI - The top AI news from the past week
Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, FT News Briefing and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features