PodcastsNewsThursdAI - The top AI news from the past week

ThursdAI - The top AI news from the past week

From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
ThursdAI - The top AI news from the past week
Latest episode

147 episodes

  • ThursdAI - The top AI news from the past week

    🎂 ThursdAI — 3rd BirthdAI: Singularity Updates Begin with Auto Researcher, Uploaded Brains, OpenClaw Mania & NVIDIA's $26B Bet on Open Source

    2026/03/13 | 1h 38 mins.
    Hey, Alex here 👋 Today was a special episode, as ThursdAI turns 3 🎉
    We’ve been on air, weekly since Pi day, March 14th, 2023. I won’t go too nostalgic but I’ll just mention, back then GPT-4 just launched with 8K context window, could barely code, tool calls weren’t a thing, it was expensive and slow, and yet we all felt it, it’s begun!
    Fast forward to today, and this week, we’ve covered Andrej Karpathy’s mini singularity moment with AutoResearcher, a whole fruit fly brain uploaded to a simulation, China’s OpenClaw embrace with 1000 people lines to install the agent. I actually created a new corner on ThursdAI, called it Singularity updates, to cover the “out of distribution” mind expanding things that are happening around AI (or are being enabled by AI)
    Also this week, we’ve had 3 interviews, Chris from Nvidia came to talk to use about Nemotron 3 super and NVIDIA’s 26B commitment to OpenSource, Dotta (anon) with his PaperClips agent orchestration project reached 20K Github starts in a single week and Matt who created /last30days research skill + a whole bunch of other AI news! Let’s dive in.
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Singularity updates - new segment
    Andrej Karpathy open sources Mini Singularity with Auto Researcher (X)
    If there’s 1 highlight this week in the world of AI, it’s this. Andrej, who previously started the AutoPilot program in Tesla, and co-founded OpenAI, is now, out there, in the open, just.. doing stuff like invent a completely autonomous ML research agent.
    Andrej posted to his almost 2M followers that he opensourced AutoResearch, a way to instruct a coding agent to do experiments against a specific task, test the hypothesis, discard what’s not working and keep going in a loop, until.. forever basically. In his case, it was optimizing speed of training GPT-2. He went to sleep and woke up to 83 experiments being done, with 20 novel improvements that stack on top of each other to speed up the model training by 11%, reducing the training time from 2.02 hours to 1.8 hours.
    The thing is, this code is already hand crafted, fine tuned and still, AI agents were able to discover new and novel ways to optimize this, running in a loop.
    Folks, this is how the singularity starts, imagine that all major labs are now training their models in a recursive way, the models get better, and get better at training better models! Reminder, OpenAI chief scientist Jakub predicted back in October that OpenAI will have an AI capable of a junior level Research ability by September of this year, and it seems that... we’re moving quicker than that!
    Practical uses of autoresearch
    This technique is not just for ML tasks either, Shopify CEO Tobi got super excited about this concept, and just posted as I’m writing this, that he set an Autoresearch loop on Liquid, Shopify’s 20 year old templating engine, with the task to improve efficiency. His autoresearch loop was able to get a whopping 51% render time efficiency, without any regressions in the testing suite. This is just bonkers. This is a 20 year old, every day production used template. And some LLM running in a loop just made it 2x faster to render, just because Karpathy showed it the way.
    I’m absolutely blown away by this, this isn’t a model release, like we usually cover on the pod, but still, a significant “unhobbling” moment that is possible with the current coding agents and models. Expect everything to become very weird from here on out!
    Simulated fruit fly brains - uploaded into a simulator
    In another completely bonkers update that I can barely believe I’m sending over, a company called EON SYSTEMS, posted that they have achieved a breakthrough in brain simulation, and were able to upload a whole fruit fly brain connectome, of 140K neurons and 50+ million synapses into a simulation environment.
    They have... uploaded a fly, and are observing a 91% behavioural accuracy. I will write this again, they have uploaded a fly’s brain into a simulation for chirst sake!
    This isn’t just an “SF startup” either, the board of advisors is stacked with folks like George Church from Harvard, father of modern genome sequencing, Stephen Wolfram who needs no introduction but one of the top mathematicians in the world, whos’ thesis is “brains are programs”, Anders Sandberg from Oxford, Stephen Larson who apparently already uploaded a worms brain and connected it to lego robots before. These folks are gung ho on making sure that at some point, human brains are going to be able to get uploaded, to survive the upcoming AI foom.
    The main discussion points on X were around the fact that there was no machine learning here, no LLMs, no attention mechanisms, no training. The behaviors of that fly were all a result of uploading a full connectome of neurons. This positions connectome (the complete diagram of a brain with neurons and connections) as an ananalouge to an pre-trained LLM network for biological intelligence.
    I encourage everyone who’s reading this, to watch Pantheon on Netflix, to understand why this is of massive importance. Combined with the above Autoresearch, things are going to go very fast here. The next step is uploading a mouse brain, which will be a 500x Neurons and 2000x more synapses, but if we’re looking at the speed with which AI is improving, that’s NOT out of the realm of possibility for the next few years!
    OpenClaw Mania Sweeps China: Thousand-Person Lines & Government Subsidies, Grandmas raising a “red lobster”
    They’re calling it “raising a red lobster” (养小龙虾). That’s the phrase that swept Chinese social media for what is, at its core, installing an open source GitHub project on your laptop. Grandmas are doing it. Mac Minis are sold out. A cottage industry of paid installers popped up overnight on Xiaohongshu, charging up to $100 for an in-person setup. And yes, there are now also people charging to uninstall it.
    On March 6th, roughly a thousand people lined up outside Tencent’s Shenzhen HQ for free OpenClaw installation. Appointment slots ran out within an hour. People brought NAS drives, MacBooks, mini PCs. Tencent engineers set up folding tables and just... started installing OpenClaw for strangers. I have pictures. I’m not making this up.
    All five major Chinese cloud providers jumped in simultaneously: Tencent Cloud, Alibaba Cloud, ByteDance Volcano Engine, JD.com Cloud, and Baidu Intelligent Cloud, each racing to offer one-click OpenClaw deployment. Why? Follow the money. Per HelloChinaTech, ByteDance, Alibaba, and Tencent spent roughly $60B combined on AI infrastructure. Chatbots don’t burn enough tokens to justify that spend. But a single OpenClaw instance runs 24/7 and consumes 10-100x more tokens per day than a chatbot user. Every install is round-the-clock API revenue. The cheaper the models get, the more people run agents, the more infra gets sold. Self-reinforcing loop.
    Local governments are pouring fuel on the fire. Shenzhen’s Longgang district is offering up to 2M yuan ($290K) per project. Hefei and Wuxi are going up to 10M yuan ($1.4M), plus free computing, office space, and accommodation for “one-person companies.” Meanwhile, China’s central cybersecurity agency issued TWO warnings, banning banks and state agencies from installing OpenClaw. So local governments are subsidizing it while the central authority is trying to pump the brakes. Peak 2026.
    With nearly half of all 142,000+ publicly tracked OpenClaw instances are now from China. OpenClaw is the most-starred GitHub repo in history, surpassing Linux’s 30-year record in just 100 days. Device makers are piling on too — Xiaomi announced “miclaw” for smartphones, MiniMax built MaxClaw, Moonshot AI built a hosted version around Kimi.
    Now, Ryan was honest on the show and I want to echo that honesty here: OpenClaw is still hard to get working. There are many failure states. It’s not “install and go to the beach.” Wolfram compared it to Linux in the late ‘90s — painful to set up, but if you push through, you can see the future behind the friction. This is real technology with real limitations, and a lot of disappointed folks in China are watching tokens burn with no actual work getting done.
    But here’s the thing I keep coming back to. The memetic velocity of OpenClaw is unlike anything I’ve seen in tech. It’s not just a tool, it’s a concept that penetrated the cultural resistance to AI. People who are scared of terminals, people who’ve never touched GitHub — they’re standing in line for this. I broke through that resistance with my own fiancée. She’s now running two OpenClaws. Not enough for her. She needs another one.
    Every major US lab is watching this closely. OpenAI brought Peter Steinberger on staff. Perplexity just announced they’re building a local agent for Mac. Anthropic has Claude Cowork. This is where all of computing is headed — always-on, autonomous, personal AI that actually does things for you. OpenClaw is the first front door, not the final destination. But what a front door it is.
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    Open Source: Nvidia Goes All In with Nemotron 3 Super 120B (X, Blog, HF)
    We had Chris Alexiuk from Nvidia join us — a friend from a dinner Nisten and I hosted in Toronto. Chris is basically “NeMo” embodied, sitting at the intersection of product and research, and he gave us the full breakdown on what might be the most complete open-source model release we’ve seen from a major lab.
    Here are the numbers: 120B total parameters, 12B active during inference (it’s a Mixture of Experts), 1 million token context window, and a hybrid Mamba-Transformer architecture they call “Hybrid Mamba Latent MoE with Multi-Token Prediction.” It’s hitting 450 tokens per second on the Terminal Bench leaderboard — faster than any other model on there. Modal is reporting over 50% faster token generation compared to other top open models.
    What Chris was emphatic about — and I want to highlight this — is that “most open” is a real designation here. They released the model checkpoint in three precisions (BF16, FP8, NVFP4), the base checkpoint before post-training, the SFT training data, and in a move that genuinely surprised people, pre-training data and a full end-to-end training recipe. You can, in theory, reproduce their training run. That’s rare. That’s a real commitment to open source.
    There’s also a huge piece of news in the background here: there’s a confirmed report that Nvidia will spend $26 billion over the next five years building the world’s best open source models. Jensen presumably has GTC remarks incoming on this. America is genuinely back in the open source AI race, and it’s Nvidia leading the charge. Chris has been in the open source world since the Hugging Face early days and said it feels genuine inside the company — not a PR exercise. And I tend to believe him. Now, all eyes are on GTC next week!
    I ran Nemotron 3 Super with my own OpenClaw instance yesterday via W&B inference and it’s genuinely fast and capable. At $0.20/M input tokens and $0.80/M output tokens on W&B inference, it’s not going to replace Opus for your hardest tasks — but for running an always-on agent that needs to be cost-efficient? It’s an incredible option. More on that in the this weeks buzz section below.
    Tools & Agentic Engineering
    Paperclip: Zero Human Companies, Now Open Source (Github)
    We had the anonymous Dotta on the show — the first AI video avatar anon person to join ThursdAI — to talk about Paperclip, an open source agent orchestration framework that hit 20,000 GitHub stars in its first week. The premise is simple and audacious: build zero-human companies.
    Now this may sound familiar to you, as we had Ben from Polsia on just two weeks ago, which is a similar concept, but Paperclip is an OpenSource project, which you can run right now on your own.
    The core “thing” that got me excited about Paperclip is that you can “hire” your own existing OpenClaw agents, or Cursor or Codex or whatever else to play roles in this autonomous company. The premise is simple, you’re the board of directors, you hire an AI Agent CEO, and it then asks you if it needs to “hire” more AI agents to do tasks autonomously. These tasks all live inside Paperclip interface, and you or your Agents can open them.
    The core concept of this whole system is the heartbeat concept, each agent receives their own instructions on what to do every time they are “woken up” by a timer. This is what’s driving the “autonomous” part of the whole thing, but it’s also what’s eating the tokens up, even if there’s no work being done, agents are still burning tokens asking “is there work to be done?”
    Dotta gave us a great metaphor, asking if we saw the movie Memento, where the protagonist lost his memory and every time he woke up, he woke up with a blank slate, and had to reconstruct the memories. AI agents are like the memento man, and Paperclip is an attempt to give those agents the whole context so they can continue working on your tasks productively. Dotta told us that the future of Paperclip is the ability to “fork” entire companies, structures that will actually run and do things on your behalf. Looking forward to that future, but for now I will be turning off my Paperclip interface as it’s costing me real money without the need.
    Symphony: Agents Writing Their Own Jira Tickets
    We mentioned Symphony last week, and I texted Ryan the link before the show, and voila, of course, he set it up and went viral, yet again! We’re so lucky to have Ryan on the show to tell us from first hand experience what it’s like to run this thing.
    Symphony was open sourced by OpenAI last week, and it’s basically an instruction manual for how to run agents autonomously via Linear ticketing system. (Github)
    The highlight for Ryan was, the whole system is running creating pull requests while he’s a sleep, and at some point, he noticed a ticket that he didn’t create. One of the agents found a bug, and created a very detailed ticket for him to approve.
    I’m just happy that I can keep even my co-hosts up to date hehe
    This weeks buzz - we’ve got skills and nemotrons!
    Look, we told you about Skills in the start of the year, since then, via OpenClaw, Hermes Agent, Claude Code, they exploded in popularity. One downside of skills is, it’s very easy to make a bad one! So, we’re answering the challenge, and are publishing the official wandb skill 🎉
    Installing it is super simple, npx skills add wandb/skills and voila, your agents are now officially “I know kung fu” pilled with the best Weights & Biases practices. For both Weave and Models 👏 Please give us feedback on Github if you have used the skills! Github
    Also, we’ve partnered with Nvidia to support the best US open source model on day 0, and we have Nemotron 3 Super on our inference service, for all to use at $0.20/1Mtok! It’s super easy to setup with something like Hermes Agent or OpenClaw and runs really really fast! Check it out here.
    Is it going to perform like Opus 4.6? No. But are you going to run Opus 4.6 at 20 cents per million? Also no.
    Gemini drops SOTA embeddings and gets dethroned 2 days later live on the show.
    This always happens, but I didn’t expect this to happen in a fairly niche segment of the AI world... multimodal embeddings!
    Gemini posted an update earlier this week with Gemini Embeddings 2.0, a way to unify images, text, video, audio embeddings under 1 roof, and posted a SOTA embedding model!
    Then, just as we launch the show, a friend of the pod Benjamin Clavie, drops me a DM, basically saying that his company Mixbread is going to deploy an embedding model that will beat Gemini Embedding 2 on almost every benchmark on that table, and then... they did!
    The most notable (and absolutely crazy) jump in this comparison is, the LIMIT benchmark, where they achieved a 98% score vs Gemini’s ... 6.9 percent. I didn’t believe this at first, but asked Ben to explain the findings, and he did. Congrats to folks for moving the search space forward every 2 days!
    Grok 4.20 in the API for $2/1Mtok
    Elon Musk and XAI co finally released Grok 4.20 in the API. Look I said what I said about XAI models, they are great for research, and for factuality, but they aren’t beating the major labs. The last firing of almost of XAI folks doesn’t help either. So this model was not “released” in any traditional sense, there’s no benchmarks, no evals, and everyone who got access to it evaluated it and, it’s no better than GLM5 on many benchmarks. So it does makes sense to release it quietly.
    It is very fast though, and again, for research and for X access, it’s an absolute beast, so I’ll be trying this out!
    Parting thoughts and a small reflection. For the past 3 years, we’ve had a front-row seat to the singularity shaping up. 2.5 years ago, I went all in, decided to pivot into podcasting full time. In those years, ThursdAI became known, we’ve had guests from nearly all major AI labs (including Chinese ones, for which I’m particularly proud), I got to meet with executives, ask leaders questions about where this is all going, and most of all, share this journey with all of you, candidly. We rarely do hype on the show, we don’t speculate, we try to do a positive outlook on the whole thing, and counter doomerism, as there’s too much of that out there.
    I am very glad this resonates, and continue to be thankful for your attention! If you wanted to give us any kind of a birthday present, subscribe or give us a 5 star review on Apple Podcasts or Spotify, it’ll greatly help other folks to discover us.
    See you next week,
    Alex 🫡
    ThursdAI - Mar 12, 2026 - TL;DR
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed) @ryancarson
    * Chris Alexiuk from Nvidia (@llm_wizard) - Nemotron
    * @dotta - creator of Paperclip.ing AI agent orchestration framework
    * Matt Van Horn - @mvanhorn - creator of @slashlast30days
    * Singularity updates
    * Andrej Karpathy’s autoresearch achieves 11% speedup on GPT-2 training through autonomous AI agent experimentation (X, GitHub, GitHub)
    * Eon Systems uploads first complete fruit fly brain to a physics-simulated body, achieving 91% behavioral accuracy (X, Announcement, Announcement)
    * OpenClaw mania sweeps China as all five major cloud providers race to support it (HelloChinaTech, Reuters, SCMP, MIT Tech Review)
    * Big CO LLMs + APIs
    * xAI quietly releases Grok 4.20 API with massive 2M token context window and multi-agent capabilities (X, Blog)
    * Google launches Gemini Embedding 2, the first natively multimodal embedding model supporting text, images, video, audio, and PDFs in a unified vector space (X, Announcement)
    * Open Source LLMs
    * NVIDIA launches Nemotron 3 Super: 120B open MoE model with 1M context window designed for agentic AI at 5x higher throughput (X, Announcement)
    * MiroMind releases MiroThinker-1.7 and H1 - open-source research agents with 256K context, 300 tool calls, achieving SOTA on deep research benchmarks (X, HF, HF, HF)
    * Covenant-72B: World’s largest permissionless decentralized LLM pre-training achieves 72B parameters on Bittensor with 146x gradient compression (X, Arxiv, HF, HF)
    * Tools & Agentic Engineering
    * ACP is the open standard that lets any AI coding agent plug into any editor — and this week Cursor officially joined the registry, meaning you can now run Cursor’s agent inside JetBrains IDEs (JetBrains blog, Cursor blog)
    * This weeks Buzz
    * W&B launches official Agent Skills for coding agents, turning experiment dashboards into terminal queries (X, Announcement, Announcement)
    * Video
    * LTX-2.3 — Lightricks open-source video model (GitHub, HF,Blog)
    * Voice & Audio
    * Fish Audio launches S2: Open-source TTS with sub-150ms latency and absurdly controllable emotion (X, HF, Blog, Announcement)
    * Show notes and links
    * Paperclip.ing by Dotta ( @dotta) - Github
    * Last30days skill by Matt Van Horn Github
    * Agency Agents repo Github
    * OpenAI Symphony (Github)
    * Mixbread Embeddings (X)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    ThursdAI - Mar 5 - OpenAI's GPT-5.4 Solves a 20-Year Math Problem, Anthropic Gets Designated a Supply Chain Risk, Qwen Drama Unfolds

    2026/03/06 | 1h 36 mins.
    Hey folks, Alex here, let me catch you up!
    Most important news about this week came today, mid-show, OpenAI dropped GPT 5.4 Thinking (and 5.4 Pro), their latest flagship general model, less autistic than Codex 5.3, with 1M context, /fast mode and the ability to steet it mid-reasoning. We tested it live on the show, it’s really a beast.
    Also, since last week, Anthropic said no to Department of War’s ultimatum and it looks like they are being designated as supply chain risk, OpenAI swooped in to sign a deal with DoW and the internet went ballistic (Dario also had some .. choice words in a leaked memo!)
    On the Open Source front, the internet lost it’s damn mind when a friend of the pod Junyang Lin, announced his departure from Qwen in a tweet, causing an uproar, and the CEO of Alibaba to intervene.
    Wolfram presented our new in-house wolfbench.ai and a lot more!
    P.S - We acknowledge the war in Iran, and wish a quick resolution, the safety of civilians on both sides. Yam had to run to the shelter multiple times during the show.
    ThursdAI - Highest signal weekly AI news show is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

    OpenAI drops GPT 5.4 Thinking and 5.4 Pro - heavy weight frontier models with 1M context, /fast mode, SOTA on many evals
    OpenAI actually opened this week with another model drop, GPT 5.3-instant, which... we can honestly skip, it was fairly insignificant besides noting that this is the model that most free users use. It is supposedly “less cringe” (actual words OpenAI used). We all wondered when 5.4 will, and OpenAI once again proved that we named the show after the right day. Of course it drops on a ThursdAI.
    GPT 5.4 Thinking is OpenAI latest “General” model, which can still code, yes (they folded most of the Codex 5.3 coding breakthroughs in here) but it also shows an incredible 83% on GDPVal (12% over Codex), 47% on Frontier Math and an incredible ability to use computers and browsers with 82% on BrowseComp beating Claude 4.6 at lower prices than Sonnet!
    GPT 5.4 is also ... quite significantly improved at Frontend design? This landing page was created by GPT 5.4 (inside the Codex app, newly available on Windows) in a few minutes, clearly showing significant improvements in style.
    I built it also to compare prices, all the 3 flagship models are trying to catch up to Gemini in 1M context window, and it’s important to note, that GPT 5.4 even at double the price after the 272K tokens cutoff is still.... cheaper than Opus 4.6. OpenAI is really going for broke here, specifically as many enterprises are adopting Anthropic at a faster and faster pace (it was reported that Anthropic is approaching 19B ARR this month, doubling from 8B just a few months ago!)
    Frontier math wiz
    The highlight from the 5.4 feedback came from a Polish mathematician Bartosz Naskręcki (@nasqret on X), who said GPT-5.4 solved a research-level FrontierMath problem he had been working on for roughly 20 years. He called it his “personal singularity,” and as overused as that word has become, I get why he said it. I’ve told you about this last week, we’re on the cusp.
    Coding efficiency
    There’s tons of metrics in this release, but I wanted to highlight this one, where it may seem on first glance that on SWE-bench Pro, this model is on par with the previous SOTA GPT 5.3 codex, but these dots here are thinking efforts. And a medium thinking effort, GPT 5.4 matches 5.3 on hard thinking! This is quite remarkable, as lower thinking efforts have less tokens, which means they are cheaper and faster ultimately!
    Fast mode arrives at OpenAI as well
    I think this one is a direct “this worked for Anthropic, lets steal this”, OpenAI enabled /fast mode that.. burns the tokens at 2x the rate, and prioritizes your tokens at 1.5x the speed. So, essentially getting you responses faster (which was one of the main complains about GPT 5.3 Codex). I can’t wait to bring the fast mode to OpenClaw with 5.4, which will absolutely come as OpenClaw is part of OpenAI now.
    There’s also a really under-appreciated feature here that I think other labs are going to copy quickly: mid-thought steering. OpenAI now lets you interrupt the model while it’s thinking and redirect it in real time in ChatGPT and iOS. This is a godsend if you’re like me, sent a prompt, seeing the model go down the wrong path in thinking... and want to just.. steer it without stopping!
    Anthropic is now designated as supply-chain risk by DoW
    Last week I left you with a cliffhanger: Anthropic had received an ultimatum from the Department of War (previously the Department of Defense) to remove their two remaining restrictions on Claude — no autonomous kill chain without human intervention, and no surveillance of US citizens. Anthropic’s response? “we cannot in good conscience acceede to their request”
    So much has happened since then; US President Trump said “I fired Anthropic” referring to his Truth Social post demanding intelligence agencies drop the use of Claude (which apparently was used in the war with Iran regardless); Sam Altman announced that OpenAI has agreed to DoW and will provide OpenAI models, causing a lot of people to cancel their OpenAI subscriptions, and later apologizing for the “rushed rollout”; Dario Amodei posted a very contentious internal memo that leaked, in which he name-called the presidency, Sam Altman and his motives, Palantir and their “safety theater”, for which he later apologized
    Honestly this whole thing is giving me whiplash trying to follow, but here’s the facts. Anthropic is now the first US company in history, being designated “supply chain risk” which means no government agency can use Claude, and neither can any company that does contracts with DoW.
    Anthropic says it’s illegal and will challenge this in court , while reporting $19B in annual recurring revenue, nearly doubling since last 3 months, and very closely approaching OpenAI at $25B.
    Look, did I want to report on this stuff when I decided to cover AI? no... I wanted to tell you about cool models and capabilities, but the world is changing, and it’s important to know that the US Government understands now that AI is inevitable, and I think this is just the first of many clashes between tech and government we’ll see. We’ll keep reporting on both. (but let me know in the comments if you’d prefer just model releases)
    OpenAI’s GPT-5.3 Instant Gets Less Cringe, Google’s Flash-Lite Gets Faster (X, Announcement)
    We also got two fast-model updates this week that are worth calling out because these are the models that often end up powering real product flows behind the scenes. As I wrote before, OpenAI’s instant model is nothing to really mention, but it’s worth mentioning that OpenAI seems to have an answer for every Gemini release.
    Gemini released Gemini Flash-lite this week, which boasts an incredible 363 tokens/s speed, which doing math at a very good level, 1M context and great scores compared to the instant/fast models like Haiku from Anthropic. Folks called out that this model is more expensive than the previous 2.5 Flash-lite. But with 86.9% on GPQA Diamond beating GPT-5 mini, and 76.8% MMMU-pro multimodal reasoning, this is definitely worth taking a look at for many agentic, super fast responses!
    For example, the heartbeat response in OpenClaw.
    Qwen 3.5 Small Models & The Departure of Junyang Lin (X, HF, HF, HF)
    Alibaba’s Qwen team continued releasing their Qwen 3.5 family, this time with Qwen 3.5 Small, a series of models at 0.8B, 2B, 4B, and 9B parameters with native multimodal capabilities. The flagship 9B model is beating GPT-OSS-120B on multiple benchmarks, scoring 82.5 on MMLU-Pro and 81.7 on GPQA Diamond. These models can handle video, documents, and images natively, support up to 201 languages, and can process up to 262K tokens of context. And.. they are great! They are trending on HF right now.
    What’s also trending is, tech lead for Qwen, a friend of the pod Junyang Lin, has posted a cryptic tweet that went viral with over 6M views. There was a lot of discussions on why he and other Qwen leads are stepping away, what’s goig to happen with the future of OpenSource. The full picture seems to be, there are a lot of internal tensions and politics, with Junyang being one of the youngest P10 leaders in the Alibaba org.
    A Chinese website 36KR ( Kind of like a chinese techcrunch) reported that this matter went all the way up to Alibaba CEO, who is no co-leading the qwen team, and that this resignation was related to an internal dispute over resource allocation and team consolidation, not a firing.
    I’m sure Junyang is going to land somewhere incredible and just wanted to highlight just how much he did for the open source community, pushing Qwen relentlessly, supporting and working with a lot of inference providers (and almost becoming a co-host for ThursdAI with 9! appearances!)
    StepFun releases Step 3.5 Flash Base (X, HF, HF, Announcement, Arxiv)
    Speaking of Open Source, StepFun just broke through the noise with a new model, a 196B parameter sparse Mixture of Experts model activating just 11B parameters when ran. It has some great benchmarks, but the main thing is this: they are releasing the pretrained base weights, a midtrain checkpoint optimized for code and agents, the complete SteptronOSS training framework, AND promising to release their SFT data soon - all under Apache 2.0!
    Technically the model looks strong too, with multi-token prediction, 74.4% on SWE-bench verified bench (though, as we told you last week, it’s.. no longer trusted) and full apache 2!
    This Week’s Buzz: presenting Wolfbench.ai
    I’m so excited about this weeks “this weeks buzz”, Wolfram has been hard at work preparing and presenting a new framework to test out these models, and named it wolfbench.ai
    Wolfbench is our attempt to compare how the same model performs via different agentic harnesses like ClaudeCode, OpenClaw and Terminalbench’s own Terminus.
    You can check out the website on wolfbench.com but the short of it is, a single number is not telling the full story.
    Wolf Bench breaks it into a four-metric framework: the average score across runs, the best single run, the ceiling (how many tasks can the model solve at least once across all runs), and the floor (how many tasks does it solve consistently across every single run). That last one is what I find most illuminating. Opus 4.6 might be able to solve 88% of Terminal Bench tasks on average, but only about 55% of tasks it solves every single time. Reliability matters enormously for agents, and benchmarks almost never surface this.
    If you want to run your own evals with the same config, reach out to Wolfram—he’s open to community contributions. Wolfram has also already kicked off a Wolf Bench run on GPT-5.4 since we tested it live today, so stay tuned for those results.
    There’s quite a few more releases we didn’t have time to get into on the show given the GPT 5.4 drop, you’ll find all those links in the show notes!
    Next week will mark 3 years since I’ve started talking about AI on the internet and created ThursdAI (It was March 14th, 2023, same day as GPT4 launched) and we’ll have a little celebration, I do hope you join us live 🔥
    As a birthday present, you may choose to share ThursdAI with a friend or two, or rate us in your podcast player of choice! See you next week,
    Alex 🫡
    ThursdAI - Mar 05, 2026 - TL;DR
    TL;DR of all topics covered:
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
    * Big CO LLMs + APIs
    * OpenAI launches GPT-5.4 Thinking and Pro (X, X, X, X)
    * Anthropic, Dept of War and OpenAI walk into a bar
    * Alibaba Qwen departures: Friend of the pod JunyangLin and Binyuan Huy both depart Qwen (X)
    * OpenAI Rolls Out GPT-5.3 Instant (X)
    * Google launches Gemini 3.1 Flash-Lite (X, Announcement)
    * Evals and Benchmarks
    * MarinLab shows degradation in Opus 4.6 (X)
    * B******t Bench from Peter Gostev (X)
    * Open Source LLMs
    * StepFun releases Step 3.5 Flash Base models (X, HF, HF, Announcement, Arxiv)
    * Alibaba releases Qwen 3.5 Small Model Series (X, HF, HF, HF)
    * Yuan 3.0 Ultra (X, Blog, HF)
    * Tools & Agentic Engineering
    * Cognition: SWE-1.6 preview (X, Blog)
    * OpenAI launches Codex app on windows (X)
    * Google released Google Workspace CLI (X)
    * OpenAI released Symphony (Github)
    * This weeks Buzz
    * Early preview of Wolf Bench (wolfbench.ai) from W&B
    * AI Art & Diffusion & 3D
    * Black Forest Labs introduces Self-Flow (X, Announcement)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI - Feb 26 - The Pentagon wants War Claude, every benchmark collapsed, and a solo founder hit $700K ARR with AI agents

    2026/02/27 | 1h 50 mins.
    Hey, it’s Alex, let me tell you why I think this week is an inflection point.
    Just this week: Everyone is launching autonomous agents or features inspired by OpenClaw (Devin 2.2, Cursor, Claude Cowork, Microsoft, Perplexity and Nous announced theirs), METR and ArcAGI 2,3 benchmarks are getting saturated, 1 person companies nearing 1M ARR within months of operation by running AI agents 24/7 (we chatted with one of them on the show today, live as he broke $700K ARR barrier) and the US Department of War gives Anthropic an ultimatum to remove nearly all restrictions on Claude for war and Anthropic says NO.
    I’ve been covering AI for 3 years every week, and this week feels, different. So if we are nearing the singularity, let me at least keep you up to date 😅
    Today on the show, we covered most of the news in the first hour + breaking news from Google, Nano Banana 2 is here, and then had 3 interviews back to back. Ben Broca with Polsia, Nader Dabit with Cognition and Philip Kiely with BaseTen. Don’t miss those conversations starting at 1 hour in.
    Thanks for reading ThursdAI - Highest signal weekly AI news show! This post is public so feel free to share it.

    Anthropic vs Department of War
    Earlier this week, the US “Department of War” invited Dario Amodei, CEO of Anthropic to a meeting, where-in Anthropic was given an ultimatum. “Remove the restrictions on Claude or Anthropic will be designated as a ‘supply chain risk’ company” and the DoD will potentially go as far as using the Defence Production Act to force Anthropic to ... comply.
    The two restrictions that Anthropic has in place for their models are: No use for domestic surveillance of American citizens and NO fully autonomous lethal weapens decisions given to Claude. For context, Claude is the only model that’s deployed on AWS top secret GovCloud and is used through Palantir’s AI platform.
    As I’m writing this, Anthropic issued a statement from Dario statement saying they will not budge on this, and will not comply. I fully commend Dario and Anthropic for this very strong backbone, but I fear that this matter is far from over, and we’ll continue to see what is the government response.
    EDIT: Apparently the DoD is pressuring Google and OpenAI to agree to the stipulations and employees from both companies are signing this petition https://notdivided.org/ to protest against dividing the major AI labs on this topic.
    Anthropic and OpenAI vs upcoming Deepseek
    It’s baffling just how many balls are in the air for Anthropic, as just this week also, they have publicly named 3 Chinese AI makers in “Distillation Attacks”, claiming that they have broke Terms of Service to generate over 16M conversations with Claude to improve their own models, while using proxy networks to avoid detection. This marks the first time a major AI company publicly attributed distillation attacks to specific entities by name.
    The most telling thing to me is not the distillation, given that Anthropic has just recently settled one of the largest copyright payouts in U.S history, paying authors about $3000/book, which was bought, trained on and destroyed by Anthropic to make Claude better.
    No, the most telling thing here is the fact that Anthropic chose to put DeepSeek on top of the accusation list with merely 140K conversations, where the other labs created millions.
    This, plus OpenAI formal memo to Congress about a similar matter, shows that the US labs are trying to prepare for Deepseek new model to drop, by saying “Every innovation they have, they stole from us”. Apparently Deepseek V4 is nearly here, it’s potentially multimodal and has been allegedly trained on Nvidia chips somewhere in Mongolia despite the export restrictions and it’s about to SLAP!
    Benchmark? What benchmarks?
    How will we know that we’re approaching the singularity? Will there be signs? Well, this week it seems that the signs are here.
    First, Agentica claimed that they solved all publicly available “hard for AI” tasks of the upcoming ArcAGI 3, then Confluence Labs announced that they got an unprecedented 97.9% on ArcAGI2 and finally METR published their results on the long-horizon tasks, which measure AI’s capability to solve task that take humans a certain amount of hours to do. And that graph is going parabolic, with Claude Opus 4.6 able to solve tasks of 14.6h (doubling every 49 days) with 50% success rate
    Why is this important? Well, this is just the benchmarks telling the story that everyone else in the industry is seeing, that approximately since December of 2025, and definitely fueled by early Feb drop of Opus 4.6 and Codex 5.3, something major shifted. Developers no longer write code, but ship 10x more features.
    This became such a talking point, Swyx Latent.Space coined this with
    https://wtfhappened2025.com/ where he collects evidence of a shelling point, something that happened in December and I think continued throughout February.
    Speaking of benchmarks no longer being valid, OpenAI published that the divergence between the SWE-bench verified gains with real life performance is so vast, that they will no longer be using SWE-bench verified, and will be switching to SWE-bench pro for evaluations.
    Everyone’s Autonomous agents (and subagents) are here
    Look, with over 250K Github stars, OpenAI getting Peter Steinberger on board, it’s clear now. OpenClaw made a huge dent in how people think about autonomous agents (and subagents!)
    It may be a “moment in time” that the model capabilities were “just good enough” to be able to run agents async for a long time. but the big labs noticed the OpenClaw excitement and are shipping like never before to make sure their users don’t switch over!
    Perplexity launched “Computer“, which has scheduled tasks in a compute environment, and can complete long lasting projects end to end, Cursor pivots from IDE only to running Agents in the cloud with their own environments, Claude Code added memory, and Remote Control, while Claude Cowork added Scheduled tasks, our friends from Nous shipped Hermes Agent and even Microsoft wants to bring this to their customers in Copilot. The most interesting one from these is the new Devin from Cognition.
    I’ve gotten access and chatted with Nader Dabit on the show about how Devin was the “OG” async coding Agent, but now as models capabilities are here, Devin can do so much more. PR reviews with devinreview.com can complete the loop between coding, fixing and testing something end to end. They have an integrated environment with a scrub so you can roll back and see what the agent did, scheduled tasks and video showing you how the agent tested your website.
    I’ve used it to fix bugs in ThrusdAI.news and it found a few that Claude Code didn’t even know about! You can try out Devin (for free for a week?) here
    This weeks buzz - W&B updates
    I’m happy this week, because we finally launched both 2.5 open source models that we’re making the news lately.
    Kimi 2.5 and MiniMax M2.5 are both live on our inference service, at very very decent prices!
    Check them both out here and let me know if you need some credits.
    From the show this week, most hosts agree that Kimi 2.5 is the best open source alternative to Opus inside OpenClaw, just give your agent the WANDB_API_KEY and ask it to set itself up with the new model!
    Surfing the singularity with Ben Broca and Polsia, hitting $700K ARR since December
    I’ve reached out to Ben and asked him to join the show this week because alongside OpenClaw blowing up since December, his Polsia startup, which builds and scales entire companies with AI agents running 24x7 has hit an unprecedented $700K ARR milestone after just a few months. We actually saw him break the $700K ARR on the show live 🎉 But get this, he’s the only employee, everything is done with AIs. He’s using Polsia to scale Polsia.
    Polsia let’s anyone add an existing company or create a whole new one, and then a team of agents will spin up a marketing team, a GTM motion, a research arm and you and Polsia could work together to make this company a reality. Does this actually work? IDK, the whole thing is new, I’m trying out a few things and will let you know in a few weeks if any of this worked.
    But it’s definitely blowing up, Ben showed us that over the last 24 hours, over 770 companies launched on Polsia, he’s hitting nearly 1M ARR with people paying $50/mo for him to run inference for them, marketing campaigns, and he just added Meta ads.
    This ARR chart, the live dashboard, and Ben doing all of this Solo is underlining the whole “Singularity is near” thing for me! It’s impossible to imagine something like this working even... 5 months ago, and now we just accept it as .. sure, yeah, one person can manage AIs that manage checks notes over 700 companies.
    What’s clever about Polsia’s architecture is the cross-company learning system: when an agent learns something useful (like “subject lines with emojis get better open rates”), that learning gets anonymized and generalized into a shared memory file that benefits every company on the platform. The more companies running on Polsia, the smarter every agent gets — like a platform effect but for agent intelligence.
    AI Art, Video & Audio
    Seedance 2.0 is finally “here”
    This week has not been quiet in the multimodality world either, SeeDance 2.0 from ByteDance was delayed via the API partners (was supposed to launch Feb 24) due to copyright concerns, but apparently they dropped it inside CapCut, ByteDance’s video editing software! It’s really good though what makes it absolutely incredible IMO is the video transfer, and you can’t really do that in CapCut, so we’re keep waiting for the “full model”
    Nano Banana 2 - Pro level intelligence, with Flash speed and pricing (Blog)
    Google dropped a breaking news item before the show started today, and announced Nano Banana 2, which is supposed to be as good as Nano Banana Pro (which is incredible) but faster. It wasn’t really faster for me, as I got early access thanks to the DeepMind team, but apparently it’s just the rollout pains. But the quality is nearly matching Nano Banana Pro!
    It can do the same super high quality text rendering, comes with a few new ratios to create ultra long images (4:1 and 1:4) and a new small 512 resolution for extra cheap generation. The additional thing is Image Search is now integrated into the model, allowing it to look something up before generating. Though, that didn’t really work for me as well, I tried to get it to look up images of Mike Intrator and Dario Amodei, and it kept showing me random people who look nothing like them, despite the thinking traces showing the search did happen.
    Speaking of pricing, this model is around 50-30% cheaper than NBP, which is great given the added speed! Go play with it, it’s available on AI.dev and Gemini, go give it a try!
    Open source AI
    This week in OpenSource, our friends from Qwen came back with a set of 3 models, the middle medium one is a hybrid architecture with only 3B parameters that beats their 235B flagship Qwen3 from before! It’s really good at longer context especially given the hybrid attention similar to Jamba that we covered before. (X, HF, HF, HF, Blog)
    Additionally, Liquid AI releases their largest LFM, 24B (X, HF, Blog) and that is also deployable on consumer laptops.
    One note on AI tools, LM Studio, our favorite way of running these models on your hardware, have launched LMLink, powered by Tailscale, which let’s you run local inference on once device and stream tokens to any other device in your network securely! You can use this to run your OpenClaw with Qwen medium for example, for a complete off the grid OpenClaw!
    Check it out here: https://lmstudio.ai/link
    I really didn’t want to sound hype-y but this week things are moving so fast that I was not sure how it’s possible to talk about all this, covering the news while also having 3 interviews. I think we’ve done a good job, but I am honestly getting to a point whereI have to do deep prioritization of what content is the most important in my eyes. I hope you guys enjoy my prioritization, and do leave comments of what you’d like to see more, or see less of! I am hungry for feedback!
    If you enjoyed this week’s newsletter, checkout the whole edited video and share it with a friend or two? See you next week!
    ThursdAI - Join us as we surf the AI singularity together

    Here’s the TL;DR and show notes:
    ThursdAI - Feb 26, 2026 - TL;DR
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
    * Ben Cera (@bencera_) - Founder Polsia
    * Nader Dabit (@dabit3) - Growth at Cognition
    * Philip Kiely (@philipkiely) - Devrel Base10, Author Inference Engineering
    * ThursdAI new website: https://thursdai.news
    * Big CO LLMs + APIs
    * Anthropic vs Chinese OSS - Accuses DeepSeek, Minimax, ZAI at distillation attacks (Blog)
    * Pentagon Issues an ultimatum to Anthropic: Give military unfettered Claude access by Friday or face Defense Production Act - Anthropic says NO (Blog)
    * OpenAI releases GPT-5.3-Codex, their most capable agentic coding model, to all developers via the Responses API (X, Announcement)
    * Open Source LLMs
    * Alibaba: Qwen 3.5 Medium - 35B model with only 3B active parameters outperforms their previous 235B flagship (X, HF, HF, HF, Blog)
    * Liquid AI releases LFM2-24B-A2B: A 24B MoE model with only 2.3B active parameters that runs on consumer laptops (X, HF, Blog)
    * Perplexity launches ppxl-embed - SOTA embedding models (Blog, HF, API) by our friend Bo Wang
    * Evals & Benchmarks
    * METR’s Time Horizon Benchmark Goes Vertical: Claude Opus 4.6 Achieves ~14.5 Hour Task Completion (X, Blog)
    * Confluence Labs emerges from stealth with 97.9% SOTA on ARC-AGI-2 benchmark (X, GitHub)
    * OpenAI Retires SWE-bench Verified, (X, Blog, X)
    * Agentica claiming to solve all public ArcAGI 3 (X)
    * Tools & Agentic Engineering
    * Happy 1 year Birthday Claude Code!
    * Devin AI 2.2 - autonomous agent with computer use, browser, self verify and self fix it’s own work - interview with Nader Dabit (X)
    * LMStudio launches LMLink - use your local models from everywhere with TailScale! (try it)
    * Claude Code introduces Remote Control: Control your local coding sessions from your phone or any device (X, Docs) and memory (X)
    * Claude Cowork and Codex both now have automations (Cron Jobs) to do tasks for you (Cowork)
    * Cursor launches cloud agents that onboard to codebases, run in isolated VMs, and deliver video demos of completed PRs (X)
    * Nous research agent (X)
    * Perplexity Computer (blog)
    * Microsoft Copilot tasks (blog)
    * This weeks Buzz - Weights & Biases update
    * W&B adds MiniMax 2.5 and Kimi K2.5 on our Inference Service (LINK)
    * Interviews mention links
    * Ben Broca - polsia.com/live Polsia Dashboard
    * Nader Dabit - on seeing the future (blog)
    * Philip Kiely - Inference Engineering book (Book)
    * Vision & Video
    * Seedance 2.0 finally available in Capcut in US. API release apparently held back due to copyright issues (X)
    * Voice & Audio
    * OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5 models with major improvements in speech-to-speech AI capabilities (X, Announcement)
    * AI Art & Diffusion & 3D
    * Google DeepMind launches Nano Banana 2 (X, Announcement)
    * Quiver solves SVG with Arrow 1.0 (X)
    * Others
    * Taalas AI - 15,000 tokens per second demo (chatjimmy.ai/)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📅 ThursdAI - Feb 19 - Gemini 3.1 Pro Drops LIVE, Sonnet 4.6 Closes Gap, OpenClaw Goes to OpenAI

    2026/02/20 | 1h 31 mins.
    Hey, it’s Alex, let me catch you up!
    Since last week, OpenAI convinced OpenClaw founder Peter Steinberger to join them, while keeping OpenClaw.. well... open. Anthropic dropped Sonnet 4.6 which nearly outperforms the previous Opus and is much cheaper, Qwen released 3.5 on Chinese New Year’s Eve, while DeepSeek was silent and Elon and XAI folks deployed Grok 4.20 without any benchmarks, and it’s 4 500B models in a trenchcoat?
    Also, Anthropic updated rules state that it’s breaking ToS to use their plans for anything except Claude Code & Claude SDK (and then clarified that it’s OK? we’re not sure)
    Then Google decided to drop their Gemini 3.1 Pro preview right at the start of our show, and it’s very nearly the best LLM folks can use right now (though it didn’t pass Nisten’s vibe checks)
    Also, Google released Lyria 3 for music gen (though only 30 seconds?) and our own Ryan Carson blew up on X again with over 1M views for his Code Factory article, Wolfram did a deep dive into Terminal Bench and .. we have a brand new website:
    https://thursdai.news 🎉
    Great week all in all, let’s dive in!
    ThursdAI - Subscribe to never feel like you’re behind. Share with your friends if you’re already subscribed!

    Big Companies & API updates
    Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2 (X, Blog, Announcement)
    In a release that surprised no-one, Google decided to drop their latest update to Gemini models, and it’s quite a big update too! We’ve now seen all major labs ship big model updates in the first two months of 2026. With 77.1% on ARC-AGI 2, and 80.6% on SWE-bench verified, Gemini is not complete SOTA across the board but it’s damn near close.
    The kicker is, it’s VERY competitive on the pricing, with 1M context, $2 / $12 (But if you look at the trajectory, it’s really notable how quickly we’re moving, with this model being 82% better on abstract reasoning than the 3 pro released just a few months ago!
    The 1 Million Context Discrepancy, who’s better at long context?
    The most fascinating catch of the live broadcast came from LDJ, who has an eagle eye for evaluation tables. He immediately noticed something weird in Google’s reported benchmarks regarding long-context recall. On the MRCR v2 8-needle benchmark (which tests retrieval quality deep inside a massive context window), Google’s table showed Gemini 3.1 Pro getting a 26% recall score at 1 million tokens. Curiously, they marked Claude Opus 4.6 as “not supported” in that exact tier.
    LDJ quickly pulled up the actual receipts: Opus 4.6 at a 1-million context window gets a staggering 76% recall score. That is a massive discrepancy! It was addressed by a member of DeepMind on X in a response to me, saying that Anthropic used an internal model for evaluating this (with receipts he pulled from the Anthropic model card)
    Live Vibe-Coding Test for Gemini 3.1 Pro
    We couldn’t just stare at numbers, so Nisten immediately fired up AI Studio for a live vibe check. He threw our standard “build a mars driver simulation game” prompt at the new Gemini.
    The speed was absolutely breathtaking. The model generated the entire single-file HTML/JS codebase in about 20 seconds. However, when he booted it up, the result was a bit mixed. The first run actually failed to render entirely. A quick refresh got a version working, and it rendered a neat little orbital launch UI, but it completely lacked the deep physics trajectories and working simulation elements that models like OpenAI’s Codex 5.3 or Claude Opus 4.6 managed to output on the exact same prompt last week. As Nisten put it, “It’s not bad at all, but I’m not impressed compared to what Opus and Codex did. They had a fully working one with trajectories, and this one I’m just stuck.”
    It’s a great reminder that raw benchmarks aren’t everything. A lot of this comes down to the harness—the specific set of system prompts and sandboxes that the labs use to wrap their models.
    Anthropic launches Claude Sonnet 4.6, with 1M token context and near-Opus intelligence at Sonnet pricing
    The above Gemini release comes just a few days after Anthropic has shipped an update to the middle child of their lineup, Sonnet 4.6. With much improved Computer Use skills, updated Beta mode for 1M tokens, it achieves 79.6% on SWE-bench verified eval, showing good coding performance, while maintaining that “anthropic trained model” vibes that many people seem to prefer.
    Apparently in blind testing inside Claude Code, folks preferred this new model outputs to the latest Opus 4.5 around ~60% of the time, while preferring it over the previous sonnet 70% of the time.
    With $3/$15 per million tokens pricing, it’s cheaper than Opus, but is still more expensive than the flagship Gemini model, while being quite behind.
    Vibing with Sonnet 4.6
    I’ve tested out Sonnet 4.6 inside my OpenClaw harness for a few days, and it was decent. It did annoy me a bit more than Opus, with misunderstanding what I ask it, but it definitely does have the same “emotional tone” as Opus. Comparing it to Codex 5.3 is very easy, it’s much nicer to talk to. IDK what kind of Anthropic magic they put in there, but if you’re on a budget, Sonnet is definitely the way to go when interacting with Agents (and you can get it to orchestrate as many Codex instances as you want if you don’t like how it writes code)
    For Devs: Auto prompt caching and Web Search updates
    One nice update Anthropic also dropped is that prompt caching (which leads to almost 90% decrease in token pricing) for developers (Blog) and a new and improved Web Search for everyone else that can now use tools
    Grok 4.20 - 4 groks in a trenchcoat?
    In a very weird release, Grok has been updated with the long hyped Grok 4.20. Elon has been promising this version for a while (since late last year in fact) and this “release” definitely felt underwhelming. There was no evaluations, no comparisons to other labs models, no charts (heck, not even a blogpost on X.ai).
    What we do know, is that Grok 4.20 (and Grok 4.20 Heavy) use multiple agents (4 for Grok, 16 for Heavy) to do a LOT of research and combine their answers somehow. This is apparently what the other labs use for their ultra expensive models (GPT Pro and Gemini DeepThink) but Grok is showing it in the UI, and gives these agents... names and personalities.
    Elon has confirmed also that what’s deployed right now is ~500B “small” base version, and that bigger versions are coming, in one of the rarest confirmations about model size from the big labs.
    Vibe checking this new grok, it’s really fast at research across X and the web, but I don’t really see it as a daily driver for anyone who converses with LLMs all the time. Supposedly they are planning to keep teaching this model and get it “improved week over week” so I’ll keep you up to date with major changes here.
    Open Source AI
    It seems that all the chinese OSS labs were shipping before the Chinese New Year, with Qwen being the last one of them, dropping the updated Qwen 3.5.
    Alibaba’s Qwen3.5 397B-A17B: First open-weight native multimodal MoE model (X, HF)
    Qwen decided to go for Sparse MoE architecture with this release, with a high number of experts (512) and only 17B active parameters.
    It’s natively multi-modal with a hybrid architecture, able to understand images/text, while being comparable to GPT 5.2 and Opus 4.5 on benches including agentic tasks.
    Benchmarks aside, the release page of Qwen models is a good sniff test on where these model labs are going, they have multimodality in there, but they also feature an example of how to use this model within OpenClaw, which doesn’t necessarily show off any specific capabilities, but shows that the Chinese labs are focusing on agentic behavior, tool use and mostl of all pricing!
    This model is also available as Qwen 3.5 Max with 1M token window (as opposed to the 256K native one on the OSS side) on their API.
    Agentic Coding world - The Clawfather is joining OpenAI, Anthropic loses dev mindshare
    This was a heck of a surprise to many folks, Peter Steinberger, announced that he’s joining OpenAI, while OpenClaw (that now sits on >200K stars in Github, and is adopted by nearly every Chinese lab) is going to become an Open Source foundation.
    OpenAI has also confirmed that it’s absolutely ok to use your ChatGPT plus/pro subscriptions to use inside OpenClaw, and it’s really a heck of a thing to see how quickly Peter jumped from relative anonymity (after scaling and selling PSPDFKIT ) into a spotlight. Apparently Mark Zuckerberg reached out directly as well as Sam Altman, and Peter decided to go with OpenAI despite Zuck offering more money due to “culture”
    This whole ClawdBot/OpenClaw debacle also shines a very interesting and negative light on Anthropic, who recently changed their ToS to highlight that their subscription can only be used for Claude Code and nothing else. This scared a lot of folks who used their Max subscription to run their Claws 24/7. Additionally Ryan echoed how the community feel about lack of DevEx/Devrel support from Anthropic in a viral post.
    However, it does not seem like Anthropic cares? Their revenue is going exponential (much of it due to Claude Code)
    Very interestingly, I went to a local Claude Code meetup here in Denver, and the folks there are.. a bit behind the “bubble” on X. Many of them didn’t even try Codex 5.3 or OpenClaw, they are maximizing their time with Claude Code like there’s no tomorrow. It has really shown me that the alpha keeps changing really fast, and many folks don’t have the time to catch up!
    P.S - this is why ThursdAI exists, and I’m happy to deliver the latest news to ya.
    This Week’s Buzz from Weights & Biases
    Our very own Wolfram Ravenwolf took over the Buzz corner this week to school us on the absolute chaos that is AI benchmarking. With his new role at W&B, he’s been stress-testing all the latest models on Terminal Bench 2.0.
    Why Terminal Bench? Because if you are building autonomous agents, multiple-choice tests like MMLU are basically useless now. You need to know if an agent can actually interact with an environment. Terminal Bench asks the agent to perform 89 real-world tasks inside a sandboxed Linux container—like building a Linux kernel or cracking a password-protected archive.
    Wolfram highlighted some fascinating nuances that marketing slides never show you. For example, did you know that on some agentic tasks, turning off the model’s “thinking/reasoning” mode actually results in a higher score? Why? Because overthinking generates so many internal tokens that it fills the context window faster, causing the model to hit its limits and fail harder than a standard zero-shot model! Furthermore, comparing benchmarks between labs is incredibly difficult because changing the benchmark’s allowed runtime from 1 hour to 2 hours drastically raises the ceiling of what models can achieve.
    He also shared a great win: while evaluating GLM-5 for our W&B inference endpoints, he got an abysmal 5% score. By pulling up the Weave trace data, Wolfram immediately spotted that the harness was injecting brain-dead Python syntax errors into the environment. He reported it, engineering fixed it in minutes, and the score shot up to its true state-of-the-art level. This is exactly why you need powerful tracing and evaluation tools when dealing with these black boxes! So y’know... check out Weave!
    Vision & BCI
    Zyphra’s ZUNA: Thought-to-Text Gets Real (X, Blog, GitHub)
    LDJ flagged this as his must-not-miss: Zyphra released ZUNA, a 380M parameter open-source BCI (Brain-Computer Interface) foundation model. It takes EEG signals from your brain and reconstructs clinical-grade brain signals from sparse, noisy data. People are literally calling it “thought to text” hahaha.
    At 380M parameters, it could potentially run in real-time on a consumer GPU. Trained on 2 million channel-hours of EEG data from 208 datasets. The wild part: it can upgrade cheap $500 consumer EEG headsets to high-resolution signal quality without retraining, something many folks are posting about and are excited to test out! Non Invasive BCI is the dream!
    Nisten was genuinely excited, noting it’s probably the best effort in this field and it’s fully Apache 2.0. Will probably need personalized training per person, but the potential is real: wear a headset, look at a screen, fire up your agents with your thoughts. Not there yet, but this feels like the actual beginning.
    Tools & Agentic Coding (The End of “Vibe Coding”) - Ryan Carson’s Code Factory & The “One-Shot Myth”
    This one is for developers, but in modern times, everyone can become a developer so if you’re not one, at least skim this.
    We spent a big chunk of the show today geeking out over agentic workflows. Ryan Carson went incredibly viral on X again this week with a phenomenal deep-dive on establishing a “Code Factory.” If you are still just chatting with models and manually copying code back into your IDE, you are doing it wrong.
    Ryan’s methodology (heavily inspired by a recent OpenAI paper on harness engineering) treats your AI agents like a massive team of junior engineers. You don’t just ask them for code and ship it. You should build a rigid, machine-enforced loop.
    Here is the flow:
    * The coding agent (Codex, OpenClaw, etc.) writes the code.
    * The GitHub repository enforces risk-aware checks. If a core system file or route is touched, the PR is automatically flagged as high risk.
    * A secondary code review agent (like Greptile) kicks off and analyzes the PR.
    * CI/CD GitHub Actions run automated tests, including browser testing.
    * If a test fails, or the review agent leaves a comment, a remediation agent is automatically triggered to fix the issue and loop back.
    * The loop spins continuously until you get a flawless, green PR.
    As Ryan pointed out, we used to hate this stuff as human engineers. Waiting for CI to pass made you want to pull your hair out. But agents have infinite time and infinite patience. You force them to grind against the machine-enforced contract (YAML/JSON gates) until they get it right. It takes a week to set up properly, and you have to aggressively fight “document drift” to make sure your AI doesn’t forget the architecture, but once it’s humming, you have unprecedented leverage.
    My Hard Truth: One-Shot is a Myth I completely agree with Ryan btw! Over the weekend, my OpenClaw agent kindly informed me that the hosting provider for the old ThursdAI website was shutting down. I needed a new website immediately.
    I decided to practice what we preach and talk to my ClawdBot to build the entire thing. It was an incredible process. I used Opus 4.6 to mock up 3 designs based on other podcast sites. Then, I deployed a swarm of sub-agents to download and read the raw text transcripts of all 152 past episodes of our show. Their job was to extract the names of every single guest (over 160 guests, including 15 from Google alone!) to build a dynamic guest directory, generating a dedicated SEO page and dynamic OpenGraph tag for every single one of them, a native website podcast player with synced sections, episode pages with guests highlighted and much more. It would have taken me months to write the code for this myself.
    Was it magical? Yes. But was it one-shot? Absolutely not.
    The amount of back-and-forth conversation, steering, and correction I had to provide to keep the CSS coherent across pages was exhausting. I set up an automation to work while I slept, and I would wake up every morning to a completely different, sometimes broken website.
    Yam Peleg chimed in with the quote of the week: “It’s not a question of whether a model can mess up your code, it’s just a matter of when. Because it is a little bit random all the time. Humans don’t mistakenly delete the entire computer. Models can mistakenly, without even realizing, delete the entire computer, and a minute later their context is compacted and they don’t even remember doing it.”
    This is why you must have gates. This is also why I don’t think engineers are going to be replaced with AI completely. Engineers who don’t use AI? yup. But if you embrace these tools and learn to work with you, you won’t have an issue getting a job! You need that human taste-maker in the loop to finish the last 5%, and you need strict CI/CD gates to stop the AI from accidentally burning down your production database.
    Voice & Audio
    Google DeepMind launches Lyria 3 (try it)
    Google wasn’t just dropping reasoning models this week; DeepMind officially launched Lyria 3, their most advanced AI music generation model, integrating it directly into the Gemini App.
    Lyria 3 generates 30-second high-fidelity tracks with custom lyrics, realistic vocals across 8 different languages, and granular controls over tempo and instrumentation. You can even provide an image and it’ll generate a soundtrack (short one) for that image.
    While it is currently limited to 30-second tracks (which makes it hard to compare to the full-length song structures of Suno or Udio), early testers are raving that the actual audio fidelity and prompt adherence of Lyria 3 is far superior. All tracks are invisibly watermarked with Google’s SynthID to ensure provenance, and it automatically generates cover art using Nano Banana. I tried to generate a jingle
    That’s a wrap for this weeks episode folks, what an exclirating week! ( Yes I know it’s a typo, but how else would you know that I’m human?)
    Please go check out our brand new website (and tell me if anything smells off there, it’s definitely not perfect!), click around the guests directory and the episodes pages (the last 3 have pages, I didn’t yet backfill the rest) and let me know what you think!
    See you all next week!
    -Alex
    ThursdAI - Feb 19, 2026 - TL;DR
    TL;DR of all topics covered:
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
    * 🔥 New website: thursdai.news with all our past guests and episodes
    * Open Source LLMs
    * Alibaba releases Qwen3.5-397B-A17B: First open-weight native multimodal MoE model with 8.6-19x faster inference than Qwen3-Max (X, HF)
    * Cohere Labs releases Tiny Aya, a 3.35B multilingual model family supporting 70+ languages that runs locally on phones (X, HF, HF)
    * Big CO LLMs + APIs
    * OpenClaw founder joins OpenAI
    * Google releases Gemini 3.1 Pro with 2.5x better abstract reasoning and improved coding/agentic capabilities (X, Blog, Announcement)
    * Anthropic launches Claude Sonnet 4.6, its most capable Sonnet model ever, with 1M token context and near-Opus intelligence at Sonnet pricing (X, Blog, Announcement)
    * ByteDance releases Seed 2.0 - a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing (X, blog, HF)
    * Anthropic changes the rules on Max use, OpenAI confirms it’s 100% fine.
    * Grok 4.20 - finally released, a mix of 4 agents
    * This weeks Buzz
    * Wolfram deep dives into Terminal Bench
    * We’ve launched Kimi K2.5 on our inference service (Link)
    * Vision & Video
    * Zyphra releases ZUNA, a 380M-parameter open-source BCI foundation model for EEG that reconstructs clinical-grade brain signals from sparse, noisy data (X, Blog, GitHub)
    * Voice & Audio
    * Google DeepMind launches Lyria 3, its most advanced AI music generation model, now available in the Gemini App (X, Announcement)
    * Tools & Agentic Coding
    * Ryan is viral once again with CodeFactory! (X)
    * Ryan uses Agentation.dev for front end development closing the loop on componenets
    * Dreamer launches beta: A full-stack platform for building and discovering agentic apps with no-code AI (X, Announcement)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
  • ThursdAI - The top AI news from the past week

    📆 Open source just pulled up to Opus 4.6 — at 1/20th the price

    2026/02/13 | 1h 28 mins.
    Hey dear subscriber, Alex here from W&B, let me catch you up!
    This week started with Anthropic releasing /fast mode for Opus 4.6, continued with ByteDance reality-shattering video model called SeeDance 2.0, and then the open weights folks pulled up!
    Z.ai releasing GLM-5, a 744B top ranking coder beast, and then today MiniMax dropping a heavily RL’d MiniMax M2.5, showing 80.2% on SWE-bench, nearly beating Opus 4.6! I’ve interviewed Lou from Z.AI and Olive from MiniMax on the show today back to back btw, very interesting conversations, starting after TL;DR!
    So while the OpenSource models were catching up to frontier, OpenAI and Google both dropped breaking news (again, during the show), with Gemini 3 Deep Think shattering the ArcAGI 2 (84.6%) and Humanity’s Last Exam (48% w/o tools)... Just an absolute beast of a model update, and OpenAI launched their Cerebras collaboration, with GPT 5.3 Codex Spark, supposedly running at over 1000 tokens per second (but not as smart)
    Also, crazy week for us at W&B as we scrambled to host GLM-5 at day of release, and are working on dropping Kimi K2.5 and MiniMax both on our inference service! As always, all show notes in the end, let’s DIVE IN!
    ThursdAI - AI is speeding up, don’t get left behind! Sub and I’ll keep you up to date with a weekly catch up

    Open Source LLMs
    Z.ai launches GLM-5 - #1 open-weights coder with 744B parameters (X, HF, W&B inference)
    The breakaway open-source model of the week is undeniably GLM-5 from Z.ai (formerly known to many of us as Zhipu AI). We were honored to have Lou, the Head of DevRel at Z.ai, join us live on the show at 1:00 AM Shanghai time to break down this monster of a release.
    GLM-5 is massive, not something you run at home (hey, that’s what W&B inference is for!) but it’s absolutely a model that’s worth thinking about if your company has on prem requirements and can’t share code with OpenAI or Anthropic.
    They jumped from 355B in GLM4.5 and expanded their pre-training data to a whopping 28.5T tokens to get these results. But Lou explained that it’s not only about data, they adopted DeepSeeks sparse attention (DSA) to help preserve deep reasoning over long contexts (this one has 200K)
    Lou summed up the generational leap from version 4.5 to 5 perfectly in four words: “Bigger, faster, better, and cheaper.” I dunno about faster, this may be one of those models that you hand off more difficult tasks to, but definitely cheaper, with $1 input/$3.20 output per 1M tokens on W&B!
    While the evaluations are ongoing, the one interesting tid-bit from Artificial Analysis was, this model scores the lowest on their hallucination rate bench!
    Think about this for a second, this model is neck-in-neck with Opus 4.5, and if Anthropic didn’t release Opus 4.6 just last week, this would be an open weights model that rivals Opus! One of the best models the western foundational labs with all their investments has out there. Absolutely insane times.
    MiniMax drops M2.5 - 80.2% on SWE-bench verified with just 10B active parameters (X, Blog)
    Just as we wrapped up our conversation with Lou, MiniMax dropped their release (though not weights yet, we’re waiting ⏰) and then Olive Song, a senior RL researcher on the team, joined the pod, and she was an absolute wealth of knowledge!
    Olive shared that they achieved an unbelievable 80.2% on SWE-Bench Verified. Digest this for a second: a 10B active parameter open-source model is directly trading blows with Claude Opus 4.6 (80.8%) on the one of the hardest real-world software engineering benchmark we currently have. While being alex checks notes ... 20X cheaper and much faster to run? Apparently their fast version gets up to 100 tokens/s.
    Olive shared the “not so secret” sauce behind this punch-above-its-weight performance. The massive leap in intelligence comes entirely from their highly decoupled Reinforcement Learning framework called “Forge.” They heavily optimized not just for correct answers, but for the end-to-end time of task performing. In the era of bloated reasoning models that spit out ten thousand “thinking” tokens before writing a line of code, MiniMax trained their model across thousands of diverse environments to use fewer tools, think more efficiently, and execute plans faster. As Olive noted, less time waiting and fewer tools called means less money spent by the user. (as confirmed by @swyx at the Windsurf leaderboard, developers often prefer fast but good enough models)
    I really enjoyed the interview with Olive, really recommend you listen to the whole conversation starting at 00:26:15. Kudos MiniMax on the release (and I’ll keep you updated when we add this model to our inference service)
    Big Labs and breaking news
    There’s a reason the show is called ThursdAI, and today this reason is more clear than ever, AI biggest updates happen on a Thursday, often live during the show. This happened 2 times last week and 3 times today, first with MiniMax and then with both Google and OpenAI!
    Google previews Gemini 3 Deep Think, top reasoning intelligence SOTA Arc AGI 2 at 84% & SOTA HLE 48.4% (X , Blog)
    I literally went 🤯 when Yam brought this breaking news. 84% on the ARC-AGI-2 benchmark. For context, the highest score prior to this was 68% from Opus 4.6 just last week. A jump from 68 to 84 on one of the hardest reasoning benchmarks we have is mind-bending. It also scored a 48.4% on Humanity’s Last Exam without any tools.
    Only available to Ultra subscribers to Gemini (not in API yet?) this model seem to be the current leader in reasoning about hard problems and is not meant for day to day chat users like you and me (though I did use it, and it’s pretty good at writing!)
    They posted Gold-medal performance on 2025 Physics and Chemistry Olympiads, and an insane 3455 ELO rating at CodeForces, placing it within the top 10 best competitive programmers. We’re just all moving so fast I’m worried about whiplash! But hey, this is why we’re here, we stay up to date so you don’t have to.
    OpenAI & Anthropic fast modes
    Not 20 minutes passed since the above news, when OpenAI announced a new model that works only for Pro tier members (I’m starting to notice a pattern here 😡), GPT 5.3 Codex Spark.
    You may be confused, didn’t we just get GPT 5.3 Codex last week? well yeah, but this one, this one is its little and super speedy brother, hosted by the Cerebras partnership they announced a while ago, which means, this coding model absolutely slaps at over 1000t/s.
    Yes, over 1K tokens per second can be generated with this one, though there are limits. It’s not as smart, it’s text only, it has 128K context, but still, for MANY subagents, this model is an absolute beast. It won’t refactor in one shot your whole code-base but it’ll generate and iterate on it, very very quick!
    OpenAI also previously updated Deep Research with GPT 5.2 series of models, and we can all say bye bye to the “older” version of models, like 5, o3 and most importantly GPT 4o, which got a LOT of people upset (enough that they have a hashtag going, #keep4o) !
    Anthropic also announced their fast mode (using /fast) in Claude Code btw on Saturday, and that one is absolutely out of the scope for many users, with $225/1M tokens on output, this model will just burn through your wallet. Unlike the Spark version, this seems to be the full Opus 4.6 just... running on some dedicated hardware? I thought this was a rebranded Sonnet 5 at first but Anthropic folks confirmed that it wasn’t.
    Vision & Video
    ByteDance’s Seedance 2.0 Shatters Reality (and nobody in the US can use it)
    I told the panel during the show: my brain is fundamentally broken after watching the outputs from ByteDance’s new Seedance 2.0 model. If your social feed isn’t already flooded with these videos, it will be so very soon (supposedly the API launches Feb 14 on Valentines Day)
    We’ve seen good video models before. Sora blew our minds and then Sora 2, Veo is (still) great, Kling was fantastic. But Seedance 2.0 is an entirely different paradigm. It is a unified multimodal audio-video joint generation architecture. What does that mean? It means you can simultaneously input up to 9 reference images, 3 video clips, 3 audio clips, and text instructions all at once to generate a 15-second cinematic short film. It character consistency is beyond what we’ve seen before, physics are razor sharp (just looking at the examples folks are posting, it’s clear it’s on another level)
    I think very soon though, this model will be restricted, but for now, it’s really going viral due to the same strategy Sora did, folks are re-imagining famous movie and TV shows endings, doing insane mashups, and much more! Many of these are going viral over the wall in China.
    The level of director-like control is unprecedented. But the absolute craziest part is the sound and physics. Seedance 2.0 natively generates dual-channel stereo audio with ASMR-level Foley detail. If you generate a video of a guy taking a pizza out of a brick oven, you hear the exact scratch of the metal spatula, the crackle of the fire, the thud of the pizza box, and the rustling of the cardboard as he closes it. All perfectly synced to the visuals.
    Seedance 2 feels like “borrowed realism”. Previous models had only images and their training to base their generations on. It 2 accepts up to 3 video references in addition to images and sounds.
    This is why some of the videos feel like a new jump in visual capabilities. I have a hunch that ByteDance will try and clamp down on copyrighted content before releasing this model publicly, but for now the results are very very entertaining and I can’t help but wonder, who is the first creator that will just..remake the ending of GOT last season!?
    Trying this out is hard right now, especially in the US, but there’s a free way to test it out with a VPN, go to doubao.com/chat when connected from a VPN and select Seedream 4.5 but ask for “create a video please” in your prompt!
    AI Art & Diffusion: Alibaba’s Qwen-Image-2.0 (X, Blog)
    The Qwen team over at Alibaba has been on an absolute tear lately, and this week they dropped Qwen-Image-2.0. In an era where everyone is scaling models up to massive sizes, Alibaba actually shrank this model from 20B parameters down to just 7B parameters, while massively improving performance (tho didn’t drop the weights yet, they are coming)
    Despite the small size, it natively outputs 2K (2048x2048) resolution images, giving you photorealistic skin, fabric, and snow textures without needing a secondary upscaler. But the real superpower of Qwen-Image-2.0 is its text rendering, it supports massive 1,000-token prompts and renders multilingual text (English and Chinese) flawlessly.
    It’s currently #3 globally on AI Arena for text-to-image (behind only Gemini-3-Pro-Image and GPT Image 1.5) and #2 for image editing. My results with it were not the best, I tried to generate this weeks Thumbnails with it and .. they turned out meh at best?
    In fact, my results were so so bad compared to their launch blog that I’m unsure that they are serving me the “new” model 🤔 Judge for yourself, the above infographic was created with Nano Banana Pro, and this one, same prompt, with Qwen Image on their website:
    But you can test it for free at chat.qwen.ai right now, and they’ve promised open-source weights after the Chinese New Year!
    🛠️ Tools & Orchestration: Entire Checkpoints & WebMCP
    With all these incredibly smart, fast models, the tooling ecosystem is desperately trying to keep up. Two massive developments happened this week that will change how we build with AI, moving us firmly away from hacky scripts and into robust, agent-native development.
    Entire Raises $60M Seed for OSS Agent Workflows
    Agent orchestration is the hottest problem in tech right now, and a new company called Entire just raised a record-breaking $60 Million seed round (at a $300M valuation—reportedly the largest seed ever for developer tools) to solve it. Founded by former GitHub CEO Thomas Dohmke, Entire is building the “GitHub for the AI agent era.”
    Their first open-source release is a CLI tool called Checkpoints.
    Checkpoints integrates via Git hooks and automatically captures entire agent sessions—transcripts, prompts, files modified, token usage, and tool calls—and stores them as versioned Git data on a separate branch (entire/checkpoints/v1). It creates a universal semantic layer for agent tracing. If your Claude Code or Gemini CLI agent goes off the rails, Checkpoints allows you to seamlessly rewind to a specific state in the agent’s session.
    We also have to shout out our own Ryan Carson, who shipped his open-source project AntFarm this week to help orchestrate these agents on top of Open-Claw!
    Chrome 146 Introduces WebMCP
    Finally, an absolutely massive foundational shift is happening on the web. Chrome 146 Canary is shipping an early preview of WebMCP.
    We have been talking about web-browsing agents for a while, and the biggest bottleneck has always been brittle DOM scraping, guessing CSS selectors, and simulating clicks via Puppeteer or Playwright. It wastes an immense amount of tokens and breaks constantly. Chrome 146 is fundamentally changing this by introducing a native browser API.
    Co-authored by Google and Microsoft under the W3C Web Machine Learning Community Group, WebMCP allows websites to declaratively expose structured tools directly to AI agents using JSON schemas via navigator.modelContext. You can even do this declaratively through HTML form annotations using tool-name and tool-description attributes. No backend MCP server is required;
    I don’t KNOW if this is going to be big or not, but it definitely smells like it, because even the best agentic AI assistants are struggling with browsing the web, given the constrained context windows cannot just go by HTML content and screenshots! Let’s see if this will help agents browsing the web!
    All right, that about sums it up I think for this week, it was an absolute banger of a week, for open the one thing I didn’t cover as a news item but mentioned last week, is that many folks report being overly tired, barely able to go to sleep while their agentic things are running, and all of us are trying to get to the bottom of how to work with these new agentic coding tools.
    Steve Yegge noticed the same and called it “the AI vampire“ while Matt Shumer went ultraviral (80M+ views) on his article about “something big is coming“ which terrified a lot of folks. What’s true for sure, is that we’re going through an inflection point in humanity, and I believe that staying up to date is essential as we go through it, even if some of it seems scary or “too fast”.
    This is why ThursdAI exists, I first and foremost wanted this for ME to stay up to date, and after that to share this with all of you. Having recently hit a few milestones for ThursdAI, all I can say is thanks for sharing, reading, listening and tuning in from week to week 🫡
    ThursdAI - Feb 12, 2026 - TL;DR
    TL;DR of all topics covered:
    * Hosts and Guests
    * Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
    * Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed) @ryancarson
    * Lou from Z.AI (@louszbd)
    * Olive Song - Lead RL at Minimax @olive_jy_song
    * Open Source LLMs
    * Z.ai launches GLM-5: 744B parameter MoE model achieving #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified (X, HF, Wandb)
    * MiniMax M2.5 drops official benchmarks showing SOTA coding performance at 20x cheaper than competitors (X)
    * Big CO LLMs + APIs
    * XAI cofounders quit/let go after X restructuring (X, TechCrunch)
    * Anthropic releases Claude Opus 4.6 sabotage risk report, preemptively meeting ASL-4 safety standards for autonomous AI R&D (X, Blog)
    * OpenAI upgrades Deep Research to GPT-5.2 with app integrations, site-specific searches, and real-time collaboration (X, Blog)
    * Gemini 3 Deep Think SOTA on Arc AGI 2, HLE (X)
    * OpenAI releases GPT 5.3 Codex spark, backed by Cerebras with over 1000tok/sec (X)
    * This weeks Buzz
    * W&B Inference launch of Kimi K2.5 and GLM 5 🔥 (X, Inference)
    * Get $50 of credits to our inference service HERE (X)
    * Vision & Video
    * ByteDance Seedance 2.0 launches with unified multimodal audio-video generation supporting 9 images, 3 videos, 3 audio clips simultaneously (X, Blog, Announcement)
    * AI Art & Diffusion & 3D
    * Alibaba launches Qwen-Image-2.0: A 7B parameter image generation model with native 2K resolution and superior text rendering (X, Announcement)
    * Tools & Links
    * Entire raises $60M seed to build open-source developer platform for AI agent workflows with first OSS release ‘Checkpoints’ (X, GitHub, Blog)
    * Chrome 146 introduces WebMCP: A native browser API enabling AI agents to directly interact with web services (X)
    * RyanCarson AntFarm - Agent Coordination (X)
    * Steve Yegge’s “The AI Vampire” (X)
    * Matt Shumer’s “something big is happening” (X)


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, The Intelligence from The Economist and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features