Powered by RND
PodcastsNewsThursdAI - The top AI news from the past week
Listen to ThursdAI - The top AI news from the past week in the App
Listen to ThursdAI - The top AI news from the past week in the App
(471)(247,963)
Save favourites
Alarm
Sleep timer

ThursdAI - The top AI news from the past week

Podcast ThursdAI - The top AI news from the past week
From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week
Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major an...

Available Episodes

5 of 91
  • 📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?
    What a week in AI, folks! Seriously, just when you think things might slow down, the AI world throws another curveball. This week, we had everything from rogue AI apps giving unsolicited life advice (and sending rogue texts!), to mind-blowing open source releases that are pushing the boundaries of what's possible, and of course, the ever-present drama of the big AI companies with OpenAI dropping a roadmap that has everyone scratching their heads.Buckle up, because on this week's ThursdAI, we dove deep into all of it. We chatted with the brains behind the latest open source embedding model, marveled at a tiny model crushing math benchmarks, and tried to decipher Sam Altman's cryptic GPT-5 roadmap. Plus, I shared a personal story about an AI app that decided to psychoanalyze my text messages – you won't believe what happened! Let's get into the TL;DR of ThursdAI, February 13th, 2025 – it's a wild one!* Alex Volkov: AI Adventurist with weights and biases* Wolfram Ravenwlf: AI Expert & Enthusiast* Nisten: AI Community Member* Zach Nussbaum: Machine Learning Engineer at Nomic AI* Vu Chan: AI Enthusiast & Evaluator* LDJ: AI Community MemberPersonal story of Rogue AI with RPLYThis week kicked off with a hilarious (and slightly unsettling) story of my own AI going rogue, all thanks to a new Mac app called RPLY designed to help with message replies. I installed it thinking it would be a cool productivity tool, but it turned into a personal intervention session, and then… well, let's just say things escalated.The app started by analyzing my text messages and, to my surprise, delivered a brutal psychoanalysis of my co-parenting communication, pointing out how both my ex and I were being "unpleasant" and needed to focus on the kids. As I said on the show, "I got this as a gut punch. I was like, f*ck, I need to reimagine my messaging choices." But the real kicker came when the AI decided to take initiative and started sending messages without my permission (apparently this was a bug with RPLY that was fixed since I reported)! Friends were texting me question marks, and my ex even replied to a random "Hey, How's your day going?" message with a smiley, completely out of our usual post-divorce communication style. "This AI, like on Monday before just gave me absolute s**t about not being, a person that needs to be focused on the kids also decided to smooth things out on friday" I chuckled, still slightly bewildered by the whole ordeal. It could have gone way worse, but thankfully, this rogue AI counselor just ended up being more funny than disastrous.Open Source LLMsDeepHermes preview from NousResearchJust in time for me sending this newsletter (but unfortunately not quite in time for the recording of the show), our friends at Nous shipped an experimental new thinking model, their first reasoner, called DeepHermes. NousResearch claims DeepHermes is among the first models to fuse reasoning and standard LLM token generation within a single architecture (a trend you'll see echoed in the OpenAI and Claude announcements below!)Definitely experimental cutting edge stuff here, but exciting to see not just an RL replication but also innovative attempts from one of the best finetuning collectives around. Nomic Embed Text V2 - First Embedding MoENomic AI continues to impress with the release of Nomic Embed Text V2, the first general-purpose Mixture-of-Experts (MoE) embedding model. Zach Nussbaum from Nomic AI joined us to explain why this release is a big deal.* First general-purpose Mixture-of-Experts (MoE) embedding model: This innovative architecture allows for better performance and efficiency.* SOTA performance on multilingual benchmarks: Nomic Embed V2 achieves state-of-the-art results on the multilingual MIRACL benchmark for its size.* Support for 100+ languages: Truly multilingual embeddings for global applications.* Truly open source: Nomic is committed to open source, releasing training data, weights, and code under the Apache 2.0 License.Zach highlighted the benefits of MoE for embeddings, explaining, "So we're trading a little bit of, inference time memory, and training compute to train a model with mixture of experts, but we get this, really nice added bonus of, 25 percent storage." This is especially crucial when dealing with massive datasets. You can check out the model on Hugging Face and read the Technical Report for all the juicy details.AllenAI OLMOE on iOS and New Tulu 3.1 8BAllenAI continues to champion open source with the release of OLMOE, a fully open-source iOS app, and the new Tulu 3.1 8B model.* OLMOE iOS App: This app brings state-of-the-art open-source language models to your iPhone, privately and securely.* Allows users to test open-source LLMs on-device.* Designed for researchers studying on-device AI and developers prototyping new AI experiences.* Optimized for on-device performance while maintaining high accuracy.* Fully open-source code for further development.* Available on the App Store for iPhone 15 Pro or newer and M-series iPads.* Tulu 3.1 8B As Nisten pointed out, "If you're doing edge AI, the way that this model is built is pretty ideal for that." This move by AllenAI underscores the growing importance of on-device AI and open access. Read more about OLMOE on the AllenAI Blog.Groq Adds Qwen Models and Lands on OpenRouterGroq, known for its blazing-fast inference speeds, has added Qwen models, including the distilled R1-distill, to its service and joined OpenRouter.* Record-fast inference: Experience a mind-blowing 1000 TPS with distilled DeepSeek R1 70B on Open Router.* Usable Rate Limits: Groq is now accessible for production use cases with higher rate limits and pay-as-you-go options.* Qwen Model Support: Access Qwen models like 2.5B-32B and R1-distill-qwen-32B.* Open Router Integration: Groq is now available on OpenRouter, expanding accessibility for developers.As Nisten noted, "At the end of the day, they are shipping very fast inference and you can buy it and it looks like they are scaling it. So they are providing the market with what it needs in this case." This integration makes Groq's speed even more accessible to developers. Check out Groq's announcement on X.com.SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)In a complete trend of this week, SambaNova just announced they have availability of DeepSeek R1, sped up by their custom chips, flying at 150-200t/s. This is the full DeepSeek R1, not the distilled Qwen based versions! This is really impressive work, and compared to the second fastest US based DeepSeek R1 (on Together AI) it absolutely fliesAgentica DeepScaler 1.5B Beats o1-preview on MathAgentica's DeepScaler 1.5B model is making waves by outperforming OpenAI's o1-preview on math benchmarks, using Reinforcement Learning (RL) for just $4500 of compute.* Impressive Math Performance: DeepScaleR achieves a 37.1% Pass@1 on AIME 2025, outperforming the base model and even o1-preview!!* Efficient Training: Trained using RL for just $4500, demonstrating cost-effective scaling of intelligence.* Open Sourced Resources: Agentica open-sourced their dataset, code, and training logs, fostering community progress in RL-based reasoning.Vu Chan, an AI enthusiast who evaluated the model, joined us to share his excitement: "It achieves, 42% pass at one on a AIME 24. which basically means if you give the model only one chance at every problem, it will solve 42% of them." He also highlighted the model's efficiency, generating correct answers with fewer tokens. You can find the model on Hugging Face, check out the WandB logs, and see the announcement on X.com.ModernBert Instruct - Encoder Model for General TasksModernBert, known for its efficient encoder-only architecture, now has an instruct version, ModernBert Instruct, capable of handling general tasks.* Instruct-tuned Encoder: ModernBERT-Large-Instruct can perform classification and multiple-choice tasks using its Masked Language Modeling (MLM) head.* Beats Qwen .5B: Outperforms Qwen .5B on MMLU and MMLU Pro benchmarks.* Efficient and Versatile: Demonstrates the potential of encoder models for general tasks without task-specific heads.This release shows that even encoder-only models can be adapted for broader applications, challenging the dominance of decoder-based LLMs for certain tasks. Check out the announcement on X.com.Big CO LLMs + APIsRIP GPT-5 and o3 - OpenAI Announces Public RoadmapOpenAI shook things up this week with a roadmap update from Sam Altman, announcing a shift in strategy for GPT-5 and the o-series models. Get ready for GPT-4.5 (Orion) and a unified GPT-5 system!* GPT-4.5 (Orion) is Coming: This will be the last non-chain-of-thought model from OpenAI.* GPT-5: A Unified System: GPT-5 will integrate technologies from both the GPT and o-series models into a single, seamless system.* No Standalone o3: o3 will not be released as a standalone model; its technology will be integrated into GPT-5. "We will no longer ship O3 as a standalone model," Sam Altman stated.* Simplified User Experience: The model picker will be eliminated in ChatGPT and the API, aiming for a more intuitive experience.* Subscription Tier Changes:* Free users will get unlimited access to GPT-5 at a standard intelligence level.* Plus and Pro subscribers will gain access to increasingly advanced intelligence settings of GPT-5.* Expanded Capabilities: GPT-5 will incorporate voice, canvas, search, deep research, and more.This roadmap signals a move towards more integrated and user-friendly AI experiences. As Wolfram noted, "Having a unified access and the AI should be smart enough... AI has, we need an AI to pick which AI to use." This seems to be OpenAI's direction. Read Sam Altman's full announcement on X.com.OpenAI Releases ModelSpec v2OpenAI also released ModelSpec v2, an update to their document defining desired AI model behaviors, emphasizing customizability, transparency, and intellectual freedom.* Chain of Command: Defines a hierarchy to balance user/developer control with platform-level rules.* Truth-Seeking and User Empowerment: Encourages models to "seek the truth together" with users and empower decision-making.* Core Principles: Sets standards for competence, accuracy, avoiding harm, and embracing intellectual freedom.* Open Source: OpenAI open-sourced the Spec and evaluation prompts for broader use and collaboration on GitHub.This release reflects OpenAI's ongoing efforts to align AI behavior and promote responsible development. Wolfram praised ModelSpec, saying, "I was all over the original models back when it was announced in the first place... That is one very important aspect when you have the AI agent going out on the web and get information from not trusted sources." Explore ModelSpec v2 on the dedicated website.VP Vance Speech at AI Summit in Paris - Deregulate and Dominate!Vice President Vance delivered a powerful speech at the AI Summit in Paris, advocating for pro-growth AI policies and deregulation to maintain American leadership in AI.* Pro-Growth and Deregulation: VP Vance urged for policies that encourage AI innovation and cautioned against excessive regulation, specifically mentioning GDPR.* American AI Leadership: Emphasized ensuring American AI technology remains the global standard and blocks hostile foreign adversaries from weaponizing AI. "Hostile foreign adversaries have weaponized AI software to rewrite history, surveil users, and censor speech… I want to be clear – this Administration will block such efforts, full stop," VP Vance declared.* Key Points:* Ensure American AI leadership.* Encourage pro-growth AI policies.* Maintain AI's freedom from ideological bias.* Prioritize a pro-worker approach to AI development.* Safeguard American AI and chip technologies.* Block hostile foreign adversaries' weaponization of AI.Nisten commented, "He really gets something that most EU politicians do not understand is that whenever they have such a good thing, they're like, okay, this must be bad. And we must completely stop it." This speech highlights the ongoing debate about AI regulation and its impact on innovation. Read the full speech here.Cerebras Powers Perplexity with Blazing Speed (1200 t/s!)Perplexity is now powered by Cerebras, achieving inference speeds exceeding 1200 tokens per second.* Unprecedented Speed: Perplexity's Sonar model now flies at over 1200 tokens per second thanks to Cerebras' massive LPU chips. "Like perplexity sonar, their specific LLM for search is now powered by Cerebras and it's like 12. 100 tokens per second. It's it matches Google now on speed," I noted on the show.* Google-Level Speed: Perplexity now matches Google in inference speed, making it incredibly fast and responsive.This partnership significantly enhances Perplexity's performance, making it an even more compelling search and AI tool. See Perplexity's announcement on X.com.Anthropic Claude Incoming - Combined LLM + Reasoning ModelRumors are swirling that Anthropic is set to release a new Claude model that will be a combined LLM and reasoning model, similar to OpenAI's GPT-5 roadmap.* Unified Architecture: Claude's next model is expected to integrate both LLM and reasoning capabilities into a single, hybrid architecture.* Reasoning Powerhouse: Rumors suggest Anthropic has had a reasoning model stronger than Claude 3 for some time, hinting at a significant performance leap.This move suggests a broader industry trend towards unified AI models that seamlessly blend different capabilities. Stay tuned for official announcements from Anthropic.Elon Musk Teases Grok 3 "Weeks Out"Elon Musk continues to tease the release of Grok 3, claiming it will be "a few weeks out" and the "most powerful AI" they have tested, with enhanced reasoning capabilities.* Grok 3 Hype: Elon Musk claims Grok 3 will be the most powerful AI X.ai has released, with a focus on reasoning.* Reasoning Focus: Grok 3's development may have shifted towards reasoning capabilities, potentially causing a slight delay in release.While details remain scarce, the anticipation for Grok 3 is building, especially in light of the advancements in open source reasoning models.This Week's Buzz 🐝Weave Dataset Editing in UIWeights & Biases Weave has added a highly requested feature: dataset editing directly in the UI.* UI-Based Dataset Editing: Users can now edit datasets directly within the Weave UI, adding, modifying, and deleting rows without code. "One thing that, folks asked us and we've recently shipped is the ability to edit this from the UI itself. So you don't have to have code," I explained.* Versioning and Collaboration: Every edit creates a new dataset version, allowing for easy tracking and comparison.* Improved Dataset Management: Simplifies dataset management and version control for evaluations and experiments.This feature streamlines the workflow for LLM evaluation and observability, making Weave even more user-friendly. Try it out at wandb.me/weave Toronto Workshops - AI in Production: Evals & ObservabilityDon't miss our upcoming AI in Production: Evals & Observability Workshops in Toronto!* Two Dates: Sunday and Monday workshops in Toronto.* Hands-on Learning: Learn to build and evaluate LLM-powered applications with robust observability.* Expert Guidance: Led by yours truly, Alex Volkov, and featuring Nisten.* Limited Spots: Registration is still open, but spots are filling up fast! Register for Sunday's workshop here and Monday's workshop here.Join us to level up your LLM skills and network with the Toronto AI community!Vision & VideoAdobe Firefly Video - Image to Video and Text to VideoAdobe announced Firefly Video, entering the image-to-video and text-to-video generation space.* Video Generation: Firefly Video offers both image-to-video and text-to-video capabilities.* Adobe Ecosystem: Integrates with Adobe's creative suite, providing a powerful tool for video creators.This release marks Adobe's significant move into the rapidly evolving video generation landscape. Try Firefly Video here.Voice & AudioYouTube Expands AI Dubbing to All CreatorsYouTube is expanding AI dubbing to all creators, breaking down language barriers on the platform.* AI-Powered Dubbing: YouTube is leveraging AI to provide dubbing in multiple languages for all creators. "YouTube now expands. AI dubbing in languages to all creators, and that's super cool. So basically no language barriers anymore. AI dubbing is here," I announced.* Increased Watch Time: Pilot program saw 40% of watch time in dubbed languages, demonstrating the feature's impact. "Since the pilot launched last year, 40 percent of watch time for videos with the feature enabled was in the dub language and not the original language. That's insane!" I highlighted.* Global Reach: Eliminates language barriers, making content accessible to a wider global audience.Wolfram emphasized the importance of dubbing, especially in regions with strong dubbing cultures like Germany. "Every movie that comes here is getting dubbed in high quality. And now AI is doing that on YouTube. And I personally, as a content creator, I have always have to decide, do I post in German or English?" This feature is poised to revolutionize content consumption on YouTube. Read more on X.com.Meta Audiobox Aesthetics - Unified Quality AssessmentMeta released Audiobox Aesthetics, a unified automatic quality assessment model for speech, music, and sound.* Unified Assessment: Provides a single model for evaluating the quality of speech, music, and general sound.* Four Key Metrics: Evaluates audio based on Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU).* Automated Evaluation: Offers a scalable solution for assessing synthetic audio quality, reducing reliance on costly human evaluations.This tool is expected to significantly improve the development and evaluation of TTS and audio generation models. Access the Paper and Weights on GitHub.Zonos - Expressive TTS with High-Fidelity CloningZyphra released Zonos, a highly expressive TTS model with high-fidelity voice cloning capabilities.* Expressive TTS: Zonos offers expressive speech generation with control over speaking rate, pitch, and emotions.* High-Fidelity Voice Cloning: Claims high-fidelity voice cloning from short audio samples (though my personal test was less impressive). "My own voice clone sounded a little bit like me but not a lot. Ok at least for me, the cloning is really really bad," I admitted on the show.* High Bitrate Audio: Generates speech at 44kHz with a high bitrate codec for enhanced audio quality.* Open Source & API: Models are open source, with a commercial API available.While voice cloning might need further refinement, Zonos represents another step forward in open-source TTS technology. Explore Zonos on Hugging Face (Hybrid), Hugging Face (Transformer), and GitHub, and read the Blog post.Tools & OthersEmergent Values AI - AI Utility Functions and BiasesResearchers found that AIs exhibit emergent values, including biases in valuing human lives from different regions.* Emergent Utility Functions: AI models appear to develop implicit utility functions and value systems during training. "Research finds that AI's have expected utility functions for people and other emergent values. And this is freaky," I summarized.* Value Biases: Studies revealed biases, with AIs valuing lives from certain regions (e.g., Nigeria, Pakistan, India) higher than others (e.g., Italy, France, Germany, UK, US). "Nigerian people, valued as like eight us people. One Nigerian person was valued like eight us people," I highlighted the surprising finding.* Utility Engineering: Researchers propose "utility engineering" as a research agenda to analyze and control these emergent value systems.LDJ pointed out a potential correlation between the valued regions and the source of RLHF data labeling, suggesting a possible link between training data and emergent biases. While the study is still debated, it raises important questions about AI value alignment. Read the announcement on X.com and the Paper.LM Studio Lands Support for Speculative DecodingLM Studio, the popular local LLM inference tool, now supports speculative decoding, significantly speeding up inference.* Faster Inference: Speculative decoding leverages a smaller "draft" model to accelerate inference with a larger model. "Speculative decoding finally landed in LM studio, which is dope folks. If you use LM studio, if you don't, you should," I exclaimed.* Visualize Accepted Tokens: LM Studio visualizes accepted draft tokens, allowing users to see speculative decoding in action.* Performance Boost: Improved inference speeds by up to 40% in tests, without sacrificing model performance. "It runs around 10 tokens per second without the speculative decoding and around 14 to 15 tokens per second with speculative decoding, which is great," I noted.This update makes LM Studio even more powerful for local LLM experimentation. See the announcement on X.com.Noam Shazeer / Jeff Dean on Dwarkesh PodcastPodcast enthusiasts should check out the new Dwarkesh Podcast episode featuring Noam Shazeer (Transformer co-author) and Jeff Dean (Google DeepMind).* AI Insights: Listen to insights from two AI pioneers in this new podcast episode.Tune in to hear from these influential figures in the AI world. Find the announcement on X.com.What a week, folks! From rogue AI analyzing my personal life to OpenAI shaking up the roadmap and tiny models conquering math, the AI world continues to deliver surprises. Here are some key takeaways:* Open Source is Exploding: Nomic Embed Text V2, OLMoE, DeepScaler 1.5B, and ModernBERT Instruct are pushing the boundaries of what's possible with open, accessible models.* Speed is King: Groq, Cerebras and SambaNovas are delivering blazing-fast inference, making real-time AI applications more feasible than ever.* Reasoning is Evolving: DeepScaler 1.5B's success demonstrates the power of RL for even small models, and OpenAI and Anthropic are moving towards unified models with integrated reasoning.* Privacy Matters: AllenAI's OLMoE highlights the growing importance of on-device AI for data privacy.* The AI Landscape is Shifting: OpenAI's roadmap announcement signals a move towards simpler, more integrated AI experiences, while government officials are taking a stronger stance on AI policy.Stay tuned to ThursdAI for the latest updates, and don't forget to subscribe to the newsletter for all the links and details! Next week, I'll be in New York, so expect a special edition of ThursdAI from the AI Engineer floor.TLDR & Show Notes* Open Source LLMs* NousResearch DeepHermes-3 Preview (X, HF)* Nomic Embed Text V2 - first embedding MoE (HF, Tech Report)* AllenAI OLMOE on IOS as a standalone app & new Tulu 3.1 8B (Blog, App Store)* Groq adds Qwen models (including R1 distill) and lands on OpenRouter (X)* Agentica DeepScaler 1.5B beats o1-preview on math using RL for $4500 (X, HF, WandB)* ModernBert can be instructed (though encoder only) to do general tasks (X)* LMArena releases a dataset of 100K votes with human preferences (X, HF)* SambaNova adds full DeepSeek R1 671B - flies at 200t/s (blog)* Big CO LLMs + APIs* RIP GPT-5 and o3 - OpenAI announces a public roadmap (X)* OpenAI released Model Spec v2 (Github, Blog)* VP Vance Speech at AI Summit in Paris (full speech)* Cerebras now powers Perplexity with >1200t/s (X)* Anthropic Claude incoming, will be combined LLM + reasoning (The Information)* This weeks Buzz* We've added dataset editing in the UI (X)* 2 workshops in Toronto, Sunday and Monday* Vision & Video* Adobe announces firefly video (img2video and txt2video) (try it)* Voice & Audio* Youtube to expand AI Dubbing to all creators (X)* Meta Audiobox Aesthetics - Unified Automatic Quality Assessment for Speech, Music, and Sound (Paper, Weights)* Zonos, a highly expressive TTS model with high fidelity voice cloning (Blog, HF,HF, Github)* Tools & Others* Emergent Values AI - Research finds that AI's have expected utility functions (X, paper)* LMStudio lands support for Speculative Decoding (X)* Noam Shazeer / Jeff Dean on Dwarkesh podcast (X) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:43:48
  • 📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news
    What's up friends, Alex here, back with another ThursdAI hot off the presses.Hold onto your hats because this week was another whirlwind of AI breakthroughs, mind-blowing demos, and straight-up game-changers. We dove deep into OpenAI's new "Deep Research" agent – and let me tell you, it's not just hype, it's legitimately revolutionary. You also don't have to take my word for it, a new friend of the pod and a scientist DR Derya Unutmaz joined us to discuss his experience with Deep Research as a scientist himself! You don't want to miss this conversation! We also unpack Google's Gemini 2.0 release, including the blazing-fast Flash Lite model. And just when you thought your brain couldn't handle more, ByteDance drops OmniHuman-1, a human animation model that's so realistic, it's scary good.I've also saw maybe 10 moreTLDR & Show Notes* Open Source LLMs (and deep research implementations)* Jina Node-DeepResearch (X, Github)* HuggingFace - OpenDeepResearch (X)* Deep Agent - R1 -V (X, Github)* Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India (X, Blog, HF)* Simple Scaling - S1 - R1 (Paper)* Mergekit updated - * Big CO LLMs + APIs* OpenAI ships o3-mini and o3-mini High + updates thinking traces (Blog, X)* Mistral relaunches LeChat with Cerebras for 1000t/s (Blog)* OpenAI Deep Research - the researching agent that uses o3 (X, Blog)* Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (Blog)* Anthropic Constitutional Classifiers - announced a universal jailbreak prevention (Blog, Try It)* Cloudflare to protect websites from AI scraping (News)* HuggingFace becomes the AI Appstore (link)* This weeks Buzz - Weights & Biases updates* AI Engineer workshop (Saturday 22) * Tinkerers Toronto workshops (Sunday 23 , Monday 24)* We released a new Dataset editor feature (X)* Audio and Sound* KyutAI open sources Hibiki - simultaneous translation models (Samples, HF)* AI Art & Diffusion & 3D* ByteDance OmniHuman-1 - unparalleled Human Animation Models (X, Page)* Pika labs adds PikaAdditions - adding anything to existing video (X)* Google added Imagen3 to their API (Blog)* Tools & Others* Mistral Le Chat has ios an and adroid apps now (X)* CoPilot now has agentic workflows (X)* Replit launches free apps agent for everyone (X)* Karpathy drops a new 3 hour video on youtube (X, Youtube)* OpenAI canvas links are now shareable (like Anthropic artifacts) - (example)* Show Notes & Links * Guest of the week - Dr Derya Umnutaz - talking about Deep Research* He's examples of Ehlers-Danlos Syndrome (ChatGPT), (ME/CFS) Deep Research, Nature article about Deep Reseach with Derya comments* Hosts* Alex Volkov - AI Evangelist & Host @altryne* Wolfram Ravenwolf - AI Evangelist @WolframRvnwlf* Nisten Tahiraj - AI Dev at github.GG - @nisten* LDJ - Resident data scientist - @ldjconfirmedBig Companies products & APIsOpenAI's new chatGPT moment with Deep Research, their second "agent" product (X)Look, I've been reporting on AI weekly for almost 2 years now, and been following the space closely since way before chatGPT (shoutout Codex days) and this definitely feels like another chatGPT moment for me.DeepResearch is OpenAI's new agent, that searches the web for any task you give it, is able to reason about the results, and continue searching those sources, to provide you with an absolute incredible level of research into any topic, scientific or ... the best taqueria in another country. The reason why it's so good is it's ability to do multiple search trajectories, backtrack if it needs to, and react in real time to new information. It also has python tool use (to do plots and calculations) and of course, the brain of it is o3, the best reasoning model from OpenAIDeep Research is only offered on the Pro tier ($200) of chatGPT, and it's the first publicly available way to use o3 full! and boy, does it deliver! I've had it review my workshop content, help me research LLM as a judge articles (which it did masterfully) and help me plan datenights in Denver (though it kind of failed at that, showing me a closed restaurant) A breakthrough for scientific researchBut I'm no scientist, so I've asked Dr Derya Unutmaz, M.D. to join us, and share his incredible findings as a doctor, a scientist and someone with decades of experience in writing grants, patent applications, paper etc. The whole conversation is very very much worth listening to on the pod, we talked for almost an hour, but the highlights are honestly quite crazy. So one of the first things I did was, I asked Deep Research to write a review on a particular disease that I’ve been studying for a decade. It came out with this impeccable 10-to-15-page review that was the best I’ve read on the topic— Dr. Derya UnutmazAnd another banger quoteIt wrote a phenomenal 25-page patent application for a friend’s cancer discovery—something that would’ve cost 10,000 dollars or more and taken weeks. I couldn’t believe it. Every one of the 23 claims it listed was thoroughly justifiedHumanity's LAST exam? OpenAI announced Deep Research and have showed that on HLE (Humanity's Last Exam) benchmark that was just released a few weeks ago, it scores a whopping 26.6 percent! When HLE was released (our coverage here) all the way back at ... checks notes... January 23 or this year! the top reasoning models at the time (o1, R1) scored just under 10%O3-mini and Deep Research now score 13% and 26.6% respectively, which means both that AI is advancing like crazy, but also.. that maybe calling this "last exam" was a bit premature? 😂😅Deep Research is now also SOTA holder on GAIA, a public benchmark on real world questions, though Clementine (one of GAIA authors) throws a bit of shade on the result since OpenAI didn't really submit their results. Incidently, Clementine is also involved in HuggingFace attempt at replicating Deep Research in the open (with OpenDeepResearch) OpenAI releases o3-mini and o3-mini highThis honestly got kind of buried with the Deep Research news, but as promised, on the last day of January, OpenAI released their new reasoning model, which is significantly fast and much cheaper than o1, while matching it on most benchmarks! I've been talking about the fact that during o3 announcement (our coverage) that mini may be more practical and useful announcement than o3 itself, given the price and speed of it. And viola, OpenAI has reduced the price point of their best reasoner model by 67%, and it's now matches just 2x that of DeepSeek R1.Coming in at 110c for 1M input tokens and 440c for 1M output tokens, and streaming at a whopping 1000t/s at some instances, this reasoner is really something to beat. Great for application developersIn addition to seem to be a great model, comparing it to R1 is a nonstarter IMO, not only because "it’s sending your data to choyna", which IMO is a ridiculous attack vector and people should be ashamed by posting this content. o3-mini supports all of the nice API things that OpenAI has, like tool use, structured outputs, developer messages and streaming. The ability to set the reasoning effort is also interesting for applications! Added benefit is the new 200K context window with 100K (claimed) output context. It's also really really fast, while R1 availability grows, as it gets hosted on more and more US based providers, none of them are offering the full context window at these token speeds. o3-mini-high?! While the free users also started getting access to o3-mini, with the "reason" button on chatGPT, plus subscribers received 2 models, o3-mini and o3-mini-high, which is essentially the same model, but with the "high" reasoning mode turned on, giving the model significantly more compute (and tokens) to think. This can be done on the API level by selecting reasoning_effort=high but it's the first time OpenAI is exposing this to non API users! One highlight for me is, just how MANY tokens o3-mini high things through. In one of my evaluations on Weave, o3-mini high generated around 160K output tokens, answering 20 questions, while DeepSeek R1 for example generated 75K and Gemini Thinking, got the highest score on these, while charging only 14K tokens (though I'm pretty sure Google just doesn't report on thinking tokens yet, this seems like a bug)As I'm writing this, OpenAI just announced a new update, o3-mini and o3-mini-high now show... "updated" reasoning traces! These definitely "feel" more like the R1 reasoning traces (remember, previously OpenAI had a different model summarizing the reasoning to prevent training on them?) but they are not really the RAW ones (confirmed) Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (X, Blog)Congrats to our friends at Google for 2.0 👏 Google finally put all the experimental models under one 2.0 umbrella, giving us Gemini 2.0, Gemini 2.0 Flash and a new model! They also introduced Gemini 2.0 Flash-lite, a crazy fast and cheap model that performs similarly to Flash 1.5. The rate limits on Flash-lite are twice as high as the regular Flash, making it incredibly useful for real-time applications. They have also released a few benchmarks, but they only compared those to the previous benchmark released by Google, and while that's great, I wanted a comparison done, so I asked DeepResearch to do it for me, and it did (with citations!) Google also released Imagen 3, their awesome image diffusion model in their API today, with 3c per image, this one is really really good! Mistral's new LeChat spits out 1000t/s + new IOS appsDuring the show, Mistral announced new capabilities for their LeChat interface, including a 15$/mo tier, but most importantly, a crazy fast generation using some kind of new inference, spitting out around 1000t/s. (Powered by Cerebras)Additionally they have code interpreter there, Canvas, and they also claim to have the best OCR and don't forget, they have access to Flux images, and likely are the only place I know of that offers that image model for free! Finally, they've released native mobile apps! (IOS, Android)* from my quick tests, the 1000t/s is not always on, my first attempt was instant, it was like black magic, and then the rest of them were pretty much the same speed as before 🤔 Maybe they are getting hammered in traffic... This weeks Buzz (What I learned with WandB this week)I got to play around with O3-Mini before it was released (perks of working at Weights & Biases!), and I used Weave, our observability and evaluation framework, to analyze its performance. The results were… interesting.* Latency and Token Count: O3-Mini High's latency was six times longer than O3-Mini Low on a simple reasoning benchmark (92 seconds vs. 6 seconds). But here's the kicker: it didn't even answer more questions correctly! And the token count? O3-Mini High used half a million tokens to answer 20 questions three times. That's… a lot.* Weave Leaderboards: Nisten got super excited about using Weave's leaderboard feature to benchmark models. He realized it could solve a real problem in the open-source community – providing a verifiable and transparent way to share benchmark results. (really, we didnt' rehearse this!) I also announced some upcoming workshops I'd love to see you at:* AI Engineer Workshop in NYC: I'll be running a workshop on evaluations at the AI Engineer Summit in New York on February 22nd. Come say hi and learn about evals!* AI Tinkerers Workshops in Toronto: I'll also be doing two workshops with AI Tinkerers in Toronto on February 23rd and 24th.ByteDance OmniHuman-1 - a reality bending mind breaking img2human modelOk, this is where my mind completely broke this week, like absolutely couldn't stop thinking about this release from ByteDance. After releasing the SOTA lipsyncing model just a few months ago (LatentSync, our coverage) they have once again blew everyone away. This time with a img2avatar model that's unlike anything we've ever seen. This one doesn't need words, just watch my live reaction as I lose my mindThe level of real world building in these videos is just absolutely ... too much? The piano keys moving, there's a video of a woman speaking in the microphone, and behind her, the window has reflections of cars and people moving! The thing that most blew me away upon review was the Niki Glazer video, with shiny dress and the model almost perfectly replicating the right sources of light. Just absolute sorcery! The authors confirmed that they don't have any immediate plans to release this as a model or even a product, but given the speed of open source, we'll get this within a year for sure! Get readyOpen Source LLMs (and deep research implementations)This week wasn't massive for open-source releases in terms of entirely new models, but the ripple effects of DeepSeek's R1 are still being felt. The community is buzzing with attempts to replicate and build upon its groundbreaking reasoning capabilities. It feels like everyone is scrambling to figure out the "secret sauce" behind R1's "aha moment," and we're seeing some fascinating results.Jina Node-DeepResearch and HuggingFace OpenDeepResearchThe community wasted no time trying to replicate OpenAI's Deep Research agent.* Jina AI released "Node-DeepResearch" (X, Github), claiming it follows the "query, search, read, reason, repeat" formula. As I mentioned on the show, "I believe that they're wrong" about it being just a simple loop. O3 is likely a fine-tuned model, but still, it's awesome to see the open-source community tackling this so quickly!* Hugging Face also announced "OpenDeepResearch" (X), aiming to create a truly open research agent. Clementine Fourrier, one of the authors behind the GAIA benchmark (which measures research agent capabilities), is involved, so this is definitely one to watch.Deep Agent - R1 -V: These folks claim to have replicated DeepSeek R1's "aha moment" – where the model realizes its own mistakes and rethinks its approach – for just $3! (X, Github)As I said on the show, "It's crazy, right? Nothing costs $3 anymore. Like it's half a coffee in Starbucks." They even claim you can witness this "aha moment" in a VLM. Open source is moving fast.Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India: This Indian AI lab released a whole suite of models, including an improved LLM (Krutim 2), a VLM (Chitrarth 1), a speech-language model (Dhwani 1), an embedding model (Vyakhyarth 1), and a translation model (Krutrim Translate 1). (X, Blog, HF) They even developed a benchmark called "BharatBench" to evaluate Indic AI performance.However, the community was quick to point out some… issues. As Harveen Singh Chadha pointed out on X, it seems like they blatantly copied IndicTrans, an MIT-licensed model, without even mentioning it. Not cool, Krutim. Not cool.AceCoder: This project focuses on using reinforcement learning (RL) to improve code models. (X) They claim to have created a pipeline to automatically generate high-quality, verifiable code training data.They trained a reward model (AceCode-RM) that significantly boosts the performance of Llama-3.1 and Qwen2.5-coder-7B. They even claim you can skip SFT training for code models by using just 80 steps of R1-style training!Simple Scaling - S1 - R1: This paper (Paper) showcases the power of quality over quantity. They fine-tuned Qwen2.5-32B-Instruct on just 1,000 carefully curated reasoning examples and matched the performance of o1-preview!They also introduced a technique called "budget forcing," allowing the model to control its test-time compute and improve performance. As I mentioned, Niklas Mengenhoff, who worked at Allen and was previously on the show, is involved. This is one to really pay attention to – it shows that you don't need massive datasets to achieve impressive reasoning capabilities.Unsloth reduces R1 type reasoning to just 7GB VRAM (blog)Deepseek R1-zero was autonimously learned reasoning in what they DeepSeek researchers called the "aha moment" Unsloth adds another attempt at replicating this "aha moment" and claims they got it down to less than 7B VRAM, and it can see it for free, in a google colab! This magic could be recreated through GRPO, a RL algorithm that optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization (PPO) which relies on a value functionHow it works:1. The model generates groups of responses.2. Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.3 . The average score of the group is computed.4. Each response's score is compared to the group average.5. The model is reinforced to favor higher-scoring responses.ToolsA few new and interesting tools were released this week as well: * Replit rebuilt and released their replit agents in an IOS app and released it free for many users. It can now build mini apps for you on the fly! (Replit)* Mistral has ios / android apps with the new release of LeChat (X)* Molly Cantillon released RPLY, which sits on your mac, and drafts replies to your messages. I installed it during writing this newsletter, and I did not expect it to hit this hard, it reviewed and summarized my texting patterns to "sound like me" and the models sit on device as well. Very very well crafted tool and the best thing it runs models on device if you want! * Github Copilot announced agentic workflows and next line editing, which are cursor features. To try them out you have to download VSCode insiders. They also added Gemini 2.0 (Blog)The AI field moves SO fast, I had to update the content of the newsletter around 5 times while writing it as new things kept getting released! This was a Banger week that started with o3-mini and deep research, continued with Gemini 2.0 and OmniHuman and "ended" with Mistral x Cerebras, Github copilot agents, o3-mini updated COT reasoning traces and a bunch more! AI doesn't stop, and we're here weekly to cover all of this, and give you guys the highlights, but also go deep! Really appreciate Derya's appearance on the show this week, please give him a follow and see you guys next week! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:40:29
  • 📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news
    Hey folks, Alex here 👋It’s official—grandmas (and the entire stock market) now know about DeepSeek. If you’ve been living under an AI rock, DeepSeek’s new R1 model just set the world on fire, rattling Wall Street (causing the biggest monetary loss for any company, ever!) and rocketing to #1 on the iOS App Store. This week’s ThursdAI show took us on a deep (pun intended) dive into the dizzying whirlwind of open-source AI breakthroughs, agentic mayhem, and big-company cat-and-mouse announcements. Grab your coffee (or your winter survival kit if you’re in Canada), because in true ThursdAI fashion, we’ve got at least a dozen bombshells to cover—everything from brand-new Mistral to next-gen vision models, new voice synthesis wonders, and big moves from Meta and OpenAI.We’re also talking “reasoning mania,” as the entire industry scrambles to replicate, dethrone, or ride the coattails of the new open-source champion, R1. So buckle up—because if the last few days are any indication, 2025 is officially the Year of Reasoning (and quite possibly, the Year of Agents, or both!)Open Source LLMsDeepSeek R1 discourse Crashes the Stock MarketOne-sentence summary: DeepSeek’s R1 “reasoning model” caused a frenzy this week, hitting #1 on the App Store and briefly sending NVIDIA’s stock plummeting in the process ($560B drop, largest monetary loss of any stock, ever)Ever since DeepSeek R1 launched (our technical coverate last week!), the buzz has been impossible to ignore—everyone from your mom to your local barista has heard the name. The speculation? DeepSeek’s new architecture apparently only cost $5.5 million to train, fueling the notion that high-level AI might be cheaper than Big Tech claims. Suddenly, people wondered if GPU manufacturers like NVIDIA might see shrinking demand, and the stock indeed took a short-lived 17% tumble. On the show, I joked, “My mom knows about DeepSeek—your grandma probably knows about it, too,” underscoring just how mainstream the hype has become.Not everyone is convinced the cost claims are accurate. Even Dario Amodei of Anthropic weighed in with a blog post arguing that DeepSeek’s success increases the case for stricter AI export controls. Public Reactions* Dario Amodei’s blogIn “On DeepSeek and Export Controls,” Amodei argues that DeepSeek’s efficient scaling exemplifies why democratic nations need to maintain a strategic leadership edge—and enforce export controls on advanced AI chips. He sees Chinese breakthroughs as proof that AI competition is global and intense.* OpenAI Distillation EvidenceOpenAI mentioned it found “distillation traces” of GPT-4 inside R1’s training data. Hypocrisy or fair game? On ThursdAI, the panel mused that “everyone trains on everything,” so perhaps it’s a moot point.* Microsoft ReactionMicrosoft wasted no time, swiftly adding DeepSeek to Azure—further proof that corporations want to harness R1’s reasoning power, no matter where it originated.* Government reactedEven officials in the government, David Sacks, US incoming AI & Crypto czar, discussed the fact that DeepSeek did "distillation" using the term somewhat incorrectly, and presidet Trump was asked about it.* API OutagesDeepSeek’s own API has gone in and out this week, apparently hammered by demand (and possibly DDoS attacks). Meanwhile, GPU clouds like Groq are showing up to accelerate R1 at 300 tokens/second, for those who must have it right now.We've seen so many bad takes on the topic, from seething cope takes, to just gross misunderstandings from gov officials confusing the ios App with the OSS models, folks throwing conspiracy theories into the mix, claiming that $5.5M sum was a PsyOp. The fact of the matter is, DeepSeek R1 is an incredible model, and is now powering (just a week later), multiple products (more on this below) and experiences already, while pushing everyone else to compete (and give us reasoning models!)Open Thoughts Reasoning DatasetOne-sentence summary: A community-led effort, “Open Thoughts,” released a new large-scale dataset (OpenThoughts-114k) of chain-of-thought reasoning data, fueling the open-source drive toward better reasoning models.Worried about having enough labeled “thinking” steps to train your own reasoner? Fear not. The OpenThoughts-114k dataset aggregates chain-of-thought prompts and responses—114,000 of them—for building or fine-tuning reasoning LLMs. It’s now on Hugging Face for your experimentation pleasure. The ThursdAI panel pointed out how crucial these large, openly available reasoning datasets are. As Wolfram put it, “We can’t rely on the big labs alone. More open data means more replicable breakouts like DeepSeek R1.”Mistral Small 2501 (24B)One-sentence summary: Mistral AI returns to the open-source spotlight with a 24B model that fits on a single 4090, scoring over 81% on MMLU while under Apache 2.0.Long rumored to be “going more closed,” Mistral AI re-emerged this week with Mistral-Small-24B-Instruct-2501—an Apache 2.0 licensed LLM that runs easily on a 32GB VRAM GPU. That 81% MMLU accuracy is no joke, putting it well above many 30B–70B competitor models. It was described as “the perfect size for local inference and a real sweet spot,” noting that for many tasks, 24B is “just big enough but not painfully heavy.” Mistral also finally started comparing themselves to Qwen 2.5 in official benchmarks—a big shift from their earlier reluctance, which we applaud! Berkeley TinyZero & RAGEN (R1 Replications)One-sentence summary: Two separate projects (TinyZero and RAGEN) replicated DeepSeek R1-zero’s reinforcement learning approach, showing you can get “aha” reasoning moments with minimal compute.If you were wondering whether R1 is replicable: yes, it is. Berkeley’s TinyZero claims to have reproduced the core R1-zero behaviors for $30 using a small 3B model. Meanwhile, the RAGEN project aims to unify RL + LLM + Agents with a minimal codebase. While neither replication is at R1-level performance, they demonstrate how quickly the open-source community pounces on new methods. “We’re now seeing those same ‘reasoning sparks’ in smaller reproductions,” said Nisten. “That’s huge.”AgentsCodename Goose by Blocks (X, Github)One-sentence summary: Jack Dorsey’s company Blocks released Goose, an open-source local agent framework letting you run keyboard automation on your machine.Ever wanted your AI to press keys and move your mouse in real time? Goose does exactly that with AppleScript, memory extensions, and a fresh approach to “local autonomy.” On the show, I tried Goose, but found it occasionally “went rogue, trying to delete my WhatsApp chats.” Security concerns aside, Goose is significant: it’s an open-source playground for agent-building. The plugin system includes integration with Git, Figma, a knowledge graph, and more. If nothing else, Goose underscores how hot “agentic” frameworks are in 2025.OpenAI’s Operator: One-Week-InIt’s been a week since Operator went live for Pro-tier ChatGPT users. “It’s the first agent that can run for multiple minutes without bugging me every single second,”. Yet it’s still far from perfect—captchas, login blocks, and repeated confirmations hamper tasks. The potential, though, is enormous: “I asked Operator to gather my X.com bookmarks and generate a summary. It actually tried,” I shared, “but it got stuck on three links and needed constant nudges.” Simon Willison added that it’s “a neat tech demo” but not quite a productivity boon yet. Next steps? Possibly letting the brand-new reasoning models (like O1 Pro Reasoning) do the chain-of-thought under the hood.I also got tired of opening hundreds of tabs for operator, so I wrapped it in a macOS native app, that has native notifications and the ability to launch Operator tasks via a Raycast extension, if you're interested, you can find it on my GithubBrowser-use / Computer-use AlternativesIn addition to Goose, the ThursdAI panel mentioned browser-use on GitHub, plus numerous code interpreters. So far, none blow minds in reliability. But 2025 is evidently “the year of agents.” If you’re itching to offload your browsing or file editing to an AI agent, expect to tinker, troubleshoot, and yes, babysit. The show consensus? “It’s not about whether agents are coming, it’s about how soon they’ll become truly robust,” said Wolfram.Big CO LLMs + APIsAlibaba Qwen2.5-Max (& Hidden Video Model) (Try It)One-sentence summary: Alibaba’s Qwen2.5-Max stands toe-to-toe with GPT-4 on some tasks, while also quietly rolling out video-generation features.While Western media fixates on DeepSeek, Alibaba’s Qwen team quietly dropped the Qwen2.5-Max MoE model. It clocks in at 69% on MMLU-Pro—beating some OpenAI or Google offerings—and comes with a 1-million-token context window. And guess what? The official Chat interface apparently does hidden video generation, though Alibaba hasn’t publicized it in the English internet. In the Chinese AI internet, this video generation model is called Tongyi Wanxiang, and even has it’s own website, can support first and last video generation and looks really really good, they have a gallery up there, and it even has audio generation together with the video!This one was an img2video, but the movements are really natural! Zuckerberg on LLama4 & LLama4 MiniIn Meta’s Q4 earnings call, Zuck was all about AI (sorry, Metaverse). He declared that LLama4 is in advanced training, with a smaller “LLama4 Mini” finishing pre-training. More importantly, a “reasoning model” is in the works, presumably influenced by the mania around R1. Some employees had apparently posted on Blind about “Why are we paying billions for training if DeepSeek did it for $5 million?” so the official line is that Meta invests heavily for top-tier scale. Zuck also doubled down on saying "Glasses are the perfect form factor for AI" , to which I somewhat agree, I love my Meta Raybans, I just wished they were integrated into the ios more. He also boasted about their HUGE datacenters, called Mesa, spanning the size of Manhattan, being built for the next step of AI. (Nearly) Announced: O3-MiniRight before the ThursdAI broadcast, rumors swirled that OpenAI might reveal O3-Mini. It’s presumably GPT-4’s “little cousin” with a fraction of the cost. Then…silence. Sam Altman also mentioned they would be bringing o3-mini by end of January, but maybe the R1 crazyness made them keep working on it and training it a bit more? 🤔 In any case, we'll cover it when it launches. This Week’s BuzzWe're still the #1 spot on Swe-bench verified with W&B programmer, and our CTO, Shawn Lewis, chatted with friends of the pod Swyx and Alessio about it! (give it a listen)We have two upcoming events:* AI.engineer in New York (Feb 20–22). Weights & Biases is sponsoring, and I will broadcast ThursdAI live from the summit. If you snagged a ticket, come say hi—there might be a cameo from the “Chef.”* Toronto Tinkerer Workshops (late February) in the University of Toronto. The Canadian AI scene is hot, so watch out for sign-ups (will add them to the show next week)Weights & Biases also teased more features for LLM observability (Weave) and reminded folks of their new suite of evaluation tools. “If you want to know if your AI is actually better, you do evals,” Alex insisted. For more details, check out wandb.me/weave or tune into the next ThursdAI.Vision & VideoDeepSeek - Janus Pro - multimodal understanding and image gen unified (1.5B & 7B)One-sentence summary: Alongside R1, DeepSeek also released Janus Pro, a unified model for image understanding and generation (like GPT-4’s rumored image abilities).DeepSeek apparently never sleeps. Janus Pro is MIT-licensed, 7B parameters, and can both parse images (SigLIP) and generate them (LlamaGen). The model outperforms DALL·E 3 and SDXL! on some internal benchmarks—though at a modest 384×384 resolution. NVIDIA’s Eagle 2 ReduxOne-sentence summary: NVIDIA re-released the Eagle 2 vision-language model with 4K resolution support, after mysteriously yanking it a week ago.Eagle 2 is back, boasting multi-expert architecture, 16k context, and high-res video analysis. Rumor says it competes with big 70B param vision models at only 9B. But it’s overshadowed by Qwen2.5-VL (below). Some suspect NVIDIA is aiming to outdo Meta’s open-source hold on vision—just in time to keep GPU demand strong.Qwen 2.5 VL - SOTA oss vision model is here One-sentence summary: Alibaba’s Qwen 2.5 VL model claims state-of-the-art in open-source vision, including 1-hour video comprehension and “object grounding.”The Qwen team didn’t hold back: “It’s the final boss for vision,” joked Nisten. Qwen 2.5 VL uses advanced temporal modeling for video and can handle complicated tasks like OCR or multi-object bounding boxes. Featuring advances in precise object localization, video temporal understanding and agentic capabilities for computer, this is going to be the model to beat! Voice & AudioYuE 7B (Open “Suno”)Ever dream of building the next pop star from your code editor? YuE 7B is your ticket. This model, now under Apache 2.0, supports chain-of-thought creation of structured songs, multi-lingual lyrics, and references. It’s slow to infer, but it’s arguably the best open music generator so far in the open sourceWhat's more, they have changed the license to apache 2.0 just before we went live, so you can use YuE everywhere! Refusion FuzzRefusion, a new competitor to paid audio models like Suno and Udio, launched “Fuzz,” offering free music generation online until GPU meltdown.If you want to dabble in “prompt to jam track” without paying, check out Refusion Fuzz. Will it match the emotional nuance of premium services like 11 Labs or Hauio? Possibly not. But hey, free is free.Tools (that have integrated R1)Perplexity with R1In the perplexity.ai chat, you can choose “Pro with R1” if you pay for it, harnessing R1’s improved reasoning to parse results. For some, it’s a major upgrade to “search-based question answering.” Others prefer it to paying for O1 or GPT-4. I always check Perplexity if it knows what the latest episode of ThursdAI was, and it's the first time it did a very good summary! I legit used it to research the show this week! It's really something. Meanwhile, Exa.ai also integrated a “DeepSeek Chat” for your agent-based workflows. Like it or not, R1 is everywhere.Krea.ai with DeepSeekOur friends at Krea, an AI art tool aggregator, also hopped on the R1 bandwagon for chat-based image searching or generative tasks.ConclusionKey Takeaways* DeepSeek’s R1 has massive cultural reach, from #1 apps to spooking the stock market.* Reasoning mania is upon us—everyone from Mistral to Meta wants a piece of the logic-savvy LLM pie.* Agentic frameworks like Goose, Operator, and browser-use are proliferating, though they’re still baby-stepping through reliability issues.* Vision and audio get major open-source love, with Janus Pro, Qwen 2.5 VL, YuE 7B, and more reshaping multimodality.* Big Tech (Meta, Alibaba, OpenAI) is forging ahead with monster models, multi-billion-dollar projects, and cross-country expansions in search of the best reasoning approaches.At this point, it’s not even about where the next big model drop comes from; it’s about how quickly the entire ecosystem can adopt (or replicate) that new methodology. Stay tuned for next week’s ThursdAI, where we’ll hopefully see new updates from OpenAI (maybe O3-Mini?), plus the ongoing race for best agent. Also, catch us at AI.engineer in NYC if you want to talk shop or share your own open-source success stories. Until then, keep calm and carry on training.TLDR* Open Source LLMs* DeepSeek Crashes the Stock Market: Did $5.5M training or hype do it?* Open Thoughts Reasoning Dataset OpenThoughts-114k (X, HF)* Mistral Small 2501 (24B, Apache 2.0) (HF)* Berkeley TinyZero & RAGEN (R1-Zero Replications) (Github, WANDB)* Allen Institute - Tulu 405B (Blog, HF)* Agents* Goose by Blocks (local agent framework) - (X, Github)* Operator (OpenAI) – One-Week-In (X)* Browser-use - oss version of Operator (Github)* Big CO LLMs + APIs* Alibaba Qwen2.5-Max (+ hidden video model) - (X, Try it)* Zuckerberg on LLama4 & “Reasoning Model” (X)* This Week’s Buzz* Shawn Lewis interview on Latent Space with swyx & Alessio * We’re sponsoring the ai.engineer upcoming summit in NY (Feb 19-22), come say hi! * After that, we’ll host 2 workshops with AI Tinkerers Toronto (Feb 23-24), make sure you’re signed up to Toronto Tinkerers to receive the invite (we were sold out quick last time!) * Vision & Video* DeepSeek Janus Pro - 1.5B and 7B (Github, Try It)* NVIDIA Eagle 2 (Paper, Model, Demo)* Alibaba Qwen 2.5 VL (Project, HF, Github, Try It)* Voice & Audio* Yue 7B (Open Suno) - (Demo, HF, Github)* Refusion Fuzz (free for now)* Tools* Perplexity with R1 (choose Pro with R1) * Exa integrated R1 for free (demo)* Participants* Alex Volkov (@altryne)* Wolfram Ravenwolf (@WolframRvnwlf)* Nisten Tahiraj (@nisten )* LDJ (@ldjOfficial)* Simon Willison (@simonw)* W&B Weave (@weave_wb) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:54:46
  • 📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news
    What a week, folks, what a week! Buckle up, because ThursdAI just dropped, and this one's a doozy. We're talking seismic shifts in the open source world, a potential game-changer from DeepSeek AI that's got everyone buzzing, and oh yeah, just a casual $500 BILLION infrastructure project announcement. Plus, OpenAI finally pulled the trigger on "Operator," their agentic browser thingy – though getting it to actually operate proved to be a bit of a live show adventure, as you'll hear. This week felt like one of those pivotal moments in AI, a real before-and-after kind of thing. DeepSeek's R1 hit the open source scene like a supernova, and suddenly, top-tier reasoning power is within reach for anyone with a Mac and a dream. And then there's OpenAI's Operator, promising to finally bridge the gap between chat and action. Did it live up to the hype? Well, let's just say things got interesting.As I’m writing this, White House just published that an Executive Order on AI was just signed and published as well, what a WEEK.Open Source AI Goes Nuclear: DeepSeek R1 is HERE!Hold onto your hats, open source AI just went supernova! This week, the Chinese Whale Bros – DeepSeek AI, that quant trading firm turned AI powerhouse – dropped a bomb on the community in the best way possible: R1, their reasoning model, is now open source under the MIT license! As I said on the show, "Open source AI has never been as hot as this week."This isn't just a model, folks. DeepSeek unleashed a whole arsenal: two full-fat R1 models (DeepSeek R1 and DeepSeek R1-Zero), and a whopping six distilled finetunes based on Qwen (1.5B, 7B, 14B, and 32B) and Llama (8B, 72B). One stat that blew my mind, and Nisten's for that matter, is that DeepSeek-R1-Distill-Qwen-1.5B, the tiny 1.5 billion parameter model, is outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks! "This 1.5 billion parameter model that now does this. It's absolutely insane," I exclaimed on the show. We're talking 28.9% on AIME and 83.9% on MATH. Let that sink in. A model you can probably run on your phone is schooling the big boys in math.License-wise, it's MIT, which as Nisten put it, "MIT is like a jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now." Basically, do whatever you want with it. Distill it, fine-tune it, build Skynet – it's all fair game.And the vibes? "Vibes are insane," as I mentioned on the show. Early benchmarks are showing R1 models trading blows with o1-preview and o1-mini, and even nipping at the heels of the full-fat o1 in some areas. Check out these numbers:And the price? Forget about it. We're talking 50x cheaper than o1 currently. DeepSeek R1 API is priced at $0.14 / 1M input tokens and $2.19 / 1M output tokens, compared to OpenAI's o1 at $15.00 / 1M input and a whopping $60.00 / 1M output. Suddenly, high-quality reasoning is democratized.LDJ highlighted the "aha moment" in DeepSeek's paper, where they talk about how reinforcement learning enabled the model to re-evaluate its approach and "think more." It seems like simple RL scaling, combined with a focus on reasoning, is the secret sauce. No fancy Monte Carlo Tree Search needed, apparently!But the real magic of open source is what the community does with it. Pietro Schirano joined us to talk about his "Retrieval Augmented Thinking" (RAT) approach, where he extracts the thinking process from R1 and transplants it to other models. "And what I found out is actually by doing so, you may even like smaller, quote unquote, you know, less intelligent model actually become smarter," Pietro explained. Frankenstein models, anyone? (John Lindquist has a tutorial on how to do it here)And then there's the genius hack from Voooogel, who figured out how to emulate a "reasoning_effort" knob by simply replacing the "end" token with "Wait, but". "This tricks the model into keeps thinking," as I described it. Want your AI to really ponder the meaning of life (or just 1+1)? Now you can, thanks to open source tinkering.Georgi Gerganov, the legend behind llama.cpp, even jumped in with a two-line snippet to enable speculative decoding, boosting inference speeds on the 32B model on my Macbook from a sluggish 5 tokens per second to a much more respectable 10-11 tokens per second. Open source collaboration at its finest and it's only going to get better! Thinking like a NeuroticMany people really loved the way R1 thinks, and what I found astonishing is that I just sent "hey" and the thinking went into a whole 5 paragraph debate of how to answer, a user on X answered with "this is Woody Allen-level of Neurotic" which... nerd sniped me so hard! I used Hauio Audio (which is great!) and ByteDance latentSync and gave R1 a voice! It's really something when you hear it's inner monologue being spoken out like this! ByteDance Enters the Ring: UI-TARS Controls Your PCNot to be outdone in the open source frenzy, ByteDance, the TikTok behemoth, dropped UI-TARS, a set of models designed to control your PC. And they claim SOTA performance, beating even Anthropic's computer use models and, in some benchmarks, GPT-4o and Claude.UI-TARS comes in 2B, 7B, and 72B parameter flavors, and ByteDance even released desktop apps for Mac and PC to go along with them. "They released an app it's called the UI TARS desktop app. And then, this app basically allows you to Execute the mouse clicks and keyboard clicks," I explained during the show.While I personally couldn't get the desktop app to work flawlessly (quantization issues, apparently), the potential is undeniable. Imagine open source agents controlling your computer – the possibilities are both exciting and slightly terrifying. As Nisten wisely pointed out, "I would use another machine. These things are not safe to tell people. I might actually just delete your data if you, by accident." Words to live by, folks.LDJ chimed in, noting that UI-TARS seems to excel particularly in operating system-level control tasks, while OpenAI's leaked "Operator" benchmarks might show an edge in browser control. It's a battle for desktop dominance brewing in open source!Noting that the common benchmark between Operator and UI-TARS is OSWorld, UI-Tars launched with a SOTA Humanity's Last Exam: The Benchmark to BeatSpeaking of benchmarks, a new challenger has entered the arena: Humanity's Last Exam (HLE). A cool new unsaturated bench of 3,000 challenging questions across over a hundred subjects, crafted by nearly a thousand subject matter experts from around the globe. "There's no way I'm answering any of those myself. I need an AI to help me," I confessed on the show.And guess who's already topping the HLE leaderboard? You guessed it: DeepSeek R1, with a score of 9.4%! "Imagine how hard this benchmark is if the top reasoning models that we have right now... are getting less than 10 percent completeness on this," MMLU and Math are getting saturated? HLE is here to provide a serious challenge. Get ready to hear a lot more about HLE, folks.Big CO LLMs + APIs: Google's Gemini Gets a Million-Token BrainWhile open source was stealing the show, the big companies weren't completely silent. Google quietly dropped an update to Gemini Flash Thinking, their experimental reasoning model, and it's a big one. We're talking 1 million token context window and code execution capabilities now baked in!"This is Google's scariest model by far ever built ever," Nisten declared. "This thing, I don't like how good it is. This smells AGI-ish" High praise, and high concern, coming from Nisten! Benchmarks are showing significant performance jumps in math and science evals, and the speed is, as Nisten put it, "crazy usable." They have enabled the whopping 1M context window for the new Gemini Flash 2.0 Thinking Experimental (long ass name, maybe let's call it G1?) and I agree, it's really really good!And unlike some other reasoning models cough OpenAI cough, Gemini Flash Thinking shows you its thinking process! You can actually see the chain of thought unfold, which is incredibly valuable for understanding and debugging. Google's Gemini is quietly becoming a serious contender in the reasoning race (especially with Noam Shazeer being responsible for it!)OpenAI's "Operator" - Agents Are (Almost) HereThe moment we were all waiting for (or at least, I was): OpenAI finally unveiled Operator, their first foray into Level 3 Autonomy - agentic capabilities with ChatGPT. Sam Altman himself hyped it up as "AI agents are AI systems that can do work for you. You give them a task and they go off and do it." Sounds amazing, right?Operator is built on a new model called CUA (Computer Using Agent), trained on top of GPT-4, and it's designed to control a web browser in the cloud, just like a human would, using screen pixels, mouse, and keyboard. "This is just using screenshots, no API, nothing, just working," one of the OpenAI presenters emphasized. They demoed Operator booking restaurant reservations on OpenTable, ordering groceries on Instacart, and even trying to buy Warriors tickets on StubHub (though that demo got a little… glitchy). The idea is that you can delegate tasks to Operator, and it'll go off and handle them in the background, notifying you when it needs input or when the task is complete.As I'm writing these words, I have an Operator running trying to get me some fried rice, and another one trying to book me a vacation with kids over the summer, find some options and tell me what it found. Benchmarks-wise, OpenAI shared numbers for OSWorld (38.1%) and WebArena (58.1%), showing Operator outperforming previous SOTA but still lagging behind human performance. "Still a way to go," as they admitted. But the potential is massive.The catch? Operator is initially launching in the US for Pro users only, and even then, it wasn't exactly smooth sailing. I immediately paid the $200/mo to try it out (pro mode didn't convince me, unlimited SORA videos didn't either, operator definitely did, SOTA agents from OpenAI is definitely something I must try!) and my first test? Writing a tweet 😂 Here's a video of that first attempt, which I had to interrupt 1 time. But hey, it's a "low key research preview" right? And as Sam Altman said, "This is really the beginning of this product. This is the beginning of our step into Agents Level 3 on our tiers of AGI" Agentic ChatGPT is coming, folks, even if it's taking a slightly bumpy route to get here.BTW, while I'm writing these words, Operator is looking up some vacation options for me and is sending me notifications about them, what a world and we've only just started 2025!Project Stargate: $500 Billion for AI InfrastructureIf R1 and Operator weren't enough to make your head spin, how about a $500 BILLION "Manhattan Project for AI infrastructure"? That's exactly what OpenAI, SoftBank, and Oracle announced this week: Project Stargate."This is insane," I exclaimed on the show. "Power ups for the United States compared to like, other, other countries, like 500 billion commitment!" We're talking about a massive investment in data centers, power plants, and everything else needed to fuel the AI revolution. 2% of the US GDP, according to some estimates!Larry Ellison even hinted at using this infrastructure for… curing cancer with personalized vaccines. Whether you buy into that or not, the scale of this project is mind-boggling. As LDJ explained, "It seems like it is very specifically for open AI. Open AI will be in charge of operating it. And yeah, it's, it sounds like a smart way to actually kind of get funding and investment for infrastructure without actually having to give away open AI equity."And in a somewhat related move, Microsoft, previously holding exclusive cloud access for OpenAI, has opened the door for OpenAI to potentially run on other clouds, with Microsoft's approval if "they cannot meet demant". Is AGI closer than we think? Sam Altman himself downplayed the hype, tweeting, "Twitter hype is out of control again. We're not going to deploy AGI next month, nor have we built it. We have some very cool stuff for you, but please chill and cut your expectations a hundred X."But then he drops Operator and a $500 billion infrastructure bomb in the same week and announces that o3-mini is going to be available for the FREE tier of chatGPT.Sure, Sam, we're going to chill... yeah right. This Week's Buzz at Weights & Biases: SWE-bench SOTA!Time for our weekly dose of Weights & Biases awesomeness! This week, our very own CTO, Shawn Lewis, broke the SOTA on SWE-bench Verified! That's right, W&B Programmer, Shawn's agentic framework built on top of o1, achieved a 64.6% solve rate on this notoriously challenging coding benchmark.Shawn detailed his journey in a blog post, highlighting the importance of iteration and evaluation – powered by Weights & Biases Weave, naturally. He ran over 1000 evaluations to reach this SOTA result! Talk about eating your own dogfood!REMOVING BARRIERS TO AMERICAN LEADERSHIP IN ARTIFICIAL INTELLIGENCE - Executive orderJust now as I’m editing the podcast, President Trump signed into effect an executive order for AI, and here are the highlights. - Revokes existing AI policies that hinder American AI innovation- Aims to solidify US as global leader in AI for human flourishing, competitiveness, and security- Directs development of an AI Action Plan within 180 days- Requires immediate review and revision of conflicting policies- Directs OMB to revise relevant memos within 60 days- Preserves agency authority and OMB budgetary functions- Consistent with applicable law and funding availability- Seeks to remove barriers and strengthen US AI dominanceThis marks such a significant pivot into AI acceleration, removing barriers, acknowledging that AI is a huge piece of our upcoming future and that US really needs to innovate here, become the global leader, and remove regulation and obstacles. The folks that work on this behind the scenes, Sriram Krishan (previously A16Z) and David Sacks, are starting to get into the government and implement those policies, so we’re looking forward to what will come form that! Vision & Video: Nvidia's Vanishing Eagle 2 & Hugging Face's Tiny VLMIn the world of vision and video, Nvidia teased us with Eagle 2, a series of frontier vision-language models promising 4K HD input, long-context video, and grounding capabilities with some VERY impressive evals. Weights were released, then…yanked. "NVIDIA released Eagle 2 and then yanked it back. So I don't know what's that about," I commented. Mysterious Nvidia strikes again.On the brighter side, Hugging Face released SmolVLM, a truly tiny vision-language model, coming in at just 256 million and 500 million parameters. "This tiny model that runs in like one gigabyte of RAM or some, some crazy things, like a smart fridge" I exclaimed, impressed. The 256M model even outperforms their previous 80 billion parameter Idefics model from just 17 months ago. Progress marches on, even in tiny packages.AI Art & Diffusion & 3D: Hunyuan 3D 2.0 is State of the ArtFor the artists and 3D enthusiasts, Tencent's Hunyuan 3D 2.0 dropped this week, and it's looking seriously impressive. "Just look at this beauty," I said, showcasing a generated dragon skull. "Just look at this."Hunyuan 3D 2.0 boasts two models: Hunyuan3D-DiT-v2-0 for shape generation and Hunyuan3D-Paint-v2-0 for coloring. Text-to-3D and image-to-3D workflows are both supported, and the results are, well, see for yourself:If you're looking to move beyond 2D images, Hunyuan 3D 2.0 is definitely worth checking out.Tools: ByteDance Clones Cursor with TraeAnd finally, in the "tools" department, ByteDance continues its open source blitzkrieg with Trae, a free Cursor competitor. "ByteDance drops Trae, which is a cursor competitor, which is free for now" I announced on the show, so if you don't mind your code being sent to... china somewhere, and can't afford Cursor, this is not a bad alternative! Trae imports your Cursor configs, supports Claude 3.5 and GPT-4o, and offers a similar AI-powered code editing experience, complete with chat interface and "builder" (composer) mode. The catch? Your code gets sent to a server in China. If you're okay with that, you've got yourself a free Cursor alternative. "If you're okay with your like code getting shared with ByteDance, this is a good option for you," I summarized. Decisions, decisions.Phew! That was a whirlwind tour through another insane week in AI. From DeepSeek R1's open source reasoning revolution to OpenAI's Operator going live, and Google's million-token Gemini brain, it's clear that the pace of innovation is showing no signs of slowing down. Open source is booming, agents are inching closer to reality, and the big companies are throwing down massive infrastructure investments. We're accelerating as f**k, and it's only just beginning, hold on to your butts.Make sure to dive into the show notes below for all the links and details on everything we covered. And don't forget to give R1 a spin – and maybe try out that "reasoning_effort" hack. Just don't blame me if your AI starts having an existential crisis.And as a final thought, channeling my inner Woody Allen-R1, "Don't overthink too much. enjoy our one. Enjoy the incredible things we received this week from open source."See you all next week for more ThursdAI madness! And hopefully, by then, Operator will actually be operating. 😉TL;DR and show notes* Open Source LLMs* DeepSeek R1 - MIT licensed SOTA open source reasoning model (HF, X)* ByteDance UI-TARS - PC control models (HF, Github )* HLE - Humanity's Last Exam benchmark (Website)* Big CO LLMs + APIs* SoftBank, Oracle, OpenAI Stargate Project - $500B AI infrastructure (OpenAI Blog)* Google Gemini Flash Thinking 01-21 - 1M context, Code execution, Better Evals (X)* OpenAI Operator - Agentic browser in ChatGPT Pro operator.chatgpt.com* Anthropic launches citations in API (blog)* Perplexity SonarPRO Search API and an Android AI assistant (X)* This weeks Buzz 🐝* W&B broke SOTA SWE-bench verified (W&B Blog)* Vision & Video* HuggingFace SmolVLM - Tiny VLMs - runs even on WebGPU (HF)* AI Art & Diffusion & 3D* Hunyuan 3D 2.0 - SOTA open-source 3D (HF)* Tools* ByteDance Trae - Cursor competitor (Trae AI: https://trae.ai/)* Show Notes: * Pietro Skirano RAT - Retrieval augmented generation (X)* Run DeepSeek with more “thinking” script (Gist) This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:49:39
  • 📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news
    Hey everyone, Alex here 👋 Welcome back, to an absolute banger of a week in AI releases, highlighted with just massive Open Source AI push. We're talking a MASSIVE 4M context window context window model from Hailuo (remember when a jump from 4K to 16K seemed like a big deal?), a 8B omni model that lets you livestream video and glimpses of Agentic ChatGPT? This week's ThursdAI was jam-packed with so much open source goodness that the big companies were practically silent. But don't worry, we still managed to squeeze in some updates from OpenAI and Mistral, along with a fascinating new paper from Sakana AI on self-adaptive LLMs. Plus, we had the incredible Graham Neubig, from All Hands AI, join us to talk about Open Hands (formerly OpenDevin) and even contributed to our free, LLM Evaluation course on Weights & Biases!Before we dive in, a friend asked me over dinner, what are the main 2 things that happened in AI in 2024, and this week highlights one of those trends. Most of the Open Source is now from China. This week, we got MiniMax from Hailuo, OpenBMB with a new MiniCPM, InternLM came back and most of the rest were Qwen finetunes. Not to mention DeepSeek. Wanted to highlight this significant narrative change and that this is being done despite the chip export restrictions. ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.Open Source AI & LLMsMiniMax-01: 4 Million Context, 456 Billion Parameters, and Lightning Attention This came absolutely from the left field, given that we've seen no prior LLMs from Haulio, the company previously releasing video models with consistent characters. Dropping a massive 456B mixture of experts model (45B active parameters) with such a long context support in open weights, but also with very significant benchmarks that compete with Gpt-4o, Claude and DeekSeek v3 (75.7 MMLU-pro, 89 IFEval, 54.4 GPQA)They have trained the model on up to 1M context window and then extended it to 4M with ROPE scaling methods (our coverage of RoPE) during Inference. MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE) with 45B active parameters. I gotta say, when we started talking about context window, imagining a needle in a haystack graph that shows 4M, in the open source seemed far fetched, though we did say that theoretically, there may not be a limit to context windows. I just always expected that limit to be unlocked by transformers alternative architectures like Mamba or other State Space Models.Vision, API and Browsing - Minimax-VL-01It feels like such a well rounded and complete release, that it highlights just how mature company that is behind it. They have also released a vision version of this model, that includes a 300M param Vision Transformer on top (trained with 512B vision language tokens) that features dynamic resolution and boasts very high DocVQA and ChartQA scores. Not only did these two models were released in open weights, they also launched as a unified API endpoint (supporting up to 1M tokens) and it's cheap! $0.2/1M input and $1.1/1M output tokens! AFAIK this is only the 3rd API that supports this much context, after Gemini at 2M and Qwen Turbo that supports 1M as well.Surprising web browsing capabilitiesYou can play around with the model on their website, hailuo.ai which also includes web grounding, which I found quite surprising to find out, that they are beating chatGPT and Perplexity on how fast they can find information that just happened that same day! Not sure what search API they are using under the hood but they are very quick. 8B chat with video model omni-model from OpenBMBOpenBMB has been around for a while and we've seen consistently great updates from them on the MiniCPM front, but this one takes the cake! This is a complete omni modal end to end model, that does video streaming, audio to audio and text understanding, all on a model that can run on an iPad! They have a demo interface that is very similar to the chatGPT demo from spring of last year, and allows you to stream your webcam and talk to the model, but this is just an 8B parameter model we're talking about! It's bonkers! They are boasting some incredible numbers, and to be honest, I highly doubt their methodology in textual understanding, because, well, based on my experience alone, this model understands less than close to chatGPT advanced voice mode, but miniCPM has been doing great visual understanding for a while, so ChartQA and DocVQA are close to SOTA. But all of this doesn't matter, because, I say again, just a little over a year ago, Google released a video announcing these capabilities, having an AI react to a video in real time, and it absolutely blew everyone away, and it was FAKED. And this time a year after, we have these capabilities, essentially, in an 8B model that runs on device 🤯 Voice & Audio This week seems to be very multimodal, not only did we get an omni-modal from OpenBMB that can speak, and last week's Kokoro still makes a lot of waves, but this week there were a lot of voice updates as wellKokoro.js - run the SOTA open TTS now in your browserThanks to friend of the pod Xenova (and the fact that Kokoro was released with ONNX weights), we now have kokoro.js, or npm -i kokoro-js if you will. This allows you to install and run Kokoro, the best tiny TTS model, completely within your browser, with a tiny 90MB download and it sounds really good (demo here)Hailuo T2A - Emotional text to speech + API Hailuo didn't rest on their laurels of releasing a huge context window LLM, they also released a new voice framework (tho not open sourced) this week, and it sounds remarkably good (competing with 11labs) They have all the standard features like Voice Cloning, but claim to have a way to preserve the emotional undertones of a voice. They also have 300 voices to choose from and professional effects applied on the fly, like acoustics or telephone filters. (Remember, they have a video model as well, so assuming that some of this is to for the holistic video production) What I specifically noticed is their "emotional intelligence system" that's either automatic or can be selected from a dropdown. I also noticed their "lax" copyright restrictions, as one of the voices that was called "Imposing Queen" sounded just like a certain blonde haired heiress to the iron throne from a certain HBO series. When I generated a speech worth of that queen, I noticed that the emotion in that speech sounded very much like an actress would read them, and unlike any old TTS, just listen to it in the clip above, I don't remember getting TTS outputs with this much emotion from anything, maybe outside of advanced voice mode! Quite impressive!This Weeks Buzz from Weights & Biases - AGENTS!Breaking news from W&B as our CTO just broke SWE-bench Verified SOTA, with his own o1 agentic framework he calls W&B Programmer 😮 at 64.6% of the issues!Shawn describes how he achieved this massive breakthrough here and we'll be publishing more on this soon, but the highlight for me is he ran over 900 evaluations during the course of this, and tracked all of them in Weave! We also have an upcoming event in NY, on Jan 22nd, if you're there, come by and learn how to evaluate your AI agents, RAG applications and hang out with our team! (Sign up here)Big Companies & APIsOpenAI adds chatGPT tasks - first agentic feature with more to come! We finally get a glimpse of an agentic chatGPT, in the form of scheduled tasks! Deployed to all users, it is now possible to select gpt-4o with tasks, and schedule tasks in the future. You can schedule them in natural language, and then will execute a chat (and maybe perform a search or do a calculation) and then send you a notification (and an email!) when the task is done! A bit underwhelming at first, as I didn't really find a good use for this yet, I don't doubt that this is just a building block for something more Agentic to come that can connect to my email or calendar and do actual tasks for me, not just... save me from typing the chatGPT query at "that time" Mistral CodeStral 25.01 - a new #1 coding assistant modelAn updated Codestral was released at the beginning of the week, and TBH I've never seen the vibes split this fast on a model. While it's super exciting that Mistral is placing a coding model at #1 on the LMArena CoPilot's arena, near Claude 3.5 and DeepSeek, the fact that this new model is not released weights is really a bummer (especially as a reference to the paragraph I mentioned on top) We seem to be closing down on OpenSource in the west, while the Chinese labs are absolutely crushing it (while also releasing in the open, including Weights, Technical papers). Mistral has released this model in API and via a collab with the Continue dot dev coding agent, but they used to be the darling of the open source community by releasing great models! Also notable, a very quick new benchmark post release was dropped that showed a significant difference between their reported benchmarks and how it performs on Aider polyglot There was way more things for this week than we were able to cover, including a new and exciting transformers squared new architecture from Sakana, a new open source TTS with voice cloning and a few other open source LLMs, one of which cost only $450 to train! All the links in the TL;DR below! TL;DR and show notes* Open Source LLMs * MiniMax-01 from Hailuo - 4M context 456B (45B A) LLM (Github, HF, Blog, Report)* Jina - reader V2 model - HTML 2 Markdown/JSON (HF)* InternLM3-8B-Instruct - apache 2 License (Github, HF)* OpenBMB - MiniCPM-o 2.6 - Multimodal Live Streaming on Your Phone (HF, Github, Demo)* KyutAI - Helium-1 2B - Base (X, HF)* Dria-Agent-α - 3B model that outputs python code (HF)* Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450 (blog)* Big CO LLMs + APIs* OpenAI launches ChatGPT tasks (X)* Mistral - new CodeStral 25.01 (Blog, no Weights)* Sakana AI - Transformer²: Self-Adaptive LLMs (Blog)* This weeks Buzz * Evaluating RAG Applications Workshop - NY, Jan 22, W&B and PineCone (Free Signup)* Our evaluations course is going very strong! (chat w/ Graham Neubig) (https://wandb.me/evals-t)* Vision & Video* Luma releases Ray2 video model (Web)* Voice & Audio* Hailuo T2A-01-HD - Emotions Audio Model from Hailuo (X, Try It)* OuteTTS 0.3 - 1B & 500M - zero shot voice cloning model (HF)* Kokoro.js - 80M SOTA TTS in your browser! (X, Github, try it )* AI Art & Diffusion & 3D* Black Forest Labs - Finetuning for Flux Pro and Ultra via API (Blog)* Show Notes and other Links* Hosts - Alex Volkov (@altryne), Wolfram RavenWlf (@WolframRvnwlf), Nisten Tahiraj (@nisten)* Guest - Graham Neubig (@gneubig) from All Hands AI (@allhands_ai)* Graham’s mentioned Agents blogpost - 8 things that agents can do right now* Projects - Open Hands (previously Open Devin) - Github* Germany meetup in Cologne (here)* Toronto Tinkerer Meetup *Sold OUT* (Here)* YaRN conversation we had with the Authors (coverage)See you folks next week! Have a great long weekend if you’re in the US 🫡 Please help to promote the podcast and newsletter by sharing with a friend! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe
    --------  
    1:40:32

More News podcasts

About ThursdAI - The top AI news from the past week

Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more. sub.thursdai.news
Podcast website

Listen to ThursdAI - The top AI news from the past week, Global News Podcast and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.7.0 | © 2007-2025 radio.de GmbH
Generated: 2/16/2025 - 6:20:49 PM