PodcastsScienceInterconnects

Interconnects

Nathan Lambert
Interconnects
Latest episode

139 episodes

  • Interconnects

    Dean Ball on open models and government control

    2026/03/06 | 35 mins.
    Watching history unfold between Anthropic and the Department of War (DoW) it has been obvious to me that this could be a major turning point in perspectives on open models, but one that’ll take years to be obvious. As AI becomes more powerful, existing power structures will grapple with their roles relative to existing companies. Some in open models frame this as “not your weights, not your brain,” but it points to a much bigger problem when governments realize this.
    If AI is the most powerful technology, why would any global entity let a single U.S. company (or government) control their relationship to it?
    I got Dean W. Ball of the great Hyperdimensional newsletter onto the SAIL Media weekly Substack live to discuss this. In the end, we agree that the recent actions by the DoW — especially the designation of Anthropic as a supply chain risk (which Dean and I both vehemently disagree with) — points to open models being the 5-10 year stable equilibrium for power centers.
    The point of this discussion is:
    * Why do open models avoid some of the power struggles we’ve seen play out last week?
    * How do we bridge short term headwinds for open models towards long-term strength?
    * The general balance of capabilities between open and closed models.
    Personally, I feel the need to build open models more than ever and am happy to see more constituencies wake up to it. What I don’t know is how to fund and organize that. Commoditizing one’s compliments is a valid strategy, but it starts to break down when AI models cost closer to a trillion dollars than a hundred million. With open models being very hard to monetize, there’s a bumpy road ahead for figuring out who builds these models in face of real business growth elsewhere in the AI stack.
    Enjoy and please share any feedback you have on this tricky topic!
    Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
    Chapters
    * 00:00 Intro: is the Anthropic supply chain risk good or bad for open models?
    * 04:03 Funding open models and the widening frontier gap
    * 12:33 Sovereign AI and global demand for alternatives
    * 20:55 Open model ecosystem: Qwen, usability, and short-term outlook
    * 28:20 Government power, nationalization risk, and financializing compute
    Transcript
    00:00:00 Nathan Lambert: Okay. We are live and people will start joining. I’m very happy to catch up with Dean. I think as we were setting this up, the news has been breaking that the official supply chain risk designation was filed. This is not a live reaction to that. If we get any really, really interesting news, we’ll talk about it. I think one of the undercurrents that I’ve felt that this week where everything happened is gonna touch on is open models, but there’s not an obvious angle. I think I will frame this to Dean to start, which is how does-- Like, there’s two sides of open models. One is that there’s the kind of cliche like, not my weights, not your weights, not your mind, where like somebody could take it away if not an open model, which people are boosting like, “Oh, like Anthropic’s gonna take away their intelligence.” But the other side is people worried about open models existing that the Department of War can just take and use for any purpose that it wants. And I feel like both of these are a little cliche. And the core question is like, is this type of event where more control is coming towards AI and more multi-party interest, like is that gonna be good or bad for the open weight model ecosystem?
    00:01:12 Dean Ball: My guess is that in the long run, this is probably profoundly good for open weight AI. And like the whole reason I got in, like, so I became interested in frontier AI governance. I did something totally different with my time before. I wrote about different kinds of policy and studied different kinds of policy. And the reason I got into this was because it immediately occurred to me that the government was gonna... I was like, okay, let’s assume we’re building super intelligence soon or whatever, like very advanced AI that seems like really important and powerful. That’s gonna be something that I depend on, like for my day-to-day life. I’m gonna need it for all kinds of things. It’s gonna profoundly implicate my freedom of expression as an American and my exercise of my liberty and all that. And yet it’s also gonna profoundly implicate national security. And so the government’s gonna have its hands all over it, and they also might not like me using it because I might use it, and others might use it to challenge the status quo in various ways, to challenge the existing power structures which the government is a part of. So we have a political problem on our hands here, in my view.
    00:02:36 Dean Ball: It immediately occurred to me that we’re gonna have this huge problem of like, this is gonna be a conflict because this is something that’s gonna enormously implicate American speech and liberty, and also it’s gonna have legitimate national security issues, and also the government’s gonna want it because of bad power-seeking reasons. And so that’s always a part of the picture. And my view was this is just a fight that’s gonna play out over the coming decades, and I wanna be a part of this fight. But number two, in that fight, you have to have an insurance policy, and open weight is the insurance policy. Open weight is the way we can always say yes, but we can build the open ecosystem. We can do that. And so I think in the fullness of time, this is gonna be beneficial, but the problem is there’s a lot of coordination and economic problems that have to be solved here. It’s not just a matter of hoping that Google and Meta or whomever else, or the Chinese companies, by virtue, out of the goodness of their hearts continue to open-source things. That’s not scalable. There has to be a reason to do it. So what are the institutional dynamics open weight gonna look like in the long term? I don’t really know, but it feels deeply under theorized.
    00:04:03 Nathan Lambert: I think it’s hard to fund is the thing. I mean, we saw Qwen had their turmoil this week, which is timely, and I’m not that surprised because the stakes for these companies is so high, and they all are trying to make sure their companies win in it. And people will say like, “Oh, Meta should commoditize their complements and release open models.” But no one’s ever commoditized their complements with something that costs a trillion dollars to make. Like, that’s a line item. Like, is Apple gonna commoditize... Apple commoditizing their complement would be them doing the... They could spend just as much as all the other tech companies are on CapEx and spend hundreds of billions of dollars, but they’re choosing not to. And I just like, I agree that long term it should be better, but if we never bridge that gap, does it actually materialize? Like, the crank is being turned of these models getting better and better. GPT 5.4 released today, excited to try it.
    00:05:02 Nathan Lambert: But like, where does it go? Like, what I’m working on is totally falling behind the frontier. We’re the foundation of research, but it’s like I see it already slipping.
    00:05:13 Dean Ball: So I kinda think, yeah, I mean, look, I think it’s gonna get bad in the short term, it’s gonna be bleak, right? There’s just no doubt about that in my view. Because we’re in this period, like I think the pace of frontier progress is gonna continue. My own view is that, like, just ‘cause I peer in and use the open weight Chinese models on a fairly regular basis, and I kinda just feel as though the gap has widened between the US frontier and the open frontier. Unfortunately, it’s so sad that US frontier and open frontier are increasingly distinct things. But I do feel as though that probably is true. And that’s probably gonna continue because in the next, like, in the early stages of a new technology, you would expect for the vertically integrated players to be the ones who do the best. And over time, the modular players can win, and part of that is ‘cause eventually you do get to good enough, right? Like, eventually, I think most people think the iPhone is good enough now. There was a time when every year the iPhone upgrade was like, “Oh my God, this is so much better.” Intelligence is maybe different, but maybe not for a lot of things.
    00:06:37 Nathan Lambert: Well, like, there’s no iPhone that you can buy from anyone. Nothing you can buy from anyone but Apple is nearly as good. That’s the concern. It’s like, is it gonna be Anthropic that like, yeah, it stopped getting better, but you can’t rebuild it. Like, you can’t make the open source version.
    00:06:51 Nathan Lambert: I also think I had a later question, which is like, the weights are so much less of a concern for me. So like, somebody dropping a two-trillion-parameter model that’s open weights and way better than anything else that somebody has built and released in the open, it almost doesn’t matter if you don’t understand the harness and the tools and the setup you need to make it into a Claude-like system. Like, you need what, eighty nodes of H100s that cost a hundred thousand dollars a day to run and expertise to make it a system. It’s like the shifting away from weights is also happening. I don’t think it’s happening in this open versus closed ecosystem at the surface level of the discussion. So that’s why I’m just like, I don’t know if it’s gonna exist. The thing that I could see happening is that open weights models are niche, and they help these Claude-like models, but there’s not an alternative in that universe. So it’s like, is the government capable of actually making this alternative exist? I don’t know. Like, I don’t know if you can Manhattan Project this, and I wouldn’t advocate for it.
    00:07:53 Dean Ball: I actually think about it from the opposite perspective, because I think that what happens if the government follows through on what they’ve threatened with Anthropic, which is to make it so that basically any military contractor cannot have any commercial relations with Anthropic, which means NVIDIA can’t sell GPUs to them for anything. Amazon can’t sell cloud services to them. Amazon and NVIDIA also can’t be invested in them, by the way, if you take any commercial relations at its face value. Now, that’s not a power the government actually has, but nonetheless, if this harassment campaign continues, I think what it probably does... You know, I spend a lot of time in international policy, dealing, talking to foreign governments and civil society in foreign countries, and they already have major trust issues with respect to the US closed source models because they think the US government is gonna come in and disable the models. Like, the American president will get mad at Brazil, say, and in addition to putting tariffs or sanctions, the US president will say, “Yeah, we’re also gonna turn off all your public services that are dependent upon American closed source models.” Right? So people view that as this profound threat, and people are legitimately scared of that in other countries.
    00:10:00 Dean Ball: I think this turns that fear up another meaningful degree, and probably not incorrectly, by the way, probably rightfully so. And so I kinda look at this and I think, well, now a lot of American companies might also have that concern, and so you certainly have a demand side of people who are gonna be like, “I get this. It is a risk to use anything where I have a commercial relationship. ‘Cause once I have a commercial relationship, the government can regulate that. Can I find some way of getting out of it?” I think there’s gonna be demand for that. Whether or not that demand produces supply, I think will depend on... It might just not be possible, that’s true. But I think you’ve never had a more favorable demand picture, and I suspect that on the margin, this probably will favor open in the longer run.
    00:10:44 Nathan Lambert: Yeah. So there’s a few ways that I think about this. I have this thing, like ATOM Project and all this other stuff I do, and it’s like, how do I meaningfully advocate for this? I think there’s something, like I work at AI2, and AI2 has budgets of order of a hundred million dollars and can train decent models. But if I wanted to redo an AI2, like my method for getting that type of money, it’s mostly gonna be like befriending a billionaire. And it seems like philanthropy dice roll in the near term is a way to get it. But then, like, maybe it really is some long slog of a multi-industrial consortium that takes a couple years off the ground and slowly, like, Google’s, or all these Netflix and all these five hundred billion dollar smaller companies are gonna give millions of dollars to have somebody else do it because they can’t get the billion dollars themselves, but they know they need to have it existed.
    00:11:31 Dean Ball: And sovereign wealth funds. Right. Sovereign wealth funds everywhere can do that, right? There’s trillions of dollars in sovereign wealth. There’s pension funds, public employee pension funds. A lot of people can chip into this and it’s possible. This is like, Yann LeCun thinks this is the inevitable outcome. He thinks that the future is gonna be that some sort of global consortium gets together and builds this, because no one country is gonna be able to own it, because it’s gonna be too important. I’ve always kinda doubted that, and I’ve always thought that that outcome is probably a bad outcome for the world, honestly.
    00:12:06 Nathan Lambert: That’s a bad outcome for how good the AI is.
    00:12:09 Dean Ball: That’s correct. It’s a socialist outcome, you know? It’s not communism, but it is democratic socialism, and I’m not a democratic socialist, so I’m not a super big fan of that. But at the same time, I have to be honest that I kinda think that this probably does increase the odds of that precise outcome coming to bear.
    00:12:33 Nathan Lambert: I think something that comes sooner is that a lot of these super wealthy countries are gonna realize they can have real... Like, they can do some sort of sovereign AI and make some sort of noise, particularly starting with open models. I think there’s the Institute for Foundation Models, which is based on the UAE university system. Like, that’s--
    00:12:53 Dean Ball: That’s very UAE-coded, yeah.
    00:12:55 Nathan Lambert: They’ve been playing that for years, and they can keep doing this. Their models are gonna be pretty good, and I think there’s gonna be more people that do this. There’s the SWISS initiative in EU, which is on one hand doing a good job, on the other hand plagued by the most obvious European limitations of talent cycling and consortium life. I think these things are gonna become more of a thing in the next year, but I don’t know exactly how they impact the... They don’t impact the frontier of AI, but maybe they’re just like how the geopolitics and power of AI evolves. And I for some reason feel like open models need to be the thing that they’re gonna do because if they have a closed model that’s not as good, it doesn’t really give them any sort of power. But I don’t have a good enough world view for what that actually does, and if there’s more EU models, if India actually has their act together and trains a solid model. I don’t know what that does, but I feel like it’s probably gonna happen.
    00:13:54 Dean Ball: Yeah. I mean, it’s really super interesting ‘cause I think the other thing-- that will be inherently... I mean, it will be a Linux compared to a macOS, you know? It will not be as good of an experience for people. But then it becomes strange. Like, I don’t think macOS is as appealing of a thing if it’s viewed to be owned by the US government, right? And in fact, part of the reason I think that Apple is able to make its case quite credibly to consumers and businesses is they have resisted US government pressure to turn things over before. People might remember about a decade ago, there was this shooter in San Bernardino, California, and the FBI tried to force Apple to release iPhone data, and Apple said, “No, we’re not gonna expose this information.” Now, I think the FBI eventually just hacked it anyway, but that’s a separate issue. It’s a matter of principle here.
    00:15:01 Dean Ball: So yeah, I think it’s an interesting question: do we expect for the gap between the open frontier and the American closed frontier to widen in the near future, especially just because of how much compute they’re gonna have?
    00:15:30 Nathan Lambert: A hundred percent. And data and talent. Like, a hundred percent. It’s happening.
    00:15:34 Dean Ball: Data, talent. And it’s compounding, right? I mean, this has always been my view. And how much, I’m not sure, but I think it could be quite significant because these things are compounding benefits. And so if you expect them to just continue compounding, then all of a sudden it gets pretty bleak pretty quickly, would be my fear.
    00:16:00 Nathan Lambert: One of the... I mean, what’s your take on this? Why has it not compounded so much faster? Like, I feel like these three companies are spending, I don’t know, 10X what the Chinese labs are spending, and you only get like a little bit better model. Like, I believed so full-heartedly that Claude and ChatGPT and all these models are much better, and I expect them to become better by increasing margin, but it’s still confusing why they’re not already more ahead.
    00:16:29 Dean Ball: I go back and forth on this. Sometimes I think they are that ahead, and it’s just difficult to show up in benchmarks for the obvious reasons that benchmarks get chased. And like, I do feel that with the coding agents and with certain use cases, I do just feel like, wow, the American frontier is just way ahead, profoundly ahead of the Chinese frontier there. But there’s a lot of other things where you do kinda saturate how good you can be. I suspect that a very large fraction of AI usage is essentially glorified Google search. Even though I don’t think AI is glorified Google search, I suspect that a lot of what people use it for is that, at the consumer level. And it isn’t obvious to me how much better you can get at things like that. But my guess would be that over the next five years, I would guess the American labs really take off, in part because of compute, data, internal deployments for recursive self-improvement style stuff. And also, it’s amazing how we talk about that as just a normal thing now.
    00:18:05 Nathan Lambert: I think there will be a ceiling on it. Like, they’re gonna get a ton of improvement-- The gains are insane. It’s like, personally, at my job, I’ve been a lot of a research manager and just chasing s**t down to get a model out the door. But now I can take on hard engineering tasks because I’m like, “Okay, might as well do this at the same time.” Like, going from zero to a hundred software engineers at anyone’s fingertips is worth a lot in terms of exploration. But the next, like, from a hundred to ten thousand is like, people can mess that up type thing. But that’s a huge gain.
    00:18:37 Dean Ball: I kind of agree. I think there’ll be a sigmoid there too. But then the other thing that will happen is, like, what I sort of wonder is will the AI companies, will the current model vendors, will they eventually become more like true infrastructure companies where what they actually do is they have models that design their own chips and models that design their own data centers and models that design their own successors. And so it’s this hugely vertically integrated thing, and what you’re really getting access to is not just the model itself, but you’re getting access to this highly optimized hardware, physical world infrastructure. And again, that’s kind of already the case, but does that become even more the case? And then that’s truly insurmountable for any open player. That’s definitionally insurmountable for an open player, and that becomes scary too. But again, this is why I’ve always felt so good about the position of the US closed source labs. This is why I’ve always been pretty bullish on them and have my concerns about open.
    00:20:07 Dean Ball: But to the extent the US government makes it impossible to trust closed source models, you do provide an advantage to open there. You’re giving a shot in the arm. If you like open source, you should hope that the supply chain risk designation against Anthropic is quite broad.
    00:20:09 Nathan Lambert: It’s a rough thing to hope for.
    00:20:09 Dean Ball: I mean, you shouldn’t actually hope for it, but I just mean, like, if that’s the only thing you care about in the world is open source, then--
    00:20:17 Nathan Lambert: I would say that anyone that only cares about open source probably is not thinking through any of these principles. It just gets really bad if you only have-- Like, AI is not gonna be meaningful lift to the economy and nor sustainable if everything is open. Like, if models are truly commoditized, things look kind of rough out there.
    00:20:36 Dean Ball: I think a world where models get commoditized is a really bleak world too, actually. And yeah, this is why I’m very worried about what the US government is doing. But I think that it helps on the margin, though. It probably helps on the margin in terms of waking people up. That still is my view.
    00:20:55 Nathan Lambert: I am a little surprised by the Qwen stuff, but I think there’s-- It’s like, at some point, I knew there was gonna be a year where a lot of the open model efforts just died because they’re just too expensive and too similar. But at the same time, having a lot of efforts that are somewhat similar but exploring a lot of the minor permutations in modeling space to figure out what works for people who use open models is actually quite good. I’m very bearish on the reflection style approach, which is build a lab, build an incredible model, drop it, make a bank selling it on-prem. Because on-prem is not that distinct from a business model as having a closed model. You could sell a closed model on-prem with the right IP controls. But then the person who actually wins open is by trying a whole bunch of tiny different things, understanding what is actually a meaningful differentiator in private data, in certain deployments and whatever, and then really iterating on that with a community. And that’s why I was like, Qwen is the closest to doing this by being so close to the community, and it’s so distinct from what a lot of the other labs are betting on.
    00:22:05 Nathan Lambert: But I see the pressure going away and kind of reducing diversity onto standards, because standards also make inference more efficient. Using open models is really rough. I think some of the best open models have really had rough launches. I think GPT-OSS had a horrible launch in terms of usability and is now one of the most popular models of all time. Qwen 3.5, it’s like researchers I work with are like, “Oh, let’s see if we can do some basic RL baselines on it,” and all the software stack is kinda broken. It takes a few weeks to get it going. And this is ‘cause all the models change differently, and closed labs just have such an advantage there ‘cause they should conceivably ship things on day one that work. I mean, don’t talk about Claude’s runtime, but that’s fine.
    00:22:42 Dean Ball: And don’t talk about the GPT-5 auto router either. But yeah, no, totally. I think that’s right.
    00:22:53 Dean Ball: I think fullness of time, I’m bullish on open source in the long run, fairly bearish in the next five years. The next five years are gonna matter quite a bit. And there is a lot of cope in both open source world and also... I don’t really hear it so much in open source world. I think open source world is actually more honest about this. But where the cope is so bad is in global civil society discourse. Like, I was in India for the AI Impact Summit recently, and they are just smoking the copium, being like, “We are gonna do everything on subfrontier open source models, and we’re just gonna diffuse those, and that’s all we’re gonna need in our economy.” And I just think that’s, if you’re India, that’s really not the bet you wanna make. I understand these are resource-constrained countries. They have a lot of acute constraints that they face, but nonetheless, I think that’s probably not a good bet.
    00:24:05 Nathan Lambert: Well, it’s even if those long tail models will work like manufacturing has worked, where it’s like Apple has put hundreds of billions of dollars into the manufacturing ecosystem in China to get absolute fine margins and scale. Like, if you really-- these things are gonna be used so much that that fine margin is actually gonna matter a lot, and it is not cheap to get that fine margin. You can’t just YOLO a DeepSeek V3 and spend five million dollars in compute and be done. It’s still gonna be expensive for a long time.
    00:24:34 Dean Ball: Yeah, it requires-- I think the Chinese approach, in the long run, if China’s gonna continue its strategy and they want to be competitive with the American frontier, they’re gonna have to fully socialize that, I think. I don’t think DeepSeek alone is gonna be able to do this, and I don’t think even Alibaba alone is gonna be able to do this. I think they’re going to need some sort of collective effort. Especially because of the export controls, the American export controls. They’re gonna have to centralize compute. They’re gonna have to centralize all these things, and talent and data and all that.
    00:25:17 Nathan Lambert: I don’t see it happening. Like, maybe someone gets officially AGI pilled, and I don’t know that much about China. But the things I know about China, it seems like that would be a big lift, and it would take a lot of time to actually do it. Like, all the companies would have to give up their biggest... All the cloud companies are like tech companies making a lot of money. They would be like, “We have to give up what?”
    00:25:42 Dean Ball: No, it would be a tough sell. Obviously, if the Chinese government decides they want to do it, they absolutely will. But in total, it will be a tough sell. My experience having had diplomatic engagements of many sorts with Chinese government-- and a lot of Chinese tech policy is actually not directly set by the government. It’s actually more kind of civil society, academia and civil society adjacent to government. Had a lot of conversations with folks like that, and they’re definitely... It’s largely not a very AGI-pilled crew. I think AGI-pilled-ness probably has a rough correlation with GDP per capita, and I think China is about where you would expect based on their GDP per capita, maybe a little bit ahead, but not very so. But if they ever do get AGI pilled, that’s the kind of thing that they could consider, but then that’s still a pretty extraordinary outcome because the Chinese government would have to be willing to make these things and then give it away. And I kinda just don’t think they will.
    00:27:11 Nathan Lambert: Yeah. I mean, all the politics of control with how everybody thinks AI is so powerful are pointing to very value-destructive actions economically in order to achieve the end state that people determine to be right. It’s like supporting open source to the extent that you can to avoid situations like Anthropic being labeled a supply chain risk and having interactions like that totally decimating runway of AI productivity. Like, if the companies are really gonna commit to open source for other things, then they’re gonna lose money. And I see this in-- China’s economy would be taking a gigantic hit doing this. And that’s kind of a common theme of what we’re talking about is that the interface of AI in an economic fashion is gonna make the next few years really weird.
    00:28:06 Dean Ball: I hope so.
    00:28:09 Nathan Lambert: I think things are gonna be weird, but I haven’t spent a ton of time thinking about how that interacts with political institutions. I thought about socially weird a lot, but I haven’t thought about power weird a lot.
    00:28:20 Dean Ball: Oh, power weird is what I worry about all the time. What I worry about the most is I think it’s plausible that what we’re seeing... I’ve always had this concern. I have this dual problem of-- maybe I’m talking out of both sides of my mouth. Maybe that’s just the critique, and it’s a fair critique. But I routinely complain about how people in government aren’t really... They pretend to take AI seriously, but they don’t take it that seriously. And they don’t really own the implications of advanced, of near term advanced AI and all that. I think we basically have transformative AI right now, but they don’t own that, because it’s annoying, it’s difficult, it’s conceptually challenging.
    00:29:08 Dean Ball: But the flip side of that is that if people do start to take it very seriously, there’s the risk that they sort of lash out, that they get scared, and they lash out and do things that are rash, in a rush. And that actually creates very, very bad, much worse outcomes than you otherwise might have gotten. I think that’s a very fair risk, and I think it’s possible that you might see things like that happen within the U.S. I don’t think this particular incident with Anthropic is quite an example of that. But it’s possible that you do see that in the coming years, and that is in and of itself a pretty scary outcome because if the U.S. government decides that they want to nationalize the frontier labs, I think it could be one of the most tyrannical things we ever see happen in this country.
    00:30:16 Nathan Lambert: Yeah. It’s like, I don’t know how to reply to this. I think things are... It’s serious times and I see so many... It feels like such a Sisyphean task to make more open models exist, but all the broader trends seem to point to that being a more stable equilibrium in a lot of ways. Like, good enough open models and keeping up with what we all feel happening in the closed model land.
    00:30:50 Nathan Lambert: So I don’t know. I stay motivated, but I feel increasingly lost in terms of achieving it.
    00:30:56 Dean Ball: I don’t think you should be. I think, look, I suspect the US government will not actually do it, and the best thing about America is that our general sort of-- I don’t wanna say incompetence, but the general sort of chaos of American institutions and decentralized confusingness of it all, it can often be quite frustrating, and it can sometimes be a detriment, but it can also be really great because we tend to not execute and follow through on our very worst ideas. And so I don’t think we’re going to do that. It doesn’t feel very American to do it. I worry about it because I worry about these rash reactions, and that’s why I fight as heavily as I do on things like this, despite not insignificant cost to me to do it, politically speaking. But that’s totally worth it because I care about this. I think everything, I think that will probably be fine. But yeah, I do agree. It’s a major risk. It’s a major risk, and it’s a weird world to think about, I’ll tell you that much.
    00:32:16 Nathan Lambert: Yeah. I don’t have a lot more to add. I’m sure we’ll continue this discussion. I think it warrants the space of it ‘cause that’s the... It’s one of the longer term things, but it’s not in the news cycle whatsoever, at least the open model angle. There’s just so many layers. People have to talk. Like, send feedback, people listening. I’ll even send this out as a podcast as well and just like, what do people think? How do we get to the places we want to get to?
    00:32:46 Dean Ball: Well, one thing I’m particularly interested in is-- one of the items in the Trump administration action plan, which I worked on for those who don’t have that context, is this idea of financializing compute, creating a financial market, like basically a commodities market for compute so that you can buy, you know, like really robust. In the same way that you can buy electricity spot, electricity futures and electricity on the spot market and things like this, the wholesale. Could you do something like that for compute? That could really profoundly change the dynamics and the economics of AI production. It’s not gonna turn them over. It doesn’t flip them on their head, but it changes it quite meaningfully. And I’m very excited by that prospect.
    00:33:48 Dean Ball: And that’s the kind of thing that I would be increasingly doing if this sort of interference of government into the frontier continues. What I suspect I’ll do is start developing some of those ideas which I developed earlier. I’m only one person. If those things start to seem relevant again, I totally will. Because anything to make it easier to produce AI for people that don’t have trillions of dollars will be extremely important.
    00:34:38 Nathan Lambert: Yeah. I think that... I don’t know. I’m happy to leave it there.
    00:34:43 Dean Ball: Cool.
    00:34:45 Nathan Lambert: I can let you get on your trip. It’s good to catch up. I’m early in the process of potentially coming to DC in a few months, so I will let you know if I do.
    00:34:52 Dean Ball: Oh, please do. It’d be great to see you. We can record an episode of my podcast live.
    00:34:58 Nathan Lambert: Sounds good. Okay. Thanks everybody for listening.
    00:35:03 Dean Ball: Talk to y’all later. Bye.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Olmo Hybrid and future LLM architectures

    2026/03/05 | 11 mins.
    So-called hybrid architectures are far from new in open-weight models these days. We now have the recent Qwen 3.5 (previewed by Qwen3-Next), Kimi Linear last fall (a smaller release than their flagship Kimi K2 models), Nvidia’s Nemotron 3 Nano (with the bigger models expecting to drop soon), IBM Granite 4, and other less notable models. This is one of those times when a research trend looks like it’s getting adopted everywhere at once (maybe the Muon optimizer too, soon?).
    To tell this story, we need to go back a few years to December 2023, when Mamba and Striped Hyena were taking the world by storm — asking the question: Do we need full attention in our models? These early models fizzled out, partially for the same reasons they’re hard today — tricky implementations, open-source tool problems, more headaches in training — but also because the models fell over a bit when scaled up. The hybrid models of the day weren’t quite good enough yet.
    These models are called hybrid because they mix these new recurrent neural network (RNN) modules with the traditional attention that made the transformer famous. They all work best with this mix of modules. The RNN layers keep part of the computation compressed in a hidden state to be used for the next token in the prediction — a summary of all information that came before — an idea that has an extremely long historical lineage in deep learning, e.g. back to the LSTM. This setup avoids the quadratic compute cost of attention (i.e. avoiding the incrementally expanding the KV cache per token of the attention operator), and can even assist in solving new problems.
    The models listed to start this article use a mix of RNN approaches, some models (Qwen and Kimi) use a newer idea called Gated DeltaNet (GDN) and some still use Mamba layers (Granite and Nemotron). The Olmo Hybrid model we’re releasing today also falls on the GDN side, based on careful experimentation, and theory that GDN is capable of learning features that attention or Mamba layers cannot.
    Introducing Olmo Hybrid and its pretraining efficiency
    Olmo Hybrid is a 7B base model, with 3 experiment post-trained checkpoints released — starting with an Instruct model, with a reasoning model coming soon. It is the best open artifact for studying hybrid models, as it is almost identical to our Olmo 3 7B model from last fall, just with a change in architecture. With the model, we are releasing a paper with substantial theory on why hybrid models can be better than standard transformers. This is a long paper that I’m still personally working through, but it’s excellent.
    You can read the paper here and poke around with the checkpoints here. This is an incredible, long-term research project led by Will Merrill. He did a great job.
    To understand the context of why hybrid models can be a strict upgrade on transformers, let me begin with a longer excerpt from the paper’s introduction, emphasis mine:
    Past theoretical work has shown that attention and recurrence have complementary strengths (Merrill et al., 2024; Grazzi et al., 2025), so mixing them is a natural way to construct an architecture with the benefits of both primitives. We further derive novel theoretical results showing that hybrid models are even more powerful than the sum of their parts: there are formal problems related to code evaluation that neither transformers nor GDN can express on their own, but which hybrid models can represent theoretically and learn empirically. But this greater expressivity does not immediately imply that hybrid models should be better LMs: thus, we run fully controlled scaling studies comparing hybrid models vs. transformers, showing rigorously that hybrid models’ expressivity translates to better token efficiency, in agreement with our observations from the Olmo Hybrid pretraining run. Finally, we provide a theoretical explanation for why increasing an architecture’s expressive power should improve language model scaling rooted in the multi-task nature of the language modeling objective.
    Taken together, our results suggest that hybrid models dominate transformers, both theoretically, in their balance of expressivity and parallelism, and empirically, in terms of benchmark performance and long-context abilities. We believe these findings position hybrid models for wider adoption and call on the research community to pursue further architecture research.
    Essentially, we show and argue a few things:
    * Hybrid models are more expressive. They can form their outputs to learn more types of functions. An intuition for why this would be good could follow: More expressive models are good with deep learning because we want to make the model class as flexible as possible and let the optimizer do the work rather than constraints on the learner. Sounds a lot like the Bitter Lesson.
    * Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling.
    All of this theory work is a great way to go deeper, and frankly I have a lot more to learn on it, but the crucial part is that we transition from theory to clear experiments that back it up. Particularly the scaling laws for designing this model were studied carefully to decide on the final hybrid architecture. The final performance is very sensitive to exactly which RNN block is used and in what quantity.
    In scaling experiments, the results showed that for Olmo, the hybrid GDN (3:1 ratio of layers) > pure GDN (all RNN layers) > standard transformer (all attention) > hybrid Mamba2 > pure Mamba2. The crucial point was that these gaps maintained when scaling to more parameters and compute. A visual summary of the different types of architectures studied is below.
    In terms of this specific model, the pretraining gains were giant! Relative to Olmo 3 dense, it represents an about 2X gain on training efficiency. When you look at evaluation performance for pretraining, there was also substantial improvement in performance, particularly after long context extension (the final 2 rows of Table 2 in the paper, highlighted below).
    The journey to post-training Olmo Hybrid
    Most of the experience in post-training Olmo models has been climbing up a steep curve in base model capabilities with minor tweaks to architecture. Our recipes from Tulu 2, Tulu 3, and the Olmo 3 reasoning work (building substantially on OpenThoughts 3) all worked in a fairly straightforward, off the shelf manner. Olmo Hybrid is our first experience in post-training a substantially different architecture, and the results were mixed.
    1. Benchmark performance
    Following the Olmo 3 recipe, we got some substantial wins (knowledge) and some substantial losses (extended reasoning) relative to the dense model. All together these still represent a very strong fully open model — just that the pretraining gains didn’t translate as obviously. The results are below.
    The exact reason why this happens is a research question. Our best guess is that the Olmo Hybrid base model is just a sufficiently different student model, where most of our post training data at early stages is learning from stronger “teacher” models (a recap of this method, called distillation, appeared recently in Interconnects).
    There is a lot of other research ongoing in the community around what makes a strong teacher model — generally, the best overall model is not the best teacher. In other words, training on data outputted from the model with best evaluation scores today is unlikely to unlock the ceiling in performance for your new base model. A second factor, which is even less explored, is how different base models likely need different teachers to learn from. This is why Olmo Hybrid could perform very differently, where it’s behavior is downstream of an architecture-based learning change, where the pretraining data is almost identical.
    There’s A LOT more work to dig into here, some empirical work in generating better data and other work in understanding how different training stages fit together. I am confident this Olmo Hybrid base model is solid and more performance can be extracted, but it takes more careful work adapting existing datasets.
    2. Open-source tooling
    The frank reality of new architectures for open models is that the open-source software tooling support is horrific. There’s the paper-cuts that people are familiar with, e.g. random errors in popular libraries (as people experienced with GPT-OSS) that slow adoption, but there are also deeper problems.
    A large part of the potential benefit of hybrid models is the reduction in memory usage for long-context generation, which is crucial for reinforcement learning and agentic tasks. It should be a huge win for post-training! This, unfortunately, is far from the case, and will likely take another 3-6months to get right for this batch of GDN models.
    The core problem is that the open-source inference tools, e.g. VLLM, are relying on far less developed kernels (and other internals) when compared to standard transformers. This comes with two challenges — throughput slowdowns and numerical issues. Numerical issues can be combatted with a variety of inference flags. Quoting the paper again:
    The two key flags in VLLM we needed to get maximum performance with the post-training model were --disable-cascade-attn, which disables cascade attention (an optimization for shared prompt prefixes), and --enforce-eager, which turns off CUDA graphs. These two flags have been used in our RL setup dating back to Olmo 3, but are new additions to evaluations. Scores for the released models drop precipitously without them. We also evaluated our final models with the hybrid model cache in the richer FP32 datatype, to improve stability via --mamba_ssm_cache_dtype following NVIDIA.
    Essentially, we used these to make sure the model was numerically stable. The downside is that the inference throughput plummets, so the potential gains in compute efficiency are erased. A comparison of numbers is below.
    Effectively, the 7B hybrid model today takes more compute to train with RL than our 7B dense model (that doesn’t even have a common memory saving technique, GQA). The total compute estimate from the table at different context lengths is below (more visuals in the slides from my recent CMU talk).
    The good news is that these are solvable problems — and improving the tooling could even improve benchmark numbers — but it’s going to take a good bit of time and hard work in the OSS community.
    This leads to my final question. If I’m optimistic about the open ecosystem evolving to support these models with ease, motivated by the better fundamental scaling of the architectures and a large cluster of leading open model builders already using it, are closed models like GPT and Claude built like this?
    To be clear, this answer is a total guess (which I don’t normally do), but with the evidence I have I’d put the chance of one of the 3 frontier models being an RNN being around a coin flip. I’ll let you know if I learn for sure either way. If the scaling advantages hold at frontier scale, the economic case becomes hard to ignore, but they could already have architectures that are efficient like RNNs, but with even more benefits.
    I’m going to follow up this post with more architecture discussions, particularly on why Mixture of Expert (MoE) models are a major headache to post-train, so make sure to subscribe if that sounds interesting to you!
    Thanks to Will Merrill and Finbarr Timbers for some discussions that helped inform this post.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    How much does distillation really matter for Chinese LLMs?

    2026/02/24 | 11 mins.
    Distillation has been one of the most frequent topics of discussion in the broader US-China and technological diffusion story for AI. Distillation is a term with many definitions — the colloquial one today is using a stronger AI model’s outputs to teach a weaker model. The word itself is derived from a more technical and specific definition of knowledge distillation (Hinton, Vinyals, & Dean 2015), which involves a specific way of learning to match the probability distribution of a teacher model.
    The distillation of today is better described generally as synthetic data. You take outputs from a stronger model, usually via an API, and you train your model to predict those. The technical form of knowledge distillation is not actually possible from API models because they don’t expose the right information to the user.
    Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis. Yes, architecture is crucial, some data still needs exclusively human inputs, and new ideas like reinforcement learning with verifiable rewards at scale can transform the industry, but so much of the day to day life in improving models today is figuring out how to properly capture and scale up synthetic data.
    To flesh out the point from the start of this piece, the argument has repeatedly been that the leading Chinese labs are using distillation for their models to steal capabilities from the best American API-based counterparts. The most prominent case to date was surrounding the release of DeepSeek R1 — where OpenAI accused DeepSeek of stealing their reasoning traces by jailbreaking the API (they’re not exposed by default — for context, a reasoning trace is a colloquial word of art referring to the internal reasoning process, such as what open weight reasoning models expose to the user). Fear of distillation is also likely why Gemini quickly flipped from exposing the reasoning traces to users to hiding them. There was even very prominent, early reasoning research that built on Gemini!
    This all leads us to today’s news, where Anthropic named and directly accused a series of Chinese labs for elaborate distillation campaigns on their Claude models. This is a complex issue. In this post we unpack a series of questions, beginning with the impact, and ending with politics. The core question is — how much of a performance benefit do Chinese labs get from distilling from American models.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    To start, let’s review what Anthropic shared. From the blog post, emphasis mine:
    We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions.
    These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.
    Much like the models themselves, the benefits of distillation are very jagged. For some capabilities, particularly if you don’t have a full training pipeline setup for it, quickly distilling some data from the leading frontier model in that area can yield massive performance boosts. This can definitely help the lab distilling from the API catch up much more quickly than they otherwise would. Most distillation is rather benign, using many tokens of an LLM to help process and refine existing data — putting a lot of compute into getting a few, high quality training tokens out. This sort of raw data processing work can be done on many different APIs, but one tends to be best.
    When we go into what Anthropic says the three Chinese LLM builders actually used the Claude API for — as an aside, Anthropic didn’t confirm that the attack was done through the API, the chat app, or Claude Code — the actual impact of the operations is very mixed. It’s hard to know how much untracked usage these labs deployed for other projects (or other American models).
    To start, Anthropic puts DeepSeek first in their blog post because they’re the household name in the US for Chinese AI. The extent of their use is actually quite small, showing how this post is more about the big picture than the details:
    DeepSeek
    Scale: Over 150,000 exchanges
    The operation targeted:
    * Reasoning capabilities across diverse tasks
    * Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning
    * Creating censorship-safe alternatives to policy sensitive queries
    In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment. It looks like they were experimenting with some rubrics, which could’ve been for an online RL run, but that’s extremely unlikely with how distributed the access was, and then some minor stuff on completions for sensitive queries. This usage of Anthropic’s API will have a negligible impact on DeepSeek’s long-rumored V4 model (or whichever model the data here contributed to). This was also very likely a small team at DeepSeek and unknown to much of the broader training organization.
    The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage.
    Moonshot AI
    Scale: Over 3.4 million exchanges
    The operation targeted:
    * Agentic reasoning and tool use
    * Coding and data analysis
    * Computer-use agent development
    * Computer vision
    MiniMax
    Scale: Over 13 million exchanges
    The operation targeted:
    * Agentic coding
    * Tool use and orchestration
    The role of distillation is constantly changing. Distilling from Claude today for its agentic behavior is much more valuable than versions of Claude have been as a teacher in the past. Claude Opus 4.6 has a well-rounded agentic navigation that none of the other models quite match. Why not try training on some of the model outputs to see if your model absorbs it? Over the next few months, that’ll be less differentiated. It’s sort of like how all the models are way better at math today than most people need — there are plenty of places to distill from.
    Estimates will vary, but if each response had 10-25K tokens per exchange, the total tokens across these two labs, mostly with MiniMax, would be 150-400 billion tokens. This is a substantial amount, which could meaningfully improve a models’ post-training. For example, in Olmo 3 we had an SFT dataset of 20 billion tokens that could be built like this, and increasing it by 10X would be very reasonable.
    These numbers are just scratching the surface of total synthetic data generation across APIs hosted by US companies. At the same time, quantity is a pretty crude way to measure impact. Just taking the outputs from Claude and figuring out how to add them to your model pipeline isn’t easy. The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem.
    This is what I’m sure the Chinese labs are innovating at. There’s an argument that Chinese frontier labs are substantially more efficient than their Western counterparts — this is misleading.
    The labs operate under different constraints. The Chinese labs are likely slightly more efficient out of necessity in being lower on resources, but overall the picture of talent access is very similar. The Chinese labs also approach benchmarks differently, making it appear that they’re a bit closer than they really are (and appearing as if they’re potentially surpassing). This is needed to get momentum and brand recognition in the AI market.
    The Chinese labs likely innovate greatly on distilling from leading API models, due to their restricted access to GPUs. GPUs could be used to construct synthetic data, but for organizations with more funding than they can spend on research compute (being supply limited), using API-based models is one of the few other options for effectively getting more compute. It’s way easier to figure out getting access to “banned” API models than it is to smuggle tens of thousands of physical GPUs and get them set up.
    It’s not only the Chinese labs that operate like this. Synthetic data from a model you don’t own is all arguably distillation. Distillation is a shortcut to more compute for anyone. It’s also a far less risky cost, as having a big cluster for research requires a very large financial commitment, where APIs are pay-as-you-go. For example, in Olmo 3 we used millions of GPU hours on the Frontier supercomputer and Azure credits through NAIRR for synthetic data. We didn’t have the equivalent in GPUs (or really the cash, thank you research credits!).
    All together, it’s very fair for Anthropic to be concerned about this. I still wouldn’t say it is a crucial factor in these Chinese labs post-training capabilities, especially not one that’ll be easy to measure in a time gap to matching the model they’re distilling from a la the US-China performance lag.
    If we take a step back, there was even a time when Claude Sonnet was the flagship model ahead of Opus (I think this was with Sonnet 3.5), much of this comes from it being well distilled internally from Opus checkpoints. Fast iteration and high-quality data can go very far, letting student models surpass the teacher. Frontier labs use this to their advantage, by having internal-only models for generating synthetic data, but saying that Chinese models could never pass the US frontier due to data distillation is like saying that Claude Sonnet could never beat Opus. It's unlikely, and it depends a lot on release times, but with AI models making dramatic progress, weirder things like this have already literally happened.
    The biggest factor unaddressed here is how distillation from stronger teacher models is harder in an era when reinforcement learning at scale is needed to train the best models. You can spend compute carefully crafting and filtering prompts, but you still need to train the model yourself with substantial, on-policy inference — generation is the majority of the compute cost for RL and it can’t be generations from another model. For this reason, I expected this story to die down a bit. It’s clear from their open research that Chinese labs have excellent RL infrastructure, despite the compute shortages.
    The reason I expected it to fade is that not being allowed to distill models for “competitive purposes” has violated the terms of service for API models for quite some time. Academics and open model builders in the US used to greatly worry about and debate this (and I’ve written about it multiple times in 2022 and 2023). Only later in 2024 did that worry die down in the community (and no action has been taken against any smaller model builders).
    This action from Anthropic represents another continued step ratcheting up the AI geopolitical tension. Kneecapping model distillation will be far harder than restricting the shipments of physical goods like GPUs. In many ways it seems like fully restricting distillation through distributed access methods seems almost impossible, and restricting GPU sales would be far more impactful.
    Anthropic and the AI industry should choose their battles. When API endpoints are available for the best models, other entities will use that to train variants of said model. This is a natural evolution of AI models. If AI models are so precious that distillation is an extreme risk, then the models will be restricted to first-party products. Anthropic has a choice to do this with their latest models. The market for API-based model alternatives may be so competitive that some companies go this path — likely in part due to Chinese models undercutting on price — but an API is a fundamental offering that no leading lab will risk walking back from anytime soon.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Opus 4.6, Codex 5.3, and the post-benchmark era

    2026/02/09 | 8 mins.
    Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6, respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents, primarily driven by a Claude Code with Opus 4.5-induced step change in performance. This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability.
    Going into these releases I’d been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks.
    For the last few days, I’ve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claude’s territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors.
    OpenAI’s latest GPT, with this context, keeps an edge as a better coding model. It’s hard to describe this general statement precisely, and a lot of it is based on reading others’ work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps).
    As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I haven’t been able to unlock it.
    Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like “clean up this branch and push the PR.” I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc.
    Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. I’ve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do — they’re really best when given well-scoped, clear problems (especially Codex). Claude Code’s harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails.
    Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. It’s approachable, it tends to work in the wide range of tasks I throw at it, and this’ll help them gain much broader adoption than Codex. If I’m going to recommend a coding agent to an audience who has limited-to-no software experience, it’s certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data.
    In the meantime, there’s no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents.
    Interconnects AI is a reader-supported publication. Consider becoming a subscriber.

    Assessing models in 2026
    There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day — models were more reliable, could do more tasks, etc. This continued through models like OpenAI’s o3. During this phase of AI’s buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious.
    It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models.
    Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed “AGI-pilled” NYTimes reporter in SF said:
    There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there — they had the launch of Bard and the first versions of Gemini, which had some issues — and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back?
    We don’t need to dwell on the depths of Gemini’s current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance — dare I say, even many commonly accepted definitions of AGI that center around the notion of a “remote worker?” The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king.
    On the other end of the spectrum is Anthropic. With Anthropic’s release of Claude 4 in May of 2025, I was skeptical of their bet on code — I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs.
    Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025, a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks:
    This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.
    This leaves me reflecting on the role of Interconnects’ model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAI’s first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value — they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, they’ll do little to disentangle the complexity in mapping the current frontier of AI.
    In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how I’m using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe
  • Interconnects

    Why Nvidia builds open models with Bryan Catanzaro

    2026/02/04 | 1h 7 mins.
    One of the big stories of 2025 for me was how Nvidia massively stepped up their open model program — more releases, higher quality models, joining a small handful of companies releasing datasets, etc. In this interview, I sat down with one of the 3 VP’s leading the effort of 500+ technical staff, Bryan Catanzaro, to discuss:
    * Their very impressive Nemotron 3 Nano model released in Dec. 2025, and the bigger Super and Ultra variants coming soon,
    * Why Nvidia’s business clearly benefits from them building open models,
    * How the Nemotron team culture was crafted in pursuit of better models,
    * Megatron-LM and the current state of open-source training software,
    * Career reflections and paths into AI research,
    * And other topics.
    The biggest takeaway I had from this interview is how Nvidia understands their unique roll as a company that and both build and directly capture the value they get from building open language models, giving them a uniquely sustainable advantage.
    Bryan has a beautiful analogy for open models this early in AI’s development, and how they are a process of creating “potential energy” for AI’s future applications.
    I hope you enjoy it!
    Guest: Bryan Catanzaro, VP Applied Deep Learning Research (ADLR), NVIDIA. X: @ctnzr, LinkedIn, Google Scholar.
    Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
    Nemotron Model Timeline
    2019–2022 — Foundational Work
    * Megatron-LM (model parallelism framework that has become very popular again recently; alternatives: DeepSpeed, PyTorch FSDP).
    * NeMo Framework (NVIDIA’s end-to-end LLM stack: training recipes, data pipelines, evaluation, deployment).
    Nov 2023 — Nemotron-3 8B: Enterprise-ready NeMo models. Models: base, chat-sft, chat-rlhf, collection. Blog.
    Feb 2024 — Nemotron-4 15B: Multilingual LLM trained to 8T tokens. Paper.
    Jun 2024 — Nemotron-4 340B: Major open release detailing their synthetic data pipeline. Paper, blog. Models: Instruct, Reward.
    Jul–Sep 2024 — Minitron / Nemotron-Mini: First of their pruned models, pruned from 15B. Minitron-4B (base model), Nemotron-Mini-4B-Instruct. Paper, code.
    Oct 2024 — Llama-3.1-Nemotron-70B: Strong post-training on Llama 3.1 70B. Model, collection. Key dataset — HelpSteer2, paper.
    Mar–Jun 2025 — Nemotron-H: First hybrid Mamba-Transformer models for inference efficiency. Paper, research page, blog. Models: 8B, 47B, 4B-128K.
    May 2025 — Llama-Nemotron: Efficient reasoning models built ontop of Llama (still!). Paper.
    Sep 2025 — Nemotron Nano 2: 9B hybrid for reasoning, continuing to improve in performance. 12B base on 20T tokens (FP8 training) pruned to 9B for post-training. Report, V2 collection.
    Nov 2025 — Nemotron Nano V2 VL: 12B VLM. Report.
    Dec 2025 — Nemotron 3: Nano/Super/Ultra family, hybrid MoE, up to 1M context. Super/Ultra H1 2026. Nano: 25T tokens, 31.6B total / ~3.2B active, releases recipes + code + datasets. Papers: White Paper, Technical Report. Models: Nano-30B-BF16, Base, FP8.
    Nemotron’s Recent Datasets
    NVIDIA began releasing substantially more data in 2025, including pretraining datasets — making them one of few organizations releasing high-quality pretraining data at scale (which comes with non-negligible legal risk).
    Pretraining Data
    Collection — CC-v2, CC-v2.1, CC-Code-v1, Code-v2, Specialized-v1, CC-Math-v1. Math paper: arXiv:2508.15096.
    Post-Training Data
    Core post-training dumps (SFT/RL blends):
    * Llama Nemotron Post-Training v1.1 (Apr 2025)
    * Nemotron Post-Training v1 (Jul 2025)
    * Nemotron Post-Training v2 (Aug 2025)
    2025 reasoning/code SFT corpora:
    * OpenMathReasoning (Apr 2025)
    * OpenCodeReasoning (Apr 2025), OpenCodeReasoning-2 (May 2025)
    * AceReason-1.1-SFT (Jun 2025)
    * Nemotron-Math-HumanReasoning (Jun 2025), Nemotron-PrismMath (Apr 2025)
    NeMo Gym RLVR datasets: Collection
    Nemotron v3 post-training (Dec 2025): Collection
    HelpSteer (human feedback/preference):
    * HelpSteer (Nov 2023)
    * HelpSteer2 (Jun 2024)
    * HelpSteer3 (Mar 2025)
    And others, not linked here.
    Chapters
    * 00:00:00 Intro & Why NVIDIA Releases Open Models
    * 00:05:17 Nemotron’s two jobs: systems R&D + ecosystem support
    * 00:15:23 Releasing datasets, not just models
    * 00:22:25 Organizing 500+ people with “invitation, not control”
    * 0:37:29 Scaling Nemotron & The Evolution of Megatron
    * 00:48:26 Career Reflections: From SVMs to DLSS
    * 00:54:12 Lessons from the Baidu Silicon Valley AI Lab
    * 00:57:25 Building an Applied Research Lab with Jensen Huang
    * 01:00:44 Advice for Researchers & Predictions for 2026
    Transcript
    00:00:06 Nathan Lambert: Okay. Hey, Bryan. I’m very excited to talk about Nemotron. I think low-key, one of the biggest evolving stories in twenty-five of open models, outside the obvious things in China that everybody talks about, that gets a ton of attention. So th- thanks for coming on the pod.
    00:00:22 Bryan Catanzaro: Oh, yeah, it’s my honor.
    00:00:23 Nathan Lambert: So I wanted to start, and some of these questions are honestly fulfilling my curiosity as a fan. As like, why does NVIDIA, at a basic level, release Nemotron as open models?
    00:00:39 Bryan Catanzaro: Well, we know that it’s an opportunity for NVIDIA to grow our market whenever AI grows, and we know that having access to open AI models is really important for a lot of developers and researchers that are trying to push AI forward. you know, we were really excited by efforts from some other companies around the industry to push openly developed AI forward. You know, Meta did some amazing work, obviously, with Llama and you know OpenAI released GPT OSS, which was exciting. And the Allen Institute, of course, has been, you know, really leading the charge for research, open research and, you know, also things like the Marin Project and OpenAthena. You know, like there’s, there’s a bunch of things that we’re always excited to see develop.
    And, you know, as we think about where AI is gonna go, you know, NVIDIA believes that AI is a form of infrastructure. it’s.. AI is a very useful technology when it’s applied, but on its own you know, it’s kind of a foundation and infrastructure. We think that technology generally works better when there’s openness to the infrastructure so that people can build things in different ways. You know, you think about the way that the internet transformed every aspect of the world economy is pretty profound, and we’re not done yet.
    But the way that, for example, retail uses the internet is different from the way that healthcare uses the internet. And the fact that you know, different sectors of the economy were able to figure out how to incorporate the internet into the beating heart of their businesses in different ways was possible because the internet was built on open technologies that, you know, allowed people to try different things. And we think AI is gonna evolve in a similar way, that organizations across every sector of the world economy are gonna find new and surprising and fun, and important things to do with AI, and they’ll be able to do that better if they have the ability to customize AI and incorporate it directly into the work that they do. and so -- and by the way, this is not to detract from any of the you know, more closed approaches to AI, you know, the APIs that we see from a number of leading labs that, you know, are just extraordinary and have amazing capabilities. We’re excited about those, too.
    You know, NVIDIA loves to support AI in all of its manifestations, but we feel like right now the sort of closed approaches to deploying AI are doing pretty well but we, you know, could use some more energy in the openly developed AI ecosystem, and so that’s why we’ve been putting more effort into it this past year.
    00:03:42 Nathan Lambert: Yeah. So I’m definitely gonna dig into this a lot ‘cause I have seen this. We’re sitting here recording in January twenty-six, which is in the midst of the rollout of these Nemotron three models. There’s the-- I think the Nano has released in the fall, which was probably one of the biggest splashes the org has made, and everybody’s eagerly awaiting these super and ultra-larger variants.
    And it’s like how far are you, how far are you willing to push this Nemotron platform? Like, is it just depending on the users and the uptake and the ecosystem? Like, like, what is the-- is there a North Star in this? Or you hear a lot of.. if you listen to a lot of other open labs, they’re like: “We want to build open AGI,” which is like, I don’t necessarily think grounded, but there’s like a very unifying vision.
    Is there something that you try to set the tone for it that goes through the organization? I mean, AI too, it’s like-
    00:04:31 Bryan Catanzaro: You know, my North-
    00:04:32 Nathan Lambert: .. academics is so-
    00:04:34 Bryan Catanzaro: For Nemotron.
    00:04:36 Nathan Lambert: Okay, go ahead.
    00:04:37 Bryan Catanzaro: Oh, sorry. Go ahead.
    00:04:39 Nathan Lambert: I was just, like, gonna compare to, like, AI too, where we can have such a-- like, we have a very specific vision, being so open that it’s like, I think, like, research is so needed, and there’s so little recipes to build on, like, with really credible research. So there’s, like, a research infrastructure, and then when you have something like Llama, it was, like, built on Zuckerberg’s vision, and he changed his mind, which I actually thought his vision was ex- was excellent, the way he articulated the need for open models, and it kind of faded. So it’s like, is there a way to set a vision for an org that, like, permeates every- everyone and is really compelling and exciting?
    00:05:17 Bryan Catanzaro: Right. Well, we built Nemotron for two main reasons. The first is because we need to for our main product line. So what I mean by that?
    Well, accelerated computing, what NVIDIA does, we build fast computers, right? But the point of building fast computers is to help people do new things. and actually every fast computer is also a slow computer. you know, the observation that it would be nice if computers were faster and could do more things isn’t new. that’s been around since the beginning of computing. So what makes accelerated computing different from standard computing is that we’re prioritizing, you know, we’re focusing, we’re deciding we’re gonna accelerate this workload. This other workload, which is like ninety-nine percent of all of the workloads, we’re gonna let somebody else do that, right?
    So, like, you do not buy NVIDIA systems to do any general purpose computation. You buy them for a purpose, right? Which is these days, all about AI. But when you think about the workload, the compute workloads involved in AI there’s a, there’s a lot of diversity and there’s a lot of really important -.. parameters, hyperparameters, or algorithmic approaches that all have enormous imp- impacts on the systems that we need to build for AI.
    So things like numeric precision MoE architecture, which of course, influence net-- it influences network design. you know, we’re dreaming about sparsity. We, you know, we’ve had, we’ve had sparse neural network acceleration in the GPU since Ampere. I don’t think that it’s being used enough. you know, so how do we, how do we figure out how to use that? These, these sorts of things have an enormous impact on the future of NVIDIA’s main product line, and we have to understand the answers to those questions deeply ourselves in order to know what we’re going to build.
    We can’t just go to our customers and do a survey and say, “Hey “ you know, Meta, for example, since we were just talking about them, “what would you like to see in a future product line from NVIDIA?” Of course, Meta’s always trying to help us as much as they can, but there’s limits to what they can tell us because, you know a lot of the information that influences the design of these systems, it’s very expensive to derive, and so therefore, it’s, it’s very closely held. And so we need to be able to understand these questions very deeply in order to understand what kind of systems to build, in order to understand what we’re accelerating in AI and what we’re not gonna worry about. and so that’s kind of the first job for Nemotron models, is to make it possible for NVIDIA to continue to exist as a company. And I think it’s important that the community knows that because that’s the reason why NVIDIA is making the investments in Nemotron, is because we believe it’s essential for the future of our company. and so this isn’t-- and although as much, as much as it feels good to say, you know, NVIDIA believes in open openly developed AI because you know, we’re so charitable, but actually, that’s not the case. This is actually a business decision-
    00:08:34 Nathan Lambert: It’s smart
    00:08:34 Bryan Catanzaro: .. like, for NVIDIA, our business needs us to know about AI very deeply. And and so, you know, the amount of investment that is justified to carry on NVIDIA’s ongoing business, I think, is large. and so that’s that’s job number one for Nemotron. Now job number two for Nemotron is to support the ecosystem more broadly outside of NVIDIA. and, you know, NVIDIA has a special position in the AI landscape. of all of the big AI companies I think we’re the one that works with the most other companies. We support every company small and large, AI native company to old established enterprise.
    We work with hyperscalers, we work with tiny little startups, we work with countries around the world. so we have this unique position and I think also a uni- unique responsibility and al- maybe also a unique opportunity, that whenever AI is able to grow in any sort of direction, in any capability, then you know, that’s an opportunity for us to grow our business. Obviously, it’s not automatic, right? you know, the AI market is diverse, and it’s getting more diverse, and it should be, ‘cause it’s the most important market in the history of humanity. So so we acknowledge that, and at the same time, we know that it’s in our interest to develop the AI ecosystem. The more people that are building, inventing, and deploying AI, the more opportunity that we have as a company.
    So that’s job number two for Nemotron.
    00:10:17 Nathan Lambert: Yeah. I really appreciate you saying it so directly ‘cause it’s like we’ve worked.. We- I launched this thing, the Adam Project, last summer, which is trying to get more investment in the US open models, and it’s like the only company that has an obvious business model for open models is something like NVIDIA, where you need to make sure that the open models and the research ecosystem plays nicely on CUDA, because then you’re gonna be able to be one-- You’re so many steps closer to research that’s happening. If not, like, if it like- There’s such an advantage to have research happen mostly on GPUs relative to AMD or anything like this, so.
    00:10:49 Bryan Catanzaro: Well, you know, we are-- we’re, we’re not thinking about how to prevent competition. You know, we welcome competition. There’s lots of competition. There should be more competition in this space, but we are very self-interested in staying engaged with the community.
    You know, it’s very important. You know, CUDA not many people remember this because it happened so long ago, but you know, CUDA started out with a lot of outreach from NVIDIA to the academic and industrial community saying, “Hey, we have this new way of doing computing. we’d love to see what you can do with it.” In fact, you know, I started using CUDA in 2006 when I was a grad student at Berkeley because David Kirk, who was the chief scientist of NVIDIA at the time, came over to Berkeley and said, “Hey we just released this new GPU, and it has this new programming model called CUDA. You should give it a try.” And I was-- at the time, I was working on machine learning on FPGAs, and I had been working on this one particular piece of support vector machine training on the FPGA, and I decided to take that little piece and write it in CUDA, and it took me like fifteen minutes, and then I ran it, and it was like two hundred times faster than my single-threaded CPU code, and I was like: “Whoa, that was way easier than what I was doing before. I’m just gonna go do that,” right?
    So, like, my own personal involvement with CUDA and NVIDIA came about because of this outreach that NVIDIA conducted right from the beginning of CUDA. you know, of course, that led to a lot of great things for NVIDIA, including AlexNet, which was another academic project, you know, where Alex Krizhevsky and Ilya Sutskever were thinking about: “How do we train larger neural networks on more data? we’re gonna go write a bunch of GPU code that uses the GPU in a, in a kinda new and clever way, so that we can train a better image classification model.” And, you know, that had such astonishing results, it kicked off the deep learning era for the whole community. and again, not something that-.. could have been done top-down. That was a, that was a very much a result of NVIDIA supporting open development and re- research in parallel computing and artificial intelligence. And so we remember that, and we’re thinking about in twenty-six, what does it look like to help, you know, the Alex Krizhevsky of the future, who’s, who’s a grad student in a lab somewhere, invent the next technology that changes the world? It seems really difficult to do that without something like Nemotron or, or the other openly developed AI projects out there. yeah, I also wanna say in regards to this Nemotron is not trying to be the only project out there.
    We’re part of the community. We love other people doing great work in openly developed AI. We learn from things that other people do and you know, so we’re, we’re trying to support the community because it’s in our interest, but we you know, we’re very happy to see other people contributing as well.
    00:13:57 Nathan Lambert: Yeah, I mean, I can transition into something I wanted to ask about is like, I see multiple ways, twenty-five Nemotron mat-- in, I don’t wanna use the word maturing ‘cause I wanna ask you about how it feels in the org, but just like the output reached levels that were more noticed by the community and people building with models. And there’s a lot of ways that can happen, but one of them is like, in my niche community, I’ve been using Nemotron datasets a lot. Like we-- when we redo our post-training recipe, one of the only people we look at is like, okay, NVIDIA, Nemotron has released a lot of high-quality, openly licensed post-training data. this year, you also started releasing some pre-training data, which among AI2 got a lot of notice. Like, what is that? is that like a distinct shift within Nemotron?
    Is that something that you’ve wanted to do for a while and finally just did? But it’s ‘cause it’s like-- it is just like a zero to one moment where releasing pre-training data comes with legal risk for any company, but so few people do it, where on my side of the world, it’s like pretty easy to normally say what the best pre-training dataset is, and it had, for a long time, oscillated between like Hugging Face, AI2, DCLM, and there was like literally only two or three options. So in terms of fundamental research, like I think that’s a big step from an org to support the community and take on some risk. So if you have any story you can tell and or just say like, I appreciate it, that’s, that’s all.. that’s all I got.
    00:15:23 Bryan Catanzaro: Well, yeah. I mean, so I think it’d be great if more people could understand that Nemotron is not just a model, right? Like, what we’re trying to do with Nemotron is to support openly developed AI, because, again, that’s our big opportunity, right? Now, there’s a lot of organizations that are incentivized to build a model, and the model is maybe the thing that runs their business, right?
    But at NVIDIA, the model is not the thing that runs our business, it’s the systems. So when we’re thinking about how do we support the ecosystem, it’s clear to us that the ecosystem needs more than just a model. There’s a lot of models out there already, you know? And of course, we want Nemotron to be awesome, but you know, if Nemotron can convince other people to work on AI because of a dataset or a technique, you know, we’re, we’re trying to be very open with all of the things we learn, you know, including..
    I mean, we do a lot of expensive experiments in order to figure out how to do blending for our datasets or to figure out, you know, optimize our settings and, you know, these sorts of things. we’re very happy for other people to pick that up and run with it if it’s useful to them, you know. And so that makes Nemotron a different kind of AI effort. Of course, there is a model component, and that’s a tangible thing, and it’s, it’s easy to focus on that, but we see Nemotron as you know, an effort that includes models, but also includes datasets, techniques, all of all of the research that goes into Nemotron. And again we’re a unique kind of AI organization because of the way that we work with AI companies around the industry and because of the way that our business works, we can afford to be more open with some of these things than maybe some other organizations could be.
    Now to your question about, like, does it take some courage in order to be open? Yeah, absolutely it does. and you know, I think there’s been-- one of the things that’s happened in twenty-five is that there’s been an evolving understanding within NVIDIA about the benefits of openness, and that has really enabled the company to make some investments that perhaps it was a little gun-shy to make in the past. And so that’s really encouraging for me. it’s something that I’ve you know, advocated for a while, and so it’s, it’s great to see the company kind of lining up behind it. I also, you know, to your point about like twenty-five being a, a year where Nemotron really made some strides, I want to say thank you for noticing that, and then maybe tell you a little bit about how that happened, because I think it’s instructive for me about how I think the work is gonna go forward in the future.
    So you know, NVIDIA is a very decentralized company with a lot of volunteers. You know, everybody that works at NVIDIA is a volunteer. And what do I mean by that? Well, I mean, look, the industry is moving quick.
    You know, people can always move from one job to the next. So the way that we think about the work that we do is like, it’s very decentralized, it’s very much let smart people figure out what they should be doing and then kind of self-organize. Now one of the challenges of self-organization in a field that’s moving quickly is that sometimes a whole bunch of people decide to-.. do similar kind of overlapping things but aren’t really coordinated. and that’s okay at the beginning because, you know in a place like NVIDIA, it’s just great to have some energy. It, it took us a while, I think, as a company to figure out that Nemotron was better together.
    That rather than having, like, this group has a, has a model and that group has a dataset, and like, you know, then we end up publishing papers that kind of you know don’t really acknowledge each other and aren’t really coordinated. And then, of course along with that, we need to have k times the GPUs, where k is the number of independent efforts. we realized that, you know building AI, you really do need to figure out how to collaborate. the AI efforts that are built from teams of people focused on the overall effort succeeding rather than their own particular piece of the project succeeding, those are the ones that, you know, really change the world. And, you know, of course, NVIDIA works that way for the systems that we build, right? So, like, the people working on the memory controller on the GPU know that they also have to work with the people working on the SM that does the math, right?
    Like, you can’t, you can’t make a GPU where it’s just like, “Well, we’ve got an awesome memory controller,” if the math doesn’t work, right? It all has to, has to kinda work together. And so that coordination, I think in the field of AI, it took us a little bit longer to do maybe than you could imagine that it could have. and I think that slowed the progress for Nemotron. so I give a lot of credit to the Nemotron team for realizing over the past, I don’t know, year and a half or so, that it was really time to join up and build one thing and make it awesome, and deeply understand that the success of the Nemotron project was more important than the success of any individual piece of that project. And the reason why I’m telling you all of this is because I think that’s actually true more broadly than just inside NVIDIA, and I think it’s, it’s difficult. you know, researchers like those of us with PhDs, for example, we are taught how to be independent, you know, and how to, how to build up our Google Scholar profile, and there’s, like, an incentive to go ahead and focus on that.
    And a lot of successful academics and people researchers you know, they manage to push that pretty far and get some pretty amazing results. But, you know, I do believe that in 2020- in the 2020s you know, that the best research is done as part of a larger team. so how do we figure out how to work together? You know, how do we figure out how to put the success of the team first? That is a thing that is challenging to do but if we can achieve it, I think yield significant results.
    And, you know, to the extent that we made progress in that part of the organization, I think we also saw progress in the technology. and that’s.. That gives me great hope for 2026 for Nemotron because the way the team is working together, I think is you know, pretty extraordinary. There’s just an enormous number of brilliant people that have decided that they’re gonna volunteer to make Nemotron awesome, and we’re, we’re starting to see some pretty great things come together.
    00:22:25 Nathan Lambert: I agree with everything you said. Do you have any advice for making the orgs come together? I think we’ve seen big-- Wait, I’ve seen two class-- there’s two classes of AI companies right now. One is startup, does everything, and you have a model in six months, but you’re building from zero, and you have-- you p-- everybody agrees when they start that they do this. And then you have Google’s famous long-winded reorgs, which they actually eventually got right. Like, they got it very right with what’s going on with Gemini and Google DeepMind-.. right now. And it’s like, do you have any advice on doing this? I think, like, I’m, AI too, also advocating for this, but it’s very hard. I think personally-
    00:22:58 Bryan Catanzaro: It’s-
    00:22:58 Nathan Lambert: .. it’s like, I mean, I’m, I’m a special case ‘cause I’m also visible, where it’s e-- very easy for me to turn internet activity into, like, reputation points because of algorithms and size. But it’s very hard to do bottom-up technical work and get all of this and get all the culture alignment. So do you have any advice on actually, like, what works in this domain?
    00:23:20 Bryan Catanzaro: You know what’s worked for us is invitation and not control. so you know, one way that, like, for a while I kinda wanted to try to implement was, like, nobody gets to publish any papers in AI unless they’re clearly part of Nemotron. So this is kind of a top-down, like, we’re gonna make you do it, right? I came to the realization that which we never implemented this, by the way, but I came to realization that this was a bad idea because it would just breed resentment, and, you know, NVIDIA is a company of volunteers. Everybody here is a volunteer.
    So what we need to do is create the conditions by which it makes sense for people to volunteer to be part of Nemotron. And so the way that we went about doing that first of all it involved like, some top-level agreements between me and some of the other leaders of Nemotron, for example, John Cohen and Kerry Briski. I work very closely with the two of them. And you know, that hadn’t always been the case.
    Like, we kind of had all come to this place independently. but we realized, like, Nemotron, better together, all three of us, and then we started telling our teams that: “You know, we really think Nemotron is gonna be better together.” so that top-down alignment, I think was really helpful. We-- again, we weren’t telling people exactly what to do, but we were just sending a con constant message like, you know, “Nemotron’s better together.” And then we built some structures that facilitated collaboration. So in the past decisions in the Nemotron project tended to be made in kind of a an opaque way. and the reason for that is just, you know-.. it’s hard to tell everybody about the middle of the sausage-making process. You know, it’s, like, messy and dif- difficult, and so, like, you know, it’s natural.
    Like, researchers, we’re used to doing this, right? It’s a fait accompli. Like, “Here’s my ICML paper,” and like, you know, the fact that you spent, like, two years failing at that task before you finally succeeded, and then you tied a bow around it and gave it to the ICML committee, you don’t really talk about that, right? And so it’s difficult for researchers to, to be open about the middle of the process of research.
    There’s a lot of failure, and it’s hard for people to feel like they’re, they’re not looking amazing. But what we, what we decided to do is we structured the project with.. There’s about twenty different areas for the project. Each of them has a clear leader, what we call a pilot in command.
    Their job is to-- the job of the pilot in command is to land the airplane. You know, you just want the airplane to land, okay? So somebody, if you’re landing an airplane, there might be multiple pilots on board, but only one of them is gonna land the airplane at any time, right? Because it would be chaos if two of them tried to land at the same time, people would die.
    So so this is not a committee structure; it is a delineated responsibility structure. And then the purpose of that pilot in command for each of these sections is to gather together all the best ideas, help the group of people that are interested in working on that space to come up with data-driven answers to what we should do, what technical decisions we should make, and then document that, you know, in a, in a way that other people can review. and you know, the thing that’s been really great about that is that it is inviting to people because when they see, like, okay, here’s the group of volunteers that are working on this area of Nemotron and then they want to contribute, it’s much clearer about how they could go about doing that, and it’s also clearer what the group needs because you know, these meetings are being held in the open. and we have-- we actually have a website where all of the ideas are submitted. they each get, like, a unique identifier, and then they get engaged with, you know, the PIC is trying to understand what the implications are, what kinds of experiments need to be run in order to prove or disprove the idea? how do we do what I call integration studies? You know, I, integration studies are so key for bringing researchers together, and they’re so opposite of what we are taught when we’re learning how to do ablations as a graduate student. You know, rather than, like, isolating the particular contribution of one idea, integration studies are about putting a hundred ideas together and seeing if they’re better than what we had before. so this kind of thing, doing that in a structured way and in a, in an open way internally has then made it possible for more people to volunteer, and that has then generally raised the rigor of the experiments and also the I think the outcome of the work.
    00:28:15 Nathan Lambert: Yeah, this is great. I think that over the last few years, there’s been more consensus on things that work for research. And I think the- we also do integration tests very regularly of like, is this feature gonna land for the model? And that’s kind of a..
    It’s a good- it’s a nice mirror to ablations, where we know research is changing so much. There’s a lot of turmoil in the academic research community, and it’s nice to have things that are tangible as ways that are a little bit different when you’re doing these large-scale projects. So people that underst- like, you still need to do ablations. But then it needs to survive, like, an additional test in order to land into the model.
    So it’s like an additional type of work that needs to be done, and I just like to have words to describe what is actually happening. I think on the Nemotron-3 Nano front, I do a lot of analysis on just looking at basic adoption metrics and Nemotron we created this, what we called like a relative adoption metric, which is essentially looking at downloads over time for models, because it’s easy to know which models have a ton of downloads that are released a while ago. But to, like, look at the trajectory of downloads changing over time, this is a lot-- this is a mouthful. It’s kind of an aside, but, like, Nemotron Nano 3 was in the thirty B size range, like, on track to be one of the top ten models downloaded of all time.
    The point that I bring this up, other than to just flatter you, is like, do you think last mile adoption takes a substantial amount of work other than making, like, a very functional model? Or does adoption-- like, do you need to, like, change the recipe that you’re making and put a lot of focus and evaluation and, like, change this over time so that you actually get people to really use the model, rather than, like, “Oh, the benchmarks are good,” look at NVIDIA flying high?
    00:30:03 Bryan Catanzaro: Right. Yeah, I mean, wow, it has taken the whole company coming together in order to make Nano V3 have more of an impact than the models that we released before. and there’s so many different aspects to that. obviously, there’s a lot of technical aspects which frankly, I think we have more work to do. So, like you know, making sure that on day zero, when we release something, that the quantizations, all the quantizations, the best quantizations are out there, that the speed on all of the important inference frameworks is out there, that it runs on all of the edge devices that we care about fla- flawlessly, that the install experience is great. You know, this kind of work is extraordinarily important because you know, it’s a crowded world.
    There’s so many different things that people could choose to work with, and any amount of friction that gets in the way of people even evaluating something that you do is gonna blunt the results, no matter how good that technology is.. I don’t think that we’re amazing at this yet, so this is something that I anticipate we’re gonna see a lot more investment in as the, you know more people at NVIDIA from all over the company, from marketing, from developer relations, from software engineering, you know as they-- as we all come together in support of this effort. so yeah, so it does, it does take an enormous amount of work. and then, you know, something that I’m particularly interested in is you know, how do we work engage-- i-in a new way, sort of engage with the community to make future Nemotron models even stronger? You know if the only things that we were to optimize for with a Nemotron model would be kind of academic benchmarks that are, you know, highly cited it’s likely the case that the model wouldn’t be general enough to really be useful. And so what we’re trying to build is a technology that other people can extend and deploy, and that means we need to have, like, other ways of understanding the strength of a model besides you know, a handful of academic benchmarks.
    I think we have a lot of room to grow here. I’m hoping over time that we develop the muscle of being able to engage with the community and learn from them. Like, you know, okay, this particular thing that I tried to do with Nemotron, it didn’t work. It did this other thing that, you know, I wasn’t expecting, it was wrong. well, that can become feedback that then is used to make the next version better.
    I think we’ve got a lot of work to do in that regard.
    00:33:10 Nathan Lambert: Do you think there’s any magic to it? I’ve-- I’m blown away by how successful OpenAI’s two open-source models are. Like, yes, they’re obviously the number one name brand in AI, but on the same metric that I see you guys, like, overperforming, like, what I would expect. I’m like, “Wow, great job, NVIDIA.” They’re, like, totally off the charts, like, on track to like, beat Llama’s, like, most downloaded numbers ever with these two GPT OSS models.
    And I feel like what they-- like, even on release, they had hiccups where people were pretty negative on it. But for whatever reason, it has just like.. People figured it out, and it just clicked, and then just, like, for a company to say so little about it. Like, we-- Meta put so much effort into Llama being adopted, and you obviously are putting a lot of effort into this.
    Like, I’m just like, did OpenAI just crack the code, or is there sometimes a bit of luck?
    00:33:59 Bryan Catanzaro: Well, I don’t think I, I don’t think about OpenAI as a, as a lucky company. I think of them as a visionary company that works incredibly hard and you know, I think their success is well deserved. I love the GPT OSS models. You know definitely they’re an inspiration for us here at Nemotron. and yeah, so I think OpenAI also has, like, some other ways of engaging with the community just because of the large number of people that use their services, and that helps them learn things about what are people trying to do with AI, that then they can address when they’re building models, and you know, obviously, you know, people talk about that as a flywheel. you know, I think that’s really interesting and really important.
    NVIDIA is never going to have the same kind of flywheel as OpenAI does. We’re not trying to build a service like ChatGPT. What we’re trying to do is help the ecosystem, you know, be strong and enduring. we think that it’s important for there to be this openly developed AI ecosystem, and also we’re, we’re trying to build our next generation of systems, and so we have our own reasons for doing this. But we’re not ever going to have the same exact user base or flywheel that OpenAI does.
    On the other hand, you know, we are able to work with institutions around the world in our own way, that I think offers us different opportunities and hopefully, that helps us make things that are, that are useful, too.
    00:35:38 Nathan Lambert: Yeah, this makes me realize, I’m having a lot of conversations on.. There are many open model efforts, especially even among people that are fully open, and it’s like, how do we better coordinate? So especially at the smaller scale, it’s like AI2 and Hugging Face. So they’re not big teams.
    Like, how do we make sure we’re not doing the same data project at the same-- the same exact thing at the same time? And it’s like, I wonder if there’s opportunities for open companies, like LM Arena has historically released a lot of user data to, like, better help us close this kind of what are people using models for flywheel. And but it’s just-- it’s very hard to build cross-organizational model improvement pipelines, is something that I think. I think models become pretty vertical in terms of somebody at NVIDIA getting the feedback and the model making better.
    So that’s what would be something I would like to see this year, but I don’t have ideas for doing it well.
    00:36:28 Bryan Catanzaro: Yeah. You know at NVIDIA, we have a tradition of working really closely with, you know, organizations that use our technology. and, you know, we really-- we have, we have teams of engineers that their job is to enable success for our customers. in fact, there’s more people at NVIDIA that care about the success of people outside of NVIDIA than I feel like sometimes there are people that care about the success of things inside NVIDIA. So, like, sometimes I’m like, I’m like: “Hey, could we use a little bit of that e-energy to support Nemotron?” And, and the answer is yes, and NVIDIA is doing that. But I think as Nemotron matures, we’re gonna find that you know, the organizations that work with NVIDIA to make Nemotron awesome for their business, for their use case are gonna have a say in how Nemotron evolves and hopefully, that helps Nemotron address their needs.
    00:37:29 Nathan Lambert: .. Yeah, a basic question: how many people, like, how many employees does it take to build all the different versions of Nemotron? I haven’t brought this up because you also have other great types of models. I think our, like, open model analyst, Florian, is obsessed with the Parakeet model, ‘cause- Much faster at typing and is much faster at speaking than typing.
    So there’s a lot of other-- I don’t know-- I don’t have the full list of other NVIDIA models off the top of my head, but you are releasing a lot of varieties of models. So I think it’s a bit of a there’s more context to my original question, which is I think about language models ‘cause I’m a n-- like, I just think of AI’s progress is gonna continue to go very fast, so I focus as that as the engine. So but it’s like, how many people is putting this kind of movement into place?
    00:38:16 Bryan Catanzaro: Yeah. Well, it’s, it’s, it’s hard to know exactly, and as I said, NVIDIA is a company of volunteers. But and also these days, things are changing, right? Like, so the Parakeet team, which is an excellent team, by the way they I would say a year ago wouldn’t have really considered themselves so much part of the core Nemotron effort, but these days they absolutely are. for the obvious reason that, you know, LLMs these days need to be able to consume all sorts of data, right?
    Including audio data. And so you know, as the pro-- as the characteristics, the capabilities of Nemotron models expand obviously, the number of people contributing is gonna expand. I’d say right now there’s about five hundred people that are working pretty much full-time on Nemotron technologies in different ways. This is everything from numerics quantization recipes to speech recognition or image understanding or, you know, pre-training, post-training, RL systems inference software. you know, there’s, there’s a, there’s a whole bunch of different dimensions, right?
    So I’d say it’s about five hundred people. but also we’re having our Nemotron all-hands meeting this week, and so I took a look to see how many people were invited to that all-hands meeting, and it was about two thousand. so those are people around the company that are interested in working with Nemotron and either expanding its capabilities or helping its adoption. and so I think you know, the number is somewhere in between and it’s hopefully gonna keep growing as, as Nemotron matures.
    00:40:07 Nathan Lambert: Yeah, I mean, that’s one of the greatest attestations to what you’re saying is like, if the interest outside the company-- inside the company is four times as big as the people doing it, you’re gonna, you’re gonna keep scaling up, it seems. People are gonna-.. find ways to help. - One of the other things I’m interested in, I don’t know, like, on the point of five hundred, it’s like, it sounds like a lot of people, but with how many things you have going on, it seems also very few. ‘Cause I’m transitioning to thinking about the long-standing, like, open-source software that you’ve had for NeMo, and I think Megatron, and it’s like they’ve been around for a long time. I think Megatron has gone through many eras. I have a note here.
    It’s like these softwares have been going around since, like, twenty nineteen in some form. And it’s, it-
    00:40:51 Bryan Catanzaro: Publicly. We had our first public release in twenty nineteen, but we started earlier.
    00:40:56 Nathan Lambert: And it’s something that I’ve found is that when I started doing lang- language models, so I was a late bloomer, and we’ll transition to some career talk in a few minutes at Hugging Face. Like Megatron had, like, a bad rap of being very hard to use. But now, like three years later, I hear from anyone that’s founding a new language modeling startup, they’re like, “Just use Megatron.” like, do you pick up on things like this? Is it just, like, random-
    00:41:22 Bryan Catanzaro: Well, we-
    00:41:22 Nathan Lambert: .. but it’s like-
    00:41:22 Bryan Catanzaro: We hard on it. You know, we’re trying really hard to make Megatron easier to use. It’s difficult. Megatron is a complicated piece of technology, and, you know, when we originally started Megatron, the point was to show the community that you could make state-of-the-art large transformer language models with NVIDIA.
    I don’t know if you recall, but it-- there was some assertions by some other companies back in twenty seventeen when the transformer was invented, that they could only be made without NVIDIA. in fact, there were statements to that effect on bl-- on official blog posts, which I think got redacted later on. But it was important for NVIDIA to show up and say, “We love language models. We love transformers. Let’s see what we could do, you know, if we partitioned the work properly on lots of GPUs with an amazing interconnect, what kinds of models could we train?” And so that’s where the Megatron project started.
    You know, I actually came up with the name Megatron. one of my proudest moments, I suppose. I was thinking about it, I was like: This is a really big transformer. What’s the biggest and baddest transformer? Oh, it’s Megatron.
    So that’s, you know, where the name came from. but you’ll think about that had nothing to do with usability, right? Like, I wasn’t, I wasn’t thinking about, like, how do we make a platform that’s really easy for other people to use? I was just trying to show the world that, like, NVIDIA systems could be awesome for transformers. You know, that was, that was my goal.
    Over the years, you know, it has evolved. We have a lot more people trying to use Megatron. We got a lot of complaints about how hard it was to use, and then we did a lot of work to try to improve the software engineering around Megatron. You know, these days Megatron software engineering is actually shared between about four different teams at NVIDIA. and we have to coordinate that work very closely.
    That has also not been easy. There has been times when you know, people wanted to fork Megatron, and then there were times when we, like, had to bring it back together, and it’s like: Look, I know forking things is always tempting, but look, better together. It’s better for all of us to keep working together.. and so I feel like Megatron the-- and especially Megatron Core, which is like a subset of Megatron that’s, like, especially protected, and we try to put more software engineering into that that has gotten dramatically better since we started paying more attention to it as a company. are we done yet? No, there’s a lot, a lot, a lot more work.
    00:43:52 Nathan Lambert: a ba-- a basic question: Is is Megatron or Megatron Core, like, this is what Nemotron is trained on? And also-- And it’s also something that many of the hottest, like, AI startups are training their models on. I would guess that there’s nothing else that does that. So, like, could you summarize why it’s so hard?
    00:44:11 Bryan Catanzaro: Well, you know, there’s a, there’s a lot of other great frameworks out there. Megatron’s not the only one. and you know, we’re happy about that. NVIDIA doesn’t need to control the space. What we, what we do wanna do is make sure that we’re putting our products forward in the best light, you know, and it’s a challenging problem.
    We’ve got so many things going on with precision and you know, the networking. Like, those questions, like, the software is so complicated. these days, you know, we’re pre-training our Nemotron-3 Super and Ultra models using FP4 which is a thing that, you know, hasn’t been done publicly anyway and something that, you know, we’re pretty excited about because our GPUs have really awesome FP4 throughput. But obviously, the numerical challenges of, like, trying to train a state-of-the-art language model using four bits is non-trivial. So, like, you know, all of that work has to go into Megatron, into Transformer Engine which is a, another open-source project that Megatron relies on and, you know coordinating all of that making sure that, you know, we can actually deliver the benefits of NVIDIA systems to people that are trying to make state-of-the-art models, that’s really important to us.
    And, you know, of the five hundred or so people working on Megatron, like, a pretty good fraction.. or on Nemotron, a pretty good fraction of them are working on these kinds of systems issues, right? Because NVIDIA at its core, is a systems company. and Megatron, you know, Nemotron’s first job really is about systems, you know, and so we, we care, we care deeply about that.
    00:45:51 Nathan Lambert: Yeah. I mean, from my perspective, I was at Hugging Face before AI2, and Hugging Face is, like, the best company at doing public work. But also, and switching to AI2 and focusing on, like, we’re focused on the output artifact the most. Seeing the different type-- Like, it’s such a different type of work, going from you’re trying to build a tool that’s good for training models, to build a tool that’s good for everybody else and whatever heck use case they are.
    00:46:13 Bryan Catanzaro: It’s different.
    00:46:13 Nathan Lambert: So I think-
    00:46:13 Bryan Catanzaro: Yeah. Different work.
    00:46:14 Nathan Lambert: To do both is like.. I’m, I’m happy that AI2’s repos aren’t that popular in terms-
    00:46:21 Bryan Catanzaro: Oh,
    00:46:21 Nathan Lambert: .. of open-source adoption because, like, we can’t handle it. We just can’t. It’s, like, so hard because it’s people-- it’s, like, it ends up being researchers that are supporting it, and we don’t have the ability to scale the organization structure. So I just think, like, that’s a, that’s a very fun turnaround for me to think of all these things happening at once.
    00:46:39 Bryan Catanzaro: Yeah. Well, thanks for noticing we’re putting effort in. I would say Megatron is still not nearly as user-friendly as Hugging Face libraries. Like-.. Hugging Face libraries are legendary, and I admire the work they’ve done to make the community so productive. people, you know, are able to get so much research done thanks to the work that, you know, Hugging Face has put into to their library. So you know, my hat’s off to them as well.
    00:47:06 Nathan Lambert: Yeah. One of my hot takes, you don’t have to reply, is that Hugging Face and NVIDIA have been very good partners.
    00:47:10 Bryan Catanzaro: Oh, absolutely.
    00:47:10 Nathan Lambert: And it’s like bringing that Hugging Face culture to the NVIDIA stuff would be so good. It’s just so hard, so I don’t know how that would work, but-
    00:47:17 Bryan Catanzaro: We’re trying, you know, and you know, it is, it is challenging. NVIDIA is always a company that is gonna prioritize speed like hardware speed, above really anything else, ‘cause that’s, like, who we are. I am always trying to make the case that developer speed is important, too, right? It’s like there’s different ways of thinking about speed. and it is definitely the case that a lot of NVIDIA’s software is so cumbersome to use that you know people can’t get the actual hardware speed as fast as it should be because they just give up.
    You know, they just don’t, don’t even figure out how to use that. So I think NVIDIA’s making strides there. I think the, the company is understanding more deeply how important developer experience is, and I hope we continue to push that, so that the benefits of all of the systems technology that NVIDIA works so hard on can be more widely used. but at the same time, you know, there is gonna be a tension between those things. It’s, it’s not gonna go away, and you know, to a certain extent, I think that’s just life on planet Earth.
    00:48:26 Nathan Lambert: It is. I think you’re do- you’re doing a good job, and I’m gonna kind of shift gears in this interview. So I’ve.. In becoming more back in language- in becoming a person that works in language models, I’ve seen your name more and more times.
    I was like, “Bryan Catanzaro, like, where have I seen this?” And then I went and did the research of the Berkeley PhD in, like.. It says April of 2021, you gave a Berkeley EECS Colloquium titled “Applications of Deep Learning and Graphics, Conversational AI, and Systems Design.” I’m not even gonna posit that I actually went, but that’s definitely where I remembered the name from in grad school. And we both have backgrounds that aren’t traditionally in AI and end up working in language models. I just wanted to, like-- what have you learned from your path th- through NVIDIA into what, like, people should be thinking about with AI or open models today?
    This could be career reflections, like technical reflections. I just think that there’s-- there are actually a lot of people that come from all over the, like, STEM field to work in AI, so giving it-
    00:49:29 Bryan Catanzaro: Sure
    00:49:29 Nathan Lambert: .. space to think about is-
    00:49:31 Bryan Catanzaro: .. useful, even if it’s just like, it was the big problem, and I wanted to go solve it. Well, I think, you know I’ve, I’ve had a lot of opportunity and a lot of luck in my career. I think in hindsight, it seems like an extraordinarily lucky thing that, you know, I did my first internship at NVIDIA in 2008, and I was, like, building machine learning models on the GPU, and I went to NVIDIA, and nobody else was really doing that. And I was like, “Hey, like, we should have more people doing machine learning on the GPU.
    I think this could be an opportunity.” And you know, it took a few years for me to make any headway. NVIDIA didn’t really wanna listen to me. I was a brand-new PhD. I was in the research organization, which is very independent, but, you know, sometimes struggles to change the way that the, you know, the bigger company thinks about things.
    And and yet, I just had this conviction, you know, I just was following my heart about what I think is gonna be important, what do I think could really change the world? And that has been, I think, the thread that has taken me through my whole career, is that I’m constantly trying to refine my beliefs about what matters and then hold to them. And that.. I don’t know how helpful it is to say that, but I feel like sometimes people you know, tend to follow the, whatever the thing is that people are talking about on Twitter.
    And like I’ve- I’ve done a lot of unpopular things during my career because I believed in them, you know? I remember I published my first paper in 2008 on, at ICML, on training support vector machines on the GPU, and I actually had somebody at the conference, it was in Helsinki at dinner, you know, we were all telling each other what we’re doing, and, and I was like: Yeah, I wanna help people train bigger models on bigger data sets with GPUs. And, and I had you know, a couple of people just say, “Well, why are you here at ICML? That just doesn’t really feel like a good thing for us.” And in 2008, ICML was momly- mainly about new mathematical frameworks for thinking about data, and you know, maybe if you trained a model at all, you would train one on your laptop.
    You know, that was the state of machine learning in 2008. So for somebody to come in and say, “I think I want to focus on, like, parallel computing, new kinds of hardware for machine learning, programming frameworks for machine learning, so that, you know, we- more people can try inventing new models on complicated machines with a lot more compute throughput on bigger data sets,” that was like a, an unpopular thing. At least it felt very unpopular. I felt very marginalized at the time by the community.
    But I believed in it, you know? I just felt like, look, technology.. Like I have this sense of, like, where do I think technology is going? I knew that traditional computing was running out of steam.
    You know, I had, I had done a few internships at Intel, and I was trying to help Intel make processors that ran at, like, ten gigahertz back in 2001, and, you know, it was, like, clear that th- they were running into a wall. And I was thinking: Okay, so if the compute hardware is gonna have to be different, it’s gonna be more restricted. It’s not gonna be able to be so general-purpose in order to get speed. What kinds of applications are gonna have, like, an infinite need for more computing?
    And I thought, well, machine learning and AI, that could really change the world if it ever actually worked. But, you know, but, you know, back then it, back then, it kinda worked inside of Google. outside of Google, it kind of didn’t work. and so I had kinda these signals, like it was possible, but it was hard. It was a little weird. It was a little niche.
    I was a little bit caught in between different fields, like the systems people didn’t think I was systems enough, and the machine learning people didn’t think I was machine learning enough. But, but I believed in what I was doing, and I found a way to keep following that belief. And, you know, ultimately it was very rewarding when all of a sudden NVIDIA decided, “Hey deep learning is changing the world. What do we know about deep learning?” And then it was like: Oh, well, Bryan’s been doing that for several years, and he’s written some libraries that we could turn into a product.
    Let’s go do that. And, you know, so that all happened really quickly after many years of nothing happening, you know? And that was really obviously an amazing opportunity for me. you know, an- another thing that was important to me, I left NVIDIA in 2014 to go work at the Silicon Valley AI Lab at Baidu with a group of really talented people, including Andrew Ng and Dario Amodei and Awni Hannun and Adam Coates, and you know, this was a, a really once-in-a-lifetime opportunity, I think for me, to learn some things that would have been hard for me to learn on my own. you know, I felt at the time at NVIDIA that although I had this great opportunity to help NVIDIA become an AI company, and I was doing that, and I was succeeding at that back in 2013 2014, I also felt like I really wanted to learn from a broader community of people applying machine learning and AI to solve really important business problems. And so going to work at Baidu really gave me that chance. and I was there for a couple of years, learned a ton. very grateful to the team there especially to Andrew Ng, who, who encouraged me to, to join with him on that. and then, you know, I ran into limits of what I could do in California, working for a Chinese company.
    I was thinking about, you know, what should I do next? And Jensen asked me to come back and build an applied research lab at NVIDIA in 2016. and -.. I wasn’t sure, like, if that was a good idea. I thought NVIDIA’s already grown so much, you know.
    The, the years from twenty fourteen to twenty sixteen, NVIDIA actually grew a lot. these days you look back at it, and you’re like: It was still really tiny. But, but back then, I was like: I don’t know, maybe NVIDIA’s already tapped out. I don’t know if you recall, in twenty sixteen, there was already, like, ten different companies making GPU competitors, right? The TPU had already been out for a while and you know, it, it wasn’t clear that NVIDIA was gonna become as large as it, as it has.
    But I believed in the opportunity. I believed in the people. you know, one of the things I loved about NVIDIA was that it’s a very stable organization. So Jensen, he’s been running it since he founded it in nineteen ninety-three. my boss, Jonah Alben, who’s an absolutely extraordinary person has been here for you know quite a, quite a long time, almost since the very beginning of NVIDIA. And these people a lot of the leadership at NVIDIA they love the work.
    Their heart is in the work. Jensen and Jonah and many other leaders at NVIDIA, they don’t need to be doing this, right? They, they have earned the right to go sit on a beach and drink mai tais all day, but their heart is in the work, and they work incredibly hard. you know, the.. I feel like if there was an Olympics for email, you know Jensen would get the gold medal.
    You know, like it’s, it’s unfathomable to me, like, how much information he’s able to process. and it’s a skill that he’s built up over a long time running this company, but it’s also a reflection of his commitment to the work. And I felt like working at a place where we’ve got this very stable organization that loves the work, that really wants to change the world. You know, why does, why does Jensen get up in the morning? Well, it’s-- this is his chance to do something meaningful.
    I thought, associating with these people, you know, I could do worse. I could-- I think I could learn from this as well. And so I came to NVIDIA, and back then it was really hard to explain to people why I was trying to build an AI lab inside of NVIDIA. At, at the time, NVIDIA wasn’t doing very much AI, and so I had to kind of develop a vision for that and then explain it to people. that’s ended up being a really good idea for me as well.
    You know, the lab, I think, has really helped NVIDIA. you know, Megatron, I think, has really shown the industry, like, how valuable NVIDIA systems can be for language modeling, which is, which is awesome. DLSS, you know I’m continuing to, to push DLSS forward. Very excited about making graphics, you know more efficient with AI. These days, you know, fifteen out of every sixteen pixels a gamer sees are rendered by AI models that, you know, my team developed, and that then makes the GPU ten times more power efficient.
    This is a really exciting you know, thing for me to be involved with, something that I’ve, you know, dreamed about for years. So, so that’s the kind of thing that continues to push me forward, is that I have strong beliefs about what I think is possible, where I think technology’s going, and I’m willing to do things that are we- weird and unpopular but, you know, basically following my convictions. I’m very much always thinking about the people I’m working with, the tribe. You know, I think tribes matter enormously. like you know if I..
    So, so back when I was a grad student, I was working on programming models for machine learning. I joined the Python tribe. There are other people that were in the Scala tribe, and the people that did their work in the Scala tribe, trying to make programming models for machine learning in, like, two thousand and ten you know, that work, although a lot of it was technically excellent, didn’t matter to the community as much as the people who were in the Python tribe. It ended up.. and, you know, it kind of sucks sometimes that the world is tribal like this, but it’s just the case.
    You know, that like the people that you work with, the community that you work with has a big impact on the problems you think about and then the impact that your work has. So I think a lot about the people and the tribes that I’m collaborating with or that I’m part of. and you know, that’s, that’s kind of been the thread that has carried me through my career.
    00:59:56 Nathan Lambert: Yeah. Than- thanks for sharing this full arc. I think you’ve said things that I tell people but in different languages, and the first one, the early days, it seems like there can be space in between fields, where people-- two fields will have their way of describing things, but both of them are probably incomplete, and there can be space there, which is a lot of what I was doing transitioning from novel robots to model-based RL, where I, like, didn’t sit and bear in the actual AI lab, but I started doing AI with my, like, total electrical engineering friends. And then the second thing is, like, I’d wholeheartedly recommend this to people, is, like, choose your work based on the people and people that sincerely are in it for-.. the, what they want to do, and a lot of-
    01:00:41 Bryan Catanzaro: And follow your beliefs. You know, think about it. What do you believe in? And it’s okay to change your mind, you know, but, like, figure out what is it that you believe in.
    Ask yourself every day: Do I still believe in that? If I do, what next? You know. If I don’t, well, what do I believe in?
    You know, that’s been really important to me. I think too many people end up kind of just following trends. That’s not usually helpful because the trends are too late. So if you wanna, if you wanna change the world, you need to be ahead of the trends, and you need to know, you know, it-- trends-- I don’t think trends in computing are just fashion.
    I think there’s truth that drives those trends. Not always, but often. You know, it’s just-- this is, it’s there’s kind of an inevitable force of gravity. It just can be really hard to par- parse out the noise and figure out what is the truth that is gonna push the industry forward, and how can you push that with it.
    You know, if you can join with that, you can accomplish great things.
    01:01:36 Nathan Lambert: Yeah, I agree. I think in building language models, it’s like you want to build a model that the community wants in six months. I think if you’re building a model to compete-.. with the models that are already out, you’re not gonna keep up. And I think that it’s like, what is the right thing is building open language models in six months, and like, where do you need to try to steer things is one of the hardest problems that I think about. So I don’t-- if you want to close with any predictions where you see, like, open models, like, if we’re-- if you’re gonna be here at the end of twenty-six, if there’s anything you think will be far more obvious than it is today, or any bets that you want to make, I think it’s kind of a good place to wrap.
    01:02:18 Bryan Catanzaro: Well predictions are always hard, and I don’t feel like I’m very good at making predictions. But I am-- I feel like I am good at identifying what I believe in, and what I believe in right now is that compute remains one of the fundamental challenges behind AI. It has been that way for a very long time and I think it continues to be. I think as we find new ways to apply compute to AI, we discover new forms of scaling laws that help AI become more useful and therefore, it becomes more widespread.
    So I’m gonna keep thinking about compute. I continue to believe that the fastest-- that, you know, the way to think about AI is not just in terms of absolute intelligence, but rather intelligence per second. You know, there’s some sort of normalization in there that relates to how fast a model can think, how fast a model can be trained or post-trained. You know, that models that kind of incorporate this compute acceleration characteristic, where they’re thinking about intelligence per unit time, those are gonna end up winning because they end up getting trained on more data, they end up getting post-trained with more cycles, they end up with more iterations during thinking when they’re deployed. and you know, of course, if they happen to fit the hardware really well whatever hardware that is then, you know, that can have a pretty non-trivial effect on the intelligence as well.
    So that’s something that I really believe in. I really believe in AI as an infrastructure. You know, there’s, there’s different ways of thinking about AI. I think some people believe AI is more like the singularity, like once AGI has been declared, then the whole world is different forever, and all humans have lost their jobs and, you know, there’s a lot of like-- there’s a lot of things about AI that people believe that I personally don’t believe.
    You know, I believe, first of all, that intelligence is very multifaceted that it is not easy to pin down, that as soon as we try to pin down intelligence, we find that there’s very many more forms of intelligence that aren’t covered by that. So, for example, a model that achieves gold medal status on the International Math Olympiad, that’s an extraordinary achievement, but it doesn’t make me have no job, right? Like, I’m actually not solving math problems all day, even though, like, having the ability to solve math problems is clearly very useful. And you know, it’s also the case that intelligence is, you know, is kind of like a potential energy it’s not a kinetic energy, right?
    In order to transform intelligence into kinetic energy, it needs to have a platform. It needs to be applied in the proper way. and you know, that is why I believe in open models and open- openly developed and deployed intelligence. I believe every company, every organization, has secrets that only they know. They have special data, they have special ways of thinking about their problems, their customers, their solutions, and they’re gonna know how to apply AI better than anyone else.
    And so AI as infrastructure that transforms companies, turbocharges them, allows them to take the things they know and multiply their impact, that’s something that I believe in more than AI as an event, that one day, when it happens, makes everyone obsolete. I don’t.. I just don’t believe in that. you know, I often joke that, like if, for example, the CEO were to retire at some point, and we needed to find a replacement you know, handing out an IQ test or asking, you know, who has the highest SAT score that would not be a very good way of finding a replacement, you know? intelligence is just far too complex for that. And so you know, so this, these beliefs, you know, you can disagree with me about anything that I just said, and I’m not offended by that.
    I have a lot of friends that do. but you know, I’m asking myself, well, if I believe that intelligence has these characteristics and that AI is gonna change the world by turbocharging institutions that exist a-and also creating new applications that we haven’t even dreamed of yet rather than replacing all humans, then, you know, how do I go about building that, you know? And so that’s, that’s kind of the direction that I’m on right now.
    01:07:00 Nathan Lambert: Yeah, I love it. I agree, I agree that we’re entering an interesting area where the open models are taking so many different shapes and sizes and have so many different strengths and trade-offs, that there can start to be interesting interplay as an ecosystem, where there’s just so many different things going on. And I think I like your idea of potential energy, and you have to build things that are kind of unclear of what-- It’s like you have to build the energy in a way, and you don’t really know what the goal is, but you have to do.. try to build these good models. So I appreciate it, and-
    01:07:30 Bryan Catanzaro: Yeah, and then let people apply it. Let it-- let them make the kinetic energy happen.
    01:07:35 Nathan Lambert: I agree. Thanks for coming on.
    01:07:37 Bryan Catanzaro: Thanks so much for inviting me. It’s been a great conversation.


    This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

More Science podcasts

About Interconnects

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai
Podcast website

Listen to Interconnects, The Rest Is Science and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

Interconnects: Podcasts in Family

Social
v8.7.2 | © 2007-2026 radio.de GmbH
Generated: 3/12/2026 - 7:35:04 AM