PodcastsEducationData Engineering Podcast

Data Engineering Podcast

Tobias Macey
Data Engineering Podcast
Latest episode

507 episodes

  • Data Engineering Podcast

    Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

    2026/03/22 | 42 mins.
    Summary
    In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards and tools like Jupyter, Jupyter Book, and the push toward cloud-optimized formats (e.g., Zarr), along with graceful degradation strategies that keep interactive research usable over time. Rowan details how CurveNote enables interactive, reproducible articles that spin up compute on demand while delegating large dataset storage to specialized partners, and how community efforts like the Continuous Science Foundation and initiatives with Creative Commons aim to fix credit, licensing, and attribution. He also discusses the Open Exchange Architecture (OXA) initiative to establish a modular, computational standard for sharing science, the momentum in computational biosciences and neuroscience, and why true progress hinges on interoperability and composability across data, code, and narrative.
    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Rowan Cockett about building data systems that make scientific research easier to reproduce

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you describe what your interest is in reproducibility of scientific research?
    What role does data play in the set of challenges that plague reproducibility of published research?
    What are some of the notable changes in the areas of scientific process, and data systems that have contributed to the current crisis of reproducibility?
    Beyond technological shortcomings, what are the processes that lead to problematic experiment/research design, and how does that complicate the work of other teams trying to build on the experimental findings?
    How does a monolithic approach change the types of research that would be possible with more modular/composable experimentation and research?
    Focusing now on the data-oriented aspects of research, what are the habits of research teams that lead to friction and waste in storing, processing, publishing, and ultimately consuming the information that supports the research findings?
    What are the elements of the work that you are doing at the Continous Science Foundation and Curvenote to break the status quo?
    Are there any areas of study that you are more susceptible to friction and siloing of their data?
    What does a typical engagement with a research group look like as you try to improve the accessibility of their work?
    What are the most interesting, innovative, or unexpected ways that you have seen research data (re-)used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on reproducibility of scientific research?
    What are the next set of challenges that you are focused on addressing in the research/reproducibility space?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Continuous Science Foundation
    Curvenote
    Zenodo
    Dryad
    HDF5
    Iceberg
    Zarr
    Myst Markdown
    Jupyter Notebook
    ArXiv
    Journal of Open Source Software (JOSS)
    Data Carpentry
    Software Carpentry
    Open Rxiv
    Bio Rxiv
    Med Rxiv
    Force 11
    JupyterBook
    Open Exchange Architecture (OXA)

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    Beyond Prompts: Practical Paths to Self‑Improving AI

    2026/03/16 | 1h 1 mins.
    Summary
    In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architecture needed around models - data ingestion, sensors, action layers, sandboxes, RBAC, and agent lifecycle management - to reach enterprise-grade reliability, as well as the policy alignment steps required for regulated domains like financial crime. Raj shares hard-won lessons on tool use evolution (from bespoke tools to filesystem and Unix primitives), dynamic code-writing subagents, model version brittleness, and how organizations can standardize process and entity graphs to accelerate time-to-value. He also dives into pitfalls such as policy gaps and tribal knowledge, strategies for staged rollouts and monitoring, and where small models and cost optimization make sense. Raj closes with a vision for bringing RL-style improvement to enterprises without requiring a research team - letting businesses own the reasoning and memory layers that truly differentiate their AI systems.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey, and today I’m interviewing Raj Shukla about building self-improving AI systems — and how they enable AI scalability in real production environments.

    Interview

    Introduction
    How did you get involved in AI/ML?
    Can you start by outlining what actually improves over time in a self-improving AI system? How is that different from simply improving a model or an agent?

    How would you differentiate between an agent/agentic system vs. a self-improving system?
    One of the components that are becoming common in agentic architectures is a "memory" layer. What are some of the ways that contributes to a self-improvement feedback loop? In what ways are memory layers insufficient for a generalized self-improvement capability?

    For engineering and technology leaders, what are the key architectural and operational steps you recommend to build AI that can move from pilots into scalable, production systems?
    One of the perennial challenges for technology leaders is how to build AI systems that scale over time.
    How has AI changed the way you think about long-term advantage?
    How do self-improvement feedback loops contribute to AI scalability in real systems?
    What are some of the other key elements necessary to build a truly evolutionary AI system?
    What are the hidden costs of building these AI systems that teams should know before starting? I’m talking about enterprise who are deploying AI into their internal mission-critical workflows.
    What are the most interesting, innovative, or unexpected ways that you have seen self-improving AI systems implemented?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on evolutionary AI systems?
    What are some of the ways that you anticipate agentic architectures and frameworks evolving to be more capable of self-improvement?

    Contact Info

    LinkedIn

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Parting Question

    From your perspective, what are the biggest gaps in tooling, technology, or training for AI systems today?

    Links

    Symphony AI
    Reinforcement Learning
    Agentic Memory
    In-Context Learning
    Context Engineering
    Few-Shot Learning
    OpenClaw
    Deep Research Agent
    RAG == Retrieval Augmented Generation
    Agentic Search
    Google Gemma Models
    Ollama

    The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
  • Data Engineering Podcast

    Orion at Gravity: Trustworthy AI Analysts for the Enterprise

    2026/03/08 | 1h 5 mins.
    Summary
    In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and Drew explain how Orion uses governed, role-specific "custom agents" to drive analysis, recommendations, and proactive preparation for meetings, while maintaining accuracy, lineage transparency, and human-in-the-loop feedback. The conversation covers evolving views on semantic layers, agent memory, retrieval, and operating across messy data, multiple warehouses, and external context like documents and weather. They emphasize the importance of trust, governance, and the path to AI coworkers that act as reliable colleagues. Lucas and Drew also share field stories from public companies where Orion has surfaced board-level issues, accelerated executive prep with last-minute research, and revealed how BI investments are actually used, highlighting a shift from static dashboards to dynamic, dialog-driven decisions. They stress the need for accessible (non-proprietary) models, managing context and technical debt over time, and focusing on business actions - not just metrics - to unlock real ROI.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Lucas Thelosen and Drew Gilson about the application of semantic layers to context engineering for agentic analytics

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you start by digging into the practical elements of what is involved in the creation and maintenance of a "semantic layer"?
    How does the semantic layer relate to and differ from the physical schema of a data warehouse?
    In generative AI and agentic systems the latest term of art is "context engineering". How does a semantic layer factor into the context management for an agentic analyst?
    What are some of the ways that LLMs/agents can help to populate the semantic layer?
    What are the cases where you want to guard against hallucinations by keeping a human in the loop?
    Beyond a physical semantic layer, what are the other elements of context that you rely on for guiding the activities of your agents?
    What are some utilities that you have found helpful for bootstrapping the structural guidelines for an existing warehouse environment?
    What are the most interesting, innovative, or unexpected ways that you have seen Orion used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Orion?
    When is Orion the wrong choice?
    What do you have planned for the future of Orion?

    Contact Info

    LucasLinkedIn

    DrewLinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Gravity
    Orion
    Looker
    Semantic Layer
    dbt
    LookML
    Tableau
    OpenClaw
    Pareto Distribution

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    From Models to Momentum: Uniting Architects and Engineers with ER/Studio

    2026/03/02 | 45 mins.
    Summary
    In this episode of the Data Engineering Podcast, Jamie Knowles (Product Director) and Ryan Hirsch (Product Marketing Manager) discuss the importance of enterprise data modeling with ER/Studio. They highlight how clear, shared semantic models are a foundational discipline for modern data engineering, preventing semantic drift, speeding up delivery, and reducing rework. Jamie explains that ER/Studio helps teams define logical models that translate into physical designs and code across warehouses and analytics platforms, while maintaining traceability and governance. The conversation also touches on how AI increases the tolerance for ambiguity, but doesn't fix unclear definitions - it amplifies them. Jamie and Ryan describe ER/Studio's integrations with governance tools, collaboration features like TeamServer, reverse engineering, and metadata bridges, as well as new AI-assisted modeling capabilities. They emphasize that most data problems are meaning problems, and investing in architecture and a semantic backbone can make engineering faster, governance simpler, and analytics more reliable.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Jamie Knowles and Ryan Hirsch about ER/Studio and the foundational role of enterprise data modeling in modern data engineering.

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you describe what ER/Studio is and the story behind it?
    How has it evolved to handle the shift from traditional on-prem databases to modern, complex, and highly regulated enterprise environments?
    How do you define "Enterprise Data Architecture" today, and how does it differ from just managing a collection of pipelines in a modern data stack?
    In your view, what are the distinct responsibilities of a Data Architect versus a Data Engineer, and where is the critical overlap where they typically succeed or fail together?
    From what you see in the field, how often are the technical struggles of data engineering teams—like tool sprawl or "broken" pipelines—actually just "data meaning" problems in disguise?
    What is a logical data model, and why do you advocate for framing these as "knowledge models" rather than just technical diagrams?
    What are the long-term consequences, such as "semantic drift" or the erosion of trust, when organizations skip logical modeling to go straight to physical implementation and pipelines?
    What is the intersection of data modeling and data governance?
    What are the elements of integration between ER/Studio and governance platforms that reduce friction and time to delivery?
    For the engineers who worry that architecture and modeling slow down development, how does having a central design authority actually help teams scale and reduce downstream rework?
    What does a typical workflow look like across data architecture and data engineering for individuals and teams who are using ER/Studio as a core part of their modeling?
    What are the most interesting, innovative, or unexpected ways that you have seen ER/Studio used? * Context: Specifically regarding grounding AI initiatives or defining enterprise ontologies.
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on ER/Studio?
    When is ER/Studio the wrong choice for a data team or a specific project?
    What do you have planned for the future of ER/Studio, particularly regarding AI and the "design-time" foundation of the data stack?

    Contact Info

    Jamie
    LinkedIn
    Ryan
    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Idera
    Wherescape
    ER/Studio
    Entity-Relation Diagram (ERD)
    Business Keys
    Medallion Architecture
    RDF == Resource Description Framework
    Collibra
    Martin Fowler
    DB2

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
  • Data Engineering Podcast

    From Data Models to Mind Models: Designing AI Memory at Scale

    2026/02/22 | 57 mins.
    Summary
    In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-tenant isolation to ensure safe knowledge sharing or protection. The conversation covers practical considerations such as storage choices (Redis, Qdrant, LanceDB, Neo4j), metadata design, temporal relevance and decay, and emerging research areas like trace-based scoring and reinforcement learning for improving retrieval. Vas shares real-world examples of agentic memory in action, including applications in pharma hypothesis discovery, logistics control towers, and cybersecurity feeds, as well as scenarios where simpler approaches may suffice. He also offers guidance on when to add memory, pitfalls to avoid (naive summarization, uncontrolled fine-tuning), human-in-the-loop realities, and Cognee's future plans: revamped session/long-term stores, decision-trace research, and richer time and transformation mechanisms. Additionally, Vas touches on policy guardrails for agent actions and the potential for more efficient "pseudo-languages" for multi-agent collaboration.

    Announcements
    Hello and welcome to the Data Engineering Podcast, the show about modern data management
    If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
    Your host is Tobias Macey and today I'm interviewing Vasilije Markovic about agentic memory architectures and applications

    Interview

    Introduction
    How did you get involved in the area of data management?
    Can you start by giving an overview of the different elements of "memory" in an agentic context?
    storage and retrieval mechanisms
    how to model memories
    how does that change as you go from short-term to long-term?
    managing scope and retrieval triggers
    What are some of the useful triggers in an agent architecture to identify whether/when/what to create a new memory?
    How do things change as you try to build a shared corpus of memory across agents?
    What are the most interesting, innovative, or unexpected ways that you have seen agentic memory used?
    What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cognee?
    When is a dedicated memory layer the wrong choice?
    What do you have planned for the future of Cognee?

    Contact Info

    LinkedIn

    Parting Question

    From your perspective, what is the biggest gap in the tooling or technology for data management today?

    Closing Announcements

    Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
    Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
    If you've learned something or tried out a project from the show then tell us about it! Email [email protected] with your story.

    Links

    Cognee
    AI Engineering Podcast Episode
    [Kimball Memory](
    Cognitive Science
    Context Window
    RAG == Retrieval Augmented Generation
    Memory Types
    Redis Vector Store
    Qdrant
    Vector on Edge
    Milvus
    LanceDB
    KuzuDB
    Neo4J
    Mem0
    Zepp Graphiti
    A2A (Agent-to-Agent) Protocol
    Snowplow
    Reinforcement Learning
    Model Finetuning
    OpenClaw

    The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More Education podcasts

About Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Podcast website

Listen to Data Engineering Podcast, Motivation Daily by Motiversity and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features

Data Engineering Podcast: Podcasts in Family