Data Engineering Podcast podcast | Listen online for free

513 episodes

Holding Kafka Right: Product-Friendly Streaming with TypeStream
2026/06/18 | 49 mins.
Summary
In this episode Jevin Maltais talks about the practical realities of building reliable, product-focused streaming systems with Kafka. Jevin shares lessons from roles at Zapier, Humi, and Clio, where real-time synchronization, customer data unification, and document sync at scale highlighted both the strengths and common misuses of Kafka. He digs into using events as the source of truth, materialized views with KTables, and how schema registries and type safety prevent downstream breakage. Jevin explains why teams often reach for heavyweight Kafka clusters without leveraging Streams, Connect, or interactive queries—and how his project, TypeStream, aims to make those capabilities accessible via config-as-code while keeping a thin abstraction and clear escape hatches. He also explore trade-offs across Kafka-compatible alternatives, CDC with Debezium in the real world, and where abstractions should stop so teams can scale responsibility as complexity grows.

Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
Your host is Tobias Macey and today I'm interviewing Jevin Maltais about the challenges of building a reliable streaming

Interview

Introduction
How did you get involved in the area of data management?
Can you describe what Typestream is and the story behind it?
What are the common challenges that teams encounter when trying to build on top of Kafka?
How do those challenges/misconfigurations impact the team's ability to deliver on product goals?
What are the fundamental design aspects of Kafka that contribute to the difficulties that teams encounter when using it as an element of their architecture?
There have been numerous projects taking aim at Kafka, with varying approaches and degrees of effectiveness (e.g. RedPanda, AutoMQ, Pulsar, etc.). What are the tradeoffs that each of those approaches requires?
What makes the original Kafka project so resilient in the face of all of that competition?
Can you describe the architecture of Typestream and how each of the core elements contribute to a better user experience?
For teams who want to take advantage of streaming capabilities, but don't want to invest in becoming Kafka experts, what does the Typestream workflow look like?
If they don't want to manage the operational overhead of a Kafka cluster, how tightly coupled is Typestream to the original Kafka? (can someone use RedPanda or AutoMQ instead?)
What are the most interesting, innovative, or unexpected ways that you have seen Typestream used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Typestream?
When is Typestream the wrong choice?
What do you have planned for the future of Typestream?

Contact Info

Website

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

Typestream
Zapier
Airflow
Kafka
KTables
KSQL
RedPanda
Pulsar
AutoMQ
Kafka Schema Registry
Debezium
Change Data Capture
Kafka Connect
Terraform
Kafka Compacted Topic

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards
2026/06/08 | 52 mins.
Summary
In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, text-to-transformations, and text-to-dashboard workflows. He also digs into on-prem versus SaaS deployments, domain-specialized agents for privacy and accuracy, code blocks for custom Python/SQL, and the roadmap for a marketplace and desktop assistant. Shravan highlights how Kaarvi compresses weeks of work into hours and bridges the gap between business users and data engineers by turning AI into a dependable force multiplier.

Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
Your host is Tobias Macey and today I'm interviewing Shravan Gunda about building an agent-driven data platform at Kaarvi
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Kaarvi is and the story behind it?
"AI" is a very broad term that encompasses numerous possible implementations. Can you give some more detail about the different types and applications of AI in Kaarvi's architecture?
What are some of the core assumptions of data workflows that need to be reconsidered when AI is embedded in the execution path?
What are the most interesting, innovative, or unexpected ways that you have seen Kaarvi used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kaarvi?
When is Kaarvi the wrong choice?
What do you have planned for the future of Kaarvi?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

Kaarvi
Synthetic Data
n8n

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture
2026/06/01 | 54 mins.
Summary
In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations.

Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idempotency, watermarks, dead letter queues, change data capture, and backpressure. Every question comes from real Data Engineer interview loops at Google, Amazon, Meta, Stripe, Databricks, Netflix, and Airbnb. Go to dataengineeringpodcast.com/datadriven today to start practicing.
Your host is Tobias Macey and today I'm interviewing Weimo Liu about the engineering behind PuppyGraph's zero-copy ETL for querying your lakehouse as a graph
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what PuppyGraph is and the story behind it?
What are some of the key use cases that people are turning to PuppyGraph and graph data models for?
Graph engines have struggled to take off for several years, not least of which is due to the difficulty of scaling them to large data volumes as a result of the topological nature of the data. Can you describe the architecture of PuppyGraph and some of the ways that you are addressing that challenge of data volume for graphs?
latency/data exploration
types of traversals and limitations
lakehouse architecture pros/cons for graphs
data modeling/translation
shortcomings of zero-ETL and how transforming the underlying representation could provide benefits
For someone who is looking for a graph engine to support a connected data use case, what are the guiding questions that you would ask to lead them toward PuppyGraph vs. a dedicated graph database like Memgraph/Neo4J/etc.?
What are the most interesting, innovative, or unexpected ways that you have seen PuppyGraph used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on PuppyGraph?
When is PuppyGraph the wrong choice?
What do you have planned for the future of PuppyGraph and graph data exploration on large data volumes?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
PuppyGraph
TigerGraph
Google F1
Graph Database
Google Pregel
Iceberg
Graph Supernode
MPP == Massively Parallel Processing
Spark GraphX
Trino
Ladybug DB
lance-graph
KuzuDB
MemGraph
Labelled Property Graph
RDF Triples
Cypher Query Language
Gremlin
CDC == Change Data Capture
Neo4J
JanusGraph
NetworkX
PyTorch
DuckDB
Iceberg Array
LanceDB
Palo Alto Networks
Columnar ADBC
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
%
Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes
2026/05/06 | 58 mins.
Summary
In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; where Ray fits compared to Spark and workflow orchestrators; and why Ray excels at composing heterogeneous pools of compute, handling failures, and scaling complex systems like multi-node LLM inference and reinforcement learning. He digs into practical strategies for boosting GPU utilization across training and inference, elasticity and prioritization of workloads, topology-aware scheduling, and the importance of fast failure recovery as hardware scales from nodes to racks. If you’re wrestling with expensive GPUs, multimodal data curation, or cross-node LLM inference, this conversation offers concrete mental models and architectural guidance.

Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
Your host is Tobias Macey and today I'm interviewing Robert Nishihara about the challenges of maximizing the utility of your available hardware for AI applications
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the major contributors to wasted or idle compute?
Why does it matter if the available compute isn't being maximized?
What are some of the typical ad-hoc methods that teams might use to try to get the most out of their available hardware (especially GPUs)?
What are the most interesting, innovative, or unexpected ways that you have seen Ray used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ray and distributed compute for data and AI?
When is Ray the wrong choice?
What do you have planned for the future of Ray?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
AnyScale
Ray
Deep Learning
Computer Vision
Kubernetes
Cursor
Claude Code
Kube-Ray
PyTorch
Tensorflow
Theano
Caffe
vLLM
SGLang
Ray Tune
Neural Network
Learning Rates
Reinforcement Learning
AlphaGo
Cursor Composer 2
ImageNet
Transformer Architecture
Stochastic Gradient Descent
Airflow
Dagster
Flyte
Mixture of Experts
Prefill
Temporal
Actor Framework
RDMA == Remote Direct Memory Access
Neoclouds
AI Engineering Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
The AI-First Data Engineer: 10–50x Productivity and What Changes Next
2026/04/07 | 59 mins.
Summary
In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how security/compliance concerns can be addressed with platform-native LLM endpoints, and why the role of data engineers is shifting from code authors to operators of autonomous agents. We dig into the consolidation of the modern data stack, the economics driving more data products (Jevons paradox), and why product thinking, domain knowledge, and cross-functional skills will define the next wave of standout data professionals. We also cover practical steps for leaders and ICs: modernizing off legacy platforms, establishing safe AI adoption paths, codifying reusable “skills” and context for agents, and building validation utilities that keep the inner loop fast and trustworthy. Finally, Gleb shares how Datafold moved to fully AI-driven software delivery and why “outcomes over tools” is the emerging model for complex initiatives like data platform migrations—and how this reframes data quality for the AI era, emphasizing broad data access plus rich context over brittle human-centric tests.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
If you lead a data team, you know this pain: Every department needs dashboards, reports, custom views, and they all come to you. So you're either the bottleneck slowing everyone down, or you're spending all your time building one-off tools instead of doing actual data work. Retool gives you a way to break that cycle. Their platform lets people build custom apps on your company data—while keeping it all secure. Type a prompt like 'Build me a self-service reporting tool that lets teams query customer metrics from Databricks—and they get a production-ready app with the permissions and governance built in. They can self-serve, and you get your time back. It's data democratization without the chaos. Check out Retool at dataengineeringpodcast.com/retool today and see how other data teams are scaling self-service. Because let's be honest—we all need to Retool how we handle data requests.
Your host is Tobias Macey and today I'm bringing back Gleb Mezhanskiy to talk about our predictions for the impact of AI on data engineering for 2026

Interview
Introduction
How did you get involved in the area of data management?
What are the concrete steps that teams need to be taking today to take advantage of agentic AI capabilities?
What are the new guardrails/constraints/workflows that need to be in place before you let AI loose on your data systems?
How do you balance the potential cost savings and productivity increases with the up-front investment and variability in inference spend?

Contact Info

LinkedIn

Parting Question

From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

Blog Post
Datafold
Claude Opus 4.5
Harry Potter - Muggles
Jevon's Paradox
Modern Data Stack
Dagster Compass
Gravity Orion
MCP == Model Context Protocol
Qwen

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

More Education podcasts

Trending Education podcasts

About Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Podcast website

Education Technology

Listen to Data Engineering Podcast, The 48 Laws of Power by Robert Greene (Full Audiobook) and many other podcasts from around the world with the radio.net app

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Open app

Get the free radio.net app

Stations and podcasts to bookmark
Stream via Wi-Fi or Bluetooth
Supports Carplay & Android Auto
Many other app features

Data Engineering Podcast

Scan code,
download the app,
start listening.

Data Engineering Podcast: Podcasts in Family