Episode 15

RAG-Based Agentic Memory: Code Perspective (Chapter 17)

Unlock how Retrieval-Augmented Generation (RAG) enables AI agents to remember, learn, and personalize over time. In this episode, we explore Chapter 17 of Keith Bourne’s "Unlocking Data with Generative AI and RAG," focusing on implementing agentic memory with the CoALA framework. From episodic and semantic memory distinctions to real-world engineering trade-offs, this discussion is packed with practical insights for AI/ML engineers and infrastructure experts.

In this episode:

- Understand the difference between episodic and semantic memory and their roles in AI agents

- Explore how vector databases like ChromaDB power fast, scalable memory retrieval

- Dive into the architecture and code walkthrough using CoALA, LangChain, LangGraph, and OpenAI APIs

- Discuss engineering challenges including validation, latency, and system complexity

- Hear from author Keith Bourne on the foundational importance of agentic memory

- Review real-world applications and open problems shaping the future of memory-augmented AI

Key tools and technologies mentioned:

- CoALA framework

- LangChain & LangGraph

- ChromaDB vector database

- OpenAI API (embeddings and LLMs)

- python-dotenv

- Pydantic models

Timestamps:

0:00 - Introduction & Episode Overview

2:30 - The Concept of Agentic Memory: Episodic vs Semantic

6:00 - Vector Databases and Retrieval-Augmented Generation (RAG)

9:30 - Coding Agentic Memory: Frameworks and Workflow

13:00 - Engineering Trade-offs and Validation Challenges

16:00 - Real-World Applications and Use Cases

18:30 - Open Problems and Future Directions

20:00 - Closing Thoughts and Resources

Resources:

- "Unlocking Data with Generative AI and RAG" by Keith Bourne - Search for 'Keith Bourne' on Amazon and grab the 2nd edition

- Visit Memriq AI at https://Memriq.ai for more AI engineering deep dives and resources

Transcript

MEMRIQ INFERENCE DIGEST - ENGINEERING EDITION Episode: RAG-Based Agentic Memory in Code: Chapter 17 Deep Dive with Keith Bourne

MORGAN: 00:00

Welcome to the Memriq Inference Digest - Engineering Edition! I’m Morgan, and alongside Casey, we’re here to unpack some of the deepest engineering topics in AI and machine learning today. This show is brought to you by Memriq AI, a content studio crafting tools and resources tailored for AI practitioners. Check them out at Memriq.ai if you want more expert insights.

CASEY: 00:20

Today we’re diving into a fascinating topic from Chapter 17 of ‘Unlocking Data with Generative AI and RAG’ by Keith Bourne — specifically, RAG-based agentic memory implementations in code using the CoALA framework. This episode is for engineers, scientists, and infrastructure experts who want a detailed, hands-on look at how to build AI agents that remember and learn over time.

MORGAN: 00:40

If you want to go way deeper, the book has detailed diagrams, thorough explanations, and extensive code labs walking you through the entire build process step by step. Just search for Keith Bourne on Amazon and grab the 2nd edition.

CASEY: 00:55

We’re also thrilled to have Keith Bourne himself joining us today. Keith’s going to share insider insights, behind-the-scenes thinking, and real-world experience to enrich our technical journey.

MORGAN: 01:10

So, buckle up, because today we’re covering everything from memory architectures to vector databases, modular agent frameworks, and how you stitch it all together in production. Keith will be with us throughout, so stay tuned for his expert perspectives.

JORDAN: 01:30

Imagine an AI agent that not only remembers what you said two sessions ago but can also integrate new facts about you—like your favorite coffee blend or your allergy to peanuts—and keep that knowledge alive and accurate over time. What surprised me, digging into Keith’s work, was how the combination of episodic memory—storing entire conversation sessions—and semantic memory—extracting and structuring facts—creates a persistent, context-aware AI that feels truly agentic.

MORGAN: 01:55

That’s wild. So, the agent isn’t just stateless text generation anymore—it’s actually learning and adapting?

CASEY: 02:05

But how scalable is that? Storing all past conversations plus structured facts sounds heavy.

JORDAN: 02:12

That’s where vector databases like ChromaDB come in—they store embeddings for fast similarity search, enabling instant retrieval of relevant past interactions or facts. Combined with the CoALA framework’s modular approach, you get an architecture that separates memory types and orchestrates workflows efficiently.

MORGAN: 02:30

So it’s modular, persistent, and fast—like having a brain with different compartments for episodic and semantic knowledge.

CASEY: 02:38

And you can tune what kind of memory your agent prioritizes, right? That’s a powerful lever for building smarter, personalized AI.

JORDAN: 02:45

Exactly. This architecture flips the script on how we traditionally build stateless LLM applications. It’s a game changer for any AI requiring long-term context.

CASEY: 03:00

At its core, this episode covers RAG-based agentic memory systems that combine episodic memory—recordings of past interactions—and semantic memory—structured facts extracted from conversations—using CoALA, LangChain, LangGraph, and ChromaDB.

MORGAN: 03:15

We’re looking at a stateful agent architecture where memories are stored as vector embeddings facilitating retrieval-augmented generation.

CASEY: 03:25

Tools in the stack include the OpenAI API for embeddings and LLMs, python-dotenv for secure config, and Pydantic models for structuring semantic facts.

MORGAN: 03:35

If you remember nothing else—this architecture transforms stateless LLMs into agents that truly remember, learn, and personalize over time.

JORDAN: 03:45

Before recent advances, deploying AI agents with persistent memory was a huge engineering challenge. LLMs have amazing language ability but are inherently stateless—they forget everything once a session ends, limiting personalization and continuity.

CASEY: 04:00

Right, and you could try passing longer context windows, but that quickly hits token limits and latency issues.

JORDAN: 04:08

The breakthrough came with retrieval-augmented generation or RAG—embedding past data into vector spaces and retrieving relevant chunks dynamically to condition the LLM’s output. Combine that with scalable vector databases like ChromaDB, and you can store vast amounts of episodic and semantic memories efficiently.

MORGAN: 04:25

Frameworks like CoALA and LangChain have matured to orchestrate these modular memory components, making implementation practical at scale instead of a research curiosity.

JORDAN: 04:35

Companies deploying virtual assistants, customer support bots, and adaptive tutors are adopting these architectures to create AI that maintains long-term context and personalized knowledge across sessions.

CASEY: 04:45

The key here is the interplay between efficient storage, fast similarity search, and workflow orchestration — all critical to serving real-time production loads.

JORDAN: 04:55

Exactly. It’s not just academic; it’s solving real engineering pain points in AI today.

TAYLOR: 05:05

At its core, agentic memory decomposes into two main components: episodic memory, which stores the sequence of past interactions including timestamps; and semantic memory, which extracts structured factual triples—subject, predicate, object—from those interactions with confidence scores.

MORGAN: 05:20

So episodic memory preserves context continuity—like a diary of conversations—while semantic memory builds a factual knowledge base for reasoning and personalization.

TAYLOR: 05:30

Exactly. Retrieval-augmented generation stitches these memories together by embedding them into vector spaces, then retrieving relevant memories to augment prompt construction dynamically.

CASEY: 05:40

How does this differ from previous approaches like just passing chat history or static knowledge bases?

TAYLOR: 05:45

The key difference is modularity and persistence. Instead of blowing up prompt size with entire histories, you fetch only the most relevant episodic or semantic vectors at runtime. Plus, CoALA’s modular cognitive architecture cleanly separates these memories and orchestrates them with state graphs and workflow nodes.

MORGAN: 06:00

Keith, as the author, what made you prioritize this concept so early in the book? Why agentic memory?

KEITH: 06:05

Great question, Morgan. Agentic memory is foundational because it unlocks true personalization and long-term context, which are critical pain points in AI today. Early on, I realized that without distinguishing episodic from semantic memory and architecting them as modular, persistent components, agents remain brittle and short-lived. The book dives deep into these ideas because they form the connective tissue for building practical, scalable AI assistants.

TAYLOR: 06:30

That modularity also aids maintainability and extensibility—two crucial engineering concerns.

TAYLOR: 06:40

Let’s break down the trade-offs between episodic and semantic memory implementations, focusing on tools like CoALA, ChromaDB, and LangChain.

CASEY: 06:50

Episodic memory stores entire conversation transcripts as Documents with timestamps, making it excellent for context continuity and temporal reasoning. But it can get bulky quickly.

TAYLOR: 07:00

Right. You index these Documents in ChromaDB using vector embeddings generated via OpenAI’s API, then query them with similarity search. The trade-off is storage size and retrieval latency as episodes accumulate.

CASEY: 07:12

Semantic memory, on the other hand, extracts structured facts from conversations using models validated with Pydantic schemas. These facts—like “User prefers black coffee”—are stored separately with confidence scores. It’s more compact and supports reasoning but loses the richness of full dialogues.

TAYLOR: 07:25

Plus, semantic memory depends heavily on the accuracy of the fact extraction step. Noise or hallucinations can degrade knowledge quality.

CASEY: 07:33

So if you need rich conversational recall, episodic memory is better. If you want a distilled, factual knowledge base, semantic memory fits.

TAYLOR: 07:40

The best engineering approach combines both—episodic for dialogue continuity, semantic for knowledge grounding. CoALA’s modular nodes and workflows facilitate that hybrid architecture seamlessly.

CASEY: 07:50

Procedural memory, which encodes workflows or action sequences, is mentioned but remains an open challenge.

TAYLOR: 07:55

Exactly. Procedural memory adds complexity but could enable agents to learn and optimize multi-step tasks—something to watch for future innovations.

ALEX: 08:05

Let’s get technical and walk through how an agentic memory system is coded using CoALA, LangChain, LangGraph, ChromaDB, and the OpenAI API, with python-dotenv and Pydantic supporting configuration and validation.

ALEX: 08:20

First, the base agent is built on LangChain, which orchestrates LLM calls and prompt chaining. LangGraph adds stateful workflow management, enabling state transitions and conditional execution based on memory retrieval results.

ALEX: 08:35

Episodic memory stores conversation transcripts as LangChain Document objects, tagged with metadata including timestamps and conversation IDs. These Documents are embedded into dense vectors using OpenAI embeddings and persisted in ChromaDB, a vector database optimized for fast similarity search and filtering.

ALEX: 08:55

When new input arrives, the agent queries ChromaDB for the top-k (usually 2–5) most similar episodic memories, filtered by recent timestamps or conversation context. These retrieved transcripts are fed into prompt templates to condition GPT-4.1-mini for response generation at temperature=0 to ensure deterministic outputs.

ALEX: 09:15

Simultaneously, semantic memory extraction happens by parsing conversations through GPT prompts designed to surface subject-predicate-object triples. These triples are validated against Pydantic models to ensure schema integrity and assigned confidence scores between 0 and 1 indicating extraction certainty.

ALEX: 09:35

Validated semantic facts are also embedded and stored in ChromaDB, separately indexed and queried alongside episodic memories. When responding, the agent can retrieve both episodic dialogues and semantic facts, integrating them into the prompt for richer, personalized outputs.

ALEX: 09:55

Memory storage is handled as discrete workflow nodes in LangGraph, so after every interaction, newly generated episodic Documents and semantic facts get persisted asynchronously to the vector store, ensuring the agent gains persistent knowledge over time.

KEITH: 10:10

The main takeaway is the power of modularity and separation of concerns. By decoupling episodic and semantic memories and building them as composable nodes in a stateful workflow, engineers can iteratively improve components, plug in new memory types, and manage complexity effectively. The architecture mirrors cognitive processes and lets you scale horizontally with vector DBs and modern orchestration frameworks. It’s a blueprint for building long-term context-aware agents in production.

ALEX: 10:40

That modularity also enables experimentation—swap vector stores, tune retrieval parameters, or enhance fact extraction without overhauling the entire system.

CASEY: 10:50

And the use of typed Pydantic models adds a layer of robustness rarely discussed but crucial in production, especially when dealing with noisy LLM outputs.

ALEX: 11:00

Exactly. It’s engineering rigor applied to cutting-edge AI.

ALEX: 11:05

Now, why bother? The results are impressive. Agents using this RAG-based agentic memory approach recall relevant past conversations with high precision, keeping dialogue coherent across sessions.

ALEX: 11:20

Semantic memory lets the agent personalize responses using extracted facts—like reminding you about your nut allergy or preferred meeting times—making interactions feel human and trustworthy.

ALEX: 11:30

The system creates a continuous learning loop where every interaction enriches the memory stores, improving recall and relevance metrics over time.

MORGAN: 11:40

That’s a huge win for user experience—no more “I don’t remember that” moments from AI.

ALEX: 11:45

However, the downside is added system complexity and latency overhead—retrieval and embedding steps add tens to hundreds of milliseconds per query, depending on vector store size and API call roundtrip times.

CASEY: 12:00

So there’s a trade-off: richer context and personalization at the cost of infrastructure and response time complexity.

ALEX: 12:08

Exactly. But with proper engineering—caching, indexing strategies, and parallel calls—you can keep latency within acceptable bounds for many real-time applications.

TAYLOR: 12:20

The book’s benchmarks showed recall relevance consistently above 90% for top-k episodic memories and semantic fact extraction confidence scores averaging around 0.85, which is solid for production.

ALEX: 12:30

Those are solid numbers validating the approach’s effectiveness while signaling areas for further optimization.

CASEY: 12:40

I want to pull back and ask, what can go wrong? The book is refreshingly honest about limitations, and as an engineer, I’m worried about complexity and noise.

MORGAN: 12:50

Yeah, storing all this memory sounds heavy — how do you manage vector store bloat and retrieval latency?

CASEY: 12:55

Also, semantic fact extraction relies on LLM accuracy. Hallucinations or false facts can pollute the knowledge base, causing agent misbehavior.

KEITH: 13:05

The most common pitfall is neglecting to validate and curate extracted knowledge rigorously. Without schema validation or confidence filtering, you end up with garbage in, garbage out. Another issue is ignoring infrastructure constraints—vector stores can balloon if not pruned or sharded properly, impacting latency and cost. Privacy is another big concern—storing sensitive user data persists risk and compliance challenges if not handled securely.

CASEY: 13:30

So memory augmentation isn’t a silver bullet; it requires mature engineering discipline.

KEITH: 13:35

Exactly. The book covers these challenges candidly and suggests mitigation strategies like confidence thresholds, periodic data pruning, encryption, and operational monitoring.

MORGAN: 13:50

It’s great to see that balance — enthusiasm for new tech plus a clear-eyed view of engineering realities.

SAM: 14:00

Let’s talk applications. Memory-enabled agents are already deployed in multiple domains.

SAM: 14:05

In customer service, bots use episodic memory to recall past support tickets, reducing customer frustration and speeding resolution by surfacing relevant prior conversations.

SAM: 14:15

Educational tech leverages semantic memory to track student knowledge gaps and preferences, tailoring tutoring sessions dynamically.

SAM: 14:25

Personal assistants combine episodic and semantic memories to manage meetings, preferences, and evolving user contexts, making them feel genuinely helpful.

MORGAN: 14:35

So this isn’t just theory—it’s powering real products that require long-term context and personalization.

SAM: 14:40

Exactly. Any AI requiring continuous learning and adaptive behavior benefits from this architecture: healthcare assistants, legal research bots, even complex simulation agents.

CASEY: 14:50

It’s fascinating to see such diverse use cases grounded in the same core memory concepts.

SAM: 14:55

And the modularity of frameworks like CoALA lets teams customize memory strategies per domain needs without reinventing the wheel.

SAM: 15:05

Here’s a scenario: building a customer support agent that must recall past tickets (episodic memory) and have accurate product knowledge (semantic memory). Let’s debate three approaches.

TAYLOR: 15:15

Approach A is episodic memory only—fast retrieval of past conversations but no structured facts. Efficiency wins here, but the agent can’t reason about product details.

CASEY: 15:25

The downside is limited knowledge grounding—answers might be vague or inconsistent if the product changes.

ALEX: 15:30

Approach B uses semantic memory only—rich, structured knowledge about products but no temporal conversation context. The agent can give factual answers but feels disjointed in dialogue.

MORGAN: 15:40

That risks losing user engagement since the bot can’t remember prior complaints or resolutions.

SAM: 15:45

Approach C combines both episodic and semantic memories using CoALA and ChromaDB—giving comprehensive context and knowledge.

TAYLOR: 15:55

The trade-off is infrastructure complexity and higher storage requirements. Retrieval latency can increase, and engineering overhead grows.

CASEY: 16:05

But the payoff is a better user experience with personalized, coherent, and knowledgeable interactions.

SAM: 16:10

So choose episodic-only if you need speed and simple continuity, semantic-only if factual accuracy is paramount, or combine both for the most powerful but complex solution.

MORGAN: 16:20

It’s an engineering decision balancing cost, latency, relevance, and complexity.

SAM: 16:25

For engineers building these systems, start with LangChain for LLM orchestration and prompt chaining, keeping temperature at zero for deterministic outputs.

SAM: 16:35

LangGraph helps define stateful workflows and manage agent states, enabling modular execution of memory retrieval and storage nodes.

SAM: 16:45

Use ChromaDB for scalable vector embedding storage, taking advantage of metadata filtering to target episodic or semantic memories.

SAM: 16:55

Secure your OpenAI API keys and other credentials with python-dotenv and environment variables to avoid leaks.

SAM: 17:05

Define semantic fact schemas with Pydantic to validate and structure data before storage—this step reduces downstream errors and hallucinations.

SAM: 17:15

Implement memory storage and retrieval as modular nodes in your workflow. That makes your system extensible and easier to maintain.

SAM: 17:25

Don’t forget prompt templates that dynamically incorporate retrieved episodic and semantic contexts to enrich agent responses meaningfully.

CASEY: 17:35

Also, monitor retrieval latency and vector store size regularly to avoid surprises.

MORGAN: 17:40

Great practical advice — build iteratively, validate aggressively, and automate your workflows for resilience.

MORGAN: 17:50

Quick plug — if today’s deep dive sparked your curiosity, Keith Bourne’s ‘Unlocking Data with Generative AI and RAG’ 2nd edition goes way beyond what we covered, with detailed diagrams, thorough explanations, and hands-on code labs that let you build these memory-augmented agents yourself. Search Keith Bourne on Amazon and grab your copy. You won’t regret it.

MORGAN: 18:10

Memriq AI is an AI consultancy and content studio building tools and resources for AI practitioners.

CASEY: 18:15

This podcast is produced by Memriq AI to help engineers and leaders stay current with the rapidly evolving AI landscape.

MORGAN: 18:20

Head to Memriq.ai for more AI deep-dives, practical guides, and cutting-edge research breakdowns.

SAM: 18:30

The field still has fascinating open problems.

SAM: 18:35

Implementing procedural memory—allowing agents to learn, optimize, and execute multi-step workflows—is a frontier challenge. LangMem is one promising future framework here.

SAM: 18:50

Scaling vector stores efficiently as memory datasets grow huge, without sacrificing retrieval latency, is an ongoing engineering hurdle.

SAM: 18:58

Semantic fact extraction still struggles with hallucinations and incomplete knowledge, requiring better accuracy and validation methods.

SAM: 19:05

Privacy, security, and compliance loom large when storing persistent user data in memory systems, especially across jurisdictions.

SAM: 19:12

Finally, integrating multi-modal memories—like visual or auditory alongside textual episodic and semantic memories—is an exciting but complex frontier.

MORGAN: 19:20

These challenges mean this field is ripe for innovation and impact.

MORGAN: 19:25

My takeaway is that combining episodic and semantic memories via RAG architectures fundamentally changes what AI agents can do — from stateless chatbots to lifelong learners.

CASEY: 19:35

I remind engineers to respect the complexity this introduces — validation, pruning, and security are non-negotiable for production success.

JORDAN: 19:45

For me, the power of modular frameworks like CoALA can’t be overstated — they turn an overwhelming design into manageable components.

TAYLOR: 19:50

Architecturally, the distinction between memory types clarifies trade-offs and helps teams design precisely for their use cases.

ALEX: 19:55

I’m excited about how rigorous schemas and vector databases can bring engineering discipline to AI’s language chaos.

SAM: 20:00

Practical deployments already prove this approach’s value across industries — it’s not just theory anymore.

KEITH: 20:05

As the author, the one thing I hope you take away is that agentic memory is foundational to building AI that truly learns and adapts. The code labs in the book are there to help you turn theory into real, working systems — go build something amazing.

MORGAN: 20:20

Keith, thanks for giving us the inside scoop today.

KEITH: 20:23

My pleasure — and I hope this inspires you to dig into the book and build something amazing.

CASEY: 20:28

Keep critical eyes on implementation details, but don’t let complexity scare you off — the potential here is huge.

MORGAN: 20:35

We covered the key concepts today, but remember the book goes much deeper — detailed diagrams, thorough explanations, and hands-on code labs that let you build this stuff yourself. Search Keith Bourne on Amazon and grab the 2nd edition of Unlocking Data with Generative AI and RAG.

Thanks for listening, see you next time!

Episode 15

RAG-Based Agentic Memory: Code Perspective (Chapter 17)

Transcript

About the Podcast

Listen for free

About your host

Memriq AI