Episode 3

RAG Evaluation with ragas: Reference-Free Metrics & Monitoring

Unlock the secrets to evaluating Retrieval-Augmented Generation (RAG) pipelines effectively and efficiently with ragas, the open-source framework that’s transforming AI quality assurance. In this episode, we explore how to implement reference-free evaluation, integrate continuous monitoring into your AI workflows, and optimize for production scale — all through the lens of Keith Bourne’s comprehensive Chapter 9.

In this episode:

- Overview of ragas and its reference-free metrics that achieve 95% human agreement on faithfulness scoring

- Implementation patterns and code walkthroughs for integrating ragas with LangChain, LlamaIndex, and CI/CD pipelines

- Production monitoring architecture: sampling, async evaluation, aggregation, and alerting

- Comparison of ragas with other evaluation frameworks like DeepEval and TruLens

- Strategies for cost optimization and asynchronous evaluation at scale

- Advanced features: custom domain-specific metrics with AspectCritic and multi-turn evaluation support

Key tools and technologies mentioned:

- ragas (Retrieval Augmented Generation Assessment System)

- LangChain, LlamaIndex

- LangSmith, LangFuse (observability and evaluation tools)

- OpenAI GPT-4o, GPT-3.5-turbo, Anthropic Claude, Google Gemini, Ollama

- Python datasets library

Timestamps:

00:00 - Introduction and overview with Keith Bourne

03:00 - Why reference-free evaluation matters and ragas’s approach

06:30 - Core metrics: faithfulness, answer relevancy, context precision & recall

09:00 - Code walkthrough: installation, dataset structure, evaluation calls

12:00 - Integrations with LangChain, LlamaIndex, and CI/CD workflows

14:30 - Production monitoring architecture and cost considerations

17:00 - Advanced metrics and custom domain-specific evaluations

19:00 - Common pitfalls and testing strategies

20:30 - Closing thoughts and next steps

Resources:

- "Unlocking Data with Generative AI and RAG" by Keith Bourne - Search for 'Keith Bourne' on Amazon and grab the 2nd edition

- Memriq AI: https://Memriq.ai

- ragas website: https://www.ragas.io/

- ragas GitHub repository: https://github.com/vibrantlabsai/ragas (for direct access to code and docs)

Tune in to build more reliable, scalable, and maintainable RAG systems with confidence using open-source evaluation best practices.

Transcript

MEMRIQ INFERENCE DIGEST - ENGINEERING EDITION Episode: RAG Evaluation with RAGAS: Chapter 9 Deep Dive on Reference-Free Metrics & Monitoring

MORGAN: 00:00

Welcome back to the Memriq Inference Digest - Engineering Edition. I'm Morgan, and we're here to help you build better AI systems with deep technical insights. This podcast is brought to you by Memriq AI — a content studio building tools and resources to empower AI practitioners. Check them out at Memriq.ai if you want to dive deeper into AI engineering.

CASEY: 00:20

Today, we're doing a technical deep-dive into evaluating Retrieval-Augmented Generation pipelines with RAGAS — the Retrieval Augmented Generation Assessment System. We'll cover implementation patterns, code architecture, and how to integrate reference-free evaluation into your CI/CD workflows.

MORGAN: 00:38

And we have a special guest today — Keith Bourne, author of the second edition of "Unlocking Data with Generative AI and RAG." Keith, welcome to the show.

KEITH: 00:45

Thanks, Morgan. I'm excited to get into the technical details — this is something I've spent months implementing and documenting for the book.

CASEY: 00:52

Keith, your second edition has a dedicated chapter on RAGAS implementation. Can you give us a quick overview?

KEITH: 00:57

Chapter 9 is titled "Evaluating RAG quantitatively and with visualizations." It's a full hands-on code lab walking through RAGAS implementation — from basic metric calculation to production monitoring pipelines. I also interviewed Shahul Es, the Co-founder and CTO of RAGAS, which gave me insight into the architecture decisions behind the framework and where it's heading.

MORGAN: 01:18

Perfect. We'll reference that interview throughout. Let's dive into the code.

JORDAN: 01:23

Here's what caught my attention as an engineer. RAGAS can evaluate your RAG pipeline outputs without ground truth labels — and it achieves 95 percent agreement with human annotators on faithfulness scoring. That's 23 points higher than naive GPT-4 scoring approaches.

MORGAN: 01:39

How does that translate to actual implementation?

JORDAN: 01:42

You can literally pip install ragas, pass in your questions, contexts, and answers, and get back structured metric scores. Ten lines of code to replace weeks of manual annotation.

CASEY: 01:51

But isn't there a catch? Those traditional metrics like BLEU and ROUGE are deterministic — you get the same score every time. LLM-as-judge introduces variance.

JORDAN: 01:58

True, but BLEU and ROUGE correlate poorly with actual answer quality in RAG systems. They measure surface-level text similarity, not semantic faithfulness to retrieved context. RAGAS's approach — decomposing answers into atomic claims and verifying each one — captures what actually matters.

KEITH: 02:13

I want to clarify something important here. RAGAS isn't only a reference-free evaluation tool. It offers a comprehensive metric suite — some metrics like context recall do require ground truth. But the reference-free metrics are what make continuous production monitoring feasible, which is why we're focusing on them today.

MORGAN: 02:29

So you can use ground truth metrics during development, then switch to reference-free for production?

KEITH: 02:33

Exactly. That's the pattern Shahul recommended in our interview — and it's what inspired this episode. Different evaluation strategies for different stages of your pipeline.

CASEY: 02:41

Quick summary for engineers. RAGAS is an open-source Python framework — Apache 2.0 license — that evaluates RAG pipelines using LLMs as judges. Core reference-free metrics include faithfulness, answer relevancy, and context relevancy. It integrates with LangChain, LlamaIndex, and observability tools like LangSmith and LangFuse.

MORGAN: 02:58

What's the basic implementation look like?

CASEY: 03:01

Install with pip, create a dataset with your questions, contexts, and answers, call the evaluate function with your chosen metrics, and you get back a pandas DataFrame with scores. The evaluate function handles the LLM calls internally.

JORDAN: 03:11

And it supports multiple LLM backends — OpenAI, Anthropic Claude, Google Gemini, Azure OpenAI, and local models through Ollama. So you're not locked into any vendor.

JORDAN: 03:20

Let's talk about why this matters for your engineering workflow. Traditional evaluation meant maintaining labeled test sets — which rot quickly when your knowledge base updates.

MORGAN: 03:28

What's the maintenance burden look like?

JORDAN: 03:30

Every time you update your vector store or change your chunking strategy, your ground truth answers might become stale. Teams were spending 20-plus hours per sprint just maintaining test datasets.

CASEY: 03:39

And running batch evaluations weekly means quality issues can ship to production and sit there for days before you catch them.

KEITH: 03:44

This is exactly what Shahul described as the core problem RAGAS solves. In our interview, he explained that enterprises need evaluation that runs continuously — not just in CI/CD gates, but monitoring production traffic in real-time.

JORDAN: 03:55

The numbers back this up. RAGAS processes over 5 million evaluations monthly across enterprise deployments. That scale is only possible because reference-free metrics don't require maintaining ground truth datasets.

MORGAN: 04:05

So what's the architecture for continuous monitoring?

JORDAN: 04:07

You sample production queries, run them through RAGAS asynchronously, aggregate scores, and alert when metrics drop below thresholds. We'll get into the implementation details later.

TAYLOR: 04:15

Let's establish the technical foundations. A RAG pipeline has two main components: a retriever that fetches relevant documents from your vector store, and a generator — typically an LLM — that synthesizes an answer from those documents.

MORGAN: 04:26

And evaluation needs to cover both components?

TAYLOR: 04:28

Right. RAGAS breaks this into distinct metrics. Faithfulness measures whether the generated answer is supported by the retrieved context — did the LLM hallucinate or stick to the facts? Answer relevancy checks if the answer actually addresses the user's question. Context precision and context recall evaluate the retriever's performance.

CASEY: 04:43

What's the difference between context precision and recall?

TAYLOR: 04:45

Context precision measures what fraction of retrieved chunks are actually relevant to the question — are you retrieving noise? Context recall measures what fraction of the information needed to answer the question was retrieved — did you miss important chunks?

KEITH: 04:55

In Chapter 9, I walk through each metric with code examples. What's interesting is how RAGAS computes these without ground truth. For faithfulness, it extracts atomic claims from the answer, then verifies each claim against the context using an LLM. For answer relevancy, it generates questions from the answer and measures semantic similarity to the original question.

MORGAN: 05:11

That reverse question generation is clever. Can you explain the implementation?

KEITH: 05:14

The LLM generates multiple hypothetical questions that the answer would satisfy. Then RAGAS computes embedding similarity between those generated questions and the original question. High similarity means the answer is relevant; low similarity means it's off-topic or incomplete.

TAYLOR: 05:27

This multi-step approach is what enables reference-free evaluation. You're using the LLM's reasoning capabilities to decompose and verify, rather than comparing against static labels.

TAYLOR: 05:35

Let's compare the major evaluation frameworks from an engineering perspective. RAGAS is the most popular open-source option — over 6,000 GitHub stars, active development, and broad framework integration.

CASEY: 05:44

What about DeepEval? I've seen teams use that for CI/CD integration.

TAYLOR: 05:47

DeepEval is built specifically for test-driven LLM development. It has pytest-style assertions, built-in CI/CD hooks, and a dashboard for tracking metrics over time. The trade-off is more setup complexity and a steeper learning curve.

MORGAN: 06:00

And TruLens?

TAYLOR: 06:01

TruLens — now part of Snowflake — focuses on real-time feedback functions within LangChain and LlamaIndex workflows. It's great for immediate guardrails but creates tighter vendor coupling.

KEITH: 06:11

In the book, I focus RAGAS, but Morgan, make a note, we should really cover these other two platforms more in-depth. Let's add that to this season’s line up. We already have good coverage of RAGAS, but we can do a deep dive on TruLens and DeepEval, and we can even do a whole session on when to use which one.

MORGAN: 06:29

That's a wonderful idea Keith, I'm marking that down right now! Evaluation is so critical to AI development, we will definitely be coming back to this topic often. So tell us more about RAGAS then.

KEITH: 06:38

Sounds great Morgan, I can't wait to come back and talk more about evaluation! As I was saying, in the book, I recommend RAGAS for most teams because of its flexibility and the research backing. OpenAI featured RAGAS at their DevDay event, which speaks to industry adoption. The EACL 2024 paper gives you confidence in the methodology.

CASEY: 06:56

So what's the decision framework?

TAYLOR: 06:58

Choose RAGAS if you want open-source flexibility, strong academic foundations, and broad integration support. Choose DeepEval if you need pytest-style testing and built-in CI/CD workflows. Choose TruLens if you're deep in the LangChain ecosystem and need real-time guardrails.

MORGAN: 07:11

Can you mix and match?

TAYLOR: 07:12

Absolutely. Many teams use RAGAS for production monitoring and DeepEval for pre-deployment testing. The metrics are compatible — you can track the same quality dimensions across both.

ALEX: 07:21

Let's look at the RAGAS architecture in detail. The core abstraction is the Metric class. Each metric — faithfulness, answer relevancy, context precision — inherits from this base class and implements its own scoring logic.

MORGAN: 07:32

Walk us through the faithfulness pipeline.

ALEX: 07:34

First, the StatementExtractor component decomposes the generated answer into atomic statements. This uses an LLM call with a specific prompt to break paragraphs into individual factual claims.

CASEY: 07:43

What does that prompt look like?

ALEX: 07:44

It instructs the LLM to extract every distinct factual assertion, ignoring opinions or meta-commentary. The output is a list of strings, each representing one atomic claim.

ALEX: 07:53

Next, the NLIStatementVerifier takes each statement and checks if it's supported by, contradicted by, or neutral with respect to the retrieved context. This is essentially natural language inference at the claim level.

MORGAN: 08:04

So multiple LLM calls per evaluation?

ALEX: 08:06

Yes — that's the cost trade-off. For a single answer with five atomic statements, you're looking at one extraction call plus five verification calls. RAGAS supports batching and async execution to manage latency.

KEITH: 08:16

This is where Shahul's insights were valuable. He explained that the atomic decomposition is crucial — verifying entire paragraphs at once leads to fuzzy, unreliable scores. The granularity is what enables that 95 percent agreement with human annotators.

ALEX: 08:28

The answer relevancy pipeline is different. It uses reverse question generation — the LLM generates N hypothetical questions that the answer would address. Then RAGAS computes cosine similarity between the embedding of each generated question and the original question. The final score is the mean similarity.

CASEY: 08:41

What embedding model does it use?

ALEX: 08:42

Configurable — defaults to OpenAI's text-embedding-ada-002, but you can swap in any embedding model. The llm and embeddings parameters in the evaluate function let you customize both.

ALEX: 08:52

Let's look at actual implementation patterns. Basic usage starts with installing ragas and datasets from pip.

MORGAN: 08:57

What's the minimal code to run an evaluation?

ALEX: 09:00

You create a Dataset object with four columns: question, answer, contexts, and optionally ground_truth. Then you call evaluate with your dataset and list of metrics.

```python

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy

from datasets import Dataset

data = {

"question": ["What is the capital of France?"],

"answer": ["Paris is the capital of France."],

"contexts": [["France is a country in Europe. Paris is its capital city."]]

}

dataset = Dataset.from_dict(data)

results = evaluate(dataset, metrics=[faithfulness, answer_relevancy])

print(results.to_pandas())

```

CASEY: 09:23

That's surprisingly clean. What about configuring the LLM backend?

ALEX: 09:26

You pass an llm parameter — RAGAS supports LangChain's LLM abstractions, so you can use ChatOpenAI, ChatAnthropic, or any compatible model.

```python

from langchain_openai import ChatOpenAI

from ragas import evaluate

llm = ChatOpenAI(model="gpt-4o", temperature=0)

results = evaluate(dataset, metrics=[faithfulness], llm=llm)

```

KEITH: 09:42

In Chapter 9, I show how to use local models through Ollama for cost-sensitive deployments. The trade-off is accuracy — smaller models are less reliable as judges — but for high-volume production monitoring, it can cut costs significantly.

MORGAN: 09:54

What about async evaluation for production workloads?

ALEX: 09:56

RAGAS supports async execution natively. You can also use the SingleTurnSample class for individual evaluations rather than batch processing.

```python

from ragas.metrics import faithfulness

from ragas import SingleTurnSample

import asyncio

sample = SingleTurnSample(

user_input="What is the capital of France?",

response="Paris is the capital of France.",

retrieved_contexts=["France is a country in Europe. Paris is its capital."]

)

score = asyncio.run(faithfulness.single_turn_ascore(sample))

```

CASEY: 10:15

That's the pattern for real-time evaluation — process each query as it comes through.

SAM: 10:19

Let's talk about integrating RAGAS into your existing stack. Most teams are using LangChain or LlamaIndex for their RAG pipelines.

MORGAN: 10:25

What does LangChain integration look like?

SAM: 10:27

LangSmith has native RAGAS support. You can configure RAGAS metrics as evaluators and run them directly within LangSmith experiments. Results feed into the same dashboards you use for tracing and debugging.

```python

from langsmith import Client

from langsmith.evaluation import evaluate

from ragas.integrations.langsmith import RagasEvaluator

client = Client()

evaluator = RagasEvaluator(metrics=["faithfulness", "answer_relevancy"])

results = evaluate(

lambda x: your_rag_chain.invoke(x["question"]),

data="your-dataset-name",

evaluators=[evaluator]

)

```

CASEY: 10:46

And LlamaIndex?

SAM: 10:47

Similar pattern — there's a dedicated RAGAS integration module. You pass your query engine and RAGAS handles extracting the components for evaluation.

KEITH: 10:54

The LangFuse integration is also worth mentioning. You can run RAGAS as a batch scorer on traces you've collected in LangFuse, which is great for retrospective analysis of production traffic.

MORGAN: 11:02

What about CI/CD integration?

SAM: 11:04

You can wrap RAGAS in pytest fixtures. Define threshold assertions — if faithfulness drops below 0.85, fail the build. DeepEval makes this more explicit with its assert syntax, but you can achieve the same with RAGAS plus pytest.

```python

import pytest

from ragas import evaluate

from ragas.metrics import faithfulness

def test_rag_faithfulness():

results = evaluate(test_dataset, metrics=[faithfulness])

assert results["faithfulness"] >= 0.85, "Faithfulness below threshold"

```

SAM: 11:24

For production monitoring, the pattern is to sample queries, run async evaluation, and push scores to your observability platform — Datadog, Grafana, whatever you're using.

ALEX: 11:31

Let's address the elephant in the room — cost and latency. Multiple LLM calls per evaluation adds up.

MORGAN: 11:35

What are the numbers?

ALEX: 11:36

For faithfulness scoring with GPT-4o, expect roughly 5-10 LLM calls per answer depending on complexity. At current pricing, that's around $0.01-0.02 per evaluation. For high-volume production traffic, that's significant.

CASEY: 11:46

What are the optimization strategies?

ALEX: 11:48

First, sampling. You don't need to evaluate every production query — a 1-5 percent sample gives you statistical confidence while cutting costs dramatically.

ALEX: 11:55

Second, model selection. GPT-3.5-turbo or Claude Haiku as judges are 10-20x cheaper than frontier models. Accuracy drops slightly, but for monitoring trends rather than absolute scores, it's often acceptable.

KEITH: 12:05

In the book, I benchmark different judge models. GPT-4o is the gold standard at roughly 95 percent agreement with humans. GPT-3.5-turbo drops to around 85-88 percent. For many use cases, that's fine — you're looking for regressions, not precise scores.

ALEX: 12:18

Third, batching. RAGAS supports concurrent evaluation — you can process multiple samples in parallel to reduce wall-clock time even if total compute stays the same.

MORGAN: 12:25

What about caching?

ALEX: 12:26

Semantic caching helps if you have repeated or similar queries. You can cache evaluation results keyed on question-context-answer hashes. But be careful — cache invalidation when you update your retriever is tricky.

CASEY: 12:35

Any async patterns to share?

ALEX: 12:37

Use asyncio.gather for concurrent evaluation across multiple samples. RAGAS's async methods play well with this pattern.

```python

import asyncio

from ragas.metrics import faithfulness

async def evaluate_batch(samples):

tasks = [faithfulness.single_turn_ascore(s) for s in samples]

return await asyncio.gather(*tasks)

```

TAYLOR: 12:53

Beyond the core metrics, RAGAS supports custom evaluation aspects. The AspectCritic class lets you define domain-specific quality dimensions.

MORGAN: 12:59

How does that work?

TAYLOR: 13:00

You provide a natural language definition of what you're measuring, and RAGAS uses an LLM to score against that definition.

```python

from ragas.metrics import AspectCritic

safety_critic = AspectCritic(

name="safety",

definition="Does the response avoid harmful, dangerous, or inappropriate content?"

)

tone_critic = AspectCritic(

name="professional_tone",

definition="Is the response written in a professional, business-appropriate tone?"

)

```

CASEY: 13:20

So you can add domain-specific checks without writing custom scoring logic?

TAYLOR: 13:23

Exactly. Financial services teams add compliance aspects — "Does the response include required disclaimers?" Healthcare teams check for medical accuracy caveats. The natural language definition is flexible.

KEITH: 13:33

Shahul mentioned this as one of RAGAS's most powerful features for enterprise adoption. Every domain has specific quality requirements that generic metrics don't capture. AspectCritic bridges that gap.

MORGAN: 13:43

What about multi-turn evaluation? Most RAG applications have conversational context.

TAYLOR: 13:46

RAGAS recently added multi-turn support. You can evaluate entire conversation threads, tracking how faithfulness and relevancy evolve across turns. This is crucial for chatbot applications where context accumulates.

SAM: 13:56

When metrics drop, you need to diagnose why. RAGAS provides detailed traces that show the intermediate steps — extracted statements, individual verification results, generated questions.

MORGAN: 14:04

How do you access those?

SAM: 14:05

The evaluate function returns a Result object with a to_pandas method. Each row includes not just the final scores but the reasoning traces. You can see exactly which statements failed faithfulness verification.

CASEY: 14:15

What visualization patterns work well?

SAM: 14:17

For production monitoring, time-series plots of metric distributions catch regressions. Box plots by query type or user segment reveal where your pipeline struggles. Histograms of faithfulness scores show if you have a long tail of bad responses.

KEITH: 14:28

In Chapter 9, I include Matplotlib and Plotly code for these visualizations. But honestly, the most valuable debugging technique is looking at individual failures. Filter to your lowest-scoring samples and read the traces — you'll quickly see patterns.

SAM: 14:39

Another technique — compare metric distributions before and after changes. If you update your chunking strategy or switch embedding models, run RAGAS on a held-out set and compare the distributions statistically.

MORGAN: 14:49

A/B testing for RAG quality?

SAM: 14:50

Exactly. RAGAS gives you the measurement framework to make retrieval and generation changes with confidence.

SAM: 14:56

Let's sketch a production monitoring architecture. You need four components: sampling, evaluation, aggregation, and alerting.

MORGAN: 15:01

Walk us through each.

SAM: 15:03

Sampling: intercept a percentage of production queries. Store the question, retrieved contexts, and generated answer. This can be synchronous with your request path or async via message queue.

SAM: 15:12

Evaluation: run RAGAS on sampled queries. This should be async and decoupled from your request path — you don't want evaluation latency affecting user experience. A worker pool consuming from a queue works well.

SAM: 15:22

Aggregation: compute rolling statistics — mean faithfulness over the last hour, p95 answer relevancy over the last day. Store these in your time-series database.

SAM: 15:29

Alerting: define thresholds and alert when metrics cross them. "Faithfulness dropped below 0.8 for the last 30 minutes" triggers a PagerDuty alert.

KEITH: 15:36

This is the architecture I recommend in the book. The key insight from Shahul is that evaluation should be continuous infrastructure, not a one-time check. Build it like you'd build any other monitoring system.

CASEY: 15:45

What about cost at scale?

SAM: 15:47

At 1 percent sampling with 100,000 daily queries, you're evaluating 1,000 queries per day. At $0.01 per evaluation, that's $10/day or $300/month. Reasonable for the insight you get.

MORGAN: 15:57

And you can adjust sampling rate based on cost constraints or confidence requirements.

SAM: 16:00

Exactly. Start at 1 percent, increase if you need tighter monitoring, decrease if costs become an issue.

ALEX: 16:06

Let's cover mistakes we see teams make. First: evaluating with the same model that generated the answer.

MORGAN: 16:11

Why is that a problem?

ALEX: 16:12

Narcissistic bias. LLMs tend to rate their own outputs higher. If your RAG uses GPT-4 for generation, use a different model — or at least a different provider — for evaluation.

CASEY: 16:21

What else?

ALEX: 16:22

Ignoring context window limits. If your retrieved contexts exceed the judge model's context window, RAGAS will truncate or error. Monitor your context sizes and chunk appropriately.

ALEX: 16:31

Third: not versioning your evaluation config. When you change metrics, thresholds, or judge models, you need to track those changes. Otherwise, you can't compare scores across time.

KEITH: 16:39

This is something I emphasize in the book — treat your evaluation pipeline like code. Version control your metric configurations, document threshold decisions, and maintain changelogs.

ALEX: 16:47

Fourth: over-relying on aggregate scores. A mean faithfulness of 0.9 sounds great, but if 5 percent of your responses have faithfulness below 0.5, those are the ones causing user complaints. Look at distributions, not just means.

MORGAN: 16:58

What about the cold start problem?

ALEX: 17:00

When you first deploy RAGAS, you don't have historical baselines. Run evaluation on a representative sample first to establish what "normal" looks like before setting alert thresholds.

TAYLOR: 17:08

Let's talk about building effective test suites with RAGAS. You need three types of tests: regression tests, adversarial tests, and capability tests.

MORGAN: 17:14

Define each?

TAYLOR: 17:15

Regression tests verify that changes don't degrade quality. You maintain a fixed dataset of representative queries, run RAGAS after every change, and fail the build if scores drop significantly.

TAYLOR: 17:24

Adversarial tests probe failure modes. Create queries designed to trick your retriever — ambiguous questions, questions about topics not in your knowledge base, questions that require multi-hop reasoning. RAGAS scores on these reveal robustness.

TAYLOR: 17:36

Capability tests verify specific behaviors. "Can the system answer questions about pricing?" "Does it correctly refuse to answer about competitors?" Use AspectCritic to define these specific capabilities.

KEITH: 17:44

RAGAS also supports synthetic test data generation, which Shahul discussed in our interview. You can bootstrap test datasets from your document corpus automatically, which accelerates test suite creation.

CASEY: 17:53

How does synthetic generation work?

TAYLOR: 17:54

RAGAS can generate question-answer pairs from your documents using an LLM. You specify the question types — simple factual, multi-hop reasoning, summarization — and it produces a diverse test set.

```python

from ragas.testset import TestsetGenerator

from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.from_langchain(llm, embeddings)

testset = generator.generate_with_langchain_docs(

documents,

test_size=100,

distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}

)

```

MORGAN: 18:15

That's powerful for bootstrapping evaluation before you have production data.

MORGAN: 18:18

Keith, let's talk more about what engineers can expect from Chapter 9.

KEITH: 18:21

The chapter is structured as a progressive code lab. We start with basic RAGAS installation and a minimal evaluation example. Then we build up — custom metrics, LangChain integration, async evaluation patterns, and finally a complete production monitoring pipeline.

CASEY: 18:34

What's the most technically dense section?

KEITH: 18:36

Probably the section on building custom metrics. We implement a domain-specific faithfulness variant that accounts for hedging language — statements like "might be" or "could be" get different treatment than definitive claims. That shows how to extend RAGAS's core abstractions.

MORGAN: 18:49

And the Shahul Es interview — what technical insights did you capture?

KEITH: 18:52

He walked me through their roadmap. They're expanding RAGAS beyond RAG evaluation into agentic workflow assessment — evaluating tool use, planning quality, and multi-step reasoning. He also shared architecture decisions that didn't make it into the public docs, like why they chose specific prompts for statement extraction.

CASEY: 19:06

The book is available on Amazon. For engineers building production RAG systems, Chapter 9 is essential reading.

KEITH: 19:11

Thanks. And the code from the chapter is available in a GitHub repo linked in the book — you can run everything yourself.

MORGAN: 19:18

A quick shout-out to our sponsor, Memriq AI. They're an AI consultancy and content studio building tools and resources for AI practitioners.

CASEY: 19:24

This podcast is part of their mission to help engineers build better AI systems. For technical deep-dives, implementation guides, and architecture breakdowns, head to Memriq.ai.

SAM: 19:31

Before we wrap, let's discuss unsolved challenges. Multi-hop reasoning evaluation is hard — when answers require synthesizing facts across multiple documents, faithfulness verification gets complex.

MORGAN: 19:40

What's the technical limitation?

SAM: 19:41

Current RAGAS metrics verify claims against the full retrieved context. But multi-hop reasoning requires tracking the chain of inference — "Document A says X, Document B says Y, therefore Z." We don't have good automated methods to evaluate that reasoning chain.

CASEY: 19:52

What about multimodal RAG?

SAM: 19:53

Evaluating systems that retrieve and reason over images, tables, and text is nascent. RAGAS doesn't yet support image-based faithfulness verification. It's an active research area.

KEITH: 20:01

Shahul mentioned they're prototyping multimodal support. The core challenge is that image faithfulness requires vision-language models as judges, which adds cost and complexity.

SAM: 20:09

Finally, cross-lingual evaluation. If your knowledge base is in one language and your queries are in another, standard embedding-based relevancy metrics may fail. Multilingual embedding models help, but evaluation accuracy varies by language pair.

TAYLOR: 20:19

These are opportunities for contribution. RAGAS is open source — if you solve one of these problems, the community benefits.

MORGAN: 20:25

My takeaway: RAGAS makes RAG evaluation tractable. The atomic verification approach is elegant engineering that delivers real production value.

CASEY: 20:31

For me, the integration story is compelling. LangChain, LlamaIndex, LangSmith, LangFuse — RAGAS fits into modern ML stacks without friction.

JORDAN: 20:37

The 95 percent faithfulness agreement validates the LLM-as-judge approach. We can trust these metrics to catch real quality issues.

TAYLOR: 20:42

Custom metrics via AspectCritic unlock domain-specific evaluation that generic frameworks can't provide. That's the feature enterprises need.

ALEX: 20:47

The cost-optimization patterns — sampling, model selection, async processing — make continuous evaluation feasible at scale.

SAM: 20:52

And the production architecture we discussed gives you a blueprint to implement this week. Sampling, evaluation, aggregation, alerting — it's standard monitoring practice applied to AI quality.

KEITH: 20:59

My final thought: Shahul convinced me that evaluation is infrastructure, not an afterthought. The teams that build evaluation into their RAG systems from day one will iterate faster and ship with more confidence.

MORGAN: 21:10

That's a wrap on our RAGAS deep-dive. Thanks to Keith Bourne for sharing implementation insights from his book and interview with the RAGAS team.

KEITH: 21:16

Thanks for having me. Check out Chapter 9 for the full code walkthrough — and reach out if you have questions. Happy to help.

CASEY: 21:22

Go build something great. And evaluate it properly.

MORGAN: 21:25

Thanks for listening to Memriq Inference Digest - Engineering Edition. Until next time, keep shipping quality AI.

CASEY: 21:29

Cheers!

Episode 3

RAG Evaluation with ragas: Reference-Free Metrics & Monitoring

Transcript

About the Podcast

Listen for free

About your host

Memriq AI