RAG Retrieval Utilities¶

This module provides utilities for optimizing RAG retrieval results, including chunk merging and context formatting.

Overview¶

After retrieving chunks from a vector store, raw results often need optimization: - Adjacent chunks from the same section should be merged for coherence - Context should be formatted intelligently for LLM consumption - Headings should be included dynamically based on content size

Quick Start¶

from dataknobs_bots.knowledge.retrieval import (
    ChunkMerger,
    ContextFormatter,
    MergerConfig,
    FormatterConfig,
)

# Create merger and formatter
merger = ChunkMerger(MergerConfig(max_merged_size=2000))
formatter = ContextFormatter(FormatterConfig(include_source=True))

# After retrieval
results = await kb.query("How do I configure auth?", k=10)

# Merge adjacent chunks
merged = merger.merge(results)

# Format for LLM context
context = formatter.format_merged(merged)
wrapped = formatter.wrap_for_prompt(context)

Chunk Merging¶

The Problem¶

Without merging, retrieval returns fragmented context:

[0.92] Chunk 5: "...continued from above. The second step is..."
[0.88] Chunk 3: "The first step is to..."
[0.85] Chunk 7: "Finally, the third step..."

The LLM sees disconnected snippets in the wrong order.

The Solution¶

ChunkMerger groups chunks by their heading path and source, then combines them:

[0.88] Authentication > Setup
The first step is to...

...continued from above. The second step is...

Finally, the third step...

MergerConfig¶

from dataknobs_bots.knowledge.retrieval import MergerConfig

config = MergerConfig(
    max_merged_size=2000,   # Maximum merged chunk size in chars
    preserve_order=True      # Maintain document order within groups
)

ChunkMerger¶

from dataknobs_bots.knowledge.retrieval import ChunkMerger, MergerConfig

merger = ChunkMerger(MergerConfig(max_merged_size=2000))

# Merge search results
results = await kb.query("authentication setup", k=10)
merged = merger.merge(results)

for chunk in merged:
    print(f"[{chunk.avg_similarity:.2f}] {chunk.heading_display}")
    print(f"Merged {len(chunk.chunks)} chunks")
    print(chunk.text)
    print()

MergedChunk Fields¶

merged_chunk.text           # Combined text content
merged_chunk.source         # Source file path
merged_chunk.heading_path   # ["Section", "Subsection"]
merged_chunk.heading_display # "Section > Subsection"
merged_chunk.chunks         # List of original chunks
merged_chunk.avg_similarity # Average similarity score
merged_chunk.content_length # Total content length

Converting Back to Results¶

# Convert merged chunks to standard result format
result_list = merger.to_result_list(merged)

for result in result_list:
    print(result["text"])
    print(result["similarity"])
    print(result["metadata"]["merged_count"])

Context Formatting¶

The Problem¶

Context window tokens are expensive. Including full heading paths for every chunk wastes tokens:

[1] Getting Started > Installation > System Requirements > Dependencies > Python
Just the content about Python requirements.

The Solution¶

ContextFormatter applies dynamic heading inclusion based on content size: - Small chunks (< 200 chars): Full heading path (need context) - Medium chunks (< 800 chars): Last 2 heading levels - Large chunks (> 800 chars): No headings (self-contained)

FormatterConfig¶

from dataknobs_bots.knowledge.retrieval import FormatterConfig

config = FormatterConfig(
    small_chunk_threshold=200,   # Full headings below this
    medium_chunk_threshold=800,  # Partial headings below this
    include_scores=False,        # Show similarity scores
    include_source=True,         # Show source file
    group_by_source=False,       # Group chunks by file
)

ContextFormatter¶

from dataknobs_bots.knowledge.retrieval import ContextFormatter, FormatterConfig

formatter = ContextFormatter(FormatterConfig(
    small_chunk_threshold=200,
    include_scores=True,
    include_source=True
))

# Format standard results
context = formatter.format(results)

# Format merged chunks
context = formatter.format_merged(merged_chunks)

# Wrap for prompt injection
wrapped = formatter.wrap_for_prompt(context, tag="knowledge_base")

Output Example¶

<knowledge_base>
[1] [0.92] Authentication > OAuth 2.0
Configure OAuth by setting the client ID and secret in your
environment variables. The callback URL should point to your
application's auth endpoint.
(Source: docs/auth.md)

---

[2] [0.88] Getting Started
First, install the package using pip. Then configure your
API keys in the environment or config file.
(Source: docs/quickstart.md)
</knowledge_base>

Grouping by Source¶

formatter = ContextFormatter(FormatterConfig(group_by_source=True))
context = formatter.format(results)

Output:

## Source: docs/auth.md

[1] Authentication > OAuth 2.0
Content here...

[2] Authentication > API Keys
More content...

---

## Source: docs/setup.md

[3] Installation
Setup content...

Complete Integration¶

Basic RAG Pipeline¶

from dataknobs_bots.knowledge.retrieval import (
    ChunkMerger,
    ContextFormatter,
    MergerConfig,
    FormatterConfig,
)

class RAGKnowledgeBase:
    def __init__(self, vector_store):
        self.vector_store = vector_store
        self.merger = ChunkMerger(MergerConfig(max_merged_size=2000))
        self.formatter = ContextFormatter(FormatterConfig(
            include_scores=False,
            include_source=True
        ))

    async def query(self, query: str, k: int = 10):
        # Search vector store
        results = await self.vector_store.search(query, k=k)

        # Merge adjacent chunks
        merged = self.merger.merge(results)

        return merged

    def format_context(self, merged_chunks, wrap=True):
        context = self.formatter.format_merged(merged_chunks)
        if wrap:
            return self.formatter.wrap_for_prompt(context)
        return context

In DynaBot Chat¶

class EnhancedBot(DynaBot):
    async def _build_message_with_context(self, message, rag_query=None):
        if self.knowledge_base:
            query = rag_query or message
            results = await self.knowledge_base.query(query, k=10)
            context = self.knowledge_base.format_context(results)

            return f"{context}\n\nUser: {message}"

        return message

API Reference¶

MergerConfig¶

@dataclass
class MergerConfig:
    max_merged_size: int = 2000   # Maximum merged chunk size
    preserve_order: bool = True    # Maintain document order

MergedChunk¶

@dataclass
class MergedChunk:
    text: str                      # Combined text
    source: str                    # Source file
    heading_path: list[str]        # Heading hierarchy
    heading_display: str           # Formatted display
    chunks: list[dict[str, Any]]   # Original chunks
    avg_similarity: float          # Average similarity
    content_length: int            # Total length

ChunkMerger¶

class ChunkMerger:
    def __init__(self, config: MergerConfig | None = None):
        """Initialize with optional configuration."""

    def merge(self, results: list[dict[str, Any]]) -> list[MergedChunk]:
        """Merge search results by shared heading path."""

    def to_result_list(self, merged: list[MergedChunk]) -> list[dict[str, Any]]:
        """Convert merged chunks back to result format."""

FormatterConfig¶

@dataclass
class FormatterConfig:
    small_chunk_threshold: int = 200    # Full headings below this
    medium_chunk_threshold: int = 800   # Partial headings below this
    include_scores: bool = False        # Show similarity scores
    include_source: bool = True         # Show source file
    group_by_source: bool = False       # Group by source file

ContextFormatter¶

class ContextFormatter:
    def __init__(self, config: FormatterConfig | None = None):
        """Initialize with optional configuration."""

    def format(self, results: list[dict[str, Any]]) -> str:
        """Format search results for LLM context."""

    def format_merged(self, merged_chunks: list[MergedChunk]) -> str:
        """Format merged chunks for LLM context."""

    def wrap_for_prompt(self, context: str, tag: str = "knowledge_base") -> str:
        """Wrap context in XML tags for prompt injection."""

Query Processing - Query transformation and expansion
User Guide - Complete bot usage guide

RAG Retrieval Utilities¶

RAG Retrieval Utilities¶

Overview¶

Quick Start¶

Chunk Merging¶

The Problem¶

The Solution¶

MergerConfig¶

ChunkMerger¶

MergedChunk Fields¶

Converting Back to Results¶

Context Formatting¶

The Problem¶

The Solution¶

FormatterConfig¶

ContextFormatter¶

Output Example¶

Grouping by Source¶

Complete Integration¶

Basic RAG Pipeline¶

In DynaBot Chat¶

API Reference¶

MergerConfig¶

MergedChunk¶

ChunkMerger¶

FormatterConfig¶

ContextFormatter¶

Related¶