Markdown Chunking for RAG¶
Markdown Chunking for RAG Applications¶
This package provides comprehensive utilities for splitting and chunking markdown documents into semantically meaningful chunks suitable for RAG (Retrieval-Augmented Generation) applications.
Overview¶
The markdown chunking system consists of three main components:
- Parser (
md_parser.py): Converts markdown text into a Tree structure that preserves heading hierarchy - Chunker (
md_chunker.py): Generates chunks from the Tree structure with configurable parameters - Streaming Processor (
md_streaming.py): Handles large documents with memory management
Key Features¶
- Preserves heading hierarchy: Each chunk maintains the full path of headings from root to its content
- Flexible heading inclusion: Include headings in text, metadata, both, or neither
- Multiple output formats: Markdown, plain text, or structured JSON
- Configurable chunk sizing: Control maximum chunk size with smart boundary detection
- Streaming support: Process large documents without loading entirely into memory
- Tree-based structure: Uses the
Treedata structure fromdataknobs_structures
Installation¶
The markdown chunking utilities are part of the dataknobs-xization package:
Quick Start¶
Basic Usage¶
from dataknobs_xization import parse_markdown, chunk_markdown_tree
# Parse markdown into tree structure
markdown_text = """
# Introduction
This is the introduction.
## Background
Some background information.
"""
tree = parse_markdown(markdown_text)
# Generate chunks
chunks = chunk_markdown_tree(tree, max_chunk_size=500)
for chunk in chunks:
print(f"Chunk {chunk.metadata.chunk_index}:")
print(f"Headings: {chunk.metadata.get_heading_path()}")
print(f"Text: {chunk.text}")
print()
Streaming Large Files¶
from dataknobs_xization import stream_markdown_file
# Process large file incrementally
for chunk in stream_markdown_file("large_document.md", max_chunk_size=1000):
# Process each chunk as it becomes available
store_in_database(chunk.text, chunk.metadata)
Architecture¶
Tree Structure¶
The parser builds a Tree where: - Root node represents the document - Heading nodes are parents to their sub-headings and body text - Body text nodes are leaf nodes containing the actual content
Example structure:
ROOT
├─ H1: Introduction
│ ├─ Body: "This is the introduction."
│ └─ H2: Background
│ └─ Body: "Some background information."
MarkdownNode Data Structure¶
Each tree node contains a MarkdownNode with:
- text: The text content
- level: Heading level (1-6) or 0 for body text
- node_type: 'heading' or 'body'
- line_number: Source line number (for debugging)
Chunk Data Structure¶
Each chunk contains:
- text: The chunk text (with or without headings based on configuration)
- metadata: A ChunkMetadata object with:
- headings: List of heading texts from root to chunk
- heading_levels: Corresponding heading levels
- line_number: Starting line number
- chunk_index: Sequential index
- chunk_size: Size in characters
- custom: Dictionary for custom metadata
API Reference¶
Parsing¶
parse_markdown(source, max_line_length=None, preserve_empty_lines=False)¶
Parse markdown content into a tree structure.
Parameters:
- source: Markdown content as string, file object, or line iterator
- max_line_length: Maximum length for body text lines (None for unlimited)
- preserve_empty_lines: Whether to preserve empty lines
Returns: Tree with root node containing the document structure
Example:
from dataknobs_xization import parse_markdown
tree = parse_markdown("# Title\nBody text.", max_line_length=100)
MarkdownParser¶
Class for parsing markdown with configurable options.
from dataknobs_xization import MarkdownParser
parser = MarkdownParser(max_line_length=1000, preserve_empty_lines=False)
tree = parser.parse(markdown_content)
Chunking¶
chunk_markdown_tree(tree, max_chunk_size=1000, heading_inclusion=HeadingInclusion.BOTH, chunk_format=ChunkFormat.MARKDOWN, combine_under_heading=True)¶
Generate chunks from a markdown tree.
Parameters:
- tree: Tree structure built from markdown
- max_chunk_size: Maximum size of chunk text in characters
- heading_inclusion: How to include headings (BOTH, IN_TEXT, IN_METADATA, NONE)
- chunk_format: Output format (MARKDOWN, PLAIN, DICT)
- combine_under_heading: Whether to combine body text under same heading
Returns: List of Chunk objects
Example:
from dataknobs_xization import chunk_markdown_tree, HeadingInclusion, ChunkFormat
chunks = chunk_markdown_tree(
tree,
max_chunk_size=500,
heading_inclusion=HeadingInclusion.BOTH,
chunk_format=ChunkFormat.MARKDOWN
)
MarkdownChunker¶
Class for chunking with configurable parameters.
from dataknobs_xization import MarkdownChunker, HeadingInclusion
chunker = MarkdownChunker(
max_chunk_size=1000,
heading_inclusion=HeadingInclusion.BOTH
)
chunks = list(chunker.chunk(tree))
Text Splitting Strategy¶
When body text under a single heading exceeds max_chunk_size, the chunker splits
it using a priority-based boundary detection strategy. Within each window of
max_chunk_size characters, it searches backward for the best available break point:
- Paragraph boundary (
\n\n) — highest priority, preserves paragraph-level coherence - Sentence boundary (
.,!,?,.\n,!\n,?\n) — keeps sentences intact - Word boundary (space) — avoids splitting mid-word
- Hard cut at
max_chunk_size— only when no boundary exists in the window
The backward search (last occurrence within the window) ensures each chunk is as
large as possible while still ending at a clean boundary. Atomic constructs (code
blocks, tables, lists) are never split, even if they exceed max_chunk_size.
Streaming¶
stream_markdown_file(file_path, max_chunk_size=1000, heading_inclusion=HeadingInclusion.BOTH, chunk_format=ChunkFormat.MARKDOWN)¶
Stream chunks from a markdown file.
Parameters:
- file_path: Path to markdown file
- max_chunk_size: Maximum size of chunk text
- heading_inclusion: How to include headings
- chunk_format: Output format
Yields: Chunk objects
Example:
from dataknobs_xization import stream_markdown_file
for chunk in stream_markdown_file("document.md", max_chunk_size=500):
print(chunk.to_dict())
stream_markdown_string(content, ...)¶
Stream chunks from a markdown string.
from dataknobs_xization import stream_markdown_string
markdown = "# Title\nBody text."
for chunk in stream_markdown_string(markdown):
print(chunk.text)
StreamingMarkdownProcessor and AdaptiveStreamingProcessor¶
Classes for streaming processing with memory management.
from dataknobs_xization import AdaptiveStreamingProcessor
processor = AdaptiveStreamingProcessor(
max_chunk_size=1000,
memory_limit_nodes=10000,
adaptive_threshold=0.8
)
chunks = list(processor.process_file("large_document.md"))
Command-Line Interface¶
The package includes a CLI for testing and demonstration.
Commands¶
info - Show document information¶
Output:
Document Information
==================================================
Total nodes: 49
Heading nodes: 24
Body text nodes: 24
Tree depth: 2
Heading levels:
Level 1: 3
Level 2: 7
...
chunk - Chunk a markdown document¶
# Basic chunking
uv run python packages/xization/scripts/md_cli.py chunk document.md
# With custom parameters
uv run python packages/xization/scripts/md_cli.py chunk document.md \
--max-size 500 \
--headings both \
--show-metadata
# Output as JSON
uv run python packages/xization/scripts/md_cli.py chunk document.md \
--output-format json \
--output chunks.json
parse - Parse and show tree structure¶
CLI Options¶
chunk command:
- --max-size: Maximum chunk size in characters (default: 1000)
- --headings: How to include headings: both, text, metadata, none (default: both)
- --output-format: Output format: markdown, plain, json (default: markdown)
- --output: Output file (default: stdout)
- --show-metadata: Show chunk metadata
- --separator: Separator between chunks (default: \n---\n)
Configuration Options¶
HeadingInclusion¶
Controls how headings are included in chunks:
HeadingInclusion.BOTH: Include in both text and metadata (default)HeadingInclusion.IN_TEXT: Include only in chunk textHeadingInclusion.IN_METADATA: Include only in metadataHeadingInclusion.NONE: Don't include headings
ChunkFormat¶
Controls output format:
ChunkFormat.MARKDOWN: Include headings as markdown (e.g.,# Title)ChunkFormat.PLAIN: Plain text without markdown formattingChunkFormat.DICT: Return as dictionary
Use Cases¶
RAG Vector Store Loading¶
from dataknobs_xization import stream_markdown_file
import chromadb
client = chromadb.Client()
collection = client.create_collection("documents")
for chunk in stream_markdown_file("documentation.md", max_chunk_size=500):
collection.add(
documents=[chunk.text],
metadatas=[chunk.metadata.to_dict()],
ids=[f"chunk_{chunk.metadata.chunk_index}"]
)
Contextual Retrieval¶
from dataknobs_xization import parse_markdown, chunk_markdown_tree, HeadingInclusion
tree = parse_markdown(document_text)
# Include headings in metadata for context, but not in text
chunks = chunk_markdown_tree(
tree,
heading_inclusion=HeadingInclusion.IN_METADATA,
max_chunk_size=500
)
for chunk in chunks:
# Heading context available in metadata for filtering/ranking
context_path = chunk.metadata.get_heading_path()
# Body text is clean without heading markup
clean_text = chunk.text
Document Analysis¶
from dataknobs_xization import parse_markdown
tree = parse_markdown(document)
# Find all sections about a specific topic
relevant_sections = tree.find_nodes(
lambda n: n.data.is_heading() and "security" in n.data.text.lower()
)
# Get body text under each relevant section
for section in relevant_sections:
body_nodes = section.find_nodes(lambda n: n.data.is_body())
for body in body_nodes:
print(body.data.text)
Memory Management¶
For large documents, use the AdaptiveStreamingProcessor:
from dataknobs_xization import AdaptiveStreamingProcessor
processor = AdaptiveStreamingProcessor(
max_chunk_size=1000,
memory_limit_nodes=10000, # Maximum nodes to keep in memory
adaptive_threshold=0.8 # Trigger chunking at 80% of limit
)
# Process very large file
for chunk in processor.process_file("massive_documentation.md"):
# Chunks are yielded as they're ready
# Memory usage remains bounded
process_chunk(chunk)
Testing¶
Run the test suite:
cd packages/xization
uv run pytest tests/test_md_parser.py tests/test_md_chunker.py tests/test_md_streaming.py -v
Examples¶
See example_document.md for a sample markdown file demonstrating various heading levels and structures.
Integration with Other Dataknobs Packages¶
The markdown chunking utilities integrate seamlessly with other dataknobs packages:
- Uses
Treefromdataknobs-structuresfor representing document hierarchy - Can be combined with
dataknobs-utilsfor additional text processing - Chunk metadata can be stored in
dataknobs-kvstores - Compatible with
dataknobs-datafor data pipeline integration
Future Enhancements¶
Potential areas for future development:
- Support for additional markdown features (tables, code blocks, etc.)
- Semantic chunking based on content similarity
- Support for other document formats (HTML, reStructuredText, etc.)
- Chunk boundary optimization using NLP techniques
- Parallel processing for very large document sets
Contributing¶
Contributions are welcome! Please ensure:
- All tests pass
- New features include tests
- Code follows the existing style
- Documentation is updated
License¶
Part of the dataknobs project. See main repository for license information.