Content Transformation¶

The Content Transformation module provides tools for converting structured data formats (JSON, YAML, CSV, HTML) into well-formatted markdown suitable for RAG ingestion and chunking.

Overview¶

When building RAG (Retrieval-Augmented Generation) systems, you often need to ingest structured data like JSON configuration files, YAML documentation, CSV datasets, or HTML documents. The ContentTransformer class converts these formats into markdown with appropriate heading hierarchies, enabling the markdown chunker to create semantic boundaries around logical content units.

Quick Start¶

from dataknobs_xization import ContentTransformer, json_to_markdown

# Simple conversion
data = {"name": "My Item", "description": "A description"}
markdown = json_to_markdown(data)

# Or use the transformer class
transformer = ContentTransformer()
markdown = transformer.transform_json(data)

ContentTransformer Class¶

Initialization¶

from dataknobs_xization import ContentTransformer

transformer = ContentTransformer(
    base_heading_level=2,           # Starting heading level (default: 2)
    include_field_labels=True,      # Bold field names in output (default: True)
    code_block_fields=["example"],  # Fields to render as code blocks
    list_fields=["steps", "items"]  # Fields to render as bullet lists
)

Default Field Handling¶

By default, the transformer treats certain field names specially:

Code Block Fields (rendered as fenced code blocks): - example - code - snippet

List Fields (rendered as bullet lists): - items - steps - objectives - symptoms - solutions

You can customize these lists during initialization.

Generic Transformation¶

Without a schema, the transformer uses intelligent defaults:

data = {
    "name": "Chain of Thought",
    "description": "Step by step reasoning technique",
    "steps": ["Break down problem", "Show work", "Conclude"],
    "example": "Let's think step by step..."
}

result = transformer.transform_json(data)

Output:

## Chain of Thought

**Description**: Step by step reasoning technique

### Steps

- Break down problem
- Show work
- Conclude

### Example

Let's think step by step...

---

Title Detection¶

The transformer automatically detects title fields in this order: 1. name 2. title 3. id 4. key

Nested Structures¶

Nested dictionaries become subsections:

data = {
    "name": "Config",
    "database": {
        "host": "localhost",
        "port": 5432
    }
}

Produces:

## Config

### Database

**Host**: localhost

**Port**: 5432

Custom Schemas¶

For specialized formatting, register custom schemas:

transformer.register_schema("pattern", {
    "title_field": "name",
    "description_field": "description",
    "sections": [
        {"field": "use_case", "heading": "When to Use"},
        {"field": "example", "heading": "Example", "format": "code", "language": "python"},
        {"field": "variations", "heading": "Variations", "format": "list"}
    ],
    "metadata_fields": ["category", "difficulty"]
})

Schema Definition¶

Key	Description
`title_field`	Field to use as the main heading (required)
`description_field`	Field for intro text without a heading
`sections`	List of section definitions
`metadata_fields`	Fields to render as bold key-value pairs

Section Formats¶

Format	Description
`text` (default)	Plain text paragraph
`code`	Fenced code block (optionally with `language`)
`list`	Bullet list
`subsections`	Nested key-value pairs or list of items

Example Usage¶

patterns = [
    {
        "name": "Chain of Thought",
        "description": "A prompting technique for complex reasoning",
        "use_case": "Use for multi-step problems requiring logical reasoning",
        "example": "Let's think step by step:\n1. First, ...\n2. Then, ...",
        "variations": ["Zero-shot CoT", "Manual CoT", "Self-consistency"],
        "category": "reasoning",
        "difficulty": "intermediate"
    }
]

markdown = transformer.transform_json(patterns, schema="pattern")

Output:

## Chain of Thought

**Category**: reasoning
**Difficulty**: intermediate

A prompting technique for complex reasoning

### When to Use

Use for multi-step problems requiring logical reasoning

### Example

```python
Let's think step by step:
1. First, ...
2. Then, ...

Variations¶

Zero-shot CoT
Manual CoT
Self-consistency

## YAML Transformation

```python
# From YAML string
yaml_content = """
name: My Config
settings:
  timeout: 30
  retries: 3
"""
markdown = transformer.transform_yaml(yaml_content)

# From YAML file
markdown = transformer.transform_yaml("config.yaml")

# With schema
markdown = transformer.transform_yaml("config.yaml", schema="config")

CSV Transformation¶

# From CSV string
csv_content = "name,value,description\nItem1,100,First item\nItem2,200,Second item"
markdown = transformer.transform_csv(csv_content)

# From CSV file
markdown = transformer.transform_csv("data.csv")

# With custom title field
markdown = transformer.transform_csv("data.csv", title_field="name")

# With document title
markdown = transformer.transform_csv("data.csv", title="My Dataset")

Each row becomes a section with the first column (or title_field) as the heading.

HTML Transformation¶

HTML content (including IETF RFC markup) can be converted to markdown. See the dedicated HTML Conversion guide for full details.

# Via ContentTransformer
markdown = transformer.transform(html_string, format="html")
markdown = transformer.transform_html(html_string, title="My Page")

# Via convenience function
from dataknobs_xization import html_to_markdown
markdown = html_to_markdown("<h1>Title</h1><p>Content.</p>")

Convenience Functions¶

For quick one-off conversions:

from dataknobs_xization import json_to_markdown, yaml_to_markdown, csv_to_markdown, html_to_markdown

# JSON to markdown
md = json_to_markdown(data, title="Document Title", base_heading_level=2)

# YAML to markdown
md = yaml_to_markdown("config.yaml", title="Configuration")

# CSV to markdown
md = csv_to_markdown("data.csv", title="Data", title_field="name")

# HTML to markdown
md = html_to_markdown("<h1>Title</h1><p>Content.</p>", title="Document")

Integration with RAG¶

With RAGKnowledgeBase¶

The RAGKnowledgeBase class provides direct methods for loading JSON, YAML, and CSV:

from dataknobs_bots.knowledge import RAGKnowledgeBase
from dataknobs_xization import ContentTransformer

# Create knowledge base
kb = RAGKnowledgeBase(...)

# Create transformer with custom schema
transformer = ContentTransformer()
transformer.register_schema("pattern", {...})

# Load JSON directly
await kb.load_json_document(
    "patterns.json",
    schema="pattern",
    transformer=transformer,
    metadata={"source": "patterns"}
)

# Load YAML
await kb.load_yaml_document("config.yaml", metadata={"type": "config"})

# Load CSV
await kb.load_csv_document("data.csv", title_field="name")

Why Convert to Markdown?¶

Semantic Chunking: Markdown headings create natural boundaries for chunks
Hierarchy Preservation: Nested structures become heading hierarchies
RAG Optimization: Chunks maintain context through heading paths
Consistent Processing: All content goes through the same chunking pipeline

API Reference¶

ContentTransformer¶

class ContentTransformer:
    def __init__(
        self,
        base_heading_level: int = 2,
        include_field_labels: bool = True,
        code_block_fields: list[str] | None = None,
        list_fields: list[str] | None = None,
    )

    def register_schema(self, name: str, schema: dict[str, Any]) -> None

    def transform(
        self,
        content: Any,
        format: str = "json",
        schema: str | None = None,
        title: str | None = None,
    ) -> str

    def transform_html(
        self,
        content: str | Path,
        title: str | None = None,
    ) -> str

    def transform_json(
        self,
        data: dict[str, Any] | list[Any],
        schema: str | None = None,
        title: str | None = None,
    ) -> str

    def transform_yaml(
        self,
        content: str | Path,
        schema: str | None = None,
        title: str | None = None,
    ) -> str

    def transform_csv(
        self,
        content: str | Path,
        title: str | None = None,
        title_field: str | None = None,
    ) -> str

Convenience Functions¶

def json_to_markdown(
    data: dict[str, Any] | list[Any],
    title: str | None = None,
    base_heading_level: int = 2,
) -> str

def yaml_to_markdown(
    content: str | Path,
    title: str | None = None,
    base_heading_level: int = 2,
) -> str

def csv_to_markdown(
    content: str | Path,
    title: str | None = None,
    title_field: str | None = None,
    base_heading_level: int = 2,
) -> str

def html_to_markdown(
    content: str | Path,
    title: str | None = None,
    base_heading_level: int = 1,
    include_links: bool = True,
    frontmatter: dict[str, Any] | None = None,
) -> str