Skip to content

dataknobs-xization Complete API Reference

Complete auto-generated API documentation from source code docstrings.

💡 Also see: - Curated Guide - Hand-crafted tutorials and examples - Package Overview - Introduction and getting started - Source Code - View on GitHub


dataknobs_xization

Text normalization and tokenization tools.

Modules:

Name Description
annotations

Text annotation data structures and interfaces.

authorities

Authority-based annotation processing and field grouping.

content_transformer

Content transformation utilities for converting various formats to markdown.

html

HTML to markdown conversion utilities.

ingestion

Knowledge base ingestion module.

json

JSON chunking utilities for RAG applications.

lexicon

Lexical matching and token alignment for text processing.

markdown

Markdown chunking utilities for RAG applications.

masking_tokenizer

Character-level text feature extraction and tokenization.

normalize

Text normalization utilities and regular expressions.

Classes:

Name Description
ContentTransformer

Transform structured content into markdown for RAG ingestion.

HTMLConverter

Convert HTML content to well-structured markdown.

HTMLConverterConfig

Configuration for HTML to markdown conversion.

AdaptiveStreamingProcessor

Streaming processor that adapts to memory constraints.

Chunk

A chunk of text with associated metadata.

ChunkFormat

Output format for chunk text.

ChunkMetadata

Metadata for a document chunk.

ChunkQualityConfig

Configuration for chunk quality filtering.

ChunkQualityFilter

Filter for identifying and removing low-quality chunks.

EnrichedChunkData

Data for a chunk enriched with heading context.

HeadingInclusion

Strategy for including headings in chunks.

MarkdownChunker

Chunker for generating chunks from markdown tree structures.

MarkdownNode

Data container for markdown tree nodes.

MarkdownParser

Parser for converting markdown text into a tree structure.

StreamingMarkdownProcessor

Streaming processor for incremental markdown chunking.

CharacterFeatures

Class representing features of text as a dataframe with each character

TextFeatures

Extracts text-specific character features for tokenization.

JSONChunk

A chunk generated from JSON data.

JSONChunkConfig

Configuration for JSON chunking.

JSONChunker

Chunker for generating chunks from JSON data with preserved metadata.

DirectoryProcessor

Process documents from a directory for knowledge base ingestion.

FilePatternConfig

Configuration for a specific file pattern.

IngestionConfigError

Error related to ingestion configuration.

KnowledgeBaseConfig

Configuration for knowledge base ingestion from a directory.

ProcessedDocument

A processed document ready for embedding and storage.

Functions:

Name Description
csv_to_markdown

Convert CSV content to markdown.

json_to_markdown

Convert JSON data to markdown.

yaml_to_markdown

Convert YAML content to markdown.

html_to_markdown

Convert HTML content to markdown.

build_enriched_text

Build text for embedding with relevant heading context.

chunk_markdown_tree

Generate chunks from a markdown tree.

format_heading_display

Format a heading path for display.

get_dynamic_heading_display

Get heading display based on content length.

is_multiword

Check if a heading contains multiple words.

parse_markdown

Parse markdown content into a tree structure.

stream_markdown_file

Stream chunks from a markdown file.

stream_markdown_string

Stream chunks from a markdown string.

process_directory

Convenience function to process a directory.

Classes

ContentTransformer

ContentTransformer(
    base_heading_level: int = 2,
    include_field_labels: bool = True,
    code_block_fields: list[str] | None = None,
    list_fields: list[str] | None = None,
)

Transform structured content into markdown for RAG ingestion.

This class converts various data formats (JSON, YAML, CSV, HTML) into well-structured markdown that can be parsed by MarkdownParser and chunked by MarkdownChunker.

The transformer creates markdown with appropriate heading hierarchy so that the chunker can create semantic boundaries around logical content units.

Attributes:

Name Type Description
schemas dict[str, dict[str, Any]]

Dictionary of registered custom schemas

config dict[str, dict[str, Any]]

Transformer configuration options

Initialize the content transformer.

Parameters:

Name Type Description Default
base_heading_level int

Starting heading level for top-level items (default: 2)

2
include_field_labels bool

Whether to bold field names in output (default: True)

True
code_block_fields list[str] | None

Field names that should be rendered as code blocks

None
list_fields list[str] | None

Field names that should be rendered as bullet lists

None

Methods:

Name Description
register_schema

Register a custom schema for specialized content conversion.

transform

Transform content to markdown.

transform_json

Transform JSON data to markdown.

transform_yaml

Transform YAML content to markdown.

transform_csv

Transform CSV content to markdown.

transform_html

Transform HTML content to markdown.

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def __init__(
    self,
    base_heading_level: int = 2,
    include_field_labels: bool = True,
    code_block_fields: list[str] | None = None,
    list_fields: list[str] | None = None,
):
    """Initialize the content transformer.

    Args:
        base_heading_level: Starting heading level for top-level items (default: 2)
        include_field_labels: Whether to bold field names in output (default: True)
        code_block_fields: Field names that should be rendered as code blocks
        list_fields: Field names that should be rendered as bullet lists
    """
    self.base_heading_level = base_heading_level
    self.include_field_labels = include_field_labels
    self.code_block_fields = set(code_block_fields or ["example", "code", "snippet"])
    self.list_fields = set(list_fields or ["items", "steps", "objectives", "symptoms", "solutions"])
    self.schemas: dict[str, dict[str, Any]] = {}
Functions
register_schema
register_schema(name: str, schema: dict[str, Any]) -> None

Register a custom schema for specialized content conversion.

Schemas define how to map JSON fields to markdown structure.

Parameters:

Name Type Description Default
name str

Schema identifier

required
schema dict[str, Any]

Schema definition with the following structure: - title_field: Field to use as the main heading (required) - description_field: Field for intro text (optional) - sections: List of section definitions, each with: - field: Source field name - heading: Section heading text - format: "text", "code", "list", or "subsections" (default: "text") - language: Code block language (for format="code") - metadata_fields: Fields to render as key-value metadata

required
Example

transformer.register_schema("pattern", { ... "title_field": "name", ... "description_field": "description", ... "sections": [ ... {"field": "use_case", "heading": "When to Use"}, ... {"field": "example", "heading": "Example", "format": "code"} ... ], ... "metadata_fields": ["category", "difficulty"] ... })

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def register_schema(self, name: str, schema: dict[str, Any]) -> None:
    """Register a custom schema for specialized content conversion.

    Schemas define how to map JSON fields to markdown structure.

    Args:
        name: Schema identifier
        schema: Schema definition with the following structure:
            - title_field: Field to use as the main heading (required)
            - description_field: Field for intro text (optional)
            - sections: List of section definitions, each with:
                - field: Source field name
                - heading: Section heading text
                - format: "text", "code", "list", or "subsections" (default: "text")
                - language: Code block language (for format="code")
            - metadata_fields: Fields to render as key-value metadata

    Example:
        >>> transformer.register_schema("pattern", {
        ...     "title_field": "name",
        ...     "description_field": "description",
        ...     "sections": [
        ...         {"field": "use_case", "heading": "When to Use"},
        ...         {"field": "example", "heading": "Example", "format": "code"}
        ...     ],
        ...     "metadata_fields": ["category", "difficulty"]
        ... })
    """
    self.schemas[name] = schema
    logger.debug(f"Registered schema: {name}")
transform
transform(
    content: Any,
    format: str = "json",
    schema: str | None = None,
    title: str | None = None,
) -> str

Transform content to markdown.

Parameters:

Name Type Description Default
content Any

Content to transform (dict, list, string, or file path)

required
format str

Content format - "json", "yaml", "csv", or "html"

'json'
schema str | None

Optional schema name for custom conversion (applies to "json" and "yaml" formats only; ignored for "csv" and "html")

None
title str | None

Optional document title

None

Returns:

Type Description
str

Markdown formatted string

Raises:

Type Description
ValueError

If format is not supported

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def transform(
    self,
    content: Any,
    format: str = "json",
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform content to markdown.

    Args:
        content: Content to transform (dict, list, string, or file path)
        format: Content format - "json", "yaml", "csv", or "html"
        schema: Optional schema name for custom conversion (applies to
            "json" and "yaml" formats only; ignored for "csv" and "html")
        title: Optional document title

    Returns:
        Markdown formatted string

    Raises:
        ValueError: If format is not supported
    """
    if format == "json":
        if isinstance(content, (str, Path)):
            with open(content, encoding="utf-8") as f:
                data = json.load(f)
        else:
            data = content
        return self.transform_json(data, schema=schema, title=title)
    elif format == "yaml":
        return self.transform_yaml(content, schema=schema, title=title)
    elif format == "csv":
        if schema is not None:
            logger.warning("schema parameter is ignored for CSV format")
        return self.transform_csv(content, title=title)
    elif format == "html":
        if schema is not None:
            logger.warning("schema parameter is ignored for HTML format")
        return self.transform_html(content, title=title)
    else:
        raise ValueError(
            f"Unsupported format: {format}. Use 'json', 'yaml', 'csv', or 'html'."
        )
transform_json
transform_json(
    data: dict[str, Any] | list[Any],
    schema: str | None = None,
    title: str | None = None,
) -> str

Transform JSON data to markdown.

Parameters:

Name Type Description Default
data dict[str, Any] | list[Any]

JSON data (dict or list)

required
schema str | None

Optional schema name for custom conversion

None
title str | None

Optional document title

None

Returns:

Type Description
str

Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def transform_json(
    self,
    data: dict[str, Any] | list[Any],
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform JSON data to markdown.

    Args:
        data: JSON data (dict or list)
        schema: Optional schema name for custom conversion
        title: Optional document title

    Returns:
        Markdown formatted string
    """
    lines: list[str] = []

    # Add document title if provided
    if title:
        lines.extend([f"# {title}", ""])

    # Use custom schema if specified
    if schema and schema in self.schemas:
        return self._transform_with_schema(data, self.schemas[schema], title)

    # Generic transformation
    if isinstance(data, list):
        for item in data:
            if isinstance(item, dict):
                lines.extend(self._transform_dict_generic(item, self.base_heading_level))
                lines.extend(["---", ""])
            else:
                lines.append(f"- {item}")
                lines.append("")
    elif isinstance(data, dict):
        lines.extend(self._transform_dict_generic(data, self.base_heading_level))
    else:
        lines.append(str(data))

    return "\n".join(lines)
transform_yaml
transform_yaml(
    content: str | Path, schema: str | None = None, title: str | None = None
) -> str

Transform YAML content to markdown.

Parameters:

Name Type Description Default
content str | Path

YAML string or file path

required
schema str | None

Optional schema name for custom conversion

None
title str | None

Optional document title

None

Returns:

Type Description
str

Markdown formatted string

Raises:

Type Description
ImportError

If PyYAML is not installed

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def transform_yaml(
    self,
    content: str | Path,
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform YAML content to markdown.

    Args:
        content: YAML string or file path
        schema: Optional schema name for custom conversion
        title: Optional document title

    Returns:
        Markdown formatted string

    Raises:
        ImportError: If PyYAML is not installed
    """
    try:
        import yaml
    except ImportError:
        raise ImportError("PyYAML is required for YAML transformation. Install with: pip install pyyaml") from None

    if isinstance(content, (str, Path)) and Path(content).exists():
        with open(content, encoding="utf-8") as f:
            data = yaml.safe_load(f)
    else:
        data = yaml.safe_load(content)

    return self.transform_json(data, schema=schema, title=title)
transform_csv
transform_csv(
    content: str | Path,
    title: str | None = None,
    title_field: str | None = None,
) -> str

Transform CSV content to markdown.

Each row becomes a section with the first column (or title_field) as heading.

Parameters:

Name Type Description Default
content str | Path

CSV string or file path

required
title str | None

Optional document title

None
title_field str | None

Column to use as section title (default: first column)

None

Returns:

Type Description
str

Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def transform_csv(
    self,
    content: str | Path,
    title: str | None = None,
    title_field: str | None = None,
) -> str:
    """Transform CSV content to markdown.

    Each row becomes a section with the first column (or title_field) as heading.

    Args:
        content: CSV string or file path
        title: Optional document title
        title_field: Column to use as section title (default: first column)

    Returns:
        Markdown formatted string
    """
    lines: list[str] = []

    if title:
        lines.extend([f"# {title}", ""])

    # Read CSV
    if isinstance(content, Path) or (isinstance(content, str) and Path(content).exists()):
        with open(content, encoding="utf-8") as f:
            reader = csv.DictReader(f)
            rows = list(reader)
    else:
        reader = csv.DictReader(io.StringIO(content))
        rows = list(reader)

    if not rows:
        return "\n".join(lines)

    # Determine title field
    fieldnames = list(rows[0].keys())
    if title_field and title_field in fieldnames:
        title_col = title_field
    else:
        title_col = fieldnames[0]

    # Transform each row
    for row in rows:
        row_title = row.get(title_col, "Untitled")
        lines.append(f"{'#' * self.base_heading_level} {row_title}")
        lines.append("")

        for field, value in row.items():
            if field == title_col or not value:
                continue

            if self.include_field_labels:
                lines.append(f"**{self._format_field_name(field)}**: {value}")
            else:
                lines.append(value)
            lines.append("")

        lines.extend(["---", ""])

    return "\n".join(lines)
transform_html
transform_html(content: str | Path, title: str | None = None) -> str

Transform HTML content to markdown.

Supports standard HTML with semantic tags and IETF RFC markup. Auto-detects the document format and applies appropriate conversion.

Parameters:

Name Type Description Default
content str | Path

HTML string or file path

required
title str | None

Optional document title

None

Returns:

Type Description
str

Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py
def transform_html(
    self,
    content: str | Path,
    title: str | None = None,
) -> str:
    """Transform HTML content to markdown.

    Supports standard HTML with semantic tags and IETF RFC markup.
    Auto-detects the document format and applies appropriate conversion.

    Args:
        content: HTML string or file path
        title: Optional document title

    Returns:
        Markdown formatted string
    """
    from dataknobs_xization.html import HTMLConverter

    converter = HTMLConverter(base_heading_level=self.base_heading_level)
    return converter.convert(content, title=title)

HTMLConverter

HTMLConverter(config: HTMLConverterConfig | None = None, **kwargs: Any)

Convert HTML content to well-structured markdown.

Supports standard HTML with semantic tags (h1-h6, p, ul, ol, table, pre, etc.) and auto-detects IETF RFC markup format (pre-formatted text with span-based headings).

The converter produces markdown compatible with MarkdownParser and MarkdownChunker for downstream RAG ingestion.

Example

converter = HTMLConverter() md = converter.convert("

Overview

Details here.

") print(md)

Overview

Details here.

Initialize the converter.

Parameters:

Name Type Description Default
config HTMLConverterConfig | None

Converter configuration. If None, uses defaults.

None
**kwargs Any

Override individual config fields (e.g., base_heading_level=2).

{}

Methods:

Name Description
convert

Convert HTML content to markdown.

Source code in packages/xization/src/dataknobs_xization/html/html_converter.py
def __init__(self, config: HTMLConverterConfig | None = None, **kwargs: Any):
    """Initialize the converter.

    Args:
        config: Converter configuration. If None, uses defaults.
        **kwargs: Override individual config fields (e.g., base_heading_level=2).
    """
    if config is not None:
        self.config = config
    else:
        self.config = HTMLConverterConfig(**kwargs)
Functions
convert
convert(content: str | Path, title: str | None = None) -> str

Convert HTML content to markdown.

Auto-detects whether the document is standard HTML or IETF RFC markup and applies the appropriate conversion strategy.

Note

Each call uses internal state for reference-style link collection. Do not call convert() concurrently on the same instance from multiple threads. Create separate HTMLConverter instances for concurrent use, or use the html_to_markdown() convenience function which creates a fresh instance per call.

Parameters:

Name Type Description Default
content str | Path

HTML string or path to an HTML file.

required
title str | None

Optional document title. If provided, prepended as a top-level heading. For RFC documents, extracted automatically if not provided.

None

Returns:

Type Description
str

Well-structured markdown string.

Source code in packages/xization/src/dataknobs_xization/html/html_converter.py
def convert(self, content: str | Path, title: str | None = None) -> str:
    """Convert HTML content to markdown.

    Auto-detects whether the document is standard HTML or IETF RFC markup
    and applies the appropriate conversion strategy.

    Note:
        Each call uses internal state for reference-style link collection.
        Do not call ``convert()`` concurrently on the same instance from
        multiple threads. Create separate ``HTMLConverter`` instances for
        concurrent use, or use the ``html_to_markdown()`` convenience
        function which creates a fresh instance per call.

    Args:
        content: HTML string or path to an HTML file.
        title: Optional document title. If provided, prepended as a top-level
            heading. For RFC documents, extracted automatically if not provided.

    Returns:
        Well-structured markdown string.
    """
    if isinstance(content, Path):
        content = content.read_text(encoding="utf-8")
    elif not isinstance(content, str):
        raise TypeError(f"content must be str or Path, got {type(content).__name__}")

    # Per-conversion state for reference-style link collection.
    self._link_references: list[tuple[str, str]] = []

    soup = BeautifulSoup(content, "html.parser")

    # Strip unwanted elements before detection or conversion.
    self._strip_elements(soup)

    # Detect format and dispatch.
    if self._is_rfc_markup(soup):
        logger.debug("Detected IETF RFC markup format")
        result = self._convert_rfc(soup, title)
    else:
        logger.debug("Using standard HTML conversion")
        result = self._convert_standard(soup, title)

    # Append reference-style link definitions.
    if self._link_references:
        result = result.rstrip("\n") + "\n\n"
        for idx, (_text, href) in enumerate(self._link_references, 1):
            result += f"[{idx}]: {href}\n"

    # Prepend frontmatter if configured.
    if self.config.frontmatter:
        fm = self._render_frontmatter(self.config.frontmatter)
        result = fm + "\n" + result

    return self._normalize_whitespace(result)

HTMLConverterConfig dataclass

HTMLConverterConfig(
    base_heading_level: int = 1,
    include_links: bool = True,
    strip_nav: bool = True,
    strip_scripts: bool = True,
    preserve_code_blocks: bool = True,
    link_style: Literal["inline", "reference", "text"] = "inline",
    strip_images: bool = False,
    wrap_width: int = 0,
    frontmatter: dict[str, Any] | None = None,
)

Configuration for HTML to markdown conversion.

Attributes:

Name Type Description
base_heading_level int

Minimum heading level in output (1 = #, 2 = ##, etc.)

include_links bool

Whether to preserve hyperlinks as markdown links.

strip_nav bool

Remove

strip_scripts bool

Remove