dataknobs-xization Complete API Reference¶

Complete auto-generated API documentation from source code docstrings.

💡 Also see: - Curated Guide - Hand-crafted tutorials and examples - Package Overview - Introduction and getting started - Source Code - View on GitHub

dataknobs_xization ¶

Text normalization and tokenization tools.

Modules:

Name	Description
`annotations`	Text annotation data structures and interfaces.
`authorities`	Authority-based annotation processing and field grouping.
`content_transformer`	Content transformation utilities for converting various formats to markdown.
`html`	HTML to markdown conversion utilities.
`ingestion`	Knowledge base ingestion module.
`json`	JSON chunking utilities for RAG applications.
`lexicon`	Lexical matching and token alignment for text processing.
`markdown`	Markdown chunking utilities for RAG applications.
`masking_tokenizer`	Character-level text feature extraction and tokenization.
`normalize`	Text normalization utilities and regular expressions.

Classes:

Name	Description
`ContentTransformer`	Transform structured content into markdown for RAG ingestion.
`HTMLConverter`	Convert HTML content to well-structured markdown.
`HTMLConverterConfig`	Configuration for HTML to markdown conversion.
`AdaptiveStreamingProcessor`	Streaming processor that adapts to memory constraints.
`Chunk`	A chunk of text with associated metadata.
`ChunkFormat`	Output format for chunk text.
`ChunkMetadata`	Metadata for a document chunk.
`ChunkQualityConfig`	Configuration for chunk quality filtering.
`ChunkQualityFilter`	Filter for identifying and removing low-quality chunks.
`EnrichedChunkData`	Data for a chunk enriched with heading context.
`HeadingInclusion`	Strategy for including headings in chunks.
`MarkdownChunker`	Chunker for generating chunks from markdown tree structures.
`MarkdownNode`	Data container for markdown tree nodes.
`MarkdownParser`	Parser for converting markdown text into a tree structure.
`StreamingMarkdownProcessor`	Streaming processor for incremental markdown chunking.
`CharacterFeatures`	Class representing features of text as a dataframe with each character
`TextFeatures`	Extracts text-specific character features for tokenization.
`JSONChunk`	A chunk generated from JSON data.
`JSONChunkConfig`	Configuration for JSON chunking.
`JSONChunker`	Chunker for generating chunks from JSON data with preserved metadata.
`DirectoryProcessor`	Process documents from a directory for knowledge base ingestion.
`FilePatternConfig`	Configuration for a specific file pattern.
`IngestionConfigError`	Error related to ingestion configuration.
`KnowledgeBaseConfig`	Configuration for knowledge base ingestion from a directory.
`ProcessedDocument`	A processed document ready for embedding and storage.

Functions:

Name	Description
`csv_to_markdown`	Convert CSV content to markdown.
`json_to_markdown`	Convert JSON data to markdown.
`yaml_to_markdown`	Convert YAML content to markdown.
`html_to_markdown`	Convert HTML content to markdown.
`build_enriched_text`	Build text for embedding with relevant heading context.
`chunk_markdown_tree`	Generate chunks from a markdown tree.
`format_heading_display`	Format a heading path for display.
`get_dynamic_heading_display`	Get heading display based on content length.
`is_multiword`	Check if a heading contains multiple words.
`parse_markdown`	Parse markdown content into a tree structure.
`stream_markdown_file`	Stream chunks from a markdown file.
`stream_markdown_string`	Stream chunks from a markdown string.
`process_directory`	Convenience function to process a directory.

Classes¶

ContentTransformer ¶

ContentTransformer(
    base_heading_level: int = 2,
    include_field_labels: bool = True,
    code_block_fields: list[str] | None = None,
    list_fields: list[str] | None = None,
)

Transform structured content into markdown for RAG ingestion.

This class converts various data formats (JSON, YAML, CSV, HTML) into well-structured markdown that can be parsed by MarkdownParser and chunked by MarkdownChunker.

The transformer creates markdown with appropriate heading hierarchy so that the chunker can create semantic boundaries around logical content units.

Attributes:

Name	Type	Description
`schemas`	`dict[str, dict[str, Any]]`	Dictionary of registered custom schemas
`config`	`dict[str, dict[str, Any]]`	Transformer configuration options

Initialize the content transformer.

Parameters:

Name	Type	Description	Default
`base_heading_level`	`int`	Starting heading level for top-level items (default: 2)	`2`
`include_field_labels`	`bool`	Whether to bold field names in output (default: True)	`True`
`code_block_fields`	`list[str] \| None`	Field names that should be rendered as code blocks	`None`
`list_fields`	`list[str] \| None`	Field names that should be rendered as bullet lists	`None`

Methods:

Name	Description
`register_schema`	Register a custom schema for specialized content conversion.
`transform`	Transform content to markdown.
`transform_json`	Transform JSON data to markdown.
`transform_yaml`	Transform YAML content to markdown.
`transform_csv`	Transform CSV content to markdown.
`transform_html`	Transform HTML content to markdown.

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def __init__(
    self,
    base_heading_level: int = 2,
    include_field_labels: bool = True,
    code_block_fields: list[str] | None = None,
    list_fields: list[str] | None = None,
):
    """Initialize the content transformer.

    Args:
        base_heading_level: Starting heading level for top-level items (default: 2)
        include_field_labels: Whether to bold field names in output (default: True)
        code_block_fields: Field names that should be rendered as code blocks
        list_fields: Field names that should be rendered as bullet lists
    """
    self.base_heading_level = base_heading_level
    self.include_field_labels = include_field_labels
    self.code_block_fields = set(code_block_fields or ["example", "code", "snippet"])
    self.list_fields = set(list_fields or ["items", "steps", "objectives", "symptoms", "solutions"])
    self.schemas: dict[str, dict[str, Any]] = {}

Functions¶

register_schema ¶

register_schema(name: str, schema: dict[str, Any]) -> None

Register a custom schema for specialized content conversion.

Schemas define how to map JSON fields to markdown structure.

Parameters:

Name	Type	Description	Default
`name`	`str`	Schema identifier	required
`schema`	`dict[str, Any]`	Schema definition with the following structure: - title_field: Field to use as the main heading (required) - description_field: Field for intro text (optional) - sections: List of section definitions, each with: - field: Source field name - heading: Section heading text - format: "text", "code", "list", or "subsections" (default: "text") - language: Code block language (for format="code") - metadata_fields: Fields to render as key-value metadata	required

Example

transformer.register_schema("pattern", { ... "title_field": "name", ... "description_field": "description", ... "sections": [ ... {"field": "use_case", "heading": "When to Use"}, ... {"field": "example", "heading": "Example", "format": "code"} ... ], ... "metadata_fields": ["category", "difficulty"] ... })

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def register_schema(self, name: str, schema: dict[str, Any]) -> None:
    """Register a custom schema for specialized content conversion.

    Schemas define how to map JSON fields to markdown structure.

    Args:
        name: Schema identifier
        schema: Schema definition with the following structure:
            - title_field: Field to use as the main heading (required)
            - description_field: Field for intro text (optional)
            - sections: List of section definitions, each with:
                - field: Source field name
                - heading: Section heading text
                - format: "text", "code", "list", or "subsections" (default: "text")
                - language: Code block language (for format="code")
            - metadata_fields: Fields to render as key-value metadata

    Example:
        >>> transformer.register_schema("pattern", {
        ...     "title_field": "name",
        ...     "description_field": "description",
        ...     "sections": [
        ...         {"field": "use_case", "heading": "When to Use"},
        ...         {"field": "example", "heading": "Example", "format": "code"}
        ...     ],
        ...     "metadata_fields": ["category", "difficulty"]
        ... })
    """
    self.schemas[name] = schema
    logger.debug(f"Registered schema: {name}")

transform ¶

transform(
    content: Any,
    format: str = "json",
    schema: str | None = None,
    title: str | None = None,
) -> str

Transform content to markdown.

Parameters:

Name	Type	Description	Default
`content`	`Any`	Content to transform (dict, list, string, or file path)	required
`format`	`str`	Content format - "json", "yaml", "csv", or "html"	`'json'`
`schema`	`str \| None`	Optional schema name for custom conversion (applies to "json" and "yaml" formats only; ignored for "csv" and "html")	`None`
`title`	`str \| None`	Optional document title	`None`

Returns:

Type	Description
`str`	Markdown formatted string

Raises:

Type	Description
`ValueError`	If format is not supported

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def transform(
    self,
    content: Any,
    format: str = "json",
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform content to markdown.

    Args:
        content: Content to transform (dict, list, string, or file path)
        format: Content format - "json", "yaml", "csv", or "html"
        schema: Optional schema name for custom conversion (applies to
            "json" and "yaml" formats only; ignored for "csv" and "html")
        title: Optional document title

    Returns:
        Markdown formatted string

    Raises:
        ValueError: If format is not supported
    """
    if format == "json":
        if isinstance(content, (str, Path)):
            with open(content, encoding="utf-8") as f:
                data = json.load(f)
        else:
            data = content
        return self.transform_json(data, schema=schema, title=title)
    elif format == "yaml":
        return self.transform_yaml(content, schema=schema, title=title)
    elif format == "csv":
        if schema is not None:
            logger.warning("schema parameter is ignored for CSV format")
        return self.transform_csv(content, title=title)
    elif format == "html":
        if schema is not None:
            logger.warning("schema parameter is ignored for HTML format")
        return self.transform_html(content, title=title)
    else:
        raise ValueError(
            f"Unsupported format: {format}. Use 'json', 'yaml', 'csv', or 'html'."
        )

transform_json ¶

transform_json(
    data: dict[str, Any] | list[Any],
    schema: str | None = None,
    title: str | None = None,
) -> str

Transform JSON data to markdown.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, Any] \| list[Any]`	JSON data (dict or list)	required
`schema`	`str \| None`	Optional schema name for custom conversion	`None`
`title`	`str \| None`	Optional document title	`None`

Returns:

Type	Description
`str`	Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def transform_json(
    self,
    data: dict[str, Any] | list[Any],
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform JSON data to markdown.

    Args:
        data: JSON data (dict or list)
        schema: Optional schema name for custom conversion
        title: Optional document title

    Returns:
        Markdown formatted string
    """
    lines: list[str] = []

    # Add document title if provided
    if title:
        lines.extend([f"# {title}", ""])

    # Use custom schema if specified
    if schema and schema in self.schemas:
        return self._transform_with_schema(data, self.schemas[schema], title)

    # Generic transformation
    if isinstance(data, list):
        for item in data:
            if isinstance(item, dict):
                lines.extend(self._transform_dict_generic(item, self.base_heading_level))
                lines.extend(["---", ""])
            else:
                lines.append(f"- {item}")
                lines.append("")
    elif isinstance(data, dict):
        lines.extend(self._transform_dict_generic(data, self.base_heading_level))
    else:
        lines.append(str(data))

    return "\n".join(lines)

transform_yaml ¶

transform_yaml(
    content: str | Path, schema: str | None = None, title: str | None = None
) -> str

Transform YAML content to markdown.

Parameters:

Name	Type	Description	Default
`content`	`str \| Path`	YAML string or file path	required
`schema`	`str \| None`	Optional schema name for custom conversion	`None`
`title`	`str \| None`	Optional document title	`None`

Returns:

Type	Description
`str`	Markdown formatted string

Raises:

Type	Description
`ImportError`	If PyYAML is not installed

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def transform_yaml(
    self,
    content: str | Path,
    schema: str | None = None,
    title: str | None = None,
) -> str:
    """Transform YAML content to markdown.

    Args:
        content: YAML string or file path
        schema: Optional schema name for custom conversion
        title: Optional document title

    Returns:
        Markdown formatted string

    Raises:
        ImportError: If PyYAML is not installed
    """
    try:
        import yaml
    except ImportError:
        raise ImportError("PyYAML is required for YAML transformation. Install with: pip install pyyaml") from None

    if isinstance(content, (str, Path)) and Path(content).exists():
        with open(content, encoding="utf-8") as f:
            data = yaml.safe_load(f)
    else:
        data = yaml.safe_load(content)

    return self.transform_json(data, schema=schema, title=title)

transform_csv ¶

transform_csv(
    content: str | Path,
    title: str | None = None,
    title_field: str | None = None,
) -> str

Transform CSV content to markdown.

Each row becomes a section with the first column (or title_field) as heading.

Parameters:

Name	Type	Description	Default
`content`	`str \| Path`	CSV string or file path	required
`title`	`str \| None`	Optional document title	`None`
`title_field`	`str \| None`	Column to use as section title (default: first column)	`None`

Returns:

Type	Description
`str`	Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def transform_csv(
    self,
    content: str | Path,
    title: str | None = None,
    title_field: str | None = None,
) -> str:
    """Transform CSV content to markdown.

    Each row becomes a section with the first column (or title_field) as heading.

    Args:
        content: CSV string or file path
        title: Optional document title
        title_field: Column to use as section title (default: first column)

    Returns:
        Markdown formatted string
    """
    lines: list[str] = []

    if title:
        lines.extend([f"# {title}", ""])

    # Read CSV
    if isinstance(content, Path) or (isinstance(content, str) and Path(content).exists()):
        with open(content, encoding="utf-8") as f:
            reader = csv.DictReader(f)
            rows = list(reader)
    else:
        reader = csv.DictReader(io.StringIO(content))
        rows = list(reader)

    if not rows:
        return "\n".join(lines)

    # Determine title field
    fieldnames = list(rows[0].keys())
    if title_field and title_field in fieldnames:
        title_col = title_field
    else:
        title_col = fieldnames[0]

    # Transform each row
    for row in rows:
        row_title = row.get(title_col, "Untitled")
        lines.append(f"{'#' * self.base_heading_level} {row_title}")
        lines.append("")

        for field, value in row.items():
            if field == title_col or not value:
                continue

            if self.include_field_labels:
                lines.append(f"**{self._format_field_name(field)}**: {value}")
            else:
                lines.append(value)
            lines.append("")

        lines.extend(["---", ""])

    return "\n".join(lines)

transform_html ¶

transform_html(content: str | Path, title: str | None = None) -> str

Transform HTML content to markdown.

Supports standard HTML with semantic tags and IETF RFC markup. Auto-detects the document format and applies appropriate conversion.

Parameters:

Name	Type	Description	Default
`content`	`str \| Path`	HTML string or file path	required
`title`	`str \| None`	Optional document title	`None`

Returns:

Type	Description
`str`	Markdown formatted string

Source code in packages/xization/src/dataknobs_xization/content_transformer.py

def transform_html(
    self,
    content: str | Path,
    title: str | None = None,
) -> str:
    """Transform HTML content to markdown.

    Supports standard HTML with semantic tags and IETF RFC markup.
    Auto-detects the document format and applies appropriate conversion.

    Args:
        content: HTML string or file path
        title: Optional document title

    Returns:
        Markdown formatted string
    """
    from dataknobs_xization.html import HTMLConverter

    converter = HTMLConverter(base_heading_level=self.base_heading_level)
    return converter.convert(content, title=title)

HTMLConverter ¶

HTMLConverter(config: HTMLConverterConfig | None = None, **kwargs: Any)

Convert HTML content to well-structured markdown.

Supports standard HTML with semantic tags (h1-h6, p, ul, ol, table, pre, etc.) and auto-detects IETF RFC markup format (pre-formatted text with span-based headings).

The converter produces markdown compatible with MarkdownParser and MarkdownChunker for downstream RAG ingestion.

Example

converter = HTMLConverter() md = converter.convert("
Overview
Details here.
") print(md)

Overview¶

Details here.

Initialize the converter.

Parameters:

Name	Type	Description	Default
`config`	`HTMLConverterConfig \| None`	Converter configuration. If None, uses defaults.	`None`
`**kwargs`	`Any`	Override individual config fields (e.g., base_heading_level=2).	`{}`

Methods:

Name	Description
`convert`	Convert HTML content to markdown.

Source code in packages/xization/src/dataknobs_xization/html/html_converter.py

def __init__(self, config: HTMLConverterConfig | None = None, **kwargs: Any):
    """Initialize the converter.

    Args:
        config: Converter configuration. If None, uses defaults.
        **kwargs: Override individual config fields (e.g., base_heading_level=2).
    """
    if config is not None:
        self.config = config
    else:
        self.config = HTMLConverterConfig(**kwargs)

Functions¶

convert ¶

convert(content: str | Path, title: str | None = None) -> str

Convert HTML content to markdown.

Auto-detects whether the document is standard HTML or IETF RFC markup and applies the appropriate conversion strategy.

Note

Each call uses internal state for reference-style link collection. Do not call convert() concurrently on the same instance from multiple threads. Create separate HTMLConverter instances for concurrent use, or use the html_to_markdown() convenience function which creates a fresh instance per call.

Parameters:

Name	Type	Description	Default
`content`	`str \| Path`	HTML string or path to an HTML file.	required
`title`	`str \| None`	Optional document title. If provided, prepended as a top-level heading. For RFC documents, extracted automatically if not provided.	`None`

Returns:

Type	Description
`str`	Well-structured markdown string.

Source code in packages/xization/src/dataknobs_xization/html/html_converter.py

def convert(self, content: str | Path, title: str | None = None) -> str:
    """Convert HTML content to markdown.

    Auto-detects whether the document is standard HTML or IETF RFC markup
    and applies the appropriate conversion strategy.

    Note:
        Each call uses internal state for reference-style link collection.
        Do not call ``convert()`` concurrently on the same instance from
        multiple threads. Create separate ``HTMLConverter`` instances for
        concurrent use, or use the ``html_to_markdown()`` convenience
        function which creates a fresh instance per call.

    Args:
        content: HTML string or path to an HTML file.
        title: Optional document title. If provided, prepended as a top-level
            heading. For RFC documents, extracted automatically if not provided.

    Returns:
        Well-structured markdown string.
    """
    if isinstance(content, Path):
        content = content.read_text(encoding="utf-8")
    elif not isinstance(content, str):
        raise TypeError(f"content must be str or Path, got {type(content).__name__}")

    # Per-conversion state for reference-style link collection.
    self._link_references: list[tuple[str, str]] = []

    soup = BeautifulSoup(content, "html.parser")

    # Strip unwanted elements before detection or conversion.
    self._strip_elements(soup)

    # Detect format and dispatch.
    if self._is_rfc_markup(soup):
        logger.debug("Detected IETF RFC markup format")
        result = self._convert_rfc(soup, title)
    else:
        logger.debug("Using standard HTML conversion")
        result = self._convert_standard(soup, title)

    # Append reference-style link definitions.
    if self._link_references:
        result = result.rstrip("\n") + "\n\n"
        for idx, (_text, href) in enumerate(self._link_references, 1):
            result += f"[{idx}]: {href}\n"

    # Prepend frontmatter if configured.
    if self.config.frontmatter:
        fm = self._render_frontmatter(self.config.frontmatter)
        result = fm + "\n" + result

    return self._normalize_whitespace(result)

HTMLConverterConfig `dataclass` ¶

HTMLConverterConfig(
    base_heading_level: int = 1,
    include_links: bool = True,
    strip_nav: bool = True,
    strip_scripts: bool = True,
    preserve_code_blocks: bool = True,
    link_style: Literal["inline", "reference", "text"] = "inline",
    strip_images: bool = False,
    wrap_width: int = 0,
    frontmatter: dict[str, Any] | None = None,
)

Configuration for HTML to markdown conversion.

Attributes:

Name	Type	Description
`base_heading_level`	`int`	Minimum heading level in output (1 = #, 2 = ##, etc.)
`include_links`	`bool`	Whether to preserve hyperlinks as markdown links.
`strip_nav`	`bool`	Remove , , elements.
`strip_scripts`	`bool`	Remove Previous utils (complete) Next common (complete) Made with Material for MkDocs

dataknobs-xization Complete API Reference¶