dataknobs-xization Complete API Reference¶
Complete auto-generated API documentation from source code docstrings.
💡 Also see: - Curated Guide - Hand-crafted tutorials and examples - Package Overview - Introduction and getting started - Source Code - View on GitHub
dataknobs_xization ¶
Text normalization and tokenization tools.
Modules:
| Name | Description |
|---|---|
annotations |
Text annotation data structures and interfaces. |
authorities |
Authority-based annotation processing and field grouping. |
content_transformer |
Content transformation utilities for converting various formats to markdown. |
html |
HTML to markdown conversion utilities. |
ingestion |
Knowledge base ingestion module. |
json |
JSON chunking utilities for RAG applications. |
lexicon |
Lexical matching and token alignment for text processing. |
markdown |
Markdown chunking utilities for RAG applications. |
masking_tokenizer |
Character-level text feature extraction and tokenization. |
normalize |
Text normalization utilities and regular expressions. |
Classes:
| Name | Description |
|---|---|
ContentTransformer |
Transform structured content into markdown for RAG ingestion. |
HTMLConverter |
Convert HTML content to well-structured markdown. |
HTMLConverterConfig |
Configuration for HTML to markdown conversion. |
AdaptiveStreamingProcessor |
Streaming processor that adapts to memory constraints. |
Chunk |
A chunk of text with associated metadata. |
ChunkFormat |
Output format for chunk text. |
ChunkMetadata |
Metadata for a document chunk. |
ChunkQualityConfig |
Configuration for chunk quality filtering. |
ChunkQualityFilter |
Filter for identifying and removing low-quality chunks. |
EnrichedChunkData |
Data for a chunk enriched with heading context. |
HeadingInclusion |
Strategy for including headings in chunks. |
MarkdownChunker |
Chunker for generating chunks from markdown tree structures. |
MarkdownNode |
Data container for markdown tree nodes. |
MarkdownParser |
Parser for converting markdown text into a tree structure. |
StreamingMarkdownProcessor |
Streaming processor for incremental markdown chunking. |
CharacterFeatures |
Class representing features of text as a dataframe with each character |
TextFeatures |
Extracts text-specific character features for tokenization. |
JSONChunk |
A chunk generated from JSON data. |
JSONChunkConfig |
Configuration for JSON chunking. |
JSONChunker |
Chunker for generating chunks from JSON data with preserved metadata. |
DirectoryProcessor |
Process documents from a directory for knowledge base ingestion. |
FilePatternConfig |
Configuration for a specific file pattern. |
IngestionConfigError |
Error related to ingestion configuration. |
KnowledgeBaseConfig |
Configuration for knowledge base ingestion from a directory. |
ProcessedDocument |
A processed document ready for embedding and storage. |
Functions:
| Name | Description |
|---|---|
csv_to_markdown |
Convert CSV content to markdown. |
json_to_markdown |
Convert JSON data to markdown. |
yaml_to_markdown |
Convert YAML content to markdown. |
html_to_markdown |
Convert HTML content to markdown. |
build_enriched_text |
Build text for embedding with relevant heading context. |
chunk_markdown_tree |
Generate chunks from a markdown tree. |
format_heading_display |
Format a heading path for display. |
get_dynamic_heading_display |
Get heading display based on content length. |
is_multiword |
Check if a heading contains multiple words. |
parse_markdown |
Parse markdown content into a tree structure. |
stream_markdown_file |
Stream chunks from a markdown file. |
stream_markdown_string |
Stream chunks from a markdown string. |
process_directory |
Convenience function to process a directory. |
Classes¶
ContentTransformer ¶
ContentTransformer(
base_heading_level: int = 2,
include_field_labels: bool = True,
code_block_fields: list[str] | None = None,
list_fields: list[str] | None = None,
)
Transform structured content into markdown for RAG ingestion.
This class converts various data formats (JSON, YAML, CSV, HTML) into well-structured markdown that can be parsed by MarkdownParser and chunked by MarkdownChunker.
The transformer creates markdown with appropriate heading hierarchy so that the chunker can create semantic boundaries around logical content units.
Attributes:
| Name | Type | Description |
|---|---|---|
schemas |
dict[str, dict[str, Any]]
|
Dictionary of registered custom schemas |
config |
dict[str, dict[str, Any]]
|
Transformer configuration options |
Initialize the content transformer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_heading_level
|
int
|
Starting heading level for top-level items (default: 2) |
2
|
include_field_labels
|
bool
|
Whether to bold field names in output (default: True) |
True
|
code_block_fields
|
list[str] | None
|
Field names that should be rendered as code blocks |
None
|
list_fields
|
list[str] | None
|
Field names that should be rendered as bullet lists |
None
|
Methods:
| Name | Description |
|---|---|
register_schema |
Register a custom schema for specialized content conversion. |
transform |
Transform content to markdown. |
transform_json |
Transform JSON data to markdown. |
transform_yaml |
Transform YAML content to markdown. |
transform_csv |
Transform CSV content to markdown. |
transform_html |
Transform HTML content to markdown. |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
Functions¶
register_schema ¶
Register a custom schema for specialized content conversion.
Schemas define how to map JSON fields to markdown structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Schema identifier |
required |
schema
|
dict[str, Any]
|
Schema definition with the following structure: - title_field: Field to use as the main heading (required) - description_field: Field for intro text (optional) - sections: List of section definitions, each with: - field: Source field name - heading: Section heading text - format: "text", "code", "list", or "subsections" (default: "text") - language: Code block language (for format="code") - metadata_fields: Fields to render as key-value metadata |
required |
Example
transformer.register_schema("pattern", { ... "title_field": "name", ... "description_field": "description", ... "sections": [ ... {"field": "use_case", "heading": "When to Use"}, ... {"field": "example", "heading": "Example", "format": "code"} ... ], ... "metadata_fields": ["category", "difficulty"] ... })
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
transform ¶
transform(
content: Any,
format: str = "json",
schema: str | None = None,
title: str | None = None,
) -> str
Transform content to markdown.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
Any
|
Content to transform (dict, list, string, or file path) |
required |
format
|
str
|
Content format - "json", "yaml", "csv", or "html" |
'json'
|
schema
|
str | None
|
Optional schema name for custom conversion (applies to "json" and "yaml" formats only; ignored for "csv" and "html") |
None
|
title
|
str | None
|
Optional document title |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown formatted string |
Raises:
| Type | Description |
|---|---|
ValueError
|
If format is not supported |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
transform_json ¶
transform_json(
data: dict[str, Any] | list[Any],
schema: str | None = None,
title: str | None = None,
) -> str
Transform JSON data to markdown.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
dict[str, Any] | list[Any]
|
JSON data (dict or list) |
required |
schema
|
str | None
|
Optional schema name for custom conversion |
None
|
title
|
str | None
|
Optional document title |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown formatted string |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
transform_yaml ¶
Transform YAML content to markdown.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str | Path
|
YAML string or file path |
required |
schema
|
str | None
|
Optional schema name for custom conversion |
None
|
title
|
str | None
|
Optional document title |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown formatted string |
Raises:
| Type | Description |
|---|---|
ImportError
|
If PyYAML is not installed |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
transform_csv ¶
transform_csv(
content: str | Path,
title: str | None = None,
title_field: str | None = None,
) -> str
Transform CSV content to markdown.
Each row becomes a section with the first column (or title_field) as heading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str | Path
|
CSV string or file path |
required |
title
|
str | None
|
Optional document title |
None
|
title_field
|
str | None
|
Column to use as section title (default: first column) |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown formatted string |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
transform_html ¶
Transform HTML content to markdown.
Supports standard HTML with semantic tags and IETF RFC markup. Auto-detects the document format and applies appropriate conversion.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str | Path
|
HTML string or file path |
required |
title
|
str | None
|
Optional document title |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown formatted string |
Source code in packages/xization/src/dataknobs_xization/content_transformer.py
HTMLConverter ¶
Convert HTML content to well-structured markdown.
Supports standard HTML with semantic tags (h1-h6, p, ul, ol, table, pre, etc.) and auto-detects IETF RFC markup format (pre-formatted text with span-based headings).
The converter produces markdown compatible with MarkdownParser and MarkdownChunker for downstream RAG ingestion.
Example
converter = HTMLConverter() md = converter.convert("
Overview
Details here.
") print(md)
Overview¶
Initialize the converter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
HTMLConverterConfig | None
|
Converter configuration. If None, uses defaults. |
None
|
**kwargs
|
Any
|
Override individual config fields (e.g., base_heading_level=2). |
{}
|
Methods:
| Name | Description |
|---|---|
convert |
Convert HTML content to markdown. |
Source code in packages/xization/src/dataknobs_xization/html/html_converter.py
Functions¶
convert ¶
Convert HTML content to markdown.
Auto-detects whether the document is standard HTML or IETF RFC markup and applies the appropriate conversion strategy.
Note
Each call uses internal state for reference-style link collection.
Do not call convert() concurrently on the same instance from
multiple threads. Create separate HTMLConverter instances for
concurrent use, or use the html_to_markdown() convenience
function which creates a fresh instance per call.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content
|
str | Path
|
HTML string or path to an HTML file. |
required |
title
|
str | None
|
Optional document title. If provided, prepended as a top-level heading. For RFC documents, extracted automatically if not provided. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Well-structured markdown string. |
Source code in packages/xization/src/dataknobs_xization/html/html_converter.py
HTMLConverterConfig
dataclass
¶
HTMLConverterConfig(
base_heading_level: int = 1,
include_links: bool = True,
strip_nav: bool = True,
strip_scripts: bool = True,
preserve_code_blocks: bool = True,
link_style: Literal["inline", "reference", "text"] = "inline",
strip_images: bool = False,
wrap_width: int = 0,
frontmatter: dict[str, Any] | None = None,
)
Configuration for HTML to markdown conversion.
Attributes:
| Name | Type | Description |
|---|---|---|
base_heading_level |
int
|
Minimum heading level in output (1 = #, 2 = ##, etc.) |
include_links |
bool
|
Whether to preserve hyperlinks as markdown links. |
strip_nav |
bool
|
Remove |
strip_scripts |
bool
|
Remove |