Dataknobs Xization¶
The dataknobs-xization package provides text processing, normalization, and tokenization tools for AI knowledge bases.
💡 Quick Links: - Complete API Reference - Full API documentation - Source Code - View on GitHub - API Index - All packages
Installation¶
Overview¶
The Xization package specializes in text preprocessing and includes:
- Markdown Chunking: Parse and chunk markdown documents for RAG applications
- Preserves heading hierarchy and semantic structure
- Supports code blocks, tables, lists, and other markdown constructs
- Streaming support for large documents
- Flexible configuration for chunk size and heading inclusion
- JSON Chunking: Chunk JSON and JSONL documents for RAG applications
- Template-based or field-selection text generation
- Streaming support for large JSONL files
- GZIP compression support
- HTML Conversion: Convert HTML documents to markdown for RAG ingestion
- Standard HTML with semantic tags (headings, lists, tables, code blocks)
- IETF RFC markup auto-detection and conversion
- Configurable link handling, element stripping, and frontmatter
- Knowledge Base Ingestion: Batch process directories of documents
- Pattern-based file selection with exclusions
- Per-file type configuration
- Config file support (JSON/YAML)
- Tokenization: Advanced character-based and feature-driven tokenization
- Normalization: Text normalization with camelCase expansion, symbol handling
- Masking: Character-level masking and feature extraction
- Authorities: Text authority and lexicon management
- Annotations: Text annotation utilities
Package Structure¶
dataknobs-xization/
├── src/
│ └── dataknobs_xization/
│ ├── __init__.py
│ ├── html/
│ │ ├── html_converter.py # HTML to markdown conversion
│ │ └── __init__.py
│ ├── markdown/
│ │ ├── md_parser.py # Markdown parsing to Tree structure
│ │ ├── md_chunker.py # Chunking with configurable parameters
│ │ └── md_streaming.py # Streaming for large documents
│ ├── annotations.py
│ ├── authorities.py
│ ├── lexicon.py
│ ├── masking_tokenizer.py
│ └── normalize.py
├── scripts/
│ └── md_cli.py # CLI for markdown chunking
├── examples/
│ └── markdown/
│ ├── example_usage.py # Comprehensive examples
│ └── example_document.md # Sample markdown files
└── tests/
Quick Start¶
Markdown Chunking for RAG¶
Parse markdown into tree structure¶
markdown_text = """
# User Guide
## Installation
Install the package using pip or uv.
## Quick Start
Here's how to get started with the library.
### Example Code
```python
import example
```
"""
tree = parse_markdown(markdown_text)
# Generate chunks for RAG with headings in metadata
chunks = chunk_markdown_tree(
tree,
max_chunk_size=500,
heading_inclusion=HeadingInclusion.IN_METADATA
)
# Use chunks in vector store
for chunk in chunks:
heading_context = chunk.metadata.get_heading_path()
print(f"Context: {heading_context}")
print(f"Text: {chunk.text[:100]}...")
# store_in_vector_db(chunk.text, chunk.metadata.to_dict())
See the Markdown Chunking guide for complete documentation.
Text Normalization¶
from dataknobs_xization import normalize
# Basic normalization
text = "CamelCaseText with SYMBOLS!"
normalized = normalize.basic_normalization_fn(
text,
lowercase=True,
expand_camelcase=True,
drop_non_embedded_symbols=True
)
print(normalized) # "camel case text with symbols"
# Get lexical variations
variations = normalize.get_lexical_variations(
"XML-HTTP-Request",
expand_camelcase=True,
do_hyphen_expansion=True,
do_hyphen_split=True
)
print(variations) # {"XML-HTTP-Request", "XML HTTP Request", "XML", "HTTP", "Request", ...}
Character-Based Tokenization¶
from dataknobs_xization import TextFeatures
from dataknobs_structures import Text, TextMetaData
# Create document
metadata = TextMetaData(text_id=1, text_label="sample")
doc = Text("Hello CamelCase123 world!", metadata)
# Extract features
features = TextFeatures(
doc,
split_camelcase=True,
mark_alpha=True,
mark_digit=True
)
# Get tokens
tokens = features.get_tokens()
for token in tokens:
print(f"Token: '{token.token_text}' at position {token.token_pos}")
Masking and Features¶
from dataknobs_xization import CharacterFeatures, TextFeatures
# Create text features with masking
text = "Hello123World"
features = TextFeatures(
text,
split_camelcase=True,
mark_upper=True,
mark_lower=True,
mark_digit=True
)
# Access character dataframe
cdf = features.cdf
print(cdf.head()) # Shows character-level features
# Tokenize with normalization
def normalize_token(token_text):
return token_text.lower()
tokens = features.get_tokens(normalize_fn=normalize_token)
for token in tokens:
print(f"Original: '{token.token_text}', Normalized: '{token.norm_text}'")
Core Classes¶
TextFeatures¶
Character-level feature extraction with tokenization support.
from dataknobs_xization import TextFeatures
features = TextFeatures(
doctext="ProcessHTMLData",
split_camelcase=True, # Split on camelCase boundaries
mark_alpha=True, # Mark alphabetic characters
mark_digit=True, # Mark numeric characters
mark_upper=True, # Mark uppercase characters
mark_lower=True # Mark lowercase characters
)
# Get character features as DataFrame
df = features.cdf
print(df.columns) # Shows feature columns
# Extract tokens
tokens = features.get_tokens()
print([token.token_text for token in tokens]) # ["Process", "HTML", "Data"]
CharacterFeatures (Abstract Base)¶
Base class for character-level text analysis.
from dataknobs_xization import CharacterFeatures
# TextFeatures inherits from CharacterFeatures
# Provides common interface for character-based processing
Token Classes¶
Individual token representation with position and feature information.
# Tokens are returned by TextFeatures.get_tokens()
for token in tokens:
print(f"Text: {token.token_text}")
print(f"Position: {token.start_pos}-{token.end_pos}")
print(f"Length: {token.len}")
print(f"Normalized: {token.norm_text}")
Normalization Functions¶
Basic Normalization¶
from dataknobs_xization.normalize import basic_normalization_fn
text = "XMLHttpRequest"
normalized = basic_normalization_fn(
text,
lowercase=True,
expand_camelcase=True,
squash_whitespace=True
)
print(normalized) # "xml http request"
Camel Case Expansion¶
from dataknobs_xization.normalize import expand_camelcase_fn
text = "parseXMLData"
expanded = expand_camelcase_fn(text)
print(expanded) # "parse XML Data"
Symbol Handling¶
from dataknobs_xization.normalize import (
drop_non_embedded_symbols_fn,
drop_embedded_symbols_fn
)
# Remove symbols at word boundaries
text = "!hello@world#"
clean = drop_non_embedded_symbols_fn(text)
print(clean) # "hello@world"
# Remove symbols within words
text = "hello@world"
clean = drop_embedded_symbols_fn(text)
print(clean) # "helloworld"
Lexical Variations¶
from dataknobs_xization.normalize import get_lexical_variations
variations = get_lexical_variations(
"co-worker",
include_self=True,
expand_camelcase=False,
do_hyphen_expansion=True,
do_hyphen_split=True
)
print(variations) # {"co-worker", "co worker", "coworker", "co", "worker"}
Advanced Features¶
Hyphen and Slash Expansion¶
from dataknobs_xization.normalize import get_hyphen_slash_expansions_fn
# Generate variations for hyphenated/slashed terms
variations = get_hyphen_slash_expansions_fn(
"client-server",
subs=["-", " ", ""],
add_self=True,
do_split=True
)
print(variations) # {"client-server", "client server", "clientserver", "client", "server"}
Number to English Conversion¶
from dataknobs_xization.normalize import int_to_en, year_variations_fn
# Convert integers to English
print(int_to_en(42)) # "forty two"
print(int_to_en(1999)) # "one thousand nine hundred and ninety nine"
# Generate year variations
years = year_variations_fn(1999)
print(years) # {"1999", "one thousand nine hundred and ninety nine", "nineteen ninety nine", ...}
Smart Quote Handling¶
from dataknobs_xization.normalize import replace_smart_quotes_fn
text = ""Hello World" with 'smart quotes'"
cleaned = replace_smart_quotes_fn(text)
print(cleaned) # "\"Hello World\" with 'smart quotes'"
Integration Examples¶
With Document Structures¶
from dataknobs_structures import Text, TextMetaData
from dataknobs_xization import TextFeatures, normalize
# Create document
metadata = TextMetaData(text_id="doc_001", text_label="article")
doc = Text("parseXMLHttpRequest", metadata)
# Normalize content
normalized_text = normalize.basic_normalization_fn(
doc.text,
expand_camelcase=True,
lowercase=True
)
# Update document
doc._text = normalized_text # Note: direct access for example
# Extract features
features = TextFeatures(doc, split_camelcase=True)
tokens = features.get_tokens()
With JSON Processing¶
from dataknobs_utils import json_utils
from dataknobs_xization import normalize
# Process JSON text fields
def normalize_json_text(item, path):
if isinstance(item, str) and len(item) > 0:
normalized = normalize.basic_normalization_fn(item)
print(f"Path: {path}, Normalized: {normalized}")
# Apply to JSON stream
json_utils.stream_json_data("data.json", normalize_json_text)
Tokenization Patterns¶
CamelCase Tokenization¶
from dataknobs_xization import TextFeatures
# Split on camelCase boundaries
text = "XMLHttpRequest"
features = TextFeatures(text, split_camelcase=True)
tokens = features.get_tokens()
print([t.token_text for t in tokens]) # ["XML", "Http", "Request"]
Symbol-Aware Tokenization¶
# Handle embedded symbols differently
text = "client-server@domain.com"
features = TextFeatures(text, split_camelcase=False)
tokens = features.get_tokens()
# Tokens respect symbol boundaries
print([t.token_text for t in tokens]) # ["client", "server", "domain", "com"]
Custom Normalization¶
def custom_normalize(text):
"""Custom normalization function"""
# Apply multiple normalization steps
text = normalize.expand_camelcase_fn(text)
text = normalize.basic_normalization_fn(
text,
lowercase=True,
drop_embedded_symbols=True,
squash_whitespace=True
)
return text
# Use with tokenizer
features = TextFeatures("parseXML-Data")
tokens = features.get_tokens(normalize_fn=custom_normalize)
Error Handling¶
from dataknobs_xization import TextFeatures, normalize
try:
# Handle empty or None text
text = ""
if not text:
text = "default text"
features = TextFeatures(text)
tokens = features.get_tokens()
except Exception as e:
print(f"Tokenization error: {e}")
try:
# Handle normalization errors
result = normalize.basic_normalization_fn(None)
except (TypeError, AttributeError) as e:
print(f"Normalization error: {e}")
result = ""
Performance Tips¶
- Reuse TextFeatures: Create once, tokenize multiple times with different normalization
- Batch Processing: Process similar texts together
- Selective Features: Only enable needed character features
- Normalization Caching: Cache normalization results for repeated text
Configuration Options¶
Character Features¶
features = TextFeatures(
text,
split_camelcase=True, # Enable camelCase splitting
mark_alpha=True, # Mark alphabetic characters
mark_digit=True, # Mark numeric characters
mark_upper=True, # Mark uppercase characters
mark_lower=True, # Mark lowercase characters
emoji_data=None # Optional emoji processing
)
Normalization Options¶
normalized = normalize.basic_normalization_fn(
text,
lowercase=True, # Convert to lowercase
expand_camelcase=True, # Expand camelCase
simplify_quote_chars=True, # Replace smart quotes
drop_non_embedded_symbols=False, # Remove boundary symbols
spacify_embedded_symbols=False, # Replace embedded symbols with spaces
drop_embedded_symbols=False, # Remove embedded symbols
squash_whitespace=False, # Collapse whitespace
do_all=False # Apply all options
)
API Reference¶
For complete API documentation, see the Xization API Reference.
Module Documentation¶
- HTML Conversion - Convert HTML documents to markdown
- Markdown Chunking - Parse and chunk markdown for RAG
- JSON Chunking - Chunk JSON and JSONL documents
- Content Transformation - Convert JSON, YAML, CSV, HTML to markdown
- Ingestion Module - Batch directory processing
- Tokenization - Character-based tokenization
- Normalization - Text normalization functions
- Masking - Character masking and features