Xization Package API Reference¶
Complete API reference for the dataknobs_xization package - text normalization and tokenization tools.
📖 Also see: Auto-generated API Reference - Complete documentation from source code docstrings
This page provides curated examples and usage patterns. The auto-generated reference provides exhaustive technical documentation with all methods, parameters, and type annotations.
Package Overview¶
from dataknobs_xization import (
annotations,
authorities,
lexicon,
masking_tokenizer,
normalize
)
# Import markdown chunking utilities
from dataknobs_xization import (
parse_markdown,
chunk_markdown_tree,
stream_markdown_file,
HeadingInclusion,
ChunkFormat
)
# Import key classes
from dataknobs_xization.masking_tokenizer import CharacterFeatures, TextFeatures
from dataknobs_xization.markdown.md_parser import MarkdownParser, MarkdownNode
from dataknobs_xization.markdown.md_chunker import MarkdownChunker, Chunk, ChunkMetadata
from dataknobs_xization.markdown.md_streaming import StreamingMarkdownProcessor
Module Index¶
Core Modules¶
- markdown - Markdown parsing and chunking for RAG
- normalize - Text normalization and standardization
- masking_tokenizer - Character-level features and masking
- tokenization - Advanced tokenization capabilities
Supporting Modules¶
- annotations - Text annotation and markup tools
- authorities - Authority control and standardization
- lexicon - Lexical analysis and vocabulary management
Quick Reference¶
Markdown Chunking¶
from dataknobs_xization import (
parse_markdown,
chunk_markdown_tree,
stream_markdown_file,
stream_markdown_string,
HeadingInclusion,
ChunkFormat
)
# Parse markdown into tree
tree = parse_markdown("# Title\nBody text.")
# Generate chunks
chunks = chunk_markdown_tree(
tree,
max_chunk_size=500,
heading_inclusion=HeadingInclusion.BOTH,
chunk_format=ChunkFormat.MARKDOWN
)
# Stream large files
for chunk in stream_markdown_file("large_doc.md", max_chunk_size=1000):
print(chunk.text, chunk.metadata.to_dict())
# Access chunk data
for chunk in chunks:
print(f"Text: {chunk.text}")
print(f"Headings: {chunk.metadata.get_heading_path()}")
print(f"Size: {chunk.metadata.chunk_size}")
print(f"Index: {chunk.metadata.chunk_index}")
Text Normalization¶
from dataknobs_xization import normalize
# Basic normalization
normalized = normalize.basic_normalization_fn("Hello, WORLD!")
# CamelCase expansion
expanded = normalize.expand_camelcase_fn("firstName")
# Generate lexical variations
variations = normalize.get_lexical_variations(
"multi-platform/cross-browser"
)
# Symbol handling
cleaned = normalize.drop_non_embedded_symbols_fn("!Hello world?")
embedded = normalize.drop_embedded_symbols_fn("user@domain.com", " ")
# Expand patterns
ampersand = normalize.expand_ampersand_fn("Research & Development")
hyphen_vars = normalize.get_hyphen_slash_expansions_fn("data-science")
Character Features and Masking¶
from dataknobs_xization.masking_tokenizer import CharacterFeatures, TextFeatures
from dataknobs_structures import document as dk_doc
# Character-level analysis
class MyCharFeatures(CharacterFeatures):
@property
def cdf(self):
# Implementation for character dataframe
pass
features = MyCharFeatures("Sample text")
char_df = features.cdf
# Text-level features
text_features = TextFeatures("Analysis text")
Tokenization¶
from dataknobs_xization import tokenization
# Different tokenization levels
chars = tokenization.tokenize_characters("Hello world!")
words = tokenization.tokenize_words("Hello, world!", lowercase=True)
sentences = tokenization.tokenize_sentences("Hello world. How are you?")
# Feature extraction
char_features = tokenization.extract_character_features("Text")
token_features = tokenization.extract_token_features(["word1", "word2"])
# N-gram generation
bigrams = tokenization.generate_ngrams(["a", "b", "c", "d"], 2)
trigrams = tokenization.generate_ngrams(["a", "b", "c", "d"], 3)
Detailed Module APIs¶
markdown Module¶
Parser Functions:
- parse_markdown(source, max_line_length=None, preserve_empty_lines=False) -> Tree
- Parse markdown content into tree structure
- Returns Tree with MarkdownNode data
Chunking Functions:
- chunk_markdown_tree(tree, max_chunk_size=1000, heading_inclusion=HeadingInclusion.BOTH, chunk_format=ChunkFormat.MARKDOWN, combine_under_heading=True) -> List[Chunk]
- Generate chunks from markdown tree
- Returns list of Chunk objects with text and metadata
Streaming Functions:
- stream_markdown_file(file_path, max_chunk_size=1000, heading_inclusion=HeadingInclusion.BOTH, chunk_format=ChunkFormat.MARKDOWN) -> Iterator[Chunk]
- Stream chunks from file
- Yields Chunk objects incrementally
stream_markdown_string(content, ...) -> Iterator[Chunk]- Stream chunks from string content
Enums:
- HeadingInclusion - Control heading inclusion
- BOTH - Include in text and metadata
- IN_TEXT - Include only in text
- IN_METADATA - Include only in metadata
- NONE - Exclude headings
ChunkFormat- Control output formatMARKDOWN- Markdown formatted textPLAIN- Plain text without formattingDICT- Dictionary representation
Classes:
MarkdownParser
class MarkdownParser:
def __init__(self, max_line_length=None, preserve_empty_lines=False)
def parse(self, source) -> Tree
MarkdownNode
class MarkdownNode:
text: str # Node text content
level: int # Heading level (1-6) or 0 for body
node_type: str # 'heading', 'body', 'code', 'table', 'list', etc.
line_number: int # Source line number
metadata: dict # Additional metadata
def is_heading() -> bool
def is_body() -> bool
def is_code() -> bool
Chunk
ChunkMetadata
class ChunkMetadata:
headings: List[str] # Heading texts from root to chunk
heading_levels: List[int] # Corresponding heading levels
line_number: int # Starting line number
chunk_index: int # Sequential chunk index
chunk_size: int # Size in characters
custom: dict # Custom metadata (node_type, language, etc.)
def get_heading_path(separator=' / ') -> str
def to_dict() -> dict
MarkdownChunker
class MarkdownChunker:
def __init__(
self,
max_chunk_size=1000,
heading_inclusion=HeadingInclusion.BOTH,
chunk_format=ChunkFormat.MARKDOWN,
combine_under_heading=True
)
def chunk(self, tree: Tree) -> Iterator[Chunk]
StreamingMarkdownProcessor
class StreamingMarkdownProcessor:
def __init__(self, max_chunk_size=1000)
def process_file(self, file_path: str) -> Iterator[Chunk]
def process_string(self, content: str) -> Iterator[Chunk]
AdaptiveStreamingProcessor
class AdaptiveStreamingProcessor(StreamingMarkdownProcessor):
def __init__(
self,
max_chunk_size=1000,
memory_limit_nodes=10000,
adaptive_threshold=0.8
)
For complete markdown chunking documentation, see Markdown Chunking Guide.
normalize Module¶
Regular Expression Patterns:
- SQUASH_WS_RE - Collapse whitespace
- ALL_SYMBOLS_RE - Match all symbols
- CAMELCASE_LU_RE - CamelCase lower-upper transitions
- CAMELCASE_UL_RE - CamelCase upper-lower transitions
- NON_EMBEDDED_WORD_SYMS_RE - Non-embedded symbols
- EMBEDDED_SYMS_RE - Embedded symbols
- HYPHEN_SLASH_RE - Hyphen/slash patterns
- HYPHEN_ONLY_RE - Hyphen-only patterns
- SLASH_ONLY_RE - Slash-only patterns
- PARENTHETICAL_RE - Parenthetical expressions
- AMPERSAND_RE - Ampersand patterns
Core Functions:
- expand_camelcase_fn(text: str) -> str
- drop_non_embedded_symbols_fn(text: str, repl: str = "") -> str
- drop_embedded_symbols_fn(text: str, repl: str = "") -> str
- get_hyphen_slash_expansions_fn(text: str, subs: List[str] = ("-", " ", ""), add_self: bool = True, do_split: bool = True, min_split_token_len: int = 2, hyphen_slash_re=HYPHEN_SLASH_RE) -> Set[str]
- drop_parentheticals_fn(text: str) -> str
- expand_ampersand_fn(text: str) -> str
- get_lexical_variations(text: str, **kwargs) -> Set[str]
- basic_normalization_fn(text: str) -> str
masking_tokenizer Module¶
Abstract Classes:
class CharacterFeatures(ABC):
def __init__(self, doctext: Union[dk_doc.Text, str], roll_padding: int = 0)
@property
@abstractmethod
def cdf(self) -> pd.DataFrame:
"""Character dataframe with each padded text character as a row."""
pass
# Properties
@property
def doctext(self) -> dk_doc.Text
@property
def text_col(self) -> str
@property
def text(self) -> str
@property
def text_id(self) -> Any
class TextFeatures:
def __init__(self, doctext: Union[dk_doc.Text, str])
# Methods for text-level feature extraction
def extract_features(self) -> Dict[str, Any]
def analyze_patterns(self) -> Dict[str, Any]
annotations Module¶
Functions: - Text annotation utilities - Markup processing - Annotation validation - Format conversion
authorities Module¶
Functions: - Authority control for names and terms - Standardization utilities - Controlled vocabulary management - Cross-reference resolution
lexicon Module¶
Functions: - Vocabulary analysis - Lexical statistics - Term frequency analysis - Lexicon building utilities
Usage Patterns¶
Complete Text Processing Pipeline¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
from dataknobs_utils import file_utils
import pandas as pd
class TextProcessingPipeline:
"""Complete text processing with normalization, tokenization, and masking."""
def __init__(self, config: dict):
self.config = config
self.normalize_config = config.get('normalize', {})
self.mask_config = config.get('masking', {})
def process_text(self, text: str) -> dict:
"""Process text through complete pipeline."""
results = {'original': text}
# Step 1: Normalization
normalized = self._normalize_text(text)
results['normalized'] = normalized
# Step 2: Generate variations
variations = normalize.get_lexical_variations(
normalized, **self.normalize_config
)
results['variations'] = list(variations)
# Step 3: Character-level analysis
char_analysis = self._analyze_characters(normalized)
results['character_analysis'] = char_analysis
# Step 4: Tokenization
tokens = self._tokenize_text(normalized)
results['tokens'] = tokens
return results
def _normalize_text(self, text: str) -> str:
"""Apply normalization pipeline."""
# Expand camelCase
text = normalize.expand_camelcase_fn(text)
# Expand ampersands
text = normalize.expand_ampersand_fn(text)
# Drop parentheticals if configured
if self.normalize_config.get('drop_parentheticals', True):
text = normalize.drop_parentheticals_fn(text)
# Handle symbols
if self.normalize_config.get('drop_non_embedded_symbols', True):
text = normalize.drop_non_embedded_symbols_fn(text)
# Basic normalization
text = normalize.basic_normalization_fn(text)
return text
def _analyze_characters(self, text: str) -> dict:
"""Analyze character-level features."""
# Create character features implementation
class PipelineCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({
self.text_col: chars,
'position': range(len(chars)),
'is_alpha': [c.isalpha() for c in chars],
'is_digit': [c.isdigit() for c in chars],
'is_space': [c.isspace() for c in chars],
'is_punct': [not c.isalnum() and not c.isspace() for c in chars]
})
features = PipelineCharFeatures(text)
cdf = features.cdf
return {
'total_chars': len(cdf),
'alpha_chars': cdf['is_alpha'].sum(),
'digit_chars': cdf['is_digit'].sum(),
'space_chars': cdf['is_space'].sum(),
'punct_chars': cdf['is_punct'].sum(),
'alpha_ratio': cdf['is_alpha'].mean(),
'digit_ratio': cdf['is_digit'].mean()
}
def _tokenize_text(self, text: str) -> dict:
"""Tokenize text at multiple levels."""
from dataknobs_xization import tokenization
return {
'characters': tokenization.tokenize_characters(text),
'words': tokenization.tokenize_words(text, lowercase=True),
'sentences': tokenization.tokenize_sentences(text)
}
def process_documents(self, documents: List[dk_doc.Document]) -> List[dict]:
"""Process multiple documents."""
results = []
for doc in documents:
doc_result = self.process_text(doc.text)
doc_result['document_id'] = getattr(doc, 'text_id', None)
doc_result['metadata'] = getattr(doc, 'metadata', {})
results.append(doc_result)
return results
# Usage
config = {
'normalize': {
'drop_parentheticals': True,
'drop_non_embedded_symbols': True,
'expand_camelcase': True,
'expand_ampersands': True,
'add_eng_plurals': True
},
'masking': {
'mask_probability': 0.15,
'preserve_structure': True
}
}
pipeline = TextProcessingPipeline(config)
# Process single text
text = "getUserName() & validateInput (required)"
result = pipeline.process_text(text)
print(f"Original: {result['original']}")
print(f"Normalized: {result['normalized']}")
print(f"Variations: {len(result['variations'])}")
print(f"Character analysis: {result['character_analysis']}")
# Process documents
documents = [
dk_doc.Document("JavaScript & Node.js development", text_id="doc1"),
dk_doc.Document("Python (programming language) tutorial", text_id="doc2")
]
doc_results = pipeline.process_documents(documents)
for result in doc_results:
print(f"Doc {result['document_id']}: {result['normalized']}")
Privacy-Preserving Text Analytics¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_utils import elasticsearch_utils
import json
class PrivacyPreservingAnalytics:
"""Analytics with built-in privacy preservation."""
def __init__(self, privacy_config: dict):
self.mask_probability = privacy_config.get('mask_probability', 0.15)
self.preserve_patterns = privacy_config.get('preserve_patterns', [])
self.differential_privacy = privacy_config.get('differential_privacy', {})
def analyze_corpus(self, texts: List[str]) -> dict:
"""Analyze text corpus with privacy preservation."""
# Step 1: Normalize all texts
normalized_texts = []
for text in texts:
# Apply normalization pipeline
normalized = normalize.basic_normalization_fn(text)
normalized = normalize.expand_camelcase_fn(normalized)
normalized_texts.append(normalized)
# Step 2: Apply privacy-preserving masking
masked_analytics = self._masked_analysis(normalized_texts)
# Step 3: Generate aggregate statistics
aggregate_stats = self._compute_aggregates(normalized_texts, masked_analytics)
return {
'corpus_size': len(texts),
'privacy_parameters': {
'mask_probability': self.mask_probability,
'differential_privacy': self.differential_privacy
},
'masked_analytics': masked_analytics,
'aggregate_statistics': aggregate_stats
}
def _masked_analysis(self, texts: List[str]) -> dict:
"""Perform analysis on masked text data."""
# Character-level masking for each text
class AnalyticsCharFeatures(masking_tokenizer.CharacterFeatures):
def __init__(self, doctext, mask_prob):
super().__init__(doctext)
self.mask_prob = mask_prob
@property
def cdf(self):
import numpy as np
chars = list(self.text)
np.random.seed(42) # Reproducible for testing
return pd.DataFrame({
self.text_col: chars,
'position': range(len(chars)),
'is_alpha': [c.isalpha() for c in chars],
'is_digit': [c.isdigit() for c in chars],
'is_masked': np.random.random(len(chars)) < self.mask_prob
})
total_chars = 0
total_masked = 0
feature_stats = {'alpha': 0, 'digit': 0, 'other': 0}
for text in texts:
features = AnalyticsCharFeatures(text, self.mask_probability)
cdf = features.cdf
total_chars += len(cdf)
total_masked += cdf['is_masked'].sum()
feature_stats['alpha'] += cdf['is_alpha'].sum()
feature_stats['digit'] += cdf['is_digit'].sum()
feature_stats['other'] += len(cdf) - cdf['is_alpha'].sum() - cdf['is_digit'].sum()
return {
'total_characters': total_chars,
'total_masked': total_masked,
'mask_ratio': total_masked / total_chars if total_chars > 0 else 0,
'character_distribution': feature_stats
}
def _compute_aggregates(self, texts: List[str], masked_analytics: dict) -> dict:
"""Compute aggregate statistics with noise for differential privacy."""
import numpy as np
# Basic statistics
word_counts = [len(text.split()) for text in texts]
char_counts = [len(text) for text in texts]
# Add differential privacy noise if configured
epsilon = self.differential_privacy.get('epsilon', 1.0)
if epsilon > 0:
scale = 1.0 / epsilon
word_count_noise = np.random.laplace(0, scale, len(word_counts))
char_count_noise = np.random.laplace(0, scale, len(char_counts))
word_counts = [max(0, wc + noise) for wc, noise in zip(word_counts, word_count_noise)]
char_counts = [max(0, cc + noise) for cc, noise in zip(char_counts, char_count_noise)]
return {
'avg_word_count': np.mean(word_counts),
'avg_char_count': np.mean(char_counts),
'word_count_std': np.std(word_counts),
'char_count_std': np.std(char_counts),
'privacy_noise_added': epsilon > 0
}
# Usage
privacy_config = {
'mask_probability': 0.2,
'differential_privacy': {'epsilon': 0.5},
'preserve_patterns': ['email', 'phone']
}
analytics = PrivacyPreservingAnalytics(privacy_config)
texts = [
"This document contains sensitive information about users.",
"Financial data and personal details are stored here.",
"Public information that can be analyzed without privacy concerns."
]
results = analytics.analyze_corpus(texts)
print(json.dumps(results, indent=2))
Integration with Other Dataknobs Packages¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import Tree, document as dk_doc
from dataknobs_utils import elasticsearch_utils, file_utils
def build_normalized_search_index(
documents: List[dk_doc.Document],
index_name: str
) -> elasticsearch_utils.ElasticsearchIndex:
"""Build search index with normalized and varied text."""
# Configure Elasticsearch
table_settings = [
elasticsearch_utils.TableSettings(
index_name,
{"number_of_shards": 1, "number_of_replicas": 0},
{
"properties": {
"original_text": {"type": "text"},
"normalized_text": {"type": "text", "analyzer": "english"},
"variations": {"type": "text"},
"character_features": {"type": "object"},
"document_id": {"type": "keyword"}
}
}
)
]
# Create index
es_index = elasticsearch_utils.ElasticsearchIndex(None, table_settings)
# Process documents and create batch file
def document_generator():
for doc in documents:
# Normalize text
normalized = normalize.basic_normalization_fn(doc.text)
normalized = normalize.expand_camelcase_fn(normalized)
normalized = normalize.expand_ampersand_fn(normalized)
# Generate variations
variations = normalize.get_lexical_variations(normalized)
# Extract character features
class IndexCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({
self.text_col: chars,
'is_alpha': [c.isalpha() for c in chars],
'is_digit': [c.isdigit() for c in chars]
})
features = IndexCharFeatures(normalized)
cdf = features.cdf
yield {
'original_text': doc.text,
'normalized_text': normalized,
'variations': ' '.join(variations),
'character_features': {
'total_chars': len(cdf),
'alpha_ratio': cdf['is_alpha'].mean(),
'digit_ratio': cdf['is_digit'].mean()
},
'document_id': getattr(doc, 'text_id', str(hash(doc.text)))
}
# Create batch file and load
with open('search_batch.jsonl', 'w') as f:
elasticsearch_utils.add_batch_data(
f, document_generator(), index_name
)
return es_index
# Usage
documents = [
dk_doc.Document("JavaScript & Node.js Development", text_id="tech1"),
dk_doc.Document("Machine Learning (ML) Algorithms", text_id="ai1"),
dk_doc.Document("Data Science with Python/R", text_id="data1")
]
search_index = build_normalized_search_index(documents, "normalized_docs")
print("Search index created with normalized and varied text")
Testing Utilities¶
from dataknobs_xization import normalize, masking_tokenizer
import pytest
import tempfile
class TestXizationFunctions:
"""Test utilities for xization package."""
def test_normalization(self):
"""Test normalization functions."""
# Test camelCase expansion
assert normalize.expand_camelcase_fn("firstName") == "first Name"
assert normalize.expand_camelcase_fn("XMLParser") == "XML Parser"
# Test symbol handling
assert normalize.drop_non_embedded_symbols_fn("!Hello world?") == "Hello world"
assert normalize.drop_embedded_symbols_fn("user@domain.com") == "userdomaincom"
# Test ampersand expansion
assert normalize.expand_ampersand_fn("A & B") == "A and B"
# Test basic normalization
result = normalize.basic_normalization_fn(" HELLO, WORLD! ")
assert result.strip().lower() == "hello, world!"
def test_variations(self):
"""Test lexical variation generation."""
variations = normalize.get_lexical_variations("multi-platform")
assert "multi platform" in variations
assert "multiplatform" in variations
assert "multi-platform" in variations
def test_character_features(self):
"""Test character feature extraction."""
class TestCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({
self.text_col: chars,
'is_alpha': [c.isalpha() for c in chars]
})
features = TestCharFeatures("Hello123")
cdf = features.cdf
assert len(cdf) == 8
assert cdf['is_alpha'].sum() == 5 # "Hello"
def test_integration(self):
"""Test integration between modules."""
text = "getUserName() & validateInput"
# Normalize
normalized = normalize.expand_camelcase_fn(text)
normalized = normalize.expand_ampersand_fn(normalized)
normalized = normalize.basic_normalization_fn(normalized)
# Extract features
class IntegrationFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({self.text_col: chars})
features = IntegrationFeatures(normalized)
assert len(features.cdf) > 0
# Run tests
if __name__ == "__main__":
test_suite = TestXizationFunctions()
test_suite.test_normalization()
test_suite.test_variations()
test_suite.test_character_features()
test_suite.test_integration()
print("All tests passed!")
Performance Considerations¶
- Normalization functions use pre-compiled regex patterns for efficiency
- Character feature extraction can be memory-intensive for large texts
- Lexical variation generation may produce many variations - use selectively
- Consider caching normalized results for frequently accessed texts
- Use appropriate batch sizes for bulk processing
Best Practices¶
- Pipeline Design: Design normalization pipelines appropriate for your domain
- Configuration: Use configuration objects to manage normalization parameters
- Testing: Test normalization results on representative data
- Caching: Cache expensive operations like variation generation
- Memory Management: Monitor memory usage with large-scale character analysis
- Integration: Coordinate with other dataknobs packages for seamless workflows
- Documentation: Document normalization decisions and their rationale
Version Information¶
- Package Version: 1.0.0
- Python Compatibility: 3.8+
- Dependencies: pandas, numpy, dataknobs-structures, dataknobs-utils
For detailed documentation of individual modules, see their respective documentation pages.