Normalization API Documentation¶
The normalize module provides text normalization functions for cleaning, standardizing, and preprocessing text data.
Overview¶
Text normalization includes:
- Whitespace handling and cleanup
- Case conversion and standardization
- Symbol and punctuation processing
- CamelCase expansion
- Hyphen and slash expansion
- Lexical variation generation
- Parenthetical expression handling
- Ampersand expansion
Regular Expressions¶
The module provides pre-compiled regular expressions for common text patterns:
SQUASH_WS_RE¶
Collapses consecutive whitespace to a single space.ALL_SYMBOLS_RE¶
Identifies strings with any symbols (non-word, non-space characters).CAMELCASE_LU_RE¶
Splits between consecutive lower and upper case characters.CAMELCASE_UL_RE¶
Splits between consecutive upper case and upper-lower case characters.Symbol Handling Patterns¶
# Non-embedded symbols (without word char on both sides)
NON_EMBEDDED_WORD_SYMS_RE = re.compile(r"((?<!\w)[^\w\s]+)|([^\w\s]+(?!\w))")
# Embedded symbols (with word chars on both sides)
EMBEDDED_SYMS_RE = re.compile(r"(?<=\w)[^\w\s]+(?=\w)")
Delimiter Patterns¶
# Hyphen and/or slash between word characters
HYPHEN_SLASH_RE = re.compile(r"(?<=\w)[\-\/ ](?=\w)")
# Hyphen only between word characters
HYPHEN_ONLY_RE = re.compile(r"(?<=\w)[\- ](?=\w)")
# Slash only between word characters
SLASH_ONLY_RE = re.compile(r"(?<=\w)\/(?=\w)")
Other Patterns¶
# Parenthetical expressions
PARENTHETICAL_RE = re.compile(r"\(.*\)")
# Ampersand with optional whitespace
AMPERSAND_RE = re.compile(r"\s*\&\s*")
Core Functions¶
expand_camelcase_fn()¶
Expands both "lU" and "UUl" camelcasing patterns to add spaces.
Parameters:
- text (str): Input text with camelCase patterns
Returns: Text with expanded camelCase (spaces added)
Example:
from dataknobs_xization import normalize
# Expand camelCase
text1 = "firstName"
result1 = normalize.expand_camelcase_fn(text1)
print(result1) # "first Name"
text2 = "XMLParser"
result2 = normalize.expand_camelcase_fn(text2)
print(result2) # "XML Parser"
text3 = "iPhone"
result3 = normalize.expand_camelcase_fn(text3)
print(result3) # "i Phone"
text4 = "getUserID"
result4 = normalize.expand_camelcase_fn(text4)
print(result4) # "get User ID"
drop_non_embedded_symbols_fn()¶
Removes symbols that are not embedded within word characters.
Parameters:
- text (str): Input text
- repl (str, default=""): Replacement string for dropped symbols
Returns: Text with non-embedded symbols removed
Example:
# Remove leading/trailing punctuation
text = "!Hello world?"
result = normalize.drop_non_embedded_symbols_fn(text)
print(result) # "Hello world"
# Keep embedded symbols
text2 = "user@domain.com"
result2 = normalize.drop_non_embedded_symbols_fn(text2)
print(result2) # "user@domain.com" (@ and . are embedded)
# Custom replacement
text3 = "*important*"
result3 = normalize.drop_non_embedded_symbols_fn(text3, " ")
print(result3) # " important "
drop_embedded_symbols_fn()¶
Removes symbols that are embedded within word characters.
Parameters:
- text (str): Input text
- repl (str, default=""): Replacement string for dropped symbols
Returns: Text with embedded symbols removed
Example:
# Remove embedded punctuation
text = "user@domain.com"
result = normalize.drop_embedded_symbols_fn(text)
print(result) # "userdomaincom"
# With replacement
text2 = "first-name"
result2 = normalize.drop_embedded_symbols_fn(text2, " ")
print(result2) # "first name"
# Multiple embedded symbols
text3 = "a@b#c$d"
result3 = normalize.drop_embedded_symbols_fn(text3)
print(result3) # "abcd"
get_hyphen_slash_expansions_fn()¶
def get_hyphen_slash_expansions_fn(
text: str,
subs: List[str] = ("-", " ", ""),
add_self: bool = True,
do_split: bool = True,
min_split_token_len: int = 2,
hyphen_slash_re=HYPHEN_SLASH_RE,
) -> Set[str]
Generate variations of hyphenated or slash-separated text.
Parameters:
- text (str): Input text with potential hyphens/slashes
- subs (List[str]): Characters to substitute for delimiters
- add_self (bool): Include original text in results
- do_split (bool): Include individual tokens
- min_split_token_len (int): Minimum token length for splitting
- hyphen_slash_re (Pattern): Regex pattern for matching delimiters
Returns: Set of text variations
Example:
# Generate hyphen variations
text = "multi-word-phrase"
variations = normalize.get_hyphen_slash_expansions_fn(text)
print(variations)
# {'multi-word-phrase', 'multi word phrase', 'multiwordphrase', 'multi', 'word', 'phrase'}
# Custom substitutions
text2 = "data/science"
variations2 = normalize.get_hyphen_slash_expansions_fn(
text2, subs=[" ", "_", ""], do_split=False
)
print(variations2)
# {'data/science', 'data science', 'data_science', 'datascience'}
# Without original text
text3 = "machine-learning"
variations3 = normalize.get_hyphen_slash_expansions_fn(
text3, add_self=False, do_split=False
)
print(variations3)
# {'machine learning', 'machinelearning'}
drop_parentheticals_fn()¶
Removes parenthetical expressions from text.
Parameters:
- text (str): Input text
Returns: Text with parentheticals removed
Example:
# Remove parenthetical information
text = "Python (programming language) is popular"
result = normalize.drop_parentheticals_fn(text)
print(result) # "Python is popular"
# Multiple parentheticals
text2 = "AI (Artificial Intelligence) and ML (Machine Learning)"
result2 = normalize.drop_parentheticals_fn(text2)
print(result2) # "AI and ML "
expand_ampersand_fn()¶
Replaces ampersands with " and ".
Parameters:
- text (str): Input text
Returns: Text with ampersands expanded
Example:
# Expand ampersands
text = "Research & Development"
result = normalize.expand_ampersand_fn(text)
print(result) # "Research and Development"
# Multiple ampersands
text2 = "A&B&C"
result2 = normalize.expand_ampersand_fn(text2)
print(result2) # "A and B and C"
# Handles whitespace
text3 = "cats&dogs"
result3 = normalize.expand_ampersand_fn(text3)
print(result3) # "cats and dogs"
Advanced Functions¶
get_lexical_variations()¶
def get_lexical_variations(
text: str,
include_self: bool = True,
expand_camelcase: bool = True,
drop_non_embedded_symbols: bool = True,
drop_embedded_symbols: bool = True,
spacify_embedded_symbols: bool = False,
do_hyphen_expansion: bool = True,
hyphen_subs: List[str] = (" ", ""),
do_hyphen_split: bool = True,
min_hyphen_split_token_len=2,
do_slash_expansion: bool = True,
slash_subs: List[str] = (" ", " or "),
do_slash_split: bool = True,
min_slash_split_token_len: int = 1,
drop_parentheticals: bool = True,
expand_ampersands: bool = True,
add_eng_plurals: bool = True,
) -> Set[str]
Generate comprehensive lexical variations of input text using multiple normalization techniques.
Parameters: (extensive list of boolean flags and configuration options)
Returns: Set of text variations
Example:
# Generate comprehensive variations
text = "multi-platform/cross-browser (compatible)"
variations = normalize.get_lexical_variations(text)
print(f"Generated {len(variations)} variations:")
for var in sorted(variations):
print(f" {var}")
# Custom configuration
text2 = "JavaScript"
variations2 = normalize.get_lexical_variations(
text2,
expand_camelcase=True,
drop_non_embedded_symbols=False,
add_eng_plurals=True
)
print(variations2)
# {'JavaScript', 'Java Script', 'JavaScripts', 'Java Scripts'}
basic_normalization_fn()¶
Applies common normalization steps to text.
Standard Operations: - Lowercase conversion - Whitespace squashing - Basic punctuation handling - Trimming
Example:
# Basic normalization
text = " Hello, WORLD! \n\t How are you? "
result = normalize.basic_normalization_fn(text)
print(repr(result)) # 'hello, world! how are you?'
Usage Patterns¶
Text Cleaning Pipeline¶
from dataknobs_xization import normalize
def clean_text_pipeline(text: str) -> Dict[str, str]:
"""Comprehensive text cleaning pipeline."""
results = {"original": text}
# Step 1: Expand camelCase
step1 = normalize.expand_camelcase_fn(text)
results["camelcase_expanded"] = step1
# Step 2: Expand ampersands
step2 = normalize.expand_ampersand_fn(step1)
results["ampersands_expanded"] = step2
# Step 3: Drop parentheticals
step3 = normalize.drop_parentheticals_fn(step2)
results["parentheticals_dropped"] = step3
# Step 4: Handle embedded symbols
step4 = normalize.drop_embedded_symbols_fn(step3, " ")
results["embedded_symbols_replaced"] = step4
# Step 5: Basic normalization
step5 = normalize.basic_normalization_fn(step4)
results["final_normalized"] = step5
return results
# Example usage
text = "getUserName() & validateInput (required)"
results = clean_text_pipeline(text)
for step, result in results.items():
print(f"{step}: {result}")
Lexical Variation Generation¶
from dataknobs_xization import normalize
from collections import Counter
def generate_search_terms(query: str) -> List[str]:
"""Generate search term variations for better matching."""
# Get all variations
variations = normalize.get_lexical_variations(
query,
expand_camelcase=True,
do_hyphen_expansion=True,
do_slash_expansion=True,
add_eng_plurals=True
)
# Filter and rank by relevance
filtered = []
for var in variations:
# Skip very short or very long variations
if 2 <= len(var.split()) <= 10:
filtered.append(var)
# Sort by length and complexity (prefer simpler terms)
filtered.sort(key=lambda x: (len(x.split()), len(x)))
return filtered[:20] # Return top 20 variations
# Example
query = "machine-learning/deep-learning"
search_terms = generate_search_terms(query)
print("Generated search terms:")
for i, term in enumerate(search_terms, 1):
print(f"{i:2d}. {term}")
Domain-Specific Normalization¶
from dataknobs_xization import normalize
import re
class TechnicalTextNormalizer:
"""Specialized normalizer for technical text."""
def __init__(self):
self.tech_patterns = {
'version_numbers': re.compile(r'v?\d+\.\d+(\.\d+)?'),
'file_extensions': re.compile(r'\.[a-zA-Z0-9]+$'),
'urls': re.compile(r'https?://[^\s]+'),
'code_snippets': re.compile(r'`[^`]+`')
}
def normalize_technical(self, text: str) -> str:
"""Normalize technical text while preserving important patterns."""
# Store special patterns
preserved = {}
placeholder_count = 0
for pattern_name, pattern in self.tech_patterns.items():
matches = pattern.findall(text)
for match in matches:
placeholder = f"__PRESERVE_{placeholder_count}__"
preserved[placeholder] = match
text = text.replace(match, placeholder, 1)
placeholder_count += 1
# Apply standard normalization
normalized = normalize.basic_normalization_fn(text)
# Expand technical camelCase
normalized = normalize.expand_camelcase_fn(normalized)
# Handle technical symbols differently
normalized = normalize.drop_non_embedded_symbols_fn(normalized, " ")
# Restore preserved patterns
for placeholder, original in preserved.items():
normalized = normalized.replace(placeholder, original)
# Final cleanup
normalized = normalize.SQUASH_WS_RE.sub(" ", normalized).strip()
return normalized
# Usage
normalizer = TechnicalTextNormalizer()
tech_text = "Check out myLibrary v2.1.3 at https://github.com/user/repo and run `npm install`"
result = normalizer.normalize_technical(tech_text)
print(f"Original: {tech_text}")
print(f"Normalized: {result}")
Batch Text Processing¶
from dataknobs_xization import normalize
from dataknobs_utils import file_utils
from concurrent.futures import ThreadPoolExecutor
import json
def normalize_document(doc_data: dict) -> dict:
"""Normalize a single document."""
text = doc_data.get('content', '')
# Generate variations
variations = normalize.get_lexical_variations(
text,
expand_camelcase=True,
do_hyphen_expansion=True,
drop_parentheticals=True
)
# Basic normalization
normalized = normalize.basic_normalization_fn(text)
return {
**doc_data,
'normalized_content': normalized,
'variations': list(variations),
'variation_count': len(variations)
}
def batch_normalize_documents(input_dir: str, output_dir: str):
"""Process multiple documents in parallel."""
documents = []
# Load all documents
for filepath in file_utils.filepath_generator(input_dir):
if filepath.endswith('.json'):
for line in file_utils.fileline_generator(filepath):
try:
doc = json.loads(line)
documents.append(doc)
except json.JSONDecodeError:
continue
# Process in parallel
with ThreadPoolExecutor(max_workers=4) as executor:
normalized_docs = list(executor.map(normalize_document, documents))
# Save results
output_lines = []
for doc in normalized_docs:
output_lines.append(json.dumps(doc))
file_utils.write_lines(f"{output_dir}/normalized_documents.jsonl", output_lines)
print(f"Processed {len(normalized_docs)} documents")
print(f"Average variations per document: {sum(d['variation_count'] for d in normalized_docs) / len(normalized_docs):.1f}")
# Usage
batch_normalize_documents('/input/docs', '/output/normalized')
Error Handling¶
from dataknobs_xization import normalize
def safe_normalize(text: str, method: str = "basic") -> str:
"""Safely normalize text with error handling."""
if not text or not isinstance(text, str):
return ""
try:
if method == "basic":
return normalize.basic_normalization_fn(text)
elif method == "camelcase":
return normalize.expand_camelcase_fn(text)
elif method == "comprehensive":
variations = normalize.get_lexical_variations(text)
return min(variations, key=len) # Return shortest variation
else:
return text
except Exception as e:
print(f"Normalization failed for '{text[:50]}...': {e}")
return text # Return original on error
# Safe usage
result = safe_normalize("someWeirdText&Symbols(here)", "comprehensive")
print(result)
Performance Considerations¶
- Regular expressions are pre-compiled for efficiency
get_lexical_variations()can generate many variations - use selectively- Consider caching results for frequently processed text
- Use appropriate batch sizes for parallel processing
- Monitor memory usage with large variation sets
Integration Examples¶
With Elasticsearch¶
from dataknobs_xization import normalize
from dataknobs_utils import elasticsearch_utils
def create_searchable_document(title: str, content: str) -> dict:
"""Create document with normalized search fields."""
# Generate title variations for better matching
title_variations = normalize.get_lexical_variations(
title, do_hyphen_expansion=True, expand_camelcase=True
)
# Normalize content
normalized_content = normalize.basic_normalization_fn(content)
# Expand technical terms
expanded_content = normalize.expand_camelcase_fn(content)
expanded_content = normalize.expand_ampersand_fn(expanded_content)
return {
'title': title,
'content': content,
'title_variations': list(title_variations),
'normalized_content': normalized_content,
'expanded_content': expanded_content,
'searchable_title': ' '.join(title_variations),
'searchable_content': f"{normalized_content} {expanded_content}"
}
# Usage with Elasticsearch
doc = create_searchable_document(
"JavaScript & Node.js",
"Learn JavaScript (programming language) and Node.js development"
)
# Index with enhanced searchability
query = elasticsearch_utils.build_field_query_dict(
['searchable_title', 'searchable_content'],
'java script nodejs'
)
With File Processing¶
from dataknobs_xization import normalize
from dataknobs_utils import file_utils
def normalize_text_files(input_dir: str, output_dir: str):
"""Normalize all text files in directory."""
for filepath in file_utils.filepath_generator(input_dir):
if filepath.endswith('.txt'):
# Read file
lines = list(file_utils.fileline_generator(filepath))
# Normalize each line
normalized_lines = []
for line in lines:
normalized = normalize.basic_normalization_fn(line)
if normalized.strip(): # Skip empty lines
normalized_lines.append(normalized)
# Save normalized version
basename = file_utils.get_basename(filepath)
output_path = f"{output_dir}/normalized_{basename}"
file_utils.write_lines(output_path, normalized_lines)
# Process directory
normalize_text_files('/raw/text', '/processed/text')
The normalization module provides comprehensive text preprocessing capabilities that work seamlessly with other dataknobs components for complete text processing pipelines.