Normalization API Documentation¶

The normalize module provides text normalization functions for cleaning, standardizing, and preprocessing text data.

Overview¶

Text normalization includes:

Whitespace handling and cleanup
Case conversion and standardization
Symbol and punctuation processing
CamelCase expansion
Hyphen and slash expansion
Lexical variation generation
Parenthetical expression handling
Ampersand expansion

Regular Expressions¶

The module provides pre-compiled regular expressions for common text patterns:

SQUASH_WS_RE¶

SQUASH_WS_RE = re.compile(r"\s+")

Collapses consecutive whitespace to a single space.

ALL_SYMBOLS_RE¶

ALL_SYMBOLS_RE = re.compile(r"[^\w\s]+")

Identifies strings with any symbols (non-word, non-space characters).

CAMELCASE_LU_RE¶

CAMELCASE_LU_RE = re.compile(r"([a-z]+)([A-Z])")

Splits between consecutive lower and upper case characters.

CAMELCASE_UL_RE¶

CAMELCASE_UL_RE = re.compile(r"([A-Z]+)([A-Z][a-z])")

Splits between consecutive upper case and upper-lower case characters.

Symbol Handling Patterns¶

# Non-embedded symbols (without word char on both sides)
NON_EMBEDDED_WORD_SYMS_RE = re.compile(r"((?<!\w)[^\w\s]+)|([^\w\s]+(?!\w))")

# Embedded symbols (with word chars on both sides)
EMBEDDED_SYMS_RE = re.compile(r"(?<=\w)[^\w\s]+(?=\w)")

Delimiter Patterns¶

# Hyphen and/or slash between word characters
HYPHEN_SLASH_RE = re.compile(r"(?<=\w)[\-\/ ](?=\w)")

# Hyphen only between word characters
HYPHEN_ONLY_RE = re.compile(r"(?<=\w)[\- ](?=\w)")

# Slash only between word characters
SLASH_ONLY_RE = re.compile(r"(?<=\w)\/(?=\w)")

Other Patterns¶

# Parenthetical expressions
PARENTHETICAL_RE = re.compile(r"\(.*\)")

# Ampersand with optional whitespace
AMPERSAND_RE = re.compile(r"\s*\&\s*")

Core Functions¶

expand_camelcase_fn()¶

def expand_camelcase_fn(text: str) -> str

Expands both "lU" and "UUl" camelcasing patterns to add spaces.

Parameters: - text (str): Input text with camelCase patterns

Returns: Text with expanded camelCase (spaces added)

Example:

from dataknobs_xization import normalize

# Expand camelCase
text1 = "firstName"
result1 = normalize.expand_camelcase_fn(text1)
print(result1)  # "first Name"

text2 = "XMLParser"
result2 = normalize.expand_camelcase_fn(text2)
print(result2)  # "XML Parser"

text3 = "iPhone"
result3 = normalize.expand_camelcase_fn(text3)
print(result3)  # "i Phone"

text4 = "getUserID"
result4 = normalize.expand_camelcase_fn(text4)
print(result4)  # "get User ID"

drop_non_embedded_symbols_fn()¶

def drop_non_embedded_symbols_fn(text: str, repl: str = "") -> str

Removes symbols that are not embedded within word characters.

Parameters: - text (str): Input text - repl (str, default=""): Replacement string for dropped symbols

Returns: Text with non-embedded symbols removed

Example:

# Remove leading/trailing punctuation
text = "!Hello world?"
result = normalize.drop_non_embedded_symbols_fn(text)
print(result)  # "Hello world"

# Keep embedded symbols
text2 = "user@domain.com"
result2 = normalize.drop_non_embedded_symbols_fn(text2)
print(result2)  # "user@domain.com" (@ and . are embedded)

# Custom replacement
text3 = "*important*"
result3 = normalize.drop_non_embedded_symbols_fn(text3, " ")
print(result3)  # " important "

drop_embedded_symbols_fn()¶

def drop_embedded_symbols_fn(text: str, repl: str = "") -> str

Removes symbols that are embedded within word characters.

Parameters: - text (str): Input text - repl (str, default=""): Replacement string for dropped symbols

Returns: Text with embedded symbols removed

Example:

# Remove embedded punctuation
text = "user@domain.com"
result = normalize.drop_embedded_symbols_fn(text)
print(result)  # "userdomaincom"

# With replacement
text2 = "first-name"
result2 = normalize.drop_embedded_symbols_fn(text2, " ")
print(result2)  # "first name"

# Multiple embedded symbols
text3 = "a@b#c$d"
result3 = normalize.drop_embedded_symbols_fn(text3)
print(result3)  # "abcd"

get_hyphen_slash_expansions_fn()¶

def get_hyphen_slash_expansions_fn(
    text: str,
    subs: List[str] = ("-", " ", ""),
    add_self: bool = True,
    do_split: bool = True,
    min_split_token_len: int = 2,
    hyphen_slash_re=HYPHEN_SLASH_RE,
) -> Set[str]

Generate variations of hyphenated or slash-separated text.

Parameters: - text (str): Input text with potential hyphens/slashes - subs (List[str]): Characters to substitute for delimiters - add_self (bool): Include original text in results - do_split (bool): Include individual tokens - min_split_token_len (int): Minimum token length for splitting - hyphen_slash_re (Pattern): Regex pattern for matching delimiters

Returns: Set of text variations

Example:

# Generate hyphen variations
text = "multi-word-phrase"
variations = normalize.get_hyphen_slash_expansions_fn(text)
print(variations)
# {'multi-word-phrase', 'multi word phrase', 'multiwordphrase', 'multi', 'word', 'phrase'}

# Custom substitutions
text2 = "data/science"
variations2 = normalize.get_hyphen_slash_expansions_fn(
    text2, subs=[" ", "_", ""], do_split=False
)
print(variations2)
# {'data/science', 'data science', 'data_science', 'datascience'}

# Without original text
text3 = "machine-learning"
variations3 = normalize.get_hyphen_slash_expansions_fn(
    text3, add_self=False, do_split=False
)
print(variations3)
# {'machine learning', 'machinelearning'}

drop_parentheticals_fn()¶

def drop_parentheticals_fn(text: str) -> str

Removes parenthetical expressions from text.

Parameters: - text (str): Input text

Returns: Text with parentheticals removed

Example:

# Remove parenthetical information
text = "Python (programming language) is popular"
result = normalize.drop_parentheticals_fn(text)
print(result)  # "Python  is popular"

# Multiple parentheticals
text2 = "AI (Artificial Intelligence) and ML (Machine Learning)"
result2 = normalize.drop_parentheticals_fn(text2)
print(result2)  # "AI  and ML "

expand_ampersand_fn()¶

def expand_ampersand_fn(text: str) -> str

Replaces ampersands with " and ".

Parameters: - text (str): Input text

Returns: Text with ampersands expanded

Example:

# Expand ampersands
text = "Research & Development"
result = normalize.expand_ampersand_fn(text)
print(result)  # "Research and Development"

# Multiple ampersands
text2 = "A&B&C"
result2 = normalize.expand_ampersand_fn(text2)
print(result2)  # "A and B and C"

# Handles whitespace
text3 = "cats&dogs"
result3 = normalize.expand_ampersand_fn(text3)
print(result3)  # "cats and dogs"

Advanced Functions¶

get_lexical_variations()¶

def get_lexical_variations(
    text: str,
    include_self: bool = True,
    expand_camelcase: bool = True,
    drop_non_embedded_symbols: bool = True,
    drop_embedded_symbols: bool = True,
    spacify_embedded_symbols: bool = False,
    do_hyphen_expansion: bool = True,
    hyphen_subs: List[str] = (" ", ""),
    do_hyphen_split: bool = True,
    min_hyphen_split_token_len=2,
    do_slash_expansion: bool = True,
    slash_subs: List[str] = (" ", " or "),
    do_slash_split: bool = True,
    min_slash_split_token_len: int = 1,
    drop_parentheticals: bool = True,
    expand_ampersands: bool = True,
    add_eng_plurals: bool = True,
) -> Set[str]

Generate comprehensive lexical variations of input text using multiple normalization techniques.

Parameters: (extensive list of boolean flags and configuration options)

Returns: Set of text variations

Example:

# Generate comprehensive variations
text = "multi-platform/cross-browser (compatible)"
variations = normalize.get_lexical_variations(text)
print(f"Generated {len(variations)} variations:")
for var in sorted(variations):
    print(f"  {var}")

# Custom configuration
text2 = "JavaScript"
variations2 = normalize.get_lexical_variations(
    text2,
    expand_camelcase=True,
    drop_non_embedded_symbols=False,
    add_eng_plurals=True
)
print(variations2)
# {'JavaScript', 'Java Script', 'JavaScripts', 'Java Scripts'}

basic_normalization_fn()¶

def basic_normalization_fn(text: str) -> str

Applies common normalization steps to text.

Standard Operations: - Lowercase conversion - Whitespace squashing - Basic punctuation handling - Trimming

Example:

# Basic normalization
text = "  Hello,    WORLD!  \n\t How   are you?  "
result = normalize.basic_normalization_fn(text)
print(repr(result))  # 'hello, world! how are you?'

Usage Patterns¶

Text Cleaning Pipeline¶

from dataknobs_xization import normalize

def clean_text_pipeline(text: str) -> Dict[str, str]:
    """Comprehensive text cleaning pipeline."""
    results = {"original": text}

    # Step 1: Expand camelCase
    step1 = normalize.expand_camelcase_fn(text)
    results["camelcase_expanded"] = step1

    # Step 2: Expand ampersands
    step2 = normalize.expand_ampersand_fn(step1)
    results["ampersands_expanded"] = step2

    # Step 3: Drop parentheticals
    step3 = normalize.drop_parentheticals_fn(step2)
    results["parentheticals_dropped"] = step3

    # Step 4: Handle embedded symbols
    step4 = normalize.drop_embedded_symbols_fn(step3, " ")
    results["embedded_symbols_replaced"] = step4

    # Step 5: Basic normalization
    step5 = normalize.basic_normalization_fn(step4)
    results["final_normalized"] = step5

    return results

# Example usage
text = "getUserName() & validateInput (required)"
results = clean_text_pipeline(text)
for step, result in results.items():
    print(f"{step}: {result}")

Lexical Variation Generation¶

from dataknobs_xization import normalize
from collections import Counter

def generate_search_terms(query: str) -> List[str]:
    """Generate search term variations for better matching."""
    # Get all variations
    variations = normalize.get_lexical_variations(
        query,
        expand_camelcase=True,
        do_hyphen_expansion=True,
        do_slash_expansion=True,
        add_eng_plurals=True
    )

    # Filter and rank by relevance
    filtered = []
    for var in variations:
        # Skip very short or very long variations
        if 2 <= len(var.split()) <= 10:
            filtered.append(var)

    # Sort by length and complexity (prefer simpler terms)
    filtered.sort(key=lambda x: (len(x.split()), len(x)))

    return filtered[:20]  # Return top 20 variations

# Example
query = "machine-learning/deep-learning"
search_terms = generate_search_terms(query)
print("Generated search terms:")
for i, term in enumerate(search_terms, 1):
    print(f"{i:2d}. {term}")

Domain-Specific Normalization¶

from dataknobs_xization import normalize
import re

class TechnicalTextNormalizer:
    """Specialized normalizer for technical text."""

    def __init__(self):
        self.tech_patterns = {
            'version_numbers': re.compile(r'v?\d+\.\d+(\.\d+)?'),
            'file_extensions': re.compile(r'\.[a-zA-Z0-9]+$'),
            'urls': re.compile(r'https?://[^\s]+'),
            'code_snippets': re.compile(r'`[^`]+`')
        }

    def normalize_technical(self, text: str) -> str:
        """Normalize technical text while preserving important patterns."""
        # Store special patterns
        preserved = {}
        placeholder_count = 0

        for pattern_name, pattern in self.tech_patterns.items():
            matches = pattern.findall(text)
            for match in matches:
                placeholder = f"__PRESERVE_{placeholder_count}__"
                preserved[placeholder] = match
                text = text.replace(match, placeholder, 1)
                placeholder_count += 1

        # Apply standard normalization
        normalized = normalize.basic_normalization_fn(text)

        # Expand technical camelCase
        normalized = normalize.expand_camelcase_fn(normalized)

        # Handle technical symbols differently
        normalized = normalize.drop_non_embedded_symbols_fn(normalized, " ")

        # Restore preserved patterns
        for placeholder, original in preserved.items():
            normalized = normalized.replace(placeholder, original)

        # Final cleanup
        normalized = normalize.SQUASH_WS_RE.sub(" ", normalized).strip()

        return normalized

# Usage
normalizer = TechnicalTextNormalizer()
tech_text = "Check out myLibrary v2.1.3 at https://github.com/user/repo and run `npm install`"
result = normalizer.normalize_technical(tech_text)
print(f"Original: {tech_text}")
print(f"Normalized: {result}")

Batch Text Processing¶

from dataknobs_xization import normalize
from dataknobs_utils import file_utils
from concurrent.futures import ThreadPoolExecutor
import json

def normalize_document(doc_data: dict) -> dict:
    """Normalize a single document."""
    text = doc_data.get('content', '')

    # Generate variations
    variations = normalize.get_lexical_variations(
        text,
        expand_camelcase=True,
        do_hyphen_expansion=True,
        drop_parentheticals=True
    )

    # Basic normalization
    normalized = normalize.basic_normalization_fn(text)

    return {
        **doc_data,
        'normalized_content': normalized,
        'variations': list(variations),
        'variation_count': len(variations)
    }

def batch_normalize_documents(input_dir: str, output_dir: str):
    """Process multiple documents in parallel."""
    documents = []

    # Load all documents
    for filepath in file_utils.filepath_generator(input_dir):
        if filepath.endswith('.json'):
            for line in file_utils.fileline_generator(filepath):
                try:
                    doc = json.loads(line)
                    documents.append(doc)
                except json.JSONDecodeError:
                    continue

    # Process in parallel
    with ThreadPoolExecutor(max_workers=4) as executor:
        normalized_docs = list(executor.map(normalize_document, documents))

    # Save results
    output_lines = []
    for doc in normalized_docs:
        output_lines.append(json.dumps(doc))

    file_utils.write_lines(f"{output_dir}/normalized_documents.jsonl", output_lines)

    print(f"Processed {len(normalized_docs)} documents")
    print(f"Average variations per document: {sum(d['variation_count'] for d in normalized_docs) / len(normalized_docs):.1f}")

# Usage
batch_normalize_documents('/input/docs', '/output/normalized')

Error Handling¶

from dataknobs_xization import normalize

def safe_normalize(text: str, method: str = "basic") -> str:
    """Safely normalize text with error handling."""
    if not text or not isinstance(text, str):
        return ""

    try:
        if method == "basic":
            return normalize.basic_normalization_fn(text)
        elif method == "camelcase":
            return normalize.expand_camelcase_fn(text)
        elif method == "comprehensive":
            variations = normalize.get_lexical_variations(text)
            return min(variations, key=len)  # Return shortest variation
        else:
            return text

    except Exception as e:
        print(f"Normalization failed for '{text[:50]}...': {e}")
        return text  # Return original on error

# Safe usage
result = safe_normalize("someWeirdText&Symbols(here)", "comprehensive")
print(result)

Performance Considerations¶

Regular expressions are pre-compiled for efficiency
get_lexical_variations() can generate many variations - use selectively
Consider caching results for frequently processed text
Use appropriate batch sizes for parallel processing
Monitor memory usage with large variation sets

Integration Examples¶

With Elasticsearch¶

from dataknobs_xization import normalize
from dataknobs_utils import elasticsearch_utils

def create_searchable_document(title: str, content: str) -> dict:
    """Create document with normalized search fields."""
    # Generate title variations for better matching
    title_variations = normalize.get_lexical_variations(
        title, do_hyphen_expansion=True, expand_camelcase=True
    )

    # Normalize content
    normalized_content = normalize.basic_normalization_fn(content)

    # Expand technical terms
    expanded_content = normalize.expand_camelcase_fn(content)
    expanded_content = normalize.expand_ampersand_fn(expanded_content)

    return {
        'title': title,
        'content': content,
        'title_variations': list(title_variations),
        'normalized_content': normalized_content,
        'expanded_content': expanded_content,
        'searchable_title': ' '.join(title_variations),
        'searchable_content': f"{normalized_content} {expanded_content}"
    }

# Usage with Elasticsearch
doc = create_searchable_document(
    "JavaScript & Node.js",
    "Learn JavaScript (programming language) and Node.js development"
)

# Index with enhanced searchability
query = elasticsearch_utils.build_field_query_dict(
    ['searchable_title', 'searchable_content'],
    'java script nodejs'
)

With File Processing¶

from dataknobs_xization import normalize
from dataknobs_utils import file_utils

def normalize_text_files(input_dir: str, output_dir: str):
    """Normalize all text files in directory."""
    for filepath in file_utils.filepath_generator(input_dir):
        if filepath.endswith('.txt'):
            # Read file
            lines = list(file_utils.fileline_generator(filepath))

            # Normalize each line
            normalized_lines = []
            for line in lines:
                normalized = normalize.basic_normalization_fn(line)
                if normalized.strip():  # Skip empty lines
                    normalized_lines.append(normalized)

            # Save normalized version
            basename = file_utils.get_basename(filepath)
            output_path = f"{output_dir}/normalized_{basename}"
            file_utils.write_lines(output_path, normalized_lines)

# Process directory
normalize_text_files('/raw/text', '/processed/text')

The normalization module provides comprehensive text preprocessing capabilities that work seamlessly with other dataknobs components for complete text processing pipelines.