Masking API Documentation¶

The masking functionality in the dataknobs_xization package provides advanced text masking and tokenization capabilities for privacy protection, data anonymization, and feature extraction.

Overview¶

The masking system provides:

Character-level feature extraction and analysis
Text-level feature computation
Masking tokenization for sensitive data
Integration with document structures
Advanced pattern recognition and masking
Privacy-preserving text processing

Core Classes¶

CharacterFeatures¶

class CharacterFeatures(ABC):
    def __init__(
        self, 
        doctext: Union[dk_doc.Text, str], 
        roll_padding: int = 0
    )

Abstract base class representing features of text as a DataFrame with each character as a row and columns representing character features.

Parameters: - doctext (Union[dk_doc.Text, str]): Text to tokenize or Text document with metadata - roll_padding (int, default=0): Number of pad characters added to each end of text

Abstract Properties: - cdf (pd.DataFrame): Character DataFrame with each padded text character as a row

Properties: - doctext (dk_doc.Text): The document text wrapper - text_col (str): Name of cdf column holding text characters - text (str): The text string - text_id (Any): The ID of the text

Example Implementation:

from dataknobs_xization import masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd
import numpy as np

class BasicCharacterFeatures(masking_tokenizer.CharacterFeatures):
    """Basic implementation of character features."""

    @property
    def cdf(self) -> pd.DataFrame:
        """Create character dataframe with basic features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            # Add padding if specified
            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = [pad_char] * self._roll_padding + chars + [pad_char] * self._roll_padding

            # Create feature dataframe
            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'is_upper': [c.isupper() if c != '<PAD>' else False for c in chars],
                'is_lower': [c.islower() if c != '<PAD>' else False for c in chars],
                'is_space': [c.isspace() if c != '<PAD>' else False for c in chars],
                'is_punct': [not c.isalnum() and not c.isspace() if c != '<PAD>' else False for c in chars],
                'is_padding': [c == '<PAD>' for c in chars]
            })

        return self._cdf

# Usage example
text = "Hello, World! 123"
features = BasicCharacterFeatures(text, roll_padding=2)
print(features.cdf.head(10))

TextFeatures¶

class TextFeatures:
    def __init__(self, doctext: Union[dk_doc.Text, str])

Class for extracting and analyzing text-level features.

Parameters: - doctext (Union[dk_doc.Text, str]): Input text or document

Example:

name="__codelineno-3-1" href="#__codelineno-3-1">class AdvancedTextFeatures(masking_tokenizer.TextFeatures): """Advanced text-level features.""" def __init__(self, doctext): self.doctext = doctext if hasattr(doctext, 'text') else dk_doc.Text(doctext) def extract_features(self) -> dict: """Extract comprehensive text features.""" text = self.doctext.text return { 'char_count': len(text), 'word_count': len(text.split()), 'sentence_count': len([s for s in text.split('.') if s.strip()]), 'alpha_ratio': sum(c.isalpha() for c in text) / len(text) if text else 0, 'digit_ratio': sum(c.isdigit() for c in text) / len(text) if text else 0, 'space_ratio': sum(c.isspace() for c in text) / len(text) if text else 0, 'punct_ratio': sum(not c.isalnum() and not c.isspace() for c in text) / len(text) if text else 0, 'avg_word_length': np.mean([len(word) for word in text.split()]) if text.split() else 0, 'contains_digits': any(c.isdigit() for c in text), 'contains_special': any(not c.isalnum() and not c.isspace() for c in text), 'is_uppercase': text.isupper(), 'is_lowercase': text.islower(), 'is_titlecase': text.istitle() } class="c1"># Usage class="n">text_features = AdvancedTextFeatures("Hello, World! This has 123 numbers.") class="n">features = text_features.extract_features() class="nb">print(features)

Masking Functions¶

Basic Masking¶

def mask_text(
    text: str,
    mask_char: str = '*',
    preserve_length: bool = True,
    preserve_structure: bool = True
) -> str

Apply basic masking to text while optionally preserving structure.

Parameters: - text (str): Input text to mask - mask_char (str, default='*'): Character to use for masking - preserve_length (bool, default=True): Keep original text length - preserve_structure (bool, default=True): Preserve spaces and punctuation

Returns: Masked text string

Example:

from dataknobs_xization import masking

# Basic masking
text = "John Doe, age 30"
masked = masking.mask_text(text)
print(masked)  # "**** ***, *** **"

# Custom mask character
masked2 = masking.mask_text(text, mask_char='X')
print(masked2)  # "XXXX XXX, XXX XX"

# Don't preserve structure
masked3 = masking.mask_text(text, preserve_structure=False)
print(masked3)  # "*************"

Pattern-Based Masking¶

def mask_patterns(
    text: str,
    patterns: Dict[str, str],
    mask_char: str = '*'
) -> str

Mask text based on regex patterns for specific data types.

Parameters: - text (str): Input text - patterns (Dict[str, str]): Dictionary of pattern names to regex patterns - mask_char (str, default='*'): Masking character

Returns: Text with patterns masked

Example:

import re

# Define patterns for sensitive data
patterns = {
    'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
    'phone': r'\b\d{3}-\d{3}-\d{4}\b',
    'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
    'credit_card': r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'
}

text = "Contact John at john@example.com or 555-123-4567. SSN: 123-45-6789"
masked = masking.mask_patterns(text, patterns)
print(masked)
# "Contact John at ***************** or ************. SSN: ***********"

Entity-Based Masking¶

def mask_entities(
    text: str,
    entity_types: List[str],
    entity_detector: Callable = None
) -> Tuple[str, List[Dict]]

Mask named entities while preserving a mapping for later restoration.

Parameters: - text (str): Input text - entity_types (List[str]): Types of entities to mask - entity_detector (Callable): Function to detect entities

Returns: Tuple of (masked_text, entity_mapping)

Example:

def simple_entity_detector(text: str) -> List[Dict]:
    """Simple entity detector for demonstration."""
    import re
    entities = []

    # Detect names (capitalized words)
    for match in re.finditer(r'\b[A-Z][a-z]+\b', text):
        entities.append({
            'text': match.group(),
            'start': match.start(),
            'end': match.end(),
            'type': 'PERSON'
        })

    # Detect numbers
    for match in re.finditer(r'\b\d+\b', text):
        entities.append({
            'text': match.group(),
            'start': match.start(),
            'end': match.end(),
            'type': 'NUMBER'
        })

    return entities

text = "Alice has 25 apples and Bob has 30 oranges"
masked_text, entity_map = masking.mask_entities(
    text, 
    ['PERSON', 'NUMBER'],
    entity_detector=simple_entity_detector
)
print(f"Masked: {masked_text}")
print(f"Entities: {entity_map}")
# Masked: <PERSON_1> has <NUMBER_1> apples and <PERSON_2> has <NUMBER_2> oranges

Advanced Masking Techniques¶

Differential Privacy Masking¶

class DifferentialPrivacyMasker:
    """Apply differential privacy techniques to text."""

    def __init__(self, epsilon: float = 1.0, sensitivity: float = 1.0):
        self.epsilon = epsilon
        self.sensitivity = sensitivity

    def add_noise(self, value: float) -> float:
        """Add Laplace noise for differential privacy."""
        import numpy as np
        scale = self.sensitivity / self.epsilon
        noise = np.random.laplace(0, scale)
        return value + noise

    def mask_frequencies(self, text: str) -> Dict[str, int]:
        """Return word frequencies with differential privacy."""
        from collections import Counter

        words = text.lower().split()
        true_counts = Counter(words)

        # Add noise to counts
        noisy_counts = {}
        for word, count in true_counts.items():
            noisy_count = max(0, int(self.add_noise(count)))
            if noisy_count > 0:
                noisy_counts[word] = noisy_count

        return noisy_counts

# Usage
masker = DifferentialPrivacyMasker(epsilon=0.5)
text = "the cat sat on the mat the cat was happy"
noisy_freq = masker.mask_frequencies(text)
print(noisy_freq)

Contextual Masking¶

class ContextualMasker:
    """Mask text while preserving grammatical structure."""

    def __init__(self):
        self.pos_tags = {}  # Part-of-speech tags

    def mask_by_pos(self, text: str, mask_pos: List[str]) -> str:
        """Mask words based on part-of-speech tags."""
        # Simplified POS tagging (in practice, use a proper NLP library)
        words = text.split()
        masked_words = []

        for word in words:
            # Simplified POS detection
            if word.lower() in ['john', 'mary', 'alice', 'bob']:  # Names
                if 'NOUN' in mask_pos:
                    masked_words.append('[NAME]')
                else:
                    masked_words.append(word)
            elif word.isdigit():  # Numbers
                if 'NUM' in mask_pos:
                    masked_words.append('[NUMBER]')
                else:
                    masked_words.append(word)
            else:
                masked_words.append(word)

        return ' '.join(masked_words)

    def preserve_syntax(self, text: str) -> str:
        """Mask content while preserving syntactic structure."""
        words = text.split()
        masked = []

        for word in words:
            if word.lower() in ['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to']:
                masked.append(word)  # Keep function words
            else:
                masked.append('*' * len(word))  # Mask content words

        return ' '.join(masked)

# Usage
masker = ContextualMasker()
text = "John went to the store and bought 5 apples"
pos_masked = masker.mask_by_pos(text, ['NOUN', 'NUM'])
print(pos_masked)  # "[NAME] went to the store and bought [NUMBER] apples"

syntax_preserved = masker.preserve_syntax(text)
print(syntax_preserved)  # "**** **** to the ***** and ****** * ******"

Character-Level Masking¶

Character Feature Masking¶

class CharacterLevelMasker(masking_tokenizer.CharacterFeatures):
    """Character-level masking with feature preservation."""

    def __init__(self, doctext, roll_padding=0, mask_probability=0.15):
        super().__init__(doctext, roll_padding)
        self.mask_probability = mask_probability

    @property
    def cdf(self) -> pd.DataFrame:
        """Character dataframe with masking features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            # Add padding
            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = [pad_char] * self._roll_padding + chars + [pad_char] * self._roll_padding

            # Create base features
            import numpy as np
            np.random.seed(42)  # For reproducibility

            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'original_char': chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'is_upper': [c.isupper() if c != '<PAD>' else False for c in chars],
                'is_lower': [c.islower() if c != '<PAD>' else False for c in chars],
                'is_space': [c.isspace() if c != '<PAD>' else False for c in chars],
                'is_punct': [not c.isalnum() and not c.isspace() if c != '<PAD>' else False for c in chars],
                'should_mask': np.random.random(len(chars)) < self.mask_probability,
                'is_padding': [c == '<PAD>' for c in chars]
            })

            # Apply masking
            mask_indices = self._cdf['should_mask'] & ~self._cdf['is_padding']
            self._cdf.loc[mask_indices, self.text_col] = '[MASK]'

        return self._cdf

    def get_masked_text(self) -> str:
        """Get the masked version of the text."""
        cdf = self.cdf
        masked_chars = cdf[~cdf['is_padding']][self.text_col].tolist()
        return ''.join(masked_chars)

    def get_mask_positions(self) -> List[int]:
        """Get positions of masked characters."""
        cdf = self.cdf
        mask_positions = cdf[cdf['should_mask'] & ~cdf['is_padding']]['position'].tolist()
        return [pos - self._roll_padding for pos in mask_positions]  # Adjust for padding

# Usage
text = "This is a sample text for masking demonstration."
masker = CharacterLevelMasker(text, roll_padding=1, mask_probability=0.2)
masked_text = masker.get_masked_text()
print(f"Original: {text}")
print(f"Masked:   {masked_text}")
print(f"Mask positions: {masker.get_mask_positions()}")

Integration Patterns¶

Document Processing Pipeline¶

from dataknobs_xization import masking
from dataknobs_structures import document as dk_doc
from dataknobs_utils import file_utils
import json

class DocumentMaskingPipeline:
    """Complete document masking pipeline."""

    def __init__(self, config: Dict[str, Any]):
        self.config = config
        self.entity_patterns = config.get('entity_patterns', {})
        self.mask_probability = config.get('mask_probability', 0.15)

    def process_document(self, doc: dk_doc.Document) -> Dict[str, Any]:
        """Process a single document with multiple masking strategies."""
        text = doc.text
        results = {
            'original_doc': doc,
            'original_text': text,
            'document_id': doc.text_id if hasattr(doc, 'text_id') else None
        }

        # Pattern-based masking
        if self.entity_patterns:
            pattern_masked = masking.mask_patterns(text, self.entity_patterns)
            results['pattern_masked'] = pattern_masked

        # Character-level masking
        char_masker = CharacterLevelMasker(text, mask_probability=self.mask_probability)
        char_masked = char_masker.get_masked_text()
        results['character_masked'] = char_masked
        results['mask_positions'] = char_masker.get_mask_positions()

        # Extract features
        features = char_masker.cdf
        results['character_features'] = {
            'total_chars': len(features),
            'masked_chars': features['should_mask'].sum(),
            'alpha_chars': features['is_alpha'].sum(),
            'digit_chars': features['is_digit'].sum(),
            'space_chars': features['is_space'].sum(),
            'punct_chars': features['is_punct'].sum()
        }

        return results

    def process_documents(self, documents: List[dk_doc.Document]) -> List[Dict[str, Any]]:
        """Process multiple documents."""
        return [self.process_document(doc) for doc in documents]

    def save_results(self, results: List[Dict[str, Any]], output_path: str):
        """Save masking results to file."""
        output_lines = []
        for result in results:
            # Convert to serializable format
            serializable_result = {
                'document_id': result['document_id'],
                'original_text': result['original_text'],
                'pattern_masked': result.get('pattern_masked'),
                'character_masked': result['character_masked'],
                'mask_positions': result['mask_positions'],
                'character_features': result['character_features']
            }
            output_lines.append(json.dumps(serializable_result))

        file_utils.write_lines(output_path, output_lines)

# Usage
config = {
    'entity_patterns': {
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'phone': r'\b\d{3}-\d{3}-\d{4}\b'
    },
    'mask_probability': 0.1
}

pipeline = DocumentMaskingPipeline(config)

# Process documents
documents = [
    dk_doc.Document("Contact Alice at alice@example.com or 555-123-4567", text_id="doc1"),
    dk_doc.Document("Meeting with Bob tomorrow at 2 PM", text_id="doc2")
]

results = pipeline.process_documents(documents)
pipeline.save_results(results, "masked_documents.jsonl")

for result in results:
    print(f"Doc {result['document_id']}:")
    print(f"  Original: {result['original_text']}")
    print(f"  Pattern:  {result.get('pattern_masked', 'N/A')}")
    print(f"  Character: {result['character_masked']}")
    print()

Privacy-Preserving Analytics¶

from dataknobs_xization import masking
from collections import Counter
import numpy as np

class PrivacyPreservingAnalytics:
    """Perform analytics on masked text data."""

    def __init__(self, epsilon: float = 1.0):
        self.epsilon = epsilon
        self.dp_masker = DifferentialPrivacyMasker(epsilon)

    def analyze_masked_corpus(self, texts: List[str]) -> Dict[str, Any]:
        """Analyze a corpus of texts with privacy preservation."""
        # Mask all texts
        masked_texts = []
        total_masks = 0

        for text in texts:
            char_masker = CharacterLevelMasker(text, mask_probability=0.15)
            masked_text = char_masker.get_masked_text()
            masked_texts.append(masked_text)
            total_masks += len(char_masker.get_mask_positions())

        # Compute statistics with differential privacy
        all_words = ' '.join(masked_texts).split()
        word_freq = self.dp_masker.mask_frequencies(' '.join(texts))

        return {
            'total_documents': len(texts),
            'total_masked_characters': total_masks,
            'average_mask_per_doc': total_masks / len(texts),
            'vocabulary_size': len(word_freq),
            'top_words': sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10],
            'privacy_epsilon': self.epsilon
        }

# Usage
analytics = PrivacyPreservingAnalytics(epsilon=0.5)
texts = [
    "This is a confidential document with sensitive information.",
    "Another document containing private data and personal details.",
    "Public information that can be shared without concerns."
]

analysis = analytics.analyze_masked_corpus(texts)
print(f"Analysis with privacy (ε={analysis['privacy_epsilon']}):")
print(f"Documents: {analysis['total_documents']}")
print(f"Masked characters: {analysis['total_masked_characters']}")
print(f"Top words: {analysis['top_words'][:5]}")

Performance and Security Considerations¶

Character-level masking is computationally intensive for large texts
Consider streaming processing for very large documents
Store masked data separately from original data for security
Use cryptographically secure random number generators for production masking
Implement proper key management for reversible masking
Consider the trade-off between privacy and utility when setting masking parameters

Best Practices¶

Data Classification: Classify data sensitivity before applying masking
Masking Strategy: Choose appropriate masking strategy based on use case
Testing: Test masking effectiveness with synthetic data
Documentation: Document masking procedures and parameters
Access Control: Implement proper access controls for unmasked data
Audit Trail: Maintain logs of masking operations
Regular Review: Regularly review and update masking strategies

The masking module provides comprehensive privacy-preserving text processing capabilities that integrate with the broader dataknobs ecosystem for secure data handling and analysis.