Regex Transformations with FSM¶
This guide demonstrates how to use regular expressions directly in FSM YAML configurations for powerful text transformations.
Overview¶
The FSM framework supports using Python's re module directly in inline transform blocks within YAML configurations. This enables complex text processing without writing separate Python functions.
Key Examples¶
1. Text Normalization Pipeline (normalize_file_example.py)¶
A simple example showing how to normalize text files using FSM with streaming support.
Features:
- Streaming file processing for memory efficiency
- Text normalization (whitespace, case conversion)
- Multiple processing methods (streaming, batch, individual lines)
- Integration with SimpleFSM.process_stream()
Basic Usage:
from dataknobs_fsm.api.simple import SimpleFSM
import yaml
WORKFLOW_YAML = '''
name: text_normalization
states:
- name: start
is_start: true
- name: normalize
- name: complete
is_end: true
arcs:
- from: start
to: normalize
- from: normalize
to: complete
transform:
type: inline
code: "lambda data, ctx: {**data, 'text': data.get('text', '').lower().strip()}"
'''
config = yaml.safe_load(WORKFLOW_YAML)
fsm = SimpleFSM(config)
# Process single line
result = fsm.process({'text': ' HELLO WORLD '})
print(result['data']['text']) # Output: 'hello world'
fsm.close()
2. Advanced Regex Transformations (normalize_file_with_regex.py)¶
Comprehensive examples of regex-based text processing with field preservation.
Features: - Multiple regex patterns in a pipeline - Field preservation (original text kept while adding transformation fields) - Pattern extraction (emails, URLs, phone numbers) - Sensitive data masking - Custom regex workflow generation
Key Patterns Demonstrated:
Whitespace Normalization¶
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'clean_whitespace': __import__('re').sub(r'\\s+', ' ', data.get('text', '')).strip()
}
Email Normalization¶
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'normalized_emails': __import__('re').sub(
r'\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Z|a-z]{2,}\\b',
lambda m: m.group(0).lower(),
data.get('text', '')
)
}
Phone Number Masking¶
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'phone_masked': __import__('re').sub(
r'\\b\\d{3}[-.]?\\d{3}[-.]?\\d{4}\\b',
'[PHONE]',
data.get('text', '')
)
}
Duplicate Word Removal¶
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'deduped': __import__('re').sub(
r'\\b(\\w+)\\b(?:\\s+\\1\\b)+',
r'\\1',
data.get('text', '')
)
}
3. YAML-Based Regex Configurations (regex_transforms.yaml)¶
Two complete YAML workflows demonstrating different approaches:
- Field Transforms Workflow - Sequential transformations with field tracking
- All-in-One Transforms - Multiple transformations in a single step
Example Output:
# Input
{'text': 'Contact John at 555-123-4567 or email john@example.com'}
# Output (field transforms)
{
'original': 'Contact John at 555-123-4567 or email john@example.com',
'whitespace_normalized': 'Contact John at 555-123-4567 or email john@example.com',
'phone_masked': 'Contact John at [PHONE] or email john@example.com',
'emails_found': ['john@example.com'],
'urls_found': [],
'hashtags_found': [],
'processing_complete': True
}
4. Pattern Extraction Workflow (regex_workflow.yaml)¶
YAML configuration for extracting and transforming specific patterns.
Features: - Email, URL, hashtag, and mention extraction - SSN and credit card masking - Multiple format conversions (snake_case, kebab-case, CamelCase) - Pattern detection flags
Using Regular Expressions in YAML¶
Basic Pattern¶
The key to using regex in YAML configurations is the __import__('re') pattern:
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'result': __import__('re').sub(
r'pattern',
r'replacement',
data.get('field', '')
)
}
Important Escaping Rules¶
When using regex in YAML:
1. Backslashes in regex patterns must be escaped: \\s instead of \s
2. Use raw strings (r'...') in the Python code
3. For backreferences, use \\1 in the pattern
Complex Patterns Example¶
# Chaining multiple regex operations
transform:
type: inline
code: |
lambda data, ctx: (lambda re, text: {
**data,
'processed': re.sub(
r'[^\\w\\s]', '', # Remove punctuation
re.sub(
r'\\b(\\w+)\\b(?:\\s+\\1\\b)+', r'\\1', # Remove duplicates
re.sub(
r'\\s+', ' ', # Normalize spaces
text.lower()
)
)
).strip()
})(__import__('re'), data.get('text', ''))
Field Preservation Pattern¶
Best practice is to preserve the original text while adding transformation fields:
arcs:
- from: start
to: step1
transform:
type: inline
code: |
lambda data, ctx: {
**data,
'original': data.get('text', ''), # Preserve original
'step1_result': transform_function(data.get('text', ''))
}
- from: step1
to: step2
transform:
type: inline
code: |
lambda data, ctx: {
**data, # Keep all previous fields
'step2_result': transform_function(data.get('step1_result', ''))
}
Common Use Cases¶
1. Data Cleaning Pipeline¶
- Remove extra whitespace
- Normalize punctuation
- Fix capitalization
- Remove duplicate words
2. Data Masking for Privacy¶
- Mask phone numbers, SSNs, credit cards
- Anonymize email addresses
- Redact sensitive patterns
3. Format Standardization¶
- Convert between naming conventions (snake_case, CamelCase)
- Normalize dates and times
- Standardize phone number formats
4. Content Extraction¶
- Extract emails, URLs, mentions
- Find and collect hashtags
- Identify specific patterns
Running the Examples¶
# Navigate to FSM package
cd packages/fsm
# Run text normalization example
python examples/normalize_file_example.py
# Run regex transformation examples
python examples/normalize_file_with_regex.py
# Test YAML-based transformations
python examples/test_regex_yaml.py
# Process a file with regex transformations
python examples/normalize_file_example.py
Processing Files¶
The examples support file processing:
from dataknobs_fsm.api.simple import SimpleFSM
# Load configuration
fsm = SimpleFSM("regex_transforms.yaml")
# Process file with streaming
results = fsm.process_stream(
source="input.txt",
sink="output.jsonl",
input_format='text',
text_field_name='text',
chunk_size=1000
)
print(f"Processed: {results['total_processed']} lines")
fsm.close()
Testing¶
Comprehensive unit tests are provided in tests/test_regex_examples.py:
# Run regex example tests
pytest tests/test_regex_examples.py -v
# Run specific test class
pytest tests/test_regex_examples.py::TestRegexNormalizationWorkflow -v
Best Practices¶
- Always preserve original data - Add new fields rather than overwriting
- Use descriptive field names - Make it clear what each transformation does
- Handle None/empty values - Use
data.get('field') or ''pattern - Test regex patterns - Verify patterns work as expected before deployment
- Document complex patterns - Add comments explaining what patterns do
- Consider performance - Chain operations efficiently
Limitations¶
- Complex regex patterns can impact performance on large datasets
- YAML escaping rules can make patterns harder to read
- Debug output may be needed for complex transformations
- Some regex features may require custom Python functions