Skip to content

Tokenization API Documentation

The tokenization functionality in the dataknobs_xization package provides advanced text tokenization capabilities with character-level features and masking support.

Overview

The tokenization system includes:

  • Character-level feature extraction
  • Text-level feature analysis
  • Masking tokenization for privacy and data processing
  • Integration with document structures
  • Emoji and Unicode support

Core Classes

CharacterFeatures

class CharacterFeatures(ABC):
    def __init__(
        self, 
        doctext: Union[dk_doc.Text, str], 
        roll_padding: int = 0
    )

Abstract base class for character-level text analysis and feature extraction.

Parameters: - doctext (Union[dk_doc.Text, str]): The text to tokenize or a Text document with metadata - roll_padding (int, default=0): Number of pad characters added to each end of text

Properties: - cdf (pd.DataFrame): Character dataframe with each padded text character as a row - doctext (dk_doc.Text): The document text wrapper - text_col (str): Name of the cdf column holding text characters - text (str): The text string - text_id (Any): The ID of the text

Example:

from dataknobs_xization import masking_tokenizer
from dataknobs_structures import document as dk_doc

# Create text document
text_doc = dk_doc.Text("Hello, World! 👋", text_id="greeting")

# Create character features (concrete implementation needed)
class SimpleCharFeatures(masking_tokenizer.CharacterFeatures):
    @property
    def cdf(self):
        import pandas as pd
        chars = list(self.text)
        return pd.DataFrame({
            self.text_col: chars,
            'position': range(len(chars)),
            'is_alpha': [c.isalpha() for c in chars],
            'is_digit': [c.isdigit() for c in chars]
        })

features = SimpleCharFeatures(text_doc)
print(f"Text: {features.text}")
print(f"Text ID: {features.text_id}")
print(features.cdf.head())

TextFeatures

class TextFeatures

Class for analyzing text-level features and patterns.

Features: - Character-level analysis - Token-level analysis
- Pattern detection - Statistical measures

Example:

# Text-level feature analysis
text = "Hello, World! This is a test sentence."
features = masking_tokenizer.TextFeatures(text)

# Analyze various text properties
print(f"Character count: {len(text)}")
print(f"Word count: {len(text.split())}")
print(f"Contains punctuation: {any(not c.isalnum() and not c.isspace() for c in text)}")

Tokenization Functions

Character-Level Tokenization

def tokenize_characters(
    text: str,
    include_whitespace: bool = True,
    include_punctuation: bool = True,
    pad_length: int = 0
) -> List[str]

Tokenize text into individual characters with optional filtering.

Parameters: - text (str): Input text to tokenize - include_whitespace (bool, default=True): Include whitespace characters - include_punctuation (bool, default=True): Include punctuation characters - pad_length (int, default=0): Add padding characters

Returns: List of character tokens

Example:

from dataknobs_xization import tokenization

text = "Hello, World!"
chars = tokenization.tokenize_characters(text)
print(chars)  # ['H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!']

# Without punctuation
chars_no_punct = tokenization.tokenize_characters(text, include_punctuation=False)
print(chars_no_punct)  # ['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd']

Word-Level Tokenization

def tokenize_words(
    text: str,
    lowercase: bool = False,
    remove_punctuation: bool = True,
    split_pattern: str = None
) -> List[str]

Tokenize text into words with various preprocessing options.

Parameters: - text (str): Input text to tokenize - lowercase (bool, default=False): Convert to lowercase - remove_punctuation (bool, default=True): Remove punctuation - split_pattern (str, optional): Custom regex pattern for splitting

Returns: List of word tokens

Example:

text = "Hello, World! How are you today?"
words = tokenization.tokenize_words(text)
print(words)  # ['Hello', 'World', 'How', 'are', 'you', 'today']

# With lowercase
words_lower = tokenization.tokenize_words(text, lowercase=True)
print(words_lower)  # ['hello', 'world', 'how', 'are', 'you', 'today']

Sentence-Level Tokenization

def tokenize_sentences(
    text: str,
    sentence_endings: List[str] = None
) -> List[str]

Tokenize text into sentences.

Parameters: - text (str): Input text to tokenize - sentence_endings (List[str], optional): Custom sentence ending patterns

Returns: List of sentence tokens

Example:

text = "Hello world. How are you? I'm fine!"
sentences = tokenization.tokenize_sentences(text)
print(sentences)  # ['Hello world.', 'How are you?', "I'm fine!"]

Feature Extraction

Character Features

def extract_character_features(text: str) -> pd.DataFrame

Extract detailed features for each character in the text.

Features Extracted: - Character type (alphabetic, numeric, punctuation, whitespace) - Case information (upper, lower, title) - Unicode category - Position information - Emoji detection

Example:

import pandas as pd
from dataknobs_xization import tokenization

text = "Hello, 123! 👋"
char_features = tokenization.extract_character_features(text)

print(char_features.head())
# Output includes columns: char, position, is_alpha, is_digit, is_upper, is_lower, etc.

Token Features

def extract_token_features(
    tokens: List[str],
    include_position: bool = True,
    include_length: bool = True,
    include_case: bool = True
) -> pd.DataFrame

Extract features for a list of tokens.

Parameters: - tokens (List[str]): List of tokens to analyze - include_position (bool): Include position information - include_length (bool): Include length information - include_case (bool): Include case information

Returns: DataFrame with token features

Example:

tokens = ["Hello", "world", "123", "!"] 
token_features = tokenization.extract_token_features(tokens)
print(token_features)

Advanced Tokenization

Subword Tokenization

def subword_tokenize(
    text: str,
    method: str = "bpe",
    vocab_size: int = 10000
) -> List[str]

Perform subword tokenization using various algorithms.

Parameters: - text (str): Input text - method (str): Tokenization method ("bpe", "wordpiece") - vocab_size (int): Vocabulary size for training

Returns: List of subword tokens

Example:

text = "tokenization"
subwords = tokenization.subword_tokenize(text, method="bpe")
print(subwords)  # ['token', 'ization'] or similar

N-gram Generation

def generate_ngrams(
    tokens: List[str],
    n: int,
    pad_start: bool = False,
    pad_end: bool = False
) -> List[Tuple[str, ...]]

Generate n-grams from a list of tokens.

Parameters: - tokens (List[str]): Input tokens - n (int): N-gram size - pad_start (bool): Add start padding - pad_end (bool): Add end padding

Returns: List of n-gram tuples

Example:

tokens = ["hello", "world", "how", "are", "you"]
bigrams = tokenization.generate_ngrams(tokens, 2)
print(bigrams)  # [('hello', 'world'), ('world', 'how'), ('how', 'are'), ('are', 'you')]

trigrams = tokenization.generate_ngrams(tokens, 3)
print(trigrams)  # [('hello', 'world', 'how'), ('world', 'how', 'are'), ('how', 'are', 'you')]

Integration with Document Processing

Document Tokenization

def tokenize_document(
    document: dk_doc.Document,
    level: str = "word",
    preserve_metadata: bool = True
) -> dk_doc.Document

Tokenize a document while preserving its structure and metadata.

Parameters: - document (dk_doc.Document): Input document - level (str): Tokenization level ("char", "word", "sentence") - preserve_metadata (bool): Keep original metadata

Returns: Tokenized document

Example:

from dataknobs_structures import document as dk_doc
from dataknobs_xization import tokenization

# Create document
doc = dk_doc.Document(
    text="Hello world. How are you?",
    metadata={"title": "Greeting", "author": "Alice"}
)

# Tokenize at word level
tokenized_doc = tokenization.tokenize_document(doc, level="word")
print(tokenized_doc.text)  # Tokenized version
print(tokenized_doc.metadata)  # Original metadata preserved

Batch Document Processing

def batch_tokenize_documents(
    documents: List[dk_doc.Document],
    tokenizer_config: Dict[str, Any]
) -> List[dk_doc.Document]

Tokenize multiple documents efficiently.

Example:

documents = [doc1, doc2, doc3]
config = {
    "level": "word",
    "lowercase": True,
    "remove_punctuation": True
}

tokenized_docs = tokenization.batch_tokenize_documents(documents, config)

Usage Patterns

Text Preprocessing Pipeline

from dataknobs_xization import tokenization, normalize
from dataknobs_utils import file_utils

def preprocess_text_pipeline(text: str) -> Dict[str, Any]:
    """Complete text preprocessing pipeline."""
    # Normalize text first
    normalized = normalize.basic_normalization_fn(text)

    # Tokenize at different levels
    chars = tokenization.tokenize_characters(normalized)
    words = tokenization.tokenize_words(normalized, lowercase=True)
    sentences = tokenization.tokenize_sentences(normalized)

    # Extract features
    char_features = tokenization.extract_character_features(normalized)
    token_features = tokenization.extract_token_features(words)

    # Generate n-grams
    bigrams = tokenization.generate_ngrams(words, 2)
    trigrams = tokenization.generate_ngrams(words, 3)

    return {
        "original": text,
        "normalized": normalized,
        "tokens": {
            "characters": chars,
            "words": words,
            "sentences": sentences
        },
        "features": {
            "character": char_features,
            "token": token_features
        },
        "ngrams": {
            "bigrams": bigrams,
            "trigrams": trigrams
        }
    }

# Process text
text = "Hello, World! This is a test sentence."
result = preprocess_text_pipeline(text)
print(f"Words: {result['tokens']['words']}")
print(f"Bigrams: {result['ngrams']['bigrams'][:3]}")

Document Analysis

from dataknobs_xization import tokenization
from dataknobs_structures import document as dk_doc
from collections import Counter

def analyze_document_tokens(document: dk_doc.Document) -> Dict[str, Any]:
    """Analyze tokenization patterns in a document."""
    text = document.text

    # Get different token types
    words = tokenization.tokenize_words(text, lowercase=True)
    chars = tokenization.tokenize_characters(text)

    # Analyze patterns
    word_freq = Counter(words)
    char_freq = Counter(chars)

    # Extract features
    char_features = tokenization.extract_character_features(text)

    # Calculate statistics
    stats = {
        "word_count": len(words),
        "unique_words": len(word_freq),
        "char_count": len(chars),
        "avg_word_length": sum(len(w) for w in words) / len(words) if words else 0,
        "most_common_words": word_freq.most_common(10),
        "character_types": {
            "alpha": char_features['is_alpha'].sum(),
            "digit": char_features['is_digit'].sum(),
            "space": char_features['is_space'].sum(),
            "punct": char_features['is_punct'].sum()
        }
    }

    return stats

# Analyze document
doc = dk_doc.Document("The quick brown fox jumps over the lazy dog.")
analysis = analyze_document_tokens(doc)
print(f"Word count: {analysis['word_count']}")
print(f"Unique words: {analysis['unique_words']}")
print(f"Most common: {analysis['most_common_words'][:5]}")

Custom Tokenizer

from dataknobs_xization import tokenization
import re

class CustomTokenizer:
    """Custom tokenizer with domain-specific rules."""

    def __init__(self, preserve_entities: bool = True):
        self.preserve_entities = preserve_entities
        self.entity_pattern = re.compile(r'@\w+|#\w+|https?://\S+')

    def tokenize(self, text: str) -> List[str]:
        """Tokenize text with entity preservation."""
        if self.preserve_entities:
            # Find entities first
            entities = self.entity_pattern.findall(text)
            entity_placeholder = "__ENTITY__"

            # Replace entities with placeholders
            processed_text = self.entity_pattern.sub(entity_placeholder, text)

            # Regular tokenization
            tokens = tokenization.tokenize_words(processed_text, lowercase=True)

            # Restore entities
            entity_iter = iter(entities)
            final_tokens = []
            for token in tokens:
                if token == entity_placeholder:
                    try:
                        final_tokens.append(next(entity_iter))
                    except StopIteration:
                        final_tokens.append(token)
                else:
                    final_tokens.append(token)

            return final_tokens
        else:
            return tokenization.tokenize_words(text, lowercase=True)

# Usage
tokenizer = CustomTokenizer()
text = "Check out @username and visit https://example.com #hashtag"
tokens = tokenizer.tokenize(text)
print(tokens)  # Preserves @username, URL, and #hashtag

Performance Considerations

  • Character-level tokenization is memory intensive for large texts
  • Use generators for processing large document collections
  • Consider chunking very large texts for feature extraction
  • Cache tokenization results for repeated processing
  • Use appropriate data types for feature storage

Error Handling

from dataknobs_xization import tokenization

def safe_tokenization(text: str, method: str = "word") -> List[str]:
    """Safely tokenize text with error handling."""
    try:
        if not text or not isinstance(text, str):
            return []

        if method == "word":
            return tokenization.tokenize_words(text)
        elif method == "char":
            return tokenization.tokenize_characters(text)
        elif method == "sentence":
            return tokenization.tokenize_sentences(text)
        else:
            raise ValueError(f"Unknown tokenization method: {method}")

    except Exception as e:
        print(f"Tokenization failed: {e}")
        return []

# Safe usage
tokens = safe_tokenization("Hello world", "word")
print(tokens)

Integration Examples

With Tree Structures

from dataknobs_xization import tokenization
from dataknobs_structures import Tree

# Build token tree
def build_token_tree(text: str) -> Tree:
    root = Tree("document")

    sentences = tokenization.tokenize_sentences(text)
    for i, sentence in enumerate(sentences):
        sent_node = root.add_child(f"sentence_{i}")

        words = tokenization.tokenize_words(sentence)
        for j, word in enumerate(words):
            word_node = sent_node.add_child(f"word_{j}")

            chars = tokenization.tokenize_characters(word)
            for k, char in enumerate(chars):
                word_node.add_child(char)

    return root

text = "Hello world. How are you?"
token_tree = build_token_tree(text)
print(token_tree.as_string(multiline=True))

With File Processing

from dataknobs_xization import tokenization
from dataknobs_utils import file_utils
import json

# Process files and extract tokens
def process_text_files(input_dir: str, output_dir: str):
    for filepath in file_utils.filepath_generator(input_dir):
        if filepath.endswith('.txt'):
            # Read file content
            content_lines = list(file_utils.fileline_generator(filepath))
            full_text = '\n'.join(content_lines)

            # Tokenize
            words = tokenization.tokenize_words(full_text, lowercase=True)
            sentences = tokenization.tokenize_sentences(full_text)

            # Save results
            result = {
                'filename': filepath,
                'word_count': len(words),
                'sentence_count': len(sentences),
                'words': words,
                'sentences': sentences
            }

            output_file = f"{output_dir}/{os.path.basename(filepath)}.json"
            with open(output_file, 'w') as f:
                json.dump(result, f, indent=2)

process_text_files('/input/texts', '/output/tokens')

The tokenization module provides comprehensive text tokenization capabilities that integrate seamlessly with other dataknobs components for complete text processing pipelines.