dataknobs-xization API Reference¶

Complete API documentation for the dataknobs_xization package.

💡 Quick Links: - Complete API Documentation - Full auto-generated reference - Source Code - Browse on GitHub - Package Guide - Detailed documentation

Package Information¶

Package Name: dataknobs_xization
Version: 1.0.0
Description: Text normalization and tokenization tools
Python Requirements: >=3.8

Installation¶

pip install dataknobs-xization

Import Statement¶

from dataknobs_xization import (
    annotations,
    authorities,
    lexicon,
    masking_tokenizer,
    normalize
)

# Import key classes
from dataknobs_xization.masking_tokenizer import CharacterFeatures, TextFeatures

Module Documentation¶

normalize¶

Regular Expression Patterns¶

SQUASH_WS_RE¶

dataknobs_xization.normalize.SQUASH_WS_RE `module-attribute` ¶

SQUASH_WS_RE = compile('\\s+')

ALL_SYMBOLS_RE¶

dataknobs_xization.normalize.ALL_SYMBOLS_RE `module-attribute` ¶

ALL_SYMBOLS_RE = compile('[^\\w\\s]+')

CAMELCASE_LU_RE¶

dataknobs_xization.normalize.CAMELCASE_LU_RE `module-attribute` ¶

CAMELCASE_LU_RE = compile('([a-z]+)([A-Z])')

CAMELCASE_UL_RE¶

dataknobs_xization.normalize.CAMELCASE_UL_RE `module-attribute` ¶

CAMELCASE_UL_RE = compile('([A-Z]+)([A-Z][a-z])')

NON_EMBEDDED_WORD_SYMS_RE¶

dataknobs_xization.normalize.NON_EMBEDDED_WORD_SYMS_RE `module-attribute` ¶

NON_EMBEDDED_WORD_SYMS_RE = compile('((?<!\\w)[^\\w\\s]+)|([^\\w\\s]+(?!\\w))')

EMBEDDED_SYMS_RE¶

dataknobs_xization.normalize.EMBEDDED_SYMS_RE `module-attribute` ¶

EMBEDDED_SYMS_RE = compile('(?<=\\w)[^\\w\\s]+(?=\\w)')

HYPHEN_SLASH_RE¶

dataknobs_xization.normalize.HYPHEN_SLASH_RE `module-attribute` ¶

HYPHEN_SLASH_RE = compile('(?<=\\w)[\\-\\/ ](?=\\w)')

HYPHEN_ONLY_RE¶

dataknobs_xization.normalize.HYPHEN_ONLY_RE `module-attribute` ¶

HYPHEN_ONLY_RE = compile('(?<=\\w)[\\- ](?=\\w)')

SLASH_ONLY_RE¶

dataknobs_xization.normalize.SLASH_ONLY_RE `module-attribute` ¶

SLASH_ONLY_RE = compile('(?<=\\w)\\/(?=\\w)')

PARENTHETICAL_RE¶

dataknobs_xization.normalize.PARENTHETICAL_RE `module-attribute` ¶

PARENTHETICAL_RE = compile('\\(.*\\)')

AMPERSAND_RE¶

dataknobs_xization.normalize.AMPERSAND_RE `module-attribute` ¶

AMPERSAND_RE = compile('\\s*\\&\\s*')

Functions¶

expand_camelcase_fn¶

dataknobs_xization.normalize.expand_camelcase_fn ¶

expand_camelcase_fn(text: str) -> str

Expand both "lU" and "UUl" camelcasing to "l U" and "U Ul"

Source code in packages/xization/src/dataknobs_xization/normalize.py

def expand_camelcase_fn(text: str) -> str:
    """Expand both "lU" and "UUl" camelcasing to "l U" and "U Ul" """
    text = CAMELCASE_LU_RE.sub(r"\1 \2", text)
    return CAMELCASE_UL_RE.sub(r"\1 \2", text)

drop_non_embedded_symbols_fn¶

dataknobs_xization.normalize.drop_non_embedded_symbols_fn ¶

drop_non_embedded_symbols_fn(text: str, repl: str = '') -> str

Drop symbols not embedded within word characters

Source code in packages/xization/src/dataknobs_xization/normalize.py

def drop_non_embedded_symbols_fn(text: str, repl: str = "") -> str:
    """Drop symbols not embedded within word characters"""
    return NON_EMBEDDED_WORD_SYMS_RE.sub(repl, text)

drop_embedded_symbols_fn¶

dataknobs_xization.normalize.drop_embedded_symbols_fn ¶

drop_embedded_symbols_fn(text: str, repl: str = '') -> str

Drop symbols embedded within word characters

Source code in packages/xization/src/dataknobs_xization/normalize.py

def drop_embedded_symbols_fn(text: str, repl: str = "") -> str:
    """Drop symbols embedded within word characters"""
    return EMBEDDED_SYMS_RE.sub(repl, text)

get_hyphen_slash_expansions_fn¶

dataknobs_xization.normalize.get_hyphen_slash_expansions_fn ¶

get_hyphen_slash_expansions_fn(
    text: str,
    subs: List[str] = ("-", " ", ""),
    add_self: bool = True,
    do_split: bool = True,
    min_split_token_len: int = 2,
    hyphen_slash_re: Pattern[str] = HYPHEN_SLASH_RE,
) -> Set[str]

Given text with words that may or may not appear as hyphenated or with a slash, return the set potential variations: - the text as-is (add_self) - with a hyphen between all words (if '-' in subs) - with a space between all words (if ' ' in subs) - with all words squashed together (empty string between if '' in subs) - with each word separately (do_split as long as min_split_token_len is met for all tokens)

Note

To add a variation with a slash, add '/' to subs.
To not add any variations with symbols, leave them out of subs and don't add self.

Parameters:

Name	Type	Description	Default
`text`	`str`	The hyphen-worthy snippet of text, either already hyphenated or with a slash or space delimited.	required
`subs`	`List[str]`	A string of characters or list of strings to insert between tokens.	`('-', ' ', '')`
`add_self`	`bool`	True to include the text itself in the result.	`True`
`do_split`	`bool`	True to add split tokens separately.	`True`
`min_split_token_len`	`int`	If any of the split tokens fail to meet the min token length, don't add any of the splits.	`2`
`hyphen_slash_re`	`Pattern[str]`	The regex to identify hyphen/slash to expand.	`HYPHEN_SLASH_RE`

Returns:

Type	Description
`Set[str]`	The set of text variations.

Source code in packages/xization/src/dataknobs_xization/normalize.py

def get_hyphen_slash_expansions_fn(
    text: str,
    subs: List[str] = ("-", " ", ""),
    add_self: bool = True,
    do_split: bool = True,
    min_split_token_len: int = 2,
    hyphen_slash_re: re.Pattern[str] = HYPHEN_SLASH_RE,
) -> Set[str]:
    """Given text with words that may or may not appear as hyphenated or with a
    slash, return the set potential variations:
        - the text as-is (add_self)
        - with a hyphen between all words (if '-' in subs)
        - with a space between all words (if ' ' in subs)
        - with all words squashed together (empty string between if '' in subs)
        - with each word separately (do_split as long as min_split_token_len is
              met for all tokens)

    Note:
        * To add a variation with a slash, add '/' to subs.
        * To not add any variations with symbols, leave them out of subs
          and don't add self.

    Args:
        text: The hyphen-worthy snippet of text, either already
            hyphenated or with a slash or space delimited.
        subs: A string of characters or list of strings to insert between
            tokens.
        add_self: True to include the text itself in the result.
        do_split: True to add split tokens separately.
        min_split_token_len: If any of the split tokens fail
            to meet the min token length, don't add any of the splits.
        hyphen_slash_re: The regex to identify hyphen/slash to expand.

    Returns:
        The set of text variations.
    """
    variations = {text} if add_self else set()
    if subs is not None and len(subs) > 0:
        # create variant with all <s>'s
        for s in subs:
            variations.add(HYPHEN_SLASH_RE.sub(s, text))
    if do_split:
        # add each word separately
        tokens = set(hyphen_slash_re.split(text))
        if not max(len(t) < min_split_token_len for t in tokens):
            variations.update(tokens)
    return variations

drop_parentheticals_fn¶

dataknobs_xization.normalize.drop_parentheticals_fn ¶

drop_parentheticals_fn(text: str) -> str

Drop parenthetical expressions from the text.

Source code in packages/xization/src/dataknobs_xization/normalize.py

def drop_parentheticals_fn(text: str) -> str:
    """Drop parenthetical expressions from the text."""
    return PARENTHETICAL_RE.sub("", text)

expand_ampersand_fn¶

dataknobs_xization.normalize.expand_ampersand_fn ¶

expand_ampersand_fn(text: str) -> str

Replace '&' with ' and '.

Source code in packages/xization/src/dataknobs_xization/normalize.py

def expand_ampersand_fn(text: str) -> str:
    """Replace '&' with ' and '."""
    return AMPERSAND_RE.sub(" and ", text)

get_lexical_variations¶

dataknobs_xization.normalize.get_lexical_variations ¶

get_lexical_variations(
    text: str,
    include_self: bool = True,
    expand_camelcase: bool = True,
    drop_non_embedded_symbols: bool = True,
    drop_embedded_symbols: bool = True,
    spacify_embedded_symbols: bool = False,
    do_hyphen_expansion: bool = True,
    hyphen_subs: List[str] = (" ", ""),
    do_hyphen_split: bool = True,
    min_hyphen_split_token_len: int = 2,
    do_slash_expansion: bool = True,
    slash_subs: List[str] = (" ", " or "),
    do_slash_split: bool = True,
    min_slash_split_token_len: int = 1,
    drop_parentheticals: bool = True,
    expand_ampersands: bool = True,
    add_eng_plurals: bool = True,
) -> Set[str]

Get all variations for the text (including the text itself).

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to generate variations for.	required
`include_self`	`bool`	True to include the original text in the result.	`True`
`expand_camelcase`	`bool`	True to expand camelCase text.	`True`
`drop_non_embedded_symbols`	`bool`	True to drop symbols not embedded in words.	`True`
`drop_embedded_symbols`	`bool`	True to drop symbols embedded in words.	`True`
`spacify_embedded_symbols`	`bool`	True to replace embedded symbols with spaces.	`False`
`do_hyphen_expansion`	`bool`	True to expand hyphenated text.	`True`
`hyphen_subs`	`List[str]`	List of strings to substitute for hyphens.	`(' ', '')`
`do_hyphen_split`	`bool`	True to split on hyphens.	`True`
`min_hyphen_split_token_len`	`int`	Minimum token length for hyphen splits.	`2`
`do_slash_expansion`	`bool`	True to expand slashes.	`True`
`slash_subs`	`List[str]`	List of strings to substitute for slashes.	`(' ', ' or ')`
`do_slash_split`	`bool`	True to split on slashes.	`True`
`min_slash_split_token_len`	`int`	Minimum token length for slash splits.	`1`
`drop_parentheticals`	`bool`	True to drop parenthetical expressions.	`True`
`expand_ampersands`	`bool`	True to expand ampersands to ' and '.	`True`
`add_eng_plurals`	`bool`	True to add English plural forms.	`True`

Returns:

Type	Description
`Set[str]`	The set of all text variations.

Source code in packages/xization/src/dataknobs_xization/normalize.py

def get_lexical_variations(
    text: str,
    include_self: bool = True,
    expand_camelcase: bool = True,
    drop_non_embedded_symbols: bool = True,
    drop_embedded_symbols: bool = True,
    spacify_embedded_symbols: bool = False,
    do_hyphen_expansion: bool = True,
    hyphen_subs: List[str] = (" ", ""),
    do_hyphen_split: bool = True,
    min_hyphen_split_token_len: int = 2,
    do_slash_expansion: bool = True,
    slash_subs: List[str] = (" ", " or "),
    do_slash_split: bool = True,
    min_slash_split_token_len: int = 1,
    drop_parentheticals: bool = True,
    expand_ampersands: bool = True,
    add_eng_plurals: bool = True,
) -> Set[str]:
    """Get all variations for the text (including the text itself).

    Args:
        text: The text to generate variations for.
        include_self: True to include the original text in the result.
        expand_camelcase: True to expand camelCase text.
        drop_non_embedded_symbols: True to drop symbols not embedded in words.
        drop_embedded_symbols: True to drop symbols embedded in words.
        spacify_embedded_symbols: True to replace embedded symbols with spaces.
        do_hyphen_expansion: True to expand hyphenated text.
        hyphen_subs: List of strings to substitute for hyphens.
        do_hyphen_split: True to split on hyphens.
        min_hyphen_split_token_len: Minimum token length for hyphen splits.
        do_slash_expansion: True to expand slashes.
        slash_subs: List of strings to substitute for slashes.
        do_slash_split: True to split on slashes.
        min_slash_split_token_len: Minimum token length for slash splits.
        drop_parentheticals: True to drop parenthetical expressions.
        expand_ampersands: True to expand ampersands to ' and '.
        add_eng_plurals: True to add English plural forms.

    Returns:
        The set of all text variations.
    """
    variations = {text} if include_self else set()
    if expand_camelcase:
        variations.add(expand_camelcase_fn(text))
    if drop_non_embedded_symbols:
        variations.add(drop_non_embedded_symbols_fn(text))
    if drop_embedded_symbols:
        variations.add(drop_embedded_symbols_fn(text))
    if spacify_embedded_symbols:
        variations.add(drop_embedded_symbols_fn(text, " "))
    if (
        do_hyphen_expansion and hyphen_subs is not None and len(hyphen_subs) > 0
    ) or do_hyphen_split:
        variations.update(
            get_hyphen_slash_expansions_fn(
                text,
                subs=hyphen_subs,
                add_self=False,
                do_split=do_hyphen_split,
                min_split_token_len=min_hyphen_split_token_len,
            )
        )
    if (do_slash_expansion and slash_subs is not None and len(slash_subs) > 0) or do_slash_split:
        variations.update(
            get_hyphen_slash_expansions_fn(
                text,
                subs=slash_subs,
                add_self=False,
                do_split=do_slash_split,
                min_split_token_len=min_slash_split_token_len,
            )
        )
    if drop_parentheticals:
        variations.add(drop_parentheticals_fn(text))
    if expand_ampersands:
        variations.add(expand_ampersand_fn(text))
    if add_eng_plurals:
        # TODO: Use a better pluralizer
        plurals = {f"{v}s" for v in variations}
        variations.update(plurals)
    return variations

masking_tokenizer¶

Classes¶

CharacterFeatures¶

dataknobs_xization.masking_tokenizer.CharacterFeatures ¶

CharacterFeatures(doctext: Union[Text, str], roll_padding: int = 0)

Bases: ABC

Class representing features of text as a dataframe with each character as a row and columns representing character features.

Initialize with the text to tokenize.

Parameters:

Name	Type	Description	Default
`doctext`	`Union[Text, str]`	The text to tokenize (or dk_doc.Text with its metadata).	required
`roll_padding`	`int`	The number of pad characters added to each end of the text.	`0`

Attributes:

Name	Type	Description
`cdf`	`DataFrame`	The character dataframe with each padded text character as a row.
`doctext`	`Text`
`text_col`	`str`	The name of the cdf column holding the text characters.
`text`	`str`	The text string.
`text_id`	`Any`	The ID of the text.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py

def __init__(self, doctext: Union[dk_doc.Text, str], roll_padding: int = 0):
    """Initialize with the text to tokenize.

    Args:
        doctext: The text to tokenize (or dk_doc.Text with its metadata).
        roll_padding: The number of pad characters added to each end of
            the text.
    """
    self._doctext = doctext
    self._roll_padding = roll_padding
    self._padded_text = None

Attributes¶

cdf `property` ¶

cdf: DataFrame

The character dataframe with each padded text character as a row.

doctext `property` ¶

doctext: Text

text_col `property` ¶

text_col: str

The name of the cdf column holding the text characters.

text `property` ¶

text: str

The text string.

text_id `property` ¶

text_id: Any

The ID of the text.

Functions¶

TextFeatures¶

dataknobs_xization.masking_tokenizer.TextFeatures ¶

TextFeatures(
    doctext: Union[Text, str],
    split_camelcase: bool = True,
    mark_alpha: bool = False,
    mark_digit: bool = False,
    mark_upper: bool = False,
    mark_lower: bool = False,
    emoji_data: EmojiData = None,
)

Bases: CharacterFeatures

Extracts text-specific character features for tokenization.

Extends CharacterFeatures to provide text tokenization with support for camelCase splitting, character type features (alpha, digit, upper, lower), and emoji handling. Builds a character DataFrame with features for token boundary detection.

Initialize with text tokenization parameters.

Note

If emoji_data is non-null: * Then emojis will be treated as text (instead of as non-text) * If split_camelcase is True, * then each emoji will be in its own token * otherwise, each sequence of (adjacent) emojis will be treated as a single token.

Parameters:

Name	Type	Description	Default
`doctext`	`Union[Text, str]`	The text to tokenize with its metadata.	required
`split_camelcase`	`bool`	True to mark camel-case features.	`True`
`mark_alpha`	`bool`	True to mark alpha features (separate from alnum).	`False`
`mark_digit`	`bool`	True to mark digit features (separate from alnum).	`False`
`mark_upper`	`bool`	True to mark upper features (auto-included for camel-case).	`False`
`mark_lower`	`bool`	True to mark lower features (auto-included for camel-case).	`False`
`emoji_data`	`EmojiData`	An EmojiData instance to mark emoji BIO features.	`None`

Methods:

Name	Description
`build_first_token`	Build the first token as the start of tokenization.

Attributes:

Name	Type	Description
`cdf`	`DataFrame`	The character dataframe with each padded text character as a row.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py

def __init__(
    self,
    doctext: Union[dk_doc.Text, str],
    split_camelcase: bool = True,
    mark_alpha: bool = False,
    mark_digit: bool = False,
    mark_upper: bool = False,
    mark_lower: bool = False,
    emoji_data: emoji_utils.EmojiData = None,
):
    """Initialize with text tokenization parameters.

    Note:
        If emoji_data is non-null:
            * Then emojis will be treated as text (instead of as non-text)
            * If split_camelcase is True,
                * then each emoji will be in its own token
                * otherwise, each sequence of (adjacent) emojis will be treated
                  as a single token.

    Args:
        doctext: The text to tokenize with its metadata.
        split_camelcase: True to mark camel-case features.
        mark_alpha: True to mark alpha features (separate from alnum).
        mark_digit: True to mark digit features (separate from alnum).
        mark_upper: True to mark upper features (auto-included for
            camel-case).
        mark_lower: True to mark lower features (auto-included for
            camel-case).
        emoji_data: An EmojiData instance to mark emoji BIO features.
    """
    # NOTE: roll_padding is determined by "roll" feature needs. Currently 1.
    super().__init__(doctext, roll_padding=1)
    self.split_camelcase = split_camelcase
    self._cdf = self._build_character_dataframe(
        split_camelcase,
        mark_alpha,
        mark_digit,
        mark_upper,
        mark_lower,
        emoji_data,
    )

Attributes¶

cdf `property` ¶

cdf: DataFrame

The character dataframe with each padded text character as a row.

Functions¶

build_first_token ¶

build_first_token(normalize_fn: Callable[[str], str]) -> Token

Build the first token as the start of tokenization.

Parameters:

Name	Type	Description	Default
`normalize_fn`	`Callable[[str], str]`	A function to normalize a raw text term or any of its variations. If None, then the identity function is used.	required

Returns:

Type	Description
`Token`	The first text token.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py

def build_first_token(
    self,
    normalize_fn: Callable[[str], str],
) -> "Token":
    """Build the first token as the start of tokenization.

    Args:
        normalize_fn: A function to normalize a raw text term or any
            of its variations. If None, then the identity function is used.

    Returns:
        The first text token.
    """
    token_mask = (
        DualTokenMask(
            self,
            self.cdf["tok_start"],
            self.cdf["tok_end"],
        )
        if self.split_camelcase
        else SimpleTokenMask(self, self.cdf["alnum"])
    )
    token = Token(token_mask, normalize_fn=normalize_fn)
    return token

annotations¶

Functions and Classes¶

dataknobs_xization.annotations ¶

Text annotation data structures and interfaces.

Provides classes for managing text annotations with metadata, including position tracking, annotation types, and derived annotation columns.

Classes:

Name	Description
`AnnotatedText`	A Text object that manages its own annotations.
`Annotations`	DAO for collecting and managing a table of annotations, where each row
`AnnotationsBuilder`	A class for building annotations.
`AnnotationsGroup`	Container for annotation rows that belong together as a (consistent) group.
`AnnotationsGroupList`	Container for a list of annotation groups.
`AnnotationsMetaData`	Container for annotations meta-data, identifying key column names.
`AnnotationsRowAccessor`	A class that accesses row data according to the metadata and derived cols.
`Annotator`	Class for annotating text
`AnnotatorKernel`	Class for encapsulating core annotation logic for multiple annotators
`BasicAnnotator`	Class for extracting basic (possibly multi -level or -part) entities.
`CompoundAnnotator`	Class to apply a series of annotators through an AnnotatorKernel
`DerivedAnnotationColumns`	Interface for injecting derived columns into AnnotationsMetaData.
`EntityAnnotator`	Class for extracting single (possibly multi-level or -part) entities.
`HtmlHighlighter`	Helper class to add HTML markup for highlighting spans of text.
`MergeStrategy`	A merge strategy to be injected based on entity types being merged.
`OverlapGroupIterator`	Given:
`PositionalAnnotationsGroup`	Container for annotations that either overlap with each other or don't.
`RowData`	A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping.
`SyntacticParser`	Class for creating syntactic annotations for an input.

Functions:

Name	Description
`merge`	Merge the overlapping groups according to the given strategy.

Classes¶

AnnotatedText ¶

AnnotatedText(
    text_str: str,
    metadata: TextMetaData = None,
    annots: Annotations = None,
    bookmarks: Dict[str, DataFrame] = None,
    text_obj: Text = None,
    annots_metadata: AnnotationsMetaData = None,
)

Bases: Text

A Text object that manages its own annotations.

Initialize AnnotatedText.

Parameters:

Name	Type	Description	Default
`text_str`	`str`	The text string.	required
`metadata`	`TextMetaData`	The text's metadata.	`None`
`annots`	`Annotations`	The annotations.	`None`
`bookmarks`	`Dict[str, DataFrame]`	The annotation bookmarks.	`None`
`text_obj`	`Text`	A text_obj to override text_str and metadata initialization.	`None`
`annots_metadata`	`AnnotationsMetaData`	Override for default annotations metadata (NOTE: ineffectual if an annots instance is provided.)	`None`

Methods:

Name	Description
`add_annotations`	Add the annotations to this instance.
`get_annot_mask`	Get a True/False series for the input such that start to end positions
`get_text`	Get the text object's string, masking if indicated.
`get_text_series`	Get the input text as a (padded) pandas series.

Attributes:

Name	Type	Description
`annotations`	`Annotations`	Get the this object's annotations
`bookmarks`	`Dict[str, DataFrame]`	Get this object's bookmarks

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    text_str: str,
    metadata: dk_doc.TextMetaData = None,
    annots: Annotations = None,
    bookmarks: Dict[str, pd.DataFrame] = None,
    text_obj: dk_doc.Text = None,
    annots_metadata: AnnotationsMetaData = None,
):
    """Initialize AnnotatedText.

    Args:
        text_str: The text string.
        metadata: The text's metadata.
        annots: The annotations.
        bookmarks: The annotation bookmarks.
        text_obj: A text_obj to override text_str and metadata initialization.
        annots_metadata: Override for default annotations metadata
            (NOTE: ineffectual if an annots instance is provided.)
    """
    super().__init__(
        text_obj.text if text_obj is not None else text_str,
        text_obj.metadata if text_obj is not None else metadata,
    )
    self._annots = annots
    self._bookmarks = bookmarks
    self._annots_metadata = annots_metadata

Attributes¶

annotations `property` ¶

annotations: Annotations

Get the this object's annotations

bookmarks `property` ¶

bookmarks: Dict[str, DataFrame]

Get this object's bookmarks

Functions¶

add_annotations ¶

add_annotations(annotations: Annotations)

Add the annotations to this instance.

Parameters:

Name	Type	Description	Default
`annotations`	`Annotations`	The annotations to add.	required

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add_annotations(self, annotations: Annotations):
    """Add the annotations to this instance.

    Args:
        annotations: The annotations to add.
    """
    if annotations is not None and not annotations.is_empty():
        df = annotations.df
        if self._annots is None:
            self._annots = annotations
        elif self._annots.is_empty():
            if df is not None:
                self._annots.set_df(df.copy())
        elif df is not None:
            self._annots.add_df(df)

get_annot_mask ¶

get_annot_mask(
    annot_col: str,
    pad_len: int = 0,
    annot_df: DataFrame = None,
    text: str = None,
) -> pd.Series

Get a True/False series for the input such that start to end positions for rows where the the annotation column is non-null and non-empty are True.

Parameters:

Name	Type	Description	Default
`annot_col`	`str`	The annotation column identifying chars to mask.	required
`pad_len`	`int`	The number of characters to pad the mask with False values at both the front and back.	`0`
`annot_df`	`DataFrame`	Override annotations dataframe.	`None`
`text`	`str`	Override text.	`None`

Returns:

Type	Description
`Series`	A pandas Series where annotated input character positions
`Series`	are True and non-annotated positions are False.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_annot_mask(
    self,
    annot_col: str,
    pad_len: int = 0,
    annot_df: pd.DataFrame = None,
    text: str = None,
) -> pd.Series:
    """Get a True/False series for the input such that start to end positions
    for rows where the the annotation column is non-null and non-empty are
    True.

    Args:
        annot_col: The annotation column identifying chars to mask.
        pad_len: The number of characters to pad the mask with False
            values at both the front and back.
        annot_df: Override annotations dataframe.
        text: Override text.

    Returns:
        A pandas Series where annotated input character positions
        are True and non-annotated positions are False.
    """
    if annot_df is None:
        annot_df = self.annotations.as_df
    if text is None:
        text = self.text
    textlen = len(text)
    return self._get_annot_mask(annot_df, textlen, annot_col, pad_len=pad_len)

get_text ¶

get_text(
    annot2mask: Dict[str, str] = None,
    annot_df: DataFrame = None,
    text: str = None,
) -> str

Get the text object's string, masking if indicated.

Parameters:

Name	Type	Description	Default
`annot2mask`	`Dict[str, str]`	Mapping from annotation column (e.g., _num or _recsnum) to the replacement character(s) in the input text for masking already managed input.	`None`
`annot_df`	`DataFrame`	Override annotations dataframe.	`None`
`text`	`str`	Override text.	`None`

Returns:

Type	Description
`str`	The (masked) text.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_text(
    self,
    annot2mask: Dict[str, str] = None,
    annot_df: pd.DataFrame = None,
    text: str = None,
) -> str:
    """Get the text object's string, masking if indicated.

    Args:
        annot2mask: Mapping from annotation column (e.g., _num or
            _recsnum) to the replacement character(s) in the input text
            for masking already managed input.
        annot_df: Override annotations dataframe.
        text: Override text.

    Returns:
        The (masked) text.
    """
    if annot2mask is None:
        return self.text
    # Apply the mask
    text_s = self.get_text_series(text=text)  # no padding
    if annot2mask is not None:
        annot_df = self.annotations.as_df
        text_s = self._apply_mask(text_s, annot2mask, annot_df)
    return "".join(text_s)

get_text_series ¶

get_text_series(pad_len: int = 0, text: str = None) -> pd.Series

Get the input text as a (padded) pandas series.

Parameters:

Name	Type	Description	Default
`pad_len`	`int`	The number of spaces to pad both front and back.	`0`
`text`	`str`	Override text.	`None`

Returns:

Type	Description
`Series`	The (padded) pandas series of input characters.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_text_series(
    self,
    pad_len: int = 0,
    text: str = None,
) -> pd.Series:
    """Get the input text as a (padded) pandas series.

    Args:
        pad_len: The number of spaces to pad both front and back.
        text: Override text.

    Returns:
        The (padded) pandas series of input characters.
    """
    if text is None:
        text = self.text
    return pd.Series(list(" " * pad_len + text + " " * pad_len))

Annotations ¶

Annotations(metadata: AnnotationsMetaData, df: DataFrame = None)

DAO for collecting and managing a table of annotations, where each row carries annotation information for an input token.

The data in this class is maintained either as a list of dicts, each dict representing a "row," or as a pandas DataFrame, depending on the latest access. Changes in either the lists or dataframe will be reflected in the alternate data structure.

Construct as empty or initialize with the dataframe form.

Parameters:

Name	Type	Description	Default
`metadata`	`AnnotationsMetaData`	The annotations metadata.	required
`df`	`DataFrame`	A dataframe with annotation records.	`None`

Methods:

Name	Description
`add_df`	Add (concatentate) the annotation dataframe to the current annotations.
`add_dict`	Add the annotation dict.
`add_dicts`	Add the annotation dicts.
`clear`	Clear/empty out all annotations, returning the annotations df
`set_df`	Set (or reset) this annotation's dataframe.

Attributes:

Name	Type	Description
`ann_row_dicts`	`List[Dict[str, Any]]`	Get the annotations as a list of dictionaries.
`df`	`DataFrame`	Get the annotations as a pandas dataframe.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    metadata: AnnotationsMetaData,
    df: pd.DataFrame = None,
):
    """Construct as empty or initialize with the dataframe form.

    Args:
        metadata: The annotations metadata.
        df: A dataframe with annotation records.
    """
    self.metadata = metadata
    self._annotations_list = None
    self._df = df

Attributes¶

ann_row_dicts `property` ¶

ann_row_dicts: List[Dict[str, Any]]

Get the annotations as a list of dictionaries.

df `property` ¶

df: DataFrame

Get the annotations as a pandas dataframe.

Functions¶

add_df ¶

add_df(an_df: DataFrame)

Add (concatentate) the annotation dataframe to the current annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add_df(self, an_df: pd.DataFrame):
    """Add (concatentate) the annotation dataframe to the current annotations."""
    df = self.metadata.sort_df(pd.concat([self.df, an_df]))
    self.set_df(df)

add_dict ¶

add_dict(annotation: Dict[str, Any])

Add the annotation dict.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add_dict(self, annotation: Dict[str, Any]):
    """Add the annotation dict."""
    self.ann_row_dicts.append(annotation)

add_dicts ¶

add_dicts(annotations: List[Dict[str, Any]])

Add the annotation dicts.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add_dicts(self, annotations: List[Dict[str, Any]]):
    """Add the annotation dicts."""
    self.ann_row_dicts.extend(annotations)

clear ¶

clear() -> pd.DataFrame

Clear/empty out all annotations, returning the annotations df

Source code in packages/xization/src/dataknobs_xization/annotations.py

def clear(self) -> pd.DataFrame:
    """Clear/empty out all annotations, returning the annotations df"""
    rv = self.df
    self._df = None
    self._annotations_list = None
    return rv

set_df ¶

set_df(df: DataFrame)

Set (or reset) this annotation's dataframe.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The new annotations dataframe.	required

Source code in packages/xization/src/dataknobs_xization/annotations.py

def set_df(self, df: pd.DataFrame):
    """Set (or reset) this annotation's dataframe.

    Args:
        df: The new annotations dataframe.
    """
    self._df = df
    self._annotations_list = None

AnnotationsBuilder ¶

AnnotationsBuilder(
    metadata: AnnotationsMetaData, data_defaults: Dict[str, Any]
)

A class for building annotations.

Initialize AnnotationsBuilder.

Parameters:

Name	Type	Description	Default
`metadata`	`AnnotationsMetaData`	The annotations metadata.	required
`data_defaults`	`Dict[str, Any]`	Dict[ann_colname, default_value] with default values for annotation columns.	required

Methods:

Name	Description
`build_annotation_row`	Build an annotation row with the mandatory key values and those from
`do_build_row`	Do the row building with the key fields, followed by data defaults,

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    metadata: AnnotationsMetaData,
    data_defaults: Dict[str, Any],
):
    """Initialize AnnotationsBuilder.

    Args:
        metadata: The annotations metadata.
        data_defaults: Dict[ann_colname, default_value] with default
            values for annotation columns.
    """
    self.metadata = metadata if metadata is not None else AnnotationsMetaData()
    self.data_defaults = data_defaults

Functions¶

build_annotation_row ¶

build_annotation_row(
    start_pos: int, end_pos: int, text: str, ann_type: str, **kwargs: Any
) -> Dict[str, Any]

Build an annotation row with the mandatory key values and those from the remaining keyword arguments.

For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.

Parameters:

Name	Type	Description	Default
`start_pos`	`int`	The token start position.	required
`end_pos`	`int`	The token end position.	required
`text`	`str`	The token text.	required
`ann_type`	`str`	The annotation type.	required
`**kwargs`	`Any`	Additional keyword arguments for extra annotation fields.	`{}`

Returns:

Type	Description
`Dict[str, Any]`	The result row dictionary.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def build_annotation_row(
    self, start_pos: int, end_pos: int, text: str, ann_type: str, **kwargs: Any
) -> Dict[str, Any]:
    """Build an annotation row with the mandatory key values and those from
    the remaining keyword arguments.

    For those kwargs whose names match metadata column names, override the
    data_defaults and add remaining data_default attributes.

    Args:
        start_pos: The token start position.
        end_pos: The token end position.
        text: The token text.
        ann_type: The annotation type.
        **kwargs: Additional keyword arguments for extra annotation fields.

    Returns:
        The result row dictionary.
    """
    return self.do_build_row(
        {
            self.metadata.start_pos_col: start_pos,
            self.metadata.end_pos_col: end_pos,
            self.metadata.text_col: text,
            self.metadata.ann_type_col: ann_type,
        },
        **kwargs,
    )

do_build_row ¶

do_build_row(key_fields: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]

Do the row building with the key fields, followed by data defaults, followed by any extra kwargs.

Parameters:

Name	Type	Description	Default
`key_fields`	`Dict[str, Any]`	The dictionary of key fields.	required
`**kwargs`	`Any`	Any extra fields to add.	`{}`

Returns:

Type	Description
`Dict[str, Any]`	The constructed row dictionary.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def do_build_row(self, key_fields: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]:
    """Do the row building with the key fields, followed by data defaults,
    followed by any extra kwargs.

    Args:
        key_fields: The dictionary of key fields.
        **kwargs: Any extra fields to add.

    Returns:
        The constructed row dictionary.
    """
    result = {}
    result.update(key_fields)
    if self.data_defaults is not None:
        # Add data_defaults
        result.update(self.data_defaults)
    if kwargs is not None:
        # Override with extra kwargs
        result.update(kwargs)
    return result

AnnotationsGroup ¶

AnnotationsGroup(
    row_accessor: AnnotationsRowAccessor,
    field_col_type: str,
    accept_fn: Callable[[AnnotationsGroup, RowData], bool],
    group_type: str = None,
    group_num: int = None,
    valid: bool = True,
    autolock: bool = False,
)

Container for annotation rows that belong together as a (consistent) group.

NOTE: An instance will only accept rows on condition of consistency per its acceptance function.

Initialize AnnotationsGroup.

Parameters:

Name	Type	Description	Default
`row_accessor`	`AnnotationsRowAccessor`	The annotations row_accessor.	required
`field_col_type`	`str`	The col_type for the group field_type for retrieval using the annotations row accessor.	required
`accept_fn`	`Callable[[AnnotationsGroup, RowData], bool]`	A fn(g, row_data) that returns True to accept the row data into this group g, or False to reject the row. If None, then all rows are always accepted.	required
`group_type`	`str`	An optional (override) type for identifying this group.	`None`
`group_num`	`int`	An optional number for identifying this group.	`None`
`valid`	`bool`	True if the group is valid, or False if not.	`True`
`autolock`	`bool`	True to automatically lock this group when (1) at least one row has been added and (2) a row is rejected.	`False`

Methods:

Name	Description
`add`	Add the row if the group is not locked and the row belongs in this
`is_subset`	Determine whether the this group's text is contained within the others.
`is_subset_of_any`	Determine whether this group is a subset of any of the given groups.
`remove_row`	Remove the row from this group and optionally update the annotations
`to_dict`	Get this group (record) as a dictionary of field type to text values.

Attributes:

Name	Type	Description
`ann_type`	`str`	Get this record's annotation type
`autolock`	`bool`	Get whether this group is currently set to autolock.
`df`	`DataFrame`	Get this group as a dataframe
`group_num`	`int`	Get this group's number
`group_type`	`str`	Get this group's type, which is either an "override" value that has
`is_locked`	`bool`	Get whether this group is locked from adding more rows.
`is_valid`	`bool`	Get whether this group is currently marked as valid.
`key`	`str`	A hash key for this group.
`size`	`int`	Get the number of rows in this group.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    row_accessor: AnnotationsRowAccessor,
    field_col_type: str,
    accept_fn: Callable[["AnnotationsGroup", RowData], bool],
    group_type: str = None,
    group_num: int = None,
    valid: bool = True,
    autolock: bool = False,
):
    """Initialize AnnotationsGroup.

    Args:
        row_accessor: The annotations row_accessor.
        field_col_type: The col_type for the group field_type for retrieval
            using the annotations row accessor.
        accept_fn: A fn(g, row_data) that returns True to accept the row
            data into this group g, or False to reject the row. If None, then
            all rows are always accepted.
        group_type: An optional (override) type for identifying this group.
        group_num: An optional number for identifying this group.
        valid: True if the group is valid, or False if not.
        autolock: True to automatically lock this group when (1) at
            least one row has been added and (2) a row is rejected.
    """
    self.rows = []  # List[RowData]
    self.row_accessor = row_accessor
    self.field_col_type = field_col_type
    self.accept_fn = accept_fn
    self._group_type = group_type
    self._group_num = group_num
    self._valid = valid
    self._autolock = autolock
    self._locked = False
    self._locs = None  # track loc's for recognizing dupes
    self._key = None  # a hash key using the _locs
    self._df = None
    self._ann_type = None

Attributes¶

ann_type `property` ¶

ann_type: str

Get this record's annotation type

autolock `property` `writable` ¶

autolock: bool

Get whether this group is currently set to autolock.

df `property` ¶

df: DataFrame

Get this group as a dataframe

group_num `property` `writable` ¶

group_num: int

Get this group's number

group_type `property` `writable` ¶

group_type: str

Get this group's type, which is either an "override" value that has been set, or the "ann_type" value of the first row added.

is_locked `property` `writable` ¶

is_locked: bool

Get whether this group is locked from adding more rows.

is_valid `property` `writable` ¶

is_valid: bool

Get whether this group is currently marked as valid.

key `property` ¶

key: str

A hash key for this group.

size `property` ¶

size: int

Get the number of rows in this group.

Functions¶

add ¶

add(rowdata: RowData) -> bool

Add the row if the group is not locked and the row belongs in this group, or return False.

If autolock is True and a row fails to be added (after the first row has been added,) "lock" the group and refuse to accept any more rows.

Parameters:

Name	Type	Description	Default
`rowdata`	`RowData`	The row to add.	required

Returns:

Type	Description
`bool`	True if the row belongs and was added; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add(self, rowdata: RowData) -> bool:
    """Add the row if the group is not locked and the row belongs in this
    group, or return False.

    If autolock is True and a row fails to be added (after the first
    row has been added,) "lock" the group and refuse to accept any more
    rows.

    Args:
        rowdata: The row to add.

    Returns:
        True if the row belongs and was added; otherwise, False.
    """
    result = False
    if self._locked:
        return result

    if self.accept_fn is None or self.accept_fn(self, rowdata):
        self.rows.append(rowdata)
        self._df = None
        self._locs = None
        self._key = None
        if self._ann_type is None:
            self._ann_type = self.row_accessor.get_col_value(
                KEY_ANN_TYPE_COL,
                rowdata.row,
                missing=None,
            )
        result = True

    if not result and self.size > 0 and self.autolock:
        self._locked = True

    return result

is_subset ¶

is_subset(other: AnnotationsGroup) -> bool

Determine whether the this group's text is contained within the others.

Parameters:

Name	Type	Description	Default
`other`	`AnnotationsGroup`	The other group.	required

Returns:

Type	Description
`bool`	True if this group's text is contained within the other group.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def is_subset(self, other: "AnnotationsGroup") -> bool:
    """Determine whether the this group's text is contained within the others.

    Args:
        other: The other group.

    Returns:
        True if this group's text is contained within the other group.
    """
    result = True
    for my_row in self.rows:
        if not my_row.is_subset_of_any(other.rows):
            result = False
            break
    return result

is_subset_of_any ¶

is_subset_of_any(groups: List[AnnotationsGroup]) -> AnnotationsGroup

Determine whether this group is a subset of any of the given groups.

Parameters:

Name	Type	Description	Default
`groups`	`List[AnnotationsGroup]`	List of annotation groups.	required

Returns:

Type	Description
`AnnotationsGroup`	The first AnnotationsGroup that this group is a subset of, or None.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def is_subset_of_any(self, groups: List["AnnotationsGroup"]) -> "AnnotationsGroup":
    """Determine whether this group is a subset of any of the given groups.

    Args:
        groups: List of annotation groups.

    Returns:
        The first AnnotationsGroup that this group is a subset of, or None.
    """
    result = None
    for other_group in groups:
        if self.is_subset(other_group):
            result = other_group
            break
    return result

remove_row ¶

remove_row(row_idx: int) -> RowData

Remove the row from this group and optionally update the annotations accordingly.

Parameters:

Name	Type	Description	Default
`row_idx`	`int`	The positional index of the row to remove.	required

Returns:

Type	Description
`RowData`	The removed row data instance.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def remove_row(
    self,
    row_idx: int,
) -> RowData:
    """Remove the row from this group and optionally update the annotations
    accordingly.

    Args:
        row_idx: The positional index of the row to remove.

    Returns:
        The removed row data instance.
    """
    rowdata = self.rows.pop(row_idx)

    # Reset cached values
    self._df = None
    self._locs = None
    self._key = None

    return rowdata

to_dict ¶

to_dict() -> Dict[str, str]

Get this group (record) as a dictionary of field type to text values.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def to_dict(self) -> Dict[str, str]:
    """Get this group (record) as a dictionary of field type to text values."""
    return {self.row_accessor.get_col_value(self.field_col_type): row.text for row in self.rows}

AnnotationsGroupList ¶

AnnotationsGroupList(
    groups: List[AnnotationsGroup] = None,
    accept_fn: Callable[
        [AnnotationsGroupList, AnnotationsGroup], bool
    ] = lambda lst, g: lst.size == 0 or not g.is_subset_of_any(lst.groups),
)

Container for a list of annotation groups.

Initialize AnnotationsGroupList.

Parameters:

Name	Type	Description	Default
`groups`	`List[AnnotationsGroup]`	The initial groups for this list.	`None`
`accept_fn`	`Callable[[AnnotationsGroupList, AnnotationsGroup], bool]`	A fn(lst, g) that returns True to accept the group, g, into this list, lst, or False to reject the group. If None, then all groups are always accepted. The default function will reject any group that is a subset of any existing group in the list.	`lambda lst, g: size == 0 or not is_subset_of_any(groups)`

Methods:

Name	Description
`add`	Add the group if it belongs in this group list or return False.
`is_subset`	Determine whether the this group's text spans are contained within all

Attributes:

Name	Type	Description
`coverage`	`int`	Get the total number of (token) rows covered by the groups
`size`	`int`	Get the number of groups in this list

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    groups: List[AnnotationsGroup] = None,
    accept_fn: Callable[["AnnotationsGroupList", AnnotationsGroup], bool] = lambda lst, g: lst.size
    == 0
    or not g.is_subset_of_any(lst.groups),
):
    """Initialize AnnotationsGroupList.

    Args:
        groups: The initial groups for this list.
        accept_fn: A fn(lst, g) that returns True to accept the group, g,
            into this list, lst, or False to reject the group. If None, then all
            groups are always accepted. The default function will reject any
            group that is a subset of any existing group in the list.
    """
    self.groups = groups if groups is not None else []
    self.accept_fn = accept_fn
    self._coverage = None

Attributes¶

coverage `property` ¶

coverage: int

Get the total number of (token) rows covered by the groups

size `property` ¶

size: int

Get the number of groups in this list

Functions¶

add ¶

add(group: AnnotationsGroup) -> bool

Add the group if it belongs in this group list or return False.

Parameters:

Name	Type	Description	Default
`group`	`AnnotationsGroup`	The group to add.	required

Returns:

Type	Description
`bool`	True if the group belongs and was added; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def add(self, group: AnnotationsGroup) -> bool:
    """Add the group if it belongs in this group list or return False.

    Args:
        group: The group to add.

    Returns:
        True if the group belongs and was added; otherwise, False.
    """
    result = False
    if self.accept_fn is None or self.accept_fn(self, group):
        self.groups.append(group)
        self._coverage = None
        result = True
    return result

is_subset ¶

is_subset(other: AnnotationsGroupList) -> bool

Determine whether the this group's text spans are contained within all of the other's.

Parameters:

Name	Type	Description	Default
`other`	`AnnotationsGroupList`	The other group list.	required

Returns:

Type	Description
`bool`	True if this group list is a subset of the other group list.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def is_subset(self, other: "AnnotationsGroupList") -> bool:
    """Determine whether the this group's text spans are contained within all
    of the other's.

    Args:
        other: The other group list.

    Returns:
        True if this group list is a subset of the other group list.
    """
    result = True
    for my_group in self.groups:
        if not my_group.is_subset_of_any(other.groups):
            result = False
            break
    return result

AnnotationsMetaData ¶

AnnotationsMetaData(
    start_pos_col: str = KEY_START_POS_COL,
    end_pos_col: str = KEY_END_POS_COL,
    text_col: str = KEY_TEXT_COL,
    ann_type_col: str = KEY_ANN_TYPE_COL,
    sort_fields: List[str] = (KEY_START_POS_COL, KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
)

Bases: MetaData

Container for annotations meta-data, identifying key column names.

NOTE: this object contains only information about annotation column names and not annotation table values.

Initialize with key (and more) column names and info.

Key column types

start_pos
end_pos
text
ann_type

Note

Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.

Parameters:

Name	Type	Description	Default
`start_pos_col`	`str`	Col name for the token starting position.	`KEY_START_POS_COL`
`end_pos_col`	`str`	Col name for the token ending position.	`KEY_END_POS_COL`
`text_col`	`str`	Col name for the token text.	`KEY_TEXT_COL`
`ann_type_col`	`str`	Col name for the annotation types.	`KEY_ANN_TYPE_COL`
`sort_fields`	`List[str]`	The col types relevant for sorting annotation rows.	`(KEY_START_POS_COL, KEY_END_POS_COL)`
`sort_fields_ascending`	`List[bool]`	To specify sort order of sort_fields.	`(True, False)`
`**kwargs`	`Any`	More column types mapped to column names.	`{}`

Methods:

Name	Description
`get_col`	Get the name of the column having the given type (including key column
`sort_df`	Sort an annotations dataframe according to this metadata.

Attributes:

Name	Type	Description
`ann_type_col`	`str`	Get the column name for the token annotation type
`end_pos_col`	`str`	Get the column name for the token ending position
`start_pos_col`	`str`	Get the column name for the token starting postition
`text_col`	`str`	Get the column name for the token text

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    start_pos_col: str = KEY_START_POS_COL,
    end_pos_col: str = KEY_END_POS_COL,
    text_col: str = KEY_TEXT_COL,
    ann_type_col: str = KEY_ANN_TYPE_COL,
    sort_fields: List[str] = (KEY_START_POS_COL, KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any
):
    """Initialize with key (and more) column names and info.

    Key column types:
      * start_pos
      * end_pos
      * text
      * ann_type

    Note:
        Actual table columns can be named arbitrarily, BUT interactions
        through annotations classes and interfaces relating to the "key"
        columns must use the key column constants.

    Args:
        start_pos_col: Col name for the token starting position.
        end_pos_col: Col name for the token ending position.
        text_col: Col name for the token text.
        ann_type_col: Col name for the annotation types.
        sort_fields: The col types relevant for sorting annotation rows.
        sort_fields_ascending: To specify sort order of sort_fields.
        **kwargs: More column types mapped to column names.
    """
    super().__init__(
        {
            KEY_START_POS_COL: start_pos_col,
            KEY_END_POS_COL: end_pos_col,
            KEY_TEXT_COL: text_col,
            KEY_ANN_TYPE_COL: ann_type_col,
        },
        **kwargs,
    )
    self.sort_fields = list(sort_fields)
    self.ascending = sort_fields_ascending

Attributes¶

ann_type_col `property` ¶

ann_type_col: str

Get the column name for the token annotation type

end_pos_col `property` ¶

end_pos_col: str

Get the column name for the token ending position

start_pos_col `property` ¶

start_pos_col: str

Get the column name for the token starting postition

text_col `property` ¶

text_col: str

Get the column name for the token text

Functions¶

get_col ¶

get_col(col_type: str, missing: str = None) -> str

Get the name of the column having the given type (including key column types but not derived,) or get the missing value.

Parameters:

Name	Type	Description	Default
`col_type`	`str`	The type of column name to get.	required
`missing`	`str`	The value to return for unknown column types.	`None`

Returns:

Type	Description
`str`	The column name or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_col(self, col_type: str, missing: str = None) -> str:
    """Get the name of the column having the given type (including key column
    types but not derived,) or get the missing value.

    Args:
        col_type: The type of column name to get.
        missing: The value to return for unknown column types.

    Returns:
        The column name or the missing value.
    """
    return self.get_value(col_type, missing)

sort_df ¶

sort_df(an_df: DataFrame) -> pd.DataFrame

Sort an annotations dataframe according to this metadata.

Parameters:

Name	Type	Description	Default
`an_df`	`DataFrame`	An annotations dataframe.	required

Returns:

Type	Description
`DataFrame`	The sorted annotations dataframe.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def sort_df(self, an_df: pd.DataFrame) -> pd.DataFrame:
    """Sort an annotations dataframe according to this metadata.

    Args:
        an_df: An annotations dataframe.

    Returns:
        The sorted annotations dataframe.
    """
    if self.sort_fields is not None:
        an_df = an_df.sort_values(self.sort_fields, ascending=self.ascending)
    return an_df

AnnotationsRowAccessor ¶

AnnotationsRowAccessor(
    metadata: AnnotationsMetaData, derived_cols: DerivedAnnotationColumns = None
)

A class that accesses row data according to the metadata and derived cols.

Initialize AnnotationsRowAccessor.

Parameters:

Name	Type	Description	Default
`metadata`	`AnnotationsMetaData`	The metadata for annotation columns.	required
`derived_cols`	`DerivedAnnotationColumns`	A DerivedAnnotationColumns instance for injecting derived columns.	`None`

Methods:

Name	Description
`get_col_value`	Get the value of the column in the given row with the given type.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self, metadata: AnnotationsMetaData, derived_cols: DerivedAnnotationColumns = None
):
    """Initialize AnnotationsRowAccessor.

    Args:
        metadata: The metadata for annotation columns.
        derived_cols: A DerivedAnnotationColumns instance for injecting
            derived columns.
    """
    self.metadata = metadata
    self.derived_cols = derived_cols

Functions¶

get_col_value ¶

get_col_value(col_type: str, row: Series, missing: str = None) -> str

Get the value of the column in the given row with the given type.

This gets the value from the first existing column in the row from

The metadata.get_col(col_type) column
col_type itself
The columns derived from col_type

Parameters:

Name	Type	Description	Default
`col_type`	`str`	The type of column value to get.	required
`row`	`Series`	A row from which to get the value.	required
`missing`	`str`	The value to return for unknown or missing column.	`None`

Returns:

Type	Description
`str`	The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_col_value(
    self,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row with the given type.

    This gets the value from the first existing column in the row from:
      * The metadata.get_col(col_type) column
      * col_type itself
      * The columns derived from col_type

    Args:
        col_type: The type of column value to get.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    value = missing
    col = self.metadata.get_col(col_type, None)
    if col is None or col not in row.index:
        if col_type in self.metadata.data:
            value = row[col_type]
        elif self.derived_cols is not None:
            value = self.derived_cols.get_col_value(self.metadata, col_type, row, missing)
    else:
        value = row[col]
    return value

Annotator ¶

Annotator(name: str)

Bases: ABC

Class for annotating text

Initialize Annotator.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of this annotator.	required

Methods:

Name	Description
`annotate_input`	Annotate this instance's text, additively updating its annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name

Functions¶

annotate_input `abstractmethod` ¶

annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate this instance's text, additively updating its annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The text object to annotate.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	The annotations added.

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate this instance's text, additively updating its annotations.

    Args:
        text_obj: The text object to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added.
    """
    raise NotImplementedError

AnnotatorKernel ¶

Bases: ABC

Class for encapsulating core annotation logic for multiple annotators

Methods:

Name	Description
`annotate_input`	Execute all annotations on the text_obj

Attributes:

Name	Type	Description
`annotators`	`List[EntityAnnotator]`	Get the entity annotators

Attributes¶

annotators `abstractmethod` `property` ¶

annotators: List[EntityAnnotator]

Get the entity annotators

Functions¶

annotate_input `abstractmethod` ¶

annotate_input(text_obj: AnnotatedText) -> Annotations

Execute all annotations on the text_obj

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def annotate_input(self, text_obj: AnnotatedText) -> Annotations:
    """Execute all annotations on the text_obj"""
    raise NotImplementedError

BasicAnnotator ¶

BasicAnnotator(name: str)

Bases: Annotator

Class for extracting basic (possibly multi -level or -part) entities.

Methods:

Name	Description
`annotate_input`	Annotate the text obj, additively updating the annotations.
`annotate_text`	Build annotations for the text string.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name

Functions¶

annotate_input ¶

annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate the text obj, additively updating the annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The text to annotate.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	The annotations added to the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate the text obj, additively updating the annotations.

    Args:
        text_obj: The text to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text.
    """
    # Get new annotation with just the syntax
    annots = self.annotate_text(text_obj.text)

    # Add syntactic annotations only as a bookmark
    text_obj.annotations.add_df(annots.as_df)

    return annots

annotate_text `abstractmethod` ¶

annotate_text(text_str: str) -> Annotations

Build annotations for the text string.

Parameters:

Name	Type	Description	Default
`text_str`	`str`	The text string to annotate.	required

Returns:

Type	Description
`Annotations`	Annotations for the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def annotate_text(self, text_str: str) -> Annotations:
    """Build annotations for the text string.

    Args:
        text_str: The text string to annotate.

    Returns:
        Annotations for the text.
    """
    raise NotImplementedError

CompoundAnnotator ¶

CompoundAnnotator(kernel: AnnotatorKernel, name: str = 'entity')

Bases: Annotator

Class to apply a series of annotators through an AnnotatorKernel

Initialize with the annotators and this extractor's name.

Parameters:

Name	Type	Description	Default
`kernel`	`AnnotatorKernel`	The annotations kernel to use.	required
`name`	`str`	The name of this information extractor to be the annotations base column name for _num and _recsnum.	`'entity'`

Methods:

Name	Description
`annotate_input`	Annotate the text.
`get_html_highlighted_text`	Get html-hilighted text for the identified input's annotations

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    kernel: AnnotatorKernel,
    name: str = "entity",
):
    """Initialize with the annotators and this extractor's name.

    Args:
        kernel: The annotations kernel to use.
        name: The name of this information extractor to be the
            annotations base column name for <name>_num and <name>_recsnum.
    """
    super().__init__(name=name)
    self.kernel = kernel

Functions¶

annotate_input ¶

annotate_input(
    text_obj: AnnotatedText, reset: bool = True, **kwargs: Any
) -> Annotations

Annotate the text.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The AnnotatedText object to annotate.	required
`reset`	`bool`	When True, reset and rebuild any existing annotations.	`True`
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	The annotations added to the text_obj.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def annotate_input(
    self,
    text_obj: AnnotatedText,
    reset: bool = True,
    **kwargs: Any
) -> Annotations:
    """Annotate the text.

    Args:
        text_obj: The AnnotatedText object to annotate.
        reset: When True, reset and rebuild any existing annotations.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text_obj.
    """
    if reset:
        text_obj.annotations.clear()
    annots = self.kernel.annotate_input(text_obj)
    return annots

get_html_highlighted_text ¶

get_html_highlighted_text(
    text_obj: AnnotatedText, annotator_names: List[str] = None
) -> str

Get html-hilighted text for the identified input's annotations from the given annotators (or all).

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The input text to highlight.	required
`annotator_names`	`List[str]`	The subset of annotators to highlight.	`None`

Returns:

Type	Description
`str`	HTML string with highlighted text.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def get_html_highlighted_text(
    self,
    text_obj: AnnotatedText,
    annotator_names: List[str] = None,
) -> str:
    """Get html-hilighted text for the identified input's annotations
    from the given annotators (or all).

    Args:
        text_obj: The input text to highlight.
        annotator_names: The subset of annotators to highlight.

    Returns:
        HTML string with highlighted text.
    """
    if annotator_names is None:
        annotator_names = [ann.name for ann in self.kernel.annotators]
    hfs = {
        ann.name: ann.highlight_fieldstyles
        for ann in self.kernel.annotators
        if ann.name in annotator_names
    }
    hh = HtmlHighlighter(hfs)
    return hh.highlight(text_obj)

DerivedAnnotationColumns ¶

Bases: ABC

Interface for injecting derived columns into AnnotationsMetaData.

Methods:

Name	Description
`get_col_value`	Get the value of the column in the given row derived from col_type.

Functions¶

get_col_value `abstractmethod` ¶

get_col_value(
    metadata: AnnotationsMetaData,
    col_type: str,
    row: Series,
    missing: str = None,
) -> str

Get the value of the column in the given row derived from col_type.

Parameters:

Name	Type	Description	Default
`metadata`	`AnnotationsMetaData`	The AnnotationsMetaData.	required
`col_type`	`str`	The type of column value to derive.	required
`row`	`Series`	A row from which to get the value.	required
`missing`	`str`	The value to return for unknown or missing column.	`None`

Returns:

Type	Description
`str`	The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def get_col_value(
    self,
    metadata: AnnotationsMetaData,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row derived from col_type.

    Args:
        metadata: The AnnotationsMetaData.
        col_type: The type of column value to derive.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    raise NotImplementedError

EntityAnnotator ¶

EntityAnnotator(name: str, mask_char: str = ' ')

Bases: BasicAnnotator

Class for extracting single (possibly multi-level or -part) entities.

Initialize EntityAnnotator.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of this annotator.	required
`mask_char`	`str`	The character to use to mask out previously annotated spans of this annotator's text.	`' '`

Methods:

Name	Description
`annotate_input`	Annotate the text object (optionally) after masking out previously
`compose_groups`	Compose annotation rows into groups.
`mark_records`	Collect and mark annotation records.
`validate_records`	Validate annotated records.

Attributes:

Name	Type	Description
`annotation_cols`	`Set[str]`	Report the (final group or record) annotation columns that are filled
`highlight_fieldstyles`	`Dict[str, Dict[str, Dict[str, str]]]`	Get highlight field styles for this annotator's annotations of the form:

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    name: str,
    mask_char: str = " ",
):
    """Initialize EntityAnnotator.

    Args:
        name: The name of this annotator.
        mask_char: The character to use to mask out previously annotated
            spans of this annotator's text.
    """
    super().__init__(name)
    self.mask_char = mask_char

Attributes¶

annotation_cols `abstractmethod` `property` ¶

annotation_cols: Set[str]

Report the (final group or record) annotation columns that are filled by this annotator when its entities are annotated.

highlight_fieldstyles `abstractmethod` `property` ¶

highlight_fieldstyles: Dict[str, Dict[str, Dict[str, str]]]

Get highlight field styles for this annotator's annotations of the form: { : { : { : } } } For css-attr's like 'background-color', 'foreground-color', etc.

Functions¶

annotate_input ¶

annotate_input(
    text_obj: AnnotatedText,
    annot_mask_cols: Set[str] = None,
    merge_strategies: Dict[str, MergeStrategy] = None,
    largest_only: bool = True,
    **kwargs: Any,
) -> Annotations

Annotate the text object (optionally) after masking out previously annotated spans, additively updating the annotations in the text object.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The text object to annotate.	required
`annot_mask_cols`	`Set[str]`	The (possible) previous annotations whose spans to ignore in the text.	`None`
`merge_strategies`	`Dict[str, MergeStrategy]`	A dictionary of each input annotation bookmark tag mapped to a merge strategy for merging this annotator's annotations with the bookmarked dataframe. This is useful, for example, when merging syntactic information to refine ambiguities.	`None`
`largest_only`	`bool`	True to only mark largest records.	`True`
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	The annotations added to the text object.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def annotate_input(
    self,
    text_obj: AnnotatedText,
    annot_mask_cols: Set[str] = None,
    merge_strategies: Dict[str, MergeStrategy] = None,
    largest_only: bool = True,
    **kwargs: Any
) -> Annotations:
    """Annotate the text object (optionally) after masking out previously
    annotated spans, additively updating the annotations in the text
    object.

    Args:
        text_obj: The text object to annotate.
        annot_mask_cols: The (possible) previous annotations whose
            spans to ignore in the text.
        merge_strategies: A dictionary of each input annotation bookmark
            tag mapped to a merge strategy for merging this annotator's
            annotations with the bookmarked dataframe. This is useful, for
            example, when merging syntactic information to refine ambiguities.
        largest_only: True to only mark largest records.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text object.
    """
    # TODO: Use annot_mask_cols to mask annotations
    # annot2mask = (
    #     None
    #     if annot_mask_cols is None
    #     else {
    #         col: self.mask_char for col in annot_mask_cols
    #     }
    # )

    annots = self.annotate_text(text_obj.text)
    if annots is None:
        return annots

    if merge_strategies is not None:
        bookmarks = text_obj.bookmarks
        if bookmarks is not None and len(bookmarks) > 0:
            for tag, merge_strategy in merge_strategies.items():
                if tag in bookmarks:
                    text_obj.bookmarks[f"{self.name}.pre-merge:{tag}"] = annots.df
                    annots.add_df(bookmarks[tag])
                    annots = merge(annots, merge_strategy)

    annots = self.compose_groups(annots)

    self.mark_records(annots, largest_only=largest_only)
    # NOTE: don't pass "text" here because it may be masked
    self.validate_records(annots)
    text_obj.annotations.add_df(annots.df)
    return annots

compose_groups `abstractmethod` ¶

compose_groups(annotations: Annotations) -> Annotations

Compose annotation rows into groups.

Parameters:

Name	Type	Description	Default
`annotations`	`Annotations`	The annotations.	required

Returns:

Type	Description
`Annotations`	The composed annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def compose_groups(self, annotations: Annotations) -> Annotations:
    """Compose annotation rows into groups.

    Args:
        annotations: The annotations.

    Returns:
        The composed annotations.
    """
    raise NotImplementedError

mark_records `abstractmethod` ¶

mark_records(annotations: Annotations, largest_only: bool = True)

Collect and mark annotation records.

Parameters:

Name	Type	Description	Default
`annotations`	`Annotations`	The annotations.	required
`largest_only`	`bool`	True to only mark (keep) the largest records.	`True`

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def mark_records(self, annotations: Annotations, largest_only: bool = True):
    """Collect and mark annotation records.

    Args:
        annotations: The annotations.
        largest_only: True to only mark (keep) the largest records.
    """
    raise NotImplementedError

validate_records `abstractmethod` ¶

validate_records(annotations: Annotations)

Validate annotated records.

Parameters:

Name	Type	Description	Default
`annotations`	`Annotations`	The annotations.	required

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def validate_records(
    self,
    annotations: Annotations,
):
    """Validate annotated records.

    Args:
        annotations: The annotations.
    """
    raise NotImplementedError

HtmlHighlighter ¶

HtmlHighlighter(
    field2style: Dict[str, Dict[str, str]],
    tooltip_class: str = "tooltip",
    tooltiptext_class: str = "tooltiptext",
)

Helper class to add HTML markup for highlighting spans of text.

Initialize HtmlHighlighter.

Parameters:

Name	Type	Description	Default
`field2style`	`Dict[str, Dict[str, str]]`	The annotation column to highlight with its associated style, for example: { 'car_model_field': { 'year': {'background-color': 'lightyellow'}, 'make': {'background-color': 'lightgreen'}, 'model': {'background-color': 'cyan'}, 'style': {'background-color': 'magenta'}, }, }	required
`tooltip_class`	`str`	The css tooltip class.	`'tooltip'`
`tooltiptext_class`	`str`	The css tooltiptext class.	`'tooltiptext'`

Methods:

Name	Description
`highlight`	Return an html string with the given fields (annotation columns)

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    field2style: Dict[str, Dict[str, str]],
    tooltip_class: str = "tooltip",
    tooltiptext_class: str = "tooltiptext",
):
    """Initialize HtmlHighlighter.

    Args:
        field2style: The annotation column to highlight with its
            associated style, for example:
                {
                    'car_model_field': {
                        'year': {'background-color': 'lightyellow'},
                        'make': {'background-color': 'lightgreen'},
                        'model': {'background-color': 'cyan'},
                        'style': {'background-color': 'magenta'},
                    },
                }
        tooltip_class: The css tooltip class.
        tooltiptext_class: The css tooltiptext class.
    """
    self.field2style = field2style
    self.tooltip_class = tooltip_class
    self.tooltiptext_class = tooltiptext_class

Functions¶

highlight ¶

highlight(text_obj: AnnotatedText) -> str

Return an html string with the given fields (annotation columns) highlighted with the associated styles.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The annotated text to markup.	required

Returns:

Type	Description
`str`	HTML string with highlighted annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def highlight(
    self,
    text_obj: AnnotatedText,
) -> str:
    """Return an html string with the given fields (annotation columns)
    highlighted with the associated styles.

    Args:
        text_obj: The annotated text to markup.

    Returns:
        HTML string with highlighted annotations.
    """
    result = ["<p>"]
    anns = text_obj.annotations
    an_df = anns.df
    for field, styles in self.field2style.items():
        # NOTE: the following line relies on an_df already being sorted
        df = an_df[an_df[field].isin(styles)]
        cur_pos = 0
        for _loc, row in df.iterrows():
            enttype = row[field]
            style = styles[enttype]
            style_str = " ".join([f"{key}: {value};" for key, value in style.items()])
            start_pos = row[anns.metadata.start_pos_col]
            if start_pos > cur_pos:
                result.append(text_obj.text[cur_pos:start_pos])
            end_pos = row[anns.metadata.end_pos_col]
            result.append(f'<mark class="{self.tooltip_class}" style="{style_str}">')
            result.append(text_obj.text[start_pos:end_pos])
            result.append(f'<span class="{self.tooltiptext_class}">{enttype}</span>')
            result.append("</mark>")
            cur_pos = end_pos
    result.append("</p>")
    return "\n".join(result)

MergeStrategy ¶

Bases: ABC

A merge strategy to be injected based on entity types being merged.

Methods:

Name	Description
`merge`	Process the annotations in the given annotations group, returning the

Functions¶

merge `abstractmethod` ¶

merge(group: AnnotationsGroup) -> List[Dict[str, Any]]

Process the annotations in the given annotations group, returning the group's merged annotation dictionaries.

Source code in packages/xization/src/dataknobs_xization/annotations.py

@abstractmethod
def merge(self, group: AnnotationsGroup) -> List[Dict[str, Any]]:
    """Process the annotations in the given annotations group, returning the
    group's merged annotation dictionaries.
    """
    raise NotImplementedError

OverlapGroupIterator ¶

OverlapGroupIterator(an_df: DataFrame)

Given

annotation rows (dataframe)
in order sorted by
- start_pos (increasing for input order), and
- end_pos (decreasing for longest spans first)

Collect: * overlapping consecutive annotations * for processing

Initialize OverlapGroupIterator.

Parameters:

Name	Type	Description	Default
`an_df`	`DataFrame`	An annotations.as_df DataFrame, sliced and sorted.	required

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(self, an_df: pd.DataFrame):
    """Initialize OverlapGroupIterator.

    Args:
        an_df: An annotations.as_df DataFrame, sliced and sorted.
    """
    self.an_df = an_df
    self._cur_iter = None
    self._queued_row_data = None
    self.cur_group = None
    self.reset()

Functions¶

PositionalAnnotationsGroup ¶

PositionalAnnotationsGroup(overlap: bool, rectype: str = None, gnum: int = -1)

Bases: AnnotationsGroup

Container for annotations that either overlap with each other or don't.

Initialize PositionalAnnotationsGroup.

Parameters:

Name	Type	Description	Default
`overlap`	`bool`	If False, then only accept rows that don't overlap; else only accept rows that do overlap.	required
`rectype`	`str`	The record type.	`None`
`gnum`	`int`	The group number.	`-1`

Methods:

Name	Description
`belongs`	Determine if the row belongs in this instance based on its overlap

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(self, overlap: bool, rectype: str = None, gnum: int = -1):
    """Initialize PositionalAnnotationsGroup.

    Args:
        overlap: If False, then only accept rows that don't overlap; else
            only accept rows that do overlap.
        rectype: The record type.
        gnum: The group number.
    """
    super().__init__(None, None, None, group_type=rectype, group_num=gnum)
    self.overlap = overlap
    self.start_pos = -1
    self.end_pos = -1

Functions¶

belongs ¶

belongs(rowdata: RowData) -> bool

Determine if the row belongs in this instance based on its overlap or not.

Parameters:

Name	Type	Description	Default
`rowdata`	`RowData`	The rowdata to test.	required

Returns:

Type	Description
`bool`	True if the rowdata belongs in this instance.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def belongs(self, rowdata: RowData) -> bool:
    """Determine if the row belongs in this instance based on its overlap
    or not.

    Args:
        rowdata: The rowdata to test.

    Returns:
        True if the rowdata belongs in this instance.
    """
    result = True  # Anything belongs to an empty group
    if len(self.rows) > 0:
        start_overlaps = self._is_in_bounds(rowdata.start_pos)
        end_overlaps = self._is_in_bounds(rowdata.end_pos - 1)
        result = start_overlaps or end_overlaps
        if not self.overlap:
            result = not result
    if result:
        if self.start_pos < 0:
            self.start_pos = rowdata.start_pos
            self.end_pos = rowdata.end_pos
        else:
            self.start_pos = min(self.start_pos, rowdata.start_pos)
            self.end_pos = max(self.end_pos, rowdata.end_pos)
    return result

RowData ¶

RowData(metadata: AnnotationsMetaData, row: Series)

A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping.

Methods:

Name	Description
`is_subset`	Determine whether this row's span is a subset of the other.
`is_subset_of_any`	Determine whether this row is a subset of any of the others

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    metadata: AnnotationsMetaData,
    row: pd.Series,
):
    self.metadata = metadata
    self.row = row

Functions¶

is_subset ¶

is_subset(other_row: RowData) -> bool

Determine whether this row's span is a subset of the other.

Parameters:

Name	Type	Description	Default
`other_row`	`RowData`	The other row.	required

Returns:

Type	Description
`bool`	True if this row's span is a subset of the other row's span.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def is_subset(self, other_row: "RowData") -> bool:
    """Determine whether this row's span is a subset of the other.

    Args:
        other_row: The other row.

    Returns:
        True if this row's span is a subset of the other row's span.
    """
    return self.start_pos >= other_row.start_pos and self.end_pos <= other_row.end_pos

is_subset_of_any ¶

is_subset_of_any(other_rows: List[RowData]) -> bool

Determine whether this row is a subset of any of the others according to text span coverage.

Parameters:

Name	Type	Description	Default
`other_rows`	`List[RowData]`	The rows to test for this to be a subset of any.	required

Returns:

Type	Description
`bool`	True if this row is a subset of any of the other rows.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def is_subset_of_any(self, other_rows: List["RowData"]) -> bool:
    """Determine whether this row is a subset of any of the others
    according to text span coverage.

    Args:
        other_rows: The rows to test for this to be a subset of any.

    Returns:
        True if this row is a subset of any of the other rows.
    """
    result = False
    for other_row in other_rows:
        if self.is_subset(other_row):
            result = True
            break
    return result

SyntacticParser ¶

SyntacticParser(name: str)

Bases: BasicAnnotator

Class for creating syntactic annotations for an input.

Methods:

Name	Description
`annotate_input`	Annotate the text, additively updating the annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name

Functions¶

annotate_input ¶

annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate the text, additively updating the annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The text to annotate.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	The annotations added to the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate the text, additively updating the annotations.

    Args:
        text_obj: The text to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text.
    """
    # Get new annotation with just the syntax
    annots = self.annotate_text(text_obj.text)

    # Add syntactic annotations only as a bookmark
    text_obj.bookmarks[self.name] = annots.as_df

    return annots

Functions¶

merge ¶

merge(annotations: Annotations, merge_strategy: MergeStrategy) -> Annotations

Merge the overlapping groups according to the given strategy.

Source code in packages/xization/src/dataknobs_xization/annotations.py

def merge(
    annotations: Annotations,
    merge_strategy: MergeStrategy,
) -> Annotations:
    """Merge the overlapping groups according to the given strategy."""
    og_iter = OverlapGroupIterator(annotations.as_df)
    result = Annotations(annotations.metadata)
    while og_iter.has_next:
        og = og_iter.next_group()
        result.add_dicts(merge_strategy.merge(og))
    return result

authorities¶

Functions and Classes¶

dataknobs_xization.authorities ¶

Authority-based annotation processing and field grouping.

Provides classes for managing authority-based annotations, field groups, and derived annotation columns for structured text extraction.

Classes:

Name	Description
`AnnotationsValidator`	A base class with helper functions for performing validations on annotation
`AuthoritiesBundle`	An authority for expressing values through multiple bundled "authorities"
`Authority`	A class for managing and defining tabular authoritative data for e.g.,
`AuthorityAnnotationsBuilder`	An extension of an AnnotationsBuilder that adds the 'auth_id' column.
`AuthorityAnnotationsMetaData`	An extension of AnnotationsMetaData that adds an 'auth_id_col' to the
`AuthorityData`	A wrapper for authority data.
`AuthorityFactory`	A factory class for building an authority.
`DerivedFieldGroups`	Defines derived column types:
`LexicalAuthority`	A class for managing named entities by ID with associated values and
`RegexAuthority`	A class for managing named entities by ID with associated values and

Classes¶

AnnotationsValidator ¶

Bases: ABC

A base class with helper functions for performing validations on annotation rows.

Classes:

Name	Description
`AuthAnnotations`	A wrapper class for convenient access to the entity annotations.

Methods:

Name	Description
`__call__`	Call function to enable instances of this type of class to be passed in
`validate_annotation_rows`	Determine whether the proposed authority annotation rows are valid.

Classes¶

AuthAnnotations ¶

AuthAnnotations(auth: Authority, ann_row_dicts: List[Dict[str, Any]])

A wrapper class for convenient access to the entity annotations.

Methods:

Name	Description
`colval`	Get the column's value from the given row
`get_field_type`	Get the entity field type value
`get_text`	Get the entity text from the row

Attributes:

Name	Type	Description
`anns`	`Annotations`	Get this instance's annotation rows as an annotations object
`attributes`	`Dict[str, str]`	Get this instance's annotation entity attributes
`df`	`DataFrame`	Get the annotation's dataframe
`row_accessor`	`AnnotationsRowAccessor`	Get the row accessor for this instance's annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(self, auth: Authority, ann_row_dicts: List[Dict[str, Any]]):
    self.auth = auth
    self.ann_row_dicts = ann_row_dicts
    self._row_accessor = None  # AnnotationsRowAccessor
    self._anns = None  # Annotations
    self._atts = None  # Dict[str, str]

Attributes¶

anns property ¶

anns: Annotations

Get this instance's annotation rows as an annotations object

attributes property ¶

attributes: Dict[str, str]

Get this instance's annotation entity attributes

df property ¶

df: DataFrame

Get the annotation's dataframe

row_accessor property ¶

row_accessor: AnnotationsRowAccessor

Get the row accessor for this instance's annotations.

Functions¶

colval ¶

colval(col_name, row) -> Any

Get the column's value from the given row

Source code in packages/xization/src/dataknobs_xization/authorities.py

def colval(self, col_name, row) -> Any:
    """Get the column's value from the given row"""
    return self.row_accessor.get_col_value(col_name, row)

get_field_type ¶

get_field_type(row: Series) -> str

Get the entity field type value

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_field_type(self, row: pd.Series) -> str:
    """Get the entity field type value"""
    return self.row_accessor.get_col_value("field_type", row, None)

get_text ¶

get_text(row: Series) -> str

Get the entity text from the row

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_text(self, row: pd.Series) -> str:
    """Get the entity text from the row"""
    return self.row_accessor.get_col_value(self.auth.metadata.text_col, row, None)

Functions¶

call ¶

__call__(auth: Authority, ann_row_dicts: List[Dict[str, Any]]) -> bool

Call function to enable instances of this type of class to be passed in as a anns_validator function to an Authority.

Parameters:

Name	Type	Description	Default
`auth`	`Authority`	The authority proposing annotations.	required
`ann_row_dicts`	`List[Dict[str, Any]]`	The proposed annotations.	required

Returns:

Type	Description
`bool`	True if the annotations are valid; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __call__(
    self,
    auth: Authority,
    ann_row_dicts: List[Dict[str, Any]],
) -> bool:
    """Call function to enable instances of this type of class to be passed in
    as a anns_validator function to an Authority.

    Args:
        auth: The authority proposing annotations.
        ann_row_dicts: The proposed annotations.

    Returns:
        True if the annotations are valid; otherwise, False.
    """
    return self.validate_annotation_rows(
        AnnotationsValidator.AuthAnnotations(auth, ann_row_dicts)
    )

validate_annotation_rows `abstractmethod` ¶

validate_annotation_rows(auth_annotations: AuthAnnotations) -> bool

Determine whether the proposed authority annotation rows are valid.

Parameters:

Name	Type	Description	Default
`auth_annotations`	`AuthAnnotations`	The AuthAnnotations instance with the proposed data.	required

Returns:

Type	Description
`bool`	True if valid; False if not.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def validate_annotation_rows(
    self,
    auth_annotations: "AnnotationsValidator.AuthAnnotations",
) -> bool:
    """Determine whether the proposed authority annotation rows are valid.

    Args:
        auth_annotations: The AuthAnnotations instance with the
            proposed data.

    Returns:
        True if valid; False if not.
    """
    raise NotImplementedError

AuthoritiesBundle ¶

AuthoritiesBundle(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    parent_auth: Authority = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    auths: List[Authority] = None,
)

Bases: Authority

An authority for expressing values through multiple bundled "authorities" like dictionary-based and/or multiple regular expression patterns.

Initialize the AuthoritiesBundle.

Parameters:

Name	Type	Description	Default
`name`	`str`	This authority's entity name.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	`None`
`authdata`	`AuthorityData`	The authority data.	`None`
`field_groups`	`DerivedFieldGroups`	The derived field groups to use.	`None`
`anns_validator`	`Callable[[Authority, Dict[str, Any]], bool]`	fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".	`None`
`parent_auth`	`Authority`	This authority's parent authority (if any).	`None`
`auths`	`List[Authority]`	The authorities to bundle together.	`None`

Methods:

Name	Description
`add`	Add the authority to this bundle.
`add_annotations`	Method to do the work of finding, validating, and adding annotations.
`has_value`	Determine whether the given value is in this authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    parent_auth: "Authority" = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    auths: List[Authority] = None,
):
    """Initialize the AuthoritiesBundle.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
        auths: The authorities to bundle together.
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.auths = auths.copy() if auths is not None else []

Functions¶

add ¶

add(auth: Authority)

Add the authority to this bundle.

Parameters:

Name	Type	Description	Default
`auth`	`Authority`	The authority to add.	required

Source code in packages/xization/src/dataknobs_xization/authorities.py

def add(self, auth: Authority):
    """Add the authority to this bundle.

    Args:
        auth: The authority to add.
    """
    self.auths.append(auth)

add_annotations ¶

add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The annotated text object to process and add annotations.	required

Returns:

Type	Description
`Annotations`	The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    for auth in self.auths:
        auth.annotate_input(text_obj)
    return text_obj.annotations

has_value ¶

has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	A possible authority value.	required

Returns:

Type	Description
`bool`	True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    for auth in self.auths:
        if auth.has_value(value):
            return True
    return False

Authority ¶

Authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Annotator

A class for managing and defining tabular authoritative data for e.g., taxonomies, etc., and using them to annotate instances within text.

Initialize with this authority's metadata.

Parameters:

Name	Type	Description	Default
`name`	`str`	This authority's entity name.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	`None`
`authdata`	`AuthorityData`	The authority data.	`None`
`field_groups`	`DerivedFieldGroups`	The derived field groups to use.	`None`
`anns_validator`	`Callable[[Authority, Dict[str, Any]], bool]`	fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".	`None`
`parent_auth`	`Authority`	This authority's parent authority (if any).	`None`

Methods:

Name	Description
`add_annotations`	Method to do the work of finding, validating, and adding annotations.
`annotate_input`	Find and annotate this authority's entities in the document text
`build_annotation`	Build annotations with the given components.
`compose`	Compose annotations into groups.
`has_value`	Determine whether the given value is in this authority.
`validate_ann_dicts`	The annotation row dictionaries are valid if:

Attributes:

Name	Type	Description
`metadata`	`AuthorityAnnotationsMetaData`	Get the meta-data
`parent`	`Authority`	Get this authority's parent, or None.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's metadata.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(name)
    self.anns_builder = (
        auth_anns_builder if auth_anns_builder is not None else AuthorityAnnotationsBuilder()
    )
    self.authdata = authdata
    self.field_groups = field_groups if field_groups is not None else DerivedFieldGroups()
    self.anns_validator = anns_validator
    self._parent = parent_auth

Attributes¶

metadata `property` ¶

metadata: AuthorityAnnotationsMetaData

Get the meta-data

parent `property` ¶

parent: Authority

Get this authority's parent, or None.

Functions¶

add_annotations `abstractmethod` ¶

add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The annotated text object to process and add annotations.	required

Returns:

Type	Description
`Annotations`	The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    raise NotImplementedError

annotate_input ¶

annotate_input(
    text_obj: Union[AnnotatedText, str], **kwargs: Any
) -> dk_annots.Annotations

Find and annotate this authority's entities in the document text as dictionaries like: [ { 'input_id': , 'start_pos': , 'end_pos': , 'entity_text': , 'ann_type': , '': , 'confidence': , }, ]

Parameters:

Name	Type	Description	Default
`text_obj`	`Union[AnnotatedText, str]`	The text object or string to process.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Annotations`	An Annotations instance.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def annotate_input(
    self,
    text_obj: Union[dk_annots.AnnotatedText, str],
    **kwargs: Any,
) -> dk_annots.Annotations:
    """Find and annotate this authority's entities in the document text
    as dictionaries like:
    [
        {
            'input_id': <id>,
            'start_pos': <start_char_pos>,
            'end_pos': <end_char_pos>,
            'entity_text': <entity_text>,
            'ann_type': <authority_name>,
            '<auth_id>': <auth_value_id_or_canonical_form>,
            'confidence': <confidence_if_available>,
        },
    ]

    Args:
        text_obj: The text object or string to process.
        **kwargs: Additional keyword arguments.

    Returns:
        An Annotations instance.
    """
    if text_obj is not None:
        if isinstance(text_obj, str) and len(text_obj.strip()) > 0:
            text_obj = dk_annots.AnnotatedText(
                text_obj,
                annots_metadata=self.metadata,
            )
    if text_obj is not None:
        annotations = self.add_annotations(text_obj)
    return annotations

build_annotation ¶

build_annotation(
    start_pos: int = None,
    end_pos: int = None,
    entity_text: str = None,
    auth_value_id: Any = None,
    conf: float = 1.0,
    **kwargs,
) -> Dict[str, Any]

Build annotations with the given components.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def build_annotation(
    self,
    start_pos: int = None,
    end_pos: int = None,
    entity_text: str = None,
    auth_value_id: Any = None,
    conf: float = 1.0,
    **kwargs,
) -> Dict[str, Any]:
    """Build annotations with the given components."""
    return self.anns_builder.build_annotation_row(
        start_pos, end_pos, entity_text, self.name, auth_value_id, auth_valconf=conf, **kwargs
    )

compose ¶

compose(annotations: Annotations) -> dk_annots.Annotations

Compose annotations into groups.

Parameters:

Name	Type	Description	Default
`annotations`	`Annotations`	The annotations.	required

Returns:

Type	Description
`Annotations`	Composed annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def compose(
    self,
    annotations: dk_annots.Annotations,
) -> dk_annots.Annotations:
    """Compose annotations into groups.

    Args:
        annotations: The annotations.

    Returns:
        Composed annotations.
    """
    return annotations

has_value `abstractmethod` ¶

has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	A possible authority value.	required

Returns:

Type	Description
`bool`	True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    raise NotImplementedError

validate_ann_dicts ¶

validate_ann_dicts(ann_dicts: List[Dict[str, Any]]) -> bool

The annotation row dictionaries are valid if

They are non-empty
and
either there is no annotations validator
or they are valid according to the validator

Parameters:

Name	Type	Description	Default
`ann_dicts`	`List[Dict[str, Any]]`	Annotation dictionaries.	required

Returns:

Type	Description
`bool`	True if valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def validate_ann_dicts(self, ann_dicts: List[Dict[str, Any]]) -> bool:
    """The annotation row dictionaries are valid if:
      * They are non-empty
      * and
         * either there is no annotations validator
         * or they are valid according to the validator

    Args:
        ann_dicts: Annotation dictionaries.

    Returns:
        True if valid.
    """
    return len(ann_dicts) > 0 and (
        self.anns_validator is None or self.anns_validator(self, ann_dicts)
    )

AuthorityAnnotationsBuilder ¶

AuthorityAnnotationsBuilder(
    metadata: AuthorityAnnotationsMetaData = None,
    data_defaults: Dict[str, Any] = None,
)

Bases: AnnotationsBuilder

An extension of an AnnotationsBuilder that adds the 'auth_id' column.

Initialize AuthorityAnnotationsBuilder.

Parameters:

Name	Type	Description	Default
`metadata`	`AuthorityAnnotationsMetaData`	The authority annotations metadata.	`None`
`data_defaults`	`Dict[str, Any]`	Dict[ann_colname, default_value] with default values for annotation columns.	`None`

Methods:

Name	Description
`build_annotation_row`	Build an annotation row with the mandatory key values and those from

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    metadata: AuthorityAnnotationsMetaData = None,
    data_defaults: Dict[str, Any] = None,
):
    """Initialize AuthorityAnnotationsBuilder.

    Args:
        metadata: The authority annotations metadata.
        data_defaults: Dict[ann_colname, default_value] with default
            values for annotation columns.
    """
    super().__init__(
        metadata if metadata is not None else AuthorityAnnotationsMetaData(), data_defaults
    )

Functions¶

build_annotation_row ¶

build_annotation_row(
    start_pos: int,
    end_pos: int,
    text: str,
    ann_type: str,
    auth_id: str,
    **kwargs: Any,
) -> Dict[str, Any]

Build an annotation row with the mandatory key values and those from the remaining keyword arguments.

For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.

Parameters:

Name	Type	Description	Default
`start_pos`	`int`	The token start position.	required
`end_pos`	`int`	The token end position.	required
`text`	`str`	The token text.	required
`ann_type`	`str`	The annotation type.	required
`auth_id`	`str`	The authority ID for the row.	required
`**kwargs`	`Any`	Additional keyword arguments.	`{}`

Returns:

Type	Description
`Dict[str, Any]`	The result row dictionary.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def build_annotation_row(
    self, start_pos: int, end_pos: int, text: str, ann_type: str, auth_id: str, **kwargs: Any
) -> Dict[str, Any]:
    """Build an annotation row with the mandatory key values and those from
    the remaining keyword arguments.

    For those kwargs whose names match metadata column names, override the
    data_defaults and add remaining data_default attributes.

    Args:
        start_pos: The token start position.
        end_pos: The token end position.
        text: The token text.
        ann_type: The annotation type.
        auth_id: The authority ID for the row.
        **kwargs: Additional keyword arguments.

    Returns:
        The result row dictionary.
    """
    return self.do_build_row(
        {
            self.metadata.start_pos_col: start_pos,
            self.metadata.end_pos_col: end_pos,
            self.metadata.text_col: text,
            self.metadata.ann_type_col: ann_type,
            self.metadata.auth_id_col: auth_id,
        },
        **kwargs,
    )

AuthorityAnnotationsMetaData ¶

AuthorityAnnotationsMetaData(
    start_pos_col: str = dk_annots.KEY_START_POS_COL,
    end_pos_col: str = dk_annots.KEY_END_POS_COL,
    text_col: str = dk_annots.KEY_TEXT_COL,
    ann_type_col: str = dk_annots.KEY_ANN_TYPE_COL,
    auth_id_col: str = KEY_AUTH_ID_COL,
    sort_fields: List[str] = (
        dk_annots.KEY_START_POS_COL,
        dk_annots.KEY_END_POS_COL,
    ),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
)

Bases: AnnotationsMetaData

An extension of AnnotationsMetaData that adds an 'auth_id_col' to the standard (key) annotation columns (attributes).

Initialize with key (and more) column names and info.

Key column types

start_pos
end_pos
text
ann_type
auth_id

Note

Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.

Parameters:

Name	Type	Description	Default
`start_pos_col`	`str`	Col name for the token starting position.	`KEY_START_POS_COL`
`end_pos_col`	`str`	Col name for the token ending position.	`KEY_END_POS_COL`
`text_col`	`str`	Col name for the token text.	`KEY_TEXT_COL`
`ann_type_col`	`str`	Col name for the annotation types.	`KEY_ANN_TYPE_COL`
`auth_id_col`	`str`	Col name for the authority value ID.	`KEY_AUTH_ID_COL`
`sort_fields`	`List[str]`	The col types relevant for sorting annotation rows.	`(KEY_START_POS_COL, KEY_END_POS_COL)`
`sort_fields_ascending`	`List[bool]`	To specify sort order of sort_fields.	`(True, False)`
`**kwargs`	`Any`	More column types mapped to column names.	`{}`

Attributes:

Name	Type	Description
`auth_id_col`	`str`	Get the column name for the auth_id

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    start_pos_col: str = dk_annots.KEY_START_POS_COL,
    end_pos_col: str = dk_annots.KEY_END_POS_COL,
    text_col: str = dk_annots.KEY_TEXT_COL,
    ann_type_col: str = dk_annots.KEY_ANN_TYPE_COL,
    auth_id_col: str = KEY_AUTH_ID_COL,
    sort_fields: List[str] = (dk_annots.KEY_START_POS_COL, dk_annots.KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
):
    """Initialize with key (and more) column names and info.

    Key column types:
      * start_pos
      * end_pos
      * text
      * ann_type
      * auth_id

    Note:
        Actual table columns can be named arbitrarily, BUT interactions
        through annotations classes and interfaces relating to the "key"
        columns must use the key column constants.

    Args:
        start_pos_col: Col name for the token starting position.
        end_pos_col: Col name for the token ending position.
        text_col: Col name for the token text.
        ann_type_col: Col name for the annotation types.
        auth_id_col: Col name for the authority value ID.
        sort_fields: The col types relevant for sorting annotation rows.
        sort_fields_ascending: To specify sort order of sort_fields.
        **kwargs: More column types mapped to column names.
    """
    super().__init__(
        start_pos_col=start_pos_col,
        end_pos_col=end_pos_col,
        text_col=text_col,
        ann_type_col=ann_type_col,
        sort_fields=sort_fields,
        sort_fields_ascending=sort_fields_ascending,
        auth_id=auth_id_col,
        **kwargs,
    )

Attributes¶

auth_id_col `property` ¶

auth_id_col: str

Get the column name for the auth_id

Functions¶

AuthorityData ¶

AuthorityData(df: DataFrame, name: str)

A wrapper for authority data.

Methods:

Name	Description
`lookup_values`	Lookup authority value(s) for the given value or value id.

Attributes:

Name	Type	Description
`df`	`DataFrame`	Get the authority data in a dataframe

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(self, df: pd.DataFrame, name: str):
    self._df = df
    self.name = name

Attributes¶

df `property` ¶

df: DataFrame

Get the authority data in a dataframe

Functions¶

lookup_values ¶

lookup_values(value: Any, is_id: bool = False) -> pd.DataFrame

Lookup authority value(s) for the given value or value id.

Parameters:

Name	Type	Description	Default
`value`	`Any`	A value or value_id for this authority.	required
`is_id`	`bool`	True if value is an ID.	`False`

Returns:

Type	Description
`DataFrame`	The applicable authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def lookup_values(self, value: Any, is_id: bool = False) -> pd.DataFrame:
    """Lookup authority value(s) for the given value or value id.

    Args:
        value: A value or value_id for this authority.
        is_id: True if value is an ID.

    Returns:
        The applicable authority dataframe rows.
    """
    col = self.df.index if is_id else self.df[self.name]
    return self.df[col == value]

AuthorityFactory ¶

Bases: ABC

A factory class for building an authority.

Methods:

Name	Description
`build_authority`	Build an authority with the given name and data.

Functions¶

build_authority `abstractmethod` ¶

build_authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    authdata: AuthorityData,
    parent_auth: Authority = None,
) -> Authority

Build an authority with the given name and data.

Parameters:

Name	Type	Description	Default
`name`	`str`	The authority name.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	required
`authdata`	`AuthorityData`	The authority data.	required
`parent_auth`	`Authority`	The parent authority.	`None`

Returns:

Type	Description
`Authority`	The authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def build_authority(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    authdata: AuthorityData,
    parent_auth: Authority = None,
) -> Authority:
    """Build an authority with the given name and data.

    Args:
        name: The authority name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        parent_auth: The parent authority.

    Returns:
        The authority.
    """
    raise NotImplementedError

DerivedFieldGroups ¶

DerivedFieldGroups(
    field_type_suffix: str = "_field",
    field_group_suffix: str = "_num",
    field_record_suffix: str = "_recsnum",
)

Bases: DerivedAnnotationColumns

Defines derived column types: * "field_type" -- The column holding they type of field of an annotation row * "field_group" -- The column holding the group number(s) of the field * "field_record" -- The column holding record number(s) of the field

Add derived column types/names: Given an annnotation row, * field_type(row) == f'{row[ann_type_col]}_field' * field_group(row) == f'{row[ann_type_col]}_num' * field_record(row) == f'{row[ann_type_col])_recsnum'

Where

A field_type column holds annotation "sub"- type values, or fields
A field_group column identifies groups of annotation fields
A field_record column identifies groups of annotation field groups

Parameters:

Name	Type	Description	Default
`field_type_suffix`	`str`	The field_type col name suffix (if not _field).	`'_field'`
`field_group_suffix`	`str`	The field_group col name suffix (if not _num).	`'_num'`
`field_record_suffix`	`str`	field_record colname sfx (if not _recsnum).	`'_recsnum'`

Methods:

Name	Description
`get_col_value`	Get the value of the column in the given row derived from col_type,
`get_field_group_col`	Given a field name or field col name, e.g., an annotation type col's
`get_field_name`	Given a field name or field col name, e.g., an annotation type col's
`get_field_record_col`	Given a field name or field col name, e.g., an annotation type col's
`get_field_type_col`	Given a field name or field col name, e.g., an annotation type col's
`unpack_field`	Given a field in any of its derivatives (like field type, field group

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    field_type_suffix: str = "_field",
    field_group_suffix: str = "_num",
    field_record_suffix: str = "_recsnum",
):
    """Add derived column types/names: Given an annnotation row,
      * field_type(row) == f'{row[ann_type_col]}_field'
      * field_group(row) == f'{row[ann_type_col]}_num'
      * field_record(row) == f'{row[ann_type_col])_recsnum'

    Where:
      * A field_type column holds annotation "sub"- type values, or fields
      * A field_group column identifies groups of annotation fields
      * A field_record column identifies groups of annotation field groups

    Args:
        field_type_suffix: The field_type col name suffix (if not _field).
        field_group_suffix: The field_group col name suffix (if not _num).
        field_record_suffix: field_record colname sfx (if not _recsnum).
    """
    self.field_type_suffix = field_type_suffix
    self.field_group_suffix = field_group_suffix
    self.field_record_suffix = field_record_suffix

Functions¶

get_col_value ¶

get_col_value(
    metadata: AnnotationsMetaData,
    col_type: str,
    row: Series,
    missing: str = None,
) -> str

Get the value of the column in the given row derived from col_type, where col_type is one of: * "field_type" == f"{field}_field" * "field_group" == f"{field}_num" * "field_record" == f"{field}_recsnum"

And "field" is the row_accessor's metadata's "ann_type" col's value.

Parameters:

Name	Type	Description	Default
`metadata`	`AnnotationsMetaData`	The AnnotationsMetaData.	required
`col_type`	`str`	The type of column value to derive.	required
`row`	`Series`	A row from which to get the value.	required
`missing`	`str`	The value to return for unknown or missing column.	`None`

Returns:

Type	Description
`str`	The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_col_value(
    self,
    metadata: dk_annots.AnnotationsMetaData,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row derived from col_type,
    where col_type is one of:
      * "field_type" == f"{field}_field"
      * "field_group" == f"{field}_num"
      * "field_record" == f"{field}_recsnum"

    And "field" is the row_accessor's metadata's "ann_type" col's value.

    Args:
        metadata: The AnnotationsMetaData.
        col_type: The type of column value to derive.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    value = missing
    if metadata.ann_type_col in row.index:
        field = row[metadata.ann_type_col]
        if field is not None:
            if col_type == "field_type":
                col_name = self.get_field_type_col(field)
            elif col_type == "field_group":
                col_name = self.get_field_group_col(field)
            elif col_type == "field_record":
                col_name = self.get_field_record_col(field)
            if col_name is not None and col_name in row.index:
                value = row[col_name]
    return value

get_field_group_col ¶

get_field_group_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field group column.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_field_group_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record, get the name of the derived
    field group column.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_group_suffix}"

get_field_name ¶

get_field_name(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value (the field name); or a field type, group, or record column name, get the field name.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_field_name(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value (the field name); or a field type, group, or record column name,
    get the field name.
    """
    return self.unpack_field(field_value)

get_field_record_col ¶

get_field_record_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field record column.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_field_record_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record, get the name of the derived
    field record column.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_record_suffix}"

get_field_type_col ¶

get_field_type_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record column name, get the field name.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def get_field_type_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record column name, get the field
    name.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_type_suffix}"

unpack_field ¶

unpack_field(field_value: str) -> str

Given a field in any of its derivatives (like field type, field group or field record,) unpack and return the basic field value itself.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def unpack_field(self, field_value: str) -> str:
    """Given a field in any of its derivatives (like field type, field group
    or field record,) unpack and return the basic field value itself.
    """
    field = field_value
    if field.endswith(self.field_record_suffix):
        field = field.replace(self.field_record_suffix, "")
    elif field.endswith(self.field_group_suffix):
        field = field.replace(self.field_group_suffix, "")
    elif field.endswith(self.field_type_suffix):
        field = field.replace(self.field_type_suffix, "")
    return field

LexicalAuthority ¶

LexicalAuthority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Authority

A class for managing named entities by ID with associated values and variations.

Initialize with this authority's metadata.

Parameters:

Name	Type	Description	Default
`name`	`str`	This authority's entity name.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	`None`
`authdata`	`AuthorityData`	The authority data.	`None`
`field_groups`	`DerivedFieldGroups`	The derived field groups to use.	`None`
`anns_validator`	`Callable[[Authority, Dict[str, Any]], bool]`	fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".	`None`
`parent_auth`	`Authority`	This authority's parent authority (if any).	`None`

Methods:

Name	Description
`find_variations`	Find all matches to the given variation.
`get_id_by_variation`	Get the IDs of the value(s) associated with the given variation.
`get_value_ids`	Get all IDs associated with the given value. Note that typically
`get_values_by_id`	Get all values for the associated value ID. Note that typically

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's metadata.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )

Functions¶

find_variations `abstractmethod` ¶

find_variations(
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series

Find all matches to the given variation.

Note

Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.

Parameters:

Name	Type	Description	Default
`variation`	`str`	The text to find; treated as a regular expression unless either starts_with or ends_with is True.	required
`starts_with`	`bool`	When True, find all terms that start with the variation text.	`False`
`ends_with`	`bool`	When True, find all terms that end with the variation text.	`False`
`scope`	`str`	'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching.	`'fullmatch'`

Returns:

Type	Description
`Series`	The matching variations as a pd.Series.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def find_variations(
    self,
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series:
    """Find all matches to the given variation.

    Note:
        Only the first true of starts_with, ends_with, and scope will
        be applied. If none of these are true, a full match on the pattern
        is performed.

    Args:
        variation: The text to find; treated as a regular expression
            unless either starts_with or ends_with is True.
        starts_with: When True, find all terms that start with the
            variation text.
        ends_with: When True, find all terms that end with the variation
            text.
        scope: 'fullmatch' (default), 'match', or 'contains' for
            strict, less strict, and least strict matching.

    Returns:
        The matching variations as a pd.Series.
    """
    raise NotImplementedError

get_id_by_variation `abstractmethod` ¶

get_id_by_variation(variation: str) -> Set[str]

Get the IDs of the value(s) associated with the given variation.

Parameters:

Name	Type	Description	Default
`variation`	`str`	Variation text.	required

Returns:

Type	Description
`Set[str]`	The possibly empty set of associated value IDS.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def get_id_by_variation(self, variation: str) -> Set[str]:
    """Get the IDs of the value(s) associated with the given variation.

    Args:
        variation: Variation text.

    Returns:
        The possibly empty set of associated value IDS.
    """
    raise NotImplementedError

get_value_ids `abstractmethod` ¶

get_value_ids(value: Any) -> Set[Any]

Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	An authority value.	required

Returns:

Type	Description
`Set[Any]`	The associated IDs or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def get_value_ids(self, value: Any) -> Set[Any]:
    """Get all IDs associated with the given value. Note that typically
    there is a single ID for any value, but this allows for inherent
    ambiguities in the authority.

    Args:
        value: An authority value.

    Returns:
        The associated IDs or an empty set if the value is not valid.
    """
    raise NotImplementedError

get_values_by_id `abstractmethod` ¶

get_values_by_id(value_id: Any) -> Set[Any]

Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.

Parameters:

Name	Type	Description	Default
`value_id`	`Any`	An authority value ID.	required

Returns:

Type	Description
`Set[Any]`	The associated values or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py

@abstractmethod
def get_values_by_id(self, value_id: Any) -> Set[Any]:
    """Get all values for the associated value ID. Note that typically
    there is a single value for an ID, but this allows for inherent
    ambiguities in the authority.

    Args:
        value_id: An authority value ID.

    Returns:
        The associated values or an empty set if the value is not valid.
    """
    raise NotImplementedError

RegexAuthority ¶

RegexAuthority(
    name: str,
    regex: Pattern,
    canonical_fn: Callable[[str, str], Any] = None,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Authority

A class for managing named entities by ID with associated values and variations.

Initialize with this authority's entity name.

Note

If the regular expression has capturing groups, each group will result in a separate entity, with the group name if provided in the regular expression as ...(?Pgroup_regex)...

Parameters:

Name	Type	Description	Default
`name`	`str`	The authority name.	required
`regex`	`Pattern`	The regular expression to apply.	required
`canonical_fn`	`Callable[[str, str], Any]`	A function, fn(match_text, group_name), to transform input matches to a canonical form as a value_id. Where group_name will be None and the full match text will be passed in if there are no group names. Note that the canonical form is computed before the match_validator is applied and its value will be found as the value to the key.	`None`
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	`None`
`authdata`	`AuthorityData`	The authority data.	`None`
`field_groups`	`DerivedFieldGroups`	The derived field groups to use.	`None`
`anns_validator`	`Callable[[Authority, Dict[str, Any]], bool]`	A validation function for each regex match formed as a list of annotation row dictionaries, one row dictionary for each matching regex group. If the validator returns False, then the annotation rows will be rejected. The entity_text key will hold matched text and the _field key will hold the group name or number (if there are groups with or without names) or the if there are no groups in the regular expression. Note that the validator function takes the regex authority instance as its first parameter to provide access to the field_groups, etc. The validation_fn signature is: fn(regexAuthority, ann_row_dicts) and returns a boolean.	`None`
`parent_auth`	`Authority`	This authority's parent authority (if any).	`None`

Methods:

Name	Description
`add_annotations`	Method to do the work of finding, validating, and adding annotations.
`has_value`	Determine whether the given value is in this authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def __init__(
    self,
    name: str,
    regex: re.Pattern,
    canonical_fn: Callable[[str, str], Any] = None,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's entity name.

    Note:
        If the regular expression has capturing groups, each group
        will result in a separate entity, with the group name if provided
        in the regular expression as ...(?P<group_name>group_regex)...

    Args:
        name: The authority name.
        regex: The regular expression to apply.
        canonical_fn: A function, fn(match_text, group_name), to
            transform input matches to a canonical form as a value_id.
            Where group_name will be None and the full match text will be
            passed in if there are no group names. Note that the canonical form
            is computed before the match_validator is applied and its value
            will be found as the value to the <auth_id> key.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: A validation function for each regex match
            formed as a list of annotation row dictionaries, one row dictionary
            for each matching regex group. If the validator returns False,
            then the annotation rows will be rejected. The entity_text key
            will hold matched text and the <auth_name>_field key will hold
            the group name or number (if there are groups with or without names)
            or the <auth_name> if there are no groups in the regular expression.
            Note that the validator function takes the regex authority instance
            as its first parameter to provide access to the field_groups, etc.
            The validation_fn signature is: fn(regexAuthority, ann_row_dicts)
            and returns a boolean.
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.regex = regex
    self.canonical_fn = canonical_fn

Functions¶

add_annotations ¶

add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name	Type	Description	Default
`text_obj`	`AnnotatedText`	The annotated text object to process and add annotations.	required

Returns:

Type	Description
`Annotations`	The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    for match in re.finditer(self.regex, text_obj.text):
        ann_dicts = []
        if match.lastindex is not None:
            if len(self.regex.groupindex) > 0:  # we have named groups
                for group_name, group_num in self.regex.groupindex.items():
                    group_text = match.group(group_num)
                    kwargs = {self.field_groups.get_field_type_col(self.name): group_name}
                    ann_dicts.append(
                        self.build_annotation(
                            start_pos=match.start(group_name),
                            end_pos=match.end(group_name),
                            entity_text=group_text,
                            auth_value_id=self.get_canonical_form(group_text, group_name),
                            **kwargs,
                        )
                    )
            else:  # we have only numbers for groups
                for group_num, group_text in enumerate(match.groups()):
                    group_num += 1
                    kwargs = {self.field_groups.get_field_type_col(self.name): group_num}
                    ann_dicts.append(
                        self.build_annotation(
                            start_pos=match.start(group_num),
                            end_pos=match.end(group_num),
                            entity_text=group_text,
                            auth_value_id=self.get_canonical_form(group_text, group_num),
                            **kwargs,
                        )
                    )
        else:  # we have no groups
            ann_dicts.append(
                self.build_annotation(
                    start_pos=match.start(),
                    end_pos=match.end(),
                    entity_text=match.group(),
                    auth_value_id=self.get_canonical_form(match.group(), self.name),
                )
            )
        if self.validate_ann_dicts(ann_dicts):
            # Add non-empty, valid annotation dicts to the result
            text_obj.annotations.add_dicts(ann_dicts)
    return text_obj.annotations

has_value ¶

has_value(value: Any) -> re.Match

Determine whether the given value is in this authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	A possible authority value.	required

Returns:

Type	Description
`Match`	None if the value is not a valid entity value; otherwise,
`Match`	return the re.Match object.

Source code in packages/xization/src/dataknobs_xization/authorities.py

def has_value(self, value: Any) -> re.Match:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        None if the value is not a valid entity value; otherwise,
        return the re.Match object.
    """
    return self.regex.match(str(value))

lexicon¶

Functions and Classes¶

dataknobs_xization.lexicon ¶

Lexical matching and token alignment for text processing.

Provides classes for lexical expansion, normalization, token alignment, and pattern matching in text with support for variations and fuzzy matching.

Classes:

Name	Description
`CorrelatedAuthorityData`	Container for authoritative data containing correlated data for multiple
`DataframeAuthority`	A pandas dataframe-based lexical authority.
`LexicalExpander`	A class to expand and/or normalize original lexical input terms, to
`MultiAuthorityData`	Container for authoritative data containing correlated data for multiple
`MultiAuthorityFactory`	An factory for building a "sub" authority directly or indirectly
`SimpleMultiAuthorityData`	Data class for pulling a single column from the multi-authority data
`TokenAligner`	Aligns tokens with a lexical authority to generate annotations.
`TokenMatch`	Represents a match between tokens and a lexical authority variation.

Classes¶

CorrelatedAuthorityData ¶

CorrelatedAuthorityData(df: DataFrame, name: str)

Bases: AuthorityData

Container for authoritative data containing correlated data for multiple "sub" authorities.

Methods:

Name	Description
`auth_records_mask`	Get a series identifying records in the full authority matching
`auth_values_mask`	Identify full-authority data corresponding to this sub-value.
`combine_masks`	Combine the masks if possible, returning the valid combination or None.
`get_auth_records`	Get the authority records identified by the mask.
`sub_authority_names`	Get the "sub" authority names.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}

Functions¶

auth_records_mask `abstractmethod` ¶

auth_records_mask(
    record_value_ids: Dict[str, int], filter_mask: Series = None
) -> pd.Series

Get a series identifying records in the full authority matching the given records of the form {: }.

Parameters:

Name	Type	Description	Default
`record_value_ids`	`Dict[str, int]`	The dict of field names to value_ids.	required
`filter_mask`	`Series`	A pre-filter limiting records to consider and/or building records incrementally.	`None`

Returns:

Type	Description
`Series`	A series identifying where all fields exist.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@abstractmethod
def auth_records_mask(
    self,
    record_value_ids: Dict[str, int],
    filter_mask: pd.Series = None,
) -> pd.Series:
    """Get a series identifying records in the full authority matching
    the given records of the form {<sub-name>: <sub-value-id>}.

    Args:
        record_value_ids: The dict of field names to value_ids.
        filter_mask: A pre-filter limiting records to consider and/or
            building records incrementally.

    Returns:
        A series identifying where all fields exist.
    """
    raise NotImplementedError

auth_values_mask `abstractmethod` ¶

auth_values_mask(name: str, value_id: int) -> pd.Series

Identify full-authority data corresponding to this sub-value.

Parameters:

Name	Type	Description	Default
`name`	`str`	The sub-authority name.	required
`value_id`	`int`	The sub-authority value_id.	required

Returns:

Type	Description
`Series`	A series representing relevant full-authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@abstractmethod
def auth_values_mask(self, name: str, value_id: int) -> pd.Series:
    """Identify full-authority data corresponding to this sub-value.

    Args:
        name: The sub-authority name.
        value_id: The sub-authority value_id.

    Returns:
        A series representing relevant full-authority data.
    """
    raise NotImplementedError

combine_masks `abstractmethod` ¶

combine_masks(mask1: Series, mask2: Series) -> pd.Series

Combine the masks if possible, returning the valid combination or None.

Parameters:

Name	Type	Description	Default
`mask1`	`Series`	An auth_records_mask consistent with this data.	required
`mask2`	`Series`	Another data auth_records_mask.	required

Returns:

Type	Description
`Series`	The combined consistent records_mask or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@abstractmethod
def combine_masks(self, mask1: pd.Series, mask2: pd.Series) -> pd.Series:
    """Combine the masks if possible, returning the valid combination or None.

    Args:
        mask1: An auth_records_mask consistent with this data.
        mask2: Another data auth_records_mask.

    Returns:
        The combined consistent records_mask or None.
    """
    raise NotImplementedError

get_auth_records `abstractmethod` ¶

get_auth_records(records_mask: Series) -> pd.DataFrame

Get the authority records identified by the mask.

Parameters:

Name	Type	Description	Default
`records_mask`	`Series`	A series identifying records in the full data.	required

Returns:

Type	Description
`DataFrame`	The records for which the mask is True.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@abstractmethod
def get_auth_records(self, records_mask: pd.Series) -> pd.DataFrame:
    """Get the authority records identified by the mask.

    Args:
        records_mask: A series identifying records in the full data.

    Returns:
        The records for which the mask is True.
    """
    raise NotImplementedError

sub_authority_names ¶

sub_authority_names() -> List[str]

Get the "sub" authority names.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def sub_authority_names(self) -> List[str]:
    """Get the "sub" authority names."""
    return None

DataframeAuthority ¶

DataframeAuthority(
    name: str,
    lexical_expander: LexicalExpander,
    authdata: AuthorityData,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: LexicalAuthority

A pandas dataframe-based lexical authority.

Initialize with the name, values, and associated ids of the authority; and with the lexical expander for authoritative values.

Parameters:

Name	Type	Description	Default
`name`	`str`	The authority name, if different from df.columns[0].	required
`lexical_expander`	`LexicalExpander`	The lexical expander for the values.	required
`authdata`	`AuthorityData`	The data for this authority.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	`None`
`field_groups`	`DerivedFieldGroups`	The derived field groups to use.	`None`
`anns_validator`	`Callable[[Authority, Dict[str, Any]], bool]`	fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".	`None`
`parent_auth`	`Authority`	This authority's parent authority (if any).	`None`

Methods:

Name	Description
`add_annotations`	Method to do the work of finding, validating, and adding annotations.
`find_variations`	Find all matches to the given variation.
`get_id_by_variation`	Get the IDs of the value(s) associated with the given variation.
`get_value_ids`	Get all IDs associated with the given value. Note that typically
`get_values_by_id`	Get all values for the associated value ID. Note that typically
`get_variations`	Convenience method to compute variations for the value.
`get_variations_df`	Create a DataFrame including associated ids for each variation.
`has_value`	Determine whether the given value is in this authority.

Attributes:

Name	Type	Description
`prev_aligner`	`TokenAligner`	Get the token aligner created in the latest call to annotate_text.
`variations`	`Series`	Get all lexical variations in a series whose index has associated

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(
    self,
    name: str,
    lexical_expander: LexicalExpander,
    authdata: dk_auth.AuthorityData,
    auth_anns_builder: dk_auth.AuthorityAnnotationsBuilder = None,
    field_groups: dk_auth.DerivedFieldGroups = None,
    anns_validator: Callable[[dk_auth.Authority, Dict[str, Any]], bool] = None,
    parent_auth: dk_auth.Authority = None,
):
    """Initialize with the name, values, and associated ids of the authority;
    and with the lexical expander for authoritative values.

    Args:
        name: The authority name, if different from df.columns[0].
        lexical_expander: The lexical expander for the values.
        authdata: The data for this authority.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name if name else authdata.df.columns[0],
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.lexical_expander = lexical_expander
    self._variations = None
    self._prev_aligner = None

Attributes¶

prev_aligner `property` ¶

prev_aligner: TokenAligner

Get the token aligner created in the latest call to annotate_text.

variations `property` ¶

variations: Series

Get all lexical variations in a series whose index has associated value IDs.

Returns:

Type	Description
`Series`	A pandas series with index-identified variations.

Functions¶

add_annotations ¶

add_annotations(doctext: Text, annotations: Annotations) -> dk_anns.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name	Type	Description	Default
`doctext`	`Text`	The text to process.	required
`annotations`	`Annotations`	The annotations object to add annotations to.	required

Returns:

Type	Description
`Annotations`	The given or a new Annotations instance.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def add_annotations(
    self,
    doctext: dk_doc.Text,
    annotations: dk_anns.Annotations,
) -> dk_anns.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        doctext: The text to process.
        annotations: The annotations object to add annotations to.

    Returns:
        The given or a new Annotations instance.
    """
    first_token = self.lexical_expander.build_first_token(
        doctext.text, input_id=doctext.text_id
    )
    token_aligner = TokenAligner(first_token, self)
    self._prev_aligner = token_aligner
    if self.validate_ann_dicts(token_aligner.annotations):
        annotations.add_dicts(token_aligner.annotations)
    return annotations

find_variations ¶

find_variations(
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series

Find all matches to the given variation.

Note

Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.

Parameters:

Name	Type	Description	Default
`variation`	`str`	The text to find; treated as a regular expression unless either starts_with or ends_with is True.	required
`starts_with`	`bool`	When True, find all terms that start with the variation text.	`False`
`ends_with`	`bool`	When True, find all terms that end with the variation text.	`False`
`scope`	`str`	'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching.	`'fullmatch'`

Returns:

Type	Description
`Series`	The matching variations as a pd.Series.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def find_variations(
    self,
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series:
    """Find all matches to the given variation.

    Note:
        Only the first true of starts_with, ends_with, and scope will
        be applied. If none of these are true, a full match on the pattern
        is performed.

    Args:
        variation: The text to find; treated as a regular expression
            unless either starts_with or ends_with is True.
        starts_with: When True, find all terms that start with the
            variation text.
        ends_with: When True, find all terms that end with the variation
            text.
        scope: 'fullmatch' (default), 'match', or 'contains' for
            strict, less strict, and least strict matching.

    Returns:
        The matching variations as a pd.Series.
    """
    vs = self.variations
    if starts_with:
        vs = vs[vs.str.startswith(variation)]
    elif ends_with:
        vs = vs[vs.str.endswith(variation)]
    else:
        if scope == "fullmatch":
            hits = vs.str.fullmatch(variation)
        elif scope == "match":
            hits = vs.str.match(variation)
        else:
            hits = vs.str.contains(variation)
        vs = vs[hits]
    vs = vs.drop_duplicates()
    return vs

get_id_by_variation ¶

get_id_by_variation(variation: str) -> Set[str]

Get the IDs of the value(s) associated with the given variation.

Parameters:

Name	Type	Description	Default
`variation`	`str`	Variation text.	required

Returns:

Type	Description
`Set[str]`	The possibly empty set of associated value IDS.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_id_by_variation(self, variation: str) -> Set[str]:
    """Get the IDs of the value(s) associated with the given variation.

    Args:
        variation: Variation text.

    Returns:
        The possibly empty set of associated value IDS.
    """
    ids = set()
    for value in self.lexical_expander.get_terms(variation):
        ids.update(self.get_value_ids(value))
    return ids

get_value_ids ¶

get_value_ids(value: Any) -> Set[Any]

Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	An authority value.	required

Returns:

Type	Description
`Set[Any]`	The associated IDs or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_value_ids(self, value: Any) -> Set[Any]:
    """Get all IDs associated with the given value. Note that typically
    there is a single ID for any value, but this allows for inherent
    ambiguities in the authority.

    Args:
        value: An authority value.

    Returns:
        The associated IDs or an empty set if the value is not valid.
    """
    return set(self.authdata.lookup_values(value).index.tolist())

get_values_by_id ¶

get_values_by_id(value_id: Any) -> Set[Any]

Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.

Parameters:

Name	Type	Description	Default
`value_id`	`Any`	An authority value ID.	required

Returns:

Type	Description
`Set[Any]`	The associated values or an empty set if the value ID is not valid.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_values_by_id(self, value_id: Any) -> Set[Any]:
    """Get all values for the associated value ID. Note that typically
    there is a single value for an ID, but this allows for inherent
    ambiguities in the authority.

    Args:
        value_id: An authority value ID.

    Returns:
        The associated values or an empty set if the value ID is not valid.
    """
    return set(self.authdata.lookup_values(value_id, is_id=True)[self.name].tolist())

get_variations ¶

get_variations(value: Any, normalize: bool = True) -> Set[Any]

Convenience method to compute variations for the value.

Parameters:

Name	Type	Description	Default
`value`	`Any`	The authority value, or term, whose variations to compute.	required
`normalize`	`bool`	True to normalize the variations.	`True`

Returns:

Type	Description
`Set[Any]`	The set of variations for the value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_variations(self, value: Any, normalize: bool = True) -> Set[Any]:
    """Convenience method to compute variations for the value.

    Args:
        value: The authority value, or term, whose variations to compute.
        normalize: True to normalize the variations.

    Returns:
        The set of variations for the value.
    """
    return self.lexical_expander(value, normalize=normalize)

get_variations_df ¶

get_variations_df(
    variations: Series,
    variations_colname: str = "variation",
    ids_colname: str = None,
    lookup_values: bool = False,
) -> pd.DataFrame

Create a DataFrame including associated ids for each variation.

Parameters:

Name	Type	Description	Default
`variations`	`Series`	The variations to include in the dataframe.	required
`variations_colname`	`str`	The name of the variations column.	`'variation'`
`ids_colname`	`str`	The column name for value ids.	`None`
`lookup_values`	`bool`	When True, include a self.name column with associated values.	`False`

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_variations_df(
    self,
    variations: pd.Series,
    variations_colname: str = "variation",
    ids_colname: str = None,
    lookup_values: bool = False,
) -> pd.DataFrame:
    """Create a DataFrame including associated ids for each variation.

    Args:
        variations: The variations to include in the dataframe.
        variations_colname: The name of the variations column.
        ids_colname: The column name for value ids.
        lookup_values: When True, include a self.name column
            with associated values.
    """
    if ids_colname is None:
        ids_colname = f"{self.name}_id"
    df = pd.DataFrame(
        {
            variations_colname: variations,
            ids_colname: variations.apply(self.get_id_by_variation),
        }
    ).explode(ids_colname)
    if lookup_values:
        df[self.name] = df[ids_colname].apply(self.get_values_by_id)
        df = df.explode(self.name)
    return df

has_value ¶

has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name	Type	Description	Default
`value`	`Any`	A possible authority value.	required

Returns:

Type	Description
`bool`	True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    return np.any(self.authdata.df[self.name] == value)

LexicalExpander ¶

LexicalExpander(
    variations_fn: Callable[[str], Set[str]],
    normalize_fn: Callable[[str], str],
    split_input_camelcase: bool = True,
    detect_emojis: bool = False,
)

A class to expand and/or normalize original lexical input terms, to keep back-references from generated data to corresponding original input, and to build consistent tokens for lexical matching.

Initialize with the given functions.

Parameters:

Name	Type	Description	Default
`variations_fn`	`Callable[[str], Set[str]]`	A function, f(t), to expand a raw input term to all of its variations (including itself if desired). If None, the default is to expand each term to itself.	required
`normalize_fn`	`Callable[[str], str]`	A function to normalize a raw input term or any of its variations. If None, then the identity function is used.	required
`split_input_camelcase`	`bool`	True to split input camelcase tokens.	`True`
`detect_emojis`	`bool`	True to detect emojis. If split_input_camelcase, then adjacent emojis will also be split; otherwise, adjacent emojis will appear as a single token.	`False`

Methods:

Name	Description
`__call__`	Get all variations of the original term.
`get_terms`	Get the term ids for which the given variation was generated.
`normalize`	Normalize the given input term or variation.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(
    self,
    variations_fn: Callable[[str], Set[str]],
    normalize_fn: Callable[[str], str],
    split_input_camelcase: bool = True,
    detect_emojis: bool = False,
):
    """Initialize with the given functions.

    Args:
        variations_fn: A function, f(t), to expand a raw input term to
            all of its variations (including itself if desired). If None, the
            default is to expand each term to itself.
        normalize_fn: A function to normalize a raw input term or any
            of its variations. If None, then the identity function is used.
        split_input_camelcase: True to split input camelcase tokens.
        detect_emojis: True to detect emojis. If split_input_camelcase,
            then adjacent emojis will also be split; otherwise, adjacent
            emojis will appear as a single token.
    """
    self.variations_fn = variations_fn if variations_fn else lambda x: {x}
    self.normalize_fn = normalize_fn if normalize_fn else lambda x: x
    self.split_input_camelcase = split_input_camelcase
    self.emoji_data = emoji_utils.load_emoji_data() if detect_emojis else None
    self.v2t = defaultdict(set)

Functions¶

call ¶

__call__(term: Any, normalize: bool = True) -> Set[str]

Get all variations of the original term.

Parameters:

Name	Type	Description	Default
`term`	`Any`	The term whose variations to compute.	required
`normalize`	`bool`	True to normalize the resulting variations.	`True`

Returns:

Type	Description
`Set[str]`	All variations.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __call__(self, term: Any, normalize: bool = True) -> Set[str]:
    """Get all variations of the original term.

    Args:
        term: The term whose variations to compute.
        normalize: True to normalize the resulting variations.

    Returns:
        All variations.
    """
    variations = self.variations_fn(term)
    if normalize:
        variations = {self.normalize_fn(v) for v in variations}
    # Add a mapping from each variation to its original term
    if variations is not None and len(variations) > 0:
        more_itertools.consume(self.v2t[v].add(term) for v in variations)
    return variations

get_terms ¶

get_terms(variation: str) -> Set[Any]

Get the term ids for which the given variation was generated.

Parameters:

Name	Type	Description	Default
`variation`	`str`	A variation whose reference term(s) to retrieve.	required

Returns:

Type	Description
`Set[Any]`	The set term ids for the variation or the missing_value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_terms(self, variation: str) -> Set[Any]:
    """Get the term ids for which the given variation was generated.

    Args:
        variation: A variation whose reference term(s) to retrieve.

    Returns:
        The set term ids for the variation or the missing_value.
    """
    return self.v2t.get(variation, set())

normalize ¶

normalize(input_term: str) -> str

Normalize the given input term or variation.

Parameters:

Name	Type	Description	Default
`input_term`	`str`	An input term to normalize.	required

Returns:

Type	Description
`str`	The normalized string of the input_term.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def normalize(self, input_term: str) -> str:
    """Normalize the given input term or variation.

    Args:
        input_term: An input term to normalize.

    Returns:
        The normalized string of the input_term.
    """
    return self.normalize_fn(input_term)

MultiAuthorityData ¶

MultiAuthorityData(df: DataFrame, name: str)

Bases: CorrelatedAuthorityData

Container for authoritative data containing correlated data for multiple "sub" authorities composed of explicit data for each component.

Methods:

Name	Description
`auth_records_mask`	Get a boolean series identifying records in the full authority matching
`auth_values_mask`	Identify the rows in the full authority corresponding to this sub-value.
`build_authority_data`	Build an authority for the named sub-authority.
`combine_masks`	Combine the masks if possible, returning the valid combination or None.
`get_auth_records`	Get the authority records identified by the mask.
`get_authority_data`	Get AuthorityData for the named "sub" authority, building if needed.
`get_unique_vals_df`	Get a dataframe with the unique values from the column and the given
`lookup_auth_values`	Lookup original authority data for the named "sub" authority value.
`lookup_subauth_values`	Lookup "sub" authority data for the named "sub" authority value.

Attributes:

Name	Type	Description
`authority_data`	`AuthorityData`	Retrieve without building the named authority data, or None

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}

Attributes¶

authority_data `property` ¶

authority_data: AuthorityData

Retrieve without building the named authority data, or None

Functions¶

auth_records_mask ¶

auth_records_mask(
    record_value_ids: Dict[str, int], filter_mask: Series = None
) -> pd.Series

Get a boolean series identifying records in the full authority matching the given records of the form {: }.

Parameters:

Name	Type	Description	Default
`record_value_ids`	`Dict[str, int]`	The dict of field names to value_ids.	required
`filter_mask`	`Series`	A pre-filter limiting records to consider and/or building records incrementally.	`None`

Returns:

Type	Description
`Series`	A boolean series where all fields exist or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def auth_records_mask(
    self,
    record_value_ids: Dict[str, int],
    filter_mask: pd.Series = None,
) -> pd.Series:
    """Get a boolean series identifying records in the full authority matching
    the given records of the form {<sub-name>: <sub-value-id>}.

    Args:
        record_value_ids: The dict of field names to value_ids.
        filter_mask: A pre-filter limiting records to consider and/or
            building records incrementally.

    Returns:
        A boolean series where all fields exist or None.
    """
    has_fields = filter_mask
    for name, value_id in record_value_ids.items():
        has_field = self.auth_values_mask(name, value_id)
        if has_fields is None:
            has_fields = has_field
        else:
            has_fields &= has_field
    return has_fields

auth_values_mask ¶

auth_values_mask(name: str, value_id: int) -> pd.Series

Identify the rows in the full authority corresponding to this sub-value.

Parameters:

Name	Type	Description	Default
`name`	`str`	The sub-authority name.	required
`value_id`	`int`	The sub-authority value_id.	required

Returns:

Type	Description
`Series`	A boolean series where the field exists.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def auth_values_mask(self, name: str, value_id: int) -> pd.Series:
    """Identify the rows in the full authority corresponding to this sub-value.

    Args:
        name: The sub-authority name.
        value_id: The sub-authority value_id.

    Returns:
        A boolean series where the field exists.
    """
    field_values = self.lookup_subauth_values(name, value_id, is_id=True)
    return self.df[name].isin(field_values[name].tolist())

build_authority_data `abstractmethod` ¶

build_authority_data(name: str) -> dk_auth.AuthorityData

Build an authority for the named sub-authority.

Parameters:

Name	Type	Description	Default
`name`	`str`	The "sub" authority name.	required

Returns:

Type	Description
`AuthorityData`	The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@abstractmethod
def build_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Build an authority for the named sub-authority.

    Args:
        name: The "sub" authority name.

    Returns:
        The "sub" authority data.
    """
    raise NotImplementedError

combine_masks ¶

combine_masks(mask1: Series, mask2: Series) -> pd.Series

Combine the masks if possible, returning the valid combination or None.

Parameters:

Name	Type	Description	Default
`mask1`	`Series`	An auth_records_mask consistent with this data.	required
`mask2`	`Series`	Another data auth_records_mask.	required

Returns:

Type	Description
`Series`	The combined consistent records_mask or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def combine_masks(self, mask1: pd.Series, mask2: pd.Series) -> pd.Series:
    """Combine the masks if possible, returning the valid combination or None.

    Args:
        mask1: An auth_records_mask consistent with this data.
        mask2: Another data auth_records_mask.

    Returns:
        The combined consistent records_mask or None.
    """
    result = None
    if mask1 is not None and mask2 is not None:
        result = mask1 & mask2
    elif mask1 is not None:
        result = mask1
    elif mask2 is not None:
        result = mask2
    return result if np.any(result) else None

get_auth_records ¶

get_auth_records(records_mask: Series) -> pd.DataFrame

Get the authority records identified by the mask.

Parameters:

Name	Type	Description	Default
`records_mask`	`Series`	A boolean series identifying records in the full df.	required

Returns:

Type	Description
`DataFrame`	The records/rows for which the mask is True.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_auth_records(self, records_mask: pd.Series) -> pd.DataFrame:
    """Get the authority records identified by the mask.

    Args:
        records_mask: A boolean series identifying records in the full df.

    Returns:
        The records/rows for which the mask is True.
    """
    return self.df[records_mask]

get_authority_data ¶

get_authority_data(name: str) -> dk_auth.AuthorityData

Get AuthorityData for the named "sub" authority, building if needed.

Parameters:

Name	Type	Description	Default
`name`	`str`	The "sub" authority name.	required

Returns:

Type	Description
`AuthorityData`	The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Get AuthorityData for the named "sub" authority, building if needed.

    Args:
        name: The "sub" authority name.

    Returns:
        The "sub" authority data.
    """
    if name not in self._authority_data:
        self._authority_data[name] = self.build_authority_data(name)
    return self._authority_data[name]

get_unique_vals_df `staticmethod` ¶

get_unique_vals_df(col: Series, name: str) -> pd.DataFrame

Get a dataframe with the unique values from the column and the given column name.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

@staticmethod
def get_unique_vals_df(col: pd.Series, name: str) -> pd.DataFrame:
    """Get a dataframe with the unique values from the column and the given
    column name.
    """
    data = np.sort(pd.unique(col.dropna()))
    if np.issubdtype(col.dtype, np.integer):
        # IDs for an integer column are the integers themselves
        col_df = pd.DataFrame({name: data}, index=data)
    else:
        # IDs for other columns are auto-generated from 0 to n-1
        col_df = pd.DataFrame({name: data})
    return col_df

lookup_auth_values ¶

lookup_auth_values(name: str, value: str) -> pd.DataFrame

Lookup original authority data for the named "sub" authority value.

Parameters:

Name	Type	Description	Default
`name`	`str`	The sub-authority name.	required
`value`	`str`	The sub-authority value(s) (or dataframe row(s)).	required

Returns:

Type	Description
`DataFrame`	The original authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def lookup_auth_values(
    self,
    name: str,
    value: str,
) -> pd.DataFrame:
    """Lookup original authority data for the named "sub" authority value.

    Args:
        name: The sub-authority name.
        value: The sub-authority value(s) (or dataframe row(s)).

    Returns:
        The original authority dataframe rows.
    """
    return self.df[self.df[name] == value]

lookup_subauth_values ¶

lookup_subauth_values(
    name: str, value: int, is_id: bool = False
) -> pd.DataFrame

Lookup "sub" authority data for the named "sub" authority value.

Parameters:

Name	Type	Description	Default
`name`	`str`	The sub-authority name.	required
`value`	`int`	The value for the sub-authority to lookup.	required
`is_id`	`bool`	True if value is an ID.	`False`

Returns:

Type	Description
`DataFrame`	The applicable authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def lookup_subauth_values(self, name: str, value: int, is_id: bool = False) -> pd.DataFrame:
    """Lookup "sub" authority data for the named "sub" authority value.

    Args:
        name: The sub-authority name.
        value: The value for the sub-authority to lookup.
        is_id: True if value is an ID.

    Returns:
        The applicable authority dataframe rows.
    """
    values_df = None
    authdata = self._authority_data.get(name, None)
    if authdata is not None:
        values_df = authdata.lookup_values(value, is_id=is_id)
    return values_df

MultiAuthorityFactory ¶

MultiAuthorityFactory(auth_name: str, lexical_expander: LexicalExpander = None)

Bases: AuthorityFactory

An factory for building a "sub" authority directly or indirectly from MultiAuthorityData.

Initialize the MultiAuthorityFactory.

Parameters:

Name	Type	Description	Default
`auth_name`	`str`	The name of the dataframe authority to build.	required
`lexical_expander`	`LexicalExpander`	The lexical expander to use (default=identity).	`None`

Methods:

Name	Description
`build_authority`	Build a DataframeAuthority.
`get_lexical_expander`	Get the lexical expander for the named (column) data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(
    self,
    auth_name: str,
    lexical_expander: LexicalExpander = None,
):
    """Initialize the MultiAuthorityFactory.

    Args:
        auth_name: The name of the dataframe authority to build.
        lexical_expander: The lexical expander to use (default=identity).
    """
    self.auth_name = auth_name
    self._lexical_expander = lexical_expander

Functions¶

build_authority ¶

build_authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    multiauthdata: MultiAuthorityData,
    parent_auth: Authority = None,
) -> DataframeAuthority

Build a DataframeAuthority.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the authority to build.	required
`auth_anns_builder`	`AuthorityAnnotationsBuilder`	The authority annotations row builder to use for building annotation rows.	required
`multiauthdata`	`MultiAuthorityData`	The multi-authority source data.	required
`parent_auth`	`Authority`	The parent authority.	`None`

Returns:

Type	Description
`DataframeAuthority`	The DataframeAuthority instance.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def build_authority(
    self,
    name: str,
    auth_anns_builder: dk_auth.AuthorityAnnotationsBuilder,
    multiauthdata: MultiAuthorityData,
    parent_auth: dk_auth.Authority = None,
) -> DataframeAuthority:
    """Build a DataframeAuthority.

    Args:
        name: The name of the authority to build.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        multiauthdata: The multi-authority source data.
        parent_auth: The parent authority.

    Returns:
        The DataframeAuthority instance.
    """
    authdata = multiauthdata.get_authority_data(name)
    field_groups = None  # TODO: get from instance var set on construction?
    anns_validator = None  # TODO: get from multiauthdata?
    return DataframeAuthority(
        name,
        self.get_lexical_expander(name),
        authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )

get_lexical_expander ¶

get_lexical_expander(name: str) -> LexicalExpander

Get the lexical expander for the named (column) data.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the column to expand.	required

Returns:

Type	Description
`LexicalExpander`	The appropriate lexical_expander.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def get_lexical_expander(self, name: str) -> LexicalExpander:
    """Get the lexical expander for the named (column) data.

    Args:
        name: The name of the column to expand.

    Returns:
        The appropriate lexical_expander.
    """
    if self._lexical_expander is None:
        self._lexical_expander = LexicalExpander(None, None)
    return self._lexical_expander

SimpleMultiAuthorityData ¶

SimpleMultiAuthorityData(df: DataFrame, name: str)

Bases: MultiAuthorityData

Data class for pulling a single column from the multi-authority data as a "sub" authority.

Methods:

Name	Description
`build_authority_data`	Build an authority for the named column holding authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}

Functions¶

build_authority_data ¶

build_authority_data(name: str) -> dk_auth.AuthorityData

Build an authority for the named column holding authority data.

Note

Only unique values are kept and the full dataframe's index will not be preserved.

Parameters:

Name	Type	Description	Default
`name`	`str`	The "sub" authority (and column) name.	required

Returns:

Type	Description
`AuthorityData`	The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def build_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Build an authority for the named column holding authority data.

    Note:
        Only unique values are kept and the full dataframe's index
        will not be preserved.

    Args:
        name: The "sub" authority (and column) name.

    Returns:
        The "sub" authority data.
    """
    col = self.df[name]
    col_df = self.get_unique_vals_df(col, name)
    return dk_auth.AuthorityData(col_df, name)

TokenAligner ¶

TokenAligner(first_token: Token, authority: LexicalAuthority)

Aligns tokens with a lexical authority to generate annotations.

Processes a token stream, matching tokens against lexical authority variations and generating annotations for matches. Handles overlapping matches and tracks processed tokens.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(self, first_token: dk_tok.Token, authority: dk_auth.LexicalAuthority):
    self.first_token = first_token
    self.auth = authority
    self.annotations = []  # List[Dict[str, Any]]
    self._processed_idx = set()
    self._process(self.first_token)

TokenMatch ¶

TokenMatch(auth: LexicalAuthority, val_idx: int, var: str, token: Token)

Represents a match between tokens and a lexical authority variation.

Matches a sequence of tokens against a lexical authority variation, tracking whether the match is complete and providing access to matched text and annotation generation.

Attributes:

Name	Type	Description
`matched_text`		Get the matched original text.

Source code in packages/xization/src/dataknobs_xization/lexicon.py

def __init__(self, auth: dk_auth.LexicalAuthority, val_idx: int, var: str, token: dk_tok.Token):
    self.auth = auth
    self.val_idx = val_idx
    self.var = var
    self.token = token

    self.varparts = var.split()
    self.matches = True
    self.tokens = []
    t = token
    for v in self.varparts:
        if t is not None and v == t.norm_text:
            self.tokens.append(t)
            t = t.next_token
        else:
            self.matches = False
            break

Attributes¶

matched_text `property` ¶

matched_text

Get the matched original text.

Usage Examples¶

Text Normalization Example¶

from dataknobs_xization import normalize

# Basic text normalization
text = "  Hello,    WORLD!  \n\t How   are you?  "
normalized = normalize.basic_normalization_fn(text)
print(normalized)  # "hello, world! how are you?"

# CamelCase expansion
camel_text = "firstName"
expanded = normalize.expand_camelcase_fn(camel_text)
print(expanded)  # "first Name"

# Generate lexical variations
text_with_hyphens = "multi-platform/cross-browser"
variations = normalize.get_lexical_variations(text_with_hyphens)
print(f"Generated {len(variations)} variations:")
for var in sorted(variations):
    print(f"  {var}")

# Symbol handling
text_with_symbols = "!Hello world?"
cleaned = normalize.drop_non_embedded_symbols_fn(text_with_symbols)
print(cleaned)  # "Hello world"

embedded_text = "user@domain.com"
processed = normalize.drop_embedded_symbols_fn(embedded_text, " ")
print(processed)  # "user domain com"

# Ampersand expansion
ampersand_text = "Research & Development"
expanded_ampersand = normalize.expand_ampersand_fn(ampersand_text)
print(expanded_ampersand)  # "Research and Development"

Character Features Example¶

from dataknobs_xization.masking_tokenizer import CharacterFeatures
from dataknobs_structures import document as dk_doc
import pandas as pd

# Create a concrete implementation of CharacterFeatures
class BasicCharacterFeatures(CharacterFeatures):
    """Basic character-level feature extraction."""

    @property
    def cdf(self) -> pd.DataFrame:
        """Create character dataframe with features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            # Add padding if specified
            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = ([pad_char] * self._roll_padding + 
                        chars + 
                        [pad_char] * self._roll_padding)

            # Create feature dataframe
            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'is_upper': [c.isupper() if c != '<PAD>' else False for c in chars],
                'is_lower': [c.islower() if c != '<PAD>' else False for c in chars],
                'is_space': [c.isspace() if c != '<PAD>' else False for c in chars],
                'is_punct': [not c.isalnum() and not c.isspace() if c != '<PAD>' else False for c in chars],
                'is_padding': [c == '<PAD>' for c in chars]
            })

        return self._cdf

# Usage
text = "Hello, World! 123 👋"
features = BasicCharacterFeatures(text, roll_padding=2)

print(f"Text: {features.text}")
print(f"Text column: {features.text_col}")
print("\nCharacter DataFrame:")
print(features.cdf.head(10))

# Analyze character distribution
cdf = features.cdf
print("\nCharacter Analysis:")
print(f"Total characters: {len(cdf)}")
print(f"Alphabetic: {cdf['is_alpha'].sum()}")
print(f"Digits: {cdf['is_digit'].sum()}")
print(f"Spaces: {cdf['is_space'].sum()}")
print(f"Punctuation: {cdf['is_punct'].sum()}")
print(f"Padding: {cdf['is_padding'].sum()}")

Text Masking Example¶

from dataknobs_xization.masking_tokenizer import CharacterFeatures
import pandas as pd
import numpy as np

class MaskingCharacterFeatures(CharacterFeatures):
    """Character features with masking capability."""

    def __init__(self, doctext, roll_padding=0, mask_probability=0.15):
        super().__init__(doctext, roll_padding)
        self.mask_probability = mask_probability

    @property
    def cdf(self) -> pd.DataFrame:
        """Character dataframe with masking features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = ([pad_char] * self._roll_padding + 
                        chars + 
                        [pad_char] * self._roll_padding)

            # Set random seed for reproducibility
            np.random.seed(42)

            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'original_char': chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'should_mask': np.random.random(len(chars)) < self.mask_probability,
                'is_padding': [c == '<PAD>' for c in chars]
            })

            # Apply masking
            mask_indices = self._cdf['should_mask'] & ~self._cdf['is_padding']
            self._cdf.loc[mask_indices, self.text_col] = '[MASK]'

        return self._cdf

    def get_masked_text(self) -> str:
        """Get the masked version of the text."""
        cdf = self.cdf
        masked_chars = cdf[~cdf['is_padding']][self.text_col].tolist()
        return ''.join(masked_chars)

# Usage
original_text = "This is a sample text for demonstration."
masker = MaskingCharacterFeatures(original_text, mask_probability=0.2)

print(f"Original: {original_text}")
print(f"Masked:   {masker.get_masked_text()}")
print(f"\nMask Statistics:")
cdf = masker.cdf
print(f"Total chars: {len(cdf)}")
print(f"Masked chars: {cdf['should_mask'].sum()}")
print(f"Mask ratio: {cdf['should_mask'].mean():.2%}")

Complete Text Processing Pipeline¶

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd

class TextProcessingPipeline:
    """Complete text processing with normalization and analysis."""

    def __init__(self, normalize_config=None, analysis_config=None):
        self.normalize_config = normalize_config or {}
        self.analysis_config = analysis_config or {}

    def process_document(self, doc: dk_doc.Document) -> dict:
        """Process a document through the complete pipeline."""
        original_text = doc.text
        results = {
            'document_id': getattr(doc, 'text_id', None),
            'original_text': original_text
        }

        # Step 1: Normalization
        normalized_text = self._normalize_text(original_text)
        results['normalized_text'] = normalized_text

        # Step 2: Generate variations
        variations = normalize.get_lexical_variations(
            normalized_text, **self.normalize_config
        )
        results['variations'] = list(variations)
        results['variation_count'] = len(variations)

        # Step 3: Character analysis
        char_analysis = self._analyze_characters(normalized_text)
        results['character_analysis'] = char_analysis

        return results

    def _normalize_text(self, text: str) -> str:
        """Apply normalization pipeline."""
        # Expand camelCase
        text = normalize.expand_camelcase_fn(text)

        # Expand ampersands
        text = normalize.expand_ampersand_fn(text)

        # Drop parentheticals
        if self.normalize_config.get('drop_parentheticals', True):
            text = normalize.drop_parentheticals_fn(text)

        # Handle symbols
        if self.normalize_config.get('drop_non_embedded_symbols', True):
            text = normalize.drop_non_embedded_symbols_fn(text)

        # Basic normalization
        text = normalize.basic_normalization_fn(text)

        return text

    def _analyze_characters(self, text: str) -> dict:
        """Analyze character-level features."""
        class AnalysisCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({
                    self.text_col: chars,
                    'position': range(len(chars)),
                    'is_alpha': [c.isalpha() for c in chars],
                    'is_digit': [c.isdigit() for c in chars],
                    'is_space': [c.isspace() for c in chars],
                    'is_punct': [not c.isalnum() and not c.isspace() for c in chars]
                })

        features = AnalysisCharFeatures(text)
        cdf = features.cdf

        return {
            'total_characters': len(cdf),
            'alphabetic_characters': cdf['is_alpha'].sum(),
            'digit_characters': cdf['is_digit'].sum(),
            'space_characters': cdf['is_space'].sum(),
            'punctuation_characters': cdf['is_punct'].sum(),
            'alphabetic_ratio': cdf['is_alpha'].mean(),
            'digit_ratio': cdf['is_digit'].mean(),
            'space_ratio': cdf['is_space'].mean(),
            'punctuation_ratio': cdf['is_punct'].mean()
        }

    def process_batch(self, documents: list) -> list:
        """Process multiple documents."""
        return [self.process_document(doc) for doc in documents]

# Usage example
config = {
    'drop_parentheticals': True,
    'drop_non_embedded_symbols': True,
    'expand_camelcase': True,
    'expand_ampersands': True,
    'add_eng_plurals': True
}

pipeline = TextProcessingPipeline(normalize_config=config)

# Create sample documents
documents = [
    dk_doc.Document(
        "getUserName() & validateInput (required)", 
        text_id="tech_doc_1"
    ),
    dk_doc.Document(
        "Machine Learning (ML) & Artificial Intelligence",
        text_id="ai_doc_1" 
    )
]

# Process documents
results = pipeline.process_batch(documents)

# Display results
for result in results:
    print(f"\nDocument: {result['document_id']}")
    print(f"Original: {result['original_text']}")
    print(f"Normalized: {result['normalized_text']}")
    print(f"Variations: {result['variation_count']}")
    print(f"Character Analysis: {result['character_analysis']}")

Integration with Other Packages¶

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_utils import file_utils, elasticsearch_utils
from dataknobs_structures import Tree, document as dk_doc
import json

def create_searchable_documents(input_dir: str) -> list:
    """Create searchable documents with normalized text."""
    searchable_docs = []

    # Process all text files
    for filepath in file_utils.filepath_generator(input_dir):
        if filepath.endswith('.txt'):
            # Read file content
            content_lines = list(file_utils.fileline_generator(filepath))
            full_text = '\n'.join(content_lines)

            # Normalize text
            normalized = normalize.basic_normalization_fn(full_text)
            normalized = normalize.expand_camelcase_fn(normalized)
            normalized = normalize.expand_ampersand_fn(normalized)

            # Generate search variations
            variations = normalize.get_lexical_variations(
                normalized,
                expand_camelcase=True,
                do_hyphen_expansion=True,
                do_slash_expansion=True
            )

            # Create searchable document
            searchable_doc = {
                'filepath': filepath,
                'original_text': full_text,
                'normalized_text': normalized,
                'search_variations': ' '.join(variations),
                'variation_count': len(variations)
            }

            searchable_docs.append(searchable_doc)

    return searchable_docs

# Create Elasticsearch index with normalized documents
def index_normalized_documents(documents: list, index_name: str):
    """Index normalized documents in Elasticsearch."""
    table_settings = elasticsearch_utils.TableSettings(
        index_name,
        {"number_of_shards": 1, "number_of_replicas": 0},
        {
            "properties": {
                "original_text": {"type": "text"},
                "normalized_text": {"type": "text", "analyzer": "english"},
                "search_variations": {"type": "text"},
                "filepath": {"type": "keyword"},
                "variation_count": {"type": "integer"}
            }
        }
    )

    index = elasticsearch_utils.ElasticsearchIndex(None, [table_settings])

    # Create batch file
    with open("normalized_batch.jsonl", "w") as f:
        elasticsearch_utils.add_batch_data(
            f, iter(documents), index_name
        )

    return index

# Usage
documents = create_searchable_documents("/path/to/text/files")
index = index_normalized_documents(documents, "normalized_texts")
print(f"Indexed {len(documents)} normalized documents")

Error Handling¶

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc

def safe_text_processing(text: str) -> dict:
    """Safely process text with comprehensive error handling."""
    results = {'original': text, 'errors': []}

    try:
        # Normalization with error handling
        normalized = normalize.basic_normalization_fn(text)
        results['normalized'] = normalized
    except Exception as e:
        results['errors'].append(f"Normalization failed: {e}")
        results['normalized'] = text

    try:
        # CamelCase expansion
        expanded = normalize.expand_camelcase_fn(results['normalized'])
        results['camelcase_expanded'] = expanded
    except Exception as e:
        results['errors'].append(f"CamelCase expansion failed: {e}")
        results['camelcase_expanded'] = results['normalized']

    try:
        # Variation generation
        variations = normalize.get_lexical_variations(results['camelcase_expanded'])
        results['variations'] = list(variations)
    except Exception as e:
        results['errors'].append(f"Variation generation failed: {e}")
        results['variations'] = [results['camelcase_expanded']]

    try:
        # Character analysis
        class SafeCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                import pandas as pd
                chars = list(self.text) if self.text else []
                return pd.DataFrame({
                    self.text_col: chars,
                    'is_alpha': [c.isalpha() for c in chars]
                })

        features = SafeCharFeatures(results['camelcase_expanded'])
        results['character_count'] = len(features.cdf)
    except Exception as e:
        results['errors'].append(f"Character analysis failed: {e}")
        results['character_count'] = 0

    results['success'] = len(results['errors']) == 0
    return results

# Usage
test_texts = [
    "Normal text for processing",
    "camelCaseText & symbols!",
    "",  # Empty string
    None,  # None value
    "Special unicode: 👋🌍"
]

for i, text in enumerate(test_texts):
    try:
        result = safe_text_processing(text or "")
        print(f"\nTest {i+1}: {'SUCCESS' if result['success'] else 'ERRORS'}")
        print(f"Original: {repr(text)}")
        if result['success']:
            print(f"Normalized: {result['normalized']}")
            print(f"Variations: {len(result['variations'])}")
        else:
            print(f"Errors: {result['errors']}")
    except Exception as e:
        print(f"\nTest {i+1}: CRITICAL ERROR - {e}")

Testing¶

import pytest
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd

class TestXizationFunctions:
    """Test suite for xization functionality."""

    def test_normalization_functions(self):
        """Test core normalization functions."""
        # Test camelCase expansion
        assert normalize.expand_camelcase_fn("firstName") == "first Name"
        assert normalize.expand_camelcase_fn("XMLParser") == "XML Parser"

        # Test symbol handling
        assert normalize.drop_non_embedded_symbols_fn("!Hello world?") == "Hello world"
        assert normalize.drop_embedded_symbols_fn("user@domain.com") == "userdomaincom"

        # Test ampersand expansion
        assert normalize.expand_ampersand_fn("A & B") == "A and B"

        # Test parenthetical removal
        assert normalize.drop_parentheticals_fn("Text (with note)") == "Text "

    def test_lexical_variations(self):
        """Test lexical variation generation."""
        variations = normalize.get_lexical_variations("multi-platform")

        # Check expected variations are present
        assert "multi platform" in variations
        assert "multiplatform" in variations
        assert "multi-platform" in variations

        # Check it returns a set
        assert isinstance(variations, set)
        assert len(variations) > 1

    def test_character_features(self):
        """Test character feature extraction."""
        class TestCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({
                    self.text_col: chars,
                    'is_alpha': [c.isalpha() for c in chars],
                    'is_digit': [c.isdigit() for c in chars]
                })

        features = TestCharFeatures("Hello123")
        cdf = features.cdf

        # Test basic properties
        assert len(cdf) == 8
        assert cdf['is_alpha'].sum() == 5  # "Hello"
        assert cdf['is_digit'].sum() == 3  # "123"

        # Test text properties
        assert features.text == "Hello123"
        assert features.text_col == 'text'  # Default column name

    def test_document_integration(self):
        """Test integration with document structures."""
        doc = dk_doc.Text("Test document", text_id="test1")

        class DocCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({self.text_col: chars})

        features = DocCharFeatures(doc)
        assert features.text_id == "test1"
        assert features.text == "Test document"

    def test_error_handling(self):
        """Test error handling in various scenarios."""
        # Test empty text
        empty_variations = normalize.get_lexical_variations("")
        assert isinstance(empty_variations, set)

        # Test None handling in utility function
        from dataknobs_xization.normalize import basic_normalization_fn
        try:
            result = basic_normalization_fn("")
            assert isinstance(result, str)
        except Exception:
            pytest.fail("Should handle empty string gracefully")

# Run tests
if __name__ == "__main__":
    test_suite = TestXizationFunctions()
    test_suite.test_normalization_functions()
    test_suite.test_lexical_variations()
    test_suite.test_character_features()
    test_suite.test_document_integration()
    test_suite.test_error_handling()
    print("All tests passed!")

Performance Notes¶

Regular Expressions: Pre-compiled patterns for efficient text processing
Character Analysis: Memory-intensive for large texts - use streaming for big documents
Variation Generation: Can produce many variations - filter appropriately
Pandas DataFrames: Efficient for character-level analysis but consider memory usage

Dependencies¶

Core dependencies for dataknobs_xization:

pandas>=1.3.0
numpy>=1.20.0
dataknobs-structures>=1.0.0
dataknobs-utils>=1.0.0

Contributing¶

For contributing to dataknobs_xization:

Fork the repository
Create feature branch for text processing enhancements
Add comprehensive tests for normalization functions
Test with various text types and edge cases
Submit pull request with documentation updates

See Contributing Guide for detailed information.

Changelog¶

Version 1.0.0¶

Initial release
Text normalization functions
Character-level feature extraction
Lexical variation generation
Masking tokenizer framework
Integration with dataknobs-structures

License¶

See License for license information.

dataknobs-xization API Reference¶

Package Information¶

Installation¶

Import Statement¶

Module Documentation¶

normalize¶

Regular Expression Patterns¶

SQUASH_WS_RE¶

dataknobs_xization.normalize.SQUASH_WS_RE module-attribute ¶

ALL_SYMBOLS_RE¶

dataknobs_xization.normalize.ALL_SYMBOLS_RE module-attribute ¶

CAMELCASE_LU_RE¶

dataknobs_xization.normalize.CAMELCASE_LU_RE module-attribute ¶

CAMELCASE_UL_RE¶

dataknobs_xization.normalize.CAMELCASE_UL_RE module-attribute ¶

NON_EMBEDDED_WORD_SYMS_RE¶

dataknobs_xization.normalize.NON_EMBEDDED_WORD_SYMS_RE module-attribute ¶

EMBEDDED_SYMS_RE¶

dataknobs_xization.normalize.EMBEDDED_SYMS_RE module-attribute ¶

HYPHEN_SLASH_RE¶

dataknobs_xization.normalize.HYPHEN_SLASH_RE module-attribute ¶

HYPHEN_ONLY_RE¶

dataknobs_xization.normalize.HYPHEN_ONLY_RE module-attribute ¶

SLASH_ONLY_RE¶

dataknobs_xization.normalize.SLASH_ONLY_RE module-attribute ¶

PARENTHETICAL_RE¶

dataknobs_xization.normalize.PARENTHETICAL_RE module-attribute ¶

AMPERSAND_RE¶

dataknobs_xization.normalize.AMPERSAND_RE module-attribute ¶

Functions¶

expand_camelcase_fn¶

dataknobs_xization.normalize.expand_camelcase_fn ¶

drop_non_embedded_symbols_fn¶

dataknobs_xization.normalize.drop_non_embedded_symbols_fn ¶

drop_embedded_symbols_fn¶

dataknobs_xization.normalize.drop_embedded_symbols_fn ¶

get_hyphen_slash_expansions_fn¶

dataknobs_xization.normalize.get_hyphen_slash_expansions_fn ¶

drop_parentheticals_fn¶

dataknobs_xization.normalize.drop_parentheticals_fn ¶

expand_ampersand_fn¶

dataknobs_xization.normalize.expand_ampersand_fn ¶

get_lexical_variations¶

dataknobs_xization.normalize.get_lexical_variations ¶

masking_tokenizer¶

Classes¶

CharacterFeatures¶

dataknobs_xization.masking_tokenizer.CharacterFeatures ¶

Attributes¶

cdf property ¶

doctext property ¶

text_col property ¶

text property ¶

text_id property ¶

Functions¶

TextFeatures¶

dataknobs_xization.masking_tokenizer.TextFeatures ¶

Attributes¶

cdf property ¶

Functions¶

build_first_token ¶

annotations¶

Functions and Classes¶

dataknobs_xization.annotations ¶

Classes¶

AnnotatedText ¶

Attributes¶

annotations property ¶

bookmarks property ¶

Functions¶

add_annotations ¶

get_annot_mask ¶

get_text ¶

get_text_series ¶

Annotations ¶

Attributes¶

ann_row_dicts property ¶

df property ¶

Functions¶

add_df ¶

dataknobs_xization.normalize.SQUASH_WS_RE `module-attribute` ¶

dataknobs_xization.normalize.ALL_SYMBOLS_RE `module-attribute` ¶

dataknobs_xization.normalize.CAMELCASE_LU_RE `module-attribute` ¶

dataknobs_xization.normalize.CAMELCASE_UL_RE `module-attribute` ¶

dataknobs_xization.normalize.NON_EMBEDDED_WORD_SYMS_RE `module-attribute` ¶

dataknobs_xization.normalize.EMBEDDED_SYMS_RE `module-attribute` ¶

dataknobs_xization.normalize.HYPHEN_SLASH_RE `module-attribute` ¶

dataknobs_xization.normalize.HYPHEN_ONLY_RE `module-attribute` ¶

dataknobs_xization.normalize.SLASH_ONLY_RE `module-attribute` ¶

dataknobs_xization.normalize.PARENTHETICAL_RE `module-attribute` ¶

dataknobs_xization.normalize.AMPERSAND_RE `module-attribute` ¶

cdf `property` ¶

doctext `property` ¶

text_col `property` ¶

text `property` ¶

text_id `property` ¶

cdf `property` ¶

annotations `property` ¶

bookmarks `property` ¶

ann_row_dicts `property` ¶

df `property` ¶

ann_type `property` ¶

autolock `property` `writable` ¶

df `property` ¶

group_num `property` `writable` ¶

group_type `property` `writable` ¶

is_locked `property` `writable` ¶

is_valid `property` `writable` ¶

key `property` ¶

size `property` ¶

coverage `property` ¶

size `property` ¶

ann_type_col `property` ¶

end_pos_col `property` ¶

start_pos_col `property` ¶

text_col `property` ¶

annotate_input `abstractmethod` ¶

annotators `abstractmethod` `property` ¶

annotate_input `abstractmethod` ¶

annotate_text `abstractmethod` ¶

get_col_value `abstractmethod` ¶

annotation_cols `abstractmethod` `property` ¶

highlight_fieldstyles `abstractmethod` `property` ¶

compose_groups `abstractmethod` ¶