Skip to content

dataknobs-xization API Reference

Complete API documentation for the dataknobs_xization package.

💡 Quick Links: - Complete API Documentation - Full auto-generated reference - Source Code - Browse on GitHub - Package Guide - Detailed documentation

Package Information

  • Package Name: dataknobs_xization
  • Version: 1.0.0
  • Description: Text normalization and tokenization tools
  • Python Requirements: >=3.8

Installation

pip install dataknobs-xization

Import Statement

from dataknobs_xization import (
    annotations,
    authorities,
    lexicon,
    masking_tokenizer,
    normalize
)

# Import key classes
from dataknobs_xization.masking_tokenizer import CharacterFeatures, TextFeatures

Module Documentation

normalize

Regular Expression Patterns

SQUASH_WS_RE

dataknobs_xization.normalize.SQUASH_WS_RE module-attribute

SQUASH_WS_RE = compile('\\s+')
ALL_SYMBOLS_RE

dataknobs_xization.normalize.ALL_SYMBOLS_RE module-attribute

ALL_SYMBOLS_RE = compile('[^\\w\\s]+')
CAMELCASE_LU_RE

dataknobs_xization.normalize.CAMELCASE_LU_RE module-attribute

CAMELCASE_LU_RE = compile('([a-z]+)([A-Z])')
CAMELCASE_UL_RE

dataknobs_xization.normalize.CAMELCASE_UL_RE module-attribute

CAMELCASE_UL_RE = compile('([A-Z]+)([A-Z][a-z])')
NON_EMBEDDED_WORD_SYMS_RE

dataknobs_xization.normalize.NON_EMBEDDED_WORD_SYMS_RE module-attribute

NON_EMBEDDED_WORD_SYMS_RE = compile('((?<!\\w)[^\\w\\s]+)|([^\\w\\s]+(?!\\w))')
EMBEDDED_SYMS_RE

dataknobs_xization.normalize.EMBEDDED_SYMS_RE module-attribute

EMBEDDED_SYMS_RE = compile('(?<=\\w)[^\\w\\s]+(?=\\w)')
HYPHEN_SLASH_RE

dataknobs_xization.normalize.HYPHEN_SLASH_RE module-attribute

HYPHEN_SLASH_RE = compile('(?<=\\w)[\\-\\/ ](?=\\w)')
HYPHEN_ONLY_RE

dataknobs_xization.normalize.HYPHEN_ONLY_RE module-attribute

HYPHEN_ONLY_RE = compile('(?<=\\w)[\\- ](?=\\w)')
SLASH_ONLY_RE

dataknobs_xization.normalize.SLASH_ONLY_RE module-attribute

SLASH_ONLY_RE = compile('(?<=\\w)\\/(?=\\w)')
PARENTHETICAL_RE

dataknobs_xization.normalize.PARENTHETICAL_RE module-attribute

PARENTHETICAL_RE = compile('\\(.*\\)')
AMPERSAND_RE

dataknobs_xization.normalize.AMPERSAND_RE module-attribute

AMPERSAND_RE = compile('\\s*\\&\\s*')

Functions

expand_camelcase_fn

dataknobs_xization.normalize.expand_camelcase_fn

expand_camelcase_fn(text: str) -> str

Expand both "lU" and "UUl" camelcasing to "l U" and "U Ul"

Source code in packages/xization/src/dataknobs_xization/normalize.py
def expand_camelcase_fn(text: str) -> str:
    """Expand both "lU" and "UUl" camelcasing to "l U" and "U Ul" """
    text = CAMELCASE_LU_RE.sub(r"\1 \2", text)
    return CAMELCASE_UL_RE.sub(r"\1 \2", text)
drop_non_embedded_symbols_fn

dataknobs_xization.normalize.drop_non_embedded_symbols_fn

drop_non_embedded_symbols_fn(text: str, repl: str = '') -> str

Drop symbols not embedded within word characters

Source code in packages/xization/src/dataknobs_xization/normalize.py
def drop_non_embedded_symbols_fn(text: str, repl: str = "") -> str:
    """Drop symbols not embedded within word characters"""
    return NON_EMBEDDED_WORD_SYMS_RE.sub(repl, text)
drop_embedded_symbols_fn

dataknobs_xization.normalize.drop_embedded_symbols_fn

drop_embedded_symbols_fn(text: str, repl: str = '') -> str

Drop symbols embedded within word characters

Source code in packages/xization/src/dataknobs_xization/normalize.py
def drop_embedded_symbols_fn(text: str, repl: str = "") -> str:
    """Drop symbols embedded within word characters"""
    return EMBEDDED_SYMS_RE.sub(repl, text)
get_hyphen_slash_expansions_fn

dataknobs_xization.normalize.get_hyphen_slash_expansions_fn

get_hyphen_slash_expansions_fn(
    text: str,
    subs: List[str] = ("-", " ", ""),
    add_self: bool = True,
    do_split: bool = True,
    min_split_token_len: int = 2,
    hyphen_slash_re: Pattern[str] = HYPHEN_SLASH_RE,
) -> Set[str]

Given text with words that may or may not appear as hyphenated or with a slash, return the set potential variations: - the text as-is (add_self) - with a hyphen between all words (if '-' in subs) - with a space between all words (if ' ' in subs) - with all words squashed together (empty string between if '' in subs) - with each word separately (do_split as long as min_split_token_len is met for all tokens)

Note
  • To add a variation with a slash, add '/' to subs.
  • To not add any variations with symbols, leave them out of subs and don't add self.

Parameters:

Name Type Description Default
text str

The hyphen-worthy snippet of text, either already hyphenated or with a slash or space delimited.

required
subs List[str]

A string of characters or list of strings to insert between tokens.

('-', ' ', '')
add_self bool

True to include the text itself in the result.

True
do_split bool

True to add split tokens separately.

True
min_split_token_len int

If any of the split tokens fail to meet the min token length, don't add any of the splits.

2
hyphen_slash_re Pattern[str]

The regex to identify hyphen/slash to expand.

HYPHEN_SLASH_RE

Returns:

Type Description
Set[str]

The set of text variations.

Source code in packages/xization/src/dataknobs_xization/normalize.py
def get_hyphen_slash_expansions_fn(
    text: str,
    subs: List[str] = ("-", " ", ""),
    add_self: bool = True,
    do_split: bool = True,
    min_split_token_len: int = 2,
    hyphen_slash_re: re.Pattern[str] = HYPHEN_SLASH_RE,
) -> Set[str]:
    """Given text with words that may or may not appear as hyphenated or with a
    slash, return the set potential variations:
        - the text as-is (add_self)
        - with a hyphen between all words (if '-' in subs)
        - with a space between all words (if ' ' in subs)
        - with all words squashed together (empty string between if '' in subs)
        - with each word separately (do_split as long as min_split_token_len is
              met for all tokens)

    Note:
        * To add a variation with a slash, add '/' to subs.
        * To not add any variations with symbols, leave them out of subs
          and don't add self.

    Args:
        text: The hyphen-worthy snippet of text, either already
            hyphenated or with a slash or space delimited.
        subs: A string of characters or list of strings to insert between
            tokens.
        add_self: True to include the text itself in the result.
        do_split: True to add split tokens separately.
        min_split_token_len: If any of the split tokens fail
            to meet the min token length, don't add any of the splits.
        hyphen_slash_re: The regex to identify hyphen/slash to expand.

    Returns:
        The set of text variations.
    """
    variations = {text} if add_self else set()
    if subs is not None and len(subs) > 0:
        # create variant with all <s>'s
        for s in subs:
            variations.add(HYPHEN_SLASH_RE.sub(s, text))
    if do_split:
        # add each word separately
        tokens = set(hyphen_slash_re.split(text))
        if not max(len(t) < min_split_token_len for t in tokens):
            variations.update(tokens)
    return variations
drop_parentheticals_fn

dataknobs_xization.normalize.drop_parentheticals_fn

drop_parentheticals_fn(text: str) -> str

Drop parenthetical expressions from the text.

Source code in packages/xization/src/dataknobs_xization/normalize.py
def drop_parentheticals_fn(text: str) -> str:
    """Drop parenthetical expressions from the text."""
    return PARENTHETICAL_RE.sub("", text)
expand_ampersand_fn

dataknobs_xization.normalize.expand_ampersand_fn

expand_ampersand_fn(text: str) -> str

Replace '&' with ' and '.

Source code in packages/xization/src/dataknobs_xization/normalize.py
def expand_ampersand_fn(text: str) -> str:
    """Replace '&' with ' and '."""
    return AMPERSAND_RE.sub(" and ", text)
get_lexical_variations

dataknobs_xization.normalize.get_lexical_variations

get_lexical_variations(
    text: str,
    include_self: bool = True,
    expand_camelcase: bool = True,
    drop_non_embedded_symbols: bool = True,
    drop_embedded_symbols: bool = True,
    spacify_embedded_symbols: bool = False,
    do_hyphen_expansion: bool = True,
    hyphen_subs: List[str] = (" ", ""),
    do_hyphen_split: bool = True,
    min_hyphen_split_token_len: int = 2,
    do_slash_expansion: bool = True,
    slash_subs: List[str] = (" ", " or "),
    do_slash_split: bool = True,
    min_slash_split_token_len: int = 1,
    drop_parentheticals: bool = True,
    expand_ampersands: bool = True,
    add_eng_plurals: bool = True,
) -> Set[str]

Get all variations for the text (including the text itself).

Parameters:

Name Type Description Default
text str

The text to generate variations for.

required
include_self bool

True to include the original text in the result.

True
expand_camelcase bool

True to expand camelCase text.

True
drop_non_embedded_symbols bool

True to drop symbols not embedded in words.

True
drop_embedded_symbols bool

True to drop symbols embedded in words.

True
spacify_embedded_symbols bool

True to replace embedded symbols with spaces.

False
do_hyphen_expansion bool

True to expand hyphenated text.

True
hyphen_subs List[str]

List of strings to substitute for hyphens.

(' ', '')
do_hyphen_split bool

True to split on hyphens.

True
min_hyphen_split_token_len int

Minimum token length for hyphen splits.

2
do_slash_expansion bool

True to expand slashes.

True
slash_subs List[str]

List of strings to substitute for slashes.

(' ', ' or ')
do_slash_split bool

True to split on slashes.

True
min_slash_split_token_len int

Minimum token length for slash splits.

1
drop_parentheticals bool

True to drop parenthetical expressions.

True
expand_ampersands bool

True to expand ampersands to ' and '.

True
add_eng_plurals bool

True to add English plural forms.

True

Returns:

Type Description
Set[str]

The set of all text variations.

Source code in packages/xization/src/dataknobs_xization/normalize.py
def get_lexical_variations(
    text: str,
    include_self: bool = True,
    expand_camelcase: bool = True,
    drop_non_embedded_symbols: bool = True,
    drop_embedded_symbols: bool = True,
    spacify_embedded_symbols: bool = False,
    do_hyphen_expansion: bool = True,
    hyphen_subs: List[str] = (" ", ""),
    do_hyphen_split: bool = True,
    min_hyphen_split_token_len: int = 2,
    do_slash_expansion: bool = True,
    slash_subs: List[str] = (" ", " or "),
    do_slash_split: bool = True,
    min_slash_split_token_len: int = 1,
    drop_parentheticals: bool = True,
    expand_ampersands: bool = True,
    add_eng_plurals: bool = True,
) -> Set[str]:
    """Get all variations for the text (including the text itself).

    Args:
        text: The text to generate variations for.
        include_self: True to include the original text in the result.
        expand_camelcase: True to expand camelCase text.
        drop_non_embedded_symbols: True to drop symbols not embedded in words.
        drop_embedded_symbols: True to drop symbols embedded in words.
        spacify_embedded_symbols: True to replace embedded symbols with spaces.
        do_hyphen_expansion: True to expand hyphenated text.
        hyphen_subs: List of strings to substitute for hyphens.
        do_hyphen_split: True to split on hyphens.
        min_hyphen_split_token_len: Minimum token length for hyphen splits.
        do_slash_expansion: True to expand slashes.
        slash_subs: List of strings to substitute for slashes.
        do_slash_split: True to split on slashes.
        min_slash_split_token_len: Minimum token length for slash splits.
        drop_parentheticals: True to drop parenthetical expressions.
        expand_ampersands: True to expand ampersands to ' and '.
        add_eng_plurals: True to add English plural forms.

    Returns:
        The set of all text variations.
    """
    variations = {text} if include_self else set()
    if expand_camelcase:
        variations.add(expand_camelcase_fn(text))
    if drop_non_embedded_symbols:
        variations.add(drop_non_embedded_symbols_fn(text))
    if drop_embedded_symbols:
        variations.add(drop_embedded_symbols_fn(text))
    if spacify_embedded_symbols:
        variations.add(drop_embedded_symbols_fn(text, " "))
    if (
        do_hyphen_expansion and hyphen_subs is not None and len(hyphen_subs) > 0
    ) or do_hyphen_split:
        variations.update(
            get_hyphen_slash_expansions_fn(
                text,
                subs=hyphen_subs,
                add_self=False,
                do_split=do_hyphen_split,
                min_split_token_len=min_hyphen_split_token_len,
            )
        )
    if (do_slash_expansion and slash_subs is not None and len(slash_subs) > 0) or do_slash_split:
        variations.update(
            get_hyphen_slash_expansions_fn(
                text,
                subs=slash_subs,
                add_self=False,
                do_split=do_slash_split,
                min_split_token_len=min_slash_split_token_len,
            )
        )
    if drop_parentheticals:
        variations.add(drop_parentheticals_fn(text))
    if expand_ampersands:
        variations.add(expand_ampersand_fn(text))
    if add_eng_plurals:
        # TODO: Use a better pluralizer
        plurals = {f"{v}s" for v in variations}
        variations.update(plurals)
    return variations

masking_tokenizer

Classes

CharacterFeatures

dataknobs_xization.masking_tokenizer.CharacterFeatures

CharacterFeatures(doctext: Union[Text, str], roll_padding: int = 0)

Bases: ABC

Class representing features of text as a dataframe with each character as a row and columns representing character features.

Initialize with the text to tokenize.

Parameters:

Name Type Description Default
doctext Union[Text, str]

The text to tokenize (or dk_doc.Text with its metadata).

required
roll_padding int

The number of pad characters added to each end of the text.

0

Attributes:

Name Type Description
cdf DataFrame

The character dataframe with each padded text character as a row.

doctext Text
text_col str

The name of the cdf column holding the text characters.

text str

The text string.

text_id Any

The ID of the text.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
def __init__(self, doctext: Union[dk_doc.Text, str], roll_padding: int = 0):
    """Initialize with the text to tokenize.

    Args:
        doctext: The text to tokenize (or dk_doc.Text with its metadata).
        roll_padding: The number of pad characters added to each end of
            the text.
    """
    self._doctext = doctext
    self._roll_padding = roll_padding
    self._padded_text = None

Attributes

cdf property

cdf: DataFrame

The character dataframe with each padded text character as a row.

doctext property

doctext: Text

text_col property

text_col: str

The name of the cdf column holding the text characters.

text property

text: str

The text string.

text_id property

text_id: Any

The ID of the text.

Functions

TextFeatures

dataknobs_xization.masking_tokenizer.TextFeatures

TextFeatures(
    doctext: Union[Text, str],
    split_camelcase: bool = True,
    mark_alpha: bool = False,
    mark_digit: bool = False,
    mark_upper: bool = False,
    mark_lower: bool = False,
    emoji_data: EmojiData = None,
)

Bases: CharacterFeatures

Extracts text-specific character features for tokenization.

Extends CharacterFeatures to provide text tokenization with support for camelCase splitting, character type features (alpha, digit, upper, lower), and emoji handling. Builds a character DataFrame with features for token boundary detection.

Initialize with text tokenization parameters.

Note

If emoji_data is non-null: * Then emojis will be treated as text (instead of as non-text) * If split_camelcase is True, * then each emoji will be in its own token * otherwise, each sequence of (adjacent) emojis will be treated as a single token.

Parameters:

Name Type Description Default
doctext Union[Text, str]

The text to tokenize with its metadata.

required
split_camelcase bool

True to mark camel-case features.

True
mark_alpha bool

True to mark alpha features (separate from alnum).

False
mark_digit bool

True to mark digit features (separate from alnum).

False
mark_upper bool

True to mark upper features (auto-included for camel-case).

False
mark_lower bool

True to mark lower features (auto-included for camel-case).

False
emoji_data EmojiData

An EmojiData instance to mark emoji BIO features.

None

Methods:

Name Description
build_first_token

Build the first token as the start of tokenization.

Attributes:

Name Type Description
cdf DataFrame

The character dataframe with each padded text character as a row.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
def __init__(
    self,
    doctext: Union[dk_doc.Text, str],
    split_camelcase: bool = True,
    mark_alpha: bool = False,
    mark_digit: bool = False,
    mark_upper: bool = False,
    mark_lower: bool = False,
    emoji_data: emoji_utils.EmojiData = None,
):
    """Initialize with text tokenization parameters.

    Note:
        If emoji_data is non-null:
            * Then emojis will be treated as text (instead of as non-text)
            * If split_camelcase is True,
                * then each emoji will be in its own token
                * otherwise, each sequence of (adjacent) emojis will be treated
                  as a single token.

    Args:
        doctext: The text to tokenize with its metadata.
        split_camelcase: True to mark camel-case features.
        mark_alpha: True to mark alpha features (separate from alnum).
        mark_digit: True to mark digit features (separate from alnum).
        mark_upper: True to mark upper features (auto-included for
            camel-case).
        mark_lower: True to mark lower features (auto-included for
            camel-case).
        emoji_data: An EmojiData instance to mark emoji BIO features.
    """
    # NOTE: roll_padding is determined by "roll" feature needs. Currently 1.
    super().__init__(doctext, roll_padding=1)
    self.split_camelcase = split_camelcase
    self._cdf = self._build_character_dataframe(
        split_camelcase,
        mark_alpha,
        mark_digit,
        mark_upper,
        mark_lower,
        emoji_data,
    )

Attributes

cdf property

cdf: DataFrame

The character dataframe with each padded text character as a row.

Functions

build_first_token

build_first_token(normalize_fn: Callable[[str], str]) -> Token

Build the first token as the start of tokenization.

Parameters:

Name Type Description Default
normalize_fn Callable[[str], str]

A function to normalize a raw text term or any of its variations. If None, then the identity function is used.

required

Returns:

Type Description
Token

The first text token.

Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
def build_first_token(
    self,
    normalize_fn: Callable[[str], str],
) -> "Token":
    """Build the first token as the start of tokenization.

    Args:
        normalize_fn: A function to normalize a raw text term or any
            of its variations. If None, then the identity function is used.

    Returns:
        The first text token.
    """
    token_mask = (
        DualTokenMask(
            self,
            self.cdf["tok_start"],
            self.cdf["tok_end"],
        )
        if self.split_camelcase
        else SimpleTokenMask(self, self.cdf["alnum"])
    )
    token = Token(token_mask, normalize_fn=normalize_fn)
    return token

annotations

Functions and Classes

dataknobs_xization.annotations

Text annotation data structures and interfaces.

Provides classes for managing text annotations with metadata, including position tracking, annotation types, and derived annotation columns.

Classes:

Name Description
AnnotatedText

A Text object that manages its own annotations.

Annotations

DAO for collecting and managing a table of annotations, where each row

AnnotationsBuilder

A class for building annotations.

AnnotationsGroup

Container for annotation rows that belong together as a (consistent) group.

AnnotationsGroupList

Container for a list of annotation groups.

AnnotationsMetaData

Container for annotations meta-data, identifying key column names.

AnnotationsRowAccessor

A class that accesses row data according to the metadata and derived cols.

Annotator

Class for annotating text

AnnotatorKernel

Class for encapsulating core annotation logic for multiple annotators

BasicAnnotator

Class for extracting basic (possibly multi -level or -part) entities.

CompoundAnnotator

Class to apply a series of annotators through an AnnotatorKernel

DerivedAnnotationColumns

Interface for injecting derived columns into AnnotationsMetaData.

EntityAnnotator

Class for extracting single (possibly multi-level or -part) entities.

HtmlHighlighter

Helper class to add HTML markup for highlighting spans of text.

MergeStrategy

A merge strategy to be injected based on entity types being merged.

OverlapGroupIterator

Given:

PositionalAnnotationsGroup

Container for annotations that either overlap with each other or don't.

RowData

A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping.

SyntacticParser

Class for creating syntactic annotations for an input.

Functions:

Name Description
merge

Merge the overlapping groups according to the given strategy.

Classes

AnnotatedText

AnnotatedText(
    text_str: str,
    metadata: TextMetaData = None,
    annots: Annotations = None,
    bookmarks: Dict[str, DataFrame] = None,
    text_obj: Text = None,
    annots_metadata: AnnotationsMetaData = None,
)

Bases: Text

A Text object that manages its own annotations.

Initialize AnnotatedText.

Parameters:

Name Type Description Default
text_str str

The text string.

required
metadata TextMetaData

The text's metadata.

None
annots Annotations

The annotations.

None
bookmarks Dict[str, DataFrame]

The annotation bookmarks.

None
text_obj Text

A text_obj to override text_str and metadata initialization.

None
annots_metadata AnnotationsMetaData

Override for default annotations metadata (NOTE: ineffectual if an annots instance is provided.)

None

Methods:

Name Description
add_annotations

Add the annotations to this instance.

get_annot_mask

Get a True/False series for the input such that start to end positions

get_text

Get the text object's string, masking if indicated.

get_text_series

Get the input text as a (padded) pandas series.

Attributes:

Name Type Description
annotations Annotations

Get the this object's annotations

bookmarks Dict[str, DataFrame]

Get this object's bookmarks

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    text_str: str,
    metadata: dk_doc.TextMetaData = None,
    annots: Annotations = None,
    bookmarks: Dict[str, pd.DataFrame] = None,
    text_obj: dk_doc.Text = None,
    annots_metadata: AnnotationsMetaData = None,
):
    """Initialize AnnotatedText.

    Args:
        text_str: The text string.
        metadata: The text's metadata.
        annots: The annotations.
        bookmarks: The annotation bookmarks.
        text_obj: A text_obj to override text_str and metadata initialization.
        annots_metadata: Override for default annotations metadata
            (NOTE: ineffectual if an annots instance is provided.)
    """
    super().__init__(
        text_obj.text if text_obj is not None else text_str,
        text_obj.metadata if text_obj is not None else metadata,
    )
    self._annots = annots
    self._bookmarks = bookmarks
    self._annots_metadata = annots_metadata
Attributes
annotations property
annotations: Annotations

Get the this object's annotations

bookmarks property
bookmarks: Dict[str, DataFrame]

Get this object's bookmarks

Functions
add_annotations
add_annotations(annotations: Annotations)

Add the annotations to this instance.

Parameters:

Name Type Description Default
annotations Annotations

The annotations to add.

required
Source code in packages/xization/src/dataknobs_xization/annotations.py
def add_annotations(self, annotations: Annotations):
    """Add the annotations to this instance.

    Args:
        annotations: The annotations to add.
    """
    if annotations is not None and not annotations.is_empty():
        df = annotations.df
        if self._annots is None:
            self._annots = annotations
        elif self._annots.is_empty():
            if df is not None:
                self._annots.set_df(df.copy())
        elif df is not None:
            self._annots.add_df(df)
get_annot_mask
get_annot_mask(
    annot_col: str,
    pad_len: int = 0,
    annot_df: DataFrame = None,
    text: str = None,
) -> pd.Series

Get a True/False series for the input such that start to end positions for rows where the the annotation column is non-null and non-empty are True.

Parameters:

Name Type Description Default
annot_col str

The annotation column identifying chars to mask.

required
pad_len int

The number of characters to pad the mask with False values at both the front and back.

0
annot_df DataFrame

Override annotations dataframe.

None
text str

Override text.

None

Returns:

Type Description
Series

A pandas Series where annotated input character positions

Series

are True and non-annotated positions are False.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_annot_mask(
    self,
    annot_col: str,
    pad_len: int = 0,
    annot_df: pd.DataFrame = None,
    text: str = None,
) -> pd.Series:
    """Get a True/False series for the input such that start to end positions
    for rows where the the annotation column is non-null and non-empty are
    True.

    Args:
        annot_col: The annotation column identifying chars to mask.
        pad_len: The number of characters to pad the mask with False
            values at both the front and back.
        annot_df: Override annotations dataframe.
        text: Override text.

    Returns:
        A pandas Series where annotated input character positions
        are True and non-annotated positions are False.
    """
    if annot_df is None:
        annot_df = self.annotations.as_df
    if text is None:
        text = self.text
    textlen = len(text)
    return self._get_annot_mask(annot_df, textlen, annot_col, pad_len=pad_len)
get_text
get_text(
    annot2mask: Dict[str, str] = None,
    annot_df: DataFrame = None,
    text: str = None,
) -> str

Get the text object's string, masking if indicated.

Parameters:

Name Type Description Default
annot2mask Dict[str, str]

Mapping from annotation column (e.g., _num or _recsnum) to the replacement character(s) in the input text for masking already managed input.

None
annot_df DataFrame

Override annotations dataframe.

None
text str

Override text.

None

Returns:

Type Description
str

The (masked) text.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_text(
    self,
    annot2mask: Dict[str, str] = None,
    annot_df: pd.DataFrame = None,
    text: str = None,
) -> str:
    """Get the text object's string, masking if indicated.

    Args:
        annot2mask: Mapping from annotation column (e.g., _num or
            _recsnum) to the replacement character(s) in the input text
            for masking already managed input.
        annot_df: Override annotations dataframe.
        text: Override text.

    Returns:
        The (masked) text.
    """
    if annot2mask is None:
        return self.text
    # Apply the mask
    text_s = self.get_text_series(text=text)  # no padding
    if annot2mask is not None:
        annot_df = self.annotations.as_df
        text_s = self._apply_mask(text_s, annot2mask, annot_df)
    return "".join(text_s)
get_text_series
get_text_series(pad_len: int = 0, text: str = None) -> pd.Series

Get the input text as a (padded) pandas series.

Parameters:

Name Type Description Default
pad_len int

The number of spaces to pad both front and back.

0
text str

Override text.

None

Returns:

Type Description
Series

The (padded) pandas series of input characters.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_text_series(
    self,
    pad_len: int = 0,
    text: str = None,
) -> pd.Series:
    """Get the input text as a (padded) pandas series.

    Args:
        pad_len: The number of spaces to pad both front and back.
        text: Override text.

    Returns:
        The (padded) pandas series of input characters.
    """
    if text is None:
        text = self.text
    return pd.Series(list(" " * pad_len + text + " " * pad_len))

Annotations

Annotations(metadata: AnnotationsMetaData, df: DataFrame = None)

DAO for collecting and managing a table of annotations, where each row carries annotation information for an input token.

The data in this class is maintained either as a list of dicts, each dict representing a "row," or as a pandas DataFrame, depending on the latest access. Changes in either the lists or dataframe will be reflected in the alternate data structure.

Construct as empty or initialize with the dataframe form.

Parameters:

Name Type Description Default
metadata AnnotationsMetaData

The annotations metadata.

required
df DataFrame

A dataframe with annotation records.

None

Methods:

Name Description
add_df

Add (concatentate) the annotation dataframe to the current annotations.

add_dict

Add the annotation dict.

add_dicts

Add the annotation dicts.

clear

Clear/empty out all annotations, returning the annotations df

set_df

Set (or reset) this annotation's dataframe.

Attributes:

Name Type Description
ann_row_dicts List[Dict[str, Any]]

Get the annotations as a list of dictionaries.

df DataFrame

Get the annotations as a pandas dataframe.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    metadata: AnnotationsMetaData,
    df: pd.DataFrame = None,
):
    """Construct as empty or initialize with the dataframe form.

    Args:
        metadata: The annotations metadata.
        df: A dataframe with annotation records.
    """
    self.metadata = metadata
    self._annotations_list = None
    self._df = df
Attributes
ann_row_dicts property
ann_row_dicts: List[Dict[str, Any]]

Get the annotations as a list of dictionaries.

df property
df: DataFrame

Get the annotations as a pandas dataframe.

Functions
add_df
add_df(an_df: DataFrame)

Add (concatentate) the annotation dataframe to the current annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def add_df(self, an_df: pd.DataFrame):
    """Add (concatentate) the annotation dataframe to the current annotations."""
    df = self.metadata.sort_df(pd.concat([self.df, an_df]))
    self.set_df(df)
add_dict
add_dict(annotation: Dict[str, Any])

Add the annotation dict.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def add_dict(self, annotation: Dict[str, Any]):
    """Add the annotation dict."""
    self.ann_row_dicts.append(annotation)
add_dicts
add_dicts(annotations: List[Dict[str, Any]])

Add the annotation dicts.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def add_dicts(self, annotations: List[Dict[str, Any]]):
    """Add the annotation dicts."""
    self.ann_row_dicts.extend(annotations)
clear
clear() -> pd.DataFrame

Clear/empty out all annotations, returning the annotations df

Source code in packages/xization/src/dataknobs_xization/annotations.py
def clear(self) -> pd.DataFrame:
    """Clear/empty out all annotations, returning the annotations df"""
    rv = self.df
    self._df = None
    self._annotations_list = None
    return rv
set_df
set_df(df: DataFrame)

Set (or reset) this annotation's dataframe.

Parameters:

Name Type Description Default
df DataFrame

The new annotations dataframe.

required
Source code in packages/xization/src/dataknobs_xization/annotations.py
def set_df(self, df: pd.DataFrame):
    """Set (or reset) this annotation's dataframe.

    Args:
        df: The new annotations dataframe.
    """
    self._df = df
    self._annotations_list = None

AnnotationsBuilder

AnnotationsBuilder(
    metadata: AnnotationsMetaData, data_defaults: Dict[str, Any]
)

A class for building annotations.

Initialize AnnotationsBuilder.

Parameters:

Name Type Description Default
metadata AnnotationsMetaData

The annotations metadata.

required
data_defaults Dict[str, Any]

Dict[ann_colname, default_value] with default values for annotation columns.

required

Methods:

Name Description
build_annotation_row

Build an annotation row with the mandatory key values and those from

do_build_row

Do the row building with the key fields, followed by data defaults,

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    metadata: AnnotationsMetaData,
    data_defaults: Dict[str, Any],
):
    """Initialize AnnotationsBuilder.

    Args:
        metadata: The annotations metadata.
        data_defaults: Dict[ann_colname, default_value] with default
            values for annotation columns.
    """
    self.metadata = metadata if metadata is not None else AnnotationsMetaData()
    self.data_defaults = data_defaults
Functions
build_annotation_row
build_annotation_row(
    start_pos: int, end_pos: int, text: str, ann_type: str, **kwargs: Any
) -> Dict[str, Any]

Build an annotation row with the mandatory key values and those from the remaining keyword arguments.

For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.

Parameters:

Name Type Description Default
start_pos int

The token start position.

required
end_pos int

The token end position.

required
text str

The token text.

required
ann_type str

The annotation type.

required
**kwargs Any

Additional keyword arguments for extra annotation fields.

{}

Returns:

Type Description
Dict[str, Any]

The result row dictionary.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def build_annotation_row(
    self, start_pos: int, end_pos: int, text: str, ann_type: str, **kwargs: Any
) -> Dict[str, Any]:
    """Build an annotation row with the mandatory key values and those from
    the remaining keyword arguments.

    For those kwargs whose names match metadata column names, override the
    data_defaults and add remaining data_default attributes.

    Args:
        start_pos: The token start position.
        end_pos: The token end position.
        text: The token text.
        ann_type: The annotation type.
        **kwargs: Additional keyword arguments for extra annotation fields.

    Returns:
        The result row dictionary.
    """
    return self.do_build_row(
        {
            self.metadata.start_pos_col: start_pos,
            self.metadata.end_pos_col: end_pos,
            self.metadata.text_col: text,
            self.metadata.ann_type_col: ann_type,
        },
        **kwargs,
    )
do_build_row
do_build_row(key_fields: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]

Do the row building with the key fields, followed by data defaults, followed by any extra kwargs.

Parameters:

Name Type Description Default
key_fields Dict[str, Any]

The dictionary of key fields.

required
**kwargs Any

Any extra fields to add.

{}

Returns:

Type Description
Dict[str, Any]

The constructed row dictionary.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def do_build_row(self, key_fields: Dict[str, Any], **kwargs: Any) -> Dict[str, Any]:
    """Do the row building with the key fields, followed by data defaults,
    followed by any extra kwargs.

    Args:
        key_fields: The dictionary of key fields.
        **kwargs: Any extra fields to add.

    Returns:
        The constructed row dictionary.
    """
    result = {}
    result.update(key_fields)
    if self.data_defaults is not None:
        # Add data_defaults
        result.update(self.data_defaults)
    if kwargs is not None:
        # Override with extra kwargs
        result.update(kwargs)
    return result

AnnotationsGroup

AnnotationsGroup(
    row_accessor: AnnotationsRowAccessor,
    field_col_type: str,
    accept_fn: Callable[[AnnotationsGroup, RowData], bool],
    group_type: str = None,
    group_num: int = None,
    valid: bool = True,
    autolock: bool = False,
)

Container for annotation rows that belong together as a (consistent) group.

NOTE: An instance will only accept rows on condition of consistency per its acceptance function.

Initialize AnnotationsGroup.

Parameters:

Name Type Description Default
row_accessor AnnotationsRowAccessor

The annotations row_accessor.

required
field_col_type str

The col_type for the group field_type for retrieval using the annotations row accessor.

required
accept_fn Callable[[AnnotationsGroup, RowData], bool]

A fn(g, row_data) that returns True to accept the row data into this group g, or False to reject the row. If None, then all rows are always accepted.

required
group_type str

An optional (override) type for identifying this group.

None
group_num int

An optional number for identifying this group.

None
valid bool

True if the group is valid, or False if not.

True
autolock bool

True to automatically lock this group when (1) at least one row has been added and (2) a row is rejected.

False

Methods:

Name Description
add

Add the row if the group is not locked and the row belongs in this

is_subset

Determine whether the this group's text is contained within the others.

is_subset_of_any

Determine whether this group is a subset of any of the given groups.

remove_row

Remove the row from this group and optionally update the annotations

to_dict

Get this group (record) as a dictionary of field type to text values.

Attributes:

Name Type Description
ann_type str

Get this record's annotation type

autolock bool

Get whether this group is currently set to autolock.

df DataFrame

Get this group as a dataframe

group_num int

Get this group's number

group_type str

Get this group's type, which is either an "override" value that has

is_locked bool

Get whether this group is locked from adding more rows.

is_valid bool

Get whether this group is currently marked as valid.

key str

A hash key for this group.

size int

Get the number of rows in this group.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    row_accessor: AnnotationsRowAccessor,
    field_col_type: str,
    accept_fn: Callable[["AnnotationsGroup", RowData], bool],
    group_type: str = None,
    group_num: int = None,
    valid: bool = True,
    autolock: bool = False,
):
    """Initialize AnnotationsGroup.

    Args:
        row_accessor: The annotations row_accessor.
        field_col_type: The col_type for the group field_type for retrieval
            using the annotations row accessor.
        accept_fn: A fn(g, row_data) that returns True to accept the row
            data into this group g, or False to reject the row. If None, then
            all rows are always accepted.
        group_type: An optional (override) type for identifying this group.
        group_num: An optional number for identifying this group.
        valid: True if the group is valid, or False if not.
        autolock: True to automatically lock this group when (1) at
            least one row has been added and (2) a row is rejected.
    """
    self.rows = []  # List[RowData]
    self.row_accessor = row_accessor
    self.field_col_type = field_col_type
    self.accept_fn = accept_fn
    self._group_type = group_type
    self._group_num = group_num
    self._valid = valid
    self._autolock = autolock
    self._locked = False
    self._locs = None  # track loc's for recognizing dupes
    self._key = None  # a hash key using the _locs
    self._df = None
    self._ann_type = None
Attributes
ann_type property
ann_type: str

Get this record's annotation type

autolock property writable
autolock: bool

Get whether this group is currently set to autolock.

df property
df: DataFrame

Get this group as a dataframe

group_num property writable
group_num: int

Get this group's number

group_type property writable
group_type: str

Get this group's type, which is either an "override" value that has been set, or the "ann_type" value of the first row added.

is_locked property writable
is_locked: bool

Get whether this group is locked from adding more rows.

is_valid property writable
is_valid: bool

Get whether this group is currently marked as valid.

key property
key: str

A hash key for this group.

size property
size: int

Get the number of rows in this group.

Functions
add
add(rowdata: RowData) -> bool

Add the row if the group is not locked and the row belongs in this group, or return False.

If autolock is True and a row fails to be added (after the first row has been added,) "lock" the group and refuse to accept any more rows.

Parameters:

Name Type Description Default
rowdata RowData

The row to add.

required

Returns:

Type Description
bool

True if the row belongs and was added; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def add(self, rowdata: RowData) -> bool:
    """Add the row if the group is not locked and the row belongs in this
    group, or return False.

    If autolock is True and a row fails to be added (after the first
    row has been added,) "lock" the group and refuse to accept any more
    rows.

    Args:
        rowdata: The row to add.

    Returns:
        True if the row belongs and was added; otherwise, False.
    """
    result = False
    if self._locked:
        return result

    if self.accept_fn is None or self.accept_fn(self, rowdata):
        self.rows.append(rowdata)
        self._df = None
        self._locs = None
        self._key = None
        if self._ann_type is None:
            self._ann_type = self.row_accessor.get_col_value(
                KEY_ANN_TYPE_COL,
                rowdata.row,
                missing=None,
            )
        result = True

    if not result and self.size > 0 and self.autolock:
        self._locked = True

    return result
is_subset
is_subset(other: AnnotationsGroup) -> bool

Determine whether the this group's text is contained within the others.

Parameters:

Name Type Description Default
other AnnotationsGroup

The other group.

required

Returns:

Type Description
bool

True if this group's text is contained within the other group.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def is_subset(self, other: "AnnotationsGroup") -> bool:
    """Determine whether the this group's text is contained within the others.

    Args:
        other: The other group.

    Returns:
        True if this group's text is contained within the other group.
    """
    result = True
    for my_row in self.rows:
        if not my_row.is_subset_of_any(other.rows):
            result = False
            break
    return result
is_subset_of_any
is_subset_of_any(groups: List[AnnotationsGroup]) -> AnnotationsGroup

Determine whether this group is a subset of any of the given groups.

Parameters:

Name Type Description Default
groups List[AnnotationsGroup]

List of annotation groups.

required

Returns:

Type Description
AnnotationsGroup

The first AnnotationsGroup that this group is a subset of, or None.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def is_subset_of_any(self, groups: List["AnnotationsGroup"]) -> "AnnotationsGroup":
    """Determine whether this group is a subset of any of the given groups.

    Args:
        groups: List of annotation groups.

    Returns:
        The first AnnotationsGroup that this group is a subset of, or None.
    """
    result = None
    for other_group in groups:
        if self.is_subset(other_group):
            result = other_group
            break
    return result
remove_row
remove_row(row_idx: int) -> RowData

Remove the row from this group and optionally update the annotations accordingly.

Parameters:

Name Type Description Default
row_idx int

The positional index of the row to remove.

required

Returns:

Type Description
RowData

The removed row data instance.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def remove_row(
    self,
    row_idx: int,
) -> RowData:
    """Remove the row from this group and optionally update the annotations
    accordingly.

    Args:
        row_idx: The positional index of the row to remove.

    Returns:
        The removed row data instance.
    """
    rowdata = self.rows.pop(row_idx)

    # Reset cached values
    self._df = None
    self._locs = None
    self._key = None

    return rowdata
to_dict
to_dict() -> Dict[str, str]

Get this group (record) as a dictionary of field type to text values.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def to_dict(self) -> Dict[str, str]:
    """Get this group (record) as a dictionary of field type to text values."""
    return {self.row_accessor.get_col_value(self.field_col_type): row.text for row in self.rows}

AnnotationsGroupList

AnnotationsGroupList(
    groups: List[AnnotationsGroup] = None,
    accept_fn: Callable[
        [AnnotationsGroupList, AnnotationsGroup], bool
    ] = lambda lst, g: lst.size == 0 or not g.is_subset_of_any(lst.groups),
)

Container for a list of annotation groups.

Initialize AnnotationsGroupList.

Parameters:

Name Type Description Default
groups List[AnnotationsGroup]

The initial groups for this list.

None
accept_fn Callable[[AnnotationsGroupList, AnnotationsGroup], bool]

A fn(lst, g) that returns True to accept the group, g, into this list, lst, or False to reject the group. If None, then all groups are always accepted. The default function will reject any group that is a subset of any existing group in the list.

lambda lst, g: size == 0 or not is_subset_of_any(groups)

Methods:

Name Description
add

Add the group if it belongs in this group list or return False.

is_subset

Determine whether the this group's text spans are contained within all

Attributes:

Name Type Description
coverage int

Get the total number of (token) rows covered by the groups

size int

Get the number of groups in this list

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    groups: List[AnnotationsGroup] = None,
    accept_fn: Callable[["AnnotationsGroupList", AnnotationsGroup], bool] = lambda lst, g: lst.size
    == 0
    or not g.is_subset_of_any(lst.groups),
):
    """Initialize AnnotationsGroupList.

    Args:
        groups: The initial groups for this list.
        accept_fn: A fn(lst, g) that returns True to accept the group, g,
            into this list, lst, or False to reject the group. If None, then all
            groups are always accepted. The default function will reject any
            group that is a subset of any existing group in the list.
    """
    self.groups = groups if groups is not None else []
    self.accept_fn = accept_fn
    self._coverage = None
Attributes
coverage property
coverage: int

Get the total number of (token) rows covered by the groups

size property
size: int

Get the number of groups in this list

Functions
add
add(group: AnnotationsGroup) -> bool

Add the group if it belongs in this group list or return False.

Parameters:

Name Type Description Default
group AnnotationsGroup

The group to add.

required

Returns:

Type Description
bool

True if the group belongs and was added; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def add(self, group: AnnotationsGroup) -> bool:
    """Add the group if it belongs in this group list or return False.

    Args:
        group: The group to add.

    Returns:
        True if the group belongs and was added; otherwise, False.
    """
    result = False
    if self.accept_fn is None or self.accept_fn(self, group):
        self.groups.append(group)
        self._coverage = None
        result = True
    return result
is_subset
is_subset(other: AnnotationsGroupList) -> bool

Determine whether the this group's text spans are contained within all of the other's.

Parameters:

Name Type Description Default
other AnnotationsGroupList

The other group list.

required

Returns:

Type Description
bool

True if this group list is a subset of the other group list.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def is_subset(self, other: "AnnotationsGroupList") -> bool:
    """Determine whether the this group's text spans are contained within all
    of the other's.

    Args:
        other: The other group list.

    Returns:
        True if this group list is a subset of the other group list.
    """
    result = True
    for my_group in self.groups:
        if not my_group.is_subset_of_any(other.groups):
            result = False
            break
    return result

AnnotationsMetaData

AnnotationsMetaData(
    start_pos_col: str = KEY_START_POS_COL,
    end_pos_col: str = KEY_END_POS_COL,
    text_col: str = KEY_TEXT_COL,
    ann_type_col: str = KEY_ANN_TYPE_COL,
    sort_fields: List[str] = (KEY_START_POS_COL, KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
)

Bases: MetaData

Container for annotations meta-data, identifying key column names.

NOTE: this object contains only information about annotation column names and not annotation table values.

Initialize with key (and more) column names and info.

Key column types
  • start_pos
  • end_pos
  • text
  • ann_type
Note

Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.

Parameters:

Name Type Description Default
start_pos_col str

Col name for the token starting position.

KEY_START_POS_COL
end_pos_col str

Col name for the token ending position.

KEY_END_POS_COL
text_col str

Col name for the token text.

KEY_TEXT_COL
ann_type_col str

Col name for the annotation types.

KEY_ANN_TYPE_COL
sort_fields List[str]

The col types relevant for sorting annotation rows.

(KEY_START_POS_COL, KEY_END_POS_COL)
sort_fields_ascending List[bool]

To specify sort order of sort_fields.

(True, False)
**kwargs Any

More column types mapped to column names.

{}

Methods:

Name Description
get_col

Get the name of the column having the given type (including key column

sort_df

Sort an annotations dataframe according to this metadata.

Attributes:

Name Type Description
ann_type_col str

Get the column name for the token annotation type

end_pos_col str

Get the column name for the token ending position

start_pos_col str

Get the column name for the token starting postition

text_col str

Get the column name for the token text

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    start_pos_col: str = KEY_START_POS_COL,
    end_pos_col: str = KEY_END_POS_COL,
    text_col: str = KEY_TEXT_COL,
    ann_type_col: str = KEY_ANN_TYPE_COL,
    sort_fields: List[str] = (KEY_START_POS_COL, KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any
):
    """Initialize with key (and more) column names and info.

    Key column types:
      * start_pos
      * end_pos
      * text
      * ann_type

    Note:
        Actual table columns can be named arbitrarily, BUT interactions
        through annotations classes and interfaces relating to the "key"
        columns must use the key column constants.

    Args:
        start_pos_col: Col name for the token starting position.
        end_pos_col: Col name for the token ending position.
        text_col: Col name for the token text.
        ann_type_col: Col name for the annotation types.
        sort_fields: The col types relevant for sorting annotation rows.
        sort_fields_ascending: To specify sort order of sort_fields.
        **kwargs: More column types mapped to column names.
    """
    super().__init__(
        {
            KEY_START_POS_COL: start_pos_col,
            KEY_END_POS_COL: end_pos_col,
            KEY_TEXT_COL: text_col,
            KEY_ANN_TYPE_COL: ann_type_col,
        },
        **kwargs,
    )
    self.sort_fields = list(sort_fields)
    self.ascending = sort_fields_ascending
Attributes
ann_type_col property
ann_type_col: str

Get the column name for the token annotation type

end_pos_col property
end_pos_col: str

Get the column name for the token ending position

start_pos_col property
start_pos_col: str

Get the column name for the token starting postition

text_col property
text_col: str

Get the column name for the token text

Functions
get_col
get_col(col_type: str, missing: str = None) -> str

Get the name of the column having the given type (including key column types but not derived,) or get the missing value.

Parameters:

Name Type Description Default
col_type str

The type of column name to get.

required
missing str

The value to return for unknown column types.

None

Returns:

Type Description
str

The column name or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_col(self, col_type: str, missing: str = None) -> str:
    """Get the name of the column having the given type (including key column
    types but not derived,) or get the missing value.

    Args:
        col_type: The type of column name to get.
        missing: The value to return for unknown column types.

    Returns:
        The column name or the missing value.
    """
    return self.get_value(col_type, missing)
sort_df
sort_df(an_df: DataFrame) -> pd.DataFrame

Sort an annotations dataframe according to this metadata.

Parameters:

Name Type Description Default
an_df DataFrame

An annotations dataframe.

required

Returns:

Type Description
DataFrame

The sorted annotations dataframe.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def sort_df(self, an_df: pd.DataFrame) -> pd.DataFrame:
    """Sort an annotations dataframe according to this metadata.

    Args:
        an_df: An annotations dataframe.

    Returns:
        The sorted annotations dataframe.
    """
    if self.sort_fields is not None:
        an_df = an_df.sort_values(self.sort_fields, ascending=self.ascending)
    return an_df

AnnotationsRowAccessor

AnnotationsRowAccessor(
    metadata: AnnotationsMetaData, derived_cols: DerivedAnnotationColumns = None
)

A class that accesses row data according to the metadata and derived cols.

Initialize AnnotationsRowAccessor.

Parameters:

Name Type Description Default
metadata AnnotationsMetaData

The metadata for annotation columns.

required
derived_cols DerivedAnnotationColumns

A DerivedAnnotationColumns instance for injecting derived columns.

None

Methods:

Name Description
get_col_value

Get the value of the column in the given row with the given type.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self, metadata: AnnotationsMetaData, derived_cols: DerivedAnnotationColumns = None
):
    """Initialize AnnotationsRowAccessor.

    Args:
        metadata: The metadata for annotation columns.
        derived_cols: A DerivedAnnotationColumns instance for injecting
            derived columns.
    """
    self.metadata = metadata
    self.derived_cols = derived_cols
Functions
get_col_value
get_col_value(col_type: str, row: Series, missing: str = None) -> str

Get the value of the column in the given row with the given type.

This gets the value from the first existing column in the row from
  • The metadata.get_col(col_type) column
  • col_type itself
  • The columns derived from col_type

Parameters:

Name Type Description Default
col_type str

The type of column value to get.

required
row Series

A row from which to get the value.

required
missing str

The value to return for unknown or missing column.

None

Returns:

Type Description
str

The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_col_value(
    self,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row with the given type.

    This gets the value from the first existing column in the row from:
      * The metadata.get_col(col_type) column
      * col_type itself
      * The columns derived from col_type

    Args:
        col_type: The type of column value to get.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    value = missing
    col = self.metadata.get_col(col_type, None)
    if col is None or col not in row.index:
        if col_type in self.metadata.data:
            value = row[col_type]
        elif self.derived_cols is not None:
            value = self.derived_cols.get_col_value(self.metadata, col_type, row, missing)
    else:
        value = row[col]
    return value

Annotator

Annotator(name: str)

Bases: ABC

Class for annotating text

Initialize Annotator.

Parameters:

Name Type Description Default
name str

The name of this annotator.

required

Methods:

Name Description
annotate_input

Annotate this instance's text, additively updating its annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name
Functions
annotate_input abstractmethod
annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate this instance's text, additively updating its annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The text object to annotate.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

The annotations added.

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate this instance's text, additively updating its annotations.

    Args:
        text_obj: The text object to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added.
    """
    raise NotImplementedError

AnnotatorKernel

Bases: ABC

Class for encapsulating core annotation logic for multiple annotators

Methods:

Name Description
annotate_input

Execute all annotations on the text_obj

Attributes:

Name Type Description
annotators List[EntityAnnotator]

Get the entity annotators

Attributes
annotators abstractmethod property
annotators: List[EntityAnnotator]

Get the entity annotators

Functions
annotate_input abstractmethod
annotate_input(text_obj: AnnotatedText) -> Annotations

Execute all annotations on the text_obj

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def annotate_input(self, text_obj: AnnotatedText) -> Annotations:
    """Execute all annotations on the text_obj"""
    raise NotImplementedError

BasicAnnotator

BasicAnnotator(name: str)

Bases: Annotator

Class for extracting basic (possibly multi -level or -part) entities.

Methods:

Name Description
annotate_input

Annotate the text obj, additively updating the annotations.

annotate_text

Build annotations for the text string.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name
Functions
annotate_input
annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate the text obj, additively updating the annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The text to annotate.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

The annotations added to the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate the text obj, additively updating the annotations.

    Args:
        text_obj: The text to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text.
    """
    # Get new annotation with just the syntax
    annots = self.annotate_text(text_obj.text)

    # Add syntactic annotations only as a bookmark
    text_obj.annotations.add_df(annots.as_df)

    return annots
annotate_text abstractmethod
annotate_text(text_str: str) -> Annotations

Build annotations for the text string.

Parameters:

Name Type Description Default
text_str str

The text string to annotate.

required

Returns:

Type Description
Annotations

Annotations for the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def annotate_text(self, text_str: str) -> Annotations:
    """Build annotations for the text string.

    Args:
        text_str: The text string to annotate.

    Returns:
        Annotations for the text.
    """
    raise NotImplementedError

CompoundAnnotator

CompoundAnnotator(kernel: AnnotatorKernel, name: str = 'entity')

Bases: Annotator

Class to apply a series of annotators through an AnnotatorKernel

Initialize with the annotators and this extractor's name.

Parameters:

Name Type Description Default
kernel AnnotatorKernel

The annotations kernel to use.

required
name str

The name of this information extractor to be the annotations base column name for _num and _recsnum.

'entity'

Methods:

Name Description
annotate_input

Annotate the text.

get_html_highlighted_text

Get html-hilighted text for the identified input's annotations

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    kernel: AnnotatorKernel,
    name: str = "entity",
):
    """Initialize with the annotators and this extractor's name.

    Args:
        kernel: The annotations kernel to use.
        name: The name of this information extractor to be the
            annotations base column name for <name>_num and <name>_recsnum.
    """
    super().__init__(name=name)
    self.kernel = kernel
Functions
annotate_input
annotate_input(
    text_obj: AnnotatedText, reset: bool = True, **kwargs: Any
) -> Annotations

Annotate the text.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The AnnotatedText object to annotate.

required
reset bool

When True, reset and rebuild any existing annotations.

True
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

The annotations added to the text_obj.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def annotate_input(
    self,
    text_obj: AnnotatedText,
    reset: bool = True,
    **kwargs: Any
) -> Annotations:
    """Annotate the text.

    Args:
        text_obj: The AnnotatedText object to annotate.
        reset: When True, reset and rebuild any existing annotations.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text_obj.
    """
    if reset:
        text_obj.annotations.clear()
    annots = self.kernel.annotate_input(text_obj)
    return annots
get_html_highlighted_text
get_html_highlighted_text(
    text_obj: AnnotatedText, annotator_names: List[str] = None
) -> str

Get html-hilighted text for the identified input's annotations from the given annotators (or all).

Parameters:

Name Type Description Default
text_obj AnnotatedText

The input text to highlight.

required
annotator_names List[str]

The subset of annotators to highlight.

None

Returns:

Type Description
str

HTML string with highlighted text.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def get_html_highlighted_text(
    self,
    text_obj: AnnotatedText,
    annotator_names: List[str] = None,
) -> str:
    """Get html-hilighted text for the identified input's annotations
    from the given annotators (or all).

    Args:
        text_obj: The input text to highlight.
        annotator_names: The subset of annotators to highlight.

    Returns:
        HTML string with highlighted text.
    """
    if annotator_names is None:
        annotator_names = [ann.name for ann in self.kernel.annotators]
    hfs = {
        ann.name: ann.highlight_fieldstyles
        for ann in self.kernel.annotators
        if ann.name in annotator_names
    }
    hh = HtmlHighlighter(hfs)
    return hh.highlight(text_obj)

DerivedAnnotationColumns

Bases: ABC

Interface for injecting derived columns into AnnotationsMetaData.

Methods:

Name Description
get_col_value

Get the value of the column in the given row derived from col_type.

Functions
get_col_value abstractmethod
get_col_value(
    metadata: AnnotationsMetaData,
    col_type: str,
    row: Series,
    missing: str = None,
) -> str

Get the value of the column in the given row derived from col_type.

Parameters:

Name Type Description Default
metadata AnnotationsMetaData

The AnnotationsMetaData.

required
col_type str

The type of column value to derive.

required
row Series

A row from which to get the value.

required
missing str

The value to return for unknown or missing column.

None

Returns:

Type Description
str

The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def get_col_value(
    self,
    metadata: AnnotationsMetaData,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row derived from col_type.

    Args:
        metadata: The AnnotationsMetaData.
        col_type: The type of column value to derive.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    raise NotImplementedError

EntityAnnotator

EntityAnnotator(name: str, mask_char: str = ' ')

Bases: BasicAnnotator

Class for extracting single (possibly multi-level or -part) entities.

Initialize EntityAnnotator.

Parameters:

Name Type Description Default
name str

The name of this annotator.

required
mask_char str

The character to use to mask out previously annotated spans of this annotator's text.

' '

Methods:

Name Description
annotate_input

Annotate the text object (optionally) after masking out previously

compose_groups

Compose annotation rows into groups.

mark_records

Collect and mark annotation records.

validate_records

Validate annotated records.

Attributes:

Name Type Description
annotation_cols Set[str]

Report the (final group or record) annotation columns that are filled

highlight_fieldstyles Dict[str, Dict[str, Dict[str, str]]]

Get highlight field styles for this annotator's annotations of the form:

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    name: str,
    mask_char: str = " ",
):
    """Initialize EntityAnnotator.

    Args:
        name: The name of this annotator.
        mask_char: The character to use to mask out previously annotated
            spans of this annotator's text.
    """
    super().__init__(name)
    self.mask_char = mask_char
Attributes
annotation_cols abstractmethod property
annotation_cols: Set[str]

Report the (final group or record) annotation columns that are filled by this annotator when its entities are annotated.

highlight_fieldstyles abstractmethod property
highlight_fieldstyles: Dict[str, Dict[str, Dict[str, str]]]

Get highlight field styles for this annotator's annotations of the form: { : { : { : } } } For css-attr's like 'background-color', 'foreground-color', etc.

Functions
annotate_input
annotate_input(
    text_obj: AnnotatedText,
    annot_mask_cols: Set[str] = None,
    merge_strategies: Dict[str, MergeStrategy] = None,
    largest_only: bool = True,
    **kwargs: Any,
) -> Annotations

Annotate the text object (optionally) after masking out previously annotated spans, additively updating the annotations in the text object.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The text object to annotate.

required
annot_mask_cols Set[str]

The (possible) previous annotations whose spans to ignore in the text.

None
merge_strategies Dict[str, MergeStrategy]

A dictionary of each input annotation bookmark tag mapped to a merge strategy for merging this annotator's annotations with the bookmarked dataframe. This is useful, for example, when merging syntactic information to refine ambiguities.

None
largest_only bool

True to only mark largest records.

True
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

The annotations added to the text object.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def annotate_input(
    self,
    text_obj: AnnotatedText,
    annot_mask_cols: Set[str] = None,
    merge_strategies: Dict[str, MergeStrategy] = None,
    largest_only: bool = True,
    **kwargs: Any
) -> Annotations:
    """Annotate the text object (optionally) after masking out previously
    annotated spans, additively updating the annotations in the text
    object.

    Args:
        text_obj: The text object to annotate.
        annot_mask_cols: The (possible) previous annotations whose
            spans to ignore in the text.
        merge_strategies: A dictionary of each input annotation bookmark
            tag mapped to a merge strategy for merging this annotator's
            annotations with the bookmarked dataframe. This is useful, for
            example, when merging syntactic information to refine ambiguities.
        largest_only: True to only mark largest records.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text object.
    """
    # TODO: Use annot_mask_cols to mask annotations
    # annot2mask = (
    #     None
    #     if annot_mask_cols is None
    #     else {
    #         col: self.mask_char for col in annot_mask_cols
    #     }
    # )

    annots = self.annotate_text(text_obj.text)
    if annots is None:
        return annots

    if merge_strategies is not None:
        bookmarks = text_obj.bookmarks
        if bookmarks is not None and len(bookmarks) > 0:
            for tag, merge_strategy in merge_strategies.items():
                if tag in bookmarks:
                    text_obj.bookmarks[f"{self.name}.pre-merge:{tag}"] = annots.df
                    annots.add_df(bookmarks[tag])
                    annots = merge(annots, merge_strategy)

    annots = self.compose_groups(annots)

    self.mark_records(annots, largest_only=largest_only)
    # NOTE: don't pass "text" here because it may be masked
    self.validate_records(annots)
    text_obj.annotations.add_df(annots.df)
    return annots
compose_groups abstractmethod
compose_groups(annotations: Annotations) -> Annotations

Compose annotation rows into groups.

Parameters:

Name Type Description Default
annotations Annotations

The annotations.

required

Returns:

Type Description
Annotations

The composed annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def compose_groups(self, annotations: Annotations) -> Annotations:
    """Compose annotation rows into groups.

    Args:
        annotations: The annotations.

    Returns:
        The composed annotations.
    """
    raise NotImplementedError
mark_records abstractmethod
mark_records(annotations: Annotations, largest_only: bool = True)

Collect and mark annotation records.

Parameters:

Name Type Description Default
annotations Annotations

The annotations.

required
largest_only bool

True to only mark (keep) the largest records.

True
Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def mark_records(self, annotations: Annotations, largest_only: bool = True):
    """Collect and mark annotation records.

    Args:
        annotations: The annotations.
        largest_only: True to only mark (keep) the largest records.
    """
    raise NotImplementedError
validate_records abstractmethod
validate_records(annotations: Annotations)

Validate annotated records.

Parameters:

Name Type Description Default
annotations Annotations

The annotations.

required
Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def validate_records(
    self,
    annotations: Annotations,
):
    """Validate annotated records.

    Args:
        annotations: The annotations.
    """
    raise NotImplementedError

HtmlHighlighter

HtmlHighlighter(
    field2style: Dict[str, Dict[str, str]],
    tooltip_class: str = "tooltip",
    tooltiptext_class: str = "tooltiptext",
)

Helper class to add HTML markup for highlighting spans of text.

Initialize HtmlHighlighter.

Parameters:

Name Type Description Default
field2style Dict[str, Dict[str, str]]

The annotation column to highlight with its associated style, for example: { 'car_model_field': { 'year': {'background-color': 'lightyellow'}, 'make': {'background-color': 'lightgreen'}, 'model': {'background-color': 'cyan'}, 'style': {'background-color': 'magenta'}, }, }

required
tooltip_class str

The css tooltip class.

'tooltip'
tooltiptext_class str

The css tooltiptext class.

'tooltiptext'

Methods:

Name Description
highlight

Return an html string with the given fields (annotation columns)

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    field2style: Dict[str, Dict[str, str]],
    tooltip_class: str = "tooltip",
    tooltiptext_class: str = "tooltiptext",
):
    """Initialize HtmlHighlighter.

    Args:
        field2style: The annotation column to highlight with its
            associated style, for example:
                {
                    'car_model_field': {
                        'year': {'background-color': 'lightyellow'},
                        'make': {'background-color': 'lightgreen'},
                        'model': {'background-color': 'cyan'},
                        'style': {'background-color': 'magenta'},
                    },
                }
        tooltip_class: The css tooltip class.
        tooltiptext_class: The css tooltiptext class.
    """
    self.field2style = field2style
    self.tooltip_class = tooltip_class
    self.tooltiptext_class = tooltiptext_class
Functions
highlight
highlight(text_obj: AnnotatedText) -> str

Return an html string with the given fields (annotation columns) highlighted with the associated styles.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The annotated text to markup.

required

Returns:

Type Description
str

HTML string with highlighted annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def highlight(
    self,
    text_obj: AnnotatedText,
) -> str:
    """Return an html string with the given fields (annotation columns)
    highlighted with the associated styles.

    Args:
        text_obj: The annotated text to markup.

    Returns:
        HTML string with highlighted annotations.
    """
    result = ["<p>"]
    anns = text_obj.annotations
    an_df = anns.df
    for field, styles in self.field2style.items():
        # NOTE: the following line relies on an_df already being sorted
        df = an_df[an_df[field].isin(styles)]
        cur_pos = 0
        for _loc, row in df.iterrows():
            enttype = row[field]
            style = styles[enttype]
            style_str = " ".join([f"{key}: {value};" for key, value in style.items()])
            start_pos = row[anns.metadata.start_pos_col]
            if start_pos > cur_pos:
                result.append(text_obj.text[cur_pos:start_pos])
            end_pos = row[anns.metadata.end_pos_col]
            result.append(f'<mark class="{self.tooltip_class}" style="{style_str}">')
            result.append(text_obj.text[start_pos:end_pos])
            result.append(f'<span class="{self.tooltiptext_class}">{enttype}</span>')
            result.append("</mark>")
            cur_pos = end_pos
    result.append("</p>")
    return "\n".join(result)

MergeStrategy

Bases: ABC

A merge strategy to be injected based on entity types being merged.

Methods:

Name Description
merge

Process the annotations in the given annotations group, returning the

Functions
merge abstractmethod
merge(group: AnnotationsGroup) -> List[Dict[str, Any]]

Process the annotations in the given annotations group, returning the group's merged annotation dictionaries.

Source code in packages/xization/src/dataknobs_xization/annotations.py
@abstractmethod
def merge(self, group: AnnotationsGroup) -> List[Dict[str, Any]]:
    """Process the annotations in the given annotations group, returning the
    group's merged annotation dictionaries.
    """
    raise NotImplementedError

OverlapGroupIterator

OverlapGroupIterator(an_df: DataFrame)
Given
  • annotation rows (dataframe)
  • in order sorted by
    • start_pos (increasing for input order), and
    • end_pos (decreasing for longest spans first)

Collect: * overlapping consecutive annotations * for processing

Initialize OverlapGroupIterator.

Parameters:

Name Type Description Default
an_df DataFrame

An annotations.as_df DataFrame, sliced and sorted.

required
Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(self, an_df: pd.DataFrame):
    """Initialize OverlapGroupIterator.

    Args:
        an_df: An annotations.as_df DataFrame, sliced and sorted.
    """
    self.an_df = an_df
    self._cur_iter = None
    self._queued_row_data = None
    self.cur_group = None
    self.reset()
Functions

PositionalAnnotationsGroup

PositionalAnnotationsGroup(overlap: bool, rectype: str = None, gnum: int = -1)

Bases: AnnotationsGroup

Container for annotations that either overlap with each other or don't.

Initialize PositionalAnnotationsGroup.

Parameters:

Name Type Description Default
overlap bool

If False, then only accept rows that don't overlap; else only accept rows that do overlap.

required
rectype str

The record type.

None
gnum int

The group number.

-1

Methods:

Name Description
belongs

Determine if the row belongs in this instance based on its overlap

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(self, overlap: bool, rectype: str = None, gnum: int = -1):
    """Initialize PositionalAnnotationsGroup.

    Args:
        overlap: If False, then only accept rows that don't overlap; else
            only accept rows that do overlap.
        rectype: The record type.
        gnum: The group number.
    """
    super().__init__(None, None, None, group_type=rectype, group_num=gnum)
    self.overlap = overlap
    self.start_pos = -1
    self.end_pos = -1
Functions
belongs
belongs(rowdata: RowData) -> bool

Determine if the row belongs in this instance based on its overlap or not.

Parameters:

Name Type Description Default
rowdata RowData

The rowdata to test.

required

Returns:

Type Description
bool

True if the rowdata belongs in this instance.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def belongs(self, rowdata: RowData) -> bool:
    """Determine if the row belongs in this instance based on its overlap
    or not.

    Args:
        rowdata: The rowdata to test.

    Returns:
        True if the rowdata belongs in this instance.
    """
    result = True  # Anything belongs to an empty group
    if len(self.rows) > 0:
        start_overlaps = self._is_in_bounds(rowdata.start_pos)
        end_overlaps = self._is_in_bounds(rowdata.end_pos - 1)
        result = start_overlaps or end_overlaps
        if not self.overlap:
            result = not result
    if result:
        if self.start_pos < 0:
            self.start_pos = rowdata.start_pos
            self.end_pos = rowdata.end_pos
        else:
            self.start_pos = min(self.start_pos, rowdata.start_pos)
            self.end_pos = max(self.end_pos, rowdata.end_pos)
    return result

RowData

RowData(metadata: AnnotationsMetaData, row: Series)

A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping.

Methods:

Name Description
is_subset

Determine whether this row's span is a subset of the other.

is_subset_of_any

Determine whether this row is a subset of any of the others

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    metadata: AnnotationsMetaData,
    row: pd.Series,
):
    self.metadata = metadata
    self.row = row
Functions
is_subset
is_subset(other_row: RowData) -> bool

Determine whether this row's span is a subset of the other.

Parameters:

Name Type Description Default
other_row RowData

The other row.

required

Returns:

Type Description
bool

True if this row's span is a subset of the other row's span.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def is_subset(self, other_row: "RowData") -> bool:
    """Determine whether this row's span is a subset of the other.

    Args:
        other_row: The other row.

    Returns:
        True if this row's span is a subset of the other row's span.
    """
    return self.start_pos >= other_row.start_pos and self.end_pos <= other_row.end_pos
is_subset_of_any
is_subset_of_any(other_rows: List[RowData]) -> bool

Determine whether this row is a subset of any of the others according to text span coverage.

Parameters:

Name Type Description Default
other_rows List[RowData]

The rows to test for this to be a subset of any.

required

Returns:

Type Description
bool

True if this row is a subset of any of the other rows.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def is_subset_of_any(self, other_rows: List["RowData"]) -> bool:
    """Determine whether this row is a subset of any of the others
    according to text span coverage.

    Args:
        other_rows: The rows to test for this to be a subset of any.

    Returns:
        True if this row is a subset of any of the other rows.
    """
    result = False
    for other_row in other_rows:
        if self.is_subset(other_row):
            result = True
            break
    return result

SyntacticParser

SyntacticParser(name: str)

Bases: BasicAnnotator

Class for creating syntactic annotations for an input.

Methods:

Name Description
annotate_input

Annotate the text, additively updating the annotations.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def __init__(
    self,
    name: str,
):
    """Initialize Annotator.

    Args:
        name: The name of this annotator.
    """
    self.name = name
Functions
annotate_input
annotate_input(text_obj: AnnotatedText, **kwargs: Any) -> Annotations

Annotate the text, additively updating the annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The text to annotate.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

The annotations added to the text.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def annotate_input(
    self,
    text_obj: AnnotatedText,
    **kwargs: Any
) -> Annotations:
    """Annotate the text, additively updating the annotations.

    Args:
        text_obj: The text to annotate.
        **kwargs: Additional keyword arguments.

    Returns:
        The annotations added to the text.
    """
    # Get new annotation with just the syntax
    annots = self.annotate_text(text_obj.text)

    # Add syntactic annotations only as a bookmark
    text_obj.bookmarks[self.name] = annots.as_df

    return annots

Functions

merge

merge(annotations: Annotations, merge_strategy: MergeStrategy) -> Annotations

Merge the overlapping groups according to the given strategy.

Source code in packages/xization/src/dataknobs_xization/annotations.py
def merge(
    annotations: Annotations,
    merge_strategy: MergeStrategy,
) -> Annotations:
    """Merge the overlapping groups according to the given strategy."""
    og_iter = OverlapGroupIterator(annotations.as_df)
    result = Annotations(annotations.metadata)
    while og_iter.has_next:
        og = og_iter.next_group()
        result.add_dicts(merge_strategy.merge(og))
    return result

authorities

Functions and Classes

dataknobs_xization.authorities

Authority-based annotation processing and field grouping.

Provides classes for managing authority-based annotations, field groups, and derived annotation columns for structured text extraction.

Classes:

Name Description
AnnotationsValidator

A base class with helper functions for performing validations on annotation

AuthoritiesBundle

An authority for expressing values through multiple bundled "authorities"

Authority

A class for managing and defining tabular authoritative data for e.g.,

AuthorityAnnotationsBuilder

An extension of an AnnotationsBuilder that adds the 'auth_id' column.

AuthorityAnnotationsMetaData

An extension of AnnotationsMetaData that adds an 'auth_id_col' to the

AuthorityData

A wrapper for authority data.

AuthorityFactory

A factory class for building an authority.

DerivedFieldGroups

Defines derived column types:

LexicalAuthority

A class for managing named entities by ID with associated values and

RegexAuthority

A class for managing named entities by ID with associated values and

Classes

AnnotationsValidator

Bases: ABC

A base class with helper functions for performing validations on annotation rows.

Classes:

Name Description
AuthAnnotations

A wrapper class for convenient access to the entity annotations.

Methods:

Name Description
__call__

Call function to enable instances of this type of class to be passed in

validate_annotation_rows

Determine whether the proposed authority annotation rows are valid.

Classes
AuthAnnotations
AuthAnnotations(auth: Authority, ann_row_dicts: List[Dict[str, Any]])

A wrapper class for convenient access to the entity annotations.

Methods:

Name Description
colval

Get the column's value from the given row

get_field_type

Get the entity field type value

get_text

Get the entity text from the row

Attributes:

Name Type Description
anns Annotations

Get this instance's annotation rows as an annotations object

attributes Dict[str, str]

Get this instance's annotation entity attributes

df DataFrame

Get the annotation's dataframe

row_accessor AnnotationsRowAccessor

Get the row accessor for this instance's annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(self, auth: Authority, ann_row_dicts: List[Dict[str, Any]]):
    self.auth = auth
    self.ann_row_dicts = ann_row_dicts
    self._row_accessor = None  # AnnotationsRowAccessor
    self._anns = None  # Annotations
    self._atts = None  # Dict[str, str]
Attributes
anns property
anns: Annotations

Get this instance's annotation rows as an annotations object

attributes property
attributes: Dict[str, str]

Get this instance's annotation entity attributes

df property
df: DataFrame

Get the annotation's dataframe

row_accessor property
row_accessor: AnnotationsRowAccessor

Get the row accessor for this instance's annotations.

Functions
colval
colval(col_name, row) -> Any

Get the column's value from the given row

Source code in packages/xization/src/dataknobs_xization/authorities.py
def colval(self, col_name, row) -> Any:
    """Get the column's value from the given row"""
    return self.row_accessor.get_col_value(col_name, row)
get_field_type
get_field_type(row: Series) -> str

Get the entity field type value

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_field_type(self, row: pd.Series) -> str:
    """Get the entity field type value"""
    return self.row_accessor.get_col_value("field_type", row, None)
get_text
get_text(row: Series) -> str

Get the entity text from the row

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_text(self, row: pd.Series) -> str:
    """Get the entity text from the row"""
    return self.row_accessor.get_col_value(self.auth.metadata.text_col, row, None)
Functions
__call__
__call__(auth: Authority, ann_row_dicts: List[Dict[str, Any]]) -> bool

Call function to enable instances of this type of class to be passed in as a anns_validator function to an Authority.

Parameters:

Name Type Description Default
auth Authority

The authority proposing annotations.

required
ann_row_dicts List[Dict[str, Any]]

The proposed annotations.

required

Returns:

Type Description
bool

True if the annotations are valid; otherwise, False.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __call__(
    self,
    auth: Authority,
    ann_row_dicts: List[Dict[str, Any]],
) -> bool:
    """Call function to enable instances of this type of class to be passed in
    as a anns_validator function to an Authority.

    Args:
        auth: The authority proposing annotations.
        ann_row_dicts: The proposed annotations.

    Returns:
        True if the annotations are valid; otherwise, False.
    """
    return self.validate_annotation_rows(
        AnnotationsValidator.AuthAnnotations(auth, ann_row_dicts)
    )
validate_annotation_rows abstractmethod
validate_annotation_rows(auth_annotations: AuthAnnotations) -> bool

Determine whether the proposed authority annotation rows are valid.

Parameters:

Name Type Description Default
auth_annotations AuthAnnotations

The AuthAnnotations instance with the proposed data.

required

Returns:

Type Description
bool

True if valid; False if not.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def validate_annotation_rows(
    self,
    auth_annotations: "AnnotationsValidator.AuthAnnotations",
) -> bool:
    """Determine whether the proposed authority annotation rows are valid.

    Args:
        auth_annotations: The AuthAnnotations instance with the
            proposed data.

    Returns:
        True if valid; False if not.
    """
    raise NotImplementedError

AuthoritiesBundle

AuthoritiesBundle(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    parent_auth: Authority = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    auths: List[Authority] = None,
)

Bases: Authority

An authority for expressing values through multiple bundled "authorities" like dictionary-based and/or multiple regular expression patterns.

Initialize the AuthoritiesBundle.

Parameters:

Name Type Description Default
name str

This authority's entity name.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

None
authdata AuthorityData

The authority data.

None
field_groups DerivedFieldGroups

The derived field groups to use.

None
anns_validator Callable[[Authority, Dict[str, Any]], bool]

fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".

None
parent_auth Authority

This authority's parent authority (if any).

None
auths List[Authority]

The authorities to bundle together.

None

Methods:

Name Description
add

Add the authority to this bundle.

add_annotations

Method to do the work of finding, validating, and adding annotations.

has_value

Determine whether the given value is in this authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    parent_auth: "Authority" = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    auths: List[Authority] = None,
):
    """Initialize the AuthoritiesBundle.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
        auths: The authorities to bundle together.
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.auths = auths.copy() if auths is not None else []
Functions
add
add(auth: Authority)

Add the authority to this bundle.

Parameters:

Name Type Description Default
auth Authority

The authority to add.

required
Source code in packages/xization/src/dataknobs_xization/authorities.py
def add(self, auth: Authority):
    """Add the authority to this bundle.

    Args:
        auth: The authority to add.
    """
    self.auths.append(auth)
add_annotations
add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The annotated text object to process and add annotations.

required

Returns:

Type Description
Annotations

The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    for auth in self.auths:
        auth.annotate_input(text_obj)
    return text_obj.annotations
has_value
has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name Type Description Default
value Any

A possible authority value.

required

Returns:

Type Description
bool

True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    for auth in self.auths:
        if auth.has_value(value):
            return True
    return False

Authority

Authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Annotator

A class for managing and defining tabular authoritative data for e.g., taxonomies, etc., and using them to annotate instances within text.

Initialize with this authority's metadata.

Parameters:

Name Type Description Default
name str

This authority's entity name.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

None
authdata AuthorityData

The authority data.

None
field_groups DerivedFieldGroups

The derived field groups to use.

None
anns_validator Callable[[Authority, Dict[str, Any]], bool]

fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".

None
parent_auth Authority

This authority's parent authority (if any).

None

Methods:

Name Description
add_annotations

Method to do the work of finding, validating, and adding annotations.

annotate_input

Find and annotate this authority's entities in the document text

build_annotation

Build annotations with the given components.

compose

Compose annotations into groups.

has_value

Determine whether the given value is in this authority.

validate_ann_dicts

The annotation row dictionaries are valid if:

Attributes:

Name Type Description
metadata AuthorityAnnotationsMetaData

Get the meta-data

parent Authority

Get this authority's parent, or None.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's metadata.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(name)
    self.anns_builder = (
        auth_anns_builder if auth_anns_builder is not None else AuthorityAnnotationsBuilder()
    )
    self.authdata = authdata
    self.field_groups = field_groups if field_groups is not None else DerivedFieldGroups()
    self.anns_validator = anns_validator
    self._parent = parent_auth
Attributes
metadata property
metadata: AuthorityAnnotationsMetaData

Get the meta-data

parent property
parent: Authority

Get this authority's parent, or None.

Functions
add_annotations abstractmethod
add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The annotated text object to process and add annotations.

required

Returns:

Type Description
Annotations

The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    raise NotImplementedError
annotate_input
annotate_input(
    text_obj: Union[AnnotatedText, str], **kwargs: Any
) -> dk_annots.Annotations

Find and annotate this authority's entities in the document text as dictionaries like: [ { 'input_id': , 'start_pos': , 'end_pos': , 'entity_text': , 'ann_type': , '': , 'confidence': , }, ]

Parameters:

Name Type Description Default
text_obj Union[AnnotatedText, str]

The text object or string to process.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Annotations

An Annotations instance.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def annotate_input(
    self,
    text_obj: Union[dk_annots.AnnotatedText, str],
    **kwargs: Any,
) -> dk_annots.Annotations:
    """Find and annotate this authority's entities in the document text
    as dictionaries like:
    [
        {
            'input_id': <id>,
            'start_pos': <start_char_pos>,
            'end_pos': <end_char_pos>,
            'entity_text': <entity_text>,
            'ann_type': <authority_name>,
            '<auth_id>': <auth_value_id_or_canonical_form>,
            'confidence': <confidence_if_available>,
        },
    ]

    Args:
        text_obj: The text object or string to process.
        **kwargs: Additional keyword arguments.

    Returns:
        An Annotations instance.
    """
    if text_obj is not None:
        if isinstance(text_obj, str) and len(text_obj.strip()) > 0:
            text_obj = dk_annots.AnnotatedText(
                text_obj,
                annots_metadata=self.metadata,
            )
    if text_obj is not None:
        annotations = self.add_annotations(text_obj)
    return annotations
build_annotation
build_annotation(
    start_pos: int = None,
    end_pos: int = None,
    entity_text: str = None,
    auth_value_id: Any = None,
    conf: float = 1.0,
    **kwargs,
) -> Dict[str, Any]

Build annotations with the given components.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def build_annotation(
    self,
    start_pos: int = None,
    end_pos: int = None,
    entity_text: str = None,
    auth_value_id: Any = None,
    conf: float = 1.0,
    **kwargs,
) -> Dict[str, Any]:
    """Build annotations with the given components."""
    return self.anns_builder.build_annotation_row(
        start_pos, end_pos, entity_text, self.name, auth_value_id, auth_valconf=conf, **kwargs
    )
compose
compose(annotations: Annotations) -> dk_annots.Annotations

Compose annotations into groups.

Parameters:

Name Type Description Default
annotations Annotations

The annotations.

required

Returns:

Type Description
Annotations

Composed annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def compose(
    self,
    annotations: dk_annots.Annotations,
) -> dk_annots.Annotations:
    """Compose annotations into groups.

    Args:
        annotations: The annotations.

    Returns:
        Composed annotations.
    """
    return annotations
has_value abstractmethod
has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name Type Description Default
value Any

A possible authority value.

required

Returns:

Type Description
bool

True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    raise NotImplementedError
validate_ann_dicts
validate_ann_dicts(ann_dicts: List[Dict[str, Any]]) -> bool
The annotation row dictionaries are valid if
  • They are non-empty
  • and
  • either there is no annotations validator
  • or they are valid according to the validator

Parameters:

Name Type Description Default
ann_dicts List[Dict[str, Any]]

Annotation dictionaries.

required

Returns:

Type Description
bool

True if valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def validate_ann_dicts(self, ann_dicts: List[Dict[str, Any]]) -> bool:
    """The annotation row dictionaries are valid if:
      * They are non-empty
      * and
         * either there is no annotations validator
         * or they are valid according to the validator

    Args:
        ann_dicts: Annotation dictionaries.

    Returns:
        True if valid.
    """
    return len(ann_dicts) > 0 and (
        self.anns_validator is None or self.anns_validator(self, ann_dicts)
    )

AuthorityAnnotationsBuilder

AuthorityAnnotationsBuilder(
    metadata: AuthorityAnnotationsMetaData = None,
    data_defaults: Dict[str, Any] = None,
)

Bases: AnnotationsBuilder

An extension of an AnnotationsBuilder that adds the 'auth_id' column.

Initialize AuthorityAnnotationsBuilder.

Parameters:

Name Type Description Default
metadata AuthorityAnnotationsMetaData

The authority annotations metadata.

None
data_defaults Dict[str, Any]

Dict[ann_colname, default_value] with default values for annotation columns.

None

Methods:

Name Description
build_annotation_row

Build an annotation row with the mandatory key values and those from

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    metadata: AuthorityAnnotationsMetaData = None,
    data_defaults: Dict[str, Any] = None,
):
    """Initialize AuthorityAnnotationsBuilder.

    Args:
        metadata: The authority annotations metadata.
        data_defaults: Dict[ann_colname, default_value] with default
            values for annotation columns.
    """
    super().__init__(
        metadata if metadata is not None else AuthorityAnnotationsMetaData(), data_defaults
    )
Functions
build_annotation_row
build_annotation_row(
    start_pos: int,
    end_pos: int,
    text: str,
    ann_type: str,
    auth_id: str,
    **kwargs: Any,
) -> Dict[str, Any]

Build an annotation row with the mandatory key values and those from the remaining keyword arguments.

For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.

Parameters:

Name Type Description Default
start_pos int

The token start position.

required
end_pos int

The token end position.

required
text str

The token text.

required
ann_type str

The annotation type.

required
auth_id str

The authority ID for the row.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Dict[str, Any]

The result row dictionary.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def build_annotation_row(
    self, start_pos: int, end_pos: int, text: str, ann_type: str, auth_id: str, **kwargs: Any
) -> Dict[str, Any]:
    """Build an annotation row with the mandatory key values and those from
    the remaining keyword arguments.

    For those kwargs whose names match metadata column names, override the
    data_defaults and add remaining data_default attributes.

    Args:
        start_pos: The token start position.
        end_pos: The token end position.
        text: The token text.
        ann_type: The annotation type.
        auth_id: The authority ID for the row.
        **kwargs: Additional keyword arguments.

    Returns:
        The result row dictionary.
    """
    return self.do_build_row(
        {
            self.metadata.start_pos_col: start_pos,
            self.metadata.end_pos_col: end_pos,
            self.metadata.text_col: text,
            self.metadata.ann_type_col: ann_type,
            self.metadata.auth_id_col: auth_id,
        },
        **kwargs,
    )

AuthorityAnnotationsMetaData

AuthorityAnnotationsMetaData(
    start_pos_col: str = dk_annots.KEY_START_POS_COL,
    end_pos_col: str = dk_annots.KEY_END_POS_COL,
    text_col: str = dk_annots.KEY_TEXT_COL,
    ann_type_col: str = dk_annots.KEY_ANN_TYPE_COL,
    auth_id_col: str = KEY_AUTH_ID_COL,
    sort_fields: List[str] = (
        dk_annots.KEY_START_POS_COL,
        dk_annots.KEY_END_POS_COL,
    ),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
)

Bases: AnnotationsMetaData

An extension of AnnotationsMetaData that adds an 'auth_id_col' to the standard (key) annotation columns (attributes).

Initialize with key (and more) column names and info.

Key column types
  • start_pos
  • end_pos
  • text
  • ann_type
  • auth_id
Note

Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.

Parameters:

Name Type Description Default
start_pos_col str

Col name for the token starting position.

KEY_START_POS_COL
end_pos_col str

Col name for the token ending position.

KEY_END_POS_COL
text_col str

Col name for the token text.

KEY_TEXT_COL
ann_type_col str

Col name for the annotation types.

KEY_ANN_TYPE_COL
auth_id_col str

Col name for the authority value ID.

KEY_AUTH_ID_COL
sort_fields List[str]

The col types relevant for sorting annotation rows.

(KEY_START_POS_COL, KEY_END_POS_COL)
sort_fields_ascending List[bool]

To specify sort order of sort_fields.

(True, False)
**kwargs Any

More column types mapped to column names.

{}

Attributes:

Name Type Description
auth_id_col str

Get the column name for the auth_id

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    start_pos_col: str = dk_annots.KEY_START_POS_COL,
    end_pos_col: str = dk_annots.KEY_END_POS_COL,
    text_col: str = dk_annots.KEY_TEXT_COL,
    ann_type_col: str = dk_annots.KEY_ANN_TYPE_COL,
    auth_id_col: str = KEY_AUTH_ID_COL,
    sort_fields: List[str] = (dk_annots.KEY_START_POS_COL, dk_annots.KEY_END_POS_COL),
    sort_fields_ascending: List[bool] = (True, False),
    **kwargs: Any,
):
    """Initialize with key (and more) column names and info.

    Key column types:
      * start_pos
      * end_pos
      * text
      * ann_type
      * auth_id

    Note:
        Actual table columns can be named arbitrarily, BUT interactions
        through annotations classes and interfaces relating to the "key"
        columns must use the key column constants.

    Args:
        start_pos_col: Col name for the token starting position.
        end_pos_col: Col name for the token ending position.
        text_col: Col name for the token text.
        ann_type_col: Col name for the annotation types.
        auth_id_col: Col name for the authority value ID.
        sort_fields: The col types relevant for sorting annotation rows.
        sort_fields_ascending: To specify sort order of sort_fields.
        **kwargs: More column types mapped to column names.
    """
    super().__init__(
        start_pos_col=start_pos_col,
        end_pos_col=end_pos_col,
        text_col=text_col,
        ann_type_col=ann_type_col,
        sort_fields=sort_fields,
        sort_fields_ascending=sort_fields_ascending,
        auth_id=auth_id_col,
        **kwargs,
    )
Attributes
auth_id_col property
auth_id_col: str

Get the column name for the auth_id

Functions

AuthorityData

AuthorityData(df: DataFrame, name: str)

A wrapper for authority data.

Methods:

Name Description
lookup_values

Lookup authority value(s) for the given value or value id.

Attributes:

Name Type Description
df DataFrame

Get the authority data in a dataframe

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(self, df: pd.DataFrame, name: str):
    self._df = df
    self.name = name
Attributes
df property
df: DataFrame

Get the authority data in a dataframe

Functions
lookup_values
lookup_values(value: Any, is_id: bool = False) -> pd.DataFrame

Lookup authority value(s) for the given value or value id.

Parameters:

Name Type Description Default
value Any

A value or value_id for this authority.

required
is_id bool

True if value is an ID.

False

Returns:

Type Description
DataFrame

The applicable authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def lookup_values(self, value: Any, is_id: bool = False) -> pd.DataFrame:
    """Lookup authority value(s) for the given value or value id.

    Args:
        value: A value or value_id for this authority.
        is_id: True if value is an ID.

    Returns:
        The applicable authority dataframe rows.
    """
    col = self.df.index if is_id else self.df[self.name]
    return self.df[col == value]

AuthorityFactory

Bases: ABC

A factory class for building an authority.

Methods:

Name Description
build_authority

Build an authority with the given name and data.

Functions
build_authority abstractmethod
build_authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    authdata: AuthorityData,
    parent_auth: Authority = None,
) -> Authority

Build an authority with the given name and data.

Parameters:

Name Type Description Default
name str

The authority name.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

required
authdata AuthorityData

The authority data.

required
parent_auth Authority

The parent authority.

None

Returns:

Type Description
Authority

The authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def build_authority(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    authdata: AuthorityData,
    parent_auth: Authority = None,
) -> Authority:
    """Build an authority with the given name and data.

    Args:
        name: The authority name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        parent_auth: The parent authority.

    Returns:
        The authority.
    """
    raise NotImplementedError

DerivedFieldGroups

DerivedFieldGroups(
    field_type_suffix: str = "_field",
    field_group_suffix: str = "_num",
    field_record_suffix: str = "_recsnum",
)

Bases: DerivedAnnotationColumns

Defines derived column types: * "field_type" -- The column holding they type of field of an annotation row * "field_group" -- The column holding the group number(s) of the field * "field_record" -- The column holding record number(s) of the field

Add derived column types/names: Given an annnotation row, * field_type(row) == f'{row[ann_type_col]}_field' * field_group(row) == f'{row[ann_type_col]}_num' * field_record(row) == f'{row[ann_type_col])_recsnum'

Where
  • A field_type column holds annotation "sub"- type values, or fields
  • A field_group column identifies groups of annotation fields
  • A field_record column identifies groups of annotation field groups

Parameters:

Name Type Description Default
field_type_suffix str

The field_type col name suffix (if not _field).

'_field'
field_group_suffix str

The field_group col name suffix (if not _num).

'_num'
field_record_suffix str

field_record colname sfx (if not _recsnum).

'_recsnum'

Methods:

Name Description
get_col_value

Get the value of the column in the given row derived from col_type,

get_field_group_col

Given a field name or field col name, e.g., an annotation type col's

get_field_name

Given a field name or field col name, e.g., an annotation type col's

get_field_record_col

Given a field name or field col name, e.g., an annotation type col's

get_field_type_col

Given a field name or field col name, e.g., an annotation type col's

unpack_field

Given a field in any of its derivatives (like field type, field group

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    field_type_suffix: str = "_field",
    field_group_suffix: str = "_num",
    field_record_suffix: str = "_recsnum",
):
    """Add derived column types/names: Given an annnotation row,
      * field_type(row) == f'{row[ann_type_col]}_field'
      * field_group(row) == f'{row[ann_type_col]}_num'
      * field_record(row) == f'{row[ann_type_col])_recsnum'

    Where:
      * A field_type column holds annotation "sub"- type values, or fields
      * A field_group column identifies groups of annotation fields
      * A field_record column identifies groups of annotation field groups

    Args:
        field_type_suffix: The field_type col name suffix (if not _field).
        field_group_suffix: The field_group col name suffix (if not _num).
        field_record_suffix: field_record colname sfx (if not _recsnum).
    """
    self.field_type_suffix = field_type_suffix
    self.field_group_suffix = field_group_suffix
    self.field_record_suffix = field_record_suffix
Functions
get_col_value
get_col_value(
    metadata: AnnotationsMetaData,
    col_type: str,
    row: Series,
    missing: str = None,
) -> str

Get the value of the column in the given row derived from col_type, where col_type is one of: * "field_type" == f"{field}_field" * "field_group" == f"{field}_num" * "field_record" == f"{field}_recsnum"

And "field" is the row_accessor's metadata's "ann_type" col's value.

Parameters:

Name Type Description Default
metadata AnnotationsMetaData

The AnnotationsMetaData.

required
col_type str

The type of column value to derive.

required
row Series

A row from which to get the value.

required
missing str

The value to return for unknown or missing column.

None

Returns:

Type Description
str

The row value or the missing value.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_col_value(
    self,
    metadata: dk_annots.AnnotationsMetaData,
    col_type: str,
    row: pd.Series,
    missing: str = None,
) -> str:
    """Get the value of the column in the given row derived from col_type,
    where col_type is one of:
      * "field_type" == f"{field}_field"
      * "field_group" == f"{field}_num"
      * "field_record" == f"{field}_recsnum"

    And "field" is the row_accessor's metadata's "ann_type" col's value.

    Args:
        metadata: The AnnotationsMetaData.
        col_type: The type of column value to derive.
        row: A row from which to get the value.
        missing: The value to return for unknown or missing column.

    Returns:
        The row value or the missing value.
    """
    value = missing
    if metadata.ann_type_col in row.index:
        field = row[metadata.ann_type_col]
        if field is not None:
            if col_type == "field_type":
                col_name = self.get_field_type_col(field)
            elif col_type == "field_group":
                col_name = self.get_field_group_col(field)
            elif col_type == "field_record":
                col_name = self.get_field_record_col(field)
            if col_name is not None and col_name in row.index:
                value = row[col_name]
    return value
get_field_group_col
get_field_group_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field group column.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_field_group_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record, get the name of the derived
    field group column.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_group_suffix}"
get_field_name
get_field_name(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value (the field name); or a field type, group, or record column name, get the field name.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_field_name(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value (the field name); or a field type, group, or record column name,
    get the field name.
    """
    return self.unpack_field(field_value)
get_field_record_col
get_field_record_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field record column.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_field_record_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record, get the name of the derived
    field record column.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_record_suffix}"
get_field_type_col
get_field_type_col(field_value: str) -> str

Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record column name, get the field name.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def get_field_type_col(self, field_value: str) -> str:
    """Given a field name or field col name, e.g., an annotation type col's
    value; or a field type, group, or record column name, get the field
    name.
    """
    field = self.unpack_field(field_value)
    return f"{field}{self.field_type_suffix}"
unpack_field
unpack_field(field_value: str) -> str

Given a field in any of its derivatives (like field type, field group or field record,) unpack and return the basic field value itself.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def unpack_field(self, field_value: str) -> str:
    """Given a field in any of its derivatives (like field type, field group
    or field record,) unpack and return the basic field value itself.
    """
    field = field_value
    if field.endswith(self.field_record_suffix):
        field = field.replace(self.field_record_suffix, "")
    elif field.endswith(self.field_group_suffix):
        field = field.replace(self.field_group_suffix, "")
    elif field.endswith(self.field_type_suffix):
        field = field.replace(self.field_type_suffix, "")
    return field

LexicalAuthority

LexicalAuthority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Authority

A class for managing named entities by ID with associated values and variations.

Initialize with this authority's metadata.

Parameters:

Name Type Description Default
name str

This authority's entity name.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

None
authdata AuthorityData

The authority data.

None
field_groups DerivedFieldGroups

The derived field groups to use.

None
anns_validator Callable[[Authority, Dict[str, Any]], bool]

fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".

None
parent_auth Authority

This authority's parent authority (if any).

None

Methods:

Name Description
find_variations

Find all matches to the given variation.

get_id_by_variation

Get the IDs of the value(s) associated with the given variation.

get_value_ids

Get all IDs associated with the given value. Note that typically

get_values_by_id

Get all values for the associated value ID. Note that typically

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[["Authority", Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's metadata.

    Args:
        name: This authority's entity name.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
Functions
find_variations abstractmethod
find_variations(
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series

Find all matches to the given variation.

Note

Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.

Parameters:

Name Type Description Default
variation str

The text to find; treated as a regular expression unless either starts_with or ends_with is True.

required
starts_with bool

When True, find all terms that start with the variation text.

False
ends_with bool

When True, find all terms that end with the variation text.

False
scope str

'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching.

'fullmatch'

Returns:

Type Description
Series

The matching variations as a pd.Series.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def find_variations(
    self,
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series:
    """Find all matches to the given variation.

    Note:
        Only the first true of starts_with, ends_with, and scope will
        be applied. If none of these are true, a full match on the pattern
        is performed.

    Args:
        variation: The text to find; treated as a regular expression
            unless either starts_with or ends_with is True.
        starts_with: When True, find all terms that start with the
            variation text.
        ends_with: When True, find all terms that end with the variation
            text.
        scope: 'fullmatch' (default), 'match', or 'contains' for
            strict, less strict, and least strict matching.

    Returns:
        The matching variations as a pd.Series.
    """
    raise NotImplementedError
get_id_by_variation abstractmethod
get_id_by_variation(variation: str) -> Set[str]

Get the IDs of the value(s) associated with the given variation.

Parameters:

Name Type Description Default
variation str

Variation text.

required

Returns:

Type Description
Set[str]

The possibly empty set of associated value IDS.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def get_id_by_variation(self, variation: str) -> Set[str]:
    """Get the IDs of the value(s) associated with the given variation.

    Args:
        variation: Variation text.

    Returns:
        The possibly empty set of associated value IDS.
    """
    raise NotImplementedError
get_value_ids abstractmethod
get_value_ids(value: Any) -> Set[Any]

Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.

Parameters:

Name Type Description Default
value Any

An authority value.

required

Returns:

Type Description
Set[Any]

The associated IDs or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def get_value_ids(self, value: Any) -> Set[Any]:
    """Get all IDs associated with the given value. Note that typically
    there is a single ID for any value, but this allows for inherent
    ambiguities in the authority.

    Args:
        value: An authority value.

    Returns:
        The associated IDs or an empty set if the value is not valid.
    """
    raise NotImplementedError
get_values_by_id abstractmethod
get_values_by_id(value_id: Any) -> Set[Any]

Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.

Parameters:

Name Type Description Default
value_id Any

An authority value ID.

required

Returns:

Type Description
Set[Any]

The associated values or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/authorities.py
@abstractmethod
def get_values_by_id(self, value_id: Any) -> Set[Any]:
    """Get all values for the associated value ID. Note that typically
    there is a single value for an ID, but this allows for inherent
    ambiguities in the authority.

    Args:
        value_id: An authority value ID.

    Returns:
        The associated values or an empty set if the value is not valid.
    """
    raise NotImplementedError

RegexAuthority

RegexAuthority(
    name: str,
    regex: Pattern,
    canonical_fn: Callable[[str, str], Any] = None,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: Authority

A class for managing named entities by ID with associated values and variations.

Initialize with this authority's entity name.

Note

If the regular expression has capturing groups, each group will result in a separate entity, with the group name if provided in the regular expression as ...(?Pgroup_regex)...

Parameters:

Name Type Description Default
name str

The authority name.

required
regex Pattern

The regular expression to apply.

required
canonical_fn Callable[[str, str], Any]

A function, fn(match_text, group_name), to transform input matches to a canonical form as a value_id. Where group_name will be None and the full match text will be passed in if there are no group names. Note that the canonical form is computed before the match_validator is applied and its value will be found as the value to the key.

None
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

None
authdata AuthorityData

The authority data.

None
field_groups DerivedFieldGroups

The derived field groups to use.

None
anns_validator Callable[[Authority, Dict[str, Any]], bool]

A validation function for each regex match formed as a list of annotation row dictionaries, one row dictionary for each matching regex group. If the validator returns False, then the annotation rows will be rejected. The entity_text key will hold matched text and the _field key will hold the group name or number (if there are groups with or without names) or the if there are no groups in the regular expression. Note that the validator function takes the regex authority instance as its first parameter to provide access to the field_groups, etc. The validation_fn signature is: fn(regexAuthority, ann_row_dicts) and returns a boolean.

None
parent_auth Authority

This authority's parent authority (if any).

None

Methods:

Name Description
add_annotations

Method to do the work of finding, validating, and adding annotations.

has_value

Determine whether the given value is in this authority.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def __init__(
    self,
    name: str,
    regex: re.Pattern,
    canonical_fn: Callable[[str, str], Any] = None,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    authdata: AuthorityData = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: "Authority" = None,
):
    """Initialize with this authority's entity name.

    Note:
        If the regular expression has capturing groups, each group
        will result in a separate entity, with the group name if provided
        in the regular expression as ...(?P<group_name>group_regex)...

    Args:
        name: The authority name.
        regex: The regular expression to apply.
        canonical_fn: A function, fn(match_text, group_name), to
            transform input matches to a canonical form as a value_id.
            Where group_name will be None and the full match text will be
            passed in if there are no group names. Note that the canonical form
            is computed before the match_validator is applied and its value
            will be found as the value to the <auth_id> key.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        authdata: The authority data.
        field_groups: The derived field groups to use.
        anns_validator: A validation function for each regex match
            formed as a list of annotation row dictionaries, one row dictionary
            for each matching regex group. If the validator returns False,
            then the annotation rows will be rejected. The entity_text key
            will hold matched text and the <auth_name>_field key will hold
            the group name or number (if there are groups with or without names)
            or the <auth_name> if there are no groups in the regular expression.
            Note that the validator function takes the regex authority instance
            as its first parameter to provide access to the field_groups, etc.
            The validation_fn signature is: fn(regexAuthority, ann_row_dicts)
            and returns a boolean.
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name,
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.regex = regex
    self.canonical_fn = canonical_fn
Functions
add_annotations
add_annotations(text_obj: AnnotatedText) -> dk_annots.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name Type Description Default
text_obj AnnotatedText

The annotated text object to process and add annotations.

required

Returns:

Type Description
Annotations

The added Annotations.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def add_annotations(
    self,
    text_obj: dk_annots.AnnotatedText,
) -> dk_annots.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        text_obj: The annotated text object to process and add annotations.

    Returns:
        The added Annotations.
    """
    for match in re.finditer(self.regex, text_obj.text):
        ann_dicts = []
        if match.lastindex is not None:
            if len(self.regex.groupindex) > 0:  # we have named groups
                for group_name, group_num in self.regex.groupindex.items():
                    group_text = match.group(group_num)
                    kwargs = {self.field_groups.get_field_type_col(self.name): group_name}
                    ann_dicts.append(
                        self.build_annotation(
                            start_pos=match.start(group_name),
                            end_pos=match.end(group_name),
                            entity_text=group_text,
                            auth_value_id=self.get_canonical_form(group_text, group_name),
                            **kwargs,
                        )
                    )
            else:  # we have only numbers for groups
                for group_num, group_text in enumerate(match.groups()):
                    group_num += 1
                    kwargs = {self.field_groups.get_field_type_col(self.name): group_num}
                    ann_dicts.append(
                        self.build_annotation(
                            start_pos=match.start(group_num),
                            end_pos=match.end(group_num),
                            entity_text=group_text,
                            auth_value_id=self.get_canonical_form(group_text, group_num),
                            **kwargs,
                        )
                    )
        else:  # we have no groups
            ann_dicts.append(
                self.build_annotation(
                    start_pos=match.start(),
                    end_pos=match.end(),
                    entity_text=match.group(),
                    auth_value_id=self.get_canonical_form(match.group(), self.name),
                )
            )
        if self.validate_ann_dicts(ann_dicts):
            # Add non-empty, valid annotation dicts to the result
            text_obj.annotations.add_dicts(ann_dicts)
    return text_obj.annotations
has_value
has_value(value: Any) -> re.Match

Determine whether the given value is in this authority.

Parameters:

Name Type Description Default
value Any

A possible authority value.

required

Returns:

Type Description
Match

None if the value is not a valid entity value; otherwise,

Match

return the re.Match object.

Source code in packages/xization/src/dataknobs_xization/authorities.py
def has_value(self, value: Any) -> re.Match:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        None if the value is not a valid entity value; otherwise,
        return the re.Match object.
    """
    return self.regex.match(str(value))

lexicon

Functions and Classes

dataknobs_xization.lexicon

Lexical matching and token alignment for text processing.

Provides classes for lexical expansion, normalization, token alignment, and pattern matching in text with support for variations and fuzzy matching.

Classes:

Name Description
CorrelatedAuthorityData

Container for authoritative data containing correlated data for multiple

DataframeAuthority

A pandas dataframe-based lexical authority.

LexicalExpander

A class to expand and/or normalize original lexical input terms, to

MultiAuthorityData

Container for authoritative data containing correlated data for multiple

MultiAuthorityFactory

An factory for building a "sub" authority directly or indirectly

SimpleMultiAuthorityData

Data class for pulling a single column from the multi-authority data

TokenAligner

Aligns tokens with a lexical authority to generate annotations.

TokenMatch

Represents a match between tokens and a lexical authority variation.

Classes

CorrelatedAuthorityData

CorrelatedAuthorityData(df: DataFrame, name: str)

Bases: AuthorityData

Container for authoritative data containing correlated data for multiple "sub" authorities.

Methods:

Name Description
auth_records_mask

Get a series identifying records in the full authority matching

auth_values_mask

Identify full-authority data corresponding to this sub-value.

combine_masks

Combine the masks if possible, returning the valid combination or None.

get_auth_records

Get the authority records identified by the mask.

sub_authority_names

Get the "sub" authority names.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}
Functions
auth_records_mask abstractmethod
auth_records_mask(
    record_value_ids: Dict[str, int], filter_mask: Series = None
) -> pd.Series

Get a series identifying records in the full authority matching the given records of the form {: }.

Parameters:

Name Type Description Default
record_value_ids Dict[str, int]

The dict of field names to value_ids.

required
filter_mask Series

A pre-filter limiting records to consider and/or building records incrementally.

None

Returns:

Type Description
Series

A series identifying where all fields exist.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@abstractmethod
def auth_records_mask(
    self,
    record_value_ids: Dict[str, int],
    filter_mask: pd.Series = None,
) -> pd.Series:
    """Get a series identifying records in the full authority matching
    the given records of the form {<sub-name>: <sub-value-id>}.

    Args:
        record_value_ids: The dict of field names to value_ids.
        filter_mask: A pre-filter limiting records to consider and/or
            building records incrementally.

    Returns:
        A series identifying where all fields exist.
    """
    raise NotImplementedError
auth_values_mask abstractmethod
auth_values_mask(name: str, value_id: int) -> pd.Series

Identify full-authority data corresponding to this sub-value.

Parameters:

Name Type Description Default
name str

The sub-authority name.

required
value_id int

The sub-authority value_id.

required

Returns:

Type Description
Series

A series representing relevant full-authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@abstractmethod
def auth_values_mask(self, name: str, value_id: int) -> pd.Series:
    """Identify full-authority data corresponding to this sub-value.

    Args:
        name: The sub-authority name.
        value_id: The sub-authority value_id.

    Returns:
        A series representing relevant full-authority data.
    """
    raise NotImplementedError
combine_masks abstractmethod
combine_masks(mask1: Series, mask2: Series) -> pd.Series

Combine the masks if possible, returning the valid combination or None.

Parameters:

Name Type Description Default
mask1 Series

An auth_records_mask consistent with this data.

required
mask2 Series

Another data auth_records_mask.

required

Returns:

Type Description
Series

The combined consistent records_mask or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@abstractmethod
def combine_masks(self, mask1: pd.Series, mask2: pd.Series) -> pd.Series:
    """Combine the masks if possible, returning the valid combination or None.

    Args:
        mask1: An auth_records_mask consistent with this data.
        mask2: Another data auth_records_mask.

    Returns:
        The combined consistent records_mask or None.
    """
    raise NotImplementedError
get_auth_records abstractmethod
get_auth_records(records_mask: Series) -> pd.DataFrame

Get the authority records identified by the mask.

Parameters:

Name Type Description Default
records_mask Series

A series identifying records in the full data.

required

Returns:

Type Description
DataFrame

The records for which the mask is True.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@abstractmethod
def get_auth_records(self, records_mask: pd.Series) -> pd.DataFrame:
    """Get the authority records identified by the mask.

    Args:
        records_mask: A series identifying records in the full data.

    Returns:
        The records for which the mask is True.
    """
    raise NotImplementedError
sub_authority_names
sub_authority_names() -> List[str]

Get the "sub" authority names.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def sub_authority_names(self) -> List[str]:
    """Get the "sub" authority names."""
    return None

DataframeAuthority

DataframeAuthority(
    name: str,
    lexical_expander: LexicalExpander,
    authdata: AuthorityData,
    auth_anns_builder: AuthorityAnnotationsBuilder = None,
    field_groups: DerivedFieldGroups = None,
    anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
    parent_auth: Authority = None,
)

Bases: LexicalAuthority

A pandas dataframe-based lexical authority.

Initialize with the name, values, and associated ids of the authority; and with the lexical expander for authoritative values.

Parameters:

Name Type Description Default
name str

The authority name, if different from df.columns[0].

required
lexical_expander LexicalExpander

The lexical expander for the values.

required
authdata AuthorityData

The data for this authority.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

None
field_groups DerivedFieldGroups

The derived field groups to use.

None
anns_validator Callable[[Authority, Dict[str, Any]], bool]

fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity".

None
parent_auth Authority

This authority's parent authority (if any).

None

Methods:

Name Description
add_annotations

Method to do the work of finding, validating, and adding annotations.

find_variations

Find all matches to the given variation.

get_id_by_variation

Get the IDs of the value(s) associated with the given variation.

get_value_ids

Get all IDs associated with the given value. Note that typically

get_values_by_id

Get all values for the associated value ID. Note that typically

get_variations

Convenience method to compute variations for the value.

get_variations_df

Create a DataFrame including associated ids for each variation.

has_value

Determine whether the given value is in this authority.

Attributes:

Name Type Description
prev_aligner TokenAligner

Get the token aligner created in the latest call to annotate_text.

variations Series

Get all lexical variations in a series whose index has associated

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(
    self,
    name: str,
    lexical_expander: LexicalExpander,
    authdata: dk_auth.AuthorityData,
    auth_anns_builder: dk_auth.AuthorityAnnotationsBuilder = None,
    field_groups: dk_auth.DerivedFieldGroups = None,
    anns_validator: Callable[[dk_auth.Authority, Dict[str, Any]], bool] = None,
    parent_auth: dk_auth.Authority = None,
):
    """Initialize with the name, values, and associated ids of the authority;
    and with the lexical expander for authoritative values.

    Args:
        name: The authority name, if different from df.columns[0].
        lexical_expander: The lexical expander for the values.
        authdata: The data for this authority.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        field_groups: The derived field groups to use.
        anns_validator: fn(auth, anns_dict_list) that returns True if
            the list of annotation row dicts are valid to be added as
            annotations for a single match or "entity".
        parent_auth: This authority's parent authority (if any).
    """
    super().__init__(
        name if name else authdata.df.columns[0],
        auth_anns_builder=auth_anns_builder,
        authdata=authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
    self.lexical_expander = lexical_expander
    self._variations = None
    self._prev_aligner = None
Attributes
prev_aligner property
prev_aligner: TokenAligner

Get the token aligner created in the latest call to annotate_text.

variations property
variations: Series

Get all lexical variations in a series whose index has associated value IDs.

Returns:

Type Description
Series

A pandas series with index-identified variations.

Functions
add_annotations
add_annotations(doctext: Text, annotations: Annotations) -> dk_anns.Annotations

Method to do the work of finding, validating, and adding annotations.

Parameters:

Name Type Description Default
doctext Text

The text to process.

required
annotations Annotations

The annotations object to add annotations to.

required

Returns:

Type Description
Annotations

The given or a new Annotations instance.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def add_annotations(
    self,
    doctext: dk_doc.Text,
    annotations: dk_anns.Annotations,
) -> dk_anns.Annotations:
    """Method to do the work of finding, validating, and adding annotations.

    Args:
        doctext: The text to process.
        annotations: The annotations object to add annotations to.

    Returns:
        The given or a new Annotations instance.
    """
    first_token = self.lexical_expander.build_first_token(
        doctext.text, input_id=doctext.text_id
    )
    token_aligner = TokenAligner(first_token, self)
    self._prev_aligner = token_aligner
    if self.validate_ann_dicts(token_aligner.annotations):
        annotations.add_dicts(token_aligner.annotations)
    return annotations
find_variations
find_variations(
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series

Find all matches to the given variation.

Note

Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.

Parameters:

Name Type Description Default
variation str

The text to find; treated as a regular expression unless either starts_with or ends_with is True.

required
starts_with bool

When True, find all terms that start with the variation text.

False
ends_with bool

When True, find all terms that end with the variation text.

False
scope str

'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching.

'fullmatch'

Returns:

Type Description
Series

The matching variations as a pd.Series.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def find_variations(
    self,
    variation: str,
    starts_with: bool = False,
    ends_with: bool = False,
    scope: str = "fullmatch",
) -> pd.Series:
    """Find all matches to the given variation.

    Note:
        Only the first true of starts_with, ends_with, and scope will
        be applied. If none of these are true, a full match on the pattern
        is performed.

    Args:
        variation: The text to find; treated as a regular expression
            unless either starts_with or ends_with is True.
        starts_with: When True, find all terms that start with the
            variation text.
        ends_with: When True, find all terms that end with the variation
            text.
        scope: 'fullmatch' (default), 'match', or 'contains' for
            strict, less strict, and least strict matching.

    Returns:
        The matching variations as a pd.Series.
    """
    vs = self.variations
    if starts_with:
        vs = vs[vs.str.startswith(variation)]
    elif ends_with:
        vs = vs[vs.str.endswith(variation)]
    else:
        if scope == "fullmatch":
            hits = vs.str.fullmatch(variation)
        elif scope == "match":
            hits = vs.str.match(variation)
        else:
            hits = vs.str.contains(variation)
        vs = vs[hits]
    vs = vs.drop_duplicates()
    return vs
get_id_by_variation
get_id_by_variation(variation: str) -> Set[str]

Get the IDs of the value(s) associated with the given variation.

Parameters:

Name Type Description Default
variation str

Variation text.

required

Returns:

Type Description
Set[str]

The possibly empty set of associated value IDS.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_id_by_variation(self, variation: str) -> Set[str]:
    """Get the IDs of the value(s) associated with the given variation.

    Args:
        variation: Variation text.

    Returns:
        The possibly empty set of associated value IDS.
    """
    ids = set()
    for value in self.lexical_expander.get_terms(variation):
        ids.update(self.get_value_ids(value))
    return ids
get_value_ids
get_value_ids(value: Any) -> Set[Any]

Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.

Parameters:

Name Type Description Default
value Any

An authority value.

required

Returns:

Type Description
Set[Any]

The associated IDs or an empty set if the value is not valid.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_value_ids(self, value: Any) -> Set[Any]:
    """Get all IDs associated with the given value. Note that typically
    there is a single ID for any value, but this allows for inherent
    ambiguities in the authority.

    Args:
        value: An authority value.

    Returns:
        The associated IDs or an empty set if the value is not valid.
    """
    return set(self.authdata.lookup_values(value).index.tolist())
get_values_by_id
get_values_by_id(value_id: Any) -> Set[Any]

Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.

Parameters:

Name Type Description Default
value_id Any

An authority value ID.

required

Returns:

Type Description
Set[Any]

The associated values or an empty set if the value ID is not valid.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_values_by_id(self, value_id: Any) -> Set[Any]:
    """Get all values for the associated value ID. Note that typically
    there is a single value for an ID, but this allows for inherent
    ambiguities in the authority.

    Args:
        value_id: An authority value ID.

    Returns:
        The associated values or an empty set if the value ID is not valid.
    """
    return set(self.authdata.lookup_values(value_id, is_id=True)[self.name].tolist())
get_variations
get_variations(value: Any, normalize: bool = True) -> Set[Any]

Convenience method to compute variations for the value.

Parameters:

Name Type Description Default
value Any

The authority value, or term, whose variations to compute.

required
normalize bool

True to normalize the variations.

True

Returns:

Type Description
Set[Any]

The set of variations for the value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_variations(self, value: Any, normalize: bool = True) -> Set[Any]:
    """Convenience method to compute variations for the value.

    Args:
        value: The authority value, or term, whose variations to compute.
        normalize: True to normalize the variations.

    Returns:
        The set of variations for the value.
    """
    return self.lexical_expander(value, normalize=normalize)
get_variations_df
get_variations_df(
    variations: Series,
    variations_colname: str = "variation",
    ids_colname: str = None,
    lookup_values: bool = False,
) -> pd.DataFrame

Create a DataFrame including associated ids for each variation.

Parameters:

Name Type Description Default
variations Series

The variations to include in the dataframe.

required
variations_colname str

The name of the variations column.

'variation'
ids_colname str

The column name for value ids.

None
lookup_values bool

When True, include a self.name column with associated values.

False
Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_variations_df(
    self,
    variations: pd.Series,
    variations_colname: str = "variation",
    ids_colname: str = None,
    lookup_values: bool = False,
) -> pd.DataFrame:
    """Create a DataFrame including associated ids for each variation.

    Args:
        variations: The variations to include in the dataframe.
        variations_colname: The name of the variations column.
        ids_colname: The column name for value ids.
        lookup_values: When True, include a self.name column
            with associated values.
    """
    if ids_colname is None:
        ids_colname = f"{self.name}_id"
    df = pd.DataFrame(
        {
            variations_colname: variations,
            ids_colname: variations.apply(self.get_id_by_variation),
        }
    ).explode(ids_colname)
    if lookup_values:
        df[self.name] = df[ids_colname].apply(self.get_values_by_id)
        df = df.explode(self.name)
    return df
has_value
has_value(value: Any) -> bool

Determine whether the given value is in this authority.

Parameters:

Name Type Description Default
value Any

A possible authority value.

required

Returns:

Type Description
bool

True if the value is a valid entity value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def has_value(self, value: Any) -> bool:
    """Determine whether the given value is in this authority.

    Args:
        value: A possible authority value.

    Returns:
        True if the value is a valid entity value.
    """
    return np.any(self.authdata.df[self.name] == value)

LexicalExpander

LexicalExpander(
    variations_fn: Callable[[str], Set[str]],
    normalize_fn: Callable[[str], str],
    split_input_camelcase: bool = True,
    detect_emojis: bool = False,
)

A class to expand and/or normalize original lexical input terms, to keep back-references from generated data to corresponding original input, and to build consistent tokens for lexical matching.

Initialize with the given functions.

Parameters:

Name Type Description Default
variations_fn Callable[[str], Set[str]]

A function, f(t), to expand a raw input term to all of its variations (including itself if desired). If None, the default is to expand each term to itself.

required
normalize_fn Callable[[str], str]

A function to normalize a raw input term or any of its variations. If None, then the identity function is used.

required
split_input_camelcase bool

True to split input camelcase tokens.

True
detect_emojis bool

True to detect emojis. If split_input_camelcase, then adjacent emojis will also be split; otherwise, adjacent emojis will appear as a single token.

False

Methods:

Name Description
__call__

Get all variations of the original term.

get_terms

Get the term ids for which the given variation was generated.

normalize

Normalize the given input term or variation.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(
    self,
    variations_fn: Callable[[str], Set[str]],
    normalize_fn: Callable[[str], str],
    split_input_camelcase: bool = True,
    detect_emojis: bool = False,
):
    """Initialize with the given functions.

    Args:
        variations_fn: A function, f(t), to expand a raw input term to
            all of its variations (including itself if desired). If None, the
            default is to expand each term to itself.
        normalize_fn: A function to normalize a raw input term or any
            of its variations. If None, then the identity function is used.
        split_input_camelcase: True to split input camelcase tokens.
        detect_emojis: True to detect emojis. If split_input_camelcase,
            then adjacent emojis will also be split; otherwise, adjacent
            emojis will appear as a single token.
    """
    self.variations_fn = variations_fn if variations_fn else lambda x: {x}
    self.normalize_fn = normalize_fn if normalize_fn else lambda x: x
    self.split_input_camelcase = split_input_camelcase
    self.emoji_data = emoji_utils.load_emoji_data() if detect_emojis else None
    self.v2t = defaultdict(set)
Functions
__call__
__call__(term: Any, normalize: bool = True) -> Set[str]

Get all variations of the original term.

Parameters:

Name Type Description Default
term Any

The term whose variations to compute.

required
normalize bool

True to normalize the resulting variations.

True

Returns:

Type Description
Set[str]

All variations.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __call__(self, term: Any, normalize: bool = True) -> Set[str]:
    """Get all variations of the original term.

    Args:
        term: The term whose variations to compute.
        normalize: True to normalize the resulting variations.

    Returns:
        All variations.
    """
    variations = self.variations_fn(term)
    if normalize:
        variations = {self.normalize_fn(v) for v in variations}
    # Add a mapping from each variation to its original term
    if variations is not None and len(variations) > 0:
        more_itertools.consume(self.v2t[v].add(term) for v in variations)
    return variations
get_terms
get_terms(variation: str) -> Set[Any]

Get the term ids for which the given variation was generated.

Parameters:

Name Type Description Default
variation str

A variation whose reference term(s) to retrieve.

required

Returns:

Type Description
Set[Any]

The set term ids for the variation or the missing_value.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_terms(self, variation: str) -> Set[Any]:
    """Get the term ids for which the given variation was generated.

    Args:
        variation: A variation whose reference term(s) to retrieve.

    Returns:
        The set term ids for the variation or the missing_value.
    """
    return self.v2t.get(variation, set())
normalize
normalize(input_term: str) -> str

Normalize the given input term or variation.

Parameters:

Name Type Description Default
input_term str

An input term to normalize.

required

Returns:

Type Description
str

The normalized string of the input_term.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def normalize(self, input_term: str) -> str:
    """Normalize the given input term or variation.

    Args:
        input_term: An input term to normalize.

    Returns:
        The normalized string of the input_term.
    """
    return self.normalize_fn(input_term)

MultiAuthorityData

MultiAuthorityData(df: DataFrame, name: str)

Bases: CorrelatedAuthorityData

Container for authoritative data containing correlated data for multiple "sub" authorities composed of explicit data for each component.

Methods:

Name Description
auth_records_mask

Get a boolean series identifying records in the full authority matching

auth_values_mask

Identify the rows in the full authority corresponding to this sub-value.

build_authority_data

Build an authority for the named sub-authority.

combine_masks

Combine the masks if possible, returning the valid combination or None.

get_auth_records

Get the authority records identified by the mask.

get_authority_data

Get AuthorityData for the named "sub" authority, building if needed.

get_unique_vals_df

Get a dataframe with the unique values from the column and the given

lookup_auth_values

Lookup original authority data for the named "sub" authority value.

lookup_subauth_values

Lookup "sub" authority data for the named "sub" authority value.

Attributes:

Name Type Description
authority_data AuthorityData

Retrieve without building the named authority data, or None

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}
Attributes
authority_data property
authority_data: AuthorityData

Retrieve without building the named authority data, or None

Functions
auth_records_mask
auth_records_mask(
    record_value_ids: Dict[str, int], filter_mask: Series = None
) -> pd.Series

Get a boolean series identifying records in the full authority matching the given records of the form {: }.

Parameters:

Name Type Description Default
record_value_ids Dict[str, int]

The dict of field names to value_ids.

required
filter_mask Series

A pre-filter limiting records to consider and/or building records incrementally.

None

Returns:

Type Description
Series

A boolean series where all fields exist or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def auth_records_mask(
    self,
    record_value_ids: Dict[str, int],
    filter_mask: pd.Series = None,
) -> pd.Series:
    """Get a boolean series identifying records in the full authority matching
    the given records of the form {<sub-name>: <sub-value-id>}.

    Args:
        record_value_ids: The dict of field names to value_ids.
        filter_mask: A pre-filter limiting records to consider and/or
            building records incrementally.

    Returns:
        A boolean series where all fields exist or None.
    """
    has_fields = filter_mask
    for name, value_id in record_value_ids.items():
        has_field = self.auth_values_mask(name, value_id)
        if has_fields is None:
            has_fields = has_field
        else:
            has_fields &= has_field
    return has_fields
auth_values_mask
auth_values_mask(name: str, value_id: int) -> pd.Series

Identify the rows in the full authority corresponding to this sub-value.

Parameters:

Name Type Description Default
name str

The sub-authority name.

required
value_id int

The sub-authority value_id.

required

Returns:

Type Description
Series

A boolean series where the field exists.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def auth_values_mask(self, name: str, value_id: int) -> pd.Series:
    """Identify the rows in the full authority corresponding to this sub-value.

    Args:
        name: The sub-authority name.
        value_id: The sub-authority value_id.

    Returns:
        A boolean series where the field exists.
    """
    field_values = self.lookup_subauth_values(name, value_id, is_id=True)
    return self.df[name].isin(field_values[name].tolist())
build_authority_data abstractmethod
build_authority_data(name: str) -> dk_auth.AuthorityData

Build an authority for the named sub-authority.

Parameters:

Name Type Description Default
name str

The "sub" authority name.

required

Returns:

Type Description
AuthorityData

The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@abstractmethod
def build_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Build an authority for the named sub-authority.

    Args:
        name: The "sub" authority name.

    Returns:
        The "sub" authority data.
    """
    raise NotImplementedError
combine_masks
combine_masks(mask1: Series, mask2: Series) -> pd.Series

Combine the masks if possible, returning the valid combination or None.

Parameters:

Name Type Description Default
mask1 Series

An auth_records_mask consistent with this data.

required
mask2 Series

Another data auth_records_mask.

required

Returns:

Type Description
Series

The combined consistent records_mask or None.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def combine_masks(self, mask1: pd.Series, mask2: pd.Series) -> pd.Series:
    """Combine the masks if possible, returning the valid combination or None.

    Args:
        mask1: An auth_records_mask consistent with this data.
        mask2: Another data auth_records_mask.

    Returns:
        The combined consistent records_mask or None.
    """
    result = None
    if mask1 is not None and mask2 is not None:
        result = mask1 & mask2
    elif mask1 is not None:
        result = mask1
    elif mask2 is not None:
        result = mask2
    return result if np.any(result) else None
get_auth_records
get_auth_records(records_mask: Series) -> pd.DataFrame

Get the authority records identified by the mask.

Parameters:

Name Type Description Default
records_mask Series

A boolean series identifying records in the full df.

required

Returns:

Type Description
DataFrame

The records/rows for which the mask is True.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_auth_records(self, records_mask: pd.Series) -> pd.DataFrame:
    """Get the authority records identified by the mask.

    Args:
        records_mask: A boolean series identifying records in the full df.

    Returns:
        The records/rows for which the mask is True.
    """
    return self.df[records_mask]
get_authority_data
get_authority_data(name: str) -> dk_auth.AuthorityData

Get AuthorityData for the named "sub" authority, building if needed.

Parameters:

Name Type Description Default
name str

The "sub" authority name.

required

Returns:

Type Description
AuthorityData

The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Get AuthorityData for the named "sub" authority, building if needed.

    Args:
        name: The "sub" authority name.

    Returns:
        The "sub" authority data.
    """
    if name not in self._authority_data:
        self._authority_data[name] = self.build_authority_data(name)
    return self._authority_data[name]
get_unique_vals_df staticmethod
get_unique_vals_df(col: Series, name: str) -> pd.DataFrame

Get a dataframe with the unique values from the column and the given column name.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
@staticmethod
def get_unique_vals_df(col: pd.Series, name: str) -> pd.DataFrame:
    """Get a dataframe with the unique values from the column and the given
    column name.
    """
    data = np.sort(pd.unique(col.dropna()))
    if np.issubdtype(col.dtype, np.integer):
        # IDs for an integer column are the integers themselves
        col_df = pd.DataFrame({name: data}, index=data)
    else:
        # IDs for other columns are auto-generated from 0 to n-1
        col_df = pd.DataFrame({name: data})
    return col_df
lookup_auth_values
lookup_auth_values(name: str, value: str) -> pd.DataFrame

Lookup original authority data for the named "sub" authority value.

Parameters:

Name Type Description Default
name str

The sub-authority name.

required
value str

The sub-authority value(s) (or dataframe row(s)).

required

Returns:

Type Description
DataFrame

The original authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def lookup_auth_values(
    self,
    name: str,
    value: str,
) -> pd.DataFrame:
    """Lookup original authority data for the named "sub" authority value.

    Args:
        name: The sub-authority name.
        value: The sub-authority value(s) (or dataframe row(s)).

    Returns:
        The original authority dataframe rows.
    """
    return self.df[self.df[name] == value]
lookup_subauth_values
lookup_subauth_values(
    name: str, value: int, is_id: bool = False
) -> pd.DataFrame

Lookup "sub" authority data for the named "sub" authority value.

Parameters:

Name Type Description Default
name str

The sub-authority name.

required
value int

The value for the sub-authority to lookup.

required
is_id bool

True if value is an ID.

False

Returns:

Type Description
DataFrame

The applicable authority dataframe rows.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def lookup_subauth_values(self, name: str, value: int, is_id: bool = False) -> pd.DataFrame:
    """Lookup "sub" authority data for the named "sub" authority value.

    Args:
        name: The sub-authority name.
        value: The value for the sub-authority to lookup.
        is_id: True if value is an ID.

    Returns:
        The applicable authority dataframe rows.
    """
    values_df = None
    authdata = self._authority_data.get(name, None)
    if authdata is not None:
        values_df = authdata.lookup_values(value, is_id=is_id)
    return values_df

MultiAuthorityFactory

MultiAuthorityFactory(auth_name: str, lexical_expander: LexicalExpander = None)

Bases: AuthorityFactory

An factory for building a "sub" authority directly or indirectly from MultiAuthorityData.

Initialize the MultiAuthorityFactory.

Parameters:

Name Type Description Default
auth_name str

The name of the dataframe authority to build.

required
lexical_expander LexicalExpander

The lexical expander to use (default=identity).

None

Methods:

Name Description
build_authority

Build a DataframeAuthority.

get_lexical_expander

Get the lexical expander for the named (column) data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(
    self,
    auth_name: str,
    lexical_expander: LexicalExpander = None,
):
    """Initialize the MultiAuthorityFactory.

    Args:
        auth_name: The name of the dataframe authority to build.
        lexical_expander: The lexical expander to use (default=identity).
    """
    self.auth_name = auth_name
    self._lexical_expander = lexical_expander
Functions
build_authority
build_authority(
    name: str,
    auth_anns_builder: AuthorityAnnotationsBuilder,
    multiauthdata: MultiAuthorityData,
    parent_auth: Authority = None,
) -> DataframeAuthority

Build a DataframeAuthority.

Parameters:

Name Type Description Default
name str

The name of the authority to build.

required
auth_anns_builder AuthorityAnnotationsBuilder

The authority annotations row builder to use for building annotation rows.

required
multiauthdata MultiAuthorityData

The multi-authority source data.

required
parent_auth Authority

The parent authority.

None

Returns:

Type Description
DataframeAuthority

The DataframeAuthority instance.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def build_authority(
    self,
    name: str,
    auth_anns_builder: dk_auth.AuthorityAnnotationsBuilder,
    multiauthdata: MultiAuthorityData,
    parent_auth: dk_auth.Authority = None,
) -> DataframeAuthority:
    """Build a DataframeAuthority.

    Args:
        name: The name of the authority to build.
        auth_anns_builder: The authority annotations row builder to use
            for building annotation rows.
        multiauthdata: The multi-authority source data.
        parent_auth: The parent authority.

    Returns:
        The DataframeAuthority instance.
    """
    authdata = multiauthdata.get_authority_data(name)
    field_groups = None  # TODO: get from instance var set on construction?
    anns_validator = None  # TODO: get from multiauthdata?
    return DataframeAuthority(
        name,
        self.get_lexical_expander(name),
        authdata,
        field_groups=field_groups,
        anns_validator=anns_validator,
        parent_auth=parent_auth,
    )
get_lexical_expander
get_lexical_expander(name: str) -> LexicalExpander

Get the lexical expander for the named (column) data.

Parameters:

Name Type Description Default
name str

The name of the column to expand.

required

Returns:

Type Description
LexicalExpander

The appropriate lexical_expander.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def get_lexical_expander(self, name: str) -> LexicalExpander:
    """Get the lexical expander for the named (column) data.

    Args:
        name: The name of the column to expand.

    Returns:
        The appropriate lexical_expander.
    """
    if self._lexical_expander is None:
        self._lexical_expander = LexicalExpander(None, None)
    return self._lexical_expander

SimpleMultiAuthorityData

SimpleMultiAuthorityData(df: DataFrame, name: str)

Bases: MultiAuthorityData

Data class for pulling a single column from the multi-authority data as a "sub" authority.

Methods:

Name Description
build_authority_data

Build an authority for the named column holding authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(self, df: pd.DataFrame, name: str):
    super().__init__(df, name)
    self._authority_data = {}
Functions
build_authority_data
build_authority_data(name: str) -> dk_auth.AuthorityData

Build an authority for the named column holding authority data.

Note

Only unique values are kept and the full dataframe's index will not be preserved.

Parameters:

Name Type Description Default
name str

The "sub" authority (and column) name.

required

Returns:

Type Description
AuthorityData

The "sub" authority data.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def build_authority_data(self, name: str) -> dk_auth.AuthorityData:
    """Build an authority for the named column holding authority data.

    Note:
        Only unique values are kept and the full dataframe's index
        will not be preserved.

    Args:
        name: The "sub" authority (and column) name.

    Returns:
        The "sub" authority data.
    """
    col = self.df[name]
    col_df = self.get_unique_vals_df(col, name)
    return dk_auth.AuthorityData(col_df, name)

TokenAligner

TokenAligner(first_token: Token, authority: LexicalAuthority)

Aligns tokens with a lexical authority to generate annotations.

Processes a token stream, matching tokens against lexical authority variations and generating annotations for matches. Handles overlapping matches and tracks processed tokens.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(self, first_token: dk_tok.Token, authority: dk_auth.LexicalAuthority):
    self.first_token = first_token
    self.auth = authority
    self.annotations = []  # List[Dict[str, Any]]
    self._processed_idx = set()
    self._process(self.first_token)

TokenMatch

TokenMatch(auth: LexicalAuthority, val_idx: int, var: str, token: Token)

Represents a match between tokens and a lexical authority variation.

Matches a sequence of tokens against a lexical authority variation, tracking whether the match is complete and providing access to matched text and annotation generation.

Attributes:

Name Type Description
matched_text

Get the matched original text.

Source code in packages/xization/src/dataknobs_xization/lexicon.py
def __init__(self, auth: dk_auth.LexicalAuthority, val_idx: int, var: str, token: dk_tok.Token):
    self.auth = auth
    self.val_idx = val_idx
    self.var = var
    self.token = token

    self.varparts = var.split()
    self.matches = True
    self.tokens = []
    t = token
    for v in self.varparts:
        if t is not None and v == t.norm_text:
            self.tokens.append(t)
            t = t.next_token
        else:
            self.matches = False
            break
Attributes
matched_text property
matched_text

Get the matched original text.

Usage Examples

Text Normalization Example

from dataknobs_xization import normalize

# Basic text normalization
text = "  Hello,    WORLD!  \n\t How   are you?  "
normalized = normalize.basic_normalization_fn(text)
print(normalized)  # "hello, world! how are you?"

# CamelCase expansion
camel_text = "firstName"
expanded = normalize.expand_camelcase_fn(camel_text)
print(expanded)  # "first Name"

# Generate lexical variations
text_with_hyphens = "multi-platform/cross-browser"
variations = normalize.get_lexical_variations(text_with_hyphens)
print(f"Generated {len(variations)} variations:")
for var in sorted(variations):
    print(f"  {var}")

# Symbol handling
text_with_symbols = "!Hello world?"
cleaned = normalize.drop_non_embedded_symbols_fn(text_with_symbols)
print(cleaned)  # "Hello world"

embedded_text = "user@domain.com"
processed = normalize.drop_embedded_symbols_fn(embedded_text, " ")
print(processed)  # "user domain com"

# Ampersand expansion
ampersand_text = "Research & Development"
expanded_ampersand = normalize.expand_ampersand_fn(ampersand_text)
print(expanded_ampersand)  # "Research and Development"

Character Features Example

from dataknobs_xization.masking_tokenizer import CharacterFeatures
from dataknobs_structures import document as dk_doc
import pandas as pd

# Create a concrete implementation of CharacterFeatures
class BasicCharacterFeatures(CharacterFeatures):
    """Basic character-level feature extraction."""

    @property
    def cdf(self) -> pd.DataFrame:
        """Create character dataframe with features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            # Add padding if specified
            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = ([pad_char] * self._roll_padding + 
                        chars + 
                        [pad_char] * self._roll_padding)

            # Create feature dataframe
            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'is_upper': [c.isupper() if c != '<PAD>' else False for c in chars],
                'is_lower': [c.islower() if c != '<PAD>' else False for c in chars],
                'is_space': [c.isspace() if c != '<PAD>' else False for c in chars],
                'is_punct': [not c.isalnum() and not c.isspace() if c != '<PAD>' else False for c in chars],
                'is_padding': [c == '<PAD>' for c in chars]
            })

        return self._cdf

# Usage
text = "Hello, World! 123 👋"
features = BasicCharacterFeatures(text, roll_padding=2)

print(f"Text: {features.text}")
print(f"Text column: {features.text_col}")
print("\nCharacter DataFrame:")
print(features.cdf.head(10))

# Analyze character distribution
cdf = features.cdf
print("\nCharacter Analysis:")
print(f"Total characters: {len(cdf)}")
print(f"Alphabetic: {cdf['is_alpha'].sum()}")
print(f"Digits: {cdf['is_digit'].sum()}")
print(f"Spaces: {cdf['is_space'].sum()}")
print(f"Punctuation: {cdf['is_punct'].sum()}")
print(f"Padding: {cdf['is_padding'].sum()}")

Text Masking Example

from dataknobs_xization.masking_tokenizer import CharacterFeatures
import pandas as pd
import numpy as np

class MaskingCharacterFeatures(CharacterFeatures):
    """Character features with masking capability."""

    def __init__(self, doctext, roll_padding=0, mask_probability=0.15):
        super().__init__(doctext, roll_padding)
        self.mask_probability = mask_probability

    @property
    def cdf(self) -> pd.DataFrame:
        """Character dataframe with masking features."""
        if not hasattr(self, '_cdf'):
            chars = list(self.text)

            if self._roll_padding > 0:
                pad_char = '<PAD>'
                chars = ([pad_char] * self._roll_padding + 
                        chars + 
                        [pad_char] * self._roll_padding)

            # Set random seed for reproducibility
            np.random.seed(42)

            self._cdf = pd.DataFrame({
                self.text_col: chars,
                'original_char': chars,
                'position': range(len(chars)),
                'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
                'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
                'should_mask': np.random.random(len(chars)) < self.mask_probability,
                'is_padding': [c == '<PAD>' for c in chars]
            })

            # Apply masking
            mask_indices = self._cdf['should_mask'] & ~self._cdf['is_padding']
            self._cdf.loc[mask_indices, self.text_col] = '[MASK]'

        return self._cdf

    def get_masked_text(self) -> str:
        """Get the masked version of the text."""
        cdf = self.cdf
        masked_chars = cdf[~cdf['is_padding']][self.text_col].tolist()
        return ''.join(masked_chars)

# Usage
original_text = "This is a sample text for demonstration."
masker = MaskingCharacterFeatures(original_text, mask_probability=0.2)

print(f"Original: {original_text}")
print(f"Masked:   {masker.get_masked_text()}")
print(f"\nMask Statistics:")
cdf = masker.cdf
print(f"Total chars: {len(cdf)}")
print(f"Masked chars: {cdf['should_mask'].sum()}")
print(f"Mask ratio: {cdf['should_mask'].mean():.2%}")

Complete Text Processing Pipeline

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd

class TextProcessingPipeline:
    """Complete text processing with normalization and analysis."""

    def __init__(self, normalize_config=None, analysis_config=None):
        self.normalize_config = normalize_config or {}
        self.analysis_config = analysis_config or {}

    def process_document(self, doc: dk_doc.Document) -> dict:
        """Process a document through the complete pipeline."""
        original_text = doc.text
        results = {
            'document_id': getattr(doc, 'text_id', None),
            'original_text': original_text
        }

        # Step 1: Normalization
        normalized_text = self._normalize_text(original_text)
        results['normalized_text'] = normalized_text

        # Step 2: Generate variations
        variations = normalize.get_lexical_variations(
            normalized_text, **self.normalize_config
        )
        results['variations'] = list(variations)
        results['variation_count'] = len(variations)

        # Step 3: Character analysis
        char_analysis = self._analyze_characters(normalized_text)
        results['character_analysis'] = char_analysis

        return results

    def _normalize_text(self, text: str) -> str:
        """Apply normalization pipeline."""
        # Expand camelCase
        text = normalize.expand_camelcase_fn(text)

        # Expand ampersands
        text = normalize.expand_ampersand_fn(text)

        # Drop parentheticals
        if self.normalize_config.get('drop_parentheticals', True):
            text = normalize.drop_parentheticals_fn(text)

        # Handle symbols
        if self.normalize_config.get('drop_non_embedded_symbols', True):
            text = normalize.drop_non_embedded_symbols_fn(text)

        # Basic normalization
        text = normalize.basic_normalization_fn(text)

        return text

    def _analyze_characters(self, text: str) -> dict:
        """Analyze character-level features."""
        class AnalysisCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({
                    self.text_col: chars,
                    'position': range(len(chars)),
                    'is_alpha': [c.isalpha() for c in chars],
                    'is_digit': [c.isdigit() for c in chars],
                    'is_space': [c.isspace() for c in chars],
                    'is_punct': [not c.isalnum() and not c.isspace() for c in chars]
                })

        features = AnalysisCharFeatures(text)
        cdf = features.cdf

        return {
            'total_characters': len(cdf),
            'alphabetic_characters': cdf['is_alpha'].sum(),
            'digit_characters': cdf['is_digit'].sum(),
            'space_characters': cdf['is_space'].sum(),
            'punctuation_characters': cdf['is_punct'].sum(),
            'alphabetic_ratio': cdf['is_alpha'].mean(),
            'digit_ratio': cdf['is_digit'].mean(),
            'space_ratio': cdf['is_space'].mean(),
            'punctuation_ratio': cdf['is_punct'].mean()
        }

    def process_batch(self, documents: list) -> list:
        """Process multiple documents."""
        return [self.process_document(doc) for doc in documents]

# Usage example
config = {
    'drop_parentheticals': True,
    'drop_non_embedded_symbols': True,
    'expand_camelcase': True,
    'expand_ampersands': True,
    'add_eng_plurals': True
}

pipeline = TextProcessingPipeline(normalize_config=config)

# Create sample documents
documents = [
    dk_doc.Document(
        "getUserName() & validateInput (required)", 
        text_id="tech_doc_1"
    ),
    dk_doc.Document(
        "Machine Learning (ML) & Artificial Intelligence",
        text_id="ai_doc_1" 
    )
]

# Process documents
results = pipeline.process_batch(documents)

# Display results
for result in results:
    print(f"\nDocument: {result['document_id']}")
    print(f"Original: {result['original_text']}")
    print(f"Normalized: {result['normalized_text']}")
    print(f"Variations: {result['variation_count']}")
    print(f"Character Analysis: {result['character_analysis']}")

Integration with Other Packages

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_utils import file_utils, elasticsearch_utils
from dataknobs_structures import Tree, document as dk_doc
import json

def create_searchable_documents(input_dir: str) -> list:
    """Create searchable documents with normalized text."""
    searchable_docs = []

    # Process all text files
    for filepath in file_utils.filepath_generator(input_dir):
        if filepath.endswith('.txt'):
            # Read file content
            content_lines = list(file_utils.fileline_generator(filepath))
            full_text = '\n'.join(content_lines)

            # Normalize text
            normalized = normalize.basic_normalization_fn(full_text)
            normalized = normalize.expand_camelcase_fn(normalized)
            normalized = normalize.expand_ampersand_fn(normalized)

            # Generate search variations
            variations = normalize.get_lexical_variations(
                normalized,
                expand_camelcase=True,
                do_hyphen_expansion=True,
                do_slash_expansion=True
            )

            # Create searchable document
            searchable_doc = {
                'filepath': filepath,
                'original_text': full_text,
                'normalized_text': normalized,
                'search_variations': ' '.join(variations),
                'variation_count': len(variations)
            }

            searchable_docs.append(searchable_doc)

    return searchable_docs

# Create Elasticsearch index with normalized documents
def index_normalized_documents(documents: list, index_name: str):
    """Index normalized documents in Elasticsearch."""
    table_settings = elasticsearch_utils.TableSettings(
        index_name,
        {"number_of_shards": 1, "number_of_replicas": 0},
        {
            "properties": {
                "original_text": {"type": "text"},
                "normalized_text": {"type": "text", "analyzer": "english"},
                "search_variations": {"type": "text"},
                "filepath": {"type": "keyword"},
                "variation_count": {"type": "integer"}
            }
        }
    )

    index = elasticsearch_utils.ElasticsearchIndex(None, [table_settings])

    # Create batch file
    with open("normalized_batch.jsonl", "w") as f:
        elasticsearch_utils.add_batch_data(
            f, iter(documents), index_name
        )

    return index

# Usage
documents = create_searchable_documents("/path/to/text/files")
index = index_normalized_documents(documents, "normalized_texts")
print(f"Indexed {len(documents)} normalized documents")

Error Handling

from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc

def safe_text_processing(text: str) -> dict:
    """Safely process text with comprehensive error handling."""
    results = {'original': text, 'errors': []}

    try:
        # Normalization with error handling
        normalized = normalize.basic_normalization_fn(text)
        results['normalized'] = normalized
    except Exception as e:
        results['errors'].append(f"Normalization failed: {e}")
        results['normalized'] = text

    try:
        # CamelCase expansion
        expanded = normalize.expand_camelcase_fn(results['normalized'])
        results['camelcase_expanded'] = expanded
    except Exception as e:
        results['errors'].append(f"CamelCase expansion failed: {e}")
        results['camelcase_expanded'] = results['normalized']

    try:
        # Variation generation
        variations = normalize.get_lexical_variations(results['camelcase_expanded'])
        results['variations'] = list(variations)
    except Exception as e:
        results['errors'].append(f"Variation generation failed: {e}")
        results['variations'] = [results['camelcase_expanded']]

    try:
        # Character analysis
        class SafeCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                import pandas as pd
                chars = list(self.text) if self.text else []
                return pd.DataFrame({
                    self.text_col: chars,
                    'is_alpha': [c.isalpha() for c in chars]
                })

        features = SafeCharFeatures(results['camelcase_expanded'])
        results['character_count'] = len(features.cdf)
    except Exception as e:
        results['errors'].append(f"Character analysis failed: {e}")
        results['character_count'] = 0

    results['success'] = len(results['errors']) == 0
    return results

# Usage
test_texts = [
    "Normal text for processing",
    "camelCaseText & symbols!",
    "",  # Empty string
    None,  # None value
    "Special unicode: 👋🌍"
]

for i, text in enumerate(test_texts):
    try:
        result = safe_text_processing(text or "")
        print(f"\nTest {i+1}: {'SUCCESS' if result['success'] else 'ERRORS'}")
        print(f"Original: {repr(text)}")
        if result['success']:
            print(f"Normalized: {result['normalized']}")
            print(f"Variations: {len(result['variations'])}")
        else:
            print(f"Errors: {result['errors']}")
    except Exception as e:
        print(f"\nTest {i+1}: CRITICAL ERROR - {e}")

Testing

import pytest
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd

class TestXizationFunctions:
    """Test suite for xization functionality."""

    def test_normalization_functions(self):
        """Test core normalization functions."""
        # Test camelCase expansion
        assert normalize.expand_camelcase_fn("firstName") == "first Name"
        assert normalize.expand_camelcase_fn("XMLParser") == "XML Parser"

        # Test symbol handling
        assert normalize.drop_non_embedded_symbols_fn("!Hello world?") == "Hello world"
        assert normalize.drop_embedded_symbols_fn("user@domain.com") == "userdomaincom"

        # Test ampersand expansion
        assert normalize.expand_ampersand_fn("A & B") == "A and B"

        # Test parenthetical removal
        assert normalize.drop_parentheticals_fn("Text (with note)") == "Text "

    def test_lexical_variations(self):
        """Test lexical variation generation."""
        variations = normalize.get_lexical_variations("multi-platform")

        # Check expected variations are present
        assert "multi platform" in variations
        assert "multiplatform" in variations
        assert "multi-platform" in variations

        # Check it returns a set
        assert isinstance(variations, set)
        assert len(variations) > 1

    def test_character_features(self):
        """Test character feature extraction."""
        class TestCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({
                    self.text_col: chars,
                    'is_alpha': [c.isalpha() for c in chars],
                    'is_digit': [c.isdigit() for c in chars]
                })

        features = TestCharFeatures("Hello123")
        cdf = features.cdf

        # Test basic properties
        assert len(cdf) == 8
        assert cdf['is_alpha'].sum() == 5  # "Hello"
        assert cdf['is_digit'].sum() == 3  # "123"

        # Test text properties
        assert features.text == "Hello123"
        assert features.text_col == 'text'  # Default column name

    def test_document_integration(self):
        """Test integration with document structures."""
        doc = dk_doc.Text("Test document", text_id="test1")

        class DocCharFeatures(masking_tokenizer.CharacterFeatures):
            @property
            def cdf(self):
                chars = list(self.text)
                return pd.DataFrame({self.text_col: chars})

        features = DocCharFeatures(doc)
        assert features.text_id == "test1"
        assert features.text == "Test document"

    def test_error_handling(self):
        """Test error handling in various scenarios."""
        # Test empty text
        empty_variations = normalize.get_lexical_variations("")
        assert isinstance(empty_variations, set)

        # Test None handling in utility function
        from dataknobs_xization.normalize import basic_normalization_fn
        try:
            result = basic_normalization_fn("")
            assert isinstance(result, str)
        except Exception:
            pytest.fail("Should handle empty string gracefully")

# Run tests
if __name__ == "__main__":
    test_suite = TestXizationFunctions()
    test_suite.test_normalization_functions()
    test_suite.test_lexical_variations()
    test_suite.test_character_features()
    test_suite.test_document_integration()
    test_suite.test_error_handling()
    print("All tests passed!")

Performance Notes

  • Regular Expressions: Pre-compiled patterns for efficient text processing
  • Character Analysis: Memory-intensive for large texts - use streaming for big documents
  • Variation Generation: Can produce many variations - filter appropriately
  • Pandas DataFrames: Efficient for character-level analysis but consider memory usage

Dependencies

Core dependencies for dataknobs_xization:

pandas>=1.3.0
numpy>=1.20.0
dataknobs-structures>=1.0.0
dataknobs-utils>=1.0.0

Contributing

For contributing to dataknobs_xization:

  1. Fork the repository
  2. Create feature branch for text processing enhancements
  3. Add comprehensive tests for normalization functions
  4. Test with various text types and edge cases
  5. Submit pull request with documentation updates

See Contributing Guide for detailed information.

Changelog

Version 1.0.0

  • Initial release
  • Text normalization functions
  • Character-level feature extraction
  • Lexical variation generation
  • Masking tokenizer framework
  • Integration with dataknobs-structures

License

See License for license information.