dataknobs-xization API Reference¶
Complete API documentation for the dataknobs_xization package.
💡 Quick Links: - Complete API Documentation - Full auto-generated reference - Source Code - Browse on GitHub - Package Guide - Detailed documentation
Package Information¶
- Package Name:
dataknobs_xization - Version: 1.0.0
- Description: Text normalization and tokenization tools
- Python Requirements: >=3.8
Installation¶
Import Statement¶
from dataknobs_xization import (
annotations,
authorities,
lexicon,
masking_tokenizer,
normalize
)
# Import key classes
from dataknobs_xization.masking_tokenizer import CharacterFeatures, TextFeatures
Module Documentation¶
normalize¶
Regular Expression Patterns¶
SQUASH_WS_RE¶
ALL_SYMBOLS_RE¶
dataknobs_xization.normalize.ALL_SYMBOLS_RE
module-attribute
¶
CAMELCASE_LU_RE¶
dataknobs_xization.normalize.CAMELCASE_LU_RE
module-attribute
¶
CAMELCASE_UL_RE¶
dataknobs_xization.normalize.CAMELCASE_UL_RE
module-attribute
¶
NON_EMBEDDED_WORD_SYMS_RE¶
dataknobs_xization.normalize.NON_EMBEDDED_WORD_SYMS_RE
module-attribute
¶
EMBEDDED_SYMS_RE¶
dataknobs_xization.normalize.EMBEDDED_SYMS_RE
module-attribute
¶
HYPHEN_SLASH_RE¶
dataknobs_xization.normalize.HYPHEN_SLASH_RE
module-attribute
¶
HYPHEN_ONLY_RE¶
dataknobs_xization.normalize.HYPHEN_ONLY_RE
module-attribute
¶
SLASH_ONLY_RE¶
dataknobs_xization.normalize.SLASH_ONLY_RE
module-attribute
¶
PARENTHETICAL_RE¶
dataknobs_xization.normalize.PARENTHETICAL_RE
module-attribute
¶
AMPERSAND_RE¶
Functions¶
expand_camelcase_fn¶
dataknobs_xization.normalize.expand_camelcase_fn ¶
Expand both "lU" and "UUl" camelcasing to "l U" and "U Ul"
drop_non_embedded_symbols_fn¶
dataknobs_xization.normalize.drop_non_embedded_symbols_fn ¶
drop_embedded_symbols_fn¶
dataknobs_xization.normalize.drop_embedded_symbols_fn ¶
get_hyphen_slash_expansions_fn¶
dataknobs_xization.normalize.get_hyphen_slash_expansions_fn ¶
get_hyphen_slash_expansions_fn(
text: str,
subs: List[str] = ("-", " ", ""),
add_self: bool = True,
do_split: bool = True,
min_split_token_len: int = 2,
hyphen_slash_re: Pattern[str] = HYPHEN_SLASH_RE,
) -> Set[str]
Given text with words that may or may not appear as hyphenated or with a slash, return the set potential variations: - the text as-is (add_self) - with a hyphen between all words (if '-' in subs) - with a space between all words (if ' ' in subs) - with all words squashed together (empty string between if '' in subs) - with each word separately (do_split as long as min_split_token_len is met for all tokens)
Note
- To add a variation with a slash, add '/' to subs.
- To not add any variations with symbols, leave them out of subs and don't add self.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The hyphen-worthy snippet of text, either already hyphenated or with a slash or space delimited. |
required |
subs
|
List[str]
|
A string of characters or list of strings to insert between tokens. |
('-', ' ', '')
|
add_self
|
bool
|
True to include the text itself in the result. |
True
|
do_split
|
bool
|
True to add split tokens separately. |
True
|
min_split_token_len
|
int
|
If any of the split tokens fail to meet the min token length, don't add any of the splits. |
2
|
hyphen_slash_re
|
Pattern[str]
|
The regex to identify hyphen/slash to expand. |
HYPHEN_SLASH_RE
|
Returns:
| Type | Description |
|---|---|
Set[str]
|
The set of text variations. |
Source code in packages/xization/src/dataknobs_xization/normalize.py
drop_parentheticals_fn¶
dataknobs_xization.normalize.drop_parentheticals_fn ¶
expand_ampersand_fn¶
dataknobs_xization.normalize.expand_ampersand_fn ¶
get_lexical_variations¶
dataknobs_xization.normalize.get_lexical_variations ¶
get_lexical_variations(
text: str,
include_self: bool = True,
expand_camelcase: bool = True,
drop_non_embedded_symbols: bool = True,
drop_embedded_symbols: bool = True,
spacify_embedded_symbols: bool = False,
do_hyphen_expansion: bool = True,
hyphen_subs: List[str] = (" ", ""),
do_hyphen_split: bool = True,
min_hyphen_split_token_len: int = 2,
do_slash_expansion: bool = True,
slash_subs: List[str] = (" ", " or "),
do_slash_split: bool = True,
min_slash_split_token_len: int = 1,
drop_parentheticals: bool = True,
expand_ampersands: bool = True,
add_eng_plurals: bool = True,
) -> Set[str]
Get all variations for the text (including the text itself).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to generate variations for. |
required |
include_self
|
bool
|
True to include the original text in the result. |
True
|
expand_camelcase
|
bool
|
True to expand camelCase text. |
True
|
drop_non_embedded_symbols
|
bool
|
True to drop symbols not embedded in words. |
True
|
drop_embedded_symbols
|
bool
|
True to drop symbols embedded in words. |
True
|
spacify_embedded_symbols
|
bool
|
True to replace embedded symbols with spaces. |
False
|
do_hyphen_expansion
|
bool
|
True to expand hyphenated text. |
True
|
hyphen_subs
|
List[str]
|
List of strings to substitute for hyphens. |
(' ', '')
|
do_hyphen_split
|
bool
|
True to split on hyphens. |
True
|
min_hyphen_split_token_len
|
int
|
Minimum token length for hyphen splits. |
2
|
do_slash_expansion
|
bool
|
True to expand slashes. |
True
|
slash_subs
|
List[str]
|
List of strings to substitute for slashes. |
(' ', ' or ')
|
do_slash_split
|
bool
|
True to split on slashes. |
True
|
min_slash_split_token_len
|
int
|
Minimum token length for slash splits. |
1
|
drop_parentheticals
|
bool
|
True to drop parenthetical expressions. |
True
|
expand_ampersands
|
bool
|
True to expand ampersands to ' and '. |
True
|
add_eng_plurals
|
bool
|
True to add English plural forms. |
True
|
Returns:
| Type | Description |
|---|---|
Set[str]
|
The set of all text variations. |
Source code in packages/xization/src/dataknobs_xization/normalize.py
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | |
masking_tokenizer¶
Classes¶
CharacterFeatures¶
dataknobs_xization.masking_tokenizer.CharacterFeatures ¶
Bases: ABC
Class representing features of text as a dataframe with each character as a row and columns representing character features.
Initialize with the text to tokenize.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doctext
|
Union[Text, str]
|
The text to tokenize (or dk_doc.Text with its metadata). |
required |
roll_padding
|
int
|
The number of pad characters added to each end of the text. |
0
|
Attributes:
| Name | Type | Description |
|---|---|---|
cdf |
DataFrame
|
The character dataframe with each padded text character as a row. |
doctext |
Text
|
|
text_col |
str
|
The name of the cdf column holding the text characters. |
text |
str
|
The text string. |
text_id |
Any
|
The ID of the text. |
Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
TextFeatures¶
dataknobs_xization.masking_tokenizer.TextFeatures ¶
TextFeatures(
doctext: Union[Text, str],
split_camelcase: bool = True,
mark_alpha: bool = False,
mark_digit: bool = False,
mark_upper: bool = False,
mark_lower: bool = False,
emoji_data: EmojiData = None,
)
Bases: CharacterFeatures
Extracts text-specific character features for tokenization.
Extends CharacterFeatures to provide text tokenization with support for camelCase splitting, character type features (alpha, digit, upper, lower), and emoji handling. Builds a character DataFrame with features for token boundary detection.
Initialize with text tokenization parameters.
Note
If emoji_data is non-null: * Then emojis will be treated as text (instead of as non-text) * If split_camelcase is True, * then each emoji will be in its own token * otherwise, each sequence of (adjacent) emojis will be treated as a single token.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doctext
|
Union[Text, str]
|
The text to tokenize with its metadata. |
required |
split_camelcase
|
bool
|
True to mark camel-case features. |
True
|
mark_alpha
|
bool
|
True to mark alpha features (separate from alnum). |
False
|
mark_digit
|
bool
|
True to mark digit features (separate from alnum). |
False
|
mark_upper
|
bool
|
True to mark upper features (auto-included for camel-case). |
False
|
mark_lower
|
bool
|
True to mark lower features (auto-included for camel-case). |
False
|
emoji_data
|
EmojiData
|
An EmojiData instance to mark emoji BIO features. |
None
|
Methods:
| Name | Description |
|---|---|
build_first_token |
Build the first token as the start of tokenization. |
Attributes:
| Name | Type | Description |
|---|---|---|
cdf |
DataFrame
|
The character dataframe with each padded text character as a row. |
Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
Attributes¶
Functions¶
build_first_token ¶
Build the first token as the start of tokenization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
normalize_fn
|
Callable[[str], str]
|
A function to normalize a raw text term or any of its variations. If None, then the identity function is used. |
required |
Returns:
| Type | Description |
|---|---|
Token
|
The first text token. |
Source code in packages/xization/src/dataknobs_xization/masking_tokenizer.py
annotations¶
Functions and Classes¶
dataknobs_xization.annotations ¶
Text annotation data structures and interfaces.
Provides classes for managing text annotations with metadata, including position tracking, annotation types, and derived annotation columns.
Classes:
| Name | Description |
|---|---|
AnnotatedText |
A Text object that manages its own annotations. |
Annotations |
DAO for collecting and managing a table of annotations, where each row |
AnnotationsBuilder |
A class for building annotations. |
AnnotationsGroup |
Container for annotation rows that belong together as a (consistent) group. |
AnnotationsGroupList |
Container for a list of annotation groups. |
AnnotationsMetaData |
Container for annotations meta-data, identifying key column names. |
AnnotationsRowAccessor |
A class that accesses row data according to the metadata and derived cols. |
Annotator |
Class for annotating text |
AnnotatorKernel |
Class for encapsulating core annotation logic for multiple annotators |
BasicAnnotator |
Class for extracting basic (possibly multi -level or -part) entities. |
CompoundAnnotator |
Class to apply a series of annotators through an AnnotatorKernel |
DerivedAnnotationColumns |
Interface for injecting derived columns into AnnotationsMetaData. |
EntityAnnotator |
Class for extracting single (possibly multi-level or -part) entities. |
HtmlHighlighter |
Helper class to add HTML markup for highlighting spans of text. |
MergeStrategy |
A merge strategy to be injected based on entity types being merged. |
OverlapGroupIterator |
Given: |
PositionalAnnotationsGroup |
Container for annotations that either overlap with each other or don't. |
RowData |
A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping. |
SyntacticParser |
Class for creating syntactic annotations for an input. |
Functions:
| Name | Description |
|---|---|
merge |
Merge the overlapping groups according to the given strategy. |
Classes¶
AnnotatedText ¶
AnnotatedText(
text_str: str,
metadata: TextMetaData = None,
annots: Annotations = None,
bookmarks: Dict[str, DataFrame] = None,
text_obj: Text = None,
annots_metadata: AnnotationsMetaData = None,
)
Bases: Text
A Text object that manages its own annotations.
Initialize AnnotatedText.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_str
|
str
|
The text string. |
required |
metadata
|
TextMetaData
|
The text's metadata. |
None
|
annots
|
Annotations
|
The annotations. |
None
|
bookmarks
|
Dict[str, DataFrame]
|
The annotation bookmarks. |
None
|
text_obj
|
Text
|
A text_obj to override text_str and metadata initialization. |
None
|
annots_metadata
|
AnnotationsMetaData
|
Override for default annotations metadata (NOTE: ineffectual if an annots instance is provided.) |
None
|
Methods:
| Name | Description |
|---|---|
add_annotations |
Add the annotations to this instance. |
get_annot_mask |
Get a True/False series for the input such that start to end positions |
get_text |
Get the text object's string, masking if indicated. |
get_text_series |
Get the input text as a (padded) pandas series. |
Attributes:
| Name | Type | Description |
|---|---|---|
annotations |
Annotations
|
Get the this object's annotations |
bookmarks |
Dict[str, DataFrame]
|
Get this object's bookmarks |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
Functions¶
add_annotations ¶
Add the annotations to this instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
Annotations
|
The annotations to add. |
required |
Source code in packages/xization/src/dataknobs_xization/annotations.py
get_annot_mask ¶
get_annot_mask(
annot_col: str,
pad_len: int = 0,
annot_df: DataFrame = None,
text: str = None,
) -> pd.Series
Get a True/False series for the input such that start to end positions for rows where the the annotation column is non-null and non-empty are True.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annot_col
|
str
|
The annotation column identifying chars to mask. |
required |
pad_len
|
int
|
The number of characters to pad the mask with False values at both the front and back. |
0
|
annot_df
|
DataFrame
|
Override annotations dataframe. |
None
|
text
|
str
|
Override text. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
A pandas Series where annotated input character positions |
Series
|
are True and non-annotated positions are False. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
get_text ¶
Get the text object's string, masking if indicated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annot2mask
|
Dict[str, str]
|
Mapping from annotation column (e.g., _num or _recsnum) to the replacement character(s) in the input text for masking already managed input. |
None
|
annot_df
|
DataFrame
|
Override annotations dataframe. |
None
|
text
|
str
|
Override text. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The (masked) text. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
get_text_series ¶
Get the input text as a (padded) pandas series.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pad_len
|
int
|
The number of spaces to pad both front and back. |
0
|
text
|
str
|
Override text. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
The (padded) pandas series of input characters. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Annotations ¶
DAO for collecting and managing a table of annotations, where each row carries annotation information for an input token.
The data in this class is maintained either as a list of dicts, each dict representing a "row," or as a pandas DataFrame, depending on the latest access. Changes in either the lists or dataframe will be reflected in the alternate data structure.
Construct as empty or initialize with the dataframe form.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AnnotationsMetaData
|
The annotations metadata. |
required |
df
|
DataFrame
|
A dataframe with annotation records. |
None
|
Methods:
| Name | Description |
|---|---|
add_df |
Add (concatentate) the annotation dataframe to the current annotations. |
add_dict |
Add the annotation dict. |
add_dicts |
Add the annotation dicts. |
clear |
Clear/empty out all annotations, returning the annotations df |
set_df |
Set (or reset) this annotation's dataframe. |
Attributes:
| Name | Type | Description |
|---|---|---|
ann_row_dicts |
List[Dict[str, Any]]
|
Get the annotations as a list of dictionaries. |
df |
DataFrame
|
Get the annotations as a pandas dataframe. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
ann_row_dicts
property
¶
Get the annotations as a list of dictionaries.
Functions¶
add_df ¶
Add (concatentate) the annotation dataframe to the current annotations.
add_dict ¶
add_dicts ¶
clear ¶
Clear/empty out all annotations, returning the annotations df
set_df ¶
Set (or reset) this annotation's dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The new annotations dataframe. |
required |
AnnotationsBuilder ¶
A class for building annotations.
Initialize AnnotationsBuilder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AnnotationsMetaData
|
The annotations metadata. |
required |
data_defaults
|
Dict[str, Any]
|
Dict[ann_colname, default_value] with default values for annotation columns. |
required |
Methods:
| Name | Description |
|---|---|
build_annotation_row |
Build an annotation row with the mandatory key values and those from |
do_build_row |
Do the row building with the key fields, followed by data defaults, |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
build_annotation_row ¶
build_annotation_row(
start_pos: int, end_pos: int, text: str, ann_type: str, **kwargs: Any
) -> Dict[str, Any]
Build an annotation row with the mandatory key values and those from the remaining keyword arguments.
For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_pos
|
int
|
The token start position. |
required |
end_pos
|
int
|
The token end position. |
required |
text
|
str
|
The token text. |
required |
ann_type
|
str
|
The annotation type. |
required |
**kwargs
|
Any
|
Additional keyword arguments for extra annotation fields. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
The result row dictionary. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
do_build_row ¶
Do the row building with the key fields, followed by data defaults, followed by any extra kwargs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key_fields
|
Dict[str, Any]
|
The dictionary of key fields. |
required |
**kwargs
|
Any
|
Any extra fields to add. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
The constructed row dictionary. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
AnnotationsGroup ¶
AnnotationsGroup(
row_accessor: AnnotationsRowAccessor,
field_col_type: str,
accept_fn: Callable[[AnnotationsGroup, RowData], bool],
group_type: str = None,
group_num: int = None,
valid: bool = True,
autolock: bool = False,
)
Container for annotation rows that belong together as a (consistent) group.
NOTE: An instance will only accept rows on condition of consistency per its acceptance function.
Initialize AnnotationsGroup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_accessor
|
AnnotationsRowAccessor
|
The annotations row_accessor. |
required |
field_col_type
|
str
|
The col_type for the group field_type for retrieval using the annotations row accessor. |
required |
accept_fn
|
Callable[[AnnotationsGroup, RowData], bool]
|
A fn(g, row_data) that returns True to accept the row data into this group g, or False to reject the row. If None, then all rows are always accepted. |
required |
group_type
|
str
|
An optional (override) type for identifying this group. |
None
|
group_num
|
int
|
An optional number for identifying this group. |
None
|
valid
|
bool
|
True if the group is valid, or False if not. |
True
|
autolock
|
bool
|
True to automatically lock this group when (1) at least one row has been added and (2) a row is rejected. |
False
|
Methods:
| Name | Description |
|---|---|
add |
Add the row if the group is not locked and the row belongs in this |
is_subset |
Determine whether the this group's text is contained within the others. |
is_subset_of_any |
Determine whether this group is a subset of any of the given groups. |
remove_row |
Remove the row from this group and optionally update the annotations |
to_dict |
Get this group (record) as a dictionary of field type to text values. |
Attributes:
| Name | Type | Description |
|---|---|---|
ann_type |
str
|
Get this record's annotation type |
autolock |
bool
|
Get whether this group is currently set to autolock. |
df |
DataFrame
|
Get this group as a dataframe |
group_num |
int
|
Get this group's number |
group_type |
str
|
Get this group's type, which is either an "override" value that has |
is_locked |
bool
|
Get whether this group is locked from adding more rows. |
is_valid |
bool
|
Get whether this group is currently marked as valid. |
key |
str
|
A hash key for this group. |
size |
int
|
Get the number of rows in this group. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
group_type
property
writable
¶
Get this group's type, which is either an "override" value that has been set, or the "ann_type" value of the first row added.
is_locked
property
writable
¶
Get whether this group is locked from adding more rows.
Functions¶
add ¶
Add the row if the group is not locked and the row belongs in this group, or return False.
If autolock is True and a row fails to be added (after the first row has been added,) "lock" the group and refuse to accept any more rows.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rowdata
|
RowData
|
The row to add. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the row belongs and was added; otherwise, False. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
is_subset ¶
Determine whether the this group's text is contained within the others.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AnnotationsGroup
|
The other group. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this group's text is contained within the other group. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
is_subset_of_any ¶
Determine whether this group is a subset of any of the given groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
groups
|
List[AnnotationsGroup]
|
List of annotation groups. |
required |
Returns:
| Type | Description |
|---|---|
AnnotationsGroup
|
The first AnnotationsGroup that this group is a subset of, or None. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
remove_row ¶
Remove the row from this group and optionally update the annotations accordingly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
row_idx
|
int
|
The positional index of the row to remove. |
required |
Returns:
| Type | Description |
|---|---|
RowData
|
The removed row data instance. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
to_dict ¶
Get this group (record) as a dictionary of field type to text values.
AnnotationsGroupList ¶
AnnotationsGroupList(
groups: List[AnnotationsGroup] = None,
accept_fn: Callable[
[AnnotationsGroupList, AnnotationsGroup], bool
] = lambda lst, g: lst.size == 0 or not g.is_subset_of_any(lst.groups),
)
Container for a list of annotation groups.
Initialize AnnotationsGroupList.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
groups
|
List[AnnotationsGroup]
|
The initial groups for this list. |
None
|
accept_fn
|
Callable[[AnnotationsGroupList, AnnotationsGroup], bool]
|
A fn(lst, g) that returns True to accept the group, g, into this list, lst, or False to reject the group. If None, then all groups are always accepted. The default function will reject any group that is a subset of any existing group in the list. |
lambda lst, g: size == 0 or not is_subset_of_any(groups)
|
Methods:
| Name | Description |
|---|---|
add |
Add the group if it belongs in this group list or return False. |
is_subset |
Determine whether the this group's text spans are contained within all |
Attributes:
| Name | Type | Description |
|---|---|---|
coverage |
int
|
Get the total number of (token) rows covered by the groups |
size |
int
|
Get the number of groups in this list |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
Functions¶
add ¶
Add the group if it belongs in this group list or return False.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
group
|
AnnotationsGroup
|
The group to add. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the group belongs and was added; otherwise, False. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
is_subset ¶
Determine whether the this group's text spans are contained within all of the other's.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other
|
AnnotationsGroupList
|
The other group list. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this group list is a subset of the other group list. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
AnnotationsMetaData ¶
AnnotationsMetaData(
start_pos_col: str = KEY_START_POS_COL,
end_pos_col: str = KEY_END_POS_COL,
text_col: str = KEY_TEXT_COL,
ann_type_col: str = KEY_ANN_TYPE_COL,
sort_fields: List[str] = (KEY_START_POS_COL, KEY_END_POS_COL),
sort_fields_ascending: List[bool] = (True, False),
**kwargs: Any,
)
Bases: MetaData
Container for annotations meta-data, identifying key column names.
NOTE: this object contains only information about annotation column names and not annotation table values.
Initialize with key (and more) column names and info.
Key column types
- start_pos
- end_pos
- text
- ann_type
Note
Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_pos_col
|
str
|
Col name for the token starting position. |
KEY_START_POS_COL
|
end_pos_col
|
str
|
Col name for the token ending position. |
KEY_END_POS_COL
|
text_col
|
str
|
Col name for the token text. |
KEY_TEXT_COL
|
ann_type_col
|
str
|
Col name for the annotation types. |
KEY_ANN_TYPE_COL
|
sort_fields
|
List[str]
|
The col types relevant for sorting annotation rows. |
(KEY_START_POS_COL, KEY_END_POS_COL)
|
sort_fields_ascending
|
List[bool]
|
To specify sort order of sort_fields. |
(True, False)
|
**kwargs
|
Any
|
More column types mapped to column names. |
{}
|
Methods:
| Name | Description |
|---|---|
get_col |
Get the name of the column having the given type (including key column |
sort_df |
Sort an annotations dataframe according to this metadata. |
Attributes:
| Name | Type | Description |
|---|---|---|
ann_type_col |
str
|
Get the column name for the token annotation type |
end_pos_col |
str
|
Get the column name for the token ending position |
start_pos_col |
str
|
Get the column name for the token starting postition |
text_col |
str
|
Get the column name for the token text |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
Functions¶
get_col ¶
Get the name of the column having the given type (including key column types but not derived,) or get the missing value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col_type
|
str
|
The type of column name to get. |
required |
missing
|
str
|
The value to return for unknown column types. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The column name or the missing value. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
sort_df ¶
Sort an annotations dataframe according to this metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
an_df
|
DataFrame
|
An annotations dataframe. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The sorted annotations dataframe. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
AnnotationsRowAccessor ¶
AnnotationsRowAccessor(
metadata: AnnotationsMetaData, derived_cols: DerivedAnnotationColumns = None
)
A class that accesses row data according to the metadata and derived cols.
Initialize AnnotationsRowAccessor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AnnotationsMetaData
|
The metadata for annotation columns. |
required |
derived_cols
|
DerivedAnnotationColumns
|
A DerivedAnnotationColumns instance for injecting derived columns. |
None
|
Methods:
| Name | Description |
|---|---|
get_col_value |
Get the value of the column in the given row with the given type. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
get_col_value ¶
Get the value of the column in the given row with the given type.
This gets the value from the first existing column in the row from
- The metadata.get_col(col_type) column
- col_type itself
- The columns derived from col_type
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
col_type
|
str
|
The type of column value to get. |
required |
row
|
Series
|
A row from which to get the value. |
required |
missing
|
str
|
The value to return for unknown or missing column. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The row value or the missing value. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Annotator ¶
Bases: ABC
Class for annotating text
Initialize Annotator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of this annotator. |
required |
Methods:
| Name | Description |
|---|---|
annotate_input |
Annotate this instance's text, additively updating its annotations. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
annotate_input
abstractmethod
¶
Annotate this instance's text, additively updating its annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The text object to annotate. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
The annotations added. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
AnnotatorKernel ¶
Bases: ABC
Class for encapsulating core annotation logic for multiple annotators
Methods:
| Name | Description |
|---|---|
annotate_input |
Execute all annotations on the text_obj |
Attributes:
| Name | Type | Description |
|---|---|---|
annotators |
List[EntityAnnotator]
|
Get the entity annotators |
BasicAnnotator ¶
Bases: Annotator
Class for extracting basic (possibly multi -level or -part) entities.
Methods:
| Name | Description |
|---|---|
annotate_input |
Annotate the text obj, additively updating the annotations. |
annotate_text |
Build annotations for the text string. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
annotate_input ¶
Annotate the text obj, additively updating the annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The text to annotate. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
The annotations added to the text. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
annotate_text
abstractmethod
¶
Build annotations for the text string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_str
|
str
|
The text string to annotate. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
Annotations for the text. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
CompoundAnnotator ¶
Bases: Annotator
Class to apply a series of annotators through an AnnotatorKernel
Initialize with the annotators and this extractor's name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kernel
|
AnnotatorKernel
|
The annotations kernel to use. |
required |
name
|
str
|
The name of this information extractor to be the
annotations base column name for |
'entity'
|
Methods:
| Name | Description |
|---|---|
annotate_input |
Annotate the text. |
get_html_highlighted_text |
Get html-hilighted text for the identified input's annotations |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
annotate_input ¶
Annotate the text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The AnnotatedText object to annotate. |
required |
reset
|
bool
|
When True, reset and rebuild any existing annotations. |
True
|
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
The annotations added to the text_obj. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
get_html_highlighted_text ¶
Get html-hilighted text for the identified input's annotations from the given annotators (or all).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The input text to highlight. |
required |
annotator_names
|
List[str]
|
The subset of annotators to highlight. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
HTML string with highlighted text. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
DerivedAnnotationColumns ¶
Bases: ABC
Interface for injecting derived columns into AnnotationsMetaData.
Methods:
| Name | Description |
|---|---|
get_col_value |
Get the value of the column in the given row derived from col_type. |
Functions¶
get_col_value
abstractmethod
¶
get_col_value(
metadata: AnnotationsMetaData,
col_type: str,
row: Series,
missing: str = None,
) -> str
Get the value of the column in the given row derived from col_type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AnnotationsMetaData
|
The AnnotationsMetaData. |
required |
col_type
|
str
|
The type of column value to derive. |
required |
row
|
Series
|
A row from which to get the value. |
required |
missing
|
str
|
The value to return for unknown or missing column. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The row value or the missing value. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
EntityAnnotator ¶
Bases: BasicAnnotator
Class for extracting single (possibly multi-level or -part) entities.
Initialize EntityAnnotator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of this annotator. |
required |
mask_char
|
str
|
The character to use to mask out previously annotated spans of this annotator's text. |
' '
|
Methods:
| Name | Description |
|---|---|
annotate_input |
Annotate the text object (optionally) after masking out previously |
compose_groups |
Compose annotation rows into groups. |
mark_records |
Collect and mark annotation records. |
validate_records |
Validate annotated records. |
Attributes:
| Name | Type | Description |
|---|---|---|
annotation_cols |
Set[str]
|
Report the (final group or record) annotation columns that are filled |
highlight_fieldstyles |
Dict[str, Dict[str, Dict[str, str]]]
|
Get highlight field styles for this annotator's annotations of the form: |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Attributes¶
annotation_cols
abstractmethod
property
¶
Report the (final group or record) annotation columns that are filled by this annotator when its entities are annotated.
highlight_fieldstyles
abstractmethod
property
¶
Get highlight field styles for this annotator's annotations of the form:
{
Functions¶
annotate_input ¶
annotate_input(
text_obj: AnnotatedText,
annot_mask_cols: Set[str] = None,
merge_strategies: Dict[str, MergeStrategy] = None,
largest_only: bool = True,
**kwargs: Any,
) -> Annotations
Annotate the text object (optionally) after masking out previously annotated spans, additively updating the annotations in the text object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The text object to annotate. |
required |
annot_mask_cols
|
Set[str]
|
The (possible) previous annotations whose spans to ignore in the text. |
None
|
merge_strategies
|
Dict[str, MergeStrategy]
|
A dictionary of each input annotation bookmark tag mapped to a merge strategy for merging this annotator's annotations with the bookmarked dataframe. This is useful, for example, when merging syntactic information to refine ambiguities. |
None
|
largest_only
|
bool
|
True to only mark largest records. |
True
|
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
The annotations added to the text object. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
compose_groups
abstractmethod
¶
Compose annotation rows into groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
Annotations
|
The annotations. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
The composed annotations. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
mark_records
abstractmethod
¶
Collect and mark annotation records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
Annotations
|
The annotations. |
required |
largest_only
|
bool
|
True to only mark (keep) the largest records. |
True
|
Source code in packages/xization/src/dataknobs_xization/annotations.py
validate_records
abstractmethod
¶
Validate annotated records.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
Annotations
|
The annotations. |
required |
HtmlHighlighter ¶
HtmlHighlighter(
field2style: Dict[str, Dict[str, str]],
tooltip_class: str = "tooltip",
tooltiptext_class: str = "tooltiptext",
)
Helper class to add HTML markup for highlighting spans of text.
Initialize HtmlHighlighter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field2style
|
Dict[str, Dict[str, str]]
|
The annotation column to highlight with its associated style, for example: { 'car_model_field': { 'year': {'background-color': 'lightyellow'}, 'make': {'background-color': 'lightgreen'}, 'model': {'background-color': 'cyan'}, 'style': {'background-color': 'magenta'}, }, } |
required |
tooltip_class
|
str
|
The css tooltip class. |
'tooltip'
|
tooltiptext_class
|
str
|
The css tooltiptext class. |
'tooltiptext'
|
Methods:
| Name | Description |
|---|---|
highlight |
Return an html string with the given fields (annotation columns) |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
highlight ¶
Return an html string with the given fields (annotation columns) highlighted with the associated styles.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The annotated text to markup. |
required |
Returns:
| Type | Description |
|---|---|
str
|
HTML string with highlighted annotations. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
MergeStrategy ¶
Bases: ABC
A merge strategy to be injected based on entity types being merged.
Methods:
| Name | Description |
|---|---|
merge |
Process the annotations in the given annotations group, returning the |
OverlapGroupIterator ¶
Given
- annotation rows (dataframe)
- in order sorted by
- start_pos (increasing for input order), and
- end_pos (decreasing for longest spans first)
Collect: * overlapping consecutive annotations * for processing
Initialize OverlapGroupIterator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
an_df
|
DataFrame
|
An annotations.as_df DataFrame, sliced and sorted. |
required |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
PositionalAnnotationsGroup ¶
Bases: AnnotationsGroup
Container for annotations that either overlap with each other or don't.
Initialize PositionalAnnotationsGroup.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
overlap
|
bool
|
If False, then only accept rows that don't overlap; else only accept rows that do overlap. |
required |
rectype
|
str
|
The record type. |
None
|
gnum
|
int
|
The group number. |
-1
|
Methods:
| Name | Description |
|---|---|
belongs |
Determine if the row belongs in this instance based on its overlap |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
belongs ¶
Determine if the row belongs in this instance based on its overlap or not.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rowdata
|
RowData
|
The rowdata to test. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the rowdata belongs in this instance. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
RowData ¶
A wrapper for an annotation row (pd.Series) to facilitate e.g., grouping.
Methods:
| Name | Description |
|---|---|
is_subset |
Determine whether this row's span is a subset of the other. |
is_subset_of_any |
Determine whether this row is a subset of any of the others |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
is_subset ¶
Determine whether this row's span is a subset of the other.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other_row
|
RowData
|
The other row. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this row's span is a subset of the other row's span. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
is_subset_of_any ¶
Determine whether this row is a subset of any of the others according to text span coverage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other_rows
|
List[RowData]
|
The rows to test for this to be a subset of any. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if this row is a subset of any of the other rows. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
SyntacticParser ¶
Bases: BasicAnnotator
Class for creating syntactic annotations for an input.
Methods:
| Name | Description |
|---|---|
annotate_input |
Annotate the text, additively updating the annotations. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
annotate_input ¶
Annotate the text, additively updating the annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The text to annotate. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
The annotations added to the text. |
Source code in packages/xization/src/dataknobs_xization/annotations.py
Functions¶
merge ¶
Merge the overlapping groups according to the given strategy.
Source code in packages/xization/src/dataknobs_xization/annotations.py
authorities¶
Functions and Classes¶
dataknobs_xization.authorities ¶
Authority-based annotation processing and field grouping.
Provides classes for managing authority-based annotations, field groups, and derived annotation columns for structured text extraction.
Classes:
| Name | Description |
|---|---|
AnnotationsValidator |
A base class with helper functions for performing validations on annotation |
AuthoritiesBundle |
An authority for expressing values through multiple bundled "authorities" |
Authority |
A class for managing and defining tabular authoritative data for e.g., |
AuthorityAnnotationsBuilder |
An extension of an AnnotationsBuilder that adds the 'auth_id' column. |
AuthorityAnnotationsMetaData |
An extension of AnnotationsMetaData that adds an 'auth_id_col' to the |
AuthorityData |
A wrapper for authority data. |
AuthorityFactory |
A factory class for building an authority. |
DerivedFieldGroups |
Defines derived column types: |
LexicalAuthority |
A class for managing named entities by ID with associated values and |
RegexAuthority |
A class for managing named entities by ID with associated values and |
Classes¶
AnnotationsValidator ¶
Bases: ABC
A base class with helper functions for performing validations on annotation rows.
Classes:
| Name | Description |
|---|---|
AuthAnnotations |
A wrapper class for convenient access to the entity annotations. |
Methods:
| Name | Description |
|---|---|
__call__ |
Call function to enable instances of this type of class to be passed in |
validate_annotation_rows |
Determine whether the proposed authority annotation rows are valid. |
Classes¶
AuthAnnotations ¶
A wrapper class for convenient access to the entity annotations.
Methods:
| Name | Description |
|---|---|
colval |
Get the column's value from the given row |
get_field_type |
Get the entity field type value |
get_text |
Get the entity text from the row |
Attributes:
| Name | Type | Description |
|---|---|---|
anns |
Annotations
|
Get this instance's annotation rows as an annotations object |
attributes |
Dict[str, str]
|
Get this instance's annotation entity attributes |
df |
DataFrame
|
Get the annotation's dataframe |
row_accessor |
AnnotationsRowAccessor
|
Get the row accessor for this instance's annotations. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
__call__ ¶
Call function to enable instances of this type of class to be passed in as a anns_validator function to an Authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
auth
|
Authority
|
The authority proposing annotations. |
required |
ann_row_dicts
|
List[Dict[str, Any]]
|
The proposed annotations. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the annotations are valid; otherwise, False. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
validate_annotation_rows
abstractmethod
¶
Determine whether the proposed authority annotation rows are valid.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
auth_annotations
|
AuthAnnotations
|
The AuthAnnotations instance with the proposed data. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if valid; False if not. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
AuthoritiesBundle ¶
AuthoritiesBundle(
name: str,
auth_anns_builder: AuthorityAnnotationsBuilder = None,
authdata: AuthorityData = None,
field_groups: DerivedFieldGroups = None,
parent_auth: Authority = None,
anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
auths: List[Authority] = None,
)
Bases: Authority
An authority for expressing values through multiple bundled "authorities" like dictionary-based and/or multiple regular expression patterns.
Initialize the AuthoritiesBundle.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
This authority's entity name. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
None
|
authdata
|
AuthorityData
|
The authority data. |
None
|
field_groups
|
DerivedFieldGroups
|
The derived field groups to use. |
None
|
anns_validator
|
Callable[[Authority, Dict[str, Any]], bool]
|
fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity". |
None
|
parent_auth
|
Authority
|
This authority's parent authority (if any). |
None
|
auths
|
List[Authority]
|
The authorities to bundle together. |
None
|
Methods:
| Name | Description |
|---|---|
add |
Add the authority to this bundle. |
add_annotations |
Method to do the work of finding, validating, and adding annotations. |
has_value |
Determine whether the given value is in this authority. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
add ¶
Add the authority to this bundle.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
auth
|
Authority
|
The authority to add. |
required |
add_annotations ¶
Method to do the work of finding, validating, and adding annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The annotated text object to process and add annotations. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
The added Annotations. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
has_value ¶
Determine whether the given value is in this authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
A possible authority value. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value is a valid entity value. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Authority ¶
Authority(
name: str,
auth_anns_builder: AuthorityAnnotationsBuilder = None,
authdata: AuthorityData = None,
field_groups: DerivedFieldGroups = None,
anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
parent_auth: Authority = None,
)
Bases: Annotator
A class for managing and defining tabular authoritative data for e.g., taxonomies, etc., and using them to annotate instances within text.
Initialize with this authority's metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
This authority's entity name. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
None
|
authdata
|
AuthorityData
|
The authority data. |
None
|
field_groups
|
DerivedFieldGroups
|
The derived field groups to use. |
None
|
anns_validator
|
Callable[[Authority, Dict[str, Any]], bool]
|
fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity". |
None
|
parent_auth
|
Authority
|
This authority's parent authority (if any). |
None
|
Methods:
| Name | Description |
|---|---|
add_annotations |
Method to do the work of finding, validating, and adding annotations. |
annotate_input |
Find and annotate this authority's entities in the document text |
build_annotation |
Build annotations with the given components. |
compose |
Compose annotations into groups. |
has_value |
Determine whether the given value is in this authority. |
validate_ann_dicts |
The annotation row dictionaries are valid if: |
Attributes:
| Name | Type | Description |
|---|---|---|
metadata |
AuthorityAnnotationsMetaData
|
Get the meta-data |
parent |
Authority
|
Get this authority's parent, or None. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Attributes¶
Functions¶
add_annotations
abstractmethod
¶
Method to do the work of finding, validating, and adding annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The annotated text object to process and add annotations. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
The added Annotations. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
annotate_input ¶
Find and annotate this authority's entities in the document text
as dictionaries like:
[
{
'input_id':
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
Union[AnnotatedText, str]
|
The text object or string to process. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Annotations
|
An Annotations instance. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
build_annotation ¶
build_annotation(
start_pos: int = None,
end_pos: int = None,
entity_text: str = None,
auth_value_id: Any = None,
conf: float = 1.0,
**kwargs,
) -> Dict[str, Any]
Build annotations with the given components.
Source code in packages/xization/src/dataknobs_xization/authorities.py
compose ¶
Compose annotations into groups.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations
|
Annotations
|
The annotations. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
Composed annotations. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
has_value
abstractmethod
¶
Determine whether the given value is in this authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
A possible authority value. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value is a valid entity value. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
validate_ann_dicts ¶
The annotation row dictionaries are valid if
- They are non-empty
- and
- either there is no annotations validator
- or they are valid according to the validator
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ann_dicts
|
List[Dict[str, Any]]
|
Annotation dictionaries. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if valid. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
AuthorityAnnotationsBuilder ¶
AuthorityAnnotationsBuilder(
metadata: AuthorityAnnotationsMetaData = None,
data_defaults: Dict[str, Any] = None,
)
Bases: AnnotationsBuilder
An extension of an AnnotationsBuilder that adds the 'auth_id' column.
Initialize AuthorityAnnotationsBuilder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AuthorityAnnotationsMetaData
|
The authority annotations metadata. |
None
|
data_defaults
|
Dict[str, Any]
|
Dict[ann_colname, default_value] with default values for annotation columns. |
None
|
Methods:
| Name | Description |
|---|---|
build_annotation_row |
Build an annotation row with the mandatory key values and those from |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
build_annotation_row ¶
build_annotation_row(
start_pos: int,
end_pos: int,
text: str,
ann_type: str,
auth_id: str,
**kwargs: Any,
) -> Dict[str, Any]
Build an annotation row with the mandatory key values and those from the remaining keyword arguments.
For those kwargs whose names match metadata column names, override the data_defaults and add remaining data_default attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_pos
|
int
|
The token start position. |
required |
end_pos
|
int
|
The token end position. |
required |
text
|
str
|
The token text. |
required |
ann_type
|
str
|
The annotation type. |
required |
auth_id
|
str
|
The authority ID for the row. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
The result row dictionary. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
AuthorityAnnotationsMetaData ¶
AuthorityAnnotationsMetaData(
start_pos_col: str = dk_annots.KEY_START_POS_COL,
end_pos_col: str = dk_annots.KEY_END_POS_COL,
text_col: str = dk_annots.KEY_TEXT_COL,
ann_type_col: str = dk_annots.KEY_ANN_TYPE_COL,
auth_id_col: str = KEY_AUTH_ID_COL,
sort_fields: List[str] = (
dk_annots.KEY_START_POS_COL,
dk_annots.KEY_END_POS_COL,
),
sort_fields_ascending: List[bool] = (True, False),
**kwargs: Any,
)
Bases: AnnotationsMetaData
An extension of AnnotationsMetaData that adds an 'auth_id_col' to the standard (key) annotation columns (attributes).
Initialize with key (and more) column names and info.
Key column types
- start_pos
- end_pos
- text
- ann_type
- auth_id
Note
Actual table columns can be named arbitrarily, BUT interactions through annotations classes and interfaces relating to the "key" columns must use the key column constants.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_pos_col
|
str
|
Col name for the token starting position. |
KEY_START_POS_COL
|
end_pos_col
|
str
|
Col name for the token ending position. |
KEY_END_POS_COL
|
text_col
|
str
|
Col name for the token text. |
KEY_TEXT_COL
|
ann_type_col
|
str
|
Col name for the annotation types. |
KEY_ANN_TYPE_COL
|
auth_id_col
|
str
|
Col name for the authority value ID. |
KEY_AUTH_ID_COL
|
sort_fields
|
List[str]
|
The col types relevant for sorting annotation rows. |
(KEY_START_POS_COL, KEY_END_POS_COL)
|
sort_fields_ascending
|
List[bool]
|
To specify sort order of sort_fields. |
(True, False)
|
**kwargs
|
Any
|
More column types mapped to column names. |
{}
|
Attributes:
| Name | Type | Description |
|---|---|---|
auth_id_col |
str
|
Get the column name for the auth_id |
Source code in packages/xization/src/dataknobs_xization/authorities.py
AuthorityData ¶
A wrapper for authority data.
Methods:
| Name | Description |
|---|---|
lookup_values |
Lookup authority value(s) for the given value or value id. |
Attributes:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
Get the authority data in a dataframe |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Attributes¶
Functions¶
lookup_values ¶
Lookup authority value(s) for the given value or value id.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
A value or value_id for this authority. |
required |
is_id
|
bool
|
True if value is an ID. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The applicable authority dataframe rows. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
AuthorityFactory ¶
Bases: ABC
A factory class for building an authority.
Methods:
| Name | Description |
|---|---|
build_authority |
Build an authority with the given name and data. |
Functions¶
build_authority
abstractmethod
¶
build_authority(
name: str,
auth_anns_builder: AuthorityAnnotationsBuilder,
authdata: AuthorityData,
parent_auth: Authority = None,
) -> Authority
Build an authority with the given name and data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The authority name. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
required |
authdata
|
AuthorityData
|
The authority data. |
required |
parent_auth
|
Authority
|
The parent authority. |
None
|
Returns:
| Type | Description |
|---|---|
Authority
|
The authority. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
DerivedFieldGroups ¶
DerivedFieldGroups(
field_type_suffix: str = "_field",
field_group_suffix: str = "_num",
field_record_suffix: str = "_recsnum",
)
Bases: DerivedAnnotationColumns
Defines derived column types: * "field_type" -- The column holding they type of field of an annotation row * "field_group" -- The column holding the group number(s) of the field * "field_record" -- The column holding record number(s) of the field
Add derived column types/names: Given an annnotation row, * field_type(row) == f'{row[ann_type_col]}_field' * field_group(row) == f'{row[ann_type_col]}_num' * field_record(row) == f'{row[ann_type_col])_recsnum'
Where
- A field_type column holds annotation "sub"- type values, or fields
- A field_group column identifies groups of annotation fields
- A field_record column identifies groups of annotation field groups
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
field_type_suffix
|
str
|
The field_type col name suffix (if not _field). |
'_field'
|
field_group_suffix
|
str
|
The field_group col name suffix (if not _num). |
'_num'
|
field_record_suffix
|
str
|
field_record colname sfx (if not _recsnum). |
'_recsnum'
|
Methods:
| Name | Description |
|---|---|
get_col_value |
Get the value of the column in the given row derived from col_type, |
get_field_group_col |
Given a field name or field col name, e.g., an annotation type col's |
get_field_name |
Given a field name or field col name, e.g., an annotation type col's |
get_field_record_col |
Given a field name or field col name, e.g., an annotation type col's |
get_field_type_col |
Given a field name or field col name, e.g., an annotation type col's |
unpack_field |
Given a field in any of its derivatives (like field type, field group |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
get_col_value ¶
get_col_value(
metadata: AnnotationsMetaData,
col_type: str,
row: Series,
missing: str = None,
) -> str
Get the value of the column in the given row derived from col_type, where col_type is one of: * "field_type" == f"{field}_field" * "field_group" == f"{field}_num" * "field_record" == f"{field}_recsnum"
And "field" is the row_accessor's metadata's "ann_type" col's value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
AnnotationsMetaData
|
The AnnotationsMetaData. |
required |
col_type
|
str
|
The type of column value to derive. |
required |
row
|
Series
|
A row from which to get the value. |
required |
missing
|
str
|
The value to return for unknown or missing column. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The row value or the missing value. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_field_group_col ¶
Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field group column.
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_field_name ¶
Given a field name or field col name, e.g., an annotation type col's value (the field name); or a field type, group, or record column name, get the field name.
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_field_record_col ¶
Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record, get the name of the derived field record column.
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_field_type_col ¶
Given a field name or field col name, e.g., an annotation type col's value; or a field type, group, or record column name, get the field name.
Source code in packages/xization/src/dataknobs_xization/authorities.py
unpack_field ¶
Given a field in any of its derivatives (like field type, field group or field record,) unpack and return the basic field value itself.
Source code in packages/xization/src/dataknobs_xization/authorities.py
LexicalAuthority ¶
LexicalAuthority(
name: str,
auth_anns_builder: AuthorityAnnotationsBuilder = None,
authdata: AuthorityData = None,
field_groups: DerivedFieldGroups = None,
anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
parent_auth: Authority = None,
)
Bases: Authority
A class for managing named entities by ID with associated values and variations.
Initialize with this authority's metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
This authority's entity name. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
None
|
authdata
|
AuthorityData
|
The authority data. |
None
|
field_groups
|
DerivedFieldGroups
|
The derived field groups to use. |
None
|
anns_validator
|
Callable[[Authority, Dict[str, Any]], bool]
|
fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity". |
None
|
parent_auth
|
Authority
|
This authority's parent authority (if any). |
None
|
Methods:
| Name | Description |
|---|---|
find_variations |
Find all matches to the given variation. |
get_id_by_variation |
Get the IDs of the value(s) associated with the given variation. |
get_value_ids |
Get all IDs associated with the given value. Note that typically |
get_values_by_id |
Get all values for the associated value ID. Note that typically |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
find_variations
abstractmethod
¶
find_variations(
variation: str,
starts_with: bool = False,
ends_with: bool = False,
scope: str = "fullmatch",
) -> pd.Series
Find all matches to the given variation.
Note
Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variation
|
str
|
The text to find; treated as a regular expression unless either starts_with or ends_with is True. |
required |
starts_with
|
bool
|
When True, find all terms that start with the variation text. |
False
|
ends_with
|
bool
|
When True, find all terms that end with the variation text. |
False
|
scope
|
str
|
'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching. |
'fullmatch'
|
Returns:
| Type | Description |
|---|---|
Series
|
The matching variations as a pd.Series. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_id_by_variation
abstractmethod
¶
Get the IDs of the value(s) associated with the given variation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variation
|
str
|
Variation text. |
required |
Returns:
| Type | Description |
|---|---|
Set[str]
|
The possibly empty set of associated value IDS. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_value_ids
abstractmethod
¶
Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
An authority value. |
required |
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The associated IDs or an empty set if the value is not valid. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
get_values_by_id
abstractmethod
¶
Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value_id
|
Any
|
An authority value ID. |
required |
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The associated values or an empty set if the value is not valid. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
RegexAuthority ¶
RegexAuthority(
name: str,
regex: Pattern,
canonical_fn: Callable[[str, str], Any] = None,
auth_anns_builder: AuthorityAnnotationsBuilder = None,
authdata: AuthorityData = None,
field_groups: DerivedFieldGroups = None,
anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
parent_auth: Authority = None,
)
Bases: Authority
A class for managing named entities by ID with associated values and variations.
Initialize with this authority's entity name.
Note
If the regular expression has capturing groups, each group
will result in a separate entity, with the group name if provided
in the regular expression as ...(?P
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The authority name. |
required |
regex
|
Pattern
|
The regular expression to apply. |
required |
canonical_fn
|
Callable[[str, str], Any]
|
A function, fn(match_text, group_name), to
transform input matches to a canonical form as a value_id.
Where group_name will be None and the full match text will be
passed in if there are no group names. Note that the canonical form
is computed before the match_validator is applied and its value
will be found as the value to the |
None
|
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
None
|
authdata
|
AuthorityData
|
The authority data. |
None
|
field_groups
|
DerivedFieldGroups
|
The derived field groups to use. |
None
|
anns_validator
|
Callable[[Authority, Dict[str, Any]], bool]
|
A validation function for each regex match
formed as a list of annotation row dictionaries, one row dictionary
for each matching regex group. If the validator returns False,
then the annotation rows will be rejected. The entity_text key
will hold matched text and the |
None
|
parent_auth
|
Authority
|
This authority's parent authority (if any). |
None
|
Methods:
| Name | Description |
|---|---|
add_annotations |
Method to do the work of finding, validating, and adding annotations. |
has_value |
Determine whether the given value is in this authority. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
Functions¶
add_annotations ¶
Method to do the work of finding, validating, and adding annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text_obj
|
AnnotatedText
|
The annotated text object to process and add annotations. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
The added Annotations. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
has_value ¶
Determine whether the given value is in this authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
A possible authority value. |
required |
Returns:
| Type | Description |
|---|---|
Match
|
None if the value is not a valid entity value; otherwise, |
Match
|
return the re.Match object. |
Source code in packages/xization/src/dataknobs_xization/authorities.py
lexicon¶
Functions and Classes¶
dataknobs_xization.lexicon ¶
Lexical matching and token alignment for text processing.
Provides classes for lexical expansion, normalization, token alignment, and pattern matching in text with support for variations and fuzzy matching.
Classes:
| Name | Description |
|---|---|
CorrelatedAuthorityData |
Container for authoritative data containing correlated data for multiple |
DataframeAuthority |
A pandas dataframe-based lexical authority. |
LexicalExpander |
A class to expand and/or normalize original lexical input terms, to |
MultiAuthorityData |
Container for authoritative data containing correlated data for multiple |
MultiAuthorityFactory |
An factory for building a "sub" authority directly or indirectly |
SimpleMultiAuthorityData |
Data class for pulling a single column from the multi-authority data |
TokenAligner |
Aligns tokens with a lexical authority to generate annotations. |
TokenMatch |
Represents a match between tokens and a lexical authority variation. |
Classes¶
CorrelatedAuthorityData ¶
Bases: AuthorityData
Container for authoritative data containing correlated data for multiple "sub" authorities.
Methods:
| Name | Description |
|---|---|
auth_records_mask |
Get a series identifying records in the full authority matching |
auth_values_mask |
Identify full-authority data corresponding to this sub-value. |
combine_masks |
Combine the masks if possible, returning the valid combination or None. |
get_auth_records |
Get the authority records identified by the mask. |
sub_authority_names |
Get the "sub" authority names. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Functions¶
auth_records_mask
abstractmethod
¶
Get a series identifying records in the full authority matching
the given records of the form {
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record_value_ids
|
Dict[str, int]
|
The dict of field names to value_ids. |
required |
filter_mask
|
Series
|
A pre-filter limiting records to consider and/or building records incrementally. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
A series identifying where all fields exist. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
auth_values_mask
abstractmethod
¶
Identify full-authority data corresponding to this sub-value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sub-authority name. |
required |
value_id
|
int
|
The sub-authority value_id. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
A series representing relevant full-authority data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
combine_masks
abstractmethod
¶
Combine the masks if possible, returning the valid combination or None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask1
|
Series
|
An auth_records_mask consistent with this data. |
required |
mask2
|
Series
|
Another data auth_records_mask. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
The combined consistent records_mask or None. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_auth_records
abstractmethod
¶
Get the authority records identified by the mask.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records_mask
|
Series
|
A series identifying records in the full data. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The records for which the mask is True. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
DataframeAuthority ¶
DataframeAuthority(
name: str,
lexical_expander: LexicalExpander,
authdata: AuthorityData,
auth_anns_builder: AuthorityAnnotationsBuilder = None,
field_groups: DerivedFieldGroups = None,
anns_validator: Callable[[Authority, Dict[str, Any]], bool] = None,
parent_auth: Authority = None,
)
Bases: LexicalAuthority
A pandas dataframe-based lexical authority.
Initialize with the name, values, and associated ids of the authority; and with the lexical expander for authoritative values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The authority name, if different from df.columns[0]. |
required |
lexical_expander
|
LexicalExpander
|
The lexical expander for the values. |
required |
authdata
|
AuthorityData
|
The data for this authority. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
None
|
field_groups
|
DerivedFieldGroups
|
The derived field groups to use. |
None
|
anns_validator
|
Callable[[Authority, Dict[str, Any]], bool]
|
fn(auth, anns_dict_list) that returns True if the list of annotation row dicts are valid to be added as annotations for a single match or "entity". |
None
|
parent_auth
|
Authority
|
This authority's parent authority (if any). |
None
|
Methods:
| Name | Description |
|---|---|
add_annotations |
Method to do the work of finding, validating, and adding annotations. |
find_variations |
Find all matches to the given variation. |
get_id_by_variation |
Get the IDs of the value(s) associated with the given variation. |
get_value_ids |
Get all IDs associated with the given value. Note that typically |
get_values_by_id |
Get all values for the associated value ID. Note that typically |
get_variations |
Convenience method to compute variations for the value. |
get_variations_df |
Create a DataFrame including associated ids for each variation. |
has_value |
Determine whether the given value is in this authority. |
Attributes:
| Name | Type | Description |
|---|---|---|
prev_aligner |
TokenAligner
|
Get the token aligner created in the latest call to annotate_text. |
variations |
Series
|
Get all lexical variations in a series whose index has associated |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Attributes¶
prev_aligner
property
¶
Get the token aligner created in the latest call to annotate_text.
variations
property
¶
Get all lexical variations in a series whose index has associated value IDs.
Returns:
| Type | Description |
|---|---|
Series
|
A pandas series with index-identified variations. |
Functions¶
add_annotations ¶
Method to do the work of finding, validating, and adding annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
doctext
|
Text
|
The text to process. |
required |
annotations
|
Annotations
|
The annotations object to add annotations to. |
required |
Returns:
| Type | Description |
|---|---|
Annotations
|
The given or a new Annotations instance. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
find_variations ¶
find_variations(
variation: str,
starts_with: bool = False,
ends_with: bool = False,
scope: str = "fullmatch",
) -> pd.Series
Find all matches to the given variation.
Note
Only the first true of starts_with, ends_with, and scope will be applied. If none of these are true, a full match on the pattern is performed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variation
|
str
|
The text to find; treated as a regular expression unless either starts_with or ends_with is True. |
required |
starts_with
|
bool
|
When True, find all terms that start with the variation text. |
False
|
ends_with
|
bool
|
When True, find all terms that end with the variation text. |
False
|
scope
|
str
|
'fullmatch' (default), 'match', or 'contains' for strict, less strict, and least strict matching. |
'fullmatch'
|
Returns:
| Type | Description |
|---|---|
Series
|
The matching variations as a pd.Series. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_id_by_variation ¶
Get the IDs of the value(s) associated with the given variation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variation
|
str
|
Variation text. |
required |
Returns:
| Type | Description |
|---|---|
Set[str]
|
The possibly empty set of associated value IDS. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_value_ids ¶
Get all IDs associated with the given value. Note that typically there is a single ID for any value, but this allows for inherent ambiguities in the authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
An authority value. |
required |
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The associated IDs or an empty set if the value is not valid. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_values_by_id ¶
Get all values for the associated value ID. Note that typically there is a single value for an ID, but this allows for inherent ambiguities in the authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value_id
|
Any
|
An authority value ID. |
required |
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The associated values or an empty set if the value ID is not valid. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_variations ¶
Convenience method to compute variations for the value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
The authority value, or term, whose variations to compute. |
required |
normalize
|
bool
|
True to normalize the variations. |
True
|
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The set of variations for the value. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_variations_df ¶
get_variations_df(
variations: Series,
variations_colname: str = "variation",
ids_colname: str = None,
lookup_values: bool = False,
) -> pd.DataFrame
Create a DataFrame including associated ids for each variation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variations
|
Series
|
The variations to include in the dataframe. |
required |
variations_colname
|
str
|
The name of the variations column. |
'variation'
|
ids_colname
|
str
|
The column name for value ids. |
None
|
lookup_values
|
bool
|
When True, include a self.name column with associated values. |
False
|
Source code in packages/xization/src/dataknobs_xization/lexicon.py
has_value ¶
Determine whether the given value is in this authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
value
|
Any
|
A possible authority value. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if the value is a valid entity value. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
LexicalExpander ¶
LexicalExpander(
variations_fn: Callable[[str], Set[str]],
normalize_fn: Callable[[str], str],
split_input_camelcase: bool = True,
detect_emojis: bool = False,
)
A class to expand and/or normalize original lexical input terms, to keep back-references from generated data to corresponding original input, and to build consistent tokens for lexical matching.
Initialize with the given functions.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variations_fn
|
Callable[[str], Set[str]]
|
A function, f(t), to expand a raw input term to all of its variations (including itself if desired). If None, the default is to expand each term to itself. |
required |
normalize_fn
|
Callable[[str], str]
|
A function to normalize a raw input term or any of its variations. If None, then the identity function is used. |
required |
split_input_camelcase
|
bool
|
True to split input camelcase tokens. |
True
|
detect_emojis
|
bool
|
True to detect emojis. If split_input_camelcase, then adjacent emojis will also be split; otherwise, adjacent emojis will appear as a single token. |
False
|
Methods:
| Name | Description |
|---|---|
__call__ |
Get all variations of the original term. |
get_terms |
Get the term ids for which the given variation was generated. |
normalize |
Normalize the given input term or variation. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Functions¶
__call__ ¶
Get all variations of the original term.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
term
|
Any
|
The term whose variations to compute. |
required |
normalize
|
bool
|
True to normalize the resulting variations. |
True
|
Returns:
| Type | Description |
|---|---|
Set[str]
|
All variations. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_terms ¶
Get the term ids for which the given variation was generated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
variation
|
str
|
A variation whose reference term(s) to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
Set[Any]
|
The set term ids for the variation or the missing_value. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
normalize ¶
Normalize the given input term or variation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_term
|
str
|
An input term to normalize. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The normalized string of the input_term. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
MultiAuthorityData ¶
Bases: CorrelatedAuthorityData
Container for authoritative data containing correlated data for multiple "sub" authorities composed of explicit data for each component.
Methods:
| Name | Description |
|---|---|
auth_records_mask |
Get a boolean series identifying records in the full authority matching |
auth_values_mask |
Identify the rows in the full authority corresponding to this sub-value. |
build_authority_data |
Build an authority for the named sub-authority. |
combine_masks |
Combine the masks if possible, returning the valid combination or None. |
get_auth_records |
Get the authority records identified by the mask. |
get_authority_data |
Get AuthorityData for the named "sub" authority, building if needed. |
get_unique_vals_df |
Get a dataframe with the unique values from the column and the given |
lookup_auth_values |
Lookup original authority data for the named "sub" authority value. |
lookup_subauth_values |
Lookup "sub" authority data for the named "sub" authority value. |
Attributes:
| Name | Type | Description |
|---|---|---|
authority_data |
AuthorityData
|
Retrieve without building the named authority data, or None |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Attributes¶
authority_data
property
¶
Retrieve without building the named authority data, or None
Functions¶
auth_records_mask ¶
Get a boolean series identifying records in the full authority matching
the given records of the form {
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record_value_ids
|
Dict[str, int]
|
The dict of field names to value_ids. |
required |
filter_mask
|
Series
|
A pre-filter limiting records to consider and/or building records incrementally. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
A boolean series where all fields exist or None. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
auth_values_mask ¶
Identify the rows in the full authority corresponding to this sub-value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sub-authority name. |
required |
value_id
|
int
|
The sub-authority value_id. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
A boolean series where the field exists. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
build_authority_data
abstractmethod
¶
Build an authority for the named sub-authority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The "sub" authority name. |
required |
Returns:
| Type | Description |
|---|---|
AuthorityData
|
The "sub" authority data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
combine_masks ¶
Combine the masks if possible, returning the valid combination or None.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mask1
|
Series
|
An auth_records_mask consistent with this data. |
required |
mask2
|
Series
|
Another data auth_records_mask. |
required |
Returns:
| Type | Description |
|---|---|
Series
|
The combined consistent records_mask or None. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_auth_records ¶
Get the authority records identified by the mask.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records_mask
|
Series
|
A boolean series identifying records in the full df. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The records/rows for which the mask is True. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_authority_data ¶
Get AuthorityData for the named "sub" authority, building if needed.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The "sub" authority name. |
required |
Returns:
| Type | Description |
|---|---|
AuthorityData
|
The "sub" authority data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_unique_vals_df
staticmethod
¶
Get a dataframe with the unique values from the column and the given column name.
Source code in packages/xization/src/dataknobs_xization/lexicon.py
lookup_auth_values ¶
Lookup original authority data for the named "sub" authority value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sub-authority name. |
required |
value
|
str
|
The sub-authority value(s) (or dataframe row(s)). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The original authority dataframe rows. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
lookup_subauth_values ¶
Lookup "sub" authority data for the named "sub" authority value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The sub-authority name. |
required |
value
|
int
|
The value for the sub-authority to lookup. |
required |
is_id
|
bool
|
True if value is an ID. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The applicable authority dataframe rows. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
MultiAuthorityFactory ¶
Bases: AuthorityFactory
An factory for building a "sub" authority directly or indirectly from MultiAuthorityData.
Initialize the MultiAuthorityFactory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
auth_name
|
str
|
The name of the dataframe authority to build. |
required |
lexical_expander
|
LexicalExpander
|
The lexical expander to use (default=identity). |
None
|
Methods:
| Name | Description |
|---|---|
build_authority |
Build a DataframeAuthority. |
get_lexical_expander |
Get the lexical expander for the named (column) data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Functions¶
build_authority ¶
build_authority(
name: str,
auth_anns_builder: AuthorityAnnotationsBuilder,
multiauthdata: MultiAuthorityData,
parent_auth: Authority = None,
) -> DataframeAuthority
Build a DataframeAuthority.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the authority to build. |
required |
auth_anns_builder
|
AuthorityAnnotationsBuilder
|
The authority annotations row builder to use for building annotation rows. |
required |
multiauthdata
|
MultiAuthorityData
|
The multi-authority source data. |
required |
parent_auth
|
Authority
|
The parent authority. |
None
|
Returns:
| Type | Description |
|---|---|
DataframeAuthority
|
The DataframeAuthority instance. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
get_lexical_expander ¶
Get the lexical expander for the named (column) data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The name of the column to expand. |
required |
Returns:
| Type | Description |
|---|---|
LexicalExpander
|
The appropriate lexical_expander. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
SimpleMultiAuthorityData ¶
Bases: MultiAuthorityData
Data class for pulling a single column from the multi-authority data as a "sub" authority.
Methods:
| Name | Description |
|---|---|
build_authority_data |
Build an authority for the named column holding authority data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Functions¶
build_authority_data ¶
Build an authority for the named column holding authority data.
Note
Only unique values are kept and the full dataframe's index will not be preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
The "sub" authority (and column) name. |
required |
Returns:
| Type | Description |
|---|---|
AuthorityData
|
The "sub" authority data. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
TokenAligner ¶
Aligns tokens with a lexical authority to generate annotations.
Processes a token stream, matching tokens against lexical authority variations and generating annotations for matches. Handles overlapping matches and tracks processed tokens.
Source code in packages/xization/src/dataknobs_xization/lexicon.py
TokenMatch ¶
Represents a match between tokens and a lexical authority variation.
Matches a sequence of tokens against a lexical authority variation, tracking whether the match is complete and providing access to matched text and annotation generation.
Attributes:
| Name | Type | Description |
|---|---|---|
matched_text |
Get the matched original text. |
Source code in packages/xization/src/dataknobs_xization/lexicon.py
Usage Examples¶
Text Normalization Example¶
from dataknobs_xization import normalize
# Basic text normalization
text = " Hello, WORLD! \n\t How are you? "
normalized = normalize.basic_normalization_fn(text)
print(normalized) # "hello, world! how are you?"
# CamelCase expansion
camel_text = "firstName"
expanded = normalize.expand_camelcase_fn(camel_text)
print(expanded) # "first Name"
# Generate lexical variations
text_with_hyphens = "multi-platform/cross-browser"
variations = normalize.get_lexical_variations(text_with_hyphens)
print(f"Generated {len(variations)} variations:")
for var in sorted(variations):
print(f" {var}")
# Symbol handling
text_with_symbols = "!Hello world?"
cleaned = normalize.drop_non_embedded_symbols_fn(text_with_symbols)
print(cleaned) # "Hello world"
embedded_text = "user@domain.com"
processed = normalize.drop_embedded_symbols_fn(embedded_text, " ")
print(processed) # "user domain com"
# Ampersand expansion
ampersand_text = "Research & Development"
expanded_ampersand = normalize.expand_ampersand_fn(ampersand_text)
print(expanded_ampersand) # "Research and Development"
Character Features Example¶
from dataknobs_xization.masking_tokenizer import CharacterFeatures
from dataknobs_structures import document as dk_doc
import pandas as pd
# Create a concrete implementation of CharacterFeatures
class BasicCharacterFeatures(CharacterFeatures):
"""Basic character-level feature extraction."""
@property
def cdf(self) -> pd.DataFrame:
"""Create character dataframe with features."""
if not hasattr(self, '_cdf'):
chars = list(self.text)
# Add padding if specified
if self._roll_padding > 0:
pad_char = '<PAD>'
chars = ([pad_char] * self._roll_padding +
chars +
[pad_char] * self._roll_padding)
# Create feature dataframe
self._cdf = pd.DataFrame({
self.text_col: chars,
'position': range(len(chars)),
'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
'is_upper': [c.isupper() if c != '<PAD>' else False for c in chars],
'is_lower': [c.islower() if c != '<PAD>' else False for c in chars],
'is_space': [c.isspace() if c != '<PAD>' else False for c in chars],
'is_punct': [not c.isalnum() and not c.isspace() if c != '<PAD>' else False for c in chars],
'is_padding': [c == '<PAD>' for c in chars]
})
return self._cdf
# Usage
text = "Hello, World! 123 👋"
features = BasicCharacterFeatures(text, roll_padding=2)
print(f"Text: {features.text}")
print(f"Text column: {features.text_col}")
print("\nCharacter DataFrame:")
print(features.cdf.head(10))
# Analyze character distribution
cdf = features.cdf
print("\nCharacter Analysis:")
print(f"Total characters: {len(cdf)}")
print(f"Alphabetic: {cdf['is_alpha'].sum()}")
print(f"Digits: {cdf['is_digit'].sum()}")
print(f"Spaces: {cdf['is_space'].sum()}")
print(f"Punctuation: {cdf['is_punct'].sum()}")
print(f"Padding: {cdf['is_padding'].sum()}")
Text Masking Example¶
from dataknobs_xization.masking_tokenizer import CharacterFeatures
import pandas as pd
import numpy as np
class MaskingCharacterFeatures(CharacterFeatures):
"""Character features with masking capability."""
def __init__(self, doctext, roll_padding=0, mask_probability=0.15):
super().__init__(doctext, roll_padding)
self.mask_probability = mask_probability
@property
def cdf(self) -> pd.DataFrame:
"""Character dataframe with masking features."""
if not hasattr(self, '_cdf'):
chars = list(self.text)
if self._roll_padding > 0:
pad_char = '<PAD>'
chars = ([pad_char] * self._roll_padding +
chars +
[pad_char] * self._roll_padding)
# Set random seed for reproducibility
np.random.seed(42)
self._cdf = pd.DataFrame({
self.text_col: chars,
'original_char': chars,
'position': range(len(chars)),
'is_alpha': [c.isalpha() if c != '<PAD>' else False for c in chars],
'is_digit': [c.isdigit() if c != '<PAD>' else False for c in chars],
'should_mask': np.random.random(len(chars)) < self.mask_probability,
'is_padding': [c == '<PAD>' for c in chars]
})
# Apply masking
mask_indices = self._cdf['should_mask'] & ~self._cdf['is_padding']
self._cdf.loc[mask_indices, self.text_col] = '[MASK]'
return self._cdf
def get_masked_text(self) -> str:
"""Get the masked version of the text."""
cdf = self.cdf
masked_chars = cdf[~cdf['is_padding']][self.text_col].tolist()
return ''.join(masked_chars)
# Usage
original_text = "This is a sample text for demonstration."
masker = MaskingCharacterFeatures(original_text, mask_probability=0.2)
print(f"Original: {original_text}")
print(f"Masked: {masker.get_masked_text()}")
print(f"\nMask Statistics:")
cdf = masker.cdf
print(f"Total chars: {len(cdf)}")
print(f"Masked chars: {cdf['should_mask'].sum()}")
print(f"Mask ratio: {cdf['should_mask'].mean():.2%}")
Complete Text Processing Pipeline¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd
class TextProcessingPipeline:
"""Complete text processing with normalization and analysis."""
def __init__(self, normalize_config=None, analysis_config=None):
self.normalize_config = normalize_config or {}
self.analysis_config = analysis_config or {}
def process_document(self, doc: dk_doc.Document) -> dict:
"""Process a document through the complete pipeline."""
original_text = doc.text
results = {
'document_id': getattr(doc, 'text_id', None),
'original_text': original_text
}
# Step 1: Normalization
normalized_text = self._normalize_text(original_text)
results['normalized_text'] = normalized_text
# Step 2: Generate variations
variations = normalize.get_lexical_variations(
normalized_text, **self.normalize_config
)
results['variations'] = list(variations)
results['variation_count'] = len(variations)
# Step 3: Character analysis
char_analysis = self._analyze_characters(normalized_text)
results['character_analysis'] = char_analysis
return results
def _normalize_text(self, text: str) -> str:
"""Apply normalization pipeline."""
# Expand camelCase
text = normalize.expand_camelcase_fn(text)
# Expand ampersands
text = normalize.expand_ampersand_fn(text)
# Drop parentheticals
if self.normalize_config.get('drop_parentheticals', True):
text = normalize.drop_parentheticals_fn(text)
# Handle symbols
if self.normalize_config.get('drop_non_embedded_symbols', True):
text = normalize.drop_non_embedded_symbols_fn(text)
# Basic normalization
text = normalize.basic_normalization_fn(text)
return text
def _analyze_characters(self, text: str) -> dict:
"""Analyze character-level features."""
class AnalysisCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({
self.text_col: chars,
'position': range(len(chars)),
'is_alpha': [c.isalpha() for c in chars],
'is_digit': [c.isdigit() for c in chars],
'is_space': [c.isspace() for c in chars],
'is_punct': [not c.isalnum() and not c.isspace() for c in chars]
})
features = AnalysisCharFeatures(text)
cdf = features.cdf
return {
'total_characters': len(cdf),
'alphabetic_characters': cdf['is_alpha'].sum(),
'digit_characters': cdf['is_digit'].sum(),
'space_characters': cdf['is_space'].sum(),
'punctuation_characters': cdf['is_punct'].sum(),
'alphabetic_ratio': cdf['is_alpha'].mean(),
'digit_ratio': cdf['is_digit'].mean(),
'space_ratio': cdf['is_space'].mean(),
'punctuation_ratio': cdf['is_punct'].mean()
}
def process_batch(self, documents: list) -> list:
"""Process multiple documents."""
return [self.process_document(doc) for doc in documents]
# Usage example
config = {
'drop_parentheticals': True,
'drop_non_embedded_symbols': True,
'expand_camelcase': True,
'expand_ampersands': True,
'add_eng_plurals': True
}
pipeline = TextProcessingPipeline(normalize_config=config)
# Create sample documents
documents = [
dk_doc.Document(
"getUserName() & validateInput (required)",
text_id="tech_doc_1"
),
dk_doc.Document(
"Machine Learning (ML) & Artificial Intelligence",
text_id="ai_doc_1"
)
]
# Process documents
results = pipeline.process_batch(documents)
# Display results
for result in results:
print(f"\nDocument: {result['document_id']}")
print(f"Original: {result['original_text']}")
print(f"Normalized: {result['normalized_text']}")
print(f"Variations: {result['variation_count']}")
print(f"Character Analysis: {result['character_analysis']}")
Integration with Other Packages¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_utils import file_utils, elasticsearch_utils
from dataknobs_structures import Tree, document as dk_doc
import json
def create_searchable_documents(input_dir: str) -> list:
"""Create searchable documents with normalized text."""
searchable_docs = []
# Process all text files
for filepath in file_utils.filepath_generator(input_dir):
if filepath.endswith('.txt'):
# Read file content
content_lines = list(file_utils.fileline_generator(filepath))
full_text = '\n'.join(content_lines)
# Normalize text
normalized = normalize.basic_normalization_fn(full_text)
normalized = normalize.expand_camelcase_fn(normalized)
normalized = normalize.expand_ampersand_fn(normalized)
# Generate search variations
variations = normalize.get_lexical_variations(
normalized,
expand_camelcase=True,
do_hyphen_expansion=True,
do_slash_expansion=True
)
# Create searchable document
searchable_doc = {
'filepath': filepath,
'original_text': full_text,
'normalized_text': normalized,
'search_variations': ' '.join(variations),
'variation_count': len(variations)
}
searchable_docs.append(searchable_doc)
return searchable_docs
# Create Elasticsearch index with normalized documents
def index_normalized_documents(documents: list, index_name: str):
"""Index normalized documents in Elasticsearch."""
table_settings = elasticsearch_utils.TableSettings(
index_name,
{"number_of_shards": 1, "number_of_replicas": 0},
{
"properties": {
"original_text": {"type": "text"},
"normalized_text": {"type": "text", "analyzer": "english"},
"search_variations": {"type": "text"},
"filepath": {"type": "keyword"},
"variation_count": {"type": "integer"}
}
}
)
index = elasticsearch_utils.ElasticsearchIndex(None, [table_settings])
# Create batch file
with open("normalized_batch.jsonl", "w") as f:
elasticsearch_utils.add_batch_data(
f, iter(documents), index_name
)
return index
# Usage
documents = create_searchable_documents("/path/to/text/files")
index = index_normalized_documents(documents, "normalized_texts")
print(f"Indexed {len(documents)} normalized documents")
Error Handling¶
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
def safe_text_processing(text: str) -> dict:
"""Safely process text with comprehensive error handling."""
results = {'original': text, 'errors': []}
try:
# Normalization with error handling
normalized = normalize.basic_normalization_fn(text)
results['normalized'] = normalized
except Exception as e:
results['errors'].append(f"Normalization failed: {e}")
results['normalized'] = text
try:
# CamelCase expansion
expanded = normalize.expand_camelcase_fn(results['normalized'])
results['camelcase_expanded'] = expanded
except Exception as e:
results['errors'].append(f"CamelCase expansion failed: {e}")
results['camelcase_expanded'] = results['normalized']
try:
# Variation generation
variations = normalize.get_lexical_variations(results['camelcase_expanded'])
results['variations'] = list(variations)
except Exception as e:
results['errors'].append(f"Variation generation failed: {e}")
results['variations'] = [results['camelcase_expanded']]
try:
# Character analysis
class SafeCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
import pandas as pd
chars = list(self.text) if self.text else []
return pd.DataFrame({
self.text_col: chars,
'is_alpha': [c.isalpha() for c in chars]
})
features = SafeCharFeatures(results['camelcase_expanded'])
results['character_count'] = len(features.cdf)
except Exception as e:
results['errors'].append(f"Character analysis failed: {e}")
results['character_count'] = 0
results['success'] = len(results['errors']) == 0
return results
# Usage
test_texts = [
"Normal text for processing",
"camelCaseText & symbols!",
"", # Empty string
None, # None value
"Special unicode: 👋🌍"
]
for i, text in enumerate(test_texts):
try:
result = safe_text_processing(text or "")
print(f"\nTest {i+1}: {'SUCCESS' if result['success'] else 'ERRORS'}")
print(f"Original: {repr(text)}")
if result['success']:
print(f"Normalized: {result['normalized']}")
print(f"Variations: {len(result['variations'])}")
else:
print(f"Errors: {result['errors']}")
except Exception as e:
print(f"\nTest {i+1}: CRITICAL ERROR - {e}")
Testing¶
import pytest
from dataknobs_xization import normalize, masking_tokenizer
from dataknobs_structures import document as dk_doc
import pandas as pd
class TestXizationFunctions:
"""Test suite for xization functionality."""
def test_normalization_functions(self):
"""Test core normalization functions."""
# Test camelCase expansion
assert normalize.expand_camelcase_fn("firstName") == "first Name"
assert normalize.expand_camelcase_fn("XMLParser") == "XML Parser"
# Test symbol handling
assert normalize.drop_non_embedded_symbols_fn("!Hello world?") == "Hello world"
assert normalize.drop_embedded_symbols_fn("user@domain.com") == "userdomaincom"
# Test ampersand expansion
assert normalize.expand_ampersand_fn("A & B") == "A and B"
# Test parenthetical removal
assert normalize.drop_parentheticals_fn("Text (with note)") == "Text "
def test_lexical_variations(self):
"""Test lexical variation generation."""
variations = normalize.get_lexical_variations("multi-platform")
# Check expected variations are present
assert "multi platform" in variations
assert "multiplatform" in variations
assert "multi-platform" in variations
# Check it returns a set
assert isinstance(variations, set)
assert len(variations) > 1
def test_character_features(self):
"""Test character feature extraction."""
class TestCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({
self.text_col: chars,
'is_alpha': [c.isalpha() for c in chars],
'is_digit': [c.isdigit() for c in chars]
})
features = TestCharFeatures("Hello123")
cdf = features.cdf
# Test basic properties
assert len(cdf) == 8
assert cdf['is_alpha'].sum() == 5 # "Hello"
assert cdf['is_digit'].sum() == 3 # "123"
# Test text properties
assert features.text == "Hello123"
assert features.text_col == 'text' # Default column name
def test_document_integration(self):
"""Test integration with document structures."""
doc = dk_doc.Text("Test document", text_id="test1")
class DocCharFeatures(masking_tokenizer.CharacterFeatures):
@property
def cdf(self):
chars = list(self.text)
return pd.DataFrame({self.text_col: chars})
features = DocCharFeatures(doc)
assert features.text_id == "test1"
assert features.text == "Test document"
def test_error_handling(self):
"""Test error handling in various scenarios."""
# Test empty text
empty_variations = normalize.get_lexical_variations("")
assert isinstance(empty_variations, set)
# Test None handling in utility function
from dataknobs_xization.normalize import basic_normalization_fn
try:
result = basic_normalization_fn("")
assert isinstance(result, str)
except Exception:
pytest.fail("Should handle empty string gracefully")
# Run tests
if __name__ == "__main__":
test_suite = TestXizationFunctions()
test_suite.test_normalization_functions()
test_suite.test_lexical_variations()
test_suite.test_character_features()
test_suite.test_document_integration()
test_suite.test_error_handling()
print("All tests passed!")
Performance Notes¶
- Regular Expressions: Pre-compiled patterns for efficient text processing
- Character Analysis: Memory-intensive for large texts - use streaming for big documents
- Variation Generation: Can produce many variations - filter appropriately
- Pandas DataFrames: Efficient for character-level analysis but consider memory usage
Dependencies¶
Core dependencies for dataknobs_xization:
Contributing¶
For contributing to dataknobs_xization:
- Fork the repository
- Create feature branch for text processing enhancements
- Add comprehensive tests for normalization functions
- Test with various text types and edge cases
- Submit pull request with documentation updates
See Contributing Guide for detailed information.
Changelog¶
Version 1.0.0¶
- Initial release
- Text normalization functions
- Character-level feature extraction
- Lexical variation generation
- Masking tokenizer framework
- Integration with dataknobs-structures
License¶
See License for license information.