Skip to content

Architecture

Overview

Dataknobs is designed as a modular monorepo with clear separation of concerns. Each package has a specific responsibility and can be used independently or as part of the complete ecosystem.

System Architecture

graph TB
    subgraph "Application Layer"
        APP[Your Application]
    end

    subgraph "Dataknobs Packages"
        FSM[dataknobs-fsm<br/>Finite State Machine]
        DATA[dataknobs-data<br/>Data Abstraction Layer]
        CONFIG[dataknobs-config<br/>Configuration Management]
        STRUCT[dataknobs-structures<br/>Core Data Structures]
        UTILS[dataknobs-utils<br/>Utility Functions]
        XIZ[dataknobs-xization<br/>Text Processing]
        COMMON[dataknobs-common<br/>Shared Components]
        LEGACY[dataknobs-legacy<br/>Compatibility Layer]
    end

    subgraph "External Services"
        ES[Elasticsearch]
        DB[Database]
        FS[File System]
    end

    APP --> FSM
    APP --> DATA
    APP --> CONFIG
    APP --> STRUCT
    APP --> UTILS
    APP --> XIZ
    APP --> LEGACY

    FSM --> DATA
    FSM --> CONFIG
    FSM --> COMMON

    DATA --> CONFIG
    DATA --> COMMON

    CONFIG --> COMMON

    LEGACY --> STRUCT
    LEGACY --> UTILS
    LEGACY --> XIZ

    STRUCT --> COMMON
    UTILS --> COMMON
    XIZ --> COMMON
    XIZ --> STRUCT

    UTILS --> ES
    UTILS --> DB
    UTILS --> FS

Package Architecture

Core Design Principles

  1. Modularity: Each package is self-contained with minimal dependencies
  2. Composability: Packages can be combined to build complex solutions
  3. Extensibility: Easy to extend without modifying core code
  4. Testability: Clear interfaces enable comprehensive testing
  5. Performance: Optimized for large-scale data processing

Package Responsibilities

dataknobs-common

  • Purpose: Shared utilities and base classes
  • Key Components:
  • Base exceptions
  • Type definitions
  • Shared constants
  • Common utilities

dataknobs-fsm

  • Purpose: Finite State Machine framework for workflow orchestration
  • Key Components:
  • SimpleFSM: Synchronous state machine
  • AsyncSimpleFSM: Asynchronous state machine
  • AdvancedFSM: Debugging and advanced features
  • Data handling modes (COPY, REFERENCE, DIRECT)
  • Resource management
  • Streaming support

dataknobs-data

  • Purpose: Unified data abstraction layer
  • Key Components:
  • Database backends (Memory, File, PostgreSQL, S3, Elasticsearch)
  • Record and Query abstractions
  • Factory pattern for backend selection
  • Transaction support

dataknobs-config

  • Purpose: Configuration management with environment variables
  • Key Components:
  • YAML/JSON configuration loading
  • Environment variable substitution
  • Factory registration
  • Cross-references

dataknobs-structures

  • Purpose: Core data structures for knowledge representation
  • Key Components:
  • Tree: Hierarchical data structure
  • Document: Document abstraction
  • RecordStore: Key-value storage
  • ConditionalDict: Rule-based dictionary

dataknobs-utils

  • Purpose: Utility functions for various operations
  • Key Components:
  • File I/O utilities
  • JSON processing
  • Elasticsearch integration
  • LLM utilities
  • Request handling

dataknobs-xization

  • Purpose: Text processing and normalization
  • Key Components:
  • Tokenization
  • Text normalization
  • Pattern masking
  • Language processing

dataknobs-legacy

  • Purpose: Backward compatibility layer
  • Status: Deprecated, for migration only

Data Flow Architecture

Typical Processing Pipeline

graph LR
    INPUT[Input Data] --> READ[File Reading<br/>utils]
    READ --> PARSE[Parsing<br/>utils]
    PARSE --> NORM[Normalization<br/>xization]
    NORM --> STRUCT[Structure Creation<br/>structures]
    STRUCT --> PROCESS[Processing<br/>Application Logic]
    PROCESS --> INDEX[Indexing<br/>utils]
    INDEX --> STORE[Storage<br/>Elasticsearch/DB]

Component Interactions

# Example: Document processing pipeline
from dataknobs_utils import file_utils
from dataknobs_xization import normalize
from dataknobs_structures import Document, Tree
from dataknobs_utils import elasticsearch_utils

# 1. Read input
content = file_utils.read_file("input.txt")

# 2. Normalize text
normalized = normalize.basic_normalization_fn(content)

# 3. Create structure
doc = Document(normalized, metadata={"source": "input.txt"})
tree = Tree(doc)

# 4. Process (application-specific)
tree.process_nodes(custom_function)

# 5. Index for search
index = elasticsearch_utils.ElasticsearchIndex(...)
index.index_document(doc)

FSM-Based Processing Pipeline

# Example: FSM-based ETL pipeline
from dataknobs_fsm import SimpleFSM
from dataknobs_data import DatabaseFactory
from dataknobs_config import Config

# Load configuration
config = Config("etl_config.yaml")

# Create FSM for orchestration
fsm = SimpleFSM(config.get("fsm_definition"))

# Setup database connections
db_factory = DatabaseFactory()
source_db = db_factory.create(backend="postgresql", **config.get("source_db"))
target_db = db_factory.create(backend="elasticsearch", **config.get("target_db"))

# Run ETL pipeline through FSM
result = fsm.process({
    "source": source_db,
    "target": target_db,
    "batch_size": 1000
})

Deployment Architecture

Monorepo Structure

dataknobs/
├── packages/           # Independent packages
│   ├── common/
│   ├── config/
│   ├── data/
│   ├── fsm/
│   ├── structures/
│   ├── utils/
│   ├── xization/
│   └── legacy/
├── docs/              # Documentation
├── tests/             # Integration tests
├── bin/               # Utility scripts
└── pyproject.toml     # Workspace configuration

Deployment Options

  1. Individual Package Installation:

    pip install dataknobs-structures
    

  2. Complete Installation:

    pip install dataknobs-structures dataknobs-utils dataknobs-xization
    

  3. Development Installation:

    uv pip install -e packages/structures
    

Performance Considerations

Memory Management

  • Lazy loading for large datasets
  • Streaming interfaces for file processing
  • Efficient tree traversal algorithms
  • Caching strategies for repeated operations

Scalability

  • Horizontal scaling through parallel processing
  • Batch processing capabilities
  • Async I/O for network operations
  • Connection pooling for database access

Optimization Techniques

  1. Vectorization: NumPy/Pandas for numerical operations
  2. Caching: LRU caches for expensive computations
  3. Indexing: Efficient data structures for lookups
  4. Streaming: Process data without loading entirely into memory

Security Architecture

Security Layers

  1. Input Validation: Sanitize all external inputs
  2. Access Control: Permission-based file/resource access
  3. Data Encryption: Support for encrypted storage
  4. Secure Communication: HTTPS for external services
  5. Dependency Management: Regular security updates

Best Practices

  • No hardcoded credentials
  • Environment-based configuration
  • Secure token storage
  • Input sanitization
  • Output encoding

Extension Points

Plugin Architecture

Dataknobs supports extensions through:

  1. Custom Processors: Implement processing interfaces
  2. Storage Backends: Add new storage adapters
  3. Normalization Rules: Define custom text normalization
  4. Tree Traversal: Custom traversal strategies

Example Extension

from dataknobs_structures import Tree

class CustomProcessor:
    """Custom tree processor implementation."""

    def process(self, tree: Tree) -> Tree:
        """Process tree with custom logic."""
        # Custom implementation
        return tree

# Register and use
processor = CustomProcessor()
result = processor.process(my_tree)

Future Architecture Goals

Short Term (Q1-Q2)

  • Async/await support throughout
  • GraphQL API layer
  • WebAssembly bindings

Medium Term (Q3-Q4)

  • Distributed processing support
  • Real-time streaming capabilities
  • Cloud-native deployments

Long Term (Next Year)

  • AI/ML integration layer
  • Multi-language support
  • Federated architecture