Architecture¶
Overview¶
Dataknobs is designed as a modular monorepo with clear separation of concerns. Each package has a specific responsibility and can be used independently or as part of the complete ecosystem.
System Architecture¶
graph TB
subgraph "Application Layer"
APP[Your Application]
end
subgraph "Dataknobs Packages"
FSM[dataknobs-fsm<br/>Finite State Machine]
DATA[dataknobs-data<br/>Data Abstraction Layer]
CONFIG[dataknobs-config<br/>Configuration Management]
STRUCT[dataknobs-structures<br/>Core Data Structures]
UTILS[dataknobs-utils<br/>Utility Functions]
XIZ[dataknobs-xization<br/>Text Processing]
COMMON[dataknobs-common<br/>Shared Components]
LEGACY[dataknobs-legacy<br/>Compatibility Layer]
end
subgraph "External Services"
ES[Elasticsearch]
DB[Database]
FS[File System]
end
APP --> FSM
APP --> DATA
APP --> CONFIG
APP --> STRUCT
APP --> UTILS
APP --> XIZ
APP --> LEGACY
FSM --> DATA
FSM --> CONFIG
FSM --> COMMON
DATA --> CONFIG
DATA --> COMMON
CONFIG --> COMMON
LEGACY --> STRUCT
LEGACY --> UTILS
LEGACY --> XIZ
STRUCT --> COMMON
UTILS --> COMMON
XIZ --> COMMON
XIZ --> STRUCT
UTILS --> ES
UTILS --> DB
UTILS --> FS
Package Architecture¶
Core Design Principles¶
- Modularity: Each package is self-contained with minimal dependencies
- Composability: Packages can be combined to build complex solutions
- Extensibility: Easy to extend without modifying core code
- Testability: Clear interfaces enable comprehensive testing
- Performance: Optimized for large-scale data processing
Package Responsibilities¶
dataknobs-common¶
- Purpose: Shared utilities and base classes
- Key Components:
- Base exceptions
- Type definitions
- Shared constants
- Common utilities
dataknobs-fsm¶
- Purpose: Finite State Machine framework for workflow orchestration
- Key Components:
SimpleFSM: Synchronous state machineAsyncSimpleFSM: Asynchronous state machineAdvancedFSM: Debugging and advanced features- Data handling modes (COPY, REFERENCE, DIRECT)
- Resource management
- Streaming support
dataknobs-data¶
- Purpose: Unified data abstraction layer
- Key Components:
- Database backends (Memory, File, PostgreSQL, S3, Elasticsearch)
RecordandQueryabstractions- Factory pattern for backend selection
- Transaction support
dataknobs-config¶
- Purpose: Configuration management with environment variables
- Key Components:
- YAML/JSON configuration loading
- Environment variable substitution
- Factory registration
- Cross-references
dataknobs-structures¶
- Purpose: Core data structures for knowledge representation
- Key Components:
Tree: Hierarchical data structureDocument: Document abstractionRecordStore: Key-value storageConditionalDict: Rule-based dictionary
dataknobs-utils¶
- Purpose: Utility functions for various operations
- Key Components:
- File I/O utilities
- JSON processing
- Elasticsearch integration
- LLM utilities
- Request handling
dataknobs-xization¶
- Purpose: Text processing and normalization
- Key Components:
- Tokenization
- Text normalization
- Pattern masking
- Language processing
dataknobs-legacy¶
- Purpose: Backward compatibility layer
- Status: Deprecated, for migration only
Data Flow Architecture¶
Typical Processing Pipeline¶
graph LR
INPUT[Input Data] --> READ[File Reading<br/>utils]
READ --> PARSE[Parsing<br/>utils]
PARSE --> NORM[Normalization<br/>xization]
NORM --> STRUCT[Structure Creation<br/>structures]
STRUCT --> PROCESS[Processing<br/>Application Logic]
PROCESS --> INDEX[Indexing<br/>utils]
INDEX --> STORE[Storage<br/>Elasticsearch/DB]
Component Interactions¶
# Example: Document processing pipeline
from dataknobs_utils import file_utils
from dataknobs_xization import normalize
from dataknobs_structures import Document, Tree
from dataknobs_utils import elasticsearch_utils
# 1. Read input
content = file_utils.read_file("input.txt")
# 2. Normalize text
normalized = normalize.basic_normalization_fn(content)
# 3. Create structure
doc = Document(normalized, metadata={"source": "input.txt"})
tree = Tree(doc)
# 4. Process (application-specific)
tree.process_nodes(custom_function)
# 5. Index for search
index = elasticsearch_utils.ElasticsearchIndex(...)
index.index_document(doc)
FSM-Based Processing Pipeline¶
# Example: FSM-based ETL pipeline
from dataknobs_fsm import SimpleFSM
from dataknobs_data import DatabaseFactory
from dataknobs_config import Config
# Load configuration
config = Config("etl_config.yaml")
# Create FSM for orchestration
fsm = SimpleFSM(config.get("fsm_definition"))
# Setup database connections
db_factory = DatabaseFactory()
source_db = db_factory.create(backend="postgresql", **config.get("source_db"))
target_db = db_factory.create(backend="elasticsearch", **config.get("target_db"))
# Run ETL pipeline through FSM
result = fsm.process({
"source": source_db,
"target": target_db,
"batch_size": 1000
})
Deployment Architecture¶
Monorepo Structure¶
dataknobs/
├── packages/ # Independent packages
│ ├── common/
│ ├── config/
│ ├── data/
│ ├── fsm/
│ ├── structures/
│ ├── utils/
│ ├── xization/
│ └── legacy/
├── docs/ # Documentation
├── tests/ # Integration tests
├── bin/ # Utility scripts
└── pyproject.toml # Workspace configuration
Deployment Options¶
-
Individual Package Installation:
-
Complete Installation:
-
Development Installation:
Performance Considerations¶
Memory Management¶
- Lazy loading for large datasets
- Streaming interfaces for file processing
- Efficient tree traversal algorithms
- Caching strategies for repeated operations
Scalability¶
- Horizontal scaling through parallel processing
- Batch processing capabilities
- Async I/O for network operations
- Connection pooling for database access
Optimization Techniques¶
- Vectorization: NumPy/Pandas for numerical operations
- Caching: LRU caches for expensive computations
- Indexing: Efficient data structures for lookups
- Streaming: Process data without loading entirely into memory
Security Architecture¶
Security Layers¶
- Input Validation: Sanitize all external inputs
- Access Control: Permission-based file/resource access
- Data Encryption: Support for encrypted storage
- Secure Communication: HTTPS for external services
- Dependency Management: Regular security updates
Best Practices¶
- No hardcoded credentials
- Environment-based configuration
- Secure token storage
- Input sanitization
- Output encoding
Extension Points¶
Plugin Architecture¶
Dataknobs supports extensions through:
- Custom Processors: Implement processing interfaces
- Storage Backends: Add new storage adapters
- Normalization Rules: Define custom text normalization
- Tree Traversal: Custom traversal strategies
Example Extension¶
from dataknobs_structures import Tree
class CustomProcessor:
"""Custom tree processor implementation."""
def process(self, tree: Tree) -> Tree:
"""Process tree with custom logic."""
# Custom implementation
return tree
# Register and use
processor = CustomProcessor()
result = processor.process(my_tree)
Future Architecture Goals¶
Short Term (Q1-Q2)¶
- Async/await support throughout
- GraphQL API layer
- WebAssembly bindings
Medium Term (Q3-Q4)¶
- Distributed processing support
- Real-time streaming capabilities
- Cloud-native deployments
Long Term (Next Year)¶
- AI/ML integration layer
- Multi-language support
- Federated architecture