Development Guide¶

Welcome to the Dataknobs development documentation. This section provides comprehensive information for developers who want to contribute to, extend, or understand the internal workings of the Dataknobs ecosystem.

Overview¶

Dataknobs is a modular Python ecosystem for AI knowledge base structures and text processing. The project is organized as a monorepo with multiple interconnected packages that work together to provide comprehensive data processing capabilities.

Quick Start with dk Command¶

New Developer? Start Here!

The dk command is your unified interface for all development tasks. Install it with ./setup-dk.sh and use simple commands like dk pr to prepare for pull requests or dk test to run tests.

→ Learn the dk Command

Creating New Packages¶

Package Creation Automation

Creating a new DataKnobs package is fully automated! Use ./bin/create-package.py to generate package structure and automatically integrate it into the ecosystem. The script handles all the tedious integration work - you just focus on implementing the functionality.

Quick commands: - ./bin/create-package.py <name> -d "description" - Create new package - ./bin/create-package.py --help - See all options - ./bin/validate-package-references.py - Validate integration

→ New Package Checklist - Complete guide with automated and manual steps

Release Management¶

Simplified Release Process

The release process has been streamlined with automated tools that handle version bumping, changelog generation, and publishing. Use dk release for an interactive guided process or check the Release Process Guide for detailed documentation and FAQ.

Quick commands: - dk release - Interactive complete release - dk release-check - See what changed - dk release-bump - Update versions - dk release-notes - Generate changelog

Getting Started¶

If you're new to Dataknobs development:

Developer Workflow (dk) - 🚀 Start here - The easy way to develop
Linux Setup Guide - 🐧 Linux users - Platform-specific setup and troubleshooting
Contributing Guide - Learn how to contribute
New Package Checklist - 📦 Create and integrate new packages
Configuration System - Understand the DataKnobs configuration patterns
UV Virtual Environment Guide - How to work with UV package manager
Quality Checks Process - Developer-driven quality assurance
Architecture Overview - Understand the system design
Testing Guide - Learn about our testing approach
Integration Testing & CI - Integration testing in CI/CD pipeline
CI/CD Pipeline - Understand our deployment process

Development Topics¶

Core Development¶

Contributing Guide - How to contribute code, documentation, and report issues
New Package Checklist - 📦 Automated package creation and integration guide
Configuration System - DataKnobs configuration patterns and best practices
Adding Config Support - Step-by-step guide to add configuration support to packages
UV Virtual Environment Guide - Working with UV package manager and virtual environments
Quality Checks Process - Running quality checks locally before PRs
Architecture Overview - System architecture and design principles
Documentation Guide - How to write and maintain documentation

Testing¶

Testing Guide - Testing strategies, frameworks, and best practices
Testing Commands - Practical guide to running tests with the new test infrastructure
Integration Testing & CI - Integration testing with real services and CI/CD quality gates

Operations¶

CI/CD Pipeline - Continuous integration and deployment processes
Dependency Updates - Automated weekly updates and the review process
Release Process - Streamlined release workflow with automated tools and comprehensive FAQ

Project Structure¶

dataknobs/
├── packages/          # Individual packages
│   ├── common/          # Shared utilities
│   ├── config/          # Configuration system
│   ├── data/            # Database abstractions
│   ├── fsm/             # FSM processing
│   ├── llm/             # LLM integration
│   ├── bots/            # AI agents and chatbots
│   ├── structures/      # Core data structures
│   ├── utils/           # Utility functions
│   ├── xization/        # Text processing
│   └── legacy/          # Legacy compatibility
├── docs/              # Documentation
├── tests/             # Integration tests
├── docker/            # Docker configurations
├── bin/               # Scripts and tools
└── resources/         # Shared resources

Development Environment¶

Prerequisites¶

Python: 3.12 or higher
Package Manager: UV (fast Python package manager)
Version Control: Git
Docker: For running PostgreSQL, Elasticsearch, and LocalStack services

Quick Setup with UV¶

# Clone the repository
git clone https://github.com/yourusername/dataknobs.git
cd dataknobs

# Install all dependencies
uv sync --all-packages

# Activate virtual environment
source .venv/bin/activate  # On Linux/macOS
# or
.venv\Scripts\activate  # On Windows

# Run quality checks before PRs (includes integration tests)
./bin/run-quality-checks.sh

# Or run specific test types with the new test infrastructure
./bin/test.sh                          # Run all tests (unit + integration)
./bin/test.sh -t unit                  # Unit tests only
./bin/test.sh -t integration           # Integration tests with services
./bin/test.sh data                     # Test specific package
./bin/run-integration-tests.sh -s      # Start services for manual testing

Development Services¶

Docker-based Services¶

Most development services run via Docker:

# Start all services
docker-compose up -d postgres elasticsearch localstack

# Check service status
docker-compose ps

# Stop services
docker-compose down

Ollama (Local Installation Required)¶

Unlike other services, Ollama runs locally due to hardware requirements (GPU access).

Installation: - macOS: brew install ollama - Linux: curl -fsSL https://ollama.ai/install.sh | sh - Windows: Download from https://ollama.ai/download

Starting Ollama:

ollama serve

Verifying Ollama:

./bin/check-ollama.sh
# Or manually:
curl http://localhost:11434/api/tags

Running Tests Without Ollama:

export TEST_OLLAMA=false
dk test
# Or use quick test mode (skips all integration tests)
dk testquick

For more details, see the UV Virtual Environment Guide and Quality Checks Process.

Package Overview¶

dataknobs-common¶

Purpose: Shared utilities and base classes used across all packages.

Key Components: - Base classes and interfaces - Common configuration management - Standardized logging - Error handling framework

dataknobs-config¶

Purpose: Modular configuration system with environment variable support.

Key Components: - YAML/JSON configuration loading - Environment variable substitution (${VAR:default}) - Factory registration for dynamic object creation - Cross-reference resolution - Layered configuration merging

dataknobs-data¶

Purpose: Unified data abstraction layer for consistent operations across storage backends.

Key Components: - Multiple backend support (Memory, File, PostgreSQL, Elasticsearch, S3) - Unified Record and Query abstractions - Factory pattern for dynamic backend selection - Transaction management (Single, Batch, Manual) - Vector store integration - Streaming support for large datasets

dataknobs-fsm¶

Purpose: Finite State Machine framework for workflow orchestration and data processing.

Key Components: - Three API levels: SimpleFSM (sync), AsyncSimpleFSM (async), AdvancedFSM (debugging) - Data handling modes (COPY, REFERENCE, DIRECT) for different performance/safety tradeoffs - Built-in resource management (databases, files, HTTP, LLMs, vector stores) - Streaming support with backpressure handling - YAML/JSON configuration with inline transforms - Step-by-step debugging with breakpoints and execution hooks

dataknobs-llm¶

Purpose: LLM integration with prompt management, conversations, versioning, and tools.

Key Components: - Multi-provider LLM support (OpenAI, Anthropic, Ollama, etc.) - Prompt template management with versioning - Conversation history and context management - Tool/function calling support - Cost tracking and token usage monitoring - Async/await support for concurrent requests

dataknobs-bots¶

Purpose: Configuration-driven AI agents and chatbots for building intelligent applications.

Key Components: - Multi-tenant bot architecture with BotRegistry - Memory systems (buffer, vector) for conversation context - RAG (Retrieval Augmented Generation) with knowledge base integration - Reasoning strategies (Simple, ReAct) for tool-using agents - Configuration-driven tool loading without code changes - Production-ready with PostgreSQL storage and horizontal scaling

dataknobs-structures¶

Purpose: Core data structures for hierarchical and document-based data.

Key Components: - Tree data structure with advanced navigation - Document and text processing classes - Record storage and retrieval - Conditional dictionary implementations

dataknobs-utils¶

Purpose: Utility functions for various data processing tasks.

Key Components: - File operations and I/O utilities - Elasticsearch integration - JSON processing tools - LLM prompt management - Database and statistical utilities

dataknobs-xization¶

Purpose: Text normalization, tokenization, and processing.

Key Components: - Text normalization functions - Character-level analysis - Tokenization and masking - Lexical variation generation

Development Workflow¶

1. Issue Creation¶

Use GitHub issues to track bugs, features, and improvements
Follow issue templates for consistency
Label issues appropriately
Assign to milestones when relevant

2. Branch Management¶

main: Stable, production-ready code
develop: Integration branch for features
feature/*: Individual feature development
bugfix/*: Bug fixes
hotfix/*: Critical production fixes

3. Code Development¶

Follow Python PEP 8 style guidelines
Write comprehensive docstrings
Include type hints
Add appropriate tests
Update documentation

4. Testing¶

Write unit tests for all new functionality
Run integration tests with real services (PostgreSQL, Elasticsearch)
Ensure all tests pass with ./bin/run-quality-checks.sh
Achieve minimum code coverage targets (70% overall, 90% for new code)
Test across supported Python versions (3.12+)

5. Review Process¶

Create pull request with detailed description
Request review from maintainers
Address feedback and make necessary changes
Ensure all checks pass

Code Standards¶

Python Style¶

Follow PEP 8 coding style
Use black for code formatting
Use isort for import organization
Use pylint for code quality checks

Documentation¶

Write clear, comprehensive docstrings
Follow Google docstring style
Include usage examples
Update README files as needed

Testing¶

Aim for >90% code coverage
Write both unit and integration tests
Use descriptive test names
Include edge cases and error conditions

Type Hints¶

Use type hints for all public functions
Import types from typing module
Use Union, Optional, and generics appropriately

Tools and Utilities¶

Code Quality¶

# Format code
black packages/

# Sort imports
isort packages/

# Check style
flake8 packages/

# Type checking
mypy packages/

# Security scanning
bandit -r packages/

Testing¶

# Run all tests (unit + integration) with new infrastructure
./bin/test.sh

# Run unit tests only
./bin/test.sh -t unit

# Run integration tests with services
./bin/test.sh -t integration

# Test specific package
./bin/test.sh data                     # All tests for data package
./bin/test.sh -t unit config          # Unit tests for config package
./bin/test.sh -t integration data      # Integration tests for data package

# Advanced options
./bin/test.sh -v                      # Verbose output
./bin/test.sh -k test_s3              # Run tests matching pattern
./bin/test.sh -x                      # Stop on first failure
./bin/test.sh -n -t integration       # Run integration tests without starting services

# Service management
./bin/run-integration-tests.sh -s     # Start services only
./bin/run-integration-tests.sh -k     # Keep services running after tests

# Legacy pytest commands (still available)
pytest                                 # Run all tests
pytest -m "not integration"            # Unit tests only
pytest --cov=packages/                # Run with coverage
pytest packages/structures/tests/     # Run specific package tests

# Run quality checks (linting + tests + coverage)
./bin/run-quality-checks.sh

Documentation¶

# Build documentation
mkdocs serve

# Generate API docs
mkdocs build

# Test documentation links
mkdocs build --strict

Performance Considerations¶

Memory Management¶

Use generators for large dataset processing
Implement proper resource cleanup
Monitor memory usage in tests
Consider lazy loading for large data structures

Processing Efficiency¶

Profile code performance regularly
Use appropriate data structures
Implement caching where beneficial
Consider parallel processing for CPU-intensive tasks

Scalability¶

Design for horizontal scaling
Use streaming processing for large files
Implement proper error handling and recovery
Consider database connection pooling

Security Guidelines¶

Input Validation¶

Validate all user inputs
Sanitize data before processing
Use parameterized queries for databases
Implement proper authentication and authorization

Data Handling¶

Encrypt sensitive data at rest
Use HTTPS for all network communications
Implement proper logging (avoid logging sensitive data)
Follow data retention policies

Dependencies¶

Regularly update dependencies
Use security scanning tools
Pin dependency versions
Review new dependencies for security issues

Debugging and Troubleshooting¶

Common Issues¶

Import Errors
Ensure packages are installed in development mode
Check Python path configuration
Verify virtual environment activation
Test Failures
Run tests individually to isolate issues
Check for test data dependencies
Verify mock configurations
Performance Issues
Use profiling tools to identify bottlenecks
Check memory usage patterns
Review algorithm complexity

Debugging Tools¶

# Python debugger
import pdb; pdb.set_trace()

# Performance profiling
import cProfile
cProfile.run('your_function()')

# Memory profiling
from memory_profiler import profile
@profile
def your_function():
    pass

Communication and Community¶

Channels¶

GitHub Issues: Bug reports and feature requests
GitHub Discussions: General questions and community discussions
Pull Requests: Code contributions and reviews

Guidelines¶

Be respectful and constructive
Provide clear and detailed information
Follow up on your contributions
Help others when possible

Resources¶

Documentation¶

User Guide - End-user documentation
API Reference - Detailed API documentation
Examples - Usage examples and tutorials

External Resources¶

Getting Help¶

If you need help with development:

Check existing documentation and examples
Search GitHub issues for similar problems
Create a new issue with detailed information
Join community discussions for general questions

Next Steps¶

Ready to contribute? Start with:

Read the Contributing Guide
Set up your development environment
Pick a "good first issue" from GitHub
Make your first contribution!

We welcome contributions of all types - code, documentation, testing, and community support. Thank you for helping make Dataknobs better!