Development Guide¶
Welcome to the Dataknobs development documentation. This section provides comprehensive information for developers who want to contribute to, extend, or understand the internal workings of the Dataknobs ecosystem.
Overview¶
Dataknobs is a modular Python ecosystem for AI knowledge base structures and text processing. The project is organized as a monorepo with multiple interconnected packages that work together to provide comprehensive data processing capabilities.
Quick Start with dk Command¶
New Developer? Start Here!
The dk command is your unified interface for all development tasks. Install it with ./setup-dk.sh and use simple commands like dk pr to prepare for pull requests or dk test to run tests.
Creating New Packages¶
Package Creation Automation
Creating a new DataKnobs package is fully automated! Use ./bin/create-package.py to generate package structure and automatically integrate it into the ecosystem. The script handles all the tedious integration work - you just focus on implementing the functionality.
Quick commands:
- ./bin/create-package.py <name> -d "description" - Create new package
- ./bin/create-package.py --help - See all options
- ./bin/validate-package-references.py - Validate integration
→ New Package Checklist - Complete guide with automated and manual steps
Release Management¶
Simplified Release Process
The release process has been streamlined with automated tools that handle version bumping, changelog generation, and publishing. Use dk release for an interactive guided process or check the Release Process Guide for detailed documentation and FAQ.
Quick commands:
- dk release - Interactive complete release
- dk release-check - See what changed
- dk release-bump - Update versions
- dk release-notes - Generate changelog
Getting Started¶
If you're new to Dataknobs development:
- Developer Workflow (dk) - 🚀 Start here - The easy way to develop
- Linux Setup Guide - 🐧 Linux users - Platform-specific setup and troubleshooting
- Contributing Guide - Learn how to contribute
- New Package Checklist - 📦 Create and integrate new packages
- Configuration System - Understand the DataKnobs configuration patterns
- UV Virtual Environment Guide - How to work with UV package manager
- Quality Checks Process - Developer-driven quality assurance
- Architecture Overview - Understand the system design
- Testing Guide - Learn about our testing approach
- Integration Testing & CI - Integration testing in CI/CD pipeline
- CI/CD Pipeline - Understand our deployment process
Development Topics¶
Core Development¶
- Contributing Guide - How to contribute code, documentation, and report issues
- New Package Checklist - 📦 Automated package creation and integration guide
- Configuration System - DataKnobs configuration patterns and best practices
- Adding Config Support - Step-by-step guide to add configuration support to packages
- UV Virtual Environment Guide - Working with UV package manager and virtual environments
- Quality Checks Process - Running quality checks locally before PRs
- Architecture Overview - System architecture and design principles
- Documentation Guide - How to write and maintain documentation
Testing¶
- Testing Guide - Testing strategies, frameworks, and best practices
- Testing Commands - Practical guide to running tests with the new test infrastructure
- Integration Testing & CI - Integration testing with real services and CI/CD quality gates
Operations¶
- CI/CD Pipeline - Continuous integration and deployment processes
- Dependency Updates - Automated weekly updates and the review process
- Release Process - Streamlined release workflow with automated tools and comprehensive FAQ
Project Structure¶
dataknobs/
├── packages/ # Individual packages
│ ├── common/ # Shared utilities
│ ├── config/ # Configuration system
│ ├── data/ # Database abstractions
│ ├── fsm/ # FSM processing
│ ├── llm/ # LLM integration
│ ├── bots/ # AI agents and chatbots
│ ├── structures/ # Core data structures
│ ├── utils/ # Utility functions
│ ├── xization/ # Text processing
│ └── legacy/ # Legacy compatibility
├── docs/ # Documentation
├── tests/ # Integration tests
├── docker/ # Docker configurations
├── bin/ # Scripts and tools
└── resources/ # Shared resources
Development Environment¶
Prerequisites¶
- Python: 3.12 or higher
- Package Manager: UV (fast Python package manager)
- Version Control: Git
- Docker: For running PostgreSQL, Elasticsearch, and LocalStack services
Quick Setup with UV¶
# Clone the repository
git clone https://github.com/yourusername/dataknobs.git
cd dataknobs
# Install all dependencies
uv sync --all-packages
# Activate virtual environment
source .venv/bin/activate # On Linux/macOS
# or
.venv\Scripts\activate # On Windows
# Run quality checks before PRs (includes integration tests)
./bin/run-quality-checks.sh
# Or run specific test types with the new test infrastructure
./bin/test.sh # Run all tests (unit + integration)
./bin/test.sh -t unit # Unit tests only
./bin/test.sh -t integration # Integration tests with services
./bin/test.sh data # Test specific package
./bin/run-integration-tests.sh -s # Start services for manual testing
Development Services¶
Docker-based Services¶
Most development services run via Docker:
# Start all services
docker-compose up -d postgres elasticsearch localstack
# Check service status
docker-compose ps
# Stop services
docker-compose down
Ollama (Local Installation Required)¶
Unlike other services, Ollama runs locally due to hardware requirements (GPU access).
Installation:
- macOS: brew install ollama
- Linux: curl -fsSL https://ollama.ai/install.sh | sh
- Windows: Download from https://ollama.ai/download
Starting Ollama:
Verifying Ollama:
Running Tests Without Ollama:
export TEST_OLLAMA=false
dk test
# Or use quick test mode (skips all integration tests)
dk testquick
For more details, see the UV Virtual Environment Guide and Quality Checks Process.
Package Overview¶
dataknobs-common¶
Purpose: Shared utilities and base classes used across all packages.
Key Components: - Base classes and interfaces - Common configuration management - Standardized logging - Error handling framework
dataknobs-config¶
Purpose: Modular configuration system with environment variable support.
Key Components:
- YAML/JSON configuration loading
- Environment variable substitution (${VAR:default})
- Factory registration for dynamic object creation
- Cross-reference resolution
- Layered configuration merging
dataknobs-data¶
Purpose: Unified data abstraction layer for consistent operations across storage backends.
Key Components:
- Multiple backend support (Memory, File, PostgreSQL, Elasticsearch, S3)
- Unified Record and Query abstractions
- Factory pattern for dynamic backend selection
- Transaction management (Single, Batch, Manual)
- Vector store integration
- Streaming support for large datasets
dataknobs-fsm¶
Purpose: Finite State Machine framework for workflow orchestration and data processing.
Key Components: - Three API levels: SimpleFSM (sync), AsyncSimpleFSM (async), AdvancedFSM (debugging) - Data handling modes (COPY, REFERENCE, DIRECT) for different performance/safety tradeoffs - Built-in resource management (databases, files, HTTP, LLMs, vector stores) - Streaming support with backpressure handling - YAML/JSON configuration with inline transforms - Step-by-step debugging with breakpoints and execution hooks
dataknobs-llm¶
Purpose: LLM integration with prompt management, conversations, versioning, and tools.
Key Components: - Multi-provider LLM support (OpenAI, Anthropic, Ollama, etc.) - Prompt template management with versioning - Conversation history and context management - Tool/function calling support - Cost tracking and token usage monitoring - Async/await support for concurrent requests
dataknobs-bots¶
Purpose: Configuration-driven AI agents and chatbots for building intelligent applications.
Key Components: - Multi-tenant bot architecture with BotRegistry - Memory systems (buffer, vector) for conversation context - RAG (Retrieval Augmented Generation) with knowledge base integration - Reasoning strategies (Simple, ReAct) for tool-using agents - Configuration-driven tool loading without code changes - Production-ready with PostgreSQL storage and horizontal scaling
dataknobs-structures¶
Purpose: Core data structures for hierarchical and document-based data.
Key Components: - Tree data structure with advanced navigation - Document and text processing classes - Record storage and retrieval - Conditional dictionary implementations
dataknobs-utils¶
Purpose: Utility functions for various data processing tasks.
Key Components: - File operations and I/O utilities - Elasticsearch integration - JSON processing tools - LLM prompt management - Database and statistical utilities
dataknobs-xization¶
Purpose: Text normalization, tokenization, and processing.
Key Components: - Text normalization functions - Character-level analysis - Tokenization and masking - Lexical variation generation
Development Workflow¶
1. Issue Creation¶
- Use GitHub issues to track bugs, features, and improvements
- Follow issue templates for consistency
- Label issues appropriately
- Assign to milestones when relevant
2. Branch Management¶
- main: Stable, production-ready code
- develop: Integration branch for features
- feature/*: Individual feature development
- bugfix/*: Bug fixes
- hotfix/*: Critical production fixes
3. Code Development¶
- Follow Python PEP 8 style guidelines
- Write comprehensive docstrings
- Include type hints
- Add appropriate tests
- Update documentation
4. Testing¶
- Write unit tests for all new functionality
- Run integration tests with real services (PostgreSQL, Elasticsearch)
- Ensure all tests pass with
./bin/run-quality-checks.sh - Achieve minimum code coverage targets (70% overall, 90% for new code)
- Test across supported Python versions (3.12+)
5. Review Process¶
- Create pull request with detailed description
- Request review from maintainers
- Address feedback and make necessary changes
- Ensure all checks pass
Code Standards¶
Python Style¶
- Follow PEP 8 coding style
- Use black for code formatting
- Use isort for import organization
- Use pylint for code quality checks
Documentation¶
- Write clear, comprehensive docstrings
- Follow Google docstring style
- Include usage examples
- Update README files as needed
Testing¶
- Aim for >90% code coverage
- Write both unit and integration tests
- Use descriptive test names
- Include edge cases and error conditions
Type Hints¶
- Use type hints for all public functions
- Import types from typing module
- Use Union, Optional, and generics appropriately
Tools and Utilities¶
Code Quality¶
# Format code
black packages/
# Sort imports
isort packages/
# Check style
flake8 packages/
# Type checking
mypy packages/
# Security scanning
bandit -r packages/
Testing¶
# Run all tests (unit + integration) with new infrastructure
./bin/test.sh
# Run unit tests only
./bin/test.sh -t unit
# Run integration tests with services
./bin/test.sh -t integration
# Test specific package
./bin/test.sh data # All tests for data package
./bin/test.sh -t unit config # Unit tests for config package
./bin/test.sh -t integration data # Integration tests for data package
# Advanced options
./bin/test.sh -v # Verbose output
./bin/test.sh -k test_s3 # Run tests matching pattern
./bin/test.sh -x # Stop on first failure
./bin/test.sh -n -t integration # Run integration tests without starting services
# Service management
./bin/run-integration-tests.sh -s # Start services only
./bin/run-integration-tests.sh -k # Keep services running after tests
# Legacy pytest commands (still available)
pytest # Run all tests
pytest -m "not integration" # Unit tests only
pytest --cov=packages/ # Run with coverage
pytest packages/structures/tests/ # Run specific package tests
# Run quality checks (linting + tests + coverage)
./bin/run-quality-checks.sh
Documentation¶
# Build documentation
mkdocs serve
# Generate API docs
mkdocs build
# Test documentation links
mkdocs build --strict
Performance Considerations¶
Memory Management¶
- Use generators for large dataset processing
- Implement proper resource cleanup
- Monitor memory usage in tests
- Consider lazy loading for large data structures
Processing Efficiency¶
- Profile code performance regularly
- Use appropriate data structures
- Implement caching where beneficial
- Consider parallel processing for CPU-intensive tasks
Scalability¶
- Design for horizontal scaling
- Use streaming processing for large files
- Implement proper error handling and recovery
- Consider database connection pooling
Security Guidelines¶
Input Validation¶
- Validate all user inputs
- Sanitize data before processing
- Use parameterized queries for databases
- Implement proper authentication and authorization
Data Handling¶
- Encrypt sensitive data at rest
- Use HTTPS for all network communications
- Implement proper logging (avoid logging sensitive data)
- Follow data retention policies
Dependencies¶
- Regularly update dependencies
- Use security scanning tools
- Pin dependency versions
- Review new dependencies for security issues
Debugging and Troubleshooting¶
Common Issues¶
- Import Errors
- Ensure packages are installed in development mode
- Check Python path configuration
-
Verify virtual environment activation
-
Test Failures
- Run tests individually to isolate issues
- Check for test data dependencies
-
Verify mock configurations
-
Performance Issues
- Use profiling tools to identify bottlenecks
- Check memory usage patterns
- Review algorithm complexity
Debugging Tools¶
# Python debugger
import pdb; pdb.set_trace()
# Performance profiling
import cProfile
cProfile.run('your_function()')
# Memory profiling
from memory_profiler import profile
@profile
def your_function():
pass
Communication and Community¶
Channels¶
- GitHub Issues: Bug reports and feature requests
- GitHub Discussions: General questions and community discussions
- Pull Requests: Code contributions and reviews
Guidelines¶
- Be respectful and constructive
- Provide clear and detailed information
- Follow up on your contributions
- Help others when possible
Resources¶
Documentation¶
- User Guide - End-user documentation
- API Reference - Detailed API documentation
- Examples - Usage examples and tutorials
External Resources¶
Getting Help¶
If you need help with development:
- Check existing documentation and examples
- Search GitHub issues for similar problems
- Create a new issue with detailed information
- Join community discussions for general questions
Next Steps¶
Ready to contribute? Start with:
- Read the Contributing Guide
- Set up your development environment
- Pick a "good first issue" from GitHub
- Make your first contribution!
We welcome contributions of all types - code, documentation, testing, and community support. Thank you for helping make Dataknobs better!