Skip to content

Latest commit

 

History

History
372 lines (302 loc) · 13 KB

File metadata and controls

372 lines (302 loc) · 13 KB

AgentUnit Architecture Guide

This document supplements the high-level overview in the AgentUnit README. If you are setting up your first evaluation run, start with the Quickstart before diving into architectural internals.

Overview

AgentUnit is designed as a modular, extensible framework for testing and monitoring multi-agent AI systems. The architecture follows clean separation of concerns with pluggable adapters, async processing, and production-ready monitoring capabilities.

Core Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AgentUnit Framework                      │
├─────────────────────────────────────────────────────────────┤
│ CLI Interface                                               │
│ ├── multiagent     ├── monitoring    ├── config           │
│ ├── analyze        └── help          └── templates        │
├─────────────────────────────────────────────────────────────┤
│ Core Framework                                              │
│ ├── Scenario       ├── Runner        ├── DatasetSource     │
│ ├── TraceLog       └── Exceptions    └── Results           │
├─────────────────────────────────────────────────────────────┤
│ Multi-Agent Orchestration                                   │
│ ├── MultiAgentAdapter    ├── AgentRole                     │
│ ├── SessionManager       ├── InteractionPattern           │
│ └── CommunicationModes   └── OrchestrationPattern          │
├─────────────────────────────────────────────────────────────┤
│ Platform Adapters                                          │
│ ├── AutoGen AG2    ├── OpenAI Swarm  ├── LangSmith        │
│ ├── AgentOps       └── Wandb         └── BaseAdapter      │
├─────────────────────────────────────────────────────────────┤
│ Monitoring & Production                                     │
│ ├── ProductionMonitor    ├── MetricsCollector             │
│ ├── AlertManager         └── PerformanceAnalytics         │
├─────────────────────────────────────────────────────────────┤
│ Reporting & Analytics                                       │
│ ├── ScenarioResult      ├── Exporters (JSON/XML/HTML)     │
│ ├── PerformanceMetrics  └── ComplianceReports             │
└─────────────────────────────────────────────────────────────┘

Component Details

1. Core Framework (src/agentunit/core/)

The foundation of AgentUnit providing essential abstractions and execution logic.

Key Components:

  • Scenario: Defines a reproducible test scenario with adapter, dataset, and configuration
  • Runner: Executes scenarios and manages test execution lifecycle
  • DatasetSource: Provides test cases and manages data loading
  • TraceLog: Captures detailed execution traces for debugging and analysis
  • Exceptions: Custom exception hierarchy for error handling

Design Patterns:

  • Factory Pattern: For creating scenarios and adapters
  • Strategy Pattern: For different execution strategies
  • Observer Pattern: For monitoring execution progress

2. Multi-Agent Orchestration (src/agentunit/multiagent/)

Handles complex multi-agent interactions and coordination patterns.

Core Classes:

@dataclass
class AgentRole:
    name: str
    description: str
    responsibilities: List[str]
    capabilities: List[str]

@dataclass  
class AgentInteraction:
    source_agent: AgentID
    target_agent: AgentID
    interaction_type: InteractionType
    content: str
    timestamp: datetime
    metadata: Dict[str, Any]

class MultiAgentAdapter(ABC):
    # Abstract base for all multi-agent platform integrations
    @abstractmethod
    async def run_scenario(self, scenario: Scenario) -> ScenarioResult
    
    @abstractmethod
    def get_agent_roles(self) -> Dict[AgentID, AgentRole]
    
    @abstractmethod
    def get_communication_modes(self) -> List[CommunicationMode]

Orchestration Patterns:

  • Sequential: Agents interact in predefined order
  • Parallel: Multiple agents work simultaneously
  • Hierarchical: Structured command-and-control relationships
  • Peer-to-Peer: Direct agent-to-agent communication
  • Hub-and-Spoke: Central coordinator manages all interactions

3. Platform Adapters (src/agentunit/adapters/)

Platform-specific integrations that bridge AgentUnit with external frameworks.

Adapter Hierarchy:

BaseAdapter (ABC)
├── MultiAgentAdapter (ABC)
│   ├── AG2Adapter           # AutoGen AG2 integration
│   ├── SwarmAdapter         # OpenAI Swarm integration
│   ├── LangSmithAdapter     # LangSmith monitoring
│   ├── AgentOpsAdapter      # AgentOps production monitoring  
│   └── WandbAdapter         # Wandb experiment tracking
└── SingleAgentAdapter (ABC)
    ├── LangGraphAdapter     # LangGraph workflows
    ├── CrewAIAdapter        # CrewAI agent teams
    └── CustomAdapter        # User-defined adapters

Adapter Contract:

class BaseAdapter(ABC):
    @abstractmethod
    def prepare(self) -> None:
        """Perform lazy setup (loading models, connecting to services)"""
    
    @abstractmethod  
    def execute(self, case: DatasetCase, trace: TraceLog) -> AdapterOutcome:
        """Execute a single test case and return results"""
    
    def cleanup(self) -> None:
        """Clean up resources"""
        
    def supports_replay(self) -> bool:
        """Whether adapter supports replay functionality"""

4. Production Monitoring (src/agentunit/monitoring/)

Real-time monitoring and observability for production deployments.

Monitoring Architecture:

class ProductionMonitor:
    def __init__(self, integrations: List[ProductionIntegration]):
        self.integrations = integrations
        self.metrics_collector = MetricsCollector()
        self.alert_manager = AlertManager()
    
    async def start_monitoring(self) -> None:
        """Start monitoring all configured integrations"""
    
    async def create_session(self, session_name: str) -> SessionID:
        """Create a new monitoring session"""
    
    async def track_interaction(self, interaction: AgentInteraction) -> None:
        """Track a single agent interaction"""

Production Integrations:

  • LangSmith: Project monitoring, trace collection, evaluation
  • AgentOps: Real-time observability, performance analytics
  • Wandb: Experiment tracking, model comparison, collaboration

5. Reporting & Analytics (src/agentunit/reporting/)

Comprehensive reporting and analytics capabilities.

Report Types:

@dataclass
class ScenarioResult:
    name: str
    runs: List[ScenarioRun]
    
    @property
    def success_rate(self) -> float:
        """Calculate overall success rate"""
    
    def aggregate_metric(self, name: str) -> float:
        """Aggregate a specific metric across all runs"""

@dataclass
class ScenarioRun:
    scenario_name: str
    case_id: str
    success: bool
    metrics: Dict[str, float]
    duration_ms: float
    trace: TraceLog
    error: Optional[str] = None

Export Formats:

  • JSON: Machine-readable results for CI/CD integration
  • XML: JUnit-compatible format for test runners
  • HTML: Rich, interactive reports with charts and graphs
  • Markdown: Human-readable summaries for documentation

6. CLI Interface (src/agentunit/cli/)

Command-line interface providing easy access to all framework capabilities.

Command Structure:

agentunit
├── multiagent
│   ├── run          # Execute multi-agent scenarios
│   ├── validate     # Validate scenario configurations
│   └── template     # Generate scenario templates
├── monitoring  
│   ├── start        # Start production monitoring
│   ├── stop         # Stop monitoring
│   ├── status       # Check monitoring status
│   └── report       # Generate monitoring reports
├── analyze
│   ├── results      # Analyze test results
│   ├── performance  # Performance analysis
│   └── compare      # Compare multiple runs
└── config
    ├── set          # Set configuration values
    ├── get          # Get configuration values
    └── reset        # Reset to defaults

Data Flow

1. Test Execution Flow

graph TD
    A[CLI Command] --> B[Load Scenario]
    B --> C[Initialize Adapter] 
    C --> D[Load Dataset]
    D --> E[Start Monitoring]
    E --> F[Execute Test Cases]
    F --> G[Collect Results]
    G --> H[Generate Reports]
    H --> I[Cleanup Resources]
Loading

2. Multi-Agent Interaction Flow

graph TD
    A[Agent A] --> B[Interaction Message]
    B --> C[Orchestration Layer]
    C --> D[Route to Agent B]
    D --> E[Agent B Processing]
    E --> F[Response Message]
    F --> G[Track Interaction]
    G --> H[Update Metrics]
Loading

3. Production Monitoring Flow

graph TD
    A[Agent Interaction] --> B[Monitor Hook]
    B --> C[Metrics Collection]
    C --> D[Real-time Dashboard]
    B --> E[Alert Evaluation]
    E --> F[Notification System]
    C --> G[Data Storage]
    G --> H[Analytics & Reports]
Loading

Extensibility Points

1. Custom Adapters

Create new platform integrations by extending BaseAdapter or MultiAgentAdapter:

class CustomAdapter(MultiAgentAdapter):
    def __init__(self, config: Dict[str, Any]):
        super().__init__(config)
        # Initialize your platform connection
    
    async def run_scenario(self, scenario: Scenario) -> ScenarioResult:
        # Implement scenario execution logic
        pass
    
    def get_agent_roles(self) -> Dict[AgentID, AgentRole]:
        # Return available agent roles
        pass

2. Custom Metrics

Add new metrics by implementing the metrics interface:

class CustomMetric:
    def __init__(self, name: str):
        self.name = name
    
    def calculate(self, expected: str, actual: str) -> float:
        # Implement metric calculation
        pass

3. Custom Exporters

Create new report formats by extending the exporter base:

class CustomExporter:
    def export(self, results: ScenarioResult, path: Path) -> None:
        # Implement custom export logic
        pass

Performance Considerations

1. Async Architecture

  • All I/O operations are asynchronous
  • Concurrent execution of multiple scenarios
  • Non-blocking monitoring and reporting

2. Memory Management

  • Lazy loading of datasets and models
  • Streaming processing for large datasets
  • Configurable memory limits and cleanup

3. Scalability

  • Horizontal scaling through multiple workers
  • Load balancing across available resources
  • Efficient resource pooling and reuse

Security & Privacy

1. Credential Management

  • Secure storage of API keys and credentials
  • Support for environment variables and key management systems
  • Automatic credential rotation capabilities

2. Data Privacy

  • No sensitive data logged by default
  • Configurable data masking and filtering
  • GDPR and compliance-ready data handling

3. Audit Trail

  • Complete audit log of all operations
  • Tamper-proof logging with checksums
  • Integration with enterprise audit systems

Configuration Management

1. Configuration Hierarchy

# Global defaults
defaults:
  timeout: 60
  retries: 3
  
# Platform-specific settings  
adapters:
  autogen_ag2:
    model: "gpt-4"
    max_turns: 10
  langsmith:
    project: "production"
    
# Monitoring configuration
monitoring:
  enabled: true
  real_time: true
  alerts:
    error_threshold: 0.05

2. Environment Support

  • Development, staging, production configurations
  • Environment-specific overrides
  • Validation and schema checking

This architecture ensures AgentUnit remains flexible, maintainable, and production-ready while providing comprehensive testing and monitoring capabilities for multi-agent AI systems.