Aetherra Maintenance System

Maintained and officially operated by Aetherra Labs. Powered by Aetherra Labs.

Aetherra's Maintenance System provides autonomous system stability, continuous improvement, and adaptive evolution for the entire Aetherra OS. Like a self-healing organism with adaptive intelligence, it continuously monitors system health, learns from operational patterns, discovers and integrates new capabilities, and automatically maintains optimal system performance without human intervention.

The Maintenance System is composed of three integrated subsystems working in harmony:

Homeostasis System: Real-time stability control and error correction
Self-Improvement Engine: Pattern analysis and optimization proposal generation
Self-Incorporation Service: Code discovery, classification, and safe integration

Together, these systems form a complete autonomous loop that ensures Aetherra remains stable, performant, and continuously evolving.

For detailed Self-Improvement API documentation, see AETHERRA_SELF_IMPROVEMENT_API.md.

Architecture overview

The Maintenance System operates as a closed-loop autonomous control system with three primary layers:

Stability Layer (Homeostasis): Real-time monitoring, error detection, and immediate corrective actions
Intelligence Layer (Self-Improvement): Pattern analysis, trend detection, and improvement proposal generation
Evolution Layer (Self-Incorporation): Code discovery, security evaluation, and safe capability integration

Key properties:

Fully autonomous operation with zero human intervention required
Real-time monitoring with sub-second error detection and correction
Continuous learning from system performance and operational patterns
Safe code integration with multi-tier trust model and ethics evaluation
Complete audit trail for all actions and decisions
Graceful degradation when components are unavailable
Policy-driven safety constraints preventing destructive actions
Integration with all major Aetherra subsystems

System integration flow

┌──────────────────────────────────────────────────────────────┐
│                  AETHERRA MAINTENANCE SYSTEM                  │
│                                                               │
│  ┌────────────────────────────────────────────────────────┐  │
│  │              HOMEOSTASIS SYSTEM                         │  │
│  │  • Collects metrics (15+ types)                        │  │
│  │  • Detects errors in real-time                         │  │
│  │  • Applies immediate fixes                             │  │
│  │  • Monitors system health                              │  │
│  └────────┬────────────────────────────────────┬──────────┘  │
│           │ Forwards metrics                   │              │
│           │ every 60s                          │ Reports      │
│           ↓                                    │ health       │
│  ┌────────────────────────────────────────────┴──────────┐  │
│  │         SELF-IMPROVEMENT ENGINE                        │  │
│  │  • Analyzes performance patterns                       │  │
│  │  • Identifies optimization opportunities               │  │
│  │  • Generates improvement proposals                     │  │
│  │  • Tracks trends and anomalies                         │  │
│  └────────┬───────────────────────────────────────────────┘  │
│           │ Sends proposals                                   │
│           │ for evaluation                                    │
│           ↓                                                   │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         SELF-INCORPORATION SERVICE                     │  │
│  │  • Discovers code in project                          │  │
│  │  • Classifies by type and risk                        │  │
│  │  • Evaluates safety and ethics                        │  │
│  │  • Integrates approved capabilities                   │  │
│  │  • Night cycle learning during idle                   │  │
│  └────────┬───────────────────────────────────────────────┘  │
│           │ Forwards insights                                 │
│           │ and integration metrics                           │
│           ↓                                                   │
│         (Back to Homeostasis)                                 │
│                                                               │
└──────────────────────────────────────────────────────────────┘

Boot sequence

The Maintenance System is initialized during OS startup in a coordinated sequence:

Phase 2: System Loading

Self-Improvement Engine loads and starts analysis cycle
Self-Incorporation Service loads with configuration
Self-Repair Service loads (legacy support)
Homeostasis System loads all 8 phases

Phase 3: System Injection

Kernel loop initialized with core systems
HMR controller wired for hot-swap capabilities
Self-Incorporation receives references to service registry, kernel, plugin manager, and agent orchestrator

Phase 4: System Activation

Memory and plugin systems activate
Homeostasis starts all 8 phases (including error correction and metrics bridge)
Self-Incorporation starts and triggers initial code discovery scan
All systems mark themselves as healthy in service registry

Core components

1) Homeostasis System (Stability Layer)

File: Aetherra/homeostasis/homeostasis_integration.py

The Homeostasis System maintains system stability through 8 coordinated phases:

Phase 1: Stability Metrics Collection

Continuous gathering of health signals from all Aetherra components
Plugin load success rates, memory RTT, task latency, hub connectivity
Exception rates, queue depths, service availability
Collects 15+ metric types every monitoring cycle

Phase 2: Adaptive Controller

PID-based control decisions for corrective actions
Calculates deviation from target setpoints
Determines appropriate response magnitudes
Rate-limits actions to prevent oscillation

Phase 3: Multi-Level Actuators

Executes corrective actions across system components
Plugin reload, memory cleanup, task prioritization
Service restart, resource allocation adjustments
Graceful degradation when actuators unavailable

Phase 4: Supervisor & Health Monitoring

Maintains global system health score (0.0-1.0)
Tracks OS runlevel and component states
Aggregates health from all subsystems
Provides unified health API for dashboards

Phase 5: Feedback Loop

Validates effectiveness of corrective actions
Adjusts controller parameters based on outcomes
Learns optimal response strategies
Tracks action success rates

Phase 6: Validation & Observability

Comprehensive metrics export for monitoring
Action effectiveness tracking
Performance statistics and trends
Integration with observability platforms

Phase 7: Autonomous Error Correction

File: Aetherra/homeostasis/autonomous_error_corrector.py

Real-time log monitoring and automatic error correction:

Custom logging handler captures all WARNING+ messages
Pattern-based detection for 6 error categories:
- Service registration API mismatches
- Deprecated module imports
- Missing Python modules
- Missing system capabilities
- Plugin load failures
- Expected data file missing
Intelligent cooldown (5-10 minutes) prevents fix spam
Automatic fix handlers for each error type
Statistics tracking: detected, attempted, successful, failed

Key APIs:

start(): Begin log monitoring and background processing
stop(): Clean shutdown of monitoring
get_statistics(): Retrieve correction statistics
process_log_message(record): Analyze log entry for errors

Phase 8: Self-Improvement Metrics Bridge

File: Aetherra/homeostasis/self_improvement_metrics_bridge.py

Data pipeline from homeostasis to self-improvement engine:

Polls homeostasis metrics every 60 seconds
Forwards 15+ metrics to self-improvement engine:
- plugin_load_success: Plugin loading success rate
- memory_rtt: Memory system response time
- task_latency: Task processing latency
- hub_connection: Hub connectivity health
- controller_active: Controller operational state
- actions_executed: Corrective actions taken
- system_health_score: Overall system health (0.0-1.0)
- effectiveness_*: 5 effectiveness metrics from validator
- errors_detected: Error correction detections
- fixes_successful: Successful auto-fixes
- fix_success_rate: Fix effectiveness percentage
Statistics tracking: metrics_forwarded, forward_failures, success_rate

Key APIs:

start(): Begin metrics collection and forwarding
stop(): Clean shutdown
get_status(): Bridge health and statistics
_collect_homeostasis_metrics(): Gather current metrics
_forward_metrics(): Send to self-improvement engine

2) Self-Improvement Engine (Intelligence Layer)

File: Aetherra/aetherra_core/engine/self_improvement_engine.py

The Self-Improvement Engine provides continuous learning and optimization through pattern analysis:

Metrics Collection

Receives performance metrics from homeostasis bridge every 60 seconds
Stores metrics history with timestamps for trend analysis
Maintains rolling window of recent performance data
Tracks metric trends, anomalies, and patterns

Pattern Analysis

Analyzes metrics every 5 minutes for patterns
Identifies performance trends (improving, degrading, stable)
Detects anomalies and outliers
Correlates metrics to find relationships
Calculates statistical measures (mean, variance, percentiles)

Improvement Generation

Generates optimization proposals based on patterns:
- scale_up: Increase resource allocation when performance degrading
- optimize: Fine-tune parameters when inefficiencies detected
- degrade: Reduce resource usage when overprovisioned
- change_strategy: Switch approaches when current strategy suboptimal
Includes rationale, confidence score, and supporting evidence
Prioritizes proposals by impact and feasibility

Strategy Library

Repository of proven improvement patterns
Successful strategies learned from past actions
Context-aware strategy selection
Continuous refinement based on outcomes

Key APIs:

record_performance_metric(name, value, unit, context): Add metric
get_improvement_status(): Current state and proposals
get_metric_trends(metric_name, time_window): Trend analysis
start_improvement_cycle(loop): Begin continuous analysis
stop_improvement_cycle(): Clean shutdown

Message API (Service Registry)

selfimprovement.record_metric: Receive metrics from other systems
selfimprovement.status: Get engine state and proposals
selfimprovement.trends: Retrieve metric trends

3) Self-Incorporation Service (Evolution Layer)

File: aetherra_self_incorporation.py

The Self-Incorporation Service provides autonomous code discovery, evaluation, and safe integration:

Code Discovery

Component: CodeIndex

Scans project roots for Python files (configurable paths)
Tracks file hash (SHA-256), size, modification time
Identifies entry points and code structure
Stores in dual format: SQLite database + JSONL backup
Incremental scanning detects new/modified files

Heuristic Classification

Component: HeuristicClassifier

Analyzes code to determine type and purpose
Classification types:
- PLUGIN: Aetherra plugin modules
- AGENT: Agent implementations
- AETHER: Aether script files
- WORKFLOW: Workflow definitions
- UTILITY: Helper/utility code
- DATASET: Data files and datasets
- DOCS: Documentation files
- UNKNOWN: Unclassified items
Confidence scoring (0.0-1.0) for classification quality
Feature extraction: imports, classes, functions, patterns

Policy & Safety Gate

Components: PolicyEngine, SecurityGate, SafetyIndex

Policy engine enforces integration policies from JSON configuration
Security gate analyzes risk factors:
- Dangerous imports (subprocess, eval, exec, network)
- File system operations
- Network access patterns
- Code execution capabilities
- External command invocation
Trust tier assignment:
- VERIFIED: Signed by Aetherra Labs, full trust
- TRUSTED: Known good, reviewed and approved
- STANDARD: Default trust, standard checks
- EXPERIMENTAL: Unproven code, extra scrutiny
- QUARANTINED: Suspicious or blocked
Safety decisions stored with rationale and risk scores

Integration Planning

Component: IntegrationPlanner

Creates integration plans from classified and approved code
Conflict detection:
- Duplicate capability names
- Namespace collisions
- Version incompatibilities
- Dependency conflicts
Generates integration actions:
- load_plugin: Load plugin into plugin manager
- register_agent: Register agent with orchestrator
- import_utility: Import utility module
- execute_workflow: Run aether workflow
- index_dataset: Index data for retrieval
Prioritizes actions by dependencies and risk

Core Integration

Component: CoreIntegrator

Executes integration plans safely
Hot-swap capabilities via HMR controller
Rollback support with unique tokens
Validates integration success
Reports outcomes to audit ledger

Ethics & Audit

Components: EthicsEngine, AuditLedger

Ethics evaluation using multiple frameworks:
- Utilitarian: Maximum benefit, minimum harm
- Deontological: Rule-based ethical principles
- Virtue ethics: Character and excellence focus
- Care ethics: Relationship and empathy focus
Evaluates ethical implications of integrations
Risk factor identification (privacy, security, autonomy)
Benefit analysis (capability enhancement, performance)
Complete audit trail in SQLite database
Immutable record of all actions and decisions

Quarantine Management

Component: QuarantineManager

Isolates suspicious or untrusted code
Policy-based quarantine triggers
Manual review interface for quarantined items
Recovery and reintegration workflow
Statistics tracking for quarantine events

Night Cycle Learning

Component: NightCycleProcessor

Autonomous learning during system idle periods:

7-Phase Night Cycle:
1. INACTIVE: Waiting for idle trigger
2. MONITORING: Watching for user activity
3. DISCOVERY_ANALYSIS: Analyzing discovered code
4. PATTERN_LEARNING: Learning from code patterns
5. OPTIMIZATION: Optimizing integrations
6. VALIDATION: Validating learned patterns
7. REPORTING: Generating insights report
User activity detection:
- Last interaction timestamp
- CPU usage monitoring
- Memory usage patterns
- Network activity levels
- Idle threshold: 30+ minutes no activity
Learning insights generation:
- Code quality patterns
- Performance optimization opportunities
- Security vulnerability patterns
- Common integration conflicts
- Best practices identification

Key APIs:

start(): Start service and register with service registry
stop(): Clean shutdown
inject_systems(registry, kernel, plugins, agents): Inject core systems
trigger_scan(root_filter): Discover code in project roots
trigger_classify(type_filter): Classify discovered items
trigger_security_eval(trust_filter): Evaluate code safety
trigger_planning(experimental): Create integration plan
trigger_integrate(plan_id): Execute integration plan
trigger_rollback(token): Rollback integration
get_status(): Service health and metrics
health_check(): Detailed health status

Configuration

Component: SelfIncorporationConfig

Default settings:

enabled = True
roots = [Path("."), Path("Aetherra")]  # Project roots to scan
trust_mode = "standard"  # or "strict", "permissive"
index_db_path = Path("data/selfinc_index.db")
audit_db_path = Path("data/selfinc_audit.db")

Policy file: config/self_incorporation_policy.json

Data flows

Current operational flows

Homeostasis → Self-Improvement

Frequency: Every 60 seconds
Method: Phase 8 Metrics Bridge
Data: 15+ metrics (stability, performance, health, errors)
Direction: Homeostasis polls own metrics → forwards to SI Engine via service registry messages
Purpose: Provides performance data for pattern analysis and improvement proposals

Homeostasis → Error Correction

Frequency: Real-time (every log message)
Method: Phase 7 Log Monitor (custom logging.Handler)
Data: WARNING+ log messages
Direction: Python logging system → Error Corrector → Fix handlers
Purpose: Immediate error detection and automatic correction

Planned flows (Phase 2)

Self-Incorporation → Self-Improvement

Frequency: Every 60 seconds (planned)
Method: Phase 9 Metrics Bridge (to be implemented)
Data: Discovery metrics, classification success, integration stats, night cycle insights
Direction: Self-Incorporation → SI Engine via service registry messages
Purpose: Inform SI Engine about code evolution and integration effectiveness

Self-Improvement → Self-Incorporation

Frequency: On proposal generation (as needed)
Method: Proposal consumer in Self-Incorporation (to be implemented)
Data: Improvement proposals (scale_up, optimize, integrate_capability, etc.)
Direction: SI Engine → Self-Incorporation via service registry messages
Purpose: Execute optimization proposals safely through integration pipeline

Self-Incorporation → Homeostasis

Frequency: Every 60 seconds (planned)
Method: Extended Phase 8 Bridge
Data: Self-Incorporation health metrics, quarantine status, integration success rate
Direction: Self-Incorporation → Homeostasis metrics
Purpose: Include Self-Incorporation health in overall system health score

Operational modes

The Maintenance System operates in coordinated modes across all three subsystems:

Normal mode (default)

Homeostasis monitors continuously with standard thresholds
Self-Improvement analyzes every 5 minutes
Self-Incorporation performs daily scheduled scans
Night cycle learning during detected idle periods
Full audit trail enabled
Standard safety policies enforced

Quiet mode (`AETHERRA_QUIET=1`)

Reduced logging verbosity
Faster stabilization delays (0.5s vs 2.0s)
Error correction continues silently
Metrics bridge operates normally
Suitable for automated testing and CI/CD

Strict mode (`AETHERRA_PROFILE=prod`)

Enhanced security policies
All integrations require manual approval
Trust mode set to "strict"
Network allowlisting enforced
Audit ledger required for all actions
Suitable for production environments

Test mode (`AETHERRA_PROFILE=test`)

Relaxed thresholds for faster testing
Mock fallbacks for unavailable services
Shorter collection intervals
Suitable for development and testing

Health monitoring

The Maintenance System provides comprehensive health monitoring through multiple interfaces:

Service registry health

All three subsystems register with the service registry and report health status:

# Query health via service registry
status = service_registry.get_service_info("homeostasis_system")
# Returns: {status: "HEALTHY", uptime: 3600, ...}

status = service_registry.get_service_info("self_improvement_engine")
# Returns: {status: "HEALTHY", proposals: 5, ...}

status = service_registry.get_service_info("self_incorporation")
# Returns: {status: "HEALTHY", files_discovered: 150, ...}

Metrics endpoints

Each subsystem exposes metrics for observability platforms:

Homeostasis metrics:

system_health_score: Overall health (0.0-1.0)
controller_active: Controller operational (true/false)
actions_executed: Count of corrective actions
errors_detected: Count of detected errors
fixes_successful: Count of successful fixes
metrics_collected: Total metrics collected

Self-Improvement metrics:

metrics_recorded: Total metrics received
patterns_detected: Patterns identified
proposals_generated: Improvement proposals created
trends_analyzed: Trend analysis operations
analysis_cycles: Total analysis cycles completed

Self-Incorporation metrics:

files_discovered: Total files found
files_classified: Files classified by type
files_integrated: Successfully integrated
files_quarantined: Quarantined for review
night_cycles_completed: Learning cycles completed
insights_generated: Learning insights created

Health check APIs

Direct health check endpoints for detailed status:

# Homeostasis health
health = await homeostasis.get_status()
# Returns: {status, uptime, phases: {phase1: {...}, ...}, metrics: {...}}

# Self-Improvement health
health = await self_improvement.handle_message("status", {})
# Returns: {status: "active", proposals: [...], trends: {...}}

# Self-Incorporation health
health = await self_incorporation.health_check()
# Returns: {status: "HEALTHY", running: true, metrics: {...}}

Configuration

Homeostasis configuration

File: Aetherra/homeostasis/configs/setpoints.yaml

setpoints:
  plugin_load_success:
    target: 0.95
    min_acceptable: 0.85
  memory_rtt:
    target: 50.0
    max_acceptable: 120.0
  task_latency_p95:
    target: 100.0
    max_acceptable: 250.0

controller:
  pid_gains:
    kp: 1.0
    ki: 0.1
    kd: 0.05
  collection_interval: 60  # seconds

error_correction:
  enabled: true
  cooldown_min: 300  # 5 minutes
  cooldown_max: 600  # 10 minutes

Self-Improvement configuration

File: config.json (section: self_improvement)

{
  "self_improvement": {
    "enabled": true,
    "db_path": "self_improvement.db",
    "analysis_interval": 300,
    "confidence_threshold": 0.6,
    "max_proposals": 10
  }
}

Self-Incorporation configuration

File: config.json (section: self_incorporation)

{
  "self_incorporation": {
    "enabled": true,
    "roots": [".", "Aetherra"],
    "trust_mode": "standard",
    "night_cycle_enabled": true,
    "idle_threshold_minutes": 30,
    "auto_integrate": false
  }
}

Policy file: config/self_incorporation_policy.json

{
  "trust_tiers": {
    "verified": {
      "auto_approve": true,
      "require_review": false
    },
    "trusted": {
      "auto_approve": true,
    ### Health check APIs

    ```
      "require_review": false
    },
    "standard": {
      "auto_approve": false,
      "require_review": true
    }
  },
  "risk_thresholds": {
    The Hub exposes a unified, best-effort maintenance status endpoint that aggregates Homeostasis, Self-Improvement, and Self-Incorporation.

    - Route: `GET /api/maintenance/status`
    - Behavior: Always returns HTTP 200 with availability flags; missing subsystems are reported as `available: false` and do not fail the endpoint.
    - Headline fields:
      - `overall.runlevel` — OS runlevel from supervisor if available, else `"UNKNOWN"`
      - `overall.health_percent` — Global health percent if reported by Homeostasis supervisor
      - `overall.critical_health_percent` — Critical health percent if reported
      - `homeostasis.si_health_contribution` — SI health contribution via Phase 9 bridge (if available)

    Example response:

    ```json
    {
      "ok": true,
      "ts": "2025-10-23T12:34:56.789123",
      "overall": {
        "runlevel": "ONLINE",
        "health_percent": 92.5,
        "critical_health_percent": 98.0,
        "overall_running": true
      },
      "homeostasis": {
        "available": true,
        "running": true,
        "orchestrator": {"running": true, "initialized": true},
        "health": {"supervisor": {"runlevel": "ONLINE"}},
        "si_health_contribution": {"score": 0.12}
      },
      "self_improvement": {
        "available": true,
        "status": {"improvement_active": true, "total_proposals": 0}
      },
      "self_incorporation": {
        "available": true,
        "status": {"status": "ok", "running": true}
      }
    }
    ```

    Quick check (with Hub running on 3001):

    ```powershell
    # Optional: start via VS Code task "Run Hub (AI API 3001)"
    Invoke-RestMethod -Uri "http://localhost:3001/api/maintenance/status" -Method GET | ConvertTo-Json -Depth 6
    ```
    "network_access": 0.7,
    "file_operations": 0.6,
    "code_execution": 0.8
  }
}

Maintenance system lifecycle

Startup sequence

Phase 2: System Loading
- Self-Improvement Engine: Load and register with service registry
- Self-Incorporation Service: Load with configuration, register with service registry
- Homeostasis System: Load all 8 phases, register with service registry
Phase 3: System Injection
- Self-Incorporation: Receive references to service registry, kernel loop, plugin manager, agent orchestrator
- Homeostasis: Initialize all phase components (metrics collector, controller, actuators, supervisor, validator, error corrector, metrics bridge)
Phase 4: System Activation
- Homeostasis: Start all 8 phases in sequence
  - Phase 7 (Error Correction): Install log handler, start background processing
  - Phase 8 (Metrics Bridge): Start 60-second polling loop
- Self-Incorporation: Start service, trigger initial code discovery scan
- All systems: Mark as HEALTHY in service registry

Runtime operation

Continuous monitoring (Homeostasis Phase 1)

Collects metrics every 60 seconds from all system components
Stores in time-series for trend analysis
Provides current snapshot on demand

Real-time error correction (Homeostasis Phase 7)

Monitors all Python logs (WARNING+) in real-time
Matches against error patterns
Attempts automatic fixes with intelligent cooldown
Tracks statistics for effectiveness analysis

Metrics forwarding (Homeostasis Phase 8)

Every 60 seconds, collects homeostasis metrics
Formats for self-improvement engine consumption
Sends via service registry message bus
Tracks forwarding success rate

Pattern analysis (Self-Improvement)

Every 5 minutes, analyzes accumulated metrics
Identifies trends: improving, degrading, stable, anomalous
Generates improvement proposals with rationale
Stores proposals for future action

Code discovery (Self-Incorporation)

Initial scan at boot (non-blocking background task)
Periodic rescans detect new/modified files
Night cycle during idle periods for deep analysis
Maintains index of all discovered code

Safety evaluation (Self-Incorporation)

Classifies discovered code by type
Analyzes risk factors and assigns trust tier
Ethics evaluation for integration decisions
Quarantines suspicious code

Shutdown sequence

Stop metrics forwarding
- Homeostasis Phase 8: Stop bridge loop, final metrics flush
Stop error correction
- Homeostasis Phase 7: Remove log handler, stop background processing
Stop self-incorporation
- Save current discovery state
- Flush audit ledger
- Unregister from service registry
Stop self-improvement
- Complete current analysis cycle
- Save proposals and trends
- Unregister from service registry
Stop homeostasis
- Stop all 8 phases in reverse order
- Final metrics snapshot
- Unregister from service registry

Observability and metrics

Prometheus/OpenMetrics export

Homeostasis exposes metrics in Prometheus format:

# HELP aetherra_system_health System health score (0.0-1.0)
# TYPE aetherra_system_health gauge
aetherra_system_health 0.95

# HELP aetherra_errors_detected Total errors detected
# TYPE aetherra_errors_detected counter
aetherra_errors_detected 42

# HELP aetherra_fixes_successful Successful automatic fixes
# TYPE aetherra_fixes_successful counter
aetherra_fixes_successful 38

# HELP aetherra_metrics_forwarded Metrics forwarded to SI Engine
# TYPE aetherra_metrics_forwarded counter
aetherra_metrics_forwarded 1440

Logging

All three subsystems log to standard Python logging:

Homeostasis: [HOMEOSTASIS] prefix, logs phase transitions, actions, errors
Self-Improvement: [SI] prefix, logs analysis cycles, proposals, trends
Self-Incorporation: [SELFINC] prefix, logs discoveries, integrations, night cycles

Log levels:

INFO: Normal operational events (starts, stops, routine actions)
WARNING: Anomalies, degradations, non-critical failures
ERROR: Critical failures, unable to perform core functions
DEBUG: Detailed diagnostics (metric values, pattern details, decision rationale)

Dashboards

Recommended dashboard panels:

System Health Overview
- Overall health score (gauge)
- Health by component (bar chart)
- Health trend (time series)
Error Correction
- Errors detected (counter)
- Fixes attempted (counter)
- Fix success rate (percentage)
- Errors by category (pie chart)
Self-Improvement
- Metrics received (counter)
- Proposals generated (counter)
- Active proposals (list)
- Trend analysis results (table)
Self-Incorporation
- Files discovered (counter)
- Files by type (pie chart)
- Files by trust tier (bar chart)
- Integration success rate (percentage)
- Night cycle status (indicator)

Safety and security

Safety constraints

Homeostasis actuators:

Rate limiting: Maximum 1 action per component per minute
Cooldown periods: 5-10 minutes between repeated actions
Policy validation: All actions checked against safety policies
Rollback capability: All actuator actions reversible
Human override: Manual intervention can disable actuators

Self-Improvement proposals:

Confidence thresholds: Proposals require >60% confidence
Impact assessment: High-impact proposals flagged for review
Simulation mode: Test proposals before applying
Audit trail: All proposals logged with rationale

Self-Incorporation integrations:

Multi-tier trust model: Unverified code requires approval
Risk analysis: Dangerous operations flagged
Ethics evaluation: Ethical implications assessed
Quarantine system: Suspicious code isolated
Rollback tokens: All integrations reversible
Audit ledger: Immutable record of all integrations

Security policies

Network access:

Self-Incorporation analyzes network imports
Risk score increases for socket, urllib, requests usage
Network allowlisting in strict mode
Audit trail for all network-accessing code

Code execution:

eval(), exec(), subprocess flagged as high risk
Trust tier downgrade for dangerous operations
Sandboxing for experimental code (future)
Requires manual approval in strict mode

File operations:

File I/O operations tracked and audited
Path validation prevents directory traversal
Sensitive file paths protected
Audit trail for all file operations

Data privacy:

No sensitive data logged to audit trail
Policy-aware anonymization
Configurable data retention periods
GDPR-compliant by design

Troubleshooting

Common issues

Issue: Homeostasis not starting

Symptoms: No [HOMEOSTASIS] log messages, service not in registry

Diagnosis:

# Check if homeostasis is registered
status = service_registry.get_service_info("homeostasis_system")
if status is None:
    # Not registered - check OS launcher logs for exceptions

Solutions:

Check OS launcher startup logs for exceptions during Phase 2 loading
Verify Aetherra/homeostasis/ directory exists and has required files
Check Python environment has required dependencies

Issue: Metrics not forwarding to Self-Improvement

Symptoms: Self-Improvement shows 0 metrics received, no proposals generated

Diagnosis:

# Check metrics bridge status
bridge_status = homeostasis.metrics_bridge.get_status()
# Check forward_failures > 0 or success_rate < 100%

Solutions:

Verify Self-Improvement Engine is registered: service_registry.get_service_info("self_improvement_engine")
Check service registry is operational and routing messages
Review homeostasis logs for [BRIDGE] messages indicating failures

Issue: Error correction not detecting errors

Symptoms: Errors in logs but no fix attempts, statistics show 0 detections

Diagnosis:

# Check error corrector is running
stats = homeostasis.error_corrector.get_statistics()
# If errors_detected == 0, log handler may not be installed

Solutions:

Verify Phase 7 started: Check for [ERRCORR] log messages
Confirm logging.Handler installed: logging.root.handlers includes LogMonitorHandler
Check error patterns match actual log messages

Issue: Self-Incorporation not discovering files

Symptoms: No files discovered after boot, files_discovered metric shows 0

Diagnosis:

# Check initial scan was triggered
status = await self_incorporation.get_status()
# Check last_scan_timestamp > 0

Solutions:

Verify service started: service_registry.get_service_info("self_incorporation")
Check configured roots are valid paths: self_incorporation.config.roots
Review logs for [SELFINC] messages about scan completion
Manually trigger scan: await self_incorporation.trigger_scan()

Issue: Night cycle not running

Symptoms: night_cycles_completed remains 0, no learning insights

Diagnosis:

# Check night cycle configuration
config = self_incorporation.config
if not config.night_cycle_enabled:
    # Disabled in configuration

Solutions:

Enable night cycle in config: "night_cycle_enabled": true
Ensure system detects idle: User activity must be low for 30+ minutes
Check logs for night cycle phase transitions
Verify CPU/memory usage is low (night cycle waits for low resource usage)

Debug mode

Enable verbose debug logging:

import logging
logging.getLogger("Aetherra.homeostasis").setLevel(logging.DEBUG)
logging.getLogger("Aetherra.aetherra_core.engine").setLevel(logging.DEBUG)
logging.getLogger("aetherra_self_incorporation").setLevel(logging.DEBUG)

Or via environment variable:

export AETHERRA_LOG_LEVEL=DEBUG
python aetherra_os_launcher.py --mode full -v

Performance tuning

Reduce metrics collection overhead:

# In setpoints.yaml
controller:
  collection_interval: 120  # Increase from 60 to 120 seconds

Reduce self-improvement analysis frequency:

{
  "self_improvement": {
    "analysis_interval": 600  // Increase from 300 to 600 seconds (10 minutes)
  }
}

Disable night cycle learning:

{
  "self_incorporation": {
    "night_cycle_enabled": false
  }
}

Disable error correction:

# In setpoints.yaml
error_correction:
  enabled: false

Future enhancements

Phase 2: Complete data flows

Priority 2: Self-Incorporation Metrics Bridge

Forward discovery and integration metrics to Self-Improvement Engine
Include Self-Incorporation health in Homeostasis system health score
Enable Self-Improvement to learn from code evolution patterns

Priority 3: Proposal Consumer

Implement message handler in Self-Incorporation for improvement proposals
Evaluate proposals against safety policies
Execute safe proposals through integration pipeline
Report outcomes back to Self-Improvement Engine

Priority 4: Coordinated Night Cycle

Homeostasis broadcasts system-wide idle state
Self-Incorporation synchronizes night cycle with system idle
Optimal resource utilization for learning activities

Phase 3: Advanced capabilities

Predictive maintenance:

Self-Improvement predicts failures before they occur
Proactive corrective actions prevent issues
Statistical anomaly detection for early warning

A/B testing framework:

Self-Improvement proposes multiple strategies
Self-Incorporation tests strategies in parallel
Homeostasis monitors outcomes and selects winner

Automated rollback:

Homeostasis detects degraded health after integration
Self-Incorporation automatically rolls back recent changes
Self-Improvement learns from failed integrations

Distributed maintenance:

Maintenance System coordinates across multiple Aetherra instances
Shared learning and proposal exchange
Federated pattern analysis

Explainability interface:

Visual dashboard showing maintenance decisions
Rationale and evidence for all actions
Human-in-the-loop for complex decisions

API reference

Homeostasis System API

from Aetherra.homeostasis.homeostasis_integration import HomeostasisOrchestrator

# Create and initialize
homeostasis = HomeostasisOrchestrator()
await homeostasis.initialize()

# Start all phases
await homeostasis.start()

# Get system status
status = homeostasis.get_status()
# Returns: {status: "active", health_score: 0.95, phases: {...}}

# Stop all phases
await homeostasis.stop()

Self-Improvement Engine API

from Aetherra.aetherra_core.engine.self_improvement_engine import SelfImprovementEngine

# Create and start
engine = SelfImprovementEngine(db_path="self_improvement.db")
await engine.start_improvement_cycle(loop=asyncio.get_running_loop())

# Record metric
engine.record_performance_metric(
    name="plugin_load_success",
    value=0.95,
    unit="percentage",
    context={"component": "plugin_manager"}
)

# Get improvement status
status = engine.get_improvement_status()
# Returns: {proposals: [...], metrics_count: 1440, last_analysis: ...}

# Get metric trends
trends = engine.get_metric_trends("plugin_load_success", time_window=3600)
# Returns: {trend: "improving", values: [...], statistics: {...}}

# Stop engine
await engine.stop_improvement_cycle()

Self-Incorporation Service API

from aetherra_self_incorporation import SelfIncorporationService, SelfIncorporationConfig

# Create with configuration
config = SelfIncorporationConfig(
    enabled=True,
    roots=[Path("."), Path("Aetherra")],
    trust_mode="standard"
)
service = SelfIncorporationService(config)

# Inject core systems
service.inject_systems(
    service_registry,
    kernel_loop,
    plugin_manager,
    agent_orchestrator
)

# Start service
await service.start()

# Trigger code discovery
result = await service.trigger_scan(root_filter=None)
# Returns: {ok: true, discovered: 150, duration: 2.3, timestamp: ...}

# Classify discovered files
result = await service.trigger_classify(type_filter=None)
# Returns: {ok: true, classified: 150, duration: 1.5, timestamp: ...}

# Security evaluation
result = await service.trigger_security_eval(trust_filter=None)
# Returns: {ok: true, evaluated: 150, duration: 1.2, timestamp: ...}

# Create integration plan
result = await service.trigger_planning(include_experimental=False)
# Returns: {ok: true, plan_id: "abc123", total_components: 25, ...}

# Execute integration
result = await service.trigger_integrate(plan_id="abc123")
# Returns: {ok: true, integrated: 25, rollback_token: "xyz789", ...}

# Get service status
status = await service.get_status()
# Returns: {status: "HEALTHY", files_discovered: 150, ...}

# Health check
health = await service.health_check()
# Returns: {status: "HEALTHY", running: true, config_enabled: true, ...}

# Stop service
await service.stop()

Production hardening

SLO promotion and guardrails

Elevate setpoints to OS-level SLOs

Homeostasis already defines target operating parameters. Promote these to formal SLOs with automated enforcement:

# Aetherra/homeostasis/configs/slos.yaml
slos:
  plugin_load_success:
    target: 0.95
    minimum: 0.85
    breach_action: "alert_and_degrade"
    grace_period: 300  # seconds before enforcement

  memory_rtt_p95:
    target: 50.0
    maximum: 120.0
    breach_action: "trigger_maintenance"
    grace_period: 120

  task_latency_p95:
    target: 100.0
    maximum: 250.0
    breach_action: "auto_rollback"
    grace_period: 60

breach_policies:
  alert_and_degrade:
    - log_breach_event
    - notify_hub_dashboard
    - reduce_task_concurrency

  trigger_maintenance:
    - log_breach_event
    - trigger_memory_cleanup
    - notify_admin_if_persistent

  auto_rollback:
    - log_breach_event
    - identify_recent_changes
    - execute_automatic_rollback
    - notify_hub_dashboard

Guard policies for autonomous actions

# Aetherra/homeostasis/configs/guard_policies.yaml
guards:
  integration_velocity:
    max_integrations_per_hour: 5
    max_quarantines_per_hour: 2
    breach_action: "pause_self_incorporation"

  actuator_frequency:
    max_actions_per_component: 1
    cooldown_minutes: 5
    breach_action: "disable_actuator"

  rollback_cascade:
    max_rollbacks_per_hour: 3
    breach_action: "require_human_approval"

escalation:
  persistent_slo_breach:
    threshold_minutes: 15
    action: "notify_chat_agents"
    message: "System health degraded, manual review required"

Security system integration

Wire Self-Incorporation through Security System

All autonomous integrations must pass through existing security controls:

# In aetherra_self_incorporation.py
async def _validate_integration_security(self, file_item: FileItem, plan: dict) -> bool:
    """
    Validate integration against Security System policies.
    Deny-by-default: requires explicit approval.
    """
    from Aetherra.security.security_policy import SecurityPolicy

    # 1. Check signature requirement (VERIFIED/TRUSTED tiers only)
    if self.config.trust_mode == "strict":
        if not await self._verify_code_signature(file_item):
            logger.warning(f"[SELFINC] Integration blocked: no valid signature for {file_item.path}")
            return False

    # 2. Check capability grants
    required_caps = plan.get("required_capabilities", [])
    for cap in required_caps:
        if not await SecurityPolicy.check_capability_grant(cap):
            logger.warning(f"[SELFINC] Integration blocked: capability '{cap}' not granted")
            return False

    # 3. Check network policy compliance
    if self._requires_network_access(file_item):
        if not await SecurityPolicy.check_network_policy(file_item.path):
            logger.warning(f"[SELFINC] Integration blocked: network policy violation")
            return False

    # 4. Policy drift detection
    drift = await SecurityPolicy.detect_policy_drift(file_item)
    if drift.severity == "critical":
        logger.error(f"[SELFINC] Integration blocked: critical policy drift detected")
        await self.quarantine_file(file_item.id, "critical_policy_drift", drift.details)
        return False

    return True

Strict mode enforcement

# Environment-based security escalation
if os.getenv("AETHERRA_PROFILE") == "prod":
    # Production mode: strictest policies
    - require_signatures = True
    - auto_integrate = False  # Manual approval required
    - trust_mode = "strict"
    - capability_checks = "mandatory"

elif os.getenv("AETHERRA_NET_STRICT") == "1":
    # Network strict mode
    - network_allowlist_required = True
    - block_unsigned_network_code = True
    - audit_all_network_ops = True

Kernel-native action safety

Adopt kernel action envelopes for all actuator operations

# Aetherra/homeostasis/actuators.py
from Aetherra.kernel.action_envelope import ActionEnvelope, ActionPriority

async def execute_actuator_action(self, action_type: str, target: str, params: dict) -> bool:
    """
    Execute actuator action using kernel-native action envelope.
    Provides: trace_id, deadline, timeout, priority, DLQ on failure.
    """
    # Create action envelope
    envelope = ActionEnvelope(
        action_type=action_type,
        target=target,
        params=params,
        trace_id=self._generate_trace_id(),
        deadline_ts=time.time() + 30.0,  # 30-second deadline
        timeout_sec=25.0,  # 25-second timeout (< deadline)
        priority=ActionPriority.MAINTENANCE,
        retry_policy={
            "max_attempts": 2,
            "backoff_sec": 5.0,
            "exponential": False
        }
    )

    # Submit to kernel action queue
    try:
        result = await self.kernel_loop.submit_action(envelope)

        if result.success:
            self.metrics["actions_successful"] += 1
            return True
        else:
            self.metrics["actions_failed"] += 1
            logger.warning(f"[ACTUATOR] Action failed: {result.error}")
            return False

    except ActionExpiredError:
        # Action exceeded deadline, moved to DLQ
        self.metrics["actions_expired"] += 1
        logger.error(f"[ACTUATOR] Action expired and moved to DLQ: {envelope.trace_id}")
        return False

    except ActionQueueFullError:
        # Backpressure detected
        self.metrics["actions_dropped"] += 1
        logger.warning(f"[ACTUATOR] Action dropped due to queue backpressure")
        return False

DLQ monitoring and recovery

async def monitor_dlq(self):
    """Monitor Dead Letter Queue for failed actions."""
    dlq_items = await self.kernel_loop.get_dlq_items(limit=100)

    if len(dlq_items) > 10:
        logger.error(f"[ACTUATOR] DLQ depth high: {len(dlq_items)} items")

        # Analyze failure patterns
        failure_patterns = self._analyze_dlq_patterns(dlq_items)

        # Disable problematic actuators
        for actuator_type, failure_rate in failure_patterns.items():
            if failure_rate > 0.5:  # 50% failure rate
                logger.error(f"[ACTUATOR] Disabling {actuator_type} due to high failure rate")
                await self.disable_actuator(actuator_type)

Memory and STORM integration

Feed memory health into Homeostasis

# Aetherra/homeostasis/stability_metrics.py
async def collect_memory_health_metrics(self) -> dict[str, float]:
    """Collect health metrics from Memory System including STORM."""
    from Aetherra.memory.memory_advanced import AetherraMemoryEngineAdvanced

    memory = AetherraMemoryEngineAdvanced.get_instance()

    metrics = {}

    # Core memory health
    health = await memory.get_health_snapshot()
    metrics["memory_recall_latency_p95"] = health.get("recall_latency_p95", 0.0)
    metrics["memory_store_success_rate"] = health.get("store_success_rate", 1.0)
    metrics["memory_index_size"] = health.get("index_size", 0)

    # STORM health (if enabled)
    if memory.storm_enabled:
        storm_health = await memory.storm.get_health_metrics()
        metrics["storm_sheaf_inconsistency"] = storm_health.get("sheaf_inconsistency", 0.0)
        metrics["storm_tt_rank_avg"] = storm_health.get("tt_rank_avg", 0.0)
        metrics["storm_ot_cost_avg"] = storm_health.get("ot_cost_avg", 0.0)
        metrics["storm_coherence_score"] = storm_health.get("coherence_score", 1.0)

    # Pulse health
    pulse = await memory.get_pulse_status()
    metrics["memory_pulse_healthy"] = 1.0 if pulse.get("status") == "healthy" else 0.0

    # Narrative health
    narrative_health = await memory.get_narrative_health()
    metrics["memory_narrative_completeness"] = narrative_health.get("completeness", 1.0)

    return metrics

STORM maintenance triggers

# In setpoints.yaml
memory_slos:
  storm_sheaf_inconsistency:
    target: 0.0
    maximum: 0.1
    breach_action: "trigger_storm_maintenance"

  storm_coherence_score:
    target: 1.0
    minimum: 0.9
    breach_action: "trigger_storm_reindex"

storm_maintenance:
  inconsistency_threshold: 0.1
  actions:
    - recompute_sheaf_topology
    - rebuild_transport_maps
    - validate_tt_decompositions

  cooldown_minutes: 60  # Prevent excessive maintenance

Canary deployments and HMR integration

Default canary strategy for new capabilities

# In aetherra_self_incorporation.py
async def integrate_with_canary(self, plan_id: str, canary_percent: float = 0.1) -> dict:
    """
    Integrate new capability using canary deployment.

    Flow:
    1. Deploy to canary_percent of traffic
    2. Monitor health for canary_duration
    3. Auto-rollback if health degrades
    4. Full rollout if health stable
    """
    plan = self.integration_planner.get_plan(plan_id)

    # Generate rollback token before integration
    rollback_token = self._generate_rollback_token()

    # Create HMR canary configuration
    hmr_config = {
        "canary_percent": canary_percent,
        "canary_duration": 300,  # 5 minutes
        "health_check_interval": 10,  # 10 seconds
        "rollback_threshold": 0.9,  # Rollback if health < 0.9
        "rollback_token": rollback_token
    }

    # Execute canary deployment via HMR
    canary_result = await self.core_integrator.integrate_canary(
        plan=plan,
        hmr_config=hmr_config,
        kernel_loop=self.kernel_loop
    )

    if canary_result["status"] == "canary_stable":
        # Health stable during canary, proceed to full rollout
        logger.info(f"[SELFINC] Canary stable, proceeding to full rollout")
        full_result = await self.core_integrator.integrate_full(plan)

        return {
            "ok": True,
            "deployment": "canary_promoted",
            "rollback_token": rollback_token,
            "health_delta": canary_result["health_delta"]
        }

    elif canary_result["status"] == "auto_rollback":
        # Health degraded, automatic rollback executed
        logger.warning(f"[SELFINC] Canary failed, automatic rollback executed")

        return {
            "ok": False,
            "deployment": "canary_failed",
            "rollback_token": rollback_token,
            "rollback_reason": canary_result["rollback_reason"],
            "health_delta": canary_result["health_delta"]
        }

Golden path testing

Acceptance tests for autonomous error correction

# tests/acceptance/test_autonomous_error_correction.py
import pytest
from Aetherra.homeostasis.autonomous_error_corrector import AutonomousErrorCorrector

@pytest.mark.acceptance
async def test_service_registration_error_detection_and_fix():
    """
    Test: Service registration API mismatch error is detected and fixed.

    Golden path:
    1. Error appears in logs (service registration API mismatch)
    2. Error corrector detects pattern
    3. Cooldown respected (no immediate retry)
    4. Fix handler applies correction
    5. Metrics increment: errors_detected, fixes_attempted, fixes_successful
    """
    corrector = AutonomousErrorCorrector()
    await corrector.start()

    # Inject error into logs
    logger.warning("Service 'test_service' failed to register: API mismatch")

    # Wait for detection
    await asyncio.sleep(0.5)

    stats = corrector.get_statistics()
    assert stats["errors_detected"] >= 1, "Error should be detected"
    assert stats["fixes_attempted"] >= 1, "Fix should be attempted"

    # Inject same error immediately (should be blocked by cooldown)
    logger.warning("Service 'test_service' failed to register: API mismatch")
    await asyncio.sleep(0.5)

    stats_after = corrector.get_statistics()
    assert stats_after["fixes_attempted"] == stats["fixes_attempted"], \
        "Second fix should be blocked by cooldown"

    # Verify fix was successful
    assert stats["fixes_successful"] >= 1, "Fix should succeed"

    await corrector.stop()

# Similar tests for all 6 error categories:
# - test_deprecated_import_detection_and_fix()
# - test_missing_module_detection_and_fix()
# - test_missing_capability_detection_and_fix()
# - test_plugin_load_failure_detection_and_fix()
# - test_missing_data_detection_and_fix()

Concrete implementation roadmap

STATUS UPDATE (2025-10-23): Phases 2A, 2B, 2C, 2D, 2E, and 2F Complete! ✅

Phase 2A: Metrics triangle closed with Self-Incorporation bridge, proposal consumer, and unified status API

Phase 2B: Security hardening complete with trust modes, guard policies, audit immutability, and strict-mode enforcement

Phase 2C: Kernel integration complete with actuator action envelopes, DLQ monitoring, and backpressure handling

Phase 2D: Memory and STORM integration complete with health metrics collection and maintenance triggers

Phase 2E: Canary deployment strategy implemented with health monitoring and automatic rollback

Phase 2F: Testing and validation complete (golden paths, canary E2E, load + security)

See docs/PHASE_2A_IMPLEMENTATION.md for Phase 2A details.

Phase 2A: Close the metrics triangle (1-2 weeks)

Week 1: Bridge implementation ✅ COMPLETE

Implement Phase 9: Self-Incorporation Metrics Bridge
- File: Aetherra/homeostasis/self_incorporation_metrics_bridge.py (453 lines)
- Forward discovery/integration metrics to SI Engine (9+ metrics)
- Include Self-Incorporation health in Homeostasis health score
- Statistics: metrics_forwarded, forward_failures, success_rate
- Status: ✅ Implemented, tested, integrated
Implement Proposal Consumer in Self-Incorporation
- Message handler: handle_improvement_proposal(proposal) in aetherra_self_incorporation.py
- Validates proposal type (scale_up, optimize, degrade, change_strategy)
- Adjusts runtime knobs: processing_velocity, optimization_hints
- Optional integration execution: accepts actions or integration_plan in params
- Executes via core_integrator with HMR support where applicable
- Records proposals_executed and proposals_accepted metrics
- Appends to audit ledger with trace_id for observability
- Reports results back to SI Engine via service registry ("selfimprovement.proposal_result")
- Unit test: tests/unit/test_selfinc_proposal_consumer.py
- Status: ✅ Implemented, tested, integrated
Unified Maintenance Status API
- Endpoint: /api/maintenance/status in aetherra_hub/blueprints/maintenance.py
- Aggregates: system_health_score, actions_executed, proposals_generated, proposals_executed, proposals_accepted, files_integrated, files_quarantined, last_rollback_token
- Best-effort: returns HTTP 200 with availability flags; missing subsystems reported as available: false
- KPIs extraction: reads from Homeostasis health, SIE status/metrics, Self-Inc status/metrics
- OpenAPI schema: aetherra_hub/blueprints/openapi.py includes MaintenanceStatus and kpis
- Unit tests: tests/unit/test_hub_maintenance_status.py, tests/unit/test_openapi_maintenance_spec.py
- Status: ✅ Implemented and tested

Week 2: Testing and validation ✅ COMPLETE

End-to-end flow testing
- Homeostasis detects performance issue → SI Engine generates proposal → Self-Incorporation evaluates → Integration executed → Health improves
- Acceptance tests: tests/acceptance/test_maintenance_e2e_flow.py
- Golden path test: proposal consumption → metrics increment → audit trail → feedback
- Integration test: proposal with actions → dry-run execution → metrics tracking
- Status: ✅ 2/2 tests passing in 19.24s
Metrics validation
- Tool: tools/validate_maintenance_metrics.py
- Validates: metrics increment, audit trail trace_ids, API extraction consistency
- Checks: proposals_executed, proposals_accepted, last_rollback_token
- Status: ✅ All validation checks passed

Phase 2A Overall: ✅ COMPLETE

Comprehensive completion summary: docs/PHASE_2A_COMPLETION_SUMMARY.md

Optional Enhancements (deferred to Phase 2C):

Dashboard deployment
- Unified maintenance status on Hub
- Real-time metrics visualization

Phase 2B: Security hardening (1 week) ✅ COMPLETE

Security System integration
- Wire Self-Incorporation through Security System checks
- File: Aetherra/homeostasis/self_incorporation_security.py (413 lines)
- Signature verification for code integration (strict vs permissive modes)
- Capability grant validation for integration plans
- Network policy compliance checks (detects network imports, requires capability grants)
- Policy drift detection (30% risk threshold)
- Status: ✅ Implemented with 18/18 unit tests passing
Proposal authentication and authorization
- Authenticate proposal sender (required in strict mode)
- Authorize based on capability grants (maintenance:proposal:type)
- Rate limiting: 10 proposals per minute per sender
- Window-based rate limiting with automatic reset
- Status: ✅ Implemented and tested
Strict mode enforcement
- Production profile with mandatory security (AETHERRA_PROFILE=prod)
- Network strict mode (AETHERRA_NET_STRICT=1)
- trust_mode configuration: "strict" (prod), "standard" (default), "permissive" (dev)
- Signature requirements enforced in strict mode
- Anonymous proposals rejected in strict mode
- Status: ✅ Implemented with environment-based activation
Guard policy implementation (foundation)
- Define SLOs with breach actions
- Integration velocity limits (env-overridable)
- Actuator frequency guards per component
- Rollback cascade prevention baseline
- Config: Aetherra/homeostasis/configs/guard_policies.yaml
- Runtime: GuardPolicyEnforcer wired into proposal flow (pre-check + record)
- Tests: unit + acceptance for velocity enforcement
Guard policy metrics exposure
- Status surfaces guard policy snapshot: policies (thresholds+windows), windows (accepted/rollbacks/components), and rejection counters
- Available via Self-Incorporation.get_status() under guards
- Included in Hub Maintenance Status aggregation (best-effort)
- Tests: unit coverage for metrics presence and rejection increments
Audit trail immutability (hash chain)
- Enhance audit ledger with tamper detection via SHA-256 hash chaining
- New columns: prev_hash, entry_hash with automatic migration
- Method: AuditLedger.verify_integrity() to validate chain
- Tests: unit test detects tampering
Strict-mode and rate limit acceptance
- Strict profile rejects unknown senders end-to-end
- Proposal rate limiting enforced per-sender (10/min)
- Tests: acceptance tests for strict-mode auth and rate limiting

Phase 2C: Kernel integration (1 week) ✅ COMPLETE

Action envelope adoption
- Kernel-native actuator_action task type with trace_id, deadline_ts, timeout_sec, priority
- File: aetherra_kernel_loop.py - Added actuator_action handler in _execute_task
- Submission helpers: submit_actuator_action() and submit_actuator_action_and_wait()
- DLQ support: Failed actions written to .aetherra/kernel_dlq.jsonl
- Priority mapping: ActionPriority (EMERGENCY/CRITICAL/HIGH → high queue, MEDIUM → normal, LOW → background)
- Retry policy: Exponential backoff with jitter; timeout triggers retry; structural failure → DLQ
- Actuators integration: execute_action_via_kernel() method submits via kernel with fallback to direct execution
- Controller routing: execute_pending_actions() prefers kernel-aware path with backward compatibility
- Status: ✅ Implemented, tested via smoke tests
DLQ monitoring
- File: Aetherra/homeostasis/homeostasis_integration.py - Added DLQMonitor class
- Periodic polling: get_dlq_items(limit=100) every 60 seconds
- Failure pattern analysis: Groups by action_type, reason; calculates failure rates
- Auto-disable: Quarantines actuator types with ≥5 failures (configurable threshold)
- Metrics exposure: dlq_count, top_failure_reasons, quarantined_actuators via get_system_health_status()
- Background task integration: Started with homeostasis background tasks; kernel reference injected dynamically
- Status: ✅ Implemented, integrated, tested
Backpressure handling
- Kernel queue limits enforced: high_priority, normal_priority, background queues
- Drop to DLQ: Actions dropped when queue full; metrics track drops_high, drops_normal, drops_background
- Graceful degradation: DLQ monitor detects high failure rates and quarantines problematic actuators
- Metrics tracking: DLQ depth, failure patterns, quarantine events exposed in homeostasis status
- Status: ✅ Implemented via kernel task queue infrastructure

Phase 2C Overall: ✅ COMPLETE

All actuator actions now flow through kernel-native envelopes with:

Full observability (trace_id, timestamps, DLQ entries)
Safety (timeout enforcement, retry logic, failure isolation)
Resilience (auto-disable failing actuators, backpressure handling)
Backward compatibility (fallback to direct execution if kernel unavailable)

Remaining work (optional enhancements):

Unit/acceptance tests for DLQ monitoring and actuator quarantine behavior
Per-actuator retry policy configuration (currently uses kernel defaults)
DLQ analysis dashboard UI

Phase 2D: STORM and memory integration (3-5 days) ✅ COMPLETE

Memory health metrics collection
- File: Aetherra/homeostasis/stability_metrics.py - Added _collect_memory_health_metrics()
- Recall latency p95: Extracted from STORM metrics (storm_recall_latency_ms_p95)
- STORM sheaf inconsistency: Tracked via aetherra_storm_sheaf_inconsistency metric
- STORM coherence score: Calculated as 1.0 / (1.0 + sheaf_inconsistency)
- STORM OT cost average: Tracked via aetherra_storm_ot_cost_avg metric
- STORM TT rank: Tracked via aetherra_storm_tt_rank metric
- Pulse health status: Coherence score, contradiction count, orphaned fragments from get_memory_health()
- Narrative completeness: Optional from memory engine's narrative cache
- Shadow mode metrics: Agreement rate, divergences, comparison count
- Status: ✅ Implemented, integrated into metrics collection cycle
STORM maintenance triggers
- File: Aetherra/homeostasis/configs/setpoints.yaml - Added memory/STORM SLOs and maintenance config
- STORM setpoints: storm_sheaf_inconsistency (target: 0.0, max: 0.1), storm_coherence_score (target: 1.0, min: 0.9), storm_ot_cost_avg (max: 5.0), storm_recall_latency_ms_p95 (target: 50ms, max: 150ms)
- Memory pulse setpoints: memory_coherence_score (target: 0.9, min: 0.7), memory_contradiction_count (max: 5), memory_orphaned_fragments (max: 10)
- Narrative setpoint: narrative_completeness (target: 1.0, min: 0.8)
- Maintenance actions: STORM (recompute sheaf topology, rebuild transport maps, validate TT decompositions), Memory (resolve contradictions, cleanup orphaned fragments, rebuild concept clusters), Narrative (regenerate narrative, fill gaps, validate temporal consistency)
- Cooldown periods: STORM (60 min), Memory (30 min), Narrative (120 min)
- Rate limits: Max 2 STORM actions/hour, 3 memory actions/hour, 1 narrative action/hour
- Emergency thresholds: STORM coherence < 0.75, Memory coherence < 0.5
- Status: ✅ Configured with safety limits and cooldowns

Phase 2D Overall: ✅ COMPLETE

Memory and STORM health now integrated into Homeostasis monitoring with:

Full observability of STORM sheaf coherence, optimal transport costs, and recall performance
Memory pulse tracking (coherence, contradictions, orphaned fragments)
Narrative completeness monitoring
Automated maintenance triggers with safety limits (cooldowns, rate limits, emergency thresholds)
Ready for Phase 2E canary deployments and Phase 2F testing

Phase 2E: Canary deployments (1 week) ✅ COMPLETE

Canary integration strategy
- File: aetherra_self_incorporation.py - Added integrate_with_canary() method
- Default 10% canary rollout (configurable via canary_percent parameter)
- Baseline health recording before deployment
- Health monitoring during canary period (default 5 minutes, configurable)
- Periodic health checks at configurable intervals (default 10 seconds)
- Automatic rollback on health degradation below threshold (default 0.9)
- Canary promotion when all health checks pass
- Comprehensive result tracking: baseline/min/max/avg health, health delta, rollback reason
- Metrics: canary_deployments_successful, canary_deployments_failed
- Status: ✅ Implemented with configurable parameters and metrics tracking
HMR integration
- Uses existing HMR controller from service registry
- Integrates with CoreIntegrator's HMR-aware execution path
- Rollback tokens generated automatically during integration
- Automatic rollback via trigger_rollback() when health degrades
- Leverages kernel HMR lifecycle (quiesce → swap → verify → rollback)
- Status: ✅ Integrated with existing HMR infrastructure
Canary deployment tests
- File: tests/unit/test_canary_deployment.py - 7 unit tests covering all canary logic paths
- Test coverage: HMR disabled, plan not ready, baseline health too low, stable dry-run, auto-rollback on health drop, configurable parameters, metrics tracking
- Mock-based tests for integration scenarios (real end-to-end tests deferred to Phase 2F)
- Status: ✅ Unit tests passing (7/7)

Phase 2E Overall: ✅ COMPLETE

Canary deployment infrastructure is fully implemented and tested:

Configurable canary strategy with health-based rollback
Seamless HMR integration for hot-swapping
Comprehensive unit test coverage
Validated by Phase 2F end-to-end acceptance testing

Phase 2F: Testing and validation (1 week) ✅ COMPLETE

Golden path tests
- Acceptance tests for all 6 error correction categories
- File: tests/acceptance/test_autonomous_error_correction_golden_paths.py
- Status: ✅ 8/8 passing (cooldown semantics validated; regex-aligned messages)
Canary E2E acceptance tests (promotion + rollback)
- File: tests/acceptance/test_canary_e2e.py
- Scenarios: canary promotion on stable health; auto-rollback on health degradation below threshold
- Health monitoring: exercised via Homeostasis health API path used by integrate_with_canary() (service registry stubs supply dynamic health sequences)
- HMR: rollback_token generation validated through HMR path (register_plugin action)
- Status: ✅ 2/2 passing (fast checks with patched sleep; no external services required)
Load testing
- Stress test with high error rates
- Validate actuator rate limiting (via DLQMonitor quarantine path)
- Test DLQ behavior under load
- File: tests/acceptance/test_load_and_security_phase2f.py::test_dlq_monitor_quarantines_actuator_on_high_failure_rate
- Status: ✅ Pass — repeated actuator_action failures in DLQ trigger auto-quarantine; metrics (dlq_count, top_failure_reasons) exposed
Security testing
- Verify strict mode enforcement (prod profile, deny-by-default without grants)
- Test policy drift detection (critical drift blocks)
- Validate capability grant checks
- File: tests/acceptance/test_load_and_security_phase2f.py
  - test_security_capability_grant_required_strict_mode
  - test_security_policy_drift_detection_critical
- Status: ✅ 2/2 passing — strict capability denial and critical drift detection

Maintenance Guardrail Pack

Create .aether scripts for common maintenance scenarios:

// maintenance_guardrails.aether
// Policy enforcement and safety checks

metadata:
  name: "Maintenance Guardrails"
  version: "1.0.0"
  requires: ["homeostasis", "self_incorporation", "security"]

policy:
  max_integrations_per_hour: 5
  require_signatures_strict: true
  auto_rollback_on_health_drop: true

action check_slo_compliance:
  inputs: [slo_name, current_value]

  let target = homeostasis.get_slo(slo_name).target
  let breach = current_value < target

  if breach:
    emit "slo_breach" {slo: slo_name, value: current_value}
    trigger homeostasis.actuate_breach_policy(slo_name)

  return breach

action validate_integration_security:
  inputs: [file_item, plan]

  let signature_valid = security.verify_signature(file_item)
  let capabilities_granted = security.check_capabilities(plan.required_capabilities)
  let network_allowed = security.check_network_policy(file_item)

  return signature_valid && capabilities_granted && network_allowed

// maintenance_canary.aether
// Canary deployment with automatic rollback

metadata:
  name: "Maintenance Canary Deployment"
  version: "1.0.0"
  requires: ["self_incorporation", "homeostasis", "kernel"]

action deploy_canary:
  inputs: [plan_id, canary_percent = 0.1]

  // Capture baseline health
  let baseline_health = homeostasis.get_health_score()

  // Generate rollback token
  let rollback_token = self_incorporation.generate_rollback_token()

  // Deploy to canary percent
  let canary_result = self_incorporation.integrate_canary(plan_id, canary_percent)

  // Monitor health for 5 minutes
  sleep 300

  let canary_health = homeostasis.get_health_score()
  let health_delta = canary_health - baseline_health

  if health_delta < -0.1:  // Health dropped by 10%+
    emit "canary_failed" {health_delta: health_delta}
    trigger self_incorporation.rollback(rollback_token)
    return {success: false, reason: "health_degradation"}

  // Health stable, promote to full
  let full_result = self_incorporation.integrate_full(plan_id)

  return {success: true, rollback_token: rollback_token, health_delta: health_delta}

// maintenance_rollback.aether
// Trace-driven automatic rollback

metadata:
  name: "Maintenance Rollback"
  version: "1.0.0"
  requires: ["self_incorporation", "homeostasis", "kernel"]

action auto_rollback_on_breach:
  inputs: [slo_name, breach_severity]

  if breach_severity != "critical":
    return {action: "none", reason: "breach not critical"}

  // Find recent integrations (last hour)
  let recent_integrations = self_incorporation.get_recent_integrations(time_window = 3600)

  if recent_integrations.count == 0:
    return {action: "none", reason: "no recent integrations"}

  // Correlate breach with integrations using kernel trace_ids
  let suspected_integration = kernel.correlate_traces(breach_event, recent_integrations)

  if suspected_integration:
    emit "auto_rollback_triggered" {integration: suspected_integration, slo: slo_name}

    let rollback_result = self_incorporation.rollback(suspected_integration.rollback_token)

    // Verify health recovery
    sleep 60
    let health_recovered = homeostasis.check_slo(slo_name)

    return {
      action: "rollback",
      integration: suspected_integration.id,
      health_recovered: health_recovered
    }

  return {action: "none", reason: "no integration correlated with breach"}

Hub dashboard KPIs

System Health Panel

Global health score (0.0-1.0) with trend sparkline
Task latency p95 with SLO threshold
Plugin load success rate with target line
Memory RTT with acceptable range

Autonomy Quality Panel

Error correction effectiveness: fixes_successful / fixes_attempted
Quarantine rate: files_quarantined / files_discovered
Rollback count (last 24 hours)
Auto-fix category breakdown (pie chart)

Evolution Velocity Panel

Proposals generated (counter)
Proposals accepted (counter)
Time-to-canary average (seconds)
Time-to-rollback average (seconds)
Integration success rate

Memory Integrity Panel

STORM OT cost average
STORM sheaf coherence score
Recall latency p95
Pulse health indicator

Risk Indicators

SLO breaches (last 24 hours)
DLQ depth (actionable threshold)
Policy drift alerts
Actuator disable events

Risk mitigation strategies

Runaway auto-integration

Risk: Self-Incorporation integrates too many capabilities too quickly, destabilizing system.

Mitigations:

Velocity limits: Maximum 5 integrations per hour (configurable)
Strict capability checks: All integrations validated against Security System
Signed manifests: Require valid signatures in strict mode
Canary rollouts: Default 10% canary with health gates
Automatic rollback: Trigger on SLO breach within 5 minutes of integration
Human-in-the-loop: Chat/Agents API approval in strict mode

Detection:

if metrics["integrations_last_hour"] > config["max_integrations_per_hour"]:
    logger.error("[GUARD] Integration velocity limit exceeded")
    await self_incorporation.pause(duration=3600)  # Pause for 1 hour
    await chat_agents.notify("Integration velocity limit exceeded, paused for 1 hour")

Oscillation from aggressive actuators

Risk: Controller oscillates between corrective actions, causing instability.

Mitigations:

PID tuning: Properly tuned gains (Kp=1.0, Ki=0.1, Kd=0.05)
Rate limiting: Maximum 1 action per component per minute
Cooldown periods: 5-10 minute cooldown between repeated actions
Effectiveness learning: Controller learns optimal dampening gains over time
Oscillation detection: Detect rapid state changes and increase dampening

Detection:

action_history = actuator.get_action_history(time_window=600)  # Last 10 minutes
if self._detect_oscillation(action_history):
    logger.warning("[ACTUATOR] Oscillation detected, increasing dampening")
    controller.increase_dampening(factor=2.0)
    actuator.extend_cooldown(minutes=5)

Adaptive dampening:

# Track effectiveness of each action
for action in action_history:
    if action.effectiveness < 0.5:  # Action not effective
        controller.adjust_gain(action.type, factor=0.9)  # Reduce gain by 10%
    elif action.effectiveness > 0.9:  # Action very effective
        controller.adjust_gain(action.type, factor=1.05)  # Increase gain by 5%

Conclusion

The Aetherra Maintenance System represents a paradigm shift in operating system design: from reactive manual maintenance to proactive autonomous self-management. By integrating the Homeostasis System (stability), Self-Improvement Engine (intelligence), and Self-Incorporation Service (evolution), Aetherra achieves:

Zero-touch operation: System maintains itself without human intervention
Continuous learning: Performance improves autonomously over time
Self-healing: Errors detected and corrected in real-time
Safe evolution: New capabilities integrated with safety guarantees
Complete auditability: All decisions and actions fully traceable

Production hardening complete: With SLO promotion, security integration, kernel-native safety, STORM monitoring, canary deployments, and comprehensive testing, the Maintenance System is production-ready and auditable.

This autonomous maintenance architecture ensures Aetherra remains stable, performant, and continuously evolving to meet user needs and adapt to changing conditions.

The Maintenance System operates continuously in Aetherra OS installations worldwide.

For technical support or questions about the Maintenance System:

GitHub Issues: https://github.com/AetherraLabs/Aetherra
Documentation: https://docs.aetherra.ai
Email: support@aetherraalabs.com

Uh oh!

FilesExpand file tree

AETHERRA_MAINTENANCE_SYSTEM.md

Latest commit

History

AETHERRA_MAINTENANCE_SYSTEM.md

File metadata and controls

Aetherra Maintenance System

Architecture overview

System integration flow

Boot sequence

Core components

1) Homeostasis System (Stability Layer)

2) Self-Improvement Engine (Intelligence Layer)

3) Self-Incorporation Service (Evolution Layer)

Data flows

Current operational flows

Planned flows (Phase 2)

Operational modes

Normal mode (default)

Quiet mode (AETHERRA_QUIET=1)

Strict mode (AETHERRA_PROFILE=prod)

Test mode (AETHERRA_PROFILE=test)

Health monitoring

Service registry health

Metrics endpoints

Health check APIs

Configuration

Homeostasis configuration

Self-Improvement configuration

Self-Incorporation configuration

Maintenance system lifecycle

Startup sequence

Runtime operation

Shutdown sequence

Observability and metrics

Prometheus/OpenMetrics export

Logging

Dashboards

Safety and security

Safety constraints

Security policies

Troubleshooting

Common issues

Debug mode

Performance tuning

Future enhancements

Phase 2: Complete data flows

Phase 3: Advanced capabilities

Related documentation

API reference

Homeostasis System API

Self-Improvement Engine API

Self-Incorporation Service API

Production hardening

SLO promotion and guardrails

Security system integration

Kernel-native action safety

Memory and STORM integration

Canary deployments and HMR integration

Golden path testing

Concrete implementation roadmap

Phase 2A: Close the metrics triangle (1-2 weeks)

Phase 2B: Security hardening (1 week) ✅ COMPLETE

Phase 2C: Kernel integration (1 week) ✅ COMPLETE

Phase 2D: STORM and memory integration (3-5 days) ✅ COMPLETE

Phase 2E: Canary deployments (1 week) ✅ COMPLETE

Phase 2F: Testing and validation (1 week) ✅ COMPLETE

Maintenance Guardrail Pack

Hub dashboard KPIs

Risk mitigation strategies

Runaway auto-integration

Oscillation from aggressive actuators

Conclusion

Quiet mode (`AETHERRA_QUIET=1`)

Strict mode (`AETHERRA_PROFILE=prod`)

Test mode (`AETHERRA_PROFILE=test`)