|
| 1 | +# HelixAgent Comprehensive Monitoring System |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The HelixAgent Monitoring System provides comprehensive monitoring capabilities for all challenge executions, including: |
| 6 | + |
| 7 | +- Real-time resource monitoring (CPU, memory, disk, network) |
| 8 | +- Log collection from all system components |
| 9 | +- Memory leak detection |
| 10 | +- Warning/error detection and analysis |
| 11 | +- Automatic issue investigation |
| 12 | +- Comprehensive HTML/JSON reports |
| 13 | + |
| 14 | +## Architecture |
| 15 | + |
| 16 | +``` |
| 17 | +┌─────────────────────────────────────────────────────────────┐ |
| 18 | +│ MONITORING SYSTEM ARCHITECTURE │ |
| 19 | +├─────────────────────────────────────────────────────────────┤ |
| 20 | +│ │ |
| 21 | +│ ┌─────────────────┐ ┌─────────────────┐ │ |
| 22 | +│ │ monitoring_lib │ │ report_generator│ │ |
| 23 | +│ │ .sh │ │ .sh │ │ |
| 24 | +│ └────────┬────────┘ └────────┬────────┘ │ |
| 25 | +│ │ │ │ |
| 26 | +│ ▼ ▼ │ |
| 27 | +│ ┌─────────────────────────────────────────┐ │ |
| 28 | +│ │ run_monitored_challenges.sh │ │ |
| 29 | +│ │ • Challenge orchestration │ │ |
| 30 | +│ │ • Resource sampling │ │ |
| 31 | +│ │ • Log collection │ │ |
| 32 | +│ │ • Issue tracking │ │ |
| 33 | +│ └─────────────────────────────────────────┘ │ |
| 34 | +│ │ |
| 35 | +│ Output: │ |
| 36 | +│ ┌──────────────┬──────────────┬─────────────┐ │ |
| 37 | +│ │ JSON Report │ HTML Report │ Issue Files │ │ |
| 38 | +│ └──────────────┴──────────────┴─────────────┘ │ |
| 39 | +│ │ |
| 40 | +└─────────────────────────────────────────────────────────────┘ |
| 41 | +``` |
| 42 | + |
| 43 | +## Components |
| 44 | + |
| 45 | +### Core Library (`challenges/monitoring/lib/monitoring_lib.sh`) |
| 46 | + |
| 47 | +The core monitoring library provides all monitoring functions: |
| 48 | + |
| 49 | +| Function | Description | |
| 50 | +|----------|-------------| |
| 51 | +| `mon_init` | Initialize monitoring session | |
| 52 | +| `mon_log` | Log messages with severity levels | |
| 53 | +| `mon_sample_resources` | Collect CPU/memory/disk/network stats | |
| 54 | +| `mon_collect_all_logs` | Gather logs from all components | |
| 55 | +| `mon_detect_memory_leaks` | Detect memory leaks against baseline | |
| 56 | +| `mon_analyze_log_file` | Analyze logs for errors/warnings | |
| 57 | +| `mon_record_issue` | Record detected issues with severity | |
| 58 | +| `mon_record_fix` | Record applied fixes with test references | |
| 59 | +| `mon_finalize` | Finalize monitoring and generate summary | |
| 60 | + |
| 61 | +### Report Generator (`challenges/monitoring/lib/report_generator.sh`) |
| 62 | + |
| 63 | +Generates comprehensive reports in multiple formats: |
| 64 | + |
| 65 | +- **JSON Report**: Machine-readable format for CI/CD integration |
| 66 | +- **HTML Report**: Human-readable format with visualizations |
| 67 | + |
| 68 | +### Main Runner (`challenges/monitoring/run_monitored_challenges.sh`) |
| 69 | + |
| 70 | +The main entry point that: |
| 71 | +1. Initializes monitoring |
| 72 | +2. Starts infrastructure (HelixAgent, PostgreSQL, Redis) |
| 73 | +3. Runs all challenges with monitoring |
| 74 | +4. Investigates detected errors |
| 75 | +5. Generates final reports |
| 76 | + |
| 77 | +## Usage |
| 78 | + |
| 79 | +### Running Monitored Challenges |
| 80 | + |
| 81 | +```bash |
| 82 | +# Run all challenges with monitoring |
| 83 | +./challenges/monitoring/run_monitored_challenges.sh |
| 84 | + |
| 85 | +# Run specific challenges |
| 86 | +./challenges/monitoring/run_monitored_challenges.sh --challenges "health_monitoring,provider_verification" |
| 87 | + |
| 88 | +# Skip infrastructure checks |
| 89 | +./challenges/monitoring/run_monitored_challenges.sh --skip-infra |
| 90 | + |
| 91 | +# Continue on failures |
| 92 | +./challenges/monitoring/run_monitored_challenges.sh --continue-on-failure |
| 93 | +``` |
| 94 | + |
| 95 | +### Using the Monitoring Library |
| 96 | + |
| 97 | +```bash |
| 98 | +#!/bin/bash |
| 99 | +source "./challenges/monitoring/lib/monitoring_lib.sh" |
| 100 | + |
| 101 | +# Initialize monitoring |
| 102 | +mon_init "my_test_session" |
| 103 | + |
| 104 | +# Log events |
| 105 | +mon_log "INFO" "Starting test..." |
| 106 | + |
| 107 | +# Sample resources periodically |
| 108 | +mon_sample_resources |
| 109 | + |
| 110 | +# Analyze logs |
| 111 | +mon_analyze_log_file "/var/log/helixagent.log" "helixagent" || true |
| 112 | + |
| 113 | +# Record issues and fixes |
| 114 | +mon_record_issue "high" "Memory usage exceeded threshold" |
| 115 | +mon_record_fix "issue_001" "Optimized memory allocation" "TestMemoryOptimization" |
| 116 | + |
| 117 | +# Finalize and generate report |
| 118 | +mon_finalize |
| 119 | +``` |
| 120 | + |
| 121 | +## Error Detection Patterns |
| 122 | + |
| 123 | +### Error Patterns (15 patterns) |
| 124 | +``` |
| 125 | +panic:|fatal error:|FATAL|ERROR|runtime error:|nil pointer| |
| 126 | +index out of range|deadlock|timeout|connection refused| |
| 127 | +context deadline exceeded|too many open files|out of memory|OOMKilled |
| 128 | +``` |
| 129 | + |
| 130 | +### Warning Patterns (10 patterns) |
| 131 | +``` |
| 132 | +WARN|WARNING|deprecated|retry|reconnect|circuit breaker| |
| 133 | +rate limit|slow query|high latency|token expired|invalid.*format |
| 134 | +``` |
| 135 | + |
| 136 | +### Ignored Patterns (5 patterns) |
| 137 | +``` |
| 138 | +TestError|error.*test|expected.*error|mock.*error| |
| 139 | +PASS|ok.*dev.helix |
| 140 | +``` |
| 141 | + |
| 142 | +## Memory Leak Detection |
| 143 | + |
| 144 | +The monitoring system detects memory leaks by: |
| 145 | + |
| 146 | +1. **Baseline Collection**: Capture initial memory usage at startup |
| 147 | +2. **Periodic Sampling**: Sample memory usage during execution |
| 148 | +3. **Threshold Comparison**: Alert if memory exceeds baseline by configured threshold |
| 149 | +4. **FD Monitoring**: Track file descriptor count for leak detection |
| 150 | + |
| 151 | +Default threshold: 150% of baseline memory usage |
| 152 | + |
| 153 | +## Resource Monitoring |
| 154 | + |
| 155 | +### Collected Metrics |
| 156 | + |
| 157 | +| Metric | Source | Frequency | |
| 158 | +|--------|--------|-----------| |
| 159 | +| CPU Usage | `/proc/stat` | Every sample | |
| 160 | +| Memory Usage | `/proc/meminfo` | Every sample | |
| 161 | +| Disk I/O | `iostat` | Every sample | |
| 162 | +| Network I/O | `/proc/net/dev` | Every sample | |
| 163 | +| File Descriptors | `/proc/[pid]/fd` | Every sample | |
| 164 | +| Goroutine Count | pprof endpoint | Every sample | |
| 165 | + |
| 166 | +### Sample Interval |
| 167 | + |
| 168 | +Default: 5 seconds (configurable via `MON_SAMPLE_INTERVAL`) |
| 169 | + |
| 170 | +## Report Output |
| 171 | + |
| 172 | +### JSON Report Structure |
| 173 | + |
| 174 | +```json |
| 175 | +{ |
| 176 | + "session_id": "challenges_20260115_013435", |
| 177 | + "start_time": "2026-01-15T01:34:35Z", |
| 178 | + "end_time": "2026-01-15T02:15:22Z", |
| 179 | + "duration_seconds": 2447, |
| 180 | + "challenges": { |
| 181 | + "total": 45, |
| 182 | + "passed": 43, |
| 183 | + "failed": 2, |
| 184 | + "skipped": 0 |
| 185 | + }, |
| 186 | + "issues": { |
| 187 | + "total": 0, |
| 188 | + "high": 0, |
| 189 | + "medium": 0, |
| 190 | + "low": 0 |
| 191 | + }, |
| 192 | + "fixes": { |
| 193 | + "total": 0 |
| 194 | + }, |
| 195 | + "resources": { |
| 196 | + "peak_memory_mb": 512, |
| 197 | + "peak_cpu_percent": 85, |
| 198 | + "peak_disk_io_mb": 150 |
| 199 | + } |
| 200 | +} |
| 201 | +``` |
| 202 | + |
| 203 | +### HTML Report Features |
| 204 | + |
| 205 | +- Executive summary with pass/fail counts |
| 206 | +- Resource usage graphs |
| 207 | +- Issue timeline |
| 208 | +- Challenge results table |
| 209 | +- Fix history with test references |
| 210 | + |
| 211 | +## Test Coverage |
| 212 | + |
| 213 | +The monitoring system is covered by comprehensive tests: |
| 214 | + |
| 215 | +### Unit Tests (`tests/integration/monitoring_system_test.go`) |
| 216 | + |
| 217 | +- `TestMonitoringLibInit` - Initialization |
| 218 | +- `TestMonitoringSampleResources` - Resource sampling |
| 219 | +- `TestMonitoringLogCollection` - Log collection |
| 220 | +- `TestMonitoringMemoryLeakDetection` - Memory leak detection |
| 221 | +- `TestMonitoringLogAnalysis` - Log analysis |
| 222 | +- `TestMonitoringIssueTracking` - Issue tracking |
| 223 | +- `TestMonitoringFixRecording` - Fix recording |
| 224 | +- `TestMonitoringFinalization` - Finalization |
| 225 | + |
| 226 | +### Challenge Script (`challenges/scripts/monitoring_system_challenge.sh`) |
| 227 | + |
| 228 | +21 comprehensive tests covering: |
| 229 | +- Library initialization |
| 230 | +- Resource sampling |
| 231 | +- Log collection |
| 232 | +- Memory leak detection |
| 233 | +- Log analysis |
| 234 | +- Issue tracking |
| 235 | +- Fix recording |
| 236 | +- Report generation |
| 237 | +- Concurrent operations |
| 238 | + |
| 239 | +## Configuration |
| 240 | + |
| 241 | +### Environment Variables |
| 242 | + |
| 243 | +| Variable | Description | Default | |
| 244 | +|----------|-------------|---------| |
| 245 | +| `MON_LOG_DIR` | Log directory | `/tmp/helixagent_monitoring` | |
| 246 | +| `MON_SAMPLE_INTERVAL` | Sample interval (seconds) | `5` | |
| 247 | +| `MON_MEMORY_THRESHOLD` | Memory leak threshold (%) | `150` | |
| 248 | +| `MON_RESOURCE_MONITOR` | Enable resource monitoring | `true` | |
| 249 | + |
| 250 | +### Output Directories |
| 251 | + |
| 252 | +``` |
| 253 | +challenges/monitoring/ |
| 254 | +├── logs/ # Session logs |
| 255 | +│ └── [session_id]/ |
| 256 | +│ ├── master.log # Combined log |
| 257 | +│ ├── resources/ # Resource samples |
| 258 | +│ ├── components/ # Component logs |
| 259 | +│ └── issues/ # Issue records |
| 260 | +└── reports/ # Generated reports |
| 261 | + └── [session_id]/ |
| 262 | + ├── report.json |
| 263 | + └── report.html |
| 264 | +``` |
| 265 | + |
| 266 | +## Integration with CI/CD |
| 267 | + |
| 268 | +### GitHub Actions Example |
| 269 | + |
| 270 | +```yaml |
| 271 | +- name: Run Monitored Challenges |
| 272 | + run: | |
| 273 | + ./challenges/monitoring/run_monitored_challenges.sh --continue-on-failure |
| 274 | +
|
| 275 | +- name: Upload Monitoring Reports |
| 276 | + uses: actions/upload-artifact@v3 |
| 277 | + with: |
| 278 | + name: monitoring-reports |
| 279 | + path: challenges/monitoring/reports/ |
| 280 | + |
| 281 | +- name: Check for Critical Issues |
| 282 | + run: | |
| 283 | + if grep -q '"high":' challenges/monitoring/reports/*/report.json; then |
| 284 | + echo "Critical issues detected!" |
| 285 | + exit 1 |
| 286 | + fi |
| 287 | +``` |
| 288 | +
|
| 289 | +## Troubleshooting |
| 290 | +
|
| 291 | +### Common Issues |
| 292 | +
|
| 293 | +1. **ANSI color codes in output** |
| 294 | + - Solution: Output colors to stderr, not stdout |
| 295 | +
|
| 296 | +2. **Exit codes from analysis functions** |
| 297 | + - Solution: Use `|| true` after `mon_analyze_log_file` |
| 298 | + |
| 299 | +3. **Double counting in issue/fix recording** |
| 300 | + - Solution: Direct file writes instead of `mon_log` |
| 301 | + |
| 302 | +4. **Report generator stdout pollution** |
| 303 | + - Solution: Redirect status messages to stderr |
| 304 | + |
| 305 | +## Future Enhancements |
| 306 | + |
| 307 | +- [ ] Integration with Prometheus for metrics export |
| 308 | +- [ ] Real-time dashboard with WebSocket updates |
| 309 | +- [ ] Automated issue correlation |
| 310 | +- [ ] Machine learning-based anomaly detection |
| 311 | +- [ ] Integration with alerting systems (Slack, PagerDuty) |
0 commit comments