Skip to content

Commit 60d74ab

Browse files
committed
Auto-commit
1 parent 997063f commit 60d74ab

File tree

2 files changed

+492
-0
lines changed

2 files changed

+492
-0
lines changed
Lines changed: 311 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,311 @@
1+
# HelixAgent Comprehensive Monitoring System
2+
3+
## Overview
4+
5+
The HelixAgent Monitoring System provides comprehensive monitoring capabilities for all challenge executions, including:
6+
7+
- Real-time resource monitoring (CPU, memory, disk, network)
8+
- Log collection from all system components
9+
- Memory leak detection
10+
- Warning/error detection and analysis
11+
- Automatic issue investigation
12+
- Comprehensive HTML/JSON reports
13+
14+
## Architecture
15+
16+
```
17+
┌─────────────────────────────────────────────────────────────┐
18+
│ MONITORING SYSTEM ARCHITECTURE │
19+
├─────────────────────────────────────────────────────────────┤
20+
│ │
21+
│ ┌─────────────────┐ ┌─────────────────┐ │
22+
│ │ monitoring_lib │ │ report_generator│ │
23+
│ │ .sh │ │ .sh │ │
24+
│ └────────┬────────┘ └────────┬────────┘ │
25+
│ │ │ │
26+
│ ▼ ▼ │
27+
│ ┌─────────────────────────────────────────┐ │
28+
│ │ run_monitored_challenges.sh │ │
29+
│ │ • Challenge orchestration │ │
30+
│ │ • Resource sampling │ │
31+
│ │ • Log collection │ │
32+
│ │ • Issue tracking │ │
33+
│ └─────────────────────────────────────────┘ │
34+
│ │
35+
│ Output: │
36+
│ ┌──────────────┬──────────────┬─────────────┐ │
37+
│ │ JSON Report │ HTML Report │ Issue Files │ │
38+
│ └──────────────┴──────────────┴─────────────┘ │
39+
│ │
40+
└─────────────────────────────────────────────────────────────┘
41+
```
42+
43+
## Components
44+
45+
### Core Library (`challenges/monitoring/lib/monitoring_lib.sh`)
46+
47+
The core monitoring library provides all monitoring functions:
48+
49+
| Function | Description |
50+
|----------|-------------|
51+
| `mon_init` | Initialize monitoring session |
52+
| `mon_log` | Log messages with severity levels |
53+
| `mon_sample_resources` | Collect CPU/memory/disk/network stats |
54+
| `mon_collect_all_logs` | Gather logs from all components |
55+
| `mon_detect_memory_leaks` | Detect memory leaks against baseline |
56+
| `mon_analyze_log_file` | Analyze logs for errors/warnings |
57+
| `mon_record_issue` | Record detected issues with severity |
58+
| `mon_record_fix` | Record applied fixes with test references |
59+
| `mon_finalize` | Finalize monitoring and generate summary |
60+
61+
### Report Generator (`challenges/monitoring/lib/report_generator.sh`)
62+
63+
Generates comprehensive reports in multiple formats:
64+
65+
- **JSON Report**: Machine-readable format for CI/CD integration
66+
- **HTML Report**: Human-readable format with visualizations
67+
68+
### Main Runner (`challenges/monitoring/run_monitored_challenges.sh`)
69+
70+
The main entry point that:
71+
1. Initializes monitoring
72+
2. Starts infrastructure (HelixAgent, PostgreSQL, Redis)
73+
3. Runs all challenges with monitoring
74+
4. Investigates detected errors
75+
5. Generates final reports
76+
77+
## Usage
78+
79+
### Running Monitored Challenges
80+
81+
```bash
82+
# Run all challenges with monitoring
83+
./challenges/monitoring/run_monitored_challenges.sh
84+
85+
# Run specific challenges
86+
./challenges/monitoring/run_monitored_challenges.sh --challenges "health_monitoring,provider_verification"
87+
88+
# Skip infrastructure checks
89+
./challenges/monitoring/run_monitored_challenges.sh --skip-infra
90+
91+
# Continue on failures
92+
./challenges/monitoring/run_monitored_challenges.sh --continue-on-failure
93+
```
94+
95+
### Using the Monitoring Library
96+
97+
```bash
98+
#!/bin/bash
99+
source "./challenges/monitoring/lib/monitoring_lib.sh"
100+
101+
# Initialize monitoring
102+
mon_init "my_test_session"
103+
104+
# Log events
105+
mon_log "INFO" "Starting test..."
106+
107+
# Sample resources periodically
108+
mon_sample_resources
109+
110+
# Analyze logs
111+
mon_analyze_log_file "/var/log/helixagent.log" "helixagent" || true
112+
113+
# Record issues and fixes
114+
mon_record_issue "high" "Memory usage exceeded threshold"
115+
mon_record_fix "issue_001" "Optimized memory allocation" "TestMemoryOptimization"
116+
117+
# Finalize and generate report
118+
mon_finalize
119+
```
120+
121+
## Error Detection Patterns
122+
123+
### Error Patterns (15 patterns)
124+
```
125+
panic:|fatal error:|FATAL|ERROR|runtime error:|nil pointer|
126+
index out of range|deadlock|timeout|connection refused|
127+
context deadline exceeded|too many open files|out of memory|OOMKilled
128+
```
129+
130+
### Warning Patterns (10 patterns)
131+
```
132+
WARN|WARNING|deprecated|retry|reconnect|circuit breaker|
133+
rate limit|slow query|high latency|token expired|invalid.*format
134+
```
135+
136+
### Ignored Patterns (5 patterns)
137+
```
138+
TestError|error.*test|expected.*error|mock.*error|
139+
PASS|ok.*dev.helix
140+
```
141+
142+
## Memory Leak Detection
143+
144+
The monitoring system detects memory leaks by:
145+
146+
1. **Baseline Collection**: Capture initial memory usage at startup
147+
2. **Periodic Sampling**: Sample memory usage during execution
148+
3. **Threshold Comparison**: Alert if memory exceeds baseline by configured threshold
149+
4. **FD Monitoring**: Track file descriptor count for leak detection
150+
151+
Default threshold: 150% of baseline memory usage
152+
153+
## Resource Monitoring
154+
155+
### Collected Metrics
156+
157+
| Metric | Source | Frequency |
158+
|--------|--------|-----------|
159+
| CPU Usage | `/proc/stat` | Every sample |
160+
| Memory Usage | `/proc/meminfo` | Every sample |
161+
| Disk I/O | `iostat` | Every sample |
162+
| Network I/O | `/proc/net/dev` | Every sample |
163+
| File Descriptors | `/proc/[pid]/fd` | Every sample |
164+
| Goroutine Count | pprof endpoint | Every sample |
165+
166+
### Sample Interval
167+
168+
Default: 5 seconds (configurable via `MON_SAMPLE_INTERVAL`)
169+
170+
## Report Output
171+
172+
### JSON Report Structure
173+
174+
```json
175+
{
176+
"session_id": "challenges_20260115_013435",
177+
"start_time": "2026-01-15T01:34:35Z",
178+
"end_time": "2026-01-15T02:15:22Z",
179+
"duration_seconds": 2447,
180+
"challenges": {
181+
"total": 45,
182+
"passed": 43,
183+
"failed": 2,
184+
"skipped": 0
185+
},
186+
"issues": {
187+
"total": 0,
188+
"high": 0,
189+
"medium": 0,
190+
"low": 0
191+
},
192+
"fixes": {
193+
"total": 0
194+
},
195+
"resources": {
196+
"peak_memory_mb": 512,
197+
"peak_cpu_percent": 85,
198+
"peak_disk_io_mb": 150
199+
}
200+
}
201+
```
202+
203+
### HTML Report Features
204+
205+
- Executive summary with pass/fail counts
206+
- Resource usage graphs
207+
- Issue timeline
208+
- Challenge results table
209+
- Fix history with test references
210+
211+
## Test Coverage
212+
213+
The monitoring system is covered by comprehensive tests:
214+
215+
### Unit Tests (`tests/integration/monitoring_system_test.go`)
216+
217+
- `TestMonitoringLibInit` - Initialization
218+
- `TestMonitoringSampleResources` - Resource sampling
219+
- `TestMonitoringLogCollection` - Log collection
220+
- `TestMonitoringMemoryLeakDetection` - Memory leak detection
221+
- `TestMonitoringLogAnalysis` - Log analysis
222+
- `TestMonitoringIssueTracking` - Issue tracking
223+
- `TestMonitoringFixRecording` - Fix recording
224+
- `TestMonitoringFinalization` - Finalization
225+
226+
### Challenge Script (`challenges/scripts/monitoring_system_challenge.sh`)
227+
228+
21 comprehensive tests covering:
229+
- Library initialization
230+
- Resource sampling
231+
- Log collection
232+
- Memory leak detection
233+
- Log analysis
234+
- Issue tracking
235+
- Fix recording
236+
- Report generation
237+
- Concurrent operations
238+
239+
## Configuration
240+
241+
### Environment Variables
242+
243+
| Variable | Description | Default |
244+
|----------|-------------|---------|
245+
| `MON_LOG_DIR` | Log directory | `/tmp/helixagent_monitoring` |
246+
| `MON_SAMPLE_INTERVAL` | Sample interval (seconds) | `5` |
247+
| `MON_MEMORY_THRESHOLD` | Memory leak threshold (%) | `150` |
248+
| `MON_RESOURCE_MONITOR` | Enable resource monitoring | `true` |
249+
250+
### Output Directories
251+
252+
```
253+
challenges/monitoring/
254+
├── logs/ # Session logs
255+
│ └── [session_id]/
256+
│ ├── master.log # Combined log
257+
│ ├── resources/ # Resource samples
258+
│ ├── components/ # Component logs
259+
│ └── issues/ # Issue records
260+
└── reports/ # Generated reports
261+
└── [session_id]/
262+
├── report.json
263+
└── report.html
264+
```
265+
266+
## Integration with CI/CD
267+
268+
### GitHub Actions Example
269+
270+
```yaml
271+
- name: Run Monitored Challenges
272+
run: |
273+
./challenges/monitoring/run_monitored_challenges.sh --continue-on-failure
274+
275+
- name: Upload Monitoring Reports
276+
uses: actions/upload-artifact@v3
277+
with:
278+
name: monitoring-reports
279+
path: challenges/monitoring/reports/
280+
281+
- name: Check for Critical Issues
282+
run: |
283+
if grep -q '"high":' challenges/monitoring/reports/*/report.json; then
284+
echo "Critical issues detected!"
285+
exit 1
286+
fi
287+
```
288+
289+
## Troubleshooting
290+
291+
### Common Issues
292+
293+
1. **ANSI color codes in output**
294+
- Solution: Output colors to stderr, not stdout
295+
296+
2. **Exit codes from analysis functions**
297+
- Solution: Use `|| true` after `mon_analyze_log_file`
298+
299+
3. **Double counting in issue/fix recording**
300+
- Solution: Direct file writes instead of `mon_log`
301+
302+
4. **Report generator stdout pollution**
303+
- Solution: Redirect status messages to stderr
304+
305+
## Future Enhancements
306+
307+
- [ ] Integration with Prometheus for metrics export
308+
- [ ] Real-time dashboard with WebSocket updates
309+
- [ ] Automated issue correlation
310+
- [ ] Machine learning-based anomaly detection
311+
- [ ] Integration with alerting systems (Slack, PagerDuty)

0 commit comments

Comments
 (0)