A robust, scalable distributed job scheduling system built with Go, Python, Redis, PostgreSQL, and monitored with Prometheus & Grafana.
- Distributed Job Processing: Horizontal scaling with multiple worker nodes
- Job Leasing: Prevents duplicate processing with timeout-based leasing
- Dead Letter Queue (DQL): Handles permanently failed jobs after max retries
- Exponential Backoff: Retry mechanism with exponentially increasing delay + jitter to prevent thundering herd
- Load Balancing: Least Recently Used (LRU) worker selection algorithm
- Worker Health Monitoring: Automatic heartbeat verification and state management
- Lease Timeout Recovery: Automatic job recovery when worker becomes unavailable
- Graceful Failure Handling: Comprehensive error handling and job retry logic
- Atomic Operations: Database transactions ensure data consistency
- Prometheus Metrics: Comprehensive system metrics collection
- Grafana Dashboard: Real-time visualization of system performance and health
- Coordinator (Go): Central orchestrator managing job distribution and worker coordination
- Worker (Python): Asynchronous job processors with realistic workload simulations
- Submitter (Go): REST API for job submission
- Redis: Message queue for job distribution, result collection and dead letter queue
- PostgreSQL: Persistent storage for jobs and workers
- Prometheus: Metrics collection server
- Grafana: Visualization and monitoring dashboard
- Docker & Docker Compose
- Go 1.21+ (for local development)
- Python 3.9+ (for local development)
git clone https://github.com/soum-sr/distributed_job_scheduler.git
cd distributed_job_schedulermake up- Submitter API: http://localhost:8000
- Coordinator: http://localhost:9000
- Grafana Dashboard: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- PostgreSQL: localhost:5432
- Redis: localhost:6379
Use the below curl command sample to submit a cpu_intensive job. More test jobs are present under: distributed_job_scheduler/scripts
curl -X POST http://localhost:8000/submit_job \
-H "Content-Type: application/json" \
-d '{"name": "cpu_intensive", "payload": "test task"}'
- Job Processing Rates: Real time job completion/failure rates
- Job Total Counts: Cumulative completed, failed, and timeout jobs
- Worker Status: Active workers by state (available/busy/unavailable)
- Queue Metrics: Jobs in queue and dead letter queue
- Processing Duration: Job execution time percentiles
- Retry Patterns: Retry attempt distributions by failure reason
- Lease Timeouts: Worker unresponsiveness incidents
.
├── Makefile
├── README.md
├── coordinator
│ ├── Dockerfile
│ ├── go.mod
│ ├── go.sum
│ ├── handlers.go
│ ├── jobs.go
│ ├── main.go
│ ├── metrics.go
│ ├── utils.go
│ └── worker.go
├── deploy
│ ├── docker-compose.yml
│ ├── grafana
│ │ └── provisioning
│ │ ├── dashboards
│ │ │ ├── dashboard.yml
│ │ │ └── dashboard_content.json
│ │ └── datasources
│ │ └── prometheus.yml
│ ├── initdb
│ │ └── 01_schema.sql
│ └── prometheus
│ └── prometheus.yml
├── docs
│ └── architecture-diagram.png
├── go.work
├── go.work.sum
├── scripts
│ ├── high_volume_stress_test.sh
│ ├── send_cpu_intensive_jobs.sh
│ ├── send_failing_jobs.sh
│ ├── send_io_intensive_jobs.sh
│ └── send_network_intensive_jobs.sh
├── submitter
│ ├── Dockerfile
│ ├── go.mod
│ ├── go.sum
│ └── main.go
└── worker
├── Dockerfile
└── main.py
12 directories, 31 files

