A high-throughput, fault-tolerant webhook delivery system built on Kafka + Redis.
- Kafka is the source of truth — every webhook and delivery result is a Kafka record
- Redis is rebuildable hot state — if Redis dies, restore from compacted topic (zero maintenance)
- Zero maintenance recovery — distributed lock + canary pattern enables automatic Redis rebuild on multi-instance deployments
- Last-Write-Wins (LWW) — nanosecond timestamps prevent race conditions during concurrent reconciliation
- Circuit breakers protect endpoints — failing endpoints don't burn your retry budget
- Penalty box isolates slow endpoints — excess in-flight traffic is rate-limited to protect the fast lane
- At-least-once delivery — we never lose a webhook, duplicates are the receiver's problem
┌─────────────────────────────────────┐
│ API Gateway │
│ POST /webhooks/:config/enqueue │
│ validate → batch produce │
│ → 202 Accepted │
└───────────────┬─────────────────────┘
│
Kafka Produce (batched)
│
▼
┌──────────────────────────────────────────┐
│ Kafka Cluster │
│ │
│ howk.pending → webhooks to deliver │
│ howk.slow → rate-limited lane │
│ howk.results → delivery outcomes │
│ howk.deadletter → exhausted retries │
│ │
│ retention: 7 days │
└────┬─────────────────────────┬──────────┘
│ │
┌─────────┘ └──────────┐
▼ ▼
┌──────────────────────┐ ┌───────────────────────┐
│ Worker Pool │ │ Results Consumer │
│ (N consumers) │ │ │
│ │ │ • Update Redis │
│ • Read pending │ │ status/stats │
│ • Check circuit │ │ • Feed ClickHouse │
│ • Fire HTTP │ │ (optional) │
│ • Produce result │ │ │
│ • Schedule retry │ └───────────────────────┘
│ if needed │
└──────────┬───────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Redis │
│ │
│ Circuit Breaker (per endpoint): │
│ HSET circuit:{endpoint_hash} state=OPEN failures=5 last=... │
│ │
│ Concurrency Control (Penalty Box): │
│ INCR concurrency:{endpoint_hash} (with TTL) │
│ Lua: DECR with floor at 0 to prevent drift │
│ │
│ Retry Scheduling: │
│ ZADD retries <next_at_unix> <webhook_id:attempt> │
│ SET retry_data:{id} <compressed_webhook> │
│ │
│ Status (per webhook) - LWW Hash Structure: │
│ HSET status:{webhook_id} │
│ data={json_blob} ← WebhookStatus JSON │
│ ts={nanoseconds} ← UpdatedAtNs for LWW resolution │
│ EXPIRE status:{webhook_id} 7d │
│ │
│ Stats (hourly buckets): │
│ INCR stats:delivered:2026013015 │
│ PFADD stats:hll:endpoints:2026013015 {endpoint} │
│ │
│ System Keys: │
│ howk:system:initialized ← Canary (Redis initialized?) │
│ howk:reconciler:lock ← Distributed lock for rebuild │
│ │
│ ══════════════════════════════════════════════════════════════ │
│ ALL OF THIS IS REBUILDABLE FROM KAFKA REPLAY │
└──────────────────────────────────────────────────────────────────┘
flowchart TB
subgraph Kafka["Kafka Topics"]
PENDING[("howk.pending")]
SLOW[("howk.slow<br/>Rate-limited lane")]
RESULTS[("howk.results")]
DLQ[("howk.deadletter")]
STATE[("howk.state<br/>Compacted topic<br/>Active webhook state")]
end
subgraph Redis["Redis Keys"]
DATA[("retry_data:{webhook_id}<br/>Compressed JSON<br/>TTL: 7 days")]
ZSET[("retries ZSET<br/>score: unix_timestamp<br/>member: webhook_id:attempt")]
META[("retry_meta:{id}:{attempt}<br/>reason, scheduled_at")]
CONC[("concurrency:{endpoint_hash}<br/>In-flight counter<br/>TTL: 2min")]
end
subgraph Worker["Worker Process (Fast Lane)"]
W_CONSUME["1. Consume from<br/>howk.pending"]
W_CHECK_CB{"Circuit<br/>Allows?"}
W_CHECK_CONC{"Inflight <<br/>threshold?"}
W_INCR["IncrInflight()<br/>INCR + EXPIRE"]
W_DELIVER["2. Deliver HTTP"]
W_DECR["DecrInflight()<br/>Lua: DECR ≥0"]
W_SUCCESS{"Success?"}
W_RETRY{"Should<br/>Retry?"}
W_STORE["StoreRetryData()<br/>SET retry_data:{id}"]
W_SCHEDULE["ScheduleRetry()<br/>ZADD + SET meta"]
W_CLEANUP["DeleteRetryData()<br/>DEL retry_data:{id}"]
W_PUBLISH_OK["PublishResult()"]
W_PUBLISH_DLQ["PublishDeadLetter()"]
W_PUBLISH_SLOW["PublishToSlow()<br/>Divert to slow lane"]
end
subgraph SlowWorker["Slow Worker Process"]
SW_CONSUME["1. Consume from<br/>howk.slow"]
SW_RATE["2. Rate limit<br/>(5/sec default)"]
SW_REUSE["3. Same logic as<br/>fast lane"]
end
subgraph Scheduler["Scheduler Process"]
S_POLL["1. Poll every 1s"]
S_POP["2. PopAndLockRetries()<br/>Lua: ZRANGEBYSCORE + ZADD"]
S_PARSE["3. Parse reference"]
S_FETCH["4. GetRetryData()"]
S_PUBLISH["5. Publish to Kafka"]
S_ACK["6. AckRetry()<br/>ZREM + DEL meta"]
end
PENDING --> W_CONSUME
W_CONSUME --> W_CHECK_CB
W_CHECK_CB -->|No| W_SCHEDULE
W_CHECK_CB -->|Yes| W_INCR
W_INCR --> CONC
W_INCR --> W_CHECK_CONC
W_CHECK_CONC -->|Yes| W_DELIVER
W_CHECK_CONC -->|No| W_PUBLISH_SLOW
W_PUBLISH_SLOW --> SLOW
W_PUBLISH_SLOW -.->|decr| CONC
W_DELIVER --> W_SUCCESS
W_DELIVER -.->|defer| W_DECR
W_DECR --> CONC
W_SUCCESS -->|Yes| W_CLEANUP
W_CLEANUP --> W_PUBLISH_OK
W_PUBLISH_OK --> RESULTS
W_SUCCESS -->|No| W_RETRY
W_RETRY -->|Yes| W_STORE
W_STORE --> DATA
W_STORE --> W_SCHEDULE
W_SCHEDULE --> ZSET
W_SCHEDULE --> META
W_RETRY -->|No/Exhausted| W_CLEANUP
W_CLEANUP --> W_PUBLISH_DLQ
W_PUBLISH_DLQ --> DLQ
SLOW --> SW_CONSUME
SW_CONSUME --> SW_RATE
SW_RATE --> SW_REUSE
SW_REUSE --> W_CHECK_CB
S_POLL --> S_POP
S_POP --> ZSET
ZSET --> S_PARSE
S_PARSE --> S_FETCH
S_FETCH --> DATA
S_FETCH --> S_PUBLISH
S_PUBLISH --> PENDING
S_PUBLISH --> S_ACK
S_ACK --> ZSET
S_ACK --> META
classDef kafka fill:#ff9800,stroke:#e65100,color:#000
classDef redis fill:#dc382d,stroke:#a41e11,color:#fff
classDef worker fill:#2196f3,stroke:#1565c0,color:#fff
classDef slowworker fill:#ffeb3b,stroke:#f9a825,color:#000
classDef scheduler fill:#4caf50,stroke:#2e7d32,color:#fff
classDef decision fill:#ff9800,stroke:#e65100,color:#000
class PENDING,SLOW,RESULTS,DLQ kafka
class DATA,ZSET,META,CONC redis
class W_CONSUME,W_DELIVER,W_STORE,W_SCHEDULE,W_CLEANUP,W_PUBLISH_OK,W_PUBLISH_DLQ,W_INCR,W_DECR,W_PUBLISH_SLOW worker
class SW_CONSUME,SW_RATE,SW_REUSE slowworker
class S_POLL,S_POP,S_PARSE,S_FETCH,S_PUBLISH,S_ACK scheduler
class W_SUCCESS,W_RETRY,W_CHECK_CB,W_CHECK_CONC decision
sequenceDiagram
autonumber
participant K as Kafka<br/>howk.pending
participant W as Worker
participant R as Redis
participant E as Endpoint
participant S as Scheduler
Note over K,S: INITIAL DELIVERY (Attempt 0)
K->>W: Consume webhook (attempt=0)
W->>E: HTTP POST
E-->>W: 503 Service Unavailable
Note over W,R: Store data & schedule retry
W->>R: SET retry_data:wh_123 [compressed]
W->>R: ZADD retries score "wh_123:1"
W->>R: SET retry_meta:wh_123:1
Note over K,S: SCHEDULER PICKS UP
rect rgb(220, 240, 220)
Note over S,R: Visibility Timeout (Pop & Lock)
S->>R: Lua: ZRANGEBYSCORE + ZADD future
R-->>S: ["wh_123:1"]
end
S->>R: GET retry_data:wh_123
S->>K: Publish (attempt=1)
S->>R: ZREM + DEL meta (NOT data)
Note over K,S: SECOND DELIVERY (Attempt 1)
K->>W: Consume (attempt=1)
W->>E: HTTP POST
E-->>W: 503 Again
W->>R: SET retry_data:wh_123 (overwrite)
W->>R: ZADD retries "wh_123:2"
Note over K,S: ... Scheduler cycle ...
Note over K,S: FINAL DELIVERY - SUCCESS
K->>W: Consume (attempt=2)
W->>E: HTTP POST
E-->>W: 200 OK
rect rgb(200, 255, 200)
Note over W,R: Terminal State - Cleanup
W->>R: DEL retry_data:wh_123
end
flowchart LR
subgraph DataKeys["Data Keys (One Per Webhook)"]
D1["<b>retry_data:wh_123</b><br/>━━━━━━━━━━━━━━━<br/>Value: gzip(webhook JSON)<br/>TTL: 7 days<br/>Overwritten each retry"]
D2["<b>retry_data:wh_456</b><br/>━━━━━━━━━━━━━━━<br/>Different webhook"]
end
subgraph ZSet["Sorted Set (References)"]
Z["<b>retries</b><br/>━━━━━━━━━━━━━━━<br/>1706745600 → wh_123:1<br/>1706752800 → wh_123:2<br/>1706760000 → wh_456:1"]
end
subgraph MetaKeys["Metadata (Per Attempt)"]
M1["<b>retry_meta:wh_123:1</b><br/>━━━━━━━━━━━━━━━<br/>{reason, scheduled_at}"]
M2["<b>retry_meta:wh_123:2</b>"]
M3["<b>retry_meta:wh_456:1</b>"]
end
Z -->|"parse id"| D1
Z -->|"parse id"| D2
Z -.->|"same ref"| M1
Z -.->|"same ref"| M2
Z -.->|"same ref"| M3
classDef data fill:#e3f2fd,stroke:#1565c0,color:#000
classDef zset fill:#fff3e0,stroke:#e65100,color:#000
classDef meta fill:#f3e5f5,stroke:#7b1fa2,color:#000
class D1,D2 data
class Z zset
class M1,M2,M3 meta
stateDiagram-v2
[*] --> Pending: API Enqueue
Pending --> Delivering: Worker Consume
state Delivering {
[*] --> CheckCircuit: Receive message
CheckCircuit --> CircuitOpen: circuit open
CheckCircuit --> CheckConcurrency: circuit allows
CheckConcurrency --> FastLane: inflight < threshold
CheckConcurrency --> DivertToSlow: inflight ≥ threshold
FastLane --> HTTPDeliver: INCR concurrency
DivertToSlow --> [*]: Publish to howk.slow
HTTPDeliver --> Success: 2xx response
HTTPDeliver --> Failure: error/timeout
Success --> [*]: DECR + publish result
Failure --> ScheduleRetry: retryable
Failure --> DLQ: non-retryable
ScheduleRetry --> [*]: DECR + schedule
DLQ --> [*]: DECR + dead letter
CircuitOpen --> ScheduleRetryCircuit: schedule far future
ScheduleRetryCircuit --> [*]
}
state SlowLane {
[*] --> RateLimitedConsume: Consume from howk.slow
RateLimitedConsume --> ReCheck: Rate limited
ReCheck --> RetryDeliver: Re-check concurrency
RetryDeliver --> Delivering: Process normally
}
Delivering --> Delivered: HTTP 2xx
Delivering --> Failed: HTTP 5xx/408/429
Delivering --> Exhausted: HTTP 4xx
Failed --> Pending: Scheduler Re-enqueue
Failed --> Exhausted: Max Attempts
Delivered --> [*]: ✓ Success
Exhausted --> [*]: ✗ DLQ
note right of Delivering
Concurrency Control:
• INCR concurrency:{hash}
• Check against threshold
• DECR on all exits
• Floor at 0 (Lua script)
end note
note right of SlowLane
Self-healing:
• Rate limited (5/sec)
• Re-checks concurrency
• Returns to fast lane
when endpoint recovers
end note
note right of Failed
Redis State:
• retry_data:{id} = compressed
• retries ZSET = scheduled
• retry_meta:{ref} = metadata
end note
note right of Delivered
Cleanup:
DEL retry_data:{id}
end note
note right of Exhausted
Cleanup:
DEL retry_data:{id}
Publish to DLQ
end note
| Operation | Component | Redis Commands | When Called |
|---|---|---|---|
IncrInflight() |
Worker | INCR concurrency:{hash}EXPIRE concurrency:{hash} {ttl} |
Before delivery attempt |
DecrInflight() |
Worker | Lua: DECR if > 0 |
After delivery (success/fail/DLQ) |
PublishToSlow() |
Worker | Kafka Produce to howk.slow |
When inflight ≥ threshold |
StoreRetryData() |
Worker | SET retry_data:{id} compressed EX 604800 |
Before scheduling retry |
ScheduleRetry() |
Worker | ZADD retries score memberSET retry_meta:{ref} |
After storing data |
PopAndLockRetries() |
Scheduler | Lua: ZRANGEBYSCORE + ZADD future |
Poll loop (every 1s) |
GetRetryData() |
Scheduler | GET retry_data:{id} |
After parsing reference |
AckRetry() |
Scheduler | ZREM retries memberDEL retry_meta:{ref} |
After Kafka publish |
DeleteRetryData() |
Worker | DEL retry_data:{id} |
Terminal state (success/DLQ) |
Per-endpoint circuit breaker with three states:
┌─────────────────────────────────────────────────────────────┐
│ │
▼ │
┌────────┐ failure_threshold ┌────────┐ recovery_timeout ┌───────────┐
│ CLOSED │ ────────────────────▶ │ OPEN │ ─────────────────▶ │ HALF_OPEN │
│ │ exceeded │ │ expired │ │
└────────┘ └────────┘ └───────────┘
▲ ▲ │
│ │ │
│ success │ probe fails │
└──────────────────────────────────┴──────────────────────────────┘
│
│ probe succeeds
└────────────────────▶ CLOSED
When circuit is OPEN:
- Don't attempt delivery (save resources)
- Schedule retry far in the future (respect the endpoint)
- Periodically allow ONE probe request (HALF_OPEN)
Circuit state is per-endpoint, stored in Redis, rebuildable from Kafka results.
Prevents slow/timing-out endpoints from starving the fast delivery path by routing excess in-flight traffic to a rate-limited slow topic.
Fast Lane (howk.pending) Slow Lane (howk.slow)
┌─────────────────────┐ ┌─────────────────────┐
│ Consume webhook │ │ Rate-limited consume│
│ INCR concurrency │ │ (20/sec per worker) │
│ Check threshold │ │ │
│ (< 50 by default) │ │ Re-check concurrency│
│ │ │ If recovered → fast │
│ If over threshold ──┼────────►│ If still slow ──────┼──► (backpressure)
│ │ │ │
│ HTTP POST │ │ HTTP POST │
│ DECR concurrency │ │ DECR concurrency │
└─────────────────────┘ └─────────────────────┘
Key behaviors:
| Component | Failure Mode | Behavior |
|---|---|---|
| Concurrency Check | Fail-open | If Redis is unavailable, delivery proceeds normally without throttling |
| Circuit Breaker | Fail-closed | If Redis is unavailable, requests are blocked (safety over availability) |
| Idempotency Check | Fail-open | If Redis is unavailable, duplicate delivery is possible |
| Slow Lane Divert | Fail-open | If divert to slow lane fails, delivery proceeds in fast lane |
| Stats Recording | Fail-silent | Stats errors are logged but don't block delivery |
- Fail-open vs Fail-closed:
- Fail-open (concurrency, idempotency): Better to deliver duplicates than drop webhooks
- Fail-closed (circuit breaker): Better to pause delivery than overwhelm a failing endpoint
- Self-healing: When endpoint recovers, traffic automatically returns to fast lane
- Crash recovery: TTL on concurrency keys (2min default) auto-corrects leaked counts
- Floor protection: Lua script ensures counter never goes below 0
| Setting | Default | Description |
|---|---|---|
concurrency.max_inflight_per_endpoint |
50 | Threshold above which webhooks are diverted |
concurrency.inflight_ttl |
2m | TTL for concurrency counter (crash recovery) |
concurrency.slow_lane_rate |
20 | Max deliveries/sec from slow lane per worker |
kafka.topics.slow |
howk.slow | Slow lane Kafka topic name |
ttl.retry_data_ttl |
7d | TTL for compressed retry data in Redis |
ttl.status_ttl |
7d | TTL for webhook status records |
ttl.circuit_state_ttl |
24h | TTL for circuit breaker state |
ttl.stats_ttl |
48h | TTL for hourly stats counters |
ttl.idempotency_ttl |
24h | TTL for idempotency keys |
Environment variables:
export HOWK_CONCURRENCY_MAX_INFLIGHT_PER_ENDPOINT=50
export HOWK_CONCURRENCY_INFLIGHT_TTL=2m
export HOWK_CONCURRENCY_SLOW_LANE_RATE=20
export HOWK_KAFKA_TOPICS_SLOW=howk.slow
export HOWK_TTL_RETRY_DATA_TTL=168h
export HOWK_TTL_STATUS_TTL=168h
export HOWK_TTL_CIRCUIT_STATE_TTL=24h
export HOWK_TTL_STATS_TTL=48h
export HOWK_TTL_IDEMPOTENCY_TTL=24hAggregates in-flight requests by domain hostname to prevent overwhelming a single destination, regardless of how many different endpoint URLs point to it.
Without domain limiting:
api.stripe.com/hook1andapi.stripe.com/hook2have independent inflight budgets- Could accidentally send 50 + 50 = 100 concurrent requests to
api.stripe.com - Stripe (or any destination) may rate-limit or block the traffic
With domain limiting:
- Both endpoints share a per-domain budget (e.g., 100 for
api.stripe.com) - Total concurrent requests to stripe.com are capped
┌─────────────────────────────────────────────────────────────┐
│ Domain Limiter (Redis-backed) │
│ │
│ INCR domain_concurrency:api.stripe.com ──┐ │
│ (check against max, default: disabled) │ │
│ ▼ │
│ If under limit ──────────────────────────► Proceed │
│ If over limit ───────────────────────────► Divert to slow │
│ │ │
│ DECR domain_concurrency:api.stripe.com ◄──┘ (on complete)│
└─────────────────────────────────────────────────────────────┘
Integration point: Domain check runs after circuit breaker check, before endpoint inflight check.
| Setting | Default | Description |
|---|---|---|
concurrency.max_inflight_per_domain |
0 | Max concurrent requests per domain (0 = disabled) |
concurrency.domain_overrides |
{} | Per-domain limits: {"api.stripe.com": 200} |
Environment variables:
export HOWK_CONCURRENCY_MAX_INFLIGHT_PER_DOMAIN=100
export HOWK_CONCURRENCY_DOMAIN_OVERRIDES='{"api.stripe.com":200,"hooks.slack.com":30}'Safety features:
- Fail-open: On Redis error, allows the request (logs warning)
- TTL: Uses same TTL as endpoint inflight counters (2min default)
- Lua DECR: Never goes below zero (prevents counter drift)
Controls how many goroutines process messages for the same partition key (ConfigID) concurrently.
| Value | Behavior | Use Case |
|---|---|---|
| 1 (default) | Sequential per ConfigID | Need strict ordering per tenant |
| N > 1 | Parallel per ConfigID | Maximize throughput, idempotent webhooks |
kafka:
per_key_parallelism: 1 # Default: sequentialEnvironment variable:
export HOWK_KAFKA_PER_KEY_PARALLELISM=4Important: With N > 1, messages from the same ConfigID may be delivered concurrently/out of order. This is safe if webhooks are idempotent (each has unique ID, receivers should handle duplicates).
Exponential backoff with circuit-aware delays:
Base delay: 10s
Max delay: 24h
Max attempts: 20
Jitter: ±20%
Circuit CLOSED: delay = base * (2 ^ min(attempt, 10)) + jitter
Circuit OPEN: delay = recovery_timeout (e.g., 5 minutes)
Circuit HALF_OPEN: immediate (it's a probe)
| Binary | Purpose |
|---|---|
howk-api |
HTTP API for enqueueing webhooks |
howk-worker |
Consumes pending, delivers, produces results. Includes both fast lane and slow lane workers |
howk-scheduler |
Pops due retries from Redis, re-enqueues to Kafka |
howk-reconciler |
Rebuilds Redis state from Kafka replay |
howk-dev |
Single-process dev mode — no Kafka/Redis needed (see Dev Mode) |
# Start infrastructure
docker-compose up -d
# Run all components
make run-api
make run-worker
make run-scheduler
# Enqueue a webhook
curl -X POST http://localhost:8080/webhooks/tenant123/enqueue \
-H "Content-Type: application/json" \
-d '{
"endpoint": "https://example.com/webhook",
"payload": {"event": "user.created", "data": {"id": 123}},
"idempotency_key": "user-created-123"
}'# Single binary, zero infrastructure
go run ./cmd/dev
# Or with dry-run (simulates delivery, no HTTP calls)
go run ./cmd/dev --dry
# With Lua scripts loaded from disk
go run ./cmd/dev --scripts-dir=./scripts
# Test it
curl -X POST http://localhost:8080/webhooks/tenant123/enqueue \
-H "Content-Type: application/json" \
-d '{"endpoint": "https://httpbin.org/post", "payload": {"event": "test"}}'See Dev Mode for full documentation.
HOWK supports flexible configuration through:
- Environment Variables (highest priority) -
HOWK_*prefixed - Config File (YAML format) - specified via
--configflag or auto-discovered - Defaults (lowest priority) - sensible built-in defaults
Override any configuration setting using environment variables with the HOWK_ prefix:
export HOWK_API_PORT=9090
export HOWK_KAFKA_BROKERS=kafka1:9092,kafka2:9092
export HOWK_REDIS_ADDR=redis.example.com:6379
export HOWK_REDIS_PASSWORD=secret
export HOWK_TTL_STATUS_TTL=72h
bin/howk-apiSee .env.example for a complete list of environment variables.
Use a YAML config file for complex configurations:
bin/howk-api --config=/etc/howk/config.yamlExample config.yaml:
api:
port: 8080
read_timeout: 10s
write_timeout: 10s
kafka:
brokers:
- localhost:19092
topics:
pending: howk.pending
slow: howk.slow
results: howk.results
deadletter: howk.deadletter
consumer_group: howk-workers
retention: 168h
# per_key_parallelism: 1 # Uncomment for per-key parallelism (default: 1 = sequential)
redis:
addr: "localhost:6379"
password: ""
pool_size: 100
delivery:
timeout: 30s
max_idle_conns: 100
max_conns_per_host: 10
retry:
base_delay: 10s
max_delay: 24h
max_attempts: 20
jitter: 0.2
circuit_breaker:
failure_threshold: 5
failure_window: 60s
recovery_timeout: 5m
probe_interval: 60s
success_threshold: 2
concurrency:
max_inflight_per_endpoint: 50
inflight_ttl: 2m
slow_lane_rate: 20
# max_inflight_per_domain: 0 # Uncomment to enable domain limiting (0 = disabled)
# domain_overrides: # Optional per-domain limits
# api.stripe.com: 200
# hooks.slack.com: 30
scheduler:
poll_interval: 1s
batch_size: 500
ttl:
circuit_state_ttl: 24h
status_ttl: 168h
stats_ttl: 48h
idempotency_ttl: 24h
retry_data_ttl: 168h # Compressed webhook data for retriesEnvironment variables override config file settings, which override defaults:
# config.yaml has: api.port: 7070
# Environment variable overrides it:
export HOWK_API_PORT=9090
bin/howk-api --config=config.yaml
# Result: API listens on port 9090POST /webhooks/:config/enqueue
Request:
{
"endpoint": "https://customer.com/webhook",
"payload": {"event": "order.completed"},
"headers": {"X-Custom": "value"},
"idempotency_key": "order-123-completed",
"signing_secret": "whsec_..."
}Response: 202 Accepted
{
"webhook_id": "wh_01HQXYZ...",
"status": "pending"
}GET /webhooks/:webhook_id/status
Response:
{
"webhook_id": "wh_01HQXYZ...",
"state": "delivered",
"attempts": 2,
"last_attempt_at": "2026-01-30T10:00:00Z",
"last_status_code": 200,
"next_retry_at": null
}GET /stats
Response:
{
"last_1h": {
"enqueued": 7200,
"delivered": 7150,
"failed": 50,
"unique_endpoints": 1200
},
"last_24h": {
"enqueued": 172800,
"delivered": 170000,
"failed": 2800,
"unique_endpoints": 45000
}
}POST /incoming/:script_name
Execute a Lua transformer script to fan out incoming webhooks. See docs/transformers.md for full documentation.
Request:
curl -X POST http://localhost:8080/incoming/stripe-router \
-H "Content-Type: application/json" \
-u admin:password \
-d '{"type": "charge.succeeded", "amount": 1000}'Response: 200 OK
{
"webhooks": [
{"id": "wh_01HQXYZ...", "endpoint": "https://billing.internal/webhook"},
{"id": "wh_01HQABC...", "endpoint": "https://analytics.internal/track"}
],
"count": 2
}Features:
- Lua scripting for payload transformation and routing
- Fan-out: 1 incoming request → N outgoing webhooks
- Basic Auth support (bcrypt or plaintext)
- Domain allowlists for security
- Hot-reload via SIGHUP
- KV store access for deduplication/state
- HTTP client for external API calls
HOWK includes a single-process dev mode that runs the full webhook lifecycle (API + worker + scheduler) with no external dependencies. Kafka is replaced by an in-memory channel-based broker, and Redis is replaced by miniredis (a full Redis implementation in Go).
Running Kafka + Redis locally requires Docker and adds friction for developers who just need to test webhook delivery flows. Dev mode eliminates that:
go run ./cmd/dev # That's it. No Docker, no infra.| Feature | Dev Mode | Production |
|---|---|---|
| Webhook enqueue + delivery | In-memory broker | Kafka |
| Status tracking | miniredis | Redis |
| Circuit breaker | miniredis (full state machine) | Redis |
| Retry scheduling | miniredis | Redis |
Lua scripts (kv.get/kv.set) |
miniredis | Redis |
| HTTP delivery | Real HTTP (or --dry) |
Real HTTP |
| Slow lane / penalty box | In-memory broker | Kafka |
| Transformer scripts | Supported | Supported |
--config Config file path (optional, uses defaults)
--scripts-dir Directory with .lua/.json script files
--port API port (default: 8080, overrides config)
--dry Dry-run mode: log deliveries, return 200 without HTTP calls
Scripts can be loaded from disk at startup. Two formats are supported:
.lua files — raw Lua code, filename becomes config_id:
# scripts/wh.lua → config_id "wh"
# scripts/caesb.lua → config_id "caesb"
go run ./cmd/dev --scripts-dir=./scripts.json files — full script config:
{
"config_id": "wh",
"lua_code": "request.headers['X-Custom'] = 'value'",
"hash": "abc123",
"version": "1.0"
}Scripts can also be uploaded at runtime via the API (PUT /config/:id/script) — they flow through the in-memory broker just like production.
With --dry, deliveries are logged but no HTTP requests are made. The worker assumes a 200 OK response for every webhook:
go run ./cmd/dev --dry --scripts-dir=./scriptsINF DRY: would deliver webhook_id=wh_01KNN... endpoint=https://example.com/hook payload_bytes=42
INF Delivery succeeded status_code=200 duration=1ms
This is useful for testing Lua script transformations, payload routing, and retry logic without needing a live endpoint.
make run-dev # Live delivery, no scripts
make run-dev-dry # Dry-run mode
make run-dev-scripts # With scripts from ./scripts/- No persistence — all state is lost on restart (miniredis + in-memory broker)
- No reconciler — not needed without real Kafka to replay from
- Single-process — no partition parallelism or consumer groups
- No Kafka compaction — script topic is simple pub/sub
┌────────────────────────────────────────────────────┐
│ cmd/dev/main.go │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────────┐ │
│ │ API │ │ Worker │ │ Scheduler │ │
│ │ Server │ │ + Slow │ │ │ │
│ └────┬─────┘ └────┬─────┘ └───────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ MemBroker (channels) │ │
│ │ howk.pending → howk.results → howk.dl │ │
│ └─────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ miniredis (in-process Redis) │ │
│ │ circuit breaker, retries, status, KV, ... │ │
│ └─────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────┘
HOWK implements automatic self-healing for Redis loss using a distributed coordination pattern:
When Redis is lost in a multi-instance deployment (e.g., 3 workers + 2 publishers):
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Worker │ │ Worker │ │ Worker │
│ Instance 1 │ │ Instance 2 │ │ Instance N │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└─────────┬─────────┴─────────┬─────────┘
▼ ▼
┌───────────────────────────────────────┐
│ Distributed Lock │
│ SET howk:reconciler:lock NX EX 60 │
│ (Only 1 instance wins) │
└───────────────┬───────────────────────┘
▼
┌───────────────────────────────────────┐
│ Reconciler (Winner) │
│ 1. Flush Redis │
│ 2. Consume howk.state → HWM │
│ 3. Restore status + retries │
│ 4. SET howk:system:initialized │
└───────────────┬───────────────────────┘
▼
┌───────────────────────────────────────┐
│ All Other Instances │
│ WaitForCanary() then resume │
└───────────────────────────────────────┘
Startup Sequence (Per Instance):
- Check
howk:system:initializedcanary key - If missing: try to acquire
howk:reconciler:lock - If lock acquired: run reconciler, set canary, release lock
- If lock NOT acquired:
WaitForCanary()until peer finishes - Proceed with normal operation
Runtime Sentinel (Auto-Recovery):
- Background goroutine checks canary every 30s
- If canary missing (Redis flushed/lost):
- Pause Kafka consumer
- Try to acquire lock → reconcile → set canary
- Resume consumer
To prevent race conditions during concurrent reconciliation:
-- Every SetStatus uses Lua script with nanosecond timestamp check
local old_ts = tonumber(redis.call('HGET', KEYS[1], 'ts') or '0')
if new_ts > old_ts then
redis.call('HSET', key, 'data', data, 'ts', new_ts)
return 1 -- updated
end
return 0 -- skipped (old data)- Workers set
UpdatedAtNs = time.Now().UnixNano()on every status change - Reconciler restores timestamps from Kafka snapshots
- Result: Even if reconciler replays stale data, it won't overwrite newer writes
For single-instance deployments or forced rebuild:
# Stop workers, flush Redis, run reconciler, start workers
redis-cli FLUSHDB
./bin/howk-reconciler
./bin/howk-worker &Why this works:
- Workers continuously publish state snapshots to
howk.statetopic - Failed webhooks (pending retry) → full state snapshot with
UpdatedAtNs - Terminal webhooks (delivered/exhausted) → tombstone published
- Kafka compaction retains only the latest state per webhook
- LWW ensures safe concurrent writes during recovery
No data loss: Redis state is fully reconstructible from Kafka's compacted topic.
Kafka handles this internally via replication. If you lose all replicas... you have bigger problems.
MIT