HOWK - High Opinionated Webhook Kit

A high-throughput, fault-tolerant webhook delivery system built on Kafka + Redis.

Philosophy

Kafka is the source of truth — every webhook and delivery result is a Kafka record
Redis is rebuildable hot state — if Redis dies, restore from compacted topic (zero maintenance)
Zero maintenance recovery — distributed lock + canary pattern enables automatic Redis rebuild on multi-instance deployments
Last-Write-Wins (LWW) — nanosecond timestamps prevent race conditions during concurrent reconciliation
Circuit breakers protect endpoints — failing endpoints don't burn your retry budget
Penalty box isolates slow endpoints — excess in-flight traffic is rate-limited to protect the fast lane
At-least-once delivery — we never lose a webhook, duplicates are the receiver's problem

Architecture

High-Level Overview

                    ┌─────────────────────────────────────┐
                    │           API Gateway               │
                    │  POST /webhooks/:config/enqueue     │
                    │  validate → batch produce           │
                    │  → 202 Accepted                     │
                    └───────────────┬─────────────────────┘
                                    │
                          Kafka Produce (batched)
                                    │
                                    ▼
                 ┌──────────────────────────────────────────┐
                 │            Kafka Cluster                 │
                 │                                          │
                 │  howk.pending    → webhooks to deliver   │
                 │  howk.slow       → rate-limited lane     │
                 │  howk.results    → delivery outcomes     │
                 │  howk.deadletter → exhausted retries     │
                 │                                          │
                 │  retention: 7 days                       │
                 └────┬─────────────────────────┬──────────┘
                      │                         │
            ┌─────────┘                         └──────────┐
            ▼                                              ▼
  ┌──────────────────────┐                    ┌───────────────────────┐
  │   Worker Pool        │                    │   Results Consumer    │
  │   (N consumers)      │                    │                       │
  │                      │                    │   • Update Redis      │
  │   • Read pending     │                    │     status/stats      │
  │   • Check circuit    │                    │   • Feed ClickHouse   │
  │   • Fire HTTP        │                    │     (optional)        │
  │   • Produce result   │                    │                       │
  │   • Schedule retry   │                    └───────────────────────┘
  │     if needed        │
  └──────────┬───────────┘
             │
             ▼
  ┌──────────────────────────────────────────────────────────────────┐
  │                           Redis                                   │
  │                                                                   │
  │  Circuit Breaker (per endpoint):                                  │
  │    HSET circuit:{endpoint_hash} state=OPEN failures=5 last=...   │
  │                                                                   │
  │  Concurrency Control (Penalty Box):                               │
  │    INCR concurrency:{endpoint_hash}  (with TTL)                  │
  │    Lua: DECR with floor at 0 to prevent drift                     │
  │                                                                   │
  │  Retry Scheduling:                                                │
  │    ZADD retries <next_at_unix> <webhook_id:attempt>              │
  │    SET retry_data:{id} <compressed_webhook>                      │
  │                                                                   │
  │  Status (per webhook) - LWW Hash Structure:                       │
  │    HSET status:{webhook_id}                                       │
  │      data={json_blob}     ← WebhookStatus JSON                    │
  │      ts={nanoseconds}     ← UpdatedAtNs for LWW resolution        │
  │    EXPIRE status:{webhook_id} 7d                                  │
  │                                                                   │
  │  Stats (hourly buckets):                                          │
  │    INCR stats:delivered:2026013015                               │
  │    PFADD stats:hll:endpoints:2026013015 {endpoint}               │
  │                                                                   │
  │  System Keys:                                                     │
  │    howk:system:initialized     ← Canary (Redis initialized?)     │
  │    howk:reconciler:lock        ← Distributed lock for rebuild    │
  │                                                                   │
  │  ══════════════════════════════════════════════════════════════  │
  │  ALL OF THIS IS REBUILDABLE FROM KAFKA REPLAY                    │
  └──────────────────────────────────────────────────────────────────┘

System Data Flow

flowchart TB
    subgraph Kafka["Kafka Topics"]
        PENDING[("howk.pending")]
        SLOW[("howk.slow<br/>Rate-limited lane")]
        RESULTS[("howk.results")]
        DLQ[("howk.deadletter")]
        STATE[("howk.state<br/>Compacted topic<br/>Active webhook state")]
    end

    subgraph Redis["Redis Keys"]
        DATA[("retry_data:{webhook_id}<br/>Compressed JSON<br/>TTL: 7 days")]
        ZSET[("retries ZSET<br/>score: unix_timestamp<br/>member: webhook_id:attempt")]
        META[("retry_meta:{id}:{attempt}<br/>reason, scheduled_at")]
        CONC[("concurrency:{endpoint_hash}<br/>In-flight counter<br/>TTL: 2min")]
    end

    subgraph Worker["Worker Process (Fast Lane)"]
        W_CONSUME["1. Consume from<br/>howk.pending"]
        W_CHECK_CB{"Circuit<br/>Allows?"}
        W_CHECK_CONC{"Inflight &lt;<br/>threshold?"}
        W_INCR["IncrInflight()<br/>INCR + EXPIRE"]
        W_DELIVER["2. Deliver HTTP"]
        W_DECR["DecrInflight()<br/>Lua: DECR ≥0"]
        W_SUCCESS{"Success?"}
        W_RETRY{"Should<br/>Retry?"}
        W_STORE["StoreRetryData()<br/>SET retry_data:{id}"]
        W_SCHEDULE["ScheduleRetry()<br/>ZADD + SET meta"]
        W_CLEANUP["DeleteRetryData()<br/>DEL retry_data:{id}"]
        W_PUBLISH_OK["PublishResult()"]
        W_PUBLISH_DLQ["PublishDeadLetter()"]
        W_PUBLISH_SLOW["PublishToSlow()<br/>Divert to slow lane"]
    end

    subgraph SlowWorker["Slow Worker Process"]
        SW_CONSUME["1. Consume from<br/>howk.slow"]
        SW_RATE["2. Rate limit<br/>(5/sec default)"]
        SW_REUSE["3. Same logic as<br/>fast lane"]
    end

    subgraph Scheduler["Scheduler Process"]
        S_POLL["1. Poll every 1s"]
        S_POP["2. PopAndLockRetries()<br/>Lua: ZRANGEBYSCORE + ZADD"]
        S_PARSE["3. Parse reference"]
        S_FETCH["4. GetRetryData()"]
        S_PUBLISH["5. Publish to Kafka"]
        S_ACK["6. AckRetry()<br/>ZREM + DEL meta"]
    end

    PENDING --> W_CONSUME
    W_CONSUME --> W_CHECK_CB
    W_CHECK_CB -->|No| W_SCHEDULE
    W_CHECK_CB -->|Yes| W_INCR
    W_INCR --> CONC
    W_INCR --> W_CHECK_CONC
    
    W_CHECK_CONC -->|Yes| W_DELIVER
    W_CHECK_CONC -->|No| W_PUBLISH_SLOW
    W_PUBLISH_SLOW --> SLOW
    W_PUBLISH_SLOW -.->|decr| CONC
    
    W_DELIVER --> W_SUCCESS
    W_DELIVER -.->|defer| W_DECR
    W_DECR --> CONC
    
    W_SUCCESS -->|Yes| W_CLEANUP
    W_CLEANUP --> W_PUBLISH_OK
    W_PUBLISH_OK --> RESULTS
    
    W_SUCCESS -->|No| W_RETRY
    W_RETRY -->|Yes| W_STORE
    W_STORE --> DATA
    W_STORE --> W_SCHEDULE
    W_SCHEDULE --> ZSET
    W_SCHEDULE --> META
    
    W_RETRY -->|No/Exhausted| W_CLEANUP
    W_CLEANUP --> W_PUBLISH_DLQ
    W_PUBLISH_DLQ --> DLQ

    SLOW --> SW_CONSUME
    SW_CONSUME --> SW_RATE
    SW_RATE --> SW_REUSE
    SW_REUSE --> W_CHECK_CB

    S_POLL --> S_POP
    S_POP --> ZSET
    ZSET --> S_PARSE
    S_PARSE --> S_FETCH
    S_FETCH --> DATA
    S_FETCH --> S_PUBLISH
    S_PUBLISH --> PENDING
    S_PUBLISH --> S_ACK
    S_ACK --> ZSET
    S_ACK --> META

    classDef kafka fill:#ff9800,stroke:#e65100,color:#000
    classDef redis fill:#dc382d,stroke:#a41e11,color:#fff
    classDef worker fill:#2196f3,stroke:#1565c0,color:#fff
    classDef slowworker fill:#ffeb3b,stroke:#f9a825,color:#000
    classDef scheduler fill:#4caf50,stroke:#2e7d32,color:#fff
    classDef decision fill:#ff9800,stroke:#e65100,color:#000

    class PENDING,SLOW,RESULTS,DLQ kafka
    class DATA,ZSET,META,CONC redis
    class W_CONSUME,W_DELIVER,W_STORE,W_SCHEDULE,W_CLEANUP,W_PUBLISH_OK,W_PUBLISH_DLQ,W_INCR,W_DECR,W_PUBLISH_SLOW worker
    class SW_CONSUME,SW_RATE,SW_REUSE slowworker
    class S_POLL,S_POP,S_PARSE,S_FETCH,S_PUBLISH,S_ACK scheduler
    class W_SUCCESS,W_RETRY,W_CHECK_CB,W_CHECK_CONC decision

Retry Lifecycle Sequence

sequenceDiagram
    autonumber
    participant K as Kafka<br/>howk.pending
    participant W as Worker
    participant R as Redis
    participant E as Endpoint
    participant S as Scheduler

    Note over K,S: INITIAL DELIVERY (Attempt 0)
    
    K->>W: Consume webhook (attempt=0)
    W->>E: HTTP POST
    E-->>W: 503 Service Unavailable
    
    Note over W,R: Store data & schedule retry
    
    W->>R: SET retry_data:wh_123 [compressed]
    W->>R: ZADD retries score "wh_123:1"
    W->>R: SET retry_meta:wh_123:1

    Note over K,S: SCHEDULER PICKS UP
    
    rect rgb(220, 240, 220)
        Note over S,R: Visibility Timeout (Pop & Lock)
        S->>R: Lua: ZRANGEBYSCORE + ZADD future
        R-->>S: ["wh_123:1"]
    end
    
    S->>R: GET retry_data:wh_123
    S->>K: Publish (attempt=1)
    S->>R: ZREM + DEL meta (NOT data)

    Note over K,S: SECOND DELIVERY (Attempt 1)
    
    K->>W: Consume (attempt=1)
    W->>E: HTTP POST
    E-->>W: 503 Again
    
    W->>R: SET retry_data:wh_123 (overwrite)
    W->>R: ZADD retries "wh_123:2"

    Note over K,S: ... Scheduler cycle ...

    Note over K,S: FINAL DELIVERY - SUCCESS
    
    K->>W: Consume (attempt=2)
    W->>E: HTTP POST
    E-->>W: 200 OK
    
    rect rgb(200, 255, 200)
        Note over W,R: Terminal State - Cleanup
        W->>R: DEL retry_data:wh_123
    end

Redis Key Structure

flowchart LR
    subgraph DataKeys["Data Keys (One Per Webhook)"]
        D1["<b>retry_data:wh_123</b><br/>━━━━━━━━━━━━━━━<br/>Value: gzip(webhook JSON)<br/>TTL: 7 days<br/>Overwritten each retry"]
        D2["<b>retry_data:wh_456</b><br/>━━━━━━━━━━━━━━━<br/>Different webhook"]
    end
    
    subgraph ZSet["Sorted Set (References)"]
        Z["<b>retries</b><br/>━━━━━━━━━━━━━━━<br/>1706745600 → wh_123:1<br/>1706752800 → wh_123:2<br/>1706760000 → wh_456:1"]
    end
    
    subgraph MetaKeys["Metadata (Per Attempt)"]
        M1["<b>retry_meta:wh_123:1</b><br/>━━━━━━━━━━━━━━━<br/>{reason, scheduled_at}"]
        M2["<b>retry_meta:wh_123:2</b>"]
        M3["<b>retry_meta:wh_456:1</b>"]
    end

    Z -->|"parse id"| D1
    Z -->|"parse id"| D2
    Z -.->|"same ref"| M1
    Z -.->|"same ref"| M2
    Z -.->|"same ref"| M3

    classDef data fill:#e3f2fd,stroke:#1565c0,color:#000
    classDef zset fill:#fff3e0,stroke:#e65100,color:#000
    classDef meta fill:#f3e5f5,stroke:#7b1fa2,color:#000

    class D1,D2 data
    class Z zset
    class M1,M2,M3 meta

Webhook State Machine

stateDiagram-v2
    [*] --> Pending: API Enqueue
    
    Pending --> Delivering: Worker Consume
    
    state Delivering {
        [*] --> CheckCircuit: Receive message
        CheckCircuit --> CircuitOpen: circuit open
        CheckCircuit --> CheckConcurrency: circuit allows
        
        CheckConcurrency --> FastLane: inflight < threshold
        CheckConcurrency --> DivertToSlow: inflight ≥ threshold
        
        FastLane --> HTTPDeliver: INCR concurrency
        DivertToSlow --> [*]: Publish to howk.slow
        
        HTTPDeliver --> Success: 2xx response
        HTTPDeliver --> Failure: error/timeout
        
        Success --> [*]: DECR + publish result
        Failure --> ScheduleRetry: retryable
        Failure --> DLQ: non-retryable
        ScheduleRetry --> [*]: DECR + schedule
        DLQ --> [*]: DECR + dead letter
        
        CircuitOpen --> ScheduleRetryCircuit: schedule far future
        ScheduleRetryCircuit --> [*]
    }
    
    state SlowLane {
        [*] --> RateLimitedConsume: Consume from howk.slow
        RateLimitedConsume --> ReCheck: Rate limited
        ReCheck --> RetryDeliver: Re-check concurrency
        RetryDeliver --> Delivering: Process normally
    }
    
    Delivering --> Delivered: HTTP 2xx
    Delivering --> Failed: HTTP 5xx/408/429
    Delivering --> Exhausted: HTTP 4xx
    
    Failed --> Pending: Scheduler Re-enqueue
    Failed --> Exhausted: Max Attempts
    
    Delivered --> [*]: ✓ Success
    Exhausted --> [*]: ✗ DLQ

    note right of Delivering
        Concurrency Control:
        • INCR concurrency:{hash}
        • Check against threshold
        • DECR on all exits
        • Floor at 0 (Lua script)
    end note
    
    note right of SlowLane
        Self-healing:
        • Rate limited (5/sec)
        • Re-checks concurrency
        • Returns to fast lane
          when endpoint recovers
    end note
    
    note right of Failed
        Redis State:
        • retry_data:{id} = compressed
        • retries ZSET = scheduled
        • retry_meta:{ref} = metadata
    end note
    
    note right of Delivered
        Cleanup:
        DEL retry_data:{id}
    end note
    
    note right of Exhausted
        Cleanup:
        DEL retry_data:{id}
        Publish to DLQ
    end note

Operations Summary

Operation	Component	Redis Commands	When Called
`IncrInflight()`	Worker	`INCR concurrency:{hash}` `EXPIRE concurrency:{hash} {ttl}`	Before delivery attempt
`DecrInflight()`	Worker	`Lua: DECR if > 0`	After delivery (success/fail/DLQ)
`PublishToSlow()`	Worker	Kafka Produce to `howk.slow`	When inflight ≥ threshold
`StoreRetryData()`	Worker	`SET retry_data:{id} compressed EX 604800`	Before scheduling retry
`ScheduleRetry()`	Worker	`ZADD retries score member` `SET retry_meta:{ref}`	After storing data
`PopAndLockRetries()`	Scheduler	`Lua: ZRANGEBYSCORE + ZADD future`	Poll loop (every 1s)
`GetRetryData()`	Scheduler	`GET retry_data:{id}`	After parsing reference
`AckRetry()`	Scheduler	`ZREM retries member` `DEL retry_meta:{ref}`	After Kafka publish
`DeleteRetryData()`	Worker	`DEL retry_data:{id}`	Terminal state (success/DLQ)

Circuit Breaker Design

Per-endpoint circuit breaker with three states:

    ┌─────────────────────────────────────────────────────────────┐
    │                                                             │
    ▼                                                             │
┌────────┐   failure_threshold    ┌────────┐   recovery_timeout  ┌───────────┐
│ CLOSED │ ────────────────────▶  │  OPEN  │ ─────────────────▶  │ HALF_OPEN │
│        │      exceeded          │        │     expired         │           │
└────────┘                        └────────┘                     └───────────┘
    ▲                                  ▲                              │
    │                                  │                              │
    │         success                  │        probe fails           │
    └──────────────────────────────────┴──────────────────────────────┘
                                       │
                                       │ probe succeeds
                                       └────────────────────▶ CLOSED

When circuit is OPEN:

Don't attempt delivery (save resources)
Schedule retry far in the future (respect the endpoint)
Periodically allow ONE probe request (HALF_OPEN)

Circuit state is per-endpoint, stored in Redis, rebuildable from Kafka results.

Penalty Box / Slow Lane

Prevents slow/timing-out endpoints from starving the fast delivery path by routing excess in-flight traffic to a rate-limited slow topic.

How It Works

Fast Lane (howk.pending)          Slow Lane (howk.slow)
┌─────────────────────┐          ┌─────────────────────┐
│ Consume webhook     │          │ Rate-limited consume│
│ INCR concurrency    │          │ (20/sec per worker) │
│ Check threshold     │          │                     │
│ (< 50 by default)   │          │ Re-check concurrency│
│                     │          │ If recovered → fast │
│ If over threshold ──┼────────►│ If still slow ──────┼──► (backpressure)
│                     │          │                     │
│ HTTP POST           │          │ HTTP POST           │
│ DECR concurrency    │          │ DECR concurrency    │
└─────────────────────┘          └─────────────────────┘

Key behaviors:

Component	Failure Mode	Behavior
Concurrency Check	Fail-open	If Redis is unavailable, delivery proceeds normally without throttling
Circuit Breaker	Fail-closed	If Redis is unavailable, requests are blocked (safety over availability)
Idempotency Check	Fail-open	If Redis is unavailable, duplicate delivery is possible
Slow Lane Divert	Fail-open	If divert to slow lane fails, delivery proceeds in fast lane
Stats Recording	Fail-silent	Stats errors are logged but don't block delivery

Fail-open vs Fail-closed:
- Fail-open (concurrency, idempotency): Better to deliver duplicates than drop webhooks
- Fail-closed (circuit breaker): Better to pause delivery than overwhelm a failing endpoint
Self-healing: When endpoint recovers, traffic automatically returns to fast lane
Crash recovery: TTL on concurrency keys (2min default) auto-corrects leaked counts
Floor protection: Lua script ensures counter never goes below 0

Configuration

Setting	Default	Description
`concurrency.max_inflight_per_endpoint`	50	Threshold above which webhooks are diverted
`concurrency.inflight_ttl`	2m	TTL for concurrency counter (crash recovery)
`concurrency.slow_lane_rate`	20	Max deliveries/sec from slow lane per worker
`kafka.topics.slow`	howk.slow	Slow lane Kafka topic name
`ttl.retry_data_ttl`	7d	TTL for compressed retry data in Redis
`ttl.status_ttl`	7d	TTL for webhook status records
`ttl.circuit_state_ttl`	24h	TTL for circuit breaker state
`ttl.stats_ttl`	48h	TTL for hourly stats counters
`ttl.idempotency_ttl`	24h	TTL for idempotency keys

Environment variables:

export HOWK_CONCURRENCY_MAX_INFLIGHT_PER_ENDPOINT=50
export HOWK_CONCURRENCY_INFLIGHT_TTL=2m
export HOWK_CONCURRENCY_SLOW_LANE_RATE=20
export HOWK_KAFKA_TOPICS_SLOW=howk.slow
export HOWK_TTL_RETRY_DATA_TTL=168h
export HOWK_TTL_STATUS_TTL=168h
export HOWK_TTL_CIRCUIT_STATE_TTL=24h
export HOWK_TTL_STATS_TTL=48h
export HOWK_TTL_IDEMPOTENCY_TTL=24h

Domain Concurrency Limiter (NEW)

Aggregates in-flight requests by domain hostname to prevent overwhelming a single destination, regardless of how many different endpoint URLs point to it.

Problem It Solves

Without domain limiting:

api.stripe.com/hook1 and api.stripe.com/hook2 have independent inflight budgets
Could accidentally send 50 + 50 = 100 concurrent requests to api.stripe.com
Stripe (or any destination) may rate-limit or block the traffic

With domain limiting:

Both endpoints share a per-domain budget (e.g., 100 for api.stripe.com)
Total concurrent requests to stripe.com are capped

How It Works

┌─────────────────────────────────────────────────────────────┐
│  Domain Limiter (Redis-backed)                              │
│                                                             │
│  INCR domain_concurrency:api.stripe.com  ──┐               │
│  (check against max, default: disabled)    │               │
│                                            ▼               │
│  If under limit ──────────────────────────►  Proceed        │
│  If over limit ───────────────────────────►  Divert to slow │
│                                            │               │
│  DECR domain_concurrency:api.stripe.com  ◄──┘  (on complete)│
└─────────────────────────────────────────────────────────────┘

Integration point: Domain check runs after circuit breaker check, before endpoint inflight check.

Configuration

Setting	Default	Description
`concurrency.max_inflight_per_domain`	0	Max concurrent requests per domain (0 = disabled)
`concurrency.domain_overrides`	{}	Per-domain limits: `{"api.stripe.com": 200}`

Environment variables:

export HOWK_CONCURRENCY_MAX_INFLIGHT_PER_DOMAIN=100
export HOWK_CONCURRENCY_DOMAIN_OVERRIDES='{"api.stripe.com":200,"hooks.slack.com":30}'

Safety features:

Fail-open: On Redis error, allows the request (logs warning)
TTL: Uses same TTL as endpoint inflight counters (2min default)
Lua DECR: Never goes below zero (prevents counter drift)

Per-Key Parallelism (NEW)

Controls how many goroutines process messages for the same partition key (ConfigID) concurrently.

Trade-off

Value	Behavior	Use Case
1 (default)	Sequential per ConfigID	Need strict ordering per tenant
N > 1	Parallel per ConfigID	Maximize throughput, idempotent webhooks

Configuration

kafka:
  per_key_parallelism: 1  # Default: sequential

Environment variable:

export HOWK_KAFKA_PER_KEY_PARALLELISM=4

Important: With N > 1, messages from the same ConfigID may be delivered concurrently/out of order. This is safe if webhooks are idempotent (each has unique ID, receivers should handle duplicates).

Retry Strategy

Exponential backoff with circuit-aware delays:

Base delay: 10s
Max delay: 24h
Max attempts: 20
Jitter: ±20%

Circuit CLOSED:  delay = base * (2 ^ min(attempt, 10)) + jitter
Circuit OPEN:    delay = recovery_timeout (e.g., 5 minutes)
Circuit HALF_OPEN: immediate (it's a probe)

Components

Binary	Purpose
`howk-api`	HTTP API for enqueueing webhooks
`howk-worker`	Consumes pending, delivers, produces results. Includes both fast lane and slow lane workers
`howk-scheduler`	Pops due retries from Redis, re-enqueues to Kafka
`howk-reconciler`	Rebuilds Redis state from Kafka replay
`howk-dev`	Single-process dev mode — no Kafka/Redis needed (see Dev Mode)

Quick Start

# Start infrastructure
docker-compose up -d

# Run all components
make run-api
make run-worker
make run-scheduler

# Enqueue a webhook
curl -X POST http://localhost:8080/webhooks/tenant123/enqueue \
  -H "Content-Type: application/json" \
  -d '{
    "endpoint": "https://example.com/webhook",
    "payload": {"event": "user.created", "data": {"id": 123}},
    "idempotency_key": "user-created-123"
  }'

Quick Start (Dev Mode — No Docker)

# Single binary, zero infrastructure
go run ./cmd/dev

# Or with dry-run (simulates delivery, no HTTP calls)
go run ./cmd/dev --dry

# With Lua scripts loaded from disk
go run ./cmd/dev --scripts-dir=./scripts

# Test it
curl -X POST http://localhost:8080/webhooks/tenant123/enqueue \
  -H "Content-Type: application/json" \
  -d '{"endpoint": "https://httpbin.org/post", "payload": {"event": "test"}}'

See Dev Mode for full documentation.

Configuration

HOWK supports flexible configuration through:

Environment Variables (highest priority) - HOWK_* prefixed
Config File (YAML format) - specified via --config flag or auto-discovered
Defaults (lowest priority) - sensible built-in defaults

Environment Variables

Override any configuration setting using environment variables with the HOWK_ prefix:

export HOWK_API_PORT=9090
export HOWK_KAFKA_BROKERS=kafka1:9092,kafka2:9092
export HOWK_REDIS_ADDR=redis.example.com:6379
export HOWK_REDIS_PASSWORD=secret
export HOWK_TTL_STATUS_TTL=72h
bin/howk-api

See .env.example for a complete list of environment variables.

Config File

Use a YAML config file for complex configurations:

bin/howk-api --config=/etc/howk/config.yaml

Example config.yaml:

api:
  port: 8080
  read_timeout: 10s
  write_timeout: 10s

kafka:
  brokers:
    - localhost:19092
  topics:
    pending: howk.pending
    slow: howk.slow
    results: howk.results
    deadletter: howk.deadletter
  consumer_group: howk-workers
  retention: 168h
  # per_key_parallelism: 1  # Uncomment for per-key parallelism (default: 1 = sequential)

redis:
  addr: "localhost:6379"
  password: ""
  pool_size: 100

delivery:
  timeout: 30s
  max_idle_conns: 100
  max_conns_per_host: 10

retry:
  base_delay: 10s
  max_delay: 24h
  max_attempts: 20
  jitter: 0.2

circuit_breaker:
  failure_threshold: 5
  failure_window: 60s
  recovery_timeout: 5m
  probe_interval: 60s
  success_threshold: 2

concurrency:
  max_inflight_per_endpoint: 50
  inflight_ttl: 2m
  slow_lane_rate: 20
  # max_inflight_per_domain: 0  # Uncomment to enable domain limiting (0 = disabled)
  # domain_overrides:           # Optional per-domain limits
  #   api.stripe.com: 200
  #   hooks.slack.com: 30

scheduler:
  poll_interval: 1s
  batch_size: 500

ttl:
  circuit_state_ttl: 24h
  status_ttl: 168h
  stats_ttl: 48h
  idempotency_ttl: 24h
  retry_data_ttl: 168h  # Compressed webhook data for retries

Configuration Precedence

Environment variables override config file settings, which override defaults:

# config.yaml has: api.port: 7070
# Environment variable overrides it:
export HOWK_API_PORT=9090
bin/howk-api --config=config.yaml
# Result: API listens on port 9090

API

Enqueue Webhook

POST /webhooks/:config/enqueue

Request:

{
  "endpoint": "https://customer.com/webhook",
  "payload": {"event": "order.completed"},
  "headers": {"X-Custom": "value"},
  "idempotency_key": "order-123-completed",
  "signing_secret": "whsec_..."
}

Response: 202 Accepted

{
  "webhook_id": "wh_01HQXYZ...",
  "status": "pending"
}

Get Status

GET /webhooks/:webhook_id/status

Response:

{
  "webhook_id": "wh_01HQXYZ...",
  "state": "delivered",
  "attempts": 2,
  "last_attempt_at": "2026-01-30T10:00:00Z",
  "last_status_code": 200,
  "next_retry_at": null
}

Get Stats

GET /stats

Response:

{
  "last_1h": {
    "enqueued": 7200,
    "delivered": 7150,
    "failed": 50,
    "unique_endpoints": 1200
  },
  "last_24h": {
    "enqueued": 172800,
    "delivered": 170000,
    "failed": 2800,
    "unique_endpoints": 45000
  }
}

Incoming Webhook Transformer

POST /incoming/:script_name

Execute a Lua transformer script to fan out incoming webhooks. See docs/transformers.md for full documentation.

Request:

curl -X POST http://localhost:8080/incoming/stripe-router \
  -H "Content-Type: application/json" \
  -u admin:password \
  -d '{"type": "charge.succeeded", "amount": 1000}'

Response: 200 OK

{
  "webhooks": [
    {"id": "wh_01HQXYZ...", "endpoint": "https://billing.internal/webhook"},
    {"id": "wh_01HQABC...", "endpoint": "https://analytics.internal/track"}
  ],
  "count": 2
}

Features:

Lua scripting for payload transformation and routing
Fan-out: 1 incoming request → N outgoing webhooks
Basic Auth support (bcrypt or plaintext)
Domain allowlists for security
Hot-reload via SIGHUP
KV store access for deduplication/state
HTTP client for external API calls

Dev Mode

HOWK includes a single-process dev mode that runs the full webhook lifecycle (API + worker + scheduler) with no external dependencies. Kafka is replaced by an in-memory channel-based broker, and Redis is replaced by miniredis (a full Redis implementation in Go).

Why

Running Kafka + Redis locally requires Docker and adds friction for developers who just need to test webhook delivery flows. Dev mode eliminates that:

go run ./cmd/dev        # That's it. No Docker, no infra.

What Works

Feature	Dev Mode	Production
Webhook enqueue + delivery	In-memory broker	Kafka
Status tracking	miniredis	Redis
Circuit breaker	miniredis (full state machine)	Redis
Retry scheduling	miniredis	Redis
Lua scripts (`kv.get`/`kv.set`)	miniredis	Redis
HTTP delivery	Real HTTP (or `--dry`)	Real HTTP
Slow lane / penalty box	In-memory broker	Kafka
Transformer scripts	Supported	Supported

Flags

--config        Config file path (optional, uses defaults)
--scripts-dir   Directory with .lua/.json script files
--port          API port (default: 8080, overrides config)
--dry           Dry-run mode: log deliveries, return 200 without HTTP calls

Loading Scripts

Scripts can be loaded from disk at startup. Two formats are supported:

.lua files — raw Lua code, filename becomes config_id:

# scripts/wh.lua → config_id "wh"
# scripts/caesb.lua → config_id "caesb"
go run ./cmd/dev --scripts-dir=./scripts

.json files — full script config:

{
  "config_id": "wh",
  "lua_code": "request.headers['X-Custom'] = 'value'",
  "hash": "abc123",
  "version": "1.0"
}

Scripts can also be uploaded at runtime via the API (PUT /config/:id/script) — they flow through the in-memory broker just like production.

Dry-Run Mode

With --dry, deliveries are logged but no HTTP requests are made. The worker assumes a 200 OK response for every webhook:

go run ./cmd/dev --dry --scripts-dir=./scripts

INF DRY: would deliver webhook_id=wh_01KNN... endpoint=https://example.com/hook payload_bytes=42
INF Delivery succeeded status_code=200 duration=1ms

This is useful for testing Lua script transformations, payload routing, and retry logic without needing a live endpoint.

Makefile Targets

make run-dev           # Live delivery, no scripts
make run-dev-dry       # Dry-run mode
make run-dev-scripts   # With scripts from ./scripts/

Limitations

No persistence — all state is lost on restart (miniredis + in-memory broker)
No reconciler — not needed without real Kafka to replay from
Single-process — no partition parallelism or consumer groups
No Kafka compaction — script topic is simple pub/sub

Architecture

┌────────────────────────────────────────────────────┐
│                  cmd/dev/main.go                    │
│                                                    │
│  ┌──────────┐  ┌──────────┐  ┌───────────────┐    │
│  │   API    │  │  Worker  │  │   Scheduler   │    │
│  │  Server  │  │  + Slow  │  │               │    │
│  └────┬─────┘  └────┬─────┘  └───────┬───────┘    │
│       │              │                │            │
│       ▼              ▼                ▼            │
│  ┌─────────────────────────────────────────────┐   │
│  │          MemBroker (channels)               │   │
│  │  howk.pending → howk.results → howk.dl      │   │
│  └─────────────────────────────────────────────┘   │
│       │              │                │            │
│       ▼              ▼                ▼            │
│  ┌─────────────────────────────────────────────┐   │
│  │       miniredis (in-process Redis)          │   │
│  │  circuit breaker, retries, status, KV, ...  │   │
│  └─────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────┘

Recovery

Redis Dies (Zero Maintenance Recovery)

HOWK implements automatic self-healing for Redis loss using a distributed coordination pattern:

Automatic Recovery (Multi-Instance)

When Redis is lost in a multi-instance deployment (e.g., 3 workers + 2 publishers):

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Worker    │     │   Worker    │     │   Worker    │
│  Instance 1 │     │  Instance 2 │     │  Instance N │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └─────────┬─────────┴─────────┬─────────┘
                 ▼                   ▼
       ┌───────────────────────────────────────┐
       │         Distributed Lock              │
       │   SET howk:reconciler:lock NX EX 60   │
       │   (Only 1 instance wins)              │
       └───────────────┬───────────────────────┘
                       ▼
       ┌───────────────────────────────────────┐
       │         Reconciler (Winner)           │
       │   1. Flush Redis                      │
       │   2. Consume howk.state → HWM         │
       │   3. Restore status + retries         │
       │   4. SET howk:system:initialized      │
       └───────────────┬───────────────────────┘
                       ▼
       ┌───────────────────────────────────────┐
       │        All Other Instances            │
       │   WaitForCanary() then resume         │
       └───────────────────────────────────────┘

Startup Sequence (Per Instance):

Check howk:system:initialized canary key
If missing: try to acquire howk:reconciler:lock
If lock acquired: run reconciler, set canary, release lock
If lock NOT acquired: WaitForCanary() until peer finishes
Proceed with normal operation

Runtime Sentinel (Auto-Recovery):

Background goroutine checks canary every 30s
If canary missing (Redis flushed/lost):
- Pause Kafka consumer
- Try to acquire lock → reconcile → set canary
- Resume consumer

Last-Write-Wins (LWW) Conflict Resolution

To prevent race conditions during concurrent reconciliation:

-- Every SetStatus uses Lua script with nanosecond timestamp check
local old_ts = tonumber(redis.call('HGET', KEYS[1], 'ts') or '0')
if new_ts > old_ts then
    redis.call('HSET', key, 'data', data, 'ts', new_ts)
    return 1  -- updated
end
return 0  -- skipped (old data)

Workers set UpdatedAtNs = time.Now().UnixNano() on every status change
Reconciler restores timestamps from Kafka snapshots
Result: Even if reconciler replays stale data, it won't overwrite newer writes

Manual Recovery (Optional)

For single-instance deployments or forced rebuild:

# Stop workers, flush Redis, run reconciler, start workers
redis-cli FLUSHDB
./bin/howk-reconciler
./bin/howk-worker &

Why this works:

Workers continuously publish state snapshots to howk.state topic
Failed webhooks (pending retry) → full state snapshot with UpdatedAtNs
Terminal webhooks (delivered/exhausted) → tombstone published
Kafka compaction retains only the latest state per webhook
LWW ensures safe concurrent writes during recovery

No data loss: Redis state is fully reconstructible from Kafka's compacted topic.

Kafka Broker Dies

Kafka handles this internally via replication. If you lose all replicas... you have bigger problems.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
.claude		.claude
.github/workflows		.github/workflows
bench		bench
charts		charts
cmd		cmd
docs		docs
internal		internal
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
GEMINI.md		GEMINI.md
Makefile		Makefile
README.md		README.md
TESTING_IMPLEMENTATION_NOTES.md		TESTING_IMPLEMENTATION_NOTES.md
TESTING_LUA_IMPLEMENTATION_NOTES.md		TESTING_LUA_IMPLEMENTATION_NOTES.md
TESTING_LUA_SUMMARY.md		TESTING_LUA_SUMMARY.md
TESTING_SUMMARY.md		TESTING_SUMMARY.md
codecov.yml		codecov.yml
config.example.yaml		config.example.yaml
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

HOWK - High Opinionated Webhook Kit

Philosophy

Architecture

High-Level Overview

System Data Flow

Retry Lifecycle Sequence

Redis Key Structure

Webhook State Machine

Operations Summary

Circuit Breaker Design

Penalty Box / Slow Lane

How It Works

Configuration

Domain Concurrency Limiter (NEW)

Problem It Solves

How It Works

Configuration

Per-Key Parallelism (NEW)

Trade-off

Configuration

Retry Strategy

Components

Quick Start

Quick Start (Dev Mode — No Docker)

Configuration

Environment Variables

Config File

Configuration Precedence

API

Enqueue Webhook

Get Status

Get Stats

Incoming Webhook Transformer

Dev Mode

Why

What Works

Flags

Loading Scripts

Dry-Run Mode

Makefile Targets

Limitations

Architecture

Recovery

Redis Dies (Zero Maintenance Recovery)

Automatic Recovery (Multi-Instance)

Last-Write-Wins (LWW) Conflict Resolution

Manual Recovery (Optional)

Kafka Broker Dies

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages